DAFx 2025: Unsupervised Text-to-Sound Mapping via Embedding Space Alignment

Abstract

This work focuses on developing an artistic tool that performs an unsupervised mapping between text and sound, converting an input text string into a series of sounds from a given sound corpus. With the use of a pre-trained sound embedding model and a separate, pre-trained text embedding model, the goal is to find a mapping between the two feature spaces. Our approach is unsupervised which allows any sound corpus to be used with the system. The tool performs the task of text-to-sound retrieval, creating a soundfile in which each word in the text input is mapped to a single sound in the corpus, and the resulting sounds are concatenated to play sequentially. We experiment with three different mapping methods, and perform quantitative and qualitative evaluations on the outputs. Our results demonstrate the potential of unsupervised methods for creative applications in text-to-sound mapping.

Audio Files

The sample below uses the Bashō haiku as the text input. The sound corpus input is seven audio files, each of which is an individual singer in a choir performing Aftonen by the Swedish composer Alfvén. With the pre-processing method "Grain 1000ms", each sound file is sliced into 1000ms segments before being embedded in the sound feature space.

Text Input: Bashō
Sound Corpus: Choir
Sound Encoder: MuQ
Text Encoder: fastText
Mapping Method: Cluster
Sound Preprocessing: Grain 1000ms

Using the "onsets" pre-processing method, the sound files are sliced into segments based on detected onsets in the audio. This often leads to much shorter outputs. The sound corpus used here is a sample pack of bowed electric guitar with heavy processing.

Text Input: Bashō
Sound Corpus: Mothman
Sound Encoder: MuQ
Text Encoder: fastText
Mapping Method: Cluster
Sound Preprocessing: Onsets

The following two outputs are generated using the same text input (Ezra Pound) and sound corpus (Choir), but with different mapping methods: Identity and ICP.

Text Input: Ezra Pound
Sound Corpus: Choir
Sound Encoder: MuQ
Text Encoder: fastText
Mapping Method: Identity
Sound Preprocessing: Grain 1000ms

Text Input: Ezra Pound
Sound Corpus: Choir
Sound Encoder: MuQ
Text Encoder: fastText
Mapping Method: ICP
Sound Preprocessing: Grain 1000ms

These two outputs are generated using the same text input (Ezra Pound) and sound corpus (Choir), but with different text encoders: fastText and RoBERTa. Note the effect of a static (fastText) vs contextual (RoBERTa) text encoder on the output. With fastText, the word "and" maps to the exact same sound each time it is repeated. With RoBERTa, the word "and" maps to the same "oh" vowel, but sung by a different singer.

Text Input: Ezra Pound
Sound Corpus: Choir
Sound Encoder: MuQ
Text Encoder: fastText
Mapping Method: Identity
Sound Preprocessing: Grain 1000ms

Text Input: Ezra Pound
Sound Corpus: Choir
Sound Encoder: MuQ
Text Encoder: RoBERTa
Mapping Method: Identity
Sound Preprocessing: Grain 1000ms

The following sample is generated from a different sound corpus (samples from the Polish Radio Experimental Studio) and a shorter grain size (500ms).

Text Input: Ezra Pound
Sound Corpus: Polish Radio Experimental Studio samples
Sound Encoder: MuQ
Text Encoder: RoBERTa
Mapping Method: Identity
Sound Preprocessing: Grain 500ms

Abstract

Text Input

Audio Files