Unsupervised Text-to-Sound Mapping via Embedding Space Alignment

Luke Dzwonczyk and Carmine-Emanuele Cella

Abstract

This work focuses on developing an artistic tool that performs an unsupervised mapping between text and sound, converting an input text string into a series of sounds from a given sound corpus. With the use of a pre-trained sound embedding model and a separate, pre-trained text embedding model, the goal is to find a mapping between the two feature spaces. Our approach is unsupervised which allows any sound corpus to be used with the system. The tool performs the task of text-to-sound retrieval, creating a soundfile in which each word in the text input is mapped to a single sound in the corpus, and the resulting sounds are concatenated to play sequentially. We experiment with three different mapping methods, and perform quantitative and qualitative evaluations on the outputs. Our results demonstrate the potential of unsupervised methods for creative applications in text-to-sound mapping.

Text Input

The old pond,
A frog jumps in:
Plop!

- Bashō

And the days are not full enough
And the nights are not full enough
And life slips by like a field mouse
Not shaking the grass

- Ezra Pound

Audio Files

The sample below uses the Bashō haiku as the text input. The sound corpus input is seven audio files, each of which is an individual singer in a choir performing Aftonen by the Swedish composer Alfvén. With the pre-processing method "Grain 1000ms", each sound file is sliced into 1000ms segments before being embedded in the sound feature space.

  • Text Input: Bashō
  • Sound Corpus: Choir
  • Sound Encoder: MuQ
  • Text Encoder: fastText
  • Mapping Method: Cluster
  • Sound Preprocessing: Grain 1000ms

Using the "onsets" pre-processing method, the sound files are sliced into segments based on detected onsets in the audio. This often leads to much shorter outputs. The sound corpus used here is a sample pack of bowed electric guitar with heavy processing.

  • Text Input: Bashō
  • Sound Corpus: Mothman
  • Sound Encoder: MuQ
  • Text Encoder: fastText
  • Mapping Method: Cluster
  • Sound Preprocessing: Onsets

The following two outputs are generated using the same text input (Ezra Pound) and sound corpus (Choir), but with different mapping methods: Identity and ICP.

  • Text Input: Ezra Pound
  • Sound Corpus: Choir
  • Sound Encoder: MuQ
  • Text Encoder: fastText
  • Mapping Method: Identity
  • Sound Preprocessing: Grain 1000ms
  • Text Input: Ezra Pound
  • Sound Corpus: Choir
  • Sound Encoder: MuQ
  • Text Encoder: fastText
  • Mapping Method: ICP
  • Sound Preprocessing: Grain 1000ms

These two outputs are generated using the same text input (Ezra Pound) and sound corpus (Choir), but with different text encoders: fastText and RoBERTa. Note the effect of a static (fastText) vs contextual (RoBERTa) text encoder on the output. With fastText, the word "and" maps to the exact same sound each time it is repeated. With RoBERTa, the word "and" maps to the same "oh" vowel, but sung by a different singer.

  • Text Input: Ezra Pound
  • Sound Corpus: Choir
  • Sound Encoder: MuQ
  • Text Encoder: fastText
  • Mapping Method: Identity
  • Sound Preprocessing: Grain 1000ms
  • Text Input: Ezra Pound
  • Sound Corpus: Choir
  • Sound Encoder: MuQ
  • Text Encoder: RoBERTa
  • Mapping Method: Identity
  • Sound Preprocessing: Grain 1000ms

The following sample is generated from a different sound corpus (samples from the Polish Radio Experimental Studio) and a shorter grain size (500ms).

  • Text Input: Ezra Pound
  • Sound Corpus: Polish Radio Experimental Studio samples
  • Sound Encoder: MuQ
  • Text Encoder: RoBERTa
  • Mapping Method: Identity
  • Sound Preprocessing: Grain 500ms