Sketch2Sound: Controllable Audio Generation via Time-Varying Signals and Sonic Imitations

HF García, O Nieto, J Salamon, B Pardo… - arXiv preprint arXiv …, 2024 - arxiv.org
We present Sketch2Sound, a generative audio model capable of creating high-quality
sounds from a set of interpretable time-varying control signals: loudness, brightness, and
pitch, as well as text prompts. Sketch2Sound can synthesize arbitrary sounds from sonic
imitations (ie,~ a vocal imitation or a reference sound-shape). Sketch2Sound can be
implemented on top of any text-to-audio latent diffusion transformer (DiT), and requires only
40k steps of fine-tuning and a single linear layer per control, making it more lightweight than …

Sketch2Sound: Controllable Audio Generation via Time-Varying Signals and Sonic Imitations

H Flores García, O Nieto, J Salamon, B Pardo… - arXiv e …, 2024 - ui.adsabs.harvard.edu
Abstract We present Sketch2Sound, a generative audio model capable of creating high-
quality sounds from a set of interpretable time-varying control signals: loudness, brightness,
and pitch, as well as text prompts. Sketch2Sound can synthesize arbitrary sounds from sonic
imitations (ie,~ a vocal imitation or a reference sound-shape). Sketch2Sound can be
implemented on top of any text-to-audio latent diffusion transformer (DiT), and requires only
40k steps of fine-tuning and a single linear layer per control, making it more lightweight than …