Sketch2Sound: Controllable Audio Generation via Time-Varying Signals and Sonic Imitations
We present Sketch2Sound, a generative audio model capable of creating high-quality
sounds from a set of interpretable time-varying control signals: loudness, brightness, and
pitch, as well as text prompts. Sketch2Sound can synthesize arbitrary sounds from sonic
imitations (ie,~ a vocal imitation or a reference sound-shape). Sketch2Sound can be
implemented on top of any text-to-audio latent diffusion transformer (DiT), and requires only
40k steps of fine-tuning and a single linear layer per control, making it more lightweight than …
sounds from a set of interpretable time-varying control signals: loudness, brightness, and
pitch, as well as text prompts. Sketch2Sound can synthesize arbitrary sounds from sonic
imitations (ie,~ a vocal imitation or a reference sound-shape). Sketch2Sound can be
implemented on top of any text-to-audio latent diffusion transformer (DiT), and requires only
40k steps of fine-tuning and a single linear layer per control, making it more lightweight than …
Sketch2Sound: Controllable Audio Generation via Time-Varying Signals and Sonic Imitations
Abstract We present Sketch2Sound, a generative audio model capable of creating high-
quality sounds from a set of interpretable time-varying control signals: loudness, brightness,
and pitch, as well as text prompts. Sketch2Sound can synthesize arbitrary sounds from sonic
imitations (ie,~ a vocal imitation or a reference sound-shape). Sketch2Sound can be
implemented on top of any text-to-audio latent diffusion transformer (DiT), and requires only
40k steps of fine-tuning and a single linear layer per control, making it more lightweight than …
quality sounds from a set of interpretable time-varying control signals: loudness, brightness,
and pitch, as well as text prompts. Sketch2Sound can synthesize arbitrary sounds from sonic
imitations (ie,~ a vocal imitation or a reference sound-shape). Sketch2Sound can be
implemented on top of any text-to-audio latent diffusion transformer (DiT), and requires only
40k steps of fine-tuning and a single linear layer per control, making it more lightweight than …