Spleeter contains pre-trained models for:
• vocals/accompaniment separation.
• 4 stems separation as in SiSec (Stöter, Liutkus, & Ito, 2018) (vocals, bass, drums and
other).
• 5 stems separation with an extra piano stem (vocals, bass, drums, piano, and other). It
is, to the authors’ knowledge, the first released model to perform such a separation.
The pre-trained models are U-nets (Jansson et al., 2017) and follow similar specifications as
in (Prétet, Hennequin, Royo-Letelier, & Vaglio, 2019). The U-net is an encoder/decoder
Convolutional Neural Network (CNN) architecture with skip connections. We used 12-layer
U-nets (6 layers for the encoder and 6 for the decoder). A U-net is used for estimating a
soft mask for each source (stem). Training loss is a L1-norm between masked input mix
spectrograms and source-target spectrograms. The models were trained on Deezer’s internal
datasets (noteworthily the Bean dataset that was used in (Prétet et al., 2019)) using Adam
(Kingma & Ba, 2014). Training time took approximately a full week on a single GPU.
Separation is then done from estimated source spectrograms using soft masking or multi-
channel Wiener filtering.
Training and inference are implemented in Tensorflow which makes it possible to run the code
on Central Processing Unit (CPU) or GPU.