Audiovisual Singing Voice Separation

Bochen Li; Yuxuan Wang; Zhiyao Duan

1. Introduction

Vocal performance is an important art form of music. The task of singing voice separation is to isolate vocals from the audio mixture, which contains other instrumental sounds that help to define the harmony, rhythm, and genre. Singing voice separation is often the first step towards many application-oriented vocal processing tasks including pitch correction, voice beautification, and style transfer, as implemented in some mobile Apps such as WeSing and Smule. It is also often a preprocessing step for other research tasks such as singer identification (), lyrics alignment (), and tone analysis ().

There are various scenarios when video recordings are available for singing performances, such as operas, music videos (MV), and self-recorded singing activities. In pop music, creative visual performances give artists a substantial competitive advantage. Moreover, due to the rapid growth of Internet bandwidth and smartphone users, videos of singing activities are becoming popular in a number of video sharing platforms such as TikTok and Instagram.

Visual information, e.g., lip movement, has been incorporated and shown its benefits in speech signal processing, such as audiovisual speech separation (), enhancement (), and recognition (). Visual information has also been incorporated in music analysis (), such as source association (; , ), source separation (), multi-pitch analysis (), playing technique analysis (), cross-modal retrieval () and generation (; ). For singing performances, however, little work has been done. It is reasonable to think that visual information would also help to analyze singing activities, and in particular, separate singing voices from background music. This is based on the fact that mouth movements and facial expressions of the singer are often correlated with the singing voice signal fluctuations. The advantages of audiovisual analysis over audio-only analysis can be best shown on songs with multiple vocal sources but only one target vocal source for separation, e.g., songs with backing vocals in the accompaniment. However, to what extent the incorporation of visual information helps singing voice separation is still a question. Different from speech signals, singing voices (except for rap music) often contain prolonged vowels and less frequent consonants (), which shows less apparent matching with mouth movements (). Furthermore, some musically important fluctuations of the singing voice such as pitch modulations show little, if any, correlation with mouth movements ().

Therefore, it is our intention to answer the following research question in this paper: Can visual information about the singer improve singing voice separation, and if yes, how much? It is noted that while traditional singing voice separation tasks (e.g., MIREX, SiSEC2018, or MDX2021) define all vocal components in a song as the singing voice, in this work we define it as separating the solo singing voice from the accompaniment, where the accompaniment may contain backing vocals. We argue that our definition makes sense for many songs as it separates the solo part, typically presenting the main melody, from accompaniment, typically presenting harmony. Separating the solo voice enables many applications such as solo vocal pitch correction () and vocal effect generation for the soloist without affecting the backing vocal sources. The solo singing voice separation problem is somewhat similar to speech enhancement with babble noise (). However, music accompaniment is typically much louder and richer in timbre than background noise in speech enhancement settings. In addition, music accompaniment, especially backing vocals, shows very strong correlations with the solo vocal signal. These factors make the problem at hand very challenging.

To answer the above-mentioned research question, we design an audiovisual neural network model to separate the solo singing voice from the accompaniment that may contain backing vocals. This network model takes both the audio mixture signal and the mouth region of the singing video as input. The audio processing sub-network is designed based on the MMDenseLSTM (Takahashi et al., 2018), the champion of SiSEC2018 (the latest music separation campaign running blind evaluations by the time of mid 2021). The visual processing sub-network uses convolutional and LSTM layers to encode mouth movements of the singer. The audio and visual encodings are fused before they are used to reconstruct the solo singing magnitude spectrogram. The training target of the proposed audiovisual network is to minimize the Mean-Square-Error (MSE) loss of the magnitude spectrogram reconstruction of the solo singing voice. To facilitate the network to learn audiovisual correlation of singing activities, we add extra vocal signals unrelated to the solo singer to the audio mixture during training. To investigate the benefits of visual information, we compare the proposed audiovisual model with several state-of-the-art audio-based singing separation methods and an audiovisual speech enhancement method. We further vary the architecture and input of the visual processing sub-network to compare their performances.

One challenge we encounter in this work is the lack of audiovisual datasets of singing performance. For training, this can be addressed by randomly mixing solo singing videos downloaded from the Internet with unrelated accompaniment music. We download a cappella audition vocal performance videos and randomly mix their audio with other accompaniment resources to generate mixtures. We name this the Audition-RandMix dataset, and partition it into training, validation and test subsets. For evaluation on real songs, however, we need audiovisual recordings of singing with its relevant accompaniment music in separate tracks. To our best knowledge, no such dataset exists. Therefore, we record a new audiovisual dataset named URSing, where singers are recruited to sing along with prepared accompaniment tracks in front of a camera.

We conduct experiments on both the Audition-RandMix test set and the URSing dataset. Results on both sets show that the proposed audiovisual method outperforms baseline methods in most test conditions, whether the accompaniment tracks contain backing vocals or not. We further conduct subjective evaluations on a cappella video performances in the wild to prove the advantages of our proposed method.

The contributions of this paper include:

The first work to incorporate visual information with a state-of-the-art music source separation framework to address the singing voice separation problem,
A proposal of solo voice separation where backing vocal components, if they exist, are regarded as accompaniment tracks, which better fits many application scenarios, and
The first audiovisual singing performance dataset, URSing, free for download.

2.1 Singing Voice Separation

Early methods for singing voice separation include non-negative matrix factorization (), adaptive Bayesian modeling (, ), robust principal component analysis (; ), and auto-correlation (). Some methods address the singing separation problem using extra information such as vocal pitches () or voice activities (). Recently, deep learning based methods are proposed to model convolutional () or recurrent structures (; ) of magnitude spectral representations of music signals. Some works also learn to reconstruct spectral phases in addition to magnitudes (; ), while others directly work on time-domain waveforms with an end-to-end training strategy (; ). Official blind evaluations and comparisons of these methods can be found in the results of SiSEC2018 (), where the best performing method MMDenseLSTM (Takahashi et al., 2018) uses a DenseNet structure with a recurrent structure to process magnitude spectrograms. Since then, more systems have been proposed and open-sourced with comparable or better results, such as Open-Unmix (), Spleeter (), D3Net (), DEMUCS (), and LaSAFT (). More recently proposed music separation systems can also be found in the AICrowd Music Demixing Challenge, another official contest to be conducted on music separation following SiSEC2018.

2.2 Audiovisual Source Separation

Most audiovisual separation works are proposed for speech signals. For speech separation, one challenge is the permutation problem where the separated components need to be assigned to the correct talkers. Lu et al. () specifically address the problem by applying the visual information as a post-processing step to adjust the separation mask. Later the same group proposes to fuse the visual information to an audio-based deep clustering framework to propose an audiovisual deep clustering model for speech separation (). Another work is described by Ephrat et al. (), where the input is the mixture spectrogram and the face embeddings of all the speakers appearing in the audio sample. The training target is the complex mask that can be applied to the original spectrogram to recover the complex spectrogram of each speaker. It is noted that speech separation algorithms typically assume a noiseless or less noisy environment in which speech signals are mixed. In addition, speech signals to be separated are typically assumed to be from different speakers. Both assumptions are not true in solo singing separation, as the background music is often quite strong and the backing vocal may come from the same singer as the soloist ().

Speech enhancement aims at separating speech signals from background noise. It is more relevant to singing voice separation from background music considering the foreground-background relations of sources. Hou et al. () address the speech enhancement problem using a two-stream structure that takes both noisy speech and frames of the cropped mouth regions as inputs to compute their features. These features are then concatenated by a fusion network which also outputs corresponding clean speech and reconstructed mouth regions. Another audiovisual speech enhancement work proposed by Afouras et al. () uses 1D convolutional layers to reconstruct the magnitude spectrogram of the clean speech and uses it to further estimate its phase spectrogram. The input of the visual branch is the feature embeddings from the lip region that are pre-trained on lip reading tasks.

Less work has been proposed for audiovisual music separation. Parekh et al. () apply non-negative matrix factorization (NMF) to separate string ensembles, where the bowing motions are used to derive additional constraints on the activation of audio dictionary elements. This method, however, is only evaluated on randomly assembled video scenes of string instruments where distinct bowing motions of each player are clearly captured. Zhao et al. () propose to learn static audiovisual correspondences with cross-modal source localization. The correlation between each pixel in a given video frame and the sound component can then be constructed. Follow-up works on separating music sources include recognizing the audiovisual correspondence from visual motions () and gestures () in musical instrument performances. Similar works have been proposed by Gao and Grauman () and Tzinis et al. (), where correspondences between audio and video are learned in an unsupervised manner to guide source separation. This line of research achieves promising results in audiovisual music separation for musical instrument performances, but not yet on singing voice separation.

3. Method

3.1 Network Architecture

The proposed model takes the input of the magnitude spectrogram of an audio mixture (solo vocal + music accompaniment) and the mouth region of the video frames corresponding to the solo vocal. The output is the separated magnitude spectrogram of the solo vocal. It builds upon a state-of-the-art audio separation model named MMDenseLSTM () with a video front-end model. The MMDenseLSTM model performs multi-scale processing on the input mixture spectrogram through a sequence of downsample convolutional dense blocks followed by a sequence of upsample convolutional dense blocks. The downsample blocks encode the input into a feature space, while the upsample blocks decode it to recover the target source magnitude spectrogram. Skip connections are added at each scale, similar to that in the U-net (). This “encoder-decoder” structure with skip connections is widely applied in several music separation models (; ; ). The video front-end model extracts visual features from mouth movements, which are fused with the encoded audio feature. The network structure is illustrated in Figure 1. We explain each part of the model in detail as follows.

Figure 1

(a) The audio subnetwork. Downsample/upsample are applied to both time and frequency dimensions in the outer layers (marked by *), while they are only applied to the frequency dimension in the inner layers. (b) The video subnetwork. (c) The audiovisual fusion.

3.1.1 Audio Separation Model

The audio separation model described in this section is the same as the method proposed by Takahashi et al. (), except that we adjust the downsample/upsample parameters for audiovisual fusion when visual inputs are applied and drop the LSTM structure. This follows the observation that the addition of the LSTM structure does not achieve substantial improvement in SiSEC2018 yet the number of parameters would be increased significantly for audiovisual fusion. More description of each module is below:

Dense Block. It applies 2D convolutional blocks and the output feature maps of all layers are concatenated with each other along the channel dimension. This structure reuses the feature maps from previous layers and greatly reduces the model size.
Compression layer. It is a convolutional layer with 1 × 1 kernels applied after each dense block. We use a compression ratio of 0.2, which means that the number of feature maps (channels) is reduced by 80% after each compression layer. We apply a compression layer right after each dense block, which improves the model compactness.
Downsample/Upsample. These layers are applied after compression layers to resize the feature maps without changing the number of channels. Downsample layers are average pooling with 2 × 2 kernels after the first compression layer, and 1 × 2 kernels in the following layers. In other words, downsampling is performed along both the time and frequency dimensions in the first layer, but only to the frequency dimension in other layers. Symmetrically, upsample layers apply transposed convolutional layers with 2 × 2 kernels and strides at the last upsample layer but 1 × 2 for the other layers. Different from Takahashi et al. (2018), where downsample/upsample always addresses both time and frequency dimensions in multiple scales, our proposed strategy downsamples/upsamples the time dimension only once, making the audio stream have the same frame rate as the video stream. The encoded audio spectrogram feature is denoted as S_A ∈ ℝ^M×T×F, with channel (M), downsampled time (T), and frequency (F) dimensions. As with the U-net structure of Takahashi et al. (), skip connections are applied as concatenations in the corresponding layers with the same feature map size.
Multi-Band. Following Takahashi and Mitsufuji (), we also equally divide the spectrogram into a low-frequency sub-band and a high-frequency sub-band and apply the above-mentioned U-net encoder-decoder on each sub-band. The dense blocks of the low-frequency sub-band have a higher channel number. Detailed parameters are given by Takahashi and Mitsufuji ().

While MMDenseLSTM was the best performing model in SiSEC2018, there have been new models proposed since then. However, in this paper, we still take MMDenseLSTM to build our audio subnetwork, for two reasons. First, since SiSEC2018 there has not been public music separation contest running blind evaluations of different methods. Therefore, MMDenseLSTM remains the most reliable audio separation framework for building our audiovisual separation model, although it may no longer achieve the highest performance. We emphasize reliability over cutting edge techniques here as we conduct this first study on audiovisual vocal separation. Second, MMDenseLSTM has a small model size, which makes it an ideal subnetwork for our audiovisual fusion model, considering the relatively small size of the audiovisual singing performance datasets. In Table 1, we compare model sizes of MMDenseLSTM and other music separation models.

Table 1

Comparison of model size of different methods.

Method	# Parameters (×10⁶)

UMX	8.5
Spleeter	19.7
Demucs	38
MMDenseLSTM	1.22
AVDCNN	11.3
Proposed	2.05

3.1.2 Video Front-End Model

We propose to apply a visual branch to parse the input video stream and fuse it with the encoded audio features. The video stream is a sequence of mouth region RGB images in consecutive video frames. The video front-end model has four convolutional layers, followed by a fully connected layer, an LSTM layer, and another fully-connected layer, with the parameters of Conv2D@16 (channel number is 16), Conv2D@16, Conv2D@32, Conv2D@32, FC@256, LSTM@128, and FC@N, where N is the dimension of the encoded feature vector for each video frame. The input video stream with T frames results in a feature map S_V ∈ ℝ^N×T×1. There is no pooling operation along the time dimension thus the temporal information is preserved.

3.1.3 Audiovisual Fusion

The extracted visual feature map S_V ∈ ℝ^N×T×1 from the video branch is fused with the encoded audio spectrogram feature map S_A ∈ ℝ^M×T×F. The fusion is usually a concatenation operation by flattening or broadcasting the mismatched dimension. In our work, the visual feature map S_V ∈ ℝ^N×T×1 is broadcast along the third dimension and then concatenated with the audio feature to obtain the audiovisual feature S_AV ∈ ℝ^L×T×F, where L = M + N is the concatenated channel dimension. Note that the temporal information from both the audio and video branches is correlated during this fusion; this is different from some work where audiovisual fusion is performed on feature maps that aggregate information along time.

3.2 Training

We train the model to predict the magnitude spectrogram of the source signal and use the original mixture’s phase to recover the time-domain waveform. Many spectral-domain source separation methods, especially those for speech signals, use a spectrogram mask as the training target; this mask is then multiplied element-wise with the mixture signal’s magnitude spectrogram to recover the source magnitude spectrogram. For music separation, some recent works train networks to directly output the source magnitude spectrogram (; Takahashi et al., 2018) using a Mean-Squared-Error (MSE) loss. In our work, we also use the MSE loss for the magnitude spectrogram, but our network first outputs a mask, which is computed through a Sigmoid function to have a value range of [0, 1], and is then multiplied with the input spectrogram to compute the separated source spectrogram. We find that this mask computation step is beneficial for our audiovisual separation model. We have a comparative experiment in Section 5.4.

Compared to the audio mixture input, the visual input provides much less information about the source signals, therefore, the training loss may not be propagated back sufficiently into the visual branch, making the audiovisual network difficult to train. One way to address this is to explicitly learn audiovisual matching, either through pre-training () or early audiovisual fusion (). Another way might be to add visual reconstruction as another training target, leading to a chimera-like network structure ().

In this work, we address this problem by adding some extra vocal components to the original mixture, which are not related to the mouth movements and thus are not included in the target vocal spectrogram. This is similar to adding an additional speaker in the training data in the case of audiovisual speech separation (), which forces the model to learn audiovisual correlations after the fusion and only separate the vocal components that are related to the visual input. Note that in the training samples all of the vocal and accompaniment components are randomly mixed, so neither the extra vocal components or the solo vocal components have harmonic relations with the accompaniment tracks. In the experiments, we show that the strategy of training with randomly generated vocal-accompaniment pairs performs decently on real songs.

4. Dataset

There are several audiovisual datasets for music performances (; ; ), but they are all about musical instrument performances. Since there is no publicly available audiovisual singing voice dataset containing isolated vocal tracks, we collect our own data for training and evaluating the proposed method.

4.1 Audition-RandMix

This dataset contains random mixtures of solo vocals and other vocals and instrumental accompaniment. Each component is independently collected and randomly mixed. To collect solo vocals with videos, we curated 491 YouTube videos of solo singing performances by querying the YouTube search API with the keyword “Academic Acappella Audition”. We only selected video excerpts where the singer faces the camera and sings without accompaniment. The total length of these excerpts is about 8 hours. This set of data is referred to as “A Cappella Audition Vocals (AAV)”. We then simply randomly chose instrumental accompaniment tracks (from the “accompaniments” track in the MUSDB18 dataset) and mixed them with the solo singing excerpts to create singing-accompaniment mixtures. To prepare the extra vocal components, we also download 2 hours of choral recordings from YouTube, which are acoustically similar to some background vocals in pop songs.

The randomly mixed samples are used for training, validation, and evaluation. Before the mixing process, vocals in AAV are divided into training, validation, and evaluation sets roughly as 8:1:1 (50 tracks for evaluation). Instrumental accompaniment tracks from MUSDB18 (which contains a wide range of music genres and instrument types) are also divided into the three sets following the official way (also 50 tracks for evaluation). Then mixing is applied on each split independently to form the training, validation, and evaluation sets. The volume of each track is normalized using the root-mean-square (RMS) value. For the training and validation sets, each track is split into short samples (around 2.5 seconds) for random mixing, resulting in a large number of mixed samples. We do not balance the volume of each individual sample so the mixtures may have different SNRs. During training, for half of the training and validation samples we add extra vocal components that are not related to the mouth movements to encourage the model to learn audiovisual correlations. Half of the extra vocal components are solo vocals from other unrelated singers in the AAV dataset, and the other half are samples from the choral recordings. We apply a random gain between –6dB and 0dB for the extra vocal components. This is based on the observation that background vocals are typically softer than the solo vocal in most songs. For evaluation, mixing is performed on a random bijection between the 50 vocals and 50 instrumental accompaniments. For each mixture, we pick a 30-second excerpt (with both vocal and accompaniment present) for evaluation, following the same strategy as the MUSDB18 dataset. This set is referred to as “Audition-RandMix” in the following experiments. For the same 50 mixtures, we randomly add extra vocals following the same strategy as preparing the training set, which is referred to as “Audition-RandMix (v+)”, in order to explore the model performance in more challenging cases.

Note that all the samples in this condition are artificial mixtures that cannot represent real songs, since vocals and accompaniments are unrelated. However, training on randomly mixed samples has been found still helpful for separating real songs (), and artificial mixtures have also been used as evaluation data for music separation tasks ().

4.2 URSing

To evaluate the proposed method in more realistic singing performances, we create the University of Rochester Multi-Modal Singing Performance Dataset (URSing). In this paper, we only use the URSing dataset for evaluation. A brief description of the creation process is described below.

4.2.1 Singer Recruiting

Singers are students at the University of Rochester. Audition is performed to filter out unqualified singers who could not sing in tune. Each participant receives $5 for recording each song, and is allowed to record up to 5 songs. Each singer has signed a consent form to authorize the release of the dataset for research purposes. In total 22 singers participated in the recording process, including 11 male and 11 female singers.

4.2.2 Piece Selection

To ensure high recording efficiency, the singers pick their own songs and their favorite accompaniment tracks to sing along with. Most songs are commercial songs. We do not put constraints on song genres, but filter out songs of which the accompaniment tracks are of low sound quality.

4.2.3 Recording

To ensure synchronization, the singers listen to the accompaniment track through earphones while recording their singing voice. Their voices are recorded using an AT2020 condenser microphone hosted by Logic Pro X, and their videos are recorded using iPhone 11. The recording is conducted in a semi-anechoic sound booth. A sample photo and the floor plan of the sound booth are shown in Figure 2.

Figure 2

A sample photo and floor plan of the sound booth for the recording process of the URSing dataset.

4.2.4 Post-processing

For each solo vocal recording we use the following plug-ins to simulate the typical audio production procedure in commercial recordings: a) static noise reduction (Klevgrand Brusfri and Waves X-noise), b) pitch refinement (Melodyne), c) sound compression (Fabfilter Pro-C 2), and d) reverberation (Fabfilter Pro-R). We also adjust the vocal volume to balance it with the accompaniment track. Beyond this, we do not perform any other editing on the audio recording (e.g., time warping or rhythmic refinement) to preserve the synchronization with the visual performance. To synchronize the audio recording captured by the AT2020 microphone with the video recording captured by the smartphone, we use the audio recording captured by the built-in microphone of the smartphone as the bridge, through cross correlation.

4.2.5 Annotation

Since the mouth movements are mostly relevant to the singing performance, we provide the annotations of the mouth regions in the dataset. This is performed using the Dlib library (), an automatic tool for facial landmark detection, followed by manual checking. The mouth region is represented as a square bounding box with the side length equal to 1.2 times the maximum horizontal distance for all mouth landmarks.

The URSing dataset contains 65 songs, totaling 4 hours of audiovisual recordings of singing performance. For each song, we provide:

The audio recording of the solo singing voice (in WAV, 44.1 KHz, 16 bits, mono).
The corresponding accompaniment audio track (in WAV, 44.1 KHz, 16 bits, mono or stereo).
The video recording of the soloist’s upper body (in MP4, 1080P portrait, 29.97 FPS).
The annotations of mouth regions for each video frame.

Note that when we prepare the accompaniment tracks, we do not avoid the tracks containing backing vocals, as they are the challenging and useful cases to study in this paper. Example video frames and cropped mouth region pictures from the annotations are provided in Figure 3.

Figure 3

Examples of video frames of the URSing dataset and cropped mouth region pictures as the input to the video branch of the proposed method.

We also choose a set of 30-sec excerpts where both solo vocal and accompaniment tracks are prominent to form a benchmark evaluation set. Specifically, for each of the 65 songs, we choose one 30-sec excerpt without backing vocals and one with backing vocals, if such excerpts are available. We provide this information in the metadata. This results in 54 excerpts with accompaniment tracks that only contain instrumental components (referred as “URSing” in the following experiments) and 26 excerpts with accompaniment tracks that also contain backing vocals (referred as “URSing (v+)”. The latter, presumably, are more challenging for solo vocal separation and more useful for showing advantages of audiovisual methods. In this paper, since we do not use any songs from URSing for training, we only use these 30-sec excerpts for evaluation.

5. Experiments

5.1 Implementation Details

For audiovisual singing videos, audio is downsampled to 32 KHz. We use a frame length of 1024 and a hop size of 640 (20 ms) for spectrogram calculation. Magnitude spectrograms are converted to logarithmic scale followed by normalization along each frequency bin; this increases the weights of the contribution from high frequencies. Video data is converted to 25 FPS (equivalent to 40 ms frame hop size). For the original singing performance videos, the mouth regions are cropped as square bounding boxes using the Dlib library () and then interpolated into the size of 64 × 64. RGB video frames are converted to grayscale, then normalized into zero mean and unit variance. The feature dimension N for each video frame is set to 128. Each training sample is 2.56 seconds long, containing 128 audio frames and 64 video frames. The input/output audio spectrogram has the shape of 2 × 128 × 513 (channels × frames × frequency bins), and each input video stream has the shape of 64 × 64 × 64 (frames × width × height). We use RMSProp optimization with a learning rate of 0.01. The learning rate decays every 5 epochs by multiplying with 0.8. We use a batch size of 8 for training on a TITAN × GPU with 11.9 GB graphic memory. It takes about 40 hours to train for 50 epochs. We adopt early stopping when the validation loss does not decrease for 10 consecutive epochs.

For evaluations, we calculate the signal-to-distortion ratio (SDR) between the separated vocal waveforms and the ground-truth ones using the BSS Eval Toolbox V4, the same as the evaluation measure applied in SiSEC2018. Specifically, for each 30-sec evaluation excerpt, we calculate the median SDR over all 1-sec audio segments.

5.2 Baselines

We first use the original mixture recording (referred as “MIX” in the experiments) as the separated vocal for evaluation on our dataset. This sets lower bounds of separation results without any separation techniques. Then we apply two oracle filtering techniques that utilize ground-truth source signals. The ideal binary mask (IBM) assigns each time-frequency bin to the predominant source. The ideal ratio mask (IRM) distributes the power of each time-frequency bin into different sources according to the power ratio of the ground-truth sources. The IBM and IRM set upper bounds for time-frequency masking-based source separation methods.

We then compare our proposed method with several audio-based music separation methods as baselines.

UMX (). An open-sourced separation method known as “Open-unmix”. The model employs the BLSTM structure and is trained on the MUSDB18 dataset.
Spleeter (). An open-sourced music separation method with a CNN+Unet model trained on their in-house dataset of 24,097 songs.
Demucs. An open-sourced music separation method with U-net and LSTM structure to process the signal in the waveform domain. It achieved the best separation performance among all open-sourced methods up to date.
Spleeter-train. The same model as “Spleeter” but trained on our Audition-RandMix dataset using the same conditions as those for our proposed audiovisual method as a direct comparison.
Demucs-train. The same model as “Demucs” but trained on our Audition-RandMix dataset.
MMDenseLSTM (). The method that achieved the best results in SiSEC2018, even without training on extra data. We implemented this method from scratch. Our implementation has been validated by achieving similar vocal separation performance on the MUSDB18 test set. We then trained this model on our Audition-RandMix dataset as a direct comparison.

We also implement an audiovisual speech enhancement method named AVDCNN proposed by Hou et al. (). This method applies 2D CNNs to take the noisy speech and the mouth region from a visual recording as inputs, and fuses encoded audio and visual features to output the enhanced speech signal as well as reconstructed video frames of mouth movements. After the fusion layers, we used LSTM instead of fully-connected layers as used by Hou et al. (), which shows higher performance in our experimental scenarios.

We choose audiovisual speech enhancement instead of audiovisual speech separation as the baseline, because we believe that speech enhancement is more relevant to singing voice separation from background music in terms of foreground-background relations of sources, as explained in Section 2.2. In addition, audiovisual speech separation usually assumes the availability of the video recordings of all talkers, while in our setting, only the video of the solo singing voice is used.

We present the model sizes of all the models in Table 1.

5.3 Objective Evaluation of Separation Results

We evaluate the comparison methods on the four test sets described in Section 4: Audition-RandMix, Audition-RandMix (v+), URSing, and URSing (v+). Again, “v+” means that the accompaniments contain vocal components. Boxplots of SDR results are shown in Figure 4, where each data point in the boxplots is the median SDR of the separated vocal of all 1-sec segments of a 30-sec excerpt. The horizontal line inside each box indicates the median value across all excerpts. Several interesting observations can be made from the results.

Figure 4

The SDR (dB) comparison on separated solo vocals with different methods on different evaluation sets. (“v+” denotes songs where accompaniments contain vocal components.)

5.3.1 Benefits of Visual Information

The proposed method outperforms audio-based separation baselines in most of the evaluation sets. This shows the advantage of incorporating visual information about the singer’s mouth movement for solo singing voice separation. However, Spleeter and Demucs slightly outperform our proposed system on the URSing set. We believe that this is because they are trained on a much larger in-house dataset (e.g., 24,097 songs totalling 79 hours for Spleeter). This is verified by the fact that Spleeter-train and Demucs-train, the same baseline models but trained on our dataset as a fair comparison, do not outperform our proposed method. We suggest that this is because our proposed model (and MMDenseLSTM) has a much smaller model size than other audio baseline methods, making it less prone to overfitting given a small training set.

Comparing songs with backing vocals (Audition-RandMix (v+) and URSing (v+)) to songs without backing vocals (Audition-RandMix and URSing), we can see that the outperformance of the proposed method is better pronounced on songs with backing vocals. Wilcoxon signed-rank tests show that the improvement of the proposed method over MMDenseLSTM on Audition-RandMix (v+) and URSing (v+) are both significant, with p values of 5.1 × 10^–3 and 2.3 × 10^–2, respectively. We argue that this is because audio-only methods, although trained to only separate the target vocal (the strongest vocal) in the experiments, often confuse the target vocal with other vocals. The proposed audiovisual method, in contrast, learns to only separate the vocal signals that are correlated to the solo singer’s mouth movements.

The reason that the improvement is more pronounced on Audition-RandMix (v+) than on URSing (v+), we argue, are twofold: 1) backing vocals in URSing (v+) are not as strong as the intentionally added backing vocals in Audition-RandMix (v+), and 2) backing vocals in URSing (v+) often overlap with solo vocals and share the same lyrics, showing high correlations with the mouth movements of the solo singer, while the added backing vocals in Audition-RandMix (v+) are unrelated to the solo vocal.

Figure 5 shows one 10-sec sample as an extreme case to compare the spectrograms of the audio-based MMDenseLSTM method and the proposed audiovisual method when backing vocal components are strong (e.g., the middle part of the sample). We also show the mouth movement in several frames throughout this excerpt. It can be seen that MMDenseLSTM recognizes the backing vocal components in the middle frames as the solo vocal, while the audiovisual method suppresses those components significantly.

Figure 5

One 10-sec example comparing vocal separation results from different methods on a song excerpt with strong backing vocals from the Audition-RandMix dataset. The four spectrograms from top to bottom are the original mixture, ground-truth vocal, audio-based vocal separation result from Takahashi et al. (), and audiovisual vocal separation result from the proposed method. One mouth frame is shown for each second.

On songs without backing vocals, the outperformance of the proposed method can still be observed. Subjective listening by the authors suggests that the visual information helps to reduce high-frequency percussive sounds from the solo vocal, as the former do not correlate with mouth movements well.

5.3.2 Superiority of Proposed Audiovisual Architecture

The proposed method outperforms the audiovisual speech enhancement baseline significantly in all evaluation sets. Note that the baseline is trained and evaluated on the same dataset as the proposed method. This shows the superiority of the proposed network architecture on the solo singing voice separation task. In particular, we argue two main reasons for this. First, the proposed model utilizes the commonly used U-net structure with skip connections, which generally achieves good results in music separation (; ; ). Second, in our audiovisual fusion scheme we preserve the temporal correspondence, which prevents a substantial increase of the number of trainable parameters in the fusion layer. This is important when the DenseNet-based audio sub-network has a small model size. The variations of different video sub-networks, however, does not make much difference to the separation performance, as we analyze in Section 5.4.

5.3.3 Limitations and Room for Improvement

Compared with reported SDR values in SiSEC2018, the SDR values in Figure 4 are much higher. For example, MMDenseLSTM reaches around 10dB on URSing but only around 7dB in SiSEC2018 (method “TAK1” in ()). We argue that the songs used in SiSEC2018 (i.e., the MUSDB18 dataset) are professionally recorded, mastered and mixed vocals. They often contain complex components such as polyphonic vocals, background humming, and strong reverberation. They are mastered and mixed by professional music producers to intentionally make them better fused into the background music. In contrast, the ground-truth vocals in our datasets are solo vocals recorded in controlled environments with limited vocal effects added. It is reasonable to believe that the benefits of visual information can be further demonstrated on more professionally produced songs. In addition, the performance difference between the Audition-RandMix test sets and the URSing test sets seems to be small for all methods, including the oracle results. This shows that randomly mixed songs, although lacking harmonic and rhythmic coherence, are not easier to separate than the more realistically mixed songs, suggesting that it may be reasonable to use randomly mixed songs for training () and evaluation (). However, whether this is still true for professionally produced songs is still a question.

On the other hand, there is still some gap between the proposed method and the oracle results on the SDR metric in our evaluation sets. It is likely that this gap will be even bigger on professionally produced songs. This suggests that much work can be done to improve the separation performance. We have more discussion in Section 6.

5.4 Comparison of Different Video Front-End Models

To investigate the key factors of the audiovisual separation framework and the robustness, we replace the proposed Conv2D+LSTM video front-end with several other widely-used visual feature extraction frameworks:

No-mask. This experiment has the same video branch, but without a mask layer after the audiovisual fusion.
Conv3D. The Conv3D model was firstly proposed by Tran et al. () and achieved the best video action recognition results at the publication time. It takes all the video frames from each sample as a feature map and a 4th dimension is added as the channel dimension of size 64. We then apply 2 Conv2D layers (with the channel dimension 128 and 256) on each frame to share the channel dimension with Conv3D. Followed by a pooling operation and fully-connected layers, we obtain the video feature with the same dimensionality as S_Conv3D ∈ ℝ^N×T×1. Note that in this structure, the temporal information is only parsed at the very first Conv3D structure, since no recurrent network is applied.
Dense+LSTM. Differently from the proposed model, we replace the Conv2D layers with a dense block from the DenseNet structure. The dense block was firstly proposed by Huang et al. (), and it achieved significant improvements on image object recognition benchmarks with a smaller model size and less computation cost. Here each dense block has 2 layers with growth rate of 12. Then a Conv2D layer with 1 × 1 kernels is applied to compress the channel count to 32, resulting in the same feature dimensionality as the proposed CNN+LSTM model before feeding into the FC@256.
Lip-reading. This variation uses a pre-trained model proposed by Petridis et al. () for the lip reading task on the LRW dataset (). The original model structure consists of Conv3D, ResNet-34, and GRU. We only use the pre-trained model to extract the visual feature to integrate into our proposed audiovisual source separation model.

A comparison of different video front-end models is shown in Figure 6. It can be seen that the proposed (Conv2D+LSTM) model achieves the highest SDR values for most cases, but some video front-end models have similar performance. Applying a mask layer is critical, as otherwise the audiovisual method even degrades from the audio-based method. Note that for the audio-based baseline method (MMDenseLSTM), we have also experimented with models with or without a mask layer, but it did not make any difference to the separation results. The Conv3D framework slightly degrades the performance, but still outperforms the audio-based baseline method (MMDenseLSTM). One reason for this performance drop may be that in this framework, there is no recurrent structure, and the temporal evolution of visual information is only processed by the Conv3D structure. As the Conv3D structure takes the raw input of mouth frames, it may be sensitive to mouth position changes due to landmark detection errors. The model pre-trained on lip reading ranks the worst among the audiovisual models. This is because the lip reading model was trained on the LRW dataset where for each sample containing several words, only one word around the center frames is annotated as the training target. This makes the model only attend to the middle frames of a video excerpt, leading to limited guidance for the singing voice separation and even degradation from audio-based methods. We have also conducted experiments using the pre-trained lip reading model with finetuning on our separation task, but it does not boost the separation performance from our proposed video frontend model. It is possibly because lip movements in speech and singing are different.

Figure 6

The SDR (dB) comparison on the separated solo vocal from the audiovisual method using different video front-end models.

5.5 Subjective Evaluation on Professional A Cappella Songs

In this section, we further evaluate the benefits of visual information incorporated in our proposed method on real a cappella songs in the wild. We collect 35 audiovisual a cappella recordings from YouTube. This collection represents the extreme cases where all the accompaniment components are vocals (except for several cases where additional percussive instruments are also present), to study how much the proposed audiovisual method is advantageous while the audio-based method is very likely to fail. Here we use the MMDenseLSTM baseline as the audio-based method for comparison. Most of these songs are chorus performance with a solo singer accompanied by harmonic vocals and/or vocal beatbox, while some are performances with multiple solo singers. We only keep the videos where the solo singer’s mouth is visible and clear, without video shot transition for at least 10 seconds. A sample frame of one song is shown in Figure 7 with the mouth region of the targeted solo singer highlighted.

Figure 7

One sample frame of an a cappella song for subjective evaluation.

As we do not have access to the source tracks, we cannot evaluate the separation performance using common objective evaluation metrics. Instead, we conduct a subjective evaluation on the source separation quality (, ) over 51 people. Some subjects are students or faculty from the University of Rochester, others are subscribers from the International Society for Music Information Retrieval (ISMIR) community. Statistics of the subjects’ music background is shown in Figure 8. Each survey asks a subject to rate 7 of the 35 songs, and each subject may take more than one survey. For ratings from the same subject, we take the average to avoid bias. The evaluations are conducted remotely on a web interface, and subjects are required to have a quiet listening environment. For each song, the subjects first watch a 10-sec excerpt of the original performance and then watch the same video twice with the solo singing voice separated by two different singing voice separation methods in a random order to rate the separation quality. Due to the variations across these songs, the original recording serves as a reference for a consistent scoring scheme. For each video we also highlight the mouth region of the target solo singer (see Figure 7) to help subjects focus on the corresponding solo voice. The specific evaluation questions are:

Question 1: What do you think about the overall separation quality for the targeted singer?
Question 2: What do you think about the separation quality in terms of removing backing vocal accompaniments in the separated solo voice?
Question 3: What do you think about the separation quality in terms of not introducing artifacts into the separated solo voice?

Figure 8

Statistics of the 26 subjects’ musical background related to the subjective evaluation.

The subjects need to answer each question using a scale from 1 to 5, where “1” represents Very bad and “5” represents Very good. The three questions are related to the common definitions of the three objective source separation evaluation metrics, SDR, SIR, and SAR, respectively.

The results of the subjective evaluations are presented in Figure 9. According to the collected responses for Question 1, the proposed audiovisual method is rated significantly higher than the baseline audio-based method (Wilcoxon signed-rank test shows a p value of 3.5 × 10^–31); the average rating is raised from 3.1 to 3.9. For Question 2, the difference is even more significant, as the average rating is increased from 2.6 to 3.8 (with a p value of 3.1 × 10^–45), showing that the proposed method is especially beneficial for removing backing vocals from the mixture. Regarding the artifacts introduced into the separated solo vocals in Question 3, both methods achieve a rating between “neutral” and “good”, and the difference is not statistically significant (with a p value of 0.46).

Figure 9

The subjective ratings of the separation quality in response to the three questions. Each error bar shows mean ± standard deviation.

5.6 Ablation Studies on Non-Informative Visual Input

To further investigate how the incorporation of visual information affects the separation performance, in this section, we substitute the visual input (i.e., mouth region of the solo singer) with some non-informative content.

Constant. We feed the visual branch with constant zero values all the time.
White-noise. We feed the visual branch with white noise.
Mismatch. The input of the visual branch is the mouth region video of an unrelated singer to provide misleading information about the singing activity.
Random-scenes. In this case, we collect a singing performance video from the “Who Sang It Better” program on YouTube and randomly crop the video frames as input of the visual branch. The video consists of selfie recordings from several singers, and the cropped regions contain random scenes including microphones, other parts of the singers, or background scenes.

Figure 10 shows the separation results for different experimental settings. The model performance always degrades from the audio-based baseline MMDenseLSTM when feeding with irrelevant or misleading information, suggesting that a non-informative visual input is harmful for separation. This is because our training data was not augmented with noise. This also proves that the video branch is an essential part of our model. The performance degradation by feeding white noise or a mismatched singer is more noticeable than a constant input or random scenes. This may be because the model is more likely to overfit irrelevant visual fluctuations in the training data, while for a constant visual input the model is more likely to ignore it. In all these cases, the input of random scenes is most likely to happen in real scenarios, when the singer’s mouth region is not shown or is occluded in the video. Without a preprocessing method to filter out these irrelevant scenes, these would be considered failing cases for the proposed model. Nonetheless, in all of these circumstances, the separation performance still achieves a median SDR over 5dB for most cases. This suggests that the audio branch is dominant in the model inference. Comparing with the “No-mask” results in Figure 6, this also confirms our claim in Section 5.4 that the mask layer helps to improve the model robustness, even when the visual input is less informative.

Figure 10

The SDR (dB) comparison on the separated solo vocal of the proposed audiovisual method with non-informative visual inputs.

6. Discussion

Our proposed method is the first work to address audiovisual separation for singing performance, and there are still many aspects to improve and many areas to explore. First, we are not building our model upon the most state-of-the-art audio separation methods due to the reasons described in Section 3.1.1. Other techniques like time-domain-based () and transformer-based () models or different audiovisual fusion methods may further improve the performance. Second, in this paper we collected the Audition-RandMix data from the Internet for training, and we recorded the URSing dataset for evaluation. While it is a challenging process to record audiovisual singing performance with ground-truth tracks, collecting randomly mixed data for training is an easier process, since there are many solo singing performance videos on the Internet. It has been proved that using randomly mixed data is beneficial for training music separation (), so one could potentially improve the audiovisual vocal separation results by collecting more random mixing data for training. Third, another promising direction is to apply a pre-trained audio separation model to build the audiovisual structure, where the audio subnetwork can be pre-trained on tens of thousands of songs with audio recordings only. Fourth, as we discussed in Section 5.6, there could be some failure cases when the mouth regions are blocked or wrongly detected. As attention models have been known to work well on multi-modal fusion problems (), the preprocessing step of cropping mouth regions can be addressed by using an attention-based mechanism to learn to focus on the mouth region. Last but not least, it is worth investigating how other kinds of visual information could help with the analysis of singing voice, such as facial expressions, body gestures and movements.

7. Conclusion

In this paper, we proposed an audiovisual approach to address the solo singing voice separation problem by analyzing both the auditory signal and mouth movement of the solo singer in the visual signal. To evaluate our proposed method, we created the URSing dataset, the first publicly available dataset of audiovisual singing performances recorded in isolation for singing voice separation research. We also collected solo singing recordings from YouTube for training. Both objective evaluations on our prepared singing recordings and subjective evaluation on professionally produced a cappella songs in the wild showed that the proposed method outperforms state-of-the-art audio-based methods. The advantages of the proposed method is especially pronounced when the accompaniment track contains backing vocals, which have been difficult to separate from solo vocals by audio-based methods.

Transactions of the International Society for Music Information Retrieval

RESEARCH ARTICLE

Audiovisual Singing Voice Separation

Abstract