1. Introduction
In the real world, speech is often corrupted by noise and/or reverberation. Speech enhancement aims to extract the clean speech and suppress the noise and reverberation components, which is one of the core problems in audio signal processing. It is reported that multi-channel speech enhancement (MCSE) tends to have superior performance, when compared with monaural speech enhancement, owing to the additional spatial information [
1]. Therefore, multi-channel speech enhancement has been widely applied as a preprocessor in video conferencing systems, automatic speech recognition (ASR) systems, and smart TVs. In the past forty years, several beamforming-based [
2] and blind-source-separation-based [
3] methods have been developed. The deep neural networks (DNNs) are the artificial neural networks (ANNs), with multiple hidden layers between the input and output layers. With the help of their strong nonlinear modeling ability, DNNs have been, widely, used in a variety of audio tasks, such as emotion recognition, ASR, and speech enhancement/separation. Recently, DNNs have facilitated the research in MCSE, yielding notable performance improvements over conventional statistical beamforming techniques [
4,
5,
6,
7,
8,
9,
10,
11].
Considering the success of DNNs in the single-channel speech enhancement (SCSE) area, a straightforward strategy is to extend the previous SCSE models to extract spatial features, either heuristically or implicitly [
4,
5,
6,
7,
8,
9]. This paradigm is prone to cause nonlinear speech distortion, such as spectral blackholes in low signal-to-noise (SNR) scenarios, since the advantage of the spatial filter with microphone-array beamforming is not fully exploited to null the directional interference and suppress the ambient noise [
10,
11]. Another category follows the cascade-style regime. To be specific, in the first stage, an SC-based network was adopted to predict the mask of each acoustic channel in parallel, followed by the steering vector estimation and noise spatial covariance matrix (SCM) calculation. In the second stage, a traditional beamformer, such as minimum variance distortionless response (MVDR) or eigenvalue decomposition (GEV), was adopted for spatial filtering [
10,
12,
13,
14]. These methods have shown their effectiveness in ASR, since ASR can tolerate a latency of hundreds of milliseconds. When the latency should be much lower, such as no more than 20 ms [
15] for many practical applications, such as speech communication, hearing aids, and transparency, these methods may degrade their performance, significantly, for these low-latency systems. Moreover, the performance heavily depends on the mask estimation accuracy, which can degrade a lot in complex acoustic scenarios.
As a solution, an intuitive tactic is to enforce the network to directly predict the beamforming weights, which can be done in either the time domain [
16,
17] or the frequency domain [
11,
18,
19,
20]. Nonetheless, according to the signal theory, the desired beam pattern is required to form its main beam towards the target direction and, meanwhile, form the null towards the interference direction, which tends to be difficult, especially, in low-SNR scenarios, from the optimization perspective. Moreover, slight errors of the estimated weights are able to lead to severe distortions in the beam pattern and, thus, affect the performance of the algorithm.
In this paper, we design a neural filter in the beamspace domain, rather than the spatial domain, for real-time multi-channel speech enhancement. In detail, the multi-channel signals are, first, processed by a set of pre-defined fixed beamformers. A beam set is sampled, uniformly, with various directions in the space. Then, the network is utilized to learn the spectro-temporal-spatial discriminative features of the target speech and noise, which aims to generate the bin-level filtering coefficients to, automatically, weight the beam set. Note, different from the previous neural beamformer-based literature [
8,
11], where the output weights are applied to multi-channel input signals directly, here the predicted coefficients are to filter the noise component of each pre-generated beam and fuse them. We dub it a neural beamspace-domain filter, to distinguish it from the existing neural beamformer, literally. The rationale of such network design logic is three-fold.
- •
The target signal can be pre-extracted with the fixed beamformer, and the dominant part should exist within at least one directional beam, serving as the SNR-improved target priori to guide the subsequent beam fusion process. The interference-dominant beam can be obtained, when the beam steers towards the interference direction, providing the interference priori for better distinguishment in a spatial-spectral sense. Besides, the target and interference components may co-exist within each beam, while their distributions are dynamically changed, due to their spectral difference. Therefore, the beam set can be viewed as a reasonable candidate to indicate the spectral and spatial characteristics.
- •
In addition to the design of beam pattern in the spatial domain, the proposed system can, also, learn the spectral characteristics of the interference components, to cancel residual noise in the spectral domain, completing the enhancement of both the spatial domain and the spectral domain, which can achieve a higher upper limit of performance than the neural spatial network that only performs filtering in the spatial domain.
- •
From the optimization standpoint, the small error in the beamforming weights may lead to serious distortion of the beam pattern, while the beamspace-domain weights will only leak some undesired components when the error occurs, which has much less direct impact on the performance of the system. Therefore, the beamspace-domain filter is more robust.
As the beam set is only discretely sampled in the space, the information loss tends to arise due to the limited spatial resolution at low frequencies, which causes speech distortion. To this end, a residual branch is designed to refine the fused beam. We have to emphasize that, although the multi-beam concept is used in both [
21] and this study, they are very different, as [
21] is in essence a parallel single beam enhancement process, while the proposed system can be regarded as the filter and fusion process of the multi-beam. Experiments conducted on the DNS-Challenge corpus [
22] show that the proposed neural beam filter outperforms previous state-of-the-art (SOTA) baselines.
Our main contributions are summarized as follows:
- •
We propose a novel multi-channel speech enhancement scheme in the beam-space domain. To the best of our knowledge, this is the first work that shows the effectiveness of the neural beamspace-domain filter for multi-channel speech enhancement.
- •
We introduce the residual U-Net into the convolutional encoder-decoder architecture, to improve the feature representation capability. A weight estimator module is designed, to predict the time-frequency bin-level filter coefficients, and a residual refinement module is designed to refine the estimated spectrum.
- •
We validate the superiority of the proposed framework, by comparing it with state-of-the-art algorithms in both the directional interference and diffuse noise scenarios. These evaluation results demonstrate the superiority and potentiality of the proposed method.
The remainder of the paper is organized as follows. We describe the proposed neural beam filter in
Section 2. The experimental setting and results are given in
Section 3 and
Section 4, respectively. Finally, we draw some conclusions in
Section 5.
2. Materials and Methods
The aim of this work is to develop a real-time multi-channel speech enhancement system, to extract the clean speech and suppress the noise and reverberant components. The noisy mixtures are recorded by the microphones of an array. The spectra of these signals are used as the inputs of the proposed system. This system comprises three modules, namely fixed beamforming module (FBM), beam filtering module (BFM), and residual refinement module (RRM). The enhanced speech is, then, obtained and transmitted to the telecommunication circuit and/or speech recognition system. The proposed system is presented in
Figure 1.
2.1. Signal Model
Considering an
M-channel microphone array placed in noisy-reverberant environments, the signal received in the
m-th microphone can be represented by:
where
is the STFT of .
is the STFT of .
is the STFT of .
refers to the index of frames.
refers to the index of frequency bins. Considering the symmetry of in frequency, is chosen throughout this paper.
is the direct-path signal of the speech source and its early reflections.
is the late reverberant speech.
By
N-point short-time Fourier transform (STFT), the physical model in the time-frequency domain can be expressed as:
where
where
is the STFT of .
is the STFT of .
is the STFT of .
refers to the index of frames.
refers to the index of frequency bins. Considering the symmetry of in frequency, is chosen throughout this paper.
is the direct-path signal of the speech source and its early reflections.
is the late reverberant speech.
In this paper, the aim of the proposed algorithm is to extract the direct-path, plus the early reflected components
, from the multi-channel input signals
, by the model
, assuming that the 0-th microphone is chosen as the reference microphone, and defining the reflections within the first 100 ms after the direct sound as the early reverberation. From now on, we will omit the subscript
, when no confusion arises. The above process can be formulated as:
where
After transforming by inverse STFT (iSTFT), the enhanced time-domain signal can be reconstructed by the overlap-add (OLA) method.
2.2. Forward Stream
Figure 1 shows the overall diagram of the proposed architecture, which consists of three components, namely fixed beamforming module (FBM), beam filtering module (BFM), and residual refinement module (RRM).
In FBM, the fixed beamformer is employed to sample the space uniformly and obtain multiple beams steering towards different directions. The beam set denotes
, with
, where
D denotes the number of resultant multi-beam. The process is, thus, given by:
where
We concatenate the beam set along the channel dimension, serving as the input of BFM, which denotes
. Here, 2 means that both real and imaginary (RI) parts are considered. As muti-beams can represent both spectral and spatial characteristics, BFM is adopted to learn the spectro-temporal-spatial discriminative information between speech and interference and attempt to assign the filter weights
for each beam. It is worth noting that as the beam set is discretely sampled in the space, the information loss tends to arise due to the limited spatial resolution. To alleviate this problem, the complex spectrum of the reference channel is, also, incorporated into the input and, meanwhile, similar to [
23], the complex residual needs to be estimated with RRM, which aims to compensate for the inherent information loss of the filtered spectrum. This process can be presented as:
where
By applying the estimated weights
to filter the beams
and, then, summing them along the channel axis, the fused beam
can be obtained by:
where × denotes the complex-valued multiplication operator. We, then, add the filtered beam and estimated complex residual together, to obtain the final output
, i.e.,
2.3. Fixed Beamforming Module
In this module, the fixed beamformer is leveraged to transform input multi-channel mixtures into several beams, which steer towards different-looking directions and, uniformly, sample the space. As the fixed beamformer is data-independent, it is robust in adverse environments and has low computational complexity. Moreover, filtering multi-channel mixtures with the fixed beamformer allows our system to be less sensitive to the array geometry. In this paper, we choose the super-directivity (SD) beamformer as the default beamformer, due to its promising performance in high directivity [
24]. Note that other fixed beamformers can, also, be adopted, which is out of the scope of the paper. Assuming the target directional angle is
, the weights of the SD beamformer can be calculated as:
where
is the steering vector.
is the complex transpose operator.
denotes the covariance matrix of a diffuse noise field with the diagonal loading to control the white noise gain.
Note that the diagonal-loading level, often, needs to be chosen carefully, to make a good balance between the white noise gain and the array gain [
25]. In this paper, the diagonal loading level is fixed to 1
, and its impact on performance will be studied in the near future. The
-th element of
represents the coherence between the signals received by two microphones, with indices
i and
j in an isotropic diffuse field, which can be formulated as:
where
.
is the distance between the i-th and j-th microphones.
c is the speed of sound.
is the sampling rate.
Defining
, the output of the
d-th SD beamformer can be expressed as:
2.4. Beam Filter Module
As shown in
Figure 1, the beam filter module (BFM) consists of a causal convolutional encoder-decoder (CED) architecture and a weight estimator (WE). For the encoder, it comprises six gated linear units with residual U-Net (GLU-RSU) blocks, to consecutively halve the feature size and extract high-level features, which is described in
Section 2.4.2. The decoder is the mirror version of the encoder except that all the convolution operations are replaced by the deconvolutional version (dubbed DeconvGLU). Similar to [
26], a stack of squeezed temporal convolutional networks (S-TCNs) is inserted as the bottleneck of CED, to model the temporal correlations among adjacent frames. After that, in the weight estimator, we simulate the filter generation process, where T-F bin-level filter coefficients are assigned for each beam. To be specific, the output embedding tensor of the decoder is, first, normalized by layer normalization (LN), and, then, the LSTM is employed to update the feature frame by frame, with ReLU serving as the intermediate nonlinear activation function. The weights
are obtained after the output linear layer. Then, these weights are applied to each beam to obtain the target beam.
2.4.1. CED Architecture
Convolutional encoder–decoder architecture is widely used in speech enhancement [
27]. It consists of a convolutional encoder, followed by a corresponding decoder. The encoder is a stack of convolutional layers, and the encoder is a stack of deconvolutional layers in the reverse order. The convolution layer uses a filter, namely kernel, to extract the local patterns of the low-level input feature to the high-level embedding. It is widely used in computer vision [
28], neural language processing [
29], and acoustic signal processing [
30,
31]. The deconvolution layer is a special convolution layer, which can map low-resolution features to the features with the input feature size. The symmetric CED structure ensures that the output has the same shape as the input, which is, naturally, suitable for the speech enhancement task.
2.4.2. GLU-RSU Block
The GLU-RSU block consists of convolutional gated linear units (ConvGLUs) [
32], batch normalization (BN), Parameter ReLU (PReLU), and a residual U-Net (RSU) [
33], which is shown in
Figure 2.
Firstly, the input feature
is passed by a ConvGLU, which can obtain better modeling capacity than a plain convolutional layer, due to the learnable dynamic feature selection by a gating mechanism, which can be expressed as:
where
* is the convolution operator.
⊙ is the Hadamard product operator.
and are the weights of these two convolutional layers.
and are the bias of these two convolutional layers.
is the Sigmoid function.
Then, the U-Net in the RSU is used to recalibrate feature distribution, by modeling the spectrum feature in different scales and extracting intra-beam time-frequency discrimination by continuous downsampling. Finally, a residual connection is utilized, to mitigate the gradient-vanishing problem. The above process can be formulated as:
where
2.4.3. Squeezed Temporal Convolutional Network
TCN is used to effectively capture the temporal dependence of speech. Compared with recurrent neural network (RNN), TCN is able to interfere in parallel and achieve better performance, by utilizing 1-D dilated convolutions. S-TCN is a lightweight TCN and consists of several squeezed temporal convolutional modules (S-TCM). From
Figure 3, one can see that S-TCM includes the input point convolution, the gated depth-wise dilated convolution (GDD-Conv), and the output point convolution, where the input point convolution and the output point convolution are applied to squeeze and restore the feature dimension, respectively, and GDD-Conv has three differences with the depth-wise dilated convolution in traditional TCM. Firstly, the channel of the dilated causal convolution (DC-Conv) in GDD-Conv is less to effectively represent the information, due to the time-frequency sparseness of the speech spectrum. Moreover, GDD-Conv introduces a gating branch to facilitate information flow in the gradient back-propagation process. The gating branch utilizes the Sigmoid activation function, to map the output of DC-Conv to
for changing the feature distribution of the main branch. Note that PReLU and normalization layers are inserted between adjacent convolutional layers, to facilitate the network convergence.
2.5. Residual Refinement Module
Since the SD beamformer tends to amplify the white noise to ensure the array gain at low frequencies, the weighted beam output with BFM, often, contains lots of residual noise components, which need to be further suppressed to improve speech quality. Meanwhile, speech distortion is, often, introduced, due to the mismatch between the main beam steering toward the predefined direction and the true direction of the target speech, which is because the number of the fixed beamformers is limited. To refine the target beam, a residual refinement module (RRM) is proposed, which comprises a decoder module similar to that of BFM and a residual block (ResBlock) containing three residual convolution modules, as shown in
Figure 1. The output of S-TCN serves as the input feature of the RRM. After decoding from multiple U
blocks, the output tensor
is concatenated, with the original complex spectrum of the reference microphone
, and is fed to a point convolution to squeeze the feature dimension to 16. Then, a series of residual convolution modules is applied which comprises a plain convolution layer with a
kernel and
stride, BN, PReLU, and a identity shortcut connection. Finally, the complex residual spectrum
is derived by the output 1×1-conv, to reduce the dimensions to 2, and applied to refine the filtered beam output
.
2.6. Loss Function
In this paper, a two-stage training method is used. Firstly, we train the BFM with the magnitude regularized complex spectrum loss, which is defined as:
where
is the regularized factor and is set to 0.5, empirically. The first term and the second term in the loss function are, respectively, the complex spectrum mean squared error (MSE) loss and the magnitude spectrum MSE loss.
Then, we freeze the parameters of BFM when training RRM. The same loss function is utilized:
Note that both the estimations and the targets are adopted with the power compression, to improve the speech enhancement performance. The power-compression process can be expressed as:
where the compression factor
is set to 0.5 [
34].
2.7. Datasets
In this paper, we conduct two datasets for the performance evaluation in the directional interference situation and the diffused noise situation. The DNS-Challenge corpus (
https://github.com/microsoft/DNS-Challenge (accessed on 22 April 2022)) [
22] is selected to convolve with multi-channel room impulse responses (RIRs), which represent the transfer function between the sound source and microphones of the array, to generate multi-channel pairs for experiments. To be specific, the clean clips are randomly sampled from the
neutral clean speech set [
22], which includes about 562 h speaking by 11,350 speakers. We split it into two parts without overlap, namely for training and testing. The noise clips in the DNS-Challenge corpus are selected for the directional interference. The utterances in the TIMIT corpus [
35] are used to conduct a diffused babble noise field.
For the directional interference situation, around 20,000 types of noise in the DNS-Challenge corpus are selected as the interference source in the training phase, with a duration time of about 55 h [
23,
26]. For testing, three types of unseen noise are chosen, namely babble, factory1 noises taken from NOISEX92 [
36], and cafe noise taken from CHiME3 [
37]. We generate the RIRs with the image method [
38] using a uniform linear array with 9 microphones, and the distance between two adjacent microphones is around 4 cm. The room size is sampled from 3 × 3 × 2.5 m
to 10 × 10 × 3 m
, and the reverberation time RT60 ranges from 0.05 s to 0.7 s. The source is randomly located in angle from
to
, and the distance between the source and the array center ranges from 0.5 m to 3.0 m. The signal-to-noise ratio (SNR) ranges from −6 dB to 6 dB.
For the diffused babble noise situation, we select the utterances from 480 speakers in the TIMIT corpus for training and validation, while the utterances from other speakers are used for generating test diffused noise. In total, 72 different speakers are selected randomly and are assigned to 72 directions (), to simulate diffused babble noise. The SNR ranges from −6 dB to 6 dB with 1 dB interval. The settings of the room size, RT60, and the speaker location are the same as above.
Totally, in each situation, about 80,000 and 4000 multi-channel noisy and reverberant mixtures, respectively, are generated for training and validation. For the testing set, SNR is set to dB, and 150 pairs are generated for each case. Note that the speakers and the room sizes for the test are, also, unseen in both the training and validation sets.
4. Results and Discussion
We choose perceptual evaluation speech quality (PESQ) [
41] and extended short-time objective intelligibility (ESTOI) [
42] as objective metrics, to compare the performance of different models. The PESQ score is used to evaluate speech quality of the enhanced utterance, which is obtained from the clean speech and the enhanced speech. Its value ranges from −0.5 to 4.5. The higher the PESQ score is, the better the speech perceptual quality. The ESTOI score is chosen to evaluate speech intelligibility. The higher the ESTOI score is, the better the speech intelligibility.
4.1. Results Comparison in the Directional Interference Case
The objective results of different SE systems are shown in
Table 2,
Table 3 and
Table 4. For comparison, the number of beams
D is set to 10, which means that the sampling resolution is 20
. We evaluate these systems in terms of PESQ and ESTOI.
From these tables, several observations can be made. First, compared with SCSE-extension-based multi-channel speech enhancement approaches, such as MC-ConvTasNet, end-to-end neural spatial filters, such as FaSNet-TAC and MIMO-UNet, yield notable performance improvements, consistently, thanks to linear filtering of the multi-channel signals, which can reduce speech distortion. For example, compared with MC-ConvTasNet, FaSNet-TAC achieves 0.21 and 2.83% improvements, in terms of PESQ and ESTOI under the cafe noise, and MIMO-UNet gets 0.20 PESQ improvement and 4.96% ESTOI improvement. Second, the proposed system outperforms neural beamforming-based approaches by a large margin, in all cases. For example, compared with FaSNet-TAC, our system achieves 0.61 and 13.63% improvements, in terms of PESQ and ESTOI for cafe noise, respectively. Moreover, our model outperforms MIMO-UNet by 0.62 and 11.50% in PESQ and ESTOI, respectively. This demonstrates the superiority of filtering the beams over the best neural spatial filters, based on frequency and time domains. This is because designing weights for beams is easier to optimize than approximating the desired beam pattern. Moreover, speech- and noise-dominant beams help the network learn their discriminative features. Finally, noise-dominant beams enable the noise characteristic to better cancel the residual noise in speech-dominant beams.
Figure 4 shows the spectrograms of the speech, corrupted by the cafe interference and its processed utterances. One can find that the proposed method has better noise suppression and less speech distortion, when compared with baselines. In particular, the proposed algorithm recovers harmonic structures well and obtains less speech distortion in the high-frequency (around 4 kHz–8 kHz) low-SNR bins, since the fixed beamformer, which has good spatial resolution at high frequencies, is able to better suppress high-frequency interference.
4.2. Results Comparison in the Diffused Babble Noise Case
The evaluation results of different multi-channel speech enhancement models in the diffused babble noise scenario are shown in
Table 5.
It can be seen that the trend of the model performance is similar to that of the directional interference scenario. The neural spatial filters, such as FaSNet-TAC and MIMO-UNet, are, consistently, superior to MC-ConvTasNet. The proposed algorithm significantly outperformed all baseline systems in the PESQ and ESTOI metrics of each SNR. For example, going from FaSNet-TAC to the proposed system, average 0.63 and 14.70% improvements are achieved, in terms of PESQ and ESTOI, respectively. Moreover, it improves the MIMO-UNet baseline by 0.42 PESQ, and 9.28% ESTOI, on average. This demonstrates the superiority of the proposed system, over the best neural spatial filters in the diffused babble noise case.
4.3. Ablation Analysis
We also validate the role of FBM, BFM, and RRM.
Table 6 shows the average results of three directional interferences in each case.
To analyze the effectiveness of FBM, we set another two candidates of D of 7 (30) and 19 (10), where 7 () means , and each main beam width is about ; 19 (), analogously. It can be seen that the performance of the beam neural filter gradually improves with the increase in D, which reveals the importance of FBM. However, the relative performance improvement decreases as the spatial sampling interval becomes progressively smaller, although there is still a mismatch between beam pointing and source direction, which indicates that the proposed model is robust to direction mismatch, whereas the spatial filter is more sensitive to direction estimation error.
To show the effectiveness of BFM, we visualize the norm of estimated complex weights in
Figure 5. The input signals are mixed by a speech radiating from
and a Factory1 noise source from
. We can find that greater weights are assigned to beams steering toward the surroundings of the target direction, while beams steering to other directions, including those steering toward the interference direction, are given little weights during speaking, while all weights are small in non-speech segments.
Besides, the proposed system with RRM achieves PESQ improvements of 0.13 and ESTOI improvements of 2.09%. Comparing the visualization results of the model with and without RRM from
Figure 6, one can see that the residual noise components are further suppressed at low frequencies, and some missing speech components are recovered, which confirms the effectiveness of RRM in the proposed system.
Finally, we can find that using RSU, followed by the (De)ConvGLU, can achieve significant performance improvements compared to using (De)ConvGLU only, and achieve 0.21 and 3.53% PESQ and ESTOI average improvements in the cafe interference scenario, demonstrating that U-Net can extract stronger discriminating feature characterization by modeling multi-beam information at different scales.
5. Conclusions
Speech signals are often distorted by background noise and reverberation in daily listening environments. Such distortions severely degrade speech intelligibility and quality for human hearing, as well as make automatic speech recognition more difficult. In this paper, we propose a causal neural beamspace-domain filter for real-time multi-channel speech enhancement, to recover clean speech from the noisy mixtures received by the microphone array. It comprises three components, namely FBM, BFM, and RRM. Firstly, FBM is adopted, to separate the sources from different directions. Then, BFM maps filter weights, by jointly learning the spectro-temporal-spatial discriminability of speech and interference. Finally, RRM is adopted, to refine the weighting beam output.
From the experimental results, we have the following conclusions:
- •
The proposed system achieves better speech quality and intelligibility over previous SOTA approaches in the directional interference case.
- •
In the diffused babble noise scenario, our method, also, achieves better performance than previous systems.
- •
From the spectrograms of BFM and RRM, one can see that RRM is helpful to refine the missing components of the output of BFM.
- •
From the ablation study, RSU is able to learn stronger discriminating features to improve the performance.
Video conferencing plays a very crucial part in our daily social interactions, due to the COVID-19 virus. This proposed method can be used to suppress noise and reverberation during a video conference, to improve speech quality and intelligibility. Moreover, it also can be applied to human–machine interaction systems and mobile communication devices.
Future work could concentrate on designing an MCSE system for amultiple speakers scenario, based on the proposed method. Moroever, a more effective feature extraction module of BFM can be explored.