Distant, Multichannel Speech Recognition Using Microphone Array Coding and Cloud-Based Beamforming with a Self-Attention Channel Combinator

D Sharma, D Jones, S Kruchinin… - 2023 57th Asilomar …, 2023 - ieeexplore.ieee.org
D Sharma, D Jones, S Kruchinin, R Gong, PA Naylor
2023 57th Asilomar Conference on Signals, Systems, and Computers, 2023ieeexplore.ieee.org
Distant Automatic Speech Recognition (ASR) holds the promise of more natural human-
machine interface and using multiple microphones to acquire speech in such environments
often leads to better accuracy of ASR. The benefits come from encoding spatial information
which can be used to enhance the speech and estimate the direction of sound arrival.
Current ASR systems are based on end-to-end models that require considerable
computational resources and are typically deployed in the cloud, which requires the use of a …
Distant Automatic Speech Recognition (ASR) holds the promise of more natural human-machine interface and using multiple microphones to acquire speech in such environments often leads to better accuracy of ASR. The benefits come from encoding spatial information which can be used to enhance the speech and estimate the direction of sound arrival. Current ASR systems are based on end-to-end models that require considerable computational resources and are typically deployed in the cloud, which requires the use of a CODEC to help reduce the transmission bandwidth. We present a multichannel speech coding scheme specifically adapted for microphone array signals and unlike typical speech codecs, this scheme preserves phase relationships of the signals so that the spatial information can be exploited in the cloud. We explore the use of a frequency domain relative transfer function estimator as part of the CODEC. We also explore the use of a modified discrete cosine transform based Self Attention Channel Combinator (SACC) front-end for ASR and show that the time domain signal post SACC processing leads to significant improvements in C50. Furthermore, we show that preprocessing of the array signals with a de-reverberation method leads to a lower WER and also more accurate DOA estimation.
ieeexplore.ieee.org
Showing the best result for this search. See all results