Improve Multichannel Speech Recognition with Temporal and Spatial Information

Yu ZHANG; Pengyuan ZHANG; Qingwei ZHAO

doi:10.1587/transinf.2017EDL8268

Regular Section

Improve Multichannel Speech Recognition with Temporal and Spatial Information

Yu ZHANG, Pengyuan ZHANG, Qingwei ZHAO

Author information

Keywords: multichannel speech recognition, long short-term memory, attention mechanism, generalized cross correlation

JOURNAL FREE ACCESS

2018 Volume E101.D Issue 7 Pages 1963-1967

DOI https://doi.org/10.1587/transinf.2017EDL8268

Details

Abstract

In this letter, we explored the usage of spatio-temporal information in one unified framework to improve the performance of multichannel speech recognition. Generalized cross correlation (GCC) is served as spatial feature compensation, and an attention mechanism across time is embedded within long short-term memory (LSTM) neural networks. Experiments on the AMI meeting corpus show that the proposed method provides a 8.2% relative improvement in word error rate (WER) over the model trained directly on the concatenation of multiple microphone outputs.

Corresponding author

Register with J-STAGE for free!