Multi-scale feature based convolutional neural networks for large vocabulary speech recognition
T Fu, X Wu - … IEEE International Conference on Multimedia and …, 2017 - ieeexplore.ieee.org
T Fu, X Wu
2017 IEEE International Conference on Multimedia and Expo (ICME), 2017•ieeexplore.ieee.orgDeep learning has brought a breakthrough to the performance of speech recognition. The
speech recognition systems based on deep neural networks have obtained the state-of-the-
art performance on various speech recognition tasks. These systems almost utilize the Mel-
frequency cepstral coefficients or the Mel-scale log-filterbank coefficients, which are based
on short-time Fourier transform. Although these features are designed based on the auditory
characteristics of the human, it is a problem that the inherent tradeoff of the temporal and …
speech recognition systems based on deep neural networks have obtained the state-of-the-
art performance on various speech recognition tasks. These systems almost utilize the Mel-
frequency cepstral coefficients or the Mel-scale log-filterbank coefficients, which are based
on short-time Fourier transform. Although these features are designed based on the auditory
characteristics of the human, it is a problem that the inherent tradeoff of the temporal and …
Deep learning has brought a breakthrough to the performance of speech recognition. The speech recognition systems based on deep neural networks have obtained the state-of-the-art performance on various speech recognition tasks. These systems almost utilize the Mel-frequency cepstral coefficients or the Mel-scale log-filterbank coefficients, which are based on short-time Fourier transform. Although these features are designed based on the auditory characteristics of the human, it is a problem that the inherent tradeoff of the temporal and frequency resolution still exists in spectral representations based on short-time Fourier transform. In this paper, we propose a multi-scale method to mitigate the tradeoff and a model architecture that enables to analyze speech at multiple scale. Experiments are conducted on TIMIT and HKUST corpus. We compare the proposed multi-scale features and traditional features at various number of configurations. Experimental results show that the proposed model architecture can obtain significant performance improvement.
ieeexplore.ieee.org
Showing the best result for this search. See all results