Comparative Analysis of State-of-the-Art Speech Recognition Models For Low-Resource Marathi Language

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Volume 9, Issue 4, April – 2024 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24APR1816

Comparative Analysis of State-of-the-Art


Speech Recognition Models for Low-Resource
Marathi Language
Suhas Waghmare1; Chirag Brahme2; Siddhi Panchal3; Numaan Sayed4; Mohit Goud5
1
GUIDE 2,3,4,5 BEAIDS/New Horizon Institute of Technology and Management/University of Mumbai/India

Abstract:- In this research, we present a comparative We will use the word error rate (WER) as the chosen
analysis of two state-of-the-art speech recognition models, metric for performance assessment, which is a widely used
Whisper by OpenAI and XLSR Wave2vec by Facebook, measure that quantifies the accuracy of transcription by
applied to the low-resource Marathi language. Leveraging comparing predicted text with ground truth transcripts.
the Common Voice 16 dataset, we evaluated the Through this investigation, we aim to elucidate the strengths
performance of these models using the word error rate and limitations of the Whisper and XLSR Wave2vec models
(WER) metric. Our findings reveal that the Whisper in the context of Marathi speech recognition. Our analysis will
(Small) model achieved a WER of 45%, while the XLSR include a detailed evaluation of the models' performance in
Wave2vec model obtained a WER of 71%. This study terms of accuracy, robustness, and efficiency. We will also
sheds light on the capabilities and limitations of current examine the impact of different factors, such as data size,
speech recognition technologies for low-resource languages model architecture, and training strategies, on their
and provides valuable insights for further research and performance. By shedding light on the comparative
development in this domain. performance of these models, we seek to provide insights that
can inform further advancements in speech recognition
Keywords:- Speech Recognition, State-of-the-Art Models, technology, particularly for low-resource languages like
Whisper, XLSR Wave2vec, Marathi Language, Low-Resource. Marathi. Ultimately, such insights are crucial for enabling the
development of more inclusive and effective speech
I. INTRODUCTION recognition systems that cater to diverse linguistic
communities worldwide.
In recent years, speech recognition technology has made
remarkable progress, primarily due to the development of II. DATASET
sophisticated deep learning models. Among these models,
Whisper by OpenAI and XLSR Wave2vec by Facebook have Common Voice is a publicly available dataset that
demonstrated impressive capabilities in transcribing speech contains recordings of people reading sentences in multiple
into text across various languages and domains. Although languages. These recordings are made available for free to
these models have shown high performance in general, their researchers and developers to train speech recognition
effectiveness in low-resource language settings remains a systems. By collecting voice recordings from volunteers
subject of ongoing research and scrutiny. To address this around the world, Common Voice aims to create more
research gap, this study aims to provide a detailed comparative inclusive and accurate speech recognition models. Researchers
analysis of the performance of two state-of-the-art speech and developers can use this dataset to advance the
recognition models, Whisper and XLSR Wave2vec, development of speech technology for various purposes,
specifically applied to the Marathi language. Marathi is a low- including accessibility, language learning, and voice-
resource language spoken by millions in India, and it presents controlled devices. The Common Voice dataset includes
unique challenges for speech recognition due to limited various data fields, such as 'client_id', 'path', 'audio', 'sentence',
available data and linguistic variations. To assess the 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale',
performance of these models in the Marathi language domain, 'segment', and 'variant'. For our purposes, we only use the
we will use the Common Voice 16 dataset, which is a valuable 'audio' and 'sentence' fields. We have a total of 7016 samples
resource for training and testing speech recognition systems. for the Marathi language, which we split into a combined
training and validation dataset of 4906 samples and a separate
test dataset of 2212 samples. This approach optimizes the use
of the small dataset we have available.

IJISRT24APR1816 www.ijisrt.com 1544


Volume 9, Issue 4, April – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24APR1816

III. DATA PREPROCESING sequence, facilitating accurate and nuanced understanding of


the input. Finally, a linear projection layer is applied to refine
After obtaining the dataset, we utilize the Whisper the representation and prepare it for downstream tasks such as
Feature Extractor on the audio samples. The feature extractor speech recognition or speaker identification. Together, these
follows a series of steps for processing the audio signal. In the components form a comprehensive architecture that enables
first step, the audio array is normalized to a length of 30 by the XLSR-Wav2Vec2 model to achieve state-of-the-art
either truncating or padding it. Then, we apply the windowing performance in cross-lingual speech recognition and related
process, where the signal is divided into overlapping frames of tasks.
fixed length. Each frame is typically overlapped with the
preceding and succeeding frames to ensure continuity. Next, V. RESULTS AND CONCLUSION
we apply the Fast Fourier Transform to each windowed frame,
which converts the time-domain signal into the frequency Two speech recognition models, the Whisper model by
domain. This results in a series of complex-valued frequency OpenAI and the XLSR Wave2vec model by Facebook, were
bins representing the magnitude and phase of different tested on Marathi language data. The Whisper model achieved
frequency components. We then calculate the power spectrum a WER of 45%, indicating higher accuracy than the XLSR
by taking the squared magnitude of each complex-valued Wave2vec model, which had a higher WER of 71%. The
frequency bin. This yields the power spectral density of the Whisper model was found to be more effective in low-
audio signal. After that, we multiply the obtained power resource settings, highlighting the potential of advanced neural
spectrum by a set of triangular filter banks. These filter banks network architectures in addressing challenges associated with
are spaced evenly in the Mel-frequency scale, which better speech recognition in underrepresented languages.
approximates human auditory perception of sound. Then, we
take the logarithm of the filter bank energies, and finally, we REFERENCES
compute the spectrograms. These spectrograms are then
passed into the Whisper model. In the case of the XLSR- [1]. Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J.,
Wav2Vec model, we extract the features and then convert Battenberg, E., Case, C., Casper, J., Catanzaro, B.,
them to logits for further computation. Cheng, Q., Chen, G., et al. (2016). Deep speech 2: End-
to-end speech recognition in english and mandarin. In
IV. MODEL ARCHITECTURE International conference on machine learning, pages
173–182. PMLR
A. Whisper Model [2]. Baevski, A., Zhou, H., Mohamed, A., and Auli, M.
This system is based on the transformer architecture, (2020). wav2vec 2.0: A framework for self supervised
which consists of encoder blocks and decoder blocks with an learning of speech representations. arXiv preprint
attention mechanism that propagates information between arXiv:2006.11477
them. The system takes an audio recording and splits it into [3]. Billa, J. (2018). Isi asr system for the low resource
30-second chunks, processing them one by one. For each 30- speech recognition challenge for indian languages. In
second recording, it encodes the audio using the encoder INTERSPEECH, pages 3207–3211.
section and saves the position of each word said. It leverages [4]. Chung, Y.-A., Zhang, Y., Han, W., Chiu, C.-C., Qin, J.,
this encoded information to find what was said using the Pang, R., and Wu, Y. (2021). W2v-bert: Combining
decoder. The decoder predicts tokens from this information, contrastive learning and masked language modeling for
which are the individual words being said. It repeats this self-supervised speech pre-training. arXiv preprint
process for the next word using all the same information as arXiv:2108.06209.
well as the predicted previous word, which helps to guess the [5]. Radford, Alec, Jong Wook Kim, Tao Xu, Greg
next one that would make more sense. Brockman, Christine McLeavey, and Ilya Sutskever.
"Robust speech recognition via large-scale weak
B. XLSR-wav2vec2 - supervision." In International Conference on Machine
The XLSR-Wav2Vec2 model architecture consists of Learning, pp. 28492-28518. PMLR, 2023.
various essential components for efficient speech processing. [6]. Shetty, V. M. and NJ, M. S. M. (2020). Improving the
The core of this architecture comprises convolutional layers performance of transformer based low resource speech
that play a crucial role in transforming the raw waveform recognition for indian languages. In ICASSP 2020-2020
input into a latent representation denoted as Z. These IEEE International Conference on Acoustics, Speech and
convolutional layers extract critical features from the input Signal Processing (ICASSP), pages 8279–8283. IEEE
audio, capturing important characteristics of the signal. After
this initial processing, transformer layers are used to create
contextualized representations denoted as C. These
transformer layers help the model capture long-range
dependencies and contextual information within the audio

IJISRT24APR1816 www.ijisrt.com 1545

You might also like