Comparative Analysis of State-of-the-Art Speech Recognition Models For Low-Resource Marathi Language
Comparative Analysis of State-of-the-Art Speech Recognition Models For Low-Resource Marathi Language
Comparative Analysis of State-of-the-Art Speech Recognition Models For Low-Resource Marathi Language
Abstract:- In this research, we present a comparative We will use the word error rate (WER) as the chosen
analysis of two state-of-the-art speech recognition models, metric for performance assessment, which is a widely used
Whisper by OpenAI and XLSR Wave2vec by Facebook, measure that quantifies the accuracy of transcription by
applied to the low-resource Marathi language. Leveraging comparing predicted text with ground truth transcripts.
the Common Voice 16 dataset, we evaluated the Through this investigation, we aim to elucidate the strengths
performance of these models using the word error rate and limitations of the Whisper and XLSR Wave2vec models
(WER) metric. Our findings reveal that the Whisper in the context of Marathi speech recognition. Our analysis will
(Small) model achieved a WER of 45%, while the XLSR include a detailed evaluation of the models' performance in
Wave2vec model obtained a WER of 71%. This study terms of accuracy, robustness, and efficiency. We will also
sheds light on the capabilities and limitations of current examine the impact of different factors, such as data size,
speech recognition technologies for low-resource languages model architecture, and training strategies, on their
and provides valuable insights for further research and performance. By shedding light on the comparative
development in this domain. performance of these models, we seek to provide insights that
can inform further advancements in speech recognition
Keywords:- Speech Recognition, State-of-the-Art Models, technology, particularly for low-resource languages like
Whisper, XLSR Wave2vec, Marathi Language, Low-Resource. Marathi. Ultimately, such insights are crucial for enabling the
development of more inclusive and effective speech
I. INTRODUCTION recognition systems that cater to diverse linguistic
communities worldwide.
In recent years, speech recognition technology has made
remarkable progress, primarily due to the development of II. DATASET
sophisticated deep learning models. Among these models,
Whisper by OpenAI and XLSR Wave2vec by Facebook have Common Voice is a publicly available dataset that
demonstrated impressive capabilities in transcribing speech contains recordings of people reading sentences in multiple
into text across various languages and domains. Although languages. These recordings are made available for free to
these models have shown high performance in general, their researchers and developers to train speech recognition
effectiveness in low-resource language settings remains a systems. By collecting voice recordings from volunteers
subject of ongoing research and scrutiny. To address this around the world, Common Voice aims to create more
research gap, this study aims to provide a detailed comparative inclusive and accurate speech recognition models. Researchers
analysis of the performance of two state-of-the-art speech and developers can use this dataset to advance the
recognition models, Whisper and XLSR Wave2vec, development of speech technology for various purposes,
specifically applied to the Marathi language. Marathi is a low- including accessibility, language learning, and voice-
resource language spoken by millions in India, and it presents controlled devices. The Common Voice dataset includes
unique challenges for speech recognition due to limited various data fields, such as 'client_id', 'path', 'audio', 'sentence',
available data and linguistic variations. To assess the 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale',
performance of these models in the Marathi language domain, 'segment', and 'variant'. For our purposes, we only use the
we will use the Common Voice 16 dataset, which is a valuable 'audio' and 'sentence' fields. We have a total of 7016 samples
resource for training and testing speech recognition systems. for the Marathi language, which we split into a combined
training and validation dataset of 4906 samples and a separate
test dataset of 2212 samples. This approach optimizes the use
of the small dataset we have available.