A Lightweight Task-Agreement Meta Learning for Low-Resource Speech Recognition

Chen, Yaqi; Zhang, Hao; Zhang, Wenlin; Qu, Dan; Yang, Xukui

doi:10.1007/s11063-024-11661-6

A Lightweight Task-Agreement Meta Learning for Low-Resource Speech Recognition

Open access
Published: 05 July 2024

Volume 56, article number 210, (2024)
Cite this article

Download PDF

You have full access to this open access article

Neural Processing Letters Aims and scope Submit manuscript

A Lightweight Task-Agreement Meta Learning for Low-Resource Speech Recognition

Download PDF

Yaqi Chen¹,
Hao Zhang¹,
Wenlin Zhang^1,2,
Dan Qu^1,2 &
…
Xukui Yang^1,2

533 Accesses
Explore all metrics

Abstract

Meta-learning has proven to be a powerful paradigm for transferring knowledge from prior tasks to facilitate the quick learning of new tasks in automatic speech recognition. However, the differences between languages (tasks) lead to variations in task learning directions, causing the harmful competition for model’s limited resources. To address this challenge, we introduce the task-agreement multilingual meta-learning (TAMML), which adopts the gradient agreement algorithm to guide the model parameters towards a direction where tasks exhibit greater consistency. However, the computation and storage cost of TAMML grows dramatically with model’s depth increases. To address this, we further propose a simplification called TAMML-Light which only uses the output layer for gradient calculation. Experiments on three datasets demonstrate that TAMML and TAMML-Light achieve outperform meta-learning approaches, yielding superior results.Furthermore, TAMML-Light can reduce at least 80 $\%$ of the relative increased computation expenses compared to TAMML.

Task-Consistent Meta Learning for Low-Resource Speech Recognition

Adaptive multi-task learning for speech to text translation

Article Open access 13 July 2024

Cross-Lingual Self-training to Learn Multilingual Representation for Low-Resource Speech Recognition

Article 23 July 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Automatic speech recognition has made remarkable advancements, but it relies heavily on large amounts of annotated data, which is expensive and impractical for low-resource languages [1].Recently, large-pretrained models [2, 3] have made significant advancements, enabling and bootstrapping ASR applications in low-resource languages through zero-shot or few-shot cross-lingual transfer. One of the simplest and most effective is multilingual transfer learning (MTL-ASR) [4,5,6]. Based on the theory that different languages share the same semantic space, MTL-ASR learns a common semantic representation from multiple languages to enable quick adaptation to target languages. However, MTL-ASR has limited interaction between different languages during the training process, and it focuses on the supplement of multiple language knowledge to the target language knowledge. Therefore, its generalization to the target language is weak, especially when dealing with low-resource settings. However, meta-learning [7] pursues high-level shared features in multilingual learning, and the model can better understand and capture the correlation between tasks to improve the overall learning ability. Therefore, multilingual meta learning (MML-ASR) shows faster adaptability and better results under low-resource settings [8, 9], and also widely applied in other low-resource speech domains [10,11,12].

However, a common issue in multilingual meta learning is task conflict. In the context of MML-ASR, each language represents a distinct task. As different languages originate from diverse regions and have distinct cultural characteristics and pronunciation systems, these differences lead to variations in the direction of task gradients. These differences foster detrimental competition for the model’s limited resources, leading to the task conflict problem. The current research of ASR mainly focuses on solving problems such as task imbalance [13], training instability [14] and inefficient learning [15], while there is no good solution for the task conflict problem. Recently, the gradient agreement algorithm [16] has shown good experimental results in solving the task-conflict problem in computer vision. It assigns a weight to each task in the meta-optimization stage, which is determined by the consistency between the task gradient and the average gradient of a batch of tasks.

However, since the gradient agreement algorithm uses the model gradient for weight calculation, the calculation and storage costs grow significantly as the depth of the model increases. And applying the algorithm directly to the speech domain further increase the computing cost. Speech signals are more sparse and often accompanied by noise interference compared with image classification tasks, causing speech recognition tasks often require more resources to develop a practical and reliable speech recognition system. It is well known that the model’s output-layer (head) frequently changes significantly when adapting to downstream tasks [17]. Therefore, we propose employing the model’s head for calculating the task weight because that the feature learned by the model’s head are more characteristic of the task itself. We compare the similarity (Wasserstein distance) of task weight distribution between the model (as seen in Fig. 2) and each submodule. As is shown in Fig. 1, it’s clear that the weight distribution using the output-layer is closest to using the whole model, showing the model’s head is a good alternative to the model in weight calculation.

In this paper, we validate the effectiveness of task agreement multilingual meta-learning (TAMML), which adapts gradient agreement algorithm [16] to multilingual meta-learning. Based on the above analysis, we further recommend a simple and effective lightweight variant, TAMML-Light. TAMML-Light uses the model’s head gradients instead of the whole model gradients for weight calculation, greatly reducing calculation and storage costs, and the cost does not increase with the depth of the model, which can be easily and efficiently applied in deeper and larger structures.

Experiments on a large number of low-resource languages on three datasets show that TAMML and TAMML-Light achieve better results compared to meta-learning. Specifically, TAMML-Light surpasses meta-learning by more than 7$\%$ in relative CER on four languages of OpenSLR. Further analysis showed that TAMML-Light reduced the increased computing time and storage overhead by TAMML by at least 80$\%$, with almost the same performance. And as the depth of the model increases, the reduction in computation and storage overhead becomes even more significant.

2 Preliminary

2.1 Multilingual Learning ASR

Our multilingual speech recognition model utilizes the joint CTC-Attention [18] architecture, which has also been used in previous studies [9, 13]. As illustrated in Fig. 2, the model consists of three parts: the encoder and the decoder in the Transformer, as well as a connectionist temporal classification (CTC) module, which can guide the model to learn good alignments and speed up convergence. The loss function can be expressed as a weighted sum of the decoding loss ${L}_{att}$ and the CTC loss ${L}_{ctc}$:

$$\begin{aligned} L_{asr}=\lambda {{L}_{ctc}}+(1-\lambda ){{L}_{att}} \end{aligned}$$

(1)

where the hyper-parameter $\lambda $ denotes the weight of CTC loss. To overcome the challenge posed by different language symbols, Byte Pair Encoding (BPE) [19] is employed to generate sub-words as multilingual modeling units. Training transcripts from all languages are combined together to generate the multilingual symbol vocabulary, instead of merging each language’s symbol vocabulary directly. As a result, similar sub-words are shared among different languages, which is very beneficial for similar languages to learn common information. Similar to prior work [20], we substitute the language symbol $<S\_LANG>$ for the start token $<S>$ at the beginning of the original sub-word sequence, which can alleviate the language confusion.

2.2 Multilingual Meta Learning

Meta-learning [7] is a powerful approach that enables models to acquire meta-knowledge from diverse training tasks, facilitating rapid adaptation to novel tasks. Therefore, it’s especially suitable for low-resource speech recognition. Multilingual meta-learning, on the other hand, leverages generic meta-knowledge gained from numerous training episodes across multiple source languages, thereby facilitating learning in low-resource target languages.

Suppose the dataset is a set of N languages $D_{s}=\{D_{s}^{i}\}_{i=1}^{N}$, each language $D_s^i$ is composed of the speech-text pairs. In contrast to traditional machine learning, meta learning uses tasks instead of data instances as its training sample. For i-th language, we sample tasks $T_i$ from the $D_s^i$, then split $T_i$ into two subsets, the support set $T_{sup}^i$ and the query set $T_{query}^i$. The ASR model f is parameterized by $\theta $. After multilingual pretraining, the model aims to adapt to the low-resource target languages $D_t$. Multilingual meta learning can be described as bilevel optimization problems.

Firstly, the base learner learns every task from the initial meta-learner $\theta $ in the inner loop. Concretely, the adapted model parameters $\theta _{i}$ are updated from $\theta $ by performing gradient descent on the support set $T_{\sup }^{i}$(${i}=1,2,...,N$):

$$\begin{aligned} {{\theta }_{i}}=\theta -\alpha \nabla _{\theta } {{L}}({{{\theta };T_{sup}^i }}) \end{aligned}$$

(2)

where $\alpha $ is the learning rate of the inner loop, and L is the loss function computed by using Eq. (1).

Secondly, the meta-learner integrates the knowledge of each base learner in the outer loop. Specifically, the meta model parameters $\theta $ are updated by calculating all the task query losses using the adapted model parameters ${{\theta }_{i}}$ over the query set ${T_{query}^{i}}$:

$$\begin{aligned} \theta \leftarrow \theta -\beta \sum \limits _{{i}}{\nabla _{\theta } {{L}^\mathrm{{meta}}}({{{{\theta }_{i}};T_{query}^i)}}} \end{aligned}$$

(3)

where $\beta $ is the learning rate of the outer loop. Such a complete meta-update process is also referred to as an episode.

From the Eq. (2) and (3), we can see that the meta-learning requires the computation of the second-order derivatives of $\theta $, which is computationally expensive. Therefore, the First-order MAML algorithm (FOMAML) was proposed as [8] and [10], in this way, the Eq. (3) can be reformulated as:

$$\begin{aligned} \theta \leftarrow \theta -\beta \sum \limits _{{i}}{\nabla _{{{\theta }_{i}}}{{L}^\mathrm{{meta}}}({{\theta _i};T_{query}^i)}} \end{aligned}$$

(4)

3 Lightweight Task-Agreement Multilingual Meta Learning

Meta-learning averages the query gradients of all tasks in the meta-optimization phase to update the model parameters, indicating that all tasks contribute equally to updating the model parameters. This ignores the task-conflict problem, leading to unsatisfactory results. We first introduce task-agreement multilingual meta-learning (TAMML) algorithm that adjusts the contributions of each task using the gradient agreement algorithm during the meta-optimization phase. Giving the support set $T_{sup}^i$ and query set $T_{query}^i$ of a task, specifically, the task-agreement meta-learning algorithm can be expressed as follows:

$$\begin{aligned} \theta \leftarrow \theta - \beta \sum \limits _i^N {{\omega _i}{\nabla _{{{\theta }}}} {L_{\mathrm{{meta}}}}({\theta _i};T_{query}^i)} \end{aligned}$$

(5)

$$\begin{aligned} s.t.{\hspace{1.0pt}} {\theta _i} = \theta - \alpha \nabla _{\theta } L(\theta ;T_{\sup }^i) \end{aligned}$$

(6)

where $\omega _i$ represents the weight of the i-th task.

Let $g_i$ denotes the gradient of the i-th task, specifically given by ${g_i} = {\theta _i} - \theta $. We introduce $g_v$ to denote the average gradient across all tasks, i.e., ${g_v} = \frac{1}{N}\sum \limits _i^N {{g_i}}$. If the i-th task gradient $g_i$ in conflicts with the average gradient of all tasks $g_v$, the i-th task should contribute less than other tasks, and vice versa. Therefore, the $\omega _i$ needs to be proportional to the inner product of the task gradient and the average gradient $g_i^T{g_v}$. Moreover, the weight needs to satisfy $\sum \limits _i {{\omega _i}} = 1$. So the weight $\omega _i$ can be defined as:

$$\begin{aligned} {\omega _i} = \frac{{g_i^T{g_v}}}{{\sum \limits _{k \in N} { |{g_k^T{g_v}} |} }} = \frac{{\sum \limits _{j \in N} {g_i^T{g_j}} }}{{\sum \limits _{k \in N} { |{\sum \limits _{j \in N} {g_k^T{g_j}} } |} }} \end{aligned}$$

(7)

In this way, if the i-th task gradient aligns with the average gradient, its weight $\omega _i$ increases. If not, its weight decreases. With this insight, TAMML pushes the model parameters in a direction where tasks have more consistency, which reduces the competition between different languages and improves the learning efficiency.

However, when $g_i$ denotes the gradient of the entire model, computation and storage costs grow dramatically as the model’s depth increases. To address this, we further propose a lightweight variant of TAMML called TAMML-Light. Mathematically, let $g = ({g^1},{g^2},...,{g^l})$ be the gradients for all layers of the network. It is well known that the model’s head can effectively capture task-specific characteristics. Moreover, the weight distribution obtained from the output layer closely represents the entire model. Therefore, we propose weight calculation for the network body, and focus solely on the model head. Specifically, TAMML-Light uses the gradient of the last output layer $g_l$ to calculate the weight in Eq. (7). In this way, TAMML-Light significantly reduces compute and storage costs, and the cost remains the same as the depth of the model increases.

Table 1 Results of low-resource ASR on IARPA BABEL in terms of CER ($\%$)

Full size table

4 Experiment

4.1 Dataset

Our experiments are based on the IARPA BABEL [21], OpenSLR^{Footnote 1} and Common Voice [22]. For IARPA BABEL, we constructed two datasets: Babel3 and Babel6. To construct Babel3, we selected three languages: Bengali (61.76 h), Tagalog (84.56 h), and Zulu (62.13 h). To construct Babel6, we added three languages to Babel3: Turkish (77.18 h), Lithuanian (42.52 h), and Guarani (43.03 h). We also selected four languages as target languages: Vietnamese (87.72 h), Tamil (68.36 h), Swahili (44.39 h), and Kurmanji (42.08 h). For the OpenSLR dataset, we selected nine languages as source languages for pre-training (OpenSLR-9): Gujarati (7.89 h), Colombian Spanish (7.58 h), Tamil (7.08 h), Peruvian Spanish (9.22 h), Kannada (8.68 h), Southern English (5 h), Chilean Spanish (7.15 h), Galician (10.31 h), and Basque (13.86 h). And we fine-tuned the model on seven very low-resource target languages: Argentinian Spanish (SLR-61), Malayalam (SLR-63), Marathi (SLR-66), Nigerian English (SLR-70), Venezuelan Spanish (SLR-75), Burmese (SLR-80), and Yoruba (SLR-86). The data duration of these target languages ranges from 2.32 h to 8.08 h. For each source language, we used 80$\%$ of its data to train and 20$\%$ to validate. For each target language, we used 60$\%$ of its data to train, 10$\%$ to validate, and 30$\%$ to test. For Common Voice, we selected six source languages (CV-6): English, German, French, Italian, Portugese, and Swedish. Each language’s data for training, validation, and testing has about 5, 3, and 3 h, respectively.

4.2 Implementation Details

Our multilingual speech recognition model utilizes the joint CTC-attention [23] architecture, which has also been used in previous studies [9, 13]. For IARPA BABEL, the model uses a 6-layer VGG convolutional network, and the Transformer consists of 4 encoder blocks and 2 decoder blocks. Each block comprises four attention heads, 512 hidden units, and 2048 feed-forward hidden units. And the weight for CTC loss is $\lambda =0.3$. The model is trained using a batch size of 128, with 64 examples allocated to the support set and 64 examples to the query set. The experiments are conducted using two Tesla V100 16GB GPUs. The SGD algorithm is used for the task-update stage in meta-learning, and the Adam [24] optimizer is used for the rest of the optimization. We set the warmup steps to 12,000 and the scale factor to 0.5. For OpenSLR-9, the model consists of one encoder block and one decoder block. We set the warmup steps to 1000 and the scale factor to 0.5. The model is trained using a batch size of 96, with 48 examples allocated to the support set and 48 examples to the query set. The experiments are conducted using two Tesla V100 16GB GPUs. The SGD algorithm is used for the task-update stage in meta-learning, while the Adam optimizer is used for the rest of the optimization. The other settings are the same as IARPA BABEL. During fine-tuning target languages, we set the batch size to 128. We evaluated our model using beam-search with a beamwidth of 6. For Common Voice (CV-6), the model uses two encoder blocks and two decoder blocks of the Transformer architecture, and each block comprises eight attention heads. In the experiments, character error rate (CER) and word error rate (WER) are employed as the criterion. We trained for 200 epochs, and an early stop strategy was adopted with three epochs of patience. During inference, we average the best five checkpoints for evaluation.

5 Experiment Results

5.1 Results on Low-Resource ASR

Results on IARPA BABEL As shown in Table 1, compared to monolingual training, both multilingual transfer learning (MTL-ASR) and multilingual meta learning (MML-ASR) improve the ASR performance under different combinations of pretraining languages. Moreover, by pushing the model parameters in a direction where tasks have more consistency, TAMML outperforms MML-ASR for all languages, showing its effectiveness. Moreover, in general, our method achieves better performance than meta adversarial sampling (AMS) in [13] and our TCMML-ASR is simpler beacause it doesn’t require training an extra sampler.

Results on OpenSLR In order to verify the effectiveness of our method under very low-resource conditions, we conducted experiments on the OpenSLR-9 and fine-tuned it on four target languages. Table 2 reports the results on OpenSLR-9. It’s clear that our TAMML further improves MML-ASR by over 7$\%$ in relative CER. Moreover, TAMML-Light can achieve similar performance to TAMML.

Table 2 Results of low-resource ASR on OpenSLR-9 in terms of CER ($\%$)

Full size table

Results on Common Voice Combined with WavLM Large-scale pre-trained models have continuously shown superior performance, so we conducted an experiment on Common Voice to explore whether meta-learning still works when combined with the large-scale model. We use two types of features for contrast: 43-dimension MFCC features and 1024-dimension features extracted from the WavLM-large.^{Footnote 2} Table 3 presents the WER of six target languages on CV-6.

MFCC features performed poorly and made the model almost impossible to learn useful knowledge. But the performance of monolingual was significantly enhanced by WavLM features. And MML-ASR was still effective although the improvement had decreased compared to without pretrained model. Furthermore, TAMML-Light and TAMML still outperformed MML-ASR by a relative improvement of 4$\%$ and 5$\%$.

Table 3 Results of low-source ASR combined with WavLM on CV-6 in terms of WER($\%$)

Full size table

Statistical Significance Test The superior performance of one system compared to another is possibly a result of the random nature of the test data, rather than an accurate reflection of the systems’ true performance. To get a more reliable conclusion, it is more essential to conduct the statistical significance test of the two systems. We conducted three different significance testing methods using the SCTK toolkit^{Footnote 3} provided by NIST: the Matched pair sentence segment word error (MP) test [25], the Signed paired comparison speaker word accuracy (SI) test [26], and the Wilcoxon signed-rank speaker word accuracy (WI) test [26]. To account for the potential influence of speakers, we treated each sample as belonging to a distinct speaker during the calculations. Firstly, we aligned the recognition results with the correct annotations using the Sclite toolkit within SCTK. Subsequently, the alignment results from both methods were inputted into the sc_stats tool for testing. Table 4 presents the experimental results of TAMML and TAMML-Light with MML on six target languages.

All three statistical significance tests indicate that, at a significance level of 5$\%$, there is a significant difference between the recognition results of TAMML-ASR and MML-ASR. However, there is no significant difference between TAMML-Light and MML-ASR in Catalan and Kabyle in terms of the WI metric. This is mainly due to WI’s focus on rank ordering of differences, being less sensitive to specific numerical changes in differences. But TAMML-Light performs badly in Catalan and Kabyle, which aligns with our experimental results in Table 3. Overall, the results of the statistical significance tests are consistent with the WER results, providing further evidence of the effectiveness of our method.

In summary, we thoroughly evaluated TAMML and TAMML-Light using datasets of varying quality, different numbers of languages, varying amounts of data, and in combination with a large-scale pre-trained model. Across these different experimental settings, TAMML and TAMML-Light achieved continuous improvements for low-resource target languages, showing superior generalization and robustness.

Table 4 Statistical significance test between different methods

Full size table

5.2 Ablation Studies

Different Meta Learning Methods Former experiments were based on FOMAML because of its better performance and high efficiency. We also analyzed the performance of different gradient-based meta-learning methods, like MAML [27] and ANIL [28]. Table 5 reports the CER of three target languages in OpenSLR when using different meta learning methods. It can be observed that the performance of MAML outperforms FOMAML and ANIL, and TAMML-Light improves all meta learning methods.

Comparison Among Different Weight-Adjusting Methods We further compare the performance of different weight-adjusting methods. (1) The weights for every task are set to 1/N and are equal (MML-ASR). (2) The weights for every task are assigned at random (TWR). (3) The weights for disagreement tasks are increased (TWI). (4) The weights for disagreement tasks are set to 0, meaning the conflicting task is dropped (TWD). (5) Our proposed method, TAMML-Light, which increases the weights of agreement tasks.

As shown in Table 5, TWR performs poorly for the random weight, and TWD is ineffective because it discards some task information. TWI is slightly better than MML-ASR, indicating that inconsistent tasks may still contain useful knowledge. However, our TAMML-Light significantly outperforms other methods, demonstrating that pushing the model parameters in a task-agreement direction facilitates the model acquiring more meta-knowledge.

Table 5 Ablation study results on OpenSLR in terms of CER ($\%$)

Full size table

5.3 Method Analysis

Different Scales of Training Data We evaluated the performance of TAMML and TAMML-Light using different proportions of SLR-86’s training data. As shown in Fig. 3, the performance of TAMML-Light is similar to TAMML under varied training data proportions, showing the stability of TAMML-Light’s performance.

The Computational and Storage Cost of TAMML and TAMML-Light We conducted an evaluation of the computational cost and the performance incurred by TAMML and TAMML-Light on OpenSLR-9. As is shown in Table 6, it can be found that TAMML-Light can achieve similar performance to TAMML. But TAMML-Light reduces the average iteration time from 5.08 s to 4.67 s, accounting for over 80$\%$ of the relative increased computation duration to TAMML, as well as significant reductions in FLOPS and Storage. Furthermore, unlike TAMML, the cost of TAMML-Light does not grow as the model’s depth increases. In deeper model structures, TAMML-Light can save more computational overhead incurred by TAMML.

Table 6 Performance comparison and resource analysis

Full size table

Visualization of Model Learning Dynamic The dynamics of the model training process are depicted in Fig. 4. It can be observed that TAMML-Light and TAMML can accelerate the training process and decrease the convergence epoch by over 20 $\%$ and 30$\%$.

6 Conclusion

In this work, we introduce adapting gradient agreement algorithm to multilingual meta learning, and validate its effectiveness. To save the computational cost, we further propose TAMML-Light. Extensive experimental results demonstrate that TAMML-Light effectively enhances the few-shot learning ability of meta-learning and making considerable savings. In the future, we plan to exploit the task gradient adjusting strategy adaptively in meta-learning.

Notes

References

Baevski A, Zhou H, Mohamed A-r, Auli M (2020) wav2vec 2.0: A framework for self-supervised learning of speech representations. ArXiv arxiv:2006.11477
Pratap V, Sriram A, Tomasello P, Hannun AY, Liptchinsky V, Synnaeve G, Collobert R (2020) Massively multilingual asr: 50 languages, 1 model, 1 billion parameters. ArXiv arxiv:2007.03001
Hsu W-N, Bolte B, Tsai Y-HH, Lakhotia K, Salakhutdinov R, Mohamed A-r (2021) Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans Audio Speech Lang Proc 29:3451–3460
Article Google Scholar
Luo J, Wang J, Cheng N, Zheng Z, Xiao J (2022) Adaptive activation network for low resource multilingual speech recognition. 2022 Int Jt Conf Neural Netw (IJCNN), 1–7
Hou W, Dong Y, Zhuang B, Yang L, Shi J, Shinozaki T (2020) Large-scale end-to-end multilingual speech recognition and language identification with multi-task learning. In: Interspeech
Madhavaraj A, Ganesan RA (2022) Data and knowledge-driven approaches for multilingual training to improve the performance of speech recognition systems of indian languages. ArXiv arxiv:2201.09494
Hospedales TM, Antoniou A, Micaelli P, Storkey AJ (2020) Meta-learning in neural networks: a survey. IEEE Trans Pattern Anal Mach Intell 44:5149–5169
Google Scholar
Hsu J, Chen Y, Lee H (2020) Meta learning for end-to-end low-resource speech recognition. In: ICASSP, pp. 7844–7848
Hou W, Zhu H, Wang Y, Wang J, Qin T, Xu R, Shinozaki T (2021) Exploiting adapters for cross-lingual low-resource speech recognition. IEEE/ACM Trans Audio Speech Lang Proc 30:317–329
Article Google Scholar
Winata GI, Cahyawijaya S, Liu Z, Lin Z, Madotto A, Xu P, Fung P (2020) Learning fast adaptation on cross-accented speech recognition. In: Meng H, Xu B, Zheng TF (eds.) Interspeech, pp. 1276–1280. https://doi.org/10.21437/Interspeech.2020-0045
Chopra S, Mathur P, Sawhney R, Shah RR (2021) Meta-learning for low-resource speech emotion recognition. In: ICASSP , pp. 6259–6263. https://doi.org/10.1109/ICASSP39728.2021.9414373
Klejch O, Fainberg J, Bell P, Renals S (2019) Speaker adaptive training using model agnostic meta-learning. In: ASRU, pp. 881–888. https://doi.org/10.1109/ASRU46091.2019.9003751
Xiao Y, Gong K, Zhou P, et al (2021) Adversarial meta sampling for multilingual low-resource speech recognition. Proceed AAAI Conf Artif Intell 35(16):14112–14120
Singh S, Wang R, Hou F (2022) Improved meta learning for low resource speech recognition. ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4798–4802
Hou W, Wang Y, Gao S, Shinozaki T (2021) Meta-adapter: Efficient cross-lingual adaptation with meta-learning. In: ICASSP, pp. 7028–7032. https://doi.org/10.1109/ICASSP39728.2021.9414959
Eshratifar AE, Eigen D, Pedram M (2018) Gradient agreement as an optimization objective for meta-learning. CoRR arxiv:1810.08178
Zhao J, Zhang W (2022) Improving automatic speech recognition performance for low-resource languages with self-supervised models. IEEE J Sel Top Signal Proc 16:1227–1241
Article Google Scholar
Kim S, Hori T, Watanabe S (2017) Joint CTC-attention based end-toend speech recognition using multi-task learning. IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE 2017:4835–4839
Sennrich R, Haddow B, Birch A (2016) Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 1715–1725
Zhou S, Xu S, Xu B (2018) Multilingual end-to-end speech recognition with a single transformer on low-resource languages. ArXiv arxiv:1806.05059
Gales MJF, Knill KM, Ragni A, Rath SP (2014) Speech recognition and keyword spotting for low-resource languages: Babel project research at CUED. In: 4th Workshop on Spoken Language Technologies for Under-resourced Languages, SLTU 2014, pp. 16–23. St. Petersburg, Russia, May 14-16, 2014
Ardila R, Branson M, Davis K, Henretty M, Kohler M, Meyer J, Morais R, Saunders L, Tyers FM, Weber G (2019) Common voice: A massively-multilingual speech corpus
Kim S, Hori T, Watanabe S (2016) Joint ctc-attention based end-to-end speech recognition using multi-task learning. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4835–4839
Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: Bengio Y, LeCun Y (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings . arxiv:1412.6980
Gillick L, Cox SJ (1989) Some statistical issues in the comparison of speech recognition algorithms. In: International Conference on Acoustics, Speech, and Signal Processing, pp. 532–535 . IEEE
Pallet DS, Fisher WM, Fiscus JG (1990) Tools for the analysis of benchmark speech recognition tests. In: International Conference on Acoustics, Speech, and Signal Processing, pp. 97–100 . IEEE
Finn C, Abbeel P, Levine S (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In: Precup D, Teh YW (eds.) ICML , vol. 70, pp. 1126–1135
Raghu A, Raghu M, Bengio S, Vinyals O (2020) Rapid learning or feature reuse? towards understanding the effectiveness of MAML. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No.62171470), Natural Science Foundation of Henan Province of China (No.232300421240), and Henan Zhongyuan Science and Technology Innovation Leading Talent Project (No.234200510019).

Author information

Authors and Affiliations

School of Information Systems Engineering, Information Engineering University, Science Street, Zhengzhou, 450000, Henan, China
Yaqi Chen, Hao Zhang, Wenlin Zhang, Dan Qu & Xukui Yang
Laboratory for Advanced Computing and Intelligence Engineering, Wuxi, 214000, Jiangsu, China
Wenlin Zhang, Dan Qu & Xukui Yang

Authors

Yaqi Chen
View author publications
You can also search for this author in PubMed Google Scholar
Hao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Wenlin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Dan Qu
View author publications
You can also search for this author in PubMed Google Scholar
Xukui Yang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Yaqi Chen: write the paper and conduct the experiment. Hao Zhang: write the introduction and design some experiments. Wenlin Zhang: funding acquisition and revise the paper. Dan Qu: funding acquisition and revise the paper. Xukui Yang: revise the paper and design some experiments.

Corresponding authors

Correspondence to Dan Qu or Xukui Yang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Chen, Y., Zhang, H., Zhang, W. et al. A Lightweight Task-Agreement Meta Learning for Low-Resource Speech Recognition. Neural Process Lett 56, 210 (2024). https://doi.org/10.1007/s11063-024-11661-6

Download citation

Accepted: 31 May 2024
Published: 05 July 2024
DOI: https://doi.org/10.1007/s11063-024-11661-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A Lightweight Task-Agreement Meta Learning for Low-Resource Speech Recognition

Abstract

Similar content being viewed by others

Task-Consistent Meta Learning for Low-Resource Speech Recognition

Adaptive multi-task learning for speech to text translation

Cross-Lingual Self-training to Learn Multilingual Representation for Low-Resource Speech Recognition

1 Introduction