1 Introduction

Automatic speech recognition has made remarkable advancements, but it relies heavily on large amounts of annotated data, which is expensive and impractical for low-resource languages [1].Recently, large-pretrained models [2, 3] have made significant advancements, enabling and bootstrapping ASR applications in low-resource languages through zero-shot or few-shot cross-lingual transfer. One of the simplest and most effective is multilingual transfer learning (MTL-ASR) [4,5,6]. Based on the theory that different languages share the same semantic space, MTL-ASR learns a common semantic representation from multiple languages to enable quick adaptation to target languages. However, MTL-ASR has limited interaction between different languages during the training process, and it focuses on the supplement of multiple language knowledge to the target language knowledge. Therefore, its generalization to the target language is weak, especially when dealing with low-resource settings. However, meta-learning [7] pursues high-level shared features in multilingual learning, and the model can better understand and capture the correlation between tasks to improve the overall learning ability. Therefore, multilingual meta learning (MML-ASR) shows faster adaptability and better results under low-resource settings [8, 9], and also widely applied in other low-resource speech domains [10,11,12].

However, a common issue in multilingual meta learning is task conflict. In the context of MML-ASR, each language represents a distinct task. As different languages originate from diverse regions and have distinct cultural characteristics and pronunciation systems, these differences lead to variations in the direction of task gradients. These differences foster detrimental competition for the model’s limited resources, leading to the task conflict problem. The current research of ASR mainly focuses on solving problems such as task imbalance [13], training instability [14] and inefficient learning [15], while there is no good solution for the task conflict problem. Recently, the gradient agreement algorithm [16] has shown good experimental results in solving the task-conflict problem in computer vision. It assigns a weight to each task in the meta-optimization stage, which is determined by the consistency between the task gradient and the average gradient of a batch of tasks.

Fig. 1
figure 1

Weight distribution distance between different submodules and the model

However, since the gradient agreement algorithm uses the model gradient for weight calculation, the calculation and storage costs grow significantly as the depth of the model increases. And applying the algorithm directly to the speech domain further increase the computing cost. Speech signals are more sparse and often accompanied by noise interference compared with image classification tasks, causing speech recognition tasks often require more resources to develop a practical and reliable speech recognition system. It is well known that the model’s output-layer (head) frequently changes significantly when adapting to downstream tasks [17]. Therefore, we propose employing the model’s head for calculating the task weight because that the feature learned by the model’s head are more characteristic of the task itself. We compare the similarity (Wasserstein distance) of task weight distribution between the model (as seen in Fig. 2) and each submodule. As is shown in Fig. 1, it’s clear that the weight distribution using the output-layer is closest to using the whole model, showing the model’s head is a good alternative to the model in weight calculation.

In this paper, we validate the effectiveness of task agreement multilingual meta-learning (TAMML), which adapts gradient agreement algorithm [16] to multilingual meta-learning. Based on the above analysis, we further recommend a simple and effective lightweight variant, TAMML-Light. TAMML-Light uses the model’s head gradients instead of the whole model gradients for weight calculation, greatly reducing calculation and storage costs, and the cost does not increase with the depth of the model, which can be easily and efficiently applied in deeper and larger structures.

Experiments on a large number of low-resource languages on three datasets show that TAMML and TAMML-Light achieve better results compared to meta-learning. Specifically, TAMML-Light surpasses meta-learning by more than 7\(\%\) in relative CER on four languages of OpenSLR. Further analysis showed that TAMML-Light reduced the increased computing time and storage overhead by TAMML by at least 80\(\%\), with almost the same performance. And as the depth of the model increases, the reduction in computation and storage overhead becomes even more significant.

2 Preliminary

2.1 Multilingual Learning ASR

Our multilingual speech recognition model utilizes the joint CTC-Attention [18] architecture, which has also been used in previous studies [9, 13]. As illustrated in Fig. 2, the model consists of three parts: the encoder and the decoder in the Transformer, as well as a connectionist temporal classification (CTC) module, which can guide the model to learn good alignments and speed up convergence. The loss function can be expressed as a weighted sum of the decoding loss \({L}_{att}\) and the CTC loss \({L}_{ctc}\):

$$\begin{aligned} L_{asr}=\lambda {{L}_{ctc}}+(1-\lambda ){{L}_{att}} \end{aligned}$$
(1)

where the hyper-parameter \(\lambda \) denotes the weight of CTC loss. To overcome the challenge posed by different language symbols, Byte Pair Encoding (BPE) [19] is employed to generate sub-words as multilingual modeling units. Training transcripts from all languages are combined together to generate the multilingual symbol vocabulary, instead of merging each language’s symbol vocabulary directly. As a result, similar sub-words are shared among different languages, which is very beneficial for similar languages to learn common information. Similar to prior work [20], we substitute the language symbol \(<S\_LANG>\) for the start token \(<S>\) at the beginning of the original sub-word sequence, which can alleviate the language confusion.

Fig. 2
figure 2

Illustration: multilingual learning model architecture based on joint CTC-attention

2.2 Multilingual Meta Learning

Meta-learning [7] is a powerful approach that enables models to acquire meta-knowledge from diverse training tasks, facilitating rapid adaptation to novel tasks. Therefore, it’s especially suitable for low-resource speech recognition. Multilingual meta-learning, on the other hand, leverages generic meta-knowledge gained from numerous training episodes across multiple source languages, thereby facilitating learning in low-resource target languages.

Suppose the dataset is a set of N languages \(D_{s}=\{D_{s}^{i}\}_{i=1}^{N}\), each language \(D_s^i\) is composed of the speech-text pairs. In contrast to traditional machine learning, meta learning uses tasks instead of data instances as its training sample. For i-th language, we sample tasks \(T_i\) from the \(D_s^i\), then split \(T_i\) into two subsets, the support set \(T_{sup}^i\) and the query set \(T_{query}^i\). The ASR model f is parameterized by \(\theta \). After multilingual pretraining, the model aims to adapt to the low-resource target languages \(D_t\). Multilingual meta learning can be described as bilevel optimization problems.

Firstly, the base learner learns every task from the initial meta-learner \(\theta \) in the inner loop. Concretely, the adapted model parameters \(\theta _{i}\) are updated from \(\theta \) by performing gradient descent on the support set \(T_{\sup }^{i}\)(\({i}=1,2,...,N\)):

$$\begin{aligned} {{\theta }_{i}}=\theta -\alpha \nabla _{\theta } {{L}}({{{\theta };T_{sup}^i }}) \end{aligned}$$
(2)

where \(\alpha \) is the learning rate of the inner loop, and L is the loss function computed by using Eq. (1).

Secondly, the meta-learner integrates the knowledge of each base learner in the outer loop. Specifically, the meta model parameters \(\theta \) are updated by calculating all the task query losses using the adapted model parameters \({{\theta }_{i}}\) over the query set \({T_{query}^{i}}\):

$$\begin{aligned} \theta \leftarrow \theta -\beta \sum \limits _{{i}}{\nabla _{\theta } {{L}^\mathrm{{meta}}}({{{{\theta }_{i}};T_{query}^i)}}} \end{aligned}$$
(3)

where \(\beta \) is the learning rate of the outer loop. Such a complete meta-update process is also referred to as an episode.

From the Eq. (2) and (3), we can see that the meta-learning requires the computation of the second-order derivatives of \(\theta \), which is computationally expensive. Therefore, the First-order MAML algorithm (FOMAML) was proposed as [8] and [10], in this way, the Eq. (3) can be reformulated as:

$$\begin{aligned} \theta \leftarrow \theta -\beta \sum \limits _{{i}}{\nabla _{{{\theta }_{i}}}{{L}^\mathrm{{meta}}}({{\theta _i};T_{query}^i)}} \end{aligned}$$
(4)

3 Lightweight Task-Agreement Multilingual Meta Learning

Meta-learning averages the query gradients of all tasks in the meta-optimization phase to update the model parameters, indicating that all tasks contribute equally to updating the model parameters. This ignores the task-conflict problem, leading to unsatisfactory results. We first introduce task-agreement multilingual meta-learning (TAMML) algorithm that adjusts the contributions of each task using the gradient agreement algorithm during the meta-optimization phase. Giving the support set \(T_{sup}^i\) and query set \(T_{query}^i\) of a task, specifically, the task-agreement meta-learning algorithm can be expressed as follows:

$$\begin{aligned} \theta \leftarrow \theta - \beta \sum \limits _i^N {{\omega _i}{\nabla _{{{\theta }}}} {L_{\mathrm{{meta}}}}({\theta _i};T_{query}^i)} \end{aligned}$$
(5)
$$\begin{aligned} s.t.{\hspace{1.0pt}} {\theta _i} = \theta - \alpha \nabla _{\theta } L(\theta ;T_{\sup }^i) \end{aligned}$$
(6)

where \(\omega _i\) represents the weight of the i-th task.

Let \(g_i\) denotes the gradient of the i-th task, specifically given by \({g_i} = {\theta _i} - \theta \). We introduce \(g_v\) to denote the average gradient across all tasks, i.e., \({g_v} = \frac{1}{N}\sum \limits _i^N {{g_i}}\). If the i-th task gradient \(g_i\) in conflicts with the average gradient of all tasks \(g_v\), the i-th task should contribute less than other tasks, and vice versa. Therefore, the \(\omega _i\) needs to be proportional to the inner product of the task gradient and the average gradient \(g_i^T{g_v}\). Moreover, the weight needs to satisfy \(\sum \limits _i {{\omega _i}} = 1\). So the weight \(\omega _i\) can be defined as:

$$\begin{aligned} {\omega _i} = \frac{{g_i^T{g_v}}}{{\sum \limits _{k \in N} { |{g_k^T{g_v}} |} }} = \frac{{\sum \limits _{j \in N} {g_i^T{g_j}} }}{{\sum \limits _{k \in N} { |{\sum \limits _{j \in N} {g_k^T{g_j}} } |} }} \end{aligned}$$
(7)

In this way, if the i-th task gradient aligns with the average gradient, its weight \(\omega _i\) increases. If not, its weight decreases. With this insight, TAMML pushes the model parameters in a direction where tasks have more consistency, which reduces the competition between different languages and improves the learning efficiency.

However, when \(g_i\) denotes the gradient of the entire model, computation and storage costs grow dramatically as the model’s depth increases. To address this, we further propose a lightweight variant of TAMML called TAMML-Light. Mathematically, let \(g = ({g^1},{g^2},...,{g^l})\) be the gradients for all layers of the network. It is well known that the model’s head can effectively capture task-specific characteristics. Moreover, the weight distribution obtained from the output layer closely represents the entire model. Therefore, we propose weight calculation for the network body, and focus solely on the model head. Specifically, TAMML-Light uses the gradient of the last output layer \(g_l\) to calculate the weight in Eq. (7). In this way, TAMML-Light significantly reduces compute and storage costs, and the cost remains the same as the depth of the model increases.

Table 1 Results of low-resource ASR on IARPA BABEL in terms of CER (\(\%\))

4 Experiment

4.1 Dataset

Our experiments are based on the IARPA BABEL [21], OpenSLRFootnote 1 and Common Voice [22]. For IARPA BABEL, we constructed two datasets: Babel3 and Babel6. To construct Babel3, we selected three languages: Bengali (61.76 h), Tagalog (84.56 h), and Zulu (62.13 h). To construct Babel6, we added three languages to Babel3: Turkish (77.18 h), Lithuanian (42.52 h), and Guarani (43.03 h). We also selected four languages as target languages: Vietnamese (87.72 h), Tamil (68.36 h), Swahili (44.39 h), and Kurmanji (42.08 h). For the OpenSLR dataset, we selected nine languages as source languages for pre-training (OpenSLR-9): Gujarati (7.89 h), Colombian Spanish (7.58 h), Tamil (7.08 h), Peruvian Spanish (9.22 h), Kannada (8.68 h), Southern English (5 h), Chilean Spanish (7.15 h), Galician (10.31 h), and Basque (13.86 h). And we fine-tuned the model on seven very low-resource target languages: Argentinian Spanish (SLR-61), Malayalam (SLR-63), Marathi (SLR-66), Nigerian English (SLR-70), Venezuelan Spanish (SLR-75), Burmese (SLR-80), and Yoruba (SLR-86). The data duration of these target languages ranges from 2.32 h to 8.08 h. For each source language, we used 80\(\%\) of its data to train and 20\(\%\) to validate. For each target language, we used 60\(\%\) of its data to train, 10\(\%\) to validate, and 30\(\%\) to test. For Common Voice, we selected six source languages (CV-6): English, German, French, Italian, Portugese, and Swedish. Each language’s data for training, validation, and testing has about 5, 3, and 3 h, respectively.

4.2 Implementation Details

Our multilingual speech recognition model utilizes the joint CTC-attention [23] architecture, which has also been used in previous studies [9, 13]. For IARPA BABEL, the model uses a 6-layer VGG convolutional network, and the Transformer consists of 4 encoder blocks and 2 decoder blocks. Each block comprises four attention heads, 512 hidden units, and 2048 feed-forward hidden units. And the weight for CTC loss is \(\lambda =0.3\). The model is trained using a batch size of 128, with 64 examples allocated to the support set and 64 examples to the query set. The experiments are conducted using two Tesla V100 16GB GPUs. The SGD algorithm is used for the task-update stage in meta-learning, and the Adam [24] optimizer is used for the rest of the optimization. We set the warmup steps to 12,000 and the scale factor to 0.5. For OpenSLR-9, the model consists of one encoder block and one decoder block. We set the warmup steps to 1000 and the scale factor to 0.5. The model is trained using a batch size of 96, with 48 examples allocated to the support set and 48 examples to the query set. The experiments are conducted using two Tesla V100 16GB GPUs. The SGD algorithm is used for the task-update stage in meta-learning, while the Adam optimizer is used for the rest of the optimization. The other settings are the same as IARPA BABEL. During fine-tuning target languages, we set the batch size to 128. We evaluated our model using beam-search with a beamwidth of 6. For Common Voice (CV-6), the model uses two encoder blocks and two decoder blocks of the Transformer architecture, and each block comprises eight attention heads. In the experiments, character error rate (CER) and word error rate (WER) are employed as the criterion. We trained for 200 epochs, and an early stop strategy was adopted with three epochs of patience. During inference, we average the best five checkpoints for evaluation.

5 Experiment Results

5.1 Results on Low-Resource ASR

Results on IARPA BABEL As shown in Table 1, compared to monolingual training, both multilingual transfer learning (MTL-ASR) and multilingual meta learning (MML-ASR) improve the ASR performance under different combinations of pretraining languages. Moreover, by pushing the model parameters in a direction where tasks have more consistency, TAMML outperforms MML-ASR for all languages, showing its effectiveness. Moreover, in general, our method achieves better performance than meta adversarial sampling (AMS) in [13] and our TCMML-ASR is simpler beacause it doesn’t require training an extra sampler.

Results on OpenSLR In order to verify the effectiveness of our method under very low-resource conditions, we conducted experiments on the OpenSLR-9 and fine-tuned it on four target languages. Table 2 reports the results on OpenSLR-9. It’s clear that our TAMML further improves MML-ASR by over 7\(\%\) in relative CER. Moreover, TAMML-Light can achieve similar performance to TAMML.

Table 2 Results of low-resource ASR on OpenSLR-9 in terms of CER (\(\%\))

Results on Common Voice Combined with WavLM Large-scale pre-trained models have continuously shown superior performance, so we conducted an experiment on Common Voice to explore whether meta-learning still works when combined with the large-scale model. We use two types of features for contrast: 43-dimension MFCC features and 1024-dimension features extracted from the WavLM-large.Footnote 2 Table 3 presents the WER of six target languages on CV-6.

MFCC features performed poorly and made the model almost impossible to learn useful knowledge. But the performance of monolingual was significantly enhanced by WavLM features. And MML-ASR was still effective although the improvement had decreased compared to without pretrained model. Furthermore, TAMML-Light and TAMML still outperformed MML-ASR by a relative improvement of 4\(\%\) and 5\(\%\).

Table 3 Results of low-source ASR combined with WavLM on CV-6 in terms of WER(\(\%\))

Statistical Significance Test The superior performance of one system compared to another is possibly a result of the random nature of the test data, rather than an accurate reflection of the systems’ true performance. To get a more reliable conclusion, it is more essential to conduct the statistical significance test of the two systems. We conducted three different significance testing methods using the SCTK toolkitFootnote 3 provided by NIST: the Matched pair sentence segment word error (MP) test [25], the Signed paired comparison speaker word accuracy (SI) test [26], and the Wilcoxon signed-rank speaker word accuracy (WI) test [26]. To account for the potential influence of speakers, we treated each sample as belonging to a distinct speaker during the calculations. Firstly, we aligned the recognition results with the correct annotations using the Sclite toolkit within SCTK. Subsequently, the alignment results from both methods were inputted into the sc_stats tool for testing. Table 4 presents the experimental results of TAMML and TAMML-Light with MML on six target languages.

All three statistical significance tests indicate that, at a significance level of 5\(\%\), there is a significant difference between the recognition results of TAMML-ASR and MML-ASR. However, there is no significant difference between TAMML-Light and MML-ASR in Catalan and Kabyle in terms of the WI metric. This is mainly due to WI’s focus on rank ordering of differences, being less sensitive to specific numerical changes in differences. But TAMML-Light performs badly in Catalan and Kabyle, which aligns with our experimental results in Table 3. Overall, the results of the statistical significance tests are consistent with the WER results, providing further evidence of the effectiveness of our method.

In summary, we thoroughly evaluated TAMML and TAMML-Light using datasets of varying quality, different numbers of languages, varying amounts of data, and in combination with a large-scale pre-trained model. Across these different experimental settings, TAMML and TAMML-Light achieved continuous improvements for low-resource target languages, showing superior generalization and robustness.

Table 4 Statistical significance test between different methods

5.2 Ablation Studies

Different Meta Learning Methods Former experiments were based on FOMAML because of its better performance and high efficiency. We also analyzed the performance of different gradient-based meta-learning methods, like MAML [27] and ANIL [28]. Table 5 reports the CER of three target languages in OpenSLR when using different meta learning methods. It can be observed that the performance of MAML outperforms FOMAML and ANIL, and TAMML-Light improves all meta learning methods.

Comparison Among Different Weight-Adjusting Methods We further compare the performance of different weight-adjusting methods. (1) The weights for every task are set to 1/N and are equal (MML-ASR). (2) The weights for every task are assigned at random (TWR). (3) The weights for disagreement tasks are increased (TWI). (4) The weights for disagreement tasks are set to 0, meaning the conflicting task is dropped (TWD). (5) Our proposed method, TAMML-Light, which increases the weights of agreement tasks.

As shown in Table 5, TWR performs poorly for the random weight, and TWD is ineffective because it discards some task information. TWI is slightly better than MML-ASR, indicating that inconsistent tasks may still contain useful knowledge. However, our TAMML-Light significantly outperforms other methods, demonstrating that pushing the model parameters in a task-agreement direction facilitates the model acquiring more meta-knowledge.

Table 5 Ablation study results on OpenSLR in terms of CER (\(\%\))

5.3 Method Analysis

Different Scales of Training Data We evaluated the performance of TAMML and TAMML-Light using different proportions of SLR-86’s training data. As shown in Fig. 3, the performance of TAMML-Light is similar to TAMML under varied training data proportions, showing the stability of TAMML-Light’s performance.

Fig. 3
figure 3

Different proportions of SLR-86’s training data

The Computational and Storage Cost of TAMML and TAMML-Light We conducted an evaluation of the computational cost and the performance incurred by TAMML and TAMML-Light on OpenSLR-9. As is shown in Table 6, it can be found that TAMML-Light can achieve similar performance to TAMML. But TAMML-Light reduces the average iteration time from 5.08 s to 4.67 s, accounting for over 80\(\%\) of the relative increased computation duration to TAMML, as well as significant reductions in FLOPS and Storage. Furthermore, unlike TAMML, the cost of TAMML-Light does not grow as the model’s depth increases. In deeper model structures, TAMML-Light can save more computational overhead incurred by TAMML.

Table 6 Performance comparison and resource analysis

Visualization of Model Learning Dynamic The dynamics of the model training process are depicted in Fig. 4. It can be observed that TAMML-Light and TAMML can accelerate the training process and decrease the convergence epoch by over 20 \(\%\) and 30\(\%\).

Fig. 4
figure 4

Visualization of training process

6 Conclusion

In this work, we introduce adapting gradient agreement algorithm to multilingual meta learning, and validate its effectiveness. To save the computational cost, we further propose TAMML-Light. Extensive experimental results demonstrate that TAMML-Light effectively enhances the few-shot learning ability of meta-learning and making considerable savings. In the future, we plan to exploit the task gradient adjusting strategy adaptively in meta-learning.