\interspeechcameraready\name

[]DevangKulshreshtha \name[]SaketDingliwal \name[]BradyHouston \name[]NikolaosPappas \name[]SrikanthRonanki

Sequential Editing for Lifelong Training of Speech Recognition Models

Abstract

Automatic Speech Recognition (ASR) traditionally assumes known domains, but adding data from a new domain raises concerns about computational inefficiencies linked to retraining models on both existing and new domains. Fine-tuning solely on new domain risks Catastrophic Forgetting (CF). To address this, Lifelong Learning (LLL) algorithms have been proposed for ASR. Prior research has explored techniques such as Elastic Weight Consolidation, Knowledge Distillation, and Replay, all of which necessitate either additional parameters or access to prior domain data. We propose Sequential Model Editing as a novel method to continually learn new domains in ASR systems. Different than previous methods, our approach does not necessitate access to prior datasets or the introduction of extra parameters. Our study demonstrates up to 15% Word Error Rate Reduction (WERR) over fine-tuning baseline, and superior efficiency over other LLL techniques on CommonVoice English multi-accent dataset.

keywords:

speech recognition, lifelong learning, model editing, multi-accent ASR

1 Introduction

Recently, the field of speech recognition (and machine learning/AI in general) has trended toward large foundational models trained on very large, diverse datasets covering many domains. Despite this trend, it is still common in industrial settings that after a foundational/base model has been initially trained, to gradually train it on new domains or categories. In multidialect ASR, for example, these two situations would be improving the performance of the model on a single dialect (or subset of dialects) and adding a previously-unseen dialect/accent to the model. Both of these goals would typically be achieved by fine-tuning the base/foundational model, possibly with the addition/substitution of some model parameters, on new training data.

Fine-tuning often comes with a cost, which is that the model’s performance on the domains seen during training can degrade due to catastrophic forgetting [1]. Returning to the example of adding a new dialect/accent to a multilingual ASR model, this degradation can be troublesome if the model is expected to perform well on the new dialect and on all previously-seen dialects. A common mitigation approach to the catastrophic forgetting observed when fine-tuning on a new dialect is to re-train the model with both the new data and the already-seen data. This, of course, can be extremely costly, especially in the era of very large models and training datasets. In addition, the previously-seen data is not always continually-available in practical settings. Lifelong learning (or continual learning) approaches have been shown to alleviate this catastrophic forgetting effect in a wide variety of machine learning models and tasks, including ASR.

The most simple lifelong learning approach is Experience Replay; when a new domain is added via fine-tuning, a subset of the original training is also included [2] (or possibly all of the training data [3]). However, this comes with the obvious downside of being inefficient, as each new domain being added requires more and more replay data. Adding multi-task training objectives to encourage the model to retain information on previous domains, as in Elastic Weight Consolidation [1] and Knowledge Distillation can also be effective in ASR [4] and other tasks [5], but they may show limited ability to scale to many new domains. Above approaches also require either additional parameters or previous domain data to mitigate Catastrophic Forgetting (CF). More recently, several studies [6, 7, 8] investigated the manipulation and/or combination of fine-tuned model parameters with base model parameters for creating multi-task models, avoiding the need to re-use data or implement more complicated multi-task training approaches. However, these methods tend to degrade in quality when applied to a large number of tasks [8].

Refer to caption — Figure 1: Sequential Model Editing for Lifelong ASR: At each time step $t$ , the current model $\theta_{t-1}$ is fine-tuned on data $\mathcal{D}_{t}$ to obtain $\hat{\theta_{t}}$ . Then task vector $\tau_{t}$ is computed. Finally, new model is obtained my merging $\tau_{t}$ with $\theta_{t-1}$ as : $\theta_{t}=\theta_{t-1}+\lambda\cdot(\tau_{t})$ .

In this paper, we explore model editing approaches for sequential training of ASR models that overcome limitations in prior work in terms of requiring additional parameters or access to prior domain data. Specifically, we investigate sequential editing of the original model continually trained on all the previous data sets. Here, we focus on two representative methods, namely Task Arithmetic [7], which uses basic arithmetic operations to combine checkpoints from different tasks, and TIES-Merging [8] which addresses issues that arise when task vectors (i.e. models) are combined, such as sign conflicts and small weights. These approaches have been explored for out-of-domain generalization [7, 9], multi-task learning [7, 8], and transfer learning [10] but not for lifelong learning to the best of our knowledge. At every sequential step, we assume access only on the new data source which is a challenging setting for existing continual learning methods [11]. For evaluation, we focus on effectively learning multiple English dialects in an incremental fashion while preserving the performance on previously seen dialects on CommonVoice English data set. The main contributions can be summarized as follows:

•

We propose a novel Sequential Model Editing approach that can be used for lifelong training of ASR models without relying on prior domain datasets or additional training and/or parameters.
•

Our approach results in 15% WER-reduction (WERR) over the fine-tuning baseline on CommonVoice English multi-accent dataset, compared to 6% WERR achieved by previously proposed baselines.

1: Data sources

\mathcal{D}=[\mathcal{D}_{0},\mathcal{D}_{1},\dots,\mathcal{D}_{T}]

\lambda

, and

k

2:Lifelong ASR model

\theta^{*}

3:Init:

\theta_{0}\leftarrow

Train on

\mathcal{D}_{0}

4:for

t\leftarrow 1

T

\hat{\theta_{t}}\leftarrow

Fine-tune

\theta_{t-1}

\mathcal{D}_{t}

\tau_{t}\leftarrow\hat{\theta_{t}}-\theta_{t-1}

\triangleright

Task Arithmetic

7: if TIES-Merging then

\tau_{t}\leftarrow

ties_merging_procedure

(\tau_{t},k)

9: end if

10:

\theta_{t}\leftarrow\theta_{t-1}+\lambda\cdot(\tau_{t})

11:end for

12:

\theta^{*}\leftarrow\theta_{T}

Algorithm 1 Sequential Model Editing for Lifelong ASR

2 Lifelong Learning for ASR

ASR systems often consist of a Conformer-based CTC model [12, 13] that takes in audio sequence and outputs a text sequence. These models are often trained with paired audio and text data.

Let $\theta_{0}$ be the parameters of such a model that is trained on a large set of such audio-text pairs represented by $\mathcal{D}_{0}$ . In many practical scenarios, the abilities of the model are expanded by targeting new domains/accents/languages. More formally, let $\mathcal{D}_{1},\ldots,\mathcal{D}_{T}$ be a sequence of $T$ data sources that are incrementally used to update the model with each dataset targeting a particular domain. As the capabilities of the ASR model expand to these new domains sequentially, it is desirable to retain the performance on the older domains. Also, with the recent use of massive volumes of datasets for training, it often becomes prohibitively difficult to store and maintain all the data sources. Similarly, for some publicly available models, training datasets are not released or are behind pay-walls, thereby making it challenging to adapt the model to new domains without catastrophic forgetting.

To address these practical challenges, we define the goal for Lifelong Learning (LLL) for ASR as learning the optimal model parameters $\theta^{*}$ that performs well on all the data sources $\mathcal{D}_{0}\ldots\mathcal{D}_{T}$ , where the data sources are obtained sequentially and at any time step $t$ , only the data source $\mathcal{D}_{t}$ is accessible and no past or future data sources. This constraint makes the existing trivial multi-task solutions unusable, which assume access to all domains data simultaneously to train the model. Therefore, we propose a novel sequential model-editing based approach as summarized in the Algorithm 1 and Figure 1.

3 Sequential Model Editing

Model editing refers to the paradigm of adding new functionality and behaviors to pre-trained neural models by manipulating the parameters or outputs, without the need of expensive retraining. In particular, [7] defines editing neural networks based on task vectors, which encode the information necessary to do well on a given task. They obtain such vectors by taking the weights of a model fine-tuned on a task and subtracting the corresponding pre-trained weights. They showcase that performing simple arithmetic operations on these task vectors can adapt a model to a new task or negate an undesirable behavior. In this work, we propose a novel Sequential Model Editing approach that leverages task vectors to expand the abilities of our ASR model without the need to access old data or to introduce any training loss functions and/or additional model parameters.

At any given stage $t$ of the update of the ASR model, we have access to the model parameters $\theta_{t-1}$ and the new domain/accent with data source $\mathcal{D}_{t}$ . The objective is to learn new set of model parameters $\theta_{t}$ that performs well on the new domain/accent while maintaining its original capabilities. Therefore, the problem can be simplified to learning a task vector $\tau_{t}$ for the new domain/accent and then leveraging model editing to update the ASR model. The task vector $\tau_{t}$ will represent information specific to new domain that was missing in previous model checkpoint $\theta_{t-1}$ . First, we fine-tune the model on the new data source to arrive at an intermediate model checkpoint $\hat{\theta_{t}}$ . Next, we explore two different ways of creating task vectors from this checkpoint. These two different versions of Sequential Model Editing are defined as follows:

(1) Task Arithmetic [7]: In this version, the task vector is defined by simply taking the element-wise difference between $\hat{\theta_{t}}$ and $\theta_{t-1}$ , i.e., $\tau_{t}=\hat{\theta_{t}}-\theta_{t-1}$ .

(2) TIES-Merging [8]: Since the number of parameters in a model can be substantially large, the dimension of the task vector in the previous version will be equivalently large. Many of the values in this vector will be of low magnitude. Therefore, redundant parameters from $\tau_{t}$ are removed in this version. Specifically, the top-k% values are retained based on their magnitude, while the bottom (100-k)% are set to 0. Although the TIES-Merging procedure [8] involves more complex operations to create the final aggregate task vector when multiple tasks are involved, we omit those details in this work as our sequential model editing procedure involves only a single task at a time.

Finally, the task vector created by either of the two versions is added back to the model to create the final checkpoint $\theta_{t}$ :

\theta_{t}=\theta_{t-1}+\lambda\cdot(\tau_{t})

$\lambda=1$ corresponds to fine-tuning, which leads to Catastrophic Forgetting, while $\lambda=0$ means no model update. The optimal $\lambda$ is chosen via held-out validation, balancing these two extremes.

Similarities with Model Averaging for ASR: Weight averaging is a well-known technique in lifelong learning for ASR [14, 15], where a weighted average of a previous and adapted model is computed. Task Arithmetic can be seen as a form of weight averaging, but our model editing approach extends this by incorporating additional operations, similar to TIES-Merging, and other advanced editing techniques ([16, 17]).

4 Experiments

4.1 Data

We use the CommonVoice English ASR data [18] partitioned by accents. We adopt the data settings outlined in [19] and retrieve the data from their open-source GitHub repository¹¹1https://github.com/StevenVdEeckt/online-cl-for-asr. Our lifelong learning experiments incrementally improve ASR performance across six accents: US, ENG, AUS, IND, SCO, and IRE. The initial model $\theta_{0}$ is trained on US data ( $\mathcal{D}_{0}$ ), and accents are added in the order US $\rightarrow$ ENG $\rightarrow$ AUS $\rightarrow$ IND $\rightarrow$ SCO $\rightarrow$ IRE. This order starts with the largest dataset (US) to create a strong base model, and subsequent accents are sequenced randomly. Dataset specifics are shown in Table 1.

Notation	Accent	Country	Train	Dev	Test
$\mathcal{D}_{0}$	US	United States	470	1.4	1.6
$\mathcal{D}_{1}$	ENG	England	152	1.2	1.2
$\mathcal{D}_{2}$	AUS	Australia	78	1	1.4
$\mathcal{D}_{3}$	IND	India	104	1.3	1.6
$\mathcal{D}_{4}$	SCO	Scotland	17	1	1.3
$\mathcal{D}_{5}$	IRE	Ireland	10	1	1.4

Table 1: CV English dataset duration (hrs) per accent.

4.2 Model Architecture

We use a 12-layer CTC Conformer model, incorporating 8 self-attention heads, a 1024-dimensional feedforward layer, and an input/output size of 80, following the approach [12]. The models are designed to directly predict subword targets, derived from a sentence-piece model trained on initial US dialect $\mathcal{D}_{0}$ , with a total vocabulary size of 512. The initial training on $D_{0}$ data source spans 60 epochs with a learning rate of 5e-3. Subsequently, for every addition of new data for an accent, the models undergo an additional 10 epochs of training with a reduced learning rate of 5e-4. To enhance ASR inference, a 4-gram language model is trained on combined data from all accents, and it is employed during beam search. We use the ESPnet library [20] for ASR and KenLM [21] for LM training. All models are updated with the Adam optimizer with a weight decay of 0.1.

Model Editing: We assign $\lambda=0.4$ for Task Arithmetic, and for TIES-Merging, we set $\lambda=0.6$ and $k=0.5$ consistently across all time steps. These specific values are determined through evaluation on the development set at stage $t=1$ , and we maintain them unchanged for future stages. Although we did not explore varying $\lambda$ or $k$ for each stage here, such an exploration remains a potential avenue for future research.

Method		US	ENG	AUS	IND	SCO	IRE	AWER	WERR (%)
Baselines	Fine-tune	13.2	11.5	8.9	16.9	9	7.8	11.2	-
	UOE	12.3	10.9	8.4	15.4	8.2	7.5	10.5	6.3
	CLRL-Tuning	12.9	12	9.7	18	9.2	8	11.6	-3.5
Model Editing	Task Arithmetic	12.1	9.8	9	14.8	9.1	6.4	10.2	9.1
Model Editing	TIES-Merging	11.3	8.8	8.2	14.2	8.8	5.9	9.5	15
Oracle	Sep. Model	12.9	9.8	6.3	12.1	7.7	7.1	9.3	16.8
Oracle	Multi-task	13	9.6	6.2	13.4	7.3	7.2	9.4	15.7

Table 2: WER (

\downarrow

) on the CV English testset after learning the six tasks (i.e. accents) in sequence.

4.3 Baselines

•

Fine-tuning: This involves fine-tuning previous checkpoint on new accent data, and is expected to be highly susceptible to catastrophic forgetting (CF).
•

Randomly Layer-wise (CLRL) Tuning [22]: This approach suggests randomly fine-tuning only $M<N$ out of $N$ encoder layers on new data while keeping the remaining $N-M$ layers frozen to mitigate CF. We set $N=1$ as it yields optimal results based on the referenced paper.
•

Update Only Encoders (UOE) [23]: This method involves updating only linear layers of Conformer encoders to prevent CF during incremental domain adaptation. Here, linear layers refer specifically to the weight matrices of the Feedforward Network and attention module within a Conformer block.

Other conventional LLL methods like Experience Replay [2] require access to the old data at every stage and therefore are not directly comparable to our methods (as our methods explicitly aim to relax this requirement). However, we do benchmark even stronger upper bounds (oracle): (1)Multi-Task model, which trains the ASR model on the pooled data from all accents $\mathcal{D}_{combined}=\cup_{t=0}^{T}\mathcal{D}_{t}$ . This helps to better understand the gap between our methods and the best method when all the datasets are available at every stage. (2) Sep. Model, which trains separate ASR models for each accent dataset independently and hence uses more parameters than our methods.

4.4 Metrics

We report the WER per task, average WER across seen accents (AWER), and WER reduction (WERR %) across seen accents compared to the fine-tuning baseline. In this context, ”seen” accents refer to those accents for which the corresponding data source has been used in any stage of training. For instance, at time step $t=2$ , the AWER is computed as the average of the baseline (US) and the next two accents (ENG, AUS) WER on the test sets $\mathcal{D}_{0},\mathcal{D}_{1},\mathcal{D}_{2}$ , respectively.

Figure 2: Evolution of WER on seen accents for various approaches as new accents are added incrementally.

5 Results

5.1 Catastrophic Forgetting - Model Editing vs Baselines

Table 2 presents the results of sequential lifelong learning experiments post the last step ( $t=5$ ), encompassing exposure to all six English tasks (accents).

The average WER (AWER) across all dialects of the oracle multi-task model achieved through training the ASR model on all data is 9.4, while sequential conventional fine-tuning results in a WER of 11.2. This highlights and underscores the existence of the catastrophic forgetting problem. The recently introduced Update Only Encoders (UOE) method [23] exhibits a 6.3% WERR improvement over the fine-tuning baseline. However, the CLRL-Tuning method [22] tends to perform below this baseline for most accents. Both these approaches however demonstrate efficacy in scenarios with fewer tasks (refer to section 5.2) but experiences degradation with the addition of more tasks.

Notably, both of our sequential editing methods show improvement over the baseline. Task Arithmetic and TIES-Merging yield 9.1% and 15.0% WERR, respectively, almost reaching the performance of the oracle upper bound methods. The superiority of TIES-Merging, with lower overall WER than Task Arithmetic, underscores the importance of employing additional steps during merging of task vectors as proposed in [8].

5.2 Stage-wise Analysis

We evaluate the performance of various approaches at each time step of task addition, ranging from $t=1$ (introducing $\mathcal{D}_{1}$ ) to $t=5$ (introducing $\mathcal{D}_{5}$ ). The AWER at each time step is computed, and Figure 2 visually represents the results.

Our observations indicate that previously proposed methods, such as UOE and CLRL-Tuning, exhibit superiority over the baseline fine-tuning and even match the performance of our model editing approach in the initial stages up to $t=3$ . However, with the incorporation of additional accented data, these approaches start encountering the issue of forgetting, and the model editing approaches consistently outperform all baselines. This underscores the scalability of these model-editing approaches for sequential lifelong learning, with the potential for further enhancements as more accents are introduced.

5.3 Incremental improvements in Model Editing

To assess the incremental enhancements in model editing at each time step $t$ , we compare the intermediate fine-tuned checkpoint $\hat{\theta}_{t}$ with the edited checkpoint $\theta_{t}$ , where $\theta_{t}=\theta_{t-1}+\lambda(\tau_{t})$ . Table 3 presents the AWER for various time steps, comparing both the fine-tuned and edited (TIES-Merging) checkpoints. Note that this fine-tuned checkpoint is different than conventional fine-tuning baseline, since the former is fine-tuned on previously edited checkpoint. The table shows that incorporating task vectors consistently enhances performance at all stages compared to fine-tuning, with WERR gains between 1.5-14%.

Model/ $T$	$t=1$	$t=2$	$t=3$	$t=4$	$t=5$
Intermediate ( $\hat{\theta_{t}}$ )	11.7	14.2	14.1	12.8	10.8
Edited ( $\theta_{t}$ )	10.9	13.8	13.9	11	9.5
WERR (%)	3.4	2.8	1.5	14	12

Table 3: AWER at every timestep for intermeditate fine-tuned vs edited checkpoint.

5.4 Choosing the optimal $\lambda$ to mitigate CF

To analyze the impact of the scaling factor $\lambda$ in the Task Arithmetic technique, we conducted an ablation study by varying $\lambda$ during $t=2$ . The results, illustrated in Figure 3, reveal impacts across previous accents, new accents, and the average of both.

Notably, $\lambda=1$ , or full fine-tuning, results in catastrophic forgetting for previous accents, leading to the worst performance. Conversely, this setting yields the best performance for the new accent. Intriguingly, a $\lambda$ value of 0.2 emerges as the most optimal, in contrast to the fixed value of 0.4 used for all Task Arithmetic-based model merging. This finding suggests the potential for fine-tuning $\lambda$ differently for each time step and even tailoring it for different task vectors. We leave this avenue for further exploration in future work.

Figure 3: Variation in WER for previous seen accents vs new accent for

t=2

for different

\lambda

6 Conclusion

We address the challenge of adapting Automatic Speech Recognition (ASR) to new domains by introducing Lifelong Learning (LLL) algorithms. Traditional methods face computational inefficiencies and concerns about Catastrophic Forgetting (CF) during fine-tuning. While previous LLL techniques exist, our study propose Sequential Model Editing, a novel approach that does not require previous datasets or additional parameters. Empirical results showcase up to a 15% Word Error Rate Reduction (WERR) over the fine-tuning baseline and superior efficiency compared to other LLL techniques on the CV English multi-accent dataset. This approach effectively mitigates CF and maintains high performance across diverse domains.

One avenue for future research involves experimenting with varying values of the scaling factor $\lambda$ at different time steps, potentially yielding enhanced improvements, as illustrated in Section 5.4. Another avenue is exploring into the theoretical foundations that contribute to the superior performance of model editing in the context of lifelong learning. Additionally, we intend to explore recently proposed editing techniques, including Drop And REscale [16] and Soft Merging of Experts [17].

7 Acknowledgements

We thank Veera Raghavendra Elluru for his constant feedback during the course of work, as well as rebuttal phase of the paper.

References

[1] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al., “Overcoming catastrophic forgetting in neural networks,” Proceedings of the national academy of sciences, vol. 114, no. 13, pp. 3521–3526, 2017.
[2] D. Isele and A. Cosgun, “Selective experience replay for lifelong learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
[3] B. Li, R. Pang, Y. Zhang, T. N. Sainath, T. Strohman, P. Haghani, Y. Zhu, B. Farris, N. Gaur, and M. Prasad, “Massively multilingual asr: A lifelong learning solution,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6397–6401.
[4] B. Houston and K. Kirchhoff, “Continual learning for multi-dialect acoustic models,” 2020.
[5] M. H. Phan, T.-A. Ta, S. L. Phung, L. Tran-Thanh, and A. Bouzerdoum, “Class similarity weighted knowledge distillation for continual semantic segmentation,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 16 845–16 854.
[6] M. Wortsman, G. Ilharco, S. Y. Gadre, R. Roelofs, R. Gontijo-Lopes, A. S. Morcos, H. Namkoong, A. Farhadi, Y. Carmon, S. Kornblith, and L. Schmidt, “Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time,” 2022.
[7] G. Ilharco, M. T. Ribeiro, M. Wortsman, S. Gururangan, L. Schmidt, H. Hajishirzi, and A. Farhadi, “Editing models with task arithmetic,” 2023.
[8] P. Yadav, D. Tam, L. Choshen, C. Raffel, and M. Bansal, “Ties-merging: Resolving interference when merging models,” 2023.
[9] X. Jin, X. Ren, D. Preotiuc-Pietro, and P. Cheng, “Dataless knowledge fusion by merging weights of language models,” ArXiv, vol. abs/2212.09849, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:254877510
[10] M. Matena and C. Raffel, “Merging models with fisher-weighted averaging,” ArXiv, vol. abs/2111.09832, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:244345933
[11] R. Aljundi, E. Belilovsky, T. Tuytelaars, L. Charlin, M. Caccia, M. Lin, and L. Page-Caccia, “Online continual learning with maximal interfered retrieval,” in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds., vol. 32. Curran Associates, Inc., 2019. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2019/file/15825aee15eb335cc13f9b559f166ee8-Paper.pdf
[12] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution-augmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020.
[13] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376.
[14] S. Vander Eeckt et al., “Rehearsal-free online continual learning for automatic speech recognition,” Proceedings Interspeech 2023, pp. 944–948, 2023.
[15] S. V. Eeckt et al., “Weight averaging: A simple yet effective method to overcome catastrophic forgetting in automatic speech recognition,” arXiv preprint arXiv:2210.15282, 2022.
[16] L. Yu, B. Yu, H. Yu, F. Huang, and Y. Li, “Language models are super mario: Absorbing abilities from homologous models as a free lunch,” 2024.
[17] M. Muqeeth, H. Liu, and C. Raffel, “Soft merging of experts with adaptive routing,” 2023.
[18] R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” arXiv preprint arXiv:1912.06670, 2019.
[19] S. V. Eeckt et al., “Rehearsal-free online continual learning for automatic speech recognition,” arXiv preprint arXiv:2306.10860, 2023.
[20] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen et al., “Espnet: End-to-end speech processing toolkit,” arXiv preprint arXiv:1804.00015, 2018.
[21] K. Heafield, “KenLM: Faster and smaller language model queries,” in Proceedings of the Sixth Workshop on Statistical Machine Translation, C. Callison-Burch, P. Koehn, C. Monz, and O. F. Zaidan, Eds. Edinburgh, Scotland: Association for Computational Linguistics, Jul. 2011, pp. 187–197. [Online]. Available: https://aclanthology.org/W11-2123
[22] Z. Wang, F. Hou, and R. Wang, “Clrl-tuning: A novel continual learning approach for automatic speech recognition.”
[23] Y. Takashima, S. Horiguchi, S. Watanabe, P. García, and Y. Kawaguchi, “Updating only encoders prevents catastrophic forgetting of end-to-end asr models,” 2022.