Lifelong Language Pretraining with Distribution-Specialized Experts

Chen, Wuyang; Zhou, Yanqi; Du, Nan; Huang, Yanping; Laudon, James; Chen, Zhifeng; Cu, Claire

Computer Science > Computation and Language

arXiv:2305.12281 (cs)

[Submitted on 20 May 2023]

Title:Lifelong Language Pretraining with Distribution-Specialized Experts

Authors:Wuyang Chen, Yanqi Zhou, Nan Du, Yanping Huang, James Laudon, Zhifeng Chen, Claire Cu

View PDF

Abstract:Pretraining on a large-scale corpus has become a standard method to build general language models (LMs). Adapting a model to new data distributions targeting different downstream tasks poses significant challenges. Naive fine-tuning may incur catastrophic forgetting when the over-parameterized LMs overfit the new data but fail to preserve the pretrained features. Lifelong learning (LLL) aims to enable information systems to learn from a continuous data stream across time. However, most prior work modifies the training recipe assuming a static fixed network architecture. We find that additional model capacity and proper regularization are key elements to achieving strong LLL performance. Thus, we propose Lifelong-MoE, an extensible MoE (Mixture-of-Experts) architecture that dynamically adds model capacity via adding experts with regularized pretraining. Our results show that by only introducing a limited number of extra experts while keeping the computation cost constant, our model can steadily adapt to data distribution shifts while preserving the previous knowledge. Compared to existing lifelong learning approaches, Lifelong-MoE achieves better few-shot performance on 19 downstream NLP tasks.

Comments:	ICML 2023
Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2305.12281 [cs.CL]
	(or arXiv:2305.12281v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2305.12281

Submission history

From: Wuyang Chen [view email]
[v1] Sat, 20 May 2023 21:15:19 UTC (2,385 KB)

Computer Science > Computation and Language

Title:Lifelong Language Pretraining with Distribution-Specialized Experts

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Lifelong Language Pretraining with Distribution-Specialized Experts

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators