Towards Effective and Efficient Continual Pre-training of Large Language Models

Chen, Jie; Chen, Zhipeng; Wang, Jiapeng; Zhou, Kun; Zhu, Yutao; Jiang, Jinhao; Min, Yingqian; Zhao, Wayne Xin; Dou, Zhicheng; Mao, Jiaxin; Lin, Yankai; Song, Ruihua; Xu, Jun; Chen, Xu; Yan, Rui; Wei, Zhewei; Hu, Di; Huang, Wenbing; Wen, Ji-Rong

Computer Science > Computation and Language

arXiv:2407.18743 (cs)

[Submitted on 26 Jul 2024]

Title:Towards Effective and Efficient Continual Pre-training of Large Language Models

Authors:Jie Chen, Zhipeng Chen, Jiapeng Wang, Kun Zhou, Yutao Zhu, Jinhao Jiang, Yingqian Min, Wayne Xin Zhao, Zhicheng Dou, Jiaxin Mao, Yankai Lin, Ruihua Song, Jun Xu, Xu Chen, Rui Yan, Zhewei Wei, Di Hu, Wenbing Huang, Ji-Rong Wen

View PDF HTML (experimental)

Abstract:Continual pre-training (CPT) has been an important approach for adapting language models to specific domains or tasks. To make the CPT approach more traceable, this paper presents a technical report for continually pre-training Llama-3 (8B), which significantly enhances the Chinese language ability and scientific reasoning ability of the backbone model. To enhance the new abilities while retaining the original abilities, we design specific data mixture and curriculum strategies by utilizing existing datasets and synthesizing high-quality datasets. Specifically, we synthesize multidisciplinary scientific question and answer (QA) pairs based on related web pages, and subsequently incorporate these synthetic data to improve the scientific reasoning ability of Llama-3. We refer to the model after CPT as Llama-3-SynE (Synthetic data Enhanced Llama-3). We also present the tuning experiments with a relatively small model -- TinyLlama, and employ the derived findings to train the backbone model. Extensive experiments on a number of evaluation benchmarks show that our approach can largely improve the performance of the backbone models, including both the general abilities (+8.81 on C-Eval and +6.31 on CMMLU) and the scientific reasoning abilities (+12.00 on MATH and +4.13 on SciEval), without hurting the original capacities. Our model, data, and codes are available at this https URL.

Comments:	16 pages, 10 figures, 16 tables
Subjects:	Computation and Language (cs.CL)
MSC classes:	68T50
ACM classes:	I.2.7
Cite as:	arXiv:2407.18743 [cs.CL]
	(or arXiv:2407.18743v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2407.18743

Submission history

From: Jie Chen [view email]
[v1] Fri, 26 Jul 2024 13:55:21 UTC (880 KB)

Computer Science > Computation and Language

Title:Towards Effective and Efficient Continual Pre-training of Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Towards Effective and Efficient Continual Pre-training of Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators