Matching domain experts by training from scratch on domain knowledge

Luo, Xiaoliang; Sun, Guangzhi; Love, Bradley C.

Quantitative Biology > Neurons and Cognition

arXiv:2405.09395 (q-bio)

[Submitted on 15 May 2024 (v1), last revised 2 Jul 2024 (this version, v2)]

Title:Matching domain experts by training from scratch on domain knowledge

Authors:Xiaoliang Luo, Guangzhi Sun, Bradley C. Love

View PDF HTML (experimental)

Abstract:Recently, large language models (LLMs) have outperformed human experts in predicting the results of neuroscience experiments (Luo et al., 2024). What is the basis for this performance? One possibility is that statistical patterns in that specific scientific literature, as opposed to emergent reasoning abilities arising from broader training, underlie LLMs' performance. To evaluate this possibility, we trained (next word prediction) a relatively small 124M-parameter GPT-2 model on 1.3 billion tokens of domain-specific knowledge. Despite being orders of magnitude smaller than larger LLMs trained on trillions of tokens, small models achieved expert-level performance in predicting neuroscience results. Small models trained on the neuroscience literature succeeded when they were trained from scratch using a tokenizer specifically trained on neuroscience text or when the neuroscience literature was used to finetune a pretrained GPT-2. Our results indicate that expert-level performance may be attained by even small LLMs through domain-specific, auto-regressive training approaches.

Comments:	ICML 2024 (Large Language Models and Cognition)
Subjects:	Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2405.09395 [q-bio.NC]
	(or arXiv:2405.09395v2 [q-bio.NC] for this version)
	https://doi.org/10.48550/arXiv.2405.09395

Submission history

From: Xiaoliang Luo [view email]
[v1] Wed, 15 May 2024 14:50:51 UTC (165 KB)
[v2] Tue, 2 Jul 2024 16:42:48 UTC (370 KB)

Quantitative Biology > Neurons and Cognition

Title:Matching domain experts by training from scratch on domain knowledge

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Quantitative Biology > Neurons and Cognition

Title:Matching domain experts by training from scratch on domain knowledge

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators