Sapbert Medical Domain Hard Positive Negatives
Sapbert Medical Domain Hard Positive Negatives
Sapbert Medical Domain Hard Positive Negatives
Fangyu Liu♣ , Ehsan Shareghi♦,♣ , Zaiqiao Meng♣ , Marco Basaldella♥∗, Nigel Collier♣
♣
Language Technology Lab, TAL, University of Cambridge
♦
Department of Data Science & AI, Monash University ♥ Amazon Alexa
♣
{fl399, zm324, nhc30}@cam.ac.uk
♦
[email protected] ♥ [email protected]
Abstract PUBMEDBERT
4230
scientific language social media language
NCBI BC5CDR-d BC5CDR-c MedMentions AskAPatient COMETA
model
@1 @5 @1 @5 @1 @5 @1 @5 @1 @5 @1 @5
vanilla B ERT (Devlin et al., 2019) 67.6 77.0 81.4 89.1 79.8 91.2 39.6 60.2 38.2 43.3 40.4 47.7
+ S AP B ERT 91.6 95.2 92.7 95.4 96.1 98.0 52.5 72.6 68.4 87.6 59.5 76.8
B IO B ERT (Lee et al., 2020) 71.3 84.1 79.8 92.3 74.0 90.0 24.2 38.5 41.4 51.5 35.9 46.1
+ S AP B ERT 91.0 94.7 93.3 95.5 96.6 97.6 53.0 73.7 72.4 89.1 63.3 77.0
B LUE B ERT (Peng et al., 2019) 75.7 87.2 83.2 91.0 87.7 94.1 41.6 61.9 41.5 48.5 42.9 52.9
+ S AP B ERT 90.9 94.0 93.4 96.0 96.7 98.2 49.6 73.1 72.4 89.4 66.0 78.8
C LINICAL B ERT (Alsentzer et al., 2019) 72.1 84.5 82.7 91.6 75.9 88.5 43.9 54.3 43.1 51.8 40.6 61.8
+ S AP B ERT 91.1 95.1 93.0 95.7 96.6 97.7 51.5 73.0 71.1 88.5 64.3 77.3
S CI B ERT (Beltagy et al., 2019) 85.1 88.4 89.3 92.8 94.2 95.5 42.3 51.9 48.0 54.8 45.8 66.8
+ S AP B ERT 91.7 95.2 93.3 95.7 96.6 98.0 50.1 73.9 72.1 88.7 64.5 77.5
U MLS B ERT (Michalopoulos et al., 2020) 77.0 85.4 85.5 92.5 88.9 94.1 36.1 55.8 44.4 54.5 44.6 53.0
+ S AP B ERT 91.2 95.2 92.8 95.5 96.6 97.7 52.1 73.2 72.6 89.3 63.4 76.9
P UB M ED B ERT (Gu et al., 2020) 77.8 86.9 89.0 93.8 93.0 94.6 43.9 64.7 42.5 49.6 46.8 53.2
+ S AP B ERT 92.0 95.6 93.5 96.0 96.5 98.2 50.8 74.4 70.5 88.9 65.9 77.9
supervised SOTA 91.1 93.9 93.2 96.0 96.6 97.2 OOM OOM 87.5 - 79.0 -
P UB M ED B ERT 77.8 86.9 89.0 93.8 93.0 94.6 43.9 64.7 42.5 49.6 46.8 53.2
+ S AP B ERT 92.0 95.6 93.5 96.0 96.5 98.2 50.8 74.4 70.5 88.9 65.9 77.9
+ S AP B ERT (A DAPTER13% ) 91.5 95.8 93.6 96.3 96.5 98.0 50.7 75.0† 67.5 87.1 64.5 74.9
+ S AP B ERT (A DAPTER1% ) 90.9 95.4 93.8† 96.5† 96.5 97.9 52.2† 74.8 65.7 84.0 63.5 74.2
+ S AP B ERT (F INE - TUNED ) 92.3 95.5 93.2 95.4 96.5 97.9 50.4 73.9 89.0† 96.2† 75.1 (81.1 ) 85.5 (86.1† )
†
B IO S YN 91.1 93.9 93.2 96.0 96.6 97.2 OOM OOM 82.6 87.0 71.3 77.8
+ (init. w/) S AP B ERT 92.5† 96.2† 93.6 96.2 96.8 98.4† OOM OOM 87.6 95.6 77.0 84.2
Table 1: Top: Comparison of 7 B ERT-based models before and after S AP B ERT pretraining (+ S AP B ERT). All
results in this section are from unsupervised learning (not fine-tuned on task data). The gradient of green indicates
the improvement comparing to the base model (the deeper the more). Bottom: S AP B ERT vs. SOTA results. Blue
and red denote unsupervised and supervised models. Bold and underline denote the best and second best results
in the column. “† ” denotes statistically significant better than supervised SOTA (T-test, ρ < 0.05). On COMETA,
the results inside the parentheses added the supervised SOTA’s dictionary back-off technique (Basaldella et al.,
2020). “-”: not reported in the SOTA paper. “OOM”: out-of-memory (192GB+).
figurations specified in App. §B.4). For other hyper- the training set and ground truth synonyms are from
parameters used, please view App. §C.2. the reference ontology. We use the same optimiser
and learning rates but train with a batch size of
Evaluation Data and Protocol. We experiment 256 (to accommodate the memory of 1 GPU). On
on 6 different English MEL datasets: 4 in the scien- scientific language datasets, we train for 3 epochs
tific domain (NCBI, Doğan et al. 2014; BC5CDR-c while on AskAPatient and COMETA we train for
and BC5CDR-d, Li et al. 2016; MedMentions, Mo- 15 and 10 epochs respectively. For B IO S YN on so-
han and Li 2018) and 2 in the social media domain cial media language datasets, we empirically found
(COMETA, Basaldella et al. 2020 and AskAPa- that 10 epochs work the best. Other configurations
tient, Limsopatham and Collier 2016). Descrip- are the same as the original B IO S YN paper.
tions of the datasets and their statistics are provided
in App. §A. We report Acc@1 and Acc@5 (denoted 3.2 Main Results and Analysis
as @1 and @5) for evaluating performance. In all *B ERT + S AP B ERT (Tab. 1, top). We illustrate
experiments, S AP B ERT denotes further pretraining the impact of S AP B ERT pretraining over 7 exist-
with our self-alignment method on UMLS. At the ing B ERT-based models (*B ERT = {B IO B ERT ,
test phase, for all S AP B ERT models we use near- P UB M ED B ERT, ...}). S AP B ERT obtains consis-
est neighbour search without further fine-tuning on tent improvement over all *B ERT models across all
task data (unless stated otherwise). Except for num- datasets, with larger gains (by up to 31.0% absolute
bers reported in previous papers, all results are the Acc@1 increase) observed in the social media do-
average of five runs with different random seeds. main. While S CI B ERT is the leading model before
applying S AP B ERT, P UB M ED B ERT +S AP B ERT
Fine-Tuning on Task Data. The red rows in
performs the best afterwards.
Tab. 1 are results of models (further) fine-tuned
on the training sets of the six MEL datasets. Sim- S AP B ERT vs. SOTA (Tab. 1, bottom). We take
ilar to pretraining, a positive pair list is generated P UB M ED B ERT +S AP B ERT (w/wo fine-tuning) and
through traversing the combinations of mention and compare against various published SOTA results
all ground truth synonyms where mentions are from (see App. §C.1 for a full listing of 10 baselines)
4231
which all require task supervision. For the scien- The Impact of Online Mining (Eq. (1)). As
tific language domain, the SOTA is B IO S YN (Sung suggested in Tab. 2, switching off the online hard
et al., 2020). For the social media domain, the pairs mining procedure causes a large performance
SOTA are Basaldella et al. (2020) and G EN - drop in @1 and a smaller but still significant drop
R ANK (Xu et al., 2020) on COMETA and AskAP- in @5. This is due to the presence of many easy and
atient respectively. All these SOTA methods com- already well-separated samples in the mini-batches.
bine B ERT with heuristic modules such as tf-idf, These uninformative training examples dominated
string matching and information retrieval system the gradients and harmed the learning process.
(i.e. Apache Lucene) in a multi-stage manner.
configuration @1 @5
Measured by Acc@1 , S AP B ERT achieves new
Mining switched-on 67.2 80.3
SOTA with statistical significance on 5 of the 6 Mining switched-off 52.3↓14.9 76.1↓4.2
datasets and for the dataset (BC5CDR-c) where
S AP B ERT is not significantly better, it performs on Table 2: This table compares P UB M ED -
par with SOTA (96.5 vs. 96.6). Interestingly, on sci- B ERT +S AP B ERT’s performance with and without
entific language datasets, S AP B ERT outperforms online hard mining on COMETA (zeroshot general).
SOTA without any task supervision (fine-tuning
mostly leads to overfitting and performance drops). Integrating S AP B ERT in Existing Systems.
On social media language datasets, unsupervised S AP B ERT can be easily inserted into existing
S AP B ERT lags behind supervised SOTA by large B ERT-based MEL systems by initialising the sys-
margins, highlighting the well-documented com- tems with S AP B ERT pretrained weights. We use
plex nature of social media language (Baldwin the SOTA scientific language system, B IO S YN
et al., 2013; Limsopatham and Collier, 2015, 2016; (originally initialised with B IO B ERT weights), as
Basaldella et al., 2020; Tutubalina et al., 2020). an example and show the performance is boosted
However, after fine-tuning on the social media across all datasets (last two rows, Tab. 1).
datasets (using the MS loss introduced earlier),
S AP B ERT outperforms SOTA significantly, indi- 4 Conclusion
cating that knowledge acquired during the self-
aligning pretraining can be adapted to a shifted We present S AP B ERT, a self-alignment pretraining
domain without much effort. scheme for learning biomedical entity represen-
tations. We highlight the consistent performance
boost achieved by S AP B ERT, obtaining new SOTA
in all six widely used MEL benchmarking datasets.
The A DAPTER Variant. As an option for param-
Strikingly, without any fine-tuning on task-specific
eter efficient pretraining, we explore a variant of
labelled data, S AP B ERT already outperforms the
S AP B ERT using a recently introduced training mod-
previous supervised SOTA (sophisticated hybrid en-
ule named A DAPTER (Houlsby et al., 2019). While
tity linking systems) on multiple datasets in the sci-
maintaining the same pretraining scheme with the
entific language domain. Our work opens new av-
same S AP B ERT online mining + MS loss, instead
enues to explore for general domain self-alignment
of training from the full model of P UB M ED B ERT,
(e.g. by leveraging knowledge graphs such as DB-
we insert new A DAPTER layers between Trans-
pedia). We plan to incorporate other types of rela-
former layers of the fixed P UB M ED B ERT, and only
tions (i.e., hypernymy and hyponymy) and extend
train the weights of these A DAPTER layers. In our
our model to sentence-level representation learning.
experiments, we use the enhanced A DAPTER con-
In particular, our ongoing work using a combina-
figuration by Pfeiffer et al. (2020). We include two
tion of S AP B ERT and A DAPTER is a promising
variants where trained parameters are 13.22% and
direction for tackling sentence-level tasks.
1.09% of the full S AP B ERT variant. The A DAPTER
variant of S AP B ERT achieves comparable perfor- Acknowledgements
mance to full-model-tuning in scientific datasets
but lags behind in social media datasets, Tab. 1. The We thank the three reviewers and the Area Chair
results indicate that more parameters are needed for their insightful comments and suggestions. FL
in pretraining for knowledge transfer to a shifted is supported by Grace & Thomas C.H. Chan Cam-
domain, in our case, the social media datasets. bridge Scholarship. NC and MB would like to
4232
acknowledge funding from Health Data Research Rezarta Islamaj Doğan, Robert Leaman, and Zhiyong
UK as part of the National Text Analytics project. Lu. 2014. NCBI disease corpus: a resource for dis-
ease name recognition and concept normalization.
Journal of Biomedical Informatics, 47:1–10.
References Kevin Donnelly. 2006. SNOMED-CT: The advanced
terminology and coding system for eHealth. Studies
Emily Alsentzer, John Murphy, William Boag, Wei- in health technology and informatics, 121:279.
Hung Weng, Di Jindi, Tristan Naumann, and
Matthew McDermott. 2019. Publicly available clini- Jennifer D’Souza and Vincent Ng. 2015. Sieve-based
cal BERT embeddings. In Proceedings of the 2nd entity linking for the biomedical domain. In Pro-
Clinical Natural Language Processing Workshop, ceedings of the 53rd Annual Meeting of the Associ-
pages 72–78, Minneapolis, Minnesota, USA. Asso- ation for Computational Linguistics and the 7th In-
ciation for Computational Linguistics. ternational Joint Conference on Natural Language
Processing (ACL-IJCNLP) (Volume 2: Short Pa-
Timothy Baldwin, Paul Cook, Marco Lui, Andrew pers), pages 297–302, Beijing, China. Association
MacKinlay, and Li Wang. 2013. How noisy so- for Computational Linguistics.
cial media text, how diffrnt social media sources?
In Proceedings of the Sixth International Joint Con- Jacob Goldberger, Geoffrey E Hinton, Sam T Roweis,
ference on Natural Language Processing (IJCNLP), and Russ R Salakhutdinov. 2005. Neighbourhood
pages 356–364, Nagoya, Japan. Asian Federation of components analysis. In Advances in Neural Infor-
Natural Language Processing. mation Processing Systems, pages 513–520.
Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas,
Marco Basaldella, Fangyu Liu, Ehsan Shareghi, and
Naoto Usuyama, Xiaodong Liu, Tristan Naumann,
Nigel Collier. 2020. COMETA: A corpus for med-
Jianfeng Gao, and Hoifung Poon. 2020. Domain-
ical entity linking in the social media. In Proceed-
specific language model pretraining for biomedical
ings of the 2020 Conference on Empirical Methods
natural language processing. arXiv:2007.15779.
in Natural Language Processing (EMNLP), pages
3122–3137, Online. Association for Computational Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and
Linguistics. Ross Girshick. 2020. Momentum contrast for unsu-
pervised visual representation learning. In Proceed-
Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciB- ings of the IEEE/CVF Conference on Computer Vi-
ERT: A pretrained language model for scientific text. sion and Pattern Recognition, pages 9729–9738.
In Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski,
9th International Joint Conference on Natural Lan- Bruna Morrone, Quentin de Laroussilhe, Andrea
guage Processing (EMNLP-IJCNLP), pages 3615– Gesmundo, Mona Attariyan, and Sylvain Gelly.
3620, Hong Kong, China. Association for Computa- 2019. Parameter-efficient transfer learning for NLP.
tional Linguistics. In Proceedings of the 36th International Confer-
ence on Machine Learning, ICML 2019, 9-15 June
Olivier Bodenreider. 2004. The unified medical lan- 2019, Long Beach, California, USA, volume 97 of
guage system (UMLS): integrating biomedical ter- Proceedings of Machine Learning Research, pages
minology. Nucleic Acids Research, 32:D267–D270. 2790–2799. PMLR.
Allan Peter Davis, Cynthia J Grondin, Robin J Johnson, Zongcheng Ji, Qiang Wei, and Hua Xu. 2020. BERT-
Daniela Sciaky, Roy McMorran, Jolene Wiegers, based ranking for biomedical entity normalization.
Thomas C Wiegers, and Carolyn J Mattingly. 2019. AMIA Summits on Translational Science Proceed-
The comparative toxicogenomics database: update ings, 2020:269.
2019. Nucleic Acids Research, 47:D948–D954.
Donghyeon Kim, Jinhyuk Lee, Chan Ho So, Hwisang
Jeon, Minbyul Jeong, Yonghwa Choi, Wonjin Yoon,
Allan Peter Davis, Thomas C Wiegers, Michael C
Mujeen Sung, , and Jaewoo Kang. 2019. A neural
Rosenstein, and Carolyn J Mattingly. 2012. MEDIC:
named entity recognition and multi-type normaliza-
a practical disease vocabulary used at the compara-
tion tool for biomedical text mining. IEEE Access,
tive toxicogenomics database. Database.
7:73729–73740.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Robert Leaman and Zhiyong Lu. 2016. Tag-
Kristina Toutanova. 2019. BERT: Pre-training of gerOne: joint named entity recognition and normal-
deep bidirectional transformers for language under- ization with semi-markov models. Bioinformatics,
standing. In Proceedings of the 2019 Conference 32:2839–2846.
of the North American Chapter of the Association
for Computational Linguistics: Human Language Jinhyuk Lee, Wonjin Yoon, Sungdong Kim,
Technologies (NAACL), Volume 1 (Long and Short Donghyeon Kim, Sunkyu Kim, Chan Ho So,
Papers), pages 4171–4186, Minneapolis, Minnesota. and Jaewoo Kang. 2020. BioBERT: a pre-
Association for Computational Linguistics. trained biomedical language representation model
4233
for biomedical text mining. Bioinformatics, Yifan Peng, Shankai Yan, and Zhiyong Lu. 2019.
36(4):1234–1240. Transfer learning in biomedical natural language
processing: An evaluation of bert and elmo on ten
Sunwon Lee, Donghyeon Kim, Kyubum Lee, Jaehoon benchmarking datasets. In Proceedings of the 2019
Choi, Seongsoon Kim, Minji Jeon, Sangrak Lim, Workshop on Biomedical Natural Language Process-
Donghee Choi, Sunkyu Kim, Aik-Choon Tan, et al. ing, pages 58–65.
2016. BEST: next-generation biomedical entity
search tool for knowledge discovery from biomed- Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Se-
ical literature. PloS one, 11:e0164680. bastian Ruder. 2020. MAD-X: An Adapter-Based
Framework for Multi-Task Cross-Lingual Transfer.
Jiao Li, Yueping Sun, Robin J Johnson, Daniela Sci- In Proceedings of the 2020 Conference on Empirical
aky, Chih-Hsuan Wei, Robert Leaman, Allan Peter Methods in Natural Language Processing (EMNLP),
Davis, Carolyn J Mattingly, Thomas C Wiegers, and pages 7654–7673, Online. Association for Computa-
Zhiyong Lu. 2016. BioCreative V CDR task corpus: tional Linguistics.
a resource for chemical disease relation extraction.
Database, 2016. Minh C Phan, Aixin Sun, and Yi Tay. 2019. Robust
representation learning of biomedical names. In Pro-
Nut Limsopatham and Nigel Collier. 2015. Adapt- ceedings of the 57th Annual Meeting of the Asso-
ing phrase-based machine translation to normalise ciation for Computational Linguistics, pages 3275–
medical terms in social media messages. In Pro- 3285.
ceedings of the 2015 Conference on Empirical Meth-
ods in Natural Language Processing, pages 1675– Kirk Roberts, Matthew S Simpson, Ellen M Voorhees,
1680, Lisbon, Portugal. Association for Computa- and William R Hersh. 2015. Overview of the trec
tional Linguistics. 2015 clinical decision support track. In TREC.
Florian Schroff, Dmitry Kalenichenko, and James
Nut Limsopatham and Nigel Collier. 2016. Normalis-
Philbin. 2015. Facenet: A unified embedding for
ing medical concepts in social media texts by learn-
face recognition and clustering. In Proceedings of
ing semantic representation. In Proceedings of the
the IEEE Conference on Computer Vision and Pat-
54th Annual Meeting of the Association for Compu-
tern Recognition, pages 815–823.
tational Linguistics, pages 1014–1023.
Elliot Schumacher, Andriy Mulyar, and Mark Dredze.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- 2020. Clinical concept linking with contextualized
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, neural representations. In Proceedings of the 58th
Luke Zettlemoyer, and Veselin Stoyanov. 2019. Annual Meeting of the Association for Computa-
Roberta: A robustly optimized bert pretraining ap- tional Linguistics, pages 8585–8592.
proach. arXiv preprint arXiv:1907.11692.
Hoo-Chang Shin, Yang Zhang, Evelina Bakhturina,
Ilya Loshchilov and Frank Hutter. 2018. Decoupled Raul Puri, Mostofa Patwary, Mohammad Shoeybi,
weight decay regularization. In International Con- and Raghav Mani. 2020. BioMegatron: Larger
ference on Learning Representations. biomedical domain language model. In Proceed-
ings of the 2020 Conference on Empirical Methods
Laurens van der Maaten and Geoffrey Hinton. 2008. in Natural Language Processing (EMNLP), pages
Visualizing data using t-SNE. Journal of machine 4700–4706, Online. Association for Computational
learning research, 9(Nov):2579–2605. Linguistics.
George Michalopoulos, Yuanxin Wang, Hussam Kaka, Yifan Sun, Changmao Cheng, Yuhan Zhang, Chi
Helen Chen, and Alex Wong. 2020. Umls- Zhang, Liang Zheng, Zhongdao Wang, and Yichen
bert: Clinical domain knowledge augmentation of Wei. 2020. Circle loss: A unified perspective of
contextual embeddings using the unified medical pair similarity optimization. In Proceedings of the
language system metathesaurus. arXiv preprint IEEE/CVF Conference on Computer Vision and Pat-
arXiv:2010.10391. tern Recognition, pages 6398–6407.
Sunil Mohan and Donghui Li. 2018. MedMentions: A Mujeen Sung, Hwisang Jeon, Jinhyuk Lee, and Jae-
large biomedical corpus annotated with UMLS con- woo Kang. 2020. Biomedical entity representations
cepts. In Automated Knowledge Base Construction. with synonym marginalization. In Proceedings of
the 58th Annual Meeting of the Association for Com-
Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Sil- putational Linguistics (ACL), pages 3641–3650, On-
vio Savarese. 2016. Deep metric learning via lifted line. Association for Computational Linguistics.
structured feature embedding. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Elena Tutubalina, Artur Kadurin, and Zulfat Miftahut-
Recognition, pages 4004–4012. dinov. 2020. Fair evaluation in concept normaliza-
tion: a large-scale comparative analysis for bert-
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. based models. In Proceedings of the 28th Inter-
2018. Representation learning with contrastive pre- national Conference on Computational Linguistics
dictive coding. arXiv preprint arXiv:1807.03748. (COLING).
4234
Elena Tutubalina, Zulfat Miftahutdinov, Sergey The chemical mentions are mapped into the Com-
Nikolenko, and Valentin Malykh. 2018. Medical parative Toxicogenomics Database (CTD) (Davis
concept normalization in social media posts with
et al., 2019) chemical dictionary. We denote the
recurrent neural networks. Journal of Biomedical
Informatics, 84:93–102. disease and chemical mention sets as “BC5CDR-
d” and “BC5CDR-c” respectively. For NCBI and
Ivan Vulić, Edoardo Maria Ponti, Robert Litschko, BC5CDR we use the same data and evaluation pro-
Goran Glavaš, and Anna Korhonen. 2020. Probing
pretrained language models for lexical semantics. In tocol by Sung et al. (2020).11
Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP),
MedMentions (Mohan and Li, 2018) is a very-
pages 7222–7240, Online. Association for Computa- large-scale entity linking dataset containing over
tional Linguistics. 4,000 abstracts and over 350,000 mentions linked
to UMLS 2017AA. According to Mohan and Li
Xun Wang, Xintong Han, Weilin Huang, Dengke Dong,
and Matthew R Scott. 2019. Multi-similarity loss (2018), training TAGGERO NE (Leaman and Lu,
with general pair weighting for deep metric learn- 2016), a very popular MEL system, on a subset
ing. In Proceedings of the IEEE Conference on Com- of MedMentions require >900 GB of RAM. Its
puter Vision and Pattern Recognition, pages 5022– massive number of mentions and more importantly
5030.
the used reference ontology (UMLS 2017AA has
Yanshan Wang, Sijia Liu, Naveed Afzal, Majid 3M+ concepts) make the application of most MEL
Rastegar-Mojarad, Liwei Wang, Feichen Shen, Paul systems infeasible. However, through our metric
Kingsbury, and Hongfang Liu. 2018. A comparison
learning formulation, S AP B ERT can be applied on
of word embeddings for the biomedical natural lan-
guage processing. Journal of Biomedical Informat- MedMentions with minimal effort.
ics, 87:12–20.
A.2 Social-Media Language Datasets
Dustin Wright, Yannis Katsis, Raghav Mehta, and
Chun-Nan Hsu. 2019. Normco: Deep disease nor- AskAPatient (Limsopatham and Collier, 2016)
malization for biomedical knowledge base construc- includes 17,324 adverse drug reaction (ADR) anno-
tion. In Automated Knowledge Base Construction. tations collected from askapatient.com blog
posts. The mentions are mapped to 1,036 medical
Dongfang Xu, Zeyu Zhang, and Steven Bethard. 2020.
A generate-and-rank framework with semantic type concepts grounded onto SNOMED-CT (Donnelly,
regularization for biomedical concept normalization. 2006) and AMT (the Australian Medicines Termi-
In Proceedings of the 58th Annual Meeting of the nology). For this dataset, we follow the 10-fold
Association for Computational Linguistics, pages evaluation protocol stated in the original paper.12
8452–8464.
COMETA (Basaldella et al., 2020) is a recently
A Evaluation Datasets Details released large-scale MEL dataset that specifically
We divide our experimental datasets into two cate- focuses on MEL in the social media domain, con-
gories (1) scientific language datasests where the taining around 20k medical mentions extracted
data is extracted from scientific papers and (2) so- from health-related discussions on reddit.com.
cial media language datasets where the data is com- Mentions are mapped to SNOMED-CT. We use the
ing from social media forums like Reddit.com. “stratified (general)” split and follow the evaluation
For an overview of the key statistics, see Tab. 3. protocol of the original paper.13
Table 3: This table contains basic statistics of the MEL datasets used in the study. C denotes the set of concepts;
S denotes the set of all surface forms / synonyms of all concepts in C; M denotes the set of mentions / queries.
COMETA (s.g.) and (z.g.) are the stratified (general) and zeroshot (general) split respectively.
Table 4: A list of baselines on the 6 different MEL datasets, including both scientific and social media language ones. The last
row collects reported numbers from the best performing models. “∗” denotes results produced using official released code. “-”
denotes results not reported in the cited paper. “OOM” means out-of-memoery.
B.2 Comparing Loss Functions et al. (2020) for training MEL models. A very
We use COMETA (zeroshot general) as a bench- similar (though not identical) hinge-loss was used
mark for selecting learning objectives. Note by Schumacher et al. (2020) for clinical concept
that this split of COMETA is different from the linking. InfoNCE has been very popular in self-
stratified-general split used in Tab. 4. It is very supervised learning and contrastive learning (Oord
challenging (so easy to see the difference of the et al., 2018; He et al., 2020). Lifted-Structure loss
performance) and also does not directly affect the (Oh Song et al., 2016) and NCA loss (Goldberger
model’s performance on other datasets. The results et al., 2005) are two very classic metric learning ob-
are listed in Tab. 6. Note that online mining is jectives. Multi-Similarity loss (Wang et al., 2019)
switched on for all models here. and Circle loss (Sun et al., 2020) are two recently
proposed metric learning objectives and have been
loss @1 @5 considered as SOTA on large-scale visual recogni-
cosine loss (Phan et al., 2019) 55.1 64.6 tion benchmarks.
max-margin triplet loss (Basaldella et al., 2020) 64.6 74.6
NCA loss (Goldberger et al., 2005) 65.2 77.0
Lifted-Structure loss (Oh Song et al., 2016) 62.0 72.1
InfoNCE (Oord et al., 2018; He et al., 2020) 63.3 74.2
Circle loss (Sun et al., 2020) 66.7 78.7
Multi-Similarity loss (Wang et al., 2019) 67.2 80.3
Table 6: This table compares loss functions used B.3 Details of A DAPTERs
for S AP B ERT pretraining. Numbers reported are on
COMETA (zeroshot general).
The cosine loss was used by Phan et al. (2019) In Tab. 7 we list number of parameters trained in
for learning UMLS synonyms for LSTM models. the three A DAPTER variants along with full-model-
The max-margin triplet loss was used by Basaldella tuning for easy comparison.
4236
model URL
vanilla B ERT (Devlin et al., 2019) https://huggingface.co/bert-base-uncased
B IO B ERT (Lee et al., 2020) https://huggingface.co/dmis-lab/biobert-v1.1
B LUE B ERT (Peng et al., 2019) https://huggingface.co/bionlp/bluebert_pubmed_mimic_uncased_L-12_H-768_A-12
C LINICAL B ERT (Alsentzer et al., 2019) https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT
S CI B ERT (Beltagy et al., 2019) https://huggingface.co/allenai/scibert_scivocab_uncased
U MLS B ERT (Michalopoulos et al., 2020) https://www.dropbox.com/s/qaoq5gfen69xdcc/umlsbert.tar.xz?dl=0
P UB M ED B ERT (Gu et al., 2020) https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext
Table 5: This table lists the URL of models used in this study.
#params
method reduction rate #params #params in B ERT
hardware specification
RAM 192 GB
CPU Intel Xeon W-2255 @3.70GHz, 10-core 20-threads
GPU NVIDIA GeForce RTX 2080 Ti (11 GB) × 4
C Other Details
C.1 The Full Table of Supervised Baseline
Models
The full table of supervised baseline models is pro-
vided in Tab. 4.
4237
hyper-parameters search space
∗
learning rate for pretraining & fine-tuning S AP B ERT {1e-4, 2e-5 , 5e-5, 1e-5, 1e-6}
pretraining batch size {128, 256, 512∗ , 1024}
pretraining training iterations {10k, 20k, 30k, 40k, 50k (1 epoch)∗ , 100k (2 epochs)}
fine-tuning epochs on scientific language datasets {1, 2, 3∗ , 5}
fine-training epochs on AskAPatient {5, 10, 15∗ , 20}
fine-training epochs on COMETA {5, 10∗ , 15, 20}
max_seq_length of B ERT tokenizer {15, 20, 25∗ , 30}
λ in Online Mining {-0.05, -0.1, -0.2∗ , -0.3}
α in MS loss {1, 2 (Wang et al., 2019)∗ , 3}
β in MS loss {40, 50 (Wang et al., 2019)∗ , 60}
in MS loss {0.5∗ , 1 (Wang et al., 2019)}
α in max-margin triplet loss {0.05, 0.1, 0.2 (Basaldella et al., 2020)∗ , 0.3}
softmax scale in NCA loss {1 (Goldberger et al., 2005), 5, 10, 20∗ , 30}
α in Lifted-Structured loss {0.5∗ , 1 (Oh Song et al., 2016)}
τ (temperature) in InfoNCE {0.07 (He et al., 2020)∗ , 0.5 (Oord et al., 2018)}
m in Circle loss {0.25 (Sun et al., 2020)∗ , 0.4 (Sun et al., 2020)}
γ in Circle loss {80 (Sun et al., 2020), 256 (Sun et al., 2020)∗ }
Table 9: This table lists the search space for hyper-parameters used. ∗ means the used ones for reporting results.
PUDMEDBERT
PUDMEDBERT + SAPBERT
Figure 3: Same as Fig. 1 in the main text, but generated with a higher resolution.
4238