Sapbert Medical Domain Hard Positive Negatives

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Self-Alignment Pretraining for Biomedical Entity Representations

Fangyu Liu♣ , Ehsan Shareghi♦,♣ , Zaiqiao Meng♣ , Marco Basaldella♥∗, Nigel Collier♣

Language Technology Lab, TAL, University of Cambridge

Department of Data Science & AI, Monash University ♥ Amazon Alexa

{fl399, zm324, nhc30}@cam.ac.uk

[email protected][email protected]

Abstract PUBMEDBERT

Despite the widespread success of self-


supervised learning via masked language mod-
els (MLM), accurately capturing fine-grained
semantic relationships in the biomedical do-
main remains a challenge. This is of
paramount importance for entity-level tasks
such as entity linking where the ability to
model entity relations (especially synonymy) PUBMEDBERT + SAPBERT
is pivotal. To address this challenge, we pro-
pose S AP B ERT, a pretraining scheme that self-
aligns the representation space of biomedical
entities. We design a scalable metric learning Figure 1: The t-SNE (Maaten and Hinton, 2008) vi-
framework that can leverage UMLS, a massive sualisation of UMLS entities under P UB M ED B ERT
collection of biomedical ontologies with 4M+ (B ERT pretrained on PubMed papers) & P UB M ED -
concepts. In contrast with previous pipeline- B ERT+S AP B ERT (P UB M ED B ERT further pretrained
based hybrid systems, S AP B ERT offers an el- on UMLS synonyms). The biomedical names of differ-
egant one-model-for-all solution to the prob- ent concepts are hard to separate in the heterogeneous
lem of medical entity linking (MEL), achiev- embedding space (left). After the self-alignment pre-
ing a new state-of-the-art (SOTA) on six MEL training, the same concept’s entity names are drawn
benchmarking datasets. In the scientific do- closer to form compact clusters (right).
main, we achieve SOTA even without task-
specific supervision. With substantial improve-
ment over various domain-specific pretrained poses a major challenge to representation learning.
MLMs such as B IO B ERT, S CI B ERT and P UB - For instance, the medication Hydroxychloroquine
M ED B ERT, our pretraining scheme proves to is often referred to as Oxichlorochine (alternative
be both effective and robust.1 name), HCQ (in social media) and Plaquenil (brand
name).
1 Introduction
MEL addresses this problem by framing it as
Biomedical entity2 representation is the founda- a task of mapping entity mentions to unified con-
tion for a plethora of text mining systems in the cepts in a medical knowledge graph.3 The main
medical domain, facilitating applications such as bottleneck of MEL is the quality of the entity rep-
literature search (Lee et al., 2016), clinical decision resentations (Basaldella et al., 2020). Prior works
making (Roberts et al., 2015) and relational knowl- in this domain have adopted very sophisticated
edge discovery (e.g. chemical-disease, drug-drug text pre-processing heuristics (D’Souza and Ng,
and protein-protein relations, Wang et al. 2018). 2015; Kim et al., 2019; Ji et al., 2020; Sung et al.,
The heterogeneous naming of biomedical concepts 2020) which can hardly cover all the variations

of biomedical names. In parallel, self-supervised
Work conducted prior to joining Amazon.
1 learning has shown tremendous success in NLP via
For code and pretrained models, please visit: https:
//github.com/cambridgeltl/sapbert. leveraging the masked language modelling (MLM)
2
In this work, biomedical entity refers to the surface forms
3
of biomedical concepts, which can be a single word (e.g. Note that we consider only the biomedical entities them-
fever), a compound (e.g. sars-cov-2) or a short phrase (e.g. selves and not their contexts, also known as medical concept
abnormal retinal vascular development). normalisation/disambiguation in the BioNLP community.
4228
Proceedings of the 2021 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, pages 4228–4238
June 6–11, 2021. ©2021 Association for Computational Linguistics
objective to learn semantics from distributional rep-
resentations (Devlin et al., 2019; Liu et al., 2019).
Domain-specific pretraining on biomedical corpora
(e.g. B IO B ERT, Lee et al. 2020 and B IO M EGA -
TRON , Shin et al. 2020) have made much progress
in biomedical text mining tasks. Nonetheless, rep-
resenting medical entities with the existing SOTA
pretrained MLMs (e.g. P UB M ED B ERT, Gu et al.
2020) as suggested in Fig. 1 (left) does not lead to Figure 2: The distribution of similarity scores for
all sampled P UB M ED B ERT representations in a mini-
a well-separated representation space.
batch. The left graph shows the distribution of + and -
To address the aforementioned issue, we propose pairs which are easy and already well-separated. The
to pretrain a Transformer-based language model on right graph illustrates larger overlap between the two
the biomedical knowledge graph of UMLS (Boden- groups generated by the online mining step, making
them harder and more informative for learning.
reider, 2004), the largest interlingua of biomedical
ontologies. UMLS contains a comprehensive col-
lection of biomedical synonyms in various forms 2 Method: Self-Alignment Pretraining
(UMLS 2020AA has 4M+ concepts and 10M+ syn-
onyms which stem from over 150 controlled vocab- We design a metric learning framework that learns
ularies including MeSH, SNOMED CT, RxNorm, to self-align synonymous biomedical entities. The
Gene Ontology and OMIM).4 We design a self- framework can be used as both pretraining on
alignment objective that clusters synonyms of the UMLS, and fine-tuning on task-specific datasets.
same concept. To cope with the immense size of We use an existing B ERT model as our starting
UMLS, we sample hard training pairs from the point. In the following, we introduce the key com-
knowledge base and use a scalable metric learning ponents of our framework.
loss. We name our model as Self-aligning pre- Formal Definition. Let (x, y) ∈ X × Y de-
trained B ERT (S AP B ERT). note a tuple of a name and its categorical label.
Being both simple and powerful, S AP B ERT ob- For the self-alignment pretraining step, X × Y
tains new SOTA performances across all six MEL is the set of all (name, CUI5 ) pairs in UMLS,
benchmark datasets. In contrast with the current e.g. (Remdesivir, C4726677); while for the fine-
systems which adopt complex pipelines and hybrid tuning step, it is formed as an entity mention
components (Xu et al., 2020; Ji et al., 2020; Sung and its corresponding mapping from the ontol-
et al., 2020), S AP B ERT applies a much simpler ogy, e.g. (scratchy throat, 102618009). Given
training procedure without requiring any pre- or any pair of tuples (xi , yi ), (xj , yj ) ∈ X × Y, the
post-processing steps. At test time, a simple nearest goal of the self-alignment is to learn a function
neighbour’s search is sufficient for making a predic- f (·; θ) : X → Rd parameterised by θ. Then, the
tion. When compared with other domain-specific similarity hf (xi ), f (xj )i (in this work we use co-
pretrained language models (e.g. B IO B ERT and sine similarity) can be used to estimate the resem-
S CI B ERT), S AP B ERT also brings substantial im- blance of xi and xj (i.e., high if xi , xj are syn-
provement by up to 20% on accuracy across all onyms and low otherwise). We model f by a B ERT
tasks. The effectiveness of the pretraining in S AP - model with its output [CLS] token regarded as the
B ERT is especially highlighted in the scientific lan- representation of the input.6 During the learning,
guage domain where S AP B ERT outperforms previ- a sampling procedure selects the informative pairs
ous SOTA even without fine-tuning on any MEL of training samples and uses them in the pairwise
datasets. We also provide insights on pretraining’s metric learning loss function (introduced shortly).
impact across domains and explore pretraining with
Online Hard Pairs Mining. We use an online
fewer model parameters by using a recently intro-
hard triplet mining condition to find the most
duced A DAPTER module in our training scheme.
5
In UMLS, CUI is the Concept Unique Identifier.
6
We tried multiple strategies including first-token, mean-
pooling, [CLS] and also NOSPEC (recommended by Vulić
4
https://www.nlm.nih.gov/research/umls/knowledge_ et al. 2020) but found no consistent best strategy (optimal
sources/metathesaurus/release/statistics.html strategy varies on different *B ERTs).
4229
informative training examples (i.e. hard posi- of positive and negative samples of the anchor i.8
tive/negative pairs) within a mini-batch for efficient While the first term in Eq. 2 pushes negative
training, Fig. 2. For biomedical entities, this step pairs away from each other, the second term pulls
can be particularly useful as most examples can positive pairs together. This dynamic allows for
be easily classified while a small set of very hard a re-calibration of the alignment space using the
ones cause the most challenge to representation semantic biases of synonymy relations. The MS
learning.7 We start from constructing all possible loss leverages similarities among and between pos-
triplets for all names within the mini-batch where itive and negative pairs to re-weight the importance
each triplet is in the form of (xa , xp , xn ). Here of the samples. The most informative pairs will
xa is called anchor, an arbitrary name in the mini- receive more gradient signals during training and
batch; xp a positive match of xa (i.e. ya = yp ) and thus can better use the information stored in data.
xn a negative match of xa (i.e. ya 6= yn ). Among
the constructed triplets, we select out all triplets 3 Experiments and Discussions
that violate the following condition: 3.1 Experimental Setups
kf (xa ) − f (xp )k2 < kf (xa ) − f (xn )k2 + λ, (1) Data Preparation Details for UMLS Pretrain-
where λ is a pre-set margin. In other words, we ing. We download the full release of UMLS
only consider triplets with the negative sample 2020AA version.9 We then extract all English
closer to the positive sample by a margin of λ. entries from the MRCONSO.RFF raw file and
These are the hard triplets as their original repre- convert all entity names into lowercase (dupli-
sentations were very far from correct. Every hard cates are removed). Besides synonyms defined
triplet contributes one hard positive pair (xa , xp ) in MRCONSO.RFF, we also include tradenames of
and one hard negative pair (xa , xn ). We collect drugs as synonyms (extracted from MRREL.RRF).
all such positive & negative pairs and denote them After pre-processing, a list of 9,712,959 (name,
as P, N . A similar but not identical triplet min- CUI) entries is obtained. However, random batch-
ing condition was used by Schroff et al. (2015) for ing on this list can lead to very few (if not none)
face recognition to select hard negative samples. positive pairs within a mini-batch. To ensure suffi-
Switching-off this mining process, causes a drastic cient positives present in each mini-batch, we gen-
performance drop (see Tab. 2). erate offline positive pairs in the format of (name1 ,
name2 , CUI) where name1 and name2 have the
same CUI label. This can be achieved by enumer-
Loss Function. We compute the pairwise cosine ating all possible combinations of synonym pairs
similarity of all the B ERT-produced name rep- with common CUIs. For balanced training, any
resentations and obtain a similarity matrix S ∈ concepts with more than 50 positive pairs are ran-
R|Xb |×|Xb | where each entry Sij corresponds to the domly trimmed to 50 pairs. In the end we obtain a
cosine similarity between the i-th and j-th names in training list with 11,792,953 pairwise entries.
the mini-batch b. We adapted the Multi-Similarity
loss (MS loss, Wang et al. 2019), a SOTA metric UMLS Pretraining Details. During training, we
learning objective on visual recognition, for learn- use AdamW (Loshchilov and Hutter, 2018) with
ing from the positive and negative pairs: a learning rate of 2e-5 and weight decay rate of
1e-2. Models are trained on the prepared pairwise
|Xb |
1 X 1  X  UMLS data for 1 epoch (approximately 50k itera-
L= log 1 + eα(Sin −)
|Xb | α tions) with a batch size of 512 (i.e., 256 pairs per
i=1 n∈Ni
! (2) mini-batch). We train with Automatic Mixed Pre-
1  X 
cision (AMP)10 provided in PyTorch 1.7.0. This
+ log 1 + e−β(Sip −) ,
β takes approximately 5 hours on our machine (con-
p∈Pi
8
where α, β are temperature scales;  is an offset We explored several loss functions such as InfoNCE
(Oord et al., 2018), NCA loss (Goldberger et al., 2005),
applied on the similarity matrix; Pi , Ni are indices simple cosine loss (Phan et al., 2019), max-margin triplet
loss (Basaldella et al., 2020) but found our choice is empiri-
cally better. See App. §B.2 for comparison.
7 9
Most of Hydroxychloroquine’s variants are easy: Hydrox- https://download.nlm.nih.gov/umls/kss/2020AA/
ychlorochin, Hydroxychloroquine (substance), Hidroxicloro- umls-2020AA-full.zip
10
quina, but a few can be very hard: Plaquenil and HCQ. https://pytorch.org/docs/stable/amp.html

4230
scientific language social media language
NCBI BC5CDR-d BC5CDR-c MedMentions AskAPatient COMETA
model
@1 @5 @1 @5 @1 @5 @1 @5 @1 @5 @1 @5
vanilla B ERT (Devlin et al., 2019) 67.6 77.0 81.4 89.1 79.8 91.2 39.6 60.2 38.2 43.3 40.4 47.7
+ S AP B ERT 91.6 95.2 92.7 95.4 96.1 98.0 52.5 72.6 68.4 87.6 59.5 76.8
B IO B ERT (Lee et al., 2020) 71.3 84.1 79.8 92.3 74.0 90.0 24.2 38.5 41.4 51.5 35.9 46.1
+ S AP B ERT 91.0 94.7 93.3 95.5 96.6 97.6 53.0 73.7 72.4 89.1 63.3 77.0
B LUE B ERT (Peng et al., 2019) 75.7 87.2 83.2 91.0 87.7 94.1 41.6 61.9 41.5 48.5 42.9 52.9
+ S AP B ERT 90.9 94.0 93.4 96.0 96.7 98.2 49.6 73.1 72.4 89.4 66.0 78.8
C LINICAL B ERT (Alsentzer et al., 2019) 72.1 84.5 82.7 91.6 75.9 88.5 43.9 54.3 43.1 51.8 40.6 61.8
+ S AP B ERT 91.1 95.1 93.0 95.7 96.6 97.7 51.5 73.0 71.1 88.5 64.3 77.3
S CI B ERT (Beltagy et al., 2019) 85.1 88.4 89.3 92.8 94.2 95.5 42.3 51.9 48.0 54.8 45.8 66.8
+ S AP B ERT 91.7 95.2 93.3 95.7 96.6 98.0 50.1 73.9 72.1 88.7 64.5 77.5
U MLS B ERT (Michalopoulos et al., 2020) 77.0 85.4 85.5 92.5 88.9 94.1 36.1 55.8 44.4 54.5 44.6 53.0
+ S AP B ERT 91.2 95.2 92.8 95.5 96.6 97.7 52.1 73.2 72.6 89.3 63.4 76.9
P UB M ED B ERT (Gu et al., 2020) 77.8 86.9 89.0 93.8 93.0 94.6 43.9 64.7 42.5 49.6 46.8 53.2
+ S AP B ERT 92.0 95.6 93.5 96.0 96.5 98.2 50.8 74.4 70.5 88.9 65.9 77.9
supervised SOTA 91.1 93.9 93.2 96.0 96.6 97.2 OOM OOM 87.5 - 79.0 -
P UB M ED B ERT 77.8 86.9 89.0 93.8 93.0 94.6 43.9 64.7 42.5 49.6 46.8 53.2
+ S AP B ERT 92.0 95.6 93.5 96.0 96.5 98.2 50.8 74.4 70.5 88.9 65.9 77.9
+ S AP B ERT (A DAPTER13% ) 91.5 95.8 93.6 96.3 96.5 98.0 50.7 75.0† 67.5 87.1 64.5 74.9
+ S AP B ERT (A DAPTER1% ) 90.9 95.4 93.8† 96.5† 96.5 97.9 52.2† 74.8 65.7 84.0 63.5 74.2
+ S AP B ERT (F INE - TUNED ) 92.3 95.5 93.2 95.4 96.5 97.9 50.4 73.9 89.0† 96.2† 75.1 (81.1 ) 85.5 (86.1† )

B IO S YN 91.1 93.9 93.2 96.0 96.6 97.2 OOM OOM 82.6 87.0 71.3 77.8
+ (init. w/) S AP B ERT 92.5† 96.2† 93.6 96.2 96.8 98.4† OOM OOM 87.6 95.6 77.0 84.2

Table 1: Top: Comparison of 7 B ERT-based models before and after S AP B ERT pretraining (+ S AP B ERT). All
results in this section are from unsupervised learning (not fine-tuned on task data). The gradient of green indicates
the improvement comparing to the base model (the deeper the more). Bottom: S AP B ERT vs. SOTA results. Blue
and red denote unsupervised and supervised models. Bold and underline denote the best and second best results
in the column. “† ” denotes statistically significant better than supervised SOTA (T-test, ρ < 0.05). On COMETA,
the results inside the parentheses added the supervised SOTA’s dictionary back-off technique (Basaldella et al.,
2020). “-”: not reported in the SOTA paper. “OOM”: out-of-memory (192GB+).

figurations specified in App. §B.4). For other hyper- the training set and ground truth synonyms are from
parameters used, please view App. §C.2. the reference ontology. We use the same optimiser
and learning rates but train with a batch size of
Evaluation Data and Protocol. We experiment 256 (to accommodate the memory of 1 GPU). On
on 6 different English MEL datasets: 4 in the scien- scientific language datasets, we train for 3 epochs
tific domain (NCBI, Doğan et al. 2014; BC5CDR-c while on AskAPatient and COMETA we train for
and BC5CDR-d, Li et al. 2016; MedMentions, Mo- 15 and 10 epochs respectively. For B IO S YN on so-
han and Li 2018) and 2 in the social media domain cial media language datasets, we empirically found
(COMETA, Basaldella et al. 2020 and AskAPa- that 10 epochs work the best. Other configurations
tient, Limsopatham and Collier 2016). Descrip- are the same as the original B IO S YN paper.
tions of the datasets and their statistics are provided
in App. §A. We report Acc@1 and Acc@5 (denoted 3.2 Main Results and Analysis
as @1 and @5) for evaluating performance. In all *B ERT + S AP B ERT (Tab. 1, top). We illustrate
experiments, S AP B ERT denotes further pretraining the impact of S AP B ERT pretraining over 7 exist-
with our self-alignment method on UMLS. At the ing B ERT-based models (*B ERT = {B IO B ERT ,
test phase, for all S AP B ERT models we use near- P UB M ED B ERT, ...}). S AP B ERT obtains consis-
est neighbour search without further fine-tuning on tent improvement over all *B ERT models across all
task data (unless stated otherwise). Except for num- datasets, with larger gains (by up to 31.0% absolute
bers reported in previous papers, all results are the Acc@1 increase) observed in the social media do-
average of five runs with different random seeds. main. While S CI B ERT is the leading model before
applying S AP B ERT, P UB M ED B ERT +S AP B ERT
Fine-Tuning on Task Data. The red rows in
performs the best afterwards.
Tab. 1 are results of models (further) fine-tuned
on the training sets of the six MEL datasets. Sim- S AP B ERT vs. SOTA (Tab. 1, bottom). We take
ilar to pretraining, a positive pair list is generated P UB M ED B ERT +S AP B ERT (w/wo fine-tuning) and
through traversing the combinations of mention and compare against various published SOTA results
all ground truth synonyms where mentions are from (see App. §C.1 for a full listing of 10 baselines)
4231
which all require task supervision. For the scien- The Impact of Online Mining (Eq. (1)). As
tific language domain, the SOTA is B IO S YN (Sung suggested in Tab. 2, switching off the online hard
et al., 2020). For the social media domain, the pairs mining procedure causes a large performance
SOTA are Basaldella et al. (2020) and G EN - drop in @1 and a smaller but still significant drop
R ANK (Xu et al., 2020) on COMETA and AskAP- in @5. This is due to the presence of many easy and
atient respectively. All these SOTA methods com- already well-separated samples in the mini-batches.
bine B ERT with heuristic modules such as tf-idf, These uninformative training examples dominated
string matching and information retrieval system the gradients and harmed the learning process.
(i.e. Apache Lucene) in a multi-stage manner.
configuration @1 @5
Measured by Acc@1 , S AP B ERT achieves new
Mining switched-on 67.2 80.3
SOTA with statistical significance on 5 of the 6 Mining switched-off 52.3↓14.9 76.1↓4.2
datasets and for the dataset (BC5CDR-c) where
S AP B ERT is not significantly better, it performs on Table 2: This table compares P UB M ED -
par with SOTA (96.5 vs. 96.6). Interestingly, on sci- B ERT +S AP B ERT’s performance with and without
entific language datasets, S AP B ERT outperforms online hard mining on COMETA (zeroshot general).
SOTA without any task supervision (fine-tuning
mostly leads to overfitting and performance drops). Integrating S AP B ERT in Existing Systems.
On social media language datasets, unsupervised S AP B ERT can be easily inserted into existing
S AP B ERT lags behind supervised SOTA by large B ERT-based MEL systems by initialising the sys-
margins, highlighting the well-documented com- tems with S AP B ERT pretrained weights. We use
plex nature of social media language (Baldwin the SOTA scientific language system, B IO S YN
et al., 2013; Limsopatham and Collier, 2015, 2016; (originally initialised with B IO B ERT weights), as
Basaldella et al., 2020; Tutubalina et al., 2020). an example and show the performance is boosted
However, after fine-tuning on the social media across all datasets (last two rows, Tab. 1).
datasets (using the MS loss introduced earlier),
S AP B ERT outperforms SOTA significantly, indi- 4 Conclusion
cating that knowledge acquired during the self-
aligning pretraining can be adapted to a shifted We present S AP B ERT, a self-alignment pretraining
domain without much effort. scheme for learning biomedical entity represen-
tations. We highlight the consistent performance
boost achieved by S AP B ERT, obtaining new SOTA
in all six widely used MEL benchmarking datasets.
The A DAPTER Variant. As an option for param-
Strikingly, without any fine-tuning on task-specific
eter efficient pretraining, we explore a variant of
labelled data, S AP B ERT already outperforms the
S AP B ERT using a recently introduced training mod-
previous supervised SOTA (sophisticated hybrid en-
ule named A DAPTER (Houlsby et al., 2019). While
tity linking systems) on multiple datasets in the sci-
maintaining the same pretraining scheme with the
entific language domain. Our work opens new av-
same S AP B ERT online mining + MS loss, instead
enues to explore for general domain self-alignment
of training from the full model of P UB M ED B ERT,
(e.g. by leveraging knowledge graphs such as DB-
we insert new A DAPTER layers between Trans-
pedia). We plan to incorporate other types of rela-
former layers of the fixed P UB M ED B ERT, and only
tions (i.e., hypernymy and hyponymy) and extend
train the weights of these A DAPTER layers. In our
our model to sentence-level representation learning.
experiments, we use the enhanced A DAPTER con-
In particular, our ongoing work using a combina-
figuration by Pfeiffer et al. (2020). We include two
tion of S AP B ERT and A DAPTER is a promising
variants where trained parameters are 13.22% and
direction for tackling sentence-level tasks.
1.09% of the full S AP B ERT variant. The A DAPTER
variant of S AP B ERT achieves comparable perfor- Acknowledgements
mance to full-model-tuning in scientific datasets
but lags behind in social media datasets, Tab. 1. The We thank the three reviewers and the Area Chair
results indicate that more parameters are needed for their insightful comments and suggestions. FL
in pretraining for knowledge transfer to a shifted is supported by Grace & Thomas C.H. Chan Cam-
domain, in our case, the social media datasets. bridge Scholarship. NC and MB would like to
4232
acknowledge funding from Health Data Research Rezarta Islamaj Doğan, Robert Leaman, and Zhiyong
UK as part of the National Text Analytics project. Lu. 2014. NCBI disease corpus: a resource for dis-
ease name recognition and concept normalization.
Journal of Biomedical Informatics, 47:1–10.
References Kevin Donnelly. 2006. SNOMED-CT: The advanced
terminology and coding system for eHealth. Studies
Emily Alsentzer, John Murphy, William Boag, Wei- in health technology and informatics, 121:279.
Hung Weng, Di Jindi, Tristan Naumann, and
Matthew McDermott. 2019. Publicly available clini- Jennifer D’Souza and Vincent Ng. 2015. Sieve-based
cal BERT embeddings. In Proceedings of the 2nd entity linking for the biomedical domain. In Pro-
Clinical Natural Language Processing Workshop, ceedings of the 53rd Annual Meeting of the Associ-
pages 72–78, Minneapolis, Minnesota, USA. Asso- ation for Computational Linguistics and the 7th In-
ciation for Computational Linguistics. ternational Joint Conference on Natural Language
Processing (ACL-IJCNLP) (Volume 2: Short Pa-
Timothy Baldwin, Paul Cook, Marco Lui, Andrew pers), pages 297–302, Beijing, China. Association
MacKinlay, and Li Wang. 2013. How noisy so- for Computational Linguistics.
cial media text, how diffrnt social media sources?
In Proceedings of the Sixth International Joint Con- Jacob Goldberger, Geoffrey E Hinton, Sam T Roweis,
ference on Natural Language Processing (IJCNLP), and Russ R Salakhutdinov. 2005. Neighbourhood
pages 356–364, Nagoya, Japan. Asian Federation of components analysis. In Advances in Neural Infor-
Natural Language Processing. mation Processing Systems, pages 513–520.
Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas,
Marco Basaldella, Fangyu Liu, Ehsan Shareghi, and
Naoto Usuyama, Xiaodong Liu, Tristan Naumann,
Nigel Collier. 2020. COMETA: A corpus for med-
Jianfeng Gao, and Hoifung Poon. 2020. Domain-
ical entity linking in the social media. In Proceed-
specific language model pretraining for biomedical
ings of the 2020 Conference on Empirical Methods
natural language processing. arXiv:2007.15779.
in Natural Language Processing (EMNLP), pages
3122–3137, Online. Association for Computational Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and
Linguistics. Ross Girshick. 2020. Momentum contrast for unsu-
pervised visual representation learning. In Proceed-
Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciB- ings of the IEEE/CVF Conference on Computer Vi-
ERT: A pretrained language model for scientific text. sion and Pattern Recognition, pages 9729–9738.
In Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski,
9th International Joint Conference on Natural Lan- Bruna Morrone, Quentin de Laroussilhe, Andrea
guage Processing (EMNLP-IJCNLP), pages 3615– Gesmundo, Mona Attariyan, and Sylvain Gelly.
3620, Hong Kong, China. Association for Computa- 2019. Parameter-efficient transfer learning for NLP.
tional Linguistics. In Proceedings of the 36th International Confer-
ence on Machine Learning, ICML 2019, 9-15 June
Olivier Bodenreider. 2004. The unified medical lan- 2019, Long Beach, California, USA, volume 97 of
guage system (UMLS): integrating biomedical ter- Proceedings of Machine Learning Research, pages
minology. Nucleic Acids Research, 32:D267–D270. 2790–2799. PMLR.

Allan Peter Davis, Cynthia J Grondin, Robin J Johnson, Zongcheng Ji, Qiang Wei, and Hua Xu. 2020. BERT-
Daniela Sciaky, Roy McMorran, Jolene Wiegers, based ranking for biomedical entity normalization.
Thomas C Wiegers, and Carolyn J Mattingly. 2019. AMIA Summits on Translational Science Proceed-
The comparative toxicogenomics database: update ings, 2020:269.
2019. Nucleic Acids Research, 47:D948–D954.
Donghyeon Kim, Jinhyuk Lee, Chan Ho So, Hwisang
Jeon, Minbyul Jeong, Yonghwa Choi, Wonjin Yoon,
Allan Peter Davis, Thomas C Wiegers, Michael C
Mujeen Sung, , and Jaewoo Kang. 2019. A neural
Rosenstein, and Carolyn J Mattingly. 2012. MEDIC:
named entity recognition and multi-type normaliza-
a practical disease vocabulary used at the compara-
tion tool for biomedical text mining. IEEE Access,
tive toxicogenomics database. Database.
7:73729–73740.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Robert Leaman and Zhiyong Lu. 2016. Tag-
Kristina Toutanova. 2019. BERT: Pre-training of gerOne: joint named entity recognition and normal-
deep bidirectional transformers for language under- ization with semi-markov models. Bioinformatics,
standing. In Proceedings of the 2019 Conference 32:2839–2846.
of the North American Chapter of the Association
for Computational Linguistics: Human Language Jinhyuk Lee, Wonjin Yoon, Sungdong Kim,
Technologies (NAACL), Volume 1 (Long and Short Donghyeon Kim, Sunkyu Kim, Chan Ho So,
Papers), pages 4171–4186, Minneapolis, Minnesota. and Jaewoo Kang. 2020. BioBERT: a pre-
Association for Computational Linguistics. trained biomedical language representation model
4233
for biomedical text mining. Bioinformatics, Yifan Peng, Shankai Yan, and Zhiyong Lu. 2019.
36(4):1234–1240. Transfer learning in biomedical natural language
processing: An evaluation of bert and elmo on ten
Sunwon Lee, Donghyeon Kim, Kyubum Lee, Jaehoon benchmarking datasets. In Proceedings of the 2019
Choi, Seongsoon Kim, Minji Jeon, Sangrak Lim, Workshop on Biomedical Natural Language Process-
Donghee Choi, Sunkyu Kim, Aik-Choon Tan, et al. ing, pages 58–65.
2016. BEST: next-generation biomedical entity
search tool for knowledge discovery from biomed- Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Se-
ical literature. PloS one, 11:e0164680. bastian Ruder. 2020. MAD-X: An Adapter-Based
Framework for Multi-Task Cross-Lingual Transfer.
Jiao Li, Yueping Sun, Robin J Johnson, Daniela Sci- In Proceedings of the 2020 Conference on Empirical
aky, Chih-Hsuan Wei, Robert Leaman, Allan Peter Methods in Natural Language Processing (EMNLP),
Davis, Carolyn J Mattingly, Thomas C Wiegers, and pages 7654–7673, Online. Association for Computa-
Zhiyong Lu. 2016. BioCreative V CDR task corpus: tional Linguistics.
a resource for chemical disease relation extraction.
Database, 2016. Minh C Phan, Aixin Sun, and Yi Tay. 2019. Robust
representation learning of biomedical names. In Pro-
Nut Limsopatham and Nigel Collier. 2015. Adapt- ceedings of the 57th Annual Meeting of the Asso-
ing phrase-based machine translation to normalise ciation for Computational Linguistics, pages 3275–
medical terms in social media messages. In Pro- 3285.
ceedings of the 2015 Conference on Empirical Meth-
ods in Natural Language Processing, pages 1675– Kirk Roberts, Matthew S Simpson, Ellen M Voorhees,
1680, Lisbon, Portugal. Association for Computa- and William R Hersh. 2015. Overview of the trec
tional Linguistics. 2015 clinical decision support track. In TREC.
Florian Schroff, Dmitry Kalenichenko, and James
Nut Limsopatham and Nigel Collier. 2016. Normalis-
Philbin. 2015. Facenet: A unified embedding for
ing medical concepts in social media texts by learn-
face recognition and clustering. In Proceedings of
ing semantic representation. In Proceedings of the
the IEEE Conference on Computer Vision and Pat-
54th Annual Meeting of the Association for Compu-
tern Recognition, pages 815–823.
tational Linguistics, pages 1014–1023.
Elliot Schumacher, Andriy Mulyar, and Mark Dredze.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- 2020. Clinical concept linking with contextualized
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, neural representations. In Proceedings of the 58th
Luke Zettlemoyer, and Veselin Stoyanov. 2019. Annual Meeting of the Association for Computa-
Roberta: A robustly optimized bert pretraining ap- tional Linguistics, pages 8585–8592.
proach. arXiv preprint arXiv:1907.11692.
Hoo-Chang Shin, Yang Zhang, Evelina Bakhturina,
Ilya Loshchilov and Frank Hutter. 2018. Decoupled Raul Puri, Mostofa Patwary, Mohammad Shoeybi,
weight decay regularization. In International Con- and Raghav Mani. 2020. BioMegatron: Larger
ference on Learning Representations. biomedical domain language model. In Proceed-
ings of the 2020 Conference on Empirical Methods
Laurens van der Maaten and Geoffrey Hinton. 2008. in Natural Language Processing (EMNLP), pages
Visualizing data using t-SNE. Journal of machine 4700–4706, Online. Association for Computational
learning research, 9(Nov):2579–2605. Linguistics.
George Michalopoulos, Yuanxin Wang, Hussam Kaka, Yifan Sun, Changmao Cheng, Yuhan Zhang, Chi
Helen Chen, and Alex Wong. 2020. Umls- Zhang, Liang Zheng, Zhongdao Wang, and Yichen
bert: Clinical domain knowledge augmentation of Wei. 2020. Circle loss: A unified perspective of
contextual embeddings using the unified medical pair similarity optimization. In Proceedings of the
language system metathesaurus. arXiv preprint IEEE/CVF Conference on Computer Vision and Pat-
arXiv:2010.10391. tern Recognition, pages 6398–6407.
Sunil Mohan and Donghui Li. 2018. MedMentions: A Mujeen Sung, Hwisang Jeon, Jinhyuk Lee, and Jae-
large biomedical corpus annotated with UMLS con- woo Kang. 2020. Biomedical entity representations
cepts. In Automated Knowledge Base Construction. with synonym marginalization. In Proceedings of
the 58th Annual Meeting of the Association for Com-
Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Sil- putational Linguistics (ACL), pages 3641–3650, On-
vio Savarese. 2016. Deep metric learning via lifted line. Association for Computational Linguistics.
structured feature embedding. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Elena Tutubalina, Artur Kadurin, and Zulfat Miftahut-
Recognition, pages 4004–4012. dinov. 2020. Fair evaluation in concept normaliza-
tion: a large-scale comparative analysis for bert-
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. based models. In Proceedings of the 28th Inter-
2018. Representation learning with contrastive pre- national Conference on Computational Linguistics
dictive coding. arXiv preprint arXiv:1807.03748. (COLING).
4234
Elena Tutubalina, Zulfat Miftahutdinov, Sergey The chemical mentions are mapped into the Com-
Nikolenko, and Valentin Malykh. 2018. Medical parative Toxicogenomics Database (CTD) (Davis
concept normalization in social media posts with
et al., 2019) chemical dictionary. We denote the
recurrent neural networks. Journal of Biomedical
Informatics, 84:93–102. disease and chemical mention sets as “BC5CDR-
d” and “BC5CDR-c” respectively. For NCBI and
Ivan Vulić, Edoardo Maria Ponti, Robert Litschko, BC5CDR we use the same data and evaluation pro-
Goran Glavaš, and Anna Korhonen. 2020. Probing
pretrained language models for lexical semantics. In tocol by Sung et al. (2020).11
Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP),
MedMentions (Mohan and Li, 2018) is a very-
pages 7222–7240, Online. Association for Computa- large-scale entity linking dataset containing over
tional Linguistics. 4,000 abstracts and over 350,000 mentions linked
to UMLS 2017AA. According to Mohan and Li
Xun Wang, Xintong Han, Weilin Huang, Dengke Dong,
and Matthew R Scott. 2019. Multi-similarity loss (2018), training TAGGERO NE (Leaman and Lu,
with general pair weighting for deep metric learn- 2016), a very popular MEL system, on a subset
ing. In Proceedings of the IEEE Conference on Com- of MedMentions require >900 GB of RAM. Its
puter Vision and Pattern Recognition, pages 5022– massive number of mentions and more importantly
5030.
the used reference ontology (UMLS 2017AA has
Yanshan Wang, Sijia Liu, Naveed Afzal, Majid 3M+ concepts) make the application of most MEL
Rastegar-Mojarad, Liwei Wang, Feichen Shen, Paul systems infeasible. However, through our metric
Kingsbury, and Hongfang Liu. 2018. A comparison
learning formulation, S AP B ERT can be applied on
of word embeddings for the biomedical natural lan-
guage processing. Journal of Biomedical Informat- MedMentions with minimal effort.
ics, 87:12–20.
A.2 Social-Media Language Datasets
Dustin Wright, Yannis Katsis, Raghav Mehta, and
Chun-Nan Hsu. 2019. Normco: Deep disease nor- AskAPatient (Limsopatham and Collier, 2016)
malization for biomedical knowledge base construc- includes 17,324 adverse drug reaction (ADR) anno-
tion. In Automated Knowledge Base Construction. tations collected from askapatient.com blog
posts. The mentions are mapped to 1,036 medical
Dongfang Xu, Zeyu Zhang, and Steven Bethard. 2020.
A generate-and-rank framework with semantic type concepts grounded onto SNOMED-CT (Donnelly,
regularization for biomedical concept normalization. 2006) and AMT (the Australian Medicines Termi-
In Proceedings of the 58th Annual Meeting of the nology). For this dataset, we follow the 10-fold
Association for Computational Linguistics, pages evaluation protocol stated in the original paper.12
8452–8464.
COMETA (Basaldella et al., 2020) is a recently
A Evaluation Datasets Details released large-scale MEL dataset that specifically
We divide our experimental datasets into two cate- focuses on MEL in the social media domain, con-
gories (1) scientific language datasests where the taining around 20k medical mentions extracted
data is extracted from scientific papers and (2) so- from health-related discussions on reddit.com.
cial media language datasets where the data is com- Mentions are mapped to SNOMED-CT. We use the
ing from social media forums like Reddit.com. “stratified (general)” split and follow the evaluation
For an overview of the key statistics, see Tab. 3. protocol of the original paper.13

A.1 Scientific Language Datasets B Model & Training Details


NCBI disease (Doğan et al., 2014) is a corpus B.1 The Choice of Base Models
containing 793 fully annotated PubMed abstracts
We list all the versions of B ERT models used in
and 6,881 mentions. The mentions are mapped
this study, linking to the specific versions in Tab. 5.
into the MEDIC dictionary (Davis et al., 2012). We
Note that we exhaustively tried all official variants
denote this dataset as “NCBI” in our experiments.
of the selected models and the best performing ones
BC5CDR (Li et al., 2016) consists of 1,500 are chosen. All B ERT models refer to the B ERTBase
PubMed articles with 4,409 annotated chemicals, architecture in this paper.
5,818 diseases and 3,116 chemical-disease interac- 11
https://github.com/dmis-lab/BioSyn
tions. The disease mentions are mapped into the 12
https://zenodo.org/record/55013
13
MEDIC dictionary like the NCBI disease corpus. https://www.siphs.org/corpus
4235
dataset NCBI BC5CDR-d BC5CDR-c MedMentions AskAPAtient COMETA (s.g.) COMETA (z.g.)
Ontology MEDIC MEDIC CTD UMLS 2017AA SNOMED & AMT SNOMED SNOMED
Csearched ( Contology ? 7 7 7 7 3 7 7
|Csearched | 11,915 11,915 171,203 3,415,665 1,036 350,830 350,830
|Ssearched | 71,923 71,923 407,247 14,815,318 1,036 910,823 910,823
|Mtrain | 5,134 4,182 5,203 282,091 15,665.2 13,489 14,062
|Mvalidation | 787 4,244 5,347 71,062 792.6 2,176 1,958
|Mtest | 960 4,424 5,385 70,405 866.2 4,350 3,995

Table 3: This table contains basic statistics of the MEL datasets used in the study. C denotes the set of concepts;
S denotes the set of all surface forms / synonyms of all concepts in C; M denotes the set of mentions / queries.
COMETA (s.g.) and (z.g.) are the stratified (general) and zeroshot (general) split respectively.

NCBI BC5CDR-d BC5CDR-c MedMentions AskAPatient COMETA


model
@1 @5 @1 @5 @1 @5 @1 @5 @1 @5 @1 @5
S IEVE -BASED (D’Souza and Ng, 2015) 84.7 - 84.1 - 90.7 - - -
W ORD CNN (Limsopatham and Collier, 2016) - - - - - - - - 81.4 - - -
W ORD GRU+TF-IDF (Tutubalina et al., 2018) - - - - - - - - 85.7 - - -
TAGGERO NE (Leaman and Lu, 2016) 87.7 - 88.9 - 94.1 - OOM OOM - - - -
N ORM C O (Wright et al., 2019) 87.8 - 88.0 - - - - - - - - -
BNE (Phan et al., 2019) 87.7 - 90.6 - 95.8 - - - - - - -
B ERT R ANK (Ji et al., 2020) 89.1 - - - - - - - - - - -
G EN -R ANK (Xu et al., 2020) - - - - - - - - 87.5 - - -
B IO S YN (Sung et al., 2020) 91.1 93.9 93.2 96.0 96.6 97.2 OOM OOM 82.6∗ 87.0∗ 71.3∗ 77.8∗
D ICT +S OILOS +N EURAL (Basaldella et al., 2020) - - - - - - - - - - 79.0 -
supervised SOTA 91.1 93.9 93.2 96.0 96.6 97.2 OOM OOM 87.5 - 79.0 -

Table 4: A list of baselines on the 6 different MEL datasets, including both scientific and social media language ones. The last
row collects reported numbers from the best performing models. “∗” denotes results produced using official released code. “-”
denotes results not reported in the cited paper. “OOM” means out-of-memoery.

B.2 Comparing Loss Functions et al. (2020) for training MEL models. A very
We use COMETA (zeroshot general) as a bench- similar (though not identical) hinge-loss was used
mark for selecting learning objectives. Note by Schumacher et al. (2020) for clinical concept
that this split of COMETA is different from the linking. InfoNCE has been very popular in self-
stratified-general split used in Tab. 4. It is very supervised learning and contrastive learning (Oord
challenging (so easy to see the difference of the et al., 2018; He et al., 2020). Lifted-Structure loss
performance) and also does not directly affect the (Oh Song et al., 2016) and NCA loss (Goldberger
model’s performance on other datasets. The results et al., 2005) are two very classic metric learning ob-
are listed in Tab. 6. Note that online mining is jectives. Multi-Similarity loss (Wang et al., 2019)
switched on for all models here. and Circle loss (Sun et al., 2020) are two recently
proposed metric learning objectives and have been
loss @1 @5 considered as SOTA on large-scale visual recogni-
cosine loss (Phan et al., 2019) 55.1 64.6 tion benchmarks.
max-margin triplet loss (Basaldella et al., 2020) 64.6 74.6
NCA loss (Goldberger et al., 2005) 65.2 77.0
Lifted-Structure loss (Oh Song et al., 2016) 62.0 72.1
InfoNCE (Oord et al., 2018; He et al., 2020) 63.3 74.2
Circle loss (Sun et al., 2020) 66.7 78.7
Multi-Similarity loss (Wang et al., 2019) 67.2 80.3

Table 6: This table compares loss functions used B.3 Details of A DAPTERs
for S AP B ERT pretraining. Numbers reported are on
COMETA (zeroshot general).

The cosine loss was used by Phan et al. (2019) In Tab. 7 we list number of parameters trained in
for learning UMLS synonyms for LSTM models. the three A DAPTER variants along with full-model-
The max-margin triplet loss was used by Basaldella tuning for easy comparison.
4236
model URL
vanilla B ERT (Devlin et al., 2019) https://huggingface.co/bert-base-uncased
B IO B ERT (Lee et al., 2020) https://huggingface.co/dmis-lab/biobert-v1.1
B LUE B ERT (Peng et al., 2019) https://huggingface.co/bionlp/bluebert_pubmed_mimic_uncased_L-12_H-768_A-12
C LINICAL B ERT (Alsentzer et al., 2019) https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT
S CI B ERT (Beltagy et al., 2019) https://huggingface.co/allenai/scibert_scivocab_uncased
U MLS B ERT (Michalopoulos et al., 2020) https://www.dropbox.com/s/qaoq5gfen69xdcc/umlsbert.tar.xz?dl=0
P UB M ED B ERT (Gu et al., 2020) https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext

Table 5: This table lists the URL of models used in this study.

#params
method reduction rate #params #params in B ERT

A DAPTER13% 1 14.47M 13.22%


A DAPTER1% 16 0.60M 1.09%
full-model-tuning - 109.48M 100%

Table 7: This table compares number of parame-


ters trained in A DAPTER variants and also full-model-
tuning.

B.4 Hardware Configurations


All our experiments are conducted on a server with
specifications listed in Tab. 8.

hardware specification
RAM 192 GB
CPU Intel Xeon W-2255 @3.70GHz, 10-core 20-threads
GPU NVIDIA GeForce RTX 2080 Ti (11 GB) × 4

Table 8: Hardware specifications of the used machine.

C Other Details
C.1 The Full Table of Supervised Baseline
Models
The full table of supervised baseline models is pro-
vided in Tab. 4.

C.2 Hyper-Parameters Search Scope


Tab. 9 lists hyper-parameter search space for ob-
taining the set of used numbers. Note that the
chosen hyper-parameters yield the overall best per-
formance but might be sub-optimal on any single
dataset. Also, we balanced the memory limit and
model performance.

C.3 A High-Resolution Version of Fig. 1


We show a clearer version of t-SNE embedding
visualisation in Fig. 3.

4237
hyper-parameters search space

learning rate for pretraining & fine-tuning S AP B ERT {1e-4, 2e-5 , 5e-5, 1e-5, 1e-6}
pretraining batch size {128, 256, 512∗ , 1024}
pretraining training iterations {10k, 20k, 30k, 40k, 50k (1 epoch)∗ , 100k (2 epochs)}
fine-tuning epochs on scientific language datasets {1, 2, 3∗ , 5}
fine-training epochs on AskAPatient {5, 10, 15∗ , 20}
fine-training epochs on COMETA {5, 10∗ , 15, 20}
max_seq_length of B ERT tokenizer {15, 20, 25∗ , 30}
λ in Online Mining {-0.05, -0.1, -0.2∗ , -0.3}
α in MS loss {1, 2 (Wang et al., 2019)∗ , 3}
β in MS loss {40, 50 (Wang et al., 2019)∗ , 60}
 in MS loss {0.5∗ , 1 (Wang et al., 2019)}
α in max-margin triplet loss {0.05, 0.1, 0.2 (Basaldella et al., 2020)∗ , 0.3}
softmax scale in NCA loss {1 (Goldberger et al., 2005), 5, 10, 20∗ , 30}
α in Lifted-Structured loss {0.5∗ , 1 (Oh Song et al., 2016)}
τ (temperature) in InfoNCE {0.07 (He et al., 2020)∗ , 0.5 (Oord et al., 2018)}
m in Circle loss {0.25 (Sun et al., 2020)∗ , 0.4 (Sun et al., 2020)}
γ in Circle loss {80 (Sun et al., 2020), 256 (Sun et al., 2020)∗ }

Table 9: This table lists the search space for hyper-parameters used. ∗ means the used ones for reporting results.

PUDMEDBERT
PUDMEDBERT + SAPBERT

Figure 3: Same as Fig. 1 in the main text, but generated with a higher resolution.

4238

You might also like