Sapbert Medical Domain Hard Positive Negatives
Fangyu Liu♣ , Ehsan Shareghi♦,♣ , Zaiqiao Meng♣ , Marco Basaldella♥∗, Nigel Collier♣
Language Technology Lab, TAL, University of Cambridge
Department of Data Science & AI, Monash University ♥ Amazon Alexa
{fl399, zm324, nhc30}
[email protected] ♥ [email protected]
scientific language social media language
NCBI BC5CDR-d BC5CDR-c MedMentions AskAPatient COMETA
@1 @5 @1 @5 @1 @5 @1 @5 @1 @5 @1 @5
vanilla B ERT (Devlin et al., 2019) 67.6 77.0 81.4 89.1 79.8 91.2 39.6 60.2 38.2 43.3 40.4 47.7
+ S AP B ERT 91.6 95.2 92.7 95.4 96.1 98.0 52.5 72.6 68.4 87.6 59.5 76.8
B IO B ERT (Lee et al., 2020) 71.3 84.1 79.8 92.3 74.0 90.0 24.2 38.5 41.4 51.5 35.9 46.1
+ S AP B ERT 91.0 94.7 93.3 95.5 96.6 97.6 53.0 73.7 72.4 89.1 63.3 77.0
B LUE B ERT (Peng et al., 2019) 75.7 87.2 83.2 91.0 87.7 94.1 41.6 61.9 41.5 48.5 42.9 52.9
+ S AP B ERT 90.9 94.0 93.4 96.0 96.7 98.2 49.6 73.1 72.4 89.4 66.0 78.8
C LINICAL B ERT (Alsentzer et al., 2019) 72.1 84.5 82.7 91.6 75.9 88.5 43.9 54.3 43.1 51.8 40.6 61.8
+ S AP B ERT 91.1 95.1 93.0 95.7 96.6 97.7 51.5 73.0 71.1 88.5 64.3 77.3
S CI B ERT (Beltagy et al., 2019) 85.1 88.4 89.3 92.8 94.2 95.5 42.3 51.9 48.0 54.8 45.8 66.8
+ S AP B ERT 91.7 95.2 93.3 95.7 96.6 98.0 50.1 73.9 72.1 88.7 64.5 77.5
U MLS B ERT (Michalopoulos et al., 2020) 77.0 85.4 85.5 92.5 88.9 94.1 36.1 55.8 44.4 54.5 44.6 53.0
+ S AP B ERT 91.2 95.2 92.8 95.5 96.6 97.7 52.1 73.2 72.6 89.3 63.4 76.9
P UB M ED B ERT (Gu et al., 2020) 77.8 86.9 89.0 93.8 93.0 94.6 43.9 64.7 42.5 49.6 46.8 53.2
+ S AP B ERT 92.0 95.6 93.5 96.0 96.5 98.2 50.8 74.4 70.5 88.9 65.9 77.9
supervised SOTA 91.1 93.9 93.2 96.0 96.6 97.2 OOM OOM 87.5 - 79.0 -
P UB M ED B ERT 77.8 86.9 89.0 93.8 93.0 94.6 43.9 64.7 42.5 49.6 46.8 53.2
+ S AP B ERT 92.0 95.6 93.5 96.0 96.5 98.2 50.8 74.4 70.5 88.9 65.9 77.9
+ S AP B ERT (A DAPTER13% ) 91.5 95.8 93.6 96.3 96.5 98.0 50.7 75.0† 67.5 87.1 64.5 74.9
+ S AP B ERT (A DAPTER1% ) 90.9 95.4 93.8† 96.5† 96.5 97.9 52.2† 74.8 65.7 84.0 63.5 74.2
+ S AP B ERT (F INE - TUNED ) 92.3 95.5 93.2 95.4 96.5 97.9 50.4 73.9 89.0† 96.2† 75.1 (81.1 ) 85.5 (86.1† )
B IO S YN 91.1 93.9 93.2 96.0 96.6 97.2 OOM OOM 82.6 87.0 71.3 77.8
+ (init. w/) S AP B ERT 92.5† 96.2† 93.6 96.2 96.8 98.4† OOM OOM 87.6 95.6 77.0 84.2
Table 1: Top: Comparison of 7 B ERT-based models before and after S AP B ERT pretraining (+ S AP B ERT). All
results in this section are from unsupervised learning (not fine-tuned on task data). The gradient of green indicates
the improvement comparing to the base model (the deeper the more). Bottom: S AP B ERT vs. SOTA results. Blue
and red denote unsupervised and supervised models. Bold and underline denote the best and second best results
in the column. “† ” denotes statistically significant better than supervised SOTA (T-test, ρ < 0.05). On COMETA,
the results inside the parentheses added the supervised SOTA’s dictionary back-off technique (Basaldella et al.,
2020). “-”: not reported in the SOTA paper. “OOM”: out-of-memory (192GB+).
figurations specified in App. §B.4). For other hyper- the training set and ground truth synonyms are from
parameters used, please view App. §C.2. the reference ontology. We use the same optimiser
and learning rates but train with a batch size of
Evaluation Data and Protocol. We experiment 256 (to accommodate the memory of 1 GPU). On
on 6 different English MEL datasets: 4 in the scien- scientific language datasets, we train for 3 epochs
tific domain (NCBI, Doğan et al. 2014; BC5CDR-c while on AskAPatient and COMETA we train for
and BC5CDR-d, Li et al. 2016; MedMentions, Mo- 15 and 10 epochs respectively. For B IO S YN on so-
han and Li 2018) and 2 in the social media domain cial media language datasets, we empirically found
(COMETA, Basaldella et al. 2020 and AskAPa- that 10 epochs work the best. Other configurations
tient, Limsopatham and Collier 2016). Descrip- are the same as the original B IO S YN paper.
tions of the datasets and their statistics are provided
in App. §A. We report Acc@1 and Acc@5 (denoted 3.2 Main Results and Analysis
as @1 and @5) for evaluating performance. In all *B ERT + S AP B ERT (Tab. 1, top). We illustrate
experiments, S AP B ERT denotes further pretraining the impact of S AP B ERT pretraining over 7 exist-
with our self-alignment method on UMLS. At the ing B ERT-based models (*B ERT = {B IO B ERT ,
test phase, for all S AP B ERT models we use near- P UB M ED B ERT, ...}). S AP B ERT obtains consis-
est neighbour search without further fine-tuning on tent improvement over all *B ERT models across all
task data (unless stated otherwise). Except for num- datasets, with larger gains (by up to 31.0% absolute
bers reported in previous papers, all results are the Acc@1 increase) observed in the social media do-
average of five runs with different random seeds. main. While S CI B ERT is the leading model before
applying S AP B ERT, P UB M ED B ERT +S AP B ERT
Fine-Tuning on Task Data. The red rows in
performs the best afterwards.
Tab. 1 are results of models (further) fine-tuned
on the training sets of the six MEL datasets. Sim- S AP B ERT vs. SOTA (Tab. 1, bottom). We take
ilar to pretraining, a positive pair list is generated P UB M ED B ERT +S AP B ERT (w/wo fine-tuning) and
through traversing the combinations of mention and compare against various published SOTA results
all ground truth synonyms where mentions are from (see App. §C.1 for a full listing of 10 baselines)
which all require task supervision. For the scien- The Impact of Online Mining (Eq. (1)). As
tific language domain, the SOTA is B IO S YN (Sung suggested in Tab. 2, switching off the online hard
et al., 2020). For the social media domain, the pairs mining procedure causes a large performance
SOTA are Basaldella et al. (2020) and G EN - drop in @1 and a smaller but still significant drop
R ANK (Xu et al., 2020) on COMETA and AskAP- in @5. This is due to the presence of many easy and
atient respectively. All these SOTA methods com- already well-separated samples in the mini-batches.
bine B ERT with heuristic modules such as tf-idf, These uninformative training examples dominated
string matching and information retrieval system the gradients and harmed the learning process.
(i.e. Apache Lucene) in a multi-stage manner.
configuration @1 @5
Measured by Acc@1 , S AP B ERT achieves new
Mining switched-on 67.2 80.3
SOTA with statistical significance on 5 of the 6 Mining switched-off 52.3↓14.9 76.1↓4.2
datasets and for the dataset (BC5CDR-c) where
S AP B ERT is not significantly better, it performs on Table 2: This table compares P UB M ED -
par with SOTA (96.5 vs. 96.6). Interestingly, on sci- B ERT +S AP B ERT’s performance with and without
entific language datasets, S AP B ERT outperforms online hard mining on COMETA (zeroshot general).
SOTA without any task supervision (fine-tuning
mostly leads to overfitting and performance drops). Integrating S AP B ERT in Existing Systems.
On social media language datasets, unsupervised S AP B ERT can be easily inserted into existing
S AP B ERT lags behind supervised SOTA by large B ERT-based MEL systems by initialising the sys-
margins, highlighting the well-documented com- tems with S AP B ERT pretrained weights. We use
plex nature of social media language (Baldwin the SOTA scientific language system, B IO S YN
et al., 2013; Limsopatham and Collier, 2015, 2016; (originally initialised with B IO B ERT weights), as
Basaldella et al., 2020; Tutubalina et al., 2020). an example and show the performance is boosted
However, after fine-tuning on the social media across all datasets (last two rows, Tab. 1).
datasets (using the MS loss introduced earlier),
S AP B ERT outperforms SOTA significantly, indi- 4 Conclusion
cating that knowledge acquired during the self-
aligning pretraining can be adapted to a shifted We present S AP B ERT, a self-alignment pretraining
domain without much effort. scheme for learning biomedical entity represen-
tations. We highlight the consistent performance
boost achieved by S AP B ERT, obtaining new SOTA
in all six widely used MEL benchmarking datasets.
The A DAPTER Variant. As an option for param-
Strikingly, without any fine-tuning on task-specific
eter efficient pretraining, we explore a variant of
labelled data, S AP B ERT already outperforms the
S AP B ERT using a recently introduced training mod-
previous supervised SOTA (sophisticated hybrid en-
ule named A DAPTER (Houlsby et al., 2019). While
tity linking systems) on multiple datasets in the sci-
maintaining the same pretraining scheme with the
entific language domain. Our work opens new av-
same S AP B ERT online mining + MS loss, instead
enues to explore for general domain self-alignment
of training from the full model of P UB M ED B ERT,
(e.g. by leveraging knowledge graphs such as DB-
we insert new A DAPTER layers between Trans-
pedia). We plan to incorporate other types of rela-
former layers of the fixed P UB M ED B ERT, and only
tions (i.e., hypernymy and hyponymy) and extend
train the weights of these A DAPTER layers. In our
our model to sentence-level representation learning.
experiments, we use the enhanced A DAPTER con-
In particular, our ongoing work using a combina-
figuration by Pfeiffer et al. (2020). We include two
tion of S AP B ERT and A DAPTER is a promising
variants where trained parameters are 13.22% and
direction for tackling sentence-level tasks.
1.09% of the full S AP B ERT variant. The A DAPTER
variant of S AP B ERT achieves comparable perfor- Acknowledgements
mance to full-model-tuning in scientific datasets
but lags behind in social media datasets, Tab. 1. The We thank the three reviewers and the Area Chair
results indicate that more parameters are needed for their insightful comments and suggestions. FL
in pretraining for knowledge transfer to a shifted is supported by Grace & Thomas C.H. Chan Cam-
domain, in our case, the social media datasets. bridge Scholarship. NC and MB would like to
acknowledge funding from Health Data Research Rezarta Islamaj Doğan, Robert Leaman, and Zhiyong
UK as part of the National Text Analytics project. Lu. 2014. NCBI disease corpus: a resource for dis-
ease name recognition and concept normalization.
Journal of Biomedical Informatics, 47:1–10.
A Evaluation Datasets Details released large-scale MEL dataset that specifically
We divide our experimental datasets into two cate- focuses on MEL in the social media domain, con-
gories (1) scientific language datasests where the taining around 20k medical mentions extracted
data is extracted from scientific papers and (2) so- from health-related discussions on
cial media language datasets where the data is com- Mentions are mapped to SNOMED-CT. We use the
ing from social media forums like “stratified (general)” split and follow the evaluation
For an overview of the key statistics, see Tab. 3. protocol of the original paper.13
Table 3: This table contains basic statistics of the MEL datasets used in the study. C denotes the set of concepts;
S denotes the set of all surface forms / synonyms of all concepts in C; M denotes the set of mentions / queries.
COMETA (s.g.) and (z.g.) are the stratified (general) and zeroshot (general) split respectively.
Table 4: A list of baselines on the 6 different MEL datasets, including both scientific and social media language ones. The last
row collects reported numbers from the best performing models. “∗” denotes results produced using official released code. “-”
denotes results not reported in the cited paper. “OOM” means out-of-memoery.
B.2 Comparing Loss Functions et al. (2020) for training MEL models. A very
We use COMETA (zeroshot general) as a bench- similar (though not identical) hinge-loss was used
mark for selecting learning objectives. Note by Schumacher et al. (2020) for clinical concept
that this split of COMETA is different from the linking. InfoNCE has been very popular in self-
stratified-general split used in Tab. 4. It is very supervised learning and contrastive learning (Oord
challenging (so easy to see the difference of the et al., 2018; He et al., 2020). Lifted-Structure loss
performance) and also does not directly affect the (Oh Song et al., 2016) and NCA loss (Goldberger
model’s performance on other datasets. The results et al., 2005) are two very classic metric learning ob-
are listed in Tab. 6. Note that online mining is jectives. Multi-Similarity loss (Wang et al., 2019)
switched on for all models here. and Circle loss (Sun et al., 2020) are two recently
proposed metric learning objectives and have been
loss @1 @5 considered as SOTA on large-scale visual recogni-
cosine loss (Phan et al., 2019) 55.1 64.6 tion benchmarks.
max-margin triplet loss (Basaldella et al., 2020) 64.6 74.6
NCA loss (Goldberger et al., 2005) 65.2 77.0
Lifted-Structure loss (Oh Song et al., 2016) 62.0 72.1
InfoNCE (Oord et al., 2018; He et al., 2020) 63.3 74.2
Circle loss (Sun et al., 2020) 66.7 78.7
Multi-Similarity loss (Wang et al., 2019) 67.2 80.3
Table 6: This table compares loss functions used B.3 Details of A DAPTERs
for S AP B ERT pretraining. Numbers reported are on
COMETA (zeroshot general).
The cosine loss was used by Phan et al. (2019) In Tab. 7 we list number of parameters trained in
for learning UMLS synonyms for LSTM models. the three A DAPTER variants along with full-model-
The max-margin triplet loss was used by Basaldella tuning for easy comparison.
model URL
vanilla B ERT (Devlin et al., 2019)
B IO B ERT (Lee et al., 2020)
B LUE B ERT (Peng et al., 2019)
C LINICAL B ERT (Alsentzer et al., 2019)
S CI B ERT (Beltagy et al., 2019)
U MLS B ERT (Michalopoulos et al., 2020)
P UB M ED B ERT (Gu et al., 2020)
Table 5: This table lists the URL of models used in this study.
method reduction rate #params #params in B ERT
hardware specification
RAM 192 GB
CPU Intel Xeon W-2255 @3.70GHz, 10-core 20-threads
GPU NVIDIA GeForce RTX 2080 Ti (11 GB) × 4
C Other Details
C.1 The Full Table of Supervised Baseline
The full table of supervised baseline models is pro-
vided in Tab. 4.
hyper-parameters search space
learning rate for pretraining & fine-tuning S AP B ERT {1e-4, 2e-5 , 5e-5, 1e-5, 1e-6}
pretraining batch size {128, 256, 512∗ , 1024}
pretraining training iterations {10k, 20k, 30k, 40k, 50k (1 epoch)∗ , 100k (2 epochs)}
fine-tuning epochs on scientific language datasets {1, 2, 3∗ , 5}
fine-training epochs on AskAPatient {5, 10, 15∗ , 20}
fine-training epochs on COMETA {5, 10∗ , 15, 20}
max_seq_length of B ERT tokenizer {15, 20, 25∗ , 30}
λ in Online Mining {-0.05, -0.1, -0.2∗ , -0.3}
α in MS loss {1, 2 (Wang et al., 2019)∗ , 3}
β in MS loss {40, 50 (Wang et al., 2019)∗ , 60}
in MS loss {0.5∗ , 1 (Wang et al., 2019)}
α in max-margin triplet loss {0.05, 0.1, 0.2 (Basaldella et al., 2020)∗ , 0.3}
softmax scale in NCA loss {1 (Goldberger et al., 2005), 5, 10, 20∗ , 30}
α in Lifted-Structured loss {0.5∗ , 1 (Oh Song et al., 2016)}
τ (temperature) in InfoNCE {0.07 (He et al., 2020)∗ , 0.5 (Oord et al., 2018)}
m in Circle loss {0.25 (Sun et al., 2020)∗ , 0.4 (Sun et al., 2020)}
γ in Circle loss {80 (Sun et al., 2020), 256 (Sun et al., 2020)∗ }
Table 9: This table lists the search space for hyper-parameters used. ∗ means the used ones for reporting results.
Figure 3: Same as Fig. 1 in the main text, but generated with a higher resolution.