Mol-BERT An Effective Molecular Representation Wit
Mol-BERT An Effective Molecular Representation Wit
Mol-BERT An Effective Molecular Representation Wit
Research Article
Mol-BERT: An Effective Molecular Representation with BERT for
Molecular Property Prediction
1
Hunan Vocational College of Electronic and Technology, Changsha 410220, China
2
College of Information Science and Engineering, Hunan University, Changsha 410082, China
Copyright © 2021 Juncai Li and Xiaofei Jiang. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.
Molecular property prediction is an essential task in drug discovery. Most computational approaches with deep learning
techniques either focus on designing novel molecular representation or combining with some advanced models together.
However, researchers pay fewer attention to the potential benefits in massive unlabeled molecular data (e.g., ZINC). This
task becomes increasingly challenging owing to the limitation of the scale of labeled data. Motivated by the recent
advancements of pretrained models in natural language processing, the drug molecule can be naturally viewed as language
to some extent. In this paper, we investigate how to develop the pretrained model BERT to extract useful molecular
substructure information for molecular property prediction. We present a novel end-to-end deep learning framework,
named Mol-BERT, that combines an effective molecular representation with pretrained BERT model tailored for molecular
property prediction. Specifically, a large-scale prediction BERT model is pretrained to generate the embedding of molecular
substructures, by using four million unlabeled drug SMILES (i.e., ZINC 15 and ChEMBL 27). Then, the pretrained BERT
model can be fine-tuned on various molecular property prediction tasks. To examine the performance of our proposed
Mol-BERT, we conduct several experiments on 4 widely used molecular datasets. In comparison to the traditional and
state-of-the-art baselines, the results illustrate that our proposed Mol-BERT can outperform the current sequence-based
methods and achieve at least 2% improvement on ROC-AUC score on Tox21, SIDER, and ClinTox dataset.
knowledge. Besides, the generated hash bit vectors are diffi- we develop a pretrained model to extract useful molecular
cult to biologically understand the relationship between substructure information from massive SMILES sequence
chemical properties and molecular structures. datasets? To help solve this problem, we propose a novel
Inspired by the remarkable achievements that deep neural framework, named Mol-BERT, tailored for molecular
learning has shown in a variety of domains, including com- property prediction. The idea of Mol-BERT is natural and
puter vision [14] and natural language processing [15, 16], it intuitive. Our framework consists of three types of modules.
also has gained lots of attention for molecular property pre- The feature extractor is first to extract atom-level and sub-
diction. The molecular representation methods being intro- structure features centered on the current atom, and the first
duced can be mainly summarized into two parts: sequence- module can be replaced with a wide range of different
based and graph-based approaches. For sequence-based molecular representation methods. Then, the pretrained
methods, simplified molecular input line entry specification, BERT module learns molecular substructure or fragment
shortened as SMILES, is the most common molecular linear information from large pretraining corpus (i.e., unlabeled
notation that encodes the molecular topology on the basis of SMILES sequences). The final module is to predict the spe-
chemical rules [17]. In this way, several methods are cific molecular property after fine-tuning the pretrained
attempted to take SMILES representation as the input and Mol-BERT via a multityped classifier. To illustrate the per-
use current successful models (e.g., recurrent neural net- formance of our proposed method in various prediction
works) to obtain molecular representations [18], while this tasks, Mol-BERT is fine-tuned and evaluated on 4 widely
line of work suffered from insufficient labeled data for spe- used molecular benchmark datasets. In comparison to
cific molecular tasks. More recently, researchers adopted state-of-the-art baselines (i.e., sequence- and graph-based
the unsupervised and pretraining strategies in natural lan- methods), the experimental results prove the effectiveness
guage processing (NLP) to learn contextual information of our proposed Mol-BERT.
from large unlabeled molecular datasets. For example, an This paper is organized as follows. Section 2 firstly intro-
unsupervised machine learning method named Mol2vec duces the preprocessed corpus for Mol-BERT pretraining
was developed to learn vector representations of molecular and several molecular benchmark datasets used in this work.
substructures [19]. And SMILES-BERT was proposed to Then, Section 3 presents the molecular representation
pretrain the model through a masked SMILES recovery task method, the pretraining, and fine-tuning of the Mol-BERT
by designing attention mechanism-based transformer layer model, respectively. Moreover, Section 4 analyzes the
[20]. These pretrained methods pay more attention to the prediction performance of our proposed method on several
contextual information of molecular sequences, but they molecular datasets and compares it with state-of-the-art
hardly consider some molecular substructure (i.e., functional sequence-based and graph-based approaches. Finally, the
groups) that essentially contributes to the molecular conclusion of this work is summarized in Section 5.
property [21, 22].
On the other hand, graph neural networks (GNNs) have
been adopted to explore the graph-based representation for 2. Materials
molecular property prediction [23–25]. Graph convolutions
were the first work that applied the convolutional layers to The corpus of chemical compound (i.e., unlabeled SMILES)
encode molecular graph into neural fingerprints [26]. Simi- was obtained from the available ZINC and ChEMBL data-
larly, much efforts are made to extend a variety of GNNs bases. As a free and available database for virtual screening,
on property prediction tasks. For example, the weave featur- ZINC database contains over 230 million purchasable com-
ization encoded chemical features to form molecule-level pounds in multiple formats, including ready-to-dock and 3D
representations [27]. And some methods extended graph structures [43]. And ChEMBL database is a manually built
attention network [28] to learn the aggregation weights database of bioactive molecules with drug-like properties,
[25, 29]. Moreover, to better encode the interactions between which collects 1,961,462 distinct compounds [44]. Specifi-
atoms, a message passing neural network named MPNN was cally, we selected compound SMILES from ZINC version
designed to utilize the attributed features of both atoms and 15 and ChEMBL version 27 that can be processed by RDKit
edges [30]. More recently, DMPNN [31] and CMPNN [32] software [45], and the duplicates were removed in merged
were further introduced to leverage the attributed informa- dataset. Moreover, we filtered them by following the same
tion of nodes and edges during message passing. Although criteria of Mol2Vec [19]. Specifically, the two databases were
graph-based models have achieved great performance on firstly merged, and duplicates were removed. Then, only
molecular graph representation, they seldom make use of compounds SMILES that could be processed by RDKit were
the vast available biological sequence data. kept, and they were filtered according to the following cut-
Recently, substantial pretrained models [33–37] trained offs and criteria: molecular weight between 12 and 600;
on the large corpus or unlabeled data can learn universal heavy-atom count between 3 and 50; clogP21between 5
representations, which are benefit for various downstream and 7; and only H, B, C, N, O, F, P, S, Cl, and Br atoms
tasks, including protein sequence representation [38, 39], allowed. Additionally, all counterions and solvents were
biomedical text mining [40, 41], and chemical reaction pre- removed, and canonical SMILES representations were gen-
diction [42]. Advances in pretrained models have shown erated by RDKit. Finally, this procedure yielded 4 million
their powerful ability for extracting information from unla- compounds. Detailed information on the pretraining corpus
beled sequences, which raises a tantalizing question: can is provided in Supplementary (available here).
Wireless Communications and Mobile Computing 3
In this paper, we selected 4 widely used benchmark data- Table 1: The detailed description of selected benchmark datasets.
sets from MoleculeNet [13] to evaluate the performance of
our proposed method. SMILES strings were used to encode Split
Dataset Category Compound Tasks Task type
the input chemical compound in all benchmark datasets. method
The benchmark datasets we used are introduced as follows: BBBP Physiology 2,053 1 Binary Scaffold
Tox21 Physiology 8,014 12 Multilabel Scaffold
(i) BBBP. The BBBP dataset provides 2,053 com- SIDER Physiology 1,427 27 Multilabel Scaffold
pounds on their permeability properties to predict ClinTox Physiology 1,491 2 Multilabel Scaffold
the barrier permeability
(ii) Tox21. The Tox21 dataset measures 8,014 com-
3.2. Feature Extractor. The molecular substructure is an
pounds with their corresponding toxicity data
important cue for molecular interactions [21, 22]. Therefore,
against 12 targets. The label of toxicity is recorded
the key idea behind Mol-BERT is that we strengthen to
as binary task: if the label value is 1, then it means
obtain a better representation of molecular substructures
the compound has toxicity on specific target or 0
by pretraining BERT on the vast unlabeled SMILES
otherwise
sequences. Inspired by Mol2Vec [19] that considered molec-
(iii) SIDER. The SIDER dataset contains a total of 1,427 ular substructures or fragments derived from the Morgan
compounds and their adverse drug reactions (ADR) algorithm as “words” and compound as “sentences,” here
against 27 system-organ class. The ADR result is we adopt a similar method to decompose the input SMILES
described as binary labels sequences into biological words and sentences.
To achieve it, given an input compound SMILES string,
(iv) ClinTox. The ClinTox dataset provides 2 classifica- we first obtain its standardize and canonical SMILES
tion tasks for 1,491 drug compounds with known representation S generated by RDKit. Then, the Morgan
chemical structures, including clinical trial toxicity algorithm [11] is used to generate all atom identifiers with
and FDA approval status radius 0 and 1, denoted by A0i and A1i i, respectively, where
In this paper, we followed the experimental setting of the subscript i represents the index of each atom. As illus-
FP2VEC [46], and we split the datasets into the train, valida- trated in the left part of Figure 1, A0i (i.e., green node) repre-
tion, and test set with a ratio of 8/1/1. Table 1 shows the sents the current node set traversed in an atom order while
detailed description of selected benchmark datasets. Please A1i (i.e., Kelly node) represents the neighboring node set
note that binary and multilabel correspond to the binary connecting directly to the current atom, so A1i an be viewed
and multilabel classification tasks, respectively. And random as a kind of substructure or fragment. And Ai are then
splitting method randomly splits the samples into training, hashed into a fixed-length vector. Take CC(N)C(=O)O as
validation, and test subsets. Scaffold splitting method splits an example; it consists of six atoms, and we obtain its atom
the samples on the basis of their 2D structural frameworks identifiers A0i (i.e., A01 -A06 ) and the corresponding substruc-
implemented by RDKit software. tures (i.e., A11 -A16 ), and then, they are hashed into a fixed-
length vector (e.g., A11 corresponds to 3537119591). Finally,
3. Methods all vectors of the Morgan substructures are summed to
obtain the molecular representation. Therefore, in this way,
In this section, we first describe the overview of our pro- we can generate 119 atom identifiers at radius 0 and 13325
posed Mol-BERT; then, we separately introduce three mod- substructure identifiers at radius 1, respectively. The feature
ules, which we refer to as the feature extractor, pretraining, extractor module in Mol-BERT can be replaced with various
and fine-tuning of Mol-BERT, respectively. molecular representation methods. For example, FP2Vec
[46] can be used as the feature extractor to generate the
3.1. Overview. Figure 1 illustrates the overall process of Mol- 1024-bit Morgan (or circular) fingerprint with the prede-
BERT. As shown in Figure 1, Mol-BERT consists of three fined radius value.
modules, including feature extractor, pretraining, and fine-
tuning of Mol-BERT. The Mol-BERT framework learns to 3.3. Pretraining Mol-BERT. As a contextualized word repre-
predict the molecular property as follows. Given the input sentation model, BERT [33] adopted the masked technique
drug data (i.e., canonical SMILES), the featurizer module to predict randomly masked words in a sequence, which
adopts the effective molecular representation to transform can result in learning bidirectional representations. There-
them into a set of atom identifier (recall the detail in Feature fore, Mol-BERT also uses a masked SMILES task (i.e., atom
Extractor). Then, the outputs are fed into a BERT module to identifier) to predict random substructure in a SMILES
obtain a contextual embedding of each molecular substruc- string. Different from the traditional way of pretraining lan-
ture through pretraining BERT on vast preprocessed corpus guage models in NLP that BERT was trained on English
(recall the detail in Pretraining Mol-BERT). Finally, the fine- Wikipedia and BooksCopus, in this paper, we pretrain
tuned Mol-BERT outputs a value indicating the probability Mol-BERT on our preprocessed corpus obtained from ZINC
of certain molecular property in classification task (recall version 15 and ChEMBL version 27 datasets. Specifically, the
the detail in Fine-Tuning Mol-BERT). input SMILES is transformed into a list of atom identifiers Ai
4 Wireless Communications and Mobile Computing
HO HO HO HO HO HO
Fine-tuned Mol-BERT
O O O O O O
A1 A2 A3 A4 A5 A6
E1 E2 En
HO HO HO HO HO HO
via a previous module, rather than character-level for Table 2: The fine-tuning hyperparameters.
SMILES [20], and then, they are embedded as the input of
BERT module for pretraining. We initialized our proposed Parameter Value/range
Mol-BERT with weights from BERT [33] and follow the Learning rate 1e-5∼1e-3
same way to randomly mask 15% tokens in a SMILES (i.e., Batch size 8
atom identifier) as [MASK] token. The tokens are embedded Epoch 100
into the feature vector. Here, we use token embedding and Optimizer Adam
positional embedding since only the Masked Language
Embedding dimension 300
Model (MLM) task is adopted in this paper. The proposed
Size of dictionary 13,325
Mol-BERT is different from BERT in several ways as follows:
(1) Mol-BERT adopted single masked SMILES task (i.e., Number of attention head 6
MLM) on large-scale unlabeled datasets, while BERT uses Layers of fully connected neural network 6
two kinds of self-supervised tasks on English Wikipedia
and BooksCopus, and (2) w exclude the segmentation
embedding adopted in the BERT model since Mol-BERT truth labels in the training dataset, we used the crossentropy
does not require the continuous sentence training. and the mean square error as loss function for classification
and regression tasks, respectively.
3.4. Fine-Tuning Mol-BERT. After pretraining on the vast of
unlabeled SMILES compounds, with minimal modification
of hyperparameters, Mol-BERT can be applied to molecular 4. Results and Discussion
property prediction on various downstream tasks. We
mostly follow the same architecture, optimization, and In this section, we first introduce the experimental settings.
hyperparameter choices used in [8]. For classification task Then, we demonstrate the performance of our proposed
(i.e., BBBP and Tox21), we feed the final BERT vector into Mol-BERT in comparison to state-of-the-art methods to
a linear classification layer to predict the molecular property. predict the molecular property on 4 wildly used bench-
A simple classifier is adopted to output the binary value. mark datasets.
Then, the labeled sample is used for fine-tuning the model.
Mol-BERT feeds the learned drug embeddings into a multi-
4.1. Baseline Methods. We compare Mol-BERT with many
typed MLP classifier to generate predictions. Output scores
state-of-the-art sequence-based and graph-based baselines
include both continuous scores, such as the solubility value
which can be categorized as follows:
and as binary outputs indicating whether a molecule is toxic
or nontoxic. The multityped classifier detects whether the
(i) ECFP: extended-connectivity fingerprints, referred
task is regression or classification and switches to the correct
to as ECFP [11], are a type of widely used circular
loss function and evaluation metrics. In the case of regres-
or Morgan fingerprints for encoding the substruc-
sion, we use the mean square error (MSE) as the loss func-
tures in a molecule
tion and root mean square error (RMSE) as performance
metrics. In the classification case, we use binary cross (ii) GraphCov: graph convolutions are proposed by
entropy as the loss function and area under the receiver [26] to apply the convolutional networks for learn-
operating characteristics (AUC-ROC) as performance met- ing molecular fingerprints. Here, we term it as
rics. Given a set of SMILES compounds and the ground- GraphCov
Wireless Communications and Mobile Computing 5
Table 3: The metric scores of the test set against BBBP, Tox21, SIDER, and ClinTox datasets.
(iii) Weave: similar to GraphCov, the weave featuriza- are run on a single NVIDIA GTX 2080Ti GPU. Table 2
tion [27] encodes meaningful features of atom, shows all the hyperparameters of the fine-tuning model.
bond, and graph distances between matching pairs
to form molecule-level representations 4.4. Comparison Results. To examine the competitiveness of
the proposed model, we compared Mol-BERT with state-of-
(iv) MPNN: a novel message passing method is pro- the-art models used for molecular property prediction on
posed to be operated on undirected graph [30] classification task. Table 3 reports the mean and standard
(v) FP2VEC: based on Morgan or circular fingerprint, it deviation of ROC-AUC score on BBBP, SIDER, Tox21,
introduces and encodes a molecule as trainable vec- and ClinTox datasets. From this table, we can observe that
tors [46] the proposed Mol-BERT significantly outperforms the base-
lines across three datasets, including Tox21, SIDER, and
(vi) SMILES-BERT: [20] proposes a semisupervised ClinTox. More specifically, our proposed Mol-BERT
BERT model that takes the SMILES representation achieved at least 2.9% on Tox21, 2.2% on SIDER, and 4.4%
as input on ClinTox higher ROC-AUC metric than baselines. For
example, on the Tox21 dataset, Mol-BERT achieved a
We report the results of these baselines in FP2Vec [46],
ROC-AUC score of 0.839 with 2.9% absolute gain compared
including ECFP, GraphCov, Weave, and FP2VEC. And we
to ECFP (the second best method). This is because Mol-
reimplemented MPNN and SMILES-BERT, respectively.
BERT leverages the molecular representation pretrained on
As for MPNN [30], it is a graph-based model considering
large- scale unlabeled SMILES sequences, while ECFP
the edge features during message passing. And SMILES-
heavily relied on feature engineering. Compared with
BERT [20] is a sequence-based model based on transformer
graph-based methods that explore the molecular graph fea-
layer and attention mechanisms entirely to encode com-
tures, the proposed Mol-BERT outperformed them on three
pound SMILES. These models are relied on the public code
datasets while it achieved comparable performance with
and kept the same settings of models the same as reported
MPNN on the BBBP dataset. This is due to the fact that
in the original papers.
the contextual information learned from large unlabeled
datasets can benefit a lot to the model performance. More-
4.2. Evaluation Metrics. We applied the area under the over, in comparison to the sequence-based pretrained model
receiver operating characteristic curve (AUC-ROC) metric (i.e., SMILES-BERT), our proposed Mol-BERT achieved sta-
for classification task. Following [46], we train the prediction ble performance across all datasets. This is a very encourag-
model with a train set and optimize the model based on the ing result. The reason could be that our method adopted the
AUC-ROC metric of validation set for classification task. molecular representation to consider the structural feature
And the prediction results are measured using those opti- of molecular substructures, which benefits to the perfor-
mized models on the test set. For all experiments in this mance. Overall, it is essentially a nontrivial achievement in
paper, we repeated the same procedures on each task for 5 terms of molecular property prediction.
times and reported the mean and standard deviation of
AUC scores. Besides, we evaluated all models on the scaffold
splitting method as reported by [46].
5. Conclusions
In this paper, we proposed an effective molecular representa-
4.3. Implementation Details. To optimize all trainable tion method with the pretrained BERT model, named Mol-
parameters, we adopt Adam optimizer for pretraining and BERT, to resolve the molecular property prediction. Our
fine-tuning. The dynamic learning rate technique is adopted proposed Mol-BERT leverages the molecular representation
to adjust the learning rate during training and fine-tuning of substructures pretrained on large-scale unlabeled SMILES
according to various downstream tasks. We use PyTorch to dataset, which is able to learn both structural and the con-
implement Mol-BERT. And we use 3 NVIDIA GTX textual information of drug. We implement the proposed
1080Ti GPUs to pretrain Mol-BERT. All fine-tuning tasks method and conduct experimental comparisons on four
6 Wireless Communications and Mobile Computing
widely used benchmarks. The experimental results show that [8] D. S. Cao, Q. S. Xu, Q. N. Hu, and Y. Z. Liang, “ChemoPy:
Mol-BERT outperforms the classic and state-of-the-art freely available python package for computational biology
graph-based models on molecular property prediction. and chemoinformatics,” Bioinformatics, vol. 29, no. 8,
While our proposed method achieves good performance pp. 1092–1094, 2013.
on classification tasks, there are still some limitations [9] A. Mauri, V. Consonni, M. Pavan, and R. Todeschini, “Dragon
expected to be overcome. First, our method achieves rela- software: an easy approach to molecular descriptor calcula-
tively poorer performance on regression task, mainly owing tions,” Match, vol. 56, no. 2, pp. 237–248, 2006.
to the small number of samples in the dataset (e.g., Free- [10] H. Moriwaki, Y. S. Tian, N. Kawashita, and T. Takagi, “Mor-
Solv). We would like to investigate metalearning strategies dred: a molecular descriptor calculator,” Journal of Cheminfor-
for data augmentation, which results in great success in nat- matics, vol. 10, no. 1, p. 4, 2018.
ural language processing. Second, molecular property pre- [11] D. Rogers and M. Hahn, “Extended-connectivity fingerprints,”
diction is the primary step in drug discovery; we will Journal of Chemical Information and Modeling, vol. 50, no. 5,
pp. 742–754, 2010.
continue to improve our method to further investigate the
following prediction task (e.g., protein-protein interaction, [12] R. C. Glen, A. Bender, C. H. Arnby, L. Carlsson, S. Boyer, and
drug-disease associations) in the future. J. Smith, “Circular fingerprints: flexible molecular descriptors
with applications from physical chemistry to ADME,” IDrugs,
vol. 9, no. 3, p. 199, 2006.
Data Availability [13] Z. Wu, B. Ramsundar, E. N. Feinberg et al., “MoleculeNet: a
benchmark for molecular machine learning,” Chemical Sci-
The data used to support the findings of this study are avail- ence, vol. 9, no. 2, pp. 513–530, 2018.
able from the corresponding author upon request (https:// [14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
github.com/cxfjiang/MolBERT). image recognition,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, pp. 770–778, Las
Conflicts of Interest Vegas, United States, 2016.
[15] C. Xia, C. Zhang, X. Yan, Y. Chang, and P. S. Yu, “Zero-shot
The authors declare no competing financial interest. user intent detection via capsule neural networks,” 2018,
https://arxiv.org/abs/1809.00385.
Supplementary Materials [16] J. Yin, C. Gan, K. Zhao, X. Lin, Z. Quan, and Z. J. Wang, “A
novel model for imbalanced data classification,” Proceedings
The pretraining corpus are available at “https://drive.google of the AAAI Conference on Artificial Intelligence, vol. 34,
.com/drive/folders/1ST0WD1-hX9XtiPWwCceZbgZlBV0fK no. 4, pp. 6680–6687, 2020.
Pbe.” (Supplementary Materials) [17] D. Weininger, A. Weininger, and J. L. Weininger, “SMILES. 2.
Algorithm for generation of unique smiles notation,” Journal
of Chemical Information and Computer Sciences, vol. 29,
References no. 2, pp. 97–101, 1989.
[1] S. Ekins, A. C. Puhl, K. M. Zorn et al., “Exploiting machine [18] Z. Xu, S. Wang, F. Zhu, and J. Huang, “Seq2seq fingerprint: an
learning for end-to-end drug discovery and development,” unsupervised deep molecular embedding for drug discovery,”
Nature Materials, vol. 18, no. 5, pp. 435–441, 2019. in Proceedings of the 8th ACM international conference on bio-
[2] X. Lin, Z. Quan, Z. J. Wang, H. Huang, and X. Zeng, “A novel informatics, computational biology, and health informatics,
molecular representation with BiGRU neural networks for pp. 285–294, New York, NY, USA, 2017.
learning atom,” Briefings in Bioinformatics, vol. 21, no. 6, [19] S. Jaeger, S. Fulle, and S. Turk, “Mol2vec: unsupervised
pp. 2099–2111, 2020. machine learning approach with chemical intuition,” Journal
[3] X. Lin, Z. Quan, Z. J. Wang, T. Ma, and X. Zeng, “KGNN: of Chemical Information and Modeling, vol. 58, no. 1,
knowledge graph neural network for drug-drug interaction pp. 27–35, 2018.
prediction,” in Proceedings of the Twenty-Ninth International [20] S. Wang, Y. Guo, Y. Wang, H. Sun, and J. Huang, “SMILES-
Joint Conference on Artificial Intelligence, pp. 2739–2745, BERT: large scale unsupervised pre-training for molecular
Yokohama, Japan, 2020. property prediction,” in Proceedings of the 10th ACM Interna-
[4] B. K. Shoichet, “Virtual screening of chemical libraries,” tional Conference on Bioinformatics, Computational Biology
Nature, vol. 432, no. 7019, pp. 862–865, 2004. and Health Informatics, pp. 429–436, New York, NY, USA,
[5] S. Pushpakom, F. Iorio, P. A. Eyers et al., “Drug repurposing: 2019.
progress, challenges and recommendations,” Nature Reviews [21] K. Huang, C. Xiao, T. Hoang, L. Glass, and J. Sun, “Caster: pre-
Drug Discovery, vol. 18, no. 1, pp. 41–58, 2019. dicting drug interactions with chemical substructure represen-
[6] Z. Quan, Y. Guo, X. Lin, Z. J. Wang, and X. Zeng, “GraphCPI: tation,” Proceedings of the AAAI Conference on Artificial
graph neural representation learning for compound-protein Intelligence, vol. 34, no. 1, pp. 702–709, 2020.
interaction,” in 2019 IEEE International Conference on Bioin- [22] R. B. Silverman and M. W. Holladay, The Organic Chemistry of
formatics and Biomedicine, pp. 717–722, San Diego, CA, Drug Design and Drug Action, Academic Press, 2014.
USA, 2019. [23] K. Schütt, P. J. Kindermans, H. E. S. Felix, S. Chmiela,
[7] Y. Zhou, Y. Hou, J. Shen, Y. Huang, W. Martin, and F. Cheng, A. Tkatchenko, and K. R. Müller, “SchNet: a continuous-
“Network-based drug repurposing for novel coronavirus 2019- filter convolutional neural network for modeling quantum
nCoV/SARS-CoV-2,” Cell Discovery, vol. 6, no. 1, pp. 1–18, interactions,” Advances in neural information processing sys-
2020. tems, pp. 991–1001, 2017, https://arxiv.org/abs/1706.08566.
Wireless Communications and Mobile Computing 7
[24] K. T. Schütt, F. Arbabzadah, S. Chmiela, K. R. Müller, and [40] K. Huang, J. Altosaar, and R. Ranganath, “ClinicalBERT:
A. Tkatchenko, “Quantum-chemical insights from deep tensor modeling clinical notes and predicting hospital readmission,”
neural networks,” Nature Communications, vol. 8, no. 1, pp. 1– 2019, https://arxiv.org/abs/1904.05342.
8, 2017. [41] J. Lee, W. Yoon, S. Kim et al., “BioBERT: a pre-trained bio-
[25] Z. Xiong, D. Wang, X. Liu et al., “Pushing the boundaries of medical language representation model for biomedical text
molecular representation for drug discovery with the graph mining,” Bioinformatics, vol. 36, no. 4, pp. 1234–1240, 2020.
attention mechanism,” Journal of Medicinal Chemistry, [42] P. Schwaller, T. Laino, T. Gaudin et al., “Molecular trans-
vol. 63, no. 16, pp. 8749–8760, 2020. former: a model for uncertainty-calibrated chemical reaction
[26] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre et al., “Convolu- prediction,” ACS Central Science, vol. 5, no. 9, pp. 1572–
tional networks on graphs for learning molecular finger- 1583, 2019.
prints,” Advances in Neural Information Processing Systems, [43] J. J. Irwin, T. Sterling, M. M. Mysinger, E. S. Bolstad, and R. G.
pp. 2224–2232, 2015, https://arxiv.org/abs/1509.09292. Coleman, “ZINC: a free tool to discover chemistry for biol-
[27] S. Kearnes, K. McCloskey, M. Berndl, V. Pande, and P. Riley, ogy,” Journal of Chemical Information and Modeling, vol. 52,
“Molecular graph convolutions: moving beyond fingerprints,” no. 7, pp. 1757–1768, 2012.
Journal of Computer-Aided Molecular Design, vol. 30, no. 8, [44] D. Mendez, A. Gaulton, A. P. Bento et al., “ChEMBL: towards
pp. 595–608, 2016. direct deposition of bioassay data,” Nucleic Acids Research,
[28] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, vol. 47, no. D1, pp. D930–D940, 2019.
and Y. Bengio, “Graph attention networks,” 2017, https:// [45] J. Woosung and K. Dongsup, RDKit: Open-Source Cheminfor-
arxiv.org/abs/1710.10903. matics, 2006, https://www.rdkit.org.
[29] S. Ryu, J. Lim, S. H. Hong, and W. Y. Kim, “Deeply learning
[46] W. Jeon and D. Kim, “FP2VEC: a new molecular featurizer for
molecular structure-property relationships using attention-
learning molecular properties,” Bioinformatics, vol. 35, no. 23,
and gate-augmented graph convolutional network,” 2018,
pp. 4979–4985, 2019.
https://arxiv.org/abs/1805.10988.
[30] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E.
Dahl, Neural Message Passing for Quantum Chemistry. in
International Conference on Machine Learning, PMLR, 2017.
[31] K. Yang, K. Swanson, W. Jin et al., “Are learned molecular rep-
resentations ready for prime time?, [Ph.D. thesis],” Massachu-
setts Institute of Technology, 2019.
[32] Y. Song, S. Zheng, Z. Niu, Z. H. Fu, Y. Lu, and Y. Yang, “Com-
municative representation learning on attributed molecular
graphs,” in Proceedings of the Twenty-Ninth International
Joint Conference on Artificial Intelligence, pp. 2831–2838,
Yokohama, Japan, 2020.
[33] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT:
pre-training of deep bidirectional transformers for language
understanding,” in Proceedings of the 2019 Conference of the
North American Chapter of the Association for Computational
Linguistics, vol. 1, pp. 4171–4186, Minneapolis, United States,
2019.
[34] W. Hu, B. Liu, J. Gomes et al., “Strategies for pre-training
graph neural networks,” 2019, https://arxiv.org/abs/1905
.12265.
[35] K. Li, Y. Zhong, X. Lin, and Z. Quan, “Predicting the disease
risk of protein mutation sequences with pre-training model,”
Frontiers in Genetics, vol. 11, p. 1535, 2020.
[36] B. Song, Z. Li, X. Lin, J. Wang, T. Wang, and X. Fu, “Pretrain-
ing model for biological sequence data,” Briefings in Func-
tional Genomics, vol. 20, no. 3, pp. 181–195, 2021.
[37] A. Vaswani, N. Shazeer, N. Parmar et al., “Attention is all you
need,” in Advances in Neural Information Processing Systems,
NIPS (Conference and Workshop on Neural Information Pro-
cessing Systems), 2017.
[38] S. Min, S. Park, S. Kim, H. S. Choi, and S. Yoon, “Pre-training
of deep bidirectional protein sequence representations with
structural information,” 2019, https://arxiv.org/abs/1912
.05625.
[39] R. Rao, N. Bhattacharya, N. Thomas et al., “Evaluating protein
transfer learning with tape,” in Advances in Neural Informa-
tion Processing Systems, NIPS (Conference and Workshop on
Neural Information Processing Systems), 2019.