Multimodal protein representation learning and target-aware variational auto-encoders for protein-binding ligand generation

Nhat Khang Ngo; Truong Son Hy

doi:10.1088/2632-2153/ad3ee4

1. Introduction

Drug discovery is a complex and expensive process that involves multiple stages and often takes years of development, with costs running into billions of dollars [1]. The first stage is to design novel drug-like compounds with high binding affinities to target proteins. This process consists of two sub-tasks: searching for candidates and measuring drug-target affinities (DTA). Searching for potential candidates in a huge database of roughly 10³³ chemically valid molecules is a daunting task as current methods often rely on virtual screenings, professional software, and expert evaluation [2, 3]. Besides, DTA are critical measurements for identifying potential candidates and avoiding those that are inefficient for clinical trials. The most reliable technique for predicting DTA involves atomistic molecular dynamics simulations. However, these methods are computationally expensive and time-consuming, making them infeasible for large-scale sets of protein-ligand complexes. Our ultimate objective is to accelerate and automate these two sub-tasks in the first stage of the drug development process, using computational methods and machine-learning techniques.

Deep generative models (DGMs) have been proposed as a promising approach to reducing the workload of wet lab experiments in drug discovery by effectively designing probable drug-like candidates [2–11]. DGMs have demonstrated remarkable results in unconditionally generating or optimizing molecular properties, such as drug-likeness (QED) and synthetic accessibility (SA). However, they are prohibitively slow when enhancing binding affinity or other computationally expensive molecular properties. This is because they need to be trained in reinforcement learning frameworks where the generated molecular graph is modified based on a reward function determined by calling a property network that estimates the binding affinities. While effective and powerful, these approaches require specific property networks to be trained for each target protein, which is challenging due to the vast number of known and unknown proteins [3, 12]. Moreover, binding scores (i.e. labels for supervised learning) are not widely available and time-consuming to approximate with software such as Autodock Vina [13]. Additionally, several existing studies [14–16] incorporate prior knowledge of protein structures to guide conditional generative models to generate bioactive molecules. However, these methods rely solely on the information of specific binding sites of target proteins and are limited when the binding sites are not determined. It is worth noting that determining binding sites, also known as pockets, on a target protein (receptor) is a challenging problem, which consists of many constraints [17].

Proteins are macromolecules that can be represented in terms of sequences of amino acids (i.e. primary structure), 2D graphs at residue level constructed by nearest neighbors from folding information (i.e. tertiary structure), or 3D point clouds at atom level. Recent advanced methods for protein representation learning leverage language models, graph neural networks (GNNs), and convolutional neural networks (CNNs). In sequence-based methods [18–22], a protein sequence is regarded as a long sequence of tokens (i.e. k-mers [18]) that are fed to a transformer-based language model. In contrast, GNNs and CNNs-based approaches [23–27] operate on relational and geometric structures of proteins, respectively. While sequence-based approaches can capture the relationships among distant residues in a long protein sequence, they are not able to exploit the geometric relations among them. On the other hand, although GNNs and CNNs can learn spatial information about protein structures, they are limited in their ability to capture long-range interactions in large protein structures due to their reliance on localities. Recent years have also seen the rise of pre-trained large language models for scientific discovery, which have achieved remarkable results in protein science [28, 29]. This motivates future work to apply language models to many downstream tasks in drug discovery, where receptor representations are essential for the search for novel ligands.

1.1. Contributions

This work aims at developing a data-driven approach that can help accelerate the drug discovery process. We develop a conditional DGM to generate drug-like ligands that can bind to a target protein when its binding sites are unknown. This approach facilitates the process of determining binding sites in drug discovery, which is computationally expensive. Furthermore, we build a multimodal protein network to leverage multiple data types of proteins. In summary, our contributions are two-fold as follows:

We build a conditional VAE model, named TargetVAE, that can generate chemically valid, drug-like molecules with high binding scores to an arbitrarily given protein structure. Apart from other methods, ours can directly condition on the entire structure of any protein target and design multiple candidates that can bind to it, without requiring the training of a specific property network for each target. It is important to note that our generative model is conditional on the whole protein structure rather than a specific pocket or binding site (i.e. a region of the protein surface where the ligand binds to) as in works of Luo et al [16], Guan et al [30], Peng et al [31], Liu et al [32]. Our approach is more flexible because a protein complex can contain multiple binding sites and therefore potentially have different ligands. TargetVAE allows us to efficiently generate ligand candidates with high binding affinity without prior knowledge of any binding site.
We design a novel architecture, named Protein Multimodal Network (PMN), that unifies different modalities of proteins, i.e. sequence of amino-acids and 3D tertiary structure, to improve the performance of predicting binding affinities. The proposed model shows competitive results on the PDBBind v2020 benchmark in comparison with current state-of-the-art methods. Our novel multimodal architecture enables us to efficiently produce a protein embedding that can serve as the prior for our generative model (i.e. TargetVAE), and accurately estimate protein-ligand binding affinity, and potentially replace the computationally expensive molecular dynamical simulation in the evaluation of ligand generation.

2. Related work

2.1. Protein-ligand binding prediction

Machine learning methods, especially GNNs, have emerged as effective techniques for binding affinity prediction [33]. Using a Kronecker regularized least squares approach (KronRLS), Nascimento et al [34] cast the problem as a link prediction task on bipartite networks and compute distinct kernels that indicate the similarities among drugs and targets to make predictions. Apart from binary prediction, He et al [35] introduce a framework named SimBoost that can predict continuous binding affinity scores between drugs and targets. In the deep learning era, Öztürk et al [36] propose DeepDTA, a deep-learning-based method that uses CNNs to operate on sequence representations of drugs and protein targets. Moreover, Zhao et al [37] use generative adversarial networks to learn better feature representations for compounds and proteins. On the other hand, Nguyen et al [38]; Voitsitskyi et al [39] utilize GNNs to learn the representations of the molecular graphs and protein structures, which are superior to the previous methods operating on sequences and handcrafted features. However, the common limitation of these above approaches is that none of them covers a wide range of representations of proteins. Indeed, our proposed method is the first multimodal model combining sequence, graph, and spatial information of proteins altogether.

2.1.1. Molecule generation

Previous studies on molecule generation are mostly categorized into SMILES-based and graph-based approaches [40]. Gómez–Bombarelli et al [41–43] use recurrent neural networks to build generative models operating on SMILES strings. However, these SMILES-based approaches often generate chemically invalid molecules. Kusner et al [44] and Dai et al [45] circumvent this issue by augmenting the decoders with grammar and semantic constraints to only generate valid molecules, yet this added information does not fully capture chemical validity in the latent space. Apart from SMILES-based methods, several works [5–10, 46, 47] propose graph generative models to design novel drug-like molecules. For example, Jin et al [5]; Jin et al [6] can generate molecules with 100 $\%$ validity, but their methods are relatively slow as chemical rules are verified during the generative process. Although graph-based methods perform well in unconditional molecule generation, they have difficulty generating molecules with optimized properties. For a target-aware generation, reinforcement learning methods are used to systematically modify the generated molecular graphs [4–7]. Distinguishing from other prior works based on SMILES representation, we utilize the recently proposed SELFIES representation [48] to achieve a high chemical validity in generated ligands. Given the success of conditional VAE in image generation [49], our work is the first attempt to introduce a learnable prior based on the whole protein structures.

3. PMN

Proteins are complex structures that consist of long chains of residues/amino acids. Each amino acid is a molecule with 3D structures, and a combination of hundreds to thousands of residues determines the unique 3D structure of a specific protein and its functions. It is worth noting that while two residues are distant along the protein sequence, they could be close to each other in three-dimensional space. This is our key observation to design a novel framework that can unify different representations of proteins in an end-to-end learning manner.

In the field of graph learning, the conventional GNNs based on the message passing scheme [50] that propagates and aggregates information of each node to and from its local neighborhoods are incapable of capturing the long-range interactions in a large-diameter or long-range⁴ graph [51, 52] while suffering from over-smoothing [53] and over-squashing [54] problems. Furthermore, to obtain a global understanding of the whole input graph, the message-passing scheme needs the number of layers/iterations proportional to the diameter length for distant nodes to 'communicate' with each other. That is computationally infeasible for macromolecules with thousands of atoms or residues like proteins. Meanwhile, the graph Transformers that consider all pairwise node interactions via the self-attention mechanism can successfully capture the long-range dependencies [52, 55, 56]. Since proteins can be seen as long-range graphs, we utilize sequential and graph Transformers to encode both sequences and 3D graphs of residues and combine them to create a unified representation for a large protein, making our model operate on multi-modalities of proteins. Our Transformer-based model can efficiently capture a protein's local and global information with a reasonably small number of layers.

3.1. Geometric learning on protein

According to figure 1, there are three major components in the 3D modeling part, including a local encoder, a geometric vector perceptron (GVP) module [26], and a global Transformers encoder.

Figure 1. Refer to the following caption and surrounding text. — **Figure 1.** Overview of our PMN. The architecture consists of two components. Top: the three-dimensional structure of the input protein is processed by a local encoder built upon a message-passing neural network (MPNN) enhanced with geometric vector perceptron module (GVP) [26]. Subsequently, a Transformer processes the extracted invariant features to capture the global interactions between distant residues. Bottom: a Transformer-based language model is used to process the input protein sequence.
Download figure:
Standard image High-resolution image

3.1.1. Local message-passing

We use a message-passing network (MPNN) in which the GVPs [26] replace dense layers to operate on invariant features:

$\begin{align} m_{ij} & = \text{GVPs}\left(h ^ {\left(i\right)} _v \oplus h ^ {\left(j \rightarrow i\right)} _e\right), \end{align} \tag{ 1 }$

$\begin{align} h ^ {\left(i\right)} _v & = \text{LayerNorm}\left(h ^ {\left(i\right)} _v + \frac{1}{|\mathcal{N}\left(i\right)|}\sum_{j \in \mathcal{N}\left(i\right)} m_{ij}\right), \end{align} \tag{ 2 }$

where $\oplus$ is the concatenation; m_ij computed by a module of three GVP layers denotes the message propagated from node j to i. Also, $h ^ {(i)} _v$ and $h ^ {(j \rightarrow i)} _e$ indicate the embeddings of node i and edge $(j \rightarrow i)$ and are tuples of scalar and vector features as described appendix A.1. The local encoder outputs a tuple of scalar and vector features for each residue node, which are rotationally invariant and equivariant, respectively.

3.1.2. Global interaction

We utilize a GVP module to update the tuple $h_v = (s, V)$ as $(s^{^{\prime}},V^{^{\prime}}) = \text{GVP}\big((s, V)\big)$ , and we take the invariant scalar feature $s^{^{\prime}} \in \mathbb{R} ^ d$ as the node embedding for successor modules. The resulting tensor $S \in \mathbb{R}^{N \times d}$ , in which row i indicates a d-dimensional scalar feature s_i of node i, is passed to a L-layer Transformers encoder:

$\begin{align} Q_\ell & = Z_{\ell-1} W_\ell ^ {\,Q}, \ \ K_\ell = Z_{\ell-1} W_\ell ^ K, \ \ V_\ell = Z_{\ell-1} W_\ell ^ V , \end{align} \tag{ 3 }$

$\begin{align} H_\ell & = \text{MultiheadAttention}\left(Q_\ell, K_\ell, V_\ell\right) , \end{align} \tag{ 4 }$

$\begin{align} Z_\ell & = \text{LayerNorm}\left(Z_{\ell-1} + \text{FFN}\left(H_\ell\right)\right). \end{align} \tag{ 5 }$

Here, we initialize $Z_0 \triangleq S$ ; and we have $\{W_\ell ^ Q, W_\ell ^ K, W_\ell ^ V\}_{\ell = 1} ^ L \in \mathbb{R} ^ {d \times d_k}$ as learnable weight matrices/parameters corresponding to the query $Q_\ell$ , key $K_\ell$ and value $V_\ell$ matrices, respectively, of the Transformer at each layer $\ell$ ; and $Z_g \triangleq Z_L$ denotes the final node embeddings produced by the network. Notably, this global encoder allows residue nodes to attend to other nodes on a large protein graph, especially those that are distant from them (i.e. long-range modeling). Finally, we aggregate node embeddings by a row-wise Aggregator ζ (e.g. mean, max, sum, etc) to produce an embedding for the protein structure $p_g = \zeta(Z_g) \in \mathbb{R} ^ d$ .

3.1.3. Language modeling on protein sequence

A protein can be represented as a sequence $s = (s_0, s_1, \ldots, s_n)$ in which $s_i \in \mathbb{R}^{20}$ is a one-hot vector indicating one in a total of 20 types of residues. We utilize Transformer-based language models, where the layers are the same as in equations (3)–(5), to compute the text representation of this protein sequence with the initial embeddings $Z_0 = [z_1,z_2, \ldots, z_n] \in \mathbb{R} ^ {n \times d}$ with $z_i \in \mathbb{R}^d$ is calculated as $z_i = \text{Embed}(s_i) + p_i$ . Here, p_i is the positional encoding feature added at each token i. In this work, for the language modeling component, we use the pre-trained Transformer protein language models proposed by Lin et al [29] to extract protein-level embeddings for each protein. Then, we define $p_s = \zeta(Z_s) \in \mathbb{R} ^ d$ as the global representation for the entire protein sequence.

3.1.4. Multi-modal fusion

Finally, to produce the final representations of proteins, we concatenate their geometric and sequential features from the Transformers-based models and process them by a feed-forward network:

$\begin{align} p = \text{FFN}\left(p_s \oplus p_g\right). \end{align} \tag{ 6 }$

Here, $\oplus$ denotes the concatenation between two global vector features p_s and p_g .

4. Binding affinity prediction and target-aware ligand generation

4.1. Problem setup

Given a dataset D of protein-ligand pairs, our objective is to predict the binding affinity and generate novel drug-like ligands that have the potential to bind to a conditioning protein structure. We cast the former as a prediction task based on geometric and relational reasoning on protein and ligand structures, whereas the latter is regarded as a protein-structure conditioned ligand generation. Let $(l,p,s) \in D$ be a pair of protein-ligand where l and p denote the representations of ligands and proteins, respectively, and s indicates the binding score between them. Additionally, figures 2 and 3 depict the overview of our approach in both tasks.

Figure 2. Refer to the following caption and surrounding text. — **Figure 2.** A framework for predicting binding affinities between proteins and ligands. The protein multi-modal network (PMN) computes a unified representation from the sequence and structure of the input protein. Besides, a message-passing neural network (MPNN) and feed-forward network (FFN) are used to process the 2D graph and Morgan Fingerprint of the input ligand, respectively. Finally, all the representations are combined and passed to a FFN that predicts the binding affinity score.
Download figure:
Standard image High-resolution image

Figure 3. Refer to the following caption and surrounding text. — **Figure 3.** TargetVAE with an encoder, decoder, and a prior network. The PMN prior network computes the conditions from protein structures for constructing the latent space of the VAE framework, which learns to generate SELFIES representations of molecules.
Download figure:
Standard image High-resolution image

4.2. Binding affinity prediction

Figure 2 illustrates our designed architecture for predicting the binding affinities between ligands and their target proteins. Regarding the ligand part, we represent small molecules as 2D graphs $G = (\mathcal{V}, \mathcal{E})$ where nodes $v \in \mathcal{V}$ are the atoms and edges $e \in \mathcal{E}$ indicate their covalence bonds. Then, we use graph attention networks (GAT), a MPNN, to learn their representations. In particular, at the layer t, the embedding vector $\mathbf{x} ^ {t} _i$ of node $i \in \mathcal{V}$ is updated as:

$\begin{align} \mathbf{x}^{t}_i = \alpha_{i,i}W\mathbf{x} ^ {t - 1}_{i} + \sum_{j \in \mathcal{N}\left(i\right)} \alpha_{i,j}W\mathbf{x} ^ {t -1}_{j}, \end{align} \tag{ 7 }$

where the attention coefficients $\alpha_{i,j}$ are computed as

$\begin{align} \alpha_{i,j} = \frac{ \exp\left(\sigma\left(\mathbf{a}^{\top} \left[W\mathbf{x}_i \, \oplus \, W\mathbf{x}_j\right] \right)\right)} {\sum_{k \in \mathcal{N}\left(i\right) \cup \left\{i \right\}} \exp\left(\sigma\left(\mathbf{a}^{\top} \left[W\mathbf{x}_i \, \oplus \, W\mathbf{x}_k\right] \right)\right)}. \end{align} \tag{ 8 }$

Here, σ denotes non-linear activations, $\mathcal{N}(i)$ is the set of neighbors of node i, and $\oplus$ denotes concatenation, and a and W are the weight vector and matrix, respectively. At the end of the network, we aggregate all the node embeddings on the molecular graphs to produce its global embedding as $\ell = \zeta\{x_i | x_i \in \mathcal{V} \}$ . In addition to graph-based features, Morgan fingerprints can also be used as molecule features. Finally, we combine features from all sources, including sequence, geometric information of proteins, and graph-based features of ligands, and pass them to a feed-forward network (FFN) to make predictions:

$\begin{align} h & = p \oplus \ell, \end{align} \tag{ 9 }$

$\begin{align} \hat{y} & = \phi\left(h\right). \end{align} \tag{ 10 }$

Here, h is the combination of all sources of inputs, and φ is a feed-forward network that predicts the binding affinity score $\hat{y}$ .

4.3. Target-aware ligand generation

Although there exist many machine-learning approaches that generate drug-like molecules, it is challenging for graph-based or smiles-based methods to generate chemically valid ligands with high probability. Meanwhile, SELFIES (SELF-referenced Embedded Strings) [48] is a string-based representation of molecules 100 $\%$ robust to molecular validity. A ligand l can be defined as a string of $l = (l_1, l_2, \ldots, l_n)$ in which l_i is a SELFIES token, which belongs to a predefined symbol set S derived from the training dataset. We generate ligands $\hat{l} = (\hat{l}_1, \hat{l}_2, \ldots, \hat{l}_n)$ by computing n independent probability vectors $y = (y_1, y_2, \ldots, y_n)$ , $y_i \in \mathbb{R}^{|S|}$ . Each new token $\hat{l}_i$ is defined as $\hat{l}_i = S_j$ where $j = {\underset{0 \unicode{x2A7D} j \lt |S|}{\mathrm{argmax}}} (y_i)$ .

Let $\phi, \theta$ , and ψ denote the encoder, decoder, and prior network in a conditional VAE framework, respectively. In particular, φ is a recurrent neural network (RNN) that encodes a molecule into a latent vector $z \in \mathbb{R} ^ d$ , and θ is also an RNN that autoregressively generates each token of SELFIES string. The prior network ψ is our proposed protein muli-modal network in which both sequence and geometric information of a target protein are encoded, and we use two multi-layer perceptions denoted as µ_ψ and σ_ψ to compute the mean and variance of a Gaussian distribution over the latent vector $z_l \sim \mathcal{N}(\mu_{\psi}(p), \sigma_{\psi}(p))$ , which is the inferred latent embedding of the ligand l. All the networks are jointly optimized based on equation (11). After training, given a target protein p, a ligand $\hat{l}$ is generated by sampling a latent vector $z \sim \mathcal{N}(\mu_{\psi}(p),$ $\sigma_{\psi}(p))$ , which is fed to the decoder θ to decode into a SELFIES representation.

4.3.1. Conditional inference with pretrained unconditional VAE

In addition to validity, the diversity of generated sets of ligands is also an important criterion in drug discovery. While classical conditional VAE trained on protein-ligand pairs can generate novel and valid molecules, the diversity and uniqueness of these samples are relatively low due to the limited amount of available data. We address this issue by adapting the work proposed by Harvey et al [49] from the computer vision domain to diversify the latent variables. In this framework, the decoder θ of the generative model is independent with the condition p as $p_{\theta, \psi}(x, z | p) = p_{\theta}(x | z) p_{\psi}(z | p)$ , allowing θ to re-use weights of the decoder $\theta^*$ of an unconditional VAE as both have the identical architecture. We train the model to optimize the objective as:

$\begin{align} \log p_{\theta, \psi}\left(x | p\right) \unicode{x2A7E} O_\text{for} \triangleq \mathbb{E}_{q_\phi}\left[\log p_{\theta, \psi}\left(x | z\right)\right] - \text{KL}\left[q_\phi\left(z | p\right) || p_\psi\left(z|p\right)\right]. \end{align} \tag{ 11 }$

Different from equation (13), both q_φ and p_θ in equation (11) are not conditioned by the auxiliary covariate y. This allows conditional VAEs to use weights of $\phi ^ *$ and $\theta ^ *$ of a pre-trained VAE, which is trained on a diverse set of unconditional molecules.

5. Results

5.1. Binding affinity prediction on PDBBind 2020

5.1.1. Experimental setup

The objective of the task is to predict the binding affinities that reflect how small molecules (ligands) bind to their target proteins. We evaluate the performance of our approach on the PDBBind v2020 [57] with time-split training and testing sets. In particular, the dataset contains 19 119 pre-processed complexes, and there are 1152 of them were discovered in 2019 or later. We follow the split procedure described by Stärk et al [17] that randomly selects 125 unique proteins and collects all new complexes containing them to create the final test set, resulting in a total of 363 samples. For the complexes discovered before 2019, those that have ligands in the test set are removed, and this gives 17 347 complexes in total. This set is respectively split into 16 379 and 968 complexes for training and validation subsets such that they share no ligands.

5.1.2. Implementation details

In this task, we use three layers of GVP with a hidden dimension of 128 to extract the local geometric information of proteins, while also considering their interactions with the ligands, which are encoded by a three-layer GAT network with the same dimension. Regarding the language modeling component, we extract the residue-level embeddings of the pre-trained ESM-2 proposed by [29] with a hidden dimension of 1280. The features are fused and passed to a four-layer MLP with hidden dimensions of 1024 512, and 256 for making the predictions. We use batch normalization and a dropout rate of 0.05 across layers. The models are trained in 50 epochs with a batch size of 32 and a learning rate of 0.0001.

5.1.3. Main results

According to table 1, our proposed approach outperforms other sequence-based methods by a large margin, demonstrating the necessity of modeling three-dimensional structures of proteins in machine learning. Compared with structure-based and complex-based methods, the model shows comparable performances across all metrics with lower standard deviations. Furthermore, the model performs on par with other more sophisticated state-of-the-art methods, namely PSICHIC [58] and TankBind. It is worth noting that PSICHIC also leverages the residue-level embeddings extracted from the pre-trained ESM; however, Koh et al [58] use these embeddings to construct 2D graphs of proteins. This does not preserve the SE(3)-symmetry (rotations and translations), which is an important property in learning three-dimensional structures.

Table 1. Performance comparison on the PDBBind v2020 dataset. An upward arrow ( $\uparrow$ ) denotes higher scores are better, and a downward arrow ( $\downarrow$ ) denotes the reverse. Average results and standard deviations (in parentheses) from five independent runs are reported.

	Method	RMSE $\downarrow$	MAE $\downarrow$	Pearson $\uparrow$	Spearman $\uparrow$	$r_m^2\uparrow$	CI $\uparrow$
Complex	Pafnucy	1.435 (0.018)	1.144 (0.018)	0.635 (0.008)	0.587 (0.008)	0.348 (0.016)	0.707 (0.004)
	OnionNet	1.403 (0.012)	1.103 (0.014)	0.648 (0.007)	0.602 (0.013)	0.381 (0.011)	0.717 (0.005)
	IGN	1.404 (0.025)	1.116 (0.030)	0.662 (0.013)	0.638 (0.021)	0.385 (0.02)	0.730 (0.009)
	SIGN	1.373 (0.037)	1.086 (0.030)	0.685 (0.031)	0.656 (0.044)	0.398 (0.048)	0.736 (0.02)
Structure	SMINA	1.466 (0.008)	1.161 (0.007)	0.665 (0.005)	0.663 (0.019)	0.391 (0.031)	0.740 (0.008)
	GNINA	1.740 (0.014)	1.413 (0.015)	0.495 (0.011)	0.494 (0.011)	0.209 (0.009)	0.674 (0.004)
	dMaSIF	1.450 (0.032)	1.136 (0.031)	0.629 (0.018)	0.588 (0.041)	0.347 (0.029)	0.710 (0.017)
	TankBind	1.345 (0.020)	1.060 (0.031)	0.718 (0.012)	0.689 (0.041)	0.404 (0.025)	0.750 (0.006)
Sequence	GraphDTA	1.564 (0.063)	1.223 (0.066)	0.612 (0.016)	0.570 (0.050)	0.306 (0.039)	0.703 (0.019)
	TransCPI	1.493 (0.050)	1.201 (0.037)	0.604 (0.024)	0.551 (0.029)	0.255 (0.027)	0.677 (0.011)
	MolTrans	1.599 (0.060)	1.271 (0.051)	0.539 (0.057)	0.474 (0.052)	0.242 (0.045)	0.666 (0.02)
	DrugBAN	1.480 (0.046)	1.159 (0.045)	0.657 (0.018)	0.612 (0.027)	0.319 (0.021)	0.720 (0.011)
	DGraphDTA	1.493 (0.050)	1.201 (0.037)	0.604 (0.024)	0.551 (0.029)	0.312 (0.038)	0.693 (0.011)
	WGNN-DTA	1.501 (0.050)	1.196 (0.055)	0.605 (0.025)	0.562 (0.028)	0.311 (0.03)	0.697 (0.01)
	STAMP-DPI	1.503 (0.082)	1.176 (0.067)	0.653 (0.028)	0.601 (0.027)	0.327 (0.039)	0.719 (0.011)
	PSICHIC	1.314 (0.049)	1.015 (0.031)	0.710 (0.027)	0.686 (0.024)	0.428 (0.047)	0.751 (0.009)
	Ours	1.373 (0.035)	1.084 (0.032)	0.687 (0.010)	0.646 (0.016)	0.459 (0.022)	0.733 (0.006)

Note: Bold highlights the best performance.

5.1.4. Ablation studies

To comprehensively evaluate the effectiveness of protein multi-modal learning, we conducted an ablation study to assess the necessity of sequence information in protein modeling. Specifically, we re-trained our model without the sequence embeddings from pre-trained ESM, keeping the 3D modeling component unchanged. The numerical results in table 2 show that exploiting both sequence and geometric information can lead to superior performance in protein-ligand binding affinity prediction. The model that combines both sources of information outperforms its 3D modeling counterpart by a large margin in Pearson, Spearman, $r_m ^ 2$ , and CI scores. Additionally, we observe that using only ESM embeddings can also achieve comparable results to other state-of-the-art methods, indicating the importance of protein primary structure, which is the most common protein data modality.

Table 2. Ablation study on the use of sequence embeddings and three-dimensional structures. The results are aggregated from five independent runs.

Method	RMSE $\downarrow$	MAE $\downarrow$	Pearson $\uparrow$	Spearman $\uparrow$	$r_m ^ 2\uparrow$	CI $\uparrow$
Only 3D	1.596 (0.028)	1.300 (0.021)	0.505 (0.029)	0.453 (0.025)	0.235 (0.031)	0.657 (0.008)
Only ESM	1.421 (0.029)	1.123 (0.020)	0.657 (0.009)	0.607 (0.011)	0.407 (0.022)	0.718 (0.004)
ESM + 3D	1.373 (0.035)	1.084 (0.032)	0.687 (0.010)	0.646 (0.016)	0.459 (0.022)	0.733 (0.006)

Note: Bold highlights the best performance.

5.2. Target-aware ligand generation

5.2.1. Experimental setup and implementation details

This task aims to generate small drug-like molecules that bind to given target proteins with unknown binding sites. We train a prior network that encodes geometric and sequence information of a target by optimizing the objective in equation (11). We use the train set of PDBBind 2020 is used, as described in section 5.1.2. For docking simulation, we adopt AutoDock Vina [13] to compute binding affinity. The score, named Vina Score, characterizes the free energy changes of binding processes in kcal mol⁻¹. We test our approach with nine target proteins, including G-protein coupling receptors and kinases from DUD-E [59] and the SARS-Conv-2 main protease [60]. Notably, these targets are unseen to the model during the training stage. For implementation, we use the architecture and parameters of a pre-trained RNN-based VAE proposed by Gao et al [61] in which the dimensions of latent vectors are 128. For each test target, we generate 100 molecules and estimate their drug-likeness (QED), synthetic accessibility (SA), and binding affinity (Vina Score)d. The first two metrics are calculated by RDkit, and docked poses are generated by the AutoDock Vina [13] (Vina score). Additionally, we use Obabel [62] to optimize the three-dimensional atom positions of the generated molecules.

5.2.2. Results

5.2.2.1. Qualitative results

Apart from pocket-based methods [14–16], our generative model is conditioned on the entire protein structures, meaning that the knowledge of binding sites is unknown to the model. Using protein multi-modalities, as well as modeling the long-range interactions among residues enables us to extract representations that incorporate both geometric and sequence meanings of a target protein. These informative conditions, as shown in figure 4, allow the generative model to generate multiple ligands that can bind to different sites with different poses on a given target protein. Furthermore, we choose some of the generated ligands and visualize how they bind to their targets in figure 6.

Figure 4. Refer to the following caption and surrounding text. — **Figure 4.** Multiple generated ligands with different poses bind to the 1iep target.
Download figure:
Standard image High-resolution image

5.2.2.2. Quantitative results

Figure 5 illustrates the distribution of binding affinities of the generated molecules for their corresponding receptors. We used Autodock Vina to compute the docking scores between the ligands and receptors, with lower scores indicating stronger binding. The histograms show that our method can generate ligands with high binding affinities (low docking scores). The average binding affinity for each target protein is less than −6 kcal mol⁻¹. Moreover, table 3 shows the average scores for binding affinities (BA), drug-likeness (QED), and synthetic accessibility (SA) of the top 1, 10, and 20 generated molecules ranked according to their binding affinities to corresponding targets. We observe that TargetVAE can generate molecules with high binding affinity while maintaining low SA scores ranging from 4.0 to 8.0. However, we also note that our approach is limited in generating compounds with low drug-likeness (QED). We hypothesize that this is because generating ligands that simultaneously satisfy all objectives is challenging, and our model only focuses on binding affinity. This means that there may be a trade-off between the metrics.

Figure 5. Refer to the following caption and surrounding text. — **Figure 5.** The binding affinity (in kcal mol⁻¹) distribution of the generated molecules on different target proteins.
Download figure:
Standard image High-resolution image

Figure 6. Refer to the following caption and surrounding text. — **Figure 6.** Three-dimensional poses of protein-ligand complexes generated by AutoDock Vina and visualized by Chimerax. Each figure is associated with the name of each target protein, synthetic accessibility, and binding affinity in kcal mol⁻¹ of the generated ligand.
Download figure:
Standard image High-resolution image

Table 3. Quantitative results of top $k = 1, 10, 20$ generated molecules, which are ranked based on binding affinity (in kcal mol⁻¹). The scores are averaged over k ligands.

Target	Top 1			Top 10			Top 20
Target	BA $\downarrow$	SA $\downarrow$	QED $\uparrow$	BA $\downarrow$	SA $\downarrow$	QED $\uparrow$	BA $\downarrow$	SA $\downarrow$	QED $\uparrow$
1iep	−9.946	7.609	0.322	−9.242	4.412	0.413	−8.856	4.227	0.411
2rgp	−11.936	3.391	0.428	−10.293	4.201	0.520	−9.717	4.151	0.482
3eml	−25.939	7.268	0.584	−11.590	4.346	0.493	−10.294	4.446	0.476
3ny8	−11.257	5.99	0.807	−10.280	3.980	0.369	−9.870	4.193	0.433
4rlu	−11.250	2.979	0.479	−10.010	4.536	0.619	−9.495	4.759	0.564
4unn	−10.752	4.567	0.161	−9.860	4.192	0.415	−9.423	4.270	0.418
5mo4	−11.812	6.330	0.432	−10.325	5.041	0.325	−9.627	4.865	0.443
7l11	−11.220	7.912	0.136	−9.163	5.396	0.394	−8.567	5.073	0.417

6. Conclusion

This paper proposes Protein Multimodal Network (PMN), a novel neural architecture that learns to combine multiple levels of information (i.e. multimodal) of proteins including protein sequence (i.e. from the primary structure) as well as residue-level graph and geometry (i.e. from the 3D tertiary structure) into a unified representation. Our experiments in predicting protein-ligand binding affinity on the PDBBind v2020 dataset have shown that this new multimodal representation is highly effective in capturing both local and global structural information of protein, with competitive performance against current state-of-the-art methods.

Furthermore, we build a conditional variational autoencoder named TargetVAE that can generate new ligands that can bind to a target protein and have desirable properties such as high binding affinity and high synthesizability, etc. It is important to note that in our case, the binding sites are not known in advance; therefore, we utilize our pre trained PMN to encode the whole protein structure as a prior / condition to guide the generation process. We evaluate our generative model and the quality of generated ligands both quantitatively and qualitatively. The candidate ligands we generate have shown great potential in in-silico simulation.

We believe our two main contributions, PMN and TargetVAE, are highly applicable in practice. In general, PMN can help scientists estimate several protein functionalities and properties, as well as produce a sophisticated protein embedding that can be used for other downstream tasks. PMN in combination with TargetVAE allows us to generate new ligands for new proteins such as mutant proteins. In our future work, we plan to examine these candidates further in vitro and in vivo experiments.

Data availability statement

The data that support the findings of this study are openly available at the following URL/DOI: https://github.com/HySonLab/Ligand_Generation. The data that we used for this study is publicly available at https://github.com/vtarasv/3d-prot-dta [39] and https://github.com/HannesStark/EquiBind [17]. We release our data processing pipeline and software along with the installation instructions at https://github.com/HySonLab/Ligand_Generation. All experimental results and visualizations are reproducible given our software release.

Funding

Not applicable.

Scientific contribution statement

In this work, we propose Protein Multimodal Network (PMN), a novel neural architecture that learns to combine multiple levels of information (i.e. multimodal) of proteins including protein sequence (i.e. from primary structure) as well as residue-level graph and geometry (i.e. from 3D tertiary structure) into a unified representation. Given the encoded protein structure by PMN as a prior, we build a conditional variational autoencoder named TargetVAE that can generate new ligands that can bind to a target protein and have desirable properties such as high binding-affinity and high synthesizability, etc. We believe our two main contributions, PMN and TargetVAE, are highly applicable in practice: (i) PMN can help scientists to estimate several protein functionalities and properties, as well as produce a sophisticated protein embedding that can be used for other downstream tasks; and (ii) PMN in combination with TargetVAE allows us to generate new ligands for new proteins such as mutant proteins.

Appendix: Background

A.1. Rotational invariant features

According to Jing et al [26], geometric features of a residue node on the protein structure can be represented as a tuple (s, V) of scalar features $s \in \mathbb{R}^n$ and vector features $V \in \mathbb{R}^{\mu \times 3}$ . Respectively, s and V are invariant and equivariant with respect to geometric transformations in Euclidean space. In addition, to make the information propagate effectively from the vector channel to the scalar channel, Jing et al [26] propose GVP as a replacement for conventional dense layers in GNNs, enabling them to operate on geometric vectors and structural information of 3D large protein graphs.

The module transforms an input tuple (s, V) of scalar features $s \in \mathbb{R} ^ {n}$ and vector features $V \in \mathbb{R} ^ {\mu \times 3}$ into a new tuple $(s^{\prime}, V^{\prime}) \in \mathbb{R}^m \times \mathbb{R}^{\nu \times 3}$ . According to algorithm 1, GVP consists of two separate linear transformations W_m and W_h that work on the scalar and vector features respectively, followed by nonlinearities σ and $\sigma^+$ . Before being transformed, the scalar feature s is concatenated with the $L_2-\text{norm}$ of the vector feature V. This enables GVP to extract the rotation-invariant information from the input vector V. Moreover, an additional transformation W_µ is used to control the dimensionality of the output vector V', making it independent of the number of norms extracted. Albeit simple, GVP is an effective module that guarantees desired properties of invariance/equivariance and expressiveness. The scalar and vector outputs of GVP are invariant and equivariant respectively, with respect to an arbitrary composition R of rotations and reflections in 3D Euclidean space. In other words, if $\text{GVP}(s, V) = (s^{\prime}, V^{\prime})$ , then $\text{GVP}(s, R(V)) = (s^{\prime}, R(V^{\prime}))$ .

Algorithm 1. Geometric vector perceptron.
Input: Scalar and vector features $(s, V) \in \mathbb{R} ^ n \times \mathbb{R} ^ {\mu \times 3}$
Output: Scalar and vector features $(s^{\prime}, V^{\prime}) \in \mathbb{R} ^ m \times \mathbb{R} ^ {\nu \times 3}$
$h \gets \text{max}(\mu, \nu)$
GVP:
$V_h \gets W_h V \quad \in \mathbb{R} ^ {h \times 3}$
$V_\mu \gets W_\mu V_h \quad \in \mathbb{R} ^ {\mu \times 3}$
$s_h \gets \\|V_h\\|_2 \text{(row-wise)} \quad \in \mathbb{R} ^ h$
$v_\mu \gets \\|V_\mu\\|_2 \text{(row-wise} \quad \in \mathbb{R} ^ \mu$
$s_{h+n} \gets \text{concat}(s_h, s) \quad \in \mathbb{R} ^ {h + n}$
$s_m \gets W_m s_{h+n} + b \quad \in \mathbb{R} ^ m$
$s ^{\prime} \gets \sigma(s_m) \quad \in \mathbb{R} ^ m$
$V^{\prime} \gets \sigma^+(v_\mu) \odot V_\mu \text{(row-wise multiplication)} \quad \in \mathbb{R} ^ {\mu \times 3}$
return $(s^{\prime}, V ^{\prime})$

A.2. Variational auto-encoders

A variational auto-encoder (VAE) is regarded as an auto-encoding variational Bayes model [63] that comprises two components, including a generative model and an inference model (also known as probabilistic encoder). The former uses a probabilistic decoder $p_\theta(x | z)$ and a prior $p_\psi(z)$ to define a joint distribution $p_{\theta, \psi}(x, z) = p_\theta(x | z)p_\psi(z)$ between latent variables z and data x; in addition, Kingma and Welling [63] let $p_\psi(z)$ be isotropic Gaussian. An ideal generative model should learn to maximize the log-likelihood $\log p_{\theta, \psi}(x) = \log \int p_{\theta, \psi}(x, z)dz$ . However, this is intractable as marginalization over the latent space is usually infeasible with realistic data. VAE alleviates this issue by using an encoder $q_\phi(z |x)$ to approximate the true posterior distribution of the latent space and maximize the evidence lower bound (ELBO) over each training sample x:

$\begin{align} \log p_{\theta, \psi}\left(x\right) \unicode{x2A7E} \mathbb{E}_{q_\phi}\left[\log p_{\theta, \psi}\left(x | z\right)\right] - \text{KL}\left[q_\phi\left(z | x\right) || p_\psi\left(z\right)\right]. \end{align} \tag{ 12 }$

In conditional VAE, the generative component is augmented by auxiliary covariates y. Given a condition y, the generative model defines a conditional joint distribution of z and x as $p_{\theta, \psi}(x, z | y) = p_{\theta}(x | y,z)p_{\psi}(z | y)$ . Similarly, the condition inputs are integrated into the encoder as $q_{\phi}(z|x,y)$ . These two extensions establish a prominent conditional VAE model [64–67] that is trained to maximize the conditional ELBO as:

$\begin{align} \log p_{\theta, \psi}\left(x | y\right) \unicode{x2A7E} O_\text{cond} \triangleq \mathbb{E}_{q_\phi}\left[\log p_{\theta, \psi}\left(x | y, z\right)\right] - \text{KL}\left[q_\phi\left(z | x,y\right) || p_\psi\left(z|y\right)\right]. \end{align} \tag{ 13 }$

Dates