Skip to main content

Deepmol: an automated machine and deep learning framework for computational chemistry

Abstract

The domain of computational chemistry has experienced a significant evolution due to the introduction of Machine Learning (ML) technologies. Despite its potential to revolutionize the field, researchers are often encumbered by obstacles, such as the complexity of selecting optimal algorithms, the automation of data pre-processing steps, the necessity for adaptive feature engineering, and the assurance of model performance consistency across different datasets. Addressing these issues head-on, DeepMol stands out as an Automated ML (AutoML) tool by automating critical steps of the ML pipeline. DeepMol rapidly and automatically identifies the most effective data representation, pre-processing methods and model configurations for a specific molecular property/activity prediction problem. On 22 benchmark datasets, DeepMol obtained competitive pipelines compared with those requiring time-consuming feature engineering, model design and selection processes. As one of the first AutoML tools specifically developed for the computational chemistry domain, DeepMol stands out with its open-source code, in-depth tutorials, detailed documentation, and examples of real-world applications, all available at https://github.com/BioSystemsUM/DeepMol and https://deepmol.readthedocs.io/en/latest/. By introducing AutoML as a groundbreaking feature in computational chemistry, DeepMol establishes itself as the pioneering state-of-the-art tool in the field.

Scientific contribution

DeepMol aims to provide an integrated framework of AutoML for computational chemistry. DeepMol provides a more robust alternative to other tools with its integrated pipeline serialization, enabling seamless deployment using the fit, transform, and predict paradigms. It uniquely supports both conventional and deep learning models for regression, classification and multi-task, offering unmatched flexibility compared to other AutoML tools. DeepMol's predefined configurations and customizable objective functions make it accessible to users at all skill levels while enabling efficient and reproducible workflows. Benchmarking on diverse datasets demonstrated its ability to deliver optimized pipelines and superior performance across various molecular machine-learning tasks.

Introduction

In recent years, computational chemistry has undergone a remarkable transformation, driven by advances in machine learning (ML) and deep learning (DL) techniques and by the immense growth of available chemical data [1,2,3,4]. These have facilitated exploring and understanding the complex relationships between chemical structures and their properties [5]. As a result, using these computational methods within the chemical discovery pipeline has emerged as a powerful and promising way to expedite the discovery of new chemicals with improved properties [6].

Quantitative structure-activity/property relationship (QSAR/QSPR) models have always been a focal topic in computational chemistry [7,8,9,10]. Initially, simple statistical models were applied to small datasets of molecules characterized by a restricted array of descriptors [11], offering a straightforward way to correlate molecular structure with biological activity [12]. Nowadays, QSAR/QSPR modelling consists of large sets of molecules with an extensive set of molecular descriptors.

As the amount of available data and the complexity of QSAR/QSPR models continue to grow, the demand for advanced techniques has led to the emergence of DL as a viable alternative to traditional ML models [13]. With the current plethora of ML/DL models and diverse chemical descriptors available, researchers face the challenge of selecting the most suitable alternative for their data regarding data representation and processing and ML/DL models [12]. Choosing the optimal combination of features/descriptors and models requires exhaustive testing to comprehensively understand their performance on a given dataset [14, 15]. Recognizing this challenge, the importance of automated machine learning (AutoML) frameworks becomes unequivocal. In the fields of QSAR/QSPR modelling, where pre-processing steps, models, and their combination can significantly impact results, the need for a comprehensive and easily customizable AutoML framework is even more evident.

However, despite this need, only a few resources offer an easily customizable framework capable of providing a wide range of possibilities. ZairaChem [16] provides the first AutoML framework described in the literature that can automatically optimize all the pre-processing steps and model hyperparameters for specific QSAR problems. However, users cannot easily customize this system to suit their needs. For example, ZairaChem does not provide ways of creating custom objective functions, i.e. how to generate the final metric to give feedback to the AutoML system, e.g. whether in a cross- or hold-out validation or other scenarios. Another example is the impossibility of defining the type of models and descriptors to test. Furthermore, ZairaChem is restricted to binary classification tasks, which can be very limiting. Another AutoML tool is QSARTuna [17], which performs the AutoML search using Optuna [18]. However, it offers a limited range of feature extraction methods and ML/DL models. Additionally, like ZairaChem, QSARTuna shares the limitation of being unable to create custom objective functions.

Herein, we present DeepMol, an AutoML python-based open-source framework for the prediction of activities and properties of chemical molecules. Although fully automated, DeepMol is built modularly, allowing for the customization of every step of the ML pipeline, starting from the data loading and processing to the model prediction and explainability. Nonetheless, what sets DeepMol apart is its robust AutoML functionality, allowing the automatic optimization of different scenarios involving pre-processing methods, data engineering techniques, and ML/DL models and respective hyperparameters. Even for users with minimal coding experience, with just a few lines of code, DeepMol allows testing thousands of configurations to determine the most effective ones for their specific datasets. Notably, DeepMol provides documented ways of defining objective functions and customizing AutoML experiments. Moreover, it is suitable for several ML tasks such as binary, multi-class, multi-label/multi-task classification and regression. The framework leverages other well-known and established packages, like RDKit [19] for molecular operations, Scikit-Learn [20], Tensorflow [21] and DeepChem [22] for model building and Optuna [18] for end-to-end ML pipeline optimization.

A rigorous experimental framework was established to ensure an impartial assessment of the AutoML engine’s capabilities, enabling evaluation of its performance across 22 benchmark datasets for predicting adsorption, distribution, metabolism, elimination, and toxicity (ADMET) derived from the Therapeutics Data Commons (TDC) repository [23].

Implementation

DeepMol was primarily developed as a comprehensive end-to-end AutoML framework for computational chemistry. Yet, its modular design permits the independent utilization of its components. The framework provides a wide range of techniques for each step of a general ML pipeline, from data pre-processing and feature extraction to model training, evaluation, and explainability. Furthermore, the package has been extended with additional functionalities of significant relevance, including unsupervised learning models, data-splitting strategies specifically designed for molecular datasets, and approaches to address data imbalance.

The DeepMol AutoML engine, as illustrated in Fig. 1, comprehensively explores a vast array of potential methodological combinations. In this context, state-of-the-art computational chemistry methods are employed automatically and sequentially in a pipeline, following a specific configuration, ensuring it is tailored to address specific chemical tasks. The configuration space encompasses several data standardization methods (Fig. 1a: three methods), feature extraction methods (Fig. 1b: four options for sets with 34 methods in total, all with their respective parameters), scaling and selection methods (Fig. 1c: 14 methods and respective parameters), various ML models and ensembles (Fig. 1d: 140 models), and their respective hyperparameters.

Upon starting the AutoML experiment, the engine first processes the training data following a predefined sequence of steps, known as the pipeline configuration, and then uses this data to train an ML/DL model. Post-training, a separate set of data, the validation set, is processed and assessed to evaluate the model’s performance. The outcomes of this evaluation are then fed back into the optimization framework, guiding it in choosing new parameters and methods for improvement. This cycle of training and evaluating is repeated a user-specified number of times, referred to as trials. After completing all the trials, the system analyzes the results to identify and select the most effective pipeline. The optimization framework is efficiently powered by the Optuna search engine, which provides nine different optimization algorithms compatible with DeepMol. Once the optimal pipeline has been identified, it can be applied to analyze new data, enabling predictions on untested data. Additionally, this pipeline facilitates the virtual screening of extensive databases, efficiently identifying potential molecules of interest.

This section provides an overview of the key components and capabilities of DeepMol, highlighting its significant contributions and advantages in chemical compound analysis.

Fig. 1
figure 1

The DeepMol AutoML framework offers a comprehensive approach that includes an optimization framework that samples pipeline configurations from the configuration space. This pipeline begins with an optional a standardization step (configuration space: 3 different methods), followed by b the extraction of features (configuration space: four options for sets with 34 methods in total, all with their respective parameters). These features can then be optionally c scaled and selected (configuration space: 14 methods and respective parameters), preparing them for d the training phase of the model (configuration space: 140 models, architectures and hyperparameters). Post-training, the model’s performance is evaluated based on a predetermined metric. This evaluation feedbacks to the Optuna optimization engine. Optuna uses these results to inform its selection of new parameters and methods for enhancing the pipeline’s efficiency using state-of-the-art optimization algorithms. This process is repeated n times (chosen by the user), also called trials, and, in the end, the best pipeline is selected. Each individual box represented in the configuration space is expanded and schematized in the boxes below. Finally, the result of the AutoML engine is an optimized pipeline that can be used to transform new data and perform new predictions

Data loading

The representation of molecular structures can be achieved using various formats. One such format is the Simplified Molecular Input Line Entry System (SMILES), which is readable and concise, as short as single-line ASCII strings [24]. The other format is a connection table, which can be stored as a Structure Data File (SDF) and used to represent three-dimensional structures of molecules [25, 26]. These are the de facto standard starting points for representing molecules in ML tasks [24, 27] and DeepMol provides loaders for both formats. Once loaded, the information is converted into a structured data format that allows easy access to relevant details, such as the SMILES representation of all molecules, their identifiers, and known outputs associated with the input data (labels). Alternatively, if one prefers to load molecules from an already-loaded data frame, a SmilesDataset class can be created. This dataset accepts SMILES or RDKit Mol objects as input, along with their corresponding identifiers and labels.

Molecular standardization

The availability of vast amounts of chemical data found in molecular databases containing hundreds of millions of compounds can be a double-edged sword. While it pushes the field forward, it makes the human curation process infeasible, resulting in the frequent occurrence of incorrect and inconsistent molecular structures [28]. Even minor structural errors and inconsistencies within a dataset can result in significant losses in the predictive ability of QSAR/QSPR models [29]. Consequently, the development of strategies that can tackle this problem has received increased attention in recent years [30,31,32].

With DeepMol, it is possible to standardize molecules using three different options:

  • BasicStandardizer performs basic sanitization using RDKit to ensure that a molecular structure is represented consistently and validly according to a set of predefined rules. These include kekulization, check of valencies, set aromaticity, conjugation, and hybridization.

  • CustomStandardizer performs a set of steps defined by the user, using RDKit. Some steps include molecular sanitization (same as in the BasicStandardizer), removal of isotope and/or stereochemistry information, neutralizing charges, fragment removal, and kekulization.

  • ChEMBLStandardizer performs a set of rules as used by the ChEMBL database [30]. It consists of three components: a Checker that tests the validity of the chemical structures, a Standardizer that formats compounds based on the U.S. Food and Drug Administration (FDA) and International Union of Pure and Applied Chemistry (IUPAC) guidelines; and a GetParent component that removes any salts and solvents from the compound.

While DeepMol provides these widely applicable standardization techniques, we recognize there are additional layers of standardization that can significantly enhance data consistency, depending on the dataset and modeling needs. For example, more specialized transformations such as canonicalising tautomers and specific functional groups can further reduce redundancies and improve model performance. DeepMol’s design is extensible, enabling users to implement these or other tailored standardization processes to meet specific requirements.

Feature extraction

In ML, feature extraction from molecules is a common task. Different types of molecular features are categorized from 0-dimensional (D) to 4D. 0D features provide overall information about the molecule (e.g. atom and bond counts), while 1D features describe substructures within the molecule (e.g. fingerprints) [33, 34]. 2D features capture molecular topology, and 3D features capture geometric information of the molecule’s three-dimensional structure [35]. 4D features are descriptors with an additional dimension representing interactions with receptors’ active sites or multiple conformational states.

DeepMol provides a set of 0D, a few 1D and 2D descriptors in only one class. The descriptors contained in this class are enumerated and described in Supplementary Table S1. DeepMol offers a wide range of 1D features provided by RDKit, including circular, atom pair, layered and RDKit fingerprints. DeepMol also includes Molecular ACCess System (MACCS) keys implemented by RDKit, which encode the presence or absence of specific molecular fragments or substructures in a molecule. These are further enumerated and described in Supplementary Table S2.

3D molecular conformations must be generated or loaded before generating 3D descriptors. For this matter, DeepMol provides conformer generation methods like Experimental-Torsion basic Knowledge Distance Geometry (ETKDG) and methods like Merck Molecular Force Field (MMFF) and Universal Force Field (UFF) for optimizing the generated conformers. Once generated, DeepMol offers methods to extract features from these conformers. These methods include AutoCorr3D, Radial Distribution Function (RDF), the plane of best fit, MORSE, WHIM descriptors, Radius of Gyration, shape descriptors, and principal moments of inertia. More details on these methods can be found in Supplementary Table S3.

DeepMol also provides one-hot encoding schemes. Herein, DeepMol provides an implementation of an atom-level tokenizer and k-mer tokenizer from Li et al. [36]. The former method permits treating each character in the SMILES string linked to an atom (e.g., [C@] or [N@+]) as an identical token, whereas the latter provides groups of k characters of the SMILES string. Both create a vocabulary dynamically given a dataset. If no tokenizer is passed as input, the one-hot encoding uses the atom-level tokenizer.

DeepMol extends DeepChem to provide inputs for graph neural networks (GNN), including molecular graphs (2D descriptors), representing the molecule by a list of neighbors and a set of initial feature vectors for each atom. The feature vectors represent the atom’s local chemical environment, including atom types, charge, and hybridization, among others [37]. Other implementations for using GNNs were also included, such as methods for implementing Duvenaud graph convolutions [38], Weave convolutions [37], and MolGAN [39]. Coulomb matrices and their eigenvalues [40] are also provided as methods of extracting 3D features, providing information on electrostatic interaction between atoms.

Finally, DeepMol also extends DeepChem to convert molecules into images and encodings to be passed to an embedding layer. All DeepChem-derived feature extractors are further detailed in Supplementary Table S4.

Data scaling

Many feature extraction methods encode molecules as bit vectors, while others use vectors of real numbers. In the latter, a challenge arises when features have different numerical ranges, which may impact certain algorithms. For instance, scaling is particularly important in models sensitive to feature scales, such as SVMs and neural networks, where larger ranges can cause certain features to dominate the learning process. While, for instance, tree-based models are generally less affected by feature magnitudes. To support these requirements, DeepMol offers a wide range of data scalers, as provided in the Scikit-Learn package (Supplementary Table S5).

Feature selection

Feature selection is a common task in ML problems. By choosing the most relevant features, the performance of an ML model can be significantly enhanced. Eliminating irrelevant or redundant features helps prevent overfitting and improve the generalization power of the model, while reducing the computational burden. DeepMol supports many types of feature selection as provided by Scikit-Learn, including removing features with low variability, supervised filters based on statistical tests, as well as wrappers such as Recursive Feature Elimination, or selecting features based on importance weights (Supplementary Table S6). In addition, DeepMol also provides feature selection based on the Boruta algorithm [41].

Machine and deep learning models

DeepMol offers compatibility with three popular ML/DL frameworks: Tensorflow, Scikit-Learn, and DeepChem. This allows seamless integration of any model built upon these frameworks into DeepMol. A wide range of pre-built models from these frameworks is readily available, supporting single and multi-task problems, binary and multi-class classification and regression, making it convenient for users to utilize these models for specific tasks.

Through Scikit-Learn, DeepMol offers an extensive selection of popular ML models, including, among others, linear and logistic regression, support vector machines, decision trees, random forests, and gradient boosting. In conjunction with DeepChem, it also provides several DL models specifically tailored for chemical data, including graph neural networks (GNNs), recurrent neural networks (RNNs), and transformer models. Moreover, the flexibility offered by Tensorflow enables integration of any DL architecture into DeepMol’s pipeline. By consolidating these capabilities, DeepMol serves as a comprehensive and versatile framework, facilitating implementation and comparison of various ML and DL models in one unified platform [42, 43].

Hyperparameter optimization

Although DeepMol’s primary focus is on AutoML, where advanced optimization techniques can be used to optimize each step and corresponding hyperparameters of a complete ML pipeline, it also provides users with the flexibility to perform standard hyperparameter tuning on individual models. Fine-tuning model hyperparameters is essential for achieving optimal results, reducing overfitting, and improving overall efficiency. For this purpose, DeepMol includes straightforward options like randomized and grid search, enabling users to directly control specific tuning tasks if they choose not to use the full AutoML pipeline.

Pipelines

In DeepMol, we have introduced a pipeline class that simplifies the creation of an ML pipeline. This class empowers users to construct an ML pipeline tailored to their specific needs. It offers the flexibility to incorporate a diverse range of steps, enabling the inclusion of any combination of the following methods:

  • Data standardization

  • Feature extraction or transformation (data scaling and selection)

  • Model training and hyperparameter tuning.

Pipelines in DeepMol work the same as Scikit-Learn pipelines: use the fit_transform method to fit and transform the train data and the transform method for solely transforming the validation and test sets. The pipeline class also allows users to save and load the pipeline again. Moreover, it allows to evaluate the pipeline and make predictions on new data using the evaluate and predict methods, respectively. The model herein declared can also be optimized.

Users can design and customize their ML pipeline, adapting it to their unique requirements and desired outcomes. This approach significantly reduces the complexity of constructing an ML pipeline, promoting efficient experimentation and development.

AutoML: pipeline optimization

Pipelines provide a convenient and efficient way to transform raw data into trained and validated models, but determining the appropriate sequence of steps often requires expertise. Moreover, if one wishes to explore multiple combinations of pre-processing techniques (data Transformers) and models (Predictors), manually constructing a Pipeline for each combination can be repetitive and time-consuming.

Automating the process of building and optimizing each step and respective parameters of a Pipeline can greatly assist researchers with limited ML expertise in building and deploying effective QSAR/QSPR systems, accelerating the development process. In DeepMol, we offer the PipelineOptimization module (AutoML), which enables the search for the best configurations from a large number of possibilities. This module leverages the powerful Optuna library [18] and its state-of-the-art optimization algorithms.

As with any optimization problem, pipeline optimization needs an objective function. By default, DeepMol provides the objective of maximizing or minimizing a metric (e.g. accuracy) for a given validation set. However, other custom objective functions can be added.

Furthermore, a function can be created with the configuration space that will be optimized by Optuna. This function will receive a Trial object from Optuna and will return a set of steps (i.e. Transformers and Predictors) from the configuration space to define the trial pipeline. Alternatively, DeepMol provides a set of predefined configuration spaces (pre-sets) for convenience. The available pre-sets include:

  • ’Sklearn’: optimizes between all available Standardizers, Featurizers (and combinations of those), Scalers, Feature Selectors in DeepMol, and all available models in Scikit-Learn.

  • ’Deepchem’: optimizes between all available Standardizers in DeepMol and all available models in DeepChem (depending on the model, different Featurizers, Scalers and Feature Selectors may also be optimized).

  • ’Keras’: optimizes between Standardizers, Featurizers (and combinations of those), Scalers, Feature Selectors in DeepMol, and some core Keras models (Fully Connected Neural Networks (FCNNs), 1D Convolutional Neural Networks (CNNs), RNNs, Bidirectional RNNs, and transformer models).

  • ’All’: optimizes between all the above.

Finally, a Voting Pipeline composed of the best five pipelines is created after running the AutoML, where the voting process for classification occurs through soft and hard voting mechanisms. In soft voting, the predicted probabilities from each classifier are averaged to determine the final prediction. On the other hand, hard voting involves a simple majority vote, where the class predicted by the majority of the individual classifiers is selected. Additionally, for regression tasks, the final prediction is obtained by averaging the predictions from each regressor, providing a robust ensemble prediction from the diverse models in the voting pipeline.

Unsupervised learning

Unsupervised learning has become increasingly important in exploring the vast amounts of available and sometimes unlabeled chemical data. Through techniques like clustering and dimensionality reduction, unsupervised learning allows the identification of patterns and relationships in large datasets without prior knowledge.

DeepMol provides a simple interface to various techniques, such as Principal Component Analysis (PCA), t-distributed Stochastic Neighbor Embedding (t-SNE), K-means clustering, and Uniform Manifold Approximation and Projection (UMAP). These techniques can help scientists to efficiently analyze and visualize complex molecular data, providing valuable insights to guide subsequent analyses.

Data splitting

Splitting the data into training, validation and test data sets or in a cross-validation setting is a crucial step in ML pipelines. It is common for practitioners to require data separation based on molecular structures or similarities. Practitioners may opt for a homogeneous data split to maintain a balanced distribution of molecular structures or similarities across the sets. This approach aims to capture the diversity of the dataset, while providing reliable performance estimates for the ML model. Alternatively, practitioners may emphasize the separation of highly similar molecules from less similar ones (heterogeneous splits). By isolating similar molecules, practitioners can better assess the model’s ability to capture nuanced differences and generalize to unseen data.

DeepMol provides molecular splitters that split the inputted dataset based on similarity, scaffolds or Butina clusters [44], while maintaining the stratification of classes. Figure 2 illustrates how the similarity and scaffold splits are distributed in the overall chemical space and how the homogeneity parameters control the split.

Moreover, DeepMol provides random and stratified splitters for single and multitask classification problems. More information about each splitter is provided in Supplementary Table S7.

Fig. 2
figure 2

Depiction of dataset splits in the chemical space for the mutagenicity dataset from [45] provided by TDC. A similarity matrix was generated using the Tanimoto similarity metric over Morgan fingerprints. A t-SNE was applied to the matrix to reduce dimensionality and visualization. A represents the similarity splitter and examples of molecules belonging to small clusters. One can regulate the homogeneity of the splits by tweaking the homogeneity parameter, which considers all the compounds with a similarity lower than the homogeneous parameter to be in the same set. The higher the parameter is, the more heterogeneous the split will be. B represents the scaffold splitter and scaffolds belonging to the splits on the plot on the right. This splitter can separate the data by putting compounds with the same scaffolds in different data splits (homogeneous split) or not (heterogeneous split)

Imbalanced data

Imbalanced data is a common problem in many real-world applications that heavily affect the quality and reliability of ML and DL approaches [46]. Likewise, imbalanced data is an issue in QSAR studies, where the difference between the number of active and inactive molecules can be extreme [47]. This difference generally leads to biased and sub-optimal models, as traditional ML algorithms may not be able to learn the minority class effectively.

Data balancing techniques have shown potential in mitigating the effect of imbalanced data on the model performance [47]. DeepMol provides various methods for handling imbalanced data, including over-sampling, under-sampling, and combination methods, as in the imbalanced-learn package [48]. For over-sampling, it provides a random over-sampler and Synthetic Minority Over-sampling Technique (SMOTE) [49], while Random under-sampler and ClusterCentroids can be used for under-sampling. Finally, for simultaneous over and under-sampling, it provides SMOTE with Edited Nearest Neighbours (SMOTEENN) [50] and SMOTE using Tomek links (SMOTETomek) [51]. More information about each imbalanced learning technique is provided in Supplementary Table S8.

While techniques like SMOTE have shown efficacy in many fields, including in QSAR studies, we recognize its limitations, specifically for molecular data. SMOTE operates by generating synthetic samples through interpolation in the feature space, which can be problematic for molecular data where structural validity and biological relevance are crucial. For instance, stereoisomers, molecules that have the same connectivity but differ in spatial orientation, can pose significant risks when oversampling. Even though stereoisomers may share similar molecular fingerprints or descriptor-based features, their distinct three-dimensional arrangements often lead to different biological activities and interactions. Using SMOTE in such cases could introduce synthetic samples that fail to respect the nuanced differences between stereoisomers, inadvertently leading to models that overlook critical stereochemical distinctions. This could result in predictions that generalize poorly and even lead to unsafe recommendations in applications like drug discovery or toxicity prediction.

Given these considerations, including the potential issues with synthetic sample generation in molecular datasets, we opted not to integrate imbalanced learning techniques directly into the AutoML pipeline in DeepMol. Instead, we leave the decision to use such techniques to the user, allowing for more control and customization of their model-building process.

Feature importance and model interpretability

Model interpretability, an essential aspect in the field of ML, refers to the capacity to comprehend and elucidate the internal mechanisms of an ML model. This ability provides insights into the rationale behind the model’s predictions or decisions, fostering a sense of trust in the model’s outputs.

SHAP (SHapley Additive exPlanations) [52] stands out as a widely utilized technique for achieving model explainability. It leverages the Shapley value, a fundamental concept derived from cooperative game theory, to assign significance to each input feature within a model. By comparing the predictions of various feature subsets, both with and without a specific feature, the Shapley value quantifies the marginal contribution of that feature to a prediction.

Visualizing SHAP values can take various forms, including bar charts, scatter plots, and summary plots. These visual representations facilitate the interpretation of how individual features influence the model’s predictions, aiding in identifying the most crucial features for a given prediction scenario.

Along with these methods, DeepMol allows one to visualize the structures associated with bits in Morgan and RDKit fingerprints and MACCS keys. Therefore, after computing the SHAP values, one can directly cross information from the most relevant features to the molecule and draw biological and chemical conclusions, as depicted in Fig. 3.

Fig. 3
figure 3

Depiction SHAP values for a ridge classifier and the most relevant MACCS keys in the molecules. The dataset used was the bioavailability dataset from [53] provided by TDC. A provides the overall feature importance of all labels for all data points and depicts the two of the most relevant MACCS keys for the prediction. The highlighted atoms and bonds are the ones that correspond to the MACCS key. B provides a depiction of the most important features of the drug cinacalcet, showcasing how it is also possible to assess features’ importance for individual cases

Results

Results on the ADMET benchmarks

In this section, we discuss the performance of DeepMol AutoML on the 22 TDC ADMET benchmark group datasets. Thousands of pre-processing and feature extraction steps, models, and respective hyperparameters were tested and optimized using a fully automated process.

Each benchmark dataset within TDC includes its own training, validation, and test sets, alongside an evaluation metric. Out of the 22 datasets used, a combination of binary classification and regression tasks is covered. The dataset splits are computed through a scaffold split method, ensuring that the training and test sets feature distinct molecular structures. For regression tasks, the primary evaluation metric was the mean absolute error (MAE), with some exceptions where Spearman’s correlation (SC) is employed. Binary classification datasets were evaluated using the area under the receiver operating characteristic (AUROC) metric when the balance between positive and negative examples was even. Alternatively, when negative examples substantially outnumber positive ones, the area under the precision-recall curve (AUPRC) was used.

Benchmarking involves assessing an ML model’s performance using specific ADMET datasets. This is done by first dividing these datasets into training, validation, and test sets using five distinct data seeds to ensure variability. The model is then trained using the training and validation sets. Its effectiveness is evaluated based on how it performs on the test set. Finally, the process wraps up by calculating the average and standard deviation of the model’s predictions across all five seeds, providing a comprehensive measure of its reliability and accuracy.

Our experimental setup involved using the DeepMol AutoML engine to train models with the training sets and evaluate them on the validation sets. The optimization criterion for the Optuna search engine was to optimize the mean of the selected metric values across all validation sets. The Tree-structured Parzen Estimator (TPE) algorithm was utilized across 100 trials for hyperparameter and method selection.

Upon the conclusion of each experiment, the optimal pipeline was identified following the objective function criteria. Subsequently, the model underwent retraining by integrating both the validation and training datasets and its performance was assessed on the test set in alignment with the guidelines specified for submission. Voting Pipelines are also included in this analysis.

We compared our results with the TDC ADMET benchmark group leaderboard submissions. Overall, our AutoML framework demonstrated good performance, securing top-1 placement in one benchmark and achieving top-3 and top-5 rankings in four and eleven benchmarks, respectively. While our performance may not have been outstanding across all benchmarks, it is crucial to understand the remarkable nature of these results. This is particularly noteworthy, considering that the majority of benchmark submissions typically rely on handcrafted pre-processing steps and models selected for their performance. Our AutoML framework, operating without explicit instructions and requiring no prior domain knowledge from the users, efficiently navigated an extensive array of possibilities. That resulted in fully reproducible end-to-end pipelines that demonstrated competitive results. More details on each of the voting pipelines and ensembles are given in Supplementary Table S11.

Absorption benchmarks

The absorption benchmarks measure how a drug travels from the administration site to its action site. These benchmarks consist of six datasets, three of which involve classification, and three involve regression. The AutoML framework of DeepMol achieved the top rank on the leaderboard for the human intestinal absorption (HIA) dataset and the second position for the aqueous solubility (AqSol) dataset. Interestingly, the HIA dataset, comprising 578 molecules, is the smallest in the benchmark, while the AqSol dataset, consisting of 9982 molecules, is the largest. This shows the effectiveness of our framework in finding robust models across varying data sizes. Additionally, it is noteworthy that graph-based models outperformed others in three out of the six benchmarks. The results for the six benchmarks and respective pre-processing steps and models can be seen in Table 1.

Although DeepMol did not provide so much competitive performance for the drug oral bioavailability (Bioav) and lipophilicity (Lipo) datasets, the AutoML results provided invaluable hints regarding the best-performing methods. Optuna provides a dashboard to explore the results across all the trials. We analysed the results and focused our pipeline optimization on the best-performing methods and ML models. This allowed us to identify Histogram-based Gradient-boosting classification trees as top-performing models for the Bioav dataset and stacking ensembles for the Lipo dataset, leading us to set up smaller experiments and reduce the configuration space of the method search. We improved the AUROC on the Bioav test set from 0.617 to 0.753 and secured 1\(^{st}\) place in the leaderboard. Not so impressively, we secured the 6\(^{th}\) spot in the Lipo dataset leaderboard.

Table 1 Performance of DeepMol in the six absorption-related properties in the TDC ADMET benchmark group

Distribution benchmarks

The distribution benchmarks measure how a drug moves to and from the various tissues of the body and the amount of drug concentration in those tissues. These benchmarks consist of three datasets, two involving regression and one classification. The results of this benchmark were less satisfactory, particularly in the case of the classification dataset that measures the drug’s ability to penetrate the blood-brain barrier, ranking in the lower half of the entire benchmark. Notably, this particular benchmark appears to be among the most challenging, with only ensemble-based strategies showing the best performance. The results for the three benchmarks and respective pre-processing steps and models can be seen in Table 2.

Table 2 Performance of DeepMol in the three distribution-related properties in the TDC ADMET benchmark group

Metabolism benchmarks

The metabolism benchmarks evaluate the breakdown of drugs by specialized enzymatic systems, determining the duration and intensity of their effects. These benchmarks consist of six classification datasets. We achieved top-2 performance in two datasets and top-5 in three. Notably, unlike the absorption benchmarks, which thrived on graph-based approaches, simpler methods such as molecular descriptors and fingerprints yielded superior results in the metabolism benchmark. Similarly, ensemble learning techniques outperformed individual models in the distribution benchmark. The results for the six benchmarks and respective pre-processing steps and models can be seen in Table 3.

Table 3 Performance of DeepMol in the six metabolism-related properties in the TDC ADMET benchmark group

Excretion benchmarks

The excretion benchmarks measure the removal of drugs from the body using various routes. These benchmarks consist of three regression datasets.

The exploration of predictive models across these datasets reveals that CNNs, alongside the ChEMBL standardizer, emerge as standout methodologies. For the Half-Life dataset, a DeepMol custom standardization and scaled descriptors combined with a 1D CNN achieved an SC score of 0.485, ranking it fifth. On the other hand, a voting pipeline that standardized molecules with the ChEMBL standardizer comprises three Directed Message Passing Neural Networks (D-MPNN) and two GCNs and reaches fourth place for the CL-Hepa dataset. For the CL-Micro dataset, optimal performance was achieved using a voting pipeline that combined five pipelines employing the ChEMBL standardizer with five distinct TextCNNs. The pipeline achieved a modest 11th place, lagging behind the first place by a small SC margin of 0.05.

The results for the three benchmarks and respective pre-processing steps and models can be seen in Table 4.

Table 4 Performance of DeepMol in the three excretion-related properties in the TDC ADMET benchmark group

Toxicity benchmarks

The toxicity benchmarks measure how much damage a drug can cause to organisms. These benchmarks consist of four datasets, one involving regression and three for classification.

DeepMol achieved good results for the Ames and LD50 datasets, which ranked fifth and fourth, respectively. For the former dataset, a voting pipeline with five GNNs standardizing molecules with the ChEMBL standardized, while for the latter, a pipeline using a custom standardizer, layered FPs, an FS method that selected features based on the percentile 79 of the scores of a univariate linear regression test (for more information refer to [20]), and a voting regressor composed of 5 different models. On the other hand, the results for the rest of the datasets were far from impressive, with a GCN ending up in the 9th place of the hERG dataset leaderboard and a Voting pipeline securing a place in the half of the leaderboard for the DILI dataset.

The results for the four benchmarks and respective pre-processing steps and models can be seen in Table 5.

Table 5 Performance of DeepMol in the four toxicity-related properties in the TDC ADMET benchmark group

Practical applications of DeepMol

Several publications have already showcased the versatility of DeepMol in various domains, such as drug discovery, food science and natural product discovery, across different tasks. Numerous experiments have exploited the diverse array of methods offered by DeepMol, highlighting its user-friendly nature and contributing to significant research advancements.

In their studies [42, 54], Baptista and colleagues used DeepMol to investigate which compound representation methods are most suitable for drug sensitivity prediction in cancer cell lines. They benchmarked twelve compound representations, including molecular fingerprints and DL-based representation learning methods, using two classifications and three regression datasets from human cancer single-cell line drug screenings. The authors found that most compound representations performed similarly. Still, some end-to-end deep learning models performed on par with or even outperformed traditional fingerprint-based models, even when dealing with smaller datasets. Furthermore, the authors utilized DeepMol’s feature importance methods to enhance the interpretability of fingerprint-based deep learning models. They demonstrated that consistently highlighted features were known to be associated with drug response.

Capela et al. [43] conducted a study in which they trained and evaluated 66 different model configurations to predict the relationships between chemical structures and sweetness. Throughout the study pipeline, DeepMol was utilized for several tasks, such as molecular standardization, feature generation, feature selection, model construction, hyperparameter tuning, and model explainability. Furthermore, a subset of the trained models was employed to screen 60 million molecules from PubChem [55] in search of potential sweeteners. The authors successfully identified numerous derivatives of potent and artificial sweeteners, some of which were patented as sweetening agents and were not included in the original training data. This demonstrated the remarkable capability of DeepMol in helping design new sweeteners and repurposing existing compounds.

In a recent study [56], DeepMol AutoML was used to address the challenge of predicting precursors of specialized plant metabolites, which play critical roles in plant defence and have significant economic implications. Despite these compounds’ complexity, vast diversity, and the current gaps in understanding their biosynthesis, DeepMol’s methodology stood out for its efficacy. It was useful to find regularized linear classifiers that surpass existing state-of-the-art approaches in performance and also offer chemical explanations for their predictions. This approach marks a significant advancement in expediting the discovery of biosynthetic pathways, highlighting the potential of DeepMol’s AutoML in finding high-performing models.

Discussion: comparison with other tools

Several chemoinformatics toolkits have been developed in recent years [10, 57,58,59,60], including DeepChem (https://deepchem.io/), OpenChem [59], AMPL [58], ZairaChem [16], QSARTuna [17], and MolPipeline [61]. These open-source projects, developed in Python, are used to construct pipelines for QSAR/QSPR modeling. A comparison between DeepMol and the other tools is showcased in Fig. 4.

Fig. 4
figure 4

Heatmap showcasing the presence or absence of an integrated way of performing relevant tasks for constructing QSAR/QSPR systems. Green stands for presence, red for absence and yellow for incomplete integration

Regarding dataset loading, DeepMol, DeepChem, QSARTuna, and MolPipeline support CSV and SDF formats, while OpenChem, AMPL, and ZairaChem only support CSV files. Hence, a limitation of the latter three tools is their incapability to import pre-computed 3D structures.

Molecule standardization is available in DeepMol through various methods. OpenChem allows sanitizing molecules, padding sequences and canonicalizing SMILES. In contrast, AMPL allows the striping of salt groups, neutralizing and kekulizing the molecules and replacing any rare isotopes with the most common ones for each element. DeepChem only allows sanitizing molecules when loading datasets. ZairaChem applies the MELLODDY-Tuner protocol, which includes disconnecting metal atoms from the rest of the molecule, reionizing the molecule, adjusting charges and protonation states, assigning stereochemistry to the molecule, ensuring consistent representation of chiral centres, and additional standardization procedures. QSARTuna does not incorporate chemical standardization methods (by the definition described in this work, see Methods); however, it does include other essential preprocessing functionalities such as identifying missing data and duplicates, as well as handling various representations of the same molecule. MolPipeline provides a wide and complete set of methods for standardization including tautomer canonicalization, salt removal, molecule sanitization, and largest fragment selection, among others.

All the tools can perform data splitting, except OpenChem, and only DeepMol can perform a stratified split specific to molecular data and multitask classification. DeepMol stands out in this matter compared to all the other tools because it provides parameterizable stratified split based on scaffolds and similarity, including also the stratified splitting using the Butina algorithm. Additionally, all the tools are capable of generating numerical features from molecules. Hyperparameter and architecture optimization are possible in all the toolkits except for OpenChem.

Model construction in these toolkits primarily focuses on DL models, but all the tools provide shallow learning models except for OpenChem. Hyperparameter and architecture optimization are possible in all the tools except OpenChem. Moreover, while all the methods offer pipelines, only DeepMol, ZairaChem, and QSARTuna provide a comprehensive and efficient approach to automating pipeline optimization (AutoML). However, QSARTuna only provides the optimization of feature extraction methods and models, while both DeepMol and ZairaChem optimize other methods besides these two steps.

DeepMol and QSARTuna excel compared to other tools in providing feature importance analysis and effectively connecting features to molecular structures. Both ZairaChem and DeepMol also offer methods to address unbalanced datasets and perform integrated and automated feature selection, distinguishing them from the other assessed tools. ZairaChem applies feature selection through Autogluon, which offers only one feature selection method. QSARTuna applies variance threshold and co-correlated feature selection filtering only.

Even though DeepMol and ZairaChem appear to perform similarly for these tasks, it is important to note that ZairaChem is limited to binary classification problems. Consequently, the features provided by ZairaChem, as indicated in Fig. 4, are specifically designed for binary classification tasks.

DeepMol offers a more complete and robust alternative to MolPipeline by also employing an integrated, self-contained pipeline serialization approach that facilitates deployment under the fit, transform, and predict paradigms. In addition to encompassing all scikit-learn models, it supports a variety of DL models and allows both custom and fully automated pipeline optimization (AutoML).

For AutoML capabilities, both DeepMol and QSARTuna utilize Optuna for pipeline optimization. However, QSARTuna is more limited in scope, providing only eight feature extraction methods and approximately 20 models, with Directed Message Passing Neural Network (D-MPNN) and Feed Forward Neural Network (FFNN) as the only DL options. This narrower range can be restrictive for users seeking diverse modeling approaches. QSARTuna also lacks optimization functionalities for chemical standardization, feature scaling, and feature selection.

Another key difference lies in user configuration: QSARTuna requires users to create a custom configuration detailing the methods to be used before execution. In contrast, DeepMol simplifies the process by offering predefined configuration sets tailored to user requirements (see implementation in Supplementary Figure S1). For advanced users, DeepMol further enhances flexibility by supporting the implementation of new objective functions for generating metrics to inform the AutoML system in various validation scenarios, such as cross-validation or hold-out validation. However, QSARTuna includes features for probability calibration and uncertainty estimation, which are not yet available in DeepMol.

To compare the performance of DeepMol and QSARTuna, we evaluated them using three datasets: Pgp and CYP2D6 Substrate (both from TDC Commons) and DEL (from the QSARTuna publication). We created challenging train-test splits by using DeepMol’s similarity splitter with a homogeneity threshold of 0.7. For pipeline optimization, we set 20, 50, and 100 trials for the Pgp, CYP2D6 Substrate, and DEL datasets, respectively. We aimed to include all available methods by selecting the “all” pre-set for DeepMol and utilizing every method documented for QSARTuna. Table 6 presents the comparative performance of the tools on these datasets. The metrics were chosen based on class balance, with Pgp having balanced classes and CYP2D6 Substrate and DEL being imbalanced.

Table 6 Performance of DeepMol compared to QSARTuna

DeepMol outperforms QSARTuna across all three datasets in most predictive metrics, showcasing stronger overall performance, particularly in terms of F1 score, precision, and MCC. However, this superior performance comes with a significantly longer runtime, with DeepMol taking considerably more time than QSARTuna, especially as dataset size increases. For the Pgp dataset, DeepMol’s higher ROC-AUC and F1 score highlight its advantage for small-scale studies, despite the additional computational cost. In the medium-sized CYP2D6 dataset, DeepMol maintains better metrics, though QSARTuna achieves higher precision, suggesting it is more conservative with its positive predictions. On the large DEL dataset, while DeepMol excels in F1 score and precision, QSARTuna has better recall and balanced accuracy, indicating a trade-off between capturing true positives and precision. It is important to note that DeepMol requires significantly more time due to the use of the “all” pre-set, which directs Optuna to explore the entire configuration space, including computationally intensive and time-consuming methods like DL models. Selecting the ’sklearn’ pre-set would result in faster methods and reduced runtime.

Conclusion

In conclusion, DeepMol emerges as a powerful alternative to similar tools, offering a Python-based open-source framework for predicting activities and properties of chemical molecules. Its modular design allows researchers to customize every aspect of the ML pipeline, from data processing to model prediction and explainability, catering to users with varying levels of computational expertise. DeepMol’s AutoML modules further simplify the process by automatically optimizing pre-processing, data engineering techniques, and ML/DL models and their hyperparameters, streamlining the selection of the most suitable combinations for a given dataset. By providing a user-friendly and easily customizable platform, DeepMol empowers researchers to efficiently explore thousands of configurations, making it an invaluable resource for accelerating chemical discovery. With the support of well-established packages like RDKit, Scikit-Learn, Tensorflow, and DeepChem, DeepMol represents a promising avenue for optimizing and deploying pipelines, offering a vital contribution to the advancement of computational chemistry. The framework’s availability as an open-source resource encourages collaboration and innovation in the field, facilitating progress and empowering researchers in their quest for finding new chemicals with improved properties.

Future work should focus on developing advanced methods for feature selection that can effectively handle the challenges posed by correlated features to minimize overfitting and improve model generalizability. Additionally, incorporating tools that assist in the interpretation of feature importance in the presence of feature correlation would be invaluable for guiding users toward more robust and chemically meaningful insights.

Availability of data and materials

DeepMol is available at https://github.com/BioSystemsUM/DeepMol. Additionally, it is easily installed through PyPi with the command pip install deepmol[all], or as a docker image docker pull biosystemsum/deepmol:latest. Comprehensive documentation with examples for each step described in this paper is provided at https://deepmol.readthedocs.io/en/latest/. Runtimes and memory required for each method available are documented in Supplementary Material. All the models can be accessed in https://doi.org/10.5281/zenodo.11184008 and the code and data for the experiments in https://github.com/BioSystemsUM/deepmol_case_studies.

References

  1. Hessler G, Baringhaus KH (2018) Artificial intelligence in drug design. Molecules 23(10):2520. https://doi.org/10.3390/molecules23102520

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Shen J, Nicolaou CA (2019) Molecular property prediction: recent trends in the era of artificial intelligence. Drug Discov Today: Technol 32–33:29–36. https://doi.org/10.1016/j.ddtec.2020.05.001

    Article  PubMed  Google Scholar 

  3. Gasteiger J (2020) Chemistry in times of artificial intelligence. ChemPhysChem 21(20):2233–2242. https://doi.org/10.1002/cphc.202000518

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Walters WP, Barzilay R (2020) Applications of deep learning in molecule generation and molecular property prediction. Acc Chem Res 54(2):263–270. https://doi.org/10.1021/acs.accounts.0c00699

    Article  CAS  PubMed  Google Scholar 

  5. Montavon G, Rupp M, Gobre V, Vazquez-Mayagoitia A, Hansen K, Tkatchenko A et al (2013) Machine learning of molecular electronic properties in chemical compound space. New J Phys 9(15):095003. https://doi.org/10.1088/1367-2630/15/9/095003

    Article  CAS  Google Scholar 

  6. Tkatchenko A (2020) Machine learning for chemical discovery. Nat Commun 8(11):4125. https://doi.org/10.1038/s41467-020-17844-8

    Article  CAS  Google Scholar 

  7. Karelson M, Lobanov VS, Katritzky AR (1996) Quantum-chemical descriptors in QSAR/QSPR studies. Chem Rev 1(96):1027–1044. https://doi.org/10.1021/cr950202r

    Article  Google Scholar 

  8. Berhanu WM, Pillai GG, Oliferenko AA, Katritzky AR (2012) Quantitative structure-activity/property relationships: the ubiquitous links between cause and effect. ChemPlusChem 7(77):507–517. https://doi.org/10.1002/cplu.201200038

    Article  CAS  Google Scholar 

  9. Costa PCS, Evangelista JS, Leal I, Miranda PCML (2020) Chemical graph theory for property modeling in QSAR and QSPR-charming QSAR and QSPR. Mathematics 12(9):60. https://doi.org/10.3390/math9010060

    Article  Google Scholar 

  10. Guidotti IL, Neis A, Martinez DP, Seixas FK, Machado K, Kremer FS (2023) Bambu and its applications in the discovery of active molecules against melanoma. J Mol Graph Model 11(124):108564. https://doi.org/10.1016/j.jmgm.2023.108564

    Article  CAS  Google Scholar 

  11. Sliwoski G, Kothiwale S, Meiler J, Lowe Edward W, J, (2014) Computational methods in drug discovery. Pharmacol Rev 66(1):334–395. https://doi.org/10.1124/pr.112.007336

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Wu Z, Zhu M, Kang Y, Leung ELH, Lei T, Shen C et al (2021) Do we need different machine learning algorithms for QSAR modeling? A comprehensive assessment of 16 machine learning algorithms on 14 QSAR data sets. Brief Bioinform 7:22. https://doi.org/10.1093/bib/bbaa321

    Article  Google Scholar 

  13. Xu Y (2022) In: Deep Neural Networks for QSAR; p. 233–260

  14. Li Z, Jiang M, Wang S, Zhang S (2022) Deep learning methods for molecular representation and property prediction. Drug Discov Today 27(12):103373. https://doi.org/10.1016/j.drudis.2022.103373

    Article  PubMed  Google Scholar 

  15. Orosz Á, Héberger K, Rácz A (2022) Comparison of descriptor- and fingerprint sets in machine learning models for ADME-tox targets. Front Chem. https://doi.org/10.3389/fchem.2022.852893

    Article  PubMed  PubMed Central  Google Scholar 

  16. Turon G, Hlozek J, Woodland JG et al (2023) First fully-automated AI/ML virtual screening cascade implemented at a drug discovery centre in Africa. Nat Commun 14:5736. https://doi.org/10.1038/s41467-023-41512-2

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Mervin L, Voronov A, Kabeshov M, Engkvist O (2024) QSARtuna: an automated qsar modeling platform for molecular property prediction in drug design. J Chem Inform Model 64(14):5365–5374. https://doi.org/10.1021/acs.jcim.4c00457

    Article  CAS  Google Scholar 

  18. Akiba T, Sano S, Yanase T, Ohta T, Koyama M (2019) Optuna: A Next-generation Hyperparameter Optimization Framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

  19. : RDKit: Open-source cheminformatics. http://www.rdkit.org

  20. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830

    Google Scholar 

  21. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, et al.: TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Software available from tensorflow.org. Available from: https://www.tensorflow.org/

  22. Ramsundar B, Eastman P, Walters P, Pande V, Leswing K, Wu Z (2019) Deep Learning for the Life Sciences. O’Reilly Media. https://www.amazon.com/Deep-Learning-Life-Sciences-Microscopy/dp/1492039837

  23. Huang K, Fu T, Gao W, Zhao Y, Roohani Y, Leskovec J, et al. (2021) Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development. In: Vanschoren J, Yeung S, editors. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks. vol. 1. Curran. Available from: https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/4c56ff4ce4aaf9573aa5dff913df997a-Paper-round1.pdf

  24. Probst D, Reymond JL (2018) SmilesDrawer: parsing and drawing SMILES-encoded molecular structures using client-side javascript. J Chem Inform Model 58(1):1–7. https://doi.org/10.1021/acs.jcim.7b00425

    Article  CAS  Google Scholar 

  25. Gasteiger J, Sadowski J, Schuur J, Selzer P, Steinhauer L, Steinhauer V (1996) Chemical information in 3D space. J Chem Inform Comput Sci 36(5):1030–1037. https://doi.org/10.1021/ci960343+

    Article  CAS  Google Scholar 

  26. Polanski J, Gasteiger J (2017) Computer representation of chemical compounds. In: Leszczynski Jerzy, Kaczmarek-Kedziera Anna, Puzyn Tomasz, Papadopoulos Manthos G, Reis Heribert, Shukla Manoj K (eds) Handbook of computational chemistry. Springer International Publishing, New York

    Google Scholar 

  27. Wigh DS, Goodman JM, Lapkin AA (2022) A review of molecular representation in the age of machine learning. WIREs Comput Mol Sci 9:12. https://doi.org/10.1002/wcms.1603

    Article  Google Scholar 

  28. Mansouri K, Grulke CM, Richard AM, Judson RS, Williams AJ (2016) An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling. SAR and QSAR Environ Res 11(27):911–937. https://doi.org/10.1080/1062936X.2016.1253611

    Article  CAS  Google Scholar 

  29. Fourches D, Muratov E, Tropsha A (2010) Trust, but verify: on the importance of chemical structure curation in cheminformatics and QSAR modeling research. J Chem Inform Model 50(7):1189–1204. https://doi.org/10.1021/ci100176x

    Article  CAS  Google Scholar 

  30. Bento AP, Hersey A, Félix E, Landrum G, Gaulton A, Atkinson F et al (2020) An open source chemical structure curation pipeline using RDKit. J Cheminform. https://doi.org/10.1186/s13321-020-00456-1

    Article  PubMed  PubMed Central  Google Scholar 

  31. Hähnke VD, Kim S, Bolton EE (2018) PubChem chemical structure standardization. J Cheminform. https://doi.org/10.1186/s13321-018-0293-8

    Article  PubMed  PubMed Central  Google Scholar 

  32. Karapetyan K, Batchelor C, Sharpe D, Tkachenko V, Williams AJ (2015) The chemical validation and standardization platform (CVSP): large-scale automated validation of chemical structure datasets. J Cheminform. https://doi.org/10.1186/s13321-015-0072-8

    Article  PubMed  PubMed Central  Google Scholar 

  33. Consonni V, Todeschini R (2009) Molecular descriptors for chemoinformatics. John Wiley & Sons, Hoboken

    Google Scholar 

  34. Todeschini R, Consonni V (2010) Molecular descriptors. Recent Adv QSAR Stud. https://doi.org/10.1007/978-1-4020-9783-6_3

    Article  Google Scholar 

  35. Moriwaki H, Tian YS, Kawashita N, Takagi T (2018) Mordred: a molecular descriptor calculator. J Cheminform 12(10):4. https://doi.org/10.1186/s13321-018-0258-y

    Article  CAS  Google Scholar 

  36. Li X, Fourches D (2021) SMILES pair encoding: a data-driven substructure tokenization algorithm for deep learning. J Chem Inform Model 4(61):1560–1569. https://doi.org/10.1021/acs.jcim.0c01127

    Article  CAS  Google Scholar 

  37. Kearnes S, McCloskey K, Berndl M, Pande V, Riley P (2016) Molecular graph convolutions: moving beyond fingerprints. J Comput-Aided Mol Des 8(30):595–608. https://doi.org/10.1007/s10822-016-9938-8

    Article  CAS  Google Scholar 

  38. Duvenaud DK, Maclaurin D, Iparraguirre J, Bombarell R, Hirzel T, Aspuru-Guzik A, et al. (2015) Convolutional networks on graphs for learning molecular fingerprints. Advances in neural information processing systems. 28

  39. De Cao N, Kipf T (2018) MolGAN: an implicit generative model for small molecular graphs. arXiv preprint. https://doi.org/10.48550/arXiv.1805.11973

    Article  Google Scholar 

  40. Montavon G, Hansen K, Fazli S, Rupp M, Biegler F, Ziehe A, et al. (2012) Learning invariant representations of molecules for atomization energy prediction. Advances in neural information processing systems. 25

  41. Kursa MB, Rudnicki WR (2010) Feature selection with the boruta package. J Stat Softw. https://doi.org/10.1863/jss.v036.i11

    Article  Google Scholar 

  42. Baptista D, Correia J, Pereira B, Rocha M (2022) Evaluating molecular representations in machine learning models for drug response prediction and interpretability. J Integr Bioinform. https://doi.org/10.1515/jib-2022-0006

    Article  PubMed  PubMed Central  Google Scholar 

  43. Capela J, Correia J, Pereira V, Rocha M (2022) Development of Deep Learning approaches to predict relationships between chemical structures and sweetness. In: 2022 International Joint Conference on Neural Networks (IJCNN). IEEE. Available from: https://doi.org/10.1109/ijcnn55064.2022.9891992

  44. Butina D (1999) Unsupervised data base clustering based on daylight’s fingerprint and tanimoto similarity: a fast and automated way to cluster small and large data sets. J Chem Inform Comput Sci 39(4):747–750. https://doi.org/10.1021/ci9803381

    Article  CAS  Google Scholar 

  45. Xu C, Cheng F, Chen L, Du Z, Li W, Liu G et al (2012) In silico prediction of chemical Ames mutagenicity. J Chem Inform Model 52(11):2840–2847

    Article  CAS  Google Scholar 

  46. Johnson JM, Khoshgoftaar TM (2019) Survey on deep learning with class imbalance. J Big Data. https://doi.org/10.1186/s40537-019-0192-5

    Article  Google Scholar 

  47. Korkmaz S (2020) Deep learning-based imbalanced data classification for drug discovery. J Chem Inform Model 60(9):4180–4190. https://doi.org/10.1021/acs.jcim.9b01162

    Article  CAS  Google Scholar 

  48. Lemaître G, Nogueira F, Aridas CK (2017) Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J Mach Learn Res 18(17):1–5

    Google Scholar 

  49. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357. https://doi.org/10.1613/jair.953

    Article  Google Scholar 

  50. Batista GEAPA, Bazzan ALC, Monard MC (2003) Balancing Training Data for Automated Annotation of Keywords: a Case Study. In: WOB

  51. Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newslett 6(1):20–29. https://doi.org/10.1145/1007730.1007735

    Article  Google Scholar 

  52. Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. Advances in neural information processing systems. 30

  53. Ma CY, Yang SY, Zhang H, Xiang ML, Huang Q, Wei YQ (2008) Prediction models of human plasma protein binding rate and oral bioavailability derived by using GA-CG-SVM method. J Pharm Biomed Anal 47(4–5):677–682

    Article  CAS  PubMed  Google Scholar 

  54. Baptista D, Correia J, Pereira B, Rocha M (2021) A Comparison of Different Compound Representations for Drug Sensitivity Prediction. In: Practical Applications of Computational Biology and Bioinformatics, 15th International Conference (PACBB 2021). Springer International Publishing. p. 145–154. Available from: https://doi.org/10.1007/978-3-030-86258-9_15

  55. Kim S, Chen J, Cheng T, Gindulyte A, He J, He S et al (2022) PubChem 2023 update. Nucleic Acids Res 51(D1):D1373–D1380. https://doi.org/10.1093/nar/gkac956

    Article  PubMed Central  Google Scholar 

  56. Capela J, Cheixo J, de Ridder D, Rocha M, Dias O (2024) Automated Machine Learning to Predict the Precursors of Plant Specialized Metabolites. Manuscript submitted

  57. Tangadpalliwar SR, Vishwakarma S, Nimbalkar R, Garg P (2019) ChemSuite: a package for chemoinformatics calculations and machine learning. Chem Biol Drug Des 93(5):960–964. https://doi.org/10.1111/cbdd.13479

    Article  CAS  PubMed  Google Scholar 

  58. Minnich AJ, McLoughlin K, Tse M, Deng J, Weber A, Murad N et al (2020) AMPL: a data-driven modeling pipeline for drug discovery. J Chem Inform Model 60(4):1955–1968. https://doi.org/10.1021/acs.jcim.9b01053

    Article  CAS  Google Scholar 

  59. Korshunova M, Ginsburg B, Tropsha A, Isayev O (2021) OpenChem: a deep learning toolkit for computational chemistry and drug design. J Chem Inform Model 61(1):7–13. https://doi.org/10.1021/acs.jcim.0c00971

    Article  CAS  Google Scholar 

  60. Brown BP, Vu O, Geanes AR, Kothiwale S, Butkiewicz M, Lowe EW et al (2022) Introduction to the biochemical library (BCL): an application-based open-source toolkit for integrated cheminformatics and machine learning in computer-aided drug discovery. Front Pharmacol. https://doi.org/10.3389/fphar.2022.833099

    Article  PubMed  PubMed Central  Google Scholar 

  61. Sieg J, Feldmann CW, Hemmerich J, Stork C, Sandfort F, Eiden P et al (2024) MolPipeline: a python package for processing molecules with RDKit in scikit-learn. J Chem Inform Model. https://doi.org/10.1021/acs.jcim.4c00863

    Article  Google Scholar 

Download references

Acknowledgements

Not applicable.

Funding

This study was supported by the Portuguese Foundation for Science and Technology(FCT) under the scope of the strategic funding of UIDB/04469/2020 unit, and by LABBELS - Associate Laboratory in Biotechnology, Bioengineering and Microelectromechanical Systems, LA/P/0029/2020. Moreover, this research has been supported by the Portuguese Foundation for Science and Technology (FCT) through the DeepBio project - ref. NORTE-01–0247-FEDER-039831, funded by Lisboa 2020, Norte 2020, Portugal 2020 and the project SHIKIFACTORY100 - Modular cell factories for the production of 100 compounds from the shikimate pathway (Reference 814408). We also thank FCT for the PhD fellowships to J. Capela (DFA/BD/08789/2021) and J. Correia (SFRH/BD/144314/2019).

Author information

Authors and Affiliations

Authors

Contributions

Correia J. and Capela J. developed the methodology. Correia J. and Capela J. performed the analysis. Correia J., Capela J. and Rocha M. wrote the manuscript. Correia J., Capela J. and Rocha M. conceptualized the study. Rocha M. supervised the study. All authors edited and approved the final manuscript. Correia J. and Capela J. contributed equally to this work.

Corresponding author

Correspondence to Miguel Rocha.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Correia, J., Capela, J. & Rocha, M. Deepmol: an automated machine and deep learning framework for computational chemistry. J Cheminform 16, 136 (2024). https://doi.org/10.1186/s13321-024-00937-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13321-024-00937-7

Keywords