- Research
- Open access
- Published:
MATEO: intermolecular α-amidoalkylation theoretical enantioselectivity optimization. Online tool for selection and design of chiral catalysts and products
Journal of Cheminformatics volume 16, Article number: 9 (2024)
Abstract
The enantioselective Brønsted acid-catalyzed α-amidoalkylation reaction is a useful procedure is for the production of new drugs and natural products. In this context, Chiral Phosphoric Acid (CPA) catalysts are versatile catalysts for this type of reactions. The selection and design of new CPA catalysts for different enantioselective reactions has a dual interest because new CPA catalysts (tools) and chiral drugs or materials (products) can be obtained. However, this process is difficult and time consuming if approached from an experimental trial and error perspective. In this work, an Heuristic Perturbation-Theory and Machine Learning (HPTML) algorithm was used to seek a predictive model for CPA catalysts performance in terms of enantioselectivity in α-amidoalkylation reactions with R2 = 0.96 overall for training and validation series. It involved a Monte Carlo sampling of > 100,000 pairs of query and reference reactions. In addition, the computational and experimental investigation of a new set of intermolecular α-amidoalkylation reactions using BINOL-derived N-triflylphosphoramides as CPA catalysts is reported as a case of study. The model was implemented in a web server called MATEO: InterMolecular Amidoalkylation Theoretical Enantioselectivity Optimization, available online at: https://cptmltool.rnasa-imedir.com/CPTMLTools-Web/mateo. This new user-friendly online computational tool would enable sustainable optimization of reaction conditions that could lead to the design of new CPA catalysts along with new organic synthesis products.
Introduction
Chiral Phosphoric Acid (CPA) and related catalysts are widely recognized and versatile tools in catalysis and organic synthesis useful for the synthesis of chiral drugs products [1,2,3]. The selection and design of new CPA catalysts for different enantioselective reactions has a dual interest because new CPA catalysts (tools) and chiral drugs or materials (products) can be obtained [4]. However, this process is difficult and time consuming if approached from an experimental trial and error perspective. Quantum Computational Chemistry tools may help to unravel the mechanism of reactions and help in the design of new CPA catalysts [5, 6]. Unfortunately, these techniques are less useful when it is necessary a fast scanning/optimization of new CPA catalysts for large libraries of reactions with diverse substrates, nucleophiles, products, and conditions (temperature, time, catalyst load, etc.). Cheminformatics methods relying upon Artificial Intelligence/Machine Learning (AI/ML) algorithms could help to speed up the discovery of new molecules [7,8,9] and in the design new chiral catalysts and products without engaging in a long term, empirical or quantum investigation [10,11,12,13]. Therefore, there is a need to develop fast-track computational tools able to predict the enantiomeric excess saving time and experimental resources. However, the application of AI/ML techniques to the study of enantioselective reactions is still uncommon due to the inherent complexity of the problem. In addition, most models are not implemented in public online web servers or they are not available for researchers or companies. In this context, it is remarkable Sigman’s et al. platform for CPA catalysts and organophosphorous ligand design [14, 15]. In these works, the authors predict reactivity using structural information of the query reactants/products. However, useful experimental/operational conditions of already known reference reactions similar to the query reaction are not considered. Recently, our group has faced this problem by introducing the Perturbation-Theory and Machine Learning (PTML) approach that employs as inputs both vectors of structural variables Dkqi and vectors of multiple experimental conditions cqj. These PTML algorithms have been applied in medicinal chemistry, vaccine design, nanotechnology, and in catalysis as well [16,17,18,19,20,21]. In fact, we have previously reported a preliminary PTML model for the design of CPA catalysts for intermolecular α-amidoalkylation reactions [22]. However, the model was not implemented on a public online web server and is difficult to use by an experimentalist.
Consequently, in this work, we are going to focus on the development of a public web server for the selection and design of CPAs catalysts for enantioselective intermolecular α-amidoalkylation reactions (Scheme 1). In these reactions, the protonation of an α-hydroxylactam by the CPA would give a chiral conjugate base/N-acyliminium ion pair, which would be trapped by a nucleophile enantioselectively, generating a new tertiary or quaternary stereocenter [23, 24]. The α-amidoalkylation reaction of aromatic systems using N-acyliminium ions as electrophiles is a Friedel–Crafts-type reaction that has found widespread application in organic synthesis for the production of new drugs and natural products [25, 26]. For example, we have applied the procedure to the enantioselective synthesis of Nuevamine type alkaloids. Thus, indol and acyl moieties can be easily introduced in the alpha position of the nitrogen atom, using sterically demanding BINOL-derived CPA catalyst [27]. However, the enantioselectivity of these CPA catalyzed reactions is sensitive to many factors, from the nature of the nucleophile and the catalyst to the experimental conditions (solvent, temperature, etc.). In this context, many efforts have been made to understand the role of non-covalent interactions in organocatalyzed reactions and to rationalize and predict their stereochemical outcome using Quantum Chemical methods [28,29,30]. However, the chemical space accessible by organic synthesis is very wide, and all compatible combinations of substrate, nucleophile, catalyst, and solvent should have to be scanned.
Therefore, the use of Cheminformatics models to explore the chemical space of these reactions becomes a very interesting option in order to reduce costs and time. Therefore, we decided to develop a new user-friendly online computational tool able to carry out screenings of this CPA-catalyzed intermolecular α-amidoalkylation reaction space for a large number of chiral catalysts, substrates, nucleophiles, solvents, chiral products, and reaction conditions. First, we carried out a re-evaluation of all the available data in our record to obtain a better estimate of the chemical space of these reactions. Next, we developed a new PTML model using Heuristics and Monte Carlo sampling calculations without relying on costly computational calculations. This PTML model was able to predict the enantioselectivity with R2 = 0.96 after a comparative study 332 reactions, which can be paired in > 100,000 ways, as each reaction can be a query or reference reaction.
Later, we developed the web server called MATEO (interMolecular Amidoalkylation Theoretical Enantioselectivity Optimization), which is available at the online platform CPTMLTool (https://cptmltool.rnasa-imedir.com/). Finally, we have illustrated the practical use of the online tool with the experimental-theoretical study of a new set of CPA-catalyzed α-amidoalkylation reactions starting from bicyclic α-hydroxylactams 1 to construct the isoindoloisoquinoline framework 2 with a quaternary stereocenter. Electron-rich heteroaromatics (indole and pyrrole derivatives) 3 will be used as nucleophiles and chiral BINOL-derived N-triflylphosphoramides 4 as catalysts (Scheme 2). This new tool may help experimentalists in organic, medicinal, and materials chemistry to explore the chemical space of CPA-catalyzed α-amidoalkylation reactions and to optimize the reaction conditions for practical purposes.
Materials and methods
Dataset and parameter studied
In this paper, we have carried out the study of the enantiomeric excess eeR(%)obs parameter in intermolecular α-amidoalkylation reactions. The value eeR(%)obs allows to quantify the enantiomeric excess by applying an (R)-catalyst. This parameter is represented as eeR(%)obs = Sign(Prod)·Sign(CatR)·ee(%)obs, where Sign(Prod) = 1 for (R)-product or Sign(Prod) = − 1 for (S)-product, taking into account an R or S notation of products experimentally obtained consistent with the Cahn-Ingold-Prelog (CIP) rules [31]. The function Sign(Cat) = 1 for all reactions carried out with an (R)-catalyst, irrespective of the product obtained. On the other hand, the sign was switched from + 1 to Sign(Cat) = − 1 for the reactions carried out with (S)-catalyst and the sign Sign(Prod) was changed to the opposed. This operation transform (S)-catalyst reactions into (R)-catalyst reactions with the same absolute value of enantiomeric excess but opposed sign of eeR(%)obs. All reactions are expected to give the same result but with inverse configuration when you change the chirality of the Catalyst. Consequently, all reactions were studied as if they have been performed using an (R)-catalyst keeping the (R)-catalyst when originally used or switching the signs of Sign(Prod) and Sign(Cat) for (S)-catalyst reactions. In practice, this procedure will allow us to omit the use of chiral molecular descriptors for substrates, products, catalysts, etc., because all the chirality information will be included in the eeR(%) terms for the query or reference reactions (see next sections). In fact, the method worked properly in this specific case because all the reactions give products with only one stereogenic center. Consequently, we have all the chirality information necessary included in both sides of the equation without necessity of using chiral molecular descriptors.
Reaction condition variables
Apart from defining the molecular descriptors, we also consider different reaction conditions variables Vk(cqi) as input variables in order to quantify a kth property (k = 1, 2, 3) related to a general reaction condition (cq) and/or specific reactant. In this chemical reaction dataset, the variables taken into account for the ith query reactions were: V1(cqi) = T(oC) = Temperature, V2(cqi) = t(h) = reaction time and V3(cqi) = L(%) = catalyst loading. By analogy, the values of variables considered for each jth reference reactions were: V1(crj) = T(oC) = Temperature, V2(crj) = t(h) = reaction time, and V3(crj) = L(%) = catalyst loading.
Dataset studied, compounds and reactions notation
A dataset of 332 CPA-catalyzed enantioselective intermolecular α-amidoalkylation reactions has been compiled, which comprised 324 reactions obtained from literature (see Additional file 3) and 8 new reactions studied in this work for the first time (see Table 8). These reactions have been grouped into 34 families according to the different structural patterns of the substrates, nucleophiles, and catalysts. There are different types of substrates S (mostly cyclic and bicyclic α-hydroxylactams, but also 3-hydroxyindolines) that are reacted with different types of nucleophiles Nu (indoles, pyrroles, Hantzsch esters, enols and enamides) using CPAs (phosphoric acids or the corresponding N-triflylphosphoramides and sulfonamides) as catalysts Cat.
All compounds have been labeled with a 5-element code Xyznn, X = S for Substrates, X = Nu for Nucleophiles, and X = Family of Catalysts; y = is the structural family (a, b, c,…), z = is the structural sub-family, if any (a, b, c, …), and nn = is the ID number of the compound in the dataset. When the structural sub-family is missing, the label y in the notation is omitted. Then, a code was created to classify each reaction in the dataset into different reactions types based on the structure of the molecules involved. Thus, the values of the family label y of the Substrate, Nucleophile, and Catalyst were concatenated in this order to obtain the ID code of each reaction type. For example, the reaction of the Substrate S03aa with the Nucleophile Nua04 and the Catalyst Fab04 belongs to the reaction type with the ID code aaa. Scheme 3 shows selected examples of different reaction types included in the dataset using different types of cyclic hydroxylactams as substrates (S03, S04, S06) and different nucleophiles, such indoles (Nua) [32, 33] enamides (Nuf) [34] or Hantzsch esters as reducing agents (Nuc) [35], with CPAs catalysts (F). The full experimental detail of each of the 324 reference reactions (substrate, nucleophile, catalysts, catalyst loading product, solvent, temperature, time, yield, % ee) is included in the Supporting Information (Additional file 3), which also includes the SMILE code of the substrate, nucleophile and catalyst in each case. To have a general view of the chemical space in the dataset, general schemes for all reactions included in the reference dataset are included in the Supporting Information (Additional file 1: Schemes S1 to S9). The structures and codification of substrates (S), nucleophiles (Nu), and catalysts (cat.) is included in the Supporting Information (Additional file 1).
Molecular descriptors calculation
First, the web tool MMDcalc was used to calculate the molecular descriptors Dk(msqi)g and Dk(msri)g of the molecules msqi and msri involved in the query and reference reactions [36]. The MMDcalc tool is an online web server available at the PTMLTool platform (https://cptmltool.rnasa-imedir.com/) for public use. This tool implements the Markov Chain Invariants for Networks Simulation and Design (MARCH-INSIDE) algorithm online. MARCH algorithm uses Markov Chains to calculate the average value of different atomic properties. These average values of atomic properties are calculated for predefined groups of atoms (g) inside the molecule and all their neighbors placed at topological distance (d). In the notation Dk(msqi)g/Dk(msri)g the letter D = Descriptor, k = type of descriptor, s = sub-type of molecule, q = molecules involved in query reaction, r = molecules involved in reference reaction, i = ID number of the molecule, g = group of atoms inside the molecule. The general formula for the calculation is shown in Eq. 1 (see MARCH-INSIDE algorithm details in literature) [37].
The kth types (k = 1, 2, 3, 4, and 5) of molecular descriptors are: D1 = Number of Valence Electrons (Zv), D2 = van der Waals Volume (Vvdw), D3 = Sanderson Electronegativity (χ), D4 = Polarizability (α), and D5 = Electron Affinity (EA). The sub-types (s) of query molecules msqi(s = 1, 2, 3, 4, and 5) are: m1qi = Substrateqi, m2qi = Nucleofileqi, m3qi = Catalystqi, m4qi = Solventqi, and m5qi = Productqi. The chemical functional groups or atom groups Gg (g = 1, 2, 3, 4, 5) are the following: G1 = Saturated Carbon atoms (Csat), G2 = Unsaturated Carbon atoms (Cuns), G3 = Heteroatoms (Het), G4 = NonHalogen (X) Heteroatoms (HetNoX), and G5 = Total (Tot). The groups of atoms indicate which atoms in the molecules were used as the basis for calculating the different local (g < 5) and/or total (g = 5) molecular descriptors.
ML linear model
In this section, Dk(msqi)g values were introduced in order to look for a linear ML model. It is worth mentioning that each entry line of the dataset denotes only one query reaction (Rqi). The enantiomeric excess eeR(%)qicalc of the query reaction (Rqi) was predicted by applying both variables Vk(cqi) as input depending on the experimental conditions and the molecular descriptors Dk(msqi)g of the molecules taken into consideration in the reaction. With both sets of variables as inputs, we can seek a linear AI/ML additive model. A best practice, the following equality holds eeR(%)calcqi≈ eeR(%)qiobs, when the additive linear hypothesis is correct. The general additive form of AI/ML model to be developed is the following.
PTML linear model
The PTML model is a well-known approach that can be used to predict the reactivity of a new case (reaction) through making comparisons with other known reactions. Our model can provide as output the eeR(%)calcqi. On the other hand, the eeR(%)calcqi is calculated for a query reaction(Rqi) due to the observed enantiomeric excess eeR(%)rjobs = eeR(%)refj of a reaction (Rrj) used as reaction of reference is already known. For this reason, the dataset applied to train/validate the PTML model, each entry line takes into consideration a pair of reactions, specifically a query reaction compared to a reference reaction (Rqi vs. Rrj). The PTML linear model enables to predict eeR(%)calci starting with the experimental value of eeR(%)refj of a reference reaction. Afterwards, the model includes the influences of different structural, operational or experimental conditions variations (perturbations) in the query in regard to the reference reaction. We use PT Operators (PTOs) in order to quantify these variations or perturbations. The parameter of PTOs are denoted as the form ΔDk(msqi, msrj)g for structural variations and ΔVk(cqi, crj) for variations in the experimental reactions conditions. The formula of the PTML models used in this section are shown in Eqs. 3 and 4;
In this work, the linear additive model used as a function of reference eeR(%)robs and two sets of PTOs represented by ΔV(cqi, crj) and ΔD(msqi, msrj)g as input. The function of reference eeR(%)robs is equal to the observed values of enantiomeric excess ee(%), when the reference reaction used a (R)-catalyst with R configuration. We have developed two types of PTO in order to seek the PTML linear model. On the one hand, the first type of PTO is described as ΔVk(cqi, crj) = [Vk(cqi)–Vk(crj)]. It takes into account the perturbations/deviations in the values of the kth variables/conditions of reactions V(cqi) of the qth query reaction against the original values of the same variables Vk(cr) for the rth reaction of reference. On the other hand, the second type of PTO is denoted as: ΔDk(msqi, msrj) = [Dk(msqi) – Dk(msrj)]g. It considers the perturbations/deviations in the values of the molecular descriptors of the query with respect to the reference molecules. Subsequently, the input variables for the reaction of the reference Vk(crj) are related to a kth property (k = 1, 2, 3). The connection between the input variables and kth property enables the connection in terms of general experimental conditions of reaction (crj) and/or specific reactants: V1(crj) = T(oC) = Temperature, V2(crj) = t(h) = reaction time, and V3(crj) = L(%) = catalyst loading, for the reaction of reference (Rrj). The input variables denoted as Dk(mri)g are the molecular descriptors of type kth for the ith molecules (msri) of type qth involved in the reference reaction (Rrj). Analogously, the molecules mri taken part in the reaction of reference are mr1j = Substraterj, mr2j = Nucleofilej, mr3j = Catalystrj, and m4rj = Solventrj. In addition, we use the kth types of molecular descriptors as the same way as for the query reaction D1 = Number of Valence Electrons (Zv), D2 = Van der Waals Volume (Vvdw), D3 = Sanderson Electronegativity (χ), D4 = Polarizability (α), and D5 = Electron Affinity (EA). In Table 1, we illustrate the detailed information about of all the PTOs used as input variables in the PTML models.
AI/ML vs. PTML linear model development
So as to seek the AI/ML and PTML linear models, we apply Multivariate Linear Regression (MLR) and Linear Neural Network (LNN) algorithms by using the software STATISTICA [38]. In this sense, in the PTML regression models, the values of observed (experimental) enantiomeric excess eeR(%)obsqi against multiple values of reference eeR(%)refj have to be fitted. The regression model allows to generate artifacts in the standard distribution of the data [39]. The parameters ak,s bk,s,g and e0 are the coefficients of the model to be fitted by AI/ML algorithms. The formula for the PTML linear regression models was fitted as presented in the Eq. 5;
HPTML linear model
The PTML linear model built can predict diverse outputs for the same reaction taking into consideration the selected reference reactions. Therefore, in this section we introduced different Heuristics (H) in order to define the best reaction performance or set of reactions as reference. In this work, specifically we used two following heuristic. On the one hand, the first heuristic (H1) can calculate the final predicted value as this form: eeR(%)qrpred = eeR(%)qrmin. This value is obtained using as reference the reaction with a minimum (Min) value of the PTOs in other words, the minimal deviation. Specifically, the heuristic (H1) uses as reference, the reaction with a minimal difference/deviation (Δ) between the input variables ΔV(mqsi, mrsj) and ΔV(cqi, crj) for all (∀) pairs of reactions. On the other hand, the second heuristic (H2) can calculate the value eeR(%)qrpred = eeR(%)qravg = Avg(eeR(%)qrcalc). Particularly, the heuristic (H2) uses as reference the values of variables ΔD(mqi, mrj) (molecule structural variations) and ΔV(cqi, crj) (experimental conditions variations) for all (∀) pairs of reactions. As the first step, we calculated the 331 different eeR(%)qrcal values, not including the query. Then, we obtained the final values as the average for all the references. These two heuristics can be described as illustrated in Eqs. 6 and 7.
Monte carlo simulation
Most reactivity prediction models already reported take into consideration only the structure of the reactants but omit the values of temperature, catalyst loading, time of reaction, solvent polarity, etc. when predicting the enantiomeric excess of the reactions. In fact, many of the works focus only on yield at specific conditions of T, time, load, etc., and do not predict the enantiomeric excess. In addition, the values of enantiomeric excess, T, time, load, solvent polarity, etc. when measured experimentally contains a certain degree of error because most researchers do not measured them for triplicate or lead them uncontrolled like when using room temperature conditions. In this context, the Monte Carlo Simulation (MC) starts with the original values of the non-structural variables T, t, Load and using a random generator creates new values with small variations with respect to the original values. MC experiments are a wide-ranging class of computational algorithms that base on repeated random sampling to obtain numerical results. This method are among the most useful data sampling in Cheminformatics [40,41,42].
In this work, we used an MC algorithm to predict the enantiomeric excess of the reactions taking into consideration all these factors, which are of the major relevance to optimize the reaction in the laboratory. In order to demonstrate the robustness of the model we generated a new set of reactions with “perturbations” in the values of T, t, Load, etc. and retrained the models. The values of the values of T, t, Load, where changed randomly but inside the limits of min and max reported for this reactions. This allowed to test the robustness of the model in terms of ability of the model to continue working properly (giving good predictions) despite of changes/errors etc. in the reports of temperature, time, etc.
For this purpose, we generated a new set of reactions with “perturbations” in the values of T (ºC), t(h), Load(%), etc. and retrained the models. The values of T (ºC), t(h), Load(%) where changed randomly between the limits set in the minimum Vk(cqi)min and maximum Vk(cqi)max reported for this type of reactions. The synthetic data allow to test the robustness of the PTML model in terms of ability to continue giving good predictions despite of changes/errors, etc. In addition, the values of minimum Vk(cqi)min, maximum Vk(cqi)max, and step Vk(cqj)step for all the operational conditions were calculated (Table 2). Afterwards, we used a MC model based on the following system of equations in order to create the new synthetic data.
Firstly, the Eqs. 8 and 9 were applied so as to generate new Vk(cqi)new values starting from the original minimum value Vk(cqi)min (Eq. 8). Later, with the Eq. (9), we obtained the new synthetic data value Vk(cqi)synth after introducing a boundary condition. This boundary condition is set up taking into consideration the conditions of α-amidoalkylation reactions. In other words, the boundary condition keeps the synthetic values Vk(cqi)synth within the range [Vk(cqi)min, Vk(cqi)max]. The synthetics values were created for the experimental condition variables V1(cqi) = T(°C), V2(cqi) = t(h), V3(cqi) = L(%). It means that the new synthetic data values are equal to V(ck)synth = V(ck)min + rnd(0, Nmax)·V(ck)step iff (if and only if) they are lower than Vk(cqi)max; otherwise, they are equal to Vk(cqi)max. The function Rnd(0, nmax) is a generator of pseudo-random natural numbers (n = 0, 1, 2, … Nmax) based on Mersenne-Twister MC algorithm (MT19937). The same system of equations was used to form new synthetic data for the input variables of the reference Vk(crj) equation.
As mentioned above, we have only made small random changes to the values of the input variables t, T, and catalyst loading from the original ones. Consequently, in the new synthetic data cases generated by MC, we assumed that the deviations in the new values of input variables (perturbations) from the original ones are small enough to cause unetectable/non-measurable changes in the output values of eeR(%). The supposition is based on practical empiric evidence, which seems to confirm that new reactions/repetitions carried out with small changes of a few degrees of Temperature, minutes of reaction time, or catalyst loading will not alter i the value of eeR(%) by a measurable amount. In fact, in Eq. (8) the new synthetic value is equal to the minimum value in all the dataset plus the value of the step multiplied by a random value getting values 0, 1, 2, nmax.
Experimental methods
We describe here the typical procedure for the enantioselective intermolecular α-amidoalkylation reaction leading to the synthesis of ( +)-2e (See Table 8, entry 8). For full experimental details and characterization data for compounds 2a-d, See Supporting Information file SI00.pdf).
( +)-(R)-2,3-dimethoxy-12b-(1H-pyrrol-2-yl)-5,12b-dihydroisoindolo[1,2-a]isoquinolin-8(6H)-one(2e). A solution of 12b-hydroxyisoindoloisoquinoline 1 (60 mg, 0.19 mmol), pyrrole 3e (0.014 mL, 0.19 mmol) and N-triflylphosphoramide 4a (28 mg, 0.038 mmol 20 mol%) in dry THF (5 mL) were stirred during 5 h at room temperature. The solvent was evaporated under reduced pressure, and the crude reaction mixture was purified by flash column chromatography (alumina, Hexane/EtOAc 3:7) to afford isoindolo[1,2-a]isoquinoline 2e (68 mg, quant.); [α]D20 = + 40.3 (c = 0.28; CH2Cl2). The enantiomeric excess was determined by HPLC to be 54% [Chiralcel OD, 15% Hexane/2-propanol, 1 mL/min, tR (S) = 23.2 min (22.87%), tR (R) = 29.4 min (77.13%)]. m.p. (Hexane/EtOAc): 254–256 °C; IR (Film): 3188 (NH) cm−1, 1672 (CO) cm-1; 1H NMR (300 MHz, CDCl3): δ 2.70–2.76 (m, 1H), 3.06 (ddd, J = 17.3, 11.1, 6.5 Hz, 1H), 3.23 (ddd, J = 12.6, 11.1, 4.8 Hz, 1H), 3.85 (s, 3H), 3.87 (s, 3H), 4.26 (ddd, J = 12.6, 6.5, 2.2 Hz, 1H), 5.86–5.88 (m, 1H), 6.08 (dd, J = 5.8, 2.7 Hz, 1H), 6.62 (s, 1H), 6.74 (td, J = 2.7, 1.5 Hz, 1H), 7.23 (s, 1H), 7.44 (t, J = 7.5 Hz, 1H),7.58 (t, J = 7.5 Hz, 1H), 7.70–7.72 (m, 2H), 8.70 (s, 1H)ppm; 13C[1H] NMR (75.5 MHz, CDCl3):δ 28.7, 35.2, 55.9, 56.2, 65.7, 108.1, 110.5, 110.8, 111.7, 119.0, 123.7, 123.9, 127.1, 127.9, 128.8, 131.5, 132.1, 147.1, 148.6, 148.9, 167.2 ppm; MS (CI) m/z (%): 361 (100) [MH]+, 360 (50) [M]+, 294 (37), 293 (33); HRMS (CI): cacld. for C22H21N2O3 [MH]+: 361.1552; found: 361.1556.
Results and discussion
CPA catalyzed α-amidoalkylation reactions chemical space
As stated above, the chemical space of α-amidoalkylation reactions is very wide. In this work, the dataset is based on 332 reactions which contains 55 different substrates (cyclic and bicyclic hydroxylactams), 53 nucleophiles (enamides, indoles, etc.), 39 chiral catalysts (phosphoric acids, phosphoramides, etc.), and 17 different solvents undertaken by multiple experimental conditions (see Supporting Information, file SI00.pdf for structures and reaction schemes; see Additional file 3 for full details of each reference reaction, including reaction conditions, yield, enantiomeric excess, and SMILE codes for reactants and catalysts in each case). The combination of all possible substrates, catalysts, and reactions conditions to be explored is potentially high to be covered by trial and error experiments. To better understanding the amount of all possible combination, we illustrate an example, if reactions are run independently by changing one reactant at a time, a total of Ncomb = N(Subsqi)·N(Nucqi)·N(Catqi)·N(Solvqi) = 55·53·39·17 = 1,932,645 unique combinations of molecule subtypes should be run. This could be a new source of interesting products [changes in N(Subsqi) or N(Nucqi)] or a way to improve the reaction efficiency [changes in N(Catqi) or N(Solvqi)]. This estimation considers only the combinations of different molecular entities. Unfortunately, the vast majority of these reactions remain unexplored in terms of high cost in time and resources.
On the other hand, there are also important variations in the three main experimental condition variables Vk(cqi) [T(oC), t(h), and L(%)]. Table 2 shows different statistics parameters of these variables for the reported reactions. The integer values for maximum (Tmax, tmax, and Lmax), minimum (Tmin, tmin, and Lmin), and step (Tstep, tstep, and Lstep) are included. This is important because the expression Range [Vk(cqi)] = Vk(cqi)max – Vk(cqi)min] gives us the range of this variable that can be covered in actual practice in the laboratory. Consequently, when this range is divided by the minimum value, we decided to change in practice Step [Vk(cqi)], the number of experiments N(cqi) = Range[Vk(cqi))/Step(Vk(cqi)] that we can run in order to explore this variable can be obtained. When reactions are run independently by changing one experimental condition at a time, a total of Nexp experiments must be run. This will be equal to Nexp = N(c1)·N(c2)·N(c3) = N(T)·N(t)·N(L) = [Range(T)/Step(T)]·[Range(t)/Step(t)]·[Range(L)/Step(L)] = [144/10]·[(239/1]·[(28/1] = 96,365 optimization experiments for each unique combination of molecule sub-types giving as result an specific Productqi of the reactions Rqi (Table 2). The multiplication of both parts of the equation gives an estimate of the very large number of reactions accessible in this chemical space N(Rqi)max = Ncomb·Nexp ≈ 1011.The equations used to carry out the calculations of the number of reactions in this chemical space are shown below (Eq. 10) [39]:
The previous calculation gives an idea on the dimension of chemical reaction space for enantioselective CPA-catalyzed intermolecular α-amidoalkylation reactions. It is inviable to study all possible combinations in the laboratory due to the time and cost in material and human resources. In the daily practice, chemists can use expert criteria and experimental design techniques to reduce the number of combinations to be tested, to decrease the range of the different experimental conditions variables, etc. This can support researchers to reduce meaningfully the number of reactions to perform in the practice. However, the use of the previous well-known experimental expert criteria, researchers will never test interesting products. Therefore, the main objective of this project was the development of a new user-friendly predictive regression model for these reactions. This predictive model may become a useful tool to reduce the time and cost of experimentation.
ML linear model for α-amidoalkylation reactions
In the α-amidoalkylation reactions, there is no clear relationship between the chirality of the catalysts and the CIP notation of the product. In fact, in our literature dataset one can note the following ratio of Catalyst/Product chirality relationship, count, and ratio (R)/(R)140 reactions (43.2%), (S)/(R)102 reactions (31.5%), (R)/(S) 72 reactions (22.2%) and (S)/(S) 9 reactions (2.8%) of 324 reactions. There is only one reaction in the entire dataset with an (S)configuration catalyst and enantiomeric excess equal to zero. Therefore, it is very important to have a computational model to predict the absolute stereochemistry and the enantiomeric excess of the reaction product. This type of models could be used as a useful tool in order to address the design of new catalysts and/or selecting the optimal reaction conditions a priori. In this work, we decided to tackle this problem using AI/ML techniques. We trained this classic linear ML model using only the Original Data (OD) from reactions. The equation of this model is shown in Eq. 11;
This ML model does not use reference reactions for comparison. The statistic parameters of the model are n = 332, Regression coefficient R2 = 0.74, Fisher ratio F = 59.2, Standard Error of Estimates SEE = 37.1, p-level p < 0.05. More detailed information about coefficients and variables of the model as well as symbols and names of variables, Standard Error (SE), Students’ t values, and p-level are given in Table 3. The model obtains 74.0% of variance (coefficient R2 = 0.74), which is an acceptable prediction percentage for organic synthesis reactions (although extremely improbable). By the way, the SEE = 37.1 could be considered relatively high[39]. On the other hand, an essential short-coming of this classic ML linear model is that it does not provide us any evidence about the most similar reactions conveyed in the scientific literature. Consequently, this may limit our ability to deduce possible mechanisms and/or compare our results with others already known. Therefore, this ML model needs to be used along with another search strategy for similar molecules to obtain clues of similar reactions for a specific reaction under study. One option is to couple this model with similarity search strategies based on Tanimoto’s similarity indices [43]. In fact, there are interesting works that report the coupling of Cheminformatics models with search strategies based on similarity [44,45,46]. A well-known example of online search tools is the Scifinder platform [47, 48].
PTML model for α-amidoalkylation reactions
As mentioned in the previous section, we have reported a PTML model for α-amidoalkylation reactions, although it is difficult to use in practice and not implemented on a publicly available online web server. Unfortunately, the input variables used in that model are not available as an open source code. For this reason, it could be advantageous to implement the model on a public online server. Consequently, we decided to develop a new linear PTML model using our own library to calculate the molecular descriptors. PTML reactivity models can study pair-wise reactions [39]. The model infers the reactivity of a query reaction (q) by comparing it to a previously known reference reaction (r). Some PTML models use different Heuristics (H) to match q and r reactions. These models can be called HPTML models. The Fig. 1 illustrates the general workflow that has been followed during this word to look for the new HPTML models. In step 1, the reference dataset and reaction pairs q vs. r were created. In step 2, the SMILE codes of the molecules (mqsi, mrsj) involved in both q and r reactions (substrates, nucleophiles, catalysts, solvents, products) were entered in the MCDCalc server [49] to calculate their molecular descriptors Dk(mqsi)g and Dk(mrsj)g. In step 3, the PTOs for pairs of reactions were calculated. In step 4, the Multivariate Linear Regression (MLR) algorithm implemented in the STATISTICA [38] software was used to seek the PTML model. In step 5, heuristics H1 and H2 were tested interactively. In step 6, the best HPTML model was selected. Finally, in step 7, this model was implemented on a public web server (see the following sections). The best linear HPTML model found is shown in Eq. 12;
The HPTML model was trained with a total of ntrain = 78,732 arbitrarily selected reaction pairs. The statistical parameters obtained for this model are the regression coefficient value of Rtrain = 0.84 and Standard Error of Estimates SEE = 51.67 and a Fisher’s ratio of F = 15,238.7 with a p-level < 0.05 in training series. This points out a important relationship between the observed relative values of ∆eeR(%)qrobs and the predicted values ∆eeR(%)qrobs.
In addition, another subset of nval = 28,836 reaction pairs was used to validate the model. A regression coefficient Rval = 0.77 and SEE = 60.225 were found for this validation series. The output of the model is eeR(%)qrcalc. This variable represents the enantiomeric excess value calculated using a single reference reaction. The eeR(%)calc value quantifies the enantiomeric excess obtained using an (R)-catalyst. If eeR(%)calc > 0, the product is predicted to have (R) notation; if eeR(%)calc < 0, the product is predicted to have (S) notation; if eeR(%)calc = 0 racemic mixture. The overall p-level of the model is p < 0.05. All the variables introduced in the model are statistically significant (Table 4). The three first input variables quantify the effect of non-structural factors on the enantioselectivity parameter, eeR(%)calc. The remaining input variables quantify the contribution of structural variations in the Substrate (Sub), Catalyst (Cat), Product (Prod), Nucleophile (Nuc), and Solvent (Solv).
PTML calculations with a single reference reaction
As we explained above, this PTML reactivity model studies pair-wise reactions. To avoid distortions in the distribution of the variables, PTML model uses the variable ∆eeR(%)qrobs as objective function (see Eq. 5) [39]. This objective function is the function to fit and is equal to ∆eeR(%)qrobs = eeR(%)qobs—eeR(%)robs. As a result, the output of the new model is ∆eeR(%)qrcalc = eeR(%)qcalc- eeR(%)rcalc. For non-accurate models ∆eeR(%)qrcalc ≠ ∆eeR(%)qrobs (where ≠ indicates not ≈). Conversely, for a not-random accurate predictor, like this one, one can approximate ∆eeR(%)qrcalc ≈ ∆eeR(%)qrobs. This presupposes that eeR(%)qcalc ≈ eeR(%)qobs and eeR(%)rcalc ≈ eeR(%)robs. Therefore, for practical purposes, we use the model to predict the enantiomeric excess of new query reactions eeR(%)qcalc, based on the observed enantiomeric excess of a reference reaction eeR(%)qrobs. The approximation is only valid for not-random accurate predictors and takes into account that eeR(%)rcalc ≈ eeR(%)robs is always a known reference reaction, so it is necessary to rearrange the variables in Eq. 5 as shown in Eq. 13;
As a result of this approach, the model calculates different values of eeR(%)calcqi for the same reaction depending on the experimental value eeR(%)refj of the reaction used as reference in the pair [39]. Figure 2 illustrates the observed values of ΔeeR(%)qrobs vs. the predicted (calculated) values of ΔeeR(%)calcqi for 10,000 selected reaction pairs. We depict only 10000 pairs due to software plotting limitations (this the top number of points allowed by the software). A certain linear trend is observed (points with ∆eeR(%)qrcalc ≈ ∆eeR(%)qrobs), however, despite being a predictor with adequate goodness of fit, there are many points with higher dispersion (points with ∆eeR(%)qrcalc ≠ ∆eeR(%)qrobs).
In fact, PTML models may be included on a broader class of learning problems, such as delta ML, transfer ML, template selection ML, etc. [50,51,52,53]. In general, these models involve the use of a query item (item to be predicted) compared to a reference item (template, pair, known case, item from related domain, etc.). To calculate the output of a query item (quantum field, drug, protein, or reaction in this case), it is necessary to use an already known item or population of reference items as input. Query items can be in the same or a different data domain from the reference item. In this context, the low population (low number of available cases) of some of the studied data subset (data domains) is also a common problem. In our case, to calculate the value of eeR(%)calcqi for a query reaction (q), the observed eeR(%)refj values of an already known reference reaction (r) must be used as input. Here both the query and reference items come from the same data domain (both are the same type of reactions). The reaction of reference can be selected from our reaction dataset (same data domain) [54]. Consequently, for a new query reaction, there are n = 332 reactions in the dataset that can be used as the reference reaction, which pave the way for the question of which is/are the best candidate/candidates to be used as reference reaction in each case (see next section). Thus, 332 different values of eeR(%)calcqi can be calculated for the same query reaction based on the selected pairing reaction of reference. In this step, heuristic rules can be used to approximate the final predicted value eeR(%)qpred depending on the eeR(%)calc values of the model, as we have demonstrated previously to solve a similar problem [39].
HPTML model for prediction with multiple reactions of reference
As mentioned above, it is necessary to define the best reaction or set of reactions to use. Defining an appropriate reference reaction can also help reduce the dispersion and increase the value of the regression coefficient, because each query reaction will have a single predicted value. With this purpose, a Heuristic rule coupled to the PTML model can be used to select the best reference. Heuristic-based methods have been widely used in Cheminformatics to solve practical problems [55,56,57]. In our case, the combination of the PTML model with a Heuristic (H) rule defines the term HPTML = H + PTML algorithm. Two Heuristics (H1 and H2) were tested by calculating the eeR(%)qrpred values for the 332 reactions in our dataset, using the PTML trained with the OD set. These HPTML models based on Heuristics H1 and H2 were compared with a classic ML model. This classic ML model includes no PT terms and was built without using Heuristics (H0). Figure 3 shows a schematic illustration of the ML, PTML, and HPTML data re-arrangement, as well as the MC data enrichment procedures used here.
Table 5 shows the statistical parameters for these studies (see only entries with Data = OD). Detailed information can be found in Additional file 2: Table S1of the Supporting Information file (Additional file 2). It should be noted that both HPTML models using Heuristics give good results with an OD regression coefficient in the range R2 = 0.64–0.81 and p < 0.05. Specifically, the HPTML OD H1 model has a higher regression coefficient (R2 = 0.81 vs. 0.55) and a lower SEE (R2 = 29.5 vs. 37.1) than the classic ML model. However, this SEE value is still relatively high. Interestingly, MC data enrichment improved both R2 = 0.96 and SEE = 13.5 values of the HPTML OD H1 model. In addition, the HPTML model automatically provides the most similar reference reaction from the reference dataset, including the reference of the article, which might give some clues about the possible reaction mechanism, etc. of the query reaction. In contrast, the classic ML model does not give information about the plausible reaction mechanism or similar reactions in the literature. Overall, these results justify the use of the HPTML algorithm instead of the classic ML algorithm.
Interestingly, the pair-wise strategy can rapidly increase the number of cases, as you go from datasets with n items (reactions) to n x n items (pairs of reactions). In this case, we go from nreacc = 332 reactions to npairs = 107,626 pairs of reactions, which could be an advantage of PTML model, since increasing the number of items to train the ML model can improve learning. However, those items that are underrepresented in the original data are still underrepresented in the new data in relative terms. In addition, you take the risk of including mismatched pair, that is, you take the risk of trying to predict an underrepresented query item (reaction) using as reference an overrepresented item (reaction family) that is not similar to the reference. For example, reactions from the aaa family are generally the most represented with nreacc = 120 cases (36.14% of cases) and npairs = 37,570 (34.91%) including many pairs with reactions from the same family. In contrast, reactions from the dab family are very poorly represented (low abundance) with only nreacc = 3 cases (0.9% of cases) appearing in npairs = 995 pairs of reactions. Almost all of these pairs are formed with reactions from other families and the relative abundance remains low (0.9%).
Table 6 shows the absolute and relative abundance of different reaction families (subsets) in the original dataset and the number of pairs formed with them. It should be noted that the formation of pairs of mismatched reactions can lead to inaccurate predictions. For example, predicting a query reaction from the aab family may give an inaccurate result if we use a reaction from the haa family as reference, because aab reactions have an average enantiomeric excess < eeR(%) > qobs = 21.0 while haa reactions have < eeR(%) > qobs = -78.1. Both reaction families not only have a markedly different average enantiomeric excess, but also give products with reverse (R) or (S) CIP notation of absolute configuration [31]. The compound codes, SMILE codes, and chemical structures of the different families of substrates, nucleophiles, and catalysts are shown on the Supporting Information file SI00.pdf.
In this regard, synthetic data generation techniques can be used to palliate the presence of low populated data subsets. In any case, the total abundance of each enriched data subset should remain essentially constant to avoid creating data artifacts. MC sampling methods have widely used in chemistry for similar purposes [58]. To palliate this situation, we have used a Mersenne-Twister MC algorithm (MT19937) [59] for data enrichment by creating new synthetic data. Therefore, synthetic data cases of the input variables Vk(cqi) = T(°C)qi, t(h)qi, or L(%)qi of query reactions were generated using a MC algorithm (see system of equations in Materials and Methods section). The same MC algorithm (system of equations) was used to generate new synthetic data for the input variables of the equation of reference Vk(crj). Nevertheless, the molecular descriptors Dk(msqi) and Dk(msrj) were never modified in the MC data enrichment simulation, because one can reasonably expect that small changes in the input reaction condition variables [Vk(cqi) = T(°C), t(h), or L(%)] do not to significantly change the output eeR(%). However, the same cannot be guaranteed for changes in chemical structure. Thus, we obtained a slightly higher number of cases for very low abundant reactions. For example, we were able to add nmcpairs = 15, 20, or 40 new cases for the dab, aab, and eab families of reactions; but we kept their relative abundance essentially low in the range, 0.9–2.47%. Table 6 shows that both models trained with the ODMC dataset (OD enriched by MC) give essentially the same value of R = 0.8–0.9 and p < 0.05 obtained with OD alone. However, the error decreased from SEE = 29.5% to SEE = 13.5% using Heuristic H1. Table 7 shows the correlation matrix for the outputs of all models that illustrates the high correlation obtained among them, R = 0.80–0.99. The results of eeR(%)qrobs observed vs. eeR(%)qrpred predicted with this HTPML model using ODMC dataset and H1 heuristic are graphically depicted in Fig. 4, where each point corresponds to a reaction included in the dataset. It can be graphically observed that although an excellent correlation of the predicted and obtained ee(%) value is generally obtained, some values are far from the line of correlation. In selected cases, the corresponding reaction number from the database (See SI001.xls file) has been included. It is difficult to draw any conclusions from these cases, as the reactants used are structurally heterogeneous and the experimental conditions diverse as well. In any case, the model has already a very high R2 = 0.98 value. We can conclude that using ODMC enriched data decreased the error of the model without decreasing the regression quality.
HPTML vs. Experimental study of new reactions
In this section, we report an additional test of the HPTML model comparing the computational predictions with the experimental study of new reactions. Thus, we performed both an experimental and a theoretical study of new intermolecular α-amidoalkylation reactions not previously reported in the literature. First, the α-amidoalkylation reactions carried out experimentally are described. Next, we report the use of the HPTML model to predict these reactions and compare the results with the experimental values.
Experimental study of α-amidoalkylation reactions.
As stated above, the α-amidoalkylation reaction is a very attractive method for C–C bond formation in organic synthesis. In this context, we have previously reported [27] that the α-amidoalkylation reaction is an efficient procedure for the enantioselective synthesis of 12b-substituted isoindoloisoquinolines (Nuevamine-type alkaloids [60]) using BINOL-derived Brønsted acids as catalysts. It should be pointed out that these catalysts have been used in intermolecular α-amidoalkylation of indoles with cyclic N-acyliminium ions formed in situ from cyclic hydroxylactams to form tertiary or quaternary stereogenic centers, but this was the first example of bicyclic N-acyliminium intermediates in intermolecular α-amidoalkylation reactions of indoles [30].The best results were obtained using a sterically demanding CPA (20 mol% catalyst loading) under the following conditions: THF as solvent at room temperature for 24 h. However, in some cases, moderate enantioselectivity (enantiomeric excess) and/or yields were obtained. Therefore, we decided to test BINOL-derived N-triflylphosphoramides as catalysts to enhance the enantioselectivity of these reactions, because they are known to have an increased acidity when compared to the corresponding CPAs, so they can form tighter ion pairs leading to an improved reactivity [61, 62]. Thus, the N-triflylphosphoramides 4a-d were synthesized [63, 64] and tested as catalysts in the reaction of 12b-hydroxyisoindoloisoquinoline 1 with the indoles 3a-d (Scheme 4). Table 8 summarizes these new results compared with those previously obtained with phosphoric acid 5e, which has demonstrated to be the most efficient catalyst for indole [30].The best results were obtained with the catalyst 4a, although good to excellent yields were achieved with all the phosphoramides. Successfully, we were able to improve our previous result obtaining with the corresponding phosphoric acids, obtaining 2a with excellent yield and enantioselectivity (90, 93% ee). In addition, the intermolecular α-amidoalkylation reaction was extended to 5-substituted indoles 3b-d, obtaining excellent yields, even when a strong acceptor group (NO2) was introduced (Table 8, entry 5). However, the use of the substituted indoles led to lower enantiomeric excesses (Table 8, entries 5–7). The reaction could also be applied to other electron-rich heteroaromatics as pyrrole 3e, obtaining 2e quantitatively, although with moderate ee (Table 8, entry 8). In this case, the reaction was cleaner and faster (reaction completed in 5 h) than when using phosphoric acid 5e as catalyst (Table 8, entries 13–15).
HPTML prediction of new α-amidoalkylation reactions
Next, using the developed HPTML ODMC H1 model, we predicted the values of eeR(%) for the new enantioselective intermolecular α-amidoalkylation reactions. We first calculated the molecular Dk(mqsi)g descriptors of all the molecules (Substrateqi, Nucleophileqi, Catalystqi, Solventqi, and Productqi) involved in the new query reactions (Rq) using the web server MCDCalc [38]. Then, the Heuristic H1was used to find the best reference reaction for each new query reaction. Next, we substituted in the model equation the values of the molecular descriptors Dk(mqsi)g and Dr(mrsj)g of the molecules, as well as the values of the input experimental conditions variables Vk(cqi) and Vk(crj), from both the query (Rq) and reference reaction (Rr), respectively. Table 9 shows the predicted eeR(%) values for each reaction compared to the values predicted with the other Datasets (OD vs. ODMC) and Heuristics (H1 and H2).
The other HPTML models have notably larger residuals values, confirming our decision to discard them as good predictors for this type of reaction. In general, the best results are obtained with the HPTML ODMC H1 model. For a total of 6 out of 8 reactions the model almost perfectly predicts the observed values of eeR(%)qrobs with residual values in the range eeR(%)qrres = − 1.1–1.9% (reactions 1, 2, 5–8) (Table 9). The experimental and predicted values for the obtention of 2a-e using catalyst 4a are represented in Scheme 5. For the other two reactions, the model correctly predicts the absolute stereochemistry of the final products, although with a relatively higher error. In addition to the results of training and validations series, these results validate the HPTML ODMC H1 model as a useful predictor for enantioselective intermolecular α-amidoalkylation reactions. The Microsoft Excel software was used to run all these calculations. However, this HPTML calculation algorithm is slow because it is not automatic and need more than one software applications (MCDCalc, Excel) to run. Furthermore, the model is not available for use by other groups and requires some degree of expertise in Cheminformatics, so we decided to implement it on a public web server.
MATEO web server
The HPTML model was implemented on a new public web server called MATEO: interMolecular Amidoalkylation Theoretical Enantioselectivity Optimization. MATEO server is available for public use online (free of charge) through the link: https://cptmltool.rnasa-imedir.com/CPTMLTools-Web/mateo. The graphical interface of the web server is shownin Fig. 5.Users worldwide can upload their own sets of query reactions to predict the values of eeR(%)qrcalc under different experimental conditions (solvent, time, temperature, catalyst loading), see Table 10.
Figure 6 graphically illustrates (from bottom to top) the steps required to use this web server. Step 1 is to upload the chemical structures of all the molecules involved in the reaction. The server is required to upload the structures in the Simplified Molecular Input Line Entry Specification (SMILES) code format [65]. SMILES has become a simplified and memory-optimal way of managing molecular structures widely used in Cheminformatics today [66, 67]. These codes can be pasted directly on the web interface or uploaded as a text file. The server allows uploading large collections of reactions with different combinations of substrate, nucleophile, and catalyst. This could be useful for exploring large libraries of molecules (products, substrates, and nucleophiles) and/or for the design of new catalysts. The server also allows uploading of the solvent structure, making it easy to explore a large variety of solvents. In Step 2, three general types of calculations can be selected: (1) Similarity Search, (2) Structural Scan, or (3) Conditions Scan. Option (1) allows us to predict the enantiomeric excess values, in addition to obtaining a report of the most similar reactions from the references in our dataset. Option (2) allows uploading the specific structures (substrate, nucleophile, catalyst, and/or solvent) and running a scan of these molecules under reaction conditions similar to those reported in the literature. Option (3) allows to keep the structure parameters constant (same molecules), while the software performs a scan of different combinations of input variables (temperature, time, catalyst loading). Table 10 shows the range (minimum, maximum) and step of the variables allowed by the server.
In this context, Goodman et al. have recently developed a rule-based web tool BINOPtimal for the online selection of CPA catalysts in a related reaction, the addition of nucleophiles to imines, by analyzing the reagent structures [68]. MATEO is web server allows the user to make quantitative predictions of enantiomeric excess parameter eeR(%) at different reaction temperature, time, catalysts loading or solvent polarity, which are known factors that affect the enantioselectivity of α-amidoalkylation reactions. Therefore, MATEO web server will be useful to guide not only the catalyst selection but also the experimental conditions.
Conclusions
In conclusion, we have shown that classic linear ML models are not very accurate in predicting the enantioselectivity of α-amidoalkylation reactions using physicochemical properties calculated with a Markov chain approach as input. Besides, these linear ML models do not allow detecting the most similar reaction directly from the model. The PTML algorithm outperforms the classic linear ML model using the same dataset and molecular descriptors. Moreover, the HPTML algorithm based on PTML model + heuristic rule allows direct detection of the most similar reference reactions. In addition, MC synthetic data re-sampling/enrichment procedures reduce the procedural error. The final HPTML model responds very well in computational experiments with validation series. The HPTML model also reproduces very well the experimental values of a new series of reactions studied experimentally by the first time in this work. Finally, the implementation of the HPTML model on the MATEO online server makes the algorithm available for public use worldwide with a user-friendly interface.
Availability of data and materials
MATEO web server was implemented for public use by experimental organic chemists, see link: https://cptmltool.rnasa-imedir.com/CPTMLTools-Web/mateo.The code of the software was uploaded to a GitHub repository and is available free for use by cheminformatics researchers with MIT license. The links are the following. For the MATEO server code the link is: https://github.com/glezdiazh/MATEO. For libraries used to calculate the molecular descriptors the link is: https://github.com/muntisa/RMarkovTI.All data files (SI00, SI01, and SI02) have been uploaded to a public data repository and are available for use free of charge under universal commons creative license (CC0). The links are, SI00.pdf file link: https://doi.org/https://doi.org/10.6084/m9.figshare.21981740.v2, Additional file 2: https://doi.org/https://doi.org/10.6084/m9.figshare.21971690.v2, and Additional file 3: https://doi.org/https://doi.org/10.6084/m9.figshare.21971696.v2.
Abbreviations
- AI:
-
Artificial intelligence
- ANN:
-
Artificial neural networks
- CPA:
-
Chiral phosphoric acid
- GLR:
-
General linear regression
- ML:
-
Machine learning
- HPTML:
-
Heuristic perturbation-theory and machine learning
- LNN:
-
Linear neural network
- MARCH-INSIDE:
-
Markov chain invariants for networks simulation and design
- MATEO:
-
InterMolecular amidoalkylation theoretical enantioselectivity optimization
- MC:
-
Monte carlo
- ML:
-
Machine learning
- MLR:
-
Multivariate linear regression
- THF:
-
Tetrahydrofuran
- OD:
-
Original data
- PT:
-
Perturbation theory
- PTO:
-
Perturbation theory operator
- SE:
-
Standard error
- SEE:
-
Standard error estimates
- SMILE:
-
Simplified molecular input line entry specification
- ee R(%)obs :
-
Observed enantiomeric excess (experimental) using (R)-Catalyst
- ee R(%)ref :
-
Enantiomeric excess of reference (experimental) using (R)-Catalyst
- ee R(%)calc :
-
Enantiomeric excess using (R)-Catalyst calculated using one reference
- ee R(%)pred :
-
Enantiomeric excess using (R)-Catalyst predicted by the model
- ee R(%)res :
-
Residual enantiomeric excess using (R)-Catalyst
References
Parmar D, Sugiono E, Raja S, Rueping M (2014) Complete field guide to asymmetric BINOL-phosphate derived Brønsted acid and metal catalysis: history and classification by mode of activation; Brønsted acidity, hydrogen bonding, ion pairing, and metal phosphates. Chem Rev 114:9047–9153
Parmar D, Sugiono E, Raja S, Rueping M (2017) Addition and correction to complete field guide to asymmetric BINOL-phosphate derived Brønsted acid and metal catalysis: History and classification by mode of activation; Brønsted acidity, hydrogen bonding, ion pairing, and metal phosphates. Chem Rev 117:10608–10620
Akiyama T (2012) Asymmetric C-C bond formation using chiral phosphoric acid. In: Christman N, Bräse S (eds) Asymmetric Synthesis II: More Methods and Applications. Wiley, Weinheim, pp 261–266
Wu X, Gong LZ (2014) Chiral phosphoric acid-catalyzed asymmetric multicomponent reactions. In: Zhu J, Wang Q, Wamg MX (eds) Multicomponent reactions in organic synthesis. Wiley, Weinheim, pp 439–470
Zhu L, Mohamed H, Yuan H, Zhang J (2019) The control effects of different scaffolds in chiral phosphoric acids: a case study of enantioselective asymmetric arylation. Catal Sci Technol 9:6482–6491
ElKerdawy A, Güssregen S, Matter H, Hennemann M, Clark T (2014) Quantum-mechanics-based molecular interaction fields for 3D-QSAR. J Cheminform 6:1–2
Spjuth O (2018) Novel applications of machine learning in cheminformatics. J Cheminform 10:1–2
Drakakis G, Koutsoukas A, Brewerton SC, Evans DD, Bender A (2013) Using machine learning techniques for rationalising phenotypic readouts from a rat sleeping model. J Cheminform 5:1–1
Ye Z, Ouyang D (2021) Prediction of small-molecule compound solubility in organic solvents by machine learning algorithms. J Cheminform 13:1–13
Ruscher M, Herzog A, Timoshenko J, Jeon HS, Frandsen W, Kuhl S, Roldan Cuenya B (2022) Tracking heterogeneous structural motifs and the redox behaviour of copper-zinc nanocatalysts for the electrocatalytic CO(2) reduction using operando time resolved spectroscopy and machine learning. Catal Sci Technol 12:3028–3043
Takahashi K, Ohyama J, Nishimura S, Fujima J, Takahashi L, Uno T, Taniike T (2023) Catalysts informatics: paradigm shift towards data-driven catalyst design. Chem Commun 59:2222–2238
Sarma BB, Maurer F, Doronkin DE, Grunwaldt JD (2023) Design of single-atom catalysts and tracking their fate using operando and advanced X-ray spectroscopic tools. Chem Rev 123:379–444
Freeze JG, Kelly HR, Batista VS (2019) Search for catalysts by inverse design: artificial intelligence, mountain climbers, and alchemists. Chem Rev 119:6595–6612
Tsai CC, Sandford C, Wu T, Chen B, Sigman MS, Toste FD (2020) Enantioselective intramolecular allylic substitution via synergistic palladium/chiral phosphoric acid catalysis: insight into stereoinduction through statistical modeling. Angew Chem Int Ed Engl 59:14647–14655
Gensch T, Dos Passos GG, Friederich P, Peters E, Gaudin T, Pollice R, Jorner K, Nigam A, Lindner-D’Addario M, Sigman MS, Aspuru-Guzik A (2022) A comprehensive discovery platform for organophosphorus ligands for catalysis. J Am Chem Soc 144:1205–1217
Dieguez-Santana K, Gonzalez-Diaz H (2021) Towards machine learning discovery of dual antibacterial drug-nanoparticle systems. Nanoscale 13:17854–17870
Barbolla I, Hernandez-Suarez L, Quevedo-Tumailli V, Nocedo-Mena D, Arrasate S, Dea-Ayuela MA, Gonzalez-Diaz H, Sotomayor N, Lete E (2021) Palladium-mediated synthesis and biological evaluation of C-10b substituted dihydropyrrolo[1,2-b]isoquinolines as antileishmanial agents. Eur J Med Chem 220:113458
Ortega-Tenezaca B, Gonzalez-Diaz H (2021) IFPTML mapping of nanoparticle antibacterial activity vs. pathogen metabolic networks. Nanoscale 13:1318–1330
Sampaio-Dias IE, Rodriguez-Borges JE, Yanez-Perez V, Arrasate S, Llorente J, Brea JM, Bediaga H, Vina D, Loza MI, Caamano O, Garcia-Mera X, Gonzalez-Diaz H (2021) Synthesis, pharmacological, and biological evaluation of 2-furoyl-based MIF-1 peptidomimetics and the development of a general-purpose model for allosteric modulators (ALLOPTML). ACS Chem Neurosci 12:203–215
Santana R, Zuluaga R, Ganan P, Arrasate S, Onieva E, Gonzalez-Diaz H (2020) Predicting coated-nanoparticle drug release systems with perturbation-theory machine learning (PTML) models. Nanoscale 12:13471–13483
Santana R, Zuluaga R, Ganan P, Arrasate S, Onieva Caracuel E, Gonzalez-Diaz H (2020) PTML model of ChEMBL compounds assays for vitamin derivatives. ACS Comb Sci 22:129–141
Aranzamendi E, Arrasate S, Sotomayor N, Gonzalez-Diaz H, Lete E (2016) Chiral bronsted acid-catalyzed enantioselective alpha-amidoalkylation reactions: a joint experimental and predictive study. ChemistryOpen 5:540–549
Yazici A, Pyne SG (2009) Intermolecular addition reactions of N-acyliminium ions (Part II). Synthesis 2009:513–541
Rahman A, Lin X (2018) Development and application of chiral spirocyclic phosphoric acids in asymmetric catalysis. Org Biomol Chem 16:4753–4777
Han B, He X-H, Liu Y-Q, He G, Peng C, Li J-L (2021) Asymmetric organocatalysis: an enabling technology for medicinal chemistry. Chem Soc Rev 50:1522–1586
Merad J, Lalli C, Bernadat G, Maury J, Masson G (2018) Enantioselective Brønsted acid catalysis as a tool for the synthesis of natural products and pharmaceuticals. Chem-Eur J 24:3925–3943
Aranzamendi E, Sotomayor N, Lete E (2012) Brønsted acid catalyzed enantioselective α-amidoalkylation in the synthesis of isoindoloisoquinolines. J Org Chem 77:2986–2991
Wheeler SE, Seguin TJ, Guan Y, Doney AC (2016) Noncovalent interactions in organocatalysis and the prospect of computational catalyst design. Accounts Chem Res 49:1061–1069
Peng Q, Duarte F, Paton RS (2016) Computing organic stereoselectivity–from concepts to quantitative calculations and predictions. Chem Soc Rev 45:6093–6107
Maji R, Mallojjala SC, Wheeler SE (2018) Chiral phosphoric acid catalysis: from numbers to insights. Chem Soc Rev 47:1142–1158
Helmchen G (2016) The 50th anniversary of the cahn–ingold–prelog specification of molecular chirality. Angew Chem Int Ed 55:6798–6799
Yu X, Lu A, Wang Y, Wu G, Song H, Zhou Z, Tang C (2011) Chiral phosphoric acid catalyzed asymmetric friedel-crafts alkylation of indole with 3-hydroxyisoindolin-1-one: enantioselective synthesis of 3-indolyl-substituted isoindolin-1-ones. Eur J Org Chem 2011:892–897
Yu X, Wang Y, Wu G, Song H, Zhou Z, Tang C (2011) Organocatalyzed enantioselective synthesis of quaternary carbon-containing isoindolin-1-ones. Eur J Org Chem 2011:3060–3066
Guo C, Song J, Huang JZ, Chen PH, Luo SW, Gong LZ (2012) Core-structure-oriented asymmetric organocatalytic substitution of 3-hydroxyoxindoles: application in the enantioselective totalsynthesis of (+)-folicanthine. Angew Chem Int Ed 51:1046–1050
Yin Q, Wang S-G, You S-L (2013) Asymmetric synthesis of tetrahydro-β-carbolines via chiral phosphoric acid catalyzed transfer hydrogenation reaction org. Lett 15:2688–2691
Carracedo-Reboredo P, Corona R, Martinez-Nunes M, Fernandez-Lozano C, Tsiliki G, Sarimveis H, Aranzamendi E, Arrasate S, Sotomayor N, Lete E (2020) MCDCalc: markov chain molecular descriptors calculator for medicinal chemistry. Curr Top Med Chem 20:305–317
Gonzalez-Diaz H, Duardo-Sanchez A, Ubeira FM, Prado-Prado F, Perez-Montoto LG, Concu R, Podda G, Shen B (2010) Review of MARCH-INSIDE & complex networks prediction of drugs: ADMET, anti-parasite activity, metabolizing enzymes and cardiotoxicity proteome biomarkers. Curr Drug Metab 11:379–406
Hill T, Lewicki P, Lewicki P (2006) Statistics: methods and applications: a comprehensive reference for science, industry, and data mining. StatSoft Inc., Tulsa
Simon-Vidal L, Garcia-Calvo O, Oteo U, Arrasate S, Lete E, Sotomayor N, Gonzalez-Diaz H (2018) Perturbation-theory and machine learning (PTML) model for high-throughput screening of parham reactions: experimental and theoretical studies. J Chem Inf Model 58:1384–1396
Liu H, Deng J, Luo Z, Lin Y, Merz KM Jr, Zheng Z (2020) Receptor-ligand binding free energies from a consecutive histograms monte carlo sampling method. J Chem Theory Comput 16:6645–6655
Cabeza de Vaca I, Qian Y, Vilseck JZ, Tirado-Rives J, Jorgensen WL (2018) Enhanced monte carlo methods for modeling proteins including computation of absolute free energies of binding. J Chem Theory Comput 14:3279–3288
Cole DJ, Tirado-Rives J, Jorgensen WL (2014) Enhanced monte carlo sampling through replica exchange with solute tempering. J Chem Theory Comput 10:565–571
Bajusz D, Rácz A, Héberger K (2015) Why is tanimoto index an appropriate choice for fingerprint-based similarity calculations? J Cheminform 7:1–13
Škuta C, Cortés-Ciriano I, Dehaen W, Kříž P, van Westen GJ, Tetko IV, Bender A, Svozil D (2020) QSAR-derived affinity fingerprints (part 1): fingerprint construction and modeling performance for similarity searching, bioactivity classification and scaffold hopping. J Cheminform 12:1–16
Cortes-Ciriano I, Firth NC, Bender A, Watson O (2018) Discovering highly potent molecules from an initial set of inactives using iterative screening. J Chem Inf Model 58:2000–2014
Bender A, Jenkins JL, Scheiber J, Sukuru SCK, Glick M, Davies JW (2009) How similar are similarity searching methods? A principal component analysis of molecular descriptor space. J Chem Inf Model 49:108–119
Wagner AB (2006) SciFinder scholar 2006: an empirical analysis of research topic query processing. J Chem Inf Model 46:767–774
Ridley DD (2000) Strategies for chemical reaction searching in SciFinder. J Chem Inf Comp Sci 40:1077–1084
Carracedo-Reboredo P, Corona R, Martinez-Nunes M, Fernandez-Lozano C, Tsiliki G, Sarimveis H, Aranzamendi E, Arrasate S, Sotomayor N, Lete E, Munteanu CR, Gonzalez-Diaz H (2020) MCDCalc: markov chain molecular descriptors calculator for medicinal chemistry. Curr Top Med Chem 20:305–317
Pesciullesi G, Schwaller P, Laino T, Reymond J-L (2020) Transfer learning enables the molecular transformer to predict regio-and stereoselective reactions on carbohydrates. Nat Commun 11:4874
Smith JS, Nebgen BT, Zubatyuk R, Lubbers N, Devereux C, Barros K, Tretiak S, Isayev O, Roitberg AE (2019) Approaching coupled cluster accuracy with a general-purpose neural network potential through transfer learning. Nat Commun 10:2903
Grambow CA, Li Y-P, Green WH (2019) Accurate thermochemistry with small data sets: a bond additivity correction and transfer learning approach. J Phys Chem A 123:5826–5835
Sun G, Sautet P (2019) Toward fast and reliable potential energy surfaces for metallic Pt clusters by hierarchical delta neural networks. J Chem Theory Comput 15:5614–5627
Feuz KD, Cook DJ (2015) Transfer learning across feature-rich heterogeneous feature spaces via feature-space remapping (FSR). ACM T Intel Syst Tec 6:1–27
Grazioli G, Roy S, Butts CT (2019) Predicting reaction products and automating reactive trajectory characterization in molecular simulations with support vector machines. J Chem Inf Model 59:2753–2764
Charpentier A, Mignon D, Barbe S, Cortes J, Schiex T, Simonson T, Allouche D (2018) Variable neighborhood search with cost function networks to solve large computational protein design problems. J Chem Inf Model 59:127–136
Abramyan TM, An Y, Kireev D (2019) Off-pocket activity cliffs: a puzzling facet of molecular recognition. J Chem Inf Model 60:152–161
Endo K, Yuhara D, Yasuoka K (2022) Efficient monte carlo sampling for molecular systems using continuous normalizing flow. J Chem Inf Model 18:1395–1405
Matsumoto M, Nishimura T (1998) Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM T Model Comput S 8:3–30
Moreau A, Couture A, Deniau E, Grandclaudon P (2005) Construction of the six-and five-membered Aza-heterocyclic units of the isoindoloisoquinolone nucleus by parham-type cyclization sequences-total synthesis of nuevamine. Eur J Org Chem 2005:3437–3443
Akiyama T (2007) Stronger brønsted acids. Chem Rev 107:5744–5758
Akiyama T, Mori K (2015) Stronger brønsted acids: recent progress. Chem Rev 115:9277–9306
Caballero-García G, Goodman JM (2021) N-Triflylphosphoramides: highly acidic catalysts for asymmetric transformations. Org Biomol Chem 19:9565–9618
Nakashima D, Yamamoto H (2006) Design of chiral N-triflyl phosphoramide as a strong chiral brønsted acid and its application to asymmetric diels− alder reaction. J Am Chem Soc 128:9626–9627
Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28:31–36
Pogány P, Arad N, Genway S, Pickett SD (2018) De novo molecule design by translating from reduced graphs to SMILES. J Chem Inf Model 59:1136–1146
Toropov AA, Toropova AP, Benfenati E, Leszczynska D, Leszczynski J (2010) SMILES-based optimal descriptors: QSAR analysis of fullerene-based HIV-1 PR inhibitors by means of balance of correlations. J Comput Chem 31:381–392
Reid JP, Ermanis K, Goodman JM (2019) BINOPtimal: a web tool for optimal chiral phosphoric acid catalyst selection. Chem Commun 55:1778–1781
Acknowledgements
Technical and human support provided by General Research Services SGIker (UPV/EHU, MINECO, GV/EJ, ERDF and ESF) is also acknowledged.
Funding
The authors acknowledge financial support from Grant PID2019-104148 GB-I00 and PID2022-137365NB-I00 funded by MCIN/ AEI/10.13039/501100011033 and Grant IT1558-22 funded by Basque Government/Eusko Jaurlaritza, 2022–2025.CITIC is funded by the Xunta de Galicia through the collaboration agreement between the Department of Culture, Education, Vocational Training and Universities and the Galician universities to strengthen the research centers of the Galician University System (CIGUS).
Author information
Authors and Affiliations
Contributions
SA, CRM, CFL, NS, EL, and HGD conceived the presented idea. PCR and CRM implemented the idea computationally, performed the computations and analysis. EA performed the organic synthesis experiments. SH carried out the data analysis and software validation. SA, CRM, CFL, NS, EL, and HGD supervised the findings of this work. All authors discussed the results and wrote the manuscript with input of all authors. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Bioethics approval is not applicable (not laboratory animals or personal data is used). All authors consent to participate in the paper.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file1:
The following files are available free of charge. General experimental methods; Synthetic procedures and structural determination for 2a-d; Copies of HPLC chromatograms of racemic and enantioenriched 2a-d; Copies of 1H and 13C NMR spectra
Additional file2:
Dataset of reactions, molecular descriptors, SMILE codes, etc.
Additional file3
: MATEO server reactions of reference
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Carracedo-Reboredo, P., Aranzamendi, E., He, S. et al. MATEO: intermolecular α-amidoalkylation theoretical enantioselectivity optimization. Online tool for selection and design of chiral catalysts and products. J Cheminform 16, 9 (2024). https://doi.org/10.1186/s13321-024-00802-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13321-024-00802-7