License: arXiv.org perpetual non-exclusive license
arXiv:2402.00086v1 [cs.LG] 31 Jan 2024
\jyear

2021

[1]\fnmWenguan \surWang

[1]\fnmYi \surYang

1]\orgdivCollege of Computer Science and Technology, \orgnameZhejiang University, \orgaddress\cityHangzhou, \postcode310058, \stateZhejiang, \countryChina 2]\orgdivCollege of Chemical and Biological Engineering, \orgnameZhejiang University, \orgaddress\cityHangzhou, \postcode310058, \stateZhejiang, \countryChina

Retrosynthesis prediction enhanced by in-silico reaction data augmentation

\fnmXu \surZhang [email protected]    \fnmYiming \surMo [email protected]    [email protected]    [email protected] [ [
Abstract

Recent advances in machine learning (ML) have expedited retrosynthesis research by assisting chemists to design experiments more efficiently. However, all ML-based methods consume substantial amounts of paired training data (i.e.,​ chemical reaction: product-reactant(s) pair), which is costly to obtain. Moreover, companies view reaction data as a valuable asset and restrict the accessibility to researchers. These issues prevent the creation of more powerful retrosynthesis models due to their data-driven nature. As a response, we exploit easy-to-access unpaired data (i.e., one component of product-reactant(s) pair) for generating in-silico paired data to facilitate model training. Specifically, we present RetroWISE, a self-boosting framework that employs a base model inferred from real paired data to perform in-silico reaction generation and augmentation using unpaired data, ultimately leading to a superior model. On three benchmark datasets, RetroWISE achieves the best overall performance against state-of-the-art models (e.g., +8.6% top-1 accuracy on the USPTO-50K test dataset). Moreover, it consistently improves the prediction accuracy of rare transformations. These results show that RetroWISE overcomes the training bottleneck by in-silico reactions, thereby paving the way toward more effective ML-based retrosynthesis models.

keywords:
Retrosynthesis, Machine Learning, In-silico Reaction Data Augmentation, Self-boosting Framework

1 Introduction

Retrosynthesis, the process of identifying precursors for a target molecule, is essential for material design and drug discovery (Blakemore et al, 2018). However, the huge search space for possible chemical transformations and enormous time required even for experts make this challenging. Thus, efficient computer-assisted synthesis  (Corey and Wipke, 1969; Corey et al, 1985; Coley et al, 2017) has been explored for long periods. Thanks to recent advances in artificial intelligence, machine learning (ML)-based methods (Segler et al, 2018; Mikulak-Klucznik et al, 2020; Schwaller et al, 2021; Toniato et al, 2021; Yu et al, 2023; Born and Manica, 2023) have emerged to assist chemists to design experiments and gain insights that might not be solely achievable through traditional methods, bringing retrosynthesis research to a new pivotal moment.

The ML-based methods for single-step retrosynthesis can be roughly categorized into three groups: Template-based methods predict reactants using reaction templates that encode core reactive rules. LHASA (Corey et al, 1985), the first retrosynthesis program, utilizes manual-encoding templates to predict retrosynthetic routes. To scale to exponentially growing knowledge (Segler et al, 2018), data-driven methods (Segler and Waller, 2017; Coley et al, 2017; Dai et al, 2019; Baylon et al, 2019; Chen and Jung, 2021) extract a large number of reaction templates from data and formulate retrosynthesis as a template retrieval/classification task. Semi-template methods (Shi et al, 2020; Yan et al, 2020; Somnath et al, 2021; Wang et al, 2021) decompose retrosynthesis into two stages: they typically (1) identify the reactive sites to convert the product into synthons and (2) complete the synthons into reactant(s), which utilize “reaction centers” in templates to supervise the training procedure (Sun et al, 2021). Template-free methods view single-step retrosynthesis prediction as a machine translation task, where deep generative models directly translate the given product into reactant(s). These methods use either SMILES (Weininger, 1988) or molecular graph as data representations, leading to sequence-based methods (Liu et al, 2017; Tetko et al, 2020; Lin et al, 2020; Kim et al, 2021; Wan et al, 2022; Zhong et al, 2022) and graph-based methods (Seo et al, 2021; Tu and Coley, 2022; Zhong et al, 2023), respectively.

Despite appealing results, existing ML-based methods have an insatiable appetite for paired training data (i.e., chemical reaction: product-reactant(s) pair), which is costly to obtain since chemistry experiments are typically not designed to build reaction databases but to meet the specific research need (Rodrigues, 2019). Moreover, chemical reaction collection is time-consuming and requires domain expertise, making it a valuable asset to companies. As a result, proprietary databases (e.g., Reaxys (Lawson et al, 2014) collected from scientific literatures and organic chemistry/life science patents) have limited accessibility, which cannot be viewed and acquired directly. In contrast, public datasets such as USPTO (Lowe, 2012, 2017) extracted from US patents have finite paired data (roughly 3.73.73.73.7M reactions with duplicates). These issues remain key obstacles to impede progress toward more effective retrosynthesis models due to their data-driven nature. In response to such issues, data augmentation with newly generated samples has been a recent success in various fields, such as medical research (Marouf et al, 2020; Gao et al, 2023), biological research (Castro et al, 2022; Baker et al, 2023), and robotic research (Yang et al, 2022), as it provides an inexpensive augmentation without increasing the demand for costly data collection and raising privacy concerns. However, the development of in-silico reaction generation and augmentation for single-step retrosynthesis prediction has yet to be explored.

Here, we present a framework called RetroWISE that uses a base model inferred from real paired data to generate in-silico paired data from unpaired data (i.e., one component of the product-reactant(s) pair), which can be more easily collected in public databases or via web scraping, to develop a more effective ML model. Specifically, RetroWISE uses real paired data to train the base model, and then generates abundant in-silico reactions from easy-to-access unpaired data using the base model. Finally, RetroWISE augments real paired data with the generated reactions to train a more effective retrosynthesis model. In this way, our training ends up in a self-boosting manner: In-silico reactions generated from the base model in turn push the model to evolve. We conduct experiments on three widely used benchmark datasets of single-step retrosynthesis prediction. The experimental results provide encouraging evidence that RetroWISE achieves the best overall performance against state-of-the-art models (e.g., 8.6% improvement of top-1 accuracy on the USPTO-50K (Schneider et al, 2016) test dataset). Moreover, we show that RetroWISE consistently promotes the prediction accuracy on rare transformations, which are typically of particular interest to chemists for novel synthetic routes design. In summary, RetroWISE provides a feasible and cost-effective way of in-silico reaction generation and augmentation based on self-boosting procedure to advance the ML-based retrosynthesis research.

Refer to caption

Y^superscript^𝑌\hat{Y}^{\circ}over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPTX^superscript^𝑋\hat{X}^{\circ}over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPTR^superscript^𝑅\hat{R}^{\circ}over^ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT

gyxsubscript𝑔𝑦𝑥g_{y\raisebox{1.13791pt}{\scalebox{0.7}{ \leavevmode\hbox to9.94pt{\vbox to3.2% pt{\pgfpicture\makeatletter\raise 0.0pt\hbox{\hskip 1.19998pt\lower-1.59998pt% \hbox to 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}% {rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }% \pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\nullfont\hbox to 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}% {{}}{} {}{}{}{{ {\pgfsys@beginscope{} {} {} {} \pgfsys@moveto{1.99997pt}{0.0pt}\pgfsys@lineto{-1.19998pt}{1.59998pt}% \pgfsys@lineto{0.0pt}{0.0pt}\pgfsys@lineto{-1.19998pt}{-1.59998pt}\pgfsys@fill% \pgfsys@endscope}} }{}{}{{}}\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@lineto{6.53593pt}{0.0pt}% \pgfsys@stroke\pgfsys@invoke{ }{{}{{}}{}{}{{}}{{{}}{{{}}{\pgfsys@beginscope% \pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{6.53593pt}{0.0pt}% \pgfsys@invoke{ }\pgfsys@invoke{ \lxSVG@closescope }\pgfsys@invoke{% \lxSVG@closescope }\pgfsys@endscope}}{{}}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}x}italic_g start_POSTSUBSCRIPT italic_y italic_x end_POSTSUBSCRIPT

fxysubscript𝑓𝑥𝑦f_{x\raisebox{1.13791pt}{\scalebox{0.7}{ \leavevmode\hbox to8.94pt{\vbox to0.4% pt{\pgfpicture\makeatletter\raise 0.0pt\hbox{\hskip 0.2pt\lower-0.2pt\hbox to % 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{{}}{} {}{}{}{}{}{}{{}}\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@lineto{6.53593pt}{0.0pt}% \pgfsys@stroke\pgfsys@invoke{ }{{}{{}}{}{}{{}}{{{}}{{{}}{\pgfsys@beginscope% \pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{6.53593pt}{0.0pt}% \pgfsys@invoke{ }\pgfsys@invoke{ \lxSVG@closescope }\pgfsys@invoke{% \lxSVG@closescope }\pgfsys@endscope}}{{}}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}y}italic_f start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT

x^superscript^𝑥\hat{x}^{\circ}over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT

y^superscript^𝑦\hat{y}^{\circ}over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT

y~superscript~𝑦\tilde{y}^{\circ}over~ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT

R𝑅Ritalic_RR^superscript^𝑅\hat{R}^{\circ}over^ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT

f^xysubscript^𝑓𝑥𝑦\hat{f}_{x\raisebox{1.13791pt}{\scalebox{0.7}{ \leavevmode\hbox to8.94pt{\vbox to% 0.4pt{\pgfpicture\makeatletter\raise 0.0pt\hbox{\hskip 0.2pt\lower-0.2pt\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{{}}{} {}{}{}{}{}{}{{}}\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@lineto{6.53593pt}{0.0pt}% \pgfsys@stroke\pgfsys@invoke{ }{{}{{}}{}{}{{}}{{{}}{{{}}{\pgfsys@beginscope% \pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{6.53593pt}{0.0pt}% \pgfsys@invoke{ }\pgfsys@invoke{ \lxSVG@closescope }\pgfsys@invoke{% \lxSVG@closescope }\pgfsys@endscope}}{{}}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}y}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT

Figure 1: Overview of the RetroWISE framework. a, Given the unpaired reactants Y^superscript^𝑌\hat{Y}^{\circ}over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT as an example, the base forward synthesis model gyxsubscript𝑔𝑦𝑥g_{y\raisebox{1.13791pt}{\scalebox{0.7}{ \leavevmode\hbox to8.94pt{\vbox to0.4% pt{\pgfpicture\makeatletter\raise 0.0pt\hbox{\hskip 0.2pt\lower-0.2pt\hbox to % 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{{}}{} {}{}{}{}{}{}{{}}\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@lineto{6.53593pt}{0.0pt}% \pgfsys@stroke\pgfsys@invoke{ }{{}{{}}{}{}{{}}{{{}}{{{}}{\pgfsys@beginscope% \pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{6.53593pt}{0.0pt}% \pgfsys@invoke{ }\pgfsys@invoke{ \lxSVG@closescope }\pgfsys@invoke{% \lxSVG@closescope }\pgfsys@endscope}}{{}}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}x}italic_g start_POSTSUBSCRIPT italic_y italic_x end_POSTSUBSCRIPT trained on real paired data is used to generate in-silico products X^superscript^𝑋\hat{X}^{\circ}over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT. Then, a filter process consisting of template matching and molecular similarity comparison selects high-quality in-silico reactions R^superscript^𝑅\hat{R}^{\circ}over^ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT. b, These cheap in-silico reactions are used to augment costly real reactions as paired training data to train a more effective retrosynthesis model f^xysubscript^𝑓𝑥𝑦\hat{f}_{x\raisebox{1.13791pt}{\scalebox{0.7}{ \leavevmode\hbox to8.94pt{\vbox to% 0.4pt{\pgfpicture\makeatletter\raise 0.0pt\hbox{\hskip 0.2pt\lower-0.2pt\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{{}}{} {}{}{}{}{}{}{{}}\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@lineto{6.53593pt}{0.0pt}% \pgfsys@stroke\pgfsys@invoke{ }{{}{{}}{}{}{{}}{{{}}{{{}}{\pgfsys@beginscope% \pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{6.53593pt}{0.0pt}% \pgfsys@invoke{ }\pgfsys@invoke{ \lxSVG@closescope }\pgfsys@invoke{% \lxSVG@closescope }\pgfsys@endscope}}{{}}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}y}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT. In this way, the whole framework is self-boosted.

2 Results

RetroWISE framework. In the presence of real paired data, which consists of product-reactant(s) pairs, and unpaired data (products or reactants), the main idea behind RetroWISE is in a self-boosting manner: employing a base model to generate in-silico reactions from unpaired data, which in turn augment real paired data to facilitate model training. Specifically, RetroWISE uses real paired data R={(xn,yn)}n𝑅subscriptsubscript𝑥𝑛subscript𝑦𝑛𝑛\mathit{R}\!=\!\{(x_{n},y_{n})\}_{n}italic_R = { ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, where xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT represents the product and ynsubscript𝑦𝑛y_{n}italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denotes the corresponding reactant(s), to train a base forward synthesis model gyxsubscript𝑔𝑦𝑥g_{y\raisebox{1.13791pt}{\scalebox{0.7}{ \leavevmode\hbox to8.94pt{\vbox to0.4% pt{\pgfpicture\makeatletter\raise 0.0pt\hbox{\hskip 0.2pt\lower-0.2pt\hbox to % 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{{}}{} {}{}{}{}{}{}{{}}\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@lineto{6.53593pt}{0.0pt}% \pgfsys@stroke\pgfsys@invoke{ }{{}{{}}{}{}{{}}{{{}}{{{}}{\pgfsys@beginscope% \pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{6.53593pt}{0.0pt}% \pgfsys@invoke{ }\pgfsys@invoke{ \lxSVG@closescope }\pgfsys@invoke{% \lxSVG@closescope }\pgfsys@endscope}}{{}}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}x}italic_g start_POSTSUBSCRIPT italic_y italic_x end_POSTSUBSCRIPT and a base retrosynthesis model fxysubscript𝑓𝑥𝑦f_{x\raisebox{1.13791pt}{\scalebox{0.7}{ \leavevmode\hbox to8.94pt{\vbox to0.4% pt{\pgfpicture\makeatletter\raise 0.0pt\hbox{\hskip 0.2pt\lower-0.2pt\hbox to % 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{{}}{} {}{}{}{}{}{}{{}}\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@lineto{6.53593pt}{0.0pt}% \pgfsys@stroke\pgfsys@invoke{ }{{}{{}}{}{}{{}}{{{}}{{{}}{\pgfsys@beginscope% \pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{6.53593pt}{0.0pt}% \pgfsys@invoke{ }\pgfsys@invoke{ \lxSVG@closescope }\pgfsys@invoke{% \lxSVG@closescope }\pgfsys@endscope}}{{}}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}y}italic_f start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT as the preparation. Then, as illustrated in Fig.​ 1a, RetroWISE generates in-silico reactions in one of two ways: (1) using the base forward synthesis model gyxsubscript𝑔𝑦𝑥g_{y\raisebox{1.13791pt}{\scalebox{0.7}{ \leavevmode\hbox to8.94pt{\vbox to0.4% pt{\pgfpicture\makeatletter\raise 0.0pt\hbox{\hskip 0.2pt\lower-0.2pt\hbox to % 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{{}}{} {}{}{}{}{}{}{{}}\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@lineto{6.53593pt}{0.0pt}% \pgfsys@stroke\pgfsys@invoke{ }{{}{{}}{}{}{{}}{{{}}{{{}}{\pgfsys@beginscope% \pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{6.53593pt}{0.0pt}% \pgfsys@invoke{ }\pgfsys@invoke{ \lxSVG@closescope }\pgfsys@invoke{% \lxSVG@closescope }\pgfsys@endscope}}{{}}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}x}italic_g start_POSTSUBSCRIPT italic_y italic_x end_POSTSUBSCRIPT to produce in-silico products X^={x^m}msuperscript^𝑋subscriptsubscriptsuperscript^𝑥𝑚𝑚\hat{\mathit{X}}^{\circ}\!=\!\{\hat{x}^{\circ}_{m}\}_{m}over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT = { over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT from unpaired reactants Y^={y^m}msuperscript^𝑌subscriptsubscriptsuperscript^𝑦𝑚𝑚\hat{\mathit{Y}}^{\circ}\!=\!\{\hat{y}^{\circ}_{m}\}_{m}over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT = { over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT; (2) using the base retrosynthesis model fxysubscript𝑓𝑥𝑦f_{x\raisebox{1.13791pt}{\scalebox{0.7}{ \leavevmode\hbox to8.94pt{\vbox to0.4% pt{\pgfpicture\makeatletter\raise 0.0pt\hbox{\hskip 0.2pt\lower-0.2pt\hbox to % 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{{}}{} {}{}{}{}{}{}{{}}\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@lineto{6.53593pt}{0.0pt}% \pgfsys@stroke\pgfsys@invoke{ }{{}{{}}{}{}{{}}{{{}}{{{}}{\pgfsys@beginscope% \pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{6.53593pt}{0.0pt}% \pgfsys@invoke{ }\pgfsys@invoke{ \lxSVG@closescope }\pgfsys@invoke{% \lxSVG@closescope }\pgfsys@endscope}}{{}}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}y}italic_f start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT to generate in-silico reactants Y^={y^l}lsuperscript^𝑌subscriptsubscriptsuperscript^𝑦𝑙𝑙\hat{\mathit{Y}}^{\star}\!=\!\{\hat{y}^{\star}_{l}\}_{l}over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = { over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT from unpaired products X^={x^l}lsuperscript^𝑋subscriptsubscriptsuperscript^𝑥𝑙𝑙\hat{\mathit{X}}^{\star}\!=\!\{\hat{x}^{\star}_{l}\}_{l}over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = { over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. The unpaired data (e.g., unpaired reactant(s)) and in-silico data (e.g., in-silico product) make up each generated reaction. Moreover, to enhance the quality of in-silico reactions, RetroWISE incorporates a filter process with chemical awareness, which consists of a template matching step and a molecular similarity comparison step: (1) preserving generated reactions matching any selected template; (2) reconstructing pseudo unpaired data (e.g., y~superscript~𝑦\tilde{y}^{\circ}over~ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT) from in-silico data (e.g., x^superscript^𝑥\hat{x}^{\circ}over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT) of the mismatched reactions from the previous step, comparing the molecular similarity to the original unpaired data (e.g., y^superscript^𝑦\hat{y}^{\circ}over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT), and retaining in-silico reactions with the molecular similarity above a specific threshold. For brevity, the preserved in-silico reactions R^={(x^m,y^m)}msuperscript^𝑅subscriptsubscriptsuperscript^𝑥𝑚subscriptsuperscript^𝑦𝑚𝑚\mathit{\hat{R}}^{\circ}\!=\!\{(\hat{x}^{\circ}_{m},\hat{y}^{\circ}_{m})\}_{m}over^ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT = { ( over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and R^={(x^l,y^l)}lsuperscript^𝑅subscriptsubscriptsuperscript^𝑥𝑙subscriptsuperscript^𝑦𝑙𝑙\mathit{\hat{R}}^{\star}\!=\!\{(\hat{x}^{\star}_{l},\hat{y}^{\star}_{l})\}_{l}over^ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = { ( over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT are denoted as R^^𝑅\mathit{\hat{R}}over^ start_ARG italic_R end_ARG. Finally, as illustrated in Fig.​ 1b, RetroWISE uses cheap in-silico paired data R^^𝑅\mathit{\hat{R}}over^ start_ARG italic_R end_ARG to augment costly real paired data R𝑅\mathit{R}italic_R to train a more powerful retrosynthesis model f^xysubscript^𝑓𝑥𝑦\hat{f}_{x\raisebox{1.13791pt}{\scalebox{0.7}{ \leavevmode\hbox to8.94pt{\vbox to% 0.4pt{\pgfpicture\makeatletter\raise 0.0pt\hbox{\hskip 0.2pt\lower-0.2pt\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{{}}{} {}{}{}{}{}{}{{}}\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@lineto{6.53593pt}{0.0pt}% \pgfsys@stroke\pgfsys@invoke{ }{{}{{}}{}{}{{}}{{{}}{{{}}{\pgfsys@beginscope% \pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{6.53593pt}{0.0pt}% \pgfsys@invoke{ }\pgfsys@invoke{ \lxSVG@closescope }\pgfsys@invoke{% \lxSVG@closescope }\pgfsys@endscope}}{{}}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}y}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT.

Improvements using in-silico reactions. Learning from sufficient paired training data is a key factor in the success of ML-based retrosynthesis methods. Thereby, we investigate how RetroWISE improves the retrosynthesis prediction performance by augmenting paired training data with in-silico reactions. First, RetroWISE generates in-silico reactions from unpaired reactants in USPTO applications (Lowe, 2017). Specifically, the raw reactions are preprocessed as in Dai et al (2019) to obtain approximately 1111M unique reactants. Then, RetroWISE utilizes the base forward synthesis model gyxsubscript𝑔𝑦𝑥g_{y\rightarrow x}italic_g start_POSTSUBSCRIPT italic_y → italic_x end_POSTSUBSCRIPT to produce the corresponding in-silico products X^superscript^𝑋\hat{\mathit{X}}^{\circ}over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT from the unpaired reactants Y^superscript^𝑌\hat{\mathit{Y}}^{\circ}over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT, and forms them as in-silico reactions R^superscript^𝑅\hat{\mathit{R}}^{\circ}over^ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT. RetroWISE trained with extra generated reactions from USPTO applications is referred to as RetroWISE-U. This generation and training procedure is particularly useful when having plentiful reactants without knowing the outcomes in advance. Second, RetroWISE obtains in-silico reactions from unpaired products by randomly sampling 4444M molecules from the PubChem database (Kim et al, 2019) or 20202020M from the ZINC database (Irwin et al, 2020). RetroWISE utilizes the base retrosynthesis model fxysubscript𝑓𝑥𝑦f_{x\rightarrow y}italic_f start_POSTSUBSCRIPT italic_x → italic_y end_POSTSUBSCRIPT to produce in-silico reactants Y^superscript^𝑌\hat{\mathit{Y}}^{\star}over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT from the unpaired products X^superscript^𝑋\hat{\mathit{X}}^{\star}over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, and form them as in-silico reactions R^superscript^𝑅\hat{\mathit{R}}^{\star}over^ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT. For better differentiation, we denote RetroWISE trained with extra generated reactions from PubChem and ZINC as RetroWISE-P and RetroWISE-Z, respectively. This pipeline is also feasible as numerous molecules are publicly accessible in large databases.

Table 1: Performance (%) of various models trained with in-silico reactions from different unpaired data sources on the USPTO-50K test set and the USPTO-MIT test set.
Model Extra paired data USPTO-50K USPTO-MIT
Top-1 Top-1 MaxFragdelimited-⟨⟩𝑀𝑎𝑥𝐹𝑟𝑎𝑔\left\langle MaxFrag\right\rangle⟨ italic_M italic_a italic_x italic_F italic_r italic_a italic_g ⟩ Top-1
Baseline None 56.3 61.0 60.3
RetroWISE-P 4M 60.0 64.1 61.6
RetroWISE-Z 20M 60.1 64.7 61.9
RetroWISE-U 320K 63.8 68.5 64.6

As shown in Table 1, we evaluate our models (i.e., RetroWISE-U, RetroWISE-P, and RetroWISE-Z) on two benchmark datasets: USPTO-50K (Schneider et al, 2016) and USPTO-MIT (Jin et al, 2017). The baseline is trained only with real paired data, while RetroWISE is trained with the same real paired data, as well as in-silico reactions as auxiliary paired training data. The evaluation metrics are top-1 exact accuracy and top-1 MaxFrag accuracy (Tetko et al, 2020). On USPTO-50K, RetroWISE-U achieves the highest exact match accuracy at 63.8% and the highest MaxFrag accuracy at 68.5%. RetroWISE-P and RetroWISE-Z also have clear advantages over the baseline, yielding enhancements of 3.5% and 3.8% on top-1 exact accuracy. Moreover, RetroWISE achieves significant improvements on the larger USPTO-MIT dataset, e.g., RetroWISE-U, RetroWISE-P, and RetroWISE-Z exceed the baseline by 4.3%, 1.6%, and 1.3%, respectively. We attribute the superior performance of RetroWISE-U to two factors: (1) the base forward synthesis model gyxsubscript𝑔𝑦𝑥g_{y\raisebox{1.13791pt}{\scalebox{0.7}{ \leavevmode\hbox to8.94pt{\vbox to0.4% pt{\pgfpicture\makeatletter\raise 0.0pt\hbox{\hskip 0.2pt\lower-0.2pt\hbox to % 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{{}}{} {}{}{}{}{}{}{{}}\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@lineto{6.53593pt}{0.0pt}% \pgfsys@stroke\pgfsys@invoke{ }{{}{{}}{}{}{{}}{{{}}{{{}}{\pgfsys@beginscope% \pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{6.53593pt}{0.0pt}% \pgfsys@invoke{ }\pgfsys@invoke{ \lxSVG@closescope }\pgfsys@invoke{% \lxSVG@closescope }\pgfsys@endscope}}{{}}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}x}italic_g start_POSTSUBSCRIPT italic_y italic_x end_POSTSUBSCRIPT used to generate in-silico data is much more accurate than the base retrosynthesis model fxysubscript𝑓𝑥𝑦f_{x\raisebox{1.13791pt}{\scalebox{0.7}{ \leavevmode\hbox to8.94pt{\vbox to0.4% pt{\pgfpicture\makeatletter\raise 0.0pt\hbox{\hskip 0.2pt\lower-0.2pt\hbox to % 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{{}}{} {}{}{}{}{}{}{{}}\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@lineto{6.53593pt}{0.0pt}% \pgfsys@stroke\pgfsys@invoke{ }{{}{{}}{}{}{{}}{{{}}{{{}}{\pgfsys@beginscope% \pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{6.53593pt}{0.0pt}% \pgfsys@invoke{ }\pgfsys@invoke{ \lxSVG@closescope }\pgfsys@invoke{% \lxSVG@closescope }\pgfsys@endscope}}{{}}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}y}italic_f start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT, producing higher-quality reactions; and (2) unpaired reactants in RetroWISE-U are more likely to generate chemically plausible reactions than randomly sampled unpaired products from PubChem and ZINC. These results show that incorporating in-silico reactions indeed facilitate model learning process and that RetroWISE provides better results by using in-silico reactions generated from unpaired reactants.

Table 2: Top-k single-step retrosynthesis accuracy (%) on the USPTO-50K test set.
Category Model k = 1 3 5 10 20 50
Template-based Retrosim (Coley et al, 2017) 37.3 54.7 63.3 74.1 82.0 85.3
Neuralsym (Segler and Waller, 2017) 44.4 56.3 72.4 78.9 82.2 83.1
GLN (Dai et al, 2019) 52.5 69.0 75.6 83.7 89.0 92.4
LocalRetro (Chen and Jung, 2021) 53.4 77.5 85.9 92.4 - 97.7
Semi-template G2Gs (Shi et al, 2020) 48.9 67.6 72.5 75.5 - -
GraphRetro (Somnath et al, 2021) 53.7 68.3 72.2 75.5 - -
RetroXpert111RetroXpert results are updated by the official implementation (Yan et al, 2020) 50.4 61.1 62.3 63.4 63.9 64.0
RetroPrime (Wang et al, 2021) 51.4 70.8 74.0 76.1 - -
Template-free Liu’s Seq2seq (Liu et al, 2017) 37.4 52.4 57.0 61.7 65.9 70.7
GTA (Seo et al, 2021) 51.1 67.6 74.8 81.6 - -
Dual-TF (Sun et al, 2021) 53.3 69.7 73.0 75.0 - -
MEGAN (Sacha et al, 2021) 48.1 70.7 78.4 86.1 90.3 93.2
Tied transformer (Kim et al, 2021) 47.1 67.2 73.5 78.5 - -
AT (Tetko et al, 2020) 53.5 - 81.0 85.7 - -
Graph2Edits (Zhong et al, 2023) 55.1 77.3 83.4 89.4 - 92.7
R-SMILES (Zhong et al, 2022) 56.3 79.2 86.2 91.0 93.1 94.6
RetroWISE (This work) 64.9 83.5 88.4 92.7 95.1 96.9
Template-free MEGAN 54.2 75.7 83.1 89.2 92.7 95.1
MaxFragdelimited-⟨⟩𝑀𝑎𝑥𝐹𝑟𝑎𝑔\left\langle MaxFrag\right\rangle⟨ italic_M italic_a italic_x italic_F italic_r italic_a italic_g ⟩ Tied transformer 51.8 72.5 78.2 82.4 - -
AT 58.5 - 95.4 90.0 - -
Graph2Edits 59.2 80.1 86.1 91.3 - 93.1
R-SMILES 61.0 82.5 88.5 92.8 94.6 95.7
RetroWISE (This work) 69.1 86.5 90.4 93.6 95.5 97.0
\botrule
Table 3: Top-k single-step retrosynthesis accuracy (%) on the USPTO-MIT test set.
Category Model k = 1 3 5 10 20 50
Template-based Neuralsym (Segler and Waller, 2017) 47.8 67.6 74.1 80.2 - -
LocalRetro (Chen and Jung, 2021) 54.1 73.7 79.4 84.4 - 90.4
Template-free Liu’s Seq2seq (Liu et al, 2017) 46.9 61.6 66.3 70.8 - -
AutoSynRoute (Lin et al, 2020) 54.1 71.8 76.9 81.8 - -
RetroTRAE (Ucak et al, 2022) 58.3 - - - - -
R-SMILES (Zhong et al, 2022) 60.3 78.2 83.2 87.3 89.7 91.6
RetroWISE (This work) 64.6 82.3 86.7 90.3 92.4 94.0
\botrule
Table 4: Top-k single-step retrosynthesis accuracy (%) on the USPTO-Full test set.
Category Model k = 1 3 5 10 20 50
Template-based Retrosim (Coley et al, 2017) 32.8 - - 56.1 - -
Neuralsym (Segler and Waller, 2017) 35.8 - - 60.8 - -
GLN (Dai et al, 2019) 39.3 - - 63.7 - -
LocalRetro (Chen and Jung, 2021) 39.1 53.3 58.4 63.7 67.5 70.7
Semi-Template RetroPrime (Wang et al, 2021) 44.1 59.1 62.8 68.5 - -
Template-free MEGAN (Sacha et al, 2021) 33.6 - - 63.9 - 74.1
GTA (Seo et al, 2021) 46.6 - - 70.4 - -
AT (Tetko et al, 2020) 46.2 - - 73.3 - -
R-SMILES (Zhong et al, 2022) 48.9 66.6 72.0 76.4 80.4 83.1
RetroWISE (This work) 52.3 68.7 73.5 77.9 80.9 83.6
\botrule

Comparison with existing ML-based methods. Here, we compare RetroWISE with other ML-based methods using the most popular retrosynthesis benchmark datasets: USPTO-50K (Schneider et al, 2016), USPTO-MIT (Jin et al, 2017), and USPTO-Full (Dai et al, 2019). The top-k exact match accuracy and the top-k MaxFrag accuracy (Tetko et al, 2020) are adopted as the evaluation metrics. The performance of our RetroWISE are summarized in Table 23, and 4, from which we could derive three critical observations:

  1. 1.

    The proposed RetroWISE framework outperforms existing state-of-the-art methods (Chen and Jung, 2021; Wang et al, 2021; Zhong et al, 2022) on all the three datasets, e.g., RetroWISE surpasses R-SMILES by 8.6%, 4.3%, and 3.4% top-1 exact accuracy on USPTO-50K, USPTO-MIT, and USPTO-Full, respectively. Our method constantly achieves higher accuracy rates across all top-k accuracies, which attests to its effectiveness in tackling the complex single-step retrosynthesis prediction task.

  2. 2.

    RetroWISE is superior to the other methods especially in the low-resource setting with limited paired data. Notably, our method achieves substantial improvements over R-SMILES on USPTO-50K, with an absolute increase of 8.6%, 4.3%, and 2.2% in top-1, top-3, and top-5 accuracies, respectively. These results further confirm the effectiveness of RetroWISE by in-silico reaction augmentation under limited resource circumstance.

  3. 3.

    RetroWISE also delivers best results in top-k MaxFrag accuracy across the three datasets. The MaxFrag accuracy proposed by Tetko et al (2020) reflects the accuracy to predict the minimal part of reactant(s) for designing a retrosynthetic route, emphasizing multiple possible ways to synthesize the compounds (Dubrovskiy et al, 2018). The highest top-k MaxFrag accuracy (e.g., +8.1% top-1 MaxFrag accuracy on USPTO-50K) underscores the prediction diversity as well as prediction accuracy of RetroWISE.

Refer to caption

ab

Figure 2: Impact of data quantity. a, impact of in-silico data quantity and b, impact of real data quantity. Training with more in-silico and real data both improves the performance. The in-silico data ratio is measured as the number of in-silico reactions divided by the number of real data, and vice versa for the real data ratio.

Impact of data quantity. The quantity of the paired training data really matters. Next, we will evaluate how RetroWISE’s performance scales w.r.t. amount of data used. We first investigate how the amount of in-silico reactions R^^𝑅\mathit{\hat{R}}over^ start_ARG italic_R end_ARG affects prediction accuracy. A series of experiments are conducted on USPTO-50K (Schneider et al, 2016) where the number of in-silico reactions R^^𝑅\mathit{\hat{R}}over^ start_ARG italic_R end_ARG is gradually increased. Fig.​ 2a demonstrates that more in-silico reactions lead to higher accuracy for each k-value. For instance, the top-1 accuracy increases from around 59.3% to 63.8% and top-1 MaxFrag accuracy rises from 63.9% to 68.5% when more in-silico reactions are used. It could also be observed that the prediction accuracy continues to grow as R^^𝑅\mathit{\hat{R}}over^ start_ARG italic_R end_ARG increases. This suggests that increasing the amount of in-silico reactions indeed benefits the model training. In turn, we examine the effect of the amount of real paired data on prediction performance. We fix the size of in-silico reactions R^^𝑅\mathit{\hat{R}}over^ start_ARG italic_R end_ARG to 320320320320K and alter the size of real paired data R𝑅\mathit{R}italic_R in the range of {10101010K, 20202020K, 30303030K, 40404040K}. As shown in Fig.​ 2b, we observe that a larger data size of real reactions also leads to higher accuracy, e.g., the top-1 accuracy rises from 60.3% to 63.8% as the size of R𝑅\mathit{R}italic_R increases from 10101010K to 40404040K. These results highlight the importance of increasing the data quantity for training a powerful retrosynthesis model.

Table 5: Filter process raises the quality of in-silico reactions for better performances.
Method k = 1 3 5 10 20 50
Baseline 56.3 79.2 86.2 91.0 93.1 94.6
RetroWISE (w/o filtering) 63.8 83.0 87.6 91.7 94.1 95.1
RetroWISE (w filtering) 64.9 83.5 88.4 92.7 95.1 96.9

Impact of data quality. Erroneous or low-quality in-silico reactions might result in error accumulation during model training. To address this issue, RetroWISE is equipped with a filter process that leverages template matching and molecular similarity comparison to enhance the quality of in-silico reactions. Initially, the filter employs RDKit (Landrum et al, 2013) to eliminate in-silico reactions that contain wrong reactants or products SMILES. Subsequently, the template matching step selects chemical templates extracted with RDChiral (Coley et al, 2019) that appear more than 5555 times (14.5%percent14.514.5\%14.5 % of 301,257301257301,257301 , 257 templates in USPTO) as a template library, and then preserves in-silico reactions that match any selected template in this library. This procedure ensures the chemical plausibility of in-silico reactions. Next, the molecular similarity comparison step (1) reconstructs pseudo unpaired data from in-silico data of the mismatched reactions from the last step and (2) uses RDKit to calculate the molecular similarity between the pseudo unpaired data and the original unpaired data. Specifically, as illustrated in Fig.​ 1(a), we feed the in-silico data (e.g., in-silico product x^superscript^𝑥\hat{x}^{\circ}over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT) into the base model (e.g., fxysubscript𝑓𝑥𝑦f_{x\raisebox{1.13791pt}{\scalebox{0.7}{ \leavevmode\hbox to8.94pt{\vbox to0.4% pt{\pgfpicture\makeatletter\raise 0.0pt\hbox{\hskip 0.2pt\lower-0.2pt\hbox to % 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{{}}{} {}{}{}{}{}{}{{}}\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@lineto{6.53593pt}{0.0pt}% \pgfsys@stroke\pgfsys@invoke{ }{{}{{}}{}{}{{}}{{{}}{{{}}{\pgfsys@beginscope% \pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{6.53593pt}{0.0pt}% \pgfsys@invoke{ }\pgfsys@invoke{ \lxSVG@closescope }\pgfsys@invoke{% \lxSVG@closescope }\pgfsys@endscope}}{{}}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}y}italic_f start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT) to generate the pseudo unpaired data (e.g., pseudo reactant(s) y~superscript~𝑦\tilde{y}^{\circ}over~ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT) for comparing the molecular similarity to original unpaired data (e.g., unpaired reactant(s) y^superscript^𝑦\hat{y}^{\circ}over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT). Reactions with similarity above a specific threshold are also preserved. This procedure considers the diversity of in-silico reactions while ensuring data validity. As shown in Table 5, the filter process further improves the top-k prediction performance of RetroWISE (e.g., +1.1% top-1 accuracy and +1.8% top-50 accuracy over the baseline) with less in-silico reactions (89% of the original in-silico paired data). The results verify the rationality of leveraging the filter process for improving data quality such that our RetroWISE could benefit from the correct and chemically sound in-silico reactions.

Refer to caption
Figure 3: Top-5 accuracy of different types of predictions. RetroWISE achieves excellent results over the baseline on almost every reaction type.

Performance on different reaction types. Reaction types are crucial to chemists as they usually use them to navigate large databases of reactions and retrieve similar members of the same class to analyze and infer optimal reaction conditions. They also use reaction types as an efficient way to communicate what a chemical reaction does and how it works in terms of atomic rearrangements. Thereby, it is necessary to analyze the performance of different reaction types using the USPTO-50K dataset (Schneider et al, 2016), which assigns one of ten reaction classes to each reaction. These classes cover the most common reactions in organic synthesis, such as protections/deprotections, C-C bond formation, and heterocycle formation. Note that, RetroWISE does not use reaction types for training since they are often unavailable in real-world scenarios. However, as shown in Fig.​ 3, RetroWISE outperforms the baseline on almost every reaction type by a large margin. Also, we find that our RetroWISE significantly enhances heterocycle formation and C-C bond formation prediction among the ten reaction types (e.g., 9.8%percent9.89.8\%9.8 % improvement on heterocycle formation class), while protections is the most challenging to predict. We infer that the reasons are (1) heterocycle formation and C-C bond formation have more diverse possibilities for choosing reactants and reactions than other reaction types (Tetko et al, 2020); (2) in-silico reactions of protections appear less frequently, resulting in a slight imbalance during model learning.

Refer to caption

abRare-2Rare-5Rare-10

Figure 4: Performance on rare transformations. a, Top-k exact match accuracy. b, Top-k MaxFrag match accuracy. RetroWISE achieves consistent improvements on three testing benchmarks of rare transformations.
Refer to caption
Figure 5: Representative examples of Rare2 predictions. The green part highlights the structure corresponding to the template. RetroWISE produces more accurate predictions than Baseline on rare transformations.

Performance on rare transformations. Retrosynthesis prediction also faces the challenge of handling rare transformations that involve uncommon reactants, products, or reaction mechanisms, which are underrepresented in the training data. To assess our prediction performance on rare transformations, we create three test subsets from USPTO containing 204,988204988204,988204 , 988, 337,593337593337,593337 , 593, and 438,333438333438,333438 , 333 reactions, where the corresponding template of each reaction appears less than 2222, 5555, and 10101010 times, respectively. Correspondingly, these three subsets are denoted as Rare-2, Rare-5, and Rare-10. We conduct an analysis of RetroWISE and the baseline both trained on USPTO-50K and report the accuracy on all the test subsets in Fig.​ 4. We observe that RetroWISE outperforms the baseline across all the subsets, achieving great relative improvements. For instance, on the Rare-2 subset, RetroWISE achieves the relative improvements over the baseline with a top-1 accuracy of 32.2% and a top-50 accuracy of 23.0%, respectively. Moreover, we illustrate some representative examples of the Rare-2 subset in Fig.​ 5. RetroWISE produces higher ranking for correct predictions than the baseline. These quantitative and qualitative results indicate that RetroWISE better generalizes to rare scenarios.

Refer to caption
Figure 6: Examples of RetroWISE predictions. Representative examples of a, exact match prediction, b, MaxFrag match prediction, and c, inaccurate match prediction are shown. The green part highlights the differences between the ground truth (G) and the prediction (P). The molecular similarity is calculated using the ECFP4.

Discussion of prediction results. The prediction outcomes of RetroWISE require a specific comparison for proper evaluation. We take a much deeper dive into how the predictions are similar to the ground truth by using MaxFrag accuracy and the molecular similarity (Hendrickson, 1991; Nikolova and Jaworska, 2003). Exact match accuracy indicates whether the predicted reactants match the ground truth exactly, while MaxFrag accuracy measures whether the main components of them are identical. Besides, molecular similarity estimates how close the prediction and ground truth are in chemical structure. We show three top-1 predictions of RetroWISE. Among them, Fig.​ 6a shows an accurate prediction, Fig.​ 6b shows a MaxFrag accurate prediction, where the predicted reactants share the same main fragment as the ground truth (i.e., the minimal part of the reactants to design a retrosynthetic route), and Fig.​ 6c shows an inaccurate prediction. We use the Tanimoto similarity (Tcsubscript𝑇𝑐T_{c}italic_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT) with ECFP4 (Rogers and Hahn, 2010) as the molecular fingerprint to quantify the similarity, which ranges from 0 (no overlap) to 1 (complete overlap). Two structures are usually considered similar if Tc>0.85subscript𝑇𝑐0.85T_{c}>0.85italic_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT > 0.85 (Maggiora et al, 2014), and we find that even inaccurate predictions from RetroWISE usually have high Tanimoto similarity (Tc=0.91subscript𝑇𝑐0.91T_{c}=0.91italic_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = 0.91), indicating that our prediction is might be a feasible outcome from other retrosynthesis routes.

3 Discussion

Retrosynthesis prediction is a challenging task even for experienced chemists due to the huge search space of all possible chemical transformations and the incomplete understanding of the reaction mechanism. Recent machine learning (ML)-based methods have emerged as an efficient tool for chemists in designing synthetic experiments, but their effectiveness heavily hinges on the availability of paired training data (i.e., chemical reactions each consisting of a product-reactant(s) pair), which is expensive to acquire. Furthermore, reaction data is considered a valuable resource by organizations and as a result, its accessibility to the public is severely restricted, creating a major hurdle for researchers. To address these issues, RetroWISE utilizes a base model trained on real paired data to generate in-silico reactions from easily accessible unpaired data (i.e., one component of product-reactant(s) pair), thereby facilitating further model training. In this way, the whole framework is self-boosted: pushing the retrosynthesis model to evolve with the in-silico reactions generated by the base model. Besides, ensuring the quality of in-silico reactions is also crucial, which is achieved through a filter process in RetroWISE.

RetroWISE is evaluated on three benchmark datasets and is compared with other state-of-the-art models for single-step retrosynthesis prediction. The experimental results clearly indicate that RetroWISE successfully overcomes the training bottleneck caused by the aforementioned issues, e.g., RetroWISE achieves a promising 64.9% top-1 exact match accuracy on USPTO-50K and achieves the top-1 accuracy of 52.3% in the largest USPTO-Full dataset. Besides, we highlight the superior prediction of RetroWISE in almost all reaction classes e.g., RetroWISE yields a 9.8%percent9.89.8\%9.8 % improvement on heterocycle formation class. Moreover, we conduct experiments to show that RetroWISE learns more diverse reaction mechanisms, considerably improving the performance on rare transformations. For example, RetroWISE achieves the relative improvement of 32.2%percent32.232.2\%32.2 % over the baseline on top-1 accuracy, indicating that RetroWISE has the potential to assist chemists in designing novel routes. In addtion, case studies of prediction show the various possibilities our method can offer for the creation of retrosynthetic routes.

Despite the promising performance of RetroWISE, there still remain two challenges in future research: (1) the improvement of RetroWISE, in large part, relies on the availability and quality of unpaired data, which affects the diversity and chemical plausibility of in-silico reactions. We thus expect that RetroWISE could be further enhanced with more sources and methods to collect and preprocess unpaired data. (2) As the number of in-silico reactions grows, it will be more essential to refine the resulting reactions. Therefore, we hypothesize that implementing more efficient and effective filter processes will benefit the advancement of RetroWISE. With the encouraging experimental results, RetroWISE is envisioned to be used as a framework to conquer the training bottleneck of all ML-based methods and stimulate the further development of future ML-based retrosynthesis research.

4 Methods

Data. Our models are evaluated on three public benchmark datasets from USPTO curated by Lowe (2012, 2017): USPTO-50K (Schneider et al, 2016), USPTO-MIT (Jin et al, 2017), and USPTO-Full (Dai et al, 2019).

  • USPTO-50K comprises approximately 50,0005000050,00050 , 000 reactions with precise atom mappings between reactants and products. Following Liu et al (2017); Dai et al (2019); Zhong et al (2022), the 80%/10%/10% of the total 50K reactions are set as train/val/test data. Since the reaction type is usually unknown, we follow Zhong et al (2022) and do not utilize this information for training.

  • USPTO-MIT (USPTO 480K) dataset contains approximately 400,000400000400,000400 , 000 reactions for training, 30,0003000030,00030 , 000 for validation, and 40,0004000040,00040 , 000 for testing, which is much larger and noisier than the clean USPTO-50K dataset.

  • USPTO-FULL is the largest dataset encompassing roughly 1111M chemical reactions, which is built by Dai et al (2019) to verify the scalability of the retrosynthesis model. Following Dai et al (2019); Zhong et al (2022), Reactions with multiple products are split into individual reactions to ensure that each reaction has only one product, and 1M reactions are divided into train/valid/test sets with sizes of 800800800800K/100100100100K/100100100100K respectively.

Data representations. We utilize two molecular representations in this work.

  • The Simplified Molecular-Input Line-Entry System (SMILES) (Weininger, 1988) is a specification in the form of a line notation for describing the structure of chemical species using short ASCII strings (e.g., c1ccccc1 represents benzene). This representation is widely used as the input and output in most sequence-to-sequence (sequence-based) methods (Liu et al, 2017; Tetko et al, 2020; Zhong et al, 2022) for retrosynthesis prediction.

  • The molecular fingerprint is a bit-vector encoding the physicochemical or structural properties of the molecule, which is usually used for synthesis design (Segler and Waller, 2017), similarity searching (Willett et al, 1998), and virtual screening (Cereto-Massagué et al, 2015; Muegge and Mukherjee, 2016), etc.. The most used ones are Extended-Connectivity Fingerprint (ECFP) (Rogers and Hahn, 2010) and Maccs-Keys (Durant et al, 2002). The molecular fingerprint is utilized in this work to quantify the molecular similarity between two molecules, indicating their closeness.

Problem formulation. The single-step retrosynthesis prediction task aims to predict precursors by inputting a molecule of interest. ML-based methods rely on the dataset of paired data in the product and corresponding reactant(s), denoted as R={(xn,yn)}n𝑅subscriptsubscript𝑥𝑛subscript𝑦𝑛𝑛\mathit{R}\!=\!\{(x_{n},y_{n})\}_{n}italic_R = { ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, where xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT represents the product and ynsubscript𝑦𝑛y_{n}italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denotes the corresponding reactant(s). In this work, both reactants and products are represented by SMILES. Given a product sequence x𝑥xitalic_x, a sequence-based method learns a model fxysubscript𝑓𝑥𝑦f_{x\raisebox{1.13791pt}{\scalebox{0.7}{ \leavevmode\hbox to8.94pt{\vbox to0.4% pt{\pgfpicture\makeatletter\raise 0.0pt\hbox{\hskip 0.2pt\lower-0.2pt\hbox to % 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{{}}{} {}{}{}{}{}{}{{}}\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@lineto{6.53593pt}{0.0pt}% \pgfsys@stroke\pgfsys@invoke{ }{{}{{}}{}{}{{}}{{{}}{{{}}{\pgfsys@beginscope% \pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{6.53593pt}{0.0pt}% \pgfsys@invoke{ }\pgfsys@invoke{ \lxSVG@closescope }\pgfsys@invoke{% \lxSVG@closescope }\pgfsys@endscope}}{{}}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}y}italic_f start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT to obtain the corresponding reactant(s) sequence y𝑦yitalic_y.

Baseline. The baseline of our RetroSynthesis With In Silico rEactions (RetroWISE) framework only uses real paired data for training, adopting the vanilla transformer (Vaswani et al, 2017) as the network architecture, and using Root-aligned SMILES (R-SMILES) (Zhong et al, 2022) as the SMILES augmentation strategy. Transformer consists of an encoder-decoder architecture where the encoder maps the input sequence to a latent space, and the decoder decodes the output sequence from the latent space in an autoregressive manner. R-SMILES is a tightly aligned one-to-one mapping between the product and the reactant(s) sequence for more efficient retrosynthesis prediction. It adopts the same atom as the root (the starting atom) to transform molecules into SMILES sequences for the product and the corresponding reactant(s).

In-silico reaction generation. First, RetroWISE uses real paired data (e.g., reactions in USPTO-50K) to train a base forward synthesis model and a base retrosynthesis model. Then, RetroWISE collects unpaired data, the amount of which typically far exceeds the amount of paired data, from one of two sources: one containing unpaired reactants and one containing unpaired products. The reactants are derived from the USPTO 2001-2016 applications containing 1,939,25419392541,939,2541 , 939 , 254 raw reactions. Although these data are paired, the reactant(s) component of each reaction is only utilized to verify the effectiveness of our proposed framework. We preprocess raw reactions by removing duplicates, reactions with incorrect atom mappings, and reactions with multiple products (which we split into separate ones). Reactions that appear in the validation or test set of the existing dataset are also excluded. Then, the base forward synthesis model is used to produce in-silico products from the unpaired reactants, leading to more in-silico reactions. Also, the unpaired products are obtained by randomly sampling 4444M molecules from PubChem or 20202020M from ZINC, respectively. The base retrosynthesis model is used to generate in-silico reactions following a similar procedure as before. The base model performs beam-search decoding for the newly introduced unpaired data and select the best one as the in-silico data. Moreover, a filter process is adopted to enhance the quality of the in-silico reactions, which contains a template matching step and a molecular similarity comparison step. For the first step, we use RDChiral (Coley et al, 2019) to extract 1,808,17618081761,808,1761 , 808 , 176 templates from USPTO. We select those that appear more than 5555 times (i.e., 43,7104371043,71043 , 710 unique templates) to do template matching. For the second step, we set the threshold to be 0.550.550.550.55 to do a molecular similarity comparison. Ultimately, RetroWISE uses roughly 89%percent8989\%89 % of in-silico reactions through the two filtering steps to achieve a better performance.

Training details. As in Tetko et al (2020); Seo et al (2021); Zhong et al (2022), we apply the SMILES augmentation during training for our RetroWISE framework: 20 ​×\times× SMILES augmentation for USPTO-50K (Schneider et al, 2016), 5 ​×\times× SMILES augmentation for USPTO-MIT (Jin et al, 2017), and 5 ​×\times× SMILES augmentation for USPTO-Full (Dai et al, 2019). We use the OpenNMT framework (Klein et al, 2017) and PyTorch (Paszke et al, 2019) to build the transformer model. Following Irwin et al (2022); Zhong et al (2022), we use the masking strategy to pretrain the model before training. During training, we employ the Adam optimizer (Kingma and Ba, 2017) with β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, β2=0.998subscript𝛽20.998\beta_{2}=0.998italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.998 for loss minimization and apply dropout (Srivastava et al, 2014) to the whole model at a rate of 0.10.10.10.1. The starting learning rate is set to 1.01.01.01.0 and noam (Vaswani et al, 2017) is used as the learning rate decay scheme.

Evaluation procedure. We use the top-k exact match accuracy as the evaluation metric to assess the performance of each model, where the k ranges from {1, 3, 5, 10, 20, 50}. This metric is widely used in existing studies (Liu et al, 2017; Kim et al, 2021; Karpov et al, 2019; Sacha et al, 2021; Wang et al, 2021), which measures the ratio that one of the top-k predicted results exactly match the ground truth. We additionally adopt the top-k MaxFrag accuracy introduced by Tetko et al (2020) for retrosynthesis. Compared with the exact match accuracy, the MaxFrag accuracy focuses on main compound transformations, which are the minimal information required to get a retrosynthesis route. As in Tetko et al (2020); Seo et al (2021); Zhong et al (2022), we apply the same SMILES augmentations at the evaluation stage as during training.

5 Supplementary information

Details of baseline. In this work, we adopt the vanilla transformer (Vaswani et al, 2017) as the network architecture. A typical transformer model consists of two major parts called encoder and decoder. There are several identical layers of transformer encoder and each has three separate blocks, named as “Layer Norm”, “Multi-head Self Attention (MSA)”, and “Feedforward Network (FFN)”. Among them, the attention mechanism is the most critical part of transformer, where three different vectors Keys(K𝐾Kitalic_K), Queries(Q𝑄Qitalic_Q) and Values(V𝑉Vitalic_V) of dimension d𝑑ditalic_d are employed for each input token. For computing the self attention metric, the dot product of Queries and all the Keys are calculated and scaled by 1/d1𝑑1/\sqrt{d}1 / square-root start_ARG italic_d end_ARG in order to prevent the dot products from generating very large numbers. This matrix is then converted into a probability matrix through the softmax𝑠𝑜𝑓𝑡𝑚𝑎𝑥softmaxitalic_s italic_o italic_f italic_t italic_m italic_a italic_x function and is multiplied to the Values to produce the attention metric as follows:

Attention=softmax(QKTd)V.𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛𝑠𝑜𝑓𝑡𝑚𝑎𝑥𝑄superscript𝐾𝑇𝑑𝑉Attention=softmax(\frac{QK^{T}}{\sqrt{d}})V.italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) italic_V . (1)

Besides, the baseline utilizes Root-aligned SMILES (R-SMILES) (Zhong et al, 2022) as our SMILES augmentation strategy. R-SMILES has a better augmentation effect because it adopts a tightly aligned one-to-one mapping between the product and the reactant to predict retrosynthesis more effectively. Specifically, it adopts the same atom as the root (i.e., the starting atom) of the SMILES strings for both the products and the reactants, which successfully resolves the one-to-many problem in random augmentation and enriches the SMILES representation compared to using canonical SMILES.

Refer to caption
Figure S7: Predictions with very high molecule similarity (Tc0.95𝑇𝑐0.95Tc\geq 0.95italic_T italic_c ≥ 0.95) but inconsistent with the ground truth. We highlight differences between the ground truth and the prediction.
Table S6: Iterative training yields high-quality in-silico reactions and accurate prediction.
Method k = 1 3 5 10 20 50
Baseline 63.8 83.0 87.6 91.7 94.1 95.1
RetroWISE +Iterative 64.9 83.8 88.0 91.9 94.3 96.1

Iterative training. RetroWISE is a self-boosting framework and could benefit from iterative training. Specifically, a better base model will result in better in-silico reactions, leading to improved predictions for retrosynthesis. If we can build a better base model with the in-silico reactions, then we can continue repeating this process: utilizing the base model to generate in-silico reactions, and building an even better base model with these reactions to generate higher-quality reactions for training. In other words, the key idea is to build a better base model with previous in-silico reactions for iteratively augmenting real paired data. Table S6 suggests that adding one more iteration enhances the prediction performance of the retrosynthesis model (e.g., +1.1% top-1 accuracy). However, iterative training also has several drawbacks, such as significantly increasing the training and generation time with too many iterations, and introducing more biases during iterative training.

Discussions of highly scored inaccurate predictions. Two chemical structures are typically considered similar if the Tanimoto coefficient (Tcsubscript𝑇𝑐T_{c}italic_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT) is above 0.85 (Maggiora et al, 2014). We previously presented an inaccurate prediction with a high similarity (Tc=0.91subscript𝑇𝑐0.91T_{c}=0.91italic_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = 0.91) in the main manuscript for a more proper evaluation, which demonstrates the prediction diversity of RetroWISE. As illustrated in Fig.​ S7, we provide more examples from the USPTO-50K test set with higher similarity (Tc0.95subscript𝑇𝑐0.95T_{c}\geq 0.95italic_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ≥ 0.95), highlighting some challenges faced by machine learning (ML)-based methods: (1) the tendency of ML-based models to generate unnecessary reagents like NH+4superscriptsubscriptabsent4{}_{4}^{+}start_FLOATSUBSCRIPT 4 end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, HCL, and OH{}^{-}start_FLOATSUPERSCRIPT - end_FLOATSUPERSCRIPT due to learning bias; (2) the failure of models to accurately represent molecular information of stereochemistry, such as using incorrect symbols (/ or \\\backslash\) to denote directional single bonds adjacent to a double bond or creating accurate molecules but with incorrect chirality (e.g., C@H v.s. C@@H).

Refer to caption

ab

Refer to caption
Figure S8: Inference time of RetroWISE with different beam sizes. The inference time is measured with a, “Time per product” and b, “Total time”. “Time per product” measures the time required to generate one reactant(s) from a given product using RetroWISE. “Total time” is the time to decode the whole test set into prediction results.

Computational and memory efficiency. The proposed RetroWISE framework prioritizes memory and computational efficiency during inference, which enables further applications, such as multistep retrosynthesis planning. RetroWISE employs a transformer architecture with approximately 44.544.544.544.5M parameters as the sequence-based model for USPTO-50K and USPTO-MIT. Compared with previous transformer-based method such as RetroPrime (Wang et al, 2021) having 75.475.475.475.4M parameters, RetroWISE is more lightweight and easy to deploy. Fig.​ S8 illustrates the inference speed on different datasets, measured with a single GPU (GeForce RTX 4090). The time per product varies with different beam sizes. On USPTO-50K, it is between 8.038.038.038.03 ms and 115.07115.07115.07115.07 ms, while on USPTO-MIT, which has longer sequences, it is between 10.3510.3510.3510.35 ms and 123.98123.98123.98123.98 ms. The total time also depends on the beam size and the dataset. For the USPTO-50K test set with 10,0141001410,01410 , 014 products, it varies from 13.4113.4113.4113.41 min to 192.06192.06192.06192.06 min. For the USPTO-MIT test set with 201,325201325201,325201 , 325 products, the range is from 34.7334.7334.7334.73 min to 416.11416.11416.11416.11 min. These experimental results highlight the computational and memory efficiency of RetroWISE.

References

\bibcommenthead
  • Baker et al (2023) Baker EA, Schapiro D, Dumitrascu B, et al (2023) In silico tissue generation and power analysis for spatial omics. Nature Methods 20(3):424–431
  • Baylon et al (2019) Baylon JL, Cilfone NA, Gulcher JR, et al (2019) Enhancing retrosynthetic reaction prediction with deep learning using multiscale reaction classification. Journal of chemical information and modeling 59(2):673–688
  • Blakemore et al (2018) Blakemore DC, Castro L, Churcher I, et al (2018) Organic synthesis provides opportunities to transform drug discovery. Nature chemistry 10(4):383–394
  • Born and Manica (2023) Born J, Manica M (2023) Regression transformer enables concurrent sequence regression and generation for molecular language modelling. Nature Machine Intelligence 5(4):432–444
  • Castro et al (2022) Castro E, Godavarthi A, Rubinfien J, et al (2022) Transformer-based protein generation with regularized latent space optimization. Nature Machine Intelligence 4(10):840–851
  • Cereto-Massagué et al (2015) Cereto-Massagué A, Ojeda MJ, Valls C, et al (2015) Molecular fingerprint similarity search in virtual screening. Methods 71:58–63
  • Chen and Jung (2021) Chen S, Jung Y (2021) Deep retrosynthetic reaction prediction using local reactivity and global attention. JACS Au 1(10):1612–1620
  • Coley et al (2017) Coley CW, Rogers L, Green WH, et al (2017) Computer-assisted retrosynthesis based on molecular similarity. ACS central science 3(12):1237–1245
  • Coley et al (2019) Coley CW, Green WH, Jensen KF (2019) Rdchiral: An rdkit wrapper for handling stereochemistry in retrosynthetic template extraction and application. Journal of chemical information and modeling 59(6):2529–2537
  • Corey and Wipke (1969) Corey EJ, Wipke WT (1969) Computer-assisted design of complex organic syntheses: Pathways for molecular synthesis can be devised with a computer and equipment for graphical communication. Science 166(3902):178–192
  • Corey et al (1985) Corey EJ, Long AK, Rubenstein SD (1985) Computer-assisted analysis in organic synthesis. Science 228(4698):408–418
  • Dai et al (2019) Dai H, Li C, Coley C, et al (2019) Retrosynthesis prediction with conditional graph logic network. In: Advances in Neural Information Processing Systems
  • Dubrovskiy et al (2018) Dubrovskiy AV, Kesharwani T, Markina NA, et al (2018) Comprehensive Organic Transformations, 4 Volume Set: A Guide to Functional Group Preparations, vol 1
  • Durant et al (2002) Durant JL, Leland BA, Henry DR, et al (2002) Reoptimization of mdl keys for use in drug discovery. Journal of chemical information and computer sciences 42(6):1273–1280
  • Gao et al (2023) Gao C, Killeen BD, Hu Y, et al (2023) Synthetic data accelerates the development of generalizable learning-based algorithms for x-ray image analysis. Nature Machine Intelligence 5(3):294–308
  • Hendrickson (1991) Hendrickson JB (1991) Concepts and applications of molecular similarity. Science 252(5009):1189–1190
  • Irwin et al (2020) Irwin JJ, Tang KG, Young J, et al (2020) Zinc20—a free ultralarge-scale chemical database for ligand discovery. Journal of chemical information and modeling 60(12):6065–6073
  • Irwin et al (2022) Irwin R, Dimitriadis S, He J, et al (2022) Chemformer: a pre-trained transformer for computational chemistry. Machine Learning: Science and Technology 3(1):015,022
  • Jin et al (2017) Jin W, Coley C, Barzilay R, et al (2017) Predicting organic reaction outcomes with weisfeiler-lehman network. In: Advances in neural information processing systems
  • Karpov et al (2019) Karpov P, Godin G, Tetko IV (2019) A transformer model for retrosynthesis. In: Artificial Neural Networks and Machine Learning–ICANN 2019: Workshop and Special Sessions: 28th International Conference on Artificial Neural Networks, Munich, Germany, September 17–19, 2019, Proceedings, pp 817–830
  • Kim et al (2021) Kim E, Lee D, Kwon Y, et al (2021) Valid, plausible, and diverse retrosynthesis using tied two-way transformers with latent variables. Journal of Chemical Information and Modeling 61(1):123–133
  • Kim et al (2019) Kim S, Chen J, Cheng T, et al (2019) Pubchem 2019 update: improved access to chemical data. Nucleic acids research 47(D1):D1102–D1109
  • Kingma and Ba (2017) Kingma DP, Ba J (2017) Adam: A method for stochastic optimization. In: International conference on machine learning
  • Klein et al (2017) Klein G, Kim Y, Deng Y, et al (2017) Opennmt: Open-source toolkit for neural machine translation. In: Proceedings of ACL 2017, System Demonstrations, pp 67–72
  • Landrum et al (2013) Landrum G, et al (2013) Rdkit: A software suite for cheminformatics, computational chemistry, and predictive modeling. Greg Landrum 8
  • Lawson et al (2014) Lawson AJ, Swienty-Busch J, Géoui T, et al (2014) The making of reaxys—towards unobstructed access to relevant chemistry information. In: The Future of the History of Chemical Information. p 127–148
  • Lin et al (2020) Lin K, Xu Y, Pei J, et al (2020) Automatic retrosynthetic route planning using template-free models. Chemical science 11(12):3355–3364
  • Liu et al (2017) Liu B, Ramsundar B, Kawthekar P, et al (2017) Retrosynthetic reaction prediction using neural sequence-to-sequence models. ACS central science 3(10):1103–1113
  • Lowe (2017) Lowe D (2017) Chemical reactions from US patents (1976-Sep2016). doi: 10.6084/m9.figshare.5104873.v1
  • Lowe (2012) Lowe DM (2012) Extraction of chemical structures and reactions from the literature. PhD thesis, University of Cambridge
  • Maggiora et al (2014) Maggiora G, Vogt M, Stumpfe D, et al (2014) Molecular similarity in medicinal chemistry: miniperspective. Journal of medicinal chemistry 57(8):3186–3204
  • Marouf et al (2020) Marouf M, Machart P, Bansal V, et al (2020) Realistic in silico generation and augmentation of single-cell rna-seq data using generative adversarial networks. Nature communications 11(1):166
  • Mikulak-Klucznik et al (2020) Mikulak-Klucznik B, Golkebiowska P, Bayly AA, et al (2020) Computational planning of the synthesis of complex natural products. Nature 588(7836):83–88
  • Muegge and Mukherjee (2016) Muegge I, Mukherjee P (2016) An overview of molecular fingerprint similarity search in virtual screening. Expert opinion on drug discovery 11(2):137–148
  • Nikolova and Jaworska (2003) Nikolova N, Jaworska J (2003) Approaches to measure chemical similarity–a review. QSAR & Combinatorial Science 22(9-10):1006–1026
  • Paszke et al (2019) Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. In: Advances in neural information processing systems
  • Rodrigues (2019) Rodrigues T (2019) The good, the bad, and the ugly in chemical and biological data for machine learning. Drug Discovery Today: Technologies 32:3–8
  • Rogers and Hahn (2010) Rogers D, Hahn M (2010) Extended-connectivity fingerprints. Journal of chemical information and modeling 50(5):742–754
  • Sacha et al (2021) Sacha M, Błaz M, Byrski P, et al (2021) Molecule edit graph attention network: modeling chemical reactions as sequences of graph edits. Journal of Chemical Information and Modeling 61(7):3273–3284
  • Schneider et al (2016) Schneider N, Stiefl N, Landrum GA (2016) What’s what: The (nearly) definitive guide to reaction role assignment. Journal of chemical information and modeling 56(12):2336–2346
  • Schwaller et al (2021) Schwaller P, Probst D, Vaucher AC, et al (2021) Mapping the space of chemical reactions using attention-based neural networks. Nature machine intelligence 3(2):144–152
  • Segler and Waller (2017) Segler MH, Waller MP (2017) Neural-symbolic machine learning for retrosynthesis and reaction prediction. Chemistry–A European Journal 23(25):5966–5971
  • Segler et al (2018) Segler MH, Preuss M, Waller MP (2018) Planning chemical syntheses with deep neural networks and symbolic ai. Nature 555(7698):604–610
  • Seo et al (2021) Seo SW, Song YY, Yang JY, et al (2021) Gta: Graph truncated attention for retrosynthesis. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 531–539
  • Shi et al (2020) Shi C, Xu M, Guo H, et al (2020) A graph to graphs framework for retrosynthesis prediction. In: International conference on machine learning, pp 8818–8827
  • Somnath et al (2021) Somnath VR, Bunne C, Coley C, et al (2021) Learning graph models for retrosynthesis prediction. In: Advances in Neural Information Processing Systems, pp 9405–9415
  • Srivastava et al (2014) Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958
  • Sun et al (2021) Sun R, Dai H, Li L, et al (2021) Towards understanding retrosynthesis by energy-based models. In: Advances in Neural Information Processing Systems, pp 10,186–10,194
  • Tetko et al (2020) Tetko IV, Karpov P, Van Deursen R, et al (2020) State-of-the-art augmented nlp transformer models for direct and single-step retrosynthesis. Nature communications 11(1):5575
  • Toniato et al (2021) Toniato A, Schwaller P, Cardinale A, et al (2021) Unassisted noise reduction of chemical reaction datasets. Nature Machine Intelligence 3(6):485–494
  • Tu and Coley (2022) Tu Z, Coley CW (2022) Permutation invariant graph-to-sequence model for template-free retrosynthesis and reaction prediction. Journal of chemical information and modeling 62(15):3503–3513
  • Ucak et al (2022) Ucak UV, Ashyrmamatov I, Ko J, et al (2022) Retrosynthetic reaction pathway prediction through neural machine translation of atomic environments. Nature communications 13(1):1186
  • Vaswani et al (2017) Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. In: Advances in neural information processing systems
  • Wan et al (2022) Wan Y, Hsieh CY, Liao B, et al (2022) Retroformer: Pushing the limits of end-to-end retrosynthesis transformer. In: International Conference on Machine Learning, pp 22,475–22,490
  • Wang et al (2021) Wang X, Li Y, Qiu J, et al (2021) Retroprime: A diverse, plausible and transformer-based method for single-step retrosynthesis predictions. Chemical Engineering Journal 420:129,845
  • Weininger (1988) Weininger D (1988) Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences 28(1):31–36
  • Willett et al (1998) Willett P, Barnard JM, Downs GM (1998) Chemical similarity searching. Journal of chemical information and computer sciences 38(6):983–996
  • Yan et al (2020) Yan C, Ding Q, Zhao P, et al (2020) Retroxpert: Decompose retrosynthesis prediction like a chemist. In: Advances in Neural Information Processing Systems, pp 11,248–11,258
  • Yang et al (2022) Yang H, Li J, Lim KZ, et al (2022) Automatic strain sensor design via active learning and data augmentation for soft machines. Nature Machine Intelligence 4(1):84–94
  • Yu et al (2023) Yu T, Boob AG, Volk MJ, et al (2023) Machine learning-enabled retrobiosynthesis of molecules. Nature Catalysis 6(2):137–151
  • Zhong et al (2023) Zhong W, Yang Z, Chen CYC (2023) Retrosynthesis prediction using an end-to-end graph generative architecture for molecular graph editing. Nature Communications 14(1):3009
  • Zhong et al (2022) Zhong Z, Song J, Feng Z, et al (2022) Root-aligned smiles: a tight representation for chemical reaction prediction. Chemical Science 13(31):9023–9034