Extreme Gradient Boosting as a Method for Quantitative Structure-Activity Relationships

Robert P Sheridan; Wei Min Wang; Andy Liaw; Junshui Ma; Eric M Gifford

doi:10.1021/acs.jcim.6b00591

Extreme Gradient Boosting as a Method for Quantitative Structure-Activity Relationships

J Chem Inf Model. 2016 Dec 27;56(12):2353-2360. doi: 10.1021/acs.jcim.6b00591. Epub 2016 Dec 13.

Authors

Robert P Sheridan¹, Wei Min Wang², Andy Liaw³, Junshui Ma³, Eric M Gifford⁴

Affiliations

¹ Modeling and Informatics Department, Merck & Co. Inc. , 126 E. Lincoln Ave., Rahway, New Jersey 07065, United States.
² Data Science Department, MSD International GmbH (Singapore Branch) , 1 Fusionopolis Place, #06-10/07-18, Galaxis, Singapore 138522.
³ Biometrics Research Department, Merck & Co. Inc. , 126 E. Lincoln Ave., Rahway, New Jersey 07065, United States.
⁴ Bioinformatics Department, MSD International GmbH (Singapore Branch) , 1 Fusionopolis Place, #06-10/07-18, Galaxis, Singapore 138522.

PMID: 27958738
DOI: 10.1021/acs.jcim.6b00591

Abstract

In the pharmaceutical industry it is common to generate many QSAR models from training sets containing a large number of molecules and a large number of descriptors. The best QSAR methods are those that can generate the most accurate predictions but that are not overly expensive computationally. In this paper we compare eXtreme Gradient Boosting (XGBoost) to random forest and single-task deep neural nets on 30 in-house data sets. While XGBoost has many adjustable parameters, we can define a set of standard parameters at which XGBoost makes predictions, on the average, better than those of random forest and almost as good as those of deep neural nets. The biggest strength of XGBoost is its speed. Whereas efficient use of random forest requires generating each tree in parallel on a cluster, and deep neural nets are usually run on GPUs, XGBoost can be run on a single CPU in less than a third of the wall-clock time of either of the other methods.

MeSH terms

Algorithms
Databases, Pharmaceutical
Drug Discovery
Humans
Models, Biological
Quantitative Structure-Activity Relationship*
Software