XGBoost R Tutorial

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10
At a glance
Powered by AI
The tutorial provides an introduction to using the XGBoost package in R. It demonstrates how to load data, build a classification model, extract information from the model, and save/load models.

The tutorial uses the agaricus mushroom dataset from the UCI Machine Learning Repository to predict whether mushrooms are edible or poisonous.

The main steps are: 1) Load and explore the data, 2) Build an XGBoost model using the train data, 3) Extract information from the model like feature importance, 4) Evaluate the model on test data

25/08/2016 XGBoostRTutorialxgboost0.

6documentation

XGBoostRTutorial

Introduction
XgboostisshortforeXtremeGradientBoostingpackage.
ThepurposeofthisVignetteistoshowyouhowtouseXgboosttobuildamodelandmake
predictions.

Itisanefficientandscalableimplementationofgradientboostingframeworkby
@[email protected]:

linearmodel
treelearningalgorithm.

Itsupportsvariousobjectivefunctions,includingregression,classificationandranking.The
packageismadetobeextendible,sothatusersarealsoallowedtodefinetheirownobjective
functionseasily.
IthasbeenusedtowinseveralKagglecompetitions.

Ithasseveralfeatures:
Speed:itcanautomaticallydoparallelcomputationonWindowsandLinux,withOpenMP.Itis
generallyover10timesfasterthantheclassical gbm .
InputType:ittakesseveraltypesofinputdata:
DenseMatrix:Rsdensematrix,i.e. matrix
SparseMatrix:Rssparsematrix,i.e. Matrix::dgCMatrix
DataFile:localdatafiles
xgb.DMatrix :itsownclass(recommended).
Sparsity:itacceptssparseinputforbothtreeboosterandlinearbooster,andisoptimized
forsparseinput
Customization:itsupportscustomizedobjectivefunctionsandevaluationfunctions.

Installation

Githubversion
Forweeklyupdatedversion(highlyrecommended),installfromGithub:

install.packages("drat",repos="https://cran.rstudio.com")
drat:::addRepo("dmlc")
install.packages("xgboost",repos="http://dmlc.ml/drat/",type="source")

WindowsuserwillneedtoinstallRtoolsfirst.

http://xgboost.readthedocs.io/en/latest/Rpackage/xgboostPresentation.html#viewfeatureimportanceinfluencefromthelearntmodel 1/10
25/08/2016 XGBoostRTutorialxgboost0.6documentation

CRANversion
Theversion0.42isonCRAN,andyoucaninstallitby:

install.packages("xgboost")

FormerlyavailableversionscanbeobtainedfromtheCRANarchive

Learning
ForthepurposeofthistutorialwewillloadXGBoostpackage.

require(xgboost)

Datasetpresentation
Inthisexample,weareaimingtopredictwhetheramushroomcanbeeatenornot(likeinmany
tutorials,exampledataarethethesameasyouwilluseoninyoureverydaylife:).
MushroomdataiscitedfromUCIMachineLearningRepository.@Bache+Lichman:2013.

Datasetloading
Wewillloadthe agaricus datasetsembeddedwiththepackageandwilllinkthemtovariables.

Thedatasetsarealreadysplitin:

train :willbeusedtobuildthemodel
test :willbeusedtoassessthequalityofourmodel.

Whysplitthedatasetintwoparts?
Inthefirstpartwewillbuildourmodel.Inthesecondpartwewillwanttotestitandassessits
quality.Withoutdividingthedatasetwewouldtestthemodelonthedatawhichthealgorithmhave
alreadyseen.

data(agaricus.train,package='xgboost')
data(agaricus.test,package='xgboost')
train<agaricus.train
test<agaricus.test

Intherealworld,itwouldbeuptoyoutomakethisdivision
between train and test data.Thewaytodoitisoutofthepurposeofthisarticle,
however caret packagemayhelp.

http://xgboost.readthedocs.io/en/latest/Rpackage/xgboostPresentation.html#viewfeatureimportanceinfluencefromthelearntmodel 2/10
25/08/2016 XGBoostRTutorialxgboost0.6documentation

Eachvariableisa list containingtwothings, label and data :

str(train)

##Listof2
##$data:Formalclass'dgCMatrix'[package"Matrix"]with6slots
##....@i:int[1:143286]26811182021242832...
##....@p:int[1:127]036937233065845648965138380838410991...
##....@Dim:int[1:2]6513126
##....@Dimnames:Listof2
##......$:NULL
##......$:chr[1:126]"capshape=bell""capshape=conical""capshape=convex""capshape=
##....@x:num[1:143286]1111111111...
##....@factors:list()
##$label:num[1:6513]1001000100...

label istheoutcomeofourdatasetmeaningitisthebinaryclassificationwewilltrytopredict.

Letsdiscoverthedimensionalityofourdatasets.

dim(train$data)

##[1]6513126

dim(test$data)

##[1]1611126

ThisdatasetisverysmalltonotmaketheRpackagetooheavy,howeverXGBoostisbuiltto
managehugedatasetveryefficiently.

Asseenbelow,the data arestoredina dgCMatrix whichisasparsematrixand label vectoris


a numeric vector( {0,1} ):

class(train$data)[1]

##[1]"dgCMatrix"

class(train$label)

##[1]"numeric"

BasicTrainingusingXGBoost
Thisstepisthemostcriticalpartoftheprocessforthequalityofourmodel.

Basictraining
http://xgboost.readthedocs.io/en/latest/Rpackage/xgboostPresentation.html#viewfeatureimportanceinfluencefromthelearntmodel 3/10
25/08/2016 XGBoostRTutorialxgboost0.6documentation

Weareusingthe train data.Asexplainedabove,both data and label arestoredina list .

Inasparsematrix,cellscontaining 0 arenotstoredinmemory.Therefore,inadatasetmainly
madeof 0 ,memorysizeisreduced.Itisveryusualtohavesuchdataset.

Wewilltraindecisiontreemodelusingthefollowingparameters:
objective="binary:logistic" :wewilltrainabinaryclassificationmodel
max.deph=2 :thetreeswontbedeep,becauseourcaseisverysimple
nthread=2 :thenumberofcputhreadswearegoingtouse
nround=2 :therewillbetwopassesonthedata,thesecondonewillenhancethemodelby
furtherreducingthedifferencebetweengroundtruthandprediction.

bstSparse<xgboost(data=train$data,label=train$label,max.depth=2,eta=1,nthread

##[0]trainerror:0.046522
##[1]trainerror:0.022263

Morecomplextherelationshipbetweenyourfeaturesandyour label is,more


passesyouneed.

Parametervariations
Densematrix

Alternatively,youcanputyourdatasetinadensematrix,i.e.abasicRmatrix.

bstDense<xgboost(data=as.matrix(train$data),label=train$label,max.depth=2,eta=

##[0]trainerror:0.046522
##[1]trainerror:0.022263

xgb.DMatrix

XGBoostoffersawaytogroupthemina xgb.DMatrix .Youcanevenaddothermetadatainit.It


willbeusefulforthemostadvancedfeatureswewilldiscoverlater.

dtrain<xgb.DMatrix(data=train$data,label=train$label)
bstDMatrix<xgboost(data=dtrain,max.depth=2,eta=1,nthread=2,nround=2,objective

##[0]trainerror:0.046522
##[1]trainerror:0.022263

Verboseoption

XGBoosthasseveralfeaturestohelpyoutoviewhowthelearningprogressinternally.The
purposeistohelpyoutosetthebestparameters,whichisthekeyofyourmodelquality.

http://xgboost.readthedocs.io/en/latest/Rpackage/xgboostPresentation.html#viewfeatureimportanceinfluencefromthelearntmodel 4/10
25/08/2016 XGBoostRTutorialxgboost0.6documentation

Oneofthesimplestwaytoseethetrainingprogressistosetthe verbose option(seebelowfor


moreadvancedtechnics).

#verbose=0,nomessage
bst<xgboost(data=dtrain,max.depth=2,eta=1,nthread=2,nround=2,objective=

#verbose=1,printevaluationmetric
bst<xgboost(data=dtrain,max.depth=2,eta=1,nthread=2,nround=2,objective=

##[0]trainerror:0.046522
##[1]trainerror:0.022263

#verbose=2,alsoprintinformationabouttree
bst<xgboost(data=dtrain,max.depth=2,eta=1,nthread=2,nround=2,objective=

##[11:41:01]amalgamation/../src/tree/updater_prune.cc:74:treepruningend,1roots,6extran
##[0]trainerror:0.046522
##[11:41:01]amalgamation/../src/tree/updater_prune.cc:74:treepruningend,1roots,4extran
##[1]trainerror:0.022263

BasicpredictionusingXGBoost

Performtheprediction
Thepurposeofthemodelwehavebuiltistoclassifynewdata.Asexplainedbefore,wewilluse
the test datasetforthisstep.

pred<predict(bst,test$data)

#sizeofthepredictionvector
print(length(pred))

##[1]1611

#limitdisplayofpredictionstothefirst10
print(head(pred))

##[1]0.285830170.923923910.285830170.285830170.051698730.92392391

Thesenumbersdoesntlooklikebinaryclassification {0,1} .Weneedtoperformasimple


transformationbeforebeingabletousetheseresults.

http://xgboost.readthedocs.io/en/latest/Rpackage/xgboostPresentation.html#viewfeatureimportanceinfluencefromthelearntmodel 5/10
25/08/2016 XGBoostRTutorialxgboost0.6documentation

Transformtheregressioninabinaryclassification
TheonlythingthatXGBoostdoesisaregression.XGBoostisusing label vectortobuild
itsregressionmodel.

Howcanweusearegressionmodeltoperformabinaryclassification?
Ifwethinkaboutthemeaningofaregressionappliedtoourdata,thenumberswegetare
probabilitiesthatadatumwillbeclassifiedas 1 .Therefore,wewillsettherulethatifthis
probabilityforaspecificdatumis >0.5 thentheobservationisclassifiedas 1 (or 0 otherwise).

prediction<as.numeric(pred>0.5)
print(head(prediction))

##[1]010001

Measuringmodelperformance
Tomeasurethemodelperformance,wewillcomputeasimplemetric,theaverageerror.

err<mean(as.numeric(pred>0.5)!=test$label)
print(paste("testerror=",err))

##[1]"testerror=0.0217256362507759"

Notethatthealgorithmhasnotseenthe test dataduringthemodelconstruction.

Stepsexplanation:
1. as.numeric(pred>0.5) appliesourrulethatwhentheprobability(<=>regression<=>
prediction)is >0.5 theobservationisclassifiedas 1 and 0 otherwise
2. probabilityVectorPreviouslyComputed!=test$label computesthevectoroferror
betweentruedataandcomputedprobabilities
3. mean(vectorOfErrors) computestheaverageerroritself.

Themostimportantthingtorememberisthattodoaclassification,youjustdoaregressionto
the label andthenapplyathreshold.

Multiclassclassificationworksinasimilarway.

Thismetricis0.02andisprettylow:ouryummlymushroommodelworkswell!

Advancedfeatures

http://xgboost.readthedocs.io/en/latest/Rpackage/xgboostPresentation.html#viewfeatureimportanceinfluencefromthelearntmodel 6/10
25/08/2016 XGBoostRTutorialxgboost0.6documentation

Mostofthefeaturesbelowhavebeenimplementedtohelpyoutoimproveyourmodelbyofferinga
betterunderstandingofitscontent.

Datasetpreparation
Forthefollowingadvancedfeatures,weneedtoputdatain xgb.DMatrix asexplainedabove.

dtrain<xgb.DMatrix(data=train$data,label=train$label)
dtest<xgb.DMatrix(data=test$data,label=test$label)

Measurelearningprogresswithxgb.train
Both xgboost (simple)and xgb.train (advanced)functionstrainmodels.

Oneofthespecialfeatureof xgb.train isthecapacitytofollowtheprogressofthelearningafter


eachround.Becauseofthewayboostingworks,thereisatimewhenhavingtoomanyroundslead
toanoverfitting.Youcanseethisfeatureasacousinofcrossvalidationmethod.Thefollowing
techniqueswillhelpyoutoavoidoverfittingoroptimizingthelearningtimeinstoppingitassoonas
possible.
OnewaytomeasureprogressinlearningofamodelistoprovidetoXGBoostaseconddataset
alreadyclassified.Thereforeitcanlearnonthefirstdatasetandtestitsmodelonthesecondone.
Somemetricsaremeasuredaftereachroundduringthelearning.

insomewayitissimilartowhatwehavedoneabovewiththeaverageerror.The
maindifferenceisthatbelowitwasafterbuildingthemodel,andnowitisduringthe
constructionthatwemeasureerrors.

Forthepurposeofthisexample,weuse watchlist parameter.Itisalistof xgb.DMatrix ,eachof


themtaggedwithaname.

watchlist<list(train=dtrain,test=dtest)

bst<xgb.train(data=dtrain,max.depth=2,eta=1,nthread=2,nround=2,watchlist=watchlist

##[0]trainerror:0.046522testerror:0.042831
##[1]trainerror:0.022263testerror:0.021726

XGBoosthascomputedateachroundthesameaverageerrormetricthanseenabove(we
set nround to2,thatiswhywehavetwolines).Obviously,the trainerror numberisrelatedto
thetrainingdataset(theonethealgorithmlearnsfrom)andthe testerror numbertothetest
dataset.

Bothtrainingandtesterrorrelatedmetricsareverysimilar,andinsomeway,itmakessense:what
wehavelearnedfromthetrainingdatasetmatchestheobservationsfromthetestdataset.

http://xgboost.readthedocs.io/en/latest/Rpackage/xgboostPresentation.html#viewfeatureimportanceinfluencefromthelearntmodel 7/10
25/08/2016 XGBoostRTutorialxgboost0.6documentation

Ifwithyourowndatasetyouhavenotsuchresults,youshouldthinkabouthowyoudividedyour
datasetintrainingandtest.Maybethereissomethingtofix.Again, caret packagemayhelp.

Forabetterunderstandingofthelearningprogression,youmaywanttohavesomespecificmetric
orevenusemultipleevaluationmetrics.

bst<xgb.train(data=dtrain,max.depth=2,eta=1,nthread=2,nround=2,watchlist=watchlist

##[0]trainerror:0.046522trainlogloss:0.233376testerror:0.042831testlogloss:0.22668
##[1]trainerror:0.022263trainlogloss:0.136658testerror:0.021726testlogloss:0.13787

eval.metric allowsustomonitortwonewmetricsforeach
round, logloss and error .

Linearboosting
Untilnow,allthelearningswehaveperformedwerebasedonboosting
trees.XGBoostimplementsasecondalgorithm,basedonlinearboosting.Theonlydifferencewith
previouscommandis booster="gblinear" parameter(andremoving eta parameter).

bst<xgb.train(data=dtrain,booster="gblinear",max.depth=2,nthread=2,nround=2,watchlis

##[0]trainerror:0.024720trainlogloss:0.184616testerror:0.022967testlogloss:0.18423
##[1]trainerror:0.004146trainlogloss:0.069885testerror:0.003724testlogloss:0.06808

Inthisspecificcase,linearboostinggetssligtlybetterperformancemetricsthandecisiontrees
basedalgorithm.

Insimplecases,itwillhappenbecausethereisnothingbetterthanalinearalgorithmtocatcha
linearlink.However,decisiontreesaremuchbettertocatchanonlinearlinkbetweenpredictors
andoutcome.Becausethereisnosilverbullet,weadviseyoutocheckbothalgorithmswithyour
owndatasetstohaveanideaofwhattouse.

Manipulatingxgb.DMatrix

Save/Load

Likesavingmodels, xgb.DMatrix object(whichgroupsbothdatasetandoutcome)canalsobe


savedusing xgb.DMatrix.save function.

xgb.DMatrix.save(dtrain,"dtrain.buffer")

##[1]TRUE
http://xgboost.readthedocs.io/en/latest/Rpackage/xgboostPresentation.html#viewfeatureimportanceinfluencefromthelearntmodel 8/10
25/08/2016 XGBoostRTutorialxgboost0.6documentation

#toloaditin,simplycallxgb.DMatrix
dtrain2<xgb.DMatrix("dtrain.buffer")

##[11:41:01]6513x126matrixwith143286entriesloadedfromdtrain.buffer

bst<xgb.train(data=dtrain2,max.depth=2,eta=1,nthread=2,nround=2,watchlist=watchlist

##[0]trainerror:0.046522testerror:0.042831
##[1]trainerror:0.022263testerror:0.021726

Informationextraction

Informationcanbeextractedfrom xgb.DMatrix using getinfo function.Hereafterwewill


extract label data.

label=getinfo(dtest,"label")
pred<predict(bst,dtest)
err<as.numeric(sum(as.integer(pred>0.5)!=label))/length(label)
print(paste("testerror=",err))

##[1]"testerror=0.0217256362507759"

Viewfeatureimportance/influencefromthelearntmodel
FeatureimportanceissimilartoRgbmpackagesrelativeinfluence(rel.inf).

importance_matrix<xgb.importance(model=bst)
print(importance_matrix)
xgb.plot.importance(importance_matrix=importance_matrix)

Viewthetreesfromamodel

Youcandumpthetreeyoulearnedusing xgb.dump intoatextfile.

xgb.dump(bst,with.stats=T)

##[1]"booster[0]"
##[2]"0:[f28<1.00136e05]yes=1,no=2,missing=1,gain=4000.53,cover=1628.25"
##[3]"1:[f55<1.00136e05]yes=3,no=4,missing=3,gain=1158.21,cover=924.5"
##[4]"3:leaf=1.71218,cover=812"
##[5]"4:leaf=1.70044,cover=112.5"
##[6]"2:[f108<1.00136e05]yes=5,no=6,missing=5,gain=198.174,cover=703.75"
##[7]"5:leaf=1.94071,cover=690.5"
##[8]"6:leaf=1.85965,cover=13.25"
##[9]"booster[1]"

http://xgboost.readthedocs.io/en/latest/Rpackage/xgboostPresentation.html#viewfeatureimportanceinfluencefromthelearntmodel 9/10
25/08/2016 XGBoostRTutorialxgboost0.6documentation

##[10]"0:[f59<1.00136e05]yes=1,no=2,missing=1,gain=832.545,cover=788.852"
##[11]"1:[f28<1.00136e05]yes=3,no=4,missing=3,gain=569.725,cover=768.39"
##[12]"3:leaf=0.784718,cover=458.937"
##[13]"4:leaf=0.96853,cover=309.453"
##[14]"2:leaf=6.23624,cover=20.4624"

Youcanplotthetreesfromyourmodelusing```xgb.plot.tree``

xgb.plot.tree(model=bst)

ifyouprovideapathto fname parameteryoucansavethetreestoyourharddrive.

Saveandloadmodels

Maybeyourdatasetisbig,andittakestimetotrainamodelonit?Maybeyouarenotabigfanof
losingtimeinredoingthesametaskagainandagain?Intheseveryrarecases,youwillwantto
saveyourmodelandloaditwhenrequired.

Hopefullyforyou,XGBoostimplementssuchfunctions.

#savemodeltobinarylocalfile
xgb.save(bst,"xgboost.model")

##[1]TRUE

xgb.save functionshouldreturnTRUEifeverythinggoeswellandcrashes
otherwise.

Aninterestingtesttoseehowidenticaloursavedmodelistotheoriginalonewouldbetocompare
thetwopredictions.

#loadbinarymodeltoR
bst2<xgb.load("xgboost.model")
pred2<predict(bst2,test$data)

#Andnowthetest
print(paste("sum(abs(pred2pred))=",sum(abs(pred2pred))))

##[1]"sum(abs(pred2pred))=0"

http://xgboost.readthedocs.io/en/latest/Rpackage/xgboostPresentation.html#viewfeatureimportanceinfluencefromthelearntmodel 10/10

You might also like