Multivariate Optimization with Equality Constraint

XGBoost in R Programming

Last Updated : 23 May, 2024

XGBoost is a popular machine learning algorithm and it stands for “Extreme Gradient Boosting.” XGBoost is available in various programming languages, including R. An XGBoost is a fast and efficient algorithm. XG Boost works only with numeric variables. and XGBoost is a fast and efficient algorithm and is used by winners of many machine learning competitions. XG Boost works only with numeric variables. It is widely used for both classification and regression tasks.

In this article, we will learn about What is XGBoost? How to use the XGBoost algorithm in R? specifically a dataset from a big mart that stores attributes and various products ad also you will get to know about the features that are important in the XGBoost model.

What is XGBoost?

It is a part of the boosting technique in which the selection of the sample is done more intelligently to classify observations. There are interfaces of XGBoost in C++, R, Python, Julia, Java, and Scala. The core functions in XGBoost are implemented in C++, thus it is easy to share models among different interfaces. Based on the statistics from the CRAN mirror, the package has been downloaded more than 81, 000 times. XgBoost modeling consists of two techniques: Bagging and Boosting.

Bagging: It is an approach where you can take random data samples, build learning algorithms, and take simple means to find bagging probabilities.
Boosting: It is an approach where a selection of approaches is made more intelligently i.e. more and more weight is given to classify observations.

XGBoosting in R

How to use XGBoost algorithm in R ?

Parameters used in XGBoost

eta: It shrinks the feature weights to make the boosting process more conservative. The range is from 0 to 1. It is also knowm as learning rate or Shrinking factor. Low eta value signifies the model is more robust to overfitting.
gamma: The larger the value of gamma, more conservative the algorithm will be. It’s range is from 0 to infinity.
max_depth: The maximum depth of a tree can be specified using max_depth parameter.
Subsample: It is the proportion of rows that the model will randomly select to grow trees.
colsample_bytree: It is the ratio of variables randomly chosen to build each tree in the model.

The Dataset

A Big Mart dataset consists of 1559 products across 10 stores in different cities. Certain attributes of each product and store have been defined. It consists of 12 features i.e Item_Identifier( is a unique product ID assigned to every distinct item), Item_Weight(includes the weight of the product), Item_Fat_Content(describes whether the product is low fat or not), Item_Visibility(mentions the percentage of the total display area of all products in a store allocated to the particular product), Item_Type(describes the food category to which the item belongs), Item_MRP(Maximum Retail Price (list price) of the product), Outlet_Identifier(unique store ID assigned. It consists of an alphanumeric string of length 6), Outlet_Establishment_Year(mentions the year in which store was established), Outlet_Size(tells the size of the store in terms of ground area covered), Outlet_Location_Type(tells about the size of the city in which the store is located), Outlet_Type(tells whether the outlet is just a grocery store or some sort of supermarket) and Item_Outlet_Sales( sales of the product in the particular store).

R

# Loading data 
train = fread("Train_UWu5bXk.csv") 
test = fread("Test_u94Q5KV.csv") 
  
# Structure 
str(train) 

Output:

Performing XGBoost on Dataset

Using XGBoost algorithm on the dataset which includes 12 features with 1559 products across 10 stores in different cities.

R

# Installing Packages 
install.packages("data.table") 
install.packages("dplyr") 
install.packages("ggplot2") 
install.packages("caret") 
install.packages("xgboost") 
install.packages("e1071") 
install.packages("cowplot") 
  
# Loading packages 
library(data.table) # for reading and manipulation of data 
library(dplyr)     # for data manipulation and joining 
library(ggplot2) # for plotting 
library(caret)     # for modeling 
library(xgboost) # for building XGBoost model 
library(e1071)     # for skewness 
library(cowplot) # for combining multiple plots 
  
# Setting test dataset 
# Combining datasets 
# add Item_Outlet_Sales to test data 
test[, Item_Outlet_Sales := NA] 
combi = rbind(train, test) 
  
# Missing Value Treatment 
missing_index = which(is.na(combi$Item_Weight)) 
for(i in missing_index){ 
item = combi$Item_Identifier[i] 
combi$Item_Weight[i] = mean(combi$Item_Weight 
                        [combi$Item_Identifier == item], 
                        na.rm = T) 
} 
  
# Replacing 0 in Item_Visibility with mean 
zero_index = which(combi$Item_Visibility == 0) 
for(i in zero_index){ 
item = combi$Item_Identifier[i] 
combi$Item_Visibility[i] = mean( 
    combi$Item_Visibility[combi$Item_Identifier == item], 
    na.rm = T) 
} 
  
# Label Encoding 
# To convert categorical in numerical 
combi[, Outlet_Size_num := 
        ifelse(Outlet_Size == "Small", 0, 
        ifelse(Outlet_Size == "Medium", 1, 2))] 
  
combi[, Outlet_Location_Type_num := 
        ifelse(Outlet_Location_Type == "Tier 3", 0, 
        ifelse(Outlet_Location_Type == "Tier 2", 1, 2))] 
  
combi[, c("Outlet_Size", "Outlet_Location_Type") := NULL] 
  
# One Hot Encoding 
# To convert categorical in numerical 
ohe_1 = dummyVars("~.", 
        data = combi[, -c("Item_Identifier", 
                    "Outlet_Establishment_Year", 
                    "Item_Type")], fullRank = T) 
ohe_df = data.table(predict(ohe_1, 
        combi[, -c("Item_Identifier", 
        "Outlet_Establishment_Year", "Item_Type")])) 
  
combi = cbind(combi[, "Item_Identifier"], ohe_df) 
  
# Remove skewness 
skewness(combi$Item_Visibility) 
skewness(combi$price_per_unit_wt) 
  
# log + 1 to avoid division by zero 
combi[, Item_Visibility := log(Item_Visibility + 1)] 
  
# Scaling and Centering data 
# index of numeric features 
num_vars = which(sapply(combi, is.numeric)) 
num_vars_names = names(num_vars) 
  
combi_numeric = combi[, setdiff(num_vars_names, 
                "Item_Outlet_Sales"), with = F] 
  
prep_num = preProcess(combi_numeric, 
                method = c("center", "scale")) 
combi_numeric_norm = predict(prep_num, combi_numeric) 
  
# removing numeric independent variables 
combi[, setdiff(num_vars_names, 
            "Item_Outlet_Sales") := NULL] 
combi = cbind(combi, 
            combi_numeric_norm) 
  
# Splitting data back to train and test 
train = combi[1:nrow(train)] 
test = combi[(nrow(train) + 1):nrow(combi)] 
  
# Removing Item_Outlet_Sales 
test[, Item_Outlet_Sales := NULL] 
  
# Model Building: XGBoost 
param_list = list( 
objective = "reg:linear", 
eta = 0.01, 
gamma = 1, 
max_depth = 6, 
subsample = 0.8, 
colsample_bytree = 0.5 
) 
  
# Converting train and test into xgb.DMatrix format 
Dtrain = xgb.DMatrix( 
        data = as.matrix(train[, -c("Item_Identifier", 
                                "Item_Outlet_Sales")]), 
        label = train$Item_Outlet_Sales) 
Dtest = xgb.DMatrix( 
        data = as.matrix(test[, -c("Item_Identifier")])) 
  
# 5-fold cross-validation to 
# find optimal value of nrounds 
set.seed(112) # Setting seed 
xgbcv = xgb.cv(params = param_list, 
            data = Dtrain, 
            nrounds = 1000, 
            nfold = 5, 
            print_every_n = 10, 
            early_stopping_rounds = 30, 
            maximize = F) 
  
# Training XGBoost model at nrounds = 428 
xgb_model = xgb.train(data = Dtrain, 
                    params = param_list, 
                    nrounds = 428) 
xgb_model 
  
# Variable Importance 
var_imp = xgb.importance( 
            feature_names = setdiff(names(train), 
            c("Item_Identifier", "Item_Outlet_Sales")), 
            model = xgb_model) 
  
# Importance plot 
xgb.plot.importance(var_imp) 

Output:

Training of Xgboost model:

output

The xgboost model is trained calculating the train-rmse score and test-rmse score and finding its lowest value in many rounds.

Model xgb_model:

output

The XgBoost models consist of 21 features with the objective of regression linear, eta is 0.01, gamma is 1, max_depth is 6, subsample is 0.8, colsample_bytree = 0.5, and silent is 1.

Variable Importance plot:

output

The Item_MRP is the most important variable followed by Item_Visibility and Outlet_Location_Type_num.

These are the general steps to use XGBoost in R. Keep in mind that the specific details of our workflow will depend on your dataset and the problem you are trying to solve. XGBoost provides a tools that are powerful for building predictive model in R.

Multivariate Optimization with Equality Constraint

D

dhruv5819

News

Improve

Article Tags :

Similar Reads

Basic Syntax in R Programming

R is the most popular language used for Statistical Computing and Data Analysis with the support of over 10, 000+ free packages in CRAN repository. Like any other programming language, R has a specific syntax which is important to understand if you want to make use of its powerful features. This art

Learn R Programming

R is a Programming Language that is mostly used for machine learning, data analysis, and statistical computing. It is an interpreted language and is platform independent that means it can be used on platforms like Windows, Linux, and macOS. In this R Language tutorial, we will Learn R Programming La

How to Code in R programming?

R is a powerful programming language and environment for statistical computing and graphics. Whether you're a data scientist, statistician, researcher, or enthusiast, learning R programming opens up a world of possibilities for data analysis, visualization, and modeling. This comprehensive guide aim

Hello World in R Programming

When we start to learn any programming languages we do follow a tradition to begin HelloWorld as our first basic program. Here we are going to learn that tradition. An interesting thing about R programming is that we can get our things done with very little code. Before we start to learn to code, le

Assigning Vectors in R Programming

Vectors are one of the most basic data structure in R. They contain data of same type. Vectors in R is equivalent to arrays in other programming languages. In R, array is a vector of one or more dimensions and every single object created is stored in the form of a vector. The members of a vector are

How to find SubString in R programming?

In this article, we are going to see how to find substring in R programming language. R provides different ways to find substrings. These are: Using substr() methodUsing str_detect() methodUsing grep() methodMethod 1: Using substr() method Find substring in R using substr() method in R Programming i

Parallel Programming In R

Parallel programming is a type of programming that involves dividing a large computational task into smaller, more manageable tasks that can be executed simultaneously. This approach can significantly speed up the execution time of complex computations and is particularly useful for data-intensive a

Dynamic Scoping in R Programming

R is an open-source programming language that is widely used as a statistical software and data analysis tool. R generally comes with the Command-line interface. R is available across widely used platforms like Windows, Linux, and macOS. Also, the R programming language is the latest cutting-edge to

Control Statements in R Programming

Control statements are expressions used to control the execution and flow of the program based on the conditions provided in the statements. These structures are used to make a decision after assessing the variable. In this article, we'll discuss all the control statements with the examples. In R pr

How To Start Programming With R

R Programming Language is designed specifically for data analysis, visualization, and statistical modeling. Here, we'll walk through the basics of programming with R, from installation to writing our first lines of code, best practices, and much more. Table of Content 1. Installation2. Variables and