Python Implementation of Random Forest Algorithm
Python Implementation of Random Forest Algorithm
Algorithm
Now we will implement the Random Forest Algorithm tree using Python. For this, we will
use the same dataset "user_data.csv", which we have used in previous classification
models. By using the same dataset, we can compare the Random Forest classifier with
other classification models such as Decision tree Classifier, KNN, SVM, Logistic
Regression, etc.
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, [2,3]].values
11. y= data_set.iloc[:, 4].values
12.
13. # Splitting the dataset into training and test set.
14. from sklearn.model_selection import train_test_split
15. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0
)
16.
17. #feature Scaling
18. from sklearn.preprocessing import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, [2,3]].values
11. y= data_set.iloc[:, 4].values
12.
13. # Splitting the dataset into training and test set.
14. from sklearn.model_selection import train_test_split
15. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0
)
16.
17. #feature Scaling
18. from sklearn.preprocessing import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)
In the above code, we have pre-processed the data. Where we have loaded the dataset,
which is given as:
2. Fitting the Random Forest algorithm to the
training set:
Now we will fit the Random forest algorithm to the training set. To fit it, we will import
the RandomForestClassifier class from the sklearn.ensemble library. The code is given
below:
o n_estimators= The required number of trees in the Random Forest. The default
value is 10. We can choose any number but need to take care of the overfitting
issue.
o criterion= It is a function to analyze the accuracy of the split. Here we have taken
"entropy" for the information gain.
Output:
Output:
Output:
As we can see in the above matrix, there are 4+4= 8 incorrect predictions and 64+28=
92 correct predictions.
Output:
The above image is the visualization result for the Random Forest classifier working with
the training set result. It is very much similar to the Decision tree classifier. Each data
point corresponds to each user of the user_data, and the purple and green regions are
the prediction regions. The purple region is classified for the users who did not purchase
the SUV car, and the green region is for the users who purchased the SUV.
So, in the Random Forest classifier, we have taken 10 trees that have predicted Yes or
NO for the Purchased variable. The classifier took the majority of the predictions and
provided the result.
Output:
The above image is the visualization result for the test set. We can check that there is a
minimum number of incorrect predictions (8) without the Overfitting issue. We will get
different results by changing the number of trees in the classifier.