Random Forest Regression in Python
A random forest is an ensemble learning method that combines the predictions from multiple decision trees to produce a more accurate and stable prediction. It is a type of supervised learning algorithm that can be used for both classification and regression tasks.
In regression task we can use Random Forest Regression technique for predicting numerical values. It predicts continuous values by averaging the results of multiple decision trees.
Working of Random Forest Regression
Random Forest Regression works by creating multiple of decision trees each trained on a random subset of the data. The process begins with Bootstrap sampling where random rows of data are selected with replacement to form different training datasets for each tree. After this we do feature sampling where only a random subset of features is used to build each tree ensuring diversity in the models.
After the trees are trained each tree make a prediction and the final prediction for regression tasks is the average of all the individual tree predictions and this process is called as Aggregation.

Random Forest Regression Model Working
This approach is beneficial because individual decision trees may have high variance and are prone to overfitting especially with complex data. However by averaging the predictions from multiple decision trees Random Forest minimizes this variance leading to more accurate and stable predictions and hence improving generalization of model.
Implementing Random Forest Regression in Python
We will be implementing random forest regression on salaries data which can be downloaded from here.
1. Import Libraries
Here we are importing all the necessary libraries required.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import warnings
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import KNNImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
warnings.filterwarnings('ignore')
- Pandas – This library helps to load the data frame in a 2D array format and has multiple functions to perform analysis tasks in one go.
- Numpy – Numpy arrays are very fast and can perform large computations in a very short time.
- Matplotlib/Seaborn – This library is used to draw visualizations.
- sklearn: It provides a wide range of tools for preprocessing, modeling, evaluating and deploying machine learning models.
- RandomForestRegressor – This is the regression model that is based upon the Random Forest model.
- LabelEncoder: This class is used to encode categorical data into numerical values.
- KNNImputer: This class is used to impute missing values in a dataset using a k-nearest neighbors approach.
- train_test_split: This function is used to split a dataset into training and testing sets.
- StandardScaler: This class is used to standardize features by removing the mean and scaling to unit variance.
- f1_score: This function is used to evaluate the performance of a classification model using the F1 score.
- RandomForestRegressor: This class is used to train a random forest regression model.
- cross_val_score: This function is used to perform k-fold cross-validation to evaluate the performance of a model
2. Import Dataset
Now let’s load the dataset in the panda’s data frame. For better data handling and leveraging the handy functions to perform complex tasks in one go.
df= pd.read_csv('Salaries.csv')
print(df)
Output:
Position Level Salary
0 Business Analyst 1 45000
1 Junior Consultant 2 50000
2 Senior Consultant 3 60000
3 Manager 4 80000
4 Country Manager 5 110000
5 Region Manager 6 150000
6 Partner 7 200000
7 Senior Partner 8 300000
8 C-level 9 500000
9 CEO 10 1000000
df.info()
- Here the
.info()
method provides a quick overview of the structure, data types, and memory usage of the dataset.
Output:
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
# Column Non-Null Count Dtype
— —— ————– —–
0 Position 10 non-null object
1 Level 10 non-null int64
2 Salary 10 non-null
int64 dtypes: int64(2), object(1)
memory usage: 372.0+ bytes
3. Data Preparation
Here the code will extracts two subsets of data from the Dataset and stores them in separate variables.
- Extracting Features: It extracts the features from the DataFrame and stores them in a variable named
X
. - Extracting Target Variable: It extracts the target variable from the DataFrame and stores it in a variable named
y
.
X = df.iloc[:,1:2].values
y = df.iloc[:,2].values
4. Random Forest Regressor Model
The code processes categorical data by encoding it numerically, combines the processed data with numerical data, and trains a Random Forest Regression model using the prepared data.
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder
Check for and handle categorical variables
label_encoder = LabelEncoder()
x_categorical = df.select_dtypes(include=['object']).apply(label_encoder.fit_transform)
x_numerical = df.select_dtypes(exclude=['object']).values
x = pd.concat([pd.DataFrame(x_numerical), x_categorical], axis=1).values
regressor = RandomForestRegressor(n_estimators=10, random_state=0, oob_score=True)
regressor.fit(x, y)
RandomForestRegressor
: It builds multiple decision trees and combines their predictions.n_estimators=10
: Defines the number of decision trees in the Random Forest (10 trees in this case).random_state=0
: Ensures the randomness in model training is controlled for reproducibility.oob_score=True
: Enables out-of-bag scoring which evaluates the model’s performance using data not seen by individual trees during training.LabelEncoder()
: Converts categorical variables (object type) into numerical values, making them suitable for machine learning models.select_dtypes()
: Selects columns based on data type—include=['object']
selects categorical columns, andexclude=['object']
selects numerical columns.apply(label_encoder.fit_transform)
: Applies the LabelEncoder transformation to each categorical column, converting string labels into numbers.concat()
: Combines the numerical and encoded categorical features horizontally into one dataset, which is then used as input for the model.fit()
: Trains the Random Forest model using the combined dataset (x
) and target variable (y
).
5. Make predictions and Evaluation
The code evaluates the trained Random Forest Regression model:
- out-of-bag (OOB) score, which estimates the model’s generalization performance.
- Makes predictions using the trained model and stores them in the ‘predictions’ array.
- Evaluates the model’s performance using the Mean Squared Error (MSE) and R-squared (R2) metrics.
from sklearn.metrics import mean_squared_error, r2_score
oob_score = regressor.oob_score_
print(f'Out-of-Bag Score: {oob_score}')
predictions = regressor.predict(x)
mse = mean_squared_error(y, predictions)
print(f'Mean Squared Error: {mse}')
r2 = r2_score(y, predictions)
print(f'R-squared: {r2}')
mean_squared_error
: Calculates the difference between true and predicted values (MSE).r2_score
: Measures how well the model fits the data (R-squared value).oob_score_
: Retrieves the out-of-bag score for model performance evaluation.predict()
: Makes predictions using the trained Random Forest model.print()
: Displays the model evaluation metrics: out-of-bag score, MSE, and R-squared.
Output:
Out-of-Bag Score: 0.644879832593859
Mean Squared Error: 2647325000.0
R-squared: 0.9671801245316117
6. Visualization
Now let’s visualize the results obtained by using the RandomForest Regression model on our salaries dataset.
- Creates a grid of prediction points covering the range of the feature values.
- Plots the real data points as blue scatter points.
- Plots the predicted values for the prediction grid as a green line.
- Adds labels and a title to the plot for better understanding.
import numpy as np
X_grid = np.arange(min(X),max(X),0.01)
X_grid = X_grid.reshape(len(X_grid),1)
plt.scatter(X,y, color='blue') #plotting real points
plt.plot(X_grid, regressor.predict(X_grid),color='green') #plotting for predict points
plt.title("Random Forest Regression Results")
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
Output:
7. Visualizing a Single Decision Tree from the Random Forest Model
The code visualizes one of the decision trees from the trained Random Forest model. Plots the selected decision tree, displaying the decision-making process of a single tree within the ensemble.
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
# Assuming regressor is your trained Random Forest model
# Pick one tree from the forest, e.g., the first tree (index 0)
tree_to_plot = regressor.estimators_[0]
# Plot the decision tree
plt.figure(figsize=(20, 10))
plot_tree(tree_to_plot, feature_names=df.columns.tolist(), filled=True, rounded=True, fontsize=10)
plt.title("Decision Tree from Random Forest")
plt.show()
Output:
Applications of Random Forest Regression
The Random forest regression has a wide range of real-world problems including:
- Predicting continuous numerical values: Predicting house prices, stock prices or customer lifetime value.
- Identifying risk factors: Detecting risk factors for diseases, financial crises or other negative events.
- Handling high-dimensional data: Analyzing datasets with a large number of input features.
- Capturing complex relationships: Modeling complex relationships between input features and the target variable.
Advantages of Random Forest Regression
- Handles Non-Linearity: It can capture complex, non-linear relationships in the data that other models might miss.
- Reduces Overfitting: By combining multiple decision trees and averaging predictions it reduces the risk of overfitting compared to a single decision tree.
- Robust to Outliers: Random Forest is less sensitive to outliers as it aggregates the predictions from multiple trees.
- Works Well with Large Datasets: It can efficiently handle large datasets and high-dimensional data without a significant loss in performance.
- Handles Missing Data: Random Forest can handle missing values by using surrogate splits and maintaining high accuracy even with incomplete data.
- No Need for Feature Scaling: Unlike many other algorithms Random Forest does not require normalization or scaling of the data.
Disadvantages of Random Forest Regression
- Complexity: It can be computationally expensive and slow to train especially with a large number of trees and high-dimensional data.
- Less Interpretability: Since it uses many trees it can be harder to interpret compared to simpler models like linear regression or decision trees.
- Memory Intensive: Storing multiple decision trees for large datasets require significant memory resources.
- Overfitting on Noisy Data: While Random Forest reduces overfitting, it can still overfit if the data is highly noisy, especially with a large number of trees.
- Sensitive to Imbalanced Data: It may perform poorly if the dataset is highly imbalanced like one class is significantly more frequent than another.
- Difficulty in Real-Time Predictions: Due to its complexity it may not be suitable for real-time predictions especially with a large number of trees.
Random Forest Regression has become a powerful tool for continuous prediction tasks with advantages over traditional decision trees. Its capability to handle high-dimensional data, capture complex relationships and reduce overfitting has made it a popular choice for a variety of applications. Python’s scikit-learn library enables the implementation of Random Forest Regression models possible.
Frequently Asked Question(FAQ’s)
What is Random Forest Regression Python?
Random Forest Regression Python is an ensemble learning method that uses multiple decision trees to make predictions. It is a powerful and versatile algorithm that is well-suited for regression tasks.
What is the use of random forest regression?
Random Forest Regression can be used to predict a variety of target variables, including prices, sales, customer churn, and more. It is a robust algorithm that is not easily overfitted, making it a good choice for real-world applications.
What is the difference between random forest and regression?
Random Forest is an ensemble learning method, while regression is a type of supervised learning algorithm. Random Forest uses multiple decision trees to make predictions, while regression uses a single model to make predictions.
How do you tune the hyperparameters of Random Forest Regression?
There are several methods for tuning the hyperparameters of Random Forest Regression, such as:
- Grid search: Grid search involves systematically trying different combinations of hyperparameter values to find the best combination.
- Random search: Random search randomly samples different combinations of hyperparameter values to find a good combination.
Why is random forest better than regression?
Random Forest is generally more accurate and robust than regression. It is also less prone to overfitting, which means that it is more likely to generalize well to new data.