Polynomial Regression ( From Scratch using Python )

ML | Multiple Linear Regression using Python

Last Updated : 27 Jan, 2025

Linear regression is a basic and commonly used for predictive analysis. It is a statistical approach for modeling relationship between a dependent variable and one independent variables. Multiple Linear Regression is a extended version of it only. It attempts to model relationship between two or more features to fit a linear equation to predict one dependent variable.

Steps for Multiple Linear Regression

Steps to perform multiple linear Regression are almost similar to that of simple linear Regression difference Lies in the evaluation. We can use it to find out which factor has the highest impact on the predicted output and how different variables relate to each other.

The equation for multiple linear regression is:

$y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_n X_n$

$y$ is the dependent variable
$X_1, X_2, \cdots X_n$ are the independent variables
$\beta_0$ is the intercept
$\beta_1,\beta_2, \cdots \beta_n$ are the slopes

The goal of the algorithm is to find the best fit line equation that can predict the values based on the independent variables. The regression model learns a function from the dataset (with known X and Y values) and uses it to predict Y values for unknown X.

Handling Categorical Data with Dummy Variables

In multiple regression model, we often encounter categorical data, such as gender (male/female), location (urban/rural), etc. Since regression models typically expect numerical inputs, categorical data must be transformed into a usable form.

This is where Dummy Variables come into play. Dummy variables are binary variables (0 or 1) that represent the presence or absence of each category. For example:

Male: 1 if male, 0 otherwise
Female: 1 if female, 0 otherwise

In the case of multiple categories (e.g., color: “Red”, “Blue”, “Green”), we create a dummy variable for each category, excluding one to avoid multicollinearity (explained next). This process is called one-hot encoding, which converts categorical variables into a format suitable for regression models.

Multicollinearity in Multiple Linear Regression

When building a multiple linear regression model, multicollinearity can arise. It occurs when two or more independent variables are highly correlated with each other. This can make it difficult to evaluate the individual contribution of each variable to the dependent variable.

Detecting Multicollinearity includes two techniques:

Correlation Matrix: Examining the correlation matrix among the independent variables is a common way to detect multicollinearity. High correlations (close to 1 or -1) indicate potential multicollinearity.
VIF (Variance Inflation Factor): VIF is a measure that quantifies how much the variance of an estimated regression coefficient increases if your predictors are correlated. A high VIF (typically above 10) suggests multicollinearity.

We will learn more about these techniques in depth in upcoming chapters

Assumptions of Multiple Regression Model

Just like simple linear regression we have use some assumptions in multiple linear regression also:

Linearity: The relationship between dependent and independent variables should be linear.
Homoscedasticity: The variance of errors should remain constant across all levels of independent variables.
Multivariate Normality: Residuals should follow a normal distribution.
No Multicollinearity: Independent variables should not be highly correlated.

Implementing Multiple Linear Regression Model in Python

We will use the California Housing dataset, which includes features like median income, average rooms, and the target variable, house prices.

1. Importing Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing

2. Load Dataset

# Load the California Housing dataset
california_housing = fetch_california_housing()

# Assign the data (features) and target (house prices)
X = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)
y = pd.Series(california_housing.target)

Fetches the California Housing dataset from sklearn.datasets.
The dataset contains features (such as median income, average rooms) stored in X, and the target (house prices) is stored in y.

3. Select Features for Visualization

X = X[['MedInc', 'AveRooms']]

Selects two features (MedInc for median income and AveRooms for average rooms) to simplify the visualization to two dimensions.

4. Train-Test Split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

5. Initialize and Train Model

model = LinearRegression()

model.fit(X_train, y_train)

6. Make Predictions

y_pred = model.predict(X_test)

7. Visualizing Best Fit Line in 3D

fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')

ax.scatter(X_test['MedInc'], X_test['AveRooms'], y_test, color='blue', label='Actual Data')

x1_range = np.linspace(X_test['MedInc'].min(), X_test['MedInc'].max(), 100)
x2_range = np.linspace(X_test['AveRooms'].min(), X_test['AveRooms'].max(), 100)
x1, x2 = np.meshgrid(x1_range, x2_range)

z = model.predict(np.c_[x1.ravel(), x2.ravel()]).reshape(x1.shape)

ax.plot_surface(x1, x2, z, color='red', alpha=0.5, rstride=100, cstride=100)

ax.set_xlabel('Median Income')
ax.set_ylabel('Average Rooms')
ax.set_zlabel('House Price')
ax.set_title('Multiple Linear Regression Best Fit Line (3D)')

plt.show()

Output:

download9

Visualizing Multiple Linear Regression

Blue points represent the actual house prices based on the features (MedInc and AveRooms) and the red surface represents the best fit plane predicted by the multiple linear regression model. This plot shows how two selected features influence the predicted house prices.

Polynomial Regression ( From Scratch using Python )

Y

YashRaj5

News

Improve

Article Tags :

Practice Tags :

Machine Learning

Similar Reads

Machine Learning Algorithms

Machine learning algorithms are essentially sets of instructions that allow computers to learn from data, make predictions, and improve their performance over time without being explicitly programmed. Machine learning algorithms are broadly categorized into three types: Supervised Learning: Algorith

Top 15 Machine Learning Algorithms Every Data Scientist Should Know in 2025

Machine Learning (ML) Algorithms are the backbone of everything from Netflix recommendations to fraud detection in financial institutions. These algorithms form the core of intelligent systems, empowering organizations to analyze patterns, predict outcomes, and automate decision-making processes. Wi

ML | Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD) is an optimization algorithm in machine learning, particularly when dealing with large datasets. It is a variant of the traditional gradient descent algorithm but offers several advantages in terms of efficiency and scalability, making it the go-to method for many d