How to Calculate R^2 with Scikit-Learn
The coefficient of determination, denoted as R², is an essential metric in regression analysis. It indicates the extent to which the independent variables account for the variation in the dependent variable.
In this article, we will walk you through calculating R² using Scikit-Learn, a powerful Python library for machine learning.
What is R²?
R² quantifies the proportion of variance in the dependent variable that can be predicted from the independent variables. It ranges between 0 and 1, with 0 indicating that the model does not explain any of the variability and 1 indicating that the model explains all the variability.
Mathematically, R² is expressed as:
Here:
is the sum of squares of residuals (the difference between actual and predicted values). is the total sum of squares (the difference between actual values and the mean of actual values).
Calculating R2 with Scikit-Learn for Sample Data
Let's go through an example to calculate R² from sample data using simple linear regression model.
Step 1: Import Necessary Libraries
import numpy as np
from sklearn.metrics import r2_score
Step 2: Generate Sample Data
# Generate random data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
# Assuming a perfect model prediction (just for the sake of demonstration)
y_pred = 4 + 3 * X
Step 3: Computer the R2 using sklearn
# Flatten the arrays to use in r2_score
y = y.flatten()
y_pred = y_pred.flatten()
# Compute R² using Scikit-Learn
R2_sklearn = r2_score(y, y_pred)
print(f"R² (Scikit-Learn Calculation): {R2_sklearn}")
Complete Code
import numpy as np
from sklearn.metrics import r2_score
# Generate random data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
# Assuming a perfect model prediction (just for the sake of demonstration)
y_pred = 4 + 3 * X
# Flatten the arrays to use in r2_score
y = y.flatten()
y_pred = y_pred.flatten()
# Compute R² using Scikit-Learn
R2_sklearn = r2_score(y, y_pred)
print(f"R² (Scikit-Learn Calculation): {R2_sklearn}")
Output:
R² (Scikit-Learn Calculation): 0.7639751938835576
Calculating R2 for Simple Polynomial Regression Problem using Sklearn
Polynomial regression is a type of regression analysis in which the relationship between the independent variable X and the dependent variable y is modeled as an n-th degree polynomial. We will compute R-square value for polynomial regression model using python.
Step 1: Import Libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
Step 2: Generate Sample Data
We'll create a simple nonlinear dataset:
# Generate random data
np.random.seed(42)
X = 6 * np.random.rand(100, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.randn(100, 1)
Step 3: Prepare Polynomial Features
Transform the input data to include polynomial features up to the desired degree (e.g., degree 2):
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
Step 4: Fit the Polynomial Regression Model
Fit a linear regression model to the polynomial features:
model = LinearRegression()
model.fit(X_poly, y)
y_pred = model.predict(X_poly)
Step 5: Calculate R² Using Scikit-Learn
Verify the manual calculation using Scikit-Learn's r2_score function:
# Flatten the arrays to use in r2_score
y = y.flatten()
y_pred = y_pred.flatten()
# Compute R² using Scikit-Learn
R2_sklearn = r2_score(y, y_pred)
print(f"R² (Scikit-Learn Calculation): {R2_sklearn}")
Visualizing the Results
It's often helpful to visualize the polynomial regression curve along with the data points:
plt.scatter(X, y, color='blue', label='Actual')
# Sort the values for better plotting
sorted_indices = X.flatten().argsort()
plt.plot(X[sorted_indices], y_pred[sorted_indices], color='red', linewidth=2, label='Predicted')
plt.title('Actual vs Predicted (Polynomial Regression)')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.show()
Complete Code
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
# Generate random data
np.random.seed(42)
X = 6 * np.random.rand(100, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.randn(100, 1)
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
model = LinearRegression()
model.fit(X_poly, y)
y_pred = model.predict(X_poly)
# Flatten the arrays to use in r2_score
y = y.flatten()
y_pred = y_pred.flatten()
# Compute R² using Scikit-Learn
R2_sklearn = r2_score(y, y_pred)
print(f"R² (Scikit-Learn Calculation): {R2_sklearn}")
plt.scatter(X, y, color='blue', label='Actual')
# Sort the values for better plotting
sorted_indices = X.flatten().argsort()
plt.plot(X[sorted_indices], y_pred[sorted_indices], color='red', linewidth=2, label='Predicted')
plt.title('Actual vs Predicted (Polynomial Regression)')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.show()
Output:
R² (Scikit-Learn Calculation): 0.8525067519009746

Conclusion
Calculating R² directly from sample data in Python is straightforward and provides valuable insight into your model's performance. By following the steps outlined above, you can easily implement and interpret R² in your regression analyses without relying on a predefined regression model. This approach is useful when you want to validate the goodness of fit of your predictions against actual data.