Learning Model Building in Scikit-learn

Last Updated : 29 Jan, 2025

Building machine learning models from scratch can be complex and time-consuming. However with the right tools and frameworks this process can become significantly easier. Scikit-learn is one such tool that makes machine learning model creation easy. It provides user-friendly tools for tasks like Classification, Regression, Clustering and many more.

Scikit-learn is a open-source Python library that include wide range of machine learning models, pre-processing, cross-validation and visualization algorithms and all accessible with simple interface. Its simplicity and versatility make it a better choice for both beginners and advanced data scientists to build and implement machine learning models. In this article we will learn essential features and techniques for building machine learning models using Scikit-learn.

Installation of Scikit- learn

The latest version of Scikit-learn is 1.1 and it requires Python 3.8 or newer.

Scikit-learn requires:

NumPy
SciPy as its dependencies.

Before installing scikit-learn, ensure that you have NumPy and SciPy installed. Once you have a working installation of NumPy and SciPy, the easiest way to install scikit-learn is using pip:

!pip install -U scikit-learn

Let us get started with the modeling process now.

Step 1: Load a Dataset

A dataset is nothing but a collection of data. A dataset generally has two main components:

Features: They are also known as predictors, inputs or attributes. These are simply the variables of our data. They can be more than one and hence represented by a feature matrix (‘x’ is a common notation to represent feature matrix). A list of all the feature names is termed feature names.
Response: They are also known as the target, label or output. This is the output variable depending on the feature variables. We generally have a single output variable column and it is represented by a response vector ( ‘y’ is a common notation to represent response vector). All the possible values taken by a response vector are termed target names.

1. Loading exemplar dataset: Scikit-learn comes with few loaded example datasets like the iris and digits datasets for classification and the boston house prices dataset for regression.

Given below is an example of how we can load an exemplar dataset:

# load the iris dataset as an example 
from sklearn.datasets import load_iris 
iris = load_iris() 
  
# store the feature matrix (X) and response vector (y) 
X = iris.data 
y = iris.target 
  
# store the feature and target names 
feature_names = iris.feature_names 
target_names = iris.target_names 
  
# printing features and target names of our dataset 
print("Feature names:", feature_names) 
print("Target names:", target_names) 
  
# X and y are numpy arrays 
print("\nType of X is:", type(X)) 
  
# printing first 5 input rows 
print("\nFirst 5 rows of X:\n", X[:5])

load_iris(): loads the Iris dataset into the variable iris.
Features and Targets: X contains the input data (features like petal length, width, etc) and y contains the target values (species of the iris flower).
Names: feature_names and target_names provide the names of the features and the target labels respectively.
Inspecting Data: We print the feature names and target names check the type of X and display the first 5 rows of the feature data to understand the structure.

Output:

Feature names: [‘sepal length (cm)’,’sepal width (cm)’,
‘petal length (cm)’,’petal width (cm)’]
Target names: [‘setosa’ ‘versicolor’ ‘virginica’]
Type of X is:
First 5 rows of X:
[[ 5.1 3.5 1.4 0.2]
[ 4.9 3. 1.4 0.2]
[ 4.7 3.2 1.3 0.2]
[ 4.6 3.1 1.5 0.2]
[ 5. 3.6 1.4 0.2]]

2. Loading external dataset: Now consider the case when we want to load an external dataset. For this we can use the pandas library for easily loading and manipulating datasets.

For this you can refer to our article on How to import csv file in pandas

Step 2: Splitting the Dataset

In machine learning working with large datasets can be computationally expensive. For this we split the data into two parts: training data and testing data. This approach helps reduce computational cost and also helps to evaluate model’s performance and accuracy on unseen data.

Split the dataset into two pieces: a training set and a testing set.
Train the model on the training set.
Test the model on the testing set and evaluate how well our model did.

1. Load the Iris Dataset

from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

2. Import and Use train_test_split to Split the Data

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

In this step we import train_test_split from sklearn.model_selection. This function splits the dataset into two parts: a training set and a testing set.

X_train and y_train: These are the features and target values used for training the model.
X_test and y_test: These are the features and target values used for testing the model after it has been trained.
test_size=0.4: 40% of the data is allocated to the testing set while the remaining 60% is used for training.
random_state=1: This ensures that the split is consistent, meaning you’ll get the same random split every time you run the code.

3. Check the Shapes of the Split Data

When splitting data into training and testing sets verifying the shape ensures that both sets have correct proportions of data avoiding any potential errors in model evaluation or training.

print("X_train Shape:",  X_train.shape)
print("X_test Shape:", X_test.shape)
print("Y_train Shape:", y_train.shape)
print("Y_test Shape:", y_test.shape)

The number of rows in X_train should be 60% of the original dataset, and the number of rows in X_test should be 40%.
y_train should have the same number of rows as X_train, and y_test should have the same number of rows as X_test.

Output:

X_train Shape: (90, 4)
X_test Shape: (60, 4)
Y_train Shape: (90,)
Y_test Shape: (60,)

Step 3: Handling Categorical Data

It’s important to handle categorical data correctly because machine learning algorithms typically require numerical input to process the data. If categorical data is not encoded algorithms may misinterpret the categories leading to incorrect results. This is why we need to handle categorical data by encoding it. Scikit-learn provides several techniques for encoding categorical variables into numerical values.

1. Label Encoding: It converts each category into a unique integer. For example in a column with categories like ‘cat’, ‘dog’, and ‘bird’, label encoding would convert them to 0, 1 and 2 respectively.

from sklearn.preprocessing import LabelEncoder

categorical_feature = ['cat', 'dog', 'dog', 'cat', 'bird']

encoder = LabelEncoder()

encoded_feature = encoder.fit_transform(categorical_feature)

print("Encoded feature:", encoded_feature)

The LabelEncoder() is initialized to create an encoder object that will convert categorical values into numerical labels.
The fit_transform() method first fits the encoder to the categorical datavand then transforms the categories into corresponding numeric labels.

Output:

Encoded feature: [1 2 2 1 0]

This method is useful when the categorical values have an inherent order like “Low”, “Medium” and “High” but it can be problematic for unordered categories.

2. OneHotEncoding: It creates binary columns for each category where each column represents a category. For example if you have a column with values ‘cat’ ‘dog’ and ‘bird’ OneHotEncoding will create three new columns (one for each category) where each row will have a 1 in the column corresponding to its category and 0s in the others.

from sklearn.preprocessing import OneHotEncoder
import numpy as np

categorical_feature = ['cat', 'dog', 'dog', 'cat', 'bird']

categorical_feature = np.array(categorical_feature).reshape(-1, 1)

encoder = OneHotEncoder(sparse_output=False)

encoded_feature = encoder.fit_transform(categorical_feature)

print("OneHotEncoded feature:\n", encoded_feature)

The OneHotEncoder expects the input data to be in a 2D array i.e each sample should be a row, and each feature should be a column thats why we reshape it.
The OneHotEncoder(sparse_output=False) creates an encoder object that will convert categorical variables into binary columns.
The fit_transform() method is used to fit the encoder to the categorical data and transform the data into a one-hot encoded format.

Output:

OneHotEncoded feature:
[[0. 1. 0.]
[0. 0. 1.]
[0. 0. 1.]
[0. 1. 0.]
[1. 0. 0.]]

This method is useful for categorical variables without any inherent order ensuring that no numeric relationships are implied between the categories.

There are other techniques also : Mean Encoding

Step 4: Training the Model

Now it’s time to train our models using our dataset. Scikit-learn provides a wide range of machine learning algorithms that have a unified/consistent interface for fitting, predicting accuracy, etc. The example given below uses Logistic Regression.

Note: We will not go into the details of how the algorithm works as we are interested in understanding its implementation only.

2. Training Using Logistic Regression.

from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(max_iter=200)
log_reg.fit(X_train, y_train)

We create a logistic regression classifier object using log_reg = LogisticRegression(max_iter=200).
The classifier is trained using the X_train data and the corresponding response vector y_train.
This is done by calling log_reg.fit(X_train, y_train), where the logistic regression model adjusts its weights.

3. Making Predictions

y_pred = log_reg.predict(X_test)

Now, we need to test our classifier on the X_test data, log_reg.predict method is used for this purpose. It returns the predicted response vector y_pred.

4. Testing Accuracy

from sklearn import metrics
print("Logistic Regression model accuracy:", metrics.accuracy_score(y_test, y_pred))

Now, we are interested in finding the accuracy of our model by comparing y_test and y_pred. This is done using the metrics module’s method accuracy_score

5. Making Predictions

sample = [[3, 5, 4, 2], [2, 3, 5, 4]]
preds = log_reg.predict(sample)
pred_species = [iris.target_names[p] for p in preds]
print("Predictions:", pred_species)

Consider the case when you want your model to make predictions on new sample data. Then the sample input can simply be passed in the same way as we pass any feature matrix.
Here we used it as sample = [[3, 5, 4, 2], [2, 3, 5, 4]]

Output:

Logistic Regression model accuracy: 0.9666666666666667
Predictions: [‘virginica’, ‘virginica’]

Features of Scikit-learn

Pre-built functions: It offers ready to use functions for common tasks like data preprocessing, model training and prediction eliminating the need to write algorithms from scratch.
Efficient model evaluation: It includes tools for model evaluation such as cross-validation and performance metrics making it easy to assess and improve model accuracy.
Variety of algorithms: It provides a wide range of algorithms for classification, regression, clustering and more like support vector machines, random forests and k-means.
Integration with scientific libraries: Built on top of NumPy, SciPy and matplotlib making it easy to integrate with other libraries for data analysis.

Benefits of using Scikit-learn Libraries

Consistent and simple interface: Scikit-learn provides a uniform API across different models making it easy to switch between algorithms without having to learn a new syntax.
Extensive model tuning options: It offers a wide range of tuning parameters and grid search tools to fine-tune models for better performance.
Active community and support: The library has a large, engaged community ensuring regular updates, bug fixes and a wealth of user-contributed resources like forums, blogs and Q&A sites.

Scikit-learn stands as one of the most important library in the field of machine learning providing a straightforward and powerful set of tools for building and deploying models. Whether you are a beginner or an experienced data scientist it is used by everyone for making machine learning models.