ML | Why Logistic Regression in Classification ?

Text Classification using Logistic Regression

Last Updated : 04 Mar, 2024

Text classification is the process of automatically assigning labels or categories to pieces of text. This has tons of applications, like sorting emails into spam or not-spam, figuring out if a product review is positive or negative, or even identifying the topic of a news article. In this article, we will see How logistic regression is used for text classification with Scikit-Learn.

How Logistic Regression Works for Text Classification?

Logistic Regression is a statistical method used for binary classification problems, and it can also be extended to handle multi-class classification. When applied to text classification, the goal is to predict the category or class of a given text document based on its features.

Steps for how Logistic Regression works for text classification:

1. Text Representation:

Before applying logistic regression, text data should be converted as numerical features known as text vectorization.
Common techniques for text vectorization include Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), or more advanced methods like word embeddings (Word2Vec, GloVe) or deep learning-based embeddings (BERT, GPT).

2. Feature Extraction:

Once data is represented numerically, these representations can be used as features for model.
Features could be the counts of words in BoW, the weighted values in TF-IDF, or the numerical vectors in embeddings.

3. Logistic Regression Model:

Logistic Regression models the relationship between the features and the probability of belonging to a particular class using the logistic function.
The logistic function (also called the sigmoid function) maps any real-valued number into the range [0, 1], which is suitable for representing probabilities.
The logistic regression model calculates a weighted sum of the input features and applies the logistic function to obtain the probability of belonging to the positive class.

Logistic Regression Text Classification with Scikit-Learn

We'll use the popular SMS Collection Dataset, consists of a collection of SMS (Short Message Service) messages, which are labeled as either "ham" (non-spam) or "spam" based on their content. The implementation is designed to classify text messages into two categories: spam (unwanted messages) and ham (legitimate messages), using a logistic regression model. The process is broken down into several key steps:

Step 1. Import Libraries

The first step involves importing necessary libraries.

Pandas is used for data manipulation.
CountVectorizer for converting text data into a numeric format.
Various functions from sklearn.model_selection and sklearn.linear_model for creating and training the model.
functions from sklearn.metrics to evaluate the model's performance.

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

Step 2. Load and Prepare the Data

Load the dataset from a CSV file, and rename columns for clarity.
latin-1 encoding is specified to handle any non-ASCII characters that may be present in the file
Map labels from text to numeric values (0 for ham, 1 for spam), making it suitable for model training.

data = pd.read_csv(&quot;spam.csv&quot;, encoding='latin-1')
data.rename(columns={'v1': 'label', 'v2': 'text'}, inplace=True)
data['label'] = data['label'].map({'ham': 0, 'spam': 1})

Step 3. Text Vectorization

Convert text data into a numeric format using CountVectorizer, which transforms the text into a sparse matrix of token counts.

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data['text'])
y = data['label']

Step 4. Split Data into Training and Testing Sets

Divide the dataset into training and testing sets to evaluate the model's performance on unseen data.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

Step 5. Train the Logistic Regression Model

Create and train the logistic regression model using the training set.

model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

Step 6. Model Evaluation

Use the trained model to make predictions on the test set and evaluate the model's accuracy and confusion matrix to understand its performance better.

y_pred = model.predict(X_test)
print(&quot;Accuracy:&quot;, accuracy_score(y_test, y_pred))
print(&quot;Confusion Matrix:\n&quot;, confusion_matrix(y_test, y_pred))

Output:

Accuracy: 0.9763101220387652
Confusion Matrix:
 [[1201    1]
 [  32  159]]

The model is 97.6% correct on unseen data. The Confusion Matrix stated:

1201 messages correctly classified as 'ham'.
159 messages correctly classified as 'spam'.
32 'ham' messages wrongly labeled as 'spam'
and 1 'spam' wrongly labeled as 'ham'.

Manual Testing : Function to Classify Text Messages

To simplify the use of this model for predicting the category of new messages, we create a function that takes a text input and classifies it as spam or ham.

def classify_message(model, vectorizer, message):
    message_vect = vectorizer.transform([message])
    prediction = model.predict(message_vect)
    return &quot;ham&quot; if prediction[0] == 0 else &quot;spam&quot;

# Example of using the function
message = &quot;Congratulations! You've won a free ticket to Bahamas!&quot;
print(classify_message(classifier, vectorizer, message))

Output:

ham

This function first vectorizes the input text using the previously fitted CountVectorizer, then predicts the category using the trained logistic regression model, and finally returns the prediction as a human-readable label.

Conclusion

This experiment demonstrates that logistic regression is a powerful tool for classifying text, even with a simple approach. Using the SMS Spam Collection dataset, we achieved an impressive accuracy of 97.6%. This shows that the model successfully learned to distinguish between spam and legitimate text messages based on word patterns.