Text Classification using Logistic Regression
Text classification is the process of automatically assigning labels or categories to pieces of text. This has tons of applications, like sorting emails into spam or not-spam, figuring out if a product review is positive or negative, or even identifying the topic of a news article. In this article, we will see How logistic regression is used for text classification with Scikit-Learn.
How Logistic Regression Works for Text Classification?
Logistic Regression is a statistical method used for binary classification problems, and it can also be extended to handle multi-class classification. When applied to text classification, the goal is to predict the category or class of a given text document based on its features.
Steps for how Logistic Regression works for text classification:
1. Text Representation:
- Before applying logistic regression, text data should be converted as numerical features known as text vectorization.
- Common techniques for text vectorization include Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), or more advanced methods like word embeddings (Word2Vec, GloVe) or deep learning-based embeddings (BERT, GPT).
2. Feature Extraction:
- Once data is represented numerically, these representations can be used as features for model.
- Features could be the counts of words in BoW, the weighted values in TF-IDF, or the numerical vectors in embeddings.
3. Logistic Regression Model:
- Logistic Regression models the relationship between the features and the probability of belonging to a particular class using the logistic function.
- The logistic function (also called the sigmoid function) maps any real-valued number into the range [0, 1], which is suitable for representing probabilities.
- The logistic regression model calculates a weighted sum of the input features and applies the logistic function to obtain the probability of belonging to the positive class.
Logistic Regression Text Classification with Scikit-Learn
We'll use the popular SMS Collection Dataset, consists of a collection of SMS (Short Message Service) messages, which are labeled as either "ham" (non-spam) or "spam" based on their content. The implementation is designed to classify text messages into two categories: spam (unwanted messages) and ham (legitimate messages), using a logistic regression model. The process is broken down into several key steps:
Step 1. Import Libraries
The first step involves importing necessary libraries.
- Pandas is used for data manipulation.
- CountVectorizer for converting text data into a numeric format.
- Various functions from sklearn.model_selection and sklearn.linear_model for creating and training the model.
- functions from sklearn.metrics to evaluate the model's performance.
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
Step 2. Load and Prepare the Data
- Load the dataset from a CSV file, and rename columns for clarity.
- latin-1 encoding is specified to handle any non-ASCII characters that may be present in the file
- Map labels from text to numeric values (0 for ham, 1 for spam), making it suitable for model training.
data = pd.read_csv("spam.csv", encoding='latin-1')
data.rename(columns={'v1': 'label', 'v2': 'text'}, inplace=True)
data['label'] = data['label'].map({'ham': 0, 'spam': 1})