Handling Categorical Data in Python
Categorical data is a set of predefined categories in which datapoints can fall into. However, improper handling of this data can lead to errors in analysis and reduced model performance. This article provides a detailed guide to handling categorical data in Python, from identifying inconsistencies to encoding for machine learning.
Why Do We Need to Handle Categorical Data?
Handling categorical data is crucial because:
- Algorithms Require Numerical Inputs: Most machine learning algorithms cannot directly process categorical data and need it to be converted into numerical formats.
- Inconsistent Categories: Categorical data often contains inconsistencies like typos, case sensitivity, or alternate spellings. These must be standardized to avoid treating them as separate categories.
- Remapping Categories: Some categories might need to be grouped for simplicity and relevance. For example remapping rare categories into an
"Other"
group. - Improves Model Performance: Proper encoding techniques like one-hot encoding or label encoding ensure models understand the relationships of categories leading to better predictions.
- Handles Real-World Complexity: Categorical data is used in many domains such as E-commerce, Finance, Healthcare, etc.
In short, handling categorical data correctly ensures models can effectively utilize it avoiding errors and gives better insights.
Python implementation to Handle Categorical Data
There are various techniques that can be used in python to handle categorical data. For this we will be using a dataset which has some incorrect, invalid or meaningless data (bogus values) due to human error while filling survey form or any other reason.
Importing python libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.preprocessing import LabelEncoder
Now let’s load the dataset into the pandas dataframe.
main_data = pd.read_csv('demographics.csv')
main_data.head()
Output:

First five rows of the dataset
To understand membership constraints consider the feature and blood type. We need to verify whether the blood type feature consists of bogus values or not. First we need to create a data frame with all possible values of blood type that are valid.
# create a new dataframe with possible values for blood type
blood_type_categories = pd.DataFrame({
'blood_type': ['A+', 'A-', 'B+', 'B-', 'AB+', 'AB-', 'O+', 'O-']
})
blood_type_categories
Output:

Now, the bogus values can be found using the difference method.
1. Finding Bogus Value
# finding bogus categories
unique_blood_types_main = set(main_data['blood_type'])
bogus_blood_types = unique_blood_types_main.difference(
blood_type_categories['blood_type']
)
bogus_blood_types
Output:
{‘C+’, ‘D-‘}
Once the bogus values are found, the corresponding rows can be dropped from the dataset. In some scenarios, the values could be replaced with other values if there is information available. However, since there is no information available regarding the true blood type they will be dropped.
# extracting records with bogus blood types
bogus_records_index = main_data['blood_type'].isin(bogus_blood_types)
# drop the records with bogus blood types
without_bogus_records = main_data[~bogus_records_index]
without_bogus_records['blood_type'].unique()
Output:
array([‘A+’, ‘B+’, ‘A-‘, ‘AB-‘, ‘AB+’, ‘B-‘, ‘O-‘, ‘O+’], dtype=object)
2. Inconsistent Categories Handling
Inconsistencies could arise in categorical data quite often. Consider the feature, marriage status. Let us take a look at all the unique values of marital status.
# exploring inconsistencies in marriage status category
main_data['marriage_status'].unique()
Output:
array([‘married’, ‘MARRIED’, ‘ married’, ‘unmarried ‘, ‘divorced’, ‘unmarried’, ‘UNMARRIED’, ‘separated’], dtype=object)
It is quite evident that there are redundant categories due to leading and trailing spaces as well as capital letters. First, let us deal with capital letters.
# removing values with capital letters
inconsistent_data = main_data.copy()
inconsistent_data['marriage_status'] = inconsistent_data['marriage_status']\
.str.lower()
inconsistent_data['marriage_status'].unique()
Output:
array([‘married’, ‘ married’, ‘unmarried ‘, ‘divorced’, ‘unmarried’, ‘separated’], dtype=object)
Next, we will deal with leading and trailing spaces.
inconsistent_data['marriage_status'] = inconsistent_data['marriage_status']\
.str.strip()
inconsistent_data['marriage_status'].unique()
Output:
array([‘married’, ‘unmarried’, ‘divorced’, ‘separated’], dtype=object)
3. Handle Remapping Categories
Numerical data like age or income can be mapped to different groups. This helps in getting more insights about the dataset. Let us explore the income feature.
# range of income in the dataset
print(f"Max income - {max(main_data['income'])},\
Min income - {min(main_data['income'])}")
Output:
Max income – 190000, Min income – 40000
Now, let us create the range and labels for the income feature. Pandas’ cut method is used to achieve this.
# create the groups for income
range = [40000, 75000, 100000, 125000, 150000, np.inf]
labels = ['40k-75k', '75k-100k', '100k-125k', '125k-150k', '150k+']
remapping_data = main_data.copy()
remapping_data['income_groups'] = pd.cut(remapping_data['income'],
bins=range,
labels=labels)
remapping_data.head()
Output:

First five rows of the dataset.
Now, it is easier to visualize the distribution.
remapping_data['income_groups'].value_counts().plot.bar()
Output:

Barplot for the count of each income category
4. Cleaning Categorical Data in Python
To understand this problem, a new data frame with just one feature, phone numbers are created.
phone_numbers = []
for i in range(100):
# phone numbers could be of length 9 or 10
number = random.randint(100000000, 9999999999)
# +91 code is inserted in some cases
if(i % 2 == 0):
phone_numbers.append('+91 ' + str(number))
else:
phone_numbers.append(str(number))
phone_numbers_data = pd.DataFrame({
'phone_numbers': phone_numbers
})
phone_numbers_data.head()
Output:

Based on the use case, the code before numbers could be dropped or added for missing ones. Similarly, phone numbers with less than 10 numbers should be discarded.
phone_numbers_data['phone_numbers'] = phone_numbers_data['phone_numbers']\
.str.replace('\+91 ', '')
num_digits = phone_numbers_data['phone_numbers'].str.len()
invalid_numbers_index = phone_numbers_data[num_digits < 10].index
phone_numbers_data['phone_numbers'] = phone_numbers_data.drop(
invalid_numbers_index)
phone_numbers_data = phone_numbers_data.dropna()
phone_numbers_data.head()
Output:

Finally, we can verify whether the data is clean or not.
assert phone_numbers_data['phone_numbers'].str.contains('\+91 ').all() == False
assert (phone_numbers_data['phone_numbers'].str.len() != 10).all() == False
5. Visualizing Categorical Data in Python Pandas
Various plots could be used to visualize categorical data to get more insights about the data. So, let us visualize the number of people belonging to each blood type. We will make use of the seaborn library to achieve this.
sns.countplot(x='blood_type',
data=without_bogus_records)
Output:

Countplot for blood_type category
Furthermore, we can see the relationship between income and the marital status of a person using a boxplot.
sns.boxplot(x='marriage_status',
y='income',
data=inconsistent_data)
Output:

Boxplot for marriage_status with income
Encoding Categorical Data in Python
Certain learning algorithms like regression and neural networks require their input to be numbers. Hence, categorical data must be converted to numbers to use these algorithms. Let us take a look at some encoding methods.
1. Label Encoding in Python
With label encoding, we can number the categories from 0 to num_categories – 1. Let us apply label encoding on the blood type feature.
le = LabelEncoder()
without_bogus_records['blood_type'] = le.fit_transform(
without_bogus_records['blood_type'])
without_bogus_records['blood_type'].unique()
Output:
array([0, 4, 1, 3, 2, 5, 7, 6])
2. One-hot Encoding in Python
There are certain limitations of label encoding that are taken care of by one-hot encoding.
inconsistent_data = pd.get_dummies(inconsistent_data,
columns=['marriage_status'])
inconsistent_data.head()
Output:

3. Ordinal Encoding in Python
Categorical data can be ordinal, where the order is of importance. For such features, we want to preserve the order after encoding as well. We will perform ordinal encoding on income groups. We want to preserve the order as 40K-75K < 75K-100K < 100K-125K < 125K-150K < 150K+
custom_map = {'40k-75k': 1, '75k-100k': 2, '100k-125k': 3,
'125k-150k': 4, '150k+': 5}
remapping_data['income_groups'] = remapping_data['income_groups']\
.map(custom_map)
remapping_data.head()
Output:

Handling categorical data is an essential step in data preprocessing. From identifying inconsistencies to encoding and visualizing each step ensures that models can use this data effectively. By standardizing categories, addressing bogus values and choosing the right encoding techniques we can enhance the quality of analysis and model performance.