Python Data Cleaning: A How-to Guide for Beginners
Have you ever wondered why your data analysis sometimes yields unexpected results or errors? Do you know how to ensure the data for your project is accurate and reliable? Are you curious about the tools and techniques data professionals use to clean messy datasets? Have you ever struggled with data containing misspelled words, mixed-case text, or missing values? Would you like to discover how Python, a versatile programming language, helps you automate the process of data cleaning?
Do these questions spark your curiosity or resonate with your data-related experiences? If so, you've come to the right place!
In this article, we dive deep into the world of data cleaning in Python. We explore what data cleaning is, why it is crucial, and how you can harness the power of Python. We also explain two of the most helpful Python data-cleaning modules, pandas and NumPy, to transform messy datasets into valuable insights.
Let's embark on a journey to discover the art and science of data cleaning. We will see why it is an indispensable skill for data analysts, scientists, and programmers alike.
Data is the lifeblood of modern decision-making and analysis. Whether you are a data analyst, data scientist, or any kind of programmer dealing with data, one thing is clear: the quality of your data greatly influences the quality of your insights and results.
This is where data cleaning in Python comes into play. It is such an important topic that there are two tracks on LearnPython.com dedicated to processing data: Data Processing in Python and Python for Data Science. The courses in these tracks walk you through how to process various data formats in Python, such as CSV, JSON, and Excel files.
There are so many coding exercises worth dozens of hours of practice! Even then, you only need a laptop and an internet connection to get started.
And yes, we may even use Python to manipulate data stored in Excel tables. There are a few useful Python packages for working with Excel.
Python Data Cleaning?
Data cleaning, or data cleansing, is the process of identifying and correcting errors, inconsistencies, and inaccuracies in datasets. It ensures the data you work with is accurate, reliable, and ready for analysis, visualization, or further processing. It is a critical step in the data preparation pipeline.
Who Needs Cleaning Data in Python?
Anyone who analyzes, visualizes, or processes data benefits from data cleaning. This includes:
- Data analysts who rely on clean data to generate accurate reports and insights.
- Data scientists whose machine learning model performance depends on the quality of the training data.
- Other programmers also need clean data for various purposes, from creating web applications to making data-driven decisions.
Why Is Cleaning Data in Python Important?
Data cleaning is essential for several reasons:
- Data Accuracy: Clean data ensures the accuracy of your analysis. Repetitive or erroneous values may skew statistics and lead to incorrect conclusions.
- Data Consistency: Cleaning data helps maintain consistency in the dataset, making it easier to work with and preventing unexpected errors in the code.
- Data Quality: Removing bad spelling, fixing date formats, and addressing common data issues enhances data quality.
- Efficiency: Cleaning data streamlines the data processing pipeline, saving time and resources.
Cleaning Data in Python
Python provides a robust environment for data cleaning, thanks to libraries like pandas and NumPy. While you can clean data manually using tools like Excel, Python allows for automation, making it ideal for larger datasets and repetitive tasks.
Let's look at an example. We will use the NFL dataset from Kaggle. It is an open-access dataset containing a lot of American football data. It may, for example, be used to predict the probability of winning and therefore be of interest to those involved in algorithmic betting.
First, we use Python pandas to display our data as a table. Let’s look at it by loading it and displaying the shape of the dataset.
import pandas as pd import numpy as np # Load data df = pd.read_csv("../input/nflplaybyplay2009to2016/NFL Play by Play 2009-2017.csv") df.head()
Date | GameID | Drive | qtr | down | ... | yacWPA | Season | |
---|---|---|---|---|---|---|---|---|
0 | 2009-09-10 | 2009091000 | 1 | 1 | NaN | ... | NaN | 2009 |
1 | 2009-09-10 | 2009091000 | 1 | 1 | 1.0 | ... | 0.036899 | 2009 |
2 | 2009-09-10 | 2009091000 | 1 | 1 | 2.0 | ... | NaN | 2009 |
3 | 2009-09-10 | 2009091000 | 1 | 1 | 3.0 | ... | -0.156239 | 2009 |
4 | 2009-09-10 | 2009091000 | 1 | 1 | 4.0 | ... | NaN | 2009 |
5 rows × 102 columns # Display the shape of the dataset df.shape
Output:
(407688, 102)
We have a huge dataset of 407688 rows and 102 columns. For this article, we focus our Python data cleaning on the first 10 columns. Let’s use df.iloc
to help us filter rows and select the relevant columns.
df = df.iloc[:,:10] df.head()
Below, we see the first 10 columns of our dataset.
Date | GameID | Drive | qtr | down | time | TimeUnder | TimeSecs | PlayTimeDiff | SideofField | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2009-09-10 | 2009091000 | 1 | 1 | NaN | 15:00 | 15 | 3600.0 | 0.0 | TEN |
1 | 2009-09-10 | 2009091000 | 1 | 1 | 1.0 | 14:53 | 15 | 3593.0 | 7.0 | PIT |
2 | 2009-09-10 | 2009091000 | 1 | 1 | 2.0 | 14:16 | 15 | 3556.0 | 37.0 | PIT |
3 | 2009-09-10 | 2009091000 | 1 | 1 | 3.0 | 13:35 | 14 | 3515.0 | 41.0 | PIT |
4 | 2009-09-10 | 2009091000 | 1 | 1 | 4.0 | 13:27 | 14 | 3507.0 | 8.0 | PIT |
We also see that the columns have inconsistent naming conventions with a mix of lower and upper cases. This may create problems such as having duplicated columns. For example, GameID
and gameID
may be duplicated content. To avoid this pitfall, let’s use pandas to lowercase everything.
df.columns= df.columns.str.lower()
Here is our new list of column names:
Index(['date', 'gameid', 'drive', 'qtr', 'down', 'time', 'timeunder', 'timesecs', 'playtimediff', 'sideoffield'], dtype='object')
Next, let's see if there are any missing values. With Python pandas, dealing with null values is very straightforward!
# Missing values missing_count = df.isnull().sum() missing_count
Output:
Column name | Number of missing values |
---|---|
date | 0 |
gameid | 0 |
drive | 0 |
qtr | 0 |
down | 61154 |
time | 224 |
timeunder | 0 |
timesecs | 224 |
playtimediff | 444 |
sideoffield | 528 |
dtype: int64 |
How about in total?
# Total of missing values total_cells = np.product(df.shape) total_missing = df.isnull().sum().sum() # Percentage of missing data (total_missing/total_cells) * 100
A lot is going on in this code! Let’s explain it step by step.
The shape
attribute returns a tuple containing the counts of rows and columns in the dataset. np.product
multiplies these two values to output the total count of cells.
Next, we call isnull().sum()
to get a count of missing values for each column. Then, we call sum()
again to get the total count of missing values.
Finally, we output the percentage of missing values for the entire dataset.
# Output 1.5348501795490668
Missing values are only a small percentage in these 10 columns. As a good data scientist, your role here is to understand why these values are missing. It may simply be that they do not exist.
There are many ways to deal with missing data. We will get to handling missing values with pandas shortly. But let’s start by trying to remove the columns with one or more missing values as follows:
# Remove all columns with at least one missing value new_df = df.dropna(axis=1) new_df.head()
Here, the axis
parameter needs to be set to 1 to apply the function to columns. If axis=0
, it removes the rows with at least one missing value, and this is not what we want. Note that the default is axis=0
.
Let’s look at what happens.
date | gameid | drive | qtr | timeunder | |
---|---|---|---|---|---|
0 | 2009-09-10 | 2009091000 | 1 | 1 | 15 |
1 | 2009-09-10 | 2009091000 | 1 | 1 | 15 |
2 | 2009-09-10 | 2009091000 | 1 | 1 | 15 |
3 | 2009-09-10 | 2009091000 | 1 | 1 | 14 |
4 | 2009-09-10 | 2009091000 | 1 | 1 | 14 |
The downside is the loss of data that comes with it. How much data do we lose in the process?
# Data loss print("Number of columns in the original dataset: %d \n" % df.shape[1]) print("Number of columns with NaN values removed: %d" % new_df.shape[1]) # Output Number of columns in the original dataset: 10 Number of columns with NaN values removed: 5
As we see, we just lost half of our dataset.
Another way to address missing values is to leverage pandas to fill them in. For example, we may replace a NaN
with a 0
.
# Replace missing values with 0 df = df.fillna(0) df.head()
Date | GameID | Drive | qtr | down | time | TimeUnder | TimeSecs | PlayTimeDiff | SideofField | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2009-09-10 | 2009091000 | 1 | 1 | 0.0 | 15:00 | 15 | 3600.0 | 0.0 | TEN |
1 | 2009-09-10 | 2009091000 | 1 | 1 | 1.0 | 14:53 | 15 | 3593.0 | 7.0 | PIT |
2 | 2009-09-10 | 2009091000 | 1 | 1 | 2.0 | 14:16 | 15 | 3556.0 | 37.0 | PIT |
3 | 2009-09-10 | 2009091000 | 1 | 1 | 3.0 | 13:35 | 14 | 3515.0 | 41.0 | PIT |
4 | 2009-09-10 | 2009091000 | 1 | 1 | 4.0 | 13:27 | 14 | 3507.0 | 8.0 | PIT |
However, be thoughtful when dealing with missing data. For example, filling missing values with 0s for an age variable does not make much sense. Here, though not perfect, it may make sense to fill the missing values with 0s in the context of your analysis.
Missing values are also commonly replaced with median or mean values. If a variable is important, you may want to spend a little more time and use a machine-learning model to find the best value. In any case, think about what makes the most sense in each scenario.
In the dataset above, many of the missing values are due to data not being recorded for one reason or another; filling the NaN
values with 0 may or may not be a good option. Let’s do one last example with a small dataset with the last names, first names, and ages of some fictitious NFL players to illustrate.
First, let’s create our dataset. We create a Python dictionary in which the keys are the column names. We use NumPy np.nan
to generate the NaN
values.
data = { 'lastname': ['Douglas', 'Smith', 'Doe', 'Peterson', 'Johnson', 'Jones', 'Anderson'], 'firstname': ['John', 'Daniel', 'Michael', 'David', 'Robert', 'Anthony', 'Thomas'], 'age': [23, 25, np.nan, 22, np.nan, np.nan, 28] }
Next, we use pandas to create a DataFrame
from a Python dictionary:
age | firstname | lastname | |
---|---|---|---|
0 | 23 | John | Douglas |
1 | 25 | Daniel | Smith |
2 | NaN | Michael | Doe |
3 | 22 | David | Peterson |
4 | NaN | Robert | Johnson |
5 | NaN | Anthony | Jones |
6 | 28 | Thomas | Anderson |
You see that the age is missing for some players. As mentioned earlier, we do not want to fill an age value with 0.
Let’s say it is more desirable to use something representing the middle value in this case. So, we choose to replace the missing values with the median value. It does not mean we may not use the mean (no pun intended!); it may very well be an option. Feel free to try both and see what makes better sense.
Let’s replace the NaN
values with the median. The inplace
option is set to True to save the modified dataset by replacing the original dataset.
df_players['age'].fillna(df_players.median()['age'], inplace=True)
Here is the updated dataset:
age | firstname | lastname | |
---|---|---|---|
0 | 23 | John | Douglas |
1 | 25 | Daniel | Smith |
2 | 24 | Michael | Doe |
3 | 22 | David | Peterson |
4 | 24 | Robert | Johnson |
5 | 24 | Anthony | Jones |
6 | 28 | Thomas | Anderson |
All right! Now you know how to use the median to fill the missing values. Similarly, you can use the mean or the mode if either of these is desired.
Of course, there is much more to data cleaning. It is time-consuming to address all the issues in a dataset. There may be inconsistent data formats, incorrect values, or mixed letter cases in strings. However, it is a necessary step to ensure we get as much insight as possible from our data.
Once done with data cleaning, you may move on to visualizing the data using powerful Python visualization libraries such as Matplotlib.
Practice Cleaning Data in Python!
Data cleaning in Python is a critical step in any data-related project. Ensuring data accuracy, consistency, and quality sets the foundation for meaningful analysis and insights. Powerful libraries like Python pandas and Python NumPy let you automate the process and handle even large datasets, making data cleaning efficient and effective.
Clean data leads to more reliable results. Take advantage of this essential step in your data workflow. And because practice helps you achieve your goals, get your hands dirty on some data and try data cleaning in Python.
Last but not least, if you have not yet done so, take a look at our courses on Python for data science and working with Excel files in Python.
Remember to visit LearnPython.com to keep learning about Python!