Python Data Cleaning
Python Data Cleaning
DataFrame
access a column using .dot or ['bra ck et']
# Import matplotlib.pyplot
import pandas as pd
import matplotlib.pyplot as plt
# boxplot
df.boxplot(column='initial_cost', by='Borough', rot=90)
plt.show()
# scatter plot
df.plot(kind='scatter', x='initial_cost', y='total_est_fee', rot=70)
plt.show()
In this exercise, you will practice melting a DataFrame using pd.melt(). There are
two parameters you should be aware of: id_vars and value_vars. The id_vars
represent the columns of the data you do not want to melt (i.e., keep it in its
current shape), while the value_vars represent the columns you do wish to melt into
rows. By default, if no value_vars are provided, all columns not set in the id_vars
will be melted. This could save a bit of typing, depending on the number of columns
that need to be melted.
You can rename the variable column by specifying an argument to the var_name
parameter, and the value column by specifying an argument to the value_name
parameter. You will now practice doing exactly this. Pandas as pd and the DataFrame
airquality has been pre-loaded for you.
Pivot data
Pivoting data is the opposite of melting it. While melting takes a set of columns
and turns it into a single column, pivoting will create a new column for each
unique value in a specified column.
.pivot_table()
index parameter to specify the columns NOT pivoted (similar to id_vars
parameter of pd.melt()
columns (the name of the column you want to pivot), and values (the values to
be used when the column is pivoted).
There's a very simple method you can use to get back the original DataFrame from
the pivoted DataFrame: .reset_index(). Dan didn't show you how to use this method
in the video, but you're now going to practice using it in this exercise to get
back the original DataFrame from airquality_pivot, which has been pre-loaded.
Print the index of airquality_pivot by accessing its .index attribute. This has
been done for you.
Reset the index of airquality_pivot using its .reset_index() method.
Print the new index of airquality_pivot_reset.
Print the head of airquality_pivot_reset.
You'll see that by using .pivot_table() and the aggfunc parameter, you can not only
reshape your data, but also remove duplicates. Finally, you can then flatten the
columns of the pivoted DataFrame using .reset_index().
Pivot airquality_dup by using .pivot_table() with the rows indexed by 'Month' and
'Day', the columns indexed by 'measurement', and the values populated with
'reading'. Use np.mean for the aggregation function.
Print the head of airquality_pivot.
Flatten airquality_pivot by resetting its index.
Print the head of airquality_pivot and then the original airquality DataFrame to
compare their structure.
In this exercise, you're going to tidy the 'm014' column, which represents males
aged 0-14 years of age. In order to parse this value, you need to extract the first
letter into a new column for gender, and the rest into a column for age_group.
Here, since you can parse values by position, you can take advantage of pandas'
vectorized string slicing by using the str attribute of columns of type object.
Begin by printing the columns of tb in the IPython Shell using its .columns
attribute, and take note of the problematic column.
Print the columns of ebola in the IPython Shell using ebola.columns. Notice that
the data has column names such as Cases_Guinea and Deaths_Guinea. Here, the
underscore _ serves as a delimiter between the first part (cases or deaths), and
the second part (country).
This time, you cannot directly slice the variable by position as in the previous
exercise. You now need to use Python's built-in string method called .split(). By
default, this method will split a string into parts separated by a space. However,
in this case you want it to split by an underscore. You can do this on
'Cases_Guinea', for example, using 'Cases_Guinea'.split('_'), which returns the
list ['Cases', 'Guinea'].
The next challenge is to extract the first element of this list and assign it to a
type variable, and the second element of the list to a country variable. You can
accomplish this by accessing the str attribute of the column and using the .get()
method to retrieve the 0 or 1 index, depending on the part you want.
Melt ebola using 'Date' and 'Day' as the id_vars, 'type_country' as the var_name,
and 'counts' as the value_name.
Create a column called 'str_split' by splitting the 'type_country' column of
ebola_melt on '_'. Note that you will first have to access the str attribute of
type_country before you can use .split().
Create a column called 'type' by using the .get() method to retrieve index 0 of the
'str_split' column of ebola_melt.
Create a column called 'country' by using the .get() method to retrieve index 1 of
the 'str_split' column of ebola_melt.
Three DataFrames have been pre-loaded: uber1, which contains data for April 2014,
uber2, which contains data for May 2014, and uber3, which contains data for June
2014. Your job in this exercise is to concatenate these DataFrames together such
that the resulting DataFrame has the data for all three months.
Begin by exploring the structure of these three DataFrames in the IPython Shell
using methods such as .head().
You'll return to the Ebola dataset you worked with briefly in the last chapter. It
has been pre-loaded into a DataFrame called ebola_melt. In this DataFrame, the
status and country of a patient is contained in a single column. This column has
been parsed into a new DataFrame, status_country, where there are separate columns
for status and country.
Explore the ebola_melt and status_country DataFrames in the IPython Shell. Your job
is to concatenate them column-wise in order to obtain a final, clean DataFrame.
As Dan showed you in the video, the glob module has a function called glob that
takes a pattern and returns a list of the files in the working directory that match
that pattern.
For example, if you know the pattern is part_ single digit number .csv, you can
write the pattern as 'part_?.csv' (which would match part_1.csv, part_2.csv,
part_3.csv, etc.)
Similarly, you can find all .csv files with '*.csv', or all parts with 'part_*'.
The ? wildcard represents any 1 character, and the * wildcard represents any number
of characters.
Import the glob module along with pandas (as its usual alias pd).
Write a pattern to match all .csv files.
Save all files that match the pattern using the glob() within the glob module:
glob.glob().
Print the list of file names.
Read the second file in csv_files (i.e., index 1) into a DataFrame called csv2.
You'll start with an empty list called frames. Your job is to use a for loop to:
The tips dataset has been loaded into a DataFrame called tips. This data contains
information about how much a customer tipped, whether the customer was male or
female, a smoker or not, etc.
Look at the output of tips.info() in the IPython Shell. You'll note that two
columns that should be categorical - sex and smoker - are instead of type object,
which is pandas' way of storing arbitrary strings. Your job is to convert these two
columns to type category and note the reduced memory usage.
Convert the sex column of the tips DataFrame to type 'category' using the .astype()
Convert the smoker column of the tips DataFrame.
Print the memory usage of tips after converting the data types of the columns:
.info()
# Convert the sex column to type 'category'
tips.sex = tips.sex.astype('category')
You can use the pd.to_numeric() function to convert a column into a numeric data
type. If the function raises an error, you can be sure that there is a bad value
within the column. You can either use the techniques you learned in Chapter 1 to do
some exploratory data analysis and find the bad value, or you can choose to ignore
or coerce the value into a missing value, NaN.
A modified version of the tips dataset has been pre-loaded into a DataFrame called
tips. For instructional purposes, it has been pre-processed to introduce some 'bad'
data for you to clean. Use the .info() method to explore this. You'll note that the
total_bill and tip columns, which should be numeric, are instead of type object.
Your job is to fix this.
Instructions
100 XP
Use pd.to_numeric() to convert the 'total_bill' column of tips to a numeric data
type. Coerce the errors to NaN by specifying the keyword argument errors='coerce'.
Convert the 'tip' column of 'tips' to a numeric data type exactly as you did for
the 'total_bill' column.
Print the info of tips to confirm that the data types of 'total_bill' and 'tips'
are numeric.
Instructions
100 XP
Import re.
Compile a pattern that matches a phone number of the format xxx-xxx-xxxx.
Use \d{x} to match x digits. Here you'll need to use it three times: twice to match
3 digits, and once to match 4 digits.
Place the regular expression inside re.compile().
Using the .match() method on prog, check whether the pattern matches the string
'123-456-7890'.
Using the same approach, now check whether the pattern matches the string '1123-
456-7890'.
Say you have the following string: 'the recipe calls for 6 strawberries and 2
bananas'.
It would be useful to extract the 6 and the 2 from this string to be saved for
later use when comparing strawberry to banana ratios.
When using a regular expression to extract multiple numbers (or multiple pattern
matches, to be exact), you can use the re.findall() function. Dan did not discuss
this in the video, but it is straightforward to use: You pass in a pattern and a
string to re.findall(), and it will return a list of the matches.
Instructions
100 XP
Import re.
Write a pattern that will find all the numbers in the following string: 'the recipe
calls for 10 strawberries and 1 banana'. To do this:
Use the re.findall() function and pass it two arguments: the pattern, followed by
the string.
\d is the pattern required to find digits. This should be followed with a + so that
the previous element is matched one or more times. This ensures that 10 is viewed
as one number and not as 1 and 0.
Print the matches to confirm that your regular expression found the values 10 and
1.
Take Hint (-30 XP)
Pattern matching
In this exercise, you'll continue practicing your regular expression skills. For
each provided string, your job is to write the appropriate pattern to match it.
The tips dataset has been pre-loaded into a DataFrame called tips. It has a 'sex'
column that contains the values 'Male' or 'Female'. Your job is to write a function
that will recode 'Female' to 0, 'Male' to 1, and return np.nan for all entries of
'sex' that are neither 'Female' nor 'Male'.
Recoding variables like this is a common data cleaning task. Functions provide a
mechanism for you to abstract away complex bits of code as well as reuse code. This
makes your code more readable and less error prone.
As Dan showed you in the videos, you can use the .apply() method to apply a
function across entire rows or columns of DataFrames. However, note that each
column of a DataFrame is a pandas Series. Functions can also be applied across
Series. Here, you will apply your function over the 'sex' column.
Instructions
100 XP
Define a function named recode_gender() that has one parameter: gender.
If gender equals 'Male', return 1.
Else, if gender equals 'Female', return 0.
If gender does not equal 'Male' or 'Female', return np.nan. NumPy has been pre-
imported for you.
Apply your recode_gender() function over tips.sex using the .apply() method to
create a new column: 'recode'. Note that when passing in a function inside the
.apply() method, you don't need to specify the parentheses after the function name.
Hit 'Submit Answer' and take note of the new 'recode' column in the tips DataFrame!
# Define recode_gender()
def recode_gender(gender):
# Return np.nan
else:
return np.nan
# For simple recodes, you can also use the replace method. You can also convert the
column into a categorical type.
Lambda functions
You'll now be introduced to a powerful Python feature that will help you clean your
data more effectively: lambda functions. Instead of using the def syntax that you
used in the previous exercise, lambda functions let you make simple, one-line
functions.
For example, here's a function that squares a variable used in an .apply() method:
def my_square(x):
return x ** 2
df.apply(my_square)
The equivalent code using a lambda function is:
df.apply(lambda x: x ** 2)
The lambda function takes one parameter - the variable x. The function itself just
squares x and returns the result, which is whatever the one line of code evaluates
to. In this way, lambda functions can make your code concise and Pythonic.
The tips dataset has been pre-loaded into a DataFrame called tips. Your job is to
clean its 'total_dollar' column by removing the dollar sign. You'll do this using
two different methods: With the .replace() method, and with regular expressions.
The regular expression module re has been pre-imported.
Instructions
100 XP
Use the .replace() method inside a lambda function to remove the dollar sign from
the 'total_dollar' column of tips.
You need to specify two arguments to the .replace() method: The string to be
replaced ('$'), and the string to replace it by ('').
Apply the lambda function over the 'total_dollar' column of tips.
Use a regular expression to remove the dollar sign from the 'total_dollar' column
of tips.
The pattern has been provided for you: It is the first argument of the re.findall()
function.
Complete the rest of the lambda function and apply it over the 'total_dollar'
column of tips. Notice that because re.findall() returns a list, you have to slice
it in order to access the actual value.
Hit 'Submit Answer' to verify that you have removed the dollar sign from the
column.
A dataset consisting of the performance of songs on the Billboard charts has been
pre-loaded into a DataFrame called billboard. Check out its columns in the IPython
Shell. Your job in this exercise is to subset this DataFrame and then drop all
duplicate rows.
Instructions
100 XP
Create a new DataFrame called tracks that contains the following columns from
billboard: 'year', 'artist', 'track', and 'time'.
Print the info of tracks. This has been done for you.
Drop duplicate rows from tracks using the .drop_duplicates() method. Save the
result to tracks_no_duplicates.
Print the info of tracks_no_duplicates. This has been done for you, so hit 'Submit
Answer' to see the results!
It's rare to have a (real-world) dataset without any missing values, and it's
important to deal with them because certain calculations cannot handle missing
values while some calculations will, by default, skip over any missing values.
Also, understanding how much missing data you have, and thinking about where it
comes from is crucial to making unbiased interpretations of data.
Instructions
100 XP
Calculate the mean of the Ozone column of airquality using the .mean() method on
airquality.Ozone.
Use the .fillna() method to replace all the missing values in the Ozone column of
airquality with the mean, oz_mean.
Hit 'Submit Answer' to see the result of filling in the missing values!
# Replace all the missing values in the Ozone column with the mean
airquality['Ozone'] = airquality.Ozone.fillna(oz_mean)
In the video, you saw Dan use the .all() method together with the .notnull()
DataFrame method to check for missing values in a column. The .all() method returns
True if all values are True. When used on a DataFrame, it returns a Series of
Booleans - one for each column in the DataFrame. So if you are using it on a
DataFrame, like in this exercise, you need to chain another .all() method so that
you return only one True or False value. When using these within an assert
statement, nothing will be returned if the assert statement is true: This is how
you can confirm that the data you are checking are valid.
Instructions
100 XP
Write an assert statement to confirm that there are no missing values in ebola.
Use the pd.notnull() function on ebola (or the .notnull() method of ebola) and
chain two .all() methods (that is, .all().all()). The first .all() method will
return a True or False for each column, while the second .all() method will return
a single True or False.
Write an assert statement to confirm that all values in ebola are greater than or
equal to 0.
Chain two all() methods to the Boolean condition (ebola >= 0).
Exploratory analysis
Whenever you obtain a new dataset, your first task should always be to do some
exploratory analysis to get a better understanding of the data and diagnose it for
any potential issues.
The Gapminder data for the 19th century has been loaded into a DataFrame called
g1800s. In the IPython Shell, use pandas methods such as .head(), .info(), and
.describe(), and DataFrame attributes like .columns and .shape to explore it.
Use the information that you acquire from your exploratory analysis to choose the
true statement from the options provided below.
The DataFrame g1800s has been pre-loaded. Your job in this exercise is to create a
scatter plot with life expectancy in '1800' on the x-axis and life expectancy in
'1899' on the y-axis.
Here, the goal is to visually check the data for insights as well as errors. When
looking at the plot, pay attention to whether the scatter plot takes the form of a
diagonal line, and which points fall below or above the diagonal line. This will
inform how life expectancy in 1899 changed (or did not change) compared to 1800 for
different countries. If points fall on a diagonal line, it means that life
expectancy remained the same!
Instructions
100 XP
Import matplotlib.pyplot as plt.
Use the .plot() method on g1800s with kind='scatter' to create a scatter plot with
'1800' on the x-axis and '1899' on the y-axis.
Display the plot.
# Import matplotlib.pyplot
import matplotlib.pyplot as plt
Before continuing, however, it's important to make sure that the following
assumptions about the data are true:
Instructions
100 XP
Define a function called check_null_or_valid() that takes in one argument:
row_data.
Inside the function, convert no_na to a numeric data type using pd.to_numeric().
Write an assert statement to make sure the first column (index 0) of the g1800s
DataFrame is 'Life expectancy'.
Write an assert statement to test that all the values are valid for the g1800s
DataFrame. Use the check_null_or_valid() function placed inside the .apply() method
for this. Note that because you're applying it over the entire DataFrame, and not
just one column, you'll have to chain the .all() method twice, and remember that
you don't have to use () for functions placed inside .apply().
Write an assert statement to make sure that each country occurs only once in the
data. Use the .value_counts() method on the 'Life expectancy' column for this.
Specifically, index 0 of .value_counts() will contain the most frequently occuring
value. If this is equal to 1 for the 'Life expectancy' column, then you can be
certain that no country appears more than once in the data.
def check_null_or_valid(row_data):
"""Function that takes a row of data,
drops all missing values,
and checks if all remaining values are greater than or equal to 0
"""
no_na = row_data.dropna()
numeric = pd.to_numeric(no_na)
ge0 = numeric >= 0
return ge0
Your task in this exercise is to concatenate them into a single DataFrame called
gapminder. This is a row-wise concatenation, similar to how you concatenated the
monthly Uber datasets in Chapter 3.
Use pd.concat() to concatenate g1800s, g1900s, and g2000s into one DataFrame called
gapminder. Make sure you pass DataFrames to pd.concat() in the form of a list.
Print the shape and the head of the concatenated DataFrame.
Currently, the gapminder DataFrame has a separate column for each year. What you
want instead is a single column that contains the year, and a single column that
represents the average life expectancy for each year and country. By having year in
its own column, you can use it as a predictor variable in a later analysis.
You can convert the DataFrame into the desired tidy format by melting it.
Instructions
100 XP
Reshape gapminder by melting it. Keep 'Life expectancy' fixed by specifying it as
an argument to the id_vars parameter.
Rename the three columns of the melted DataFrame to 'country', 'year', and
'life_expectancy' by passing them in as a list to gapminder_melt.columns.
Print the head of the melted DataFrame.
import pandas as pd
The tidy DataFrame has been pre-loaded as gapminder. Explore it in the IPython
Shell using the .info() method. Notice that the column 'year' is of type object.
This is incorrect, so you'll need to use the pd.to_numeric() function to convert it
to a numeric data type.
Instructions
100 XP
Convert the year column of gapminder using pd.to_numeric().
Assert that the country column is of type np.object. This has been done for you.
Assert that the year column is of type np.int64.
Assert that the life_expectancy column is of type np.float64.
Since here you want to find the values that do not match, you have to invert the
boolean, which can be done using ~. This Boolean series can then be used to get the
Series of countries that have invalid names.
Instructions
100 XP
Create a Series called countries consisting of the 'country' column of gapminder.
Drop all duplicates from countries using the .drop_duplicates() method.
Write a regular expression that tests your assumptions of what characters belong in
countries:
Anchor the pattern to match exactly what you want by placing a ^ in the beginning
and $ in the end.
Use A-Za-z to match the set of lower and upper case letters, \. to match periods,
and \s to match whitespace between words.
Use str.contains() to create a Boolean vector representing values that match the
pattern.
Invert the mask by placing a ~ before it.
Subset the countries series using the .loc[] accessor and mask_inverse. Then hit
'Submit Answer' to see the invalid country names!
# Print invalid_countries
print(invalid_countries)
In general, it is not the best idea to drop missing values, because in doing so you
may end up throwing away useful information. In this data, the missing values refer
to years where no estimate for life expectancy is available for a given country.
You could fill in, or guess what these life expectancies could be by looking at the
average life expectancies for other countries in that year, for example. Whichever
strategy you go with, it is important to carefully consider all options and
understand how they will affect your data.
In this exercise, you'll practice dropping missing values. Your job is to drop all
the rows that have NaN in the life_expectancy column. Before doing so, it would be
valuable to use assert statements to confirm that year and country do not have any
missing values.
Begin by printing the shape of gapminder in the IPython Shell prior to dropping the
missing values. Complete the exercise to find out what its shape will be after
dropping the missing values!
Instructions
100 XP
Assert that country and year do not contain any missing values. The first assert
statement has been written for you. Note the chaining of the .all() method to
pd.notnull() to confirm that all values in the column are not null.
Drop the rows in the data where any observation in life_expectancy is missing. As
you confirmed that country and year don't have missing values, you can use the
.dropna() method on the entire gapminder DataFrame, because any missing values
would have to be in the life_expectancy column. The .dropna() method has the
default keyword arguments axis=0 and how='any', which specify that rows with any
missing values should be dropped.
Print the shape of gapminder.
Wrapping up
Now that you have a clean and tidy dataset, you can do a bit of visualization and
aggregation. In this exercise, you'll begin by creating a histogram of the
life_expectancy column. You should not get any values under 0 and you should see
something reasonable on the higher end of the life_expectancy age range.
Your next task is to investigate how average life expectancy changed over the
years. To do this, you need to subset the data by each year, get the
life_expectancy column from each subset, and take an average of the values. You can
achieve this using the .groupby() method. This .groupby() method is covered in
greater depth in Manipulating DataFrames with pandas.
Finally, you can save your tidy and summarized DataFrame to a file using the
.to_csv() method.