100% found this document useful (1 vote)
778 views20 pages

Python Data Cleaning

The document discusses various methods for loading and exploring data in a DataFrame. It describes using .head() and .tail() to view the top and bottom rows, .shape and .columns to view dimensions, and .info() to get an overview of data types and missing values. Methods for visualizing data like .plot() for histograms and .boxplot() for comparing groups are also covered. The document then discusses reshaping data using melt and pivot operations and splitting columns on delimiters to tidy the data.

Uploaded by

Beni Djohan
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
Download as txt, pdf, or txt
100% found this document useful (1 vote)
778 views20 pages

Python Data Cleaning

The document discusses various methods for loading and exploring data in a DataFrame. It describes using .head() and .tail() to view the top and bottom rows, .shape and .columns to view dimensions, and .info() to get an overview of data types and missing values. Methods for visualizing data like .plot() for histograms and .boxplot() for comparing groups are also covered. The document then discusses reshaping data using melt and pivot operations and splitting columns on delimiters to tidy the data.

Uploaded by

Beni Djohan
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1/ 20

Loading and viewing your data

DataFrame
access a column using .dot or ['bra ck et']

.head() and .tail() methods.


.shape and .columns attributes
Attributes, unlike methods, don't need parentheses ().
.info() method.
provides DataFrame info: #rows, #col, #non-missing values, data type in each
column
.describe() calculates summary statistics of your data.
can only be used on numeric columns.
.value_counts() method returns the freq counts for each unique value in a column!
optional parameter called dropna which is True by default, it will not count
missing data.

# Print the value counts for 'Borough'


print(df['Borough'].value_counts(dropna=False))
# Print the value_counts for 'State'
print(df.state.value_counts(dropna=False) )
# Print the value counts for 'Site Fill'
print(df['Site Fill'].value_counts(dropna=False))

Visualizing single variables with histograms


.plot() create a plot of each column.
histogram, use parameter kind='hist'
scatterplot, kind='scatter'
to plot on log scale. use keyword arguments logx=True or logy=True

Visualizing multiple variables with boxplots


.boxplot() method, specify the column and by parameters.

# Import matplotlib.pyplot
import pandas as pd
import matplotlib.pyplot as plt

# Describe the column


df['Existing Zoning Sqft'].describe()
# Plot the histogram
df['Existing Zoning Sqft'].plot(kind='hist', rot=70, logx=True, logy=True)
plt.show()

# boxplot
df.boxplot(column='initial_cost', by='Borough', rot=90)
plt.show()

# scatter plot
df.plot(kind='scatter', x='initial_cost', y='total_est_fee', rot=70)
plt.show()

# second scatter plot


df_subset.plot(kind='scatter', x='initial_cost', y='total_est_fee', rot=70)
plt.show()
Recognizing tidy data
For data to be tidy, it must have:
Each variable as a separate column.
Each row as a separate observation.

Reshaping data using melt


Melting: turning columns of your data into rows of data.
Keep in mind: Depending on how your data is represented, you will have to reshape
it differently (e.g., this could make it easier to plot values).

In this exercise, you will practice melting a DataFrame using pd.melt(). There are
two parameters you should be aware of: id_vars and value_vars. The id_vars
represent the columns of the data you do not want to melt (i.e., keep it in its
current shape), while the value_vars represent the columns you do wish to melt into
rows. By default, if no value_vars are provided, all columns not set in the id_vars
will be melted. This could save a bit of typing, depending on the number of columns
that need to be melted.
You can rename the variable column by specifying an argument to the var_name
parameter, and the value column by specifying an argument to the value_name
parameter. You will now practice doing exactly this. Pandas as pd and the DataFrame
airquality has been pre-loaded for you.

# Print the head of airquality


print(airquality.head() )
# Melt airquality: airquality_melt
airquality_melt = pd.melt(airquality, id_vars=['Month', 'Day'])
print(airquality_melt.head())
# Melt airquality: airquality_melt
airquality_melt = pd.melt(airquality, id_vars=['Month', 'Day'],
var_name='measurement', value_name='reading')
print(airquality_melt.head())

Pivot data
Pivoting data is the opposite of melting it. While melting takes a set of columns
and turns it into a single column, pivoting will create a new column for each
unique value in a specified column.

.pivot_table()
index parameter to specify the columns NOT pivoted (similar to id_vars
parameter of pd.melt()
columns (the name of the column you want to pivot), and values (the values to
be used when the column is pivoted).

# Print the head of airquality_melt


print(airquality_melt.head())

# Pivot airquality_melt: airquality_pivot


airquality_pivot = airquality_melt.pivot_table(index=['Month', 'Day'],
columns='measurement', values='reading')
print(airquality_pivot.head())

Resetting the index of a DataFrame


After pivoting airquality_melt in the previous exercise, you didn't quite get back
the original DataFrame.
What you got back instead was a pandas DataFrame with a hierarchical index (also
known as a MultiIndex).

Hierarchical indexes are covered in depth in Manipulating DataFrames with pandas.


In essence, they allow you to group columns or rows by another variable - in this
case, by 'Month' as well as 'Day'.

There's a very simple method you can use to get back the original DataFrame from
the pivoted DataFrame: .reset_index(). Dan didn't show you how to use this method
in the video, but you're now going to practice using it in this exercise to get
back the original DataFrame from airquality_pivot, which has been pre-loaded.

Print the index of airquality_pivot by accessing its .index attribute. This has
been done for you.
Reset the index of airquality_pivot using its .reset_index() method.
Print the new index of airquality_pivot_reset.
Print the head of airquality_pivot_reset.

# Print the index of airquality_pivot


print(airquality_pivot.index)

# Reset the index of airquality_pivot: airquality_pivot_reset


airquality_pivot_reset = airquality_pivot.reset_index()

# Print the new index of airquality_pivot_reset


print(airquality_pivot_reset.index)
# Print the head of airquality_pivot_reset
print(airquality_pivot_reset.head())

Pivoting duplicate values


Deal with duplicate values by providing an aggregation function through the aggfunc
parameter.

You'll see that by using .pivot_table() and the aggfunc parameter, you can not only
reshape your data, but also remove duplicates. Finally, you can then flatten the
columns of the pivoted DataFrame using .reset_index().

NumPy and pandas have been imported as np and pd respectively.

NOTE: The default aggregation function used by .pivot_table() is np.mean().

Pivot airquality_dup by using .pivot_table() with the rows indexed by 'Month' and
'Day', the columns indexed by 'measurement', and the values populated with
'reading'. Use np.mean for the aggregation function.
Print the head of airquality_pivot.
Flatten airquality_pivot by resetting its index.
Print the head of airquality_pivot and then the original airquality DataFrame to
compare their structure.

# Pivot table the airquality_dup: airquality_pivot


airquality_pivot = airquality_dup.pivot_table(index=['Month', 'Day'],
columns='measurement', values='reading', aggfunc=np.mean)

# Print the head of airquality_pivot before reset_index


print(airquality_pivot.head())

# Reset the index of airquality_pivot


airquality_pivot = airquality_pivot.reset_index()
# Print the head of airquality_pivot
print(airquality_pivot.head())

# Print the head of airquality


print(airquality.head())

Splitting a column with .str


The dataset you saw in the video, consisting of case counts of tuberculosis by
country, year, gender, and age group, has been pre-loaded into a DataFrame as tb.

In this exercise, you're going to tidy the 'm014' column, which represents males
aged 0-14 years of age. In order to parse this value, you need to extract the first
letter into a new column for gender, and the rest into a column for age_group.
Here, since you can parse values by position, you can take advantage of pandas'
vectorized string slicing by using the str attribute of columns of type object.

Begin by printing the columns of tb in the IPython Shell using its .columns
attribute, and take note of the problematic column.

Melt tb keeping 'country' and 'year' fixed.


Create a 'gender' column by slicing the first letter of the variable column of
tb_melt.
Create an 'age_group' column by slicing the rest of the variable column of tb_melt.
Print the head of tb_melt. This has been done for you, so hit 'Submit Answer' to
see the results!

# Melt tb: tb_melt


tb_melt = pd.melt(tb, id_vars=['country', 'year'])
# Create the 'gender', 'age_group' columns
tb_melt['gender'] = tb_melt.variable.str[0]
tb_melt['age_group'] = tb_melt.variable.str[1:]
print(tb_melt.head())

Splitting a column with .split() and .get()


Another common way multiple variables are stored in columns is with a delimiter.
You'll learn how to deal with such cases in this exercise, using a dataset
consisting of Ebola cases and death counts by state and country. It has been pre-
loaded into a DataFrame as ebola.

Print the columns of ebola in the IPython Shell using ebola.columns. Notice that
the data has column names such as Cases_Guinea and Deaths_Guinea. Here, the
underscore _ serves as a delimiter between the first part (cases or deaths), and
the second part (country).

This time, you cannot directly slice the variable by position as in the previous
exercise. You now need to use Python's built-in string method called .split(). By
default, this method will split a string into parts separated by a space. However,
in this case you want it to split by an underscore. You can do this on
'Cases_Guinea', for example, using 'Cases_Guinea'.split('_'), which returns the
list ['Cases', 'Guinea'].

The next challenge is to extract the first element of this list and assign it to a
type variable, and the second element of the list to a country variable. You can
accomplish this by accessing the str attribute of the column and using the .get()
method to retrieve the 0 or 1 index, depending on the part you want.
Melt ebola using 'Date' and 'Day' as the id_vars, 'type_country' as the var_name,
and 'counts' as the value_name.
Create a column called 'str_split' by splitting the 'type_country' column of
ebola_melt on '_'. Note that you will first have to access the str attribute of
type_country before you can use .split().
Create a column called 'type' by using the .get() method to retrieve index 0 of the
'str_split' column of ebola_melt.
Create a column called 'country' by using the .get() method to retrieve index 1 of
the 'str_split' column of ebola_melt.

# Melt ebola: ebola_melt


ebola_melt = pd.melt(ebola, id_vars=['Date', 'Day'], var_name='type_country',
value_name='counts')

# Create the 'str_split', 'type', 'country'


ebola_melt['str_split'] = ebola_melt.type_country.str.split('_')
ebola_melt['type'] = ebola_melt.str_split.str.get(0)
ebola_melt['country'] = ebola_melt['str_split'].str.get(1)
print(ebola_melt.head())

Combining rows of data


The dataset you'll be working with here relates to NYC Uber data. The original
dataset has all the originating Uber pickup locations by time and latitude and
longitude. For didactic purposes, you'll be working with a very small portion of
the actual data.

Three DataFrames have been pre-loaded: uber1, which contains data for April 2014,
uber2, which contains data for May 2014, and uber3, which contains data for June
2014. Your job in this exercise is to concatenate these DataFrames together such
that the resulting DataFrame has the data for all three months.

Begin by exploring the structure of these three DataFrames in the IPython Shell
using methods such as .head().

# Concatenate uber1, uber2, and uber3: row_concat


row_concat = pd.concat([uber1, uber2, uber3])

# Print the shape and head of row_concat


print(row_concat.shape)
print(row_concat.head())

Combining columns of data


Think of column-wise concatenation of data as stitching data together from the
sides instead of the top and bottom. To perform this action, you use the same
pd.concat() function, but this time with the keyword argument axis=1. The default,
axis=0, is for a row-wise concatenation.

You'll return to the Ebola dataset you worked with briefly in the last chapter. It
has been pre-loaded into a DataFrame called ebola_melt. In this DataFrame, the
status and country of a patient is contained in a single column. This column has
been parsed into a new DataFrame, status_country, where there are separate columns
for status and country.

# Concatenate ebola_melt and status_country column-wise: ebola_tidy


ebola_tidy = pd.concat([ebola_melt, status_country], axis =1 )
# Print the shape, head of ebola_tidy
print(ebola_tidy.shape)
print(ebola_tidy.head() )

Explore the ebola_melt and status_country DataFrames in the IPython Shell. Your job
is to concatenate them column-wise in order to obtain a final, clean DataFrame.

Finding files that match a pattern


You're now going to practice using the glob module to find all csv files in the
workspace. In the next exercise, you'll programmatically load them into DataFrames.

As Dan showed you in the video, the glob module has a function called glob that
takes a pattern and returns a list of the files in the working directory that match
that pattern.

For example, if you know the pattern is part_ single digit number .csv, you can
write the pattern as 'part_?.csv' (which would match part_1.csv, part_2.csv,
part_3.csv, etc.)

Similarly, you can find all .csv files with '*.csv', or all parts with 'part_*'.
The ? wildcard represents any 1 character, and the * wildcard represents any number
of characters.

Import the glob module along with pandas (as its usual alias pd).
Write a pattern to match all .csv files.
Save all files that match the pattern using the glob() within the glob module:
glob.glob().
Print the list of file names.
Read the second file in csv_files (i.e., index 1) into a DataFrame called csv2.

# Import necessary modules


import glob
import pandas as pd

# Write the pattern: pattern


pattern = '*.csv'

# Save all file matches: csv_files


csv_files = glob.glob(pattern)
print(csv_files)
# Load the second file into a DataFrame: csv2
csv2 = pd.read_csv(csv_files[1])
print(csv2.head())

Iterating and concatenating all matches


Now that you have a list of filenames to load, you can load all the files into a
list of DataFrames that can then be concatenated.

You'll start with an empty list called frames. Your job is to use a for loop to:

iterate through each of the filenames


read each filename into a DataFrame, and then
append it to the frames list.
You can then concatenate this list of DataFrames using pd.concat(). Go for it!
Write a for loop to iterate through csv_files:
In each iteration of the loop, read csv into a DataFrame called df.
After creating df, append it to the list frames using the .append() method.
Concatenate frames into a single DataFrame called uber.
Hit 'Submit Answer' to see the head and shape of the concatenated DataFrame!

# Create an empty list: frames


frames = []
# Iterate over csv_files
for csv in csv_files:
df = pd.read_csv(csv)
frames.append(df)
# Concatenate frames into a single DataFrame: uber
uber = pd.concat(frames)
print(uber.shape)
print(uber.head())

1-to-1 data merge


Many-to-1 data merge
Many-to-many data merge

# Merge the DataFrames: o2o


o2o = pd.merge(left=site, right=visited, left_on='name', right_on='site')
print(o2o)

# Merge the DataFrames: m2o


m2o = pd.merge(site, visited, left_on = 'name', right_on = 'site')
print(m2o)

# Merge site and visited: m2m


m2m = pd.merge(left=site, right=visited, left_on='name', right_on='site')

# Merge m2m and survey: m2m


m2m = pd.merge(left=m2m, right=survey, left_on='ident', right_on='taken')

# Print the first 20 lines of m2m


print(m2m.head(20))

Converting data types


In this exercise, you'll see how ensuring all categorical variables in a DataFrame
are of type category reduces memory usage.

The tips dataset has been loaded into a DataFrame called tips. This data contains
information about how much a customer tipped, whether the customer was male or
female, a smoker or not, etc.

Look at the output of tips.info() in the IPython Shell. You'll note that two
columns that should be categorical - sex and smoker - are instead of type object,
which is pandas' way of storing arbitrary strings. Your job is to convert these two
columns to type category and note the reduced memory usage.

Convert the sex column of the tips DataFrame to type 'category' using the .astype()
Convert the smoker column of the tips DataFrame.
Print the memory usage of tips after converting the data types of the columns:
.info()
# Convert the sex column to type 'category'
tips.sex = tips.sex.astype('category')

# Convert the smoker column to type 'category'


tips.smoker = tips.smoker.astype('category')

# Print the info of tips


print(tips.info())

Working with numeric data


If you expect the data type of a column to be numeric (int or float), but instead
it is of type object, this typically means that there is a non numeric value in the
column, which also signifies bad data.

You can use the pd.to_numeric() function to convert a column into a numeric data
type. If the function raises an error, you can be sure that there is a bad value
within the column. You can either use the techniques you learned in Chapter 1 to do
some exploratory data analysis and find the bad value, or you can choose to ignore
or coerce the value into a missing value, NaN.

A modified version of the tips dataset has been pre-loaded into a DataFrame called
tips. For instructional purposes, it has been pre-processed to introduce some 'bad'
data for you to clean. Use the .info() method to explore this. You'll note that the
total_bill and tip columns, which should be numeric, are instead of type object.
Your job is to fix this.

Instructions
100 XP
Use pd.to_numeric() to convert the 'total_bill' column of tips to a numeric data
type. Coerce the errors to NaN by specifying the keyword argument errors='coerce'.
Convert the 'tip' column of 'tips' to a numeric data type exactly as you did for
the 'total_bill' column.
Print the info of tips to confirm that the data types of 'total_bill' and 'tips'
are numeric.

# Convert 'total_bill' to a numeric dtype


tips['total_bill'] = pd.to_numeric(tips['total_bill'], errors='coerce')

# Convert 'tip' to a numeric dtype


tips['tip'] = pd.to_numeric(tips['tip'], errors = 'coerce')

# Print the info of tips


print(tips.info())

String parsing with regular expressions


In the video, Dan introduced you to the basics of regular expressions, which are
powerful ways of defining patterns to match strings. This exercise will get you
started with writing them.

When working with data, it is sometimes necessary to write a regular expression to


look for properly entered values. Phone numbers in a dataset is a common field that
needs to be checked for validity. Your job in this exercise is to define a regular
expression to match US phone numbers that fit the pattern of xxx-xxx-xxxx.
The regular expression module in python is re. When performing pattern matching on
data, since the pattern will be used for a match across multiple rows, it's better
to compile the pattern first using re.compile(), and then use the compiled pattern
to match values.

Instructions
100 XP
Import re.
Compile a pattern that matches a phone number of the format xxx-xxx-xxxx.
Use \d{x} to match x digits. Here you'll need to use it three times: twice to match
3 digits, and once to match 4 digits.
Place the regular expression inside re.compile().
Using the .match() method on prog, check whether the pattern matches the string
'123-456-7890'.
Using the same approach, now check whether the pattern matches the string '1123-
456-7890'.

# Import the regular expression module


import re

# Compile the pattern: prog


prog = re.compile('\d{3}-\d{3}-\d{4}')

# See if the pattern matches


result = prog.match('123-456-7890')
print(bool(result))

# See if the pattern matches


result2 = prog.match('1123-456-7890')
print(bool(result2))

Extracting numerical values from strings


Extracting numbers from strings is a common task, particularly when working with
unstructured data or log files.

Say you have the following string: 'the recipe calls for 6 strawberries and 2
bananas'.

It would be useful to extract the 6 and the 2 from this string to be saved for
later use when comparing strawberry to banana ratios.

When using a regular expression to extract multiple numbers (or multiple pattern
matches, to be exact), you can use the re.findall() function. Dan did not discuss
this in the video, but it is straightforward to use: You pass in a pattern and a
string to re.findall(), and it will return a list of the matches.

Instructions
100 XP
Import re.
Write a pattern that will find all the numbers in the following string: 'the recipe
calls for 10 strawberries and 1 banana'. To do this:
Use the re.findall() function and pass it two arguments: the pattern, followed by
the string.
\d is the pattern required to find digits. This should be followed with a + so that
the previous element is matched one or more times. This ensures that 10 is viewed
as one number and not as 1 and 0.
Print the matches to confirm that your regular expression found the values 10 and
1.
Take Hint (-30 XP)

# Import the regular expression module


import re

# Find the numeric values: matches


matches = re.findall('\d+', 'the recipe calls for 10 strawberries and 1 banana')

# Print the matches


print(matches)

Pattern matching
In this exercise, you'll continue practicing your regular expression skills. For
each provided string, your job is to write the appropriate pattern to match it.

Write patterns to match:


A telephone number of the format xxx-xxx-xxxx.
A string of the format: A dollar sign, an arbitrary number of digits, a decimal
point, 2 digits.
Use \$ to match the dollar sign, \d* to match an arbitrary number of digits, \. to
match the decimal point, and \d{x} to match x number of digits.
A capital letter, followed by an arbitrary number of alphanumeric characters.
Use [A-Z] to match any capital letter followed by \w* to match an arbitrary number
of alphanumeric characters.

# Write the first pattern


pattern1 = bool(re.match(pattern='\d{3}-\d{3}-\d{4}', string='123-456-7890'))
print(pattern1)

# Write the second pattern


pattern2 = bool(re.match(pattern='\$\d*\.\d{2}', string='$123.45'))
print(pattern2)

# Write the third pattern


pattern3 = bool(re.match(pattern='[A-Z]\w*', string='Australia'))
print(pattern3)

Custom functions to clean data


You'll now practice writing functions to clean data.

The tips dataset has been pre-loaded into a DataFrame called tips. It has a 'sex'
column that contains the values 'Male' or 'Female'. Your job is to write a function
that will recode 'Female' to 0, 'Male' to 1, and return np.nan for all entries of
'sex' that are neither 'Female' nor 'Male'.

Recoding variables like this is a common data cleaning task. Functions provide a
mechanism for you to abstract away complex bits of code as well as reuse code. This
makes your code more readable and less error prone.

As Dan showed you in the videos, you can use the .apply() method to apply a
function across entire rows or columns of DataFrames. However, note that each
column of a DataFrame is a pandas Series. Functions can also be applied across
Series. Here, you will apply your function over the 'sex' column.
Instructions
100 XP
Define a function named recode_gender() that has one parameter: gender.
If gender equals 'Male', return 1.
Else, if gender equals 'Female', return 0.
If gender does not equal 'Male' or 'Female', return np.nan. NumPy has been pre-
imported for you.
Apply your recode_gender() function over tips.sex using the .apply() method to
create a new column: 'recode'. Note that when passing in a function inside the
.apply() method, you don't need to specify the parentheses after the function name.
Hit 'Submit Answer' and take note of the new 'recode' column in the tips DataFrame!

# Define recode_gender()
def recode_gender(gender):

# Return 0 if gender is 'Female'


if gender == 'Female':
return 0

# Return 1 if gender is 'Male'


elif gender == 'Male':
return 1

# Return np.nan
else:
return np.nan

# Apply the function to the sex column


tips['recode'] = tips.sex.apply(recode_gender)

# Print the first five rows of tips


print(tips.head())

# For simple recodes, you can also use the replace method. You can also convert the
column into a categorical type.

Lambda functions
You'll now be introduced to a powerful Python feature that will help you clean your
data more effectively: lambda functions. Instead of using the def syntax that you
used in the previous exercise, lambda functions let you make simple, one-line
functions.

For example, here's a function that squares a variable used in an .apply() method:

def my_square(x):
return x ** 2

df.apply(my_square)
The equivalent code using a lambda function is:

df.apply(lambda x: x ** 2)
The lambda function takes one parameter - the variable x. The function itself just
squares x and returns the result, which is whatever the one line of code evaluates
to. In this way, lambda functions can make your code concise and Pythonic.

The tips dataset has been pre-loaded into a DataFrame called tips. Your job is to
clean its 'total_dollar' column by removing the dollar sign. You'll do this using
two different methods: With the .replace() method, and with regular expressions.
The regular expression module re has been pre-imported.

Instructions
100 XP
Use the .replace() method inside a lambda function to remove the dollar sign from
the 'total_dollar' column of tips.
You need to specify two arguments to the .replace() method: The string to be
replaced ('$'), and the string to replace it by ('').
Apply the lambda function over the 'total_dollar' column of tips.
Use a regular expression to remove the dollar sign from the 'total_dollar' column
of tips.
The pattern has been provided for you: It is the first argument of the re.findall()
function.
Complete the rest of the lambda function and apply it over the 'total_dollar'
column of tips. Notice that because re.findall() returns a list, you have to slice
it in order to access the actual value.
Hit 'Submit Answer' to verify that you have removed the dollar sign from the
column.

# Write the lambda function using replace


tips['total_dollar_replace'] = tips.total_dollar.apply(lambda x: x.replace('$',
''))

# Write the lambda function using regular expressions


tips['total_dollar_re'] = tips.total_dollar.apply(lambda x: re.findall('\d+\.\d+',
x)[0])

# Print the head of tips


print(tips.head())

Dropping duplicate data


Duplicate data causes a variety of problems. From the point of view of performance,
they use up unnecessary amounts of memory and cause unneeded calculations to be
performed when processing data. In addition, they can also bias any analysis
results.

A dataset consisting of the performance of songs on the Billboard charts has been
pre-loaded into a DataFrame called billboard. Check out its columns in the IPython
Shell. Your job in this exercise is to subset this DataFrame and then drop all
duplicate rows.

Instructions
100 XP
Create a new DataFrame called tracks that contains the following columns from
billboard: 'year', 'artist', 'track', and 'time'.
Print the info of tracks. This has been done for you.
Drop duplicate rows from tracks using the .drop_duplicates() method. Save the
result to tracks_no_duplicates.
Print the info of tracks_no_duplicates. This has been done for you, so hit 'Submit
Answer' to see the results!

# Create the new DataFrame: tracks


tracks = billboard[['year','artist','track','time']]
print(tracks.info())

# Drop the duplicates: tracks_no_duplicates


tracks_no_duplicates = tracks.drop_duplicates()
print(tracks_no_duplicates.info())

Filling missing data


Here, you'll return to the airquality dataset from Chapter 2. It has been pre-
loaded into the DataFrame airquality, and it has missing values for you to practice
filling in. Explore airquality in the IPython Shell to checkout which columns have
missing values.

It's rare to have a (real-world) dataset without any missing values, and it's
important to deal with them because certain calculations cannot handle missing
values while some calculations will, by default, skip over any missing values.

Also, understanding how much missing data you have, and thinking about where it
comes from is crucial to making unbiased interpretations of data.

Instructions
100 XP
Calculate the mean of the Ozone column of airquality using the .mean() method on
airquality.Ozone.
Use the .fillna() method to replace all the missing values in the Ozone column of
airquality with the mean, oz_mean.
Hit 'Submit Answer' to see the result of filling in the missing values!

# Calculate the mean of the Ozone column: oz_mean


oz_mean = airquality.Ozone.mean()

# Replace all the missing values in the Ozone column with the mean
airquality['Ozone'] = airquality.Ozone.fillna(oz_mean)

# Print the info of airquality


print(airquality.info())

Testing your data with asserts


Here, you'll practice writing assert statements using the Ebola dataset from
previous chapters to programmatically check for missing values and to confirm that
all values are positive. The dataset has been pre-loaded into a DataFrame called
ebola.

In the video, you saw Dan use the .all() method together with the .notnull()
DataFrame method to check for missing values in a column. The .all() method returns
True if all values are True. When used on a DataFrame, it returns a Series of
Booleans - one for each column in the DataFrame. So if you are using it on a
DataFrame, like in this exercise, you need to chain another .all() method so that
you return only one True or False value. When using these within an assert
statement, nothing will be returned if the assert statement is true: This is how
you can confirm that the data you are checking are valid.

Note: You can use pd.notnull(df) as an alternative to df.notnull().

Instructions
100 XP
Write an assert statement to confirm that there are no missing values in ebola.
Use the pd.notnull() function on ebola (or the .notnull() method of ebola) and
chain two .all() methods (that is, .all().all()). The first .all() method will
return a True or False for each column, while the second .all() method will return
a single True or False.
Write an assert statement to confirm that all values in ebola are greater than or
equal to 0.
Chain two all() methods to the Boolean condition (ebola >= 0).

# Assert that there are no missing values


assert ebola.notnull().all().all()

# Assert that all values are >= 0


assert (ebola >= 0).all().all()

Exploratory analysis
Whenever you obtain a new dataset, your first task should always be to do some
exploratory analysis to get a better understanding of the data and diagnose it for
any potential issues.

The Gapminder data for the 19th century has been loaded into a DataFrame called
g1800s. In the IPython Shell, use pandas methods such as .head(), .info(), and
.describe(), and DataFrame attributes like .columns and .shape to explore it.

Use the information that you acquire from your exploratory analysis to choose the
true statement from the options provided below.

Visualizing your data


Since 1800, life expectancy around the globe has been steadily going up. You would
expect the Gapminder data to confirm this.

The DataFrame g1800s has been pre-loaded. Your job in this exercise is to create a
scatter plot with life expectancy in '1800' on the x-axis and life expectancy in
'1899' on the y-axis.

Here, the goal is to visually check the data for insights as well as errors. When
looking at the plot, pay attention to whether the scatter plot takes the form of a
diagonal line, and which points fall below or above the diagonal line. This will
inform how life expectancy in 1899 changed (or did not change) compared to 1800 for
different countries. If points fall on a diagonal line, it means that life
expectancy remained the same!

Instructions
100 XP
Import matplotlib.pyplot as plt.
Use the .plot() method on g1800s with kind='scatter' to create a scatter plot with
'1800' on the x-axis and '1899' on the y-axis.
Display the plot.

# Import matplotlib.pyplot
import matplotlib.pyplot as plt

# Create the scatter plot


g1800s.plot(kind='scatter', x='1800', y='1899')
# Specify axis labels
plt.xlabel('Life Expectancy by Country in 1800')
plt.ylabel('Life Expectancy by Country in 1899')

# Specify axis limits


plt.xlim(20, 55)
plt.ylim(20, 55)

# Display the plot


plt.show()

Thinking about the question at hand


Since you are given life expectancy level data by country and year, you could ask
questions about how much the average life expectancy changes over each year.

Before continuing, however, it's important to make sure that the following
assumptions about the data are true:

'Life expectancy' is the first column (index 0) of the DataFrame.


The other columns contain either null or numeric values.
The numeric values are all greater than or equal to 0.
There is only one instance of each country.
You can write a function that you can apply over the entire DataFrame to verify
some of these assumptions. Note that spending the time to write such a script will
help you when working with other datasets as well.

Instructions
100 XP
Define a function called check_null_or_valid() that takes in one argument:
row_data.
Inside the function, convert no_na to a numeric data type using pd.to_numeric().
Write an assert statement to make sure the first column (index 0) of the g1800s
DataFrame is 'Life expectancy'.
Write an assert statement to test that all the values are valid for the g1800s
DataFrame. Use the check_null_or_valid() function placed inside the .apply() method
for this. Note that because you're applying it over the entire DataFrame, and not
just one column, you'll have to chain the .all() method twice, and remember that
you don't have to use () for functions placed inside .apply().
Write an assert statement to make sure that each country occurs only once in the
data. Use the .value_counts() method on the 'Life expectancy' column for this.
Specifically, index 0 of .value_counts() will contain the most frequently occuring
value. If this is equal to 1 for the 'Life expectancy' column, then you can be
certain that no country appears more than once in the data.

def check_null_or_valid(row_data):
"""Function that takes a row of data,
drops all missing values,
and checks if all remaining values are greater than or equal to 0
"""
no_na = row_data.dropna()
numeric = pd.to_numeric(no_na)
ge0 = numeric >= 0
return ge0

# Check whether the first column is 'Life expectancy'


assert g1800s.columns[0] == 'Life expectancy'
# Check whether the values in the row are valid
assert g1800s.iloc[:, 1:].apply(check_null_or_valid, axis=1).all().all()

# Check that there is only one instance of each country


assert g1800s['Life expectancy'].value_counts()[0] == 1

Assembling your data


Here, three DataFrames have been pre-loaded: g1800s, g1900s, and g2000s. These
contain the Gapminder life expectancy data for, respectively, the 19th century, the
20th century, and the 21st century.

Your task in this exercise is to concatenate them into a single DataFrame called
gapminder. This is a row-wise concatenation, similar to how you concatenated the
monthly Uber datasets in Chapter 3.

Use pd.concat() to concatenate g1800s, g1900s, and g2000s into one DataFrame called
gapminder. Make sure you pass DataFrames to pd.concat() in the form of a list.
Print the shape and the head of the concatenated DataFrame.

# Concatenate the DataFrames row-wise


gapminder = pd.concat([g1800s,g1900s,g2000s])

# Print the shape of gapminder


print(gapminder.shape)

# Print the head of gapminder


print(gapminder.head())

Reshaping your data


Now that you have all the data combined into a single DataFrame, the next step is
to reshape it into a tidy data format.

Currently, the gapminder DataFrame has a separate column for each year. What you
want instead is a single column that contains the year, and a single column that
represents the average life expectancy for each year and country. By having year in
its own column, you can use it as a predictor variable in a later analysis.

You can convert the DataFrame into the desired tidy format by melting it.

Instructions
100 XP
Reshape gapminder by melting it. Keep 'Life expectancy' fixed by specifying it as
an argument to the id_vars parameter.
Rename the three columns of the melted DataFrame to 'country', 'year', and
'life_expectancy' by passing them in as a list to gapminder_melt.columns.
Print the head of the melted DataFrame.

import pandas as pd

# Melt gapminder: gapminder_melt


gapminder_melt = pd.melt(gapminder, id_vars = ['Life expectancy'] )

# Rename the columns


gapminder_melt.columns = ['country','year','life_expectancy']

# Print the head of gapminder_melt


print(gapminder_melt.head())

Checking the data types


Now that your data are in the proper shape, you need to ensure that the columns are
of the proper data type. That is, you need to ensure that country is of type
object, year is of type int64, and life_expectancy is of type float64.

The tidy DataFrame has been pre-loaded as gapminder. Explore it in the IPython
Shell using the .info() method. Notice that the column 'year' is of type object.
This is incorrect, so you'll need to use the pd.to_numeric() function to convert it
to a numeric data type.

NumPy and pandas have been pre-imported as np and pd.

Instructions
100 XP
Convert the year column of gapminder using pd.to_numeric().
Assert that the country column is of type np.object. This has been done for you.
Assert that the year column is of type np.int64.
Assert that the life_expectancy column is of type np.float64.

# Convert the year column to numeric


gapminder.year = pd.to_numeric(gapminder.year)

# Test if country is of type object


assert gapminder.country.dtypes == np.object

# Test if year is of type int64


assert gapminder.year.dtypes == np.int64

# Test if life_expectancy is of type float64


assert gapminder.life_expectancy.dtypes == np.float64

Looking at country spellings


Having tidied your DataFrame and checked the data types, your next task in the data
cleaning process is to look at the 'country' column to see if there are any special
or invalid characters you may need to deal with.

It is reasonable to assume that country names will contain:

The set of lower and upper case letters.


Whitespace between words.
Periods for any abbreviations.
To confirm that this is the case, you can leverage the power of regular expressions
again. For common operations like this, Pandas has a built-in string method -
str.contains() - which takes a regular expression pattern, and applies it to the
Series, returning True if there is a match, and False otherwise.

Since here you want to find the values that do not match, you have to invert the
boolean, which can be done using ~. This Boolean series can then be used to get the
Series of countries that have invalid names.
Instructions
100 XP
Create a Series called countries consisting of the 'country' column of gapminder.
Drop all duplicates from countries using the .drop_duplicates() method.
Write a regular expression that tests your assumptions of what characters belong in
countries:
Anchor the pattern to match exactly what you want by placing a ^ in the beginning
and $ in the end.
Use A-Za-z to match the set of lower and upper case letters, \. to match periods,
and \s to match whitespace between words.
Use str.contains() to create a Boolean vector representing values that match the
pattern.
Invert the mask by placing a ~ before it.
Subset the countries series using the .loc[] accessor and mask_inverse. Then hit
'Submit Answer' to see the invalid country names!

# Create the series of countries: countries


countries = gapminder.country

# Drop all the duplicates from countries


countries = countries.drop_duplicates()

# Write the regular expression: pattern


pattern = '^[A-Za-z\.\s]*$'

# Create the Boolean vector: mask


mask = countries.str.contains(pattern)

# Invert the mask: mask_inverse


mask_inverse = ~mask

# Subset countries using mask_inverse: invalid_countries


invalid_countries = countries.loc[mask_inverse]

# Print invalid_countries
print(invalid_countries)

More data cleaning and processing


It's now time to deal with the missing data. There are several strategies for this:
You can drop them, fill them in using the mean of the column or row that the
missing value is in (also known as imputation), or, if you are dealing with time
series data, use a forward fill or backward fill, in which you replace missing
values in a column with the most recent known value in the column. See pandas
Foundations for more on forward fill and backward fill.

In general, it is not the best idea to drop missing values, because in doing so you
may end up throwing away useful information. In this data, the missing values refer
to years where no estimate for life expectancy is available for a given country.
You could fill in, or guess what these life expectancies could be by looking at the
average life expectancies for other countries in that year, for example. Whichever
strategy you go with, it is important to carefully consider all options and
understand how they will affect your data.

In this exercise, you'll practice dropping missing values. Your job is to drop all
the rows that have NaN in the life_expectancy column. Before doing so, it would be
valuable to use assert statements to confirm that year and country do not have any
missing values.
Begin by printing the shape of gapminder in the IPython Shell prior to dropping the
missing values. Complete the exercise to find out what its shape will be after
dropping the missing values!

Instructions
100 XP
Assert that country and year do not contain any missing values. The first assert
statement has been written for you. Note the chaining of the .all() method to
pd.notnull() to confirm that all values in the column are not null.
Drop the rows in the data where any observation in life_expectancy is missing. As
you confirmed that country and year don't have missing values, you can use the
.dropna() method on the entire gapminder DataFrame, because any missing values
would have to be in the life_expectancy column. The .dropna() method has the
default keyword arguments axis=0 and how='any', which specify that rows with any
missing values should be dropped.
Print the shape of gapminder.

# Assert that country does not contain any missing values


assert pd.notnull(gapminder.country).all()

# Assert that year does not contain any missing values


assert pd.notnull(gapminder.year).all()

# Drop the missing values


gapminder = gapminder.dropna()

# Print the shape of gapminder


print(gapminder.shape)

Wrapping up
Now that you have a clean and tidy dataset, you can do a bit of visualization and
aggregation. In this exercise, you'll begin by creating a histogram of the
life_expectancy column. You should not get any values under 0 and you should see
something reasonable on the higher end of the life_expectancy age range.

Your next task is to investigate how average life expectancy changed over the
years. To do this, you need to subset the data by each year, get the
life_expectancy column from each subset, and take an average of the values. You can
achieve this using the .groupby() method. This .groupby() method is covered in
greater depth in Manipulating DataFrames with pandas.

Finally, you can save your tidy and summarized DataFrame to a file using the
.to_csv() method.

Create a histogram of the life_expectancy column using the .plot()


Group gapminder by 'year' and aggregate 'life_expectancy' by the mean. To do this:
Use .groupby() on gapminder with 'year' as the argument. Then select
'life_expectancy' and chain .mean().
Create a line plot of average life expectancy per year by using the .plot() on
gapminder_agg.
Save gapminder and gapminder_agg to csv files using .to_csv().

# Add first subplot


plt.subplot(2, 1, 1)
# Create a histogram of life_expectancy
gapminder.life_expectancy.plot(kind = 'hist')

# Group gapminder: gapminder_agg


gapminder_agg = gapminder.groupby('year')['life_expectancy'].mean()

# Print the head of gapminder_agg


print(gapminder_agg.head())
print(gapminder_agg.tail())

# Add second subplot


plt.subplot(2, 1, 2)

# Create a line plot of life expectancy per year


gapminder_agg.plot()

# Add title and specify axis labels


plt.title('Life expectancy over the years')
plt.ylabel('Life expectancy')
plt.xlabel('Year')

# Display the plots


plt.tight_layout()
plt.show()

# Save both DataFrames to csv files


gapminder.to_csv('gapminder.csv')
gapminder_agg.to_csv('gapminder_agg.csv')

You might also like