100% found this document useful (1 vote)
537 views28 pages

Python Data Import

This document provides instructions for summarizing a document about importing data files into Python using various tools and methods. It discusses using the !ls command in IPython to view the contents of the current directory, opening and reading text files line by line using open() and readline(), importing flat files into NumPy arrays using loadtxt(), customizing NumPy imports by specifying delimiter, skiprows, and usecols arguments, importing different datatypes by setting dtype or skiprows, importing mixed datatypes using genfromtxt() and recfromcsv(), importing flat files as pandas DataFrames using read_csv() and read_table(), and customizing pandas imports.

Uploaded by

Beni Djohan
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
Download as txt, pdf, or txt
100% found this document useful (1 vote)
537 views28 pages

Python Data Import

This document provides instructions for summarizing a document about importing data files into Python using various tools and methods. It discusses using the !ls command in IPython to view the contents of the current directory, opening and reading text files line by line using open() and readline(), importing flat files into NumPy arrays using loadtxt(), customizing NumPy imports by specifying delimiter, skiprows, and usecols arguments, importing different datatypes by setting dtype or skiprows, importing mixed datatypes using genfromtxt() and recfromcsv(), importing flat files as pandas DataFrames using read_csv() and read_table(), and customizing pandas imports.

Uploaded by

Beni Djohan
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1/ 28

Exploring your working directory

In order to import data into Python, you should first have an idea of what files
are in your working directory.
IPython, which is running on DataCamp's servers, has a bunch of cool commands,
including its magic commands. For example, starting a line with ! gives you
complete system shell access. This means that the IPython magic command ! ls will
display the contents of your current directory. Your task is to use the IPython
magic command ! ls to check out the contents of your current directory and answer
the following question: which of the following files is in your working directory?

!ls

# Open a file: file


file = open('moby_dick.txt', 'r')
# Print it
print(file.read())

# Check whether file is closed


print(file.closed)

# Close file
file.close()

# Check whether file is closed


print(file.closed)

Importing text files line by line


For large files, we may not want to print all of their content to the shell: you
may wish to print only the first few lines. Enter the readline() method, which
allows you to do this. When a file called file is open, you can print out the first
line by executing file.readline(). If you execute the same command again, the
second line will print, and so on.
In the introductory video, Hugo also introduced the concept of a context manager.
He showed that you can bind a variable file by using a context manager construct:
with open('huck_finn.txt') as file:
While still within this construct, the variable file will be bound to
open('huck_finn.txt'); thus, to print the file to the shell, all the code you need
to execute is:
with open('huck_finn.txt') as file:
print(file.readline())

# Read & print the first 3 lines


with open('moby_dick.txt') as file:
print(file.readline() )
print(file.readline())
print(file.readline())

Using NumPy to import flat files


In this exercise, you're now going to load the MNIST digit recognition dataset
using the numpy function loadtxt() and see just how easy it can be:

The first argument will be the filename.


The second will be the delimiter which, in this case, is a comma.
You can find more information about the MNIST dataset here on the webpage of Yann
LeCun, who is currently Director of AI Research at Facebook and Founding Director
of the NYU Center for Data Science, among many other things.
Instructions
100 XP
Fill in the arguments of np.loadtxt() by passing file and a comma ',' for the
delimiter.
Fill in the argument of print() to print the type of the object digits. Use the
function type().
Execute the rest of the code to visualize one of the rows of the data.

# Import package
import numpy as np

# Assign filename to variable: file


file = 'digits.csv'

# Load file as array: digits


digits = np.loadtxt(file, delimiter=',')

# Print datatype of digits


print(type(digits) )

# Select and reshape a row


im = digits[21, 1:]
im_sq = np.reshape(im, (28, 28))

# Plot reshaped data (matplotlib.pyplot already loaded as plt)


plt.imshow(im_sq, cmap='Greys', interpolation='nearest')
plt.show()

Customizing your NumPy import


What if there are rows, such as a header, that you don't want to import? What if
your file has a delimiter other than a comma? What if you only wish to import
particular columns?

There are a number of arguments that np.loadtxt() takes that you'll find useful:
delimiter changes the delimiter that loadtxt() is expecting, for example, you can
use ',' and '\t' for comma-delimited and tab-delimited respectively; skiprows
allows you to specify how many rows (not indices) you wish to skip; usecols takes a
list of the indices of the columns you wish to keep.

The file that you'll be importing, digits_header.txt,

has a header
is tab-delimited.
Instructions
100 XP
Complete the arguments of np.loadtxt(): the file you're importing is tab-delimited,
you want to skip the first row and you only want to import the first and third
columns.
Complete the argument of the print() call in order to print the entire array that
you just imported.

# Import numpy
import numpy as np

# Assign the filename: file


file = 'digits_header.txt'

# Load the data: data


data = np.loadtxt(file, delimiter='\t', skiprows=1, usecols=[0,2])
# Print data
print(data)

Importing different datatypes


The file seaslug.txt

has a text header, consisting of strings


is tab-delimited.
These data consists of percentage of sea slug larvae that had metamorphosed in a
given time period. Read more here.

Due to the header, if you tried to import it as-is using np.loadtxt(), Python would
throw you a ValueError and tell you that it could not convert string to float.
There are two ways to deal with this: firstly, you can set the data type argument
dtype equal to str (for string).

Alternatively, you can skip the first row as we have seen before, using the
skiprows argument.

Instructions
100 XP
Complete the first call to np.loadtxt() by passing file as the first argument.
Execute print(data[0]) to print the first element of data.
Complete the second call to np.loadtxt(). The file you're importing is tab-
delimited, the datatype is float, and you want to skip the first row.
Print the 10th element of data_float by completing the print() command. Be guided
by the previous print() call.
Execute the rest of the code to visualize the data.

# Assign filename: file


file = 'seaslug.txt'

# Import file: data


data = np.loadtxt(file, delimiter='\t', dtype=str)

# Print the first element of data


print(data[0])

# Import data as floats and skip the first row: data_float


data_float = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1)

# Print the 10th element of data_float


print(data_float[9])

# Plot a scatterplot of the data


plt.scatter(data_float[:, 0], data_float[:, 1])
plt.xlabel('time (min.)')
plt.ylabel('percentage of larvae')
plt.show()

Working with mixed datatypes (1)


Much of the time you will need to import datasets which have different datatypes in
different columns; one column may contain strings and another floats, for example.
The function np.loadtxt() will freak at this. There is another function,
np.genfromtxt(), which can handle such structures. If we pass dtype=None to it, it
will figure out what types each column should be.

Import 'titanic.csv' using the function np.genfromtxt() as follows:

data = np.genfromtxt('titanic.csv', delimiter=',', names=True, dtype=None)


Here, the first argument is the filename, the second specifies the delimiter , and
the third argument names tells us there is a header. Because the data are of
different types, data is an object called a structured array. Because numpy arrays
have to contain elements that are all the same type, the structured array solves
this by being a 1D array, where each element of the array is a row of the flat file
imported. You can test this by checking out the array's shape in the shell by
executing np.shape(data).

Working with mixed datatypes (2)


You have just used np.genfromtxt() to import data containing mixed datatypes. There
is also another function np.recfromcsv() that behaves similarly to np.genfromtxt(),
except that its default dtype is None. In this exercise, you'll practice using this
to achieve the same result.

Instructions
100 XP
Import titanic.csv using the function np.recfromcsv() and assign it to the
variable, d. You'll only need to pass file to it because it has the defaults
delimiter=',' and names=True in addition to dtype=None!
Run the remaining code to print the first three entries of the resulting array d.

# Assign the filename: file


file = 'titanic.csv'

# Import file using np.recfromcsv: d


d = np.recfromcsv(file)
# Print out first three entries of d
print(d[:3])

Using pandas to import flat files as DataFrames (1)


In the last exercise, you were able to import flat files containing columns with
different datatypes as numpy arrays. However, the DataFrame object in pandas is a
more appropriate structure in which to store such data and, thankfully, we can
easily import files of mixed data types as DataFrames using the pandas functions
read_csv() and read_table().

Import the pandas package using the alias pd.


Read titanic.csv into a DataFrame called df. The file name is already stored in the
file object.
In a print() call, view the head of the DataFrame.

# Import pandas as pd
import pandas as pd

# Assign the filename: file


file = 'titanic.csv'
# Read the file into a DataFrame: df
df = pd.read_csv(file)
# View the head of the DataFrame
print(df.head())

Using pandas to import flat files as DataFrames (2)


In the last exercise, you were able to import flat files into a pandas DataFrame.
As a bonus, it is then straightforward to retrieve the corresponding numpy array
using the attribute values. You'll now have a chance to do this using the MNIST
dataset, which is available as digits.csv.

Import the first 5 rows of the file into a DataFrame using the function
pd.read_csv() and assign the result to data. You'll need to use the arguments nrows
and header (there is no header in this file).
Build a numpy array from the resulting DataFrame in data and assign to data_array.
Execute print(type(data_array)) to print the datatype of data_array.

# Assign the filename: file


file = 'digits.csv'
# Read the first 5 rows of the file into a DataFrame: data
data = pd.read_csv(file, nrows=5, header = None)

# Build a numpy array from the DataFrame: data_array


data_array = data.values
# Print the datatype of data_array to the shell
print(type(data_array))

Customizing your pandas import


The pandas package is also great at dealing with many of the issues you will
encounter when importing data as a data scientist, such as comments occurring in
flat files, empty lines and missing values. Note that missing values are also
commonly referred to as NA or NaN. To wrap up this chapter, you're now going to
import a slightly corrupted copy of the Titanic dataset titanic_corrupt.txt, which

contains comments after the character '#', is tab-delimited.

Complete the sep (the pandas version of delim), comment and na_values arguments of
pd.read_csv(). comment takes characters that comments occur after in the file,
which in this case is '#'. na_values takes a list of strings to recognize as
NA/NaN, in this case the string 'Nothing'.
Execute the rest of the code to print the head of the resulting DataFrame and plot
the histogram of the 'Age' of passengers aboard the Titanic.

# Import matplotlib.pyplot as plt


import matplotlib.pyplot as plt

# Assign filename: file


file = 'titanic_corrupt.txt'
# Import file: data
data = pd.read_csv(file, sep='\t', comment='#', na_values='Nothing')
# Print the head of the DataFrame
print(data.head())

# Plot 'Age' variable in a histogram


pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()
Not so flat any more
In Chapter 1, you learned how to use the IPython magic command ! ls to explore your
current working directory. You can also do this natively in Python using the
library os, which consists of miscellaneous operating system interfaces.

imports library os,


stores the name of the current directory in a string called wd and
outputs the contents of the directory in a list to the shell.

import os
wd = os.getcwd()
os.listdir(wd)

Loading a pickled file


There are a number of datatypes that cannot be saved easily to flat files, such as
lists and dictionaries. If you want your files to be human readable, you may want
to save them as text files in a clever manner. JSONs, which you will see in a later
chapter, are appropriate for Python dictionaries.

However, if you merely want to be able to import them into Python, you can
serialize them. All this means is converting the object into a sequence of bytes,
or a bytestream.

In this exercise, you'll import the pickle package, open a previously pickled data
structure from a file and load it.

Import the pickle package.


Complete the second argument of open() so that it is read only for a binary file.
This argument will be a string of two letters, one signifying 'read only', the
other 'binary'.
Pass the correct argument to pickle.load(); it should use the variable that is
bound to open.
Print the data, d.
Print the datatype of d; take your mind back to your previous use of the function
type().

# Import pickle package


import pickle

# Open pickle file and load data: d


with open('data.pkl', 'rb') as file:
d = pickle.load(file)

# Print d
print(d)
# Print datatype of d
print(type(d) )

Listing sheets in Excel files


Whether you like it or not, any working data scientist will need to deal with Excel
spreadsheets at some point in time. You won't always want to do so in Excel,
however!
Here, you'll learn how to use pandas to import Excel spreadsheets and how to list
the names of the sheets in any loaded .xlsx file.

Recall from the video that, given an Excel file imported into a variable
spreadsheet, you can retrieve a list of the sheet names using the attribute
spreadsheet.sheet_names.

Specifically, you'll be loading and checking out the spreadsheet


'battledeath.xlsx', modified from the Peace Research Institute Oslo's (PRIO)
dataset. This data contains age-adjusted mortality rates due to war in various
countries over several years.

Assign the filename to the variable file.


Pass the correct argument to pd.ExcelFile() to load the file using pandas.
Print the sheetnames of the Excel spreadsheet by passing the necessary argument to
the print() function.

# Import pandas
import pandas as pd

# Assign spreadsheet filename: file


file = 'battledeath.xlsx'
# Load spreadsheet: xl
xls = pd.ExcelFile(file)
# Print sheet names
print(xls.sheet_names)

Importing sheets from Excel files


In this exercise, you'll learn how to import any given sheet of your loaded .xlsx
file as a DataFrame. You'll be able to do so by specifying either the sheet's name
or its index.

Load the sheet '2004' into the DataFrame df1 using its name as a string.
Print the head of df1 to the shell.
Load the sheet 2002 into the DataFrame df2 using its index (0).
Print the head of df2 to the shell.

# Load a sheet into a DataFrame by name: df1


df1 = xls.parse('2004')
print(df1.head())

# Load a sheet into a DataFrame by index: df2


df2 = xls.parse(0)
print(df2.head() )

Customizing your spreadsheet import


Here, you'll parse your spreadsheets and use additional arguments to skip rows,
rename columns and select only particular columns.

As before, you'll use the method parse(). This time, however, you'll add the
additional arguments skiprows, names and usecols. These skip rows, name the columns
and designate which columns to parse, respectively. All these arguments can be
assigned to lists containing the specific row numbers, strings and column numbers,
as appropriate.

Parse the first sheet by index. In doing so, skip the first row of data and name
the columns 'Country' and 'AAM due to War (2002)' using the argument names. The
values passed to skiprows and names all need to be of type list.
Parse the second sheet by index. In doing so, parse only the first column with the
usecols parameter, skip the first row and rename the column 'Country'. The argument
passed to usecols also needs to be of type list.

# Parse the first sheet and rename the columns: df1


df1 = xls.parse(0, skiprows=1, names=['Country','AAM due to War (2002)'])
print(df1.head())

# Parse the first column of the second sheet and rename the column: df2
df2 = xls.parse(1, usecols=0, skiprows=1, names=['Country'])
# Print the head of the DataFrame df2
print(df2.head())

# Import sas7bdat package


from sas7bdat import SAS7BDAT

# Save file to a DataFrame: df_sas


with SAS7BDAT('sales.sas7bdat') as file:
df_sas = file.to_data_frame()

# Print head of DataFrame


print(df_sas.head())

# Plot histogram of DataFrame features (pandas and pyplot already imported)


pd.DataFrame.hist(df_sas[['P']])
plt.ylabel('count')
plt.show()

Using read_stata to import Stata files


The pandas package has been imported in the environment as pd and the file
disarea.dta is in your working directory. The data consist of disease extents for
several diseases in various countries (more information can be found here).

# Import pandas
import pandas as pd

# Load Stata file into a pandas DataFrame: df


df = pd.read_stata('disarea.dta')
# Print the head of the DataFrame df
print(df.head() )

# Plot histogram of one column of the DataFrame


pd.DataFrame.hist(df[['disa10']])
plt.xlabel('Extent of disease')
plt.ylabel('Number of countries')
plt.show()

Using h5py to import HDF5 files


The file 'LIGO_data.hdf5' is already in your working directory. In this exercise,
you'll import it using the h5py library. You'll also print out its datatype to
confirm you have imported it correctly. You'll then study the structure of the file
in order to see precisely what HDF groups it contains.
You can find the LIGO data plus loads of documentation and tutorials here. There is
also a great tutorial on Signal Processing with the data here.

Import the package h5py.


Assign the name of the file to the variable file.
Load the file as read only into the variable data.
Print the datatype of data.
Print the names of the groups in the HDF5 file 'LIGO_data.hdf5'.

# Import packages
import numpy as np
import h5py

# Assign filename: file


file = 'LIGO_data.hdf5'

# Load file: data


data = h5py.File(file, 'r')

# Print the datatype of the loaded file


print(type(data) )

# Print the keys of the file


for key in data.keys():
print(key)

Extracting data from your HDF5 file


In this exercise, you'll extract some of the LIGO experiment's actual data from the
HDF5 file and you'll visualize it.
To do so, you'll need to first explore the HDF5 group 'strain'.

Assign the HDF5 group data['strain'] to group.


In the for loop, print out the keys of the HDF5 group in group.
Assign to the variable strain the values of the time series data data['strain']
['Strain'] using the attribute .value.
Set num_samples equal to 10000, the number of time points we wish to sample.
Execute the rest of the code to produce a plot of the time series data in
LIGO_data.hdf5.

# Get the HDF5 group: group


group = data['strain']

# Check out keys of group


for key in group.keys():
print(key)

# Set variable equal to time series data: strain


strain = data['strain']['Strain'].value

# Set number of time points to sample: num_samples


num_samples = 10000

# Set time vector


time = np.arange(0, 1, 1/num_samples)

# Plot data
plt.plot(time, strain[:num_samples])
plt.xlabel('GPS Time (s)')
plt.ylabel('strain')
plt.show()

Loading .mat files


In this exercise, you'll figure out how to load a MATLAB file using
scipy.io.loadmat() and you'll discover what Python datatype it yields.

The file 'albeck_gene_expression.mat' is in your working directory. This file


contains gene expression data from the Albeck Lab at UC Davis. You can find the
data and some great documentation here.

Instructions
100 XP
Import the package scipy.io.
Load the file 'albeck_gene_expression.mat' into the variable mat; do so using the
function scipy.io.loadmat().
Use the function type() to print the datatype of mat to the IPython shell.

# Import package
import scipy.io

# Load MATLAB file: mat


mat = scipy.io.loadmat('albeck_gene_expression.mat')

# Print the datatype type of mat


print(type(mat) )

The structure of .mat in Python


Here, you'll discover what is in the MATLAB dictionary that you loaded in the
previous exercise.

The file 'albeck_gene_expression.mat' is already loaded into the variable mat. The
following libraries have already been imported as follows:

import scipy.io
import matplotlib.pyplot as plt
import numpy as np
Once again, this file contains gene expression data from the Albeck Lab at UCDavis.
You can find the data and some great documentation here.

Instructions
100 XP
Use the method .keys() on the dictionary mat to print the keys. Most of these keys
(in fact the ones that do NOT begin and end with '__') are variables from the
corresponding MATLAB environment.
Print the type of the value corresponding to the key 'CYratioCyt' in mat. Recall
that mat['CYratioCyt'] accesses the value.
Print the shape of the value corresponding to the key 'CYratioCyt' using the numpy
function shape().
Execute the entire script to see some oscillatory gene expression data!

# Print the keys of the MATLAB dictionary


print(mat.keys() )

# Print the type of the value corresponding to the key 'CYratioCyt'


print( type(mat['CYratioCyt']) )

# Print the shape of the value corresponding to the key 'CYratioCyt'


print(np.shape(mat['CYratioCyt']) )

# Subset the array and plot it


data = mat['CYratioCyt'][25, 5:]
fig = plt.figure()
plt.plot(data)
plt.xlabel('time (min.)')
plt.ylabel('normalized fluorescence (measure of expression)')
plt.show()

Pop quiz: The relational model


Each row or record in a table represents an instance of an entity type.
Each column in a table represents an attribute or feature of an instance.
Every table contains a primary key column, which has a unique entry for each row.
There are relations between tables.

Creating a database engine


Here, you're going to fire up your very first SQL engine. You'll create an engine
to connect to the SQLite database 'Chinook.sqlite', which is in your working
directory. Remember that to create an engine to connect to 'Northwind.sqlite', Hugo
executed the command

engine = create_engine('sqlite:///Northwind.sqlite')
Here, 'sqlite:///Northwind.sqlite' is called the connection string to the SQLite
database Northwind.sqlite. A little bit of background on the Chinook database: the
Chinook database contains information about a semi-fictional digital media store in
which media data is real and customer, employee and sales data has been manually
created.

Why the name Chinook, you ask? According to their website,

The name of this sample database was based on the Northwind database. Chinooks are
winds in the interior West of North America, where the Canadian Prairies and Great
Plains meet various mountain ranges. Chinooks are most prevalent over southern
Alberta in Canada. Chinook is a good name choice for a database that intends to be
an alternative to Northwind.

Import the function create_engine from the module sqlalchemy.


Create an engine to connect to the SQLite database 'Chinook.sqlite' and assign it
to engine.

# Import necessary module


from sqlalchemy import create_engine

# Create engine: engine


engine = create_engine('sqlite:///Chinook.sqlite' )
What are the tables in the database?
In this exercise, you'll once again create an engine to connect to
'Chinook.sqlite'. Before you can get any data out of the database, however, you'll
need to know what tables it contains!

To this end, you'll save the table names to a list using the method table_names()
on the engine and then you will print the list.

Import the function create_engine from the module sqlalchemy.


Create an engine to connect to the SQLite database 'Chinook.sqlite' and assign it
to engine.
Using the method table_names() on the engine engine, assign the table names of
'Chinook.sqlite' to the variable table_names.
Print the object table_names to the shell.

# Import necessary module


from sqlalchemy import create_engine

# Create engine: engine


engine = create_engine('sqlite:///Chinook.sqlite')

# Save the table names to a list: table_names


table_names = engine.table_names()
print(table_names)

The Hello World of SQL Queries!


Now, it's time for liftoff! In this exercise, you'll perform the Hello World of SQL
queries, SELECT, in order to retrieve all columns of the table Album in the Chinook
database. Recall that the query SELECT * selects all columns.

Instructions
100 XP
Open the engine connection as con using the method connect() on the engine.
Execute the query that selects ALL columns from the Album table. Store the results
in rs.
Store all of your query results in the DataFrame df by applying the fetchall()
method to the results rs.
Close the connection!

# Import packages
from sqlalchemy import create_engine
import pandas as pd

# Create engine: engine


engine = create_engine('sqlite:///Chinook.sqlite')

# Open engine connection: con


con = engine.connect()

# Perform query: rs
rs = con.execute('SELECT * FROM Album ')

# Save results of the query to DataFrame: df


df = pd.DataFrame(rs)
# Close connection
con.close()

# Print head of DataFrame df


print(df.head())

Customizing the Hello World of SQL Queries


Select specified columns from a table;
Select a specified number of rows;
Import column names from the database table.
Recall that Hugo performed a very similar query customization in the video:

engine = create_engine('sqlite:///Northwind.sqlite')

with engine.connect() as con:


rs = con.execute("SELECT OrderID, OrderDate, ShipName FROM Orders")
df = pd.DataFrame(rs.fetchmany(size=5))
df.columns = rs.keys()
Packages have already been imported as follows:

from sqlalchemy import create_engine


import pandas as pd
The engine has also already been created:

engine = create_engine('sqlite:///Chinook.sqlite')
The engine connection is already open with the statement

with engine.connect() as con:


All the code you need to complete is within this context.

# Open engine in context manager


# Perform query and save results to DataFrame: df
with engine.connect() as con:
rs = con.execute('SELECT LastName, Title from Employee')
df = pd.DataFrame(rs.fetchmany(size = 3) )
df.columns = rs.keys()

# Print the length of the DataFrame df


print(len(df))

# Print the head of the DataFrame df


print(df.head())

Filtering your database records using SQL's WHERE


You can now execute a basic SQL query to select records from any table in your
database and you can also perform simple query customizations to select particular
columns and numbers of rows.

There are a couple more standard SQL query chops that will aid you in your journey
to becoming an SQL ninja.

Let's say, for example that you wanted to get all records from the Customer table
of the Chinook database for which the Country is 'Canada'. You can do this very
easily in SQL using a SELECT statement followed by a WHERE clause as follows:

SELECT * FROM Customer WHERE Country = 'Canada'


In fact, you can filter any SELECT statement by any condition using a WHERE clause.
This is called filtering your records.

In this interactive exercise, you'll select all records of the Employee table for
which 'EmployeeId' is greater than or equal to 6.

# Create engine: engine


engine = create_engine('sqlite:///Chinook.sqlite')

# Open engine in context manager


# Perform query and save results to DataFrame: df
with engine.connect() as con:
rs = con.execute('SELECT * FROM Employee WHERE EmployeeId >= 6')
df = pd.DataFrame(rs.fetchall())
df.columns = rs.keys()

# Print the head of the DataFrame df


print(df.head())

Ordering your SQL records with ORDER BY


You can also order your SQL query results. For example, if you wanted to get all
records from the Customer table of the Chinook database and order them in
increasing order by the column SupportRepId, you could do so with the following
query:

"SELECT * FROM Customer ORDER BY SupportRepId"


In fact, you can order any SELECT statement by any column.

In this interactive exercise, you'll select all records of the Employee table and
order them in increasing order by the column BirthDate.

# Create engine: engine


engine = create_engine('sqlite:///Chinook.sqlite')

# Open engine in context manager


with engine.connect() as con:
rs = con.execute('Select * from Employee order by BirthDate' )
df = pd.DataFrame(rs.fetchall())

# Set the DataFrame's column names


df.columns = rs.keys()

# Print head of DataFrame


print(df.head())

Pandas and The Hello World of SQL Queries!


Here, you'll take advantage of the power of pandas to write the results of your SQL
query to a DataFrame in one swift line of Python code!

You'll first import pandas and create the SQLite 'Chinook.sqlite' engine. Then
you'll query the database to select all records from the Album table.

# Import packages
from sqlalchemy import create_engine
import pandas as pd

# Create engine: engine


engine = create_engine('sqlite:///Chinook.sqlite')

# Execute query and store records in DataFrame: df


df = pd.read_sql_query('Select * from Album', engine)

# Execute query and store records in DataFrame: df


df2 = pd.read_sql_query('Select * from Employee where EmployeeId >= 6 order by
BirthDate', engine)

# Open engine in context manager


# Perform query and save results to DataFrame: df
with engine.connect() as con:
rs = con.execute('SELECT Title, Name FROM Album INNER JOIN Artist ON
Album.ArtistID = Artist.ArtistID')
df = pd.DataFrame(rs.fetchall())
df.columns = rs.keys()

# Execute query and store records in DataFrame: df


df = pd.read_sql_query('SELECT * FROM PlaylistTrack INNER JOIN Track on
PlaylistTrack.TrackId = Track.TrackId where Milliseconds < 250000', engine)

Importing flat files from the web: your turn!


You are about to import your first file from the web! The flat file you will import
will be 'winequality-red.csv' from the University of California, Irvine's Machine
Learning repository. The flat file contains tabular data of physiochemical
properties of red wine, such as pH, alcohol content and citric acid content, along
with wine quality rating.

The URL of the file is

'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/wineq
uality-red.csv'
After you import it, you'll check your working directory to confirm that it is
there and then you'll load it into a pandas DataFrame.

Instructions
100 XP
Import the function urlretrieve from the subpackage urllib.request.
Assign the URL of the file to the variable url.
Use the function urlretrieve() to save the file locally as 'winequality-red.csv'.
Execute the remaining code to load 'winequality-red.csv' in a pandas DataFrame and
to print its head to the shell.

# Import package
from urllib.request import urlretrieve
import pandas as pd

# Assign url of file: url


url =
'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/wineq
uality-red.csv'

# Save file locally


urlretrieve(url, 'winequality-red.csv')
df = pd.read_csv('winequality-red.csv', sep=';')
print(df.head())

Opening and reading flat files from the web


You have just imported a file from the web, saved it locally and loaded it into a
DataFrame. If you just wanted to load a file from the web into a DataFrame without
first saving it locally, you can do that easily using pandas. In particular, you
can use the function pd.read_csv() with the URL as the first argument and the
separator sep as the second argument.

Assign the URL of the file to the variable url.


Read file into a DataFrame df using pd.read_csv(), recalling that the separator in
the file is ';'.
Print the head of the DataFrame df.
Execute the rest of the code to plot histogram of the first feature in the
DataFrame df.

# Import packages
import matplotlib.pyplot as plt
import pandas as pd

# Assign url of file: url


url =
'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/wineq
uality-red.csv'

# Read file into a DataFrame: df


df = pd.read_csv(url, sep = ';')
print(df.head() )

# Plot first column of df


pd.DataFrame.hist(df.ix[:, 0:1])
plt.xlabel('fixed acidity (g(tartaric acid)/dm$^3$)')
plt.ylabel('count')
plt.show()

Importing non-flat files from the web


Congrats! You've just loaded a flat file from the web into a DataFrame without
first saving it locally using the pandas function pd.read_csv(). This function is
super cool because it has close relatives that allow you to load all types of
files, not only flat ones. In this interactive exercise, you'll use pd.read_excel()
to import an Excel spreadsheet.

Your job is to use pd.read_excel() to read in all of its sheets, print the sheet
names and then print the head of the first sheet using its name, not its index.

Note that the output of pd.read_excel() is a Python dictionary with sheet names as
keys and corresponding DataFrames as corresponding values.

Assign the URL of the file to the variable url.


Read the file in url into a dictionary xl using pd.read_excel() recalling that, in
order to import all sheets you need to pass None to the argument sheetname.
Print the names of the sheets in the Excel spreadsheet; these will be the keys of
the dictionary xl.
Print the head of the first sheet using the sheet name, not the index of the sheet!
The sheet name is '1700'

# Import package
import pandas as pd

# Assign url of file: url


url =
'http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/latitude.
xls'

# Read in all sheets of Excel file: xl


xl = pd.read_excel(url, sheetname = None)

# Print the sheetnames to the shell


print(xl.keys())

# Print the head of the first sheet (using its name, NOT its index)
print(xl['1700'].head())

Performing HTTP requests in Python using urllib


Now that you know the basics behind HTTP GET requests, it's time to perform some of
your own. In this interactive exercise, you will ping our very own DataCamp servers
to perform a GET request to extract information from our teach page,
"http://www.datacamp.com/teach/documentation".

In the next exercise, you'll extract the HTML itself. Right now, however, you are
going to package and send the request and then catch the response.

Instructions
100 XP
Import the functions urlopen and Request from the subpackage urllib.request.
Package the request to the url "http://www.datacamp.com/teach/documentation" using
the function Request() and assign it to request.
Send the request and catch the response in the variable response with the function
urlopen().
Run the rest of the code to see the datatype of response and to close the
connection!

# Import packages
from urllib.request import urlopen
from urllib.request import Request

# Specify the url


url = "http://www.datacamp.com/teach/documentation"

# This packages the request: request


request = Request(url)

# Sends the request and catches the response: response


response = urlopen(request)
# Print the datatype of response
print(type(response))

# Be polite and close the response!


response.close()

Printing HTTP request results in Python using urllib


You have just packaged and sent a GET request to
"http://www.datacamp.com/teach/documentation" and then caught the response. You saw
that such a response is a http.client.HTTPResponse object. The question remains:
what can you do with this response?

Well, as it came from an HTML page, you could read it to extract the HTML and, in
fact, such a http.client.HTTPResponse object has an associated read() method. In
this exercise, you'll build on your previous great work to extract the response and
print the HTML.

Instructions
100 XP
Send the request and catch the response in the variable response with the function
urlopen(), as in the previous exercise.
Extract the response using the read() method and store the result in the variable
html.
Print the string html.
Hit submit to perform all of the above and to close the response: be tidy!

# Import packages
from urllib.request import urlopen, Request

# Specify the url


url = "http://www.datacamp.com/teach/documentation"

# This packages the request


request = Request(url)

# Sends the request and catches the response: response


response = urlopen(request)

# Extract the response: html


html = response.read()

# Print the html


print(html)

# Be polite and close the response!


response.close()

Performing HTTP requests in Python using requests


Now that you've got your head and hands around making HTTP requests using the
urllib package, you're going to figure out how to do the same using the higher-
level requests library. You'll once again be pinging DataCamp servers for their
"http://www.datacamp.com/teach/documentation" page.

Note that unlike in the previous exercises using urllib, you don't have to close
the connection when using requests!

Import the package requests.


Assign the URL of interest to the variable url.
Package the request to the URL, send the request and catch the response with a
single function requests.get(), assigning the response to the variable r.
Use the text attribute of the object r to return the HTML of the webpage as a
string; store the result in a variable text.
Hit submit to print the HTML of the webpage.

# Import package
import requests

# Specify the url: url


url = "http://www.datacamp.com/teach/documentation"

# Packages the request, send the request and catch the response: r
r = requests.get(url)

# Extract the response: text


text = r.text

# Print the html


print(text)

Parsing HTML with BeautifulSoup


In this interactive exercise, you'll learn how to use the BeautifulSoup package to
parse, prettify and extract information from HTML. You'll scrape the data from the
webpage of Guido van Rossum, Python's very own Benevolent Dictator for Life. In the
following exercises, you'll prettify the HTML and then extract the text and the
hyperlinks.

The URL of interest is url = 'https://www.python.org/~guido/'.

Import the function BeautifulSoup from the package bs4.


Assign the URL of interest to the variable url.
Package the request to the URL, send the request and catch the response with a
single function requests.get(), assigning the response to the variable r.
Use the text attribute of the object r to return the HTML of the webpage as a
string; store the result in a variable html_doc.
Create a BeautifulSoup object soup from the resulting HTML using the function
BeautifulSoup().
Use the method prettify() on soup and assign the result to pretty_soup.
Hit submit to print to prettified HTML to your shell!

# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url: url


url = 'https://www.python.org/~guido/'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Extracts the response as html: html_doc


html_doc = r.text

# Create a BeautifulSoup object from the HTML: soup


soup = BeautifulSoup(html_doc)

# Prettify the BeautifulSoup object: pretty_soup


pretty_soup = soup.prettify()

# Print the response


print(pretty_soup)

Turning a webpage into data using BeautifulSoup: getting the text


As promised, in the following exercises, you'll learn the basics of extracting
information from HTML soup. In this exercise, you'll figure out how to extract the
text from the BDFL's webpage, along with printing the webpage's title.

Instructions
100 XP
In the sample code, the HTML response object html_doc has already been created:
your first task is to Soupify it using the function BeautifulSoup() and to assign
the resulting soup to the variable soup.
Extract the title from the HTML soup soup using the attribute title and assign the
result to guido_title.
Print the title of Guido's webpage to the shell using the print() function.
Extract the text from the HTML soup soup using the method get_text() and assign to
guido_text.
Hit submit to print the text from Guido's webpage to the shell.

# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url: url


url = 'https://www.python.org/~guido/'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Extract the response as html: html_doc


html_doc = r.text

# Create a BeautifulSoup object from the HTML: soup


soup = BeautifulSoup(html_doc)

# Get the title of Guido's webpage: guido_title


guido_title = soup.title

# Print the title of Guido's webpage to the shell


print(guido_title)

# Get Guido's text: guido_text


guido_text = soup.text

# Print Guido's text to the shell


print(guido_text)
Turning a webpage into data using BeautifulSoup: getting the hyperlinks
In this exercise, you'll figure out how to extract the URLs of the hyperlinks from
the BDFL's webpage. In the process, you'll become close friends with the soup
method find_all().

Instructions
100 XP
Use the method find_all() to find all hyperlinks in soup, remembering that
hyperlinks are defined by the HTML tag <a> but passed to find_all() without angle
brackets; store the result in the variable a_tags.
The variable a_tags is a results set: your job now is to enumerate over it, using a
for loop and to print the actual URLs of the hyperlinks; to do this, for every
element link in a_tags, you want to print() link.get('href').

# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url
url = 'https://www.python.org/~guido/'

# Package the request, send the request and catch the response: r
r = requests.get(url)
html_doc = r.text

# create a BeautifulSoup object from the HTML: soup


soup = BeautifulSoup(html_doc)
print(soup.title)

# Find all 'a' tags (which define hyperlinks): a_tags


a_tags = soup.find_all(s'a')

# Print the URLs to the shell


for link in a_tags:
print(link.get('href') )

Loading and exploring a JSON


Load the JSON 'a_movie.json' into the variable json_data, which will be a
dictionary. You'll then explore the JSON contents by printing the key-value pairs
of json_data to the shell.

Load the JSON 'a_movie.json' into the variable json_data within the context
provided by the with statement. To do so, use the function json.load() within the
context manager.
Use a for loop to print all key-value pairs in the dictionary json_data. Recall
that you can access a value in a dictionary using the syntax: dictionary[key].

# Load JSON: json_data


with open("a_movie.json") as json_file:
json_data = json.load(json_file)

# Print each key-value pair in json_data


for k in json_data.keys():
print(k + ': ', json_data[k])

Pop quiz: Exploring your JSON


Load the JSON 'a_movie.json' into a variable, which will be a dictionary. Do so by
copying, pasting and executing the following code in the IPython Shell:

import json
with open("a_movie.json") as json_file:
json_data = json.load(json_file)
Print the values corresponding to the keys 'Title' and 'Year' and answer the
following question about the movie that the JSON describes:

Which of the following statements is true of the movie in question?


The title is 'Kung Fu Panda' and the year is 2010.
The title is 'Kung Fu Panda' and the year is 2008.
The title is 'The Social Network' and the year is 2010.
The title is 'The Social Network' and the year is 2008.

import json
with open("a_movie.json") as json_file:
json_data = json.load(json_file)

print(json_data.keys())
print(json_data.values())
print(json_data.keys(), json_data.values())

print(json_data['Title'])

API requests
Now it's your turn to pull some movie data down from the Open Movie Database (OMDB)
using their API. The movie you'll query the API about is The Social Network. Recall
that, in the video, to query the API about the movie Hackers, Hugo's query string
was 'http://www.omdbapi.com/?t=hackers' and had a single argument t=hackers.

Note: recently, OMDB has changed their API: you now also have to specify an API
key. This means you'll have to add another argument to the URL: apikey=72bc447a.

Import the requests package.


Assign to the variable url the URL of interest in order to query
'http://www.omdbapi.com' for the data corresponding to the movie The Social
Network. The query string should have two arguments: apikey=72bc447a and
t=the+social+network. You can combine them as follows:
apikey=72bc447a&t=the+social+network.
Print the text of the reponse object r by using its text attribute and passing the
result to the print() function.

# Import requests package


import requests

# Assign URL to variable: url


url = 'http://www.omdbapi.com/?apikey=72bc447a&t=the+social+network'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Print the text of the response


print(r.text)

JSON–from the web to Python


You've just queried your first API programmatically in Python and printed the text
of the response to the shell. However, as you know, your response is actually a
JSON, so you can do one step better and decode the JSON. You can then print the
key-value pairs of the resulting dictionary. That's what you're going to do now!

Pass the variable url to the requests.get() function in order to send the relevant
request and catch the response, assigning the resultant response message to the
variable r.
Apply the json() method to the response object r and store the resulting dictionary
in the variable json_data.
Hit Submit Answer to print the key-value pairs of the dictionary json_data to the
shell.

# Import package
import requests

# Assign URL to variable: url


url = 'http://www.omdbapi.com/?apikey=72bc447a&t=social+network'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Decode the JSON data into a dictionary: json_data


json_data = r.json()

# Print each key-value pair in json_data


for k in json_data.keys():
print(k + ': ', json_data[k])

Checking out the Wikipedia API


You're doing so well and having so much fun that we're going to throw one more API
at you: the Wikipedia API (documented here). You'll figure out how to find and
extract information from the Wikipedia page for Pizza. What gets a bit wild here is
that your query will return nested JSONs, that is, JSONs with JSONs, but Python can
handle that because it will translate them into dictionaries within dictionaries.

The URL that requests the relevant query from the Wikipedia API is

https://en.wikipedia.org/w/api.php?
action=query&prop=extracts&format=json&exintro=&titles=pizza

Assign the relevant URL to the variable url.


Apply the json() method to the response object r and store the resulting dictionary
in the variable json_data.
The variable pizza_extract holds the HTML of an extract from Wikipedia's Pizza page
as a string; use the function print() to print this string to the shell.

# Import package
import requests
# Assign URL to variable: url
url = 'https://en.wikipedia.org/w/api.php?
action=query&prop=extracts&format=json&exintro=&titles=pizza'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Decode the JSON data into a dictionary: json_data


json_data = r.json()

# Print the Wikipedia page extract


pizza_extract = json_data['query']['pages']['24768']['extract']
print(pizza_extract)

API Authentication
The package tweepy is great at handling all the Twitter API OAuth Authentication
details for you. All you need to do is pass it your authentication credentials. In
this interactive exercise, we have created some mock authentication credentials (if
you wanted to replicate this at home, you would need to create a Twitter App as
Hugo detailed in the video). Your task is to pass these credentials to tweepy's
OAuth handler.

Import the package tweepy.


Pass the parameters consumer_key and consumer_secret to the function
tweepy.OAuthHandler().
Complete the passing of OAuth credentials to the OAuth handler auth by applying to
it the method set_access_token(), along with arguments access_token and
access_token_secret.

# Import package
import tweepy

# Store OAuth authentication credentials in relevant variables


access_token = "1092294848-aHN7DcRP9B4VMTQIhwqOYiB14YkW92fFO8k8EPy"
access_token_secret = "X4dHmhPfaksHcQ7SCbmZa2oYBBVSD2g8uIHXsp5CTaksx"
consumer_key = "nZ6EA0FxZ293SxGNg8g8aP0HM"
consumer_secret = "fJGEodwe3KiKUnsYJC3VRndj7jevVvXbK2D5EiJ2nehafRgA6i"

# Pass OAuth details to tweepy's OAuth handler


auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

Streaming tweets
Now that you have set up your authentication credentials, it is time to stream some
tweets! We have already defined the tweet stream listener class, MyStreamListener,
just as Hugo did in the introductory video. You can find the code for the tweet
stream listener class here.

Your task is to create the Streamobject and to filter tweets according to


particular keywords.

Create your Stream object with authentication by passing tweepy.Stream() the


authentication handler auth and the Stream listener l;
To filter Twitter streams, pass to the track argument in stream.filter() a list
containing the desired keywords 'clinton', 'trump', 'sanders', and 'cruz'.

# Initialize Stream listener


l = MyStreamListener()

# Create your Stream object with authentication


stream = tweepy.Stream(auth, l)

# Filter Twitter Streams to capture data by the keywords:


stream.filter(['clinton', 'trump', 'sanders', 'cruz'])

Load and explore your Twitter data


Now that you've got your Twitter data sitting locally in a text file, it's time to
explore it! This is what you'll do in the next few interactive exercises. In this
exercise, you'll read the Twitter data into a list: tweets_data.

Assign the filename 'tweets.txt' to the variable tweets_data_path.


Initialize tweets_data as an empty list to store the tweets in.
Within the for loop initiated by for line in tweets_file:, load each tweet into a
variable, tweet, using json.loads(), then append tweet to tweets_data using the
append() method.
Hit submit and check out the keys of the first tweet dictionary printed to the
shell.

# Import package
import json

# String of path to file: tweets_data_path


tweets_data_path = 'tweets.txt'

# Initialize empty list to store tweets: tweets_data


tweets_data = []

# Open connection to file


tweets_file = open(tweets_data_path, "r")

# Read in tweets and store in list: tweets_data


for line in tweets_file:
tweet = json.loads(line)
tweets_data = tweet.append()

# Close connection to file


tweets_file.close()

# Print the keys of the first tweet dict


print(tweets_data[0].keys())

Load and explore your Twitter data


Now that you've got your Twitter data sitting locally in a text file, it's time to
explore it! This is what you'll do in the next few interactive exercises. In this
exercise, you'll read the Twitter data into a list: tweets_data.

Instructions
70 XP
Assign the filename 'tweets.txt' to the variable tweets_data_path.
Initialize tweets_data as an empty list to store the tweets in.
Within the for loop initiated by for line in tweets_file:, load each tweet into a
variable, tweet, using json.loads(), then append tweet to tweets_data using the
append() method.
Hit submit and check out the keys of the first tweet dictionary printed to the
shell.

# Import package
import json

# String of path to file: tweets_data_path


tweets_data_path = 'tweets.txt'

# Initialize empty list to store tweets: tweets_data


tweets_data = []

# Open connection to file


tweets_file = open(tweets_data_path, "r")

# Read in tweets and store in list: tweets_data


for line in tweets_file:
tweet = json.loads(line)
tweets_data.append(tweet)

# Close connection to file


tweets_file.close()

# Print the keys of the first tweet dict


print(tweets_data[0].keys())

Twitter data to DataFrame


Now you have the Twitter data in a list of dictionaries, tweets_data, where each
dictionary corresponds to a single tweet. Next, you're going to extract the text
and language of each tweet. The text in a tweet, t1, is stored as the value
t1['text']; similarly, the language is stored in t1['lang']. Your task is to build
a DataFrame in which each row is a tweet and the columns are 'text' and 'lang'.

Instructions
100 XP
Use pd.DataFrame() to construct a DataFrame of tweet texts and languages; to do so,
the first argument should be tweets_data, a list of dictionaries. The second
argument to pd.DataFrame() is a list of the keys you wish to have as columns.
Assign the result of the pd.DataFrame() call to df.
Print the head of the DataFrame.

# Import package
import pandas as pd

# Build DataFrame of tweet texts and languages


df = pd.DataFrame(tweets_data, columns=['text', 'lang' ])

# Print head of DataFrame


print(df.head() )
A little bit of Twitter text analysis
Now that you have your DataFrame of tweets set up, you're going to do a bit of text
analysis to count how many tweets contain the words 'clinton', 'trump', 'sanders'
and 'cruz'. In the pre-exercise code, we have defined the following function
word_in_text(), which will tell you whether the first argument (a word) occurs
within the 2nd argument (a tweet).

import re

def word_in_text(word, text):


word = word.lower()
text = tweet.lower()
match = re.search(word, text)

if match:
return True
return False
You're going to iterate over the rows of the DataFrame and calculate how many
tweets contain each of our keywords! The list of objects for each candidate has
been initialized to 0.

Instructions
100 XP
Within the for loop for index, row in df.iterrows():, the code currently increases
the value of clinton by 1 each time a tweet (text row) mentioning 'Clinton' is
encountered; complete the code so that the same happens for trump, sanders and
cruz.

# Initialize list to store tweet counts


[clinton, trump, sanders, cruz] = [0, 0, 0, 0]

# Iterate through df, counting the number of tweets in which


# each candidate is mentioned
for index, row in df.iterrows():
clinton += word_in_text('clinton', row['text'])
trump += word_in_text('trump', row['text'])
sanders += word_in_text('sanders', row['text'])
cruz += word_in_text('cruz', row['text'])

Plotting your Twitter data


Now that you have the number of tweets that each candidate was mentioned in, you
can plot a bar chart of this data. You'll use the statistical data visualization
library seaborn, which you may not have seen before, but we'll guide you through.
You'll first import seaborn as sns. You'll then construct a barplot of the data
using sns.barplot, passing it two arguments:

a list of labels and


a list containing the variables you wish to plot (clinton, trump and so on.)

Import both matplotlib.pyplot and seaborn using the aliases plt and sns,
respectively.
Complete the arguments of sns.barplot: the first argument should be the labels to
appear on the x-axis; the second argument should be the list of the variables you
wish to plot, as produced in the previous exercise.
# Import packages
import matplotlib.pyplot as plt
import seaborn as sns

# Set seaborn style


sns.set(color_codes=True)

# Create a list of labels:cd


cd = ['clinton', 'trump', 'sanders', 'cruz']

# Plot histogram
ax = sns.barplot(cd, [clinton, trump, sanders, cruz])
ax.set(ylabel="count")
plt.show()

You might also like