Python Data Import
Python Data Import
In order to import data into Python, you should first have an idea of what files
are in your working directory.
IPython, which is running on DataCamp's servers, has a bunch of cool commands,
including its magic commands. For example, starting a line with ! gives you
complete system shell access. This means that the IPython magic command ! ls will
display the contents of your current directory. Your task is to use the IPython
magic command ! ls to check out the contents of your current directory and answer
the following question: which of the following files is in your working directory?
!ls
# Close file
file.close()
# Import package
import numpy as np
There are a number of arguments that np.loadtxt() takes that you'll find useful:
delimiter changes the delimiter that loadtxt() is expecting, for example, you can
use ',' and '\t' for comma-delimited and tab-delimited respectively; skiprows
allows you to specify how many rows (not indices) you wish to skip; usecols takes a
list of the indices of the columns you wish to keep.
has a header
is tab-delimited.
Instructions
100 XP
Complete the arguments of np.loadtxt(): the file you're importing is tab-delimited,
you want to skip the first row and you only want to import the first and third
columns.
Complete the argument of the print() call in order to print the entire array that
you just imported.
# Import numpy
import numpy as np
Due to the header, if you tried to import it as-is using np.loadtxt(), Python would
throw you a ValueError and tell you that it could not convert string to float.
There are two ways to deal with this: firstly, you can set the data type argument
dtype equal to str (for string).
Alternatively, you can skip the first row as we have seen before, using the
skiprows argument.
Instructions
100 XP
Complete the first call to np.loadtxt() by passing file as the first argument.
Execute print(data[0]) to print the first element of data.
Complete the second call to np.loadtxt(). The file you're importing is tab-
delimited, the datatype is float, and you want to skip the first row.
Print the 10th element of data_float by completing the print() command. Be guided
by the previous print() call.
Execute the rest of the code to visualize the data.
Instructions
100 XP
Import titanic.csv using the function np.recfromcsv() and assign it to the
variable, d. You'll only need to pass file to it because it has the defaults
delimiter=',' and names=True in addition to dtype=None!
Run the remaining code to print the first three entries of the resulting array d.
# Import pandas as pd
import pandas as pd
Import the first 5 rows of the file into a DataFrame using the function
pd.read_csv() and assign the result to data. You'll need to use the arguments nrows
and header (there is no header in this file).
Build a numpy array from the resulting DataFrame in data and assign to data_array.
Execute print(type(data_array)) to print the datatype of data_array.
Complete the sep (the pandas version of delim), comment and na_values arguments of
pd.read_csv(). comment takes characters that comments occur after in the file,
which in this case is '#'. na_values takes a list of strings to recognize as
NA/NaN, in this case the string 'Nothing'.
Execute the rest of the code to print the head of the resulting DataFrame and plot
the histogram of the 'Age' of passengers aboard the Titanic.
import os
wd = os.getcwd()
os.listdir(wd)
However, if you merely want to be able to import them into Python, you can
serialize them. All this means is converting the object into a sequence of bytes,
or a bytestream.
In this exercise, you'll import the pickle package, open a previously pickled data
structure from a file and load it.
# Print d
print(d)
# Print datatype of d
print(type(d) )
Recall from the video that, given an Excel file imported into a variable
spreadsheet, you can retrieve a list of the sheet names using the attribute
spreadsheet.sheet_names.
# Import pandas
import pandas as pd
Load the sheet '2004' into the DataFrame df1 using its name as a string.
Print the head of df1 to the shell.
Load the sheet 2002 into the DataFrame df2 using its index (0).
Print the head of df2 to the shell.
As before, you'll use the method parse(). This time, however, you'll add the
additional arguments skiprows, names and usecols. These skip rows, name the columns
and designate which columns to parse, respectively. All these arguments can be
assigned to lists containing the specific row numbers, strings and column numbers,
as appropriate.
Parse the first sheet by index. In doing so, skip the first row of data and name
the columns 'Country' and 'AAM due to War (2002)' using the argument names. The
values passed to skiprows and names all need to be of type list.
Parse the second sheet by index. In doing so, parse only the first column with the
usecols parameter, skip the first row and rename the column 'Country'. The argument
passed to usecols also needs to be of type list.
# Parse the first column of the second sheet and rename the column: df2
df2 = xls.parse(1, usecols=0, skiprows=1, names=['Country'])
# Print the head of the DataFrame df2
print(df2.head())
# Import pandas
import pandas as pd
# Import packages
import numpy as np
import h5py
# Plot data
plt.plot(time, strain[:num_samples])
plt.xlabel('GPS Time (s)')
plt.ylabel('strain')
plt.show()
Instructions
100 XP
Import the package scipy.io.
Load the file 'albeck_gene_expression.mat' into the variable mat; do so using the
function scipy.io.loadmat().
Use the function type() to print the datatype of mat to the IPython shell.
# Import package
import scipy.io
The file 'albeck_gene_expression.mat' is already loaded into the variable mat. The
following libraries have already been imported as follows:
import scipy.io
import matplotlib.pyplot as plt
import numpy as np
Once again, this file contains gene expression data from the Albeck Lab at UCDavis.
You can find the data and some great documentation here.
Instructions
100 XP
Use the method .keys() on the dictionary mat to print the keys. Most of these keys
(in fact the ones that do NOT begin and end with '__') are variables from the
corresponding MATLAB environment.
Print the type of the value corresponding to the key 'CYratioCyt' in mat. Recall
that mat['CYratioCyt'] accesses the value.
Print the shape of the value corresponding to the key 'CYratioCyt' using the numpy
function shape().
Execute the entire script to see some oscillatory gene expression data!
engine = create_engine('sqlite:///Northwind.sqlite')
Here, 'sqlite:///Northwind.sqlite' is called the connection string to the SQLite
database Northwind.sqlite. A little bit of background on the Chinook database: the
Chinook database contains information about a semi-fictional digital media store in
which media data is real and customer, employee and sales data has been manually
created.
The name of this sample database was based on the Northwind database. Chinooks are
winds in the interior West of North America, where the Canadian Prairies and Great
Plains meet various mountain ranges. Chinooks are most prevalent over southern
Alberta in Canada. Chinook is a good name choice for a database that intends to be
an alternative to Northwind.
To this end, you'll save the table names to a list using the method table_names()
on the engine and then you will print the list.
Instructions
100 XP
Open the engine connection as con using the method connect() on the engine.
Execute the query that selects ALL columns from the Album table. Store the results
in rs.
Store all of your query results in the DataFrame df by applying the fetchall()
method to the results rs.
Close the connection!
# Import packages
from sqlalchemy import create_engine
import pandas as pd
# Perform query: rs
rs = con.execute('SELECT * FROM Album ')
engine = create_engine('sqlite:///Northwind.sqlite')
engine = create_engine('sqlite:///Chinook.sqlite')
The engine connection is already open with the statement
There are a couple more standard SQL query chops that will aid you in your journey
to becoming an SQL ninja.
Let's say, for example that you wanted to get all records from the Customer table
of the Chinook database for which the Country is 'Canada'. You can do this very
easily in SQL using a SELECT statement followed by a WHERE clause as follows:
In this interactive exercise, you'll select all records of the Employee table for
which 'EmployeeId' is greater than or equal to 6.
In this interactive exercise, you'll select all records of the Employee table and
order them in increasing order by the column BirthDate.
You'll first import pandas and create the SQLite 'Chinook.sqlite' engine. Then
you'll query the database to select all records from the Album table.
# Import packages
from sqlalchemy import create_engine
import pandas as pd
'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/wineq
uality-red.csv'
After you import it, you'll check your working directory to confirm that it is
there and then you'll load it into a pandas DataFrame.
Instructions
100 XP
Import the function urlretrieve from the subpackage urllib.request.
Assign the URL of the file to the variable url.
Use the function urlretrieve() to save the file locally as 'winequality-red.csv'.
Execute the remaining code to load 'winequality-red.csv' in a pandas DataFrame and
to print its head to the shell.
# Import package
from urllib.request import urlretrieve
import pandas as pd
# Import packages
import matplotlib.pyplot as plt
import pandas as pd
Your job is to use pd.read_excel() to read in all of its sheets, print the sheet
names and then print the head of the first sheet using its name, not its index.
Note that the output of pd.read_excel() is a Python dictionary with sheet names as
keys and corresponding DataFrames as corresponding values.
# Import package
import pandas as pd
# Print the head of the first sheet (using its name, NOT its index)
print(xl['1700'].head())
In the next exercise, you'll extract the HTML itself. Right now, however, you are
going to package and send the request and then catch the response.
Instructions
100 XP
Import the functions urlopen and Request from the subpackage urllib.request.
Package the request to the url "http://www.datacamp.com/teach/documentation" using
the function Request() and assign it to request.
Send the request and catch the response in the variable response with the function
urlopen().
Run the rest of the code to see the datatype of response and to close the
connection!
# Import packages
from urllib.request import urlopen
from urllib.request import Request
Well, as it came from an HTML page, you could read it to extract the HTML and, in
fact, such a http.client.HTTPResponse object has an associated read() method. In
this exercise, you'll build on your previous great work to extract the response and
print the HTML.
Instructions
100 XP
Send the request and catch the response in the variable response with the function
urlopen(), as in the previous exercise.
Extract the response using the read() method and store the result in the variable
html.
Print the string html.
Hit submit to perform all of the above and to close the response: be tidy!
# Import packages
from urllib.request import urlopen, Request
Note that unlike in the previous exercises using urllib, you don't have to close
the connection when using requests!
# Import package
import requests
# Packages the request, send the request and catch the response: r
r = requests.get(url)
# Import packages
import requests
from bs4 import BeautifulSoup
# Package the request, send the request and catch the response: r
r = requests.get(url)
Instructions
100 XP
In the sample code, the HTML response object html_doc has already been created:
your first task is to Soupify it using the function BeautifulSoup() and to assign
the resulting soup to the variable soup.
Extract the title from the HTML soup soup using the attribute title and assign the
result to guido_title.
Print the title of Guido's webpage to the shell using the print() function.
Extract the text from the HTML soup soup using the method get_text() and assign to
guido_text.
Hit submit to print the text from Guido's webpage to the shell.
# Import packages
import requests
from bs4 import BeautifulSoup
# Package the request, send the request and catch the response: r
r = requests.get(url)
Instructions
100 XP
Use the method find_all() to find all hyperlinks in soup, remembering that
hyperlinks are defined by the HTML tag <a> but passed to find_all() without angle
brackets; store the result in the variable a_tags.
The variable a_tags is a results set: your job now is to enumerate over it, using a
for loop and to print the actual URLs of the hyperlinks; to do this, for every
element link in a_tags, you want to print() link.get('href').
# Import packages
import requests
from bs4 import BeautifulSoup
# Specify url
url = 'https://www.python.org/~guido/'
# Package the request, send the request and catch the response: r
r = requests.get(url)
html_doc = r.text
Load the JSON 'a_movie.json' into the variable json_data within the context
provided by the with statement. To do so, use the function json.load() within the
context manager.
Use a for loop to print all key-value pairs in the dictionary json_data. Recall
that you can access a value in a dictionary using the syntax: dictionary[key].
import json
with open("a_movie.json") as json_file:
json_data = json.load(json_file)
Print the values corresponding to the keys 'Title' and 'Year' and answer the
following question about the movie that the JSON describes:
import json
with open("a_movie.json") as json_file:
json_data = json.load(json_file)
print(json_data.keys())
print(json_data.values())
print(json_data.keys(), json_data.values())
print(json_data['Title'])
API requests
Now it's your turn to pull some movie data down from the Open Movie Database (OMDB)
using their API. The movie you'll query the API about is The Social Network. Recall
that, in the video, to query the API about the movie Hackers, Hugo's query string
was 'http://www.omdbapi.com/?t=hackers' and had a single argument t=hackers.
Note: recently, OMDB has changed their API: you now also have to specify an API
key. This means you'll have to add another argument to the URL: apikey=72bc447a.
# Package the request, send the request and catch the response: r
r = requests.get(url)
Pass the variable url to the requests.get() function in order to send the relevant
request and catch the response, assigning the resultant response message to the
variable r.
Apply the json() method to the response object r and store the resulting dictionary
in the variable json_data.
Hit Submit Answer to print the key-value pairs of the dictionary json_data to the
shell.
# Import package
import requests
# Package the request, send the request and catch the response: r
r = requests.get(url)
The URL that requests the relevant query from the Wikipedia API is
https://en.wikipedia.org/w/api.php?
action=query&prop=extracts&format=json&exintro=&titles=pizza
# Import package
import requests
# Assign URL to variable: url
url = 'https://en.wikipedia.org/w/api.php?
action=query&prop=extracts&format=json&exintro=&titles=pizza'
# Package the request, send the request and catch the response: r
r = requests.get(url)
API Authentication
The package tweepy is great at handling all the Twitter API OAuth Authentication
details for you. All you need to do is pass it your authentication credentials. In
this interactive exercise, we have created some mock authentication credentials (if
you wanted to replicate this at home, you would need to create a Twitter App as
Hugo detailed in the video). Your task is to pass these credentials to tweepy's
OAuth handler.
# Import package
import tweepy
Streaming tweets
Now that you have set up your authentication credentials, it is time to stream some
tweets! We have already defined the tweet stream listener class, MyStreamListener,
just as Hugo did in the introductory video. You can find the code for the tweet
stream listener class here.
# Import package
import json
Instructions
70 XP
Assign the filename 'tweets.txt' to the variable tweets_data_path.
Initialize tweets_data as an empty list to store the tweets in.
Within the for loop initiated by for line in tweets_file:, load each tweet into a
variable, tweet, using json.loads(), then append tweet to tweets_data using the
append() method.
Hit submit and check out the keys of the first tweet dictionary printed to the
shell.
# Import package
import json
Instructions
100 XP
Use pd.DataFrame() to construct a DataFrame of tweet texts and languages; to do so,
the first argument should be tweets_data, a list of dictionaries. The second
argument to pd.DataFrame() is a list of the keys you wish to have as columns.
Assign the result of the pd.DataFrame() call to df.
Print the head of the DataFrame.
# Import package
import pandas as pd
import re
if match:
return True
return False
You're going to iterate over the rows of the DataFrame and calculate how many
tweets contain each of our keywords! The list of objects for each candidate has
been initialized to 0.
Instructions
100 XP
Within the for loop for index, row in df.iterrows():, the code currently increases
the value of clinton by 1 each time a tweet (text row) mentioning 'Clinton' is
encountered; complete the code so that the same happens for trump, sanders and
cruz.
Import both matplotlib.pyplot and seaborn using the aliases plt and sns,
respectively.
Complete the arguments of sns.barplot: the first argument should be the labels to
appear on the x-axis; the second argument should be the list of the variables you
wish to plot, as produced in the previous exercise.
# Import packages
import matplotlib.pyplot as plt
import seaborn as sns
# Plot histogram
ax = sns.barplot(cd, [clinton, trump, sanders, cruz])
ax.set(ylabel="count")
plt.show()