Lecture+Notes Python+for+DS PDF
Lecture+Notes Python+for+DS PDF
NumPy is a library used in scientific computing and data analysis. It stands for ‘Numerical
Python’. The most basic object in NumPy is ndarray or array, which is an n-dimensional
array.
Output:
Output:
Now, to calculate product element-wise of all the elements in the list, we will use a lambda
function as follows:
list_1 = [3, 6
, 7 , 5 ]
list_2 = [4, 5 , 1 , 7 ]
Output:
# The numpy array way to do it: simply multiply the two arrays
array_1 = np.array(list_1)
array_2 = np.array(list_2)
Output:
# Square a list
list_1 = [3, 6, 7, 5]
list_squared = [i**2 for i in list_1]
print(list_squared)
print(array_squared)
Output:
[9, 3
6, 49, 25]
[ 9 3 6 49 25]
From the illustration provided above, we can conclude that squaring a NumPy array is easier
than squaring a list, which requires a loop, whereas squaring a NumPy array does not require
any loop.
There are multiple ways to create a NumPy array. The most common way is to use np.array,
which is shown in the code snippet given below.
print(array_from_list)
print(array_from_tuple)
Output:
[2 5
6 7 ]
[4 5 8 9 ]
The second most common way to create a NumPy array is to initialise an array. You can do
this only when you know the size of the array beforehand.
The code snippet given below shows how to initialise 5x3 unit matrix.
# Notice that, by default, numpy creates data type = float64
# Can provide dtype explicitly using dtype
np.ones((5, 3), dtype = np.int)
.
The code snippet given below shows how to initialise 4x1 zero matrix.
# Creating array of zeros
np.zeros(4, dtype = np.int)
Output:
The code snippet given below shows how to initialise 3x4 random numbers matrix.
# Array of random numbers
np.random.random([3, 4])
Output:
.20062275, 0
array([[0.21067268, 0 .9481293 , 0 .37904343],
[0.28643457, 0 .26614814, 0 .43219753, 0 .63020881],
[0.36568786, 0 .37602622, 0 .85852183, 0 .29602912]])
There is one more initialisation function, np.arange(), which is equivalent to the range()
function.
The code to initialise 5x1 matrix with multiples of 5 less than 100 is shown in the code
snippet given below.
# From 10 to 100 with a step of 5
numbers = np.arange(10, 100, 5
)
print(numbers)
[10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95]
Sometimes, you only know the length of the array, and not the step size, so we use the
linspace function to initialise the matrix.
# np.linspace()
# Sometimes, you know the length of the array, not the step size
Output:
A few more NumPy functions that you can use to create special NumPy arrays are as
follows:
● np.full(): This function is used to create a constant array of any number ‘n’.
Output:
● np.tile(): This function is used to create a new array by repeating an existing array for
a particular number of times.
Output:
Output:
array([[1, 0
, 0 ],
[0, 1 , 0 ],
[0, 0 , 1 ]])
Output:
It is recommended to inspect NumPy arrays, especially while working with large arrays.
Some attributes of NumPy arrays are as follows:
The above function discussed to know attributes of the array are shown in the code snippet
given below.
# Initialising a random 1000 x 300 array
rand_array = np.random.random((1000, 300))
# Inspecting shape, dtype, ndim and itemsize
print("Shape: {}".format(rand_array.shape))
print("dtype: {}".format(rand_array.dtype))
print("Dimensions: {}".format(rand_array.ndim))
print("Item size: {}".format(rand_array.itemsize))
Output:
With the help of the illustration provided above, we understood what these attributes do,
where shape determines the number of rows and columns in an array, dtype determines the
data type, ndim determines the number of dimensions, and itemsize determines how much
memory is used by each element of the array.
-> For one-dimensional arrays, indexing, slicing, etc. are similar to Python lists, where
indexing starts at 0.
Some of the examples of slicing are discussed in the following code snippet.
array_1d = np.arange(10)
# Third element
print(array_1d[2])
# Specific elements
# Notice that array[2, 5, 6] will throw an error, you need to provide
the indices as a list
print(array_1d[[2, 5, 6]])
Output:
2
[2 5 6]
[2 3 4 5 6 7 8 9]
[0 1 2]
[2 3 4 5 6 ]
[0 2 4 6 8 ]
With this, you learnt how to slice a NumPy array and find an element at a particular index or
find elements from starting to ending indices.
-> Multidimensional arrays are indexed using as many indices as the number of dimensions or
axes. For instance, to index a 2-D array, you need two indices: array[x, y].
Some slicing of multidimensional arrays in Python is shown in the code snippet given below.
# Creating a 2-D array
array_2d = np.array([[2, 5, 7, 5], [4, 6, 8, 10], [10, 12, 15, 19]])
# Third row second column
print(array_2d[2, 1])
# Slicing the second row, and all columns
# Notice that the resultant is itself a 1-D array
print(array_2d[1, :])
# Slicing all rows and the third column
print(array_2d[:, 2])
# Slicing all rows and the first three columns
print(array_2d[:, :3])
Output:
12
[ 4 6 8 10]
[ 7 8 15]
[[ 2 5 7]
[ 4 6 8]
[10 12 15]]
Iterating over 2-D arrays is done with respect to the first axis (which is the row, and the second
axis is the column).
Output:
You learnt that the key advantages of NumPy are convenience and speed of computation.
You will often work with extremely large data sets; thus it is important for you to understand how
much computation time (and memory) you can save using NumPy as compared with standard
Python lists.
Now, let's compare the computation times of arrays and lists for calculating the element-wise
product of numbers as shown in code snippet given below.
# list multiplication
import time
# store start time, time after computation, and take the difference
t0 = time.time()
product_list = list(map(lambda x, y: x*y, list_1, list_2))
t1 = time.time()
list_time = t1 - t0
print(t1-t0)
# numpy array
array_1 = np.array(list_1)
array_2 = np.array(list_2)
t0 = time.time()
array_3 = array_1*array_2
t1 = time.time()
numpy_time = t1 - t0
print(t1-t0)
print("The ratio of time taken is {}".format(list_time/numpy_time))
Output:
0.13900446891784668
0.005043983459472656
The ratio of time taken is 27.558470410285498
➔ Reshaping Arrays
Output:
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
Stacking is done using the np.hstack() and np.vstack() methods. For horizontal stacking,
the number of rows should be the same, whereas for vertical stacking, the number of
columns should be the same.
We will see an example of using np.vstack() method in the code snippet given below:
# Creating two arrays
array_1 = np.arange(12).reshape(3, 4
)
array_2 = np.arange(20).reshape(5, 4 )
print(array_1)
print("\n")
print(array_2)
# vstack
# Note that np.vstack(a, b) throws an error - you need to pass the
arrays as a list
np.vstack((array_1, array_2))
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
Operations on Arrays
NumPy provides almost all the basic math functions, such as exp, sin, cos, log and sqrt. One
function is applied to each element of an array.
➔ User-Defined Functions
You can also apply your own functions on arrays. For example, you can apply the x/(x+1)
function to each element of an array.
One way to do this is by looping through the array, which is the non-NumPy way. If you
want to write vectorized code, then you can vectorise the function that you want, and
then apply it on the array. Also, NumPy provides the np.vectorise() method to vectorise
functions.Let's take a look at both the ways.
Output:
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
np.vectorise() also gives you the advantage to vectorise the function once, and then
apply it as many times as needed.
NumPy provides the np.linalg package to apply common linear algebra operations, which
are as follows:
Now we will apply what we have learned about the Linear Algebra operations as shown in
the code below.
# Creating arrays
a = np.arange(1, 10).reshape(3, 3)
Output:
[[1 2
3 ]
[4 5 6 ]
[7 8 9 ]]
-9.51619735392994e-16
Pandas is a library built specifically for data analysis and created using NumPy. You will be
using Pandas extensively for performing data manipulation or visualisation, building
machine learning models, etc.
The two main data structures in Pandas are Series and Dataframes. The default way to store
data is by using data frames; thus, manipulating data frames quickly is probably the most
important skill for data analysis.
Pandas Series: A series is similar to a 1-D NumPy array, and contains scalar values of the
same type (numeric, character, date/time, etc.). A data frame is simply a table, where each
column is a Pandas series.
Output:
0 2
1 4
2 5
3 6
4 9
dtype: int64
<class 'pandas.core.series.Series'>
Output:
2 5
3 6
4 9
dtype: int64
You might have noticed that while creating a series, Pandas automatically indexes it from 0
to (n-1), where n is the number of rows. However, you can also explicitly set the index
yourself using the ‘index’ argument while creating the series using pd.Series().
# You can also give the index as a sequence or use functions to specify
the index
# But always make sure that the number of elements in the index list is
equal to the number of elements specified in the series
pd.Series(np.array(range(0,10))**2, index = range(0,10))
Output:
0 0
1 1
2 4
3 9
4 16
5 25
6 36
7 49
8 64
9 81
dtype: int32
Dataframe is the most widely used data structure in data analysis. It is a table with rows and
columns, where rows have an index and columns have meaningful names.
Various ways of creating data frames include using dictionaries, JSON objects and
CSV files, and reading from text files.
Output:
Output:
Output:
Output:
1) By indices
2) By values
Output:
Output:
Selecting rows in data frames is similar to indexing in NumPy arrays, which you
looked at. The syntax df[start_index:end_index] will subset rows according to the
start and end indices.
Selecting the rows from indices 2 to 6 shown in the code snippet below.
import numpy as np
import pandas as pd
market_df = pd.read_csv("../global_sales_data/market_fact.csv")
# Selecting the rows from indices 2 to 6
market_df[2:7]
Output:
➔ Selecting Columns
The two simple ways to select a single column from a dataframe are as follows:
● df['column_nam df['column_name']
● df.column_name
# Using df['column']
sales = market_df['Sales']
sales.head()
You can select multiple columns by passing the list of column names inside
the [ ]: df[['column_1', 'column_2', 'column_n']].
# Select Cust_id, Sales and Profit:
market_df[['Cust_id', 'Sales', 'Profit']].head()
Output:
market_df = pd.read_csv("./global_sales_data/market_fact.csv")
Output:
You may also have dataframes with the same rows but different columns (and no
common columns). In this case, you may want to concatenate them side by side.
We will concatenate the two dataFrames by the method shown below in the
snippet.
df1 = pd.Dataframe({'Name': ['Aman', 'Joy', 'Rashmi',
'Saif'],
'Age': ['34', '31', '22', '33'],
'Gender': ['M', 'M', 'F', 'M']}
)
df2 = pd.Dataframe({'School': ['RK Public', 'JSP', 'Carmel
Convent', 'St. Paul'],
'Graduation Marks': ['84', '89', '76',
'91']}
)
# To join the two dataframes, use axis = 1 to indicate
joining along the columns axis
# The join is possible because the corresponding rows have
the same indices
pd.concat([df1, df2], axis = 1)
Output:
Grouping and aggregation are some of the most frequently used operations in data analysis,
especially while performing exploratory data analysis (EDA), where comparing summary
statistics across groups of data is common.
For example, in the retail sales data that you are working with, you may want to compare
the average sales of various regions, or the total profit of two customer segments.
1. Splitting the data into groups (For example, groups of customer segments and
product categories)
2. Applying a function to each group (For example, mean or total sales of each
customer segment)
3. Combining the results into a data structure showing the summary statistics
First, we will merge all the Dataframes, so that we have all the data in one master_df as shown
in the code snippet below:
Output:
Output:
# Converting to a df
pd.Dataframe(df_by_segment['Profit'].sum())
# Let's go through some more examples
# E.g.: Which product categories are the least profitable?
Lambda Functions
Output:
Output:
Pivot Tables
You can use Pandas pivot tables as an alternative to groupby(). They provide Excel-like
functionalities to create aggregate tables.
The general syntax is pivot_table(data, values=None, index=None, columns=None,
aggfunc='mean', ...)
where,
● ‘data’ is the data frame,
● ‘values’ contain the column to aggregate within the dataset,
● ‘index’ is the row in the pivot table,
● ‘columns’ contain the columns that you want in the pivot table, and
● ‘aggfunc’ is the aggregate function.
We will now see the data under the Product_Category column as shown below:
master_df.pivot_table(columns = 'Product_Category')
We will compute mean of numeric across categories as shown in code snippet below
# Computes the mean of all numeric columns across categories
# Notice that the means of Order_IDs are meaningless
master_df.pivot_table(columns = 'Product_Category')
Output:
Output:
There are many libraries to connect MySQL and Python, such as PyMySQL and MySQLdb. All
of these libraries follow the procedure mentioned below to connect to MySQL:
Web scraping refers to the art of programmatically getting data from the internet. One of the best
features of Python is that it makes it easy to scrape websites.
In Python 3, the most popular library for web scraping is BeautifulSoup. To use BeautifulSoup,
you need the requests module, which connects to a given URL and fetches data from it (in HTML
format). A web page is HTML code, and the main use of BeautifulSoup is that it helps you parse
HTML easily.
Note: The discussion on HTML syntax is beyond the scope of this module. However, an
extremely basic HTML experience should be enough to understand web scraping.
- Use Case - Fetching Mobile App Reviews From Google Play Store
Suppose you want to understand why people install and uninstall mobile apps, and why they like
or dislike certain apps. An extremely rich source of app reviews data is Google Play Store, where
people write their feedback about an app.
We will scrape the reviews of the Facebook Messenger app, i.e., get them into Python, and then,
you can perform some interesting analyses on the same.
To use BeautifulSoup, you need to install it using pip install beautifulsoup4, and load the module
bs4 using import bs4. You also need to install the requests module using pip install requests.
Code for web scraping reviews from Google Play Store as shown in code snippet below:
print(name_final+","+date_final.replace(",","")+","+review_final.replace
(",","")+"\n")
f.write(name_final+","+date_final.replace(",","")+","+review_final.repla
ce(",","")+"\n")
f.close()
Apart from being rich sources of data, the other reasons to use APIs are as follows:
● When the data is getting updated in real-time: If you use downloaded CSV files, then
you have to download data manually and update your analysis multiple times.
However, through APIs, you can automate the process of getting real-time data.
● Easy access to structured and verified data: Even though websites can be scraped
manually, APIs can directly provide good-quality data in a structured format.
● Access to restricted data: You cannot scrape all websites easily, as web scraping is
usually considered illegal (For example, Facebook, financial data, etc.). APIs are the
only way to access restricted data.
https://maps.googleapis.com/maps/api/geocode/json?address=UpGrad,+Nishuvi+building,+
Anne+Besant+Road,+Worli,+Mumbai&key=YOUR_API_KEY
Now we will find out latitude and longitude of upGrad’s Mumbai office location so the code
for the following is shown below:
api_key = "AIzaSyBXrK8md7uaOcpRpaluEGZAtdXS4pcI5xo"
url =
"https://maps.googleapis.com/maps/api/geocode/json?address={0}&key={1}".
format(address, api_key)
r = requests.get(url)
print((lat, lng))
# Input to the fn: Address in standard human-readable form
api_key = "AIzaSyBXrK8md7uaOcpRpaluEGZAtdXS4pcI5xo"
def address_to_latlong(address):
# convert address to the form x+y+z
split_address = address.split(" ")
address = "+".join(split_address)
To summarise, the steps involved in the procedure for getting lat/long coordinates from an
address are as follows:
● Convert the address into a suitable format and connect to the Google Maps URL
using your key.
● Get a response from the API and convert it into a dictionary using json.loads(r.text).
Reading PDF files is not as straightforward as reading text or delimited files, since PDFs often
contain images, tables, etc. PDFs are mainly designed to be human readable, and thus, you
need special libraries to read them in Python (or any other programming language).
There are some excellent libraries in Python. We will use PyPDF2 to read PDFs in Python,
since it is easy to use and works with most types of PDFs.
Note that Python will only be able to read text from PDFs, and not from images, tables, etc.
(which is possible using other specialised libraries).
For this illustration, we will read a PDF of the book Animal Farm written by George Orwell.
import PyPDF2
There are various reasons for missing data such as human errors during data entry, and
non-availability at the end of the user (for instance, DOB of certain people). Most often, the
reasons are simply unknown.
The four main methods to identify and treat missing data are as follows:
● isnull(): This method indicates the presence of missing values and returns a boolean.
● notnull(): This method is opposite of the isnull() method and returns a boolean.
● dropna(): This method drops the missing values from a data frame and returns the
remaining values.
● fillna(): This method fills (or imputes) the missing values with a specified value.
To start with missing data we will read the csv file by importing Pandas and NumPy libraries.
import numpy as np
import pandas as pd
df = pd.read_csv("melbourne.csv")
print(df.shape)
print(df.info())
Output:
The isnull() and notnull() methods are the most common ways of identifying missing values.
While handling missing data, you first need to identify the rows and columns containing missing
values, count the number of missing values, and then decide how you want to treat them.
You must treat missing values in each column separately, rather than implementing a single
solution (for example, replacing NaNs by the mean of a column) for all columns.
# isnull()
df.isnull()
Output:
Output:
Output:
Observe that the columns have around 22%, 19%, 26%, 57%, etc., of missing values. When
dealing with a column, you have two simple choices: either delete or retain the column. If
you retain the column, then you have to treat (i.e., delete or impute) the rows having
missing values.
If you delete the missing rows, then you lose data. And if you impute, then you introduce
bias.
Apart from the number of missing values, the decision to delete or retain a variable depends
on various other factors, which are as follows:
Suppose we want to build a (linear regression) model to predict the house prices in
Melbourne. Now, even though the variable Price has about 22% of missing values, you
cannot drop the variable, since that is what you want to predict.
Similarly, you would expect some other variables such as Bedroom2, Bathroom and Landsize
to be the important predictors of Price, and thus, you cannot remove those columns.
round(100*(df.isnull().sum()/len(df.index)), 2)
Output:
Now, we need to either delete or impute the missing values. First, let's see whether or not any
rows have a significant number of missing values. If so, we can drop those rows, and then decide
to delete or impute the rest.
After dropping three columns, we now have 18 columns to work with. To inspect rows with
missing values, let's take a look at the rows having more than five missing values.
Observe that we now have removed most of the rows where multiple columns (Bedroom2,
Bathroom, Landsize) were missing.
Now, we still have around 21% of missing values in the Price column and 9% in the Landsize
column. Since the Price column still contains a lot of missing data (and imputing 21% of values of
a variable you want to predict will introduce heavy bias), it is not sensible to impute those values.
Thus, let's remove the missing rows from the Price column as well. Notice that you can use
np.isnan(df['column']) to filter out the corresponding rows, and then use a ~ to discard the values
satisfying the condition.
round(100*(df.isnull().sum()/len(df.index)), 2)
Output:
The decision (whether and how to impute) will depend upon the distribution of the variable.
For example, if the variable is such that all the observations lie in a short range (say between
800 sq. ft to 820 sq.ft), then you can make a call to impute the missing values by something
similar to the mean or median Land Size.
So, we have to check the data repeatedly to confirm that there is no missing data or wrong
data in the dataset, and if it is present, then we have to remove the missing data and
perform the operations.