Data Analysis Using Python (Python For Beginners) - CloudxLab
Data Analysis Using Python (Python For Beginners) - CloudxLab
reachus@cloudxlab.com
About CloudxLab
Content Playground
Feedback
CloudxLab - Online Cloud Based Lab
Praveen
Sandeep Giri Abhinav Singh
Pavithran
Founder at CloudxLab.com | AI CTO/Co-Founder at Yatis | IOT, Co-Founder, CloudxLab.com | AI,
Advisor at Algoworks | Speaker - ML, Computer Vision, Edge ML & Big Data | Visiting Faculty at
AI, Machine Learning, Deep SCMHRD
Learning,Big Data Cypress Semiconductors, Philips,
Multiple patents Byjus, HashCube
Amazon, InMobi, D.E.Shaw conference papers, 9+ Years of Exp. in EdTech, Game
18+ Years of Exp. in Enterprise IIT Bombay Dual Degree Development & Building Product
Softwares, Machine Learning &
Churning Humongous Data
What is Python
reachus@cloudxlab.com
What is Python
- Python is a interpreted,
high-level language
reachus@cloudxlab.com
What is Python
- Python is a interpreted,
high-level language
- Invented in 1991 by Guido van
Rossum
reachus@cloudxlab.com
What is Python
- Python is a interpreted,
high-level language
- Invented in 1991 by Guido van
Rossum
- It is easy to use and improves
engineer productivity
reachus@cloudxlab.com
What is Python
- Python is a interpreted,
high-level language
- Invented in 1991 by Guido van
Rossum
- It is easy to use and improves
engineer productivity
- Libraries for multiple
applications
reachus@cloudxlab.com
What is Python
- Python is a interpreted,
high-level language
- Invented in 1991 by Guido van
Rossum
- It is easy to use and improves
engineer productivity
- Libraries for multiple
applications
- Django framework for web
applications
- We will focus on libraries for
Data Analysis
reachus@cloudxlab.com
What is Python
- Python is a interpreted,
high-level language
- Invented in 1991 by Guido van
Rossum
- It is easy to use and improves
engineer productivity
- Libraries for multiple
applications
- Django framework for web
applications
- We will focus on libraries for
Data Analysis
reachus@cloudxlab.com
Numpy
reachus@cloudxlab.com
What is NumPy
● Open Source
● Module of Python
● Provides fast mathematical functions
reachus@cloudxlab.com
What is NumPy
scikitlearn tensorflow
numpy
Python
matplotlib
pandas
● Array-oriented computing
● Efficiently implemented multi-dimensional arrays
● Designed for scientific computation
● Library of high-level mathematical functions
reachus@cloudxlab.com
Numpy - Introduction
reachus@cloudxlab.com
Numpy - Introduction
reachus@cloudxlab.com
Creating Numpy arrays
np.array - Creating NumPy array from Python Lists/Tuple
reachus@cloudxlab.com
Creating Numpy arrays
np.zeroes - An array with all Zeroes
reachus@cloudxlab.com
Creating Numpy arrays
np.ones - An array with all Ones
reachus@cloudxlab.com
Creating Numpy arrays
np.full - An array with a given value
reachus@cloudxlab.com
Creating Numpy arrays
np.arange - Creating sequence of Numbers
reachus@cloudxlab.com
Creating Numpy arrays
np.linspace - Creating an array with evenly distributed numbers
reachus@cloudxlab.com
Creating Numpy arrays
np.random.rand - Creating an array with random numbers
>>> np.random.rand(2,3)
array([[ 0.55365951, 0.60150511, 0.36113117],
[ 0.5388662 , 0.06929014, 0.07908068]])
reachus@cloudxlab.com
Creating Numpy arrays
np.empty - Creating an empty array
>>> np.empty((2,3))
array([[ 0.21288689, 0.20662218, 0.78018623],
[ 0.35294004, 0.07347101, 0.54552084]])
reachus@cloudxlab.com
Important attributes of a NumPy object
ndarray.ndim
the number of axes (dimensions) of the array.
[[ 1., 0., 0.],
[ 0., 1., 2.]]
reachus@cloudxlab.com
Important attributes of a NumPy object
ndarray.shape
the dimensions of the array. This is a tuple of integers
indicating the size of the array in each dimension.
[[ 1., 0., 0.],
[ 0., 1., 2.]]
For the above array the value of ndarray.shape is (2,3)
reachus@cloudxlab.com
Important attributes of a NumPy object
ndarray.size
the total number of elements of the array. This is equal to
the product of the elements of shape.
[[ 1., 0., 0.],
[ 0., 1., 2.]]
reachus@cloudxlab.com
Important attributes of a NumPy object
ndarray.dtype
Tells the datatype of the elements in the numpy array. All
the elements in a numpy array have the same type.
>>> c = np.arange(1, 5)
>>> c.dtype
dtype('int64')
reachus@cloudxlab.com
Important attributes of a NumPy object
ndarray.itemsize
The itemsize attribute returns the size (in bytes) of each
item:
>>> c = np.arange(1, 5)
>>> c.itemsize
8
reachus@cloudxlab.com
Reshaping Arrays
>>> a = np.arange(6)
>>> print(a)
[0 1 2 3 4 5]
>>> b = a.reshape(2, 3)
>>> print(b)
[[0 1 2],
[3 4 5]]
reachus@cloudxlab.com
Indexing and Accessing NumPy arrays
reachus@cloudxlab.com
Indexing one dimensional NumPy Arrays
0 1 2 3 4 5 6 Index
reachus@cloudxlab.com
Difference with regular Python arrays
reachus@cloudxlab.com
Difference with regular Python arrays
reachus@cloudxlab.com
Important attributes of a NumPy object
3. If you want a copy of the data, you need to use the copy
method as another_slice = a[2:6].copy() ,
if we modify another_slice, a remains same.
reachus@cloudxlab.com
Indexing multi dimensional NumPy arrays
Multi-dimensional arrays can be accessed as
>>> b[1, 2] # row 1, col 2
>>> b[1, :] # row 1, all columns
>>> b[:, 1] # all rows, column 1
reachus@cloudxlab.com
Boolean Indexing
>>> a = np.arange(12).reshape(3, 4)
>>> rows_on = np.array([ True, False, True])
>>> a[rows_on , : ] # Rows 0 and 3, all columns
array([[ 0, 1, 2, 3],
[ 8, 9, 10, 11]])
reachus@cloudxlab.com
Linear Algebra with NumPy
reachus@cloudxlab.com
Vectors
reachus@cloudxlab.com
Vectors
velocity 50 m/s
10 m/s
5,000 m/s
reachus@cloudxlab.com
Use of Vectors in Machine Learning
● Vectors have many purposes in Machine Learning, most
notably to represent observations and predictions.
● For example, say we built a Machine Learning system to
classify videos into 3 categories (good, spam, clickbait) based
on what we know about them.
Good
Spam
Clickbait
reachus@cloudxlab.com
Use of Vectors in Machine Learning
● For each video, we would have a vector representing what
we know about it, such as:
Video
reachus@cloudxlab.com
Use of Vectors in Machine Learning
class_probabilities Clickbait
Good
reachus@cloudxlab.com
Representing Vectors in Python
reachus@cloudxlab.com
Vectorized Operations
reachus@cloudxlab.com
Vectorized Operations
Matrix multiplication
1. Using for loop
>>> def multiply_loops(A, B):
C = np.zeros((A.shape[0], B.shape[1]))
for i in range(A.shape[1]):
for j in range(B.shape[0]):
C[i, j] = A[i, j] * B[j, i]
return C
reachus@cloudxlab.com
Vectorized Operations
reachus@cloudxlab.com
Vectorized Operations
Matrix multiplication - Loops - timeit Matrix multiplication - Vector - timeit
reachus@cloudxlab.com
Basic Operations on NumPy arrays
reachus@cloudxlab.com
Addition in NumPy arrays
reachus@cloudxlab.com
Subtraction in NumPy arrays
reachus@cloudxlab.com
Element wise product in NumPy arrays
reachus@cloudxlab.com
Matrix Product in NumPy arrays
reachus@cloudxlab.com
Division in NumPy arrays
reachus@cloudxlab.com
Integer Division in NumPy arrays
reachus@cloudxlab.com
Modulus in NumPy arrays
reachus@cloudxlab.com
Exponents in NumPy arrays
reachus@cloudxlab.com
Conditional Operators on NumPy arrays
m < 25
array([ True, True, False, False], dtype=bool)
reachus@cloudxlab.com
Broadcasting in NumPy arrays
reachus@cloudxlab.com
What is Broadcasting ?
1 2 0 2 1 4
4 5 3 4 7 9
1 2 5
???
4 5 7
reachus@cloudxlab.com
What is Broadcasting ?
reachus@cloudxlab.com
First rule of Broadcasting
reachus@cloudxlab.com
First rule of Broadcasting
>>> h = np.arange(5).reshape(1, 1, 5)
h
>>> array([[[0, 1, 2, 3, 4]]])
Let's try to add a 1D array of shape (5,) to this 3D array of
shape (1,1,5), applying the first rule of broadcasting.
h + [10, 20, 30, 40, 50] # same as: h + [[[10, 20, 30, 40, 50]]]
array([[[10, 21, 32, 43, 54]]])
reachus@cloudxlab.com
Second rule of Broadcasting
reachus@cloudxlab.com
Second rule of Broadcasting
>>> k = np.arange(6).reshape(2, 3)
>>> k
array([[0, 1, 2],
[3, 4, 5]])
reachus@cloudxlab.com
Mathematical and statistical
functions on NumPy arrays
reachus@cloudxlab.com
Finding Mean of NumPy array elements
The ndarray object has a method mean() which finds the mean
of all the elements in the array regardless of the shape of the
numpy array.
reachus@cloudxlab.com
Other useful ndarray methods
reachus@cloudxlab.com
Other useful ndarray methods
>>> a = np.array([[-2.5, 3.1, 7], [10, 11, 12]])
min = -2.5
max = 12.0
sum = 40.6
prod = -71610.0
std = 5.08483584352
var = 25.8555555556
reachus@cloudxlab.com
Summing across different axes
We can sum across different axes of a numpy array by
specifying the axis parameter of the sum function.
>>> c=np.arange(24).reshape(2,3,4)
>>> c
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],
reachus@cloudxlab.com
Summing across different axes
reachus@cloudxlab.com
Transposing Matrices
The T attribute is equivalent to calling transpose() when the
rank is ≥2
>>> m1 = np.arange(6).reshape(2,3)
>>> m1
array([[0, 1, 2],
[3, 4, 5]])
>>> m1.T
array([[0, 3],
[1, 4],
[2, 5]])
reachus@cloudxlab.com
Solving a system of linear scalar equations
The solve function solves a system of linear scalar equations,
such as:
2x + 6y = 6
5x + 3y = -9
reachus@cloudxlab.com
Solving a system of linear scalar equations
>>> coeffs = np.array([[2, 6], [5, 3]])
>>> depvars = np.array([6, -9])
>>> solution = linalg.solve(coeffs, depvars)
>>> solution
array([-3., 2.])
reachus@cloudxlab.com
Solving a system of linear scalar equations
Let’s check the solution.
reachus@cloudxlab.com
References
● NumPy
○ https://docs.scipy.org/doc/
reachus@cloudxlab.com
Questions?
https://discuss.cloudxlab.com
reachus@cloudxlab.com
Pandas
reachus@cloudxlab.com
What is Pandas?
● One of the most widely used Python libraries in Data Science after
NumPy and Matplotlib
● The Pandas library Provides
○ High-performance
○ Easy-to-use data structures and
○ Data analysis tools
reachus@cloudxlab.com
Pandas - DataFrame
● In memory 2D table
reachus@cloudxlab.com
Pandas - Data Analysis
○ Plotting graphs
reachus@cloudxlab.com
Pandas - Data Structures
● Series objects
● DataFrame objects
● Panel objects
○ Dictionary of DataFrames
reachus@cloudxlab.com
Pandas - Series Objects
Creating a Series
>>> import pandas as pd
>>> s = pd.Series([2,-1,3,5])
Output -
0 2
1 -1
2 3
3 5
dtype: int64
reachus@cloudxlab.com
Pandas - Series Objects
Output -
0 4
1 1
2 9
3 25
dtype: int64
reachus@cloudxlab.com
Pandas - Series Objects
Output -
0 1002
1 1999
2 3003
3 4005
dtype: int64
reachus@cloudxlab.com
Pandas - Series Objects
Broadcasting
>>> s + 1000
Output -
0 1002
1 999
2 1003
3 1005
dtype: int64
reachus@cloudxlab.com
Pandas - Series Objects
Output -
0 False
1 True
2 False
3 False
dtype: bool
reachus@cloudxlab.com
Pandas - Series Objects
Output -
0 68
1 83
2 112
3 68
dtype: int64
reachus@cloudxlab.com
Pandas - Series Objects
Output -
alice 68
bob 83
charles 112
darwin 68
dtype: int64
reachus@cloudxlab.com
Pandas - Series Objects
>>> s2[1]
● By specifying label
>>> s2["bob"]
reachus@cloudxlab.com
Pandas - Series Objects
>>> s2.loc["bob"]
>>> s2.iloc[1]
reachus@cloudxlab.com
Pandas - Series Objects
Output -
alice 68
bob 83
colin 86
darwin 68
dtype: int64
reachus@cloudxlab.com
Pandas - Series Objects
Output -
colin 86
alice 68
dtype: int64
reachus@cloudxlab.com
Pandas - Series Objects
Automatic alignment
reachus@cloudxlab.com
Pandas - Series Objects
>>> print(s2+s3)
Output -
alice 136.0
bob 166.0
charles NaN
colin NaN
darwin 136.0
dtype: float64
* Note NaN
reachus@cloudxlab.com
Pandas - Series Objects
Automatic alignment
Do not forget to set the right index labels, else you may get surprising
results
>>> s5 = pd.Series([1000,1000,1000,1000])
>>> print(s2 + s5)
Output-
alice NaN
bob NaN
charles NaN
darwin NaN
0 NaN
1 NaN
reachus@cloudxlab.com
Pandas - Series Objects
Output-
life 42
universe 42
everything 42
dtype: int64
reachus@cloudxlab.com
Pandas - Series Objects
Output-
bob 83
alice 68
Name: weights, dtype: int64
reachus@cloudxlab.com
Pandas - Series Objects
Plotting a series
reachus@cloudxlab.com
Pandas - DataFrame Objects
reachus@cloudxlab.com
Pandas - DataFrame Objects
>>> people_dict = {
"weight": pd.Series([68, 83, 112],index=["alice",
"bob", "charles"]),
Creating a DataFrame
reachus@cloudxlab.com
Pandas - DataFrame Objects
reachus@cloudxlab.com
Pandas - DataFrame Objects
>>> people["birthyear"]
Output -
alice 1985
bob 1984
charles 1992
Name: birthyear, dtype: int64
reachus@cloudxlab.com
Pandas - DataFrame Objects
Output -
reachus@cloudxlab.com
Pandas - DataFrame Objects
>>> d2 = pd.DataFrame(
people_dict,
columns=["birthyear", "weight", "height"],
index=["bob", "alice", "eugene"]
)
>>> print(d2)
reachus@cloudxlab.com
Pandas - DataFrame Objects
● Using loc
○ people.loc["charles"]
● Using iloc
○ People.iloc[2]
Output -
birthyear 1992
children 0
hobby NaN
weight 112
Name: charles, dtype: object
reachus@cloudxlab.com
Pandas - DataFrame Objects
>>> people.iloc[1:3]
Output -
reachus@cloudxlab.com
Pandas - DataFrame Objects
Output -
reachus@cloudxlab.com
Pandas - DataFrame Objects
Output -
reachus@cloudxlab.com
Pandas - DataFrame Objects
>>> people
reachus@cloudxlab.com
Pandas - DataFrame Objects
>>> people
reachus@cloudxlab.com
Pandas - DataFrame Objects
reachus@cloudxlab.com
Pandas - DataFrame Objects
>>> (people
.assign(body_mass_index = lambda df:df["weight"]
/ (df["height"] / 100) ** 2)
.assign(overweight = lambda df:
df["body_mass_index"] > 25)
)
reachus@cloudxlab.com
Pandas - DataFrame Objects
reachus@cloudxlab.com
Pandas - DataFrame Objects
>>> people.sort_index(ascending=False)
reachus@cloudxlab.com
Pandas - DataFrame Objects
>>> people.sort_index(inplace=True)
>>> people
reachus@cloudxlab.com
Pandas - DataFrame Objects
reachus@cloudxlab.com
Pandas - DataFrame Objects
Plotting a DataFrame
>>> people.plot(
kind = "line",
x = "body_mass_index",
y = ["height", "weight"]
)
>>> plt.show()
reachus@cloudxlab.com
Pandas - DataFrame Objects
reachus@cloudxlab.com
Pandas - DataFrame Objects
DataFrames - Saving
reachus@cloudxlab.com
Pandas - DataFrame Objects
DataFrames - Saving
● Save to CSV
○ >>> my_df.to_csv("my_df.csv")
● Save to HTML
○ >>> my_df.to_html("my_df.html")
● Save to JSON
○ >>> my_df.to_json("my_df.json")
reachus@cloudxlab.com
Pandas - DataFrame Objects
reachus@cloudxlab.com
Pandas - DataFrame Objects
Note that the index is saved as the first column (with no name) in a CSV file
reachus@cloudxlab.com
Pandas - DataFrame Objects
DataFrames - What was saved?
reachus@cloudxlab.com
Pandas - DataFrame Objects
reachus@cloudxlab.com
Pandas - DataFrame Objects
DataFrames - Loading
reachus@cloudxlab.com
Pandas - DataFrame Objects
>>> my_df_loaded
reachus@cloudxlab.com
Pandas - DataFrame Objects
DataFrames - Overview
reachus@cloudxlab.com
Pandas - DataFrame Objects
DataFrames - Overview
reachus@cloudxlab.com
Pandas - DataFrame Objects
reachus@cloudxlab.com
Pandas - DataFrame Objects
>>> housing.tail(n=2)
reachus@cloudxlab.com
Pandas - DataFrame Objects
● The info method prints out the summary of each column's contents
>>> housing.info()
reachus@cloudxlab.com
Pandas - DataFrame Objects
● Pandas
○ http://pandas.pydata.org/pandas-docs/stable/
reachus@cloudxlab.com
Questions?
https://discuss.cloudxlab.com
reachus@cloudxlab.com
Matplotlib
reachus@cloudxlab.com
Matplotlib - Overview
reachus@cloudxlab.com
Matplotlib - Overview
reachus@cloudxlab.com
Matplotlib - pyplot Module
● matplotlib.pyplot
○ Collection of functions that make matplotlib work like MATLAB
○ Majority of plotting commands in pyplot have MATLAB analogs with
similar arguments
reachus@cloudxlab.com
Matplotlib - pyplot Module
● matplotlib.pyplot
○ Collection of functions that make matplotlib work like MATLAB
○ Majority of plotting commands in pyplot have MATLAB analogs with
similar arguments
reachus@cloudxlab.com
Matplotlib - pyplot Module - plot()
reachus@cloudxlab.com
Matplotlib - pyplot Module - plot()
plot x versus y
>>> import matplotlib.pyplot as plt
>>> plt.plot([1, 2, 3, 4], [1, 4, 9, 16])
>>> plt.ylabel('some numbers')
>>> plt.show()
reachus@cloudxlab.com
Matplotlib - pyplot Module - Histogram
reachus@cloudxlab.com
References
● Matplotlib
○ https://matplotlib.org/tutorials/index.html
reachus@cloudxlab.com
Questions?
https://discuss.cloudxlab.com
reachus@cloudxlab.com