CS3361 - Data Science Lab Record

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 76

CS3361 DATA SCIENCE LABORATORY LTPC

0 04 2
LIST OF EXPERIMENTS:
1. Download, install and explore the features of NumPy, SciPy, Jupyter, Statsmodels
and Pandas packages.
2. Working with Numpy arrays
3. Working with Pandas data frames
4. Reading data from text files, Excel and the web and exploring various commands
for doing descriptive analytics on the Iris data set.
5. Use the diabetes data set from UCI and Pima Indians Diabetes data set for
performing the following:
a. Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard
Deviation, Skewness and Kurtosis.
b. Bivariate analysis: Linear and logistic regression modeling
c. Multiple Regression analysis
d. Also compare the results of the above analysis for the two data sets.
6. Apply and explore various plotting functions on UCI data sets.
a. Normal curves
b. Density and contour plots
c. Correlation and scatter plots
d. Histograms
e. Three dimensional plotting
7. Visualizing Geographic Data with Basemap

List of Equipments: (30 Students per Batch)


Tools: Python, Numpy, Scipy, Matplotlib, Pandas, statmodels, seaborn, plotly, bokeh
Note: Example data sets like: UCI, Iris, Pima Indians Diabetes etc.

TOTAL: 60 PERIODS

1
Ex No.1(a) Download and install the different packages like NumPy,
Date: SciPy,Jupyter, Statsmodels and Pandas

AIM:
To learn how to download and install the different packages of NumPy, SciPy, Jupyter,
Statsmodels and Pandas.

ALGORITHM:
1. Download Python and Jupyter.
2. Install Python and Jupyter.
3. Install the pack like NumPy, SciPy Satsmodels and Pandas.
4. Verify the proper execution of Python and Jupyter.

Python Installation
 Open the python official web site. (https://www.python.org/)
 Downloads ==> Windows ==> Select Recent Release. (Requires Windows 10 or
above versions)
 Install "python-3.10.6-amd64.exe"

Jupyter Installation
 Open command prompt and enter the following to check whether the pyton was
installed properly or not, “python –version”.
 If installation is proper it returns the version of python
 Enter the following to check whether the pyton package manager was installed
properly or not, “pip –version”
 If installation is proper it returns the version of python package manager
 Enter the following command “pip install jupyterlab”.
 Enter the following command “pip install jupyter notebook”.
 Copy the above command result from path to upgrade command and paste it and
execute for upgrade process.
 Create a folder and name the folder accordingly.
 Open command prompt and enter in to that folder. Enter the following code
“jupyter notebook” and then give enter.
 Now new jupyter notebook will be opened for our use.

pip Installation
Installation of NumPy
 pip install numpy
Installation of SciPy
 pip install scipy

2
Installation of Statsmodels
 pip install statsmodels
Installation of Pandas
 pip install pandas

Sample Output

RESULT:
NumPy, SciPy, Jupyter, Statsmodels and Pandas packages were installed properly and
the execution also verified.

3
Ex No.1(b) Explore the features of NumPy
Date:

AIM:

To learn the different features provided by NumPy package.

ALGORITHM:
1. Install the NumPy package
2. Study all the features of NumPy package.

NumPy
 NumPy is a Python library used for working with arrays.
 It also has functions for working in domain of linear algebra, fourier transform,
and matrices.

Features
These are the important features of NumPy
1. Array 2. Random 3. Universal Functions

1. Arrays
1.1 Array Slicing
 Slicing in python means taking elements from one given index to another given
index.
 We pass slice instead of index like this: [start:end].
 We can also define the step, like this: [start:end:step].
 If we don't pass start its considered 0
 If we don't pass end its considered length of array in that dimension
 If we don't pass step its considered 1

import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[1:5:2])

1.2 Array Shape & Reshaping


1.2.1 Array Shape
NumPy arrays have an attribute called shape that returns a tuple with each index
having the number of corresponding elements.
import numpy as np
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
print(arr.shape)

4
1.2.2 Array Reshaping
 Reshaping means changing the shape of an array.
 The shape of an array is the number of elements in each dimension.
 By reshaping we can add or remove dimensions or change number of elements
in each dimension.
 Convert the following 1-D array with 12 elements into a 3-D array.
The outermost dimension will have 2 arrays that contains 3 arrays, each with 2 elements:
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
newarr = arr.reshape(2, 3, 2)
print(newarr)

2. Random
Random Permutations
A permutation refers to an arrangement of elements. e.g. [3, 2, 1] is a permutation of
[1, 2, 3] and vice-versa.
The NumPy Random module provides two methods for this: shuffle() and
permutation().
from numpy import random
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
random.shuffle(arr)
print(arr)

Seaborn
Seaborn is a library that uses Matplotlib underneath to plot graphs. It will be used to
visualize random distributions.
import matplotlib.pyplot as plt
import seaborn as sns
sns.distplot([0, 1, 2, 3, 4, 5])
plt.show()
Binomial Distribution
Binomial Distribution is a Discrete Distribution.
It describes the outcome of binary scenarios, e.g. toss of a coin, it will either be head
or tails.
It has three parameters:
n - number of trials.
p - probability of occurence of each trial (e.g. for toss of a coin 0.5 each).
size - The shape of the returned array.
Given 10 trials for coin toss generate 10 data points:
from numpy import random
x = random.binomial(n=10, p=0.5, size=10)
print(x)

5
Poisson Distribution
It estimates how many times an event can happen in a specified time. e.g. If someone
eats twice a day what is probability he will eat thrice?
It has two parameters:
lam - rate or known number of occurences e.g. 2 for above problem.
size - The shape of the returned array.
Generate a random 1x10 distribution for occurence 2:
from numpy import random
x = random.poisson(lam=2, size=10)
print(x)

Logistic Distribution
Logistic Distribution is used to describe growth.
Used extensively in machine learning in logistic regression, neural networks etc.
It has three parameters:
loc - mean, where the peak is. Default 0.
scale - standard deviation, the flatness of distribution. Default 1.
size - The shape of the returned array.

Draw 2x3 samples from a logistic distribution with mean at 1 and stddev 2.0:
from numpy import random
x = random.logistic(loc=1, scale=2, size=(2, 3))
print(x)

Multinomial Distribution
Multinomial distribution is a generalization of binomial distribution.
It describes outcomes of multi-nomial scenarios unlike binomial where scenarios must
be only one of two. e.g. Blood type of a population, dice roll outcome.
It has three parameters:
n - number of possible outcomes (e.g. 6 for dice roll).
pvals - list of probabilties of outcomes (e.g. [1/6, 1/6, 1/6, 1/6, 1/6, 1/6] for dice roll).
size - The shape of the returned array.

Draw out a sample for dice roll:


from numpy import random
x = random.multinomial(n=6, pvals=[1/6, 1/6, 1/6, 1/6, 1/6, 1/6])
print(x)

Exponential Distribution
Exponential distribution is used for describing time till next event e.g. failure/success
etc.
It has two parameters:
scale - inverse of rate ( see lam in poisson distribution ) defaults to 1.0.
size - The shape of the returned array.

6
Draw out a sample for exponential distribution with 2.0 scale with 2x3 size:
from numpy import random
x = random.exponential(scale=2, size=(2, 3))
print(x)

Chi Square Distribution


Chi Square distribution is used as a basis to verify the hypothesis.
It has two parameters:
df - (degree of freedom).
size - The shape of the returned array.
Draw out a sample for chi squared distribution with degree of freedom 2 with size 2x3:
from numpy import random
x = random.chisquare(df=2, size=(2, 3))
print(x)

Pareto Distribution
A distribution following Pareto's law i.e. 80-20 distribution (20% factors cause 80%
outcome).
It has two parameter:
a - shape parameter.
size - The shape of the returned array.
Draw out a sample for pareto distribution with shape of 2 with size 2x3:
from numpy import random
x = random.pareto(a=2, size=(2, 3))
print(x)

3. Universal Functions

Create Your Own ufunc (Universal)


To create you own ufunc, you have to define a function, like you do with normal
functions in Python, then you add it to your NumPy ufunc library with the frompyfunc()
method.

The frompyfunc() method takes the following arguments:

function - the name of the function.


inputs - the number of input arguments (arrays).
outputs - the number of output arrays.
Create your own ufunc for addition:
import numpy as np
def myadd(x, y):
return x+y
myadd = np.frompyfunc(myadd, 2, 1)
print(myadd([1, 2, 3, 4], [5, 6, 7, 8]))

7
3.1 Simple Arithmetic

You could use arithmetic operators + - * / directly between NumPy arrays, but this
section discusses an extension of the same where we have functions that can take any array-
like objects e.g. lists, tuples etc. and perform arithmetic conditionally.
Addition
Add the values in arr1 to the values in arr2:
import numpy as np
arr1 = np.array([10, 11, 12, 13, 14, 15])
arr2 = np.array([20, 21, 22, 23, 24, 25])
newarr = np.add(arr1, arr2)
print(newarr)
Subtraction
Subtract the values in arr2 from the values in arr1:
import numpy as np
arr1 = np.array([10, 20, 30, 40, 50, 60])
arr2 = np.array([20, 21, 22, 23, 24, 25])
newarr = np.subtract(arr1, arr2)
print(newarr)
Multiplication
Multiply the values in arr1 with the values in arr2:
import numpy as np
arr1 = np.array([10, 20, 30, 40, 50, 60])
arr2 = np.array([20, 21, 22, 23, 24, 25])
newarr = np.multiply(arr1, arr2)
print(newarr)
Division
Divide the values in arr1 with the values in arr2:
import numpy as np
arr1 = np.array([10, 20, 30, 40, 50, 60])
arr2 = np.array([3, 5, 10, 8, 2, 33])
newarr = np.divide(arr1, arr2)
print(newarr)
Power
Raise the valules in arr1 to the power of values in arr2:
import numpy as np
arr1 = np.array([10, 20, 30, 40, 50, 60])
arr2 = np.array([3, 5, 6, 8, 2, 33])
newarr = np.power(arr1, arr2)
print(newarr)
Remainder
Return the remainders:
import numpy as np
arr1 = np.array([10, 20, 30, 40, 50, 60])

8
arr2 = np.array([3, 7, 9, 8, 2, 33])
newarr = np.mod(arr1, arr2)
print(newarr)
Absolute Values
Return the quotient and mod:
import numpy as np
arr = np.array([-1, -2, 1, 2, 3, -4])
newarr = np.absolute(arr)
print(newarr)

3.2 Rounding Decimals


There are primarily five ways of rounding off decimals in NumPy:
 truncation  floor
 rounding  ceil
3.2.1 Truncation
Remove the decimals, and return the float number closest to zero.
Use the trunc() and fix() functions.
Truncate elements of following array:
import numpy as np
arr = np.trunc([-3.1666, 3.6667])
print(arr)

3.2.2 Rounding
The around() function increments preceding digit or decimal by 1
if >=5 else do nothing.
Round off 3.1666 to 2 decimal places:
import numpy as np
arr = np.around(3.1666, 2)
print(arr)

3.2.3 Floor
The floor() function rounds off decimal to nearest lower integer.
Floor the elements of following array:
import numpy as np
arr = np.floor([-3.1666, 3.6667])
print(arr)

3.2.4 Ceil
The ceil() function rounds off decimal to nearest upper integer.
Ceil the elements of following array:
import numpy as np
arr = np.ceil([-3.1666, 3.6667])
print(arr)

9
3.3 Summations
Addition is done between two arguments whereas summation
happens over n elements
Add the values in arr1 to the values in arr2:
import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([1, 2, 3])
newarr = np.add(arr1, arr2)
print(newarr)

3.4 Products
To find the product of the elements in an array, use the prod()
function.
Find the product of the elements of this array:
import numpy as np
arr = np.array([1, 2, 3, 4])
x = np.prod(arr)
print(x)

3.5 Differences
A discrete difference means subtracting two successive elements.
To find the discrete difference, use the diff() function.
Compute discrete difference of the following array:
import numpy as np
arr = np.array([10, 15, 25, 5])
newarr = np.diff(arr)
print(newarr)

3.6 LCM (Lowest Common Multiple)


The Lowest Common Multiple is the least number that is common
multiple of both of the numbers.
import numpy as np
num1 = 4
num2 = 6
x = np.lcm(num1, num2)
print(x)

3.7 Trigonometric Functions


NumPy provides the ufuncs sin(), cos() and tan() that take values
in radians and produce the corresponding sin, cos and tan values.
Find sine value of PI/2:
import numpy as np
x = np.sin(np.pi/2)
print(x)

10
Find sine values for all of the values in arr:
import numpy as np
arr = np.array([np.pi/2, np.pi/3, np.pi/4, np.pi/5])
x = np.sin(arr)
print(x)

3.8 Set Operations


A set in mathematics is a collection of unique elements.

3.8.1 Create Sets in NumPy


We can use NumPy's unique() method to find unique elements
from any array. E.g. create a set array, but remember that the set arrays
should only be 1-D arrays.
Convert following array with repeated elements to a set:
import numpy as np
arr = np.array([1, 1, 1, 2, 3, 4, 5, 5, 6, 7])
x = np.unique(arr)
print(x)
3.8.2 Finding Union
To find the unique values of two arrays, use the union1d()
method.
Find union of the following two set arrays:
import numpy as np
arr1 = np.array([1, 2, 3, 4])
arr2 = np.array([3, 4, 5, 6])
newarr = np.union1d(arr1, arr2)
print(newarr)

3.8.3 Finding Intersection


To find only the values that are present in both arrays, use the
intersect1d() method.
Find intersection of the following two set arrays:
import numpy as np
arr1 = np.array([1, 2, 3, 4])
arr2 = np.array([3, 4, 5, 6])
newarr = np.intersect1d(arr1, arr2, assume_unique=True)
print(newarr)

11
Sample Output:

12
RESULT

Thus the feature study of NumPy was completed successfully.

13
Ex.No 1(c) Explore the features of SciPy
Date:

AIM:
To learn the different features provided by SciPy package.

ALGORITHM:
1. Install the SciPy package
2. Study all the features of SciPy package.

SciPy
SciPy stands for Scientific Python, SciPy is a scientific
computation library that uses NumPy underneath.

Features
These are the important features of SciPy
1. Constants 2. Sparse Data 3. Graphs
4. Spatial Data 5. Matlab Arrays 6.
Interpolation

1. Constants in SciPy
As SciPy is more focused on scientific implementations, it
provides many built-in scientific constants.
These constants can be helpful when you are working with Data
Science.
1.1 Constants in SciPy
Metric
Return the specified unit in meter
ex: print(constants.milli)
Binary
Return the specified unit in bytes
ex: print(constants.kibi)
Mass
Return the specified unit in kg
ex: print(constants.stone)
Angle
Return the specified unit in radians
ex: print(constants.degree)
Time
Return the specified unit in seconds
ex: print(constants.year)

14
Length
Return the specified unit in meters
ex: print(constants.mile)

Pressure
Return the specified unit in pascals
ex: print(constants.bar)
Area
Return the specified unit in square meters
ex: print(constants.hectare)
Volume
Return the specified unit in cubic meters
ex: print(constants.litre)
Speed
Return the specified unit in meters per second
ex: print(constants.kmh)
Temperature
Return the specified unit in Kelvin
ex: print(constants.zero_Celsius)
Energy
Return the specified unit in joules
ex: print(constants.calorie)
Power
Return the specified unit in watts
ex: print(constants.hp)
Force
Return the specified unit in newton
ex: print(constants.pound_force)

2. Sparse Data
Sparse data is data that has mostly unused elements (elements that
don't carry any information).
It can be an array like this one:
[1, 0, 2, 0, 0, 3, 0, 0, 0, 0, 0, 0]
Sparse Data: is a data set where most of the item values are zero.
Dense Array: is the opposite of a sparse array: most of the values
are not zero.

2.1 CSR(Compressed Sparse Row) Matrix


We can create CSR matrix by passing an arrray into function
scipy.sparse.csr_matrix().
Create a CSR matrix from an array:
import numpy as np
from scipy.sparse import csr_matrix

15
arr = np.array([0, 0, 0, 0, 0, 1, 1, 0, 2])
print(csr_matrix(arr))

3. Graphs
Graphs are an essential data structure.
SciPy provides us with the module scipy.sparse.csgraph for
working with such data structures.

Adjacency Matrix
Adjacency matrix is a nxn matrix where n is the number of
elements in a graph.
The values represents the connection between the elements.

3.1 Dijkstra
Use the dijkstra method to find the shortest path in a graph from
one element to another.
It takes following arguments:
return_predecessors: boolean (True to return whole path of
traversal otherwise False).
indices: index of the element to return all paths from that element
only.
limit: max weight of path.
4. Spatial Data
Spatial data refers to data that is represented in a geometric space.
E.g. points on a coordinate system.
We deal with spatial data problems on many tasks.
E.g. finding if a point is inside a boundary or not.

4.1 Triangulation
A Triangulation of a polygon is to divide the polygon into
multiple triangles with which we can compute an area of the polygon.
A Triangulation with points means creating surface composed
triangles in which all of the given points are on at least one vertex of any
triangle in the surface.
Example:
Create a triangulation from following points:
import numpy as np
from scipy.spatial import Delaunay
import matplotlib.pyplot as plt
points = np.array([
[2, 4],
[3, 4],

16
[3, 0],
[2, 2],
[4, 1]
])
simplices = Delaunay(points).simplices
plt.triplot(points[:, 0], points[:, 1], simplices)
plt.scatter(points[:, 0], points[:, 1], color='r')
plt.show()

4.2 Euclidean Distance


Find the euclidean distance between given points A and B.
Example
Find the euclidean distance between given points.
from scipy.spatial.distance import euclidean
p1 = (1, 0)
p2 = (10, 2)
res = euclidean(p1, p2)
print(res)

4.3 Cosine Distance


Is the value of cosine angle between the two points A and B.
Find the cosine distsance between given points:
from scipy.spatial.distance import cosine
p1 = (1, 0)
p2 = (10, 2)
res = cosine(p1, p2)
print(res)

Hamming Distance
Is the proportion of bits where two bits are difference.
It's a way to measure distance for binary sequences.
Find the hamming distance between given points:
from scipy.spatial.distance import hamming
p1 = (True, False, True)
p2 = (False, True, True)
res = hamming(p1, p2)
print(res)

5. Matlab Arrays
We know that NumPy provides us with methods to persist the
data in readable formats for Python. But SciPy provides us with
interoperability with Matlab as well.

Working With Matlab Arrays

17
We know that NumPy provides us with methods to persist the
data in readable formats for Python. But SciPy provides us with
interoperability with Matlab as well.
Exporting Data in Matlab Format
The savemat() function allows us to export data in Matlab format.
The method takes the following parameters:
filename - the file name for saving data.
mdict - a dictionary containing the data.
do_compression - a boolean value that specifies whether
to compress the result or not. Default False.

Example
Export the following array as variable name "vec" to a mat file:
from scipy import io
import numpy as np
arr = np.arange(10)
io.savemat('arr.mat', {"vec": arr})

Import Data from Matlab Format


The loadmat() function allows us to import data from a Matlab
file.
The function takes one required parameter:
filename - the file name of the saved data.
It will return a structured array whose keys are the variable
names, and the corresponding values are the variable values.
Import the array from following mat file.:
from scipy import io
import numpy as np
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9,])
# Export:
io.savemat('arr.mat', {"vec": arr})
# Import:
mydata = io.loadmat('arr.mat')
print(mydata)

Sample Output

18
19
RESULT

Thus the feature study of SciPy was completed successfully.

Ex.No1(d) Explore the features of Pandas


Date:

AIM:
To learn the different features provided by Pandas package.

ALGORITHM:
1. Install the Pandas package
2. Study all the features of Pandas package.

Pandas
 Pandas is a Python library used for working with data sets.
 It has functions for analyzing, cleaning, exploring, and
manipulating data.
 Pandas allows us to analyze big data and make conclusions
based on statistical theories.

20
 Pandas can clean messy data sets, and make them readable
and relevant.

Features
These are the important features of Pandas.
1. Series 2. DataFrames 3. Read CSV
4. Read JSON 5. Viewing the Data 6. Data
Cleaning
7. Plotting

1. Series
 A Pandas Series is like a column in a table.
 It is a one-dimensional array holding data of any type.
 Create a simple Pandas Series from a list:

import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)

1.1 Create Labels


With the index argument, you can name your own labels.
Example
Create you own labels:
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)

1.2 Key/Value Objects as Series


You can also use a key/value object, like a dictionary, when
creating a Series.
Example
Create a simple Pandas Series from a dictionary:
import pandas as pd
calories = {"day1": 420, "day2": 380, "day3": 390}
myvar = pd.Series(calories)
print(myvar)

2. DataFrames
A Pandas DataFrame is a 2 dimensional data structure, like a 2
dimensional array, or a table with rows and columns.
Example

21
Create a simple Pandas DataFrame:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)

3. Read CSV
A simple way to store big data sets is to use CSV files (comma
separated files). CSV files contains plain text and is a well know
format that can be read by everyone including Pandas.
Example
To print maximum rows in a CSV file
import pandas as pd
pd.options.display.max_rows = 9999
df = pd.read_csv('data.csv')
print(df)

4. Read JSON
 Big data sets are often stored, or extracted as JSON.
 JSON is plain text, but has the format of an object, and is well
known in the world of programming, including Pandas.
Load the JSON file into a DataFrame:
import pandas as pd
df = pd.read_json('data.json')
print(df.to_string())

5. Viewing the Data


One of the most used method for getting a quick overview of the
DataFrame, is the head() method. The head() method returns the headers
and a specified number of rows, starting from the top.

5.1 Info About the Data


The DataFrames object has a method called info(), that gives you
more information about the data set.
Example
Print information about the data:
import pandas as pd

22
df = pd.read_csv('data.csv')
print(df.info())

6. Data Cleaning
Data cleaning means fixing bad data in your data set.
Bad data could be:
 Empty cells
 Data in wrong format
 Wrong data
 Duplicates

6.1 Empty Cells


6.1.1 Remove Rows
One way to deal with empty cells is to remove rows that contain
empty cells.
This is usually OK, since data sets can be very big, and removing
a few rows will not have a big impact on the result.
Example
Return a new Data Frame with no empty cells:
import pandas as pd
df = pd.read_csv('data.csv')
new_df = df.dropna()
print(new_df.to_string())
inplace() method
It remove all rows with NULL values:
import pandas as pd
df = pd.read_csv('data.csv')
df.dropna(inplace = True)
print(df.to_string())

6.1.2 Replace Empty Values


Another way of dealing with empty cells is to insert a new value
instead.

Example
Replace NULL values with the number 130:
import pandas as pd
df = pd.read_csv('data.csv')
df.fillna(130, inplace = True)

6.1.3 Replace Using Mean, Median, or Mode


A common way to replace empty cells, is to calculate the mean,
median or mode value of the column.

23
Pandas uses the mean() median() and mode() methods to calculate
the respective values for a specified column:
mean()
import pandas as pd
df = pd.read_csv('data.csv')
x = df["Calories"].mean()
df["Calories"].fillna(x, inplace = True)
print(df.to_string())
median()
import pandas as pd
df = pd.read_csv('data.csv')
x = df["Calories"].median()
df["Calories"].fillna(x, inplace = True)
mode()
import pandas as pd
df = pd.read_csv('data.csv')
x = df["Calories"].mode()[0]
df["Calories"].fillna(x, inplace = True)

6.2 Data of Wrong Format


Cells with data of wrong format can make it difficult, or even
impossible, to analyze data.
To fix it, you have two options: remove the rows, or convert all
cells in the columns into the same format.
Example
import pandas as pd
df = pd.read_csv('data.csv')
df['Date'] = pd.to_datetime(df['Date'])
print(df.to_string())

6.2.1 Removing Rows


Remove rows with a NULL value in the "Date" column:
import pandas as pd
df = pd.read_csv('data.csv')
df['Date'] = pd.to_datetime(df['Date'])
df.dropna(subset=['Date'], inplace = True)
print(df.to_string())

6.3 Fixing Wrong Data


6.3.1 Wrong Data
"Wrong data" does not have to be "empty cells" or "wrong
format", it can just be wrong, like if someone registered "199" instead of
"1.99".

24
Sometimes you can spot wrong data by looking at the data set,
because you have an expectation of what it should be.

6.3.2 Replacing Values


One way to fix wrong values is to replace them with something
else.
Example
Set "Duration" = 45 in row 7:
import pandas as pd
df = pd.read_csv('data.csv')
df.loc[7,'Duration'] = 45
print(df.to_string())

6.3.3 Removing Rows


Another way of handling wrong data is to remove the rows that
contains wrong data.
Example
Delete rows where "Duration" is higher than 120:
import pandas as pd
df = pd.read_csv('data.csv')
for x in df.index:
if df.loc[x, "Duration"] > 120:
df.drop(x, inplace = True)
print(df.to_string())

6.4 Removing Duplicates


6.4.1 Discovering Duplicates
Duplicate rows are rows that have been registered more than one
time.
duplicated() method
import pandas as pd
df = pd.read_csv('data.csv')
print(df.duplicated())

6.4.2 Removing Duplicates


To remove duplicates, use the drop_duplicates() method.
import pandas as pd
df = pd.read_csv('data.csv')
df.drop_duplicates(inplace = True)
print(df.to_string())
7. Plotting
We can use Pyplot, a submodule of the Matplotlib library to
visualize the diagram on the screen.
Pandas uses the plot() method to create diagrams.

25
7.1 Scatter Plot
Specify that you want a scatter plot with the kind argument:
kind = 'scatter'
Example
import sys
import matplotlib
matplotlib.use('Agg')
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
df.plot(kind = 'scatter', x = 'Duration', y = 'Maxpulse')
plt.show()
plt.savefig(sys.stdout.buffer)
sys.stdout.flush()

7.2 Histogram
Use the kind argument to specify that you want a histogram:
kind = 'hist'
Example
import sys
import matplotlib
matplotlib.use('Agg')
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
df["Duration"].plot(kind = 'hist')
plt.show()
plt.savefig(sys.stdout.buffer)
sys.stdout.flush()

Sample Output

26
27
RESULT

Thus the feature study of Pandas was completed successfully.

28
Ex No.1(e) Explore the features of statsmodels
Date:

AIM:

To learn the different features provided by statsmodels package.

ALGORITHM:
4. Install the statsmodels package
5. Study all the features of statsmodels package.

Statsmodels
statsmodels is a Python module that provides classes and
functions for the estimation of many different statistical models, as well
as for conducting statistical tests, and statistical data exploration.

Features
These are the important features of statsmodels
1. Linear regression models
2. Survival analysis

1. Linear regression models


Linear regression analysis is a statistical technique for predicting
the value of one variable(dependent variable) based on the value of
another(independent variable).
In simple linear regression, there’s one independent variable used
to predict a single dependent variable. In the case of multilinear
regression, there’s more than one independent variable.
The independent variable is the one you’re using to forecast the
value of the other variable. The statsmodels.regression.linear_model.OLS
method is used to perform linear regression. Linear equations are of the
form:
Y=mX+C (m=slope; c=constant)
Syntax:
statsmodels.regression.linear_model.OLS(endog, exog=None,
missing=’none’, hasconst=None, **kwargs)
Parameters:
 endog: array like object.
 exog: array like object.
 missing: str. None, decrease, and raise are the available
alternatives. If the value is ‘none,’ no nan testing is performed.
Any observations with nans are dropped if ‘drop’ is selected. An
error is raised if ‘raise’ is used. ‘none’ is the default.

29
 hasconst: None or Bool. Indicates whether a user-supplied
constant is included in the RHS. If True, k constant is set to 1 and
all outcome statistics are calculated as if a constant is present. If
False, k constant is set to 0 and no constant is verified.

Step 1: Import packages.


Importing the required packages is the first step of modeling. The
pandas, NumPy, and stats model packages are imported.
import numpy as np
import pandas as pd
import statsmodels.api as sm
Step 2: Loading data
To access the CSV file click here. The CSV file is read using
pandas.read_csv() method. The head or the first five rows of the dataset is
returned by using the head() method. Head size and Brain weight are the
columns.
df = pd.read_csv('headbrain1.csv')
df.head()
Visualizing the data:
By using the matplotlib and seaborn packages, we visualize the
data. sns.regplot() function helps us create a regression plot.
# import packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('headbrain1.csv')
sns.regplot('Head Size(cm^3)', 'Brain Weight(grams)',
data=df)
plt.show()
Step 3: Setting a hypothesis.
Null hypothesis (H0): There is no relationship between head size
and brain weight.
Alternative hypothesis (Ha): There is a relationship between
head size and brain weight.
Step 4: Fitting the model
statsmodels.regression.linear_model.OLS() method is used to get
ordinary least squares, and fit() method is used to fit the data in it.
The ols method takes in the data and performs linear regression.
we provide the dependent and independent columns in this format :
inpendent_columns ~ dependent_column:
left side of the ~ operator contains the independent variables and
right side of the operator contains the name of the dependent variable or
the predicted column.
df.columns = ['Head_size', 'Brain_weight']

30
model = sm.ols(formula='Head_size ~ Brain_weight',
data=df).fit()
Step 5: Summary of the model.
All the summary statistics of the linear regression model are
returned by the model.summary() method. The p-value and many other
values/statistics are known by this method. Predictions about the data are
found by the model.summary() method.
print(model.summary())

2. Survival analysis
The statsmodels.api.SurvfuncRight class can be used to estimate
survival functions using data that may be censored to the right.
SurvfuncRight implements several inference methods, including
confidence intervals for survival quantiles, pointwise simultaneous
confidence intervals for survival functions, and plotting methods. The
duration.survdiff function provides a test procedure for comparing
survival distributions.
Here we are creating a SurvfuncRight object using the data from
the Moore study available from the R dataset repository. Adjust the
survival distribution for 'low' fcategory subjects only.

Example:
# Importing libraries
import statsmodels.api as sm
X = sm.datasets.get_rdataset("Moore", "carData").data
# Filtering data of low fcategory
X = X[X['fcategory'] == "low"]
# Creating SurvfuncRight model
model = sm.SurvfuncRight(X["conformity"], X["fscore"])
# Model Summary
model.summary()

Sample Output

Linear regression models

31
Survival analysis

32
RESULT

Thus the few important features of study statsmodels was completed


successfully.
Ex.No.2 Working with Numpy arrays
Date:

AIM:
To work with different features provided by Numpy arrays.

ALGORITHM:
1. Install the numpy package
2. Work with all the features of numpy array.

Arrays
1. Creating Arrays
 0-D Arrays
Each value in an array is a 0-D array.
import numpy as np
arr = np.array(42)
print(arr)
 1-D Arrays
An array that has 0-D arrays as its elements is called 1-D
array.
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)

33
 2-D Arrays
An array that has 1-D arrays as its elements is called a 2-D
array.
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr)
 3-D arrays
An array that has 2-D arrays (matrices) as its elements is
called 3-D array.
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(arr)
Example:
import numpy as np
a = np.array(42)
b = np.array([1, 2, 3, 4, 5])
c = np.array([[1, 2, 3], [4, 5, 6]])
d = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(a.ndim)
print(b.ndim)
print(c.ndim)
print(d.ndim)

2. Access Array Elements


Access 2-D Arrays
To access elements from 2-D arrays we can use comma separated
integers representing the dimension and the index of the element.
import numpy as np
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print('2nd element on 1st row: ', arr[0, 1])
Access 3-D Arrays
To access elements from 3-D arrays we can use comma separated
integers representing the dimensions and the index of the element.
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11,
12]]])
print(arr[0, 1, 2])

3. Array Slicing
 Slicing in python means taking elements from one given index
to another given index.
 We pass slice instead of index like this: [start:end].
 We can also define the step, like this: [start:end:step].

34
 If we don't pass start its considered 0
 If we don't pass end its considered length of array in that
dimension
 If we don't pass step its considered 1

import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[1:5:2])

4. Data Types
NumPy has some extra data types, and refer to data types with
one character, like i for integers, u for unsigned integers etc.
Below is a list of all data types in NumPy and the characters used
to represent them.
i - integer M - datetime
b - boolean O - object
u - unsigned S - string
integer U - unicode string
f - float V - fixed chunk
c - complex float of memory for
m - timedelta other type (void)
Example:
import numpy as np
arr = np.array([1, 2, 3, 4], dtype='S')
print(arr)
print(arr.dtype)

5. Copy & View


5.1 Copy:
Make a copy
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
x = arr.copy()
arr[0] = 42
print(arr)
print(x)

5.2 View:
Make a view
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
x = arr.view()
arr[0] = 42

35
print(arr)
print(x)

6. Array Shape & Reshaping


6.1 Array Shape
NumPy arrays have an attribute called shape that returns a tuple
with each index having the number of corresponding elements.
import numpy as np
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
print(arr.shape)

6.2 Array Reshaping


 Reshaping means changing the shape of an array.
 The shape of an array is the number of elements in each
dimension.
 By reshaping we can add or remove dimensions or change
number of elements in each dimension.
 Convert the following 1-D array with 12 elements into a 3-D
array.
 The outermost dimension will have 2 arrays that contains 3
arrays, each with 2 elements:
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
newarr = arr.reshape(2, 3, 2)
print(newarr)

7. Array Iterating
 Iterating means going through elements one by one.
 As we deal with multi-dimensional arrays in numpy, we can
do this using basic for loop of python.
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11,
12]]])
for x in arr:
print(x)

8. Joining Array
Joining means putting contents of two or more arrays in a single
array.
import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.concatenate((arr1, arr2))

36
print(arr)

9. Splitting Array
Splitting is reverse operation of Joining.
Joining merges multiple arrays into one and Splitting breaks one
array into multiple.
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6])
newarr = np.array_split(arr, 3)
print(newarr)

10. Searching Arrays


You can search an array for a certain value, and return the indexes
that get a match.
To search an array, use the where() method.
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 4, 4])
x = np.where(arr == 4)
print(x)

11. Sorting means putting elements in an ordered sequence.


 Ordered sequence is any sequence that has an order
corresponding to elements, like numeric or alphabetical,
ascending or descending.
 The NumPy ndarray object has a function called sort(), that
will sort a specified array.
import numpy as np
arr = np.array([3, 2, 0, 1])
print(np.sort(arr))

12. Filtering Arrays


Getting some elements out of an existing array and creating a new
array out of them is called filtering. In NumPy, you filter an array using a
boolean index list.
If the value at an index is True that element is contained in the
filtered array, if the value at that index is False that element is excluded
from the filtered array.
import numpy as np
arr = np.array([41, 42, 43, 44])
x = [True, False, True, False]
newarr = arr[x]
print(newarr)

Sample Output:

37
38
RESULT

Thus the important features of numpy array was completed


successfully.

39
Ex.No 3 Working with DataFrame
Date:

AIM:
To work with dataframe provided by pandas.

ALGORITHM:
1. Install the pandas package
2. Work with all the features of dataframe.

1. DataFrame
A Pandas DataFrame is a 2 dimensional data structure, like a 2
dimensional array, or a table with rows and columns.
Example
Create a simple Pandas DataFrame:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)

2. Locate Row
As you can see from the result above, the DataFrame is like a
table with rows and columns.
Pandas use the loc attribute to return one or more specified row(s)
Example
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df.loc[0])

3. Named Indexes
With the index argument, you can name your own indexes.

40
Example
Add a list of names to give each row a name:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2",
"day3"])
print(df)

4. Load Files Into a DataFrame


If your data sets are stored in a file, Pandas can load them into a
DataFrame.
Example
Load a comma separated file (CSV file) into a DataFrame:
import pandas as pd
df = pd.read_csv('data.csv')
print(df)

Sample Output:

RESULT

41
Thus the dataframe features of pandas was completed
successfully.

Ex.No.4 Reading data from iris data set and doing


descriptive analytics on the Iris data set
Date:

AIM:
To read data from files and exploring various commands for
doing descriptive analytics on the Iris data set.

ALGORITHM:
1. Download “Iris.csv” file from GitHub.com
2. Load the “Iris.csv” into google colab.
3. Perform descriptive analysis on the Iris file.

Importing Iris.csv
 Login to google colab by using gmail.
 Login to google drive and create a folder with required name.
 Move the Iris file from system to google drive.
 Click on the “file” icon and click on “Mount Device”.
 Code will appeared on a typing area, execute the same code.
 It requires authentication verification, complete the
authentication.
 After successful verification it shows the message “Mounted
at /content/drive”
 Find the Iris.csv file and copy the path for future references.

About Iris Database


Iris Dataset is considered as the Hello World for data science. It
contains five columns namely – Petal Length, Petal Width, Sepal Length,
Sepal Width, and Species Type. Iris is a flowering plant, the researchers
have measured various features of the different iris flowers and recorded
them digitally.
You can download the Iris.csv file from the above link. Now we
will use the Pandas library to load this CSV file, and we will convert it
into the dataframe. read_csv() method is used to read CSV files.

Example:
import pandas as pd
# Reading the CSV file

42
df = pd.read_csv("/content/drive/MyDrive/Data_Science/
iris.csv")
# Printing top 5 rows
df.head()
Getting Information about the Dataset
We will use the shape parameter to get the shape of the dataset.
df.shape -> returns no of rows and columns
df.info() -> returns column data types.
Checking Missing Values
We will check if our data contains any missing values or not.
Missing values can occur when no information is provided for one or
more items or for a whole unit. We will use the isnull() method.

Example:
df.isnull().sum()

Checking Duplicates
Let’s see if our dataset contains any duplicates or not. Pandas
drop_duplicates() method helps in removing duplicates from the data
frame.

Example:
data = df.drop_duplicates(subset ="variety",)
data

Data Visualization
Visualizing the target column
Our target column will be the Species column because at the end
we will need the result according to the species only. Let’s see a
countplot for species.
Example:
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.countplot(x='Species', data=df,)
plt.show()

Relation between variables


We will see the relationship between the sepal length and sepal
width and also between petal length and petal width.
Example 1: Comparing Sepal Length and Sepal Width
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt

43
sns.scatterplot(x='petal.length', y='petal.width',hue='variet
y', data=df, )
# Placing Legend outside the Figure
plt.legend(bbox_to_anchor=(1, 1), loc=2)
plt.show()

Example 2: Comparing Petal Length and Petal Width


# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.scatterplot(x='petal.length', y='petal.width',
hue='variety', data=df, )
# Placing Legend outside the Figure
plt.legend(bbox_to_anchor=(1, 1), loc=2)
plt.show()

Handling Correlation
Pandas dataframe.corr() is used to find the pairwise correlation of
all columns in the dataframe. Any NA values are automatically excluded.
For any non-numeric data type columns in the dataframe it is ignored.

Example:
data.corr(method='pearson')

Sample Output

44
RESULT

Iris.csv file was loaded into google colab and descriptive analytics
was made on the Iris data set successfully.

45
Ex.No 5(a) Perform Univariate analysis on the diabetes
data set
Date:

AIM:
Use the diabetes data set from UCI and Pima Indians Diabetes
data set for Univariate analysis.

ALGORITHM:
1. Download diabetes data set from UCI and Pima Indians
Diabetes data set.
2. Load the above data files into google colab.
3. Perform analysis like Frequency, Mean, Median, Mode,
Variance, Standard Deviation, Skewness and Kurtosis.

Univariate analysis
 The term univariate analysis refers to the analysis of one
variable.
 There are three common ways to perform univariate analysis
on one variable:
Summary statistics – Measures the center and spread of
values.
1. Central tendency — mean, median, mode
2. Dispersion — variance, standard deviation,
range, interquartile range (IQR)
3. Skewness — symmetry of data along with
mean value
4. Kurtosis — peakedness of data at mean value
5. Frequency table – Describes how often
different values occur.

File Importing:
# Reading the UCI file
import pandas as pd
df = pd.read_csv("/content/drive/MyDrive/
Data_Science/UCI_diabetes.csv")
# Printing top 5 rows
df.head()
# Reading the Pima file
import pandas as pd
df = pd.read_csv("/content/drive/MyDrive/Data_Science/
Pima_diabetes.csv")
# Printing top 5 rows

46
df.head()

1. Central Tendency
We can use the following syntax to calculate various summary
statistics like Mean, Median and Mode.

1.1 Mean:
It is average value of given numeric values
 Mean of UCI data
import pandas as pd
# Reading the UCI file
df = pd.read_csv("/content/drive/MyDrive/
Data_Science/UCI_diabetes.csv")
# Mean of UCI data
df.mean(axis=0)
 Mean of Pima data
import pandas as pd
# Reading the UCI file
df =
pd.read_csv("/content/drive/MyDrive/Data_Science/Pima
_diabetes.csv")
# Mean of Pima data
df.mean(axis=0)

1.2 Median:
It is middle most value of given values
 Median of UCI data
import pandas as pd
# Reading the UCI file
df =
pd.read_csv("/content/drive/MyDrive/Data_Science/UCI_
diabetes.csv")
# Median of UCI data
df.median(axis=0)

 Median of Pima data


import pandas as pd
# Reading the UCI file
df =
pd.read_csv("/content/drive/MyDrive/Data_Science/Pima
_diabetes.csv")
# Median of Pima data
df.median(axis=0)

47
1.3 Mode:
It is the most frequently occurring value of given numeric
variables
 Mode of UCI data
import pandas as pd
# Reading the UCI file
df =
pd.read_csv("/content/drive/MyDrive/Data_Science/UCI_
diabetes.csv")
# Median of UCI data
df.mode(axis=0)

 Mode of Pima data


import pandas as pd
# Reading the UCI file
df =
pd.read_csv("/content/drive/MyDrive/Data_Science/Pima
_diabetes.csv")
# Mean of Pima data
df.mode(axis=0)

2. Dispersion
2.1 Variance
The range is the difference between the maximum and minimum
values of a data set.
Example
import pandas as pd
# Reading the UCI file
df =
pd.read_csv("/content/drive/MyDrive/Data_Science/Pima
_diabetes.csv")
# variance of the BMI column
df.loc[:,"BMI"].var()

2.2 Standard deviation


Standard deviation is a measure of how spread out the numbers
are. A large standard deviation indicates that the data is spread out, - a
small standard deviation indicates that the data is clustered closely around
the mean.
Example
import pandas as pd
# Reading the UCI file

48
df =
pd.read_csv("/content/drive/MyDrive/Data_Science/Pima
_diabetes.csv")
# Standard deviation of the BMI column
df.loc[:,"BMI"].std()

2.3 Range
Range is the simplest of the measurements but is very limited in
its use, we calculate the range by taking the largest value of the dataset
and subtract the smallest value from it, in other words, it is the difference
of the maximum and minimum values of a dataset.
Example
df=pd.read_csv("/content/drive/MyDrive/
Data_Science/Pima_diabetes.csv")
print("Range is:",df.BloodPressure.max()-
df.BloodPressure.min())

2.4 Interquartile range


The interquartile range, often denoted “IQR”, is a way to measure
the spread of the middle 50% of a dataset. It is calculated as the
difference between the first quartile* (the 25th percentile) and the third
quartile (the 75th percentile) of a dataset.
Example
# Importing important libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('seaborn')
data =
pd.read_csv('/content/drive/MyDrive/Data_Science/Pima_
diabetes.csv')
# Removing the outliers
def removeOutliers(data, col):
Q3 = np.quantile(data[col], 0.75)
Q1 = np.quantile(data[col], 0.25)
IQR = Q3 - Q1

print("IQR value for column %s is: %s" % (col, IQR))


global outlier_free_list
global filtered_data

lower_range = Q1 - 1.5 * IQR


upper_range = Q3 + 1.5 * IQR

49
outlier_free_list = [x for x in data[col] if (
(x > lower_range) & (x < upper_range))]
filtered_data = data.loc[data[col].isin(outlier_free_list)]
for i in data.columns:
if i == data.columns[0]:
removeOutliers(data, i)
else:
removeOutliers(filtered_data, i)

# Assigning filtered data back to our original variable


data = filtered_data
print("Shape of data after outlier removal is: ", data.shape)

3. Skewness
 Skewness essentially measures the symmetry of the
distribution.
Example
# importing pandas as pd
import pandas as pd
# Creating the dataframe
df =
pd.read_csv("/content/drive/MyDrive/Data_Science/Pima
_diabetes.csv")
# skip the na values
# find skewness in each row
df.skew(axis = 0, skipna = True)

4. kurtosis
kurtosis determines the heaviness of the distribution tails.
Example
import pandas as pd
df =
pd.read_csv('/content/drive/MyDrive/Data_Science/Pima_
diabetes.csv')
df['BloodPressure'].kurtosis()

5. Frequency
Frequency is a count of the number of occurrences a particular
value occurs or appears in our data. A frequency table displays a set of
values along with the frequency with which they appear. They allow us to
better understand which data values are common and which are
uncommon.

50
Example
# import packages
import pandas as pd
import numpy as np
# reading csv file
data =
pd.read_csv('/content/drive/MyDrive/Data_Science/Pima_
diabetes.csv')
# one way frequency table for the species column.
freq_table = pd.crosstab(data['Age'], 'BMI')
# frequency table in proportion of species
freq_table= freq_table/len(data)
freq_table

Sample Output

51
52
RESULT

Thus the Univariate analysis on the Diabetes data of UCI and


Pima was performed successfully.

53
Ex.No.5(b) Perform Bivariate analysis on the diabetes data
set
Date:

AIM:
To use the UCI and Pima Indians Diabetes data set for Bivariate
analysis.

ALGORITHM:
1. Download diabetes data set from UCI and Pima Indians
Diabetes data set.
2. Load the above data files into google colab.
3. Perform various methods of bivariate.

Bivariate analysis
The term bivariate analysis refers to the analysis of two variables.
The purpose of bivariate analysis is to understand the relationship
between two variables
There are three common ways to perform bivariate analysis:
1. Scatterplots
2. Correlation Coefficients
3. Simple Linear Regression

1. Scatterplots
A scatterplot is a type of data display that shows the relationship
between two numerical variables
Example
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# import packages
data =
pd.read_csv("/content/drive/MyDrive/Data_Science/Pima
_diabetes.csv")
# Diabetes Outcome
g1 = data.loc[data.Outcome==1,:]
# Pregnancies, Glucose and Diabetes relation
g1.plot.scatter('Pregnancies', 'Glucose');

2. Correlation Coefficients
The correlation coefficient is a statistical measure of the strength
of the relationship between the relative movements of two variables. The

54
values range between -1.0 and 1.0. Correlation of -1.0 shows a perfect
negative correlation, while a correlation of 1.0 shows a perfect positive
correlation. A correlation of 0.0 shows no linear relationship between the
movement of the two variables.

Example
# Import those libraries
import pandas as pd
from scipy.stats import pearsonr
# Import your data into Python
df =
pd.read_csv("/content/drive/MyDrive/Data_Science/Pima
_diabetes.csv")
# Convert dataframe into series
list1 = df['BloodPressure']
list2 = df['SkinThickness']
# Apply the pearsonr()
corr, _ = pearsonr(list1, list2)
print('Pearsons correlation: %.3f' % corr)

3. Simple Linear Regression


Simple linear regression is a statistical method that we can use to
find a relationship between two variables and make predictions. The two
variables used are typically denoted as y and x. The independent variable,
or the variable used to predict the dependent variable is denoted as x. The
dependent variable, or the outcome/output, is denoted as y.
A simple linear regression model will produce a line of best fit, or
the regression line. You may have heard about drawing the line of best fit
through a scatter plot of data.
Example
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset =
pd.read_csv('/content/drive/MyDrive/Data_Science/Pima_
diabetes.csv')
X = dataset.iloc[:, :-1].values #get a copy of dataset
exclude last column
y = dataset.iloc[:, 1].values #get array of dataset in column
1st
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split

55
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=1/3, random_state=0)
# Fitting Simple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

Sample Output

56
RESULT:

Thus the Bivariate analysis on the diabetes data set was executed
successfully.

57
Ex.No.5(c) Perform Multiple Regression Analysis on the
diabetes data set
Date:

AIM:
To use UCI and Pima Indians Diabetes data set for Multiple
Regression Analysis.

ALGORITHM:
1. Download diabetes data set from UCI and Pima Indians
Diabetes data set.
2. Load the above data files into google colab.
3. Perform multiple regression analysis on data sets.

Multiple Regression Analysis


Multiple regression is like linear regression, but with more than
one independent value, meaning that we try to predict a value based on
two or more variables.
Example
# Pima_diabetes
import pandas
from sklearn import linear_model
df =
pandas.read_csv("/content/drive/MyDrive/Data_Science/P
ima_diabetes.csv")
X = df['Pregnancies ', 'Glucose ']
y = df['BloodPressure ']
regr = linear_model.LinearRegression()
regr.fit(X, y)
#predict the Blood Pressure based on Pregnancies and
Glucose level:
predictedBP = regr.predict([[4, 120]])
print(predictedBP)

# UCI-Diabetes
import pandas
from sklearn import linear_model
df =
pandas.read_csv("("/content/drive/MyDrive/Data_Science
/UCI_diabetes.csv")
X = df[['Time', 'Code']]
y = df['Value']

58
regr = linear_model.LinearRegression()
regr.fit(X, y)
#predict the Diabetes based on Time and Code:
predictedBP = regr.predict([[13:23, 46]])
print(predictedBP)

Sample Output

59
RESULT

Thus the Multiple Regression analysis on the Diabetes data of


UCI and Pima was performed successfully.

Ex.No.6(a) Apply and explore Normal curves &


Histograms plotting
Date: functions on UCI-Iris data
sets

AIM:
To apply and explore Normal curves & Histograms plotting
functions on UCI-Iris data sets.

ALGORITHM:
1. Download Iris data set from UCI.
2. Load the above Iris data files into google colab.
3. Plot the normal curve and Histograms for Iris data set.

Normal Curves
It is a probability function used in statistics that tells about how
the data values are distributed. It is the most important probability
distribution function used in statistics because of its advantages in real
case scenarios.
Example
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
import statistics
# import dataset
df =
pd.read_csv("/content/drive/MyDrive/Data_Science/iris.cs
v")
# Plot between -10 and 10 with .001 steps.
x_axis = np.arange(-20, 20, 0.01)
# Calculating mean and standard deviation
mean = df["sepal.length"].mean()
sd = df.loc[:,"sepal.width"].std()
plt.plot(x_axis, norm.pdf(x_axis, mean, sd))
plt.show()

Histograms plotting functions

60
A histogram is basically used to represent data provided in a form
of some groups.It is accurate method for the graphical representation of
numerical data distribution.It is a type of bar plot where X-axis represents
the bin ranges while Y-axis gives information about frequency.
Example
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df =
pd.read_csv('/content/drive/MyDrive/Data_Science/iris.csv ')
data = df[' sepal.length']
bins = np.arange(min(data), max(data) + 1, 1)
plt.hist(data, bins = bins, density = True)
plt.ylabel('sepal.width')
plt.xlabel( petal.length')
plt.show()
Sample Output

61
RESULT
Thus the UCI data set was plotted using Normal Curve and
Histogram plotting was executed successfully.
Ex.No 6(b) Density and contour plotting functions on UCI-
Iris data sets.
Date:

AIM:
To apply and explore Density & Contour plotting functions on
UCI-Iris data sets.

ALGORITHM:
1. Download Iris data set from UCI.
2. Load the above Iris data files into google colab.
3. Plot the density and contour plotting for Iris data sets.

Density Plotting
Density Plot is a type of data visualization tool. It is a variation of
the histogram that uses ‘kernel smoothing’ while plotting the values. It is
a continuous and smooth version of a histogram inferred from a data.
Density plots uses Kernel Density Estimation (so they are also
known as Kernel density estimation plots or KDE) which is a probability
density function. The region of plot with a higher peak is the region with
maximum data points residing between those values.

Example - Density plot of several variables

62
# libraries & dataset
import seaborn as sns
import matplotlib.pyplot as plt
# set a grey background (use sns.set_theme() if seaborn
version 0.11.0 or above)
sns.set(style="darkgrid")
df = sns.load_dataset('iris')
# plotting both distibutions on the same figure
fig = sns.kdeplot(df['sepal_width'], shade=True, color="r")
fig = sns.kdeplot(df['sepal_length'], shade=True,
color="b")
plt.show()

Contour plotting
Contour plots also called level plots are a tool for doing
multivariate analysis and visualizing 3-D plots in 2-D space. If we
consider X and Y as our variables we want to plot then the response Z
will be plotted as slices on the X-Y plane due to which contours are
sometimes referred as Z-slices or iso-response.
Contour plots are widely used to visualize density, altitudes or
heights of the mountain as well as in the meteorological department.

Example
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
px_orbital =
pd.read_csv("/content/drive/MyDrive/Data_Science/iris.cs
v
")
x = px_orbital.iloc[0, 1:]
y = px_orbital.iloc[1:, 0]
px_values = px_orbital.iloc[1:, 1:]
mpl.rcParams['font.size'] = 14
mpl.rcParams['legend.fontsize'] = 'large'
mpl.rcParams['figure.titlesize'] = 'medium'
fig, ax = plt.subplots()
ticks = np.linspace(pmin, pmax, 6)
CS = ax.contourf(x, y, px_values, cmap="RdBu",
levels=levels)
ax.set_aspect('equal')

63
ax.set_xlabel('x')
ax.set_ylabel('y')
fig.colorbar(CS, format="%.3f", ticks=ticks)

Sample Output

RESULT
Thus the UCI data set was plotted using Density & Contour
plotting was executed successfully.

Ex.No 6(c) Correlation and scatter plotting functions on


UCI data sets.
Date:

AIM:
To apply and correlation & Scatter plotting functions on UCI-Iris
data sets.

ALGORITHM:
1. Download Iris data set from UCI.
2. Load the above Iris data files into google colab.
3. Plot the correlation and scatter plotting for Iris data sets.

Correlation Matrix Plotting

64
Correlation gives an indication of how related the changes are
between two variables. If two variables change in the same direction they
are positively correlated. If the change in opposite directions together
(one goes up, one goes down), then they are negatively correlated.
You can calculate the correlation between each pair of attributes.
This is called a correlation matrix. You can then plot the correlation
matrix and get an idea of which variables have a high correlation with
each other.
This is useful to know, because some machine learning algorithms
like linear and logistic regression can have poor performance if there are
highly correlated input variables in your data.

Example
# Correction Matrix Plot
import matplotlib.pyplot as plt
import pandas
import numpy
url =
"https://raw.githubusercontent.com/jbrownlee/Datasets/ma
ster/pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi',
'age', 'class']
data = pandas.read_csv(url, names=names)
correlations = data.corr()
# plot correlation matrix
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = numpy.arange(0,9,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
plt.show()
Scatter Plotting
A scatterplot shows the relationship between two variables as dots
in two dimensions, one axis for each attribute. You can create a
scatterplot for each pair of attributes in your data. Drawing all these
scatterplots together is called a scatterplot matrix.
Scatter plots are useful for spotting structured relationships
between variables, like whether you could summarize the relationship
between two variables with a line. Attributes with structured relationships

65
may also be correlated and good candidates for removal from your
dataset.

Example
# Scatterplot Matrix
import matplotlib.pyplot as plt
import pandas
from pandas.plotting import scatter_matrix
url =
"https://raw.githubusercontent.com/jbrownlee/Datasets/ma
ster/pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi',
'age', 'class']
data = pandas.read_csv(url, names=names)
scatter_matrix(data)
plt.show()

Sample Output

66
e. Three dimensional plotting

Program :

import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax = plt.axes(projection='3d')
zline = np.linspace(0, 15, 1000)
xline = np.sin(zline)
yline = np.cos(zline)
ax.plot3D(xline, yline, zline, 'gray')
zdata = 15 * np.random.random(100)
xdata = np.sin(zdata) + 0.1 * np.random.randn(100)
ydata = np.cos(zdata) + 0.1 * np.random.randn(100)
ax.scatter3D(xdata, ydata, zdata, c=zdata, cmap='Reds')
plt.show()

67
Output:

RESULT

Thus the UCI data set was plotted using Correlation and scatter
plotting was executed successfully.

68
Ex.No.7 Visualizing Geographic Data with Basemap
Date:

AIM:
To visualizing the Geographic Data with Basemap using Zomato
geographic data.

ALGORITHM:
1. Study the basics of Basemap.
2. Use Zomato data to plot city names and restaurants details.

Basemap Introduction
Basemap is a toolkit under the Python visualization library
Matplotlib. Its main function is to draw 2D maps, which are important for
visualizing spatial data. basemap itself does not do any plotting, but
provides the ability to transform coordinates into one of 25 different map
projections.

Zomato data Visualization

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from glob import glob as gb

#list all the directories


dirs=os.listdir("C:/Users/IT
LAB-I\Desktop/Data_Science/zomato_data/")
dirs

len(dirs)

#storing all the files from every directory


li=[]
for dir1 in dirs:
files=os.listdir(r"C:/Users/IT
LAB-I\Desktop/Data_Science/zomato_data/"+dir1)
#reading each file from list of files from previous step and creating
pandas data fame
for file in files:

69
df_file=pd.read_csv("C:/Users/IT
LAB-I\Desktop/Data_Science/zomato_data/"+dir1+"/"+file,quotechar='"'
,delimiter="|")
#appending the dataframe into a list
li.append(df_file.values)
len(li)
#numpys vstack method to append all the datafames to stack the
sequence of input vertically to make a single array
df_np=np.vstack(li)
#adding the header columns
df_final=pd.DataFrame(df_final.values, columns
=["NAME","PRICE","CUSINE_CATEGORY","CITY","REGION","UR
L","PAGE NO","CUSINE
TYPE","TIMING","RATING_TYPE","RATING","VOTES"])
#displaying the dataframe
df_final
#header column "PAGE NO" is not required ,i used it while scraping the
data from zomato to do some sort of validation,lets remove the column
df_final.drop(columns=["PAGE NO"],axis=1,inplace=True)

#display the dataframe again


df_final

#lets count how many unique cities are there


df_final["CITY"].unique()

# import json and requests library to use googl apis to get the longitude
ant latituide values
import requests
import json
#creating a separate array with all city names as elements of array
city_name=df_final["CITY"].unique()
li1=[]
#googlemap api calling url
geo_s ='https://maps.googleapis.com/maps/api/geocode/json'
#iterating through a for loop for each city names
for i in range(len(city_name)):
#i have used my own google map api, please use ypur own api
param = {'address': city_name[i], 'key': 'AIzaSyD-kYTK-
8FQGueJqA2028t2YHbUX96V0vk'}
response = requests.get(geo_s, params=param)
response=response.text
data=json.loads(response)
#setting up the variable with corresponding city longitude and latitude

70
lat=data["results"][0]["geometry"]["location"]["lat"]
lng=data["results"][0]["geometry"]["location"]["lng"]
#creating a new data frame with city , latitude and longitude as columns
df2=pd.DataFrame([[city_name[i],lat,lng]])
li1.append(df2.values)

#numpys vstack method to append all the datafames to stack the


sequence of input vertically to make a single array
df_np=np.vstack(li1)

#creating a second dataframe with city name, latitude and longitude


df_sec=pd.DataFrame(df_np,columns=["CITY","lat","lng"])

#display the second dataframe contents


df_sec

#merge this data frame to the existing df_final data frame using merge
and join features from pandas,and creating a new data frame
df_final2=df_final.merge(df_sec,on="CITY",how="left")

#display the contents , it will have longitude and latitude now


df_final2

#creating pandas series to hold the citynames and corresponding count of


restuarnats in ascending order
li2=df_final["CITY"].value_counts().sort_values(ascending=True)
li2
#creating another data frame from the above dictionary
df_map=pd.DataFrame.from_dict(dc,orient="index",columns=["CI
TY","COUNT"])
#importing the libraries for map ploting
from matplotlib import cm
from matplotlib.dates import date2num
from mpl_toolkits.basemap import Basemap
# for date and time processing
import datetime

#lets take one data frame for top 20 cities with most retaurants counts
df_plot_top=df_map_final.tail(20)

#displaying the data frame


df_plot_top

71
#lets plot this inside the map corresponding to the cities exact co-
ordinates which we received from google api
#plt.subplots(figsize=(20,50))
plt.figure(figsize=(50,60))
map=Basemap(width=120000,height=900000,projection="lcc",resolution
="l",llcrnrlon=67,llcrnrlat=5,urcrnrlon=99,urcrnrlat=37,lat_0=28,lon_0=7
7)
map.drawcountries()
map.drawmapboundary(color='#f2f2f2')
map.drawcoastlines()
lg=np.array(df_plot_top["lng"])
lat=np.array(df_plot_top["lat"])
pt=np.array(df_plot_top["COUNT"])
city_name=np.array(df_plot_top["CITY"])
x,y=map(lg,lat)
#using lambda function to create different sizes of marker as per thecount
p_s=df_plot_top["COUNT"].apply(lambda x: int(x)/2)
#plt.scatter takes logitude ,latitude, marker size,shape,and color as
parameter in the below , in this plot marker color is always blue.
plt.scatter(x,y,s=p_s,marker="o",c='BLUE')
plt.title("TOP 20 INDIAN CITIES RESTAURANT COUNTS PLOT AS
PER ZOMATO",fontsize=30,color='RED')

#lets plot this inside the map corresponding to the cities exact co-
ordinates which we received from google api ,here marker color will be
different as per marker size
#plt.subplots(figsize=(20,50))
plt.figure(figsize=(50,60))
map=Basemap(width=120000,height=900000,projection="lcc",resolution
="l",llcrnrlon=67,llcrnrlat=5,urcrnrlon=99,urcrnrlat=37,lat_0=28,lon_0=7
7)
map.drawcountries()
map.drawmapboundary(color='#f2f2f2')
map.drawcoastlines()
lg=np.array(df_plot_top["lng"])
lat=np.array(df_plot_top["lat"])
pt=np.array(df_plot_top["COUNT"])
city_name=np.array(df_plot_top["CITY"])
x,y=map(lg,lat)
#using lambda function to create different sizes of marker as per thecount
p_s=df_plot_top["COUNT"].apply(lambda x: int(x)/2)
#plt.scatter takes logitude ,latitude, marker size,shape,and color as
parameter in the below , in this plot marker color is different.
plt.scatter(x,y,s=p_s,marker="o",c=p_s)

72
plt.title("TOP 20 INDIAN CITIES RESTAURANT COUNTS PLOT AS
PER ZOMATO",fontsize=30,color='RED')

#lets plot with the city names inside the map corresponding to the cities
exact co-ordinates which we received from google api ,here marker color
will be different as per marker size
#plt.subplots(figsize=(20,50))
plt.figure(figsize=(50,60))
map=Basemap(width=120000,height=900000,projection="lcc",resolution
="l",llcrnrlon=67,llcrnrlat=5,urcrnrlon=99,urcrnrlat=37,lat_0=28,lon_0=7
7)
map.drawcountries()
map.drawmapboundary(color='#f2f2f2')
map.drawcoastlines()
lg=np.array(df_plot_top["lng"])
lat=np.array(df_plot_top["lat"])
pt=np.array(df_plot_top["COUNT"])
city_name=np.array(df_plot_top["CITY"])
x,y=map(lg,lat)
#using lambda function to create different sizes of marker as per thecount
p_s=df_plot_top["COUNT"].apply(lambda x: int(x)/2)
#plt.scatter takes logitude ,latitude, marker size,shape,and color as
parameter in the below , in this plot marker color is different.
plt.scatter(x,y,s=p_s,marker="o",c=p_s)
for a,b ,c,d in zip(x,y,city_name,pt):
#plt.text takes x position , y position ,text ,font size and color as
arguments
plt.text(a,b,c,fontsize=30,color="r")
plt.title("TOP 20 INDIAN CITIES RESTAURANT COUNTS PLOT AS
PER ZOMATO",fontsize=30,color='RED')

#lets plot with the city names and restaurants count inside the map
corresponding to the cities exact co-ordinates which we received from
google api ,here marker color will be different as per marker size
#plt.subplots(figsize=(20,50))
plt.figure(figsize=(50,60))
map=Basemap(width=120000,height=900000,projection="lcc",resolution
="l",llcrnrlon=67,llcrnrlat=5,urcrnrlon=99,urcrnrlat=37,lat_0=28,lon_0=7
7)
map.drawcountries()
map.drawmapboundary(color='#f2f2f2')
map.drawcoastlines()
lg=np.array(df_plot_top["lng"])
lat=np.array(df_plot_top["lat"])

73
pt=np.array(df_plot_top["COUNT"])
city_name=np.array(df_plot_top["CITY"])
x,y=map(lg,lat)
#using lambda function to create different sizes of marker as per thecount
p_s=df_plot_top["COUNT"].apply(lambda x: int(x)/2)
#plt.scatter takes logitude ,latitude, marker size,shape,and color as
parameter in the below , in this plot marker color is different.
plt.scatter(x,y,s=p_s,marker="o",c=p_s)
for a,b ,c,d in zip(x,y,city_name,pt):
#plt.text takes x position , y position ,text(city name) ,font size and
color as arguments
plt.text(a,b,c,fontsize=30,color="r")
#plt.text takes x position , y position ,text(restaurant counts) ,font size
and color as arguments, like above . but only i have changed the x and y
position to make it more clean and easier to read
plt.text(a+60000,b+30000,d,fontsize=30)
plt.title("TOP 20 INDIAN CITIES RESTAURANT COUNTS PLOT AS
PER ZOMATO",fontsize=30,color='RED')
Sample Output

74
RESULT

75
Thus the visualization of Zomato geographic data was visualized
using Basemap.

76

You might also like