Fundamentals of Data Science Students
Fundamentals of Data Science Students
Fundamentals of Data Science Students
ENGINEERING COLLEGE
(Managed By Tamil Nadu Educational and Medical Trust)
Thoraipakkam, Chennai – 600097.
DEPARTMENT
OF
COMPUTER SCIENCE AND ENGINEERING
Register Number
BONAFIDE CERTIFICATE
This is to certify that this is a bonafide record of work done
by……………………………………………………… of B.E Computer Science
and Engineering in the DATA SCIENCE LABORATORY (CS3361) during the
Academic year 2023-2024.
VISION
Producing competent Computer Engineers with a strong background in the latest trends and
technology to achieve academic excellence and to become pioneer in software and hardware
products with an ethical approach to serve the society.
MISSION
To provide quality education in Computer Science and Engineering with the state of the art
facilities.
To provide the learning audience that helps the students to enhance problem solving skills and to
To serve the society by providing insight solutions to the real world problems by employing the
latest trends of computing technology with strict adherence to professional and ethical
responsibilities.
CS3361 - DATA SCIENCE LABORATARY SYLLABUS
COURSE OBJECTIVES:
1. To understand the python libraries for data science
2. To understand the basic Statistical and Probability measures for data science.
3. To learn descriptive analytics on the benchmark data sets.
4. To apply correlation and regression analytics on standard data sets.
5. To present and interpret data using visualization packages in Python.
LIST OF EXERCISES:
1. Download, install and explore the features of NumPy, SciPy, Jupyter, Statsmodels and Pandas
packages.
2. Working with Numpy arrays
3. Working with Pandas data frames
4. Reading data from text files, Excel and the web and exploring various commands for doing
descriptive analytics on the Iris data set.
5. Use the diabetes data set from UCI and Pima Indians Diabetes data set for performing the
following:
a. Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard Deviation,
Skewness and Kurtosis.
b. Bivariate analysis: Linear and logistic regression modeling
c. Multiple Regression analysis
d. Also compare the RESULTs of the above analysis for the two data sets.
6. Apply and explore various plotting functions on UCI data sets.
a. Normal curves
b. Density and contour plots
c. Correlation and scatter plots
d. Histograms
e. Three dimensional plotting
7. Visualizing Geographic Data with Basemap
COURSE OUTCOMES:
At the end of this course, the students will be able to:
6d. Histograms
AIM:
ALGORITHM:
Step 1: Start
Step 2: Download python 3.8 or higher and get-pip.py
Step 3: Install python with Add python.exe to PATH
Step 4: Drag and drop get-pip.py in terminal(cmd) and install
Step 5: Enter command in terminal(cmd)
a. python -m pip install --upgrade pip
b. python -m pip install numpy scipy jupyter statsmodels pandas
Step 6: Stop
SOURCE CODE:
OUTPUT:
RESULT:
Exp. no: 2
Date: WORKING WITH NUMPY ARRAYS
AIM
ALGORITHM
Step1: Start
Step2: Import numpy module
Step3: Print the basic characteristics and operations of array
Step4: Stop
PROGRAM
import numpy as np
# Creating array object
arr = np.array( [[ 1, 2,
3],
[ 4, 2, 5]] )
# Printing type of arr object
print("Array is of type: ", type(arr))
# Printing array dimensions (axes)
print("No. of dimensions: ", arr.ndim)
# Printing shape of array
print("Shape of array: ", arr.shape)
# Printing size (total number of elements) of array
print("Size of array: ", arr.size)
# Printing type of elements in array
print("Array stores elements of type: ", arr.dtype)
OUTPUT
a = np.array([[1,2,3],[3,4,5],[4,5,6]])
print(a)
print("After slicing")
print(a[1:])
OUTPUT
[[1 2 3]
[3 4 5]
[4 5 6]]
After slicing
[[3 4 5]
[4 5 6]]
RESULT:
Exp. no: 3
Date: WORKING WITH PANDAS DATA FRAMES
AIM:
ALGORITHM:
Step 1:Start
Step 2:import pandas package with an alias name as pd
Step 3:Write the data in the form of a dictionary and store it in the variable 'data'
Step 4:assign variable 't' with pd.DataFrame(data)
Step 5:increment the index value of 't' by 1
Step 6:print the value 't'
Step 7:Stop.
SOURCE CODE:
import pandas as pd
data={"Name":["Ram","Subash","Rahul","Arun","Deepak"],"Age":[24,25,24,26,25],"
CGPA":[9.5,9.3,9.0,8.5,.88]}
t=pd.DataFrame(data)
t.index+=1
print(t)
OUTPUT:
Name Age CGPA
1 Ram 24 9.50
2 Subash 25 9.30
3 Rahul 24 9.00
4 Arun 26 8.50
5 Deepak 25 0.88
>
RESULT:
Exp. no: 4 READING DATA FROM FILES AND EXPLORING
Date: VARIOUS COMMANDS FOR DOING DESCRIPTIVE
ANALYSIS ON IRIS DATASET
AIM:
ALGORITHM:
Step 1: Start
Step 2: import pandas ,numpy, matplotlib.pyplot, seaborn and from sklearn.datasets import
Step 3: load iris sns.set()
Step 4: Assign iris_data = pd.read_csv()
Step 5: print iris_data.head()
Step 6: printiris_data.describe()
Step 7: Set sns.countplot(x='species', data=iris_data) plt.show()
Step 8: Set sns.scatterplot(x='petal_length', y='petal_width',hue='species', data=iris_data)
Step 9: Set plt.legend(bbox_to_anchor=(1, 1), loc=1)
Step 10: Present via plt.show()
Step 11: Stop
SOURCE CODE:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets
import load_iris sns.set()
iris_data = pd.read_csv("D:\Downloads\cse\IRIS.csv")
print (iris_data.head())
print("*********************Descriptive Analysis****************************")
print(iris_data.describe())
#SPECIES COUNT
sns.countplot(x='species', data=iris_data) plt.show()
# COMPARING SEPAL LENGTH AND SEPAL WIDTH
sns.scatterplot(x='petal_length', y='petal_width',hue='species', data=iris_data)
plt.legend(bbox_to_anchor=(1, 1), loc=1)
plt.show()
OUTPUT:
#SPECIES COUNT
# COMPARING SEPAL LENGTH AND SEPAL WIDTH
RESULT:
Exp. no: 5a
UNIVARIATE ANALYSIS: FREQUENCY, MEAN, MEDIAN, MODE,
Date: VARIANCE, STANDARD DEVIATION, SKEWNESS AND KURTOSIS
AIM
ALGORITHM
Mean
Sum all the values in the dataset.
Divide the sum by the number of values in the dataset.
Median
Sort the dataset in ascending order.
If the number of observations is odd, the median is the middle value.
If the number of observations is even, the median is the average of the two middle values.
Mode
Count the occurrences of each unique value in the dataset.
The mode is the value(s) with the highest frequency.
Variance
Calculate the mean of the dataset.
Subtract the mean from each data point, square the RESULT, and sum all the squared differences.
Divide the sum by the number of data points.
Standard Deviation
Calculate the variance.
Take the square root of the variance.
Skewness
Calculate the mean and standard deviation of the dataset.
For each data point, subtract the mean and divide by the standard deviation.
Calculate the mean of the cubed values of the RESULTs.
Skewness is the mean divided by the cubed standard deviation.
Kurtosis
Calculate the mean and standard deviation of the dataset.
For each data point, subtract the mean and divide by the standard deviation.
Calculate the mean of the fourth power of these values.
Kurtosis is the mean divided by the fourth power of the standard deviation
SOURCE CODE:
import statistics
# initializing list
li = [1, 2, 3, 3, 2, 2, 2, 1]
# using mean() to calculate average of list
# elements
print ("The average of list values is : ",end="")
print (statistics.mean(li))
# Python code to demonstrate the
# working of median() on various
# range of data-sets
# importing the statistics module
from statistics import median
# Importing fractions module as fr
from fractions import Fraction as fr
# tuple of strings
data5 = ("red", "blue", "black", "blue", "black", "black", "brown")
import scipy
from scipy.stats import skew
# Creating a dataset
dataset = [88, 85, 82, 97, 67, 77, 74, 86,
81, 95, 77, 88, 85, 76, 81]
# Calculate the skewness
print(skew(dataset, axis=0, bias=True))
from scipy.stats import kurtosis
OUTPUT:
RESULT:
Exp. no: 5b
BIVARIATE ANALYSIS: LINEAR AND LOGISTIC
Date:
REGRESSION MODELING
AIM
ALGORITHM
estimate_coef Function:
Calculate the mean of x and y.
Initialize variables SS_xy and SS_xx to zero.
Iterate through each observation in x and y.
Update SS_xy by adding the product of the corresponding
x and y values.
Update SS_xx by adding the square of the corresponding
x value.
Calculate the slope (b_1) as SS_xy / SS_xx.
Calculate the intercept (b_0) using the formula b_0 =
mean(y) - b_1 * mean(x).
Return the tuple (b_0, b_1).
plot_regression_line Function:
Scatter plot the actual data points using Matplotlib.
Calculate the predicted response vector y_pred using the
regression coefficients.
Plot the regression line using Matplotlib.
Display the plot with labels for the x and y axes.
main Function:
Define the observations/data (x and y arrays).
Call the estimate_coef function to obtain the regression
coefficients.
Print the estimated coefficients.
Call the plot_regression_line function to visualize the
regression line.
Main Program Execution:
If the script is run as the main program (if __name__ ==
"__main__": block):
Execute the main function.
SOURCE CODE:
import numpy as np
# number of observations/points
n = np.size(x)
m_x = np.mean(x)
m_y = np.mean(y)
plt.xlabel('x')
plt.ylabel('y')
plt.show()
import numpy as np
def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
# estimating coefficients
b = estimate_coef(x, y)
{}".format(b[0], b[1]))
plot_regression_line(x, y, b)
if __name__ == "__main__":
main()
OUTPUT:
Estimated coefficients:
RESULT:.
Exp. no: 5 c
Date: MULTIPLE REGRESSION ANALYSIS
AIM
ALGORITHM
SOURCE CODE:
import numpy as np
x = [[0, 1], [5, 1], [15, 2], [25, 5], [35, 11], [45, 15], [55,
x, y = np.array(x), np.array(y)
model = LinearRegression().fit(x, y)
r_sq = model.score(x, y)
print(f"intercept: {model.intercept_}")
print(f"coefficients: {model.coef_}")
y_pred = model.predict(x)
print(f"predicted response:\n{y_pred}")
y_new = model.predict(x_new)
OUTPUT:
RESULT:
Exp. no: 5 d
COMPARE THE RESULTS OF ANALYSIS FOR THE TWO
Date:
DATA SETS.
AIM
ALGORITHM
Step 1: Start
Step 2: Import pandas
Step 3: Assign df1 = pd.read_csv("compardata1.csv")
Step 4: Assign df2=pd.read_csv("compardata2.csv")
Step 5: Assign c_RESULT = df1[df1.apply(tuple, 1).isin (df2.apply(tuple,1))]
print(c_RESULT)
Step 6: Assign c_RESULT1 = pd.merge(df1, df2)
Step 7: print c_RESULT1
Step 8: Stop
SOURCE CODE:
import pandas as pd
df1 = pd.read_csv("compardata1.csv")
df2=pd.read_csv("compardata2.csv")
c_RESULT = df1[df1.apply(tuple, 1).isin (df2.apply(tuple,1))]
print(c_RESULT)
c_RESULT1 = pd.merge(df1, df2)
print(c_RESULT1)
RESULT:
Exp. no: 6a
Date: NORMAL CURVES
AIM:
ALGORITHM:
Step 1: Start
Step 2:import matplotlib.pyplot (plt) , numpy(np) and ,math packages
Step 3: Assign x = np.arange(0, math.pi*2, 0.05)
Step 4:Assign y=np.sin(x)
Step 5:Using plt.plot(x,y) plot the graph
Step 6:Give labels to the x and y axis and a title to the plot
Step 7:Using the show() function show the plot
Step 8:Stop
SOURCE CODE:
from matplotlib import pyplot as plt
import numpy as np
import math
x = np.arange(0, math.pi*2, 0.05)
y = np.sin(x)
plt.plot(x,y)
plt.xlabel("angle")
plt.ylabel("sine")
plt.title('sine wave')
plt.show()
OUTPUT:
RESULT:
Exp. no: 6b
Date: DENSITY AND CONTOUR PLOTS
AIM:
ALGORITHM:
• Use np.meshgrid to create a 2D grid (X, Y) from the 1D arrays feature_x and feature_y.
• Define a function Z that computes values based on the grid points (X, Y).
• Z = np.cos(X / 2) + np.sin(Y / 4).
• Use plt.contour to create contour lines based on the values of Z at different (X, Y) points
• Set the title of the plot using plt.title.
• Set labels for the x and y axes using plt.xlabel and plt.ylabel.
• Use plt.show to display the contour plot.
SOURCE CODE:
fig, ax = plt.subplots(1, 1)
Z = np.cos(X / 2) + np.sin(Y / 4)
ax.set_title('Contour Plot')
ax.set_xlabel('feature_x')
ax.set_ylabel('feature_y')
plt.show()
OUTPUT:
ALGORITHM:
• Import the required libraries: pandas for data manipulation, seaborn for data visualization, and
numpy for numerical operations.
• Import matplotlib.pyplot for additional customization of the plot.
• Generate or load the data that you want to visualize.
• If your data is not already in a pandas DataFrame, create one.
• Use seaborn.kdeplot to create a kernel density plot for the variable in the DataFrame.
• Customize the appearance by specifying optional parameters such as shade, color, etc.
• Use seaborn.set_style, seaborn.set_palette, and other styling functions to customize the
appearance of the plot.
• Use plt.xlabel, plt.ylabel, and plt.title to add labels and a title to the plot.
• Use plt.show to display the density plot.
SOURCE CODE:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
RESULT:
Exp. no: 6c
Date: CORRELATION AND SCATTER PLOTS
AIM:
ALGORITHM:
Step 1:Start
Step 2:Import pandas(pd),matplotlib.pyplot(plt) and seaborn(sns) packages
Step 3:Read the dataset and store it in variable df and increment the value of index by 1
Step 4:Print the head value of the dataset using df.head() function
Step 5:Print the correlations of the dataset using df.corr(method='pearson') function
Step 6:Plot the scatter plot using sns.scatterplot(x=df.Age, y=df.Glucose, data=df)
Step 7:show the plots using show() function
Step 8:Stop
SOURCE CODE:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#Importing Dataset
df=pd.read_csv("D:\Downloads\cse\diabetes.csv")
df.index+=1
print(df.head())
#Correlation
correlations = df.corr(method = 'pearson') print("Correlations of
attributes in the data:\n",correlations)
#SCATTER PLOT
sns.scatterplot(x= df.Pregnancies
, y=df.Glucose, data=df) plt.show()
OUTPUT:
RESULT
Exp. no: 6d HISTOGRAMS
Date:
AIM:
ALGORITHM:
Step 1: Start
Step 2: import pandas and matplotlib
Step 3: read the dataset and increment the index value of it by 1.
Step4: Plot the histogram for the dataset using the hist() function and show it using
show() function.
Step 5: Stop.
SOURCE-CODE:
import pandas as pd
import matplotlib.pyplot as plt
#Importing Dataset
df=pd.read_csv("D:\Downloads\cse\diabetes.csv") df.index+=1
#HISTOGRAM
df.hist()
plt.show()
OUTPUT:
RESULT:
Exp. no: 6e
Date: THREE-DIMENSIONAL PLOTTING
AIM:
ALGORITHM:
Step 1: Start
Step 2: import pandas ,matplotlib and mplot3d from mpl_toolkits
Step 3: read and store the dataset in the variable df
Step 4: increment the index of variable df by 1
Step 5: print df.head()
Step 6: assign variable x=df.Age , y=df.Pregnancies and z =df.DiabetesPedigreeFunction
Step 7: Plot the data using the figure(),axes(),get_cmap() functions of matplotlib and plot the
3D scatter plot using the scatter3D() function
Step 8: Set the label for x,y,z axes using set_(axes)label() function
Step 9: give the title of the plot using title() function
Step 10: show the plot using show() function
Step 11: Stop
SOURCE CODE:
import pandas as pd
import matplotlib.pyplot as plt from
mpl_toolkits import mplot3d
#Importing Dataset
df=pd.read_csv("D:\Downloads\cse\diabetes.csv") df.index+=1
print(df.head())
x=df.Age
y=df.Pregnancies
z=df.DiabetesPedigreeFunction
#THREE-DIMENSIONAL PLOTTING
fig = plt.figure(figsize = (10, 7)) ax =
plt.axes(projection ="3d") my_cmap =
plt.get_cmap('hsv')
sctt = ax.scatter3D(x, y, z,alpha = 0.8,c = (x + y + z), cmap = my_cmap,marker ='*')
ax.set_xlabel('X-age')
ax.set_ylabel('Y-Pregnancies')
ax.set_zlabel('Z-DiabetesPedigreeFunction')
plt.title("3D scatter plot")
plt.show()
OUTPUT:
RESULT :
Exp. no: 7
Date: VISUALIZING GEOGRAPHIC DATA WITH BASEMAP
AIM:
ALGORITHM:
Step 1: Start
Step 2: Import Basemap and import matplotlb.pyplot using from mpl-toolkits. basemap
Step 3: Define sample longitude and latitude coordinates.
Step 4: Use the m object to convert longitude and latitude coordinates to map coordinates.
Step 5: Assign the tile of the map using the “title()” method.
Step 6: Display the map using the “show()” method.
Step 7: Stop.
SOURCE CODE:
OUTPUT:
RESULT :
ADDITIONAL PROGRAMS
Write a NumPy program to create a null vector of size 10 and update sixth value to 11
PROGRAM
OUTPUT
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Update sixth value to 11
[ 0. 0. 0. 0. 0. 0. 11. 0. 0. 0.]
OUTPUT
Original array
[1, 2, 3, 4]
Array converted to a float type:
[1. 2. 3. 4.]
Write a NumPy program to create a 3x3 matrix with values ranging from 2 to 10
PROGRAM
# Importing the NumPy library with an alias 'np'
import numpy as np
# Creating a NumPy array 'x' using arange() from 2 to 11 and reshaping it into a 3x3 matrix
x = np.arange(2, 11).reshape(3, 3)
OUTPUT
[[ 2 3 4]
[ 5 6 7]
[ 8 9 10]]
OUTPUT
PROGRAM
OUTPUT
PROGRAM
# Printing a message indicating the conversion of the list to an array using np.asarray() function
print("List to array: ")
# Converting the Python list to a NumPy array using np.asarray() and printing the resulting array
print(np.asarray(my_list))
# Printing a message indicating the conversion of the tuple to an array using np.asarray() function
print("Tuple to array: ")
# Converting the Python tuple to a NumPy array using np.asarray() and printing the resulting array
print(np.asarray(my_tuple))
OUTPUT
List to array:
[1 2 3 4 5 6 7 8]
Tuple to array:
[[8 4 6]
[1 2 3]]
Write a NumPy program to find the real and imaginary parts of an array of complex
numbers
PROGRAM
import numpy as np
arr1 = np.random.random(size=(25, 25, 1))
arr2 = np.random.random(size=(25, 25, 1))
arr3 = np.random.random(size=(25, 25, 1))
print("Original arrays:")
print(arr1)
print(arr2)
print(arr3)
result = np.concatenate((arr1, arr2, arr3), axis=-1)
print("\nAfter concatenate:")
print(result)
OUTPUT
Original arrays:
[[[0.23424822]
[0.51175253]]
[[0.57232915]
[0.22516223]]]
[[[0.01776688]
[0.40250687]]
[[0.10133723]
[0.67184758]]]
[[[0.22401405]
[0.28253877]]
[[0.23720417]
[0.09512562]]]
After concatenate:
[[[0.23424822 0.01776688 0.22401405]
[0.51175253 0.40250687 0.28253877]]
PROGRAM
# Importing the NumPy library with an alias 'np'
import numpy as np
# Modifying the array 'x' to set 0s on the border and 1s inside the array using the np.pad function
print("0 on the border and 1 inside in the array")
x = np.pad(x, pad_width=1, mode='constant', constant_values=0)
# Printing the modified array 'x' with 0s on the border and 1s inside
print(x)
OUTPUT
Original array:
[[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]]
[[0. 0. 0. 0. 0.]
[0. 1. 1. 1. 0.]
[0. 1. 1. 1. 0.]
[0. 1. 1. 1. 0.]
[0. 0. 0. 0. 0.]]
PROGRAM
# Appending values to the end of the array using np.append() and assigning the result back to 'x'
x = np.append(x, [[40, 50, 60], [70, 80, 90]])
OUTPUT
Original array:
[10, 20, 30]
After append values to the end of the array:
[10 20 30 40 50 60 70 80 90]
Write a NumPy program to find the real and imaginary parts of an array of complex numbers
PROGRAM
OUTPUT
PROGRAM
import numpy as np
np_array = np.array([[1,2,3], [4,5,6] , [7,8,9], [10, 11, 12]])
test_array = np.array([4,5,6])
print("Original Numpy array:")
print(np_array)
print("Searched array:")
print(test_array)
print("Index of the searched array in the original array:")
print(np.where((np_array == test_array).all(1))[0])
OUTPUT