D P Lab Manual
D P Lab Manual
D P Lab Manual
Aim:
1. To understand and apply the Analytical concept of Python.
2. To study Basic Python Libraries used for machine learning & data science.
SOFTWARE REQUIREMENTS:
1. Ubuntu 14.04 / 14.10
2. Python 3.9
3. Anaconda Spider/Jupiter Notebook
THEORY:
Python libraries for Machine Learning
Machine Learning, as the name suggests, is the science of programming a
computer by which they are able to learn from different kinds of data. A more
general definition given by Arthur Samuel is – “Machine Learning is the field of
study that gives computers the ability to learn without being explicitly
programmed.” They are typically used to solve various types of life problems.
In the older days, people used to perform Machine Learning tasks by manually
coding all the algorithms and mathematical and statistical formula. This made
the process time consuming, tedious and inefficient. But in the modern days, it
is become very much easy and efficient compared to the olden days by various
python libraries, frameworks, and modules. Today, Python is one of the most
popular programming languages for this task and it has replaced many
languages in the industry, one of the reasons is its vast collection of libraries.
Python libraries that used in Machine Learning are:
● Numpy
● Pandas
● Matplotlib
● Scipy
● Scikit-learn
● Theano
● TensorFlow
● Keras
Numpy
NumPy is a very popular python library for large multi-dimensional array and
matrix processing, with the help of a large collection of high-level mathematical
functions. It is very useful for fundamental scientific computations in Machine
Learning. It is particularly useful for linear algebra, Fourier transform, and
random number capabilities. High-end libraries like TensorFlow use NumPy
internally for manipulation of Tensors.
# Python program using NumPy
# for some basic mathematical
# operations
import numpy as np
Output:
219
[29 67]
Pandas
Pandas is a popular Python library for data analysis. It is not directly related to
Machine Learning. As we know that the dataset must be prepared before
training. In this case, Pandas comes handy as it was developed specifically for
data extraction and preparation. It provides high-level data structures and wide
variety tools for data analysis. It provides many inbuilt methods for groping,
combining and filtering data.
# importing pandas as pd
import pandas as pd
data_table = pd.DataFrame(data)
print(data_table)
Output:
Matplotlib is a very popular Python library for data visualization. Like Pandas, it
is not directly related to Machine Learning. It particularly comes in handy when
a programmer wants to visualize the patterns in the data. It is a 2D plotting
library used for creating 2D graphs and plots. A module named pyplot makes it
easy for programmers for plotting as it provides features to control line styles,
font properties, formatting axes, etc. It provides various kinds of graphs and
plots for data visualization, viz., histogram, error charts, bar chats, etc,
# Add a legend
plt.legend()
Output:
Theano
We all know that Machine Learning is basically mathematics and statistics.
Theano is a popular python library that is used to define, evaluate and optimize
mathematical expressions involving multi-dimensional arrays in an efficient
manner. It is achieved by optimizing the utilization of CPU and GPU. It is
extensively used for unit-testing and self-verification to detect and diagnose
different types of errors. Theano is a very powerful library that has been used
in large-scale computationally intensive scientific projects for a long time but is
simple and approachable enough to be used by individuals for their own
projects.
TensorFlow
TensorFlow is a very popular open-source library for high performance
numerical computation developed by the Google Brain team in Google. As the
name suggests, Tensorflow is a framework that involves defining and running
computations involving tensors. It can train and run deep neural networks that
can be used to develop several AI applications. TensorFlow is widely used in the
field of deep learning research and application.
Keras
Keras is a very popular Machine Learning library for Python. It is a high-level
neural networks API capable of running on top of TensorFlow, CNTK, or Theano.
It can run seamlessly on both CPU and GPU. Keras makes it really for ML
beginners to build and design a Neural Network. One of the best thing about
Keras is that it allows for easy and fast prototyping.
Conclusion:
This Practical we learned different types of Python ML Libraries.
Aim:
1. To understand and apply the Data Pre-processing concept.
2. To study detailed Data Pre-processing concept in Python.
SOFTWARE REQUIREMENTS:
1. Ubuntu 16+
2. Python 3.9+
3. Anaconda Spider/Jupiter Notebook
THEORY:
Missing Data can occur when no information is provided for one or more items
or for a whole unit. Missing Data is a very big problem in real-life scenarios.
Missing Data can also refer to as NA(Not Available) values in pandas. In Data
Frame sometimes many datasets simply arrive with missing data, either
because it exists and was not collected or it never existed. For Example,
suppose different users being surveyed may choose not to share their income,
some users may choose not to share the address in this way many datasets
went missing.
In Pandas missing data is represented by two value:
● None: None is a Python singleton object that is often used for missing
data in Python code.
● NaN : NaN (an acronym for Not a Number), is a special floating-point
value recognized by all systems that use the standard IEEE floating-point
representation
● isnull()
● notnull()
● dropna()
● fillna()
● replace()
● interpolate()
In order to check null values in Pandas DataFrame, we use isnull() function this
function return dataframe of Boolean values which are True for NaN values.
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}
Output
Conclusion:
Thus we have studied different methods to replace missing data in dataset.
Theory:
In this post, the main focus will be on using a variety of classification algorithms
across both of these domains, less emphasis will be placed on the theory
behind them.
We can use libraries in Python such as scikit-learn for machine learning models,
and Pandas to import data as data frames.
These can easily be installed and imported into Python with pip:
import sklearn as sk
import pandas as pd
We will look at data regarding coronary heart disease (CHD) in South Africa.
The goal is to use different variables such as tobacco usage, family history, ldl
cholesterol levels, alcohol usage, obesity and more.
The code below reads the data into a Pandas data frame, and then separates
the data frame into a y vector of the response and an X matrix of explanatory
variables:
import pandas as pd
import os
os.chdir('/Users/stevenhurwitt/Documents/Blog/Classification')
heart.head()
y = heart.iloc[:,9]
X = heart.iloc[:,:9]
When running this code, just be sure to change the file system path on line 4 to
suit your setup.
Logistic Regression
We can then use the predict method to predict probabilities of new data, as
well as the score method to get the mean prediction accuracy:
import sklearn as sk
import pandas as pd
import os
os.chdir('/Users/stevenhurwitt/Documents/Blog/Classification')
heart.head()
y = heart.iloc[:,9]
X = heart.iloc[:,:9]
LR.predict(X.iloc[460:,:])
round(LR.score(X,y), 4)
array([1, 1])
Support Vector Machines (SVMs) are a type of classification algorithm that are
more flexible - they can do linear classification, but can use other
non-linear basis functions. The following example uses a linear classifier to fit a
hyperplane that separates the data into two classes:
import sklearn as sk
import os
os.chdir('/Users/stevenhurwitt/Documents/Blog/Classification')
y = heart.iloc[:,9]
X = heart.iloc[:,:9]
SVM = svm.LinearSVC()
SVM.fit(X, y)
SVM.predict(X.iloc[460:,:])
round(SVM.score(X,y), 4)
array([0, 1])
Random Forests
Random Forests are an ensemble learning method that fit multiple Decision
Trees on subsets of the data and average the results. We can again fit them
using sklearn, and use them to predict outcomes, as well as get mean
prediction accuracy:
import sklearn as sk
RF.fit(X, y)
RF.predict(X.iloc[460:,:])
round(RF.score(X,y), 4)
0.7338
Neural Networks
NN.fit(X, y)
NN.predict(X.iloc[460:,:])
round(NN.score(X,y), 4)
0.6537
Multi-Class Classification
While binary classification alone is incredibly useful, there are times when we
would like to model and predict data that has more than two classes. Many of
the same algorithms can be used with slight modifications.
Additionally, it is common to split data into training and test sets. This means
we use a certain portion of the data to fit the model (the training set) and save
the remaining portion of it to evaluate to the predictive accuracy of the fitted
model (the test set).
vowel_train.head()
y_tr = vowel_train.iloc[:,0]
X_tr = vowel_train.iloc[:,1:]
y_test = vowel_test.iloc[:,0]
X_test = vowel_test.iloc[:,1:]
y x.1 x.2 x.3 x.4 x.5 x.6 x.7 x.8 x.9 x.10
We will now fit models and test them as is normally done in statistics/machine
learning: by training them on the training set and evaluating them on the test
set.
import sklearn as sk
y_tr = vowel_train.iloc[:,0]
X_tr = vowel_train.iloc[:,1:]
y_test = vowel_test.iloc[:,0]
X_test = vowel_test.iloc[:,1:]
LR.predict(X_test)
round(LR.score(X_test,y_test), 4)
SVM.predict(X_test)
round(SVM.score(X_test, y_test), 4)
RF.predict(X_test)
round(RF.score(X_test, y_test), 4)
NN.predict(X_test)
round(NN.score(X_test, y_test), 4)
Output
0.5455
Although the implementations of these models were rather naive (in practice
there are a variety of parameters that can and should be varied for each
model), we can still compare the predictive accuracy across the models. This
will tell us which one is the most accurate for this specific training and test
dataset:
This shows us that for the vowel data, an SVM using the default radial basis
function was the most accurate.
Conclusion
To summarize this post, we began by exploring the simplest form of
classification: binary. This helped us to model data where our response could
take one of two states.
We also saw how to fit and evaluate models with training and test sets.
Furthermore, we could explore additional ways to refine model fitting among
various algorithms.
Theory:
Feature Scaling is a technique to standardize the independent features present
in the data in a fixed range. It is performed during the data pre-processing to
handle highly varying magnitudes or values or units. If feature scaling is not
done, then a machine learning algorithm tends to weigh greater values, higher
and consider smaller values as the lower values, regardless of the unit of the
values.
Example: If an algorithm is not using the feature scaling method then it can
consider the value 3000 meters to be greater than 5 km but that’s actually not
true and in this case, the algorithm will give wrong predictions. So, we use
Feature Scaling to bring all values to the same magnitudes and thus, tackle this
issue.
Code: Python code explaining the working of Feature Scaling on the data
""" PART 1
Importing Libraries """
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Sklearn library
from sklearn import preprocessing
""" PART 2
Importing Data """
data_set =
pd.read_csv('C:\\Users\\dell\\Desktop\\Data_for_Feature_Scaling.csv')
data_set.head()
""" PART 4
Handling the missing values """
# Scaled feature
x_after_min_max_scaler = min_max_scaler.fit_transform(x)
Standardisation = preprocessing.StandardScaler()
# Scaled feature
x_after_Standardisation = Standardisation.fit_transform(x)
Output :
Country Age Salary Purchased
0 France 44 72000 0
1 Spain 27 48000 1
2 Germany 30 54000 0
3 Spain 38 61000 0
4 Germany 40 1000 1
After Standardisation :
[[ 0.09536935 0.66527061]
[-1.15176827 -0.43586695]
[-0.93168516 -0.16058256]
[-0.34479687 0.16058256]
[-0.1980748 -2.59226136]
[-0.56487998 0.02294037]
[ 2.58964459 -0.25234403]
[ 0.38881349 0.98643574]
[ 0.53553557 1.16995867]
[-0.41815791 0.43586695]]
Conclusion:
Assignment 5
Title: Implement normalization on dataset.
Theory:
Data normalization in Python
Normalization refers to rescaling real-valued numeric attributes into a 00 to 11
range.
Python Code
import numpy as np
a = np.random.random((1, 4))
a = a*20
normalized = preprocessing.normalize(a)
Output
Conclusion:
Thus, we have studied normalization on dataset.
Assignment 6
Title: Perform proper data labelling operation on dataset.
Theory:
Data Labeling
Explore the uses and benefits of data labeling, including different approaches and
best practices.
Data labeling, or data annotation, is part of the preprocessing stage when developing
a machine learning (ML) model. It requires the identification of raw data (i.e., images,
text files, videos), and then the addition of one or more labels to that data to specify
Data labeling underpins different machine learning and deep learning use cases,
including computer vision and natural language processing (NLP).
Computers use labeled and unlabeled data to train ML models, but what is the
difference?
Computers can also use combined data for semi-supervised learning, which reduces
the need for manually labeled data while providing a large annotated dataset.
The general tradeoff of data labeling is that while it can decrease a business’s time
to scale, it tends to come at a cost. More accurate data generally improves model
predictions, so despite its high cost, the value that it provides is usually well worth
the investment. Since data annotation provides more context to datasets, it
enhances the performance of exploratory data analysis as well as machine learning
(ML) and artificial intelligence (AI) applications. For example, data labeling produces
more relevant search results across search engine platforms and better product
recommendations on e-commerce platforms. Let’s delve deeper into other key
benefits and challenges:
Benefits
Data labeling provides users, teams and companies with greater context, quality and
usability. More specifically, you can expect:
Challenges
Data labeling is not without its challenges. In particular, some of the most common
challenges are:
No matter the approach, the following best practices optimize data labeling accuracy
and efficiency:
Though data labeling can enhance accuracy, quality and usability in multiple
contexts across industries, its more prominent use cases include:
IBM offers more resources to help transcend data labeling challenges and maximize
your overall data labeling experience.
No matter your project size or timeline, IBM Cloud and IBM Watson can enhance
your data training processes, expand your data classification efforts, and simplify
complex forecasting models.
Conclusion:
Introduction to Data Science Lab ManualPage 29
Thus, we have studied data labeling operation on dataset.
Assignment 7
Title: Implement principal component analysis algorithm.
Theory:
Implementing PCA in Python with scikit-learn
import pandas as pd
import numpy as np
2. Loading Data
Load the breast_cancer dataset from sklearn.datasets. It is clear that the dataset has
569 data items with 30 input attributes. There are two output classes-benign and
malignant. Due to 30 input features, it is impossible to visualize this data
data=load_breast_cancer()
data.keys()
print(data['target_names'])
print(data['feature_names'])
Output:
df1=pd.DataFrame(data['data'],columns=data['feature_names'])
scaling=StandardScaler()
scaling.fit(df1)
Scaled_data=scaling.transform(df1)
principal=PCA(n_components=3)
principal.fit(Scaled_data)
x=principal.transform(Scaled_data)
print(x.shape)
Output:
(569,3)
4. Check Components
The principal.components_ provide an array in which the number of rows tells the
number of principal components while the number of columns is equal to the number
principal.components_
plt.figure(figsize=(10,10))
plt.scatter(x[:,0],x[:,1],c=data['target'],cmap='plasma')
plt.xlabel('pc1')
plt.ylabel('pc2')
For three principal components, we need to plot a 3d graph. x[:,0] signifies the first
principal component. Similarly, x[:,1] and x[:,2] represent the second and the third
principal component.
fig = plt.figure(figsize=(10,10))
axis.scatter(x[:,0],x[:,1],x[:,2],
c=data['target'],cmap='plasma')
axis.set_xlabel("PC1", fontsize=10)
axis.set_ylabel("PC2", fontsize=10)
Output:
print(principal.explained_variance_ratio_)
Output:
array([0.44272026, 0.18971182, 0.09393163])
Conclusion:
Thus, we have implemented principal component analysis algorithm.
Theory:
Overview
Introduction
The performance of a machine learning model not only depends on the model and
the hyperparameters but also on how we process and feed different types of
variables to the model. Since most machine learning models only accept numerical
variables, preprocessing the categorical variables becomes a necessary step. We
A typical data scientist spends 70 – 80% of his time cleaning and preparing the data.
And converting categorical data is an unavoidable activity. It not only elevates the
model quality but also helps in better feature engineering. Now the question is, how
do we proceed? Which categorical data encoding method should we use?
Table of content
1. The city where a person lives: Delhi, Mumbai, Ahmedabad, Bangalore, etc.
2. The department a person works in: Finance, Human resources, IT,
Production.
3. The highest degree a person has: High school, Diploma, Bachelors, Masters,
PhD.
4. The grades of a student: A+, A, B+, B, B- etc.
In the above examples, the variables only have definite possible values. Further, we
can see there are two kinds of categorical data-
We use this categorical data encoding technique when the categorical feature is
ordinal. In this case, retaining the order is important. Hence encoding should reflect
the sequence.
In Label encoding, each label is converted into an integer value. We will create a
variable that contains the categories representing the education qualification of a
person.
import category_encoders as ce
import pandas as pd
train_df=pd.DataFrame({'Degree':['High
school','Masters','Diploma','Bachelors','Bachelors','Masters','Phd',
'High school','High school']})
#Original data
train_df
We use this categorical data encoding technique when the features are nominal(do
not have any order). In one hot encoding, for each level of a categorical feature, we
create a new variable. Each category is mapped with a binary variable containing
These newly created binary features are known as Dummy variables. The number
of dummy variables depends on the levels present in the categorical variable. This
might sound complicated. Let us take an example to understand this better. Suppose
we have a dataset with a category animal, having different animals like Dog, Cat,
Sheep, Cow, Lion. Now we have to one-hot encode this data.
After encoding, in the second table, we have dummy variables each representing a
category in the feature Animal. Now for each category that is present, we have 1 in
the column of that category and 0 for the others. Let’s see how to implement a
one-hot encoding in python.
import category_encoders as ce
import pandas as pd
data=pd.DataFrame({'City':[
'Delhi','Mumbai','Hydrabad','Chennai','Bangalore','Delhi','Hydrabad'
,'Bangalore','Delhi'
]})
#Original Data
data
Now let’s move to another very interesting and widely used encoding technique i.e
Dummy encoding.
Dummy Encoding
To understand this better let’s see the image below. Here we are coding the same
data using both one-hot encoding and dummy encoding techniques. While one-hot
uses 3 variables to represent the data whereas dummy encoding uses 2 variables to
code 3 categories.
import category_encoders as ce
import pandas as pd
data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','B
angalore','Delhi,'Hyderabad']})
#Original Data
data
Here using drop_first argument, we are representing the first label Bangalore using
0.
One hot encoder and dummy encoder are two powerful and effective encoding
schemes. They are also very popular among the data scientists, But may not be as
effective when-
1. A large number of levels are present in data. If there are multiple categories in
a feature variable in such a case we need a similar number of dummy
variables to encode the data. For example, a column with 30 different values
will require 30 new variables for coding.
2. If we have multiple categorical features in the dataset similar situation will
occur and again we will end to have several binary features each representing
the categorical feature and their multiple categories e.g a dataset having 10 or
more categorical columns.
In both the above cases, these two encoding schemes introduce sparsity in the
dataset i.e several columns having 0s and a few of them having 1s. In other words, it
creates multiple dummy features in the dataset without adding much information.
Also, they might lead to a Dummy variable trap. It is a phenomenon where features
are highly correlated. That means using the other variables, we can easily predict the
value of a variable.
Due to the massive increase in the dataset, coding slows down the learning of the
model along with deteriorating the overall performance that ultimately makes the
Effect Encoding:
import category_encoders as ce
import pandas as pd
data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','B
angalore','Delhi,'Hyderabad']})
encoder=ce.sum_coding.SumEncoder(cols='City',verbose=False,)
#Original Data
data
encoder.fit_transform(data)
Hash Encoder
Hashing has several applications like data retrieval, checking data corruption, and in
data encryption also. We have multiple hash functions available for example
Message Digest (MD, MD2, MD5), Secure Hash Function (SHA0, SHA1, SHA2), and
many more.
Just like one-hot encoding, the Hash encoder represents categorical features using
the new dimensions. Here, the user can fix the number of dimensions after
transformation using n_component argument. Here is what I mean – A feature with
5 categories can be represented using N new features similarly, a feature with 100
categories can also be transformed using N new features. Doesn’t this sound
amazing?
By default, the Hashing encoder uses the md5 hashing algorithm but a user can
pass any algorithm of his choice. If you want to explore the md5 algorithm, I
suggest this paper.
import category_encoders as ce
import pandas as pd
Binary Encoding
Binary encoding works really well when there are a high number of categories. For
example the cities in a country where a company supplies its products.
#Original Data
data
Before diving into BaseN encoding let’s first try to understand what is Base here?
In the numeral system, the Base or the radix is the number of digits or a combination
of digits and letters used to represent the numbers. The most common base we use
in our life is 10 or decimal system as here we use 10 unique digits i.e 0 to 9 to
represent all the numbers. Another widely used system is binary i.e. the base is 2. It
uses 0 and 1 i.e 2 digits to express all the numbers.
For Binary encoding, the Base is 2 which means it converts the numerical values of
a category into its respective Binary form. If you want to change the Base of
encoding scheme you may use Base N encoder. In the case when categories are
more and binary encoding is not able to handle the dimensionality then we can use a
larger base such as 4 or 8.
#Original Data
data
In the above example, I have used base 5 also known as the Quinary system. It is
similar to the example of Binary encoding. While Binary encoding represents the
same data by 4 new features the BaseN encoding uses only 3 new variables.
Target Encoding
In target encoding, we calculate the mean of the target variable for each category
and replace the category variable with the mean value. In the case of the categorical
target variables, the posterior probability of the target replaces each category..
#Original Data
Data
We perform Target encoding for train data only and code the test data using results
obtained from the training dataset. Although, a very efficient coding system, it has
the following issues responsible for deteriorating the model performance-
Conclusion