ML Unit 2
ML Unit 2
QUESTION BANK
1. Data Collection: The first step in any machine learning project is to collect
relevant data. This can be done using a variety of methods, including web
scraping, surveys, and data APIs.
2. Data Preprocessing: Once data has been collected, it must be cleaned and
preprocessed. This involves removing duplicates, filling in missing values, and
transforming the data into a format suitable for machine learning algorithms.
5. Model Training: This involves training the machine learning model on the data to
learn patterns and relationships between the features and the target variable.
6. Model Evaluation: Once the model has been trained, it must be evaluated to
determine its accuracy and performance. This can be done using various metrics
such as accuracy, precision, recall, and F1 score.
7. Model Deployment: After the model has been evaluated. it can be deployed in
production to make predictions on new data.
Almost anything can be turned into DATA. Building a deep understanding of the different
data types is a crucial prerequisite for doing Exploratory Data Analysis (EDA) and
Feature Engineering for Machine Learning models. You also need to convert data types
of some variables in order to make appropriate choices for visual encodings in data
Numerical Data
Numerical data is any data where data points are exact numbers. Statisticians also might
call numerical data, quantitative data. This data has meaning as a measurement such
can assume any value within a range whereas discrete data has distinct values.
Numerical Data
For example, the number of students taking Python class would be a discrete data set.
You can only have discrete whole number values like 10, 25, or 33. A class cannot have
12.75 students enrolled. A student either join a class or he doesn’t. On the other hand,
continuous data are numbers that can fall anywhere within a range. Like a student could
The takeaway here is that numerical data is not ordered in time. They are just numbers
Categorical Data
hometown. Categorical data can take numerical values. For example, maybe we would
use 1 for the colour red and 2 for blue. But these numbers don’t have a mathematical
meaning. That is, we can’t add them together or take the average.
In the context of super classification, categorical data would be the class label. This
commercial.
There is also something called ordinal data, which in some sense is a mix of numerical
and categorical data. In ordinal data, the data still falls into categories, but those
categories are ordered or ranked in some particular way. An example would be class
difficulty, such as beginner, intermediate, and advanced. Those three types of classes
would be a way that we could label the classes, and they have a natural order in
increasing difficulty.
Another example is that we just take quantitative data, and splitting it into groups, so we
For plotting purposes, ordinal data is treated much in the same way as categorical data.
But groups are usually ordered from lowest to highest so that we can preserve this
ordering.
Time series data is a sequence of numbers collected at regular intervals over some
period of time. It is very important, especially in particular fields like finance. Time series
data has a temporal value attached to it, so this would be something like a date or a
For example, we might measure the average number of home sales for many years. The
difference of time series data and numerical data is that rather than having a bunch of
numerical values that don’t have any time ordering, time-series data does have some
implied ordering. There is a first data point collected and the last data point collected.
Text
Text data is basically just words. A lot of the time the first thing that you do with text is
you turn it into numbers using some interesting functions like the bag of words
formulation.
However, only skills of machine learning are not sufficient for solving real-world
problems and designing a better product, but also you have to gain good exposure to
the data structure.
The data structure used for machine learning is quite similar to other software
development fields where it is often used.
Understanding the data structure also helps you to build ML models and algorithms in a
much more efficient way than other ML professionals. I
In other words, the data structure is the collection of data type 'values' which are stored
and organized in such a way that it allows for efficient access and modification.
The data structure is the ordered sequence of data, and it tells the compiler how a
programmer is using the data such as Integer, String, Boolean, etc.
There are two different types of data structures: Linear and Non-linear data structures.
1. Linear Data structure:
The linear data structure is a special type of data structure that helps to organize and
manage data in a specific order where the elements are attached adjacently.
Array:
An array is one of the most basic and common data structures used in Machine
Learning. It is also used in linear algebra to solve complex mathematical problems. You
will use arrays constantly in machine learning, whether it's:
An array contains index numbers to represent an element starting from 0. The lowest
index is arr[0] and corresponds to the first element.
Let's take an example of a Python array used in machine learning. Although the Python
array is quite different from than array in other programming languages, the Python list
is more popular as it includes the flexibility of data types and their length. If anyone is
using Python in ML algorithms, then it's better to kick your journey from array initially.
Method Description
Extend() It is used to add the element of a list to the end of the current list.
Index() It returns the index of the first element with the specified value.
Pop() It is used to remove an element from a specified position using an index number.
Stacks:
Stacks are based on the concept of LIFO (Last in First out) or FILO (First In Last Out).
Although stacks are easy to learn and implement in ML models but having a good grasp
can help in many computer science aspects such as parsing grammar, etc.
Stacks enable the undo and redo buttons on your computer as they function similar to
a stack of blog content.
However, we can only check the most recent one that has been added. Addition and
removal occur at the top of the stack.
Linked List:
A linked list is the type of collection having several separately allocated nodes. Or
in other words, a list is the type of collection of data elements that consist of a
value and pointer that point to the next node in the list.
In a linked list, insertion and deletion are constant time operations and are very efficient,
but accessing a value is slow and often requires scanning.
So, a linked list is very significant for a dynamic array where the shifting of elements is
required.
Although insertion of an element can be done at the head, middle or tail position, it is
relatively cost consuming. However, linked lists are easy to splice together and split
apart. Also, the list can be converted to a fixed-length array for fast access.
Queue:
Hence, the queue is significant in a program where multiple lists of codes need to be
processed.
The queue data structure can be used to record the split time of a car in F1 racing.
As the name suggests, in Non-linear data structures, elements are not arranged in any
sequence.
All the elements are arranged and linked with each other in a hierarchal manner, where
one element can be linked with one or more elements.
1) Trees
Binary Tree:
The concept of a binary tree is very much similar to a linked list, but the only difference
of nodes and their pointers.
In a linked list, each node contains a data value with a pointer that points to the next
node in the list,
whereas; in a binary tree, each node has two pointers to subsequent nodes
instead of just one.
Binary trees are sorted, so insertion and deletion operations can be easily done with
O(log N) time complexity.
Similar to the linked list, a binary tree can also be converted to an array on the basis of
tree sorting.
2) Graphs
A graph data structure is also very much useful in machine learning for link
prediction.
Graphs are directed or undirected concepts with nodes and ordered or unordered pairs.
Hence, you must have good exposure to the graph data structure for machine learning
and deep learning.
3) Maps
Maps are the popular data structure in the programming world, which are mostly useful
for minimizing the run-time algorithms and fast searching the data.
It stores data in the form of (key, value) pair, where the key must be unique; however,
the value can be duplicated. Each key corresponds to or maps a value; hence it is
named a Map.
In different programming languages, core libraries have built-in maps or, rather,
HashMaps with different names for each implementation.
o In Java: Maps
o In Python: Dictionaries
o C++: hash_map, unordered_map, etc.
Python Dictionaries are very useful in machine learning and data science as various
functions and algorithms return the dictionary as an output. Dictionaries are also much
used for implementing sparse matrices, which is very common in Machine Learning.
Heap is a hierarchically ordered data structure. Heap data structure is also very much
similar to a tree, but it consists of vertical ordering instead of horizontal ordering.
Ordering in a heap DS is applied along the hierarchy but not across it, where the value
of the parent node is always more than that of child nodes either on the left or right side.
Here, the insertion and deletion operations are performed on the basis of promotion.
After that, it gets compared with its parent and promoted until it reaches the correct
ranking position. Most of the heaps data structures can be stored in an array along with
the relationships between the elements.
This is one of the most important types of data structure used in linear algebra to solve
1-D, 2-D, 3-D as well as 4-D arrays for matrix arithmetic.
Further, it requires good exposure to Python libraries such as Python NumPy for
programming in deep learning.
Of course, the rule for principles is that there should not be many, but they should be
unbreakable - the below are critical to the successful execution of a data quality project.
Clarity is required at the start that data quality is a business problem and must be
solved by the business.
The IT department cannot and should not be running a data quality project. At the very
start both business and IT need to understand:
The business is responsible for defining the quality of the data needed.
However, the business needs to work in concert with IT to achieve their aims.
A data quality implementation needs to bring together the business and IT professionals
to work together for the benefit of a common goal.
Data sits on IT systems and they are normally the only department with direct access.
Data quality remediation work normally divides into that work done manually, and that
done either via a bulk update or via a data quality application.
If a data quality improvement requires a bulk update of data then IT are the only ones
placed to perform the work.
If an application needs installation and configuration then IT are likely the best placed to
do the work.
If the data quality remediation is looking to change process this will also require the buy-
in of IT.
A Data Quality project should be a healthy partnership between the business and IT.
It is often a large endeavour which will draw on resources from all areas of the
organisation.
There is no point cleaning up the data for it to revert to a poor state a few months later.
This is frustratingly common, and why data quality is often seen as an insurmountable
problem.
The reason poor data quality keeps coming back is precisely because organisations,
and data quality projects, fail to think about the problem holistically.
The whole point about quality is that quickly, cheaply and badly costs you money. A
mantra for any data quality project should be;
Not understanding the above means you’ll be running another data quality project in a
couple of years, and will have wasted a lot of resources – both time and money - on the
way.
Principle 4: Treat data as an asset
Data should be treated as an asset, but what does this mean for a data quality project?
It means treat every bit of data as if it is a valuable, physical asset.
It has taken time and effort for the customer to tell you their address.
It has taken time and effort for the call centre agent or branch staff member to type it
into the information systems.
This data has then been lovingly preserved for years, religiously backed-up and used
many times for verification.
It has cost money, probably quite a lot of money. Do not discard unless you are certain
it will not be valuable now or in the future.
Take time to update data with care. Look to understand data and why it is in its present
state before deciding on a solution.
Even if you have an obvious error, do not rush to remediate as this data error may be
an example of a process or data failure that will affect many thousands of records, and
the other examples may not be as obvious.
Do not treat your existing data as simply trash to deserve obliteration and replacement
with something shiny and new.
However poor data quality is a people and process problem with technological
elements, not a technological problem.
What’s more, in order to solve data quality, it is necessary to win hearts and minds of
the organisation. It is necessary to engage with people, not computers.
It is necessary to persuade the executive that data quality is causing them to
lose money on a day to day basis.
It is about training people to recognise poor quality when they see it.
A data quality project needs to understand both its people, and the people in the
organisation, what they are doing with information and how they are doing it. People are
not technology.
They are irrational and cantankerous, and are not always open to change. Their
involvement needs to be nurtured.
After the initial pain of cleaning historical data is complete, then the organisation must
embed a good-quality mindset - rather than poor-or-irrelevant quality mindset.
At this point the data quality project should disappear. It has now become part of the
organisation.
Whilst some degree of monitoring is necessary to gently steer the process onward, what
is not needed is a large data quality department.
The objective of the approach is to make data quality endemic in the organisation.
Quality management needs to be in place so that data quality issues can be identified
and addressed, but that is all. Data Quality should become just another operational
measurement, and only require a brief look at the dials to make sure they are in the
green.
The objective of a data quality project is neither to boil the ocean, nor to make data
quality perfect.
The objective is to do the minimum possible that allows the organisation to meet its
information needs.
The end state of the data should be described as “good enough”, not “perfect”.
Once this state of affairs is reached, a data quality project is complete and should stop
work and stop spending the organisation's money.
Fundamentally, the approach should be based around minimum necessary work. Work
needs to be undertaken in as effective and efficient a manner as possible, and only
ever done once.
2.1.4 Data Pre-Processing
Dimensionality reduction
Feature subset selection
Data Pre-Processing
Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model.
It is the first and crucial step while creating a machine learning model.
When creating a machine learning project, it is not always a case that we come across
the clean and formatted data.
And while doing any operation with data, it is mandatory to clean it and put in a
formatted way. So for this, we use data preprocessing task.
Data preprocessing is required tasks for cleaning the data and making it suitable for a
machine learning model which also increases the accuracy and efficiency of a machine
learning model.
Dimensionality reduction
The number of input features, variables, or columns present in a given dataset is known
as dimensionality, and the process to reduce these features is called dimensionality
reduction.
A dataset contains a huge number of input features in various cases, which makes the
predictive modeling task more complicated. Because it is very difficult to visualize or
make predictions for the training dataset with a high number of features, for such cases,
dimensionality reduction techniques are required to use.
Dimensionality reduction technique can be defined as, "It is a way of converting the
higher dimensions dataset into lesser dimensions dataset ensuring that it
provides similar information." These techniques are widely used in machine
learning for obtaining a better fit predictive model while solving the classification and
regression problems.
It is commonly used in the fields that deal with high-dimensional data, such as speech
recognition, signal processing, bioinformatics, etc.
It can also be used for data visualization, noise reduction, cluster analysis, etc.
The Curse of Dimensionality
If the dimensionality of the input dataset increases, any machine learning algorithm and
model becomes more complex.
As the number of features increases, the number of samples also gets increased
proportionally, and the chance of overfitting also increases.
Hence, it is often required to reduce the number of features, which can be done with
dimensionality reduction.
Feature subset selection
Feature Selection is the most critical pre-processing activity in any machine learning
process.
It intends to select a subset of attributes or features that makes the most meaningful
contribution to a machine learning activity.
In order to understand it, let us consider a small example i.e. Predict the weight of
students based on the past information about similar students, which is captured
inside a ‘Student Weight’ data set.
The data set has 04 features like Roll Number, Age, Height & Weight. Roll Number
has no effect on the weight of the students, so we eliminate this feature.
This subset of the data set is expected to give better results than the full set.
12 1.1 23
11 1.05 21.6
13 1.2 24.7
11 1.07 21.3
14 1.24 25.2
12 1.12 23.4
It may have sometimes hundreds or thousands of dimensions which is not good from
the machine learning aspect because it may be a big challenge for any ML algorithm
to handle that.
On the other hand, a high quantity of computational and a high amount of time will be
required.
Also, a model built on an extremely high number of features may be very difficult to
understand.
For these reasons, it is necessary to take a subset of the features instead of the
full set.
a. Feature Relevance:
In the case of supervised learning, the input data set (which is the training data set),
has a class label attached.
A model is inducted based on the training data set — so that the inducted model can
assign class labels to new, unlabeled data.
The remaining variables, which make a significant contribution to the prediction task
are said to be strongly relevant variables.
In the case of unsupervised learning, there is no training data set or labelled data.
Grouping of similar data instances are done and the similarity of data instances are
evaluated based on the value of different variables.
Certain variables do not contribute any useful information for deciding the similarity of
dissimilar data instances.
Hence, those variable makes no significant contribution to the grouping process.
These variables are marked as irrelevant variables in the context of the unsupervised
machine learning task.
We can understand the concept by taking a real-world example:
In that, Roll Number doesn’t contribute any significant information in predicting what
the Weight of a student would be.
So, in the context of grouping students with similar academic merit, the variable Roll
No is quite irrelevant.
Any feature which is irrelevant in the context of a machine learning task is a candidate
for rejection when we are selecting a subset of features.
b. Feature Redundancy:
For example, in the Student Data-set, both the features Age & Height contribute
similar information.
So, in context to that problem, Age and Height contribute similar information. In other
words, irrespective of whether the feature Height is present or not, the learning model
will give the same results.
In this kind of situation where one feature is similar to another feature, the feature is
said to be potentially redundant in the context of a machine learning problem.
All features having potential redundancy are candidates for rejection in the final
feature subset.
Only a few representative features out of a set of potentially redundant features are
considered for being a part of the final feature subset.
So in short, the main objective of feature selection is to remove all features which are
irrelevant and take a representative subset of the features which are potentially
redundant.
This leads to a meaningful feature subset in the context of a specific learning task.