PdfScanner 1664335281137
PdfScanner 1664335281137
PdfScanner 1664335281137
Since Big Data, Data Science, and Data Analytics are emerging
technologies (they’re still evolving), we often use Data Science
and Data Analytics interchangeably. The confusion primarily
arises from the fact that both Data Scientists and Data Analysts
work with Big Data. Even so, the difference between Data Analyst
and Data Scientist is stark, fuelling the Data Science vs. Data
Analytics debate.
Artificial Intelligence
AI is important because it can give enterprises insights into their operations that
they may not have been aware of previously and because, in some cases, AI can
perform tasks better than humans. Particularly when it comes to repetitive, detail-
oriented tasks like analyzing large numbers of legal documents to ensure relevant
fields are filled in properly, AI tools often complete jobs quickly and with
relatively few errors.
This has helped fuel an explosion in efficiency and opened the door to entirely new
business opportunities for some larger enterprises. Prior to the current wave of AI,
it would have been hard to imagine using computer software to connect riders to
taxis, but today Uber has become one of the largest companies in the world by
doing just that. It utilizes sophisticated machine learning algorithms to predict
when people are likely to need rides in certain areas, which helps proactively get
drivers on the road before they're needed.
What are the advantages and disadvantages of artificial
intelligence?
Artificial neural networks and deep learning artificial intelligence technologies are
quickly evolving, primarily because AI processes large amounts of data much
faster and makes predictions more accurately than humanly possible.
While the huge volume of data being created on a daily basis would bury a human
researcher, AI applications that use machine learning can take that data and quickly
turn it into actionable information. As of this writing, the primary disadvantage of
using AI is that it is expensive to process the large amounts of data that AI
programming requires.
Advantages
Disadvantages
Expensive;
Requires deep technical expertise;
Limited supply of qualified workers to build AI tools;
Only knows what it's been shown; and
Lack of ability to generalize from one task to another.
1. Google Maps
The application has been trained to recognize and understand
traffic. As a result, it suggests the best way to avoid traffic
congestion and bottlenecks. The AI-based algorithm also informs
users about the precise distance and time it will take them to
arrive at their destination. It has been trained to calculate this
based on the traffic situations. Several ride-hailing applications
have emerged as a result of the use of similar AI technology.
Utilizing face ID for unlocking our phones and using virtual filters on our faces
while taking pictures are two uses of AI that are presently essential for our day-by-
day lives.
Face recognition is used in the former, which means that every human face can be
recognized. Face recognition is used in the above, which recognizes a particular
face.
3. Online-Payments
It can be a time-consuming errand to rush to the bank for any transaction. Good
news! Artificial Intelligence is now being used by banks to support customers by
simplifying the process of payment.
Artificial intelligence has enabled you to deposit checks from the convenience of
our own home. Since AI is capable of deciphering handwriting and making online
cheque processing practicable. Artificial Intelligence can potentially be utilized to
detect fraud by observing consumers' credit card spending patterns.
When we wish to listen to our favorite songs or watch our favorite movie or shop
online, we have ever found that the things recommended to us perfectly match our
interests? This is the beauty of artificial intelligence.
These intelligent recommendation systems analyze our online activity and
preferences to provide us with similar content. Continuous training allows us to
have a customized experience. The data is obtained from the front-end, saved
as big data, and analysed using machine learning and deep learning. Then it can
predict your preferences and make suggestions to keep you amused without having
to look for something else. Artificial intelligence can also be utilized to improve
the user experience of a search engine.
5. Digital Assistants
When our hands are full, we often enlist the help of digital assistants to complete
tasks on our behalf. We might ask the assistant to call our father while we are
driving with a cup of tea in one hand. For instance, Siri would look at our contacts,
recognize the word "father," and dial the number.
6. Social Media
Social networking, a perfect example of artificial intelligence, may even figure out
what kind of content a user likes and recommends similar content. Facial
recognition is also used in social media profiles, assisting users in tagging their
friends via automatic suggestions. Smart filters can recognize spam and
undesirable messages and automatically filter them out. Users may also take
advantage of smart answers.
The social media sector could use artificial intelligence to detect mental health
issues such as suicidal thoughts by analyzing the information published and
consumed. This information can be shared with mental health professionals.
7. Healthcare
Infervision is using artificial intelligence and deep learning to save lives. In China,
where there are insufficient radiologists to keep up with the demand for checking
1.4 billion CT scans each year for early symptoms of lung cancer. Radiologists
essential to review many scans every day, which isn't just dreary, yet human
weariness can prompt errors. Infervision trained and instructed algorithms to
expand the work of radiologists in order to permit them to diagnose cancer more
proficiently and correctly.
Machine learning
Machine learning (ML) is a type of artificial intelligence (AI) that allows software
applications to become more accurate at predicting outcomes without being
explicitly programmed to do so.
Big data is larger, more complex data sets, especially from new data sources. These data sets are so
voluminous that traditional data processing software just can’t manage them.
Definition of big data is data that contains greater variety, arriving in increasing volumes and with more
velocity. This is also known as the three Vs.
VolumeThe amount of data matters. With big data, you’ll have to process high volumes of low-density,
unstructured data
Velocity-Velocity is the fast rate at which data is received and (perhaps) acted on.
Variety-Variety refers to the many types of data that are available. Traditional data types were structured
and fit neatly in a relational database. With the rise of big data, data comes in new unstructured data
types. Unstructured and semistructured data types, such as text, audio, and video, require additional
preprocessing to derive meaning and support metadata.
Small Data Big data
1.traditional 1.Big
data uses a
database is based on dynamic schema
a fixed schema that that can include
is static in nature. It structured as well as
could only work with unstructured data.
structured data that The data is stored in
fit effortlessly into a raw form and the
relational databases schema is applied
or tables. only when accessing
it.
2.traditional data is 2.big data uses a
based on a distributed
centralized database architecture.
architecture
Definition: Parallel computing refers to the process of breaking down larger problems into
smaller, independent, often similar parts that can be executed simultaneously by
multiple processors communicating via shared memory, the results of which are
combined upon completion as part of an overall algorithm.
Types of Parallelism:
1. Bit-level parallelism –
It is the form of parallel computing which is based on the increasing
processor’s size. It reduces the number of instructions that the system
must execute in order to perform a task on large-sized data.
Example: Consider a scenario where an 8-bit processor must compute
the sum of two 16-bit integers. It must first sum up the 8 lower-order
bits, then add the 8 higher-order bits, thus requiring two instructions to
perform the operation. A 16-bit processor can perform the operation
with just one instruction.
2. Instruction-level parallelism –
A processor can only address less than one instruction for each clock
cycle phase. These instructions can be re-ordered and grouped which
are later on executed concurrently without affecting the result of the
program. This is called instruction-level parallelism.
3. Task Parallelism –
Task parallelism employs the decomposition of a task into subtasks
and then allocating each of the subtasks for execution. The processors
perform the execution of sub-tasks concurrently.
4. Data-level parallelism (DLP) –
Instructions from a single stream operate concurrently on several data – Limited
by non-regular data manipulation patterns and by memory bandwidth
Machine Learning Algorithm
1.Linear Regression
The linear regression model provides a sloped straight line representing the
relationship between the variables. Consider the below image:
y= a0+a1x+ ε
Here,
Linear regression can be further divided into two types of the algorithm:
2.Logistic Regression
Logistic Regression is much similar to the Linear Regression except that how
they are used. Linear Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the classification problems.
It maps any real value into another value within a range of 0 and 1.
The value of the logistic regression must be between 0 and 1, which cannot
go beyond this limit, so it forms a curve like the “S” form. The S-form curve is
called the Sigmoid function or the logistic function.
3.Decision Tree
o Decision Trees usually mimic human thinking ability while making a decision,
so it is easy to understand.
o The logic behind the decision tree can be easily understood because it shows
a tree-like structure.
Decision Tree Terminologies
Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according
to the given conditions.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.
In a decision tree, for predicting the class of the given dataset, the algorithm
starts from the root node of the tree. This algorithm compares the values of
root attribute with the record (real dataset) attribute and, based on the
comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the
other sub-nodes and move further. It continues the process until it reaches
the leaf node of the tree. The complete process can be better understood
using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best
attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage is reached
where you cannot further classify the nodes and called the final node as a
leaf node.
Example: Suppose there is a candidate who has a job offer and wants to
decide whether he should accept the offer or Not. So, to solve this problem,
the decision tree starts with the root node (Salary attribute by ASM). The root
node splits further into the next decision node (distance from the office) and
one leaf node based on the corresponding labels. The next decision node
further gets split into one decision node (Cab facility) and one leaf node.
Finally, the decision node splits into two leaf nodes (Accepted offers and
Declined offer). Consider the below diagram:
The goal of the SVM algorithm is to create the best line or decision boundary
that can segregate n-dimensional space into classes so that we can easily
put the new data point in the correct category in the future. This best
decision boundary is called a hyperplane.
Example: SVM can be understood with the example that we have used in the
KNN classifier. Suppose we see a strange cat that also has some features of
dogs, so if we want a model that can accurately identify whether it is a cat or
dog, so such a model can be created by using the SVM algorithm. We will
first train our model with lots of images of cats and dogs so that it can learn
about different features of cats and dogs, and then we test it with this
strange creature. So as support vector creates a decision boundary between
these two data (cat and dog) and choose extreme cases (support vectors), it
will see the extreme case of cat and dog. On the basis of the support vectors,
it will classify it as a cat. Consider the below Diagram:
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which
affect the position of the hyperplane are termed as Support Vector. Since
these vectors support the hyperplane, hence called a Support vector.
the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point
of the lines from both the classes. These points are called support vectors. The
distance between the vectors and the hyperplane is called as margin. And the goal
of SVM is to maximize this margin. The hyperplane with maximum margin is called
the optimal hyperplane.
5.Navie Bayes
Bayes' Theorem:
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on
the conditional probability.
The formula for Bayes' theorem is given as:
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event
B.
Problem: If the weather is sunny, then the Player should play or not?
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Weather No Yes
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
Suppose we have a new data point and we need to put it in the required
category. Consider the below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have
already studied in geometry. It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three
nearest neighbors in category A and two nearest neighbors in category B.
Consider the below image:
o As we can see the 3 nearest neighbors are from category A, hence this new
data point must belong to category A.
o There is no particular way to determine the best value for "K", so we need to
try some values to find the best out of them. The most preferred value for K is
5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the
effects of outliers in the model.
o Large values for K are good, but it may find some difficulties.
It is simple to implement.
Always needs to determine the value of K which may be complex some time.
The computation cost is high because of calculating the distance between the
data points for all the training samples.
It allows us to cluster the data into different groups and a convenient way to
discover the categories of groups in the unlabeled dataset on its own without the
need for any training.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best clusters.
The value of k should be predetermined in this algorithm.
Determines the best value for K center points or centroids by an iterative process.
Assigns each data point to its closest k-center. Those data points which are near to
the particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from
other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
How does the K-Means Algorithm Work?
The working of the K-Means algorithm is explained in the below steps:
Step-2: Select random K points or centroids. (It can be other from the input
dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the
new closest centroid of each cluster.
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these
two variables is given below:
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put
them into different clusters. It means here we will try to group these datasets
into two different clusters.
o We need to choose some random k points or centroid to form the cluster.
These points can be either the points from the dataset or any other point. So,
here we are selecting the below two points as k points, which are not the part
of our dataset. Consider the below image:
Now we will assign each data point of the scatter plot to its closest K-point or
centroid. We will compute it by applying some mathematics that we have studied to
calculate the distance between two points. So, we will draw a median between both
the centroids. Consider the below image:
From the above image, it is clear that points left side of the line is near to the
K1 or blue centroid, and points to the right of the line are close to the yellow
centroid. Let's color them as blue and yellow for clear visualization.
o As we need to find the closest cluster, so we will repeat the process by
choosing a new centroid. To choose the new centroids, we will compute the
center of gravity of these centroids, and will find new centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we will
repeat the same process of finding a median line. The median will be like
below image:
From the above image, we can see, one yellow point is on the left side of the
line, and two blue points are right to the line. So, these three points will be
assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which is
finding new centroids or K-points.
o We will repeat the process by finding the center of gravity of centroids, so the
new centroids will be as shown in the below image:
o As we got the new centroids so again will draw the median line and reassign
the data points. So, the image will be:
o We can see in the above image; there are no dissimilar data points on either
side of the line, which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and
the two final clusters will be as shown in the below image:
8.Random Forest
The below diagram explains the working of the Random Forest algorithm:
Why use Random Forest?
Below are some points that explain why we should use the Random Forest
algorithm:
The Working process can be explained in the below steps and diagram:
Step-3: Choose the number N for decision trees that you want to build.
Step-5: For new data points, find the predictions of each decision tree, and
assign the new data points to the category that wins the majority votes.
Example: Suppose there is a dataset that contains multiple fruit images. So,
this dataset is given to the Random forest classifier. The dataset is divided
into subsets and given to each decision tree. During the training phase, each
decision tree produces a prediction result, and when a new data point occurs,
then based on the majority of results, the Random Forest classifier predicts
the final decision. Consider the below image:
Applications of Random Forest
There are mainly four sectors where Random forest mostly used:
1. Banking: Banking sector mostly uses this algorithm for the identification of
loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the
disease can be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.
i belongs to NEps(k)
Core point condition:
Density reachable:
Density connected:
A point i refers to density connected to a point j with respect to Eps, MinPts if
there is a point o such that both i and j are considered as density reachable
from o with respect to Eps and MinPts.
An object i is density reachable form the object j with respect to ε and MinPts
in a given set of objects, D' only if there is a sequence of object chains point
i1,…., in, i1 = j, pn = i such that i i + 1 is directly density reachable from
ii with respect to ε and MinPts.
o It is a scan method.
o It requires density parameters as a termination condition.
o It is used to manage noise in data clusters.
o Density-based clustering is used to identify clusters of arbitrary size.
Density-Based Clustering Methods
DBSCAN
1. Apriori
2. Eclat
3. F-P Growth Algorithm
o Support
o Confidence
o Lift
Support
Support is the frequency of A or how frequently an item appears in the
dataset. It is defined as the fraction of the transaction T that contains the
itemset X. If there are X datasets, then for transactions T, it can be written
as:
Confidence
Confidence indicates how often the rule has been found to be true. Or how
often the items X and Y occur together in the dataset when the occurrence of
X is already given. It is the ratio of the transaction that contains X and Y to
the number of records that contain X.
Lift
It is the strength of any rule, which can be defined as below formula:
It is the ratio of the observed support measure and expected support if X and
Y are independent of each other. It has three possible values:
Apriori Algorithm
This algorithm uses frequent datasets to generate association rules. It is
designed to work on the databases that contain transactions. This algorithm
uses a breadth-first search and Hash Tree to calculate the itemset efficiently.
It is mainly used for market basket analysis and helps to understand the
products that can be bought together. It can also be used in the healthcare
field to find drug reactions for patients.
Eclat Algorithm
Eclat algorithm stands for Equivalence Class Transformation. This
algorithm uses a depth-first search technique to find frequent itemsets in a
transaction database. It performs faster execution than Apriori Algorithm.