Machine Leraning Unit 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 62

Subject Coordinator : Rashika Bangroo

Unit-2
Regression
 Regression is a supervised learning technique
for investigating the relationship between
independent variables or features and a dependent
variable or outcome.
 Outcomes can then be predicted once the
relationship between independent and dependent
variables has been estimated
 E.g : House Price Prediction, Future Stock Prediction
Example
Suppose there is a marketing company A, who does
various advertisement every year and get sales on that.
Terminologies Related to the
Regression Analysis:
 Dependent Variable: The main factor in Regression
analysis which we want to predict or understand is called
the dependent variable. It is also called target variable
(Response Variable)
 Independent Variable: The factors which affect the
dependent variables or which are used to predict the values
of the dependent variables are called independent variable,
also called as a predictor.
 Outliers: Outlier is an observation which contains either
very low value or very high value in comparison to other
observed values. An outlier may hamper the result, so it
should be avoided.
Types of Regression
Regression

Linear Non-Linear
Regression Regression

Simple Multiple
Logistic
Linear Linear
Regression
Regression Regression
Types of Regression
 Simple Linear Regression : Linear regression shows
the linear relationship between the independent
variable (X-axis) and the dependent variable (Y-axis),
hence called linear regression.
 If there is only one input variable (x), then such linear
regression is called simple linear regression.
 If there is more than one input variable, then such
linear regression is called multiple linear regression.
Example of Linear Regression

 Below is the mathematical equation for Linear


regression:
Y= aX+b where, Y = dependent variables (target variables),
X= Independent variables (predictor variables),
a and b are the linear coefficients
Linear Regression
 Formula for linear regression equation is given by:
y= a+bx
a and b are given by the following formulas:
x and y are two variables on the regression line.
b = Slope of the line.
a = y-intercept of the line.
x = Values of the first data set, y = Values of the second data set
Q1 : Find linear regression equation for the following
two sets of data: (Y is the dependent variable)
X 2 4 6 8
Y 3 7 5 10

Ans : Regression Equation : y= 1.5 + 0.95x


Assignment-1
Q1 : ABC laboratory is conducting research on height and weight and wanted to know if
there is any relationship like as the height increases, the weight will also increase.
They have gathered a sample of 1000 people for each of the categories and came up
with an average height in that group. Below are the details that they have gathered.
You are required to do the calculation of regression and come up with the conclusion
that any such relationship exists.
(Assume height as independent variable)

Q2 : Explain the history of ML?


Multiple Regression
 Multiple regression is a statistical technique that can
be used to analyze the relationship between a single
dependent variable and several independent variables.
 Each predictor value is weighed, the weights denoting
their relative contribution to the overall prediction.

 Here Y is the dependent variable, and X1,…,Xn are


the n independent variables.
Logistic Regression
 Logistic regression is a statistical analysis method to
predict a binary outcome, such as yes or no, based on
prior observations of a data set.
 A logistic regression model predicts a dependent data
variable by analyzing the relationship between one or
more existing independent variables.
 E.g : a logistic regression could be used to predict
whether a political candidate will win or lose an
election or whether a high school student will be
admitted or not to a particular college.
Logistic Regression
 Logistic regression uses sigmoid function or logistic function
which is a complex cost function.
 Sigmoid function is used to model the data in logistic
regression. The function can be represented as:

f(x)= Output between the 0 and 1 value.


x= input to the function
e= base of natural logarithm.
Logistic Regression
 When we provide the input values (data) to the
function, it gives the S-curve as follows:

Sig(x)1 as x∞
Sig(x)0 as x(-∞)
Logistic Regression Assumptions
 Dependent variable must be categorical in nature
 Independent variable should not have Multi-
collinearity
 Multicollinearity occurs when independent variables
in a regression model are correlated.
 This correlation is a problem because independent
variables should be independent. If the degree of
correlation between variables is high enough, it can
cause problems when you fit the model and interpret
the results.
Difference between Linear &
Logistic Regression
Types of Logistic Regression
 Binary Logistic Regression is a type of logistic regression in
which the response variable can only belong to two categories. E.g
: Spam , Not spam
 Multinomial Logistic Regression is a type of logistic regression
in which the response variable can belong to one of three or more
categories and there is no natural ordering among the categories.
E.g : color : Red, Blue, Green
 Ordinal Logistic Regression : a type of logistic regression in
which the response variable can belong to one of three or more
categories and there is a natural ordering among the categories.
E.g : Movie Rating(1-5), Grading (Excellent,Fair,Poor)
Summary
Bayes Theorem
 Bayes theorem determines the conditional probability
of an event A given that event B has already occurred.
 Bayes theorem is also known as the Bayes Rule or Bayes
Law.
 It is a method to determine the probability of an event
based on the occurrences of prior events.
 It is used to calculate conditional probability.
Bayes Theorem
Bayes Theorem is stated as :

Where P(H|X) is the posterior probability, or a posteriori


probability, of H conditioned on X
P(H) is the prior probability, or a priori probability, of H
P(X|H) is the posterior probability of X conditioned on H
P(X) is the prior probability of X
Naive Bayesian Classification
 The naive Bayesian classifier, or simple Bayesian classifier,
works as follows :
1) Let D be a training set of tuples and their associated class
labels. Each tuple is represented by an n-dimensional
attribute vector, X = (x1, x2,..., xn), depicting n
measurements made on the tuple from n attributes,
respectively, A1, A2,..., An.
2) Suppose that there are m classes, C1, C2,..., Cm. Given a
tuple, X, the classifier will predict that X belongs to the
class having the highest posterior probability,
conditioned on X. That is, the naive Bayesian classifier
predicts that tuple X belongs to the class Ci if and only if
P(Ci |X) > P(Cj |X) . Thus, we maximize P(Ci |X).
 The class Ci for which P(Ci |X) is maximized is called the
maximum posteriori hypothesis. By Bayes theorem :
P(Ci |X) = P(X|Ci) . P(Ci) / P(X)
 Class prior probabilities may be estimated by
 P(Ci) = |Ci,D|/|D|, where |Ci,D| is the number of training tuples
of class Ci in D.

To predict the class label of X, P(X|Ci) P. (Ci) is evaluated for each


class Ci .
The classifier predicts that the class label of tuple X is the class Ci if
and only if
P(X|Ci)P(Ci) > P(X|Cj)P(Cj)
In other words, the predicted class label is the class Ci for which
P(X|Ci)P(Ci) is the maximum
Q: Predict the class label for Tuple X = (age = youth, income = medium,
student = yes, credit rating = fair) using Naïve Bayes Classifier.
 We need to maximize P(X|Ci)P(Ci), for i = 1, 2.
 P(Ci), the prior probability of each class, can be
computed based on the training tuples:
P(buys computer = yes) = 9/14 = 0.643
P(buys computer = no) = 5/14 = 0.357
 To compute P(X|Ci), fori = 1, 2, we compute the
following conditional probabilities:
P(age = youth | buys computer = yes) = 2/9 = 0.222
P(age = youth | buys computer = no) = 3/5 = 0.600
P(income = medium | buys computer = yes) = 4/9 =
0.444
P(income = medium | buys computer = no) = 2/5 = 0.400
 P(student = yes | buys computer = yes) = 6/9 = 0.667
P(student = yes | buys computer = no) = 1/5 = 0.200
P(credit rating = fair | buys computer = yes) = 6/9 = 0.667
P(credit rating = fair | buys computer = no) = 2/5 = 0.400
Using these probabilities, we obtain
 P(X|buys computer = yes) = P(age = youth | buys computer = yes)
× P(income = medium | buys computer = yes) × P(student = yes |
buys computer = yes) × P(credit rating = fair | buys computer =
yes) = 0.222 × 0.444 × 0.667 × 0.667 = 0.044.
Similarly, P(X|buys computer = no) = 0.600 × 0.400 × 0.200 × 0.400
= 0.019.
To find the class, Ci , that maximizes P(X|Ci)P(Ci), we compute
P(X|buys computer = yes)P(buys computer = yes) = 0.044 × 0.643 =
0.028
P(X|buys computer = no)P(buys computer = no) = 0.019 × 0.357 =
0.007
Therefore, the naive Bayesian classifier predicts buys computer =
yes for tuple X.
Applications of Naïve Bayes Classifier
 News Categorization (into sports , healthcare , media ,
entertainment etc.)
 SPAM filtering (categorization of e-mails into spam or not
spam) .Email services (like Gmail) use this algorithm to
figure out whether an email is a spam or not.
 It can be used in real-time predictions because Naïve
Bayes Classifier is an eager learner.
 Recommendation System: Naive Bayes Classifier
and Collaborative Filtering together builds a
Recommendation System that uses machine learning and
data mining techniques to filter unseen information and
predict whether a user would like a given resource or not.
Q2: For the Data Set given below, Predict the class label
for the data tuple
X=(Color=‘Green’,Legs=‘2’,Height=‘tall’, Smelly=‘no’)
Concept Learning
 The problem of inducing general functions from
specific training examples is central to learning.
 Concept learning can be formulated as a problem of
searching through a predefined space of potential
hypotheses for the hypothesis that best fits the
training examples.
 What is Concept Learning…?
 “A task of acquiring potential hypothesis (solution) that
best fits the given training examples.”
Concept Learning
 Consider the example task of learning the target concept
“days on which my friend Prabhas enjoys his favorite
water sport.”

 The task is to learn to predict the value of Enjoy_Sport for


an arbitrary day, based on the values of its other attributes.
Concept Learning Notation
Concept Learning Notation
What hypothesis representation shall we provide to the
learner in this case?
 Consider a simple representation in which each hypothesis
consists of a conjunction of constraints on the instance
attributes.
 Let each hypothesis be a vector of six constraints, specifying the
values of the six attributes Sky, AirTemp, Humidity, Wind,
Water, and Forecast.
 For each attribute, the hypothesis will either :
 indicate by a “?’ that any value is acceptable for this attribute,
 specify a single required value (e.g., Warm) for the attribute, or
 indicate by a “ø” that no value is acceptable.
 If some instance x satisfies all the constraints of hypothesis h,
then h classifies x as a positive example (h(x) = 1).
 To illustrate, the hypothesis that Prabhas enjoys his favorite
sport only on cold days with high humidity (independent
of the values of the other attributes) is represented by the
expression :
(?, Cold, High, ?, ?, ?)
Most General and Specific Hypothesis
 The most general hypothesis-that every day is a positive
example-is represented by
(?, ?, ?, ?, ?, ?)
 The most specific possible hypothesis-that no day is a
positive example-is represented by
(ø, ø, ø, ø, ø, ø)
General-to-Specific Ordering of Hypotheses
 consider the two hypotheses
h1 = (Sunny, ?, ?, Strong, ?, ?)
h2 = (Sunny, ?, ?, ?, ?, ?)
 Now consider the sets of instances that are classified positive
by hl and by h2. Because h2 imposes fewer constraints on the
instance, it classifies more instances as positive.
 In fact, any instance classified positive by h1 will also be
classified positive by h2. Therefore, we say that h2 is more
general than h1.
 For any instance x in X and hypothesis h in H, we say that x
satisfies h if and only if h(x) = 1.
Find S: Finding a maximally
Specific Hypothesis
 Find-S algorithm is a basic concept learning algorithm in
machine learning.
 This algorithm identifies the hypothesis that best matches
all of the positive cases.
 Note : the algorithm considers only those positive training
example.
 The algorithm starts with the most specific hypothesis and
generalizes this hypothesis each time it fails to classify an
observed positive training data.
 Hence, the Find-S algorithm moves from the most specific
hypothesis to the most general hypothesis.
Find S: Finding a maximally
Specific Hypothesis
1) Initialize h to the most specific hypothesis in H.
2) For each positive training instance x
For each attribute constraint a, in h
If the constraint a, is satisfied by x
Then do nothing
Else replace a, in h by the next more general
constraint that is satisfied by x
3) Output hypothesis h
Find S: Finding a maximally
Specific Hypothesis
Find S: Finding a maximally
Specific Hypothesis
Find S: Finding a maximally
Specific Hypothesis
CANDIDATE-ELIMINATION
ALGORITHM
 The candidate elimination algorithm incrementally builds
the version space given a hypothesis space H and a set E of
examples. The examples are added one by one; each example
possibly shrinks the version space by removing the
hypotheses that are inconsistent with the example. The
candidate elimination algorithm does this by updating the
general and specific boundary for each new example.
 It is an extended form of Find-S algorithm.
 It considers both positive and negative examples.
 Actually, positive examples are used here as Find-S algorithm
(Basically they are generalizing from the specification)while
the negative example is specified from generalize form.
CANDIDATE-ELIMINATION
ALGORITHM
Learned Version space by candidate
elimination Algorithm
Example-2 Candidate Elimination
Algorithm
Bayesian Belief Networks
Bayesian Belief Networks
Expectation-Maximization (EM)
Algorithm
 Expectation-Maximization algorithm can be used for the
latent variables (variables that are not directly observable
and are actually inferred from the values of the other
observed variables) in order to predict their values with the
condition that the general form of probability distribution
governing those latent variables is known to us.
 This algorithm is actually at the base of many unsupervised
clustering algorithms in the field of machine learning.
 It is used to find the local maximum likelihood parameters of
a statistical model in the cases where latent variables are
involved and the data is missing or incomplete.
EM Algorithm
Algorithm:
 Given a set of incomplete data, consider a set of starting
parameters.
 Expectation step (E – step): Using the observed available
data of the dataset, estimate (guess) the values of the
missing data.
 Maximization step (M – step): Complete data generated
after the expectation (E) step is used in order to update the
parameters.
 Repeat step 2 and step 3 until convergence.
EM Algorithm
Use of EM Algorithm
Usage of EM algorithm –
 It can be used to fill the missing data in a sample.
 It can be used as the basis of unsupervised learning of
clusters.
 It can be used for the purpose of estimating the
parameters of Hidden Markov Model (HMM).
Support Vector Machine
 SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as
Regression problems.
 The goal of the SVM algorithm is to create the best line or
decision boundary that can segregate n-dimensional
space into classes so that we can easily put the new data
point in the correct category in the future. This best
decision boundary is called a hyperplane.
 SVM chooses the extreme points/vectors that help in
creating the hyperplane. These extreme cases are called as
support vectors, and hence algorithm is termed as
Support Vector Machine.
Types of SVM
 SVM can be of two types:
 Linear SVM: Linear SVM is used for linearly separable
data, which means if a dataset can be classified into
two classes by using a single straight line, then such
data is termed as linearly separable data, and classifier
is used called as Linear SVM classifier.
 Non-linear SVM: Non-Linear SVM is used for non-
linearly separated data, which means if a dataset
cannot be classified by using a straight line, then such
data is termed as non-linear data and classifier used is
called as Non-linear SVM classifier.
Linear Vs Non-Linear SVM
What to do if data are not linearly
separable?
 For non-linear data, to separate the data points, we
need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear
data, we will add a third dimension z.
 It can be calculated as:
z=x2 +y2
SVM Kernal
 SVM kernel is a function that takes low dimensional
input space and transforms it into higher-dimensional
space, ie it converts not separable problem to
separable problem.
 It is mostly useful in non-linear separation problems.
 It does some extremely complex data transformations
then finds out the process to separate the data based
on the labels or outputs defined.
SVM Kernel Functions
1) Polynomial Kernel : It is a more generalized
representation of the linear kernel. It is not as preferred as
other kernel functions as it is less efficient and accurate.
 For degree-d polynomials, the polynomial kernel is defined
as

where x and y are vectors in the input space, i.e. vectors of


features computed from training or test samples

 When d=1 it becomes linear kernal


SVM Kernel Functions
2) Gaussian Radial Basis Function (RBF) It is one of the most preferred
and used kernel functions in SVM. It is usually chosen for non-linear data.
It helps to make proper separation when there is no prior knowledge of
data.
 Gaussian Radial Basis Formula

 The value of gamma varies from 0 to 1. You have to manually provide the
value of gamma in the code. The most preferred value for gamma is 0.1.
 Here when gamma= ½(σ)^2, this kernal is called as Gaussian Kernal.
 Sigma= Variance
 Gamma : It decides that how much curvature we want in a decision
boundary.
SVM Kernel Functions
3) Sigmoid Kernel : It is mostly preferred for neural
networks. This kernel function is similar to a two-
layer perceptron model of the neural network, which
works as an activation function for neurons.
 Sigmoid Kenel Function

 There are two adjustable parameters in this kernel,


Slope – alpha and constant C – intercept.
Advantages of SVM
 SVM works relatively well when there is a clear margin
of separation between classes.
 SVM is more effective in high dimensional spaces.
 SVM is relatively memory efficient
 SVM can be used to solve both classification and
regression problems
Disadvantages of SVM
 SVM algorithm is not suitable for large data sets.
 SVM does not perform very well when the data set has
more noise i.e. target classes are overlapping.
 Choosing an appropriate Kernel function is
difficult: Choosing an appropriate Kernel function (to
handle the non-linear data) is not an easy task.
 Choice of kernal function also influences the
performance
 Non-linear kernal is fully dense causing long CPU time
for computation

You might also like