Module 4

Module 4: Chapter 14.
Simple Linear
Regression
Prepared by,
Ashritha K P
Asst. Professor
Sahyadri College of Engineering and Management
Mangaluru
Linear Model
 The linear model is a mathematical model that attempts to translate natural

phenomena into comprehensible and recurring parameters.
 A prerequisite for the application of a linear model is that there is a linear
interdependence between at least one independent and
one dependent variable.
 Examples: If body height increases, body weight will increase as well.
The model: Example
 Considering DataSciencester user’s number of friends and the amount of time

the user spends on the site each day.
 Let’s assume that having more friends causes people to spend more time on
the site,
 Lets build a model describing this relationship.
 Since we find a pretty strong linear relationship, a natural place to start is a
linear model.
 Let us hypothesize that there are constants α (alpha) and β (beta) such that:
 Assuming we’ve determined such an alpha and beta, then we make
predictions simply with:
 any choice of alpha and beta gives us a predicted output for each input x_i.
Since we know the actual output y_i, we can compute the error for each pair:
 The least squares solution is to choose the alpha and beta that make
sum_of_sqerrors as small as possible.
 The choice of beta means that when the input value increases by
standard_deviation(x), the prediction then increases by correlation(x, y) *
standard_deviation(y).
 The choice of alpha simply says that when we see the average value of the
independent variable x, we predict the average value of the dependent
variable y.
 In the case where x and y are perfectly correlated, a one-standard-deviation
increase in x results in a one-standard-deviation-of-y increase in the
prediction.
 When they’re perfectly anticorrelated, the increase in x results in a decrease
in the prediction.
 When the correlation is 0, beta is 0, which means that changes in x don’t
affect the prediction at all.
 This gives values of alpha = 22.95 and beta = 0.903.
 So our model says that we expect a user with n friends to spend
 22.95 + n * 0.903 minutes on the site each day.
 That is, we predict that a user with no friends on DataSciencester would still
spend about 23 minutes a day on the site. And for each additional friend, we
expect a user to spend almost a minute more on the site each day
R-Squared
 The coefficient of determination, or R2 , is a measure that provides information about

the goodness of fit of a model.
 it is a statistical measure of how well the regression line approximates the actual data.
 Sum of residuals square/

 For the point (2,2)
 The higher the number, the better our model fits the data. Here we calculate
an R-squared of 0.329, which tells us that our model is only sort of okay at
fitting the data, and that clearly there are other factors at play
Chapter 15
Multiple Regression
Assume that we collected additional data about DataScientister
 How many hours each of your users works each day, and whether they have
a PhD. You’d like to use this additional data to improve your model.
 Accordingly, you hypothesize a linear model with more independent
variables:
minutes = α + β1.friends + β2.work hours + β3.phd + ε
There are a couple of further assumptions
that are required for this to make sense
 The first assumption is that the columns of x are linearly
independent—that there’s no way to write any one as a weighted sum
of some of the others.(no correlation between x attributes)
 example: imagine we had an extra field num_acquaintances in our
data that for every user was exactly equal to num_friends.
 The second important assumption is that the columns of x are all
uncorrelated with the errors ε.
 Independence of Errors: This assumption asserts that the errors, or
residuals, from the regression model are independent of each other. In
other words, the error term for one observation should not be related
to the error term for another observation.
 Multicollinearity: 2 or more predictor variables are highly correlated with
each other.
 Multicollinearity can make it difficult to determine the true effect of each predictor
variable on the dependent variable because the effects of the correlated variables are
confounded
 2 cases in our example:
 People who work more hours spend less time on the site.
 People with more friends tend to work more hours.
Example
1. Negative Relationship between Work Hours and Time on Site: Individuals who work
more hours tend to spend less time on the site. This implies a negative relationship
between the "work hours" predictor variable and the dependent variable "time on
site." It suggests that as the number of work hours increases, the time spent on the
site decreases.
2. Positive Relationship between Friends and Work Hours: Individuals with more friends
tend to work more hours. This indicates a positive relationship between the "friends"
predictor variable and the "work hours" predictor variable. It suggests that as the
number of friends increases, the number of work hours also increases.
 To address multicollinearity in this scenario, one might consider different strategies:
1. Variable Transformation or Creation: Instead of including both "friends" and "time on

site" as separate predictor variables, one could create a composite variable that captures
both aspects (e.g., a variable representing social engagement). This can help reduce
multicollinearity.
2. Variable Selection: Choose which variables to include in the model based on theoretical
considerations or prior knowledge. If both "friends" and "time on site" are highly
correlated with "work hours," it may be necessary to select only one of them to avoid
multicollinearity issues.
3. Regularization Techniques: Use regularization techniques such as ridge regression or

Lasso regression, which penalize the magnitude of regression coefficients, effectively
reducing multicollinearity.
 Regression is a fundamentally different kind of task from the ones
discussed before like classification and clustering tasks. Regression
aims to predict a number taken from a continuum instead of
answering "yes" or "no" or to tell from which group an input most
likely belongs.
 In regression analysis, a digression typically refers to a deviation or

departure from the main trend or relationship being studied.
 While regression aims to model the relationship between one or more

independent variables and a dependent variable, digression occurs when
there are factors or variables not accounted for by the model that
influence the dependent variable.
 The Bootstrap
 Imagine that we have a sample of n data points, generated by some

(unknown to us) distribution.
 In this case we can use statistical analysis techniques to understand
the data.
 If we could repeatedly get new samples, we could compute the
medians of many samples and look at the distribution of those
medians.
 What is it is not feasible to get new data samples regularly?
 How we can identify the underlying distribution?
 In that case we can bootstrap new datasets by choosing n data points
with replacement from our data
 Bootstrap resampling is a powerful statistical technique used when obtaining new
samples is not feasible.
1. Sampling with Replacement: In bootstrap resampling, you create new datasets by
randomly selecting data points from your original dataset with replacement. This
means that each data point has an equal chance of being selected for the new sample,
and it's possible for the same data point to be selected multiple times or not at all.
2. Creating Bootstrap Samples: By repeating the sampling process multiple times, you
generate a set of bootstrap samples. These bootstrap samples are effectively new
datasets that are similar to the original data but have some variability due to the
random sampling process.
3. Computing Statistics: Once you have your bootstrap samples, you can compute the
statistic of interest (e.g., mean, median, variance) for each sample. For example, if
you're interested in estimating the median, you would calculate the median for each
bootstrap sample.
4. Analyzing the Distribution: By examining the distribution of the statistic across the
bootstrap samples, you can gain insights into its variability and uncertainty. This
distribution approximates the sampling distribution of the statistic and can be used to
estimate confidence intervals or assess the stability of your estimates.
Standard Errors of Regression
Coefficients
 We can use bootstrap approach to estimate the standard errors of our
regression coefficients.
 We repeatedly take a bootstrap_sample of our data and estimate beta based
on that sample.
 If the coefficient corresponding to one of the independent variables (say,
num_friends) doesn’t vary much across samples, then we can be confident
that our estimate is relatively tight.
 If the coefficient varies greatly across samples, then we can’t be at all
confident in our estimate.
Regularization
 In practice, you’d often like to apply linear regression to datasets with large
numbers of variables. 2 problems here are:
 First, the more variables you use, the more likely you are to overfit your
model to the training set.
 Second, the more nonzero coefficients you have, the harder it is to make
sense of them(or to compute)
 Regularization is an approach in which we add to the error term a penalty
that gets larger as beta gets larger. We then minimize the combined error and
penalty. The more importance we place on the penalty term, the more we
discourage large coefficients.
 there are two types of regularization techniques.
 Ridge Regression:in ridge regression, we add a penalty proportional to the sum of the
squares of the beta_i (except that typically we don’t penalize beta_0, the constant
term)
 .
 Lasso Regression: L1 Regularization : it adds a penalty term proportional to the
absolute value of the coefficients of the model
Logistic Regression
 Seminar
 Refer Text Book
Natural Language Processing
 Seminar
 Refer Text Book

Module 4

Uploaded by

Copyright:

Available Formats

Module 4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 4

Uploaded by

Copyright:

Available Formats

Module 4: Chapter 14.

 The linear model is a mathematical model that attempts to translate natural

 Considering DataSciencester user’s number of friends and the amount of time

 The coefficient of determination, or R2 , is a measure that provides information about

 Sum of residuals square/

1. Variable Transformation or Creation: Instead of including both "friends" and "time on

3. Regularization Techniques: Use regularization techniques such as ridge regression or

 In regression analysis, a digression typically refers to a deviation or

 While regression aims to model the relationship between one or more

 Imagine that we have a sample of n data points, generated by some

 there are two types of regularization techniques.

You might also like