0% found this document useful (0 votes)
36 views

DS Unit-Iv

Linear regression is a statistical technique used to model the relationship between variables. It allows one to predict the value of a dependent variable based on the value of one or more independent variables. Linear regression assumes a linear relationship between the variables and can be used for prediction, forecasting, and finding patterns in data. Logistic regression is a classification algorithm used for binary or multi-class outcomes. It models the probability of different outcomes as a function of the independent variables.

Uploaded by

Kusuma Korlam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

DS Unit-Iv

Linear regression is a statistical technique used to model the relationship between variables. It allows one to predict the value of a dependent variable based on the value of one or more independent variables. Linear regression assumes a linear relationship between the variables and can be used for prediction, forecasting, and finding patterns in data. Logistic regression is a classification algorithm used for binary or multi-class outcomes. It models the probability of different outcomes as a function of the independent variables.

Uploaded by

Kusuma Korlam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Linear regression

Linear regression is a data analysis technique that predicts the


value of unknown data by using another related and known data
value. It mathematically models the unknown or dependent
variable and the known or independent variable as a linear
equation.
Linear regression
Why is linear regression important?

Linear regression models are relatively simple and provide an


easy-to-interpret mathematical formula to generate predictions. Linear
regression is an established statistical technique and applies easily to
software and computing. Businesses use it to reliably and predictably
convert raw data into business intelligence and actionable insights.
Scientists in many fields, including biology and the behavioral,
environmental, and social sciences, use linear regression to conduct
preliminary data analysis and predict future trends. Many data science
methods, such as machine learning and artificial intelligence, use linear
regression to solve complex problems.
How does linear regression work?
A simple linear regression technique attempts to plot a line graph between two data variables, x and y.
As the independent variable, x is plotted along the horizontal axis. Independent variables are also called
explanatory variables or predictor variables. The dependent variable, y, is plotted on the vertical axis.
You can also refer to y values as response variables or predicted variables.
Steps in linear regression
For this overview, consider the simplest form of the line graph equation between y and x; y=c*x+m,
where c and m are constant for all possible values of x and y. So, for example, suppose that the input
dataset for (x,y) was (1,5), (2,8), and (3,11). To identify the linear regression method, you would take
the following steps:

1. Plot a straight line, and measure the correlation between 1 and 5.


2. Keep changing the direction of the straight line for new values (2,8) and (3,11) until all values fit.
3. Identify the linear regression equation as y=3*x+2.
4. Extrapolate or predict that y is 14 when x is
What is linear regression in machine learning?

In machine learning, computer programs called algorithms analyze large


datasets and work backward from that data to calculate the linear
regression equation. Data scientists first train the algorithm on known or
labeled datasets and then use the algorithm to predict unknown values.
Real-life data is more complicated than the previous example. That is
why linear regression analysis must mathematically modify or transform
the data values to meet the following four assumptions.
• Linear relationship
• Residual independence
• Normality
• Homoscedasticity
Linear relationship

A linear relationship must exist between the independent and dependent


variables. To determine this relationship, data scientists create a scatter
plot—a random collection of x and y values—to see whether they fall
along a straight line. If not, you can apply nonlinear functions such as
square root or log to mathematically create the linear relationship
between the two variables.
Residual independence

Data scientists use residuals to measure prediction accuracy. A residual


is the difference between the observed data and the predicted value.
Residuals must not have an identifiable pattern between them. For
example, you don't want the residuals to grow larger with time. You can
use different mathematical tests, like the Durbin-Watson test, to
determine residual independence. You can use dummy data to replace
any data variation, such as seasonal data.
Normality

Graphing techniques like Q-Q plots determine whether the residuals are
normally distributed. The residuals should fall along a diagonal line in
the center of the graph. If the residuals are not normalized, you can test
the data for random outliers or values that are not typical. Removing the
outliers or performing nonlinear transformations can fix the issue.
Homoscedasticity
Homoscedasticity assumes that residuals have a constant variance or standard
deviation from the mean for every value of x. If not, the results of the analysis
might not be accurate. If this assumption is not met, you might have to change
the dependent variable. Because variance occurs naturally in large datasets, it
makes sense to change the scale of the dependent variable. For example,
instead of using the population size to predict the number of fire stations in a
city, might use population size to predict the number of fire stations per
person.
What are the types of linear regression?

• Simple linear regression

• Multiple linear regression

• Logistic regression
Simple linear regression

Simple linear regression is defined by the linear function:


• Y= β0*X + β1 + ε
• β0 and β1 are two unknown constants representing the regression
slope, whereas ε (epsilon) is the error term.
You can use simple linear regression to model the relationship between
two variables, such as these:
• Rainfall and crop yield
• Age and height in children
• Temperature and expansion of the metal mercury in a thermometer
Multiple linear regression
In multiple linear regression analysis, the dataset contains one
dependent variable and multiple independent variables. The linear
regression line function changes to include more factors as follows:
• Y= β0*X0 + β1X1 + β2X2+…… βnXn+ ε
• As the number of predictor variables increases, the β constants also
increase correspondingly.
Multiple linear regression models multiple variables and their impact on
an outcome:
• Rainfall, temperature, and fertilizer use on crop yield
• Diet and exercise on heart disease
• Wage growth and inflation on home loan rates
Logistic regression

Data scientists use logistic regression to measure the probability of an


event occurring. The prediction is a value between 0 and 1, where 0
indicates an event that is unlikely to happen, and 1 indicates a maximum
likelihood that it will happen. Logistic equations use logarithmic
functions to compute the regression line.
These are some examples:
• The probability of a win or loss in a sporting match
• The probability of passing or failing a test
• The probability of an image being a fruit or an animal
Advantages of Simple Linear Regression in R:

1. Easy to implement: R provides built-in functions, such as lm(), to perform Simple Linear
Regression quickly and efficiently.
2. Easy to interpret: Simple Linear Regression models are easy to interpret, as they model a
linear relationship between two variables.
3. Useful for prediction: Simple Linear Regression can be used to make predictions about the
dependent variable based on the independent variable.
4. Provides a measure of goodness of fit: Simple Linear Regression provides a measure of how
well the model fits the data, such as the R-squared value.
Disadvantages of Simple Linear Regression in R:

1. Assumes linear relationship: Simple Linear Regression assumes a linear relationship


between the variables, which may not be true in all cases.
2. Sensitive to outliers: Simple Linear Regression is sensitive to outliers, which can
significantly affect the model coefficients and predictions.
3. Assumes independence of observations: Simple Linear Regression assumes that the
observations are independent, which may not be true in some cases, such as time series data.
4. Cannot handle non-numeric data: Simple Linear Regression can only handle numeric data
and cannot be used for categorical or non-numeric data.
5. Overall, Simple Linear Regression is a useful tool for modeling the relationship between
two variables, but it has some limitations and assumptions that need to be carefully
considered.
Logistic regression
Logistic regression is a process of modeling the probability of a discrete outcome
given an input variable. The most common logistic regression models a binary
outcome; something that can take two values such as true/false, yes/no, and so
on
Logistic regression
Logistic regression is another powerful supervised ML algorithm used for binary
classification problems (when target is categorical). The best way to think about logistic
regression is that it is a linear regression but for classification problems. Logistic
regression essentially uses a logistic function defined below to model a binary output
variable
Logistic regression
The primary difference between linear regression and logistic regression is that logistic
regression's range is bounded between 0 and 1. In addition, as opposed to linear regression,
logistic regression does not require a linear relationship between inputs and output
variables. This is due to applying a nonlinear log transformation to the odds ratio

In the logistic function equation, x is the input variable.


Types of logistic regression
• Binary logistic regression

• Multinomial logistic regression

• Ordinal logistic regression


Binary logistic regression:

In this approach, the response or dependent variable is dichotomous in nature—i.e. it has


only two possible outcomes (e.g. 0 or 1). Some popular examples of its use include
predicting if an e-mail is spam or not spam or if a tumor is malignant or not malignant.
Within logistic regression, this is the most commonly used approach, and more generally,
it is one of the most common classifiers for binary classification.
Multinomial logistic regression:

In this type of logistic regression model, the dependent variable has three or more possible
outcomes; however, these values have no specified order. For example, movie studios
want to predict what genre of film a moviegoer is likely to see to market films more
effectively. A multinomial logistic regression model can help the studio to determine the
strength of influence a person's age, gender, and dating status may have on the type of film
that they prefer. The studio can then orient an advertising campaign of a specific movie
toward a group of people likely to go see it.
Ordinal logistic regression:

This type of logistic regression model is leveraged when the response


variable has three or more possible outcome, but in this case, these values
do have a defined order. Examples of ordinal responses include grading
scales from A to F or rating scales from 1 to 5.
Logistic regression is a method we can use to fit a regression model
when the response variable is binary.
Logistic regression uses a method known as maximum likelihood
estimation to find an equation of the following form:

log[p(X) / (1-p(X))] = β0 + β1X1 + β2X2 + … + βpXp

where:

● Xj: The jth predictor variable


● βj: The coefficient estimate for the jth predictor variable
Step 1: Load the Data
For this example, we’ll use the Default dataset from the ISLR package.
We can use the following code to load and view a summary of the dataset:
● default: Indicates whether or not an individual defaulted.

● student: Indicates whether or not an individual is a student.

● balance: Average balance carried by an individual.

● income: Income of the individual.


Step 2: Create Training and Test Samples
Next, we’ll split the dataset into a training set to train the model on and a
testing set to test the model on.
Step 3: Fit the Logistic Regression Model
Next, we’ll use the glm (general linear model) function and specify
family=”binomial” so that R fits a logistic regression model to the dataset:
The coefficients in the output indicate the average change in log odds of
defaulting. For example, a one unit increase in balance is associated with an
average increase of 0.005988 in the log odds of defaulting.

The p-values in the output also give us an idea of how effective each predictor
variable is at predicting the probability of default:

● P-value of student status: 0.0843


● P-value of balance: <0.0000
● P-value of income: 0.4304
Step 4: Use the Model to Make Predictions
Once we’ve fit the logistic regression model, we can then use it to make
predictions about whether or not an individual will default based on their student
status, balance, and income:
The probability of an individual with a balance of $1,400, an
income of $2,000, and a student status of “Yes” has a probability
of defaulting of .0273. Conversely, an individual with the same
balance and income but with a student status of “No” has a
probability of defaulting of 0.0439.

You might also like