0% found this document useful (0 votes)

36 views

DS Unit-Iv

Linear regression is a statistical technique used to model the relationship between variables. It allows one to predict the value of a dependent variable based on the value of one or more independent variables. Linear regression assumes a linear relationship between the variables and can be used for prediction, forecasting, and finding patterns in data. Logistic regression is a classification algorithm used for binary or multi-class outcomes. It models the probability of different outcomes as a function of the independent variables.

Uploaded by

Kusuma Korlam

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views

DS Unit-Iv

Uploaded by

Kusuma Korlam

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Linear regression

Linear regression is a data analysis technique that predicts the

value of unknown data by using another related and known data
value. It mathematically models the unknown or dependent
variable and the known or independent variable as a linear
equation.
Linear regression
Why is linear regression important?

Linear regression models are relatively simple and provide an

easy-to-interpret mathematical formula to generate predictions. Linear
regression is an established statistical technique and applies easily to
software and computing. Businesses use it to reliably and predictably
convert raw data into business intelligence and actionable insights.
Scientists in many fields, including biology and the behavioral,
environmental, and social sciences, use linear regression to conduct
preliminary data analysis and predict future trends. Many data science
methods, such as machine learning and artificial intelligence, use linear
regression to solve complex problems.
How does linear regression work?
A simple linear regression technique attempts to plot a line graph between two data variables, x and y.
As the independent variable, x is plotted along the horizontal axis. Independent variables are also called
explanatory variables or predictor variables. The dependent variable, y, is plotted on the vertical axis.
You can also refer to y values as response variables or predicted variables.
Steps in linear regression
For this overview, consider the simplest form of the line graph equation between y and x; y=c*x+m,
where c and m are constant for all possible values of x and y. So, for example, suppose that the input
dataset for (x,y) was (1,5), (2,8), and (3,11). To identify the linear regression method, you would take
the following steps:

1. Plot a straight line, and measure the correlation between 1 and 5.

2. Keep changing the direction of the straight line for new values (2,8) and (3,11) until all values fit.
3. Identify the linear regression equation as y=3*x+2.
4. Extrapolate or predict that y is 14 when x is
What is linear regression in machine learning?

In machine learning, computer programs called algorithms analyze large

datasets and work backward from that data to calculate the linear
regression equation. Data scientists first train the algorithm on known or
labeled datasets and then use the algorithm to predict unknown values.
Real-life data is more complicated than the previous example. That is
why linear regression analysis must mathematically modify or transform
the data values to meet the following four assumptions.
• Linear relationship
• Residual independence
• Normality
• Homoscedasticity
Linear relationship

A linear relationship must exist between the independent and dependent

variables. To determine this relationship, data scientists create a scatter
plot—a random collection of x and y values—to see whether they fall
along a straight line. If not, you can apply nonlinear functions such as
square root or log to mathematically create the linear relationship
between the two variables.
Residual independence

Data scientists use residuals to measure prediction accuracy. A residual

is the difference between the observed data and the predicted value.
Residuals must not have an identifiable pattern between them. For
example, you don't want the residuals to grow larger with time. You can
use different mathematical tests, like the Durbin-Watson test, to
determine residual independence. You can use dummy data to replace
any data variation, such as seasonal data.
Normality

Graphing techniques like Q-Q plots determine whether the residuals are
normally distributed. The residuals should fall along a diagonal line in
the center of the graph. If the residuals are not normalized, you can test
the data for random outliers or values that are not typical. Removing the
outliers or performing nonlinear transformations can fix the issue.
Homoscedasticity
Homoscedasticity assumes that residuals have a constant variance or standard
deviation from the mean for every value of x. If not, the results of the analysis
might not be accurate. If this assumption is not met, you might have to change
the dependent variable. Because variance occurs naturally in large datasets, it
makes sense to change the scale of the dependent variable. For example,
instead of using the population size to predict the number of fire stations in a
city, might use population size to predict the number of fire stations per
person.
What are the types of linear regression?

• Simple linear regression

• Multiple linear regression

• Logistic regression
Simple linear regression

Simple linear regression is defined by the linear function:

• Y= β0*X + β1 + ε
• β0 and β1 are two unknown constants representing the regression
slope, whereas ε (epsilon) is the error term.
You can use simple linear regression to model the relationship between
two variables, such as these:
• Rainfall and crop yield
• Age and height in children
• Temperature and expansion of the metal mercury in a thermometer
Multiple linear regression
In multiple linear regression analysis, the dataset contains one
dependent variable and multiple independent variables. The linear
regression line function changes to include more factors as follows:
• Y= β0*X0 + β1X1 + β2X2+…… βnXn+ ε
• As the number of predictor variables increases, the β constants also
increase correspondingly.
Multiple linear regression models multiple variables and their impact on
an outcome:
• Rainfall, temperature, and fertilizer use on crop yield
• Diet and exercise on heart disease
• Wage growth and inflation on home loan rates
Logistic regression

Data scientists use logistic regression to measure the probability of an

event occurring. The prediction is a value between 0 and 1, where 0
indicates an event that is unlikely to happen, and 1 indicates a maximum
likelihood that it will happen. Logistic equations use logarithmic
functions to compute the regression line.
These are some examples:
• The probability of a win or loss in a sporting match
• The probability of passing or failing a test
• The probability of an image being a fruit or an animal
Advantages of Simple Linear Regression in R:

1. Easy to implement: R provides built-in functions, such as lm(), to perform Simple Linear
Regression quickly and efficiently.
2. Easy to interpret: Simple Linear Regression models are easy to interpret, as they model a
linear relationship between two variables.
3. Useful for prediction: Simple Linear Regression can be used to make predictions about the
dependent variable based on the independent variable.
4. Provides a measure of goodness of fit: Simple Linear Regression provides a measure of how
well the model fits the data, such as the R-squared value.
Disadvantages of Simple Linear Regression in R:

1. Assumes linear relationship: Simple Linear Regression assumes a linear relationship

between the variables, which may not be true in all cases.
2. Sensitive to outliers: Simple Linear Regression is sensitive to outliers, which can
significantly affect the model coefficients and predictions.
3. Assumes independence of observations: Simple Linear Regression assumes that the
observations are independent, which may not be true in some cases, such as time series data.
4. Cannot handle non-numeric data: Simple Linear Regression can only handle numeric data
and cannot be used for categorical or non-numeric data.
5. Overall, Simple Linear Regression is a useful tool for modeling the relationship between
two variables, but it has some limitations and assumptions that need to be carefully
considered.
Logistic regression
Logistic regression is a process of modeling the probability of a discrete outcome
given an input variable. The most common logistic regression models a binary
outcome; something that can take two values such as true/false, yes/no, and so
on
Logistic regression
Logistic regression is another powerful supervised ML algorithm used for binary
classification problems (when target is categorical). The best way to think about logistic
regression is that it is a linear regression but for classification problems. Logistic
regression essentially uses a logistic function defined below to model a binary output
variable
Logistic regression
The primary difference between linear regression and logistic regression is that logistic
regression's range is bounded between 0 and 1. In addition, as opposed to linear regression,
logistic regression does not require a linear relationship between inputs and output
variables. This is due to applying a nonlinear log transformation to the odds ratio

In the logistic function equation, x is the input variable.

Types of logistic regression
• Binary logistic regression

• Multinomial logistic regression

• Ordinal logistic regression

Binary logistic regression:

In this approach, the response or dependent variable is dichotomous in nature—i.e. it has

only two possible outcomes (e.g. 0 or 1). Some popular examples of its use include
predicting if an e-mail is spam or not spam or if a tumor is malignant or not malignant.
Within logistic regression, this is the most commonly used approach, and more generally,
it is one of the most common classifiers for binary classification.
Multinomial logistic regression:

In this type of logistic regression model, the dependent variable has three or more possible
outcomes; however, these values have no specified order. For example, movie studios
want to predict what genre of film a moviegoer is likely to see to market films more
effectively. A multinomial logistic regression model can help the studio to determine the
strength of influence a person's age, gender, and dating status may have on the type of film
that they prefer. The studio can then orient an advertising campaign of a specific movie
toward a group of people likely to go see it.
Ordinal logistic regression:

This type of logistic regression model is leveraged when the response

variable has three or more possible outcome, but in this case, these values
do have a defined order. Examples of ordinal responses include grading
scales from A to F or rating scales from 1 to 5.
Logistic regression is a method we can use to fit a regression model
when the response variable is binary.
Logistic regression uses a method known as maximum likelihood
estimation to find an equation of the following form:

log[p(X) / (1-p(X))] = β0 + β1X1 + β2X2 + … + βpXp

where:

● Xj: The jth predictor variable

● βj: The coefficient estimate for the jth predictor variable
Step 1: Load the Data
For this example, we’ll use the Default dataset from the ISLR package.
We can use the following code to load and view a summary of the dataset:
● default: Indicates whether or not an individual defaulted.

● student: Indicates whether or not an individual is a student.

● balance: Average balance carried by an individual.

● income: Income of the individual.

Step 2: Create Training and Test Samples
Next, we’ll split the dataset into a training set to train the model on and a
testing set to test the model on.
Step 3: Fit the Logistic Regression Model
Next, we’ll use the glm (general linear model) function and specify
family=”binomial” so that R fits a logistic regression model to the dataset:
The coefficients in the output indicate the average change in log odds of
defaulting. For example, a one unit increase in balance is associated with an
average increase of 0.005988 in the log odds of defaulting.

The p-values in the output also give us an idea of how effective each predictor
variable is at predicting the probability of default:

● P-value of student status: 0.0843

● P-value of balance: <0.0000
● P-value of income: 0.4304
Step 4: Use the Model to Make Predictions
Once we’ve fit the logistic regression model, we can then use it to make
predictions about whether or not an individual will default based on their student
status, balance, and income:
The probability of an individual with a balance of $1,400, an
income of $2,000, and a student status of “Yes” has a probability
of defaulting of .0273. Conversely, an individual with the same
balance and income but with a student status of “No” has a
probability of defaulting of 0.0439.

Commonly Used Statistical Terms
100% (2)
Commonly Used Statistical Terms
4 pages
Department of Mining Engineering: Indian Institute of Technology (Indian School of Mines) Dhanbad
No ratings yet
Department of Mining Engineering: Indian Institute of Technology (Indian School of Mines) Dhanbad
25 pages
U-4_IML
No ratings yet
U-4_IML
17 pages
Linear Regression. Com
No ratings yet
Linear Regression. Com
13 pages
Regression Techniques
No ratings yet
Regression Techniques
14 pages
L4a - Supervised Learning
No ratings yet
L4a - Supervised Learning
25 pages
Unit - II_DA
No ratings yet
Unit - II_DA
22 pages
Regression in M.L
No ratings yet
Regression in M.L
13 pages
Regression Analysis
100% (2)
Regression Analysis
11 pages
Machine Learning Algorithm
100% (2)
Machine Learning Algorithm
20 pages
Thesis Linear Regression
100% (2)
Thesis Linear Regression
5 pages
MLT Unit 2
No ratings yet
MLT Unit 2
53 pages
Lecture 2
No ratings yet
Lecture 2
17 pages
228w1f0065 ML
No ratings yet
228w1f0065 ML
15 pages
m2 Data analytic and visualization
No ratings yet
m2 Data analytic and visualization
53 pages
ML - Unit 2
No ratings yet
ML - Unit 2
155 pages
Unit - 2 MLA
No ratings yet
Unit - 2 MLA
57 pages
BA3-4-5modules
No ratings yet
BA3-4-5modules
258 pages
Regression
No ratings yet
Regression
45 pages
Machine Learning: Bilal Khan
100% (2)
Machine Learning: Bilal Khan
20 pages
Unit-3_ Introduction to ML, Part-1 (1)
No ratings yet
Unit-3_ Introduction to ML, Part-1 (1)
3 pages
Unit II-II
No ratings yet
Unit II-II
21 pages
Mod3
No ratings yet
Mod3
50 pages
Linear Regression using R
No ratings yet
Linear Regression using R
11 pages
Models Assignment
No ratings yet
Models Assignment
43 pages
Unit 2
No ratings yet
Unit 2
67 pages
Regression Modelling
No ratings yet
Regression Modelling
25 pages
UNIT-2 NOTES
No ratings yet
UNIT-2 NOTES
30 pages
Unit 2 Notes - Final
No ratings yet
Unit 2 Notes - Final
32 pages
DA2
No ratings yet
DA2
12 pages
DOC-20240831-WA0023.
No ratings yet
DOC-20240831-WA0023.
22 pages
Accuracy Assessment and Confusion Matrix
No ratings yet
Accuracy Assessment and Confusion Matrix
23 pages
Regression: UNIT - V Regression Model
100% (1)
Regression: UNIT - V Regression Model
21 pages
Model Development
No ratings yet
Model Development
80 pages
Data Analytics Unit 2
No ratings yet
Data Analytics Unit 2
13 pages
Econometrics
No ratings yet
Econometrics
18 pages
Unit1 - Data Science - SPPU
No ratings yet
Unit1 - Data Science - SPPU
15 pages
Ch-2 Supervised Machine Learning
No ratings yet
Ch-2 Supervised Machine Learning
48 pages
6 Regression Analysis
No ratings yet
6 Regression Analysis
12 pages
Unit V
No ratings yet
Unit V
27 pages
Data Scienece Note
No ratings yet
Data Scienece Note
38 pages
Unit - Iii Data Analysis
No ratings yet
Unit - Iii Data Analysis
39 pages
Aih Exp 1
No ratings yet
Aih Exp 1
6 pages
DA-MODULE-3
No ratings yet
DA-MODULE-3
54 pages
DA Unit-3
No ratings yet
DA Unit-3
14 pages
Regression Analysis in Machine Learning
No ratings yet
Regression Analysis in Machine Learning
9 pages
Regression Analysis in Machine Learning
No ratings yet
Regression Analysis in Machine Learning
26 pages
Assignment Linear Regression
No ratings yet
Assignment Linear Regression
10 pages
Linear Regression and Correlation
No ratings yet
Linear Regression and Correlation
65 pages
4 ML
No ratings yet
4 ML
41 pages
UNIT 4 DATA SCIENCE
No ratings yet
UNIT 4 DATA SCIENCE
18 pages
5.REGRESSION-1
No ratings yet
5.REGRESSION-1
46 pages
Supervised Learning Algorithms
No ratings yet
Supervised Learning Algorithms
20 pages
Module 1 Notes
100% (1)
Module 1 Notes
73 pages
3-Linear Regreesion-Assumptions
No ratings yet
3-Linear Regreesion-Assumptions
28 pages
Regression Analysis in Machine Learning
No ratings yet
Regression Analysis in Machine Learning
18 pages
Econometrics 2
No ratings yet
Econometrics 2
27 pages
9 Types of Regression Analysis
No ratings yet
9 Types of Regression Analysis
16 pages
1linear Regression
No ratings yet
1linear Regression
12 pages
Data Analytics Lesson 11 Notes
No ratings yet
Data Analytics Lesson 11 Notes
8 pages
ML Using Python Unit3 pdf
No ratings yet
ML Using Python Unit3 pdf
8 pages
Correlation and Regression: Six Sigma Thinking, #8
From Everand
Correlation and Regression: Six Sigma Thinking, #8
Sumeet Savant
5/5 (1)
Distributed System Unit 3
No ratings yet
Distributed System Unit 3
32 pages
SE Unit-4
No ratings yet
SE Unit-4
21 pages
Se Unit-3
No ratings yet
Se Unit-3
13 pages
Se Unit5
No ratings yet
Se Unit5
10 pages
Se Unit-3 Part-2
No ratings yet
Se Unit-3 Part-2
4 pages
Even Solutions
No ratings yet
Even Solutions
41 pages
Big Data Assignments Answer
No ratings yet
Big Data Assignments Answer
15 pages
Extension of The Cox Proportional Hazards Model For Time-Dependent Variables
No ratings yet
Extension of The Cox Proportional Hazards Model For Time-Dependent Variables
4 pages
The Simple Regression Model: Introductory Econometrics: A Modern Approach (Wooldridge)
No ratings yet
The Simple Regression Model: Introductory Econometrics: A Modern Approach (Wooldridge)
15 pages
The Effects of Mathematics Anxiety On Ac PDF
No ratings yet
The Effects of Mathematics Anxiety On Ac PDF
11 pages
STAT - Lec.3 - Correlation and Regression
No ratings yet
STAT - Lec.3 - Correlation and Regression
8 pages
The_effects_of_social_media_use_and_poli
No ratings yet
The_effects_of_social_media_use_and_poli
19 pages
Unit 1 - Part 1
No ratings yet
Unit 1 - Part 1
105 pages
Mohammad I 2017
No ratings yet
Mohammad I 2017
13 pages
Tang Thanh Dat - 1932309007
No ratings yet
Tang Thanh Dat - 1932309007
41 pages
Independent Variable Source
No ratings yet
Independent Variable Source
8 pages
Antim Prahar 2024 Business Statistics and Analysis
No ratings yet
Antim Prahar 2024 Business Statistics and Analysis
34 pages
W10 Anova Dua Hala
No ratings yet
W10 Anova Dua Hala
50 pages
Political Conservatism, Need For Cognitive Closure, and Intergroup Hostility
No ratings yet
Political Conservatism, Need For Cognitive Closure, and Intergroup Hostility
21 pages
Implementation of Regression Logistics For Audit Switching
No ratings yet
Implementation of Regression Logistics For Audit Switching
21 pages
Econometrics - Chapter 17 - Simultaneous Equations Models - Shalabh, IIT Kanpur
No ratings yet
Econometrics - Chapter 17 - Simultaneous Equations Models - Shalabh, IIT Kanpur
30 pages
AI Presentation Machine Learning
100% (1)
AI Presentation Machine Learning
42 pages
UE20CS312 Unit2 Slides
No ratings yet
UE20CS312 Unit2 Slides
206 pages
CSIT Module IV Notes
No ratings yet
CSIT Module IV Notes
19 pages
EPSC 222 RESEARCH METHODS IN EDUCATION COURSE OUTLINE and NOTES 2020 (1)
No ratings yet
EPSC 222 RESEARCH METHODS IN EDUCATION COURSE OUTLINE and NOTES 2020 (1)
25 pages
Design of Experiments (DOE) May Prove Useful.: 2/25/2017 Ronald Morgan Shewchuk 1
No ratings yet
Design of Experiments (DOE) May Prove Useful.: 2/25/2017 Ronald Morgan Shewchuk 1
116 pages
Loss Functions and Metrics in Deep Learning A Revi
No ratings yet
Loss Functions and Metrics in Deep Learning A Revi
53 pages
Batch2 - Shruti Sharma
No ratings yet
Batch2 - Shruti Sharma
22 pages
A Brief Introduction To Linear Models in R
No ratings yet
A Brief Introduction To Linear Models in R
21 pages
Structural Equation Models: The Basics
No ratings yet
Structural Equation Models: The Basics
15 pages
Business Intelligence - Chapter 4
No ratings yet
Business Intelligence - Chapter 4
28 pages
PDF Kajongwe Chinyena Mugutso Mambo Social Media and Marketing Performance of Smes in Zimbabwe PDF
No ratings yet
PDF Kajongwe Chinyena Mugutso Mambo Social Media and Marketing Performance of Smes in Zimbabwe PDF
12 pages
Reviewer PR2
No ratings yet
Reviewer PR2
10 pages

DS Unit-Iv

Uploaded by

DS Unit-Iv

Uploaded by

Linear regression

Linear regression is a data analysis technique that predicts the

Linear regression models are relatively simple and provide an

1. Plot a straight line, and measure the correlation between 1 and 5.

In machine learning, computer programs called algorithms analyze large

A linear relationship must exist between the independent and dependent

Data scientists use residuals to measure prediction accuracy. A residual

• Simple linear regression

• Multiple linear regression

Simple linear regression is defined by the linear function:

Data scientists use logistic regression to measure the probability of an

1. Assumes linear relationship: Simple Linear Regression assumes a linear relationship

In the logistic function equation, x is the input variable.

• Multinomial logistic regression

• Ordinal logistic regression

In this approach, the response or dependent variable is dichotomous in nature—i.e. it has

This type of logistic regression model is leveraged when the response

log[p(X) / (1-p(X))] = β0 + β1X1 + β2X2 + … + βpXp

● Xj: The jth predictor variable

● student: Indicates whether or not an individual is a student.

● balance: Average balance carried by an individual.

● income: Income of the individual.

● P-value of student status: 0.0843

You might also like