Statistical Testing and Prediction Using Linear Regression: Abstract

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

1

Statistical Testing and Prediction using linear


regression
By: - Dheeraj Chavan Ajeenkya D Y Patil University. Pune 411002.
By Guidance Sir Siddharth Nanda ADYPU. Pune 411002.

Abstract: -
We'll are familiar with basics things of R, including variables, matrices, data
frames, and functions, and we'll be using the ggplot2 package, to make
visualizations of our data. Finally, some of the familiarity with mathematics,
coming under the concept of a test hypothesis, a confidencel interv and a p-
value, will be useful to come up with the test which we are going to
implement. Something we will not go in depth of using mathematical formulas
or justifications behind our project which is simple linear regression analysis.
Instead, we will be doing how to implemented the test in R. These are not rich
in the R offers., once you know how to implement the methods, it's easy to
explore others.

One essential statistical method is to test for a difference between two


samples, or groups. For example, one might see whether a group of patients
who were given a medical treatment had better outcomes than a control
group. In this lesson, we are going to analyze a question of fuel efficiency as it
relates to some concepts of automobile design and performance. We'll work
with a dataset built into R, called mtcars, from 1974 issue of Motor Trends
magazine.
2

Introduction: -

Simple Linear regression is the to method to build the relationship between


two values by giving a linear equation to the observed data. One variable is
considered to be an explanatory variable (e.g. your income), and the other is
considered as dependent variable (e.g. your expenses).

Linear regression is very useful technique in data science. Many people are
familiarity with this type of models where graphs indicates straight lines are
overlaid on the graphs. It is used to predict or to evaluate whether there is a
linear relationship between values.

Literature Survey: -
1). Multiple responses for each level of the predictor

Simple linear regression takes linearity whether there is any relationship


between a dependent and independent. In so doing, it is relying on single
response values at each level of the predictor being good representatives of
their respective populations. Having multiple independent replicates of each
population from which a mean can be calculated thereby provides better data
from which to investigate a relationship. Furthermore, the presence of
replicates of the populations at each level of the predictor variable enables us
3

to establish whether or not the observed responses differ significantly from


their predicted values along a linear regression line and thus to investigate
whether the population relationship is linear versus some other curvilinear
relationship. Analysis of such data is equivalent to ANOVA with polynomial
contrasts .

2) Regression diagnostics.

As part of linear model fitting, a suite of diagnostic measures can be calculated


each of which provide an indication of the appropriateness of the model for
the data and the indication of each points influence (and outlyingness) of each
point on resulting the model.

Leverage :-

Leverage is a measure of how much of an outlier each point is in x-space (on x-


axis) and thus only applies to the predictor variable. Values greater than 2 ∗
p/n (where p=number of model parameters (p = 2 for simple linear regression),
and n is the number of observations) should be investigated as potential
outliers.

Residuals:-

As the residuals are the differences between the observed and predicted
values along a vertical plane, they provide a measure of how much of an
outlier each point is in y-space (on y-axis). Outliers are identified by relatively
large residual values. Residuals can also standardized and studentized, the
latter of which can be compared across different modelsandfollowat
distribution enabling the probability of obtaining agivenresidual can be
determined. The patterns of residuals against predicted y values (residual plot)
are also useful diagnostic tools for investigating linearity and homogeneity of
variance assumptions.

Cook’s D:-

Cook’s D statistic is a measure of the influence of each point on the fitted


model (estimated slope) and incorporates both leverage and residuals. Values
≥ 1 (or even approaching 1) correspond to highly influential observations.

3) Power and sample size determination


4

Although interpreted differently, the tests H0 : ρ = 0 and H0 : β1 = 0


(population correlation and slope respectively equal zero) are statistically
equivalent. Therefore poweranalyses to determine sample size required for
null hypothesis rejection for both correlation and regression are identical and
based on r (correlation coefficient), whichfrom regression analyses, can be
obtained from the coefficient of determination.

4)Smoothers and local regression

Smoothers fit simple models (such as linear regression) through successive


localized subsets of the data to describe the nature of relationships between a
response variable and one or more predictor variables for each point in a data
cloud. Importantly, these techniques do not require the data to conform to a
particular global model structure (e.g. linear, exponential, etc). Essentially,
smoothers generate a line(or surface) through the data cloud by replacing each
observation with a new value that is predicted from the subset of observations
immediately surrounding the original observation. The subset of neighbouring
observations surrounding an observation is known as a band or window and
the larger the bandwidth, the greater the degree of smoothing. Smoothers can
be used as graphical representations as well as to model (local regression) the
nature of relationships between response and predictor variables in a manner
analogous to linear regression. Different smoothers differ in the manner by
which the predicted values are created.

• running medians (or less robust running means) generate predicted


values that are the medians of the responses in the bands surrounding each
observation.

• loess and lowesse (locally weighted scatterplot smoothing) -fit least


squares regression lines to successive subsets of the observations weighted
according to their distance from the target observation and thus depict
changes in the trends throughout the data cloud.

• kernel smoothers -new smoothed y-values are computed as the


weighted averages of points within a defined window (bandwidth) or
neighbourhood of the original x-values. Hence the bandwidth depends on the
scale of the x-axis. Weightings are determined by the type of kernel smoother
specified, and for. Nevertheless, the larger the window, the greater the degree
of smoothing.

• splines -join together a series of polynomial fits that have been generated
after the entire data cloud is split up into a number of smaller windows, the
5

widths of which determine the smoothness of the resulting piecewise


polynomial.

5) Correlation and regression in R

Simple correlation and regression in R are performed using the cor.test()


andlm() functions.Themblm() and rlm() functions offer a range of non-
parametric regression e Lowess and loess functions are similar in that they
both fit linear models through localizations of the data. They differ in in that
loess uses weighted quadratic least squares and lowess uses weighted linear
least squares. They also differ in how they determine the data spanning
(neighborhood of points regression model fitted to), and in that loess
smoothers can fit surfaces and thus accomodate multivariate data.

6) Ordinary least squares (OLS)

• Whenthere is no uncertainty in IV << (levels set not measured) or uncertainty


in IV uncertainty in DV
• When testing H0 : β1 = 0(no linear relationship between DV and IV)
• When generating predictive models from which new values of DV are
predicted from given values of IV. Since we rarely have estimates of
uncertainty in our new predictor values (and thus must assume there is no
uncertainty), predictions likewise must be based on predictive models
developed with the assumption of no uncertainty. Note, if there is uncertainty
in IV, standard errors and confidence intervals inappropriate. • When
distribution is not bivariate normal
> summary(lm(DV~IV, data))
7) Major axis (MA)
• Whenagoodestimate of the population parameters (slope) is required AND
• Whendistribution is bivariate normal (IV levels not set) AND
• Whenerror variance (uncertainty) in IV and DV equal (typically because
variables in same units or dimensionless)
> library(biology)
> summary(lm.II(DV~IV, data, method=’MA’))
.
6

8) Ranged Major axis (Ranged MA)


• Whenagoodestimate of the population parameters (slope) is required AND
• Whendistribution is bivariate normal (IV levels not set) AND
• Whenerror variances are proportional to variable variances AND
• There are no outliers
> library(biology)
> #For variables whose theoretical minimum is arbitrary
> summary(lm.II(DV~IV, data, method=’rMA’))
> #OR for variables whose theoretical minimum must be zero
> #such as ratios, scaled variables & abundances
> summary(lm.II(DV~IV, data, method=’rMA’, zero=T))

9) Reduced major axis (RMA)or Standard major axis (SMA)


• Whenagoodestimate of the population parameters (slope) is required AND
• Whendistribution is bivariate normal (IV levels not set) AND
• Whenerror variances are proportional to variable variances AND
• Whenthere is a significant correlation r between IV and DV
> library(biology)
> summary(lm.II(DV~IV, data, method=’RMA’))
10) Assumptions
To maximize the reliability of null hypotheses tests, the following
assumptions apply:
(i) linearity -simple linear regression models a linear (straight-line)
relationship and thus it is important to establish whether or not some other
curved relationship represents the trends better. Scatterplots are use to
explore linearity.
(ii) normality -the populations from which the single responses were collected
per level of the predictor variable are assumed to be normally distributed.
7

Boxplots of the response variable (and predictor if it was measured rather


than set) should be used to explore normality.
(iii) homogeneity of variance -the populations from which the single
responses were collected per level of the predictor variable are assumed to
be equally varied. With only a single representative of each population per
level of the predictor variable, this can only be explored by examining the
spread of responses around the fitted regression line. In particular, increasing
spread along the regression line would suggest a relationship between
population mean and variance (which must be independent to ensure
unbiased parameter estimates). This can also be diagnosed with a residual
plot.
.

Inputs and outputs: -


we will test whether the fuel efficiency of a car was related to its
weight. But what if you want to turn this relationship into a prediction: for
instance, what would be the fuel efficiency of a 4500 pound car?

We can do this by fitting a linear model, or linear regression, which is done in R


with the lm function. Let's save linear model to a variable we call fit.
Input= data("mtcars")
This loads the data into your environment. You can visualize it with the View
function
View(mtcars)
head(mtcars)
Output=
8

We can access one of these columns using the dollar sign: for example:
Input= mtcars$mpg
Output=

Input= mtcars$wt
Output=
9

Finally, note that we can show a linear model on our plot


using a method built into ggplot2, geom_smooth. We tell it
that the method to use is "lm", a linear model, the same one
we've been learning.
ggplot(mtcars, aes(wt, mpg)) + geom_point() + geom_smooth(method="lm")
Output=

Now we can see the straight line on our ggplot. The grey area indicates that it
is the uncertainty in the fit: it's a 95% confidence level of where the true trend
line could be. It's worth noting to teal that this is not a perfect linear model we
can see that values of both the variables have a tendency to be higher than we
would predict.
10

Conclusion:-
inferred that simple linear regression analysis means we can analyze a
response variable from an independent one, so whenever we need to know
from the beginning each time we add information.

You might also like