Assignment No. 02 Introduction To Educational Statistics (8614)

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 19

ASSIGNMENT No.

02
Introduction to Educational Statistics (8614)
Question#1

Define hypothesis testing and logic behind hypothesis testing.

Hypothesis or significance testing is a mathematical model for testing a claim, idea or


hypothesis about a parameter of interest in a given population set, using data measured in
a sample set. Calculations are performed on selected samples to gather more decisive
information about the characteristics of the entire population, which enables a systematic
way to test claims or ideas about the entire dataset.

Here is a simple example: A school principal reports that students in her school score an
average of 7 out of 10 in exams. To test this “hypothesis,” we record marks of say 30
students (sample) from the entire student population of the school (say 300) and calculate
the mean of that sample. We can then compare the (calculated) sample mean to the
(reported) population mean and attempt to confirm the hypothesis.

To take another example, the annual return of a particular mutual fund is 8%. Assume that
mutual fund has been in existence for 20 years. We take a random sample of annual returns
of the mutual fund for, say, five years (sample) and calculate its mean. We then compare
the (calculated) sample mean to the (claimed) population mean to verify the hypothesis.

This stated description constitutes the “Null Hypothesis (H0)” and is assumed to be true –
the way a defendant in a jury trial is presumed innocent until proven guilty by the evidence
presented in court. Similarly, hypothesis testing starts by stating and assuming a “null
hypothesis,” and then the process determines whether the assumption is likely to be true
or false.

The important point to note is that we are testing the null hypothesis because there is an
element of doubt about its validity. Whatever information that is against the stated null
hypothesis is captured in the Alternative Hypothesis (H1). For the above examples, the
alternative hypothesis will be:

∙ Students score an average that is not equal to 7.

∙ The annual return of the mutual fund is not equal to 8% per annum.

In other words, the alternative hypothesis is a direct contradiction of the null hypothesis.

As in a trial, the jury assumes the defendant's innocence (null hypothesis). The prosecutor
has to prove otherwise (alternative hypothesis). Similarly, the researcher has to prove that
the null hypothesis is either true or false. If the prosecutor fails to prove the alternative
hypothesis, the jury has to let the defendant go (basing the decision on the null hypothesis).
Similarly, if the researcher fails to prove an alternative hypothesis (or simply does nothing),
then the null hypothesis is assumed to be true.
Set the Criteria

The decision-making criteria have to be based on certain parameters of datasets and this is
where the connection to normal distribution comes into the picture.

As per the standard statistics postulate about sampling distribution, “For any sample size n,
the sampling distribution of X̅is normal if the population X from which the sample is drawn
is normally distributed.” Hence, the probabilities of all other possible sample mean that one
could select are normally distributed.

For e.g., determine if the average daily return, of any stock listed on XYZ stock market,
around New Year's Day is greater than 2%.

H0: Null Hypothesis: mean = 2%

H1: Alternative Hypothesis: mean > 2% (this is what we want to prove)

Take the sample (say of 50 stocks out of total 500) and compute the mean of the sample.

For a normal distribution, 95% of the values lie within two standard deviations of the
population mean. Hence, this normal distribution and central limit assumption for the
sample dataset allows us to establish 5% as a significance level. It makes sense as, under
this assumption, there is less than a 5% probability (100-95) of getting outliers that are
beyond two standard deviations from the population mean. Depending upon the nature of
datasets, other significance levels can be taken at 1%, 5% or 10%. For financial calculations
(including behavioral finance), 5% is the generally accepted limit. If we find any calculations
that go beyond the usual two standard deviations, then we have a strong case of outliers to
reject the null hypothesis.

Example 1

A monthly income investment scheme exists that promises variable monthly returns. An
investor will invest in it only if he is assured of an average $180 monthly income. He has a
sample of 300 months’ returns which has a mean of $190 and a standard deviation of $75.
Should he or she invest in this scheme?

Let’s set up the problem. The investor will invest in the scheme if he or she is assured of his
desired $180 average return.

H0: Null Hypothesis: mean = 180

H1: Alternative Hypothesis: mean > 180

Method 1: Critical Value Approach


Identify a critical value XL for the sample mean, which is large enough to reject the null
hypothesis – i.e. reject the null hypothesis if the sample mean >= critical value XL

P (identify a Type I alpha error) = P(reject H0 given that H0 is true),


This would be achieved when the sample mean exceeds the critical limits.

= P (given that H0 is true) = alpha

Graphically, it appears as follows:

Image by Julie Bang © Investopedia 2020


Taking alpha = 0.05 (i.e. 5% significance level), Z 0.05 = 1.645 (from the Z-table or normal
distribution table)

= > XL = 180 +1.645*(75/sqrt(300)) = 187.12

Since the sample mean (190) is greater than the critical value (187.12), the null hypothesis is
rejected, and the conclusion is that the average monthly return is indeed greater than $180,
so the investor can consider investing in this scheme.

Method 2: Using Standardized Test Statistics

One can also use standardized value z.

Test Statistic, Z = (sample mean – population mean) / (std-dev / sqrt (no. of samples).

Then, the rejection region becomes the following:

Z= (190 – 180) / (75 / sqrt (300)) = 2.309

Our rejection region at 5% significance level is Z> Z0.05 = 1.645.

Since Z= 2.309 is greater than 1.645, the null hypothesis can be rejected with a similar
conclusion mentioned above.

Method 3: P-value Calculation

We aim to identify P (sample mean >= 190, when mean = 180).

= P (Z >= (190- 180) / (75 / sqrt (300))

= P (Z >= 2.309) = 0.0084 = 0.84%

The following table to infer p-value calculations concludes that there is confirmed evidence
of average monthly returns being higher than 180:

p-value Inference

less than 1% Confirmed evidence supporting alternative hypothesis


between 1% and 5% Strong evidence supporting alternative hypothesis

between 5% and 10% Weak evidence supporting alternative hypothesis

greater than 10% No evidence supporting alternative hypothesis

Example 2

A new stockbroker (XYZ) claims that his brokerage fees are lower than that of your current
stock broker's (ABC). Data available from an independent research firm indicates that the
mean and std-dev of all ABC broker clients are $18 and $6, respectively.

A sample of 100 clients of ABC is taken and brokerage charges are calculated with the new
rates of XYZ broker. If the mean of the sample is $18.75 and std-dev is the same ($6), can
any inference be made about the difference in the average brokerage bill between ABC and
XYZ broker?
Question#2

Explain types of ANOVA. Describe possible situations in which each type should be used.

Analysis of Variance (ANOVA)

When comparing three or more groups on one dependent variable, an Analysis of Variance
is the statistics to use. There are two basic types of ANOVAs that can be used.

One-way ANOVA: A one-way ANOVA compares multiple groups on the same variable. For
example, perhaps the researcher decides to divide private schools into religious private and
secular private. Now, there are three groups to be compared: government schools, religious
private, and secular private. A one-way ANOVA is now necessary. To calculate the one-way
ANOVA, the data must be sorted

according to the independent variable - again, school type. In VassarStats, click on ANOVA,
then One Way ANOVA. Then enter the number of samples (aka the number of groups; in
this example, 3). Then click Independent Samples. Enter the mathematics scores for each
student in the appropriate column. For example, enter government students' scores in
Sample A, religious private students' scores in Sample B, and secular private students'
scores in Sample C. Then click Calculate.

Scroll down the screen. Again, the first statistic to look at is the p in the ANOVA summary
table. Again, if this p is greater than 0.050, then the null hypothesis is retained; the result is
not significant. If the result is not significant, analysis and interpretation is finished because
there is no significant difference between groups.

If this p is less than 0.050, then the result is significant. This only means, however, that there
is a significant difference between groups somewhere, not that there is a significant
difference between all groups. It is possible that government students were significantly
higher than religious private and secular private students, but there are no significant
differences between religious private and secular private students. Down at the bottom of
the screen is the result of Tukey's HSD (Honestly Significant Difference) test. This test
identifies which differences are really significant. It is important to record the means and
standard deviations for all groups, the ANOVA summary table, and the results of Tukey's
HSD. Click the Reset button and move to the next research hypothesis.

Factorial ANOVA: The factorial ANOVA compares the effect of multiple independent
variables on one dependent variable. For example, a 2x3 factorial ANOVA could compare
the effects of gender and school type on academic performance. The first independent
variable, gender, has two levels (male and female) and the second independent variable,
school type, has three levels (government, religious private, and secular private), hence 2x3
(read "two by three"). Factorial ANOVAs can also be calculated on VassarStats (click on
ANOVA then on Two-way factorial ANOVA for independent samples). However, this
interpretation is a bit more complex so please see an expert statistician to help with
interpreting the results.

Analysis of Covariance (ANCOVA)

When using a pre-post test research design, the Analysis of Covariance allows a comparison
of post test scores with pre-test scores factored out. For example, if comparing a treatment
and control group on achievement motivation with a pre-post test design, the ANCOVA will
compare the treatment and control groups' post-test scores by statistically adjusting for the
pre-test scores. For an ANCOVA, you must have pre- and post-test scores for every person
in the sample, and these scores must be sorted by the group (aka treatment and control
group).

To calculate an ANCOVA with VassarStats, click on ANCOVA. Then VassarStats will ask for
the k. The k is the number of groups. If there is only one treatment and one control group,
then k=2. Click on the correct k for data import. There are two things to bear in mind when
doing ANCOVA with VassarStat. It will ask for the concomitant variable and the dependent
variable. The concomitant variable (CV) is the variable that should be controlled for. In the
case of a pre-post test design, the concomitant variable is the pre-test. The dependent
variable (DV) is the variable that you think has been affected by the independent variable.
In the case of a pre-post test design, the dependent variable is the post-test. To use
VassarStats, it is important that the CV and the DV are side-by-side for each of the two
groups. Then enter the CV and DV into the correct columns and click Calculate.

Scroll down the screen. Just as before, the first statistic to look at is the p in the ANCOVA
summary table. If this p is less than 0.050, then the null hypothesis is rejected and the result
is significant. There are two sets of means that are important to understand in an ANCOVA.
First, the Observed Means are the actual means for the dependent variable (post-test).
Then the Adjusted Means are the means that have been statistically manipulated based on
the pre-test scores. A simple way to imagine this is that the ANCOVA statistically forces the
pre-test scores to be equal between the two groups (meaning that the two groups are now
equal at the start of the study), and then re-calculates the post

test scores based on the adjusted pre-test scores. It is important to record the observed
means, adjusted means, and standard deviations for all groups and the ACNOVA summary
table. When

creating the tables in the next step, report both the Observed and Adjusted Means.
However, make any figures based with the Adjusted Means. Add a note to the figure so that
readers are clear that these are Adjusted Means. Click the Reset button and move to the
next research hypothesis.

Correlation

Correlations should be calculated to examine the relationship between two variables within
the same group of participants. For example, the correlation would quantify the
relationship between academic achievement and achievement motivation. To calculate a
correlation, you must have scores for two variables for every participant in the sample. To
calculate a correlation in VassarStats, click on Correlation & Regression, then Basic Linear
Correlation and Regression, Data-Import Version. Enter the total scores for the two
variables and click Calculate.

Scroll down the screen. Again, the first statistic to look at is the p: two-tailed. The null
hypotheses for correlations state, There is no significant relationship between mathematics
and English achievement. If the p is greater than 0.050, then the null hypothesis is retained;
there is no significant relationship between variables. If the result is not significant, analysis
and interpretation is finished because there is no significant relationship.

If this p is less than 0.050, then the null hypothesis is rejected and the correlation is
significant. If the correlation is significant, then the next step is to look at the correlation
itself, symbolized by r. For more information on how to interpret the correlation, click on
Method of Data Analysis. It is important to record the means and standard deviations for
the two variables, the t, df, two-tailed p, and r. Click the "Reset" button and move to the
next research hypothesis.
Question#3

What is the range of correlation coefficient? Explain strong, moderate and weak
relationship.

Correlation has many uses and definitions. As Carol Alexander, 2001 observes, correlation
may only be meaningfully computed for stationary processes. Covariance stationarity for
a time series, yt, is defined as:

∙ Constant, finite mean

∙ Constant, finite variance


∙ Covariance(yt, yt-s) depends only on the lag s

For financial data, this implies that correlation is only meaningful for variates such as rates
of return or normally transformed variates, z, such that:

z = (x - μ)/σ

Where x is non-stationary and μ is the mean of x and σ the standard deviation. For non-
stationary variates like prices, correlation is not usually meaningful.

A more coherent measure of relatedness is cointegration. Cointegration uses a two-step


process:

∙ Long-term equilibrium relationships are established

∙ A dynamic correlation of returns is estimated

Cointegration will not be discussed in these ERM sessions, however, it is very important in
developing dynamic hedges that seek to keep stationary tracking error within preset
bounds. Hedging using correlation measures typically is not able to achieve such control.

However, instantaneous and terminal measures of correlation are used in various


applications such as developing stochastic interest rate generators.

Definitions of Correlation

Pearson’s correlation formula

Linear relationships between variables can be quantified using the Pearson Product-
Moment Correlation Coefficient, or The value of this statistic is always between -1 and 1,
and if and are unrelated it will equal zero. (source: http://maigret.psy.ohio

state.edu/~trish/Teaching/Intro_Stats/Lecture_Notes/chapter5/node5.html)

Spearman's Correlation Method

A nonparametric (distribution-free) rank statistic proposed by Spearman in 1904 as a


measure of the strength of the associations between two variables (Lehmann and D'Abrera
1998). The Spearman rank correlation coefficient can be used to give an R-estimate, and is
a measure of monotone association that is used when the distribution of the data make
Pearson's correlation coefficient undesirable or misleading.

The Spearman rank correlation coefficient is defined by

(1)

where d is the difference in statistical rank of corresponding variables, and is an


approximation to the exact correlation coefficient
(2)

computed from the original data. Because it uses ranks, the Spearman rank correlation
coefficient is much easier to compute.

The variance, kurtosis, and higher-order moments are

(3) (4) (5)

Student was the first to obtain the variance.

(source: 2

http://mathworld.wolfram.com/SpearmanRan

kCorrelationCoefficient.html) The Simple

Formula for rs, for Rankings without Ties difference between each pair of ranks (D=X—
Y), and then the square of each

wine D2
Here is the same table you saw above, except
now we also take the

X Y D

1 2 —
1
2 1
1
3 5

4 3 2

5 4 1

6 7 1

7 8 —

8 6 1

1
difference. All that is required for the calculation of the Spearman
c
o
e
f
f
i
c
i
e
n
t
a

are the values of N and- D2, according to the formula

6 D2

rs = 1 —

1
N(N2—1)

N = 8 - D2 = 14

(source: http://faculty.vassar.edu/lowry/ch3b.html)

There is no generally accepted method for computing the standard error for

small samples. Kendall's Tau Coefficient

Spearman’s r treats ranks as scores and computes the correlation between two
sets of ranks. Kendall’s tau is based on the number of inversions in rankings.

Although there is evidence that Kendall's Tau holds up better than Pearson's r to
extreme nonnormality in the data, that seems to be true only at quite extreme
levels.

Let inv := number of inversions, i.e. reversals of pair-wise rank orders between n
pairs. Equal rankings need an adjustment.

τ = 1 – 2* inv/(number of pairs of objects)

= 1 - 2 * inv/ (n*(n-1)/2) = 1 – 4* inv/(n*(n-1))

(source: http://www.psych.yorku.ca/dand/tsp/general/corrstats.pdf)

Relationship Between Correlation and Volatility

In Volatility and Correlation in Option Pricing,1999, in the context of two imperfectly


correlated variables, Ricardo Rebonato states,

“Under these assumptions we can now run two simulations, one with a constant …
identical volatility for both variables and with imperfect correlation, and the other with
different instantaneous imperfect correlation, and the other with instantaneous
volatilities …but perfect correlation. One can then evaluate correlation, calculated along
the path, between the changes in the log of the two variables in the two cases. …As is
apparent from the two figures, the same sample correlation can be obtained despite the
fact that the two de-correlation-generating mechanisms are very different.
Question#4
Explain chi square independence test. In what situation should it be applied?

Use the chi-square test of goodness-of-fit when you have one nominal variable with two or
more values (such as red, pink and white flowers). You compare the observed counts of
observations in each category with the expected counts, which you calculate using some
kind of theoretical expectation (such as a 1:11:1 sex ratio or a 1:2:11:2:1 ratio in a genetic
cross).

If the expected number of observations in any category is too small, the chi-square test may
give inaccurate results, and you should use an exact test instead. See the web page on small
sample sizes for discussion of what "small" means.

The chi-square test of goodness-of-fit is an alternative to the G–test of goodness-of-fit; each


of these tests has some advantages and some disadvantages, and the results of the two
tests are usually very similar. You should read the section on "Chi-square vs. G–test" near
the bottom of this page, pick either chi-square or G–test, then stick with that choice for the
rest of your life. Much of the information and examples on this page are the same as on the
G–test page, so once you've decided which test is better for you, you only need to read one.

Null hypothesis

The statistical null hypothesis is that the number of observations in each category is equal to
that predicted by a biological theory, and the alternative hypothesis is that the observed
numbers are different from the expected. The null hypothesis is usually an extrinsic
hypothesis, where you knew the expected proportions before doing the experiment.
Examples include a 1:11:1 sex ratio or a 1:2:11:2:1 ratio in a genetic cross. Another example
would be looking at an area of shore that had 59% of the area covered in sand, 28%28%
mud and 13%13% rocks; if you were investigating where seagulls like to stand, your null
hypothesis would be that 59%59% of the seagulls were standing on sand, 28%28% on mud
and 13%13% on rocks.

In some situations, you have an intrinsic hypothesis. This is a null hypothesis where you
calculate the expected proportions after you do the experiment, using some of the
information from the data. The best-known example of an intrinsic hypothesis is the Hardy-
Weinberg proportions of population genetics: if the frequency of one allele in a population
is pp and the other allele is qq, the null hypothesis is that expected frequencies of the three
genotypes are p2p2, 2pq2pq, and q2q2. This is an intrinsic hypothesis, because you
estimate pp and qq from the data after you collect the data, you can't predict pp and qq
before the experiment.

How the test works

Unlike the exact test of goodness-of-fit, the chi-square test does not directly calculate the
probability of obtaining the observed results or something more extreme. Instead, like
almost all statistical tests, the chi-square test has an intermediate step; it uses the data to
calculate a test statistic that measures how far the observed data are from the null
expectation. You then use a mathematical relationship, in this case the chi-square
distribution, to estimate the probability of obtaining that value of the test statistic.

You calculate the test statistic by taking an observed number (OO), subtracting the expected
number (EE), then squaring this difference. The larger the deviation from the null
hypothesis, the larger the difference is between observed and expected. Squaring the
differences makes them all positive. You then divide each difference by the expected
number, and you add up these standardized differences. The test statistic is approximately
equal to the log-likelihood ratio used in the G–test. It is conventionally called a "chi-square"
statistic, although this is somewhat confusing because it's just one of many test statistics
that follows the theoretical chi-square distribution. The equation is:

chi2=∑(O−E)2E(2.3.1)(2.3.1)chi2=∑(O−E)2E

As with most test statistics, the larger the difference between observed and expected, the
larger the test statistic becomes. To give an example, let's say your null hypothesis is a
3:13:1 ratio of smooth wings to wrinkled wings in offspring from a bunch of Drosophila
crosses. You observe 770770 flies with smooth wings and 230230 flies with wrinkled wings;
the expected values are 750750 smooth-winged and 250250 wrinkled-winged flies. Entering
these numbers into the equation, the chi-square value is 2.132.13. If you had observed
760760 smooth-winged flies and 240240 wrinkled-wing flies, which is closer to the null
hypothesis, your chi-square value would have been smaller, at 0.530.53; if you'd observed
800800 smooth-winged and 200200 wrinkled-wing flies, which is further from the null
hypothesis, your chi-square value would have been 13.3313.33.

The distribution of the test statistic under the null hypothesis is approximately the same as
the theoretical chi-square distribution. This means that once you know the chi-square value
and the number of degrees of freedom, you can calculate the probability of getting that
value of chi-square using the chi-square distribution. The number of degrees of freedom is
the number of categories minus one, so for our example there is one degree of freedom.
Using the CHIDIST function in a spreadsheet,

you enter =CHIDIST(2.13, 1) and calculate that the probability of getting a chi-square value
of 2.132.13 with one degree of freedom is P=0.144P=0.144.

The shape of the chi-square distribution depends on the number of degrees of freedom. For
an extrinsic null hypothesis (the much more common situation, where you know the
proportions predicted by the null hypothesis before collecting the data), the number of
degrees of freedom is simply the number of values of the variable, minus one. Thus if you
are testing a null hypothesis of a 1:11:1 sex ratio, there are two possible values (male and
female), and therefore one degree of freedom. This is

because once you know how many of the total are females (a number which is "free" to vary
from 00 to the sample size), the number of males is determined. If there are three values of
the variable (such as red, pink, and white), there are two degrees of freedom, and so on.
An intrinsic null hypothesis is one where you estimate one or more parameters from the
data in order to get the numbers for your null hypothesis. As described above, one example
is Hardy-Weinberg proportions. For an intrinsic null hypothesis, the number of degrees of
freedom is calculated by taking the number of values of the variable, subtracting 11 for
each parameter estimated from the data, then subtracting 11 more. Thus for Hardy-
Weinberg proportions with two alleles and three genotypes, there are three values of the
variable (the three genotypes); you subtract one for the parameter estimated from the data
(the allele frequency, pp); and then you subtract one more, yielding one degree of freedom.
There are other statistical issues involved in testing fit to Hardy-Weinberg expectations, so if
you need to do this, see Engels (2009) and the older references he cites.

Post-hoc test

If there are more than two categories and you want to find out which ones are significantly
different from their null expectation, you can use the same method of testing each category
vs. the sum of all other categories, with the Bonferroni correction, as I describe for the
exact test. You use chi-square tests for each category, of course.
Question#5

correlation is pre requisite of Regression Analysis. Explain.

regression analysis and how you can use it to help you analyze and better understand data
that you receive from surveys or observations. Learn what is involved in regression analysis
and what to look out for.

A Bunch of Data

Whenever we collect data or information, we want to make sense of what we've found. We
also may want to use the information to predict information about other related events.
This is all part of statistics.

For example, say we collected data about how happy people are after getting so many hours

of sleep. We have quite a few data points.

We have graphed our data as a scatter plot because each point is a separate point. None of
the points are related to the next because each is a separate individual. How do we make
sense of the scattered pieces of information? How can we further analyze this graph so that
we can make predictions for other people based on the information we gathered?
Regression Analysis

This is where regression analysis comes into play. Regression analysis is a way of relating
variables to each other. What we call 'variables' are simply the bits of information we have
taken. By using regression analysis, we are able to find patterns in our data. It allows us to
make predictions based on our data.

In our sleep vs. happiness example, our variables are sleep and happiness. They are two
seemingly unrelated variables. But by using regression analysis, we can see if we can find a
way that they relate to each other. Once we find how they relate to each other, we can
start making predictions.

Finding the Best Equation

What we want to find is an equation that best fits the data that we have. A very simple
regression analysis model that we can use for our example is called the linear model, which
uses a simple linear equation to fit the data. Recall that linear equations are those
equations that give you a straight line when graphed. Looking at our data, we see that we
can draw a straight line through the middle of most of our data points.

you can see that this line that we have drawn has roughly half the points above it and half
the points below it. We have calculated the equation of this line to be y = (10/7)x - 10/7.
We can say that, based on our regression analysis, our data can be modeled by the linear
equation y = (10/7)x - 10/7.

Now that we have a model for our data, we can use our model to make predictions about
other cases. For example, say someone sleeps for only 1 hour. We can use our formula and
plug in 1 for x to find that the amount of happiness that someone can expect to have with
only 1 hour of sleep is 0. We can plug in any reasonable number in for x to find a prediction
based on the data we collected. Of course, the better the model, the better the predictions
will be. This is why in regression analysis, there are many types of models to pick from. We
won't go into the types in this video lesson. Just know that our linear model is just one very
basic model. There are more complex models to fit more complicated data patterns.

Linear regression can be a powerful tool for predicting and interpreting information. Learn
to use two common formulas for linear regression in this lesson.
Linear Regression Scenario

Jake has decided to start a hot dog business. He has hired his cousin, Noah, to help him with
hot dog sales. But there's a problem! Noah can only work 20 hours a week. Jake wants to
have Noah working at peak hot dog sales hours. How can he find this information? In this
lesson, you will learn how to solve problems using concepts based on linear regression. First,
let's check out some of our key terms that will be beneficial in this lesson.

Key Terms

Jake will have to collect data and use regression analysis to find the optimum hot dog sale
time. Regression analysis is the study of two variables in an attempt to find a relationship,
or correlation. For example, there have been many regression analyses on student study
hours and GPA. Studies have found a relationship between the number of hours a student
studies and their overall GPA.

In other words, the number of hours a student studies is the independent variable and the
GPA is the dependent variable. The student's GPA will depend on the number of hours a
student studies; therefore, there is a relationship between the two variables. We'll talk
more about this relationship, also known as correlation, in a minute, but let's define linear
regression next.

A regression line is a straight line that attempts to predict the relationship between two
points, also known as a trend line or line of best fit. You've probably seen this line
previously in another class. Linear regression is a prediction when a variable (y) is
dependent on a second variable (x) based on the regression equation of a given set of data.

To clarify, you can take a set of data, create a scatter plot, create a regression line, and then
use regression analysis to see if you have a correlation. Once you have your correlation, you
have linear regression. Okay, that probably sounded like Greek to you. Let's talk a little bit
about correlation before looking at some examples.

A correlation is the relationship between two sets of variables used to describe or predict
information. The stronger the relationship between the two sets of variables, the more
likely your prediction will be accurate. We will examine this concept of correlation more
closely in other lessons, such as Interpreting Linear Relationships Using Data and
Correlation Versus Causation. For now, let's focus on using the regression line to help solve
Jake's hot dog sales dilemma.

Using Linear Regression

First, let's look at the data for Jake's hot dog sales. Jake has been working for the past few
weeks from 1 pm to 7 pm each day. Each day, Jake has tracked the hour and the number of
hot dog sales for each hour. Take a look at this data set for Monday:

(1, 10) (2, 11) (3, 15) (4, 12) (5, 17) (6, 18) (7, 20)

To establish the relationship between the time of day and the number of hot dogs sold, Jake
will need to put the data into the formula y = ax + b. You've probably seen the formula for
slope intercept form in algebra: y = mx + b. This is the same formula, but in statistics, we've
replaced the m with a; a is still slope in this formula, so there aren't any big changes you
need to worry about.

To find the regression line for this data set, let's first put this information into a chart like
this:

Now we need to use the least squares formula to find our variables in y = ax + b. This is the
formula to find the slope a:

Can you tell what's normal or independent and what's not? Sometimes, we need to figure
this out in the world of statistics. This lesson shows you how as it explains residuals and
regression assumptions in the context of linear regression analysis.

Regression Analysis Defined

Many important questions can be answered by statistical analysis of a data set. For example,
is going back to school for another degree a good way to increase your long-term earnings
potential? Schools will tell you that the answer is a resounding yes! Your professor might
say the same thing, by the way.

Anyways, the statistical process that we will discuss is known as regression analysis. In
particular, we will focus on how to analyze residuals to find violations of the regression
assumptions. Although we will only cover linear regression, it is important to note that
nonlinear regression also exists.

Regression analysis is a statistical process for investigating the relationship among variables.
For example, it could be used to examine the effect of sucrose concentration on the boiling
temperature of water. By creating a scatter plot of the boiling point vs. concentration, we
can draw a line of best fit through the data, as shown on the screen:

This line is fitted to result in the smallest sum of the squares of the residuals. Let me explain.
A residual is defined as the difference between an observed value and its corresponding
predicted value. If all the data points were to lie exactly on the line of best fit, then all of the
residuals would be equal to zero. On the other hand, there will be a non-zero residual for
any point that does not lie on the line, as shown by the black dashed lines in the figure on
screen now.

The red points are the observed values, while their corresponding dash-line connected black
points are the predicted values.

In certain cases, the analysis process we have just described may not be valid. Let's take a
closer look at what I mean.

Regression Assumptions and Residual Analysis

Linear regression analysis is based on four main assumptions, which include statistical
independence, linearity, homoscedasticity, and normality. Let's analyze each of these
assumptions in the context of how residuals can be used to either validate or refute them.

1. Statistical Independence

It means that there is no correlation between residuals within the data set.

In the right-side plot, you can see that there seems to be a sinusoidal pattern to the
residuals, so these data points are not statistically independent. In order for the data to be
statistically independent, the residuals need to be completely random in magnitude.

2. Linearity

It implies that the relationship between the dependent and independent

variables is linear.

The plot on the left shows linear data with a positive slope, while the one on the right shows
what looks like an inverted parabola, which is not linear data.

3. Homoscedasticity

All values of the independent variable have the same variance around the regression line. In
this context, you can think of variance as a deviation from the line of best fit. The plots in
the figure on the screen demonstrate this concept.

In the plot on the right, the residuals increase in magnitude as the independent variable
increases. This violates the homoscedasticity assumption.

4. Normality

It means that the residuals are normally distributed around the line of best fit. Take a look at
the figure on the screen:

In the plot on the right, the data points are not normally distributed. We would expect most
of the observed values to be clustered around the line of best fit, with a few outliers.

Lesson Summary

Let's now summarize today's lesson. We have covered linear regression analysis and how
residuals can be used to determine if its assumptions are valid.

Regression analysis is a statistical process for investigating the relationship among variables.
Also, recall that a residual is defined as the difference between an observed value and its
corresponding predicted value. Another important term is the line of best fit, which results
in the smallest sum of the squares of the residuals.

The four main assumptions are linearity, homoscedasticity, statistical independence, and
normality.

∙ Statistical independence means that there is no correlation between residuals within


the data set.

∙ Linearity implies that the relationship between the dependent and independent
variables is linear.

∙ Homoscedasticity means that all values of the independent variable have the same
variance around the regression line.

∙ Normality implies that the residuals are normally distributed around the line of best fit.

You should now be comfortable eyeing a scatter plot with a line of best fit and determining
if any of the linear regression assumptions seem to be violated.

Key Terms

Regression analysis a statistical process for investigating the relationship among

variables. residual the difference between an observed value and its

corresponding predicted value. line of best fit the smallest sum of the squares of

the residuals.

Statistical independence means that there is no correlation between residuals within the
data set. Linearity implies that the relationship between the dependent and

independent variables is linear.

Homoscedasticity means that all values of the independent variable have the same variance
around the regression line.

Normality implies that the residuals are normally distributed around the line of best fit.

Learning Outcomes

When you've watched the final scene of the lesson, find out whether you can:

∙ Define regression analysis and other terms

∙ Illustrate how a scatter plot with a line of best fit can determine if any linear regression
assumptions are violated

You might also like