0% found this document useful (0 votes)
35 views

Econometrics 2

This document discusses regression analysis and some of its key concepts. It describes regression as a statistical process for estimating relationships between variables and predicting the value of a dependent or outcome variable based on independent variables. Some important problems in regression like heteroscedasticity and multicollinearity are introduced. Different types of regression models and their uses are also outlined. Methods for dealing with issues like heteroscedasticity and multicollinearity are examined.

Uploaded by

Anunay Choudhary
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Econometrics 2

This document discusses regression analysis and some of its key concepts. It describes regression as a statistical process for estimating relationships between variables and predicting the value of a dependent or outcome variable based on independent variables. Some important problems in regression like heteroscedasticity and multicollinearity are introduced. Different types of regression models and their uses are also outlined. Methods for dealing with issues like heteroscedasticity and multicollinearity are examined.

Uploaded by

Anunay Choudhary
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

UNIT- 2 PROBLEMS IN REGRESSION ANALYSIS I

STRUCTURE

2.0 Learning Objective

2.1 Introduction

2.2 Nature, test, consequences and remedial steps of problems of heteroscedasticity

2.3 OLS estimation in the presence of heteroscedasticity

2.4 Method of Generalized Least Squares

2.5 Nature, test, consequences and remedial steps of Multicollinearity

2.6 Estimation in the presence of perfect multicollinearity

2.7 Estimation in the presence of high but imperfect multicollinearity

2.8 Summary

2.9 Keywords

2.10 Learning Activity

2.11 Unit End Questions

2.12 References

2.0 LEARNING OBJECTIVES

After studying this unit, you will be able to:

 To understand Nature, test, consequences and remedial steps of problems of


heteroscedasticity

 To learn about problem in regression analysis.

 To understand method of generalized least square

 This module will help the students to comprehend Multicollinearity.

 Deep understanding about Estimation in the presence of high but imperfect multicollinearity
2.1 INTRODUCTION

In statistical modeling, regression analysis is a set of statistical processes for estimating the
relationships between a dependent variable (often called the ‘outcome variable’) and one or more
independent variables (often called ‘predictors’, ‘covariates’, or ‘features’).
The terminology you will often listen related with regression analysis is:
Dependent variable or target variable: Variable to predict.
Independent variable or predictor variable: Variables to estimate the dependent variable.
Outlier: Observation that differs significantly from other observations. It should be avoided since it
may hamper the result.
Multicollinearity: Situation in which two or more independent variables are highly linearly related.

Homoscedasticity or homogeneity of variance: Situation in which the error term is the same across
all values of the independent variables.
Regression analysis is primarily used for two distinct purposes. First, it is widely used for prediction
and forecasting, which overlaps with the field of machine learning. Second, it is also used to infer
causal relationships between independent and dependent variables.
We introduced the idea of a statistical relationship between two variables, such as a relationship
between sales volume and advertising expenditure, a relationship between crop yield and fertilizer
input, a relationship between product price and supply, and so on. Such variables' relationships
show the strength and direction of their association but do not address the following: Are there any
functional (or algebraic) relationships between the variables? If so, can the most likely value of one
variable be estimated using the value of the other variable?
Regression analysis is a statistical method for estimating the value of a variable based on the known
value of another variable. It expresses the relationship between two or more variables as an
equation. Dependent (or response) variable refers to the variable whose value is estimated using the
algebraic equation, and independent (regressor or predictor) variable refers to the variable whose
value is used to estimate this value. The term "linear regression equation" refers to the algebraic
formula used to express a dependent variable in terms of an independent variable.

Sir Francis Galton first used the term regression in 1877 while researching the relationship between
the heights of fathers and sons. He discovered that, despite the saying "tall fathers have tall sons,"
the average height of tall fathers' sons is x above the average height and 2x/3

above the average height. Galton referred to such a decrease in average height as "regression to
mediocrity." The term regression is used in business and economics to refer to other types of
variables because the Galton theory is not universally applicable. Regression is also referred to as
"moving backward" in literary contexts.

The following is a summary of the key distinctions between correlation and regression analysis:

1. Regression analysis is the process of creating an algebraic equation between two variables from
sample data and predicting the value of one variable given the value of the other variable, while
correlation analysis is the process of determining the strength (or degree) of the relationship
between two variables. The absolute value of the correlation coefficient, on the other hand,
indicates the degree of the relationship between the two variables, while the sign of the correlation
coefficient indicates the nature of the relationship (direct or inverse).

2. The results of a correlation analysis show a connection between two variables x and y, but not a
cause-and-effect connection. In contrast to correlation, regression analysis establishes the cause-
and-effect relationship between x and y, i.e., that a change in the value of the independent variable
x results in an equivalent change (effect) in the value of the dependent variable y, provided that all
other variables that have an impact on y remain constant.

3. While both variables are regarded as independent in correlation analysis, one variable is
considered the dependent variable and the other the independent variable in linear regression
analysis.

1. The amount of variance in the dependent variable that can be explained or accounted for
by variation in the independent variable is indicated by the coefficient of determination,
or r2. R2's value is subject to sampling error because it is derived from a sample. The
assumption of a linear regression may be incorrect even if the value of r2 is high because
it may represent a portion of the relationship that is actually in the shape of a curve.

Several significant benefits of regression analysis include the following:


1. Regression analysis aids in the creation of a regression equation that can be used to estimate the
value of a dependent variable given the value of an independent variable.
2. Standard error of estimate is calculated using regression analysis to assess the variability or range
of values of a dependent variable in relation to the regression line. A good estimate of the value of
variable y can be made because the regression line fits the data better and the pair of values (x, y)
fall closer to it when the variance and estimation error are smaller. The standard error of estimate
equals zero when all the points lie on the line.
3. Changing the values of either x or y is thought to be acceptable when the sample size is large (df
29) and interval estimation is used to predict the value of a dependent variable based on standard
error of estimate. Regardless of the values of the two variables, r2 has a constant magnitude.

Different Kinds of Regression Models-

The creation of a regression model to explain the relationship between two or more variables in a
given population is the main goal of regression analysis. The mathematical equation known as a
regression model predicts the value of the dependent variable based on the values of one or more
independent variables that are known.
The type of data available and the nature of the problem being studied determine which regression
model should be used. However, an equation linking a dependent variable to one or more
independent variables can be used to describe each type of association or

relationship.
Regression Analysis
Regression analysis is the oldest, and probably, most widely used multivariate technique in the social
sciences. Unlike the preceding methods, regression is an example of dependence analysis in which
the variables are not treated symmetrically. In regression analysis, the object is to obtain a
prediction of one variable, given the values of the others. To accommodate this change of viewpoint,
a different terminology and notation are used. The variable being predicted is usually denoted by y
and the predictor variables by x with subscripts added to distinguish one from another. In linear
multiple regression, we look for a linear combination of the predictors (often called regressor
variables). For example, in educational research, we might be interested in the extent to which
school performance could be predicted by home circumstances, age, or performance on a previous
occasion. In practice, regression models are estimated by least squares using appropriate software.
Important practical matters concern the best selection of the best regressor variables, testing the
significance of their coefficients, and setting confidence limits to the predictions.

In statistics, it’s hard to stare at a set of random numbers in a table and try to make any sense of it.
For example, global warming may be reducing average snowfall in your town and you are asked to
predict how much snow you think will fall this year. Looking at the following table you might guess
somewhere around 10-20 inches. That’s a good guess, but you could make a better guess, by using
regression.

Fig. -2.1 Regression Models

Essentially, regression is the “best guess” at using a set of data to make some kind of prediction. It’s
fitting a set of points to a graph. There’s a whole host of tools that can run regression for you,
including Excel, which I used here to help make sense of that snowfall data:
Fig.-2.2 Regression Models Amount

Just by looking at the regression line running down through the data, you can fine tune your best
guess a bit. You can see that the original guess (20 inches or so) was way off. For 2015, it looks like
the line will be somewhere between 5 and 10 inches! That might be “good enough”, but regression
also gives you a useful equation, which for this chart is:
y = -2.2923x + 4624.4.
What that means is you can plug in an x value (the year) and get a pretty good estimate of snowfall
for any year. For example, 2005:
y = -2.2923(2005) + 4624.4 = 28.3385 inches, which is pretty close to the actual figure of 30 inches
for that year.

Best of all, you can use the equation to make predictions. For example, how much snow will fall in
2017?

y = 2.2923(2017) + 4624.4 = 0.8 inches.

Regression also gives you an R squared value, which for this graph is 0.702. This number tells you
how good your model is. The values range from 0 to 1, with 0 being a terrible model and 1 being a
perfect model. As you can probably see, 0.7 is a fairly decent model so you can be fairly confident in
your weather prediction.

2.2 NATURE, TEST, CONSEQUENCES AND REMEDIAL STEPS OF


PROBLEMS OF HETEROSCEDASTICITY

In statistics, a sequence (or a vector) of random variables is homoscedastic if all its random variables
have the same finite variance. This is also known as homogeneity of variance. The complementary
notion is called heteroscedasticity. The spellings homoskedasticity and heteroskedasticity are also
frequently used.
Assuming a variable is homoscedastic when in reality it is heteroscedastic results in unbiased but
inefficient point estimates and in biased estimates of standard errors, and may result in
overestimating the goodness of fit as measured by the Pearson coefficient.

The existence of heteroscedasticity is a major concern in regression analysis and the analysis of
variance, as it invalidates statistical tests of significance that assume that the modelling errors all
have the same variance. While the ordinary least squares estimator is still unbiased in the presence
of heteroscedasticity, it is inefficient and generalized least squares should be used instead.

Because heteroscedasticity concerns expectations of the second moment of the errors, its presence
is referred to as misspecification of the second order.

The econometrician Robert Engle was awarded the 2003 Nobel Memorial Prize for Economics for his
studies on regression analysis in the presence of heteroscedasticity, which led to his formulation of
the autoregressive conditional heteroscedasticity (ARCH) modeling technique

Heteroscedasticity implies that the variances (i.e. - the dispersion around the expected mean of
zero) of the residuals are not constant, but that they are different for different observations. This
causes a problem: if the variances are unequal, then the relative reliability of each observation (used
in the regression analysis) is unequal. The larger the variance, the lower should be the importance
(or weight) attached to that observation.

Nature of Heteroscedasticity

Heteroscedasticity's Nature
One of the key presumptions of the conventional linear regression model, as mentioned in Chapter
3, is that the variance of each disturbance term, up, is some fixed number equal to a2, conditional on
the chosen values of the explanatory variables. This is the homoscedasticity, or equal (homo) spread
(causticity), or variance, underlying assumption. Symbolically,
Diagrammatically, the homoscedasticity in the two-variable regression model may be displayed as in
Figure, which is conveniently duplicated as

Fig. 2.3 2.4 Heteroscedastic disturbances (a)


Fig. 2.3 Heteroscedastic disturbances (b)

Figure demonstrates that regardless of the values obtained by the variable X, the conditional
variance of Yi (which is equal to that of up), conditional upon the given Xi, remains the same.
Figure, on the other hand, demonstrates that the conditional variance of Yi grows as X grows. The
variations of Yi are not the same in this case. As a result, heteroscedasticity exists. Symbolically,
The conditional variances of ui (= conditional variances of Yi) are no longer constant, as the subscript
of o2 serves to remind us.
Assume that Y represents savings and X represents income in the two-variable model Yi = fa + fax +
ui to make the distinction between homoscedasticity and heteroscedasticity evident. Figures and
demonstrate that as income rises, savings likewise do so on average.

Unlike Figure where the variation of savings grows with income, the savings variance is constant
across all income levels. In Figure , it appears that higher-income families save more money on
average than lower-income families do, while there is more variation in their savings.
The deviations of ui may vary for a number of causes, some of which are listed here.
1. In accordance with models of error-learning, people gradually reduce their behavioral mistakes as
they gain experience. In this scenario, a2 is anticipated to drop. Take Figure , for instance, which
shows a relationship between the amount of practice time spent on typing and the quantity of
typing errors produced during a test. According to Figure , the average number of typing errors and
their variability decrease as the amount of practice hours rises.
People have more discretionary income when their salaries rise, which gives them greater freedom
to choose how to spend their money. As a result, a2 will probably rise as income does. Because
people have more options when it comes to their saving behavior, it is likely that the regression of
savings on income will show that a2 increases with income (as in Figure ). Similar to this, it is widely
believed that businesses with higher profits will exhibit more variation in their payout policy than
those with lower profitability. Additionally, growth-oriented businesses are more likely than
established ones to exhibit greater dividend payout ratio unpredictability.
3. A2 is probably going to go down as data collection methods advance. Consequently, banks with
advanced data processing technology are likely to
See p. 48 of Stefan Valvano’s' Econometrics, McGraw-Hill, New York, 1959. 2 According to Valvano’s,
"Income accumulates, and individuals can scarcely distinguish dollars where they could before
distinguish dimes," ibid., p. 48.
Step-by-Step Drawing an Airplane
Fig 2.5 Illustration of heteroscedasticity
compared to banks without such amenities, they make less mistakes on their clients' monthly or
quarterly statements.
4. The existence of outliers can also lead to heteroscedasticity. An observation that deviates
significantly (either much or greatly) from the other observations in the sample is referred to as an
outlier or an outlying observation. An observation from a population other than the one producing
the remaining sample observations is referred to as an outlier. 3 Regression analysis findings can be
significantly changed by including or excluding such an observation, especially if the sample size is
small.

Think about the scattergram in Figure as an illustration. This graph illustrates the percent rate of
change of stock prices (Y) and consumer prices (X) for the post-World War II era through 1969 for 20
nations using the information provided . Because the supplied Y and X values are substantially higher
for Chile than they are for the other nations, the observation on Y and X for Chile in this figure can be
viewed as an anomaly. It would be challenging to uphold the homoscedasticity assumption in
circumstances like these.

Tests Of Heteroscedasticity

Heteroskedasticity checks Heteroskedasticity has an impact on hypothesis estimation and testing.


Numerous factors might cause the data to become heteroskedastic. The tests for heteroskedasticity
make an assumption about the
heteroskedasticity. There are several tests accessible in the literature, including

1. Bartlett test-

In statistics, Bartlett's test, named after Maurice Stevenson Bartlett, is used to test
homoscedasticity, that is, if multiple samples are from populations with equal variances. Some
statistical tests, such as the analysis of variance, assume that variances are equal across groups or
samples, which can be verified with Bartlett's test.

In a Bartlett test, we construct the null and alternative hypothesis. For this purpose, several test
procedures have been devised. The test procedure due to M.S.E (Mean Square Error/Estimator)
Bartlett test is represented here. This test procedure is based on the statistic whose sampling
distribution is approximately a Chi-Square distribution with (k − 1) degrees of freedom, where k is
the number of random samples, which may vary in size and are each drawn from independent
normal distributions. Bartlett's test is sensitive to departures from normality. That is, if the samples
come from non-normal distributions, then Bartlett's test may simply be testing for non-normality.
Levine’s test and the Brown–Forsythe test are alternatives to the Bartlett test that are less sensitive
to departures from normality.

2.Breusch Pagan test-

In statistics, the Breusch–Pagan test, developed in 1979 by Trevor Breusch and Adrian Pagan, is used
to test for heteroskedasticity in a linear regression model. It was independently suggested with
some extension by R. Dennis Cook and Sanford Weisberg in 1983 (Cook–Weisberg test). Derived
from the Lagrange multiplier test principle, it tests whether the variance of the errors from a
regression is dependent on the values of the independent variables. In that case, heteroskedasticity
is present.

Suppose that we estimate the regression model

and obtain from this fitted model a set of values for {\display style {\wide hat {u}}} \wide hat {u}, the
residuals. Ordinary least squares constrain these so that their mean is 0 and so, given the
assumption that their variance does not depend on the independent variables, an estimate of this
variance can be obtained from the average of the squared values of the residuals. If the assumption
is not held to be true, a simple model might be that the variance is linearly related to independent
variables. Such a model can be examined by regressing the squared residuals on the independent
variables, using an auxiliary regression equation of the form

This is the basis of the Breusch–Pagan test. It is a chi-squared test: the test statistic is distributed nχ2
with k degrees of freedom. If the test statistic has a p-value below an appropriate threshold (e.g., p <
0.05) then the null hypothesis of homoskedasticity is rejected and heteroskedasticity assumed.

3. The Goldfield-Quentin test-

In statistics, the Goldfield–Quant test checks for homoscedasticity in regression analyses. It does
this by dividing a dataset into two parts or groups, and hence the test is sometimes called a two-
group test. The Goldfield–Quant test is one of two tests proposed in a 1965 paper by Stephen
Goldfield and Richard Quant. Both a parametric and nonparametric test are described in the paper,
but the term "Goldfield–Quant test" is usually associated only with the former.

Advantages and disadvantages

The parametric Goldfield–Quant test offers a simple and intuitive diagnostic for heteroskedastic
errors in a univariate or multivariate regression model. However, some disadvantages arise under
certain specifications or in comparison to other diagnostics, namely the Breusch–Pagan test, as the
Goldfield–Quant test is somewhat of an ad hoc test.[6] Primarily, the Goldfield–Quant test requires
that data be ordered along a known explanatory variable. The parametric test orders along this
explanatory variable from lowest to highest. If the error structure depends on an unknown variable
or an unobserved variable the Goldfield–Quant test provides little guidance. Also, error variance
must be a monotonic function of the specified explanatory variable. For example, when faced with a
quadratic function mapping the explanatory variable to error variance the Goldfield–Quant test may
improperly accept the null hypothesis of homoscedastic errors.

4.Glesjer test-

In statistics, the Glejser test for heteroscedasticity, developed by Herbert Glejser, regresses the
residuals on the explanatory variable that is thought to be related to the heteroscedastic variance.
After it was found not to be asymptotically valid under asymmetric disturbances, similar
improvements have been independently suggested by Im, and Machado and Santos Silva.

5.Spearman rank correlation coefficient test-


In statistics, Spearman's rank correlation coefficient or Spearman's ρ, named after Charles
Spearman and often denoted by the Greek letter {\display style \rho} \rho (rho) or as
{\display style r_{s}} r_{s}, is a nonparametric measure of rank correlation (statistical
dependence between the rankings of two variables). It assesses how well the relationship
between two variables can be described using a monotonic function.

The Spearman correlation between two variables is equal to the Pearson correlation between
the rank values of those two variables; while Pearson's correlation assesses linear
relationships, Spearman's correlation assesses monotonic relationships (whether linear or
not). If there are no repeated data values, a perfect Spearman correlation of +1 or −1
occurs when each of the variables is a perfect monotone function of the other.

Intuitively, the Spearman correlation between two variables will be high when observations
have a similar (or identical for a correlation of 1) rank (i.e., relative position label of the
observations within the variable: 1st, 2nd, 3rd, etc.) between the two variables, and low when
observations have a dissimilar (or fully opposed for a correlation of −1) rank between the two
variables.
Spearman's coefficient is appropriate for both continuous and discrete ordinal variables Both
Spearman's {\display style \rho} \rho and Kendall's {\display style \tau} \tau can be formulated as
special cases of a more general correlation coefficient.

6.The White Test –

In statistics, the White test is a statistical test that establishes whether the variance of the
errors in a regression model is constant: that is for homoskedasticity.
This test, and an estimator for heteroscedasticity-consistent standard errors, were proposed by
Halbert White in 1980. These methods have become extremely widely used, making this paper one
of the most cited articles in economics.
In cases where the White test statistic is statistically significant, heteroskedasticity may not
necessarily be the cause; instead, the problem could be a specification error. In other words, the
White test can be a test of heteroskedasticity or specification error or both. If no cross-product
terms are introduced in the White test procedure, then this is a test of pure heteroskedasticity. If
cross products are introduced in the model, then it is a test of both heteroskedasticity and
specification bias.
.

7.Ramsey test-

In statistics, the Ramsey Regression Equation Specification Error Test (RESET) test is a general
specification test for the linear regression model. More specifically, it tests whether non-linear
combinations of the fitted values help explain the response variable. The intuition behind the test is
that if non-linear combinations of the explanatory variables have any power in explaining the
response variable, the model is mis specified in the sense that the data generating process might be
better approximated by a polynomial or another non-linear functional form.

The test was developed by James B. Ramsey as part of his Ph.D. thesis at the University of
Wisconsin–Madison in 1968, and later published in the Journal of the Royal Statistical Society in
1969.

8.Szroeter test-

As an alternative to the score test used in estate hottest, the estate Streeter command uses a rank
test for heteroskedasticity. The alternative hypothesis that variance rises monotonically in the
variables under test is evaluated using the Stroeder test. Monotonic is short for one-way. It is
impossible for variance to ever decrease if it is rising monotonically. As the linear regression line
rises, the variance may rise as well, but eventually cease rising and remain constant. It is regarded as
monotonic as long as it never decreases.
The independent variables of a linear regression are tested by this estate Streeter command.
Depending on the parameters chosen, you can test a single variable, a group of named variables, or
all independent variables. The risk that you would mistakenly reject the null hypothesis while testing
numerous hypotheses at once grows with each new test (a false positive or type 1 error). The p-
values of your multiple tests will need to be changed to reflect this.
Three distinct modifications are possible in Stata. The first is the Bonferroni adjustment, which takes
into account both the per-family type 1 mistake rate and type 1 errors of any kind. The test(bond)
option is used to specify the Bonferroni adjustment. The Holm-Bonferroni approach, which is always
more effective or at least as effective as the Bonferroni correction, is the second adjustment. This
technique does not, however, account for the family type 1 error rate, in contrast to the Bonferroni
correction. The test(holm) option specifies the Holm-Bonferroni technique. The Sisak correction is
Stata's third adjustment option. Only when the hypothesis tests are positively independent are the
Sisak correction's effects greater than those of the other two adjustments (i.e., they are independent
or positively dependent). The test(side) option specifies the Sisak correction.
The default setting for this test will leave the p-values uncorrected if you do not provide a multiple-
test adjustment. When conducting this test simultaneously on several variables, it is strongly advised
that you make some sort of correction.

9. The nonparametric peak test-

A variety of tests of hypothesis for continuous, dichotomous, and discrete outcomes were covered in
the three courses on hypothesis testing. While tests for dichotomous and discrete outcomes
concentrated on comparing proportions, tests for continuous outcomes focused on comparing
means. The tests that are discussed in the sections on hypothesis testing are all known as parametric
tests and are predicated on certain premises. For instance, all parametric tests presumptively
assume that the outcome is roughly normally distributed in the population when conducting tests of
hypothesis for means of continuous outcomes. This doesn't imply that the data in the sample that
was seen has a normal distribution; rather, it just means that the outcome in the entire population
that was not observed has a normal distribution. Investigators are at ease using the normalcy
assumption for many outcomes (i.e., most of the observations are in the center of the distribution
while fewer are at either extreme). Additionally, it turns out that a lot of statistical tests are resilient,
which means they keep their statistical qualities even if some of the assumptions are not fully
satisfied. Based on the Central Limit Theorem, tests are resilient in the presence of breaches of the
normality assumption when the sample size is high (see page 11 in the module on Probability).
Alternative tests known as nonparametric tests are acceptable when the sample size is small, the
distribution of the outcome is unknown, and it cannot be assumed that it is roughly normally
distributed.
Nonparametric tests are sometimes called distribution-free tests because they are based on fewer
assumptions (e.g., they do not assume that the outcome is approximately normally distributed).
Parametric tests involve specific probability distributions (e.g., the normal distribution) and the tests
involve estimation of the key parameters of that distribution (e.g., the mean or difference in means)
from the sample data. The cost of fewer assumptions is that nonparametric tests are generally less
powerful than their parametric counterparts (i.e., when the alternative is true, they may be less
likely to reject H0).
It can sometimes be difficult to assess whether a continuous outcome follows a normal distribution
and, thus, whether a parametric or nonparametric test is appropriate. There are several statistical
tests that can be used to assess whether data are likely from a normal distribution. The most popular
are the Kolmogorov-Smirnov test, the Anderson-Darling test, and the Shapiro-Wilk test1. Each test is
essentially a goodness of fit test and compares observed data to quantiles of the normal (or other
specified) distribution. The null hypothesis for each test is H0: Data follow a normal distribution
versus H1: Data do not follow a normal distribution. If the test is statistically significant (e.g., p<0.05),
then data do not follow a normal distribution, and a nonparametric test is warranted. It should be
noted that these tests for normality can be subject to low power. Specifically, the tests may fail to
reject H0: Data follow a normal distribution when in fact the data do not follow a normal
distribution. Low power is a major issue when the sample size is small - which unfortunately is often
when we wish to employ these tests. The most practical approach to assessing normality involves
investigating the distributional form of the outcome in the sample using a histogram and to augment
that with data from other studies, if available, that may indicate the likely distribution of the
outcome in the population.
There are some situations when it is clear that the outcome does not follow a normal distribution.
These include situations:
when the outcome is an ordinal variable or a rank,
when there are definite outliers or
when the outcome has clear limits of detection.

COSEQUENCES

1.Ordinary least squares estimators still linear


and unbiased.
2.Ordinary least squares estimators
not efficient.
3.Usual formulas give incorrect standard errors for least squares.
4.Confidence intervals and hypothesis tests based on usual standard errors are wrong.
The existence of heteroscedasticity in the error term of an equation violates Classical Assumption V,
and the estimation of the equation with OLS has at least three consequences:
OLS estimation still gives unbiased coefficient estimates, but they are no longer BLUE.
This implies that if we still use OLS in the presence of heteroscedasticity, our standard errors could
be inappropriate and hence any inferences we make could be misleading.
Whether the standard errors calculated using the usual formulae are too big or too small will depend
upon the form of the heteroscedasticity.
In the presence of heteroscedasticity, the variances of OLS estimators are not provided by the usual
OLS formulas. But if we persist in using the usual OLS formulas, the t and F tests based on them can
be highly mislead- Ing, resulting in erroneous conclusions

REMEDIAL STEPS OF PROBLEMS OF HETEROSCEDASTICITY

The unbiasedness and consistency properties of the OLS estimators remain intact even after
Heteroscedasticity, but they are no longer efficient, not even asymptotically (i.e., large sample size).
This lack of efficiency makes the usual hypothesis-testing procedure of dubious value.
Hence, remedial measures are needed. Below discussed are the various approaches to correcting
Heteroscedasticity.
1.Whentrue error variance
is known: the Method of Generalized Least Squares
(GLS) Estimator
If the variance of error term is non-constant, then the best linear unbiased estimator (BLUE) is the
generalized least squares (GLS) estimator. It is also called as weighted least squares (WLS)
estimator.
the GLS estimator identical to weighted
least squares estimator. Or one can say that, the GLS estimator is a particular kind of WLS
estimator. In GLS estimation, a weight we given to each observation of each variable, which is
inversely proportional to the standard deviation of the error term. It implies that observations
with a smaller error variance are given more weight in the GLS regression and observations with
a large error variance is given less weight
2. Remedial measures when true error variance is unknown
WLS method makes an implicit assumption that true error variance
is known. However, in reality, it is difficult to have knowledge of the true error variance. Thus, we
need some other
methods to obtain consistent estimate of variance of error term.
Feasible Generalized Least Squares (FGLS) Estimator
In this method, we use the sample data to obtain an estimate of tend then apply the GLS
estimator using the estimates of t
When we do this, we have a different estimator called the
Feasible Generalized Least Squares Estimator, or FGLS estimator.
It is important to note that if the transformed model of Heteroscedasticity that we get after using
FGLS estimator is a reasonable approximation of the true Heteroscedasticity, then the non-linear
FGLS estimator is asymptotically more efficient than the OLS estimator. However, if it is not a
reasonable approximation of the true Heteroscedasticity, then the FGLS estimator will produce
worse estimates than the OLS estimator.
White’s Correction White’s Heteroscedasticity – Consistent Variances and Standard Errors method
shows that
statistical inferences can be made about the true parameter values for large samples
(Asymptotically valid). It develops a method to obtain consistent estimates of the variances and
co-variances of the OLS estimates. It is also called as “Heteroscedasticity consistent covariance
According to White, equation is a consistent estimator of equation [4], (or equation converges to
equation the sample size increases indefinitely. Hence, White’s Heteroscedasticity-corrected
standard errors are also known as “Robust Standard Errors”.
2.3 OLS ESTIMATION IN THE PRESENCE OF
HETEROSCEDASTICITY

When heteroscedasticity is present in data, then estimates based on Ordinary Least Square (OLS)
are subjected to following consequences:
We cannot apply the formula of the variance of the coefficients to conduct tests of significance and
construct confidence intervals.

If error term (in) is heteroscedastic, then the OLS estimates do not have the minimum variance
property in the class of unbiased estimators, i.e., they are inefficient in small samples.
Furthermore, they are asymptotically inefficient.

The estimated coefficients remain unbiased statistically. That means the property of biasedness of
OLS estimation is not violated by the presence of heteroscedasticity.
The forecasts based on the model with heteroscedasticity will be less efficient as OLS estimation
yield higher values of the variance of the estimated coefficients.

All this means the standard errors will be underestimated and the t-statistics and F-statistics will be
inaccurate, caused by a number of factors, but the main cause is when the variables have
substantially different values for each observation. For instance, GDP will suffer from
heteroscedasticity if we include large countries such as the USA and small countries such as Cuba.
In this case it may be better to use GDP per person. Also note that heteroscedasticitytends to
affect cross-sectional data more than time series

2.4 METHOD OF GENERALIZED LEAST SQUARE

In statistics, generalized least squares (GLS) is a technique for estimating the unknown parameters
in a linear regression model when there is a certain degree of correlation between theresiduals in a
regression model. In these cases, ordinary least squares and weighted least squarescan be
statistically inefficient, or even give misleading inferences. GLS was first described by Alexander
Aitken in 1934.

In statistics, Generalized Least Squares (GLS) is one of the most popular methods for estimating
unknown coefficients of a linear regression model when the independent variable is correlating
with the residuals. Ordinary Least Squares (OLS) method only estimates the parameters in linear
regression model. Also, it seeks to minimize the sum of the squares of the differences between the
observed responses in the given dataset and those predicted by a linear function. The main
advantage of using OLS regression for estimating parameters is that it is easy to use. However, OLS
gives robust results only if there are no missing values in the data and there are no major outliers
in the data set. Moreover, OLS regression model does not take into account unequal variance, or
‗heteroskedastic errors ‘. Due to heteroskedastic errors the results are not robust and also creates
bias.
Therefore, the generalized least squares test is crucial in tackling the problem of outliers,
heteroskedasticity and bias in data. It is capable of producing estimators that are ‗Best Linear
Unbiased Estimates ‘. Thus, GLS estimator is unbiased, consistent, efficient and asymptotically
normal.

Major assumption for generalized least square regression analysis

The assumption of GLS is that the errors are independent and identically distributed.
Furthermore, other assumptions include:
 The error variances are homoscedastic
 Errors are uncorrelated
 Normally distributed
 In the absence of these assumptions, the OLS estimators and the GLS estimators are same.
Thus, the difference between OLS and GLS is the assumptions of the error term of the model.
There are 3 different perspectives from which one can understand the GLS estimator:
 A generalization of OLS
 Transforming the model equation to a new model whose errors are uncorrelated and have
equal variances that is homoscedastic.

Application of generalized least squares

 GLS model is useful in regionalization of hydrologic data.


 GLS is also useful in reducing autocorrelation by choosing an appropriate weighting matrix.
 It is one of the best methods to estimate regression models with auto correlate disturbances
and test for serial correlation (Here Serial correlation and auto correlate are same things).
 One can also learn to use the maximum likelihood technique to estimate the regression
models with auto correlated disturbances.
 The GLS procedure finds extensive use across various domains. The goal of GLS method to
estimate the parameters of regional regression models of flood quantiles.

 GLS is widely popular in conducting market response model, econometrics and time series
analysis.
A number of available software support the generalized least squares test, like R, MATLAB, SAS,
SPSS, and STATA.
A special case of GLS called weighted least squares (WLS) occurs when all the off-diagonal entries
of Ω are 0. This situation arises when the variances of the observed values are unequal (i.e.
heteroscedasticity is present), but where no correlations exist among the observed variances. The
weight for unit i is proportional to the reciprocal of the variance of the response for unit i.

The generalized least squares (GLS) estimator of the coefficients of a linear regression is a
generalization of the ordinary least squares (OLS) estimator. It is used to deal with situations in
which the OLS estimator is not BLUE (best linear unbiased estimator) because one of the main
assumptions of the Gauss-Markov theorem, namely that of homoskedasticity and absence of serial
correlation, is violated. In such situations, provided that the other assumptions of the Gauss-
Markov theorem are satisfied, the GLS estimator is BLUE.

2.5 NATURE, TEST, CONSEQUENCES AND REMEDIAL STEPS OF


MULTICOLLINEARITY

In statistics, multicollinearity (also collinearity) is a phenomenon in which one predictor variable in


a multiple regression model can be linearly predicted from the others with a substantial degree of
accuracy. In this situation the coefficient estimates of the multiple regression may change
erratically in response to small changes in the model or the data. Multicollinearity does not reduce
the predictive power or reliability of the model as a whole, at least within the sample data set; it
only affects calculations regarding individual predictors. Thatis, a multivariate regression model
with collinear predictors can indicate how well the entire bundle of predictors predicts the
outcome variable, but it may not give valid results about any individual predictor, or about which
predictors are redundant with respect to others.

Note that in statements of the assumptions underlying regression analyses such as ordinary least
squares, the phrase "no multicollinearity" usually refers to the absence of perfect multicollinearity,
which is an exact (non-stochastic) linear relation among the predictors. In such case, the data
matrix X has less than full rank, and therefore the moment matrix XTX cannot be inverted.

In any case, multicollinearity is a characteristic of the data matrix, not the underlying statistical
model. Since it is generally more severe in small samples, Arthur Goldberger went so far as to call it
"micro numerosity."

Multicollinearity is a state of very high intercorrelations or inter-associations among the


independent variables. It is therefore a type of disturbance in the data, and if present in the data
the statistical inferences made about the data may not be reliable.
There are certain reasons why multicollinearity occurs:

 It is caused by an inaccurate use of dummy variables.


 It is caused by the inclusion of a variable which is computed from other variables in the data
set.
 Multicollinearity can also result from the repetition of the same kind of variable.
 Generally, occurs when the variables are highly correlated to each other.

Multicollinearity can result in several problems. These problems are as follows:

 The partial regression coefficient due to multicollinearity may not be estimated precisely.
The standard errors are likely to be high.
 Multicollinearity results in a change in the signs as well as in the magnitudes of the partial
regression coefficients from one sample to another sample.
 Multicollinearity makes it tedious to assess the relative importance of the independent
variables in explaining the variation caused by the dependent variable.

 In the presence of high multicollinearity, the confidence intervals of the coefficients tend to
become very wide and the statistics tend to be very small. It becomes difficult to reject the
null hypothesis of any study when multicollinearity is present in the data under study.

What Causes Multicollinearity?

The two types are:


 Data-based multicollinearity: caused by poorly designed experiments, data that is 100%
observational, or data collection methods that cannot be manipulated. In some cases,
variables may be highly correlated (usually due to collecting data from purely observational
studies) and there is no error on the researcher ‘s part. For this reason, you should conduct
experiments whenever possible, setting the level of the predictor variables in advance.
 Structural multicollinearity: caused by you, the researcher, creating new predictor variables.
 Causes for multicollinearity can also include:
 Insufficient data. In some cases, collecting more data can resolve the issue.
 Dummy variables may be incorrectly used. For example, the researcher may fail to exclude
one category, or add a dummy variable for every category (e.g., spring, summer, autumn,
winter).

 Including a variable in the regression that is actually a combination of two other variables. For
example, including ―total investment income‖ when total investment income = income from
stocks and bonds + income from savings interest.

 Including two identical (or almost identical) variables. For example, weight in pounds and
weight in kilos, or investment income and savings/bond income.
What Problems Do Multicollinearity Cause?

 Multicollinearity causes the following two basic types of problems:


 The coefficient estimates can swing wildly based on which other independent variables are in
the model. The coefficients become very sensitive to small changes in the model.
 Multicollinearity reduces the precision of the estimate coefficients, which weakens the
statistical power of your regression model. You might not be able to trust the p-values to
identify independent variables that are statistically significant.

Imagine you fit a regression model and the coefficient values, and even the signs, change
dramatically depending on the specific variables that you include in the model. It’s a disconcerting
feeling when slightly different models lead to very different conclusions. You don ‘t feels like you
know the actual effect of each variable. Now, throw in the fact that you can ‘t necessarily trusts the
p-values to select the independent variables to include in the model. This problem makes it difficult
both to specify the correct model and to justify the model if many of your p-values are not
statistically significant.

As the severity of the multicollinearity increases so do these problematic effects. However, these
issues affect only those independent variables that are correlated. You can have a model with
severe multicollinearity and yet some variables in the model can be completely unaffected.

The regression example with multicollinearity that I work through later on illustrates these
problems in action.

Do I Have to Fix Multicollinearity?

Multicollinearity makes it hard to interpret your coefficients, and it reduces the power of your
model to identify independent variables that are statistically significant. These are definitely serious
problems. However, the good news is that you don ‘t always has to find a way to fix
multicollinearity.

The need to reduce multicollinearity depends on its severity and your primary goal for your
regression model. Keep the following three points in mind:
 The severity of the problems increases with the degree of the multicollinearity. Therefore, if
you have only moderate multicollinearity, you may not need to resolve it.
 Multicollinearity affects only the specific independent variables that are correlated.
Therefore, if multicollinearity is not present for the independent variables that you are
particularly interested in, you may not need to resolve it. Suppose your model contains the
experimental variables of interest and some control variables. If high multicollinearity exists
for the control variables but not the experimental variables, then you can interpret the
experimental variables without problems.
 Multicollinearity affects the coefficients and p-values, but it does not influence the
predictions, precision of the predictions, and the goodness-of-fit statistics. If your primary goal
is to make predictions, and you don ‘t needs to understand the role of each independent
variable, you don ‘t needs to reduce severe multicollinearity.

The Consequences of Multicollinearity

 Imperfect multicollinearity does not violate Assumption 6. Therefore, the Gauss Markov
Theorem tells us that the OLS estimators are BLUE.
 So then why do we care about multicollinearity?
 The variances and the standard errors of the regression coefficient estimates will increase.
This means lower t-statistics.
 The overall fit of the regression equation will be largely unaffected by multicollinearity.
This also means that forecasting and prediction will be largely unaffected.
 Regression coefficients will be sensitive to specifications. Regression coefficients canchange
substantially when variables are added or dropped.

2.6 ESTIMATION IN THE PRESENCE OF PERFECT


MULTICOLLINEARITY

In statistics, multicollinearity (also collinearity) is a phenomenon in which one predictor variable in


a multiple regression model can be linearly predicted from the others with a substantial degree of
accuracy. In this situation, the coefficient estimates of the multiple regression may change
erratically in response to small changes in the model or the data. Multicollinearity does not reduce
the predictive power or reliability of the model as a whole, at least within the sample data set; it
only affects calculations regarding individual predictors. That is, a multivariate regression model
with collinear predictors can indicate how well the entire bundle of predictors predicts the
outcome variable, but it may not give valid results about any individual predictor, or about which
predictors are redundant with respect to others.

Note that in statements of the assumptions underlying regression analyses such as ordinary least
squares, the phrase "no multicollinearity" usually refers to the absence of perfect multicollinearity,
which is an exact (non-stochastic) linear relation among the predictors. In such case, the design
matrix {\display style X}X has less than full rank, and therefore the moment matrix {\display style
X^{\moths {T}}X}{\display style X^{\mathsf {T}}X} cannot be inverted. Under these circumstances,
for a general linear model {\displaystyle y=X\beta +\epsilon }y=X\beta +\epsilon , the ordinary least
squares estimator {\displaystyle {\hat {\beta }}_{OLS}=(X^{\mathsf {T}}X)^{-1}X^{\mathsf
{T}}y}{\displaystyle {\hat {\beta }}_{OLS}=(X^{\mathsf {T}}X)^{-1}X^{\mathsf {T}}y} does not exist.
In any case, multicollinearity is a characteristic of the design matrix, not the underlying statistical
model.

Multicollinearity refers to a situation in which more than two explanatory variables in a multiple
regression model are highly linearly related. We have perfect multicollinearity if, for example as in
the equation above, the correlation between two independent variables is equal to 1 or −1. In
practice, we rarely face perfect multicollinearity in a data set. More commonly, the issue of
multicollinearity arises when there is an approximate linear relationship among two or more
independent variables.

It was stated previously that in the case of perfect multicollinearity the regression coefficients
remain indeterminate and their standard errors are infinite. This fact can be demonstratedreadily
in terms of the three-variable regression model. Using the deviation form, where all the variables
are expressed as deviations from their sample means, we can write the three-variable regression
model as yi = Pi xii + 03 x3i + Ui (10.2.1)
The result of perfect multicollinearity is that you can‘t obtain any structural inferences about the
original model using sample data for estimation. In a model with perfect multicollinearity, your
regression coefficients are indeterminate and their standard errors are infinite.

2.7 ESTIMATION IN THE PRESENCE OF HIGH BUT IMPERFECT


MULTICOLLINEARITY

With imperfect multicollinearity, an independent variable has a strong but not perfect linear
function of one or more independent variables.
This also means that there are also variables in the model that effects the independent variable In
other words If there are two independent variables that are related to each other. Yet there are
also other variables out of the model that effects one of the independent variables which means
that there is no perfect linear function between the two only. Thus, the inclusion of a stochastic
term in the model shows that the existence of other variables are also affecting the regressors.

 Imperfect multicollinearity varies in degree to degree according to the sample size.


 The presence of the error term dilutes the relationship between the independent variables.

Fig.-2.6 Estimation in The Presence Of High But ImperfectMulticollinearity


The equations tell us that there might be a relationship between X1 and X2 but it does not explain
that X1 is to be completely explained by X2; there is a possibility of unexplained variations as well,
in the form of the stochastic error term.

2.8 SUMMARY

 Regression analysis is used in stats to find trends in data. For example, you might
guess that there‘s a connection between how much you eat and how much you
weigh; regression analysis can help you quantify that.
 Regression analysis will provide you with an equation for a graph so that you can
make predictions about your data. For example, if you‘ve been putting on weight
over the last few years, it can predict how much you‘ll weigh in ten years‘ time if
you continue to put on weight at the same rate.
 It will also give you a slew of statistics (including a p-value and a correlation
coefficient) to tell you how accurate your model is. Most elementary stats courses
cover very basic techniques, like making scatter plots and performing linear
regression. However, you may come across more advanced techniques like multiple
regression.

2.9 KEYWORDS

 Unbiased Estimate- estimator of a given parameter is said to be unbiased if its


expected value is equal to the true value of the parameter.
 Heteroscedasticity - When the standard deviations of a predicted variable are not
constant throughout a range of independent variable values or when they are
compared to earlier time periods, this is known as heteroskedasticity (or
heteroscedasticity) in statistics. When residual errors are visually inspected for
heteroskedasticity, they will often fan out with time, as shown in the illustration
below.

 Multivariate Statics - A subfield of statistics known as "multivariate statistics"


includes the concurrent observation and analysis of several outcome variables.

 Inductive Generalization- To draw a conclusion about the population from which


the sample was taken, you use observations about the sample. Statistical broadening
To make claims about populations, you require precise sampling data.

 Mathematical Geology- The use of mathematical techniques to address issues in the


geosciences, including geology and geophysics, with a focus on geodynamics and
seismology, is known as geomathematics (also: mathematical geosciences,
mathematical geology, and mathematical geophysics).

2.10 LEARNING ACTIVITY

1. Explain GLS in detail


__________________________________________________________________________________
____________________________________________________________________
2. State the reasons & the problems of multicollinearity.
_________________________________________________________________________________
___________________________________________________________________
3. What are the remedial steps of the problem & the consequences of Heteroscedasticity?
_________________________________________________________________________________
___________________________________________________________________

2.11 UNIT END QUESTIONS

A. Descriptive Questions

Short Questions

1.Calculate the regression coefficient and obtain the lines of regression for the following data

2. Calculate the regression coefficient and obtain the lines of regression for the following data
3. Find the means of X and Y variables and the coefficient of correlation between them from the
following two regression equations:

2Y–X–50 = 0

3Y–2X–10 = 0.

4. The two regression lines are 3X+2Y=26 and 6X+3Y=31. Find the correlation
coefficient.

5. Find the means of X and Y variables and the coefficient of correlation between them
from the following two regression equations:

4X–5Y+33 = 0

20X–9Y–107 = 0

Long Questions

1. For 5 pairs of observations the following results are obtained ∑X=15, ∑Y=25, ∑X2
=55, ∑Y2 =135, ∑XY=83 Find the equation of the lines of regression and estimate
the value of X on the first line when Y=12 and value of Y on the second line if X=8.

2. In a laboratory experiment on correlation research study the equation of the two


regression lines were found to be 2X–Y+1=0 and 3X–2Y+7=0. Find the means of X
and Y. Also work out the values of the regression coefficient and correlation between
the two variables X and Y.

3. For the given lines of regression 3X–2Y=5and X–4Y=7. Find

(I) Regression coefficients

(ii) Coefficient of correlation

4. Obtain the two regression lines from the following data N=20, ∑X=80, ∑Y=40,
∑X2=1680, ∑Y2=320 and ∑XY=48
5. . Find the equation of the regression line of Y on X, if the observations (Xi, Yi) are
the following (1,4) (2,8) (3,2) (4,12) (5, 10) (6, 14) (7, 16) (8, 6) (9, 18)

B. Multiple Choice Questions

1. The process of constructing a mathematical model or function that can be used to predict
or determine one variable by another variable is called

A. regression

B. correlation

C. residual

D. outlier plot

2. In the regression equation Y = 21 - 3X, the slope is

A. 21

B. -21

C. 3

D. -3

3. In the regression equation Y = 75.65 + 0.50X, the intercept is

A. 0.50

B. 75.65

C. 1.00

D. indeterminable

4. The difference between the actual Y value and the predicted Y value found using a
regression equation is called the
A. slope

B. residual

C. outlier

D. scatter plot

5. The total of the squared residuals is called the

A. coefficient of determination

B. sum of squares of error

C. standard error of the estimate

D. r-squared

6. The process of constructing a mathematical model or function that can be used to predict
or determine one variable by another variable is called
A. regression
B. correlation
C. residual
D. outlier plot
1. The process of constructing a mathematical model or function that can be used to predict
or determine one variable by another variable is called
A. regression
B. correlation
C. residual
D. outlier plot

Answers

1-A, 2-D, 3-B, 4-B ,5-B

2.12 REFERENCES

 Gujarati, D., Porter, D.C and Guna Sekhar, C (2012). Basic Econometrics (Fifth Edition)
McGraw Hill Education.
 Anderson, D. R., D. J. Sweeney and T. A. Williams. (2011). Statistics for Business and
Economics. 12th Edition, Cengage Learning India Pvt. Ltd.
 Wooldridge, Jeffrey M., Introductory Econometrics: A Modern Approach, Third edition,
Thomson South-Western, 2007.
 Johnstone, J., Econometrics Methods, 3rd Edition, McGraw Hill, New York, 1994.
 Ramanathan, Ramus, Introductory Econometrics with Applications, Harcourt AcademicPress,
2002 (IGM Library Call No. 330.0182 R14I).
 Houstonians, A. The Theory of Econometrics, 2nd Edition, ESLB

You might also like