Topic 6 Mte3105
Topic 6 Mte3105
Topic 6 Mte3105
Topic 6
Linear Regression
6.1 Synopsis This topic discusses the relationship between two variables by using scatter diagrams and regression analysis. In scatter diagrams, we can investigate the relationships between two variables by looking at how the pairs of values are distributed in a graph. By using regression model, we can evaluate the magnitude of change in one variable due to a certain change in another variable. For example, an economist can estimate the amount of change in food expenditure due to a certain change in the income of a household by using the regression model. A sociologist may want to estimate the increase in the crime rate due to a particular For increase in the unemployment rate. Besides answering these questions, a regression model also helps to predict the value of one variable for a given value of another variable. a given unemployment rate. 6.2 Learning Outcomes 1. 2. 3. 4. Understand the concept of dependent and independent variables. Interpret scatter diagrams for bivariate data and draw the line of best fit. Understand the concept of linear regression. Calculate the equation of linear regression by using method of least squares and use them to estimate values by interpolation and extrapolation. example, by using the regression line, we can predict the (approximate) crime rate of a city with
MTE3105 Statistics
6.3
Conceptual Framework
Linear Regression
Scatter Diagram
6.4
Lets say that an economist wishes to investigate the relationship between food expenditure and income. What factors or variables does a household has to consider when deciding how much money it should spend on food every week or month for example. Certainly income of the household is one factor. However, there are many other variables that also affect food expenditure. For instance, the size of household, the preferences and tastes of household members and any special dietary needs of household are some of the variables that influence a households decision on food expenditure. These variables are called independent or explanatory variables because they all vary independently and they explain the variation in food expenditures among different households. In other words, these variables explain why different households spend different amount of money on food. Meanwhile, food expenditure is called the dependent variable because its value depends on the independent variables.
6.5
Scatter Diagram
Scatter diagram is basically a plot of paired observation. Pairs of values of (x,y) are plotted on the horizontal and vertical axis on the graph paper with appropriate scales and axes. This type of graph is known as scatter diagram. It is a useful way of determining the relationship between two variables and it provides an early perception on the relationship of the two variables and thus enable the researcher to make an early conclusion regarding the relationship. If all points
MTE3105 Statistics
seem to lie near a line, the correlation is called linear correlation (diagrams (a), (b), (c) and (d)). If y increases as x increases, the correlation is called positive correlation as in (a) and (b). If y tends to decrease as x increases, the correlation is called negative correlation as in diagram (c) and (d). If there is no relationship shown between two variables, then there is no correlation between them as in diagram (e). between two variables. The following diagrams show the various relationships
(e) No relationship
6.6
Linear Regression
When studying the effect of two or more independent variables on a dependent variable using regression analysis, it is called multiple regression. However, if we choose only one (usually the most important) independent variable and study the effect of that single variable on a
MTE3105 Statistics
dependent variable, it is called simple regression. Thus, a simple regression includes only two variables: one independent and one dependent. Note that whether it is a simple or a multiple regression analysis, it always includes one and only one dependent variable. It is the number of independent variables that changes in simple and multiple regressions. 2.6.1 Simple Linear Regression
A regression model is a mathematical equation that describes the relationship between two or more variables. A simple regression model includes only two variables: one independent and one dependent. The dependent variable is the one being explained, and the independent variable is the one used to explained the variation in the dependent variable. Thus a simple linear regression model is a model that gives a straight-line relationship between two variables. We know that correlation coefficient can be used to measure the strength of linear relationship between two variables, it however cannot be used to make estimation or forecast on the variables. To overcome this weakness, a most suitable line is drawn on a scatter diagram and the line is called regression line. In your previous work, you may have attempted to draw a line called a line of best fit by Eye Method on the scatter diagram. Drawing a line of best fit by Eye Method is as such that there are as many points above the line as below it, or as many points to the left of the line as to the right of it. It's the line that has the least total deviation from the actual data points. That means, if you add up all the distances between most of the points and the line, your value should be the minimum possible. This means that the line of best fit is a line that you draw through your graph that divides all of the points on your scatter plot evenly. The line should also go through the point ( x , y ), which is the means of the two sets of data.
MTE3105 Statistics
y
y)
x
However, drawing by Eye method is rather haphazard and there is a mathematical way of fitting a regression line known as method of least squares as illustrated below.
6.6.2
Using formulae to find the equation of the least squares regression line
The least squares values of a and b are computed using the following formulas.
SSxy SSxx
and
a = y - bx ,
where SSxy = xy
x y
n
and
2 SSxx = x -
x 2
n
= a + bx SS stands for sum of squares and the least squares regression line y
is called the regression of y on x.
EXAMPLE 1. Find the least squares regression line for the data on incomes and food expenditures on the seven households given in Table 1 below. Use income as an independent variable and food expenditure as a dependent variable.
MTE3105 Statistics
Table 1 : Incomes and Food Expenditures of Seven Households Incomes Food Expenditures (RM 00) (RM 00) 35 9 49 15 21 7 39 11 15 5 28 8 25 9 Solution:
= a + bx. We will now find the values of a and b for the regression model y
Table 2 shows the calculations required for the computation of a and b. We denote the independent variable (income) by x and the dependent variable (food expenditure) by y. Income 35 49 21 39 15 28 25 Food Expenditure 9 15 7 11 5 8 9
xy
315 735 147 429 75 224 225
x2
1225 2401 441 1521 225 784 625 2 x = 7222
x 212
y 64
xy = 2150
Table 2
The following steps are performed to compute a and b. STEP 1. Compute x , y , x and y .
x = 212 x =
y = 64
y =
STEP 2.
Compute xy and x To calculate xy, we multiply the corresponding values of x and y. Then, we sum all
the products. The products of x and y are recorded in the third column of Table 2. To compute
MTE3105 Statistics
x2 , we square each of the x values and then add them. The squared values of x are listed in
the fourth column of Table 2. From these calculations,
xy = 2150
STEP 3.
and
x2 = 7222
x y
n
= 2150 -
(212)(64) = 211.7143 7
SSxx = x -
x 2
n
= 7222 -
(212)2 7
= 801.4286
STEP 4.
Compute a and b. b = a =
SSxy SSxx
= 1.1414 + 0.2642x y
This regression line is called the least squares regression line. It gives the regression of food
The regression line y on x gives you the average value of y for a given value of x so in certain circumstances it can be used to predict or estimate missing values. This is known as interpolation. Interpolation is generally safe in making a prediction because it is within the range of values of the predictor in the sample used to generate the model. For example, from our estimated regression model, we can find the predicted value of y for any specific value of x. Suppose we randomly select a household whose monthly income is RM 3500 so that x = 35 (recall that x denotes income in hundreds of RM). The predicted value of food expenditure for this household is
MTE3105 Statistics
In other words, based on our regression line, we predict that a household with a monthly income of RM 3500 is expected to spend RM 1038.84 per month on food. However, we must take great caution when estimating values outside the range of your data. This kind of making prediction outside the range of values of the predictor in the sample used to generate the model is called extrapolation. The more distance the prediction is from the range of values used to fit the model, the riskier and unreliable the prediction becomes because there is no way to check whether the relationship continues to be linear between the dependent and independent variables.
Exercise 1. A patient is given a drip feed containing a particular chemical and its concentration in his blood is measured, in suitable units, at one hour intervals for the next five hours. The doctor believe the figures to be subject to random errors, arising both from the sampling procedure and the subsequent chemical analysis, but that a linear model is appropriate. Time, x (hours) Concentration 0 2.4 1 4.3 2 5.2 3 6.8 4 9.1 5 11.8
a) Illustrate the data on a scatter diagram b) Determine the equation of the regression line by eye method c) Find the equation of the regression line by least square method and compare the results in b)
MTE3105 Statistics
d) Estimate the concentration of the chemical in the patients blood (i) 3 hours (ii) 10 hours after the treatment started. Comment on the likely accuracy of your prediction. 2. English speaking ability score, X and the sales, Y of a particular month for 10 salesman are shown in the table below. X Y 60 74 54 76 52 66 48 76 42 70 36 68 34 62 28 54 26 64 18 54
a) Draw a scatter diagram for the data b) By using the least square method, find the equation of the regression line of sales Y on English speaking ability score X c) What are the levels of sales for three particular salesman A, B and C with English speaking ability scores of 40,50 and 70 respectively? Answer: b) Y = 46.52 + 0.4994X c) 66.49, 71.49, 81.48
References: Crawshaw, J and Chambers, J. (2002). A concise course in advanced level Statistic. United Kingdom : Nelson Thornes Ltd, Mann, Prem S. (2001). Introductory Statistics Fourth Edition. NY:John Wliey and Sons Inc. Soon, Chin Loong et.al. (2004). Pre-U STPM Matriculation Quantitative Method. Petaling Jaya:Pearson Malaysia Sdn Bhd.