Regression Analysis With Scilab
Regression Analysis With Scilab
Regression Analysis With Scilab
By
Gilberto E. Urroz, Ph.D., P.E.
Distributed by
i nfoClearinghouse.com
A "zip" file containing all of the programs in this document (and other
SCILAB documents at InfoClearinghouse.com) can be downloaded at the
following site:
http://www.engineering.usu.edu/cee/faculty/gurro/Software_Calculators/Scil
ab_Docs/ScilabBookFunctions.zip
REGRESSION ANALYSIS
2
6
6
7
8
9
11
11
13
13
16
17
18
20
22
22
25
28
29
Analysis of residuals
Scaling residuals
Influential observations
A function for residual analysis
Applications of function residuals
32
34
35
35
36
40
42
Exercises
46
Download at InfoClearinghouse.com
Regression Analysis
The idea behind regression analysis is to verify that a function
y = f(x)
fits a given data set
{(x1,y1), (x2,y2),,( xn,yn)}
after obtaining the parameters that identify function f(x). The value x represents one or more
independent variables. The function f(x) can be, for example, a linear function, i.e.,
y = b0 + b1x1 + + bpxp,
or other non-linear functions. The procedure consists in postulating a form of the function to
be fitted, y = f(x), which will depend, in general, of a number of parameters, say {b0, b1, ,
bk}. Then we choose a criteria to determine the values of those parameters. The most
commonly used is the least-square criteria, by which the sum of the squares of the errors (SSE)
involved in the data fitting is minimized. The error involved in fitting point i in the data set is
given by
ei = yi- y i ,
thus, the quantity to be minimized is
n
SSE = e = ( y i y i ) 2 .
i =1
2
i
i =1
Minimization of SSE is accomplished by taking the derivatives of SSE with respect to each of the
parameters, b0, b1, , bk, and setting these results to zero, i.e., (SSE)/b0 = 0, (SSE)/b0 = 0,
, (SSE)/bk = 0. The resulting set of equations is then solved for the values b0, b1, , bk.
After finding the parameters by minimization of the sum of square errors (SSE), we can test
hypotheses about those parameters under certain confidence levels to complete the regression
analysis. In the following section we present the regression analysis of a simple linear
regression.
Download at InfoClearinghouse.com
y = mx+b,
representing a straight line in the x-y plane is used to represent the relationship between the
values x and y from the set. The fitted value of y corresponding to point xi is
y i = mxi+b,
and the corresponding error in the fitting is
ei = yi- y i = yi-(mxi+b) = yi-mxi-b.
To determine the values of m and b that minimize the sum of square errors, we use the
conditions
( SSE ) = 0
a
( SSE ) = 0
b
y
i =1
= b n + m xi
i =1
i =1
i =1
i =1
xi yi = b xi + m xi2
This is a system of linear equations with m and b as the unknowns. In matricial form, these
equations are written as
Download at InfoClearinghouse.com
n
xi
in=1
x2
i
i =1
n
yi
m
= ni =1
.
n
b
xi
xi yi
i =1
i =1
For example, consider the data set given in the following table
x
y
1.2
6.05
2.5
11.6
4.3
15.8
8.3
21.8
11.6
36.8
The following SCILAB commands will calculate the values of m and b to minimize SSE. A plot of
the original data and the straight line fitting is also produced. The column vector p stores the
values of m and b. Thus, for this case m=2.6827095 and b=3.4404811.
-->x=[1.2,2.5,4.3,8.3,11.6];y=[6.05,11.6,15.8,21.8,36.8];
-->Sx=sum(x);Sx2=sum(x^2);Sy=sum(y);Sxy=sum(x.*y);n=length(x);
-->A=[Sx,n;Sx2,Sx];B=[Sy;Sxy];p=A\B
p
!
!
=
2.6827095 !
3.4404811 !
-->deff('[y]=yh(x)','y=p(1).*x+p(2)')
-->plot2d(xf,yf,1,'011',' ',rect)
-->plot2d(x,y,-1,'011',' ',rect)
-->xtitle('Simple linear regression','x','y')
The value of the sum of square errors for this fitting is:
-->yhat=yh(x);err=y-yhat;SSE=sum(err^2)
SSE = 23.443412
function [] = SSEPlot(mrange,brange,x,y)
n=length(mrange); m=length(brange);
SSE = zeros(n,m);
deff('[y]=f(x)','y=slope*x+intercept')
for i = 1:n
for j = 1:m
slope = mrange(i);intercept=brange(j);
yhat = f(x);err=y-yhat;SSE(i,j)=sum(err^2);
end;
end;
Download at InfoClearinghouse.com
xset('window',1);plot3d(mrange,brange,SSE,45,45,'m@b@SSE');
xtitle('Sum of square errors')
xset('window',2);contour(mrange,brange,SSE,10);
xtitle('Sum of square errors','m','b');
The function produces a three-dimensional plot of SSE(m,b) as well as a contour plot of the
function. To produce the plots we use the following SCILAB commands:
-->mr = [2.6:0.01:2.8];br=[3.3:0.01:3.5];
-->getf('SSEPlot')
-->SSEPlot(mr,br,x,y)
Download at InfoClearinghouse.com
s xy =
1 n
( xi x)( yi y)
n 1 i =1
r xy =
s xy
sx sy
s x2 =
1
n 1
s y2 =
n
1
( yi y )2
n 1 i =1
( xi x ) 2
i =1
The correlation coefficient is a measure of how well the fitting equation, i.e., ^y = mx+b, fits
the given data. The values of rxy are constrained in the interval (-1,1). The closer the value of
rxy is to +1 or -1, the better the linear fitting for the given data.
1 n
= ( xi x ) = (n 1) s = xi xi
n i =1
i =1
i =1
n
S xx
2
x
1 n
S yy = ( yi y) = (n 1) s = yi yi
n i=1
i =1
i =1
n
2
y
1 n n
Sxy = (xi x)(yi y) = (n 1) sxy = xi yi xi yi
n i=1 i=1
i=1
i=1
n
Download at InfoClearinghouse.com
From which it follows that the standard deviations of x and y, and the covariance of x,y are
given, respectively, by
S xx
n 1
sx =
S yy
sy =
n 1
S xy
s xy =
n 1
S xy
rxy =
S xx S yy
In terms ofx,y, Sxx, Syy, and Sxy, the solution to the normal equations is:
m=
S xy
S xx
s xy
s x2
b = y mx
Download at InfoClearinghouse.com
Yi = Mxi + B + i,
i = 1,2,,n, where Yi are independent, normally distributed random variables with mean
( + xi) and common variance 2, and i are independent, normally distributed random
variables with mean zero and the common variance 2.
Let yi = actual data value, ^yi = mxi + b = least-square prediction of the data. Then, the
prediction error is:
ei = yi - ^yi = yi - (mxi +b).
The prediction error being an estimate of the regression error , an estimate of 2 is the socalled standard error of the estimate,
2
1 n
SSE S yy ( S xy ) / S xx n 1 2
2
s =
[ y i (mxi + b)] =
=
=
s y (1 rxy2 )
n 2 i =1
n2
n2
n2
2
e
function [rxy,sxy,slope,intercept]=linreg(x,y)
n=length(x);m=length(y);
if m<>n then
error('linreg - Vectors x and y are not of the same length.');
abort;
end;
Sxx
Syy
Sxy
sx
sy
sxy
rxy
xbar
ybar
slope
intercept
se
=
=
=
=
=
=
=
=
=
=
=
=
sum(x^2)-sum(x)^2/n;
sum(y^2)-sum(y)^2/n;
sum(x.*y)-sum(x)*sum(y)/n;
sqrt(Sxx/(n-1));
sqrt(Syy/(n-1));
Sxy/(n-1);
Sxy/sqrt(Sxx*Syy);
mean(x);
mean(y);
Sxy/Sxx;
ybar - slope*xbar;
sqrt((n-1)*sy^2*(1-rxy^2)/(n-2));
xmin
= min(x);
xmax
= max(x);
xrange = xmax-xmin;
xmin
= xmin - xrange/10;
xmax
= xmax + xrange/10;
xx
= [xmin:(xmax-xmin)/100:xmax];
deff('[y]=yhat(x)','y=slope*x+intercept');
Download at InfoClearinghouse.com
yy
= yhat(xx);
ymin
= min(y);
ymax
= max(y);
yrange = ymax - ymin;
ymin
= ymin - yrange/10;
ymax
= ymax + yrange/10;
rect
= [xmin ymin xmax ymax];
plot2d(xx,yy,1,'011',' ',rect);
xset('mark',-9,1);
plot2d( x, y,-9,'011',' ',rect);
xtitle('Linear regression','x','y');
4.5
113
5.6
114
7.2
109
11.2
96.5
15
91.9
20
82.5
First we enter the data into vectors x and y, and then call function linreg.
-->x=[4.5,5.6,7.2,11.2,15,20];y=[113,114,109,96.5,91.9,82.5];
-->[rxy,sxy,slope,intercept] = linreg(x,y)
xbar = 10.583333
ybar = 101.15
sx = 6.0307269
sy = 12.823221
intercept = 123.38335
slope = - 2.1007891
sxy = - 76.405
rxy = - .9879955
The correlation coefficient rxy = -0.9879955 corresponds to a decreasing linear function. The
fact that the value of the correlation coefficient is close to -1 suggest a good linear fitting.
Download at InfoClearinghouse.com
Therefore, we can produce confidence intervals for the parameters M and B for a confidence
level . We can also perform hypotheses testing on specific values of the parameters.
n-2,/2)se/Sxx
< < m + (t
n-2,/2)se/Sxx,
n-2,/2)se[(1/n)+x
n-2,/2)se[(1/n)+x
/Sxx]1/2,
; mx0+b +(t
2
1/2
]
n-2, /2)se[(1/n)+(x0-x) /Sxx]
[mx0+b (t
2
1/2
n-2,/2)se[(1/n)+(x0-x) /Sxx]
2
1/2
n-2,/2)se[1+(1/n)+(x0-x) /Sxx]
; mx0+b+(t
2
1/2
]
n-2, /2)se[1+(1/n)+(x0-x) /Sxx]
The null hypothesis, H0: = 0, is tested against the alternative hypothesis, H1: 0.
The test statistic is
t0 = (m-0)/(se/Sxx),
where t follows the Students t distribution with = n 2, degrees of freedom, and n
represents the number of points in the sample. The test is carried out as that of a mean
value hypothesis testing, i.e., given the level of significance, , determine the critical
value of t, t/2, then, reject H0 if t0 > t/2 or if t0 < - t/2.
If you test for the value 0= 0, and it turns out you do not reject the null hypothesis, H0:
= 0, then, the validity of a linear regression is in doubt. In other words, the sample data
does not support the assertion that 0. Therefore, this is a test of the significance of
the regression model.
The null hypothesis, H0: = 0, is tested against the alternative hypothesis, H1: 0.
The test statistic is
t0 = (b-0)/[(1/n)+x2/Sxx]1/2,
where t follows the Students t distribution with = n 2, degrees of freedom, and n
represents the number of points in the sample. The test is carried out as that of a mean
Download at InfoClearinghouse.com
10
value hypothesis testing, i.e., given the level of significance, , determine the critical
value of t, t/2, then, reject H0 if t0 > t/2 or if t0 < - t/2.
function [se,rxy,sxy,slope,intercept,sy,sx,ybar,xbar]=linreg(x,y)
n=length(x);m=length(y);
if m<>n then
error('linreg - Vectors x and y are not of the same length.');
abort;
end;
Sxx
Syy
Sxy
sx
sy
sxy
rxy
xbar
ybar
slope
intercept
se
=
=
=
=
=
=
=
=
=
=
=
=
sum(x^2)-sum(x)^2/n;
sum(y^2)-sum(y)^2/n;
sum(x.*y)-sum(x)*sum(y)/n;
sqrt(Sxx/(n-1));
sqrt(Syy/(n-1));
Sxy/(n-1);
Sxy/sqrt(Sxx*Syy);
mean(x);
mean(y);
Sxy/Sxx;
ybar - slope*xbar;
sqrt((n-1)*sy^2*(1-rxy^2)/(n-2));
xmin
= min(x);
xmax
= max(x);
xrange = xmax-xmin;
xmin
= xmin - xrange/10;
xmax
= xmax + xrange/10;
xx
= [xmin:(xmax-xmin)/100:xmax];
deff('[y]=yhat(x)','y=slope*x+intercept');
yy
= yhat(xx);
ymin
= min(y);
ymax
= max(y);
yrange = ymax - ymin;
ymin
= ymin - yrange/10;
ymax
= ymax + yrange/10;
rect
= [xmin ymin xmax ymax];
plot2d(xx,yy,1,'011',' ',rect);
xset('mark',-9,1);
plot2d( x, y,-9,'011',' ',rect);
xtitle('Linear regression','x','y');
Download at InfoClearinghouse.com
11
We use a
x
y
2.0
5.5
2.5
7.2
3.0
9.4
3.5
10.0
4.0
12.2
The following SCILAB commands are used to load the data and perform the regression analysis:
-->getf('linregtable')
-->x=[2.0,2.5,3.0,3.5,4.0];y=[5.5,7.2,9.4,10.0,12.2];
-->linregtable(x,y,0.05)
Regression line:
y
Significance level
Value of t_alpha/2
Confidence interval for slope
Confidence interval for intercept
Covariance of x and y
Correlation coefficient
Standard error of estimate
Standard error of slope
Standard error of intercept
Mean values of x and y
Standard deviations of x and y
Error sum of squares
=
=
=
=
=
=
=
=
=
=
=
=
=
3.24*x + -.86
.05
3.18245
[2.37976;4.10024]
[-3.51144;1.79144]
2.025
.98972
.42740
.27031
.83315
3 8.86
.79057 2.58805
.548
--------------------------------------------------------------------------------------x
y
^y
error
C.I. mean
C.I. predicted
--------------------------------------------------------------------------------------2
5.5
5.62
-.12
4.56642
6.67358
3.89952
7.34048
2.5
7.2
7.24
-.04
6.49501
7.98499
5.68918
8.79082
3
9.4
8.86
.54
8.25172
9.46828
7.37002
10.35
3.5
10
10.48
-.48
9.73501
11.225
8.92918
12.0308
4
12.2
12.1
.1
11.0464
13.1536
10.3795
13.8205
---------------------------------------------------------------------------------------
-.44117
The plot of the original data and the fitted data, also produced by function linregtable, is
shown next:
Download at InfoClearinghouse.com
12
The graph shows a good linear fitting of the data confirmed by a correlation coefficient
(0.98972) very close to 1.0. The hypotheses testing indicate that the null hypothesis H0:b = 0
cannot be rejected, i.e., a zero intercept may be substituted for the intercept of -0.86 with a
95% confidence level. On the other hand, the null hypothesis H0:m=0 is rejected, indicating a
proper linear relationship.
function [] = multiplot(X,y)
//Produces a matrix of plots:
//
//
//
//
---x1---x2-vs-x1
.
y-vs-x1
x1-vs-x2
---x2--.
y-vs-x2
x1-vs-x3 ...
x2-vs-x3 ...
.
y-vs-x3 ...
x1-vs-y
x2-vs-y
.
---y---
[m n] = size(X);
nr = n+1; nc = nr;
XX = [X y'];
xset('window',1);
xset('default');
xbasc();
xset('mark',-1,1);
for i = 1:nr
for j = 1:nc
mtlb_subplot(nr,nc,(i-1)*nr+j);
if i <> j then
rect= [min(XX(:,j)) min(XX(:,i)) max(XX(:,j)) max(XX(:,i))];
plot2d(XX(:,j),XX(:,i),-1);
Download at InfoClearinghouse.com
13
To sub-divide the plot window into subplots, function multiplot uses function mtlb_subplot, a
function that emulates Matlabs function subplot (which explains the prefix mtlb_). Details of
this, and other functions with the mtlb_ prefix are presented in more detail in Chapter 20.
To illustrate the use of function multiplot we will use the following data set:
____________________
x1
x2
y
____________________
2.3 21.5 147.47
3.2 23.2 165.42
4.5 24.5 170.60
5.1 26.2 184.84
6.2 27.1 198.05
7.5 28.3 209.96
____________________
The SCILAB commands to produce the plot array are shown next.
-->x1 = [2.3 3.2 4.5 5.1 6.2 7.5];
-->x2 = [21.5 23.2 24.5 26.2 27.1 28.3];
-->y
-->X=[x1' x2'];
-->getf('multiplot')
-->multiplot(X,y)
Download at InfoClearinghouse.com
14
__________________________________________________________________________________
__________________________________________________________________________________
A result like this array of plots is useful in determining some preliminary trends among the
variables. For example, the plots above show strong dependency between x1 and x2, besides
the expected dependency of y on x1 or y on x2. In that sense, variables x1 and x2 are not
independent of each other. When we refer to them as the independent variables, the meaning
is that of variables that explain y, which is, in turn, referred to as the dependent variable.
Download at InfoClearinghouse.com
15
x2
x21
x3
x31
x4
x5
xn1
y
y1
x12
x22
x32
xn2
y2
x13
x32
x33
xn3
y3
x1,m-1
x1,m
.
x
2,m-1
2,m
3,m-1
3,m
n,m-1
ym-1
n,m
ym
y i = xib.
If we put together the matrix, X = [x1 x2
_
1
x11
x21
1
x12
x22
1
x13
x32
.
.
.
.
.
.
1
x1,m
x 2,m
_
and the vector,
xn]T, i.e.,
x31
x32
x33
.
.
x 3,m
xn1
xn2
xn3
.
.
x n,m
_
y ,
y )T(y - y ) = (y-Xb)T(y-Xb).
To minimize SSE we write (SSE)/b = 0. It can be shown that this results in the expression
XT X b = XTy, from which it follows that
b = (XT X)-1 XTy
Download at InfoClearinghouse.com
16
An example for calculating the vector of coefficients b for a multiple linear regression was
presented in Chapter 5. The example is repeated here to facilitate understanding of the
procedure.
x2
3.10
3.10
4.50
4.50
5.00
x3
2.00
2.50
2.50
3.00
3.50
y
5.70
8.20
5.00
8.20
9.50
4.
-->x2 = [3.1,3.1,4.5,4.5,5.0]
x2 = !
3.1
3.1
4.5
4.5
-->x3 = [2.0,2.5,2.5,3.0,3.5]
x3 = !
2.
2.5
2.5
3.
-->y = [5.7,8.2,5.0,8.2,9.5]
y = !
5.7
8.2
5.
8.2
6. !
5. !
3.5 !
9.5 !
1.
1.
1.
1.
1.
1.2
2.5
3.5
4.
6.
3.1
3.1
4.5
4.5
5.
2.
2.5
2.5
3.
3.5
!
!
!
!
!
The vector of coefficients for the multiple linear equation is calculated as:
-->b =inv(X'*X)*X'*y
b =
! - 2.1649851 !
! - .7144632 !
! - 1.7850398 !
!
7.0941849 !
Download at InfoClearinghouse.com
17
4.
2. !
-->xx*b
ans = 2.739836
The fitted values of y corresponding to the values of x1, x2, and x3 from the table are obtained
from y = Xb:
-->X*b
ans =
!
5.6324056
!
8.2506958
!
5.0371769
!
8.2270378
!
9.4526839
!
!
!
!
!
Compare these fitted values with the original data as shown in the table below:
x1
1.20
2.50
3.50
4.00
6.00
x2
3.10
3.10
4.50
4.50
5.00
x3
2.00
2.50
2.50
3.00
3.50
y
5.70
8.20
5.00
8.20
9.50
y-fitted
5.63
8.25
5.04
8.23
9.45
This procedure will be coded into a user-defined SCILAB function in an upcoming section
incorporating some of the calculations for the regression analysis as shown next.
An array of plots showing the dependency of the different variables involved in the multiple
linear fitting is shown in the following page. It was produced by using function multiplot.
-->multiplot(X,y);
(See plot in next page).
Download at InfoClearinghouse.com
18
se = MSE = SSE/(m-n-1),
where m is the number of data points available and n is the number of coefficients required for
the multiple linear fitting.
The matrix C = se2(XTX)-1 is a symmetric matrix known as the covariance matrix. The diagonal
elements cii are the variances associated with each of the coefficients bi, i.e.,
Var(bi) = cii,
while elements off the diagonal, cij, ij, are the covariances of bi and bj, i.e.,
Cov(bi,bj) = cij, ij.
The square root of the variances Var(bi) are referred to as the standard error of the estimate
for each coefficient, i.e.,
se(bi) = cii = [Var(bi)]1/2.
__________________________________________________________________________________
__________________________________________________________________________________
Download at InfoClearinghouse.com
19
Using a level of confidence , we can write confidence intervals for each of the
coefficients i in the linear model for Y, as
bi (t
m-n,/2)se(bi)
< i < bi + (t
m-n,/2)
se(bi) ,
for i=1,2,,n, where bi is the i-th coefficient in the linear fitting, tm-n,/2 is the value of the
Students t variable for = m-n degrees of freedom corresponding to a cumulative
probability of 1-/2, and se(bi) is the standard error of the estimate for bi.
m-n,/2)
m-n,/2)[x0
C x0]1/2 2]
m-n,/2)se[1+x0
The null hypothesis, H0: i = 0, is tested against the alternative hypothesis, H1: i 0.
The test statistic is
t0 = (bi-0)/se(bi)
where t follows the Students t distribution with = m-n, degrees of freedom. The test is
carried out as that of a mean value hypothesis testing, i.e., given the level of significance,
, determine the critical value of t, t/2, then, reject H0 if t0 > t/2 or if t0 < - t/2. Of
interest, many times, is the test that a particular coefficient bi be zero, i.e., H0: i = 0.
Download at InfoClearinghouse.com
20
F0 =
SSR / n
MSR
=
,
SSE /(m n 1) MSE
where MSR = SSR/n is known as the regression mean square, MSE = SSE/(m-n-1) = se2 is the
mean square error, SSE is the sum of squared errors (defined earlier), and SSR is the
regression sum of squares defined as
m
SSR = ( y i y ) 2 ,
i =1
SST = ( y i y ) 2 .
i =1
The term SSR accounts for the variability in yi due to the regression line, while the terms
SSE accounts for the residual variability not incorporated in SSR. The term SSR has n
degrees of freedom, i.e., the same number of coefficients in the multiple linear fitting.
The term SSE has m-n-1 degrees of freedom, while SST has n-1 degrees of freedom.
Download at InfoClearinghouse.com
21
Analysis of variance for the test of significance of regression is typically reported in a table
that includes the following information:
R2 =
SSR
SSE
= 1
,
SST
SST
while the positive square root of this value is referred to as the multiple correlation
coefficient, R. This multiple correlation coefficient, for the case of a simple linear regression,
is the same as the correlation coefficient rxy. Values of R2 are restricted to the range [0,1].
Unlike the simple correlation coefficient, rxy, the coefficient of multiple determination, R2, is
not a good indicator of linearity. A better indicator is the adjusted coefficient of multiple
determination,
2
Radj
= 1
SSE /(m n 1)
.
SST /( m 1)
Download at InfoClearinghouse.com
22
for i = 1:nC
seb = [seb; sqrt(C(i,i))];
end;
ta2 = cdft('T',m-n,1-alpha/2,alpha/2); //t_alpha/2
sY = []; sYp = [];
//Terms involved in C.I. for Y, Ypred
for i=1:m
sY = [sY; sqrt(X(i,:)*C*X(i,:)')];
sYp = [sYp; se*sqrt(1+X(i,:)*(C/se)*X(i,:)')];
end;
CIYL = yh-sY;
//Lower limit for C.I. for mean Y
CIYU = yh+sY;
//Upper limit for C.I. for mean Y
CIYpL = yh-sYp;
//Lower limit for C.I. for predicted Y
CIYpU = yh+sYp;
//Upper limit for C.I. for predicted Y
CIbL = b-ta2*seb;
//Lower limit for C.I. for coefficients
CIbU = b+ta2*seb;
//Upper limit for C.I. for coefficients
t0b = b./seb;
//t parameter for testing H0:b(i)=0
decision = [];
//Hypothesis testing for H0:b(i)=0
for i = 1:n+1
if t0b(i)>ta2 | t0b(i)<-ta2 then
decision = [decision; ' reject
'];
else
decision = [decision; ' do not reject'];
end;
end;
ybar = mean(y);
SST = sum((y-ybar)^2);
SSR = sum((yh-ybar)^2);
MSR
MSE
F0
Fa
=
=
=
=
//Mean value of y
//Total sum of squares
//Residual sum of squares
SSR/n;
//Regression mean square
SSE/(m-n-1);
//Error mean square
MSR/MSE;
//F parameter for significance of regression
cdff('F',n,m-n-1,1-alpha,alpha);
//F_alpha
R2 = 1-SSE/SST; R = sqrt(R2);
R2a = 1-(SSE/(m-n-1))/(SST/(m-1));
//Printing of results
printf(' ');
printf('Multiple linear regression');
printf('==========================');
printf(' ');
printf('Table of coefficients');
printf('------------------------------------------------------------------------');
Download at InfoClearinghouse.com
23
printf('
i
b(i)
se(b(i))
Lower
Upper
t0
H0:b(i)=0');
printf('------------------------------------------------------------------------');
for i = 1:n+1
printf('%4.0f %10g %10g %10g %10g %10g '+decision(i),...
i-1,b(i),seb(i),CIbL(i),CIbU(i),t0b(i));
end;
printf('------------------------------------------------------------------------');
printf('
t_alpha/2 = %g',ta2);
printf('------------------------------------------------------------------------');
printf(' ');printf(' ');
printf('Table of fitted values and errors');
printf('--------------------------------------------------------------------------------');
printf('
i
y(i)
yh(i)
e(i)
C.I. for Y
C.I.
for Ypred');
printf('--------------------------------------------------------------------------------');
for i = 1:m
printf('%4.0f %10.6g %10.6g %10.6g %10.6g %10.6g %10.6g %10.6g',...
i,y(i),yh(i),e(i),CIYL(i),CIYU(i),CIYpL(i),CIYpU(i));
end;
printf('--------------------------------------------------------------------------------');
printf(' ');printf(' ');
printf('Analysis of variance');
printf('--------------------------------------------------------');
printf('Source of
Sum of
Degrees of
Mean')
printf('variation
squares
freedom
square
F0');
printf('--------------------------------------------------------');
printf('Regression
%10.6g %10.0f %10.6g %10.6g',SSR,n,MSR,F0');
printf('Residual
%10.6g %10.0f %10.6g
',SSE,m-n-1,MSE);
printf('Total
%10.6g %10.0f
',SST,m-1);
printf('--------------------------------------------------------');
printf('With F0 = %g and F_alpha = %g,',F0,Fa);
if F0>Fa then
printf('reject the null hypothesis H0:beta1=beta2=...=betan=0.');
else
printf('do not reject the null hypothesis H0:beta1=beta2=...=betan=0.');
end;
printf('--------------------------------------------------------');
disp(' ');
printf('Additional information');
printf('---------------------------------------------------------');
printf('Standard error of estimate (se)
= %g',se);
printf('Coefficient of multiple determination (R^2)
= %g',R2);
printf('Multiple correlation coefficient (R)
= %g',R);
printf('Adjusted coefficient of multiple determination = %g',R2a);
printf('---------------------------------------------------------');
printf(' ');
printf('Covariance matrix:');
disp(C);
printf(' ');
printf('---------------------------------------------------------');
//Plots of residuals - several options
Download at InfoClearinghouse.com
24
for j = 1:n
xset('window',j);xset('mark',-9,2);xbasc(j);
plot2d(XA(:,j),e,-9)
xtitle('Residual plot - error vs. x'+string(j),'x'+string(j),'error');
end;
xset('window',n+1);xset('mark',-9,2);
plot2d(y,e,-9);
xtitle('Residual plot - error vs. y','y','error');
xset('window',n+2);xset('mark',-9,2);
plot2d(yh,e,-9);
xtitle('Residual plot - error vs. yh','yh','error');
Then, we call function multiplelinear to obtain information on the multiple linear regression:
-->[b,C,se]=multiplelinear(X,y,0.1);
Multiple linear regression
==========================
Table of coefficients
------------------------------------------------------------------------i
b(i)
se(b(i))
Lower
Upper
t0 H0:b(i)=0
------------------------------------------------------------------------0
-2.16499
1.14458
-5.50713
1.17716
-1.89152 do not reject
1
-.71446
.21459
-1.34105 -.0878760
-3.3295 reject
2
-1.78504
.18141
-2.31477
-1.25531
-9.83958 reject
3
7.09418
.49595
5.64603
8.54234
14.3043 reject
------------------------------------------------------------------------t_alpha/2 = 2.91999
-------------------------------------------------------------------------
Download at InfoClearinghouse.com
25
Additional information
--------------------------------------------------------Standard error of estimate (se)
= .10720
Coefficient of multiple determination (R^2)
= .99920
Multiple correlation coefficient (R)
= .99960
Adjusted coefficient of multiple determination = .99679
---------------------------------------------------------
Covariance matrix:
!
1.3100522
!
.2388806
! - .1694459
! - .5351636
.2388806
.0460470
.0313405
.1002469
.1694459
.0313405
.0329111
.0534431
.5351636
.1002469
.0534431
.2459639
!
!
!
!
The results show that, for a confidence level = 0.1, the hypothesis H0:0 = 0 may not be
rejected, meaning that you could eliminate that term from the multiple linear fitting. On the
other hand, the test for significance of regression indicates that we cannot reject the
hypothesis that a linear regression does exist. Plots of the residuals against variables x1, x2, x3,
y, and ^y are shown next.
Download at InfoClearinghouse.com
26
Download at InfoClearinghouse.com
27
Download at InfoClearinghouse.com
28
abort;
elseif nC<>n+1 then
error('mlpredict - Dimensions of covariance matrix incompatible with vector
b.');
abort;
end;
xx = [1 x];
y = xx*b;
CIYL = y CIYU = y +
CIYpL = y CIYpU = y +
sqrt(xx*C*xx');
sqrt(xx*C*xx');
se*sqrt(1+xx*(C/se)*xx');
se*sqrt(1+xx*(C/se)*xx');
//augment vector x
//calculate y
//Lower limit C.I.
//Upper limit C.I.
//Lower limit C.I.
//Upper limit C.I.
for
for
for
for
mean Y
mean Y
predicted Y
predicted Y
//Print results
printf(' ');
disp(' ',x,'For x = ');
printf('Multiple linear regression prediction is y = %g',y);
printf('Confidence interval for mean value Y
= [%g,%g]',CIYL,CIYU);
printf('Confidence interval for predicted value Y = [%g,%g]',CIYpL,CIYpU);
An application of function mlpredict, using the values of b, C, and se obtained from function
multiplelinear as shown above, is presented next:
-->y=mlpredict(x,b,C,se,0.1);
For x =
!
2.
3.5
2.8 !
x
1.02
1.05
1.25
1.34
1.49
1.44
0.94
1.30
1.59
1.40
Download at InfoClearinghouse.com
y
90.02
89.14
91.48
93.81
96.77
94.49
87.62
91.78
99.43
94.69
29
-->[b,C,se] = multiplelinear(X,Y,0.05);
Multiple linear regression
==========================
Table of coefficients
------------------------------------------------------------------------i
b(i)
se(b(i))
Lower
Upper
t0 H0:b(i)=0
------------------------------------------------------------------------0
72.2564
2.03849
67.645
76.8678
35.446 reject
1
16.1206
1.5701
12.5688
19.6724
10.2672 reject
------------------------------------------------------------------------t_alpha/2 = 2.26216
-------------------------------------------------------------------------
Download at InfoClearinghouse.com
30
Analysis of variance
-------------------------------------------------------Source of
Sum of
Degrees of
Mean
variation
squares
freedom
square
F0
-------------------------------------------------------Regression
109.448
1
109.448
105.416
Residual
8.30597
8
1.03825
Total
117.754
9
-------------------------------------------------------With F0 = 105.416 and F_alpha = 5.31766,
reject the null hypothesis H0:beta1=beta2=...=betan=0.
-------------------------------------------------------Additional information
--------------------------------------------------------Standard error of estimate (se)
= 1.01894
Coefficient of multiple determination (R^2)
= .92946
Multiple correlation coefficient (R)
= .96409
Adjusted coefficient of multiple determination = .92065
--------------------------------------------------------Covariance matrix:
!
4.1554489
! - 3.1603934
- 3.1603934 !
2.4652054 !
---------------------------------------------------------
16.120572
72.256427 !
-->se
se = 1.0189435
-->rxy
rxy = .9640868
The data fitting produced with linreg is shown in the next figure:
Download at InfoClearinghouse.com
31
Analysis of residuals
Residuals are simply the errors between the original data values y and the fitted values
i.e., the values
e = y-
y ,
y .
Plots of the residuals against each of the independent variables, x1, x2, , xn, or against the
original data values y or the fitted values y can be used to identify trends in the residuals. If
the residuals are randomly distributed about zero, the plots will show no specific pattern for
the residuals. Thus, residual analysis can be used to check the assumption of normal
distribution of errors about a zero mean. If the assumption of normal distribution of residuals
about zero does not hold, the plots of residuals may show specific trends.
For example, consider the data from the following table:
__________________________________________________________
x2
p
q
r
x1
__________________________________________________________
1.1
22.1
1524.7407
3585.7418
1558.505
2.1
23.2
1600.8101
3863.4938
1588.175
3.4
24.5
1630.6414
4205.4344
1630.79
4.2
20.4
1416.1757
3216.1131
1371.75
5.5
25.2
1681.7725
4409.4029
1658.565
6.1
23.1
1554.5774
3876.8094
1541.455
7.4
19.2
1317.4763
2975.0054
1315.83
8.1
18.2
1324.6139
2764.6509
1296.275
9.9
20.5
1446.5163
3289.158
1481.265
11.
19.1
1328.9309
2983.4153
1458.17
__________________________________________________________
The table shows two independent variables, x1 and x2, and three dependent variables, i.e., p =
p(x1,x2), q = q(x1,x2), and r = r(x1,x2). We will try to fit a multiple linear function to the three
functions, e.g., p = b0 + b1x1 + b2x2, using function multiplelinear, with the specific purpose of
checking the distribution of the errors. Thus, we will not include in this section the output for
the function calls. Only the plots of residuals (or errors) against the fitted data will be
presented. The SCILAB command required to produce the plots are shown next.
-->x1=[1.1 2.1 3.4 4.2 5.5 6.1 7.4 8.1 9.9 11.0];
-->x2 = [22.1 23.2 24.5 20.4 25.2 23.1 19.2 18.2 20.5 19.1];
-->p = [ 1524.7407
1600.8101
1630.6414
1416.1757
1681.7725 ...
-->
1554.5774
1317.4763
1324.6139
1446.5163
1328.9309 ];
-->q = [ 3585.7418
3863.4938
4205.4344
3216.1131
4409.4029 ...
-->
3876.8094
2975.0054
2764.6509
3289.158
2983.4153 ];
-->r = [ 1558.505
1588.175
1630.79
1371.75
1658.565
1541.455 ...
-->
1315.83
1296.275
1481.265
1458.17 ];
-->X = [x1' x2'];
-->[b,C,se] = multiplelinear(X,p',0.01);
-->[b,C,se] = multiplelinear(X,q',0.01);
-->[b,C,se] = multiplelinear(X,r',0.01);
Download at InfoClearinghouse.com
32
2
adj
Significance
p=p(x1,x2)
0.97658
0.98822
0.96989
reject H0
q=q(x1,x2)
0.99926
0.99963
0.99905
reject H0
r=r(x1,x2)
0.8732
0.93445
0.83697
reject H0
The three fitting show good values of the coefficients R2, R, and R2adj, and the F test for the
significance of the regression indicates rejecting the null hypothesis H0:1 = 2 = = n = 0 in
all cases. In other words, a multiple linear fitting is acceptable for all cases. The residual
plots, depicted below, however, indicate that only the fitting of p=p(x1,x2) shows a random
distribution about zero. The fitting of q=q(x1,x2) shows a clear non-linear trend, while the
fitting of r=r(x1,x2) shows a funnel shape.
Download at InfoClearinghouse.com
33
Scaling residuals
The errors ei may be standardized by using the values
zei = ei/se,
where se is the standard error of the estimate from the multiple linear fitting. If the errors
are normally distributed then approximately 95% of them will fall within the interval (-2,+2).
Any residual located far outside of this interval may indicate an outlier point. Outlier points
are points that do not follow the general trend of the data. They may indicate an error in the
data collection or simply an extremely large or low value in the data. If sufficient justification
exists, outlier points may be discarded and the regression repeated with the remaining data
points.
Another way to scale residuals is to use the so-called studentized residuals defined as
ri =
ei
s e 1 hii
i = 1,2,,m, where hii are the diagonal elements in the hat matrix, H, defined as
H = X(XTX)-1X.
This matrix relates the observed values y to the fitted values y , i.e., y = Hy. Thus, the
name hat matrix. Using the values hii we can write the studentized standard error of the i-th
residual as
se(ei) = se(1-hii)1/2.
Download at InfoClearinghouse.com
34
Influential observations
Sometimes in a simple or multiple linear fitting there will be one or more points whose effects
on the regression are unusually influential. Typically, these so-called influential observations
correspond to outlier points that tend to drag the fitting in one or other direction. To
determine whether or not a point is influential we computed the Cooks distance defined as
di =
ei2 hii
zei2 hii
ri 2 hii
=
=
,
(n + 1) s e2 (1 + hii ) 2 (n + 1)(1 + hii ) 2 (n + 1)(1 hii )
[m n] = size(XA);
X = [ones(m,1) XA];
H = X*inv(X'*X)*X';
yh = H*y;
e=y-yh;
SSE=e'*e;
MSE = SSE/(m-n-1);
se = sqrt(MSE);
[nh mh] = size(H);
h=[];
for i=1:nh
h=[h;H(i,i)];
end;
see = se*(1-h).^2;
ze
= e/se;
r
= e./see;
d
= r.*r.*h./(1-h)/(n+1);
//Printing of results
printf(' ');
Download at InfoClearinghouse.com
35
Download at InfoClearinghouse.com
36
Notice that most of the standardized residuals are within the interval (-2,2), and all of the
values di are less than one. This residual analysis, thus, does not reveal any outliers. Similar
results are obtained from the following table corresponding to the fitting q = q(x1,x2).
Residual analysis for q = q(x1,x2)
-->residuals(X,q');
10
2983.42
2987.34 -3.92933
6.74432
-.22832
-.58261 .0675956
Download at InfoClearinghouse.com
37
The table corresponding to the fitting q=q(x1,x2) shows two residuals, e1 and e10 whose Cooks
distance is larger than one. These two points, even if their standardized residuals are in the
interval (-2,2), may be considered outliers. The residual analysis eliminating these two
suspected outliers is shown next.
!
!
!
!
!
!
!
!
2.1
3.4
4.2
5.5
6.1
7.4
8.1
9.9
23.2
24.5
20.4
25.2
23.1
19.2
18.2
20.5
!
!
!
!
!
!
!
!
-->rr = r(2:9)
rr =
!
1588.175
1630.79
1296.275
1481.265 !
1371.75
1658.565
1541.455
1315.83
-->residuals(XX,rr');
Download at InfoClearinghouse.com
38
Even after eliminating the two influential observations we find that the remaining e1 and e8 are
influential in the reduced data set. We can check the analysis of residuals eliminating these
two influential observations as shown next:
3.4
4.2
5.5
6.1
7.4
8.1
rrr
!
!
!
!
!
!
!
=
1630.79
1371.75
1658.565
1541.455
1315.83
1296.275 !
-->residuals(XXX,rrr');
Residual Analysis for Multiple Linear Regression
------------------------------------------------------------------------------i
y(i)
yh(i)
e(i)
se(e(i))
ze(i)
r(i)
d(i)
-------------------------------------------------------------------------------
1630.79
1601.8
28.9923
33.2701
.20573
.87142
.40176
2
1371.75
1557.26
-185.509
65.1625
-1.31635
-2.84688
1.9071
3
1658.57
1484.88
173.68
96.7163
1.23241
1.79577
.33395
4
1541.46
1451.48
89.9739
96.4308
.63844
.93304
.0909297
5
1315.83
1379.11
-63.2765
63.918
-.44900
-.98996
.23759
6
1296.27
1340.14
-43.8605
35.9466
-.31123
-1.22016
.72952
------------------------------------------------------------------------------Standard error of estimate (se)
= 140.927
-------------------------------------------------------------------------------
Even after eliminating the two points e1 and e8 from the reduced data set, another influential
point is identified, e2. We may continue eliminating influential points, at the risk of running
out of data, or try a different data fitting.
Download at InfoClearinghouse.com
39
x2
24
21
24
25
25
26
25
25
24
25
25
23
x3
91
90
88
87
91
94
87
86
88
91
90
89
x4
100
95
110
88
94
99
97
96
110
105
100
98
y
240
236
290
274
301
316
300
296
267
276
288
261
First, we load the data and prepare the matrix XX for the application of function datafit.
-->XY=...
-->[25 24
-->31 21
-->45 24
-->60 25
-->65 25
-->72 26
-->80 25
-->84 25
-->75 24
-->60 25
-->50 25
-->38 23
91
90
88
87
91
94
87
86
88
91
90
89
100
95
110
88
94
99
97
96
110
105
100
98
240
236
290
274
301
316
300
296
267
276
288
261];
-->XX=XY';
Download at InfoClearinghouse.com
40
Next, we define the error function to be minimized and call function datafit:
-->deff('[e]=G(b,z)',...
--> 'e=b(1)+b(2)*z(1)+b(3)*z(2)+b(4)*z(3)+b(5)*z(4)-z(5)')
-->[b,er]=datafit(G,XX,b0)
er = 1699.0093
b =
! - 102.71289 !
!
.6053697 !
!
8.9236567 !
!
1.4374508 !
!
.0136086 !
Download at InfoClearinghouse.com
41
Analysis of variance
-------------------------------------------------------Source of
Sum of
Degrees of
Mean
variation
squares
freedom
square
F0
-------------------------------------------------------Regression
4957.24
4
1239.31
5.10602
Residual
1699.01
7
242.716
Total
6656.25
11
-------------------------------------------------------With F0 = 5.10602 and F_alpha = 2.96053,
reject the null hypothesis H0:beta1=beta2=...=betan=0.
--------------------------------------------------------
Additional information
--------------------------------------------------------Standard error of estimate (se)
= 15.5793
Coefficient of multiple determination (R^2)
= .74475
Multiple correlation coefficient (R)
= .86299
Adjusted coefficient of multiple determination = .59889
--------------------------------------------------------Covariance matrix:
!
!
!
!
!
43205.302
10.472107
142.32333
389.41606
43.653591
- 10.472107
.1360850
- 1.4244286
.4289197
- .0095819
- 142.32333
- 1.4244286
28.095536
- 5.3633513
.1923068
- 389.41606
.4289197
- 5.3633513
5.7198487
- .1563728
- 43.653591 !
- .0095819 !
.1923068 !
- .1563728 !
.5384939 !
---------------------------------------------------------
Notice that the error, e, returned by datafit is the sum of squared errors, SSE, returned by
function multiplelinear. The detailed regression analysis provided by multiplelinear indicates
that the hypothesis for significance of regression is to be rejected, i.e., the linear model is not
necessarily the best for this data set. Also, the coefficient of multiple regression and its
adjusted value are relatively small.
Note: Function datafit can be used to fit linear and non-linear functions.
application of function datafit were presented in Chapter 8.
Details of the
Download at InfoClearinghouse.com
42
z
y
2.10
13.41
3.20
46.48
4.50
95.39
6.80
13.50
18.40
21.00
380.88 2451.55 5120.46 8619.14
The SCILAB instructions for producing the fitting are shown next. First, we load the data:
-->z=[2.10,3.20,4.50,6.80,13.50,18.40,21.00];
-->y=[13.41,46.48,95.39,380.88,2451.55,5120.46,8619.14];
Next, we prepare the vectors for the multiple linear regression and call the appropriate
function:
-->x1 = z; x2 = z^2; x3 = z^3; X = [x1' x2' x3']; yy = y';
-->[b,C,se]=multiplelinear(X,yy,0.01)
Multiple linear regression
==========================
Table of coefficients
------------------------------------------------------------------------i
b(i)
se(b(i))
Lower
Upper
t0 H0:b(i)=0
------------------------------------------------------------------------0
-467.699
664.835
-3528.66
2593.26
-.70348 do not reject
1
223.301
262.938
-987.289
1433.89
.84925 do not reject
2
-23.3898
26.1662
-143.861
97.0817
-.89390 do not reject
3
1.56949
.74616
-1.86592
5.00491
2.10341 do not reject
------------------------------------------------------------------------t_alpha/2 = 4.60409
------------------------------------------------------------------------Table of fitted values and errors
------------------------------------------------------------------------------i
y(i)
yh(i)
e(i)
C.I. for Y
C.I. for Ypred
------------------------------------------------------------------------------1
13.41
-87.3819
100.792
-351.859
177.095
-4793.08
4618.32
2
46.48
58.78
-12.3
-115.377
232.937
-3048.97
3166.53
3
95.39
206.529
-111.139
25.4059
387.653
-3024.28
3437.34
4
380.88
462.697
-81.8173
221.167
704.228
-3836.65
4762.04
5 2451.55
2145.6
305.95
1889.3
2401.9
-2415.28
6706.48
6 5120.46
5499.33
-378.865
5272.7
5725.95
1463.9
9534.75
7 8619.14
8441.76
177.38
8143.74
8739.78
3141.72
13741.8
-------------------------------------------------------------------------------
Analysis of variance
-------------------------------------------------------Source of
Sum of
Degrees of
Mean
variation
squares
freedom
square
F0
-------------------------------------------------------Regression
6.64055e+07
3 2.21352e+07
222.864
Residual
297964
3
99321.4
Total
6.67034e+07
6
-------------------------------------------------------With F0 = 222.864 and F_alpha = 29.4567,
reject the null hypothesis H0:beta1=beta2=...=betan=0.
--------------------------------------------------------
Download at InfoClearinghouse.com
43
Additional information
--------------------------------------------------------Standard error of estimate (se)
= 315.153
Coefficient of multiple determination (R^2)
= .99553
Multiple correlation coefficient (R)
= .99776
Adjusted coefficient of multiple determination = .99107
--------------------------------------------------------Covariance matrix:
!
442005.47
! - 166781.09
!
15518.351
! - 412.44179
- 166781.09
69136.156
- 6728.9534
183.73635
15518.351
- 6728.9534
684.66847
- 19.297326
- 412.44179
183.73635
- 19.297326
.5567618
!
!
!
!
---------------------------------------------------------
With the vector of coefficients found above, we define the cubic function:
-->deff('[y]=yf(z)','y=b(1)+b(2)*z+b(3)*z^2+b(4)*z^3')
Then, we produce the fitted data and plot it together with the original data:
-->zz=[0:0.1:25];yz=yf(zz);
-->rect = [0 -500 25 15000]; //Based on min. & max. values of z and y
-->xset('mark',-1,2)
-->plot2d(zz,yz,1,'011',' ',rect)
-->plot2d(z,y,-1,'011',' ',rect)
-->xtitle('Fitting of a cubic function','z','y')
Download at InfoClearinghouse.com
44
Alternatively, we can use function datafit to obtain the coefficients of the fitting as follows:
-->XX = [z;y]
XX
!
!
=
2.1
13.41
3.2
46.48
4.5
95.39
6.8
380.88
13.5
2451.55
18.4
5120.46
21.
!
8619.14 !
-->b0=ones(4,1)
b0 =
!
!
!
!
1.
1.
1.
1.
!
!
!
!
-->[b,er]=datafit(G,XX,b0)
er =
297964.09
=
! - 467.6971 !
!
223.30004 !
! - 23.389786 !
!
1.5694905 !
Download at InfoClearinghouse.com
45
Exercises
[1]. To analyze the dependency of the mean annual flood, Q(cfs), on the drainage area, A(mi2)
for a given region, data from six experimental watersheds is collected. The data is summarized
in the table below:
2
A(mi ) 16.58 3.23 16.8 42.91 8.35 6.04
105 465 1000 290 157
Q(cfs) 455
(a) Use function linreg to perform a simple linear regression analysis on these data. The
purpose is to obtain a relationship of the form Q = mA + b. (b) Determine the covariance of A
and B, (c)the correlation coefficient, and (d) the standard error of the estimate.
[2] For the data of problem [1] use function linregtable to perform the linear fitting. (a) What
is the decision regarding the hypothesis that the slope of the linear fitting may be zero at a
level of confidence of 0.05? (b) What is the decision regarding the hypothesis that the
intercept of the linear fitting may be zero at a level of confidence of 0.05?
[3] For the data of problem [1] use function multiplelinear to produce the data fitting. (a)
What is the decision regarding the hypothesis that the slope of the linear fitting may be zero at
a level of confidence of 0.05? (b) What is the decision regarding the hypothesis that the
intercept of the linear fitting may be zero at a level of confidence of 0.05? (b) What is the
decision regarding the hypothesis that the linear fitting may not apply at all?
[4]. The data shown in the table below represents the monthly precipitation, P(in), in a
particular month, and the corresponding runoff, R(in), out of a specific hydrological basin for
the period 1960-1969.
Year
P (in)
R (in)
1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
1.95 10.8 3.22 4.51 6.71 1.18 4.82 6.38 5.97 4.64
0.46 2.85 0.99
1.4 1.98 0.45 1.31 2.22 1.36 1.21
(a) Use function linreg to perform a simple linear regression analysis on these data. The
purpose is to obtain a relationship of the form R = mP + b. (b) Determine the covariance of P
and R, (c)the correlation coefficient, and (d) the standard error of the estimate.
[5] For the data of problem [1] use function linregtable to perform the linear fitting. (a) What
is the decision regarding the hypothesis that the slope of the linear fitting may be zero at a
level of confidence of 0.05? (b) What is the decision regarding the hypothesis that the
intercept of the linear fitting may be zero at a level of confidence of 0.05?
[6] For the data of problem [1] use function multiplelinear to produce the data fitting. (a)
What is the decision regarding the hypothesis that the slope of the linear fitting may be zero at
a level of confidence of 0.05? (b) What is the decision regarding the hypothesis that the
intercept of the linear fitting may be zero at a level of confidence of 0.05? (b) What is the
decision regarding the hypothesis that the linear fitting may not apply at all?
[7]. The following table shows data indicating the monthly precipitation during the month of
February, Pf, and during the month of March, Pm, as well as the runoff during the month of
March, Rm, for a specific watershed during the period 1935-1958.
Download at InfoClearinghouse.com
46
Year Pm Pf
Rm
1935 9.74 4.11 6.15
1936 6.01 3.33 4.93
1937 1.30 5.08 1.42
1938 4.80 2.41 3.60
1939 4.15 9.64 3.54
1940 5.94 4.04 2.26
1941 2.99 0.73 0.81
1942 5.11 3.41 2.68
1943 7.06 3.89 4.68
1944 6.38 8.68 5.18
1945 1.92 6.83 2.91
1946 2.82 5.21 2.84
1947 2.51 1.78 2.02
1948 5.07 8.39 3.27
1949 4.63 3.25 3.05
1950 4.24 5.62 2.59
1951 6.38 8.56 4.66
1952 7.01 1.96 5.40
1953 4.15 5.57 2.60
1954 4.91 2.48 2.52
1955 8.18 5.72 6.09
1956 5.85 10.19 4.58
1957 2.14 5.66 2.02
1958 3.06 3.04 2.59
a. Use function multiplot to show the dependence of the many variables.
b. Use function multiplelinear to check the multiple linear fitting Rm = b0 + b1Pm +
b2Pf.
c. For a level of confidence of 0.05, what are the decisions regarding the hypotheses
that each of the coefficients may be zero?
d. What is the decision regarding the hypothesis that the linear fitting may not apply
at all for the same level of confidence?
e. What value of runoff for the month of March is predicted if the precipitation in
the month of March is 6.2 in, and that of the month of February is 3.2 in?
f. What are the confidence intervals for the mean value and the prediction for the
data of question (e) at a confidence level 0.05?
[8]. In the analysis of runoff produced by precipitation into a watershed, often we are
required to estimate a parameter known as the time of concentration (tc) which determines
the time to the peak of the hydrograph produced by the watershed. It is assumed that the
time of concentration is a function of a characteristic watershed length (L), of a characteristic
watershed slope (S), and of a parameter known as the runoff curve number (CN). Runoff curve
numbers are numbers used by the U.S. Soil Conservation Service in the estimation of runoff
from watersheds. Runoff curve numbers are typically functions of the location of the
watershed and of its soil and vegetation covers. The following table shows values of the time
of concentration, tc(hr), the watershed length, L(ft), the watershed slope, S(%), and the runoff
curve number (CN) for 5 experimental watersheds.
Download at InfoClearinghouse.com
47
tc(hr)
L (ft)
S (%)
CN
0.2
800
2
75
0.2
0.2
0.3
0.3
1200 2100 2000 1500
3
4
6
1
84
88
70
85
(a) Use function multiplot to show the interdependence of the various variables in the table.
(b) Assuming that a multiple-linear fitting can be used to explain the dependence of tc on L, S,
and CN, use function multiplelinear to determine the coefficients of the fitting. (c) For a level
of confidence of 0.01, what are the decisions regarding the hypotheses that each of the
coefficients may be zero? (d) What is the decision regarding the hypothesis that the linear
fitting may not apply at all for the same level of confidence? (e) What value of the time of
concentration is predicted for L = 1750 ft, S = 5%, and CN = 80. (f) What are the confidence
intervals for the mean value and the prediction for the data of question (e) at a confidence
level 0.05?
[9]. The data in the table below shows the peak discharge, qp(cfs), the rainfall intensity,
i(in/hr), and the drainage area, A(acres), for rainfall events in six different watersheds.
qp(cfs)
i(in/hr)
A(acres)
23
3.2
12
45
4.6
21
44
5.1
18
64
3.8
32
68
6.1
24
62
7.4
16
(a) Use function multiplot to show the interdependence of the various variables in the table.
(b) Assuming that a multiple-linear fitting can be used to explain the dependence of qp on i,
and A, use function multiplelinear to determine the coefficients of the fitting. (c) For a level
of confidence of 0.1, what are the decisions regarding the hypotheses that each of the
coefficients may be zero? (d) What is the decision regarding the hypothesis that the linear
fitting may not apply at all for the same level of confidence? (e) What value of the time of
concentration is predicted for i = 5.6 in/hr and A = 25 acres. (f) What are the confidence
intervals for the mean value and the prediction for the data of question (e) at a confidence
level 0.10?
[10]. Measurements performed across a pipeline diameter produce the following table of
velocities, v(fps), as function of the radial distance, r(in), measured from the pipe centerline.
Download at InfoClearinghouse.com
48
r(in)
2.7
2.6
2.5
2.4
2.3
2.2
2.1
2.0
1.9
1.8
1.7
1.6
1.5
1.4
V(fps)
57.73
58.30
59.25
59.70
59.80
60.60
61.20
61.50
62.20
62.70
62.90
63.05
63.65
64.10
r(in)
1.3
1.2
1.1
1.0
0.9
0.8
0.7
0.5
0.3
0.1
0.1
0.3
0.5
0.7
V(fps)
64.40
64.80
64.80
65.20
65.50
65.50
65.90
66.30
66.40
66.50
66.50
66.30
66.00
65.70
r(in)
0.9
1.1
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
2.1
2.2
2.3
2.4
V(fps)
65.50
64.80
64.10
63.70
63.40
63.00
62.70
62.50
62.10
61.25
61.20
60.55
60.00
59.40
r(in)
2.5
2.6
2.7
2.8
2.9
3.0
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
V(fps)
58.90
58.40
57.50
57.00
56.60
55.90
54.80
54.20
53.20
52.35
50.80
49.50
47.70
44.45
(a) Plot the velocity profile indicated in the table. Notice that the values of r start at 2.7 in
and decrease to 0.1in, just to increase again from 0.1 in to 3.8 in. What these values of r
represent are velocity measurements at both sides of the centerline along a diameter of the
pipe. To produce an accurate plot of the velocity distribution, take the values of r from 2.7 in
down to 0.1 in as negative, and the remaining values as positive. (b) Using SCILAB function
datafit, fit a logarithmic function of the form v = b0 + b1 ln(y) to the data, where y = R-r, and R
is the pipe radius measured as R = 3.958 in.
[11]. The tables below show measurements of stagnation pressure on an air jet in a test set
up. The values of x represent distances from the jet outlet, where the values of r represent
distances from the jet centerline measured radially outwards. The stagnation pressure values
are used to calculate air velocities at the locations indicated by using v = (2gh)1/2, where g is
the acceleration of gravity. To make the units consistent, we recommend that you transform
the data to feet by using 1 in = (1/12) ft, and 1 cm = 0.0328 ft, and calculate the velocities
using g = 32.2 ft/s2.
Along centerline
x(in)
h(cm)
-0.25
20.58
0.00
20.38
0.50
20.16
1.00
20.16
3.00
20.16
5.00
19.6
6.00
18.25
6.50
17.11
Download at InfoClearinghouse.com
Across jet at x = 12 in
r(in)
h(cm)
8.50
0.00
5.00
0.00
4.50
0.15
4.20
0.20
3.70
0.47
3.50
0.62
3.09
0.94
2.83
1.19
49
r(in)
-0.50
-1.00
-1.50
-2.00
-2.50
-3.00
-3.50
-4.20
h(cm)
6.08
5.06
3.97
2.73
1.78
1.11
0.70
0.25
8.00
10.00
15.00
20.00
13.52
9.28
4.14
2.23
2.55
2.00
1.50
1.00
0.50
0.00
1.62
2.62
3.91
5.12
6.07
6.51
-4.50
-4.80
-5.00
-5.30
-5.50
0.17
0.12
0.07
0.02
0.00
(a) Convert the data columns to feet and calculate the velocities corresponding to the different
values of h. (b) Plot the centerline velocities against the distance x and fit an equation of the
form
v(x) = b0/(b1 + b2x)
to the data resulting from the first table. (c) Plot the velocity v at section x = 12 in against the
radial distance |r|, and fit an equation of the form
v(r) = b0exp(-b1x2)
to the data resulting from the second table.
[12]. For relatively low pressures, Henrys law relates the vapor pressure of a gas, P, to the
molar fraction of the gas in mixture, x. The law is stated as P = kx, where k is known as
Henrys constant. To determine Henrys constant in practice we use data of P against x and
fit a straight line, i.e., P = b0 + b1x. If the value of b0 can be taken as zero, then, b1 = k.
Given the pressure-molar fraction data shown in the next table, use function linreg to
determine Henrys constant for the solubility of the following elements or compouns in water
at temperature indicated:
Sulfur dioxide, 23 C
3
P(mmHg) x(10 )
5
0.3263
10
0.5709
50
2.329
100
4.213
200
7.448
300
10.2
Download at InfoClearinghouse.com
Carbon monoxide, 19 C
3
P(mmHg)
x(10 )
900
2.417
1000
2.685
2000
5.354
3000
8.000
4000
10.630
5000
13.230
6000
15.800
7000
18.280
8000
20.670
50
Hydrogen, 23 C
3
P(mmHg) x(10 )
1100
1.861
2000
3.382
3000
5.067
4000
6.729
6000
9.841
8000
12.560
[13]. In the analysis of liquid mixtures it is often necessary to determine parameters known as
activity coefficients. For the mixture of two liquids the van Laar equations can be used to
determine the activity coefficients, 1 and 2, in terms of the molecular fractions x1 and x2:
ln 1 =
A x1
1 +
B x2
, ln 2 =
B x1
1 +
A x2
1
4.90
2.90
1.95
1.52
1.30
1.20
1.10
1.04
1.01
2
1.05
1.20
1.30
1.50
1.70
2.00
2.25
2.60
2.95
[14]. An alternative relationship between the activity coefficients, 1 and 2, in terms of the
molecular fractions x1 and x2 are the Margules equations:
ln 1 = (2B-A)x22 + 2(A-B)x23
ln 2 = (2A-B)x12 + 2(B-A)x23.
Using the data of problem [13] and SCILAB function datafit, determine the coefficients A and B
of the Margules equations.
Download at InfoClearinghouse.com
51
The following table shows measurements of the infiltration rate, f, as function of time, t, for a
specific storm in a watershed.
t(hr)
1
2
3
4
5
6
8
10
12
f(cm/hr)
3.9
3.4
3.1
2.7
2.5
2.3
2
1.8
1.54
t(hr)
14
16
18
20
22
24
26
28
30
f(cm/hr)
1.43
1.36
1.31
1.28
1.25
1.23
1.22
1.2
1.2
(a) Use SCILABs function datafit to obtain the parameters f0, fc, and k for the Hortons
equation.
(b) Plot the original data and the fitted data in the same set of axes.
[16]. The following data represents different properties of granite samples taken at the
locations indicated by the coordinates x(mi) and y(mi) on a specific site. The properties listed
in the table are as follows: x1 = percentage of quartz in the sample, x2 = color index (a
percentage), x3 = percentage of total feldspar, w = specific gravity (water = 1.0).
x1
x2
x3
21.3
38.9
26.1
29.3
24.5
30.9
27.9
22.8
20.1
16.4
15.0
0.6
18.4
19.5
34.4
26.9
28.7
28.5
38.4
28.1
37.4
5.5
2.7
11.1
6.0
6.6
3.3
1.9
1.2
5.6
21.3
18.9
35.9
16.6
14.2
4.6
8.6
5.5
3.9
3.0
12.9
3.5
73.0
57.4
62.6
63.6
69.1
65.1
69.1
76.0
74.1
61.7
65.6
62.5
64.9
65.4
60.7
63.6
65.8
67.8
57.6
59.0
57.6
2.63
2.64
2.64
2.63
2.64
2.61
2.63
2.63
2.65
2.69
2.67
2.83
2.70
2.68
2.62
2.63
2.61
2.62
2.61
2.63
2.63
0.920
1.150
1.160
1.300
1.400
1.590
1.750
1.820
1.830
1.855
2.010
2.040
2.050
2.210
2.270
2.530
2.620
3.025
3.060
3.070
3.120
6.090
3.625
6.750
3.010
7.405
8.630
4.220
2.420
8.840
10.920
14.225
10.605
8.320
8.060
2.730
3.500
7.445
5.060
5.420
12.550
12.130
Download at InfoClearinghouse.com
52
0.9
8.8
16.2
2.2
29.1
24.9
39.6
17.1
0.0
19.9
1.2
13.2
13.7
26.1
19.9
4.9
15.5
0.0
4.5
0.0
4.0
23.4
29.5
22.9
34.9
5.5
28.4
5.1
6.9
3.6
11.3
47.8
11.6
34.8
18.8
21.2
2.3
4.1
18.8
12.2
39.7
30.5
63.8
24.1
12.4
9.8
74.4
55.4
77.6
69.3
65.7
67.8
56.6
70.9
52.2
67.2
64.0
67.4
64.0
71.2
76.0
74.30
69.70
60.20
63.90
35.20
71.80
63.10
60.40
2.78
2.76
2.63
2.74
2.64
2.70
2.63
2.71
2.84
2.68
2.84
2.74
2.74
2.61
2.63
2.77
2.72
2.83
2.77
2.92
2.77
2.79
2.69
3.400
3.520
3.610
4.220
4.250
4.940
5.040
5.060
5.090
5.240
5.320
5.320
5.330
5.350
5.610
5.850
6.460
6.590
7.260
7.420
7.910
8.470
8.740
15.400
9.910
11.520
16.400
11.430
5.910
1.840
11.760
16.430
11.330
8.780
13.730
12.450
1.430
4.150
13.840
11.660
14.640
12.810
16.610
14.650
13.330
15.770
(a) Use function multiplot to show the interdependence of the various variables in the table.
(b) Assuming that a multiple-linear fitting can be used to explain the dependence of w on x1,
x2, and x3, use function multiplelinear to determine the coefficients of the fitting. (c) For a
level of confidence of 0.1, what are the decisions regarding the hypotheses that each of the
coefficients may be zero? (d) What is the decision regarding the hypothesis that the linear
fitting may not apply at all for the same level of confidence? (e) What value of the time of
concentration is predicted for x1 =17, x2 = 25, and x3 = 55. (f) What are the confidence
intervals for the mean value and the prediction for the data of question (e) at a confidence
level 0.10?
Download at InfoClearinghouse.com
53
Friedman, B., 1956 (reissued 1990), "Principles and Techniques of Applied Mathematics," Dover Publications Inc., New
York.
Gomez, C. (editor), 1999, Engineering and Scientific Computing with Scilab, Birkhuser, Boston.
Gullberg, J., 1997, "Mathematics - From the Birth of Numbers," W. W. Norton & Company, New York.
Harman, T.L., J. Dabney, and N. Richert, 2000, "Advanced Engineering Mathematics with MATLAB - Second edition,"
Brooks/Cole - Thompson Learning, Australia.
Harris, J.W., and H. Stocker, 1998, "Handbook of Mathematics and Computational Science," Springer, New York.
Hsu, H.P., 1984, "Applied Fourier Analysis," Harcourt Brace Jovanovich College Outline Series, Harcourt Brace
Jovanovich, Publishers, San Diego.
Journel, A.G., 1989, "Fundamentals of Geostatistics in Five Lessons," Short Course Presented at the 28th International
Geological Congress, Washington, D.C., American Geophysical Union, Washington, D.C.
Julien, P.Y., 1998,Erosion and Sedimentation, Cambridge University Press, Cambridge CB2 2RU, U.K.
Keener, J.P., 1988, "Principles of Applied Mathematics - Transformation and Approximation," Addison-Wesley
Publishing Company, Redwood City, California.
Kitanidis, P.K., 1997,Introduction to Geostatistics - Applications in Hydogeology, Cambridge University Press,
Cambridge CB2 2RU, U.K.
Koch, G.S., Jr., and R. F. Link, 1971, "Statistical Analysis of Geological Data - Volumes I and II," Dover Publications,
Inc., New York.
Korn, G.A. and T.M. Korn, 1968, "Mathematical Handbook for Scientists and Engineers," Dover Publications, Inc., New
York.
Kottegoda, N. T., and R. Rosso, 1997, "Probability, Statistics, and Reliability for Civil and Environmental Engineers,"
The Mc-Graw Hill Companies, Inc., New York.
Kreysig, E., 1983, "Advanced Engineering Mathematics - Fifth Edition," John Wiley & Sons, New York.
Lindfield, G. and J. Penny, 2000, "Numerical Methods Using Matlab," Prentice Hall, Upper Saddle River, New Jersey.
Magrab, E.B., S. Azarm, B. Balachandran, J. Duncan, K. Herold, and G. Walsh, 2000, "An Engineer's Guide to
MATLAB", Prentice Hall, Upper Saddle River, N.J., U.S.A.
McCuen, R.H., 1989,Hydrologic Analysis and Design - second edition, Prentice Hall, Upper Saddle River, New Jersey.
Middleton, G.V., 2000, "Data Analysis in the Earth Sciences Using Matlab," Prentice Hall, Upper Saddle River, New
Jersey.
Montgomery, D.C., G.C. Runger, and N.F. Hubele, 1998, "Engineering Statistics," John Wiley & Sons, Inc.
Newland, D.E., 1993, "An Introduction to Random Vibrations, Spectral & Wavelet Analysis - Third Edition," Longman
Scientific and Technical, New York.
Nicols, G., 1995, Introduction to Nonlinear Science, Cambridge University Press, Cambridge CB2 2RU, U.K.
Parker, T.S. and L.O. Chua, , "Practical Numerical Algorithms for Chaotic Systems, 1989, Springer-Verlag, New York.
Peitgen, H-O. and D. Saupe (editors), 1988, "The Science of Fractal Images," Springer-Verlag, New York.
Peitgen, H-O., H. Jrgens, and D. Saupe, 1992, "Chaos and Fractals - New Frontiers of Science," Springer-Verlag, New
York.
Press, W.H., B.P. Flannery, S.A. Teukolsky, and W.T. Vetterling, 1989, Numerical Recipes - The Art of Scientific
Computing (FORTRAN version), Cambridge University Press, Cambridge CB2 2RU, U.K.
Raghunath, H.M., 1985, "Hydrology - Principles, Analysis and Design," Wiley Eastern Limited, New Delhi, India.
Recktenwald, G., 2000, "Numerical Methods with Matlab - Implementation and Application," Prentice Hall, Upper
Saddle River, N.J., U.S.A.
Download at InfoClearinghouse.com
54
Rothenberg, R.I., 1991, "Probability and Statistics," Harcourt Brace Jovanovich College Outline Series, Harcourt Brace
Jovanovich, Publishers, San Diego, CA.
Sagan, H., 1961,"Boundary and Eigenvalue Problems in Mathematical Physics," Dover Publications, Inc., New York.
Spanos, A., 1999,"Probability Theory and Statistical Inference - Econometric Modeling with Observational Data,"
Cambridge University Press, Cambridge CB2 2RU, U.K.
Spiegel, M. R., 1971 (second printing, 1999), "Schaum's Outline of Theory and Problems of Advanced Mathematics for
Engineers and Scientists," Schaum's Outline Series, McGraw-Hill, New York.
Tanis, E.A., 1987, "Statistics II - Estimation and Tests of Hypotheses," Harcourt Brace Jovanovich College Outline
Series, Harcourt Brace Jovanovich, Publishers, Fort Worth, TX.
Tinker, M. and R. Lambourne, 2000, "Further Mathematics for the Physical Sciences," John Wiley & Sons, LTD.,
Chichester, U.K.
Tolstov, G.P., 1962, "Fourier Series," (Translated from the Russian by R. A. Silverman), Dover Publications, New York.
Tveito, A. and R. Winther, 1998, "Introduction to Partial Differential Equations - A Computational Approach," Texts in
Applied Mathematics 29, Springer, New York.
Urroz, G., 2000, "Science and Engineering Mathematics with the HP 49 G - Volumes I & II", www.greatunpublished.com,
Charleston, S.C.
Urroz, G., 2001, "Applied Engineering Mathematics with Maple", www.greatunpublished.com, Charleston, S.C.
Winnick, J., , "Chemical Engineering Thermodynamics - An Introduction to Thermodynamics for Undergraduate
Engineering Students," John Wiley & Sons, Inc., New York.
Download at InfoClearinghouse.com
55