Regression in Geoda: Briggs Henan University 2010 1

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 37

Regression in geoDA

Example regression analyses for Illiteracy Rate ( ILLITERACY)


ChinaData.shp (n=35)
1. Simple regression with URBAN_POP_
ChinaData_29 (n=29)
2. Simple regression with URBAN_POP
3. Multiple regression with URBAN_POP
and RMB_PC_UR_
4. Spatial lag and error multiple regression
5. Multiple regression with log of Illiteracy

Briggs Henan University 2010


Running Regression in geoDA: I
1
File>Open Shape File
2 You can begin with
ChinaData
Methods>Regress Method>Regress if
Tools>Weights>
Place as below --very large number
Open or Create
of observations (over
Need weights to test for 1,000)
spatial autocorrelation. --no spatial weights
Generally, always use a --data only in a .dbf
weights file. file

If you have a large


number of observations,
do not

Need this for


Moran’ s I
for residuals
Running Regression in geoDA: II
Select one dependent variable Warning-bug!
One or more independent variables Use Suggested name.
Select type of regression: The names are
Classic or Lag or Error reversed here!

Click OK
to save
these.

Saves values for Predicted Y and


Residuals in the table
--use Table>>Promotion to see
them in table.
--you can map them or draw graphs
--use Table >> Save to Shapefile
if you want to keep them
permanently
Click RUN, then Click SAVE 3
Running Regression in geoDA: III
Results are saved in this text file.
It is saved in the same folder as the
The results shapefile.
You can rename it and change
location.

Click OK to see the results.


(You can also open the file later with
a program such as Notepad)
--scroll to end of file since results are
added to end if file already exists

Warning: if you want the residuals


(see previous slide) you must click
Save before clicking OK

Click Reset to run a different


regression
4
Summary: Running Regression in geoDA
Warning-bug!
File>Open Shape File Select variables as below.
Select type of regression: Use Suggested name.
ChinaData
Classic Lag Error The names are
Tools>Weights>
reversed here!
Open or Create
(need weights to test for
spatial autocorrelation
in residuals)
Methods>Regress
Place as below
Click OK to save these.
Use Table>Promotion to
see them in table.

Click OK in Regression
window to see results
--scroll to end of file since
Click RUN, then Click SAVE results are added to end if
file exists already 5
Regression for Provinces: n = 35
• Next slide shows results from running a simple regression with
ChinaData.shp
Y = Illiteracy rate (ILLITERACY)
X = % of population urban (URBAN_POP_)
• All provinces included
• Note problems with
– Extreme value for Xizang/Tibet
– Zeros (0) for missing data on X variable
(Taiwan, Macau, Hong Kong, P’eng-hu)
• Solution: Reduced data set to 29 using ArcGIS
– (do not know how to do this in geoDA!)

Briggs Henan University 2010 6


Display table: Table >Promotion Results for
Plot using: Explore >ScatterPlot simple regression

Note: mean of
residuals is
always zero

Total Variation Predicted by Regression Residual Variation


Illiteracy v. Urban Pop% OLS_Predict v. Urban Pop% OLS_Resid v. Urban Pop%

Extreme
value
identified
by linking:
Xizang/Tibet
Briggs Henan University 2010 7
Partitioning the Variance on Y
Total Variation Residual Variation
Predicted by Regression
Illiteracy v. Urban Pop% OLS_Predict v. Urban Pop% OLS_Resid v. Urban Pop%

Y Ỹ (Y-Ỹ)

Y Y Y

Y Y Y

 ( Y i – Y)   ( Ŷ i – Y)   ( Y i – Ŷi )
2 2 2

SS Total SS Regression SS Residual


or Total Sum of or Explained Sum of or Error Sum of Squares
Squares Squares
Briggs Henan University 2010 8
Simple Regression Results from GeoDA:
Statistics for dependent variable general
n = 35

Results for overall regression Not statistically


explains only 4.6% of variance in Y significant

Sigma-square= Variance of the estimate = 1368.89/33=41.4816


SE of regression=standard error of the estimate=√41.4816=6.44062
Identical in
simple
Results for each regression coefficient Y= 11.3146 - 6.578X
regression

Briggs Henan University 2010 9


Simple Regression Results from GeoDA:
spatial
n = 35

Moran’s I for regression residuals


--not statistically significant (p=.09)

Space > Univariate Moran


for variable: OLS_Resid
Same results!

Briggs Henan University 2010 10


Results with omitted observations:
much better!

Now explains 33.41%


But probably non-linear
Statistically
significant

Spatial autocorrelation
not a problem
Data for China Provinces 29:
excludes Xizang/Tibet, Macao, Hong Kong, Hainan, Taiwan, P'eng-hu
Briggs Henan University 2010 11
Multiple Regression Results n = 29
Illiteracy with % Pop Urban and Urban Income
Overall Results

Results for each variable

significant
Not significant
Spatial Results

Not significant

12

Briggs Henan University 2010


Residual Analysis:
Illiteracy v. Urban Pop % and UrbanIncomePerCapita

Moran’s I = .0226
p = 0.5520
Not statistically significant
No Spatial autocorrelation
in residuals
Briggs Henan University 2010 13
Spatial Error Model Results
illustrative only: not needed

Spatial
error not
significant

Briggs Henan University 2010 14


Spatial Lag Model Results
illustrative only: not needed

Spatial lag not significant

Briggs Henan University 2010 15


Regression Results Summary
Overall Urban Pop Urban Income *Spatial Term

Test Test Test


R2 Adj2 Akaike F F-prob coeff Stat prob coef Stat prob coeff Stat prob
Simple-35 0.046 0.017 231.65 1.60 0.215 -6.58 -1.263 0.215     0.1636 1.678 0.0934
Simple-29 0.334 0.309 155.42 13.55 0.001 -16.15 -3.681 0.001     0.0272 0.578 0.5631
Multiple 0.384 0.337 155.16 8.11 0.002 -26.80 -3.151 0.000 0.00041 1.452 0.159 -0.0226 0.383 0.7015
Spatial Error 0.385 155.13   -27.02 -3.411 0.001 0.00041 1.572 0.116 -0.0389 -0.162 0.8716

Spatial Lag 0.387   157.05     -26.00 -3.128 0.006 0.00040 1.486 0.137 0.0720 0.340 0.7339

*Spatial For Multiple Regression


Term OLS: for Moran's I 29
Robust LM
Lag: for W_Illiteracy (lag) 1.312 0.2520

Robust LM
Error: for Lambda (error) 1.220 0.2693

Briggs Henan University 2010 16


Note on:
Variables Saved for Spatial Models
Again, labels are reversed. Use suggested
variable names.
ERR_ indicates use of Spatial Error model.
LAG_indicates use of Spatial Lag Model
OLS_ indicates use of classic model

For the spatial lag model, there is a distinction between the residual and the
prediction error. The latter is the difference between the observed value and
the predicted value that uses only exogenous variables, rather than treating
the spatial lag Wy as observed. (Documentation for 905i, page 53)

Prediction error (xxx_PRDERR): calculated without including spatial term.


Residual error (xxx_RESIDU): calculated including spatial term

Briggs Henan University 2010 17


Table >> Add Column Table >> Field Calculator

Improving the model


Relationship is Non-linear
Use log of Illiteracy

Briggs Henan University 2010 18


The same plots using Excel
Relationship is Non-linear
Illiteracy Log of Illiteracy

Urban pop %

Briggs Henan University 2010 19


Y = Log of Illiteracy
R2 increases
from
38% to 83% !

Urban Income now significant and Urban Population is not!

Briggs Henan University 2010 20


Log of Illiteracy:
makes relationship linear
Overall Urban Pop Urban Income *Spatial Term

F- Test Test Test


R2 Adj2 Akaike F prob coeff Stat prob coef Stat prob coeff Stat prob
Simple-35 0.046 0.017 231.65 1.60 0.215 -6.58 -1.263 0.215     0.1636 1.678 0.0934
Simple-29 0.334 0.309 155.42 13.55 0.001 -16.15 -3.681 0.001    
0.0272 0.578 0.5631
-
Multiple 0.384 0.337 155.16 8.11 0.002 -26.80 -3.151 0.000 0.00041 1.452 0.159 0.0226 0.383 0.7015
Multiple -
Log Y 0.837 0.824 560.07 66.69 0.000 -3962.73 -1.800 0.083 -6446.67 -2.975 0.006 0.1192 -0.548 0.5839
                           

*Spatial
Term OLS: for Moran's I

Urban Income now significant, and % urban not significant.


--these two variables are highly intercorrelated
--see next slide
Briggs Henan University 2010 21
Inter-Correlation between Urban
Population and Urban Income

R2 for Urban Pop


Urban Population

versus Urban Income


0.84

R is .92

N=29

Urban Income
Briggs Henan University 2010 22
Table >> Add Column then use Table >> Field Calculator

Creating a better model


• Transforming dependent
and/or independent variables
can often improve the
predictive capability of
regression models
• geoDA has several
capabilities to support this.

Briggs Henan University 2010 23


Other software options for multiple regression
• Multiple regression of the type discussed here is not available in ArcGIS
– Only geographically weighted
regression available
(there is a multiple regression for raster data
but it is only in ArcInfo Workstation—difficult to use)
• Use geoDA to create spatial lag variables, then use standard statistical packages such as
SAS, SPSS or STATA
• Use R
– Free open source software, but difficult to use
– http://cran.r-project.org/web/views/Spatial.html
• CrimeStat III has some support for spatial regression
http://www.icpsr.umich.edu/NACJD/crimestat.html
• For a good list of spatial software sources, go to:
http://en.wikipedia.org/wiki/List_of_spatial_analysis_software

Briggs Henan University 2010 24


What have we learned today?
• How to use geoDA to run
– classic regression models
– Spatial Lag models
– Spatial Error Models
• Importance of examining data for “problems”
– Can have a very large affect on results
– Missing data and zeros
– Extreme values can dominate results
• Using transformations to create a better model

Briggs Henan University 2010 25


26

Briggs Henan University 2010


Geographically Weighted Regression

27

Briggs Henan University 2010


Geographically Weighted Regression
• The idea of Local Indicators can also be applied to regression
• Its called geographically weighted regression
• It calculates a separate regression
for each polygon and its neighbors,
X
– then maps the parameters from the model, such as the regression
i

coefficient (b) and/or its significance value

• Mathematically, this is done by applying the spatial weights


matrix (Wij) to the standard formulae for regression
See Fotheringham, Brunsdon and Charlton Geographically Weighted
Regression Wiley, 2002

Briggs Henan University 2010 28


Problems with Geographically Weighted Regression
• Each regression is based on few observations
– the estimates of the regression
parameters (b) are unreliable
• Need to use more observations than just those with
shared border, but
– how far out do we go? Xi
– How far out is the “local effect”?
• Need strong theory to explain why the regression
parameters are different at different places
• Serious questions about validity of statistical
inference tests since observations not independent
Briggs Henan University 2010 29
GWR in ARCGIS
• Requires ArcInfo, Spatial Analyst or Geostat.
Analyst license
• Shapefile is created:
– Open its table to see results
– for each polygon there are standard regression results
– Condition variable: indicates when the results are
unstable due to local multicollinearity
• Results not good if condition > 30, Null, or -1.79e+308
– Use source_ID to join with FID of original data to
identify observations

Briggs Henan University 2010 30


Usage Tips from ArcGIS Help
• Use projected data
• Observations included in each regression depend on kernal type,
bandwidth method and bandwidth distance parameters set by user
– Max of 1,000 observations in any one local regression
• Multicollinearity can be a problem
– if variables cluster spatially
– if use binary/nominal/categorical variables
– Never use dummy variables (1/0) to index spatial regions
• (Multicollinearity: intercorrelation between independent variables)
• Not appropriate for small data sets: need several hundred observations
• Shapefiles cannot store “nul l” values: treated as zero. Be sure there is
no missing data

Briggs Henan University 2010 31


Running GWR in ArcGIS

Briggs Henan University 2010 32


Execution Dialog for GWR in ArcGIS

Results presumable for global regression?????


--R2 value does not agree with results from geoDA?
Briggs Henan University 2010 33
Mapping Results from GWR in ArcGIS

(Default) standardized residuals


--the bigger the absolute value the
poorer the prediction?

Regression coefficient
for % Urban Pop
--larger impact of urban
pop in south east China.

Briggs Henan University 2010 34


Join with the original shapefile using FID
and Source_Id in order to identify provinces

Briggs Henan University 2010 35


GWR output: R2 and Y values
Output table
(part)
(Columns reordered.
Highlighted columns
obtained from join with
original data.)

Observed: values
on the dependent
variable Y

Predicted values
and residuals are
based upon each
local regression
and are not the
same as those for a
global regression.

Briggs Henan University 2010 36


GWR output: regression coefficients and standard errors
Regression Standard error
coefficients (b) of the estimate
Standard error of the coefficients

No statistical
significance results
provided
--statistical
significance tests in
GWR have been
severely criticized.

Briggs Henan University 2010 37

You might also like