Geographically Weighted Regression Workbook (PDFDrive)
Geographically Weighted Regression Workbook (PDFDrive)
Geographically Weighted Regression Workbook (PDFDrive)
GWR WORKSHOP
GEOGRAPHICALLY
WEIGHTED
REGRESSION
WORKBOOK
Martin Charlton
A Stewart Fotheringham
Chris Brunsdon
The first two authors acknowledge generous funding from Science Foundation Ireland which helped
create the National Centre for Geocomputation
© The contents of this book are
the copyright of the authors and
may not be reproduced or used
without their permission. This
extends to both the software and
the data files distributed with the
workbook.
i
Contents
ii
5 Lab 2: GWR and House Price Determinants 28
5.1 Introduction 28
5.2 The Data 28
5.3 The Exercise 29
5.4 Modelling House Price Variation 29
5.5 Exploration, then Modelling 29
5.6 Model 1 – Price~Floorspace 31
5.7 Model 2 – Price~Floorspace+Type 33
5.8 Model 3 – Price~Floorspace+Age 35
5.9 Model 4 _ Price ~Floorspace+Type+Age 36
5.10 Points or Surfaces 37
8 Finally 53
iii
1
Introduction
1.1 Introduction
This series of linked labs is designed to introduce you to applying
Geographically Weighted Regression to problems with your spatial data. They
assume a little familiarity with ArcGIS and none with the GWR software.
Before explaining each lab in detail (see subsequent chapters of this workbook),
we first describe some basic operational issues with the software package used
in each lab – GWR 3.0. Some of this material is covered in the lectures.
1.2 The Basic Operation of GWR 3.0 and its Linkage to GIS
The following diagram summarises the basic operation of GWR 3.0 and
how its outputs are linked to a GIS.
The user supplies a data file plus ideas on what form of model to calibrate into
the user-friendly GWR Model Editor which is completed in a series of ‘Windows-
1
style’ menus and tick boxes. Unseen to the user, this creates a control file for a
large FORTRAN program which produces two types of output. A Listing File is
written to the screen and an Output File is saved in the user’s workspace. This
latter file contains location-specific parameter estimates and other diagnostics
which can be read into a GIS (along with other spatially referenced data) for
mapping.
Georgia
Housing
Tokyo
within the SampleData folder. Each subfolder has a number of files containing
data on variables used in the GWR software and data used for mapping the
results. The various files to be used are listed in each of the lab descriptions in
subsequent sections.
________________________________________________________________________
2
________________________________________________________________________
3. My Computer/
/Explorer we will need this in order to change some
filenames. There’s usually an icon at the top left of the desktop or it might
be on the Start menu. Write down the correct path below:
________________________________________________________________________
4. The software and SampleData folder you will need will usually be in the
C:\GWR3 folder, but may be somewhere else. Note here where this folder is:
________________________________________________________________________
5. The Work folder, which you will need, is usually c:\GWR3\Work but it may
be somewhere else – if it is, note its location here.
________________________________________________________________________
6. Note how to access Excel in case you want to manipulate some data files
_______________________________________________________________________
_____________________________________________________________________
3
4
2
A Primer on Running GWR3
What you will learn:
learn
1 How to set up, run and interpret a GWR model
2 How to specify a GWR model and understand the workflow
3 How your data file should be organised
4 The content of the parameter estimate (output) file
5 Using the Model Editor to specify the model and associated options
6 Interpreting the Listing File
2.1 Introduction
This section shows how to set up and run a GWR model using the Visual Basic
GWR Model Editor. Much of this has already been covered in the lecture material
so feel free to skip it if you want. There are several different varieties of
regression model that can be run – here we will assume that you wish to run a
geographically weighted regression with a Gaussian error term. This is the
geographically weighted equivalent of an ordinary least squares regression, such
as you might find in SPSS and is probably the most frequently encountered
application of GWR.
5
The main GWR program
window is shown on the
right; it has four items in
the menu bar, ‘File’,
‘Analysis’, Tools’ and
‘Help’. The program
assumes that the user
will wish to proceed with
one of five initial options,
and provides a ‘Wizard’
for guidance through the
processes.
1. Select a task
2. Select a data file
3. Decide where to estimate the parameters
4. Specify the name of the parameter estimate file
5. Use the Model Editor to:
5.1 Title the run
5.2 Specify the dependent variable
5.3 Specify the independent variable(s)
5.4 Specify the data point location variables
5.5 Specify the weighting scheme
5.6 Specify the calibration method
5.7 Specify the type of parameter estimate file
5.8 Save the model control file
5.9 Run the model
6. Examine the diagnostics
Following this you import the parameter estimate file into a mapping package so
that you can examine any spatial variation in parameter estimates.
6
2.3 Data Organisation
The data file for GWR is an ASCII file which will normally have the filetype of .dat
or .csv. The assumptions in the software are as follows:
1. The first line of the data file is a comma separated list of the names of the
variables in the remainder of the file
2. The variable names should not contain any spaces
3. The variable names should be no more than 8 characters in length
4. The variable names should be formed from upper and lower case
alphabetic characters and the numbers 0 … 9 inclusive
5. The only other character which is allowed is the underscore (_)
6. The remaining lines in the file contain the data
7. There are as many lines as there are observations (“data points”)
8. Each line contains the same number of attributes as there are variables
9. Attributes are separated by commas
10. All attributes are numeric
11. At least one of the attributes will be a dependent variable
12. There are two variables which specify the location of each data point
As an example, here are the first 11 lines of the data file for the georgia
educational attainment data to be used later in the labs:
ID,Latitude,Longitud,TotPop90,PctRural,PctBach,PctEld,PctFB,PctPov,PctBlack
13001,31.753389,-82.285580,15744,75.6,8.2,11.43,0.635,19.9,20.76
13003,31.294857,-82.874736,6213,100.0,6.4,11.77,1.577,26.0,26.86
13005,31.556775,-82.451152,9566,61.7,6.6,11.11,0.272,24.1,15.42
13007,31.330837,-84.454013,3615,100.0,9.4,13.17,0.111,24.8,51.67
13009,33.071932,-83.250851,39530,42.7,13.3,8.64,1.432,17.5,42.39
13011,34.352696,-83.500539,10308,100.0,6.4,11.37,0.340,15.1,3.49
13013,33.993471,-83.711811,29721,64.6,9.2,10.63,0.922,14.7,11.44
13015,34.238402,-84.839182,55911,75.2,9.0,9.66,0.816,10.7,9.21
13017,31.759395,-83.219755,16245,47.0,7.6,12.81,0.332,22.0,31.33
If you have been using ArcMap to integrate your data for an analysis, you can
export a .dbf file as a .txt file. This can be renamed in the Explorer. When
ArcGIS does this it places quotes around the variable names. These are not
however stripped off by the FORTRAN program so the files will need further
editing. You can also create .csv files in Excel (save your data in comma-
separated variable form), Notepad, and other applications capable of writing
ASCII files.
7
diagnostic statistics. For this reason we have decided to make these outputs
available as a file which can then be post-processed.
8
3. MapInfo Interchange Format. A .mif/.mid pair of files is created. These
can be imported into MapINFO. The files are ASCII files and can be hand
edited to remove any anomalies. This is somewhat experimental at the
moment.
Before a new model can be created, a data file must be selected from the data
folder (see section 2.2 for details of the data file structure). The model editor
will extract the names of the variables from the first line of the data file that is
selected. We will base this description of the use of the Model Editor around the
data concerning educational attainment in the counties of the state of Georgia,
USA. These data have been described briefly in the previous presentation and
further information is given
in section 4 (lab 2).
9
standard Windows type ‘File Open’ form. There are only two data files in the data
folder, one for the example we are using and one which is supplied with the
software. Click on the relevant data file name to highlight it and click on ‘Open’
to proceed.
10
As well as a check on the names of the
variables, GWR also prints the names of
the files which you selected thus far. If
you have made a mistake, you have the
option of correcting this before you
continue. (Note: the various folder names
we use here may be different from the
ones you will use!). As we have decided
to fit the model at the data points, the
calibration location filename is blank.
Once the variables have been selected, which essentially defines the model, the
Kernel Type is chosen for the GWR. The choices are either ‘Fixed’ (Gaussian) or
11
‘Adaptive’ (bi-square). The kernel bandwidth is determined by either
crossvalidation (CV) or AIC (AICc) minimisation. Alternatively, an a priori value for
the bandwidth can be entered by clicking on the Bandwidth option and entering
the bandwidth in the window. If you are using a Fixed kernel, the bandwidth
needs to be specified in terms of the distance units used in your model. If you
are using an Adaptive kernel, the bandwidth is specified as the number of data
points in the local sample used to estimate the parameters. If you specify too
small a bandwidth, you may get unpredictable results, or the program may be
unable to estimate the model. With a very large data set, bandwidth selection
can be made using a sample of data points in order to save time. This is
achieved by clicking on Sample (%) and entering the desired percentage of the
data used for the bandwidth selection procedure. The default is that the
procedure will use All data.
If your coordinates are in some projected coordinate system (UTM , for example)
then the Coordinate Type should be Cartesian. If your measurements are in
Lat/Lon, then select Spherical. If you have Lat/Lon coordinates, but your study
area is in a relatively low latitude, then you can use Cartesian as the type. With
Spherical coordinates, the distance computations in the geographical weighting
use Great Circle distances.
The Model Options include specifying the type of output required and the type
of significance test to be employed on the local parameter estimates. Apart
from the default output listings (described later), the user has the option of
outputting List Bandwidth Selection, List Predictions and List Pointwise
Diagnostics. Examples of these are shown below. The significance testing
options are: Monte Carlo, Leung, or None (see above). Finally, the format of the
output file needs to be specified: this should be compatible with the previous
selection of an output filetype (see above). (Note: Although the Leung test
appears in the model editor, it is not really supported as it is very cumbersome
and we do not recommend its use)
12
poverty line and the percentage black. We would also like to see if there are any
geographical variations in the relationships between educational attainment and
these variables.
The sample point location variables are Longitud (x) and Latitude (y). There is
no aspatial weight variable. We have chosen an adaptive kernel and the
bandwidth will be chosen by AICc minimisation using all the data. A Monte Carlo
significance testing procedure has also been selected for the local parameter
estimates. Printing of a range of diagnostics has been requested and the output
will be written to an ArcInfo export file. Some of the output will, by default, also
be written to the screen in a listing file.
1 You may need to make a small alteration in your Windows setup so that the DOS box closes on program
termination.
13
With small data sets and simple models, the program runs very quickly. For
instance, calibrating a bivariate GWR model using the 159 counties of Georgia
on a Pentium III PC took less time than it has taken to type this sentence.
However, the time requirements increase rapidly as both model complexity and
the number of data points increases. One solution to very slow run times is to
use the option in the Model Editor which allows the user to supply a percentage
of the data points on which to base the bandwidth selection procedure.
Following a description of the model that has been calibrated, the first section of
the output from GWR3 contains the parameter estimates and their standard
errors from a global model fitted to the data. This is shown below.
**********************************************************
* GLOBAL REGRESSION PARAMETERS *
**********************************************************
Diagnostic information...
Residual sum of squares......... 1816.210715
Effective number of parameters.. 7.000000
Sigma........................... 3.456697
Akaike Information Criterion.... 855.443391
Coefficient of Determination.... 0.645830
14
Parameter Estimate Std Err T
--------- ------------ ------------ ------------
Intercept 14.779297592328 1.705507562188 8.665630340576
TotPop90 0.000023567534 0.000004746089 4.965675354004
PctRural -0.043878182061 0.013715372112 -3.199197292328
PctEld -0.061925096691 0.121460075458 -0.509839117527
PctFB 1.255536084016 0.309690422174 4.054164886475
PctPov -0.155421764065 0.070388091758 -2.208069086075
PctBlack 0.021917908085 0.025251694359 0.867977738380
There are two parts to the output from the global model. In the first panel,
some useful diagnostic information is printed which includes the residual sum of
squares (e
eTe), the number of parameters in the global model, the standard error
of the estimate (σ), the Akaike Information Criterion (corrected version) and the
coefficient of determination. In the second panel the matrix contains one line of
information for each variable in the model. The columns are:
From this point, the output listing contains the results of the GWR. The first
section is an optional calibration report which lists the calculated value of the
criterion statistic at various bandwidths, as shown below. The utility of printing
this section is to observe the speed of convergence and also to plot the results
to see the shape of the convergence function. If the calibration report is not
requested, the program will print only the optimal value of the bandwidth.
15
Bandwidth AICc
56.043532255000 952.763365832809
84.500000000000 894.827422579517
112.956467745000 872.102336481384
130.543532046749 862.364688964195
141.412935569545 859.863227740004
148.130596397659 857.532739228028
152.282339122725 856.699997311380
154.848257244551 855.820209809022
** Convergence after 8 function calls
** Convergence: Local Sample Size= 155
The next section of the output presents diagnostics for the GWR estimation.
There are two panels in this section. The first panel provides some general
information on the model: it includes (a) a count of the number of data points or
observations (b) the number of predictor variables (this is the number of
columns in the design matrix) (c) the bandwidth for the type of kernel specified
(here it is the number of nearest neighbours to be included in the bisquare
kernel) and (d) the number of regression points. The second panel contains
similar information to the corresponding panel for the global model. This
includes (a) the residual sum of squares (b) the effective number of parameters,
(c) the standard error of the estimate, (d) the Akaike Information Criterion
(corrected) and (e) the coefficient of determination. The latter is constructed
from a comparison of the predicted values from different models at each
regression point and the observed values. The coefficient has increased from
0.646 to 0.706 although an increase is to be expected given the difference in
degrees of freedom. However, the reduction in the AIC from the global model
suggests that the local model is better even accounting for differences in
degrees of freedom.
**********************************************************
* GWR ESTIMATION *
**********************************************************
Fitting Geographically Weighted Regression Model...
Number of observations............ 159
Number of independent variables... 7
(Intercept is variable 1)
Number of nearest neighbours...... 155
Number of locations to fit model.. 159
Diagnostic information...
Residual sum of squares......... 1506.219121
Effective number of parameters.. 12.814342
Sigma........................... 3.209901
Akaike Information Criterion.... 839.193981
Coefficient of Determination.... 0.706280
Casewise diagnostics can be also requested (as shown below for the first 10
observations in the Georgia data set). These include:
16
1. the observation sequence number
2. the observed data
3. the predicted data
4. the residual
5. the standardised residual
6. the local pseudo r-square
7. the influence and
8. Cook’s D.
**********************************************************
* CASEWISE DIAGNOSTICS *
**********************************************************
Another optional set of information that can be printed to the screen concerns
the predicted values (as shown below for the first 10 observations in the Georgia
data set). If this option is selected, the following data are printed to the screen:
17
This set of output is not available when the regression points are different from
the sample points.
Next in the output listing is a panel of results of an ANOVA in which the global
model is compared with the GWR model. The ANOVA tests the null hypothesis
that the GWR model represents no improvement over a global model. The results
are shown below where it can be seen that the F test suggests that the GWR
model is a significant improvement on the global model for the Georgia data.
**********************************************************
* ANOVA *
**********************************************************
Source SS DF MS F
OLS Residuals 1816.2 7.00
GWR Improvement 310.0 5.81 53.3150
GWR Residuals 1506.2 146.19 10.3035 5.1745
The main output from GWR is a set of local parameter estimates for each
relationship. Because of the volume of output these local parameter estimates
and their local standard errors generate, they are not printed in the listing file
but are automatically saved to the output file. However, as a convenient
indication of the extent of the variability in the local parameter estimates, a 5-
number summary of the local parameter estimates is printed. For the Georgia
data, this is shown in below. The 5-number summary of a distribution presents
the median, upper and lower quartiles, and the minimum and maximum values
of the data. This is helpful to get a ‘feel’ for the degree of spatial non-
stationarity in a relationship by comparing the range of the local parameter
estimates with a confidence interval around the global estimate of the equivalent
parameter.
Recall that 50% of the local parameter values will be between the upper and
lower quartiles and that approximately 68% of values in a normal distribution
will be within ± 1 standard deviations of the mean. This gives us a reasonable,
although very informal, means of comparison. We can compare the range of
18
values of the local estimates between the lower and upper and quartiles with the
range of values at ±1 standard deviations of the respective global estimate
(which is simply 2 x S.E. of each global estimate). Given that 68% of the values
would be expected to lie within this latter interval, compared to 50% in the inter-
quartile range, if the range of local estimates between the inter-quartile range is
greater than that of 2 standard errors of the global mean, this suggests the
relationship might be non-stationary.
**********************************************************
* PARAMETER 5-NUMBER SUMMARIES *
**********************************************************
Label Minimum Lwr Quartile Median Upr Quartile Maximum
Intrcept 12.620986 13.754251 15.823232 16.312238 16.489399
TotPop90 0.000014 0.000018 0.000022 0.000025 0.000028
PctRural -0.060218 -0.051780 -0.039342 -0.031651 -0.025801
PctEld -0.255508 -0.203092 -0.164197 -0.129393 -0.058400
PctFB 0.504876 0.825190 1.432738 2.003490 2.417666
PctPov -0.204510 -0.164793 -0.110038 -0.056264 -0.004242
PctBlack -0.036187 -0.013582 0.006294 0.031046 0.076566
As an example, consider the parameter estimates for the two variables PctEld
(percentage elderly) and PctFB (percentage foreign born) in the Georgia study.
For PctEld the interquartile range of the local estimates is much less than 2 x
S.E. of the global estimate suggesting a stationary relationship.
For PctFB the interquartile range of the local estimates is much greater than 2 x
S.E. of the global estimate suggesting a non-stationary relationship.
19
Finally, we can examine the significance of the spatial variability in the local
parameter estimates more formally by conducting a Monte Carlo test. The
results of a Monte Carlo test on the local estimates indicates that there is
significant spatial variation in the local parameter estimates for the variables
PctFB and PctBlack. The spatial variation in the remaining variables is not
significant and in each case there is a reasonably high probability that the
variation occurred by chance. This is useful information because now in terms
of mapping the local estimates, we can concentrate on the two variables, PctFB
and PctBlack, for which the local estimates exhibit significant spatial non-
stationarity. It is interesting to note that these results reinforce the conclusions
reached above with the informal examination of local parameter variation for the
variables PctEld and PctFB.
*************************************************
* *
* Test for spatial variability of parameters *
* *
*************************************************
Parameter P-value
---------- ------------------
Intercept 0.22000
TotPop90 0.09000
PctRural 0.17000
PctEld 0.68000
PctFB 0.00000
PctPov 0.50000
PctBlack 0.00000
20
3
Lab 0: Visualising the Output from GWR with ArcMap
What you will learn
1 Importing an interchange file into ArcGIS
2 Options for visualising a point coverage
3 Adding a shapefile into an ArcGIS project
4 How to carry out a spatial join
5 Symbolising a polygon shapefile
6 [Optional: Assigning map projection information]
The main output from GWR is a set of localised parameter estimates and
associated diagnostics. Unlike the single global values traditionally obtained in
modelling, these local values lend themselves to being mapped. Indeed, in large
data sets, mapping, or some other form of visualisation, is the only way to make
sense of the large volume of output that will be generated. We now describe
ways of visualising the output from GWR. Although we concentrate only on
displays of the local parameter estimates, in many instances it might be
instructive to plot other local statistics such as the influence and Cook’s D
statistics. Similarly, it might be useful to plot the local r-square statistic or the
local standard deviation. No matter which local statistic is mapped, however,
there is a choice of map types that can be employed. We now describe some of
these briefly after first discussing mapping the results in a commonly used, PC-
based, Geographic Information System (GIS), ArcMap.
21
program.
First, start ArcToolBox and in the Conversion Tools kit, select Import to
Coverage. Then select ArcView Import from Interchange file. This brings up
another dialog box which must
be completed. The Input file is
the ArcInfo Export File (also
known as an Interchange file).
Normally the file and path you
specify are those which have already been specified in the GWR program. The
output dataset (or coverage) is probably best located in the same folder.
However, in this lab the input file should be Georgia.e00 which you will find in
the SampleData\Georgia folder; the output file should be located in your Work
folder and named Georgia. When you have specified these, you click on [OK].
Wait a few moments until the software has finished the conversion – an
hourglass will appear while conversion is taking place. Close the ArcToolBox
application.
Navigate to your Work folder and click on the name of the coverage you have
just created. In the case of this
example, the coverage name is Georgia.
You will get an error message which
you can ignore for the time being.
The points which you can see represent the locations of the regression points –
in this case, they are the centroids of the counties of Georgia. To visualise the
spatial variation in the Intercept term (this is called PARM_1 in the coverage).
22
1. Right click on the
georgia point entry in
the Table of Contents
2. Select Properties from the
list (it’s at the bottom)
3. Click on the Symbology
tab
4. Select
Quantities/Graduated
Symbols from the Show:
box
5. Select PARM_1 from the Fields/Value dropdown list
You can use the Identify tool (6th down in the left hand column of tools in the
Toolbar) to click on one of the circles to bring up the values of all the attributes
that you see.
23
the data has. Here’s a typical entry on the right. Unfortunately, the AREAKEY
item is not present in the Georgia point coverage. We need to use a “spatial join”
to match the attributes of the points with the attributes of the polygons.
Click on [OK]
The join then takes place, and the shapefile (files of type .shp are called
shapefiles) is added to the table of contents.
24
2 Click on the Symbology tab
3 Select Quantities/Graduated Colors
4 Select PARM_1 from Field/Value List
5 Click on OK
Checklist
It might be worthwhile looking back over the decisions we have had to make in
carrying this out.
1. Check whether any of the data layers you need has projection
information. If so, make a note of it and see the notes below on
assigning a projection.
2. Convert the parameter estimate interchange file to a coverage using
ArcToolBox
3. In ArcMap add the parameter estimate coverage layer
4. Add in any other layers you might need
5. Carry out any spatial joins you need to do
6. Visualize the parameter estimate variation, either as point data with
graduated symbols, or as polygon data graduated colors.
When mapping the results from your own data sets, depending on the source of
your boundary files, you may find ArcMap automatically assumes the data to be
projected. Checking its properties, for example, you may find something like…
Coordinate System:
GCS_Assumed_Geographic_1
Datum: D_North_American_1927
Prime Meridian: 0
25
This may cause a problem in that the output from the GWR program will not
have such a projection assigned and the two data sets will therefore not be
compatible. If such a problem occurs, you will need to assign the same
projection to your GWR output coverage as ArcMap has assigned to your
boundary data. This can be done as follows:
You will need to close the ArcMap application and assign a projection to the
parameter estimate coverage. Suppose the projection you wish to assign is that
of geographic NAD 1927. The projection conversion is carried out in
ArcToolBox. You may need to do this every time you Import an interchange file.
You will have noticed that the Wizard also allows you to copy the information
about the projection from another coverage, so if you create a ‘master’ coverage
and assign the projection information, then you can use this as the source for a
copy.
26
4
Lab 1: GWR with Educational Attainment Data in Georgia
4.1 Introduction
The context of this particular modelling exercise has been outlined earlier. You
now have a chance to use GWR and some associated software to explore the
data and to model the relationships yourself.
We’ll use the Georgia data on educational attainment by county for our first
foray into GWR. The idea is to predict the level of education attainment from
some social attributes of the counties in the State of Georgia and then to map
the variation in the local parameter estimates and some diagnostics.
1. Prepare the data – may involve Excel, SPSS, SAS, or a GIS program.
2. Model relationships in GWR: examine printed diagnostics
3. Save the parameter estimates in a suitable format
4. Import the parameter estimates into a GIS program
5. Display the parameter variation – further analysis
6. Display the diagnostic variation – further analysis
It should be stressed that these are not the only routes. However these
workshops are based loosely around this approach.
27
4. Change the file type to .csv then in the SampleData\Georgia folder find
gdata_utm.csv, select it and click on Open
5. In the Analysis Point Selection form click on Yes
6. Change the file type to ArcInfo Export File then navigate to your Work
folder and enter GeorgiaOut.e00 in the Output File form.
7. After the Data Preview and file confirmation, the model editor will appear.
28
4.5 Saving and Running the Model
The control file which is created by the Model Editor has to be saved to the Work
folder before you can run it.
1. Click on the Save Model option at the bottom of the Model Editor
2. Enter Georgia.gwr as the name of the control file in your Work folder
and click on Save
3. Click on Run Model
4. In the Run the Model form enter Georgia.txt as the Model Listing
File name (again in your Work folder)
5. Click on the Run button
A DOS window will appear – it should take less than a minute for the program to
run with this dataset. The window title bar will tell you when the program has
finished and the command Exit appears in this window. The DOS window will
then disappear. WAIT until this has happened and then:
1. Click on Yes on the Run Completed form to view the Listing file
2. Click on End on the Run the Model form
Some initial values are reported: (you will need to scroll down to see these)
5. Next check the bandwidth selection. The current values of the bandwidth
and the associated AIC are printed on the output (you can cut these out
29
and put them in Excel if you wish to create a plot of the minimisation
function). These functions can be quite messy at times.
The program has converged at 155 nearest neighbours. As there are only 159
counties, the GWR results may be fairly close to the global results – but
remember that in the GWR model the data are weighted by geographic location.
6. Have a look at the global model parameters and diagnostic statistics. The
AIC for the global model is 855.44 and the global coefficient of
determination is 0.65. It looks as if the global model is a reasonable one,
although 35% of the variation in our dependent variable is from sources
other than the ones in our model.
8. Check the GWR parameter estimates and diagnostics. The AIC for this
model is 840.07. This is less than that for the global model and this
suggests that the GWR model is “better” at modelling the data. The
coefficient of determination is a little higher at 0.72.
9. You have requested the complete listing of the pointwise diagnostics and
predictions. We’ll skip these and move down to the ANOVA. The
computed value of 5.01 is in excess of the critical value for F with 7.0 and
146.2 degrees of freedom so we reject the null hypothesis that the GWR
represents no improvement over the global model. This conclusion is in
line with the AIC results above.
30
12 If you want to close GWR at this point you can, but you need not do
so.
The above will create an Arc/INFO point coverage (called gparms in this case)
Start the ArcMap program and click on the Add Data icon.
Navigate to whatever folder you have saved your parameter coverage and
highlight the name of the file and click on Add.
The name of the theme (gparms point in this case) appears in the Table of
Contents. Right-click its name and select Open Attribute Table from the list of
options. The theme table contains the values of:
2
We will eventually support shapefiles but this will be later rather than sooner.
3
We’ll put the location on an overhead
31
the residual RESID
the standardised residual STDRES
the trace of the hat matrix HAT
Cook’s D COOKSD
There are 7 sets of data for the PARM, SVAL and TVAL items numbered thus
1. Intercept
2. Totpop90
3. PctRural
4. PctEld
5. PctFB
6. PctPov
7. PctBlack
PARM_1 contains the values of the Intercept term and SVAL_1 contains the
values of the corresponding standard errors.
To show the variation in the intercept term with proportional symbols located at
the centroids of the regression points:
32
2. Highlight g_utm.shp in the SampleData\Georgia folder and click on Add:
3. Right click on G_utm to bring up Properties/Symbology, click in the
middle of the Symbol (which will be colored as the polygons on the map)
4. In the Symbol selector change Options/Fill Color to No Color
5. Click on the various OKs to return you to your map
1. Right click on g_utm in the Table of Contents and select Joins and
Relates/Joins…
2. In the first box under the question “What do you want to do to this
layer?” Select “Join data from another layer based on spatial
location”
3. Choice 1: the layer you wish to join will be gparms point
4. Choice 2: you are joining Points to Polygon. Check the second
option here “Each polygon will be given all the attributes
of the points…”
5. Choice 3: name the output layer gparmsj.shp in your Work folder
6. Click OK
7. gparmsj is then added as a layer to the Table of Contents. Use
Properties/Symbology to assign suitable shading to display the
parameter estimates.
4.8 Finishing
We’ve now completed our lighting tour of GWR. The next labs explore in some
greater detail the operation of GWR and its relationship with GIS as well as
reinforcing what has already been learned.
33
5
Lab 2: GWR and House Price Determinants
5.1 Introduction
This workshop introduces you to geographically weighted model selection by
exploring price variation in the UK housing market using a set of explanatory
variables.
Variable Description
Easting
x-coordinate of the property
Northing
y-coordinate of the property
Purprice
Purchase price in £ sterling
BldIntWr
Built between 1914 and 1939
BldPostW
Built between 1939 and 1959
Bld60s
Built between 1960 and 1970
Bld70s
Built between 1970 and 1979
Bld80s
Built between 1980 and 1990
TypDetch
Detached building
TypSemiD
SemiDetached Building
TypFlat
Flat/Apartment
FlrArea
Floor area in m2
There are 5 binary variables indicating the age of the property. If a property is
recorded with 0 in all 5 fields, it is by default one that was built pre-1914.
4
We are grateful to the Nationwide Anglia Building Society for making their data available for academic
use.
34
TypDetch is a binary variable indicating a detached property (i.e. a stand-alone
property with no shared walls with neighboring properties)
TypSemiD is a binary variable indicating a semi-detached property (i.e. a
property that shares one common wall with an adjoining property – a ‘duplex’ in
US terminology)
TypFlat is a binary variable indicating a flat (apartment in US terminology). Flats
include both purpose built flats as well as those converted from older single
occupancy buildings.
If a property has 0 values recorded for all three of the above property types, it is
by default a terraced property (‘row house’ in US parlance) – i.e. a property
joined to its neighbours on both sides.
However the type of property may also influence its price – in a small island such
as Britain the ubiquity of semi-detached and terraced housing means that
homeowners may be prepared to pay extra for the isolation from the neighbours
that a detached property offers. The relationship between price and floor area
may well be different therefore across different types of property. We can
examine this through the use of dummy variables to represent property type.
We can also explore the relationship between house price and age. People may
well be interested in paying more for a new property – there are fewer
immediate maintenance and decoration costs. However older properties may
also be seen to be more desirable by some who might for example rather live in
a medieval cottage than a more recent building. The coefficients on the age
35
dummies represent the added value of owning a property built during the age
range to which the dummy refers.
People often refer to some areas as being “more expensive” than others. Some
suburbs are seen as more desirable and properties in the more sought after
suburbs may have higher prices than ones with the same attributes in less
desirable areas. To explore this with a global model all we have are the
residuals. With GWR, rather than confine the geography of the phenomenon to
the error term, we can model it directly. Also rather than impose some pre-
determined geography on the model (sometimes dummies for regions are
included) we let the model tell us where the desirable areas are.
1. Start ArcMap
2. Click the Add Data icon, and add regions.shp, HousData.txt and
Places.txt from the SampleData/Housing folder
3. For a brief overview of Britain’s regions, right click on the layer name
(regions) and select Label features. You might wish to use to Zoom
tool from the Toolbar (it’s a magnifying glass with a + sign on it) to zoom
a little to exclude Scotland since we will not be concerned with the
housing market in Scotland.
4. The region names probably won’t mean too much, so we’ll add the names
of some well known towns and cities. Right click on Places.txt in the
Table of Contents and select Display X-Y Data…; the X Field and Y Field
choices should have been chosen as X and Y. Click OK.
5. Remove the region names (right click on regions in the Table of
Contents and uncheck Label Features
6. Right click on Places.txt Events in the Table of Contents, and select
Label Features; right click on this layer again and select Zoom to
Layer. Spend a few seconds examining the names. You can use the
Identify tool from the ToolBar to check the grid references – they are in
metres (which city is further north, Carlisle or Newcastle?).
7. Uncheck the Places.txt Events layer in the Table of Contents
8. Right click on HousData.txt in the Table of Contents and select Display
X-Y Data. The X Field should be Easting and the Y Field should be
36
Northing, so change these from those that ArcMap has selected. Click on
OK.
9. You can examine the variation in housing cost by right clicking on
HousData.txt Events in the Table of Contents, selecting Properties
and then clicking on the Symbology tab.
10. Click on Quantities/Graduated Symbols
11. The Value Field should be PurPrice
12. Click on Classify, and change the classification method from Natural
Breaks (Jenks) to Quantile. Click on OK, and OK.
Housing in London and the South East is notably more expensive than elsewhere
in Britain, indeed there is a notable divide between northern and southern
England in terms of average price. A regional dummy variable would be a fairly
crude method of dealing with the spatial variation.
1. Start GWR3
2. From GWR Wizard select Create a New Model, then [Go]
3. Select Gaussian as the Model Type the [Go]
4. Navigate to SampleData/Housing and select HousData.dat as the data file
(with a type of *.dat) and Yes when asked whether you wish to fit the
model at the data points.
5. Navigate to your Work folder and choose housing1.e00 as the parameter
output filename; click on Save
37
6. Check the Data Preview and File Confirmation choices to make sure that
you have the correct input and output data.
7. Choose the following options in the model editor:
(1) Title Price ~ Area model
(2) dependent variable: PurPrice
(3) independent variable: FlrArea
(4) X-variable: Easting
(5) Y-variable: Northing
(6) Kernel type: Adaptive
(7) All three listing options from Model Options
(8) Bandwidth selection: AICc
(9) Output format: Arc/INFO Export
(10) Save your model as housing1.gwr (in Work)
(11) Click Run Model
(12) Save the listing file as housing1.txt
(13) Click Run
11980
right. 11955
25 35 45 55 65 75 85 95
**********************************************************
* GLOBAL REGRESSION PARAMETERS *
**********************************************************
Diagnostic information...
Residual sum of squares......... 454113189550.444030
Effective number of parameters.. 2.000000
Sigma........................... 29637.173765
Akaike Information Criterion.... 12164.963478
Coefficient of Determination.... 0.416307
The AIC is 12,164 and the model accounts for about 42% of the variation in the
Price variable. Nationally the price per square metre of floorspace is just over
38
£650. The parameter estimate for the floorspace variable is significantly
different from zero but the intercept term is not. It would seem that non-
existent houses are free for the taking!
**********************************************************
* GWR ESTIMATION *
**********************************************************
Fitting Geographically Weighted Regression Model...
Number of observations............ 519
Number of independent variables... 2
(Intercept is variable 1)
Number of nearest neighbours...... 29
Number of locations to fit model.. 519
Diagnostic information...
Residual sum of squares......... 164532408987.248870
Effective number of parameters.. 92.663317
Sigma........................... 19644.879856
Akaike Information Criterion.... 11861.124348
Coefficient of Determination.... 0.788519
Notice that the local AIC is much lower and the local coefficient of determination
is much higher. We can be more confident that we have a ‘better’ model with
GWR. Note that because the effective number of parameters is different for the
global and local models that we can’t compare the r2 values directly.
**********************************************************
* PARAMETER 5-NUMBER SUMMARIES *
**********************************************************
Label Minimum Lwr Quartile Median Upr Quartile Maximum
Intrcept -64327.508064 -6189.706002 4492.787561 15304.148108 99446.460090
FlrArea -19.135883 447.434134 583.102579 730.316064 1540.075262
For 50% of the properties in the analysis the floorspace parameter varies from
£447.43 to £730.31. The average size of property is 101m2. This implies a
price variation from £45,190 to £73,761 depending on where the property was
located.
Notice that for the FlrArea parameter, the inter-quartile range of the local
estimates is 283.9 whereas 2 x SE of the global estimate is only 68.4 so there
appears to be a great deal of spatial variation in the local estimates. We can
explore this, and other features of the results, by mapping the output data as
follows:
39
1. Use ArcToolBox to convert the housing1.e00 file in your Work folder into
an Arc/INFO coverage. Call the resulting coverage housing1 and leave it
in the Work folder.
2. Add the coverage as a layer in the Table of Contents.
3. Change the fill on the regions layer to ‘No Fill’ and uncheck the other
layers except housing1.
4. Examine the standardised residuals (STDRES) – there appears to be very
little geographical pattern within them. The model has captured much of
the geographical variation through its weighting scheme although there
are a handful of outliers all west of London.
5. Now plot the PARM_2 coefficients. These are for the floorspace variable.
Under Classify change the Classification method to Standard Deviation.
6. You can choose a Color Ramp that looks pleasing to the eye – green to
red, or red to blue are good choices. If you right click on one of the
symbols in the Symbology dialog, you can Flip the color ramp round if
necessary (try it and see).
Some patterns immediately stand out. Some of the strongest relationships are
in the areas to the west of London as far as the Bristol Channel and,
interestingly, in Northern England. Some of this, of course, may be an artefact
of the particular sample chosen.
If you look back to the above table describing the data for this lab you will
observe that we have included dummy variables in the dataset for detached
properties, semi-detached properties and flats/apartments. Most of the other
types of property in the dataset are terraced (row houses). One of the more
famous terraces in Britain is the Royal Crescent in Bath, an elegant Georgian
terrace of very desirable property. At the other end of the market are the large
areas of low-cost terraced housing built by factory owners adjacent to their
factories in the mid to late 19th century. Such housing is usually cramped and
poorly constructed. We might therefore expect to see some spatial variation in
the GWR parameter for this type of property. This would appear in the intercept
term.
40
Royal Crescent, Bath 19thC Workers' Housing, Newcastle
(Mary Ann Sullivan) (BBC)
1. Close all the windows in the GWR main window and select Tools/Analysis
Wizard/Open an existing model in the GWR Model Editor.
2. Navigate to your Work folder and select housing1.gwr and click OK
3. Add the variables TypDetch, TypSemiD and TypFlat into the list of
independent variables
4. Select Save the Model and overwrite the existing control file. The
existing interchange file will be reused.
5. Select Run Model, name the Listing File housing2.txt and proceed as
before
**********************************************************
* GLOBAL REGRESSION PARAMETERS *
**********************************************************
Diagnostic information...
Residual sum of squares......... 391372117830.287660
Effective number of parameters.. 5.000000
Sigma........................... 27593.918782
Akaike Information Criterion.... 12093.912039
Coefficient of Determination.... 0.496951
41
The AIC is lower than the AIC for the global ‘floorspace-only’ model so there
appears to have been some benefit in adding the extra variables. All the
parameter estimates would appear to be significant. Whilst the coefficient on
floorspace has declined, there are adjustments for property type to be made. We
add 14,739 to the 569.64*floorarea to obtain the regression line for terraced
housing, add 14,739+15,135 to obtain the regression line for detached
housing, add 14,739-10,496 for semi-detached housing, and 14,739-14,828
for flats/apartments. Thus compared with terraced housing, there is a premium
on average of just over £15,000 on a detached house.
**********************************************************
* GWR ESTIMATION *
**********************************************************
Fitting Geographically Weighted Regression Model...
Number of observations............ 519
Number of independent variables... 5
(Intercept is variable 1)
Number of nearest neighbours...... 97
Number of locations to fit model.. 519
Diagnostic information...
Residual sum of squares......... 158230094116.817840
Effective number of parameters.. 71.094406
Sigma........................... 18795.388194
Akaike Information Criterion.... 11779.561886
Coefficient of Determination.... 0.796620
With a lower AIC we appear to be justified in using the GWR model. Perhaps
with the image of the terraced properties above in mind we might map the
intercept parameter.
42
(c) TypSemiD Parm_3
(d) TypFlat Parm_4
(e) FlrArea Parm_5
7. You can alter the classification in the Legend Editor from the default
(“natural breaks”) to some other, such as quantile. Choosing 4 quantile
classes then gives us class breaks based around the 5-number summary
with the breaks being at lowest value:lower quartile:median:upper
quartile:largest value. These values are reported in the legend.
3a Remove the property type variables so that you only have floorspace
included
3b Add the property age variables – the excluded category is pre-1914.
5 Name your listing file output housing3.txt
Having run the model, examine the global and local diagnostics.
**********************************************************
* GLOBAL REGRESSION PARAMETERS *
**********************************************************
Diagnostic information...
Residual sum of squares......... 441439313504.563230
Effective number of parameters.. 7.000000
Sigma........................... 29363.006644
Akaike Information Criterion.... 12160.508453
Coefficient of Determination.... 0.432598
The AIC is 12,160. It might be reasonable to suggest that adding this group of
variables to the global model results in almost no improvement to the model.
43
Indeed most of the parameters are not significant. For these variables these are
interpreted relative to the excluded category of pre-1914; apart from InterWr
and recently built properties there is no premium on age. It also suggests that
there is a non-linear price response to age: putting the age as a continuous
variable into the model might not be altogether wise.
**********************************************************
* GWR ESTIMATION *
**********************************************************
Fitting Geographically Weighted Regression Model...
Number of observations............ 519
Number of independent variables... 7
(Intercept is variable 1)
Number of nearest neighbours...... 111
Number of locations to fit model.. 519
Diagnostic information...
Residual sum of squares......... 201523445943.890960
Effective number of parameters.. 88.941047
Sigma........................... 21647.053653
Akaike Information Criterion.... 11955.358137
Coefficient of Determination.... 0.740973
The AIC is larger than the AICs for either the floorspace only or floorspace+type
models. As such, incorporating all the age groups does not yield a better
model.
Following steps 1-5 from Model 2, add in the rest of the variables and re-run
the model, calling your listing file housing4.txt.
**********************************************************
* GLOBAL REGRESSION PARAMETERS *
**********************************************************
Diagnostic information...
Residual sum of squares......... 387658088547.870180
Effective number of parameters.. 10.000000
Sigma........................... 27597.232591
Akaike Information Criterion.... 12099.319981
Coefficient of Determination.... 0.501725
44
Parameter Estimate Std Err T
--------- ------------ ------------ ------------
Intercept 10841.440705901732 4658.873780335065 2.327051877975
BldIntWr 7377.915831973120 3888.083657451756 1.897571325302
BldPostW 4448.988850693599 4615.216412426111 0.963982701302
Bld60s 1948.867733338220 4366.949551122227 0.446276664734
Bld70s 2503.678352971071 4602.211643175193 0.544016361237
Bld80s 6239.912888906404 3944.833890925199 1.581793546677
TypDetch 12702.104677614816 4962.522032414549 2.559606790543
TypSemiD -12716.365572053077 4310.968223855657 -2.949770212173
TypFlat -15038.310025890560 3911.276249199043 -3.844860076904
FlrArea 585.128599565044 37.757060098477 15.497197151184
The AIC is lower than that for the global floorspace-only model, but is higher
than the floorspace+type model’s. Here it would suggest that we are gaining
little by adding the extra variables, and we appear to be being penalised for
creating an unnecessarily complex model.
**********************************************************
* GWR ESTIMATION *
**********************************************************
Fitting Geographically Weighted Regression Model...
Number of observations............ 519
Number of independent variables... 10
(Intercept is variable 1)
Number of nearest neighbours...... 100
Number of locations to fit model.. 519
Diagnostic information...
Residual sum of squares......... 129983243217.320330
Effective number of parameters.. 131.099269
Sigma........................... 18305.575423
Akaike Information Criterion.... 11865.000583
Coefficient of Determination.... 0.832927
The AIC is higher than both the floorspace only and the floorspace+type model’s
AIC. This suggests that there is little to be gained by adding the age variables
into the model either with or without the property type variables.
45
4. From 3D Analyst, select Options
5. In General change the Working Directory to C:\Temp; set the Analysis
Mask to coast.shp
6. In Extent, set Analysis Extent to “Same as layer coast”
7. In Cell Size, choose “As specified below” and enter 5000 in the Cell
Size box; click OK
8. From 3D Analyst, select Interpolate to Raster/Inverse Distance
Weighted
9. Input points should be housing1 point
10. Z value field should be PARM_2
11. Search type radius should be Variable
12. Number of points should be 12
13. Output raster should be <temporary>
14. Click on OK – the surface appears on the map with suitable shading to
show variation in the intensity of the relationship
15. Check the Places.txt Events layer to bring up the place names
The surface does reveal the patterns in a slightly different way. West of London,
there is a corridor of highly priced floorspace with peaks in Reading and Bristol –
this follows the route of the M4 motorway and the high speed rail link to Bristol
and Cardiff. Reading is both a commuter dormitory in its own right as well as
home to number of companies in the computing sector. There is a small peak
between London and Brighton, again reflecting the importance of the rail link
which connects Brighton, Crawley, Gatwick Airport and London. The apparent
rise in floorspace prices in northern England is probably a reflection of basing
the analysis on a .7% sample.
46
6
Lab 3: GW Poisson Regression with Tokyo Mortality Data
6.1 Introduction
Poisson Regression is used when the response variable refers to counts of some
phenomenon and the covariates are either continuous or binary measurements.
Typical examples of count data might include numbers of people with a
particular disease, numbers of crimes in an area or numbers of derelict houses
in a neighbourhood. In GWR such data are modelled using a Poisson regression
model. Each observation has an integer-valued response variable, a number of
explanatory variables, and two locational variables (X and Y coordinates). If you
wish to model counts which relate to some underlying areal population, for
example, the number of children aged 0-14 with leukaemia, you should use
what is referred to in the Poisson regression literature as an ‘offset variable’. In
this case, the offset variable would be the number of 0-14 children in each
spatial unit. There will be more detail about this below. You should not attempt
to model data which are continuous, such as the educational attainment data we
have used in a previous workshop, in a Poisson regression framework – use
Gaussian regression for this. The GW Poisson regression model is set up using
the GWR Model Editor as in previous examples.
The Oi is known as the offset – in the second equation log Oi + β0 forms the
constant term.
In Poisson GWR, as with Gaussian GWR, the resulting parameter estimates βι∗ are
specific to each location i. Unlike Gaussian GWR, however, the model is fitted
using a technique known as iteratively reweighted least squares. This has the
47
following implications. First, the fitting technique is iterative and so a typical
GW Poisson regression will take approximately 5 times as long to run as an
equivalent Gaussian regression. Second, to compute the standard errors, the
observed counts are required, so parameter estimates may only be obtained at
the data points.
The output file contains the parameter estimates from the Poisson model and
their standard errors as well as the exponentials of the parameter estimates and
the standard errors of these exponentiated values. Positive parameter values
when exponentiated are greater than unity, a parameter of zero when
exponentiated yields a value of unity, and negative parameter values when
exponentiated are less than unity. In all cases, the exponentiated values are
positive.
Mapping
Before beginning the analysis, you should examine the spatial distribution of the
variable.
1. Start ArcMap
2. From the SampleData/Tokyo folder add the layer TMABSU.shp – these are
the boundaries of the polygons which form municipalities in the Tokyo
Metropolitan Area.
5
Premature mortality counts are defined here as the number of deaths occurring to people aged 25-65 in
each area. We are grateful to Dr Tomoki Nakaya of the Department of Geography, Ritsumeikan University,
Kita-Ku, Kyoto for this dataset.
48
3. Check the Attribute table – there is not much attribute data here – one of
the attributes is an ID field named ID_. Each ID begins with the string
SUGIURA
4. Add the table SOC90DAT.dbf and examine its contents – the area IDs are
in a field called ID
5. Right click on the TMABSU entry in the Table of Contents and select Joins
and Relates/Join.
6. Change the activity to Join attributes from a table
7. The field name in box 1 should be ID_
8. The table name in box 2 should be SOC90DAT.dbf
9. The field name in box 3 should be should be ID
10. Click on OK.
11. In Properties/Symbology select Categories/Unique Values, then
select SOC90DAT.PREFNAME; click on Add All Values then OK, and you’ll
then get a map showing the Prefectures in Tokyo.
12. Examine the distributions of the following variables SOC90DAT.POP65,
SOC90DAT.OWNH, SOC90DAT.UNEMP and SOC90DAT.OCC_TEC (which
are population over 65, homeowners, unemployed, and professional
occupations, respectively). The homeowners variable shows quite clearly
the transition between the urban centre (where very few people own their
own home) and the rural periphery (where home ownership is much
higher).
13. Add the SMR90.dbf table, and join it to TMABSU (remember to examine
the SMR90 table first to identify the ID field.
14. Examine the distribution of the SMR90.SMR variable – this will be the
distribution that we will model as a function of the explanatory variables
above.
1. Start GWR3
2. In the Wizard check Create a new model and click Go
3. … select Poisson for Model Type and click Go
49
4. In Open Data File, change the filter to Comma Separated Variable
(*.csv), and select the file Mortality.csv from the SampleData/Tokyo
folder
5. In Analysis Point Selection, click on Yes
6. The Output File should be ArcInfo Export File (*.e00); navigate to
your Work folder, and name the file tokyo.e00
7. Check that the correct variables are present in the Data Preview, and
check that the correct files are being used in Confirm – in particular,
check that the filenames for the Observed Data File and Parameter
Diagnostic File are different. The former should be Mortality.csv and
the latter should be tokyo.e00.
8. In the Model Editor enter a title, select Mort2564 as the Dependent
Variable and Professl, Elderly, OwnHome and Unemply as the predictor
variables; select X and Y as the Location Variables; select Exp_2564 as the
weight variable6; select an Adaptive kernel, with Cartesian coordinates,
and Crossvalidation as the calibration method. Check all three Model
Options (but don’t’ check the Monte Carlo test as it is not available yet for
this model type). Change the output format to ArcInfo Exp, and save the
model on your Work folder as tokyo.gwr.
9. Run the model and name the listing file tokyo. Click on Run and wait a
minute or two for the model to run.
***************************************************************
* *
* GEOGRAPHICALLY WEIGHTED POISSON REGRESSION *
* *
***************************************************************
Number of data cases read: 262
Sample data file read...
*Number of observations, nobs= 262
*Number of predictors, nvar= 4
Observation Easting extent: 131840.781
Observation Northing extent: 120125.898
Data has been read for 262 municipalities. The study region lies within an area
some 132km from West to East, and some 121km from South to North. There
will 4 predictor variables in the model.
6
In Poisson GWR if you specify a weight variable, the variable will be used as an offset.
50
The next panel is from the calibration – we have chosen to use minimising the
crossvalidation score as the criterion. Again, you can graph this function if you
wish to see how the software has found the minimum.
*Finding bandwidth...
... using all regression points
This can take some time...
*Calibration will be based on 262 cases
*Adaptive kernel sample size limits: 65 262
*Crossvalidation begins...
Bandwidth CV Score
125.876348015000 77093.504704727718
163.500000000000 78852.510762618243
102.623652260339 76090.041761883127
88.252695924830 75486.337205487056
79.370956440679 75159.569289465624
73.881739549149 75717.804317078335
82.763479033300 75179.304206706714
77.274262166597 75473.565361751971
80.666784759217 75218.352828294883
** Convergence after 9 function calls
** Convergence: Local Sample Size= 79
79000
The optimal bandwidth is about
78500
79 objects in the local sample.
78000
This is about 25% of the number
77500
77000
of observations, so we will obtain
76500 a moderate degree of smoothing
76000
with this. If you have Excel
available, you can examine the
75500
75000
60 80 100 120 140 160 180
shape of the crossvalidation
function. The number of function
calls refers to the number of times the different models have been fitted to
obtain the optimal bandwidth. Note the slight wobble near the minimum – this
sometimes happens with adaptive kernels – what the curve does suggest is that
between 79 and 82 objects is a reasonable local sample size.
The next panel shows the results from the Global Model.
51
Intercept 0.007 0.065 0.115 1.007 0.066
Professl -2.288 0.162 -14.123 0.101 0.016
Elderly 2.199 0.198 11.093 9.019 1.788
OwnHome -0.260 0.047 -5.519 0.771 0.036
Unemply 0.064 0.011 5.822 1.066 0.012
The AIC is about 399.3 (which is correct, since it’s the deviance less twice the
number of parameters in the model). Areas of lower premature mortality would
appear those with higher levels of professional employees and home owners,
whereas an ageing population and economic problems would appear to drive up
mortality.
The next panel shows the output from the Local Model.
First of all, note that there has been a substantial decrease in the AIC – this
suggests that the local model provides a better fit to the data than the global
model despite the increase in the effective number of parameters (32.6 in the
local model compared to 5 in the global model).
It is worth checking also the 5 number summaries for the parameter estimates.
The Elderly parameter estimates are consistently positive At least 75% of the
values of the Professional and Home owner parameters are negative, and 75% of
the Unemployed parameter estimates are positive. In general, this is
encouraging.
**********************************************************
* PARAMETER 5-NUMBER SUMMARIES *
**********************************************************
Label Minimum Lwr Quartile Median Upr Quartile Maximum
Intrcept -1.179779 -0.033398 0.105704 0.254133 0.591215
Professl -4.348735 -2.712938 -2.450336 -1.633525 2.501697
Elderly 0.915348 1.529775 2.022919 2.544694 5.226847
OwnHome -0.678879 -0.397862 -0.289317 -0.193223 0.242000
Unemply -0.064240 0.015048 0.033627 0.080714 0.196946
52
Again, we can informally check on the spatial variation of the local estimates by
comparing the inter-quartile range with 2 x S.E. of the respective global
estimates. In doing this, we get…
Further diagnostic analysis should involve mapping the parameter estimates and
other diagnostic data.
7
UTM stands for Universal Transverse Mercator. There appears to be a local shift in the Y coordinates
southwards.
53
4. Click on TMABSU, and then right-click and select Joins and
Relates/Join.
5. Select Join attributes from a table
6. Box 1: select TMABSU.ID_
7. Box 2: select GeogIndex.csv
8. Box 3: select ID and click on OK – this has linked the GeogIndex table and
the TMABSU table
9. Select Joins and Relates/Join again
10. Select Join attributes from a table
11. Box 1: select GWR_ID
12. Box 2: select Tokyo point
13. Box 3: select TOKYO-ID and click on OK – this links the GWR results as an
attribute of the TMABSU shapefile
Use ArcMap to examine the spatial variation in the parameters. With a smaller
bandwidth, some interesting local spatial variations are revealed. The negative
influence of the Professional parameter is most marked in Chiba-ken prefecture
and affluent suburbs of Tokyo-to. The influence of the Elderly is most noticeable
in southern Chiba-ken and the north western part of Saitama-ken. Owner
occupation is most influential in Kanagawa-ke and the central part of Tokyo-to,
and Unemployment is also most influential in Kanagawa-ke and and parts of
Saitama-ken. Examine each parameter estimate in turn. The residuals appear
to be fairly patchy.
point.PARM_1 Intercept
point.PARM_2 Professional
point.PARM_3 Elderly
point.PARM_4 Own Home
point.PARM_5 Unemployed
If you are intending to map the results from another analysis, you can remove
the join to the Tokyo point coverage, and then join another coverage in its place.
6.6 Tasks
Which of the set of independent variables produces the greatest reduction in the
corrected AIC? Experiment with each singly, and in pairs.
54
For a given model (perhaps take a simple one with only Unemployed as the
explanatory variable) what is the effect of choosing different calibration
methods?
Experiment with different bandwidths. What is the effect on the AICc and the
effective number of parameters of reducing the bandwidth? Try 17000 as the
bandwidth for a Fixed kernel, and 40 as the Local Sample Size for an Adaptive
kernel.
55
7
Lab 4: Logistic GWR with Landslides in Clearwater, Idaho
7.1 Introduction
Logistic Regression (also known as Binary Logit Regression) is used when your
response variable is binary and the covariates are either continuous or binary
measurements. Typical examples of binary data might include yes/no,
alive/dead or above/below. In GWR such data are modelled using a logistic
regression model. You will find the term binary or dichotomous used as names
for the 0/1 data – note, however, that these data are not binomial, and the
model is not binomial. In GWR each observation has a 1/0 valued response
variable, a number of explanatory variables, and some locational variables. The
problem is set up using the GWR Model Editor as before.
Note that
1
1 − yˆ i =
1 + exp( β 0 + β1 x1i + ... + β k xki )
so that
yˆ i
= exp( β 0 + β1 x1i + ... + β k xki )
1 − yˆ i
56
and
yˆ
log i = β 0 + β1 x1i + ... + β k xki
1 − yˆ i
The term on the left hand side of the equation is known as the logit
transformation and this produces a linear function in terms of the right hand
side of the equation.
In logistic GWR, as with Gaussian GWR, the local parameter estimates βι∗ are
specific to each location i. The model calibration, however, is more complicated
than in OLS regression and the model is fitted using a technique known as
iteratively reweighted least squares. This has two implications for our analysis.
First, the fitting technique is iterative, and a typical problem will take
approximately 5 times as long to run as a Gaussian GWR. Second, to compute
the standard errors, the observed 0/1 values are required, so parameter
estimates may only be obtained at the data points – that is, we no longer have
the option of producing local parameter estimates at points other than the data
points.
The output file from logistic GWR contains the parameter estimates and their
standard errors from (7.4). Positive parameter values when exponentiated are
greater than unity, a parameter of zero yields an exponentiated value of unity,
and negative parameter values when exponentiated are less than unity. In all
cases, the exponentiated values are positive. Exponentiating the parameter
values is an add-on task (see section 7.7)
57
The data for this workshop is in the SampleData/Clearwater folder in the file
landslides.csv. They are the landslides in a small part of the Forest, about
33 x 29km in size, extracted from the Clearwater National Forest GIS Library. A
digital elevation model was downloaded from the Shuttle Radar Topography
Mission data archive at the Jet Propulsion Laboratory and re-projected into UTM
Zone 11 with a 25m grid interval – this gives us an estimate of the elevation of
every 25 grid square in the forest. The DEM was then used to create slope and
aspect grids the slope is given in percent, and the aspect in degrees. Two more
grids were created with the sin and cosine of the aspect. Digitised streams data
was downloaded from the CWNF website. A basemap extracted from TerraServer
is included to help in interpreting the results.
For the 138 landslides sites in the study area we extracted the following data
(a) elevation in metres
(b) slope (%)
(c) sin of the aspect
(d) cosine of the aspect
(e) absolute deviation of the aspect from due south
(f) distance in metres to the nearest watercourse
We also need observations from sites where there were no landslides. 101
random locations were sampled in the area for the same 5 variables. We’ll refer
to the first sample as the landslide sites and the second sample as the control
sites.
The two samples were merged. A 7th variable was added with the value of 1 for
the landslide sites and 0 for the control sites. This will be used as the y variable
in the logistic regression. The topographic variables will be used as predictors
and the output will include the probability of a landslide occurring given the
characteristics of the site, whether landslide or not. A satisfactory outcome will
be high probabilities at the landslide sites and low probabilities at the control
sites.
We have taken a small area of the Forest to keep the GWR run times short. The
models take about 2 minutes to fit on a 1.70GHz laptop. On the same system
fitting the dataset of 875 sample sites and the same number of controls takes
an hour.
58
7.3 Examining the data
Before beginning the analysis, you should map the spatial distribution of the
landslides on the basemap.
1. Start ArcMap
2. From the SampleData/Clearwater folder add basemap.jpg and
landslides.csv.
3. Right-click on landslides.csv in the Table of Contents, and select
Display XY Data from the options that are presented. A form
will be presented – there’s not much to do except note that
ArcMap has selected X as the X variable and Y as Y variable. Click
on OK at the bottom of the screen
4. Right click on landslides.csv Events in the Table of Contents and
select Properties/Symbology.
5. Select Categories/Unique Values in the Show: list
6. Select Landslid as the Value Field
7. Click on the Add Values button and add the values 0 and 1.
8. Uncheck the symbol for ‘All Other Values’
9. Change the fill colours to red for 1 and blue for 0 and click OK.
Remember that landslide sites are coded 1 and control sites are
coded 0. You might want to change the colours to red for 1 and
blue or green for 0.
10. You can also examine the spatial variation in some of the explanatory
variables. As these are point data, you should use Graduated
Colors as the symbolism.
1. Start GWR3
2. In the Wizard check Create a new model and click Go
3. … select Logistic for Model Type and click Go
4. In Open Data File, change the filter to Comma Separated Variable
(*.csv), and select the file landslides.csv from the
SampleData/Clearwater folder
5. In Analysis Point Selection, click on Yes
6. The Output File should be an ArcInfo Export File (*.e00); navigate to
your Work folder, and name the file logmodela.e00
59
7. Check that the correct variables are present in the Data Preview, and
check that the correct files are being used in Confirm – in particular,
check that the filenames for the Observed Data File and Parameter
Diagnostic File are different. The former should be landslides.csv
and the latter should be logmodela.e00.
8. In the Model Editor enter a title, select Landslid as the Dependent
Variable and Elev and Slope as the predictor variables; select X and Y
as the Location Variables; select a Variable kernel, with Cartesian
coordinates, and AICc as the Bandwidth selection method. Check all
three Model Options (but don’t’ check the Monte Carlo test as it is not
available yet for this model type). Change the output format to
ArcInfo Exp, and save the model on your Work folder as
logmodela.gwr.
9. Run the model and name the listing file logmodela.txt. Click on Run
and wait a couple of minutes for the model to run.
***************************************************************
* *
* GEOGRAPHICALLY WEIGHTED LOGISTIC REGRESSION *
* *
***************************************************************
Number of data cases read: 239
Sample data file read...
*Number of observations, nobs= 239
*Number of predictors, nvar= 4
P(Landslid=1 | X) = 0.577406
Observation Easting extent: 33543.0625
Observation Northing extent: 28802.6289
Data has been read for 239 locations. 57.7% of the locations were landslide
sites. The study area is rectangular, 33.5km from east to west and 28.8km from
north to south.
*Finding bandwidth...
... using all regression points
This can take some time...
*Calibration will be based on 239 cases
*Adaptive kernel sample size limits: 59 239
*AICc minimisation begins...
Bandwidth AICc
60
114.623059100000 261.564370933634
149.000000000000 263.977389197345
93.376941151579 259.338341719932
80.246118103905 259.206909758644
72.130823143768 259.220048698357
85.261646191441 259.090515014747
88.361413027338 259.178113791002
83.345884939802 259.175696235611
** Convergence after 8 function calls
** Convergence: Local Sample Size= 85
The next panel shows the results from the Global Model.
The AIC is about 278.2, and the model has 3 parameters. Elevation has a
negative effect on landslide probability. Slope has a positive influence on
landslide probability. Certainly one would expect steeper sloped hillsides to be
more prone to landslides than flatter ones – landslides may have a greater
chance of occurring on lower slopes than elevated ones. However, remember
that these coefficients represent an average over the study area, and other
factors may be influencing hazard variation. (See the note in section 7.7)
The next panel shows the output from the Local Model.
61
Bayesian Information Criterion.. 318.678626
First of all, note that there has been a substantial decrease in the AICc – this
suggests that the local model provides a better fit to the data than the global
model. The deviance is lower, but the price of using GWR is a greater number of
effective parameters (35.88 now instead of 5.0 in the global model).
It is worth checking also the 5 number summaries for the parameter estimates.
The Elevation parameter estimates are consistently positive and negative
respectively (these are the values before exponentiation). At least 75% of the
values of the Slope parameter are positive and a similar proportion of the
Elevation parameters are negative.
**********************************************************
* PARAMETER 5-NUMBER SUMMARIES *
**********************************************************
Label Minimum Lwr Quartile Median Upr Quartile Maximum
-------- ------------- ------------- ------------- ------------- -------------
Intrcept -5.856532 -0.240704 1.719492 3.787028 16.416200
Elev -0.013785 -0.004208 -0.002330 -0.000893 0.000381
Slope -0.027931 0.035047 0.078778 0.133318 0.243211
Again, we can informally check on the spatial variation of the local estimates by
comparing the inter-quartile range with 2xS.E. of the respective global
estimates. In doing this, we get…
Further diagnostic analysis should involve mapping the parameter estimates and
other diagnostic data.
62
2. Add the new logmodela coverage into the Table of Contents
The new layer is added to the Table of Contents. Use ArcMap to examine the
spatial variation in the parameters. As you might expect given the relatively
large bandwidth, there are some broad regional variations present in the
parameter estimates. Remember that we have taken a small sample of landslides
in the Forest, so it might be unwise to draw general conclusions about
landslides in Clearwater. The influence of elevation on landslide hazard is most
negative towards the east and western parts of the study area – higher
elevations imply lower hazard. There is a north south pattern to the variation in
the slope parameter, with slopes in the south part of the study area having the
greatest influence on hazard.
There is a small cluster of high negative residuals about 3km west of Sheep
Mountain Work Center in the south west of the map. These are landslides which
the model has failed to predict (use the Properties/Symbology/Classify option to
change the class breaks to 3, and set the lower two breaks at -0.5 and 0.5). You
can use a red/blue color ramp but you map need to flip the symbol order (right
click in the symbol/key area).
63
7.7 Further Tasks
We have included both the sin and cosine of the site aspect as well as the
distance to the nearest watercourse in the data file. Create two further models,
including (a) the distance to the nearest watercourse and (b) the aspect. Is there
evidence for the hypothesis that either of these variables contribute to
improvements in either the local or global model.
You can experiment with taking the anti-logs of the coefficients. You will need
to add another field to the attribute table (use a type of double, a precision of
12, and a scale of 7). The Exp() function will compute the anti-logs for you.
64
8
Finally
We hope that these labs have given you some of the flavour of GWR in action
and the relationship it has with GIS.
http://ncg.nuim.ie/GWR
Not only is there a brief description of the methodology but there is information
on how to obtain your own copy of the software. The GWR manual, including
information on installing the software is available from this website.
65
Appendix
8.1 Introduction
This appendix introduces you to geographically weighted descriptive statistics.
We shall explore a simple set of these which are incorporated within the GWR
software. Some of this exploration will involve the GWR program, and some will
involve the GIS program.
∑x
i =1
i
x=
n
This is simply the sum of the values making up a batch of numbers divided by
the size of the batch. More generally, we can consider a weighted mean:
∑w x
i =1
i i
x= n
∑w
i =1
i
where the wis are the weights. Here we multiply each value by its weight, and
divide by the sum of the weights. In the case that each observation has a weight
of unity, then this formula and the one above are equivalent.
In many cases the weights are integers, but they may also be non-integer
numbers. In this case, we can use weights generated from the same
geographical weighting scheme that we have used for geographically weighted
regression. Rather than being a whole-map statistic, a geographically weighted
mean is available at a particular location, say, u. Thus the formula for a
geographically weighted mean at location u is:
66
n
∑ w(u ) x
i =1
i i
x (u ) = n
∑ w(u)
i =1
i
W(u)i is the geographical weight of the ith observation relative to the location u.
The weights may be generated using a fixed radius or an adaptive kernel.
∑ w(u ) ( x i i − x (u )) 2
σ 2 (u ) = i =1
n
∑ w(u)
i =1
i
and the locally weighted standard deviation is the square root of this. Notice
that the mean here is the geographical mean around point u and NOT the global
mean of the data.
The GWR software currently supplied (GWR3.0) allows the user to compute
geographically weighted means, variances, and standard deviations for a set if
input data, and for either a fixed or an adaptive kernel. However, as there is no
concept here of an optimal bandwidth, the bandwidth must be supplied by the
user. If a fixed kernel is used, the bandwidth must be in the same units as the
coordinates on the input data; if an adaptive kernel is used, the bandwidth is the
number of objects to include in the local sample. If the bandwidth is very small,
the degree of smoothing from the weighting scheme will be very small: the local
means will approach the original data values and the variances will be very
small. A zero bandwidth may cause premature and possibly inelegant
termination of the program. The larger the bandwidth, the greater will be the
degree of smoothing in the resulting geographically weighted statistic. With a
fixed kernel the bandwidth can be as large as you wish, although anything
greater than the study area width will result in an almost similar set of means.
With the adaptive kernel, the bandwidth should not be greater than the number
of observations in your dataset.
67
1. Start GWR
2. Select “Create new descriptive statistics” and click Go
3. Select the data file gdata_utm.csv from SampleData\Georgia and click Open
4. Click Yes to undertake analysis at the data point locations
5. For the output file, navigate to your Work folder, change the file type to
ArcInfo Export File, and enter gstats.e00 as the file name, and click Save
6. Check that the variable names and file locations are correct in the data
preview and file confirmation windows
7. In the Model Editor, enter a title…
8. Enter PctBach, PctEld, PctFB and PctPov into the Variable(s) list
9. The location variables should be X and Y
10. The kernel shape should be adaptive with Cartesian coordinates
11. Enter 30 as the bandwidth
12. Select ArcInfo Exp as the output file type
13. Save the control file as gstats.gwr in your Work folder
14. Run the model with gstats.txt as the listing file
***************************************************************
* *
* GEOGRAPHICALLY WEIGHTED SUMMARY STATISTICS *
* *
***************************************************************
Number of data cases read: 159
** Data file read...
Number of observations, nobs= 159
Number of predictors, nvar= 4
** Adaptive kernel: local sample size is 30
** Results written to .e00 file
To examine these data, we need first to convert the Export file to a point
coverage, and then assign the values from the attribute table to the county
boundary polygons.
68
The map suggests that there is a broad regional pattern with the greatest
proportions being in those counties around the University of Georgia.
Educational attainment is much lower in the rural south.
Plot the variance (VAR_1). Again there is a pattern here – the high local weighted
means are also associated with high locally weighted variances. This suggests
that there is a distinct local variation among the counties in the local samples. It
may be that 30 is too great a bandwidth, and that while broad regional patterns
in the means are being picked up, the confidence intervals on them may be
rather wide. A similar tale is told if you plot the local standard deviation
(STD_1) – try this for yourself.
Another diagnostic might the local coefficient of variation – this would be the
ratio of the local standard deviation to the local mean. This is not computed by
GWR, and you will have to compute this yourself. However, you can use the GIS
to help you.
69
5. Select Calculate Values and then click yes to the warning which follows.
6. Complete the Field Calculator as
shown on the right. The calculation
should be [STD_1]/[MEAN_1] – select
these from the Fields list with a
single mouseclick. Click on OK
8.5 Tasks
3. In the first GWR exercise, we used a fixed kernel and allow the calibration
routine to find the optimal bandwidth. This was 94km. Rerun the local
statistics with a fixed kernel, with a bandwidth of 94000 (the map units
are in metres!). Convert the output file to a coverage called gstats94, and
the joined shapefile should be called gstats94. Again, examine the effect
of changing the bandwidth. (The kernel smoothing literature suggests
that the bandwidth changes have a much greater impact on the
smoothing than kernel shape changes).
70
By this time, you should be adept in using both GWR and ArcGIS in tandem. You
should also have an appreciation of local statistics and the ways in which you
can use the GIS software to assist you with your tasks.
71