P NB ProbitE
P NB ProbitE
POISSON REGRESSION
We are keeping count for some quite rare events.
Data
Y has Poisson distribution. We expect the mean of Y to be similar by its magnitude to the variance
of Y. Possible values for Y are 0, 1,2,3, ...
Regressors can be interval or categorical random variables.
Main steps
1) Check if Y has Poisson distribution.
2) Check if normed deviance is close to 1.
3) Check if maximum likelihood is statistically significant. If p-value 0,05, model is
unacceptable.
4) Check if all regressors are significant (Wald test p < 0,05). If not drop them from the
model. We do not pay attention to p-value for model constant (intercept).
We will model number of other than respondent household members by agea and cldrsv. We will
investigate respondents for whom imptrad 2 and eduyrs 10.
Use Select Cases -> If condition is satisfied -> If and write imptrad <= 2 & eduyrs <= 10.
Then Continue -> OK.
Dependent variable (we call it numbhh) can be created with the help of Transform
Compute Variable.
2. Preliminary analysis
First we check if numbhh is similar to Poisson variable. Analyze Descriptive Statistics
Frequences.
Valid
281
Missing
Mean
Variance
0
1.0036
1.482
It is possible also to check if random variable has Poisson distribution with the help of
Kolmogorov- Smirnov test:
Analyze Nonparametric Tests Legacy Dialogs 1-Sample K-S.
We see that we can assume that numbhh has Poisson distribution (p = 0,169), but is not
normal (p = 0,000).
numbhh
numbhh
N
281
Normal
Mean
1.0036
N
Poisson
281
Mean
1.0036
Parametera,b
Parametersa,b
Std. Deviation
Most Extreme
Absolute
.302
Most Extreme
Absolute
.066
Differences
Positive
.302
Differences
Positive
.066
Negative
-.205
Negative
-.026
Kolmogorov-Smirnov Z
Asymp. Sig. (2-tailed)
1.21743
5.060
.000
Kolmogorov-Smirnov Z
Asymp. Sig. (2-tailed)
1.111
.169
Click on Predictors and move both regressors agea and cldcsrv into Covariates.
(We do not have categorical variables, which are moved into Factors).
In Statistics check in addition Include exponential parameter estimates. Then -> OK.
4. Results
In Goodness of Fit table we can find normed deviance. We see that the normed deviance is
close to 1 (0,919). Thus, the Poisson regression model fits our data. It remains to decide which
regressors are statistically significant.
Goodness of Fitb
Value
df
Value/df
Deviance
230.635
251
Scaled Deviance
230.635
251
Pearson Chi-Square
188.314
251
188.314
251
Likelihooda
-301.040
Akaike's Information
608.080
Log
.919
.750
Criterion (AIC)
Finite Sample Corrected
608.176
AIC (AICC)
Bayesian Information
618.692
Criterion (BIC)
Consistent AIC (CAIC)
621.692
In the table Omnibus Test we find p-value for maximum likelihood statistic. Since p <
0,05, we conclude that not all regressors are statistically insignificant.
Omnibus Testa
Likelihood Ratio
Chi-Square
112.919
df
Sig.
2
.000
In the table Tests of Model Effects we see Wald test p-values for all regressors. We do not
check p-value for intercept. Both p < 0,05. Therefore, we conclude that both regressors (agea and
cldcrsv) are statistically significant and should remain in the model.
Tests of Model Effects
Type III
Source
Wald Chi-Square
(Intercept)
agea
df
Sig.
41.188
.000
105.703
.000
14.395
.000
cldcrsv
In the table Parameter Estimates information about Wald p-values is repeated. Moreover, the
tabale contains estimates of the models coefficients (Column B).
Parameter Estimates
95 % Wald
95 % Wald Confidence
Confidence Interval
Std.
Parameter
Error
Hypothesis Test
Wald ChiLower
Upper
Square
df
Sig.
Exp(B)
Lower
Upper
(Intercept)
1.535
.2392
1.066
2.004
41.188
.000
4.642
2.905
7.419
agea
-.035
.0034
-.042
-.028
105.703
.000
.966
.959
.972
cldcrsv
.099
.0261
.048
.150
14.395
.000
1.104
1.049
1.162
(Scale)
1a
respondents age increases, the number of household members decreases. Mathematical models
expression is
= exp{1,535 0,035 + 0,099}.
Here is the mean value of other household members.
Forecasting means that we insert given values of agea and cldcrsv into above formula.
5. Categorical regressor
Categorical regressors are included into Generalized Linear Models - Predictors -> Factors
Do not forget to add ctzntr into Model window. Then, in the table Parameter Estimates
Parameter
(Intercept)
agea
[ctzcntr=1]
[ctzcntr=2]
cldcrsv
B
1.239
-.036
.352
0a
.104
Std.
Error
.3115
.0034
.2319
.
.0263
95 % Wald Confidence
Interval
Lower
.629
-.043
-.103
.
.053
Upper
1.850
-.029
.806
.
.156
We get additional information about both ctzcntr. Model then can be written as
0,352, if = 1,
ln = 1,239 0,036 + 0,104 + {
0,
if = 2.
Estimates for the coefficients , 1 , 2 , 3 are calculated from data. NB regression is an alternative
to the Poisson regression. The main difference is that the variance of Y is larger than the mean of Y.
Data
Y has negative binomial distribution. We expect the mean of Y to be smaller than the variance of Y.
Possible values for Y are 0, 1,2,3, ...
Regressors can be interval or categorical random variables.
Main steps
5) Check if the variance of Y is greater than the mean of Y. Otherwise, the NB regression is
not applicable.
6) Check if normed deviance is close to 1.
7) Check if maximum likelihood is statistically significant. If p-value 0,05, model is
unacceptable.
8) Check if all regressors are significant (Wald test p < 0,05). If not drop them from the
model. We do not pay attention to p-value for model constant (intercept).
9
brmwmny borrow money for living (1 is very difficult, ..., 5 very easy),
has only one observation greater than 26. Therefore, with recode we create a new dichotomous
variable emplnof2, (0 if no employees, 1 at least one employe).
2. SPSS options for the negative binomial regression
Analyze Generalized Linear Models Generalized Linear Models.
Click on Type of Model. Do not choose Negative binomial with log link.
10
Check Custom --> Distribution -> Negative binomial, Link function Log , Parameter
Estimate value.
In Predictors put both variables eduyrs and brwmny into Covariates. Categorical variable
emplnof2 put into Factors.
11
3. Results
At the beginning of output we see descriptive statistics. Observe that standard deviation of emplno
(moreover, its variance) is greater than mean.
Categorical Variable Information
N
Factor emplnof2
Percent
.00
33
50.0%
1.00
33
50.0%
Total
66
100.0%
Variable
respondent has/had
Covariate
Minimum
Maximum
Mean
Std. Deviation
66
763
14.73
93.831
66
23
11.71
3.732
66
3.68
1.069
education completed
brwmny Borrow money to make
ends meet, difficult or easy
In Goodness of Fit table we see that normed deviance is 0,901, that is we see quite good
overall model fit to data.
Goodness of Fitb
Value
df
Value/df
Deviance
54.989
61
Scaled Deviance
54.989
61
.901
--------------------------------------------------------
12
Omnibus Test table contains maximum likelihood statistics and its p-value. Since p < 0,05,
we conclude that at least one regressor is statistically significant.
Tests of Model Effects contains Wald tests for each regressor. All regressors are statistically
significant (we do not check p-value for intercept).
Tests of Model Effects
Type III
Omnibus Testa
Wald ChiSource
Likelihood
Square
df
Sig.
Ratio Chi-
(Intercept)
.151
.698
emplnof2
6.298
.012
Eduyrs
4.959
.026
Brwmny
7.399
.007
Square
df
Sig.
23.777
.000
95 % Wald
Confidence
Confidence
Interval
Hypothesis Test
Wald
Std.
Parameter
Error
Lower
1.590
2.1316
-2.588
5.768
.556
.456
4.904
.075
319.831
-1.629
.6493
-2.902
-.357
6.298
.012
.196
.055
.700
0a
Eduyrs
.286
.1286
.034
.539
4.959
.026
1.332
1.035
1.714
Brwmny
-.753
.2768
-1.295
-.210
7.399
.007
.471
.274
.810
1.2084
3.415
8.310
(Intercept)
[emplnof2=.00]
[emplnof2=1.00]
ChiUpper
Square
df
Sig.
Exp(B)
Lower
Upper
1b
(Scale)
(Negative
5.327
binomial)
Estimated model:
ln = 1,590 + 0,286 0,753 + {
0,
if 2 = 1,
1,629, if 2 = 0.
13
3. PROBIT REGRESSION
Model
We are modelling two-valued variable. Probit regression can be used whenever logistic regression
applies and vice versa. Model scheme
expression
1 (P( = 0)) = + 1 + 2 + 3 .
Here 1 () is inverse function, also known as probit function.
If 1 > 0, then as X grows, also grows P(Y= 0).
If 1 < 0, then as X , also grows P(Y= 1).
Data
a) Variable Y is dichotomous. Data for Y contains at least 20% of zeros and at least 20%
of 1.
b) If model contains many categorical variables, for each combination of categories data
should contain at least 5 observations.
c) No strong correlation between regressors.
14
Model fit
K2 university,
K36_1 I use professional skills obtained during studies (1 never, ..., 5 very
frequently),
15
2. SPSS options
Analyze -> Generalized Linear Models Generalized Linear Models.
16
Open Predictors and move K37_1 ir K33_2 into Covariates (with some reservation we treat
these variables as interval ones). Regressor K35_1 obtains only 4 values, therefore is treated as a
categorical variable. Move K35_1 into Factors.
17
3. Results
Model is constructed for P(Y = 0). In Categorical Variable Information we check that there is
sufficient number of respondents for each value of categorical variables (Y included).
Categorical Variable Information
N
Dependent Variable
Factor
Percent
.00
100
31.1%
1.00
222
68.9%
Total
322
100.0%
1 Tikrai taip
153
47.5 %
atitikimas bakalauro
2 Greiiau taip
95
29.5 %
3 Greiiau ne
36
11.2%
4 Tikrai ne
38
11.8%
322
100.0%
Total
In Omnibus Test table we check that p-value for the maximum likelihood test is sufficiently
small p = 0,00...< 0,05.
Omnibus Testa
Likelihood Ratio
Chi-Square
163.847
df
Sig.
5
.000
Dependent Variable: Y
Model: (Intercept), K35_1, K37_1, K33_2
Parameter Estimates table contains parameter estimates and Wald tests for the significance
of each regressor. We do not check the significance of Intercept. Categorical variable K35_1 was
18
replaced by 4 dummy variables, one of which is not statistically significant. However, for one such
insignificant result, it is not rational to drop K35_1 from the model.
Parameter Estimates
95 % Wald Confidence Interval
Hypothesis Test
Wald Chi-
Parameter
Std. Error
Lower
Upper
Square
df
Sig.
(Intercept)
4.853
.7092
3.463
6.243
46.832
.000
[K35_1=1]
-1.577
.3272
-2.218
-.936
23.229
.000
[K35_1=2]
-1.018
.3226
-1.650
-.385
9.953
.002
[K35_1=3]
-.261
.3722
-.991
.468
.493
.482
[K35_1=4]
0a
K37_1
-.273
.1141
-.496
-.049
5.720
.017
K33_2
-.780
.1151
-1.005
-.554
45.859
.000
(Scale)
1b
Dependent Variable: Y
Model: (Intercept), K35_1, K37_1, K33_2
a. Set to zero because this parameter is redundant.
b. Fixed at the displayed value.
We obtained four models, which differ by the constant only, They can be written in the following
way:
( = 0) = P(rarely applies knowledge in his work) = (),
P
1,57,
1,02,
= 4,85 0,273 371 0,78 332 + {
0,26,
0,
if 35_1 = 1,
if 35_1 = 2,
if 35_1 = 3,
if 35_1 = 4.
Signs of coefficients agrre with general logic of the models. Coefficient for K37_1 is
negative. The larger value of K37_1 (more happy with his work), the less probable that knowledge
is rarely used. Other signs of coefficients can be explained similarly.
We treated probit regression as a partial case of the generalized linear model. Therefore, one can
check the magnitude of deviance in the table Goodness of Fit. We see that deviance is close to
unity (1,156), which demonstrates good fit of the model. Note that for the probit regression more
important is small p-value of the maximum likelihood test (it can be find in Omnibus Test.). If all
models characteristics except deviance show good model fit, we assume that the model is
acceptable.
19
Goodness of Fitb
Value
Df
Value/df
Deviance
49.722
43
Scaled Deviance
49.722
43
Pearson Chi-Square
48.218
43
48.218
43
Log Likelihooda
-47.932
Akaike's Information
107.865
1.156
1.121
Criterion (AIC)
Finite Sample Corrected
108.131
AIC (AICC)
Bayesian Information
130.512
Criterion (BIC)
Consistent AIC (CAIC)
136.512
Minimum
322
Maximum
.000
Mean
.039
Std. Deviation
.00324
.006749
Distance
Valid N (listwise)
322
Maximal Cooks distance value is 0,039<1. Therefore, there is no outliers in our data.
To obtain classification table we choose Analyze Descriptive Statistics Crosstabs.
Move Y into Row(s) and PredictedValue. into Column(s) . Choose Cells and check Row. Then
Continue and OK.
Y * PredictedValue Predicted Category Value Crosstabulation
PredictedValue Predicted
Category Value
.00
Y
.00
Count
% within Y
1.00
Count
% within Y
Total
Count
% within Y
1.00
Total
66
34
100
66.0%
34.0%
100.0%
17
205
222
7.7%
92.3%
100.0%
83
239
322
25.8%
74.2%
100.0%
20
From 100 respondents, who rarely use professional skills obtained during studies, 66 are
correctly classified ( 66 %). From 222 respondents, who frequently use professional skills obtained
during studies, 205 are correctly classified ( 92,3 %). Recalling the table Categorical Variable
Information and its percents (respectively 31,1 % and 68,9 % ), we see that probit model ensures
much better forecasting than random gues. Final conlusion: probit regression model fits data
sufficiently well.
4. Forecasting
One value can be forecasted in the following way. Let us assume that previous model is applied to
respondent, for whom K33_2 = 4, K35_1 = 1, K37_1 = 4. We add additional row in data writing 4
in the column K33_2 , 1 in the column K35_1 , 4 in the column K37_1 and 1 in the column
filter_$. Remaining columns are empty.
We repeat probit analysis but check Predicted value of mean of response in window Save
In data appears new column MeanPredicted containing probabilities for P( Y = 0). We got
0,175 probability for our respondent. Therefore is unlikely that this respondent will apply skills
from studies in his professional work.
21