Section 6 - Projection Pursuit Regression
Section 6 - Projection Pursuit Regression
Section 6 - Projection Pursuit Regression
DSCI 425 – Supervised (Statistical) Learning Brant Deppa - Winona State University
Mo
Y =f ( X ) =μ y + ∑ β m ϕm ( aTm x ) + ϵ
m=1
2 2 2
√
where ‖am‖=1i . e . am 1+ am 2 +…+ amp=1 , μ y =E ( Y )
T
We then choose μ y , β m , ϕ m ,∧am to minimize
n Mo 2
∑
i=1
( y i−μ y − ∑ β m ϕm ( a x i )
m=1
T
m )
ACE models fit into this framework under the following restrictions:
θ ( y )= y , M o= p , β m=1 for all m
T
The estimated ϕ m ( am x )=ϕm ( x m ) , the usual functions of the predictors found by
ACE/AVAS.
and OLS multiple regression model with standardized predictors fits into this
framework with the restrictions:
169
Section 6 – Projection Pursuit Regression Spring 2020
DSCI 425 – Supervised (Statistical) Learning Brant Deppa - Winona State University
1
μY =0 , M o=2 , β 1=β 2= , aT1 =( 1,1 ) , aT2 = (1 ,−1 )
4
Then we have,
2
ϕ 1 ( aT1 x )=( x 1 + x 2 ) =x 21+ 2 x 1 x 2 + x 22
2
ϕ 2 ( aT2 x )=−( x1 −x2 ) =−x12+2 x 1 x 2−x 22
So that,
2
∑ β m ϕm ( aTm x )= 14 ( 4 x1 x 2 )=x 1 x 2
m=1
Neither ACE/AVAS could model this type of behavior, and MARS would find
interactions that are only products of checkmark functions.
170
Section 6 – Projection Pursuit Regression Spring 2020
DSCI 425 – Supervised (Statistical) Learning Brant Deppa - Winona State University
T 1
1) Pick a starting trial direction a 1 and compute z 1 i=a1 x i. Then with y i = y i− ý
smooth a scatter plot of ( y i , a1 x i) to obtain ϕ
^ 1=^ϕ1 , a . Then a 1 is varied to
1 T
1
minimize
n
∑ ( y i−ϕ^ 1 ,a ( z1 i ) )2
1
i=1
where for each new value for a 1 value a new ϕ ^ 1, a is obtained. The final results of
1
3) Repeat (2) until M terms have been formed, giving final fittedvalues
M
^y i= ý + ∑ ^βm ϕ^ m ( a^ Tm x i ) i=1, … , n
m =1
Example 1: The two variable interaction example in class is demonstrated below. The
data is randomly generated so that the Y =f ( X 1 , X 2 ) + ϵ=X 1 X 2 + ϵ
> set.seed(13)
> x1 <- runif(400,-1,1)
> x2 <- runif(400,-1,1)
> eps <- rnorm(400,0,.2)
> y <- x1*x2 + eps
> x <- cbind(x1,x2)
> plot(x1,y,main="Y vs. X1")
> plot(x2,y,main="Y vs. X2")
171
Section 6 – Projection Pursuit Regression Spring 2020
DSCI 425 – Supervised (Statistical) Learning Brant Deppa - Winona State University
172
Section 6 – Projection Pursuit Regression Spring 2020
DSCI 425 – Supervised (Statistical) Learning Brant Deppa - Winona State University
Here we see that projection pursuit correctly produces the theoretical results shown in
2 2
class, namely ϕ 1(x )=x , ϕ 2(x )=−x , a 1=(1 , 1) and a 2=(1 ,−1).
173
Section 6 – Projection Pursuit Regression Spring 2020
DSCI 425 – Supervised (Statistical) Learning Brant Deppa - Winona State University
It appears that 4 terms would be good candidate for a “final” model. Therefore we
rerun the regression with nterms=4.
174
Section 6 – Projection Pursuit Regression Spring 2020
DSCI 425 – Supervised (Statistical) Learning Brant Deppa - Winona State University
> bass.pp2 <- ppr(x,y,nterms=4,max.term=8)
> PPplot(bass.pp2,bar=T)
φ^ j ( a^ T x ) vs . a^ T x
j j
for j = 1,2,3,4
To visualize the linear combination terms that are formed we can look at barplots of the
variable loadings (bar = T).
These don’t aid in interpretation of the results much, but they do give some idea of what
variables are most important. For example, log(Alkalinity) is prominently loaded in the
first three terms.
175
Section 6 – Projection Pursuit Regression Spring 2020
DSCI 425 – Supervised (Statistical) Learning Brant Deppa - Winona State University
Fine tuning the projection pursuit model involves choosing how many terms to create,
which is denoted by M in the fitted model formulation shown below, and choosing how
^m (aTm x) are.
smooth or wiggly the nonparametric estimates of ϕ
M
^y i= ý + ∑ ^βm ϕ^ m ( a^ Tm x i ) i=1, … , n
m =1
Most of the fine tuning has to do with the smoothers that are used to estimate
ϕ^m ( aTm x ) , m=1 ,… , M . This involves choosing the method used to do the actual
smoothing, and controlling how wiggly the smooth from the chosen method can be.
sm.method: the method used for smoothing the ridge functions. The default is to use
Friedman's super smoother 'supsmu'. The alternatives are to use the smoothing spline
code underlying smooth.spline, either with a specified equivalent degrees of freedom or
effective number of parameters for each of the ridge functions, or to allow the
smoothness to be chosen by GCV.
bass: super smoother bass tone control used with automatic span selection (see
'supsmu'); the range of values is 0 to 10, with larger values resulting in increased
smoothing.
span: super smoother span control (see 'supsmu'). The default, '0', results in automatic
span selection by local cross-validation. 'span' can also take a value in '(0, 1]'.
176
Section 6 – Projection Pursuit Regression Spring 2020
DSCI 425 – Supervised (Statistical) Learning Brant Deppa - Winona State University
Aside: In recall for OLS regression fitted values are obtained via the Hat matrix. For the
model,
Y =f ( X ) =β o+ β1 U 1 +…+ β k−1 U k−1
177
Section 6 – Projection Pursuit Regression Spring 2020
DSCI 425 – Supervised (Statistical) Learning Brant Deppa - Winona State University
The smooths certainly look noisy and thus we almost surely overfitting our data. This
will lead to model with poor predictive abilities. We can try using different smoothers
or increasing the degree of smoothing done super smoother, which is the default
smoother.
178
Section 6 – Projection Pursuit Regression Spring 2020
DSCI 425 – Supervised (Statistical) Learning Brant Deppa - Winona State University
179
Section 6 – Projection Pursuit Regression Spring 2020
DSCI 425 – Supervised (Statistical) Learning Brant Deppa - Winona State University
Notice that height has a minimum value of zero. While normally we may not worry
about this, if we plan to employ transformation methods such the Box-Cox procedure
the zeroes in height pose a problem. The caret package has a function called
preProcess which is a very general tool for performing various pre-processing tasks on
a set of numeric variables. These pre-processing tasks include the Box-Cox
transformation for transforming numeric variables to approximate normality,
scaling/standardization (i.e. converting numeric variables to z-scores), and performing
dimension reduction techniques such as principal component analysis (PCA). We will
180
Section 6 – Projection Pursuit Regression Spring 2020
DSCI 425 – Supervised (Statistical) Learning Brant Deppa - Winona State University
be discussing PCA later in the course. For these data we will demonstrate the use of the
Box-Cox transformation.
In order to use the Box-Cox procedure for these data we need to add a constant height of
to deal with the fact that it contain zeroes.
> Abalone$height = Abalone$height+.001
> Abalone.PP = preProcess(Abalone,method="BoxCox")
> Abalone.PP$bc
$rings
Box-Cox Transformation
Largest/Smallest: 29
Sample Skewness: 1.11
181
Section 6 – Projection Pursuit Regression Spring 2020
DSCI 425 – Supervised (Statistical) Learning Brant Deppa - Winona State University
$length
Box-Cox Transformation
Largest/Smallest: 10.9
Sample Skewness: -0.64
$diam
Box-Cox Transformation
Largest/Smallest: 11.8
Sample Skewness: -0.609
$height
Box-Cox Transformation
Largest/Smallest: 251
Sample Skewness: -0.264
$whole.weight
Box-Cox Transformation
Largest/Smallest: 1410
Sample Skewness: 0.528
182
Section 6 – Projection Pursuit Regression Spring 2020
DSCI 425 – Supervised (Statistical) Learning Brant Deppa - Winona State University
$shucked.weight
Box-Cox Transformation
Largest/Smallest: 1490
Sample Skewness: 0.714
$visc.weight
Box-Cox Transformation
Largest/Smallest: 1520
Sample Skewness: 0.589
$shell.weight
Box-Cox Transformation
Largest/Smallest: 670
Sample Skewness: 0.62
While I would not advocate blindly applying these transformations in this case as the
number of predictors is not that large, we will apply them and proceed with developing
1
a PPR model for the transformed response (Note: λ=0.20∨ for Y ¿.
5
We can now inspect the relationships amongst these variables in the transformed scale
and their univariate distributions.
183
Section 6 – Projection Pursuit Regression Spring 2020
DSCI 425 – Supervised (Statistical) Learning Brant Deppa - Winona State University
> pairs.plus(Abalone.PP)
184
Section 6 – Projection Pursuit Regression Spring 2020
DSCI 425 – Supervised (Statistical) Learning Brant Deppa - Winona State University
> Abalone.ppr2 = ppr(Rings~.,data=Abalone.PP,nterms=7,span=0.05)
> PPplot(Abalone.ppr2)
185
Section 6 – Projection Pursuit Regression Spring 2020
DSCI 425 – Supervised (Statistical) Learning Brant Deppa - Winona State University
> Abalone.ppr2 = ppr(Rings~.,data=Abalone.PP,nterms=7,bass=1)
> PPplot(Abalone.ppr2,bar=T)
186
Section 6 – Projection Pursuit Regression Spring 2020
DSCI 425 – Supervised (Statistical) Learning Brant Deppa - Winona State University
The fit looks good for the most part, but we should perform cross-validation to further
fine tune this model for prediction purposes and to compare it to other models we have
considered MLR (possibly CERES, ACE/AVAS assisted) and MARS.
Rather than code a k-fold cross-validation, split-sample, or Monte Carlo version of either
function for cross-validating a PPR model we can use functions in the library
bootstrap to do some of the heavy lifting for us. The function crossval in this library
performs k-fold cross-validation for any modeling method where fitting and obtaining
predicted values for future cases can be done easily, which is the case for most methods.
The crossval has the following form from it’s R help file:
R Documentation
crossval {bootstrap}
K-fold Cross-Validation
Description
See Efron and Tibshirani (1993) for details on this function.
Usage
crossval(x, y, theta.fit, theta.predict, ..., ngroup=n)
Arguments
x a matrix containing the predictor (regressor) values. Each row corresponds to
an observation.
y a vector containing the response values
theta.fit function to be cross-validated. Takes x and y as an argument. See example
below.
theta.predict function producing predicted values for theta.fit. Arguments are a
matrix x of predictors and fit object produced by theta.fit. See example below.
... any additional arguments to be passed to theta.fit
ngroup optional argument specifying the number of groups formed . Default
is ngroup=sample size, corresponding to leave-one out cross-validation.
The required arguments are matrix of predictors/terms to use (x), a response vector (y),
a function we have to write called theta.fit which specifies how to fit the model to be
cross-validated, a function theta.predict we again have to write that specifies how
to obtain predictions for observations not used to fit the model, and the number of folds
to use in the k-fold cross-validation (ngroups).
187
Section 6 – Projection Pursuit Regression Spring 2020
DSCI 425 – Supervised (Statistical) Learning Brant Deppa - Winona State University
As a single call to this function will only perform one replication of a k-fold cross-
validation we will first write a function to perform Monte Carlo k-fold cross-validation a
specified number of times, saving the results from each.
CVK = function(x,y,theta.fit,theta.predict,ngroup=10,reps=100) {
require(bootstrap)
cv = rep(0,reps)
for (i in 1:reps) {
results = crossval(x,y,theta.fit,theta.predict,ngroup=ngroup)
cv[i] = sum((y - results$cv.fit)^2)/length(y)
}
cv
}
Create the function that will fit our PPR model with desired specifications
> theta.fitppr = function(x,y){ppr(x,y,nterms=7,bass=1)}
Create function that will predict the response value given a set of predictor values.
> theta.predictppr = function(fit,x){predict(fit,x)}
We can now run the CVK function above for our chosen PPR model.
> results = CVK(Ab.X,Ab.y,theta.fitppr,theta.predictppr,ngroup=10,reps=25)
> results
> MSEP.ppr
[1] 0.08869406
> RMSEP.ppr
[1] 0.2978155
These prediction quality measurements are for the response is the transformed scale
5
using the Box-Cox family which is T ( Y ) =( √ rings−1)/.20. To measure performance in
the original scale we need to modify the CVK function to convert the predictions and
actual response values to the originals scale within the function.
188
Section 6 – Projection Pursuit Regression Spring 2020
DSCI 425 – Supervised (Statistical) Learning Brant Deppa - Winona State University
> CVK.ab = edit(CVK)
> CVK.ab = function(x,y,theta.fit,theta.predict,ngroup=10,reps=100) {
require(bootstrap)
cv = rep(0,reps)
for (i in 1:reps) {
results = crossval(x,y,theta.fit,theta.predict,ngroup=ngroup)
ystar = (0.2*y + 1)^5
ypred = (0.2*results$cv.fit+1)^5
cv[i] = sum((ystar-ypred)^2)/length(ystar)
}
cv
}
We can easily extend CVK functions above to compute MAEP and MAPEP as well.
CVK2 = function(x,y,theta.fit,theta.predict,ngroup=10,reps=100) {
require(bootstrap)
MSEP = rep(0,reps)
MAEP = rep(0,reps)
MAPEP = rep(0,reps)
n = length(y)
for (i in 1:reps) {
results = crossval(x,y,theta.fit,theta.predict,ngroup=ngroup)
MSEP[i] = sum((y - results$cv.fit)^2)/n
MAEP[i] = sum(abs(y-results$cv.fit))/n
MAPEP[i] = sum(abs(y[y!=0]-results$cv.fit[y!=0])/y[y!=0])/length(y[y!=0])
}
RMSEP = sqrt(mean(MSEP))
MAE = mean(MAEP)
MAPE = mean(MAPEP)
cat("RMSEP\n")
cat("===============\n")
cat(RMSEP,"\n\n")
cat("MAE\n")
cat("===============\n")
cat(MAE,"\n\n")
cat("MAPE\n")
cat("===============\n")
cat(MAPE,"\n\n")
temp = data.frame(MSEP=MSEP,MAEP=MAEP,MAPEP=MAPEP)
return(temp)
}
189
Section 6 – Projection Pursuit Regression Spring 2020
DSCI 425 – Supervised (Statistical) Learning Brant Deppa - Winona State University
For the Abalone data with the Box-Cox ( λ=0.2 ¿ transformation we would need to alter
the code to undo the transformation for the actual response values and those returned
from the crossval function.
CVK2.ab = function(x,y,theta.fit,theta.predict,ngroup=10,reps=100) {
require(bootstrap)
MSEP = rep(0,reps)
MAEP = rep(0,reps)
MAPEP = rep(0,reps)
n = length(y)
for (i in 1:reps) {
results = crossval(x,y,theta.fit,theta.predict,ngroup=ngroup)
ystar = (0.2*y+1)^5
ypred = (0.2*results$cv.fit+1)^5
MSEP[i] = sum((ystar - ypred)^2)/n
MAEP[i] = sum(abs(ystar-ypred))/n
MAPEP[i] = sum(abs(ystar[ystar!=0]-ypred[ystar!=0])/ystar[ystar!=0])/length(ystar[ystar!=0])
}
RMSEP = sqrt(mean(MSEP))
MAE = mean(MAEP)
MAPE = mean(MAPEP)
cat("RMSEP\n")
cat("===============\n")
cat(RMSEP,"\n\n")
cat("MAE\n")
cat("===============\n")
cat(MAE,"\n\n")
cat("MAPE\n")
cat("===============\n")
cat(MAPE,"\n\n")
temp = data.frame(MSEP=MSEP,MAEP=MAEP,MAPEP=MAPEP)
return(temp)
}
RMSEP
===============
2.098229
MAE
===============
1.47536
MAPE
===============
0.1449155
190
Section 6 – Projection Pursuit Regression Spring 2020
DSCI 425 – Supervised (Statistical) Learning Brant Deppa - Winona State University
We will now compare the PPR model to the “best” (actually reasonable) model using
MARS.
RMSEP
===============
2.138605
MAE
===============
1.509889
MAPE
===============
0.1486187
Notice that in the theta.fitmars function we have specified that when developing a
model to (k −1) folds used in obtaining the fit we are performing an internal 5-fold
cross-validation to select the model. Thus the MARS model chosen for each fit utilizes
the same criteria (5-fold CV in this case) that the CVK2.ab function does.
The PPR model slightly outperforms the MARS model given these data in the
transformed scales chosen. However, the MARS model is much more interpretable the
PPR model in general.
191