Sparse Model Matrices: Martin Maechler R Core Development Team July 2007, 2008
Sparse Model Matrices: Martin Maechler R Core Development Team July 2007, 2008
Sparse Model Matrices: Martin Maechler R Core Development Team July 2007, 2008
Martin Maechler
R Core Development Team
[email protected]
Introduction
Model matrices in the very widely used (generalized) linear models of statistics, (typically fit via lm() or
glm() in R) are often practically sparse — whenever categorical predictors, factors in R, are used.
We show for a few classes of such linear models how to construct sparse model matrices using sparse
matrix (S4) objects from the Matrix package, and typically without using dense matrices in intermediate
steps.
1 One factor: y ∼ f1
Let’s start with an artifical small example:
> (ff <- factor(strsplit("statistics_is_a_task", "")[[1]], levels=c("_",letters)))
[1] s t a t i s t i c s _ i s _ a _ t a s k
Levels: _ a b c d e f g h i j k l m n o p q r s t u v w x y z
[1] s t a t i s t i c s _ i s _ a _ t a s k
Levels: _ a c i k s t
1
i . . 1 . .
s . . . 1 .
t . . . . 1
_ 1 . . . .
a . 1 . . .
ck . . 1 . .
i . . . 1 .
s . . . . 1
t -1 -1 -1 -1 -1
_ -1 -1 -1 -1 -1
a 1 -1 -1 -1 -1
ck . 2 -1 -1 -1
i . . 3 -1 -1
s . . . 4 -1
t . . . . 5
where contrasts() is (conceptually) just one major ingredient in the well-known model.matrix() function
to build the linear model matrix X of so-called “dummy variables”. Since 2007, the Matrix package has
been providing coercion from a factor object to a sparseMatrix one to produce the transpose of the model
matrix corresponding to a model with that factor as predictor (and no intercept):
> as(f1, "sparseMatrix")
_ . . . . . . . . . . 1 . . 1 . 1 . . . .
a . . 1 . . . . . . . . . . . 1 . . 1 . .
ck . . . . . . . . 1 . . . . . . . . . . 1
i . . . . 1 . . 1 . . . 1 . . . . . . . .
s 1 . . . . 1 . . . 1 . . 1 . . . . . 1 .
t . 1 . 1 . . 1 . . . . . . . . . 1 . . .
which is really almost the transpose of using the above sparsification of contrasts() (and arranging for nice
printing),
> printSpMatrix( t( Matrix(contrasts(f1))[as.character(f1) ,] ),
+ col.names=TRUE)
s t a t i s t i ck s _ i s _ a _ t a s ck
a . . 1 . . . . . . . . . . . 1 . . 1 . .
ck . . . . . . . . 1 . . . . . . . . . . 1
i . . . . 1 . . 1 . . . 1 . . . . . . . .
s 1 . . . . 1 . . . 1 . . 1 . . . . . 1 .
t . 1 . 1 . . 1 . . . . . . . . . 1 . . .
2
and that is the same as the “sparsification” of model.matrix(), apart from the column names (here trans-
posed),
> t( Matrix(model.matrix(~ 0+ f1))) # model with*OUT* intercept
f1_ . . . . . . . . . . 1 . . 1 . 1 . . . .
f1a . . 1 . . . . . . . . . . . 1 . . 1 . .
f1ck . . . . . . . . 1 . . . . . . . . . . 1
f1i . . . . 1 . . 1 . . . 1 . . . . . . . .
f1s 1 . . . . 1 . . . 1 . . 1 . . . . . 1 .
f1t . 1 . 1 . . 1 . . . . . . . . . 1 . . .
casein . . . . . . . . . . . . . . . . . . . . 1 1 1 1
horsebean 1 1 1 1 . . . . . . . . . . . . . . . . . . . .
linseed . . . . 1 1 1 1 . . . . . . . . . . . . . . . .
meatmeal . . . . . . . . . . . . . . . . 1 1 1 1 . . . .
soybean . . . . . . . . 1 1 1 1 . . . . . . . . . . . .
sunflower . . . . . . . . . . . . 1 1 1 1 . . . . . . . .
>
3
'data.frame': 54 obs. of 3 variables:
$ breaks : num 26 30 54 25 70 52 51 26 67 18 ...
$ wool : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
$ tension: Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ...
tension
wool L M H
A 9 9 9
B 9 9 9
This example depicts how a model matrix would be built for the model breaks wool + tension.
Since this is a main effects model (no interactions), the desired model matrix is simply the concatenation of
the model matrices of the main effects. There are two here, but the principle applies to general main effects
of factors.
The most sparse matrix is reached by not using an intercept, (which would give an all-1-column) but
rather have one factor fully coded (aka “swallow” the intercept), and all others being at "treatment" contrast,
i.e., here, the transposed model matrix, tmm, is
> tmm <- with(warpbreaks,
+ rBind(as(tension, "sparseMatrix"),
+ as(wool, "sparseMatrix")[-1,,drop=FALSE]))
> print( image(tmm) ) # print(.) the lattice object
1
Row
2
3
4
10 20 30 40 50
Column
Dimensions: 4 x 54
The matrices are even sparser when the factors have more than just two or three levels, e.g., for the morley
data set,
> data(morley) # a standard R data set
> morley$Expt <- factor(morley$Expt)
> morley$Run <- factor(morley$Run)
> str(morley)
4
5
10
Row
15
20
20 40 60 80
Column
Dimensions: 24 x 100
[1] "dgCMatrix"
attr(,"package")
[1] "Matrix"
[1] 24 13
5
(Intercept) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
block2 . . . . 1 1 1 1 . . . . . . . . . . . . . . . .
block3 . . . . . . . . 1 1 1 1 . . . . . . . . . . . .
block4 . . . . . . . . . . . . 1 1 1 1 . . . . . . . .
block5 . . . . . . . . . . . . . . . . 1 1 1 1 . . . .
block6 . . . . . . . . . . . . . . . . . . . . 1 1 1 1
N1 . 1 . 1 1 1 . . . 1 1 . 1 1 . . 1 . 1 . 1 1 . .
P1 1 1 . . . 1 . 1 1 1 . . . 1 . 1 1 . . 1 . 1 1 .
K1 1 . . 1 . 1 1 . . 1 . 1 . 1 1 . . . 1 1 1 . 1 .
N1:P1 . 1 . . . 1 . . . 1 . . . 1 . . 1 . . . . 1 . .
N1:K1 . . . 1 . 1 . . . 1 . . . 1 . . . . 1 . 1 . . .
P1:K1 1 . . . . 1 . . . 1 . . . 1 . . . . . 1 . . 1 .
N1:P1:K1 . . . . . 1 . . . 1 . . . 1 . . . . . . . . . .
Another example was reported by a user on R-help (July 15, 2008, https://stat.ethz.ch/pipermail/
r-help/2008-July/167772.html) about an “aov error with large data set”.
’m looking to analyze a large data set: a within-Ss 2*2*1500 design with 20 Ss. However, aov() gives me
an error.
And gave the following code example (slightly edited):
> id <- factor(1:20)
> a <- factor(1:2)
> b <- factor(1:2)
> d <- factor(1:1500)
> aDat <- expand.grid(id=id, a=a, b=b, d=d)
> aDat$y <- rnorm(length(aDat[, 1])) # generate some random DV data
> dim(aDat) # 120'000 x 5 (120'000 = 2*2*1500 * 20 = 6000 * 20)
[1] 120000 5
[1] 12000 4
1 the following is not run in R on purpose, rather just displayed here
6
> dim(mm <- model.matrix( ~ a*b*d, data=tmp2))
[1] "dgCMatrix"
attr(,"package")
[1] "Matrix"
40.1 bytes
shows that even for the small d here, the memory reduction would be more than an order of magnitude.
and working with the sparse instead of the dense model matrix is considerably faster as well,
> x <- 1:600
> system.time(y <- smm %*% x) ## sparse is much faster
7
> identical(as.matrix(y), y.) ## TRUE
[1] TRUE
> toLatex(sessionInfo())
• BLAS: /sfs/u/maechler/R/D/r-pre-rel/64-linux-inst/lib/libRblas.so
• LAPACK: /sfs/u/maechler/R/D/r-pre-rel/64-linux-inst/lib/libRlapack.so
• Base packages: base, datasets, grDevices, graphics, methods, stats, utils
• Other packages: Matrix 1.2-14
• Loaded via a namespace (and not attached): compiler 3.5.0, grid 3.5.0, lattice 0.20-35, tools 3.5.0