SinhaDu16 PDF
SinhaDu16 PDF
SinhaDu16 PDF
Stanford University
{amans,jduchi}@stanford.edu
Abstract
1 Introduction
An essential element of supervised learning systems is the representation of input data. Kernel
methods [27] provide one approach to this problem: they implicitly transform the data to a new
feature space, allowing non-linear data representations. This representation comes with a cost, as
kernelized learning algorithms require time that grows at least quadratically in the data set size,
and predictions with a kernelized procedure require the entire training set. This motivated Rahimi
and Recht [24, 25] to develop randomized methods that efficiently approximate kernel evaluations
with explicit feature transformations; this approach gives substantial computational benefits for large
training sets and allows the use of simple linear models in the randomly constructed feature space.
Whether we use standard kernel methods or randomized approaches, using the right kernel for a
problem can make the difference between learning a useful or useless model. Standard kernel methods
as well as the aforementioned randomized-feature techniques assume the input of a user-defined
kernela weakness if we do not a priori know a good data representation. To address this weakness,
one often wishes to learn a good kernel, which requires substantial computation. We combine kernel
learning with randomization, exploiting the computational advantages offered by randomized features
to learn the kernel in a supervised manner. Specifically, we use a simple pre-processing stage for
selecting our random features rather than jointly optimizing over the kernel and model parameters.
Our workflow is straightforward: we create randomized features, solve a simple optimization problem
to select a subset, then train a model with the optimized features. The procedure results in lower-
dimensional models than the original random-feature approach for the same performance. We give
empirical evidence supporting these claims and provide theoretical guarantees that our procedure is
consistent with respect to the limits of infinite training data and infinite-dimensional random features.
To discuss related work, we first describe the supervised learning problem underlying our approach.
We have a cost c : R Y R, where c(, y) is convex for y Y, and a reproducing kernel Hilbert
space (RKHS) of functions F with kernel K. Given a sample {(xi , y i )}ni=1 , the usual `2 -regularized
30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.
learning problem is to solve the following (shown in primal and dual forms respectively):
n n
X 2
X 1 T
minimize c(f (xi ), y i ) + kf k2 , or maximize c (i , y i ) G, (1)
f F
i=1
2 R n
i=1
2
where kk2 denotes the Hilbert space norm, c (, y) = supz {z c(z, y)} is the convex conjugate
of c (for fixed y) and G = [K(xi , xj )]ni,j=1 denotes the Gram matrix.
Several researchers have studied kernel learning. As noted by Gnen and Alpaydn [14], most
formulations fall into one of a few categories. In the supervised setting, one assumes a base class
or classes of kernels and either uses heuristic rules to combine kernels [2, 23], optimizes structured
(e.g. linear, nonnegative, convex) compositions of the kernels with respect to an alignment metric
[9, 16, 20, 28], or jointly optimizes kernel compositions with empirical risk [17, 20, 29]. The latter
approaches require an eigendecomposition of the Gram matrix or costly optimization problems
(e.g. quadratic or semidefinite programs) [10, 14], but these models have a variety of generalization
guarantees [1, 8, 10, 18, 19]. Bayesian variants of compositional kernel search also exist [12, 13]. In
un- and semi-supervised settings, the goal is to learn an embedding of the input distribution followed
by a simple classifier in the embedded space (e.g. [15]); the hope is that the input distribution carries
the structure relevant to the task. Despite the current popularity of these techniques, especially deep
neural architectures, they are costly, and it is difficult to provide guarantees on their performance.
Our approach optimizes kernel compositions with respect to an alignment metric, but rather than work
with Gram matrices in the original data representation, we work with randomized feature maps that
approximate RKHS embeddings. We learn a kernel that is structurally different from a user-supplied
base kernel, and our method is an efficiently (near linear-time) solvable convex program.
2 Proposed approach
At a high level, we take a feature mapping, find a distribution that aligns this mapping with the labels
y, and draw random features from the learned distribution; we then use these features in a standard
supervised learning approach.
For simplicity, we focus on binary classification: we have n datapoints (xi , y i ) Rd {1, 1}.
Letting : Rd W [1, 1] and Q be a probability measure on a space W, define the kernel
Z
KQ (x, x0 ) := (x, w)(x0 , w)dQ(w). (2)
We want to find the best kernel KQ over all distributions Q in some (large, nonparametric) set P of
possible distributions on random features; we consider a kernel alignment problem of the form
X
maximize KQ (xi , xj )y i y j . (3)
QP
i,j
Using randomized features, matching the input and output distances in problem (4) translates to
finding a (weighted) set of points among w1 , w2 , ..., wNw that best describe the underlying dataset,
or, more directly, finding weights q so that the kernel matrix matches the correlation matrix yy T .
2
Given a solution qb to problem (4), we can solve the primal form of problem (1) in two ways. First, we
iid
can apply the Rahimi and Recht [24] approach by drawing D samples W 1 , . . . , W D qb, defining
i i 1 i D T
features = [(x , w ) (x , w )] , and solving the risk minimization problem
X n
1 T i i
= argmin
b c D , y + r() (5)
i=1
for some regularization r. Alternatively, we may set i = [(xi , w1 ) (xi , wNw )]T , where
w1 , . . . , wNw are the original random samples from P0 used to solve (4), and directly solve
X n
1
T i i
= argmin
b c( diag(b q ) , y ) + r() .
2 (6)
i=1
Notably, if qb is sparse, the problem (6) need only store the random features corresponding to non-zero
entries of qb. Contrast our two-phase procedure to that of Rahimi and Recht [25], which samples
iid
W 1 , . . . , W D P0 and solves the minimization problem
Xn XD
i m i
minimize c m (x , w ), y subject to kk C/Nw , (7)
RNw
i=1 m=1
where C is a numerical constant. At first glance, it appears that we may suffer both in terms of
computational efficiency and in classification or learning performance compared to the one-step
procedure (7). However, as we show in the sequel, the alignment problem (4) can be solved very
efficiently and often yields sparse vectors qb, thus substantially decreasing the dimensionality of
problem (6). Additionally, we give experimental evidence in Section 4 that the two-phase procedure
yields generalization performance similar to standard kernel and randomized feature methods.
The optimization problem (4) has structure that enables efficient (near linear-time) solutions. Define
the matrix = [1 n ] RNw n , where i = [(xi , w1 ) (xi , wNw )]T RNw is the
iid
randomized feature representation for xi and wm P0 . We can rewrite the optimization objective as
X Nw
X Nw
X X n 2
yi yj qm (xi , wm )(xj , wm ) = qm y i (xi , wm ) = q T ((y) (y)) ,
i,j m=1 m=1 i=1
where denotes the Hadamard product. Constructing the linear objective requires the evaluation of
y. Assuming that the computation of is O(d), construction of is O(nNw d) on a single processor.
However, this construction is trivially parallelizable. Furthermore, computation can be sped up even
further for certain distributions P0 . For example, the Fastfood technique can approximate in
O(nNw log(d)) time for the Gaussian kernel [21].
The problem (4) is also efficiently solvable via bisection over a scalar dual variable. Using 0 for
the constraint Df (Q||P0 ) , a partial Lagrangian is
L(q, ) = q T ((y) (y)) (Df (q||1/Nw ) ) .
The corresponding dual function is g() = supq L(q, ), where := {q RN T
+ : q 1 = 1}
w
is the probability simplex. Minimizing g() yields the solution to problem (4); this is a convex
optimization problem in one dimension so we can use bisection. The computationally expensive step
in each iteration is maximizing L(q, ) with respect to q for a given . For f (t) = tk 1, we define
v := (y) (y) and solve
Nw
T 1 X
maximize q v (Nw qm )k . (8)
q Nw m=1
1
This has a solution of the form qm = vm /Nwk1 + +k1 , where is chosen so that m qm = 1.
P
We can find such a by a variant of median-based search in O(Nw ) time [11]. Thus, for any k 2,
an -suboptimal solution to problem (4) can be found in O(Nw log(1/)) time (see Algorithm 1).
3
Algorithm 1 Kernel optimization with f (t) = tk 1 as divergence
I NPUT: distribution P0 on W, sample {(xi , y i )}n i=1 , Nw N, feature function , > 0
O UTPUT: q RNw that is an -suboptimal solution to (4).
iid
S ETUP : Draw Nw samples wm P0 , build feature matrix , compute v := (y) (y).
Set u , l 0, s 1
while u =
q argmaxq L(q, s ) // (solution to problem (8))
if Df (q||1/Nw ) < then u s else s 2s
while u l > s
(u + l )/2
q argmaxq L(q, ) // (solution to problem (8))
if Df (q||1/Nw ) < then u else l
Consistency First, we provide guarantees that the solution to problem (4) approaches a population
optimum as the data and random sampling increase (n and Nw , respectively). We
consider the following (slightly more general) setting: let S : X X [1, 1] be a bounded
function, where we intuitively think of S(x, x0 ) as a similarity metric between labels for x and x0 ,
and denote Sij := S(xi , xj ) (in the binary case with y {1, 1}, we have Sij = y i y j ). We then
define the alignment functions
1 X
T (P ) := E[S(X, X 0 )KP (X, X 0 )], Tb(P ) := Sij KP (xi , xj ),
n(n 1)
i6=j
where the expectation is taken over S and the independent variables X, X 0 . Lemmas 1 and 2 provide
consistency guarantees with respect to the data sample (xi and Sij ) and the random feature sample
(wm ); together they give us the overall consistency result of Theorem 1. We provide proofs in the
supplement (Sections A.1, A.2, and A.3 respectively).
Lemma 1 (Consistency with respect to data). Let f (t) = tk 1 for k 2. Let P0 be any distribution
on the space W, and let P = {Q : Df (Q||P0 ) }. Then
nt2
P sup Tb(Q) T (Q) t 2 exp .
QP 16(1 + )
Lemma 1 shows that the empirical quantity Tb is close to the true T . Now we show that, independent
of the size of the training data, we can consistently estimate the optimal Q P via sampling (i.e.
Q PNw ).
Lemma 2 (Consistency with respectp to sampling features). Let the conditions of Lemma 1 hold.
Then, with C = 2(+1)
1+1
and D = 8(1 + ), we have
s s
log 2
sup Tb(Q) sup Tb(Q) 4C log(2N w )
+ D
QP
Nw QP
Nw Nw
iid
with probability at least 1 over the draw of the samples W m P0 .
Finally, we combine the consistency guarantees for data and sampling to reach our main result, which
shows that the alignment provided by the estimated distribution Q b is nearly optimal.
Theorem 1. Let Q b w maximize Tb(Q) over Q PN . Then, with probability at least 1 3 over the
w
sampling of both (x, y) and W , we have
s s s
2
2 log 2
b w ) sup T (Q) 4C
T (Q log(2N w ) log
+ D + 2D .
QP
Nw Nw n
4
Generalization performance The consistency results above show that our optimization procedure
nearly maximizes alignment T (P ), but they say little about generalization performance for our model
trained using the optimized kernel. We now show that the class of estimators employed by our method
has strong performance guarantees. By construction, our estimator (6) uses the function class
Nw
n X o
FNw := h(x) = m qm (x, wm ) | q PNw , kk2 B ,
m=1
and we provide bounds on its generalization
Pn via empirical Rademacher complexity. To that end,
define Rn (FNw ) := n1 E[supf FNw i=1 i f (xi )], where the expectation is taken over the i.i.d.
Rademacher variables i {1, 1}. We have the following lemma, whose proof is in Section A.4.
q
Lemma 3. Under the conditions of the preceding paragraph, Rn (FNw ) B 2(1+) n .
The bound is independent of the number of terms Nw , though in practice we let B grow with Nw .
4 Empirical evaluations
We now turn to empirical evaluations, comparing our approachs predictive performance with that of
Rahimi and Rechts randomized features [24] as well as a joint optimization over kernel compositions
and empirical risk. In each of our experiments, we investigate the effect of increasing dimensionality
of the randomized feature space D. For our approach, we use the 2 -divergence (k = 2 or f (t) =
t2 1). Letting qb denote the solution to problem (4), we use two variants of our approach: when
D < nnz(b q ) we use estimator (5), and we use estimator (6) otherwise. For the original randomized
feature approach, we relax the constraint in problem (7) with an `2 penalty. Finally, for the joint
optimization in which we learn the kernel and classifier together, we consider the kernel-learning
objective, i.e. finding the best Gram matrix G in problem (1) for the soft-margin SVM [14]:
PNw
T 1 21 i,j i j y i y j m=1 qm (xi , wm )(xj , wm )
P
minimizeqPNw sup
T
(9)
subject to 0 C1, y = 0.
We use a standard primal-dual algorithm [4] to solve the min-max problem (9). While this is an
expensive optimization, it is a convex problem and is solvable in polynomial time.
In Section 4.1, we visualize a particular problem that illustrates the effectiveness of our approach
when the user-defined kernel is poor. Section 4.2 shows how learning the kernel can be used to quickly
find a sparse set of features in high dimensional data, and Section 4.3 compares our performance with
unoptimized random features and the joint procedure (9) on benchmark datasets. The supplement
contains more experimental results in Section C.
5
0.45
GK-train
3 0.4 GK-test
OK-train
2 0.35
OK-test
0.3
1
0.25
0
0.2
-1
0.15
-2 0.1
-3 0.05
0
-4 -2 0 2 4 2 4 6 8 10 12 14
(a) Training data & optimized features for d = 2 (b) Error vs. d
Figure 1. Experiments with synthetic data. (a) Positive and negative training examples are blue and red,
and optimized randomized features (wm ) are yellow. All offset parameters v m were optimized to be
near 0 or (not shown). (b) Misclassification error of logistic regression model vs. dimensionality of
data. GK denotes random features with a Gaussian kernel, and our optimized kernel is denoted OK.
0.05
0.5
0.04
0.4
0.03
0.3
0.02
0.2 0.01
0.1 0
101 102 103 104 101 102 103 104 105
Figure 1 shows the results of the experiments for d {2, . . . , 15}. Figure 1(a) illustrates the output
of the optimization when d = 2. The selected kernel features wm lie near (1, 1) and (1, 1); the
offsets v m are near 0 and , giving the feature (, w, v) a parity flip. Thus, the kernel computes
similarity between datapoints via neighborhoods of (1, 1) and (1, 1) close to the classification
boundary. In higher dimensions, this generalizes to neighborhoods of pairs of opposing points along
the surface of the d-sphere; these features provide a coarse approximation to vector magnitude.
Performance degradation with d occurs because the neighborhoods grow exponentially larger and
less dense (due to fixed Nw and n). Nevertheless, as shown in Figure 1(b), this degradation occurs
much more slowly than that of the Gaussian kernel, which suffers a similar curse of dimensionality
due to its dependence on Euclidean distance. Although somewhat contrived, this example shows that
even in situations with poor base kernels our approach learns a more suitable representation.
In addition to the computational advantages rendered by the sparsity of q after performing the
optimization (4), we can use this sparsity to gain insights about important features in high-dimensional
datasets; this can act as an efficient filtering mechanism before further investigation. We present
one example of this task, studying an aptamer selection problem [6]. In this task, we are given
n = 2900 nucleotide sequences (aptamers) xi A81 , where A = {A,C,G,T} and labels y i indicate
(thresholded) binding affinity of the aptamer to a molecular target. We create one-hot encoded forms
P5
of k-grams of the sequence, where 1 k 5, resulting in d = k=1 |A|k (82 k) = 105,476
6
0.5 0.2
0.24
0.45 0.18
0.22 0.4 0.16
0.35 0.14
0.2
0.3
0.12
0.18 0.25
0.1
0.2
0.16 0.08
0.15
0.06
0.14 0.1
0.04
2 3 1 2 3 4
10 10 10 10 10 10 101 102 103
(a) Error vs. D, adult (b) Error vs. D, reuters (c) Error vs. D, buzz
103 101
101
102
101
100
100
0
10
10-1 10-1
102 103 101 102 103 104 101 102 103
(d) Speedup vs. D, adult (e) Speedup vs. D, reuters (f) Speedup vs. D, buzz
Figure 3. Performance analysis on benchmark datasets. The top row shows training and test misclassifi-
cation rates. Our method is denoted as OK and is shown in red. The blue methods are random features
with Gaussian, linear, or arc-cosine kernels (GK, LK, or ACK respectively). Our error and running
time become fixed above D = nnz(b q ) after which we employ estimator (6). The bottom row shows the
speedup factor of using our method over regular random features (speedup = x indicates our method
takes 1/x of the time required to use regular random features). Our method is faster at moderate to large
D and shows better performance than the random feature approach at small to moderate D.
Dataset n, ntest d Model Our error (%), time(s) Random error (%), time(s)
adult 32561, 16281 123 Logistic 15.54, 3.6 15.44, 43.1
reuters 23149, 781265 47236 Ridge 9.27, 0.8 9.36, 295.9
buzz 105530, 35177 77 Ridge 4.92, 2.0 4.58, 11.9
features. We consider the linear kernel, i.e. (x, w) = xw , where w Uni({1, . . . , d}). Figure 2(a)
compares the misclassification error of our method with that of random k-gram features, while Figure
2(b) indicates the weights qi given to features by our method. In under 0.2 seconds, we whittle down
the original feature space to 379 important features. By restricting random selection to just these
features, we outperform the approach of selecting features uniformly at random when D d. More
importantly, however, we can derive insights from this selection. For example, the circled features in
Figure 2(b) correspond to k-gram prefixes for the 5-grams GGTTG and GTTGG at indices 60 through
64; G-complexes are known to be relevant for binding affinities in aptamers [6], so this is reasonable.
We now show the benefits of our approach on large-scale datasets, since we exploit the efficiency
of random features with the performance of kernel-learning techniques. We perform experiments
on three distinct types of datasets, tracking training/test error rates as well as total (training + test)
time. For the adult2 dataset we employ the Gaussian kernel with a logistic regression model, and
for the reuters3 dataset we employ a linear kernel with a ridge regression model. For the buzz4
dataset we employ ridge regression with an arc-cosine kernel of order 2, i.e. P0 = N (0, I) and
(x, w) = H(wT x)(wT x)2 , where H() is the Heavyside step function [7].
2
https://archive.ics.uci.edu/ml/datasets/Adult
3
http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm. We con-
sider predicting whether a document has a CCAT label.
4
http://ama.liglab.fr/data/buzz/classification/. We use the Twitter dataset.
7
Table 2: Comparisons with joint optimization on subsampled data
Dataset Our training / test error (%), time(s) Joint training / test error (%), time(s)
adult 16.22 / 16.36, 1.8 14.88 / 16.31, 198.1
reuters 7.64 / 9.66, 0.6 6.30 / 8.96, 173.3
buzz 8.44 / 8.32, 0.4 7.38 / 7.08, 137.5
Comparison with unoptimized random features Results comparing our method with unopti-
mized random features are shown in Figure 3 for many values of D, and Table 1 tabulates the best
test error and corresponding time for the methods. Our method outperforms the original random
feature approach in terms of generalization error for small and moderate values of D; at very large D
the random feature approach either matches our surpasses our performance. The trends in speedup
are opposite: our method requires extra optimizations that dominate training time at extremely small
D; at very large D we use estimator (6), so our method requires less overall time. The nonmonotonic
behavior for reuters (Figure 3(e)) occurs due to the following: at D . nnz(b q ), sampling indices
from the optimized distribution takes a non-neglible fraction of total time, and solving the linear
system requires more time when rows of are not unique (due to sampling).
Performance improvements also depend on the kernel choice for a dataset. Namely, our method
provides the most improvement, in terms of training time for a given amount of generalization error,
over random features generated for the linear kernel on the reuters dataset; we are able to surpass
the best results of the random feature approach 2 orders of magnitude faster. This makes sense when
considering the ability of our method to sample from a small subset of important features. On the
other hand, random features for the arc-cosine kernel are able to achieve excellent results on the
buzz dataset even without optimization, so our approach only offers modest improvement at small to
moderate D. For the Gaussian kernel employed on the adult dataset, our method is able to achieve
the same generalization performance as random features in roughly 1/12 the training time.
Thus, we see that our optimization approach generally achieves competitive results with random
features at lower computational costs, and it offers the most improvements when either the base
kernel is not well-suited to the data or requires a large number of random features (large D) for good
performance. In other words, our method reduces the sensitivity of model performance to the users
selection of base kernels.
Comparison with joint optimization Despite the fact that we do not choose empirical risk as our
objective in optimizing kernel compositions, our optimized kernel enjoys competitive generalization
performance compared to the joint optimization procedure (9). Because the joint optimization is
very costly, we consider subsampled training datasets of 5000 training examples. Results are shown
in Table 2, where it is evident that the efficiency of our method outweighs the marginal gain in
classification performance for joint optimization.
5 Conclusion
We have developed a method to learn a kernel in a supervised manner using random features. Although
we consider a kernel alignment problem similar to other approaches in the literature, we exploit
computational advantages offered by random features to develop a much more efficient and scalable
optimization procedure. Our concentration bounds guarantee the results of our optimization procedure
closely match the limits of infinite data (n ) and sampling (Nw ), and our method produces
models that enjoy good generalization performance guarantees. Empirical evaluations indicate that
our optimized kernels indeed learn structure from data, and we attain competitive results on
benchmark datasets at a fraction of the training time for other methods. Generalizing the theoretical
results for concentration and risk to other f divergences is the subject of further research. More
broadly, our approach opens exciting questions regarding the usefulness of simple optimizations on
random features in speeding up other traditionally expensive learning problems.
Acknowledgements This research was supported by a Fannie & John Hertz Foundation Fellowship
and a Stanford Graduate Fellowship.
8
References
[1] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results.
The Journal of Machine Learning Research, 3:463482, 2003.
[2] A. Ben-Hur and W. S. Noble. Kernel methods for predicting proteinprotein interactions. Bioinformatics,
21(suppl 1):i38i46, 2005.
[3] A. Ben-Tal, D. den Hertog, A. D. Waegenaere, B. Melenberg, and G. Rennen. Robust solutions of
optimization problems affected by uncertain probabilities. Management Science, 59(2):341357, 2013.
[4] D. Bertsekas. Nonlinear Programming. Athena Scientific, 1999.
[5] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: a Nonasymptotic Theory of
Independence. Oxford University Press, 2013.
[6] M. Cho, S. S. Oh, J. Nie, R. Stewart, M. Eisenstein, J. Chambers, J. D. Marth, F. Walker, J. A. Thomson,
and H. T. Soh. Quantitative selection and parallel characterization of aptamers. Proceedings of the National
Academy of Sciences, 110(46), 2013.
[7] Y. Cho and L. K. Saul. Kernel methods for deep learning. In Advances in neural information processing
systems, pages 342350, 2009.
[8] C. Cortes, M. Mohri, and A. Rostamizadeh. Generalization bounds for learning kernels. In Proceedings of
the 27th International Conference on Machine Learning (ICML-10), pages 247254, 2010.
[9] C. Cortes, M. Mohri, and A. Rostamizadeh. Algorithms for learning kernels based on centered alignment.
The Journal of Machine Learning Research, 13(1):795828, 2012.
[10] N. Cristianini, J. Kandola, A. Elisseeff, and J. Shawe-Taylor. On kernel target alignment. In Innovations in
Machine Learning, pages 205256. Springer, 2006.
[11] J. C. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the `1 -ball for learning
in high dimensions. In Proceedings of the 25th International Conference on Machine Learning, 2008.
[12] D. Duvenaud, J. R. Lloyd, R. Grosse, J. B. Tenenbaum, and Z. Ghahramani. Structure discovery in
nonparametric regression through compositional kernel search. arXiv preprint arXiv:1302.4922, 2013.
[13] M. Girolami and S. Rogers. Hierarchic bayesian models for kernel learning. In Proceedings of the 22nd
international conference on Machine learning, pages 241248. ACM, 2005.
[14] M. Gnen and E. Alpaydn. Multiple kernel learning algorithms. The Journal of Machine Learning
Research, 12:22112268, 2011.
[15] G. E. Hinton and R. R. Salakhutdinov. Using deep belief nets to learn covariance kernels for gaussian
processes. In Advances in neural information processing systems, pages 12491256, 2008.
[16] J. Kandola, J. Shawe-Taylor, and N. Cristianini. Optimizing kernel alignment over combinations of kernel.
2002.
[17] M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. Lp-norm multiple kernel learning. The Journal of
Machine Learning Research, 12:953997, 2011.
[18] V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of
combined classifiers. Annals of Statistics, pages 150, 2002.
[19] V. Koltchinskii, D. Panchenko, et al. Complexities of convex combinations and bounding the generalization
error in classification. The Annals of Statistics, 33(4):14551496, 2005.
[20] G. R. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. I. Jordan. Learning the kernel matrix with
semidefinite programming. The Journal of Machine Learning Research, 5:2772, 2004.
[21] Q. Le, T. Sarls, and A. Smola. Fastfood-computing hilbert space expansions in loglinear time. In
Proceedings of the 30th International Conference on Machine Learning, pages 244252, 2013.
[22] D. Luenberger. Optimization by Vector Space Methods. Wiley, 1969.
[23] S. Qiu and T. Lane. A framework for multiple kernel support vector regression and its applications to
sirna efficacy prediction. Computational Biology and Bioinformatics, IEEE/ACM Transactions on, 6(2):
190199, 2009.
[24] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in Neural
Information Processing Systems 20, 2007.
[25] A. Rahimi and B. Recht. Weighted sums of random kitchen sinks: replacing minimization with randomiza-
tion in learning. In Advances in Neural Information Processing Systems 21, 2008.
[26] P. Samson. Concentration of measure inequalities for Markov chains and -mixing processes. Annals of
Probability, 28(1):416461, 2000.
[27] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press,
2004.
[28] Y. Ying, K. Huang, and C. Campbell. Enhanced protein fold recognition through a novel data integration
approach. BMC bioinformatics, 10(1):1, 2009.
[29] A. Zien and C. S. Ong. Multiclass multiple kernel learning. In Proceedings of the 24th international
conference on Machine learning, pages 11911198. ACM, 2007.
9
A Proofs of major results
Before proving our results, we provide a few technical lemmas to which we refer in the sequel, and
we also give a few definitions. The first is the standard definition of sub-Gaussian random variables.
Definition 1. A random variable X is 2 -sub-Gaussian if
2 2
E [exp((X E[X]))] exp
2
for all R.
k
Throughout our proofs, for a given k [1, ], we use k = k1 , so that 1/k + 1/k = 1, to denote
the conjugate to k.
The technical lemmas that we shall need follow. The first is an essentially standard duality result.
Lemma 4 (Ben-Tal et al. [3]). Let f be any closed convex function with domain dom f [0, ),
and let f (s) = supt0 {ts f (t)} be its conjugate. Then for any distribution P and any function
g : W R we have
Z
g(w)
Z
sup g(w)dQ(w) = inf f dP (w) + + .
Q:Df (Q||P ) 0,
See Section B.1 for a proof of this lemma. Note that as an immediate consequence R of this result, we
have an expectation upper bound on empirical versions of supQ:Df (Q||P ) g(w)dQ(w). Indeed,
let Z1 , . . . , ZNw be drawn i.i.d. from a base distribution P0 . To simplify algebra, we work with a
scaled version of the f -divergence: f (t) = k1 (tk 1), so the population and empirical constraint sets
we consider are defined by
n o n o
P = Q : Df (Q||P0 ) and PNw := q : Df (q||1/Nw ) .
k k
Then by Lemma 4, we obtain
" # " N #
1 X Zi
E sup EQ [Z] = EP0 inf f ++
QPNw 0, N k
i=1
" N #
1 X Zi
inf EP0 f ++
0, N i=1 k
Z
= inf EP0 f + +
0, k
= sup EQ [Z]. (11)
QP
The second lemma provides a lower bound on the expectation of certain robust quantities, and we
provide a proof of the lemma in Section B.2.
10
iid
Lemma 5. Let Z = (Z1 , . . . , ZNw ) be a random vector of independent random variables Zi P0 ,
where |Zi | M with probability 1. Let k [2, ] and define C,k = 2(1+)1 C = 2(+1)
1+1
.
(1+) k 1
Let f (t) = k1 (tk 1). Then
" # s
log(2Nw )
E sup EQ [Z] sup EQ [Z] 4C M
QPNw QP Nw
and " #
E sup EQ [Z] sup EQ [Z].
QPNw QP
The result follows from a dual formulation of the expression on the left hand side as well as standard
concentration results for sub-Gaussian random variables. Define
1 X
ebn (w) := Sij (xi , w)(xj , w) E[S(X, X 0 )(X, w)(X 0 , w)] (12)
n(n 1)
i6=j
to be the error in the kernel estimate at the kernel parameter w. We give our argument by duality,
noting that the lemma is equivalent to proving
nt2
Z
P sup ebn (w)dQ(w) t 2 exp .
QP 16( + 1)
Before continuing, we note the following useful result, whose proof we provide in Section B.3.
Lemma 6. For each fixed w, the random variable ebn (w) is mean-zero and n4 -sub-Gaussian.
1 k
f -divergence: f (t) = k (t 1), so the
To simplify the algebra, we work with a scaled version of the
equivalent constraint sets are P := Q : Df (Q||P0 ) k and PNw := {q : Df (q||1/Nw ) k }.
k
In this rescaled form, the convex conjugate of f (t) is f (s) = k1 [s]+ + k1 , where we recall the
definition that k1 + k1 = 1.
Using Lemma 4, we obtain
Z Z
sup ebn (w)dQ(w) sup |b
en (w)| dQ(w)
QP QP
1 k 1k +1
inf EP0 [|b
en (W )| ] +
0 k k
1
en (W )|k ]1/k
= ( + 1) k EP0 [|b
p 1
+ 1EP0 [b en (W )2 ] 2 ,
where the second inequality follows by using = 0 in Lemma 4 and the last inequality follows from
the fact that k 2 and k 2. The expectation EP0 is with respect to the variable W for a fixed ebn .
We now see that to prove the theorem, it suffices to show that
t2 nt2
Z
2
P ebn (w) dP0 (w) 2 exp .
+1 16( + 1)
en (w)2 exp 12 log 1 8
By Lemma 6, ebn is 4/n-sub-Gaussian, whence E exp b n for
n8 (recall inequality (10) above). Integrating over w, we find that for any distribution P0 we have
by the Chernoff bound technique that for n8 ,
t2 t2
Z Z
P ebn (w)2 dP0 (w) E exp ebn (w)2 dP (w) exp
+1 +1
2
Z
t
en (w)2 dP (w) exp
E exp b
+1
2
1 8 t
exp log 1 exp .
2 n +1
Note that log(1 t) t log 4 for t 12 , and take = n/16 to get the result.
11
A.2 Proof of Lemma 2
where we treat the Sij and xi as fixed and work conditionally; that is, only W is random. We consider
the convergence of
sup EQ [F (W )] to sup EQ [F (W )].
QPNw QP
In the sequel, we suppress dependence on W for notational convenience, and for a sample
W1 , . . . , WNw of random vectors Wk , we let
1 X
Fk = Sij (xi , Wk )(xj , Wk )
n(n 1)
i6=j
k2 k2 1 1 1
kqk2 Nw 2k kqkk Nw 2k ( + 1) k Nw 1/k1 = ( + 1) k Nw 2 . (13)
That is, the function (F1 , . . . , FNw ) 7 supQPNw EQ [F ] is an LNw = + 1/ Nw -Lipschitz
and convex function of bounded random variables. Using Samsons sub-Gaussian concentration
inequality [26] for Lipschitz convex functions of bounded random variables, we have with probability
at least 1 that
" # s
(1 + ) log 2
sup EQ [F ] E sup EQ [F ] 2 2 kF k . (14)
QPNw QPNw Nw
By the containment (14), we need consider only the convergence of the expectation
" #
E sup EQ [F ] to sup EQ [F ].
QPNw QP
But of course, this convergence is described precisely by Lemma 5. Thus, combining Lemma 5 with
containment (14) gives
s s
(1 + ) log 2
sup EQ [F ] sup EQ [F ] 4C kF k log(2N w )
+ 2 2 kF k
QP
Nw QP
Nw Nw
We can write
b w ) sup T (Q) sup T (Q) sup Tb(Q) + sup Tb(Q) Tb(Q
T (Q b w ) + Tb(Qb w ) T (Q
b w )
QP
QP QP
QP
b w ) + sup Tb(Q) T (Q)
sup T (Q) Tb(Q) + sup Tb(Q) Tb(Q
QP QP QP Nw
Now apply Lemma 1 to the first and third terms, apply Lemma 2 to the second term, and use a union
bound to get the result.
12
A.4 Proof of Lemma 3
We define define the dual representation of the feature matrix: let = T = [ 1 Nw ], with
columns given by m := [(x1 , wm ) (xn , wm )]T Rn . Mimicking the proof of Proposition
1 of [8], we have
v !
u Nw
B u X
Rn (FNw ) = E sup t T qk k ( k )T , (15)
n qPNw
k=1
= 3n 2n 3n2 .
2
Then
r
B 1 2(1 + )
Rn (FNw ) 3(1 + )n2 4 B
n n
as desired.
B Technical lemmas
B.1 Proof of Lemma 4
Let L 0 satisfy L(w) = dQ(w)/dP (w), so that L is the likelihood ratio between Q and P . Then
we have Z Z
sup g(w)dQ(w) = R sup g(w)L(w)dP (w)
Q:Df (Q||P ) f (L)dP ,EP [L]=1
Z Z Z
= sup inf g(w)L(w)dP (w) f (L(w))dP (w) L(w)dP (w) 1
L0 0,
Z Z Z
= inf sup g(w)L(w)dP (w) f (L(w))dP (w) L(w)dP (w) 1 ,
0, L0
where we have used that strong duality obtains because the problem is strictly feasible in its non-linear
constraints (take L 1), so that the extended Slater condition holds [22, Theorem 8.6.1 and Problem
8.7]. Noting that L is simply a positive (but otherwise arbitrary) function, we obtain
Z Z
sup g(w)dQ(w) = inf sup {(g(w) )` f (`)} dP (w) + +
Q:Df (Q||P ) 0, `0
g(w)
Z
= inf f dP (w) + + .
0,
13
Here we have used that f (s) = supt0 {st f (t)} is the conjugate of f and that 0, so that we
may take divide and multiply by in the supremum calculation.
We remark that the upper bound in the lemma is immediate from the argument for inequality (11).
Thus we focus only on the lower bound claimed in the lemma.
Before beginning the proof proper, we state a useful lemma lower bounding expectations of various
moments of random variables. (See Section B.4 for a proof.)
Lemma 7. Let Z 0, Z 6 0 be a random variable with finite 2p-th moment for 1 p . Then
we have the following inequalities:
" n 1 #
1X p p
E Z
n i=1 i
(16a)
q p
p1
p
2
n Var(Z p /E[Z p ])kZk2 , if p 2
kZkp q p 2
Var(Z p )
2 min p1
p
1
n Var(Z p /E[Z p ])kZkp , n1 p1
p kZk2p1
if p 2.
p
We also rescale to /k for algebraic convenience. For the function f (t) = k1 (tk 1), we have
k
f (s) = k1 [s]+ + k1 , so that the duality result in Lemma 4 shows that (after taking an infimum over
0) ( Nw k1 )
1/k 1 X k
sup EQ [Z] = inf (1 + ) [Zi ]+ + .
QPNw Nw i=1
Because |Zi | M for all i, we claim that any minimizing the preceding expression must satisfy
" 1
#
1 + (1 + ) k
1 , 1 M. (17)
(1 + ) k 1
Indeed, it is clear that M , because otherwise we would have SNw () > M inf SNw (). The
lower bound on is somewhat less trivial. Let = cM for some c > 1. Taking derivatives of the
objective SNw () with respect to , we have
1
PNw k 1 k 1
i=1 [Zi ]+ (c 1)M
0 1/k Nw 1/k
SNw () = 1 (1 + ) 1 k1 1 (1 + )
1
PNw k (c + 1)M
Nw i=1 [Z i ] +
k 1
c1
1/k
= 1 (1 + ) .
c+1
1
(1+) k +1
Defining the constant c,k := 1 , we see that for any c > c,k , the preceding display is
(1+) k 1
negative, so we must have c,k M (since the derivative is 0 at optimality). For the remainder of
the proof, we thus define the interval
1
(1 + ) k + 1
U := [M c,k , M ] , c,k = 1 ,
(1 + ) k 1
14
and we assume w.l.o.g. that U .
Again applying the duality result of Lemma 4, we have that
" #
E sup EQ [Z] = E inf SNw () = E inf {SNw () E[SNw ()] + E[SNw ()]}
QPNw U U
inf E[SNw ()] E sup |SNw () E[SNw ()]| . (18)
U U
To bound the first term in expression (18), note that [Z ]+ [0, 1 + c,k ]M and (1 + )1/k (1 +
c,k ) = C,k . Thus, by Lemma 7 we obtain that
r
1/k
h
k
i1/k k 1 2
E[SNw ()] (1 + ) E [Z ]+ + C,k M .
k Nw
k 1
Using that k = k1 , taking the infimum over on the right hand side and using duality yields
r
M 2
inf E[SNw ()] sup EQ [Z] C,k .
QP k N w
for any fixed R and any 0. Now, let N (U, ) = {1 , . . . , N (U,) } be an cover of the set U ,
which we may take to have size at most N (U, ) M (1 + c,k ) 1 . Then we have
sup |SNw () E[SNw ()] max |SNw (i ) E[SNw (i )]| + (1 + )1/k .
U iN (U,)
p
Using the fact that E[maxin |Xi |] 2 2 log(2n) for Xi all 2 -sub-Gaussian, we have
s
M2
E max |SNw (i ) E[SNw (i )]| C,k 2 log 2N (U, ).
iN (U,) Nw
15
Taking = M (1 + c,k )/Nw gives that
r
1 C,k M
E sup |SNw () E[SNw ()] 2M C,k log(2Nw ) + .
U Nw Nw
16
Now we take = kZkp , and we apply the Cauchy-Schwarz inequality to obtain
" n p1 # " Pn p 2 # 12
n p1 !2 12
1
1X p p1 n i=1 Zi 1X p
E Z kZkp E 1 E Z kZkp
kZkpp
n i=1 i p n i=1 i
X n p1 !2 12
p 1p 1 1
= kZkp Var(Z p /E[Z p ])E Zp E[Z p ] p
p n n i=1 i
(19)
" n 2 # 21
p 1p 1X p p 2
kZkp Var(Z p /E[Z p ])E Z + E[Z p ] p .
p n n i=1 i
by Jensen, or equivalently, the fact that the norm is non-decreasing in p. For p 2, we have by
the triangle inequality applied to expression (19), followed by an application of Jensens inequality
(using that E[Y 2/p ] E[Y ]2/p for p 2),
" n 1 # r
1X p p p 1 1p
E Z kZkp 2 Var(Z p /E[Z p ])kZkp ,
n i=1 i p n
1
Then, we have three primary relationships r : Y A BC 2 , s0 : C 2A2 2AY , and
t0 : Y A 2AB. Recursion works as follows: for i 0, we plug ti into s0 to yield a tighter
inequality si+1 for C, which in turn plugs in to r to yield a tighter inequality ti+1 for Y . In this way,
we have the relations si : C 4A2 B ai1 for i 1, and ti : Y A 2AB ai for i 0, where
17
ai = 2 2i . Taking i , we have Y A 2AB 2 , or
" n 1 # 2
Var(Z p /E[Z p ])
1X p p p1
E Zi kZkp 2kZkp
n i=1 p n
2
Var(Z p )
2 p1
= kZkp
n p kZk2p1
p
Thus, we have
q p
p1 2
p1 # Var(Z p /E[Z p ])kZk2 , if p 2
" n
1 X p n
E Zip kZkp p1
q p 2
Var(Z p )
n i=1 2 min
p
1
n Var(Z p /E[Z p ])kZkp , n1 p1
p kZk2p1
if p 2
p
In the case that we have the unifom bound kZk C, we can get tighter guarantees. To that end,
we state a simple lemma.
Lemma 8. For any random variable X 0 and a [1, 2], we have
E[X ak ] E[X k ]2a E[X 2k ]a1
Proof For c [0, 1], 1/p + 1/q = 1 and A 0, we have by Holders inequality,
E[A] = E[Ac A1c ] E[Apc ]1/p E[Aq(1c) ]1/q
2
Now take A := X ak , 1/p = 2 a, 1/q = a 1, and c = a 1.
First, note that E[Z 2p ] C p E[Z p ]. For 1 p 2, we can take a = 2/p in Lemma 8, so that we
have 2 2
E[Z 2 ] E[Z p ]2 p E[Z 2p ] p 1 kZkpp C 2p .
Now, we can plug these into the expression above (using Var Z p E[Z 2p ] C p kZkpp ):
q
p1 2
p1 # C p n, if p 2
" n
1 X p
E Zi kZkp q p 2 p
Var(Z )
n i=1 2 min p1
p
1
n Var(Z p /E[Z p ])kZkp , n1 p1
p kZk2p1
if p 2
p
Pn
In fact, we can give a somewhat sharper result by noting that E[( n1 i=1 Zip )1/p ] 0, and
similarly, kZkp 0. For shorthand, let D = ( p1 2 p p
p ) C . Then using that Var(Z /E[Z ]) =
p
2p 2p p
Var(Z p )/ kZkp E[Z 2p ]/ kZkp C p / kZkp , the preceding inequality, in the case that p 2,
implies
" n 1 #
1X p p np
1p/2 1p
o
E Zi kZkp 2 min D/n kZkp , (D/n) kZkp , kZkp /2
n i=1
np o
1p/2 1p
kZkp 2 min D/n kZkp , (D/n) kZkp , kZkp .
18
C More experiments
We present further details of the experiments shown in Section 4 as well as experiments on more
datasets and kernel-learning methods. Specifically, we also show experiments with the ads5 , farm6 ,
mnist7 , and weight8 datasets. When training/test splits do not already exist, we split the dataset
into 75% training and 25% test sets.
Table 3 shows parameters used in our method for each dataset. The last column indicates the size of
the subset of the training data used to solve problem (4). We use subsets to increase the efficiency of
our approach. Furthermore, we show /Nw simply because it is easier to work with this quantity
rather than : the value is chosen to balance fit with efficiency via cross validation. Very large
yields extremely sparse qb and poor fit, whereas very small yields dense qb and long training times.
We note that all values of are less than 1000.
p Finally, for ridge regression models, we choose the l2
penalty term such that we may absorb the qbi factors into .
Table 4 compares the accuracy of our approach (OK) with other methods: random features with 2
values for D, and two standard multiple-kernel-learning algorithms from [14]. Table 5 shows the
(training + test) times of the same methods. Algorithm ABMKSVM(ratio) is a heuristic alignment-
based kernel derived in problem (2) in [14] followed by an SVM. Algorithm MKSVM jointly
optimizes kernel composition with empirical risk via problem (9) in [14]. For both of these methods,
we consider optimizing the combination of a linear, second-order polynomial, and Gaussian kernel.
The two multiple-kernel-learning approaches require an extremely large amount of memory to build
Gram matrices, so we train on subsets of data when necessary to avoid latencies introduced by
swapping data from memory. For ABMKSVM(ratio) we train on n = 17500 for adult and weight,
and n = 10000 for reuters. Similarly, we break up the test data for reuters into ntest = 1000
chunks, which accounts for the large amount of time taken for this dataset (training time was roughly
400s). For MKSVM, we use a subset of size n = 7500 for all applicable datasets, and we use the
same testing scheme as ABMKSVM(ratio) for reuters (training time for MKSVM was roughly
1000s).
The performance of our method on all datasets is consistent: we improve the performance for random
features at a given computational cost, and we are generally competitive with much costlier standard
multiple-kernel-learning techniques. The mnist and weight datasets are slightly peculiar: both
ABSVM(ratio) and MKSVM require many support vectors, indicating that the chosen kernels are
poor for the task; this hypothesis is corroborated by the slightly worse performance of both our
method and random features (the arc-cosine kernel is similar to polynomial and Guassian kernels).
A large number of support vectors roughly translates to large nnz(b q ), which can be achieved by
increasing Nw or decreasing . We can also achieve better performance by increasing the subset
of training data used in problem (4). Doing the latter two options yields comparable results for our
method (Table 6). For the mnist models, we switch to ridge regression to enhance efficiency of the
larger problem. The upshot of this analysis is that our method is most effective in regimes where
standard multiple-kernel-learning techniques are intractable, that is, datasets with both large n and d.
5
http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements. We use all but the first 3 features which are
sometimes missing in the data.
6
https://archive.ics.uci.edu/ml/datasets/Farm+Ads
7
http://yann.lecun.com/exdb/mnist/. We do pairwise classifications of digits 1 vs. 7, 4 vs. 9, and 5 vs. 6.
8
http://archive.ics.uci.edu/ml/datasets/Weight+Lifting+Exercises+monitored+with+Inertial+Measurement+Units.
We neglect the first 4 features, and furthermore we only use remaining features that are not missing in any
datapoint. We consider classifying the datapoint as class A or not.
19
Table 3: Dataset parameters
20