Feature Selection For Nonlinear Kernel Support Vector Machines
Feature Selection For Nonlinear Kernel Support Vector Machines
Feature Selection For Nonlinear Kernel Support Vector Machines
232
(1) Pick a random E = diag(1 or 0) with cardinality of Proof. That the sequence {νe y r + e sr + σe E r e}r=∞ r=1
E inversely proportional to σ. Pick a fixed integer k, converges follows from the fact that it is nonincreas-
typically very large, for the number of sweeps through ing and bounded below by zero. That (5) is satisfied
E, and a stopping tolerance tol, typically 1e − 6. follows from the fact that each point of the sequence
{ur , γ r , y r , sr , E r } satisfies (5) with (ū, γ̄, ȳ, s̄, Ē) re-
(2) Solve the linear program (4) for a fixed E and denote placed by {ur , γ r , y r , sr , E r } on account of step (4) of
its solution by (u, γ, y, s). Algorithm 2.1. That (6) is satisfied follows from the fact
(3) For = 1, . . . , kn and j = 1 + ( − 1)mod n: that each point of the sequence {E r } satisfies (6) with
(Ē) replaced by {E r } on account of step (3) of Algo-
(a) Replace Ejj by 1 if it is 0 and by 0 if it is 1. rithm 2.1. Hence every accumulation point (ū, γ̄, ȳ, s̄, Ē)
(b) Compute: of {ur , γ r , y r , sr , E r } satisfies (5) and (6).
f (E) = νe (e−D(K(AE, EA )u−eγ))+ +σe Ee, It is important to note that by repeating steps (3) and (4)
of Algorithm 2.1, a feature dropped in one sweep through
before and after changing Ejj . the integer variables may be added back in another cy-
(c) Keep the new Ejj only if f (E) decreases by more cle, and conversely. Thus, our algorithm is not merely a
than tol. Else undo the change in Ejj . Go to (a) naı̈ve greedy approach because the choices of one iteration
if j < n. may be reversed in later iterations, and we have observed
(d) Go to (4) if the total decrease in f (E) is less than this phenomenon in our experiments. However, cycling is
or equal to tol in the last n steps. avoided by choosing tol > 0, which ensures that the se-
quence of objective values generated by Algorithm 2.1 is
(4) Solve the linear program (4) for a fixed E and denote strictly decreasing. It is also important to note that when
its solution by (u, γ, y, s). Stop if objective function changing the integer variables in step (3), only the objective
decrease of (4) is less than tol. function needs to be recomputed, which is much faster than
solving the linear program in step (4). In fact, as we shall
(5) Go to (3). discuss in Section 3, we have found that the cycle through
Remark 2.2. We note that f (E) in the RFSVM algorithm the integer variables in step (3) tends to be repeated more
is equivalent to νe y + σe Ee when y takes on its optimal often than the linear program of step (4). We turn now to
value generated by the first and the next-to-the-last sets of computational testing of our approach.
constraints of (4). Note that f (E) still depends on E even
for the case when σ = 0. 3. Computational Results
We establish now convergence of the RFSVM algorithm
for tol = 0, however computationally we use tol = 1e − 6. We illustrate the effectiveness of our Reduced Feature
SVM (RFSVM) on two datasets from the UCI Machine
Proposition 2.3. RFSVM Convergence Learning Repository [13] and on synthetic data generated
For tol = 0, the nonnegative nonincreasing values of using Michael Thompson’s NDCC generator [17]. The
the sequence of objective function values {νe y r + e sr + UCI datasets are used to compare the feature selection and
σe E r e}r=∞
r=1 , where the superscript r denotes iteration classification accuracy of RFSVM to the following two
number of step (4) of Algorithm 2.1, converge to (νe ȳ + algorithms: recursive feature elimination (RFE), a wrap-
e s̄+σe Ēe) where (ū, γ̄, ȳ, s̄, Ē) is any accumulation point per method designed for SVMs [8], and Relief, a filter
of the sequence of iterates {ur , γ r , y r , sr , E r } generated by method [14]. A feature-reducing linear kernel 1-norm
Algorithm 2.1. The point (ū, γ̄, ȳ, s̄, Ē) has the following SVM (SVM1) [1], and a nonlinear kernel 1-norm SVM
local minimum property: (NKSVM1) [11] with no feature selection are used as base-
lines. The synthetic NDCC data is used to illustrate the ef-
(νe ȳ + e s̄ + σe Ēe) = min νe y + e s + σe Ēe fectiveness of RFSVM on problems with large numbers of
u,γ,y,s
features, including a problem with 1000 features, 900 of
s.t. D(K(AĒ, ĒA )u − eγ) + y ≥ e
which are irrelevant.
−s ≤ u ≤ s
y ≥ 0,
(5) 3.1 UCI Datasets
and for p = 1, . . . , n:
We use the UCI datasets to compare RFSVM to two
f (Ē) ≤ f (E), for Epp = 1−Ēpp , Ejj = Ējj , j = p. (6) other algorithms. RFE and Relief are used to illustrate how
233
RFSVM maintains classification accuracy for different de- similar behavior on the datasets, we believe that our results
grees of feature selection. SVM1 and NKSVM1 are used support the conclusion that RFSVM is effective for learning
to establish baselines for feature selection and classifica- nonlinear classifiers with reduced input space features.
tion accuracy. For the sake of efficiency, we use the experi- For all the algorithms, we chose ν and the Gaussian ker-
mental methodology described below to compare the algo- nel parameter µ from the set {2i |i ∈ {−7, . . . , 7}}. For
rithms. We first briefly describe RFE and Relief. each dataset, we evaluated the accuracy and number of fea-
tures selected at σ ∈ {0, 1, 2, 4, 8, 16, 32, 64}. The diagonal
3.1.1. RFE. Recursive Feature Elimination (RFE) is a of E was randomly initialized so that max{ nσ , 1} features
wrapper method designed for SVMs [8]. First an SVM were present in the first linear program, where n is the num-
(u, γ) is learned using all features, then features are ranked ber of input space features for each dataset. As σ increases,
based on how much the margin u K(A, A )u changes when the penalty on the number of features begins to dominate
each feature is removed separately. Features which have a the objective. We only show values of σ for which we ob-
small effect on the margin are considered less relevant. A tained reliable results. For RFE, we removed 1 feature per
given percentage of the least relevant features are removed, iteration. For Relief, we used 1000 iterations to determine
and the entire procedure is repeated with the remaining fea- the feature weights.
tures. In our experiments, we remove one feature at a time
until the reported number of features is reached. Note that 3.1.4. Results and discussion. Figure 1 gives curves show-
our RFSVM procedure uses the objective value f (E) to de- ing the accuracy of RFSVM versus the number of input
termine whether to include or remove each feature, and re- space features used on the Ionosphere and Sonar datasets.
moves or keeps features if the objective function decreases Each point on the curve is obtained by averaging five ten-
by more than a threshold, without first ranking the features. fold cross validation experiments for a fixed σ. The square
Furthermore, once a feature is removed by RFE it is never points denote the accuracy of NKSVM1, an ordinary non-
again considered for inclusion in the final classifier, while linear classifier which uses all the input space features. The
any feature removed during a sweep through the integer points marked by triangles represent the accuracy and fea-
variables E in our Algorithm 2.1 may be included by a later ture reduction of SVM1, a linear classifier which is known
sweep. to reduce features [1]. Results for RFE are denoted by ’+’,
and results for Relief are denoted by ’’ while results of
3.1.2. Relief. Relief is a filter method for selecting features our RFSVM algorithm are denoted by circles. Note that
[14]. Features are ranked by computing weights as follows. RFSVM is potentially able to obtain a higher accuracy than
For a randomly chosen training example, find the nearest the linear classifier using approximately the same number
example with the same class (the nearest hit), and the near- of features on the Ionosphere and Sonar datasets. Note also
est example in the other class (the nearest miss). Then up- that even for σ = 0, RFSVM was able to reduce features
date the weight of each feature by subtracting the absolute based only on decrease in the objective term e y. RFSVM
value of the difference in feature values between the exam- is comparable in both classification accuracy and feature se-
ple and the nearest hit, and adding the absolute value of the lection to RFE and Relief.
difference between the example and the nearest miss. This To illustrate the efficiency of our approach, we report
procedure is then repeated several times, with a different the CPU time taken on the Ionosphere dataset. On this
random example each time. Features with high weight are dataset, the RFSVM algorithm required an average of 6 cy-
considered more relevant. Relief may be used with any bi- cles through the integer variables on the diagonal of the ma-
nary classification algorithm, but in the following we use it trix E, and the solution of 3 linear programs. The aver-
exclusively with a 1-norm Gaussian kernel SVM. ages are taken over the classifiers learned for each fold once
the parameters were selected. Using the MATLAB profiler,
3.1.3. Methodology. To save time, we tuned each algorithm we found that the CPU time taken for one complete exper-
1
by using 11 of each dataset as a tuning set, and performed iment on the Ionosphere dataset was 60.8 minutes. The ex-
ten-fold cross validation on the remaining 1011 . The tuning
periment required 1960 runs of the RFSVM algorithm. Of
set was used to choose the parameters ν and µ on the first this time, approximately 75% was used in evaluating the
fold, and the chosen parameters were then used for the re- objective function, and 15% was used in solving linear pro-
maining nine folds. In order to avoid bias due to the choice grams. Our experience with the RFSVM algorithm is that
of the tuning set, we repeated the above procedure five times the bottleneck is often the objective function evaluations
using a different, randomly selected, tuning set each time. rather than the linear programs, which suggests that signif-
This procedure allows us to efficiently investigate the be- icant speedups could be obtained by using more restrictive
havior of the feature selection algorithms RFSVM, RFE, settings of the number of sweeps k and the tolerance tol for
and Relief on the datasets. Since the algorithms exhibit decreasing f (E). These measurements were taken using
234
MATLAB 7.2 [12] under CentOS Linux 4.3 running on an NDCC Dataset: 200 Training Points with 20 True Features
Intel 3.0 GHz Pentium 4. The linear programs were solved and Varying Number of Random Features
0.96
RFSVM
0.94 NKSVM1
Relief
0.92
RFE
0.9
0.51
0.88
SVM1
0.86
100 200 500
0.84
Total Number of Features Given to RFSVM and NKSVM1
0.82 Including the 20 True Features
0.8
0.78
Figure 2. RFSVM1 and NKSVM1 on NDCC
0.76
0 5 10 15 20 25 30 35 40 data with 20 true features and 80, 180, and
Number of Input Space Features Used 480 irrelevant random features. Each point is
Sonar Dataset: 208 points in R
60 the average of the test set accuracy over two
64 16
σ independently generated datasets.
32 8 4 2 1 0
Average Ten-Fold Cross Validation Accuracy
0.85
235
important topic of future research. J. Abello, P. M. Pardalos, and M. G. C. Resende, edi-
When using Condor, we used the freely available CLP tors, Handbook of Massive Datasets, pages 439–472, Dor-
linear programming solver [4] to solve the linear programs drecht, Netherlands, 2002. Kluwer Academic Publishers.
and the MATLAB compiler version 4.5 to produce a stand- ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/99-01.ps.
[3] S. Dasgupta. Learning mixtures of Gaussians. In IEEE Sym-
alone executable which ran Algorithm 2.1 for given values
posium on Foundations of Computere Science (FOCS) 1999,
of ν and µ. On the same machine described above, the av-
pages 634–644, 1999.
erage time to run this executable for the parameters chosen [4] J. Forrest, D. de la Nuez, and R. Lougee-
by the tuning procedure was 115 seconds. Further speedups Heimer. CLP User Guide, 2004. http://www.coin-
may be possible for some kernels, including the Gaussian or.org/Clp/userguide/index.html.
kernel, by using approximations such as [16]. [5] H. Fröhlich and A. Zell. Feature subset selection for support
vector machines by incremental regularized risk minimiza-
tion. In International Joint Conference on Neural Networks
4. Conclusion and Outlook (IJCNN), volume 3, pages 2041–2046, 2004.
[6] G. Fung and O. L. Mangasarian. A feature selection Newton
We have presented a new approach to feature selection method for support vector machine classification. Computa-
for nonlinear SVM classifiers for a completely arbitrary ker- tional Optimization and Applications, 28(2):185–202, July
nel. Our approach is formulated as an easily implementable 2004. ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/02-03.ps.
[7] C. Gold, A. Holub, and P. Sollich. Bayesian approach to fea-
mixed-integer program and solved efficiently by alternating
ture selection and parameter tuning for support vector ma-
between a linear program to compute the continuous param- chine classifiers. Neural Networks, 18(5-6):693–701, 2005.
eter values of the classifier and successive sweeps through [8] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selec-
the objective function to update the integer variables repre- tion for cancer classification using support vector machines.
senting the presence or absence of each feature. This pro- Machine Learning, 46(1-3):389–422, 2002.
cedure converges to a local minimum that minimizes both [9] ILOG, Incline Village, Nevada. ILOG CPLEX 9.0 User’s
the usual SVM objective and the number of input space fea- Manual, 2003. http://www.ilog.com/products/cplex/.
tures used. Our results on two publicly available datasets [10] M. Litzkow and M. Livny. Experience with the condor
and synthetic NDCC data show that our approach efficiently distributed batch system. In Proceedings IEEE Workshop
on Experimental Distributed Systems, pages 97–101, Hun-
learns accurate nonlinear classifiers with reduced numbers
stville, AL, October 1990. IEEE Compter Society Press.
of features. Extension of RFSVM to regression problems, [11] O. L. Mangasarian. Generalized support vector ma-
further evaluation of RFSVM on datasets with very large chines. In A. Smola, P. Bartlett, B. Schölkopf, and
numbers of features, use of different strategies to update the D. Schuurmans, editors, Advances in Large Margin Clas-
integer variables, and procedures for automatically choos- sifiers, pages 135–146, Cambridge, MA, 2000. MIT Press.
ing a value of σ for a desired percentage of feature reduction ftp://ftp.cs.wisc.edu/math-prog/tech-reports/98-14.ps.
are important avenues of future research. [12] MATLAB. User’s Guide. The MathWorks, Inc., Natick,
MA 01760, 1994-2006. http://www.mathworks.com.
[13] P. M. Murphy and D. W. Aha. UCI machine learning repos-
Acknowledgments itory, 1992. www.ics.uci.edu/∼mlearn/MLRepository.html.
[14] M. Robnik-Šikonja and I. Kononenko. Theoretical and em-
We thank Hector Corrada Bravo, Greg Quinn, and pirical analysis of ReliefF and RReliefF. Machine Learning,
53(1-2):23–69, 2003.
Nicholas LeRoy for their assistance with Condor. The re-
[15] B. Schölkopf and A. Smola. Learning with Kernels. MIT
search described in this Data Mining Institute Report 06-03, Press, Cambridge, MA, 2002.
July 2006 and revised June 2007, was supported by Na- [16] Y. Shen, A. Y. Ng, and M. Seeger. Fast gaus-
tional Science Foundation Grants CCR-0138308 and IIS- sian process regression using kd-trees. In NIPS
0511905. 18, 2006. http://ai.stanford.edu/∼ang/papers/nips05-
fastgaussianprocess.pdf.
[17] M. E. Thompson. NDCC: Normally distributed clustered
References datasets on cubes, 2006. www.cs.wisc.edu/dmi/svm/ndcc/.
[18] V. N. Vapnik. The Nature of Statistical Learning Theory.
[1] P. S. Bradley and O. L. Mangasarian. Feature selec- Springer, New York, 1995.
tion via concave minimization and support vector ma- [19] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio,
chines. In J. Shavlik, editor, Proceedings 15th Inter- and V. Vapnik. Feature selection for SVMs. In NIPS, pages
national Conference on Machine Learning, pages 82– 668–674, 2000.
90, San Francisco, California, 1998. Morgan Kaufmann. [20] H. H. Zhang. Variable selection for support vector machines
ftp://ftp.cs.wisc.edu/math-prog/tech-reports/98-03.ps. via smoothing spline ANOVA. Statistica Sinica, 16(2):659–
[2] P. S. Bradley, O. L. Mangasarian, and D. R. Musi- 674, 2006.
cant. Optimization methods in massive datasets. In
236