Feature Selection For Nonlinear Kernel Support Vector Machines

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Seventh IEEE International Conference on Data Mining - Workshops

Feature Selection for Nonlinear Kernel Support Vector Machines

Olvi L. Mangasarian and Edward W. Wild


Computer Sciences Department
University of Wisconsin
Madison, WI 53706
{olvi,wildt}@cs.wisc.edu

Abstract usual nonlinear kernel K(A, A ), where A is the m× n data


matrix, by K(AE, EA ) where E is an n × n diagonal ma-
An easily implementable mixed-integer algorithm is pro- trix of ones and zeros. The proposed algorithm alternates
posed that generates a nonlinear kernel support vector ma- between computing the continuous variables (u, γ) of the
chine (SVM) classifier with reduced input space features. A nonlinear kernel classifier K(x E, EA )u−γ = 0, by using
single parameter controls the reduction. On one publicly linear programming, and the integer diagonal matrix E of
available dataset, the algorithm obtains 92.4% accuracy ones and zeros by successive minimization sweeps through
with 34.7% of the features compared to 94.1% accuracy its components. The algorithm generates a decreasing se-
with all features. On a synthetic dataset with 1000 features, quence of objective function values that converge to a local
900 of which are irrelevant, our approach improves the ac- solution that minimizes the usual data fit and the number of
curacy of a full-feature classifier by over 30%. The pro- kernel functions used while also minimizing the number of
posed algorithm introduces a diagonal matrix E with ones features used. A possibly related result, justifying the use
for features present in the classifier and zeros for removed of reduced features, is that of random projection on a sub-
features. By alternating between optimizing the continu- space of features for Gaussian mixtures which states that
ous variables of an ordinary nonlinear SVM and the integer data from a mixture of k Gaussians can be projected into
variables on the diagonal of E, a decreasing sequence of O(log k) dimensions while retaining approximate separa-
objective function values is obtained. This sequence con- bility of the clusters [3].
verges to a local solution minimizing the usual data fit and There has been considerable recent interest in feature se-
solution complexity while also minimizing the number of lection for SVMs. Weston et al. propose reducing features
features used. based on minimizing generalization bounds via a gradient
approach [19]. In [5], Frölich and Zell introduce an incre-
mental approach based on ranking features by their effect
1. Introduction on the margin. An approach based on a Bayesian interpre-
tation of SVMs is presented by Gold et al. [7], and an ap-
Feature selection is a fairly straightforward procedure for proach based on smoothing spline ANOVA kernels is pro-
linear support vector machine (SVM) classifiers. For exam- posed by Zhang [20]. In [8], Guyon et al. use a wrapper
ple, a 1-norm support vector machine linear classifier ob- method designed for SVMs. Another possibility is to use
tained by either linear programming or concave minimiza- a filter method such as Relief [14] in conjunction with an
tion will easily reduce features [1]. However, when simi- SVM. None of these approaches utilize the straightforward
lar techniques are applied to nonlinear SVM classifiers, the and easily implementable mixed-integer programming for-
resulting reduction is not in the number of input space fea- mulation proposed here.
tures but in the number of kernel functions needed to gen- We briefly outline the contents of the paper now. In
erate the nonlinear classifier [6]. This may be interpreted Section 2 we derive the theory behind our reduced fea-
as a reduction in the dimensionality of the higher dimen- ture support vector machine classifier and state our algo-
sional transformed space, but does not result in any reduc- rithm for generating our RFSVM and its convergence. Sec-
tion of input space features. It is precisely this reduction tion 3 gives computational results on two publicly available
that we are after in this paper, namely, a reduced number datasets that show the effectiveness and utility of RFSVM.
of input space features that we need to input into a nonlin- In particular, we will show that RFSVM is often able to re-
ear SVM classifier. We shall achieve this by replacing the duce features by similar percentages to those of established

0-7695-3019-2/07 $25.00 © 2007 IEEE 231


DOI 10.1109/ICDMW.2007.30
techniques for linear classifiers, while maintaining the ac- [11, 2, 15]:
curacy associated with nonlinear classifiers. We will further
demonstrate that the feature selection and classification ac- min νe y + e s
u,γ,y,s
curacy of RFSVM is comparable to two other feature selec- s.t. D(K(A, A )u − eγ) + y ≥ e,
tion methods which can be applied to nonlinear classifica- (2)
−s ≤ u ≤ s,
tion. Finally, we will show that RFSVM can handle prob- y ≥ 0.
lems with large numbers of features. Section 4 concludes
the paper. Here ν is a positive parameter that weights the misclassifi-
A word about our notation and background material fol- cation error e y = y1 versus the model-simplifying reg-
lows. All vectors will be column vectors unless transposed ularization term e s = u1. In fact, it turns out that min-
to a row vector by a prime superscript  . The scalar (in- imizing e s = u1 leads to a minimal number of kernel
ner) product of two vectors x and y in the n-dimensional functions used in the classifier (1) by zeroing components
real space Rn will be denoted by x y and the p-norm of of the variable u [6]. However, our primary concern here
 n
1 is to use as few components of the input space vector x as
x, ( |xi |p ) p , will be denoted by xp . For a matrix
possible in the nonlinear classifier (1). We proceed to do
i=1
A ∈ Rm×n , Ai denotes the ith row of A, while A·j de- that now as follows.
notes j-th column of A and Aij denotes the element of A We introduce a diagonal matrix E ∈ Rn×n with ones or
in row i and column j. A column vector of ones of ar- zeros on its diagonal. The zeros correspond to suppressed
bitrary dimension will be denoted by e. The base of the input space features and the ones correspond to features uti-
natural logarithm will be denoted by ε. For A ∈ Rm×n lized by the nonlinear classifier (1) which we modify as fol-
and B ∈ Rn×l , the kernel K(A, B) is an arbitrary function lows:
that maps Rm×n × Rn×l into Rm×l . In particular, if x and K(x E, EA )u − γ = 0. (3)
y are column vectors in Rn then, K(x , y) is a real num-
ber, K(x , A ) is a row vector in Rm and K(A, A ) is an In turn, the linear program (2) becomes the following
m × m matrix. We shall employ the widely used Gaussian mixed-integer nonlinear program:
2
kernel [18, 11, 15] ε−µAi −Aj 2 , i, j = 1, . . . , m for all our
numerical tests. The notation E = diag(1 or 0) denotes a
min νe y + e s + σe Ee
diagonal matrix with ones or zeros on the diagonal while x+ u,γ,y,s,E
denotes the nonnegative vector generated from x by setting s.t. D(K(AE, EA )u − eγ) + y ≥ e,
all its negative components to zero. Cardinality of a vector −s ≤ u ≤ s, (4)
or matrix denotes the number of its nonzero components. y ≥ 0,
The abbreviation “s.t.” stands for “subject to.” E = diag(1 or 0),

where σ is a positive parameter that that weights the fea-


2. Reduced Feature Support Vector Machine n
(RFSVM) Formulation and Algorithm ture suppression term e Ee = Eii . Mixed-integer pro-
i=1
grams are basically NP-hard. However, we can easily ob-
We consider a given set of m points in the n-dimensional
tain a local solution by fixing E and solving the resulting
input feature space Rn represented by the matrix A ∈
linear program (4) for (u, γ, y, s), then fixing (u, γ, y, s)
Rm×n . Each point represented by Ai , i = 1, . . . , m, be-
and sweeping through the components of E altering them
longs to class +1 or class -1 depending on whether Dii is 1
successively only if such alteration decreases the objec-
or -1, where D ∈ Rm×m is a given diagonal matrix of plus
tive function. Repeating this process leads to the follow-
or minus ones. We shall attempt to discriminate between the
ing algorithm which, in addition to suppressing input space
classes +1 and -1 by a nonlinear classifier induced by a com-
features, suppresses components of the variable u because
pletely arbitrary kernel K(A, A ) and parameters u ∈ Rm
of the 1-norm term in the objective function and hence
and γ ∈ R, as follows. The classifier (1) is determined by
utilizes a minimal number of kernel function components
the nonlinear surface:
K(AE, (EA )·j ), j = 1, . . . , m.
K(x , A )u − γ = 0, (1) We state our algorithm now. More implementation de-
tails are given in Section 3.
which classifies points as belonging to class +1 if
K(x , A )u − γ > 0 and as class -1 if K(x , A )u − γ < 0. Algorithm 2.1. Reduced Feature SVM (RFSVM) Algo-
The parameters u ∈ Rm and γ ∈ R in the classifier rithm
(1) are determined by solving the following linear program

232
(1) Pick a random E = diag(1 or 0) with cardinality of Proof. That the sequence {νe y r + e sr + σe E r e}r=∞ r=1
E inversely proportional to σ. Pick a fixed integer k, converges follows from the fact that it is nonincreas-
typically very large, for the number of sweeps through ing and bounded below by zero. That (5) is satisfied
E, and a stopping tolerance tol, typically 1e − 6. follows from the fact that each point of the sequence
{ur , γ r , y r , sr , E r } satisfies (5) with (ū, γ̄, ȳ, s̄, Ē) re-
(2) Solve the linear program (4) for a fixed E and denote placed by {ur , γ r , y r , sr , E r } on account of step (4) of
its solution by (u, γ, y, s). Algorithm 2.1. That (6) is satisfied follows from the fact
(3) For  = 1, . . . , kn and j = 1 + ( − 1)mod n: that each point of the sequence {E r } satisfies (6) with
(Ē) replaced by {E r } on account of step (3) of Algo-
(a) Replace Ejj by 1 if it is 0 and by 0 if it is 1. rithm 2.1. Hence every accumulation point (ū, γ̄, ȳ, s̄, Ē)
(b) Compute: of {ur , γ r , y r , sr , E r } satisfies (5) and (6).

f (E) = νe (e−D(K(AE, EA )u−eγ))+ +σe Ee, It is important to note that by repeating steps (3) and (4)
of Algorithm 2.1, a feature dropped in one sweep through
before and after changing Ejj . the integer variables may be added back in another cy-
(c) Keep the new Ejj only if f (E) decreases by more cle, and conversely. Thus, our algorithm is not merely a
than tol. Else undo the change in Ejj . Go to (a) naı̈ve greedy approach because the choices of one iteration
if j < n. may be reversed in later iterations, and we have observed
(d) Go to (4) if the total decrease in f (E) is less than this phenomenon in our experiments. However, cycling is
or equal to tol in the last n steps. avoided by choosing tol > 0, which ensures that the se-
quence of objective values generated by Algorithm 2.1 is
(4) Solve the linear program (4) for a fixed E and denote strictly decreasing. It is also important to note that when
its solution by (u, γ, y, s). Stop if objective function changing the integer variables in step (3), only the objective
decrease of (4) is less than tol. function needs to be recomputed, which is much faster than
solving the linear program in step (4). In fact, as we shall
(5) Go to (3). discuss in Section 3, we have found that the cycle through
Remark 2.2. We note that f (E) in the RFSVM algorithm the integer variables in step (3) tends to be repeated more
is equivalent to νe y + σe Ee when y takes on its optimal often than the linear program of step (4). We turn now to
value generated by the first and the next-to-the-last sets of computational testing of our approach.
constraints of (4). Note that f (E) still depends on E even
for the case when σ = 0. 3. Computational Results
We establish now convergence of the RFSVM algorithm
for tol = 0, however computationally we use tol = 1e − 6. We illustrate the effectiveness of our Reduced Feature
SVM (RFSVM) on two datasets from the UCI Machine
Proposition 2.3. RFSVM Convergence Learning Repository [13] and on synthetic data generated
For tol = 0, the nonnegative nonincreasing values of using Michael Thompson’s NDCC generator [17]. The
the sequence of objective function values {νe y r + e sr + UCI datasets are used to compare the feature selection and
σe E r e}r=∞
r=1 , where the superscript r denotes iteration classification accuracy of RFSVM to the following two
number of step (4) of Algorithm 2.1, converge to (νe ȳ + algorithms: recursive feature elimination (RFE), a wrap-
e s̄+σe Ēe) where (ū, γ̄, ȳ, s̄, Ē) is any accumulation point per method designed for SVMs [8], and Relief, a filter
of the sequence of iterates {ur , γ r , y r , sr , E r } generated by method [14]. A feature-reducing linear kernel 1-norm
Algorithm 2.1. The point (ū, γ̄, ȳ, s̄, Ē) has the following SVM (SVM1) [1], and a nonlinear kernel 1-norm SVM
local minimum property: (NKSVM1) [11] with no feature selection are used as base-
lines. The synthetic NDCC data is used to illustrate the ef-
(νe ȳ + e s̄ + σe Ēe) = min νe y + e s + σe Ēe fectiveness of RFSVM on problems with large numbers of
u,γ,y,s
 features, including a problem with 1000 features, 900 of
s.t. D(K(AĒ, ĒA )u − eγ) + y ≥ e
which are irrelevant.
−s ≤ u ≤ s
y ≥ 0,
(5) 3.1 UCI Datasets
and for p = 1, . . . , n:
We use the UCI datasets to compare RFSVM to two
f (Ē) ≤ f (E), for Epp = 1−Ēpp , Ejj = Ējj , j = p. (6) other algorithms. RFE and Relief are used to illustrate how

233
RFSVM maintains classification accuracy for different de- similar behavior on the datasets, we believe that our results
grees of feature selection. SVM1 and NKSVM1 are used support the conclusion that RFSVM is effective for learning
to establish baselines for feature selection and classifica- nonlinear classifiers with reduced input space features.
tion accuracy. For the sake of efficiency, we use the experi- For all the algorithms, we chose ν and the Gaussian ker-
mental methodology described below to compare the algo- nel parameter µ from the set {2i |i ∈ {−7, . . . , 7}}. For
rithms. We first briefly describe RFE and Relief. each dataset, we evaluated the accuracy and number of fea-
tures selected at σ ∈ {0, 1, 2, 4, 8, 16, 32, 64}. The diagonal
3.1.1. RFE. Recursive Feature Elimination (RFE) is a of E was randomly initialized so that max{ nσ , 1} features
wrapper method designed for SVMs [8]. First an SVM were present in the first linear program, where n is the num-
(u, γ) is learned using all features, then features are ranked ber of input space features for each dataset. As σ increases,
based on how much the margin u K(A, A )u changes when the penalty on the number of features begins to dominate
each feature is removed separately. Features which have a the objective. We only show values of σ for which we ob-
small effect on the margin are considered less relevant. A tained reliable results. For RFE, we removed 1 feature per
given percentage of the least relevant features are removed, iteration. For Relief, we used 1000 iterations to determine
and the entire procedure is repeated with the remaining fea- the feature weights.
tures. In our experiments, we remove one feature at a time
until the reported number of features is reached. Note that 3.1.4. Results and discussion. Figure 1 gives curves show-
our RFSVM procedure uses the objective value f (E) to de- ing the accuracy of RFSVM versus the number of input
termine whether to include or remove each feature, and re- space features used on the Ionosphere and Sonar datasets.
moves or keeps features if the objective function decreases Each point on the curve is obtained by averaging five ten-
by more than a threshold, without first ranking the features. fold cross validation experiments for a fixed σ. The square
Furthermore, once a feature is removed by RFE it is never points denote the accuracy of NKSVM1, an ordinary non-
again considered for inclusion in the final classifier, while linear classifier which uses all the input space features. The
any feature removed during a sweep through the integer points marked by triangles represent the accuracy and fea-
variables E in our Algorithm 2.1 may be included by a later ture reduction of SVM1, a linear classifier which is known
sweep. to reduce features [1]. Results for RFE are denoted by ’+’,
and results for Relief are denoted by ’’ while results of
3.1.2. Relief. Relief is a filter method for selecting features our RFSVM algorithm are denoted by circles. Note that
[14]. Features are ranked by computing weights as follows. RFSVM is potentially able to obtain a higher accuracy than
For a randomly chosen training example, find the nearest the linear classifier using approximately the same number
example with the same class (the nearest hit), and the near- of features on the Ionosphere and Sonar datasets. Note also
est example in the other class (the nearest miss). Then up- that even for σ = 0, RFSVM was able to reduce features
date the weight of each feature by subtracting the absolute based only on decrease in the objective term e y. RFSVM
value of the difference in feature values between the exam- is comparable in both classification accuracy and feature se-
ple and the nearest hit, and adding the absolute value of the lection to RFE and Relief.
difference between the example and the nearest miss. This To illustrate the efficiency of our approach, we report
procedure is then repeated several times, with a different the CPU time taken on the Ionosphere dataset. On this
random example each time. Features with high weight are dataset, the RFSVM algorithm required an average of 6 cy-
considered more relevant. Relief may be used with any bi- cles through the integer variables on the diagonal of the ma-
nary classification algorithm, but in the following we use it trix E, and the solution of 3 linear programs. The aver-
exclusively with a 1-norm Gaussian kernel SVM. ages are taken over the classifiers learned for each fold once
the parameters were selected. Using the MATLAB profiler,
3.1.3. Methodology. To save time, we tuned each algorithm we found that the CPU time taken for one complete exper-
1
by using 11 of each dataset as a tuning set, and performed iment on the Ionosphere dataset was 60.8 minutes. The ex-
ten-fold cross validation on the remaining 1011 . The tuning
periment required 1960 runs of the RFSVM algorithm. Of
set was used to choose the parameters ν and µ on the first this time, approximately 75% was used in evaluating the
fold, and the chosen parameters were then used for the re- objective function, and 15% was used in solving linear pro-
maining nine folds. In order to avoid bias due to the choice grams. Our experience with the RFSVM algorithm is that
of the tuning set, we repeated the above procedure five times the bottleneck is often the objective function evaluations
using a different, randomly selected, tuning set each time. rather than the linear programs, which suggests that signif-
This procedure allows us to efficiently investigate the be- icant speedups could be obtained by using more restrictive
havior of the feature selection algorithms RFSVM, RFE, settings of the number of sweeps k and the tolerance tol for
and Relief on the datasets. Since the algorithms exhibit decreasing f (E). These measurements were taken using

234
MATLAB 7.2 [12] under CentOS Linux 4.3 running on an NDCC Dataset: 200 Training Points with 20 True Features
Intel 3.0 GHz Pentium 4. The linear programs were solved and Varying Number of Random Features

Average Accuracy on 1000 Test Points


using CPLEX 9.0 [9] and the Gaussian kernels were com- 0.83
puted using a compiled function written in C. 0.79 RFSVM
0.78
34 0.74
Ionosphere Dataset: 351 points in R
σ NKSVM1
0.67
16 8 4 2 1 0
Average Ten-Fold Cross Validation Accuracy

0.96
RFSVM
0.94 NKSVM1
Relief
0.92
RFE
0.9
0.51
0.88
SVM1
0.86
100 200 500
0.84
Total Number of Features Given to RFSVM and NKSVM1
0.82 Including the 20 True Features
0.8
0.78
Figure 2. RFSVM1 and NKSVM1 on NDCC
0.76
0 5 10 15 20 25 30 35 40 data with 20 true features and 80, 180, and
Number of Input Space Features Used 480 irrelevant random features. Each point is
Sonar Dataset: 208 points in R
60 the average of the test set accuracy over two
64 16
σ independently generated datasets.
32 8 4 2 1 0
Average Ten-Fold Cross Validation Accuracy

0.85

0.8 Relief NKSVM1


RFSVM adding random normal features to an NDCC dataset and
0.75 SVM1
then normalizing all features to have mean 0 and standard
0.7 RFE deviation 1. The order of the features is shuffled. Each
0.65 dataset has 200 training points, 200 tuning points, and 1000
testing points. Accuracy of RFSVM and NKSVM1 on
0.6
the dataset is measured by choosing ν and µ from the set
0.55 {2i |i ∈ {−7, . . . , 7}} using the tuning set, and then eval-
0 10 20 30 40 50 60 70 uating the chosen classifier on the 1000 testing points. To
Number of Input Space Features Used
save time, we arbitrarily set σ in RFSVM to 1 before per-
forming any experiments.
Figure 1. Ten-fold cross validation accu- Figure 2 shows a comparison of RFSVM and NKSVM1
racy versus number of features used on the on NDCC data with 20 true features as the number of irrele-
Ionosphere and Sonar datasets. Results for vant features increases. Note that the accuracy of NKSVM1
each algorithm are averages over five ten- decreases more than the accuracy of RFSVM as more irrel-
fold cross validation experiments, each us- evant features are added. When 480 irrelevant features are
1
ing a different 11 of the data for tuning only, added, the accuracy of RFSVM is 74%, 45% higher than
and the remaining 10 11 for ten-fold cross vali- NKSVM1.
dation. Circles mark the average number of We also investigated the performance of RFSVM on
features used and classification accuracy of NDCC data with 1000 features, 900 of which were irrele-
RFSVM for each value of σ. ’+’, ’’, ’’, and vant. To improve the running time of RFSVM on problems
’’ represent the same values for RFE, Re- with such large numbers of features, we implemented op-
lief, NKSVM1, and SVM1, respectively. timizations which took advantage of the form of the Gaus-
sian kernel. We also used the Condor distributed comput-
ing system [10], which allowed us to evaluate Algorithm
3.2 NDCC Data 2.1 for several tuning parameters simultaneously. Over 10
datasets, the average classification accuracy of RFSVM was
The NDCC dataset generator creates datasets by placing 70%, while the average classification accuracy of NKSVM1
normal distributions at the vertices of concentric 1-norm was 53%. Thus, the feature selection provided by RFSVM
cubes [17]. The resulting datasets are not linearly sepa- leads to a 32% improvement over a classifier with no fea-
rable, thus making them attractive testbeds for nonlinear ture selection. We expect that even better accuracy could
classifiers. We create datasets to test feature selection by be obtained by tuning σ, and heuristics to choose σ are an

235
important topic of future research. J. Abello, P. M. Pardalos, and M. G. C. Resende, edi-
When using Condor, we used the freely available CLP tors, Handbook of Massive Datasets, pages 439–472, Dor-
linear programming solver [4] to solve the linear programs drecht, Netherlands, 2002. Kluwer Academic Publishers.
and the MATLAB compiler version 4.5 to produce a stand- ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/99-01.ps.
[3] S. Dasgupta. Learning mixtures of Gaussians. In IEEE Sym-
alone executable which ran Algorithm 2.1 for given values
posium on Foundations of Computere Science (FOCS) 1999,
of ν and µ. On the same machine described above, the av-
pages 634–644, 1999.
erage time to run this executable for the parameters chosen [4] J. Forrest, D. de la Nuez, and R. Lougee-
by the tuning procedure was 115 seconds. Further speedups Heimer. CLP User Guide, 2004. http://www.coin-
may be possible for some kernels, including the Gaussian or.org/Clp/userguide/index.html.
kernel, by using approximations such as [16]. [5] H. Fröhlich and A. Zell. Feature subset selection for support
vector machines by incremental regularized risk minimiza-
tion. In International Joint Conference on Neural Networks
4. Conclusion and Outlook (IJCNN), volume 3, pages 2041–2046, 2004.
[6] G. Fung and O. L. Mangasarian. A feature selection Newton
We have presented a new approach to feature selection method for support vector machine classification. Computa-
for nonlinear SVM classifiers for a completely arbitrary ker- tional Optimization and Applications, 28(2):185–202, July
nel. Our approach is formulated as an easily implementable 2004. ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/02-03.ps.
[7] C. Gold, A. Holub, and P. Sollich. Bayesian approach to fea-
mixed-integer program and solved efficiently by alternating
ture selection and parameter tuning for support vector ma-
between a linear program to compute the continuous param- chine classifiers. Neural Networks, 18(5-6):693–701, 2005.
eter values of the classifier and successive sweeps through [8] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selec-
the objective function to update the integer variables repre- tion for cancer classification using support vector machines.
senting the presence or absence of each feature. This pro- Machine Learning, 46(1-3):389–422, 2002.
cedure converges to a local minimum that minimizes both [9] ILOG, Incline Village, Nevada. ILOG CPLEX 9.0 User’s
the usual SVM objective and the number of input space fea- Manual, 2003. http://www.ilog.com/products/cplex/.
tures used. Our results on two publicly available datasets [10] M. Litzkow and M. Livny. Experience with the condor
and synthetic NDCC data show that our approach efficiently distributed batch system. In Proceedings IEEE Workshop
on Experimental Distributed Systems, pages 97–101, Hun-
learns accurate nonlinear classifiers with reduced numbers
stville, AL, October 1990. IEEE Compter Society Press.
of features. Extension of RFSVM to regression problems, [11] O. L. Mangasarian. Generalized support vector ma-
further evaluation of RFSVM on datasets with very large chines. In A. Smola, P. Bartlett, B. Schölkopf, and
numbers of features, use of different strategies to update the D. Schuurmans, editors, Advances in Large Margin Clas-
integer variables, and procedures for automatically choos- sifiers, pages 135–146, Cambridge, MA, 2000. MIT Press.
ing a value of σ for a desired percentage of feature reduction ftp://ftp.cs.wisc.edu/math-prog/tech-reports/98-14.ps.
are important avenues of future research. [12] MATLAB. User’s Guide. The MathWorks, Inc., Natick,
MA 01760, 1994-2006. http://www.mathworks.com.
[13] P. M. Murphy and D. W. Aha. UCI machine learning repos-
Acknowledgments itory, 1992. www.ics.uci.edu/∼mlearn/MLRepository.html.
[14] M. Robnik-Šikonja and I. Kononenko. Theoretical and em-
We thank Hector Corrada Bravo, Greg Quinn, and pirical analysis of ReliefF and RReliefF. Machine Learning,
53(1-2):23–69, 2003.
Nicholas LeRoy for their assistance with Condor. The re-
[15] B. Schölkopf and A. Smola. Learning with Kernels. MIT
search described in this Data Mining Institute Report 06-03, Press, Cambridge, MA, 2002.
July 2006 and revised June 2007, was supported by Na- [16] Y. Shen, A. Y. Ng, and M. Seeger. Fast gaus-
tional Science Foundation Grants CCR-0138308 and IIS- sian process regression using kd-trees. In NIPS
0511905. 18, 2006. http://ai.stanford.edu/∼ang/papers/nips05-
fastgaussianprocess.pdf.
[17] M. E. Thompson. NDCC: Normally distributed clustered
References datasets on cubes, 2006. www.cs.wisc.edu/dmi/svm/ndcc/.
[18] V. N. Vapnik. The Nature of Statistical Learning Theory.
[1] P. S. Bradley and O. L. Mangasarian. Feature selec- Springer, New York, 1995.
tion via concave minimization and support vector ma- [19] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio,
chines. In J. Shavlik, editor, Proceedings 15th Inter- and V. Vapnik. Feature selection for SVMs. In NIPS, pages
national Conference on Machine Learning, pages 82– 668–674, 2000.
90, San Francisco, California, 1998. Morgan Kaufmann. [20] H. H. Zhang. Variable selection for support vector machines
ftp://ftp.cs.wisc.edu/math-prog/tech-reports/98-03.ps. via smoothing spline ANOVA. Statistica Sinica, 16(2):659–
[2] P. S. Bradley, O. L. Mangasarian, and D. R. Musi- 674, 2006.
cant. Optimization methods in massive datasets. In

236

You might also like