Dbsmote: Density-Based Synthetic Minority Over-Sampling Technique
Dbsmote: Density-Based Synthetic Minority Over-Sampling Technique
Dbsmote: Density-Based Synthetic Minority Over-Sampling Technique
DOI 10.1007/s10489-011-0287-y
Abstract A dataset exhibits the class imbalance problem classes of instances. A derived classifier requires the anal-
when a target class has a very small number of instances ysis of a training set, defined as a group of identified in-
relative to other classes. A trivial classifier typically fails stances.
to detect a minority class due to its extremely low inci- A dataset is considered to be imbalanced if a target class
dence rate. In this paper, a new over-sampling technique has a very small number of instances compared to other
called DBSMOTE is proposed. Our technique relies on a classes. The class imbalance problem [9, 18, 19] has at-
density-based notion of clusters and is designed to over- tracted the attention of researchers from various fields. Many
sample an arbitrarily shaped cluster discovered by DB- applications only consider the two-class case [23, 24, 31].
SCAN. DBSMOTE generates synthetic instances along In this context, the smaller class is called the minority class
a shortest path from each positive instance to a pseudo- (the positive class), while the larger class is called the ma-
centroid of a minority-class cluster. Consequently, these jority class (the negative class). If a dataset has more than
synthetic instances are dense near this centroid and are two classes, a target class is chosen to be the minority class,
sparse far from this centroid. Our experimental results show and the remaining classes are merged into a single majority
that DBSMOTE improves precision, F-value, and AUC class.
more effectively than SMOTE, Borderline-SMOTE, and Analysts encounter the class imbalance problem in many
Safe-Level-SMOTE for imbalanced datasets. real-world applications, such as medical decision support
system for colon polyp screening [10], retailing bank cus-
Keywords Classification · Class imbalance · tomer attrition analysis [17], network intrusion detection of
Over-sampling · Density-based rare attack categories [22], automotive engineering diagno-
sis [26], and vehicle diagnostics [27]. In these domains,
a standard classifier needs to accurately detect a minority
class. This minority class may have an extremely low rate
1 Introduction
of incidence, and, accordingly, standard classifiers generally
prove inadequate.
Classification [21] is a data mining process that generates a One widely used strategy for handling the class im-
model called a classifier, which describes and distinguishes balance problem involves re-sampling techniques [18, 19],
which aim to balance the class distributions of a dataset
before feeding the output into a classification algorithm.
C. Bunkhumpornpat () · K. Sinapiromsaran · C. Lursinsap There are two main re-sampling techniques: over-sampling
Department of Mathematics, Faculty of Science, Chulalongkorn
University, Bangkok 10330, Thailand
techniques, which amplify positive instances, and under-
e-mail: [email protected] sampling techniques, which suppress negative instances.
K. Sinapiromsaran
However, over-sampling techniques, which create more spe-
e-mail: [email protected] cific decision regions, are negatively impacted by the over-
C. Lursinsap
fitting problem [30], while under-sampling techniques typi-
e-mail: [email protected] cally suppress important parts of a dataset. Other strategies
DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique 665
3 DBSMOTE
SMOTE. These techniques upsize a minority class to im-
In this paper, we propose a new over-sampling technique
prove classifier performance at detecting this class.
called DBSMOTE, Density-Based Minority Over-sampling
Chawla, Bowyer, Hall, and Kegelmeyer (2002) designed
TEchnique, which combines DBSCAN and SMOTE. Our
a state-of-the-art over-sampling technique called SMOTE,
goal is to better address the class imbalance problem.
Synthetic Minority Over-sampling TEchnique [7], which
operates in the feature space rather than in the data space. 3.1 Motivation
SMOTE generates synthetic instances along a line segment
that join each instance to its selected nearest neighbor. Fig- A real-world dataset that features proximate data clusters is
ure 3(a) illustrates an over-sampled minority class where typically defined by a normal distribution, which is dense at
a white dot represents a positive instance and a black dot the core and sparse towards the borders. Frequently, a clas-
represents a synthetic instance. However, SMOTE is nega- sifier correctly detects an unidentified instance within a core
tively impacted by the over-generalization problem because because it recognizes this core as a class. Unfortunately a
SMOTE blindly generalizes throughout a minority class minority class core is too small to be recognized by a classi-
without considering the majority class, especially in an over- fier, so this core needs to be over-sampled.
lapping region. Figure 4 illustrates such an over-lapping re- The design of DBSMOTE was inspired by various con-
gion, where a cross represents a negative instance. This case sistent and paradoxical concepts in Borderline-SMOTE.
is a hybrid between a minority class and a majority class, DBSMOTE broadly follows the approach of Borderline-
and it is consequently difficult for a classifier to accurately SMOTE, which operates in an over-lapping region, but DB-
detect an instance. A superior classifier should treat each re- SMOTE precisely over-samples this region to maintain the
gion differently. majority class detection rate. However, DBSMOTE also
Han, Wang, and Mao (2005) designed Borderline- incorporates a different approach to Borderline-SMOTE,
SMOTE [16], which operates only on borderline instances which fails to operate in safe regions. Instead, DBSMOTE
in the over-lapping region illustrated in Fig. 3(b)—namely over-samples this region to improve minority class detection
where synthetic instances are most dense. However, the rate rates.
of detection of majority classes is disappointing because the
classifier mis-detects instances as being positive in this con- 3.2 Data structure
text. A superior classifier should ideally improve the minor-
ity class detection rate while not negatively impacting the In this paper, we define a new data structure called the di-
ability to detect majority classes. rectly density-reachable graph, which is constructed as an
DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique 667
Fig. 7 Over-sampling
framework
shown in this figure. Over-sampling along the shortest path Fig. 8 Shortest path in a
directly density-reachable graph
avoids the over-lapping problem [28] because this shortest
path is the skeleton path [1].
3.4 Framework
Table 2 Short descriptions of the experimental UCI datasets Table 3 Average values of all over-sampling percentages from Fig. 9
Name Instance Attribute Positive Negative % Minority Technique Precision Recall F-value AUC
Pima 768 9 268 500 34.90 ORG 0.649 0.560 0.601 0.781
Haberman 306 4 81 225 26.47 SMOTE 0.807 0.925 0.862 0.841
Glass 214 11 51 163 23.83 BORD 0.792 0.942 0.861 0.831
Segmentation 2310 20 330 1980 14.29 SAFE 0.823 0.958 0.885 0.881
Satimage 6435 37 626 5809 9.73 DBSMOTE 0.856 0.900 0.877 0.888
Ecoli 336 8 25 311 7.44
Yeast 1484 9 20 1464 1.35
Table 4 Average values of all over-sampling percentages from Fig. 10
In the domain studied by Lewis and Catlett [25], none of Technique Precision Recall F-value AUC
the misclassified positive instances impact accuracy, which
ORG 0.453 0.296 0.358 0.609
approaches 100% if all instances are detected as being neg-
SMOTE 0.712 0.770 0.739 0.698
ative. Accordingly, accuracy is not a suitable metric with
BORD 0.727 0.855 0.785 0.730
which to evaluate the class imbalance problem, which aims
SAFE 0.741 0.830 0.782 0.716
to achieve superior detection rates in the case of minority
DBSMOTE 0.800 0.825 0.812 0.772
classes.
The F-value [5] is large when both recall and precision,
which are of relevance to minority classes, are also large.
β was set to 1, which means that recall is as important as Table 5 Average values of all over-sampling percentages from Fig. 11
precision in our experiments. Technique Precision Recall F-value AUC
ROC, the Receiver Operating Characteristic, summarizes
the detection rates over a range of tradeoffs between TP rate ORG 0.891 0.804 0.845 0.892
and FP rate. The ROC is a two-dimensional graph in which SMOTE 0.952 0.954 0.953 0.943
the x-axis represents the FP rate and the y-axis represents BORD 0.941 0.954 0.948 0.941
the TP rate. In addition, AUC [4], the Area Under ROC, is SAFE 0.953 0.977 0.964 0.956
equal to the probability that a classifier will rank a randomly DBSMOTE 0.966 0.957 0.962 0.958
chosen positive instance higher than a randomly chosen neg-
ative one.
Neighbor Classifier (k-NN) approaches [13]. k was set to 5
5.2 Dataset
by default in SMOTE.
We discuss only the experimental results shown in Figs. 9
We used the seven imbalance datasets from the UCI Repos-
through 15, in which the x-axis represents the minority class
itory of Machine Learning Databases [3]: the Pima Indians
Diabetes, Haberman’s Survival, Glass Identification, Image over-sampling percentages and the y-axis represents perfor-
Segmentation, Landsat Satellite, Ecoli, and Yeast databases. mance metrics. Averages of all over-sampling percentages
The columns in Table 2 list the dataset name, the number of are shown in Tables 3 through 9.
instances, the number of attributes, the number of positive Pima has the largest minority class incidence rate of
instances, the number of negative instances, and the minor- all our experimental datasets. Figure 9 illustrates the ex-
ity class percentages, respectively. For each dataset, the tar- perimental results of applying k-NN. DBSMOTE mostly
get class was chosen as the minority class, and the remaining achieves better precision and AUC, while Safe-Level-
classes were merged to create a large majority class. SMOTE mostly achieves better recall and F-value.
The minority class incidence rate is approximately 25%
5.3 Validation in the Haberman dataset. Figure 10 illustrates the ex-
perimental results of applying C4.5. DBSMOTE mostly
In our experiments, we applied 10-fold cross-validation to achieves superior performance except in the case of recall,
evaluate precision, recall, F-value, and AUC using an orig- for which Borderline-SMOTE is better.
inal dataset (ORG), SMOTE, Borderline-SMOTE (BORD), For the Glass dataset, Non-Window Glass was chosen as
Safe-Level-SMOTE (SAFE), and DBSMOTE. We em- the minority class and Window Glass was chosen as the ma-
ployed the decision tree C4.5 [29], Naïve Bayes [21], sup- jority class. Figure 11 illustrates the experimental results of
port vector machine (SVM) [21], Ripper [11], and k-Nearest applying C4.5. DBSMOTE mostly achieves better precision
DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique 671
Fig. 9 Experimental results of applying k-NN on Pima: (a) Precision, (b) Recall, (c) F-value, and (d) AUC
Fig. 10 Experimental results of applying C4.5 on Haberman: (a) Precision, (b) Recall, (c) F-value, and (d) AUC
Fig. 11 Experimental results of applying C4.5 on Glass: (a) Precision, (b) Recall, (c) F-value, and (d) AUC
and AUC, while Safe-Level-SMOTE achieves better recall of applying Naïve Bayes. DBSMOTE mostly achieves su-
and F-value. perior performance.
In the Segmentation dataset, Window was chosen as the In the Satimage dataset, Damp Grey Soil was chosen as
minority class. Figure 12 illustrates the experimental results the minority class. Figure 13 illustrates the experimental re-
672 C. Bunkhumpornpat et al.
Fig. 12 Experimental results of applying Naïve Bayes on Segmentation: (a) Precision, (b) Recall, (c) F-value, and (d) AUC
Fig. 13 Experimental results of applying Ripper on Satimage: (a) Precision, (b) Recall, (c) F-value, and (d) AUC
Table 6 Average values of all over-sampling percentages from Fig. 12 Table 7 Average values of all over-sampling percentages from Fig. 13
Technique Precision Recall F-value AUC Technique Precision Recall F-value AUC
ORG 0.329 0.903 0.482 0.831 ORG 0.609 0.526 0.564 0.750
SMOTE 0.660 0.924 0.767 0.843 SMOTE 0.824 0.849 0.836 0.904
BORD 0.655 0.908 0.757 0.850 BORD 0.805 0.861 0.832 0.901
SAFE 0.657 0.925 0.765 0.844 SAFE 0.834 0.857 0.845 0.906
DBSMOTE 0.677 0.957 0.789 0.904 DBSMOTE 0.891 0.848 0.869 0.915
Fig. 14 Experimental results of applying SVM on Ecoli: (a) Precision, (b) Recall, (c) F-value, and (d) AUC
Fig. 15 Experimental results of applying SVM on Yeast: (a) Precision, (b) Recall, (c) F-value, and (d) AUC
Table 8 Average values of all over-sampling percentages from Fig. 14 Table 9 Average values of all over-sampling percentages from Fig. 15
Technique Precision Recall F-value AUC Technique Precision Recall F-value AUC
ORG 0.938 0.600 0.732 0.798 ORG 0.733 0.550 0.629 0.774
SMOTE 0.957 0.940 0.948 0.963 SMOTE 0.906 0.550 0.684 0.774
BORD 0.920 0.972 0.945 0.972 BORD 0.885 0.428 0.574 0.712
SAFE 0.962 0.974 0.968 0.981
SAFE 0.920 0.659 0.768 0.828
DBSMOTE 0.953 0.965 0.959 0.974
DBSMOTE 0.920 0.663 0.774 0.830
instances. However, for the class imbalance problem, a safe within a noise region. Thus these instances are dense in a
region of a minority class does not contain enough instances, safe region and are sparse in an over-lapping region.
and thus a classifier often misclassifies this region. These distinct distributions cause the effectiveness of
An over-lapping region is located around a cluster border, these over-sampling techniques to be different.
which contains a blend of positive and negative instances.
This region is detected with great difficultly because a clas- • How can DBSMOTE achieve a significant F-value when
sifier cannot efficiently distinguish between the two classes,
Borderline-SMOTE cannot?
and thus over-sampling in this region might be harmful.
To summarize, for the class imbalance problem, an effi-
For DBSMOTE, a classifier correctly detects a positive in-
cient over-sampling technique should concentrate on a safe
stance in a safe region because it has sufficient numbers of
region and treat with caution any over-lapping regions.
detectable synthetic instances. Furthermore, a classifier suc-
• How should synthetic instances generated by the SMOTE cessfully detects a positive instance in an over-lapping re-
family be distributed? gion because most of its synthetic instances tend to be lo-
SMOTE operates throughout a dataset, and thus synthetic cated closer to a minority class than to a majority class. This
instances are spread throughout every region. approach decreases FP and FN, which in turn increases pre-
Borderline-SMOTE operates only on a dataset border, cision and recall. Thus, the F-value is satisfactory.
and thus synthetic instances are dense in an over-lapping re- For Borderline-SMOTE, a classifier is more likely to mis-
gion.
detect a negative instance located in an over-lapping region
Safe-Level-SMOTE operates throughout a dataset and
as a positive instance because a minority class in this re-
positions synthetic instances in an over-lapping region close
to a safe region. Accordingly, these instances are sparse in gion is dense. This approach not only increases FP, which
an over-lapping region. decreases precision, but also decreases FN, which increases
DBSMOTE produces more synthetic instances around recall. Thus a higher recall is not enough to improve F-value
a dataset core than a dataset border and does not operate because a lower precision will also affect the F-value.
DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique 675
F-value
C4.5 SMOTE 0.0168 −0.000046 4 4.548855 0.010427
Borderline-SMOTE 0.0094 −0.000660 4 2.729515 0.052469
Safe-Level-SMOTE 0.0108 −0.000210 4 3.717513 0.020519
AUC
C4.5 SMOTE 0.0230 0.000216 4 2.513999 0.065776
Borderline-SMOTE 0.0556 0.000158 4 6.217824 0.003406
Safe-Level-SMOTE 0.0292 0.000154 4 5.733210 0.004584
F-value
C4.5 SMOTE 0.0732 −0.003878 4 6.491872 0.002903
Borderline-SMOTE 0.0270 −0.003400 4 2.570846 0.061923
Safe-Level-SMOTE 0.0302 −0.004538 4 2.165222 0.096324
AUC
C4.5 SMOTE 0.0740 0.001345 4 5.256297 0.006270
Borderline-SMOTE 0.0418 0.000212 4 6.277380 0.003288
Safe-Level-SMOTE 0.0560 0.000613 4 4.732864 0.009085
F-value
C4.5 SMOTE 0.0090 0.000200 4 1.936492 0.124880
Borderline-SMOTE 0.0140 −0.000028 4 4.128375 0.014513
Safe-Level-SMOTE −0.0026 0.000306 4 −0.430590 0.688953
AUC
C4.5 SMOTE 0.0146 −0.000024 4 4.673346 0.009495
Borderline-SMOTE 0.0164 −0.000005 4 5.776654 0.004460
Safe-Level-SMOTE 0.0018 0.000085 4 0.387837 0.717891
F-value
C4.5 SMOTE 0.0044 −0.000191 4 1.632993 0.177808
Borderline-SMOTE 0.0046 0.000009 4 5.276562 0.006185
Safe-Level-SMOTE 0.0002 −0.000130 4 0.077615 0.941862
AUC
C4.5 SMOTE 0.0010 −0.000069 4 0.482243 0.654834
Borderline-SMOTE 0.0024 0.000003 4 1.530184 0.200715
Safe-Level-SMOTE −0.0018 0.000011 4 −1.230450 0.285939
F-value
C4.5 SMOTE 0.0054 0.000303 4 3.857143 0.018191
Borderline-SMOTE 0.0126 −0.001260 4 2.772078 0.050224
Safe-Level-SMOTE −0.0014 −0.000034 4 −0.465120 0.666039
AUC
C4.5 SMOTE −0.0016 0.000318 4 −0.593820 0.584588
Borderline-SMOTE −0.0034 −0.000334 4 −1.077330 0.341970
Safe-Level-SMOTE −0.0122 0.000069 4 −3.909130 0.017407
F-value
C4.5 SMOTE 0.0018 0.000604 4 0.243066 0.819910
Borderline-SMOTE −0.0132 0.001313 4 −1.466305 0.216450
Safe-Level-SMOTE 0.0018 −0.002122 4 0.183483 0.863345
AUC
C4.5 SMOTE −0.0018 0.000016 4 −0.582772 0.591320
Borderline-SMOTE 0.0056 −0.000003 4 3.055050 0.037841
Safe-Level-SMOTE 0.0172 −0.001113 4 1.326456 0.255353
F-value
C4.5 SMOTE 0.2266 −0.025624 4 4.321595 0.012432
Borderline-SMOTE 0.1000 −0.003899 4 6.558258 0.002796
Safe-Level-SMOTE 0.0150 −0.007720 4 0.682877 0.532184
AUC
C4.5 SMOTE 0.0170 −0.004680 4 0.563792 0.603004
Borderline-SMOTE −0.0634 0.002637 4 −3.858690 0.018168
Safe-Level-SMOTE −0.0462 −0.001380 4 −7.083420 0.002097