Dbsmote: Density-Based Synthetic Minority Over-Sampling Technique

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Appl Intell (2012) 36:664–684

DOI 10.1007/s10489-011-0287-y

DBSMOTE: Density-Based Synthetic Minority Over-sampling


TEchnique
Chumphol Bunkhumpornpat · Krung Sinapiromsaran ·
Chidchanok Lursinsap

Published online: 14 April 2011


© Springer Science+Business Media, LLC 2011

Abstract A dataset exhibits the class imbalance problem classes of instances. A derived classifier requires the anal-
when a target class has a very small number of instances ysis of a training set, defined as a group of identified in-
relative to other classes. A trivial classifier typically fails stances.
to detect a minority class due to its extremely low inci- A dataset is considered to be imbalanced if a target class
dence rate. In this paper, a new over-sampling technique has a very small number of instances compared to other
called DBSMOTE is proposed. Our technique relies on a classes. The class imbalance problem [9, 18, 19] has at-
density-based notion of clusters and is designed to over- tracted the attention of researchers from various fields. Many
sample an arbitrarily shaped cluster discovered by DB- applications only consider the two-class case [23, 24, 31].
SCAN. DBSMOTE generates synthetic instances along In this context, the smaller class is called the minority class
a shortest path from each positive instance to a pseudo- (the positive class), while the larger class is called the ma-
centroid of a minority-class cluster. Consequently, these jority class (the negative class). If a dataset has more than
synthetic instances are dense near this centroid and are two classes, a target class is chosen to be the minority class,
sparse far from this centroid. Our experimental results show and the remaining classes are merged into a single majority
that DBSMOTE improves precision, F-value, and AUC class.
more effectively than SMOTE, Borderline-SMOTE, and Analysts encounter the class imbalance problem in many
Safe-Level-SMOTE for imbalanced datasets. real-world applications, such as medical decision support
system for colon polyp screening [10], retailing bank cus-
Keywords Classification · Class imbalance · tomer attrition analysis [17], network intrusion detection of
Over-sampling · Density-based rare attack categories [22], automotive engineering diagno-
sis [26], and vehicle diagnostics [27]. In these domains,
a standard classifier needs to accurately detect a minority
class. This minority class may have an extremely low rate
1 Introduction
of incidence, and, accordingly, standard classifiers generally
prove inadequate.
Classification [21] is a data mining process that generates a One widely used strategy for handling the class im-
model called a classifier, which describes and distinguishes balance problem involves re-sampling techniques [18, 19],
which aim to balance the class distributions of a dataset
before feeding the output into a classification algorithm.
C. Bunkhumpornpat () · K. Sinapiromsaran · C. Lursinsap There are two main re-sampling techniques: over-sampling
Department of Mathematics, Faculty of Science, Chulalongkorn
University, Bangkok 10330, Thailand
techniques, which amplify positive instances, and under-
e-mail: [email protected] sampling techniques, which suppress negative instances.
K. Sinapiromsaran
However, over-sampling techniques, which create more spe-
e-mail: [email protected] cific decision regions, are negatively impacted by the over-
C. Lursinsap
fitting problem [30], while under-sampling techniques typi-
e-mail: [email protected] cally suppress important parts of a dataset. Other strategies
DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique 665

deal with this problem differently, for example a boosting-


based algorithm [8] and cost-sensitive learning [14].
Batista, Prati, and Monard (2004) used C4.5 in their ex-
periments [2]. They showed that AUC values with various
over-sampling techniques are generally superior to those
achieved with under-sampling techniques. They concluded
that combinations of over-sampling and data cleaning lead
to satisfactory AUC values when the minority class is small. Fig. 1 (a) Eps-neighborhood, (b) Directly density-reachable
This paper is organized as follows. Section 2 briefly
overviews related work. Section 3 describes DBSMOTE.
Section 4 analyzes DBSMOTE. Section 5 gives an overview
of performance metrics and summarizes the experimental
results. Section 6 discusses the trade-offs between the var-
ious over-sampling techniques. Section 7 summarizes and
concludes this work. The Appendix shows all experimental
results and their paired t-tests.
Fig. 2 (a) Density-reachable, (b) Density-connected

2 Related work Definition 4 (Density-connected) An instance p is density-


connected to an instance q w.r.t. Eps and MinPts if there is
2.1 DBSCAN an instance r such that both p and q are density-reachable
from r w.r.t. Eps and MinPts.
Ester, Kriegel, Sander, and Xu (1996) designed a density-
based clustering algorithm called DBSCAN, Density-Based In Figs. 1 and 2, a point represents an instance, a circle
Spatial Clustering of Applications with Noise [15], which represents the space obtained by a center point, and an arrow
discovers clusters of arbitrary shape. Key definitions in the represents directly density-reachable instances. In Fig. 1(a),
context of DBSCAN are as follows: p is a member of the Eps-neighborhood of q. In Fig. 1(b),
By Definition 1, dist returns the distance between in- p is directly density-reachable from q. In Fig. 2(a), p is
stances p and q. density-reachable from q. In Fig. 2(b), p and q are density-
By Definition 2, a core instance relies on the core in- connected to each other by r.
stance condition but a border instance does not. In addition, According to Definitions 5 and 6, a density-based cluster
an instance is directly density-reachable only from a core is a set of instances that are density-connected and that are
instance. maximal with respect to density-reachable. This approach
By Definition 3, density-reachable is the transitive exten- assumes that the rest of the instances are noise. In addition,
sion of directly density-reachable. analysts can determine Eps and MinPts by visualizing and
By Definition 4, if there is a core instance from which considering a sorted k-dist graph [15].
two instances in a cluster are density-reachable, these two
Definition 5 (Cluster) A cluster C w.r.t. Eps and MinPts is
instances are density-connected to each other.
a non-empty subset of a dataset D satisfying the following
Definition 1 (Eps-neighborhood) Let D be a dataset. The conditions:
Eps-neighborhood of an instance p, denoted by NEps (p), is (1) ∀p, q: if p ∈ C and q is density-reachable from p w.r.t.
defined by NEps (p) = {q ∈ D | dist(p, q) ≤ Eps}. Eps and MinPts, then q ∈ C (Maximality).
(2) ∀p, q ∈ C: p is density-connected to q w.r.t. Eps and
Definition 2 (Directly density-reachable) An instance p is MinPts (Connectivity).
directly density-reachable from an instance q w.r.t. Eps and
MinPts if Definition 6 (Noise) Let C1 , . . . , Ck be the k clusters of a
dataset D w.r.t. Epsi and MinPtsi where i ∈ {1, . . . , k}. The
(1) p ∈ NEps (q) and set of instances not belonging to any clusters Ci is consid-
(2) |NEps (q)| ≥ MinPts (Core instance condition). ered noise. Noise = {p ∈ D | ∀i : p ∈ Ci }.
Definition 3 (Density-reachable) An instance p is density- 2.2 The SMOTE family
reachable from an instance q w.r.t. Eps and MinPts if there
is a chain of instances p1 , . . . , pn , p1 = q, pn = p such that The SMOTE family consists of several over-sampling
pi+1 is directly density-reachable from pi . techniques: SMOTE, Borderline-SMOTE, and Safe-Level-
666 C. Bunkhumpornpat et al.

Fig. 3 Over-sampled minority


class as processed by
(a) SMOTE, (b) Borderline-
SMOTE, and
(c) Safe-Level-SMOTE

Fig. 4 The over-generalization Bunkhumpornpat, Sinapiromsaran, and Lursinsap (2009)


problem, which impacts
SMOTE performance designed Safe-Level-SMOTE [6], which generates syn-
thetic instances along the same line segment as SMOTE
but locates them closer to a minority class than a major-
ity class. Accordingly, synthetic instances are sparse in an
over-lapping region, as illustrated in Fig. 3(c). However, the
minority class core is not concentrated, and as a result it
may be misclassified. A superior classifier should ideally
recognize this core, which is significant.

3 DBSMOTE
SMOTE. These techniques upsize a minority class to im-
In this paper, we propose a new over-sampling technique
prove classifier performance at detecting this class.
called DBSMOTE, Density-Based Minority Over-sampling
Chawla, Bowyer, Hall, and Kegelmeyer (2002) designed
TEchnique, which combines DBSCAN and SMOTE. Our
a state-of-the-art over-sampling technique called SMOTE,
goal is to better address the class imbalance problem.
Synthetic Minority Over-sampling TEchnique [7], which
operates in the feature space rather than in the data space. 3.1 Motivation
SMOTE generates synthetic instances along a line segment
that join each instance to its selected nearest neighbor. Fig- A real-world dataset that features proximate data clusters is
ure 3(a) illustrates an over-sampled minority class where typically defined by a normal distribution, which is dense at
a white dot represents a positive instance and a black dot the core and sparse towards the borders. Frequently, a clas-
represents a synthetic instance. However, SMOTE is nega- sifier correctly detects an unidentified instance within a core
tively impacted by the over-generalization problem because because it recognizes this core as a class. Unfortunately a
SMOTE blindly generalizes throughout a minority class minority class core is too small to be recognized by a classi-
without considering the majority class, especially in an over- fier, so this core needs to be over-sampled.
lapping region. Figure 4 illustrates such an over-lapping re- The design of DBSMOTE was inspired by various con-
gion, where a cross represents a negative instance. This case sistent and paradoxical concepts in Borderline-SMOTE.
is a hybrid between a minority class and a majority class, DBSMOTE broadly follows the approach of Borderline-
and it is consequently difficult for a classifier to accurately SMOTE, which operates in an over-lapping region, but DB-
detect an instance. A superior classifier should treat each re- SMOTE precisely over-samples this region to maintain the
gion differently. majority class detection rate. However, DBSMOTE also
Han, Wang, and Mao (2005) designed Borderline- incorporates a different approach to Borderline-SMOTE,
SMOTE [16], which operates only on borderline instances which fails to operate in safe regions. Instead, DBSMOTE
in the over-lapping region illustrated in Fig. 3(b)—namely over-samples this region to improve minority class detection
where synthetic instances are most dense. However, the rate rates.
of detection of majority classes is disappointing because the
classifier mis-detects instances as being positive in this con- 3.2 Data structure
text. A superior classifier should ideally improve the minor-
ity class detection rate while not negatively impacting the In this paper, we define a new data structure called the di-
ability to detect majority classes. rectly density-reachable graph, which is constructed as an
DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique 667

underlying weighted directed graph [20], as in Definition 7. 3.3 Algorithm


Note that R is the set of real numbers.
Figure 6 illustrates the DBSMOTE algorithm, with variables
Definition 7 (Directly density-reachable graph) A directly and functions described as follows.
density-reachable graph of a cluster C discovered by DB- For input and output, p is a positive instance in a cluster
SCAN, denoted by G(C) = (V , E), where V is a set of i of instances Ci , and s is a synthetic instance in a set i of
nodes represented as instances in C and E is a set of edges. synthetic instances Ci .
E is defined as E = {(v1 , v2 ) ∈ V × V | an instance v1 is di- construct_directly_density-reachable_graph constructs a
rectly density-reachable from an instance v2 w.r.t. Eps and directly density-reachable graph G from Ci with respect to
ε and k.
MinPts or vice versa}. Let w : E → R be a weight func-
determine_pseudo-centroid determines a pseudo-centroid
tion where w(v1 , v2 ) is equal to distance between nodes
c, the nearest instance from a mean of Ci .
v1 , v2 ∈ V .
Dijkstra runs Dijkstra’s algorithm [12] to build a prede-
cessor list π where a given source node is c in G.
A directly density-reachable graph is illustrated in Fig. 5.
retrieve_shortest_path retrieves a shortest path § from p
A black dot represents a core instance, and a white dot rep-
to c by traversing π .
resents a border instance. For each edge, at least one node
select_random_edge randomly selects an edge e in §.
is guaranteed to be a core instance because two border in- get_connected_nodes returns a pair of nodes {v1 , v2 }
stances cannot be directly density-reachable according to connected by e.
Definition 2. generate_random_number generates a random number
gap from 0 to 1.
In the for loop, atti is an attribute index, numattrs is the
number of attributes, and dif is the result of subtracting v2
from v1 given the same attribute.
The algorithm over-samples along a shortest path from
each positive instance to a pseudo-centroid. As a result,
the presence of synthetic instances, which are dense near a
pseudo-centroid and sparse far from this centroid, will cause
the classifier to focus on cluster cores.
The shortest path from a considered instance p to a
pseudo-centroid c in a directly density-reachable graph is
Fig. 5 Directly density-reachable graph drawn as a solid line in Fig. 8. Note that not all edges are

Fig. 6 DBSMOTE algorithm


Algorithm: DBSMOTE
Input: a cluster i of positive instances Ci , Eps ε, and MinPts k
Output: a set i of synthetic instances Ci
1. Ci = ∅
2. G = construct_directly_density-reachable_graph(Ci , ε, k)
3. c = determine_pseudo-centroid(Ci )
4. π = Dijkstra(G, c)
5. ∀p ∈ Ci {
6. §= retrieve_shortest_path(π, p, c)
7. e = select_random_edge(§)
8. (v1 , v2 ) = get_connected_nodes(e)
9. for (atti = 1 to numattrs) {
10. dif = v2 [atti] − v1 [atti]
11. gap = generate_random_number(0, 1)
12. s[atti] = v1 [atti] + gap · dif
13. }
14. Ci = Ci ∪ {s}
15. }
16. return Ci
668 C. Bunkhumpornpat et al.

Fig. 7 Over-sampling
framework

shown in this figure. Over-sampling along the shortest path Fig. 8 Shortest path in a
directly density-reachable graph
avoids the over-lapping problem [28] because this shortest
path is the skeleton path [1].

3.4 Framework

A flow diagram for an over-sampling framework integrated


with DBSMOTE is illustrated in Fig. 7. DBSCAN begins
to produce m disjoint clusters C1 , C2 , . . . , Cm , and detect
a set of noise instances N , which erodes in the next step
from a minority class D + . DBSMOTE subsequently gener-
ates m sets of synthetic instances: C1 , C2 , . . . , Cm
 . Eventu-

ally, these m sets are merged with an original dataset D to


create an over-sampled dataset D  . After this framework ter-
minates, the result is n − t synthetic instances, where n and
t are the numbers of instances in D + and N , respectively. Eps-neighborhood, and Lemma 3 is true because a border
instance does not meet this condition.
Most visited nodes in a shortest path are core nodes be-
cause of their incident edges. Thus, synthetic instances are
4 Analysis
closer to core instances than border instances. DBSMOTE
not only distributes synthetic instances close to a cluster’s
A directly density-reachable graph satisfies the lemmata of
pseudo-centroid but also locates them close to core instances
similarities of instances. In this graph, the weight of each
because these regions contain significant information.
edge is computed from the distance along the edge. The de-
gree of a node is equal to the number of edges incident to Lemma 2 The minimum degree of a core node in a directly
this node. We define two terms: core node, which represents density-reachable graph cannot be less than MinPts.
a core instance, and border node, which represents a border
instance. Lemma 3 The maximum degree of a border node in a di-
Lemma 1 is true because Definitions 1 and 2 restrict the rectly density-reachable graph must be less than MinPts.
distance between two directly density-reachable instances
such that it cannot exceed Eps. According to Theorem 1, the running time of DBSMOTE
DBSMOTE over-samples along an edge. The weight pa- is no lower than that of others in the SMOTE family.
rameter reflects the similarity between a synthetic instance SMOTE is O(n2 ) because, for one positive instance, de-
and a core instance in a given directly density-reachable termining the k nearest neighbors and generating synthetic
graph. In the worst case, the distance between such a pair instances have running times O(n) and O(1), respectively.
cannot exceed the similarity-guaranteed distance Eps. Borderline-SMOTE is O(n2 ) because this technique ap-
plies SMOTE, for which the only input is a set of borderline
Lemma 1 The maximum weight of an edge in a directly instances.
density-reachable graph cannot exceed Eps. Safe-Level-SMOTE is O(n2 ) because the cost of com-
puting the safe level is O(n).
From the core instance condition in Definition 2, Lem- To analyze the time complexity of our over-sampling
ma 2 is true because a core instance obtains at least MinPts framework, DBSCAN is O(n log n) [15], and, consequently,
DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique 669

DBSMOTE is O(n2i ), which is equivalent to O(n2 ), for a Table 1 Confusion matrix


cluster i of ni positive instances. Thus, this framework is Detected positive Detected negative
O(n2 ) because a given dataset will have a finite number of
clusters. Actual positive TP FN
Actual negative FP TN
Theorem 1 (Time complexity) DBSMOTE is O(n2 ) where
n is the number of positive instance in a cluster.
Proof Begin by assuming the contrary, namely that there
Proof Let T (n) and Ti (n) be the time complexities of DB- exists an instance z for which the shortest path from
SMOTE and for the ith instruction line, respectively. a minority-class-cluster pseudo-centroid c in a directly
T1 (n), T10 (n), T12 (n), T14 (n), T16 (n) are O(1). density-reachable graph does not exist. Relying on condi-
construct_directly_density-reachable_graph is O(n2 ), tion 2 in Definition 5, z and c are in the same cluster; thus,
T2 (n), because detecting core instances is O(n2 ) and z is density-connected to c. By Definition 4, there must be
creating edges between these instances and their Eps- an instance r such that z and c are density-reachable from
neighborhood is O(n2 ). r. By Definition 3, there is a chain of instances p1 , . . . , pn ,
determine_pseudo-centroid is O(n), T3 (n), because com- p1 = r, pn = z such that pi+1 is directly density-reachable
puting a cluster mean is O(n) and determining a nearest from pi , where n is the number of instances in this chain and
instance from this mean is O(n). i is an integer in the range from 1 to n − 1. By Definition 7,
Dijkstra is O(n2 ), T4 (n), because of the time complexity pi+1 is directly density-reachable from pi ; thus, an edge
of Dijkstra’s algorithm [12]. connection between pi+1 and pi exists. This sequence is a
retrieve_shortest_path is O(n), T6 (n), because, in the path between r and z. Because z and c are density-reachable
worst case, a shortest path visits all nodes in a directly from r, a path between z and c also exists, contradicting our
density-reachable graph. assumption that there is no path between z and c. Thus, there
select_random_edge is O(1), T7 (n). exists a shortest path between each instance and a pseudo-
get_connected_nodes is O(1), T8 (n). centroid. 
generate_random_number is O(1), T11 (n).
for at lines 5 and 9 take n and numattrs steps, respec-
tively. 5 Experiments
T (n) is solved as follows:
5.1 Performance measures
T (n) = T1 (n) + T2 (n) + T3 (n) + T4 (n) + T16 (n)
+ n · (T6 (n) + T7 (n) + T8 (n) + T14 (n) Classifier performance metrics are typically evaluated by a
+ numattrs · (T10 (n) + T11 (n) + T12 (n))) confusion matrix, as shown in Table 1. The rows are actual
classes, and the columns are detected classes. TP (True Pos-
= O(1) + O(n2 ) + O(n) + O(n2 ) + O(1) itive) is the number of correctly classified positive instances.
FN (False Negative) is the number of incorrectly classified
+ n · (O(n) + O(1) + O(1) + O(1)
positive instances. FP (False Positive) is the number of in-
+ numattrs · (O(1) + O(1) + O(1))) correctly classified negative instances. TN (True Negative)
is the number of correctly classified negative instances. The
= O(n2 ) + n · (O(n) + numattrs)
six performance measures; accuracy, precision, recall, F-
numattrs is a constant, so we conclude that T (n) = O(n2 ).  value, TP rate, and FP rate, are defined by formulae (1)
through (6).
According to Theorem 2, the algorithm’s correctness is
validated because the algorithm cannot terminate unless all Accuracy = (TP + TN)/(TP + FN + FP + TN), (1)
of the shortest paths have been completely searched per line Recall = TP/(TP + FN), (2)
6 of the algorithm. If there is even a single instance that does
Precision = TP/(TP + FP), (3)
not have a shortest path from a pseudo-centroid, the output
will be incorrect. This theorem confirms that the algorithm F-value = ((1 + β)2 · Recall · Precision)
is reliable.
/(β 2 · Recall + Precision), (4)
Theorem 2 (Correctness) A directly density-reachable TP Rate = Sensitivity = TP/(TP + FN), (5)
graph of a minority class cluster has a shortest path be-
tween each instance and a pseudo-centroid. FP Rate = 1 − Specificity = FP/(TN + FP). (6)
670 C. Bunkhumpornpat et al.

Table 2 Short descriptions of the experimental UCI datasets Table 3 Average values of all over-sampling percentages from Fig. 9

Name Instance Attribute Positive Negative % Minority Technique Precision Recall F-value AUC

Pima 768 9 268 500 34.90 ORG 0.649 0.560 0.601 0.781
Haberman 306 4 81 225 26.47 SMOTE 0.807 0.925 0.862 0.841
Glass 214 11 51 163 23.83 BORD 0.792 0.942 0.861 0.831
Segmentation 2310 20 330 1980 14.29 SAFE 0.823 0.958 0.885 0.881
Satimage 6435 37 626 5809 9.73 DBSMOTE 0.856 0.900 0.877 0.888
Ecoli 336 8 25 311 7.44
Yeast 1484 9 20 1464 1.35
Table 4 Average values of all over-sampling percentages from Fig. 10

In the domain studied by Lewis and Catlett [25], none of Technique Precision Recall F-value AUC
the misclassified positive instances impact accuracy, which
ORG 0.453 0.296 0.358 0.609
approaches 100% if all instances are detected as being neg-
SMOTE 0.712 0.770 0.739 0.698
ative. Accordingly, accuracy is not a suitable metric with
BORD 0.727 0.855 0.785 0.730
which to evaluate the class imbalance problem, which aims
SAFE 0.741 0.830 0.782 0.716
to achieve superior detection rates in the case of minority
DBSMOTE 0.800 0.825 0.812 0.772
classes.
The F-value [5] is large when both recall and precision,
which are of relevance to minority classes, are also large.
β was set to 1, which means that recall is as important as Table 5 Average values of all over-sampling percentages from Fig. 11
precision in our experiments. Technique Precision Recall F-value AUC
ROC, the Receiver Operating Characteristic, summarizes
the detection rates over a range of tradeoffs between TP rate ORG 0.891 0.804 0.845 0.892
and FP rate. The ROC is a two-dimensional graph in which SMOTE 0.952 0.954 0.953 0.943
the x-axis represents the FP rate and the y-axis represents BORD 0.941 0.954 0.948 0.941
the TP rate. In addition, AUC [4], the Area Under ROC, is SAFE 0.953 0.977 0.964 0.956
equal to the probability that a classifier will rank a randomly DBSMOTE 0.966 0.957 0.962 0.958
chosen positive instance higher than a randomly chosen neg-
ative one.
Neighbor Classifier (k-NN) approaches [13]. k was set to 5
5.2 Dataset
by default in SMOTE.
We discuss only the experimental results shown in Figs. 9
We used the seven imbalance datasets from the UCI Repos-
through 15, in which the x-axis represents the minority class
itory of Machine Learning Databases [3]: the Pima Indians
Diabetes, Haberman’s Survival, Glass Identification, Image over-sampling percentages and the y-axis represents perfor-
Segmentation, Landsat Satellite, Ecoli, and Yeast databases. mance metrics. Averages of all over-sampling percentages
The columns in Table 2 list the dataset name, the number of are shown in Tables 3 through 9.
instances, the number of attributes, the number of positive Pima has the largest minority class incidence rate of
instances, the number of negative instances, and the minor- all our experimental datasets. Figure 9 illustrates the ex-
ity class percentages, respectively. For each dataset, the tar- perimental results of applying k-NN. DBSMOTE mostly
get class was chosen as the minority class, and the remaining achieves better precision and AUC, while Safe-Level-
classes were merged to create a large majority class. SMOTE mostly achieves better recall and F-value.
The minority class incidence rate is approximately 25%
5.3 Validation in the Haberman dataset. Figure 10 illustrates the ex-
perimental results of applying C4.5. DBSMOTE mostly
In our experiments, we applied 10-fold cross-validation to achieves superior performance except in the case of recall,
evaluate precision, recall, F-value, and AUC using an orig- for which Borderline-SMOTE is better.
inal dataset (ORG), SMOTE, Borderline-SMOTE (BORD), For the Glass dataset, Non-Window Glass was chosen as
Safe-Level-SMOTE (SAFE), and DBSMOTE. We em- the minority class and Window Glass was chosen as the ma-
ployed the decision tree C4.5 [29], Naïve Bayes [21], sup- jority class. Figure 11 illustrates the experimental results of
port vector machine (SVM) [21], Ripper [11], and k-Nearest applying C4.5. DBSMOTE mostly achieves better precision
DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique 671

Fig. 9 Experimental results of applying k-NN on Pima: (a) Precision, (b) Recall, (c) F-value, and (d) AUC

Fig. 10 Experimental results of applying C4.5 on Haberman: (a) Precision, (b) Recall, (c) F-value, and (d) AUC

Fig. 11 Experimental results of applying C4.5 on Glass: (a) Precision, (b) Recall, (c) F-value, and (d) AUC

and AUC, while Safe-Level-SMOTE achieves better recall of applying Naïve Bayes. DBSMOTE mostly achieves su-
and F-value. perior performance.
In the Segmentation dataset, Window was chosen as the In the Satimage dataset, Damp Grey Soil was chosen as
minority class. Figure 12 illustrates the experimental results the minority class. Figure 13 illustrates the experimental re-
672 C. Bunkhumpornpat et al.

Fig. 12 Experimental results of applying Naïve Bayes on Segmentation: (a) Precision, (b) Recall, (c) F-value, and (d) AUC

Fig. 13 Experimental results of applying Ripper on Satimage: (a) Precision, (b) Recall, (c) F-value, and (d) AUC

Table 6 Average values of all over-sampling percentages from Fig. 12 Table 7 Average values of all over-sampling percentages from Fig. 13

Technique Precision Recall F-value AUC Technique Precision Recall F-value AUC

ORG 0.329 0.903 0.482 0.831 ORG 0.609 0.526 0.564 0.750
SMOTE 0.660 0.924 0.767 0.843 SMOTE 0.824 0.849 0.836 0.904
BORD 0.655 0.908 0.757 0.850 BORD 0.805 0.861 0.832 0.901
SAFE 0.657 0.925 0.765 0.844 SAFE 0.834 0.857 0.845 0.906
DBSMOTE 0.677 0.957 0.789 0.904 DBSMOTE 0.891 0.848 0.869 0.915

experimental results of applying SVM. DBSMOTE mostly


sults of applying Ripper. DBSMOTE mostly achieves su- achieves superior performance.
perior performance except in the case of recall, for which Paired t-tests were applied to the F-value and AUC met-
Borderline-SMOTE is better. rics, as illustrated in Tables 10 and 11. For each paired t-test,
In the Ecoli dataset, Outer Membrane was chosen as the the null and alternative hypotheses were as follows:
minority class. Figure 14 illustrates the experimental results
of applying SVM. Safe-Level-SMOTE mostly achieves su- H0 : μ1 − μ2 = 0,
perior performance. H1 : μ1 − μ2 = 0,
In the Yeast dataset, Peroxisomal was chosen as the mi-
nority class. It was also the most severely imbalanced class where μ1 is a DBSMOTE mean and μ2 is a SMOTE,
among our experimental datasets. Figure 15 illustrates the BORD, or SAFE mean. We calculated three paired t-tests:
DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique 673

Fig. 14 Experimental results of applying SVM on Ecoli: (a) Precision, (b) Recall, (c) F-value, and (d) AUC

Fig. 15 Experimental results of applying SVM on Yeast: (a) Precision, (b) Recall, (c) F-value, and (d) AUC

Table 8 Average values of all over-sampling percentages from Fig. 14 Table 9 Average values of all over-sampling percentages from Fig. 15

Technique Precision Recall F-value AUC Technique Precision Recall F-value AUC

ORG 0.938 0.600 0.732 0.798 ORG 0.733 0.550 0.629 0.774
SMOTE 0.957 0.940 0.948 0.963 SMOTE 0.906 0.550 0.684 0.774
BORD 0.920 0.972 0.945 0.972 BORD 0.885 0.428 0.574 0.712
SAFE 0.962 0.974 0.968 0.981
SAFE 0.920 0.659 0.768 0.828
DBSMOTE 0.953 0.965 0.959 0.974
DBSMOTE 0.920 0.663 0.774 0.830

DBSMOTE to SMOTE, BORD, and SAFE. In these ta-


bles, μ is a mean difference, σ 2 is a variance differ- • Which minority class region should be emphasized by an
ence, df is a degree of freedom, t is a t-statistic value,
over-sampling technique?
and p is a two-tailed probability value. If p < α, H0 is
rejected, which means that there is a difference in means Typically, a dataset is divided into three regions; noise, safe,
across the paired observations. The significance level α and over-lapping.
was 0.05. A noise region is located outside a cluster. A classifier
often detects noise instances as negative because negative
instances encompass these noise instances, and thus an over-
6 Discussion
sampling technique should not operate in this region.
There are several interesting questions of relevance to DB- A safe region is located inside a cluster. A classifier easily
SMOTE: recognizes this region because it has sufficient numbers of
674 C. Bunkhumpornpat et al.

Table 10 Paired t-tests on F-value using the indicator variable DBSMOTE

Dataset Method Variable tested


(classifier) μ σ 2 df t p

Pima SMOTE 0.0154 −0.000068 4 11.952720 0.000281


(k-NN) Borderline-SMOTE 0.0166 −0.001026 4 3.434564 0.026425
Safe-Level-SMOTE −0.0078 0.000571 4 −2.750848 0.051330
Haberman SMOTE 0.0732 −0.003878 4 6.491872 0.002903
(C4.5) Borderline-SMOTE 0.0270 −0.003401 4 2.570846 0.061923
Safe-Level-SMOTE 0.0302 −0.004538 4 2.165222 0.096324
Glass SMOTE 0.0090 0.000201 4 1.936492 0.124880
(C4.5) Borderline-SMOTE 0.0140 −0.000029 4 4.128375 0.014513
Safe-Level-SMOTE −0.0026 0.000306 4 −0.430590 0.688953
Segmentation SMOTE 0.0224 0.000449 4 14.281720 0.000140
(Naïve Bayes) Borderline-SMOTE 0.0316 0.000578 4 16.296456 0.000083
Safe-Level-SMOTE 0.0242 0.000020 4 20.166667 0.000035
Satimage SMOTE 0.0328 −0.000556 4 8.114239 0.001254
(Ripper) Borderline-SMOTE 0.0372 −0.001753 4 5.771779 0.004473
Safe-Level-SMOTE 0.0236 −0.000460 4 8.162231 0.001226
Ecoli SMOTE 0.0112 −0.000152 4 2.023362 0.113064
(SVM) Borderline-SMOTE 0.0142 −0.000207 4 2.826465 0.047515
Safe-Level-SMOTE −0.0090 0.000004 4 −2.804300 0.048598
Yeast SMOTE 0.0894 0.001136 4 6.384737 0.003088
(SVM) Borderline-SMOTE 0.1996 −0.000634 4 6.416182 0.003032
Safe-Level-SMOTE 0.0058 0.000887 4 0.423189 0.693919

instances. However, for the class imbalance problem, a safe within a noise region. Thus these instances are dense in a
region of a minority class does not contain enough instances, safe region and are sparse in an over-lapping region.
and thus a classifier often misclassifies this region. These distinct distributions cause the effectiveness of
An over-lapping region is located around a cluster border, these over-sampling techniques to be different.
which contains a blend of positive and negative instances.
This region is detected with great difficultly because a clas- • How can DBSMOTE achieve a significant F-value when
sifier cannot efficiently distinguish between the two classes,
Borderline-SMOTE cannot?
and thus over-sampling in this region might be harmful.
To summarize, for the class imbalance problem, an effi-
For DBSMOTE, a classifier correctly detects a positive in-
cient over-sampling technique should concentrate on a safe
stance in a safe region because it has sufficient numbers of
region and treat with caution any over-lapping regions.
detectable synthetic instances. Furthermore, a classifier suc-
• How should synthetic instances generated by the SMOTE cessfully detects a positive instance in an over-lapping re-
family be distributed? gion because most of its synthetic instances tend to be lo-
SMOTE operates throughout a dataset, and thus synthetic cated closer to a minority class than to a majority class. This
instances are spread throughout every region. approach decreases FP and FN, which in turn increases pre-
Borderline-SMOTE operates only on a dataset border, cision and recall. Thus, the F-value is satisfactory.
and thus synthetic instances are dense in an over-lapping re- For Borderline-SMOTE, a classifier is more likely to mis-
gion.
detect a negative instance located in an over-lapping region
Safe-Level-SMOTE operates throughout a dataset and
as a positive instance because a minority class in this re-
positions synthetic instances in an over-lapping region close
to a safe region. Accordingly, these instances are sparse in gion is dense. This approach not only increases FP, which
an over-lapping region. decreases precision, but also decreases FN, which increases
DBSMOTE produces more synthetic instances around recall. Thus a higher recall is not enough to improve F-value
a dataset core than a dataset border and does not operate because a lower precision will also affect the F-value.
DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique 675

Table 11 Paired t-tests on AUC using the indicator variable DBSMOTE

Dataset Method Variable tested


(classifier) μ σ 2 df t p

Pima SMOTE 0.0466 0.000386 4 6.285829 0.003272


(k-NN) Borderline-SMOTE 0.0570 0.000118 4 35.349900 0.000004
Safe-Level-SMOTE 0.0064 0.000333 4 1.130312 0.321529
Haberman SMOTE 0.0740 0.001345 4 5.256297 0.006270
(C4.5) Borderline-SMOTE 0.0418 0.000212 4 6.277380 0.003288
Safe-Level-SMOTE 0.0560 0.000613 4 4.732864 0.009085
Glass SMOTE 0.0146 −0.000024 4 4.673346 0.009495
(C4.5) Borderline-SMOTE 0.0164 −0.000005 4 5.776654 0.004460
Safe-Level-SMOTE 0.0018 0.000085 4 0.387837 0.717891
Segmentation SMOTE 0.0610 0.000409 4 7.154226 0.002020
(Naïve Bayes) Borderline-SMOTE 0.0540 0.000398 4 7.235460 0.001936
Safe-Level-SMOTE 0.0608 0.000404 4 7.609518 0.001601
Satimage SMOTE 0.0114 0.000278 4 2.604396 0.059771
(Ripper) Borderline-SMOTE 0.0144 0.000110 4 8.231932 0.001187
Safe-Level-SMOTE 0.0090 −0.000075 4 6.708204 0.002570
Ecoli SMOTE 0.0116 −0.000113 4 2.583525 0.061100
(SVM) Borderline-SMOTE 0.0020 −0.000127 4 0.559017 0.605967
Safe-Level-SMOTE −0.0064 0.000049 4 −2.254300 0.087229
Yeast SMOTE 0.0560 0.000345 4 6.746498 0.002516
(SVM) Borderline-SMOTE 0.1176 −0.000495 4 7.153465 0.002021
Safe-Level-SMOTE 0.0016 0.000308 4 0.184188 0.862827

7 Conclusion Although we provide evidence that DBSMOTE success-


fully improves detection rates on a minority class, there is
Many researchers have addressed the class imbalance prob- considerable room for future work in this line of research.
lem. Over-sampling is a widely used technique because it Different clustering algorithms that replace DBSCAN in the
improves a classifier’s ability to detect a low-incidence mi- framework may improve classifier performance. Pruning a
nority class. Unfortunately, traditional data mining tech- directly density-reachable graph may be of interest. Auto-
niques are insufficient. matic determination of Eps, MinPts should be addressed.
DBSMOTE executes DBSCAN to discover arbitrarily
shaped clusters and then over-samples inside cluster shapes
Acknowledgements This research is supported by grant funds from
and especially within a pseudo-centroid. Thus, synthetic in- the program Strategic Scholarships for Frontier Research Network for
stances tend to not appear in a majority class. Moreover, the Ph.D. Program Thai Doctoral degree from the Commission on
these instances are dense near this centroid and sparse far Higher Education, Thailand.
from the centroid.
According to our experimental results, DBSMOTE re-
sults in the best precision, F-value, and AUC of any SMOTE
family approach when applying the various classifiers. This Appendix: Extended experimental results
good performance is due to the synthetic instances of DB-
SMOTE being generated in the most appropriate places.
Figures 16 through 22 show bar graphs to visualize the av-
This fact causes the classifier to concentrate on an impor-
tant minority class core. The statistical analysis supports our erages of F-value and AUC, represented on the y-axis, to-
conclusions. gether with over-sampling percentages on the x-axis. Ta-
Analysis of DBSMOTE reveals that our technique takes bles 12 through 18 illustrate the paired t-test results. A *
O(n2 ), which is no worse than others in the SMOTE fam- symbol indicates a significant difference (p < 0.05) on a
ily. n is the minority class size. In addition, the DBSMOTE paired t-test where the indicator variable is DBSMOTE. The
correctness is validated. symbols in these figures and tables are defined in Sect. 5.
676 C. Bunkhumpornpat et al.

Fig. 16 Experimental results on Pima: (a) F-value, (b) AUC

Table 12 Paired t-tests on Pima


using the indicator variable Classifier Technique Variable tested
DBSMOTE μ σ 2 df t p

F-value
C4.5 SMOTE 0.0168 −0.000046 4 4.548855 0.010427
Borderline-SMOTE 0.0094 −0.000660 4 2.729515 0.052469
Safe-Level-SMOTE 0.0108 −0.000210 4 3.717513 0.020519

Naïve Bayes SMOTE 0.0408 −0.002277 4 4.236689 0.013299


Borderline-SMOTE 0.0506 −0.003951 4 3.485430 0.025227
Safe-Level-SMOTE 0.0380 −0.000930 4 6.545889 0.002816

SVM SMOTE 0.0380 −0.002107 4 4.459799 0.011162


Borderline-SMOTE 0.0500 −0.001882 4 5.270463 0.006210
Safe-Level-SMOTE 0.0250 −0.001350 4 4.398846 0.011702

Ripper SMOTE 0.0190 −0.001085 4 3.343123 0.028752


Borderline-SMOTE 0.0106 −0.001565 4 1.420804 0.228413
Safe-Level-SMOTE 0.0112 −0.000926 4 1.939703 0.124420

k-NN SMOTE 0.0154 −0.000068 4 11.952720 0.000281


Borderline-SMOTE 0.0166 −0.001026 4 3.434564 0.026425
Safe-Level-SMOTE −0.0078 0.000571 4 −2.750848 0.051330

AUC
C4.5 SMOTE 0.0230 0.000216 4 2.513999 0.065776
Borderline-SMOTE 0.0556 0.000158 4 6.217824 0.003406
Safe-Level-SMOTE 0.0292 0.000154 4 5.733210 0.004584

Naïve Bayes SMOTE 0.0750 0.000418 4 8.863490 0.000895


Borderline-SMOTE 0.1286 0.000379 4 10.659090 0.000439
Safe-Level-SMOTE 0.0664 0.000413 4 8.163386 0.001226

SVM SMOTE 0.1024 −0.001479 4 5.663946 0.004791


Borderline-SMOTE 0.1706 −0.006359 4 4.636349 0.009761
Safe-Level-SMOTE 0.0582 −0.000284 4 5.951182 0.004000

Ripper SMOTE 0.0696 −0.000090 4 8.609010 0.001001


Borderline-SMOTE 0.0616 0.000142 4 10.879250 0.000405
Safe-Level-SMOTE 0.0606 −0.000256 4 7.845687 0.001426

k-NN SMOTE 0.0466 0.000386 4 6.285829 0.003272


Borderline-SMOTE 0.0570 0.000118 4 35.349900 0.000004
Safe-Level-SMOTE 0.0064 0.000333 4 1.130312 0.321529
DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique 677

Fig. 17 Experimental results on Haberman: (a) F-value, (b) AUC

Table 13 Paired t-tests on


Haberman using the indicator Classifier Technique Variable tested
variable DBSMOTE μ σ 2 df t p

F-value
C4.5 SMOTE 0.0732 −0.003878 4 6.491872 0.002903
Borderline-SMOTE 0.0270 −0.003400 4 2.570846 0.061923
Safe-Level-SMOTE 0.0302 −0.004538 4 2.165222 0.096324

Naïve Bayes SMOTE 0.1296 −0.007579 4 3.837142 0.018505


Borderline-SMOTE 0.1418 −0.025184 4 3.055268 0.037833
Safe-Level-SMOTE 0.1314 −0.010871 4 3.444254 0.026192

SVM SMOTE 0.0402 0.014313 4 1.185774 0.301337


Borderline-SMOTE 0.0986 −0.014592 4 1.403872 0.233031
Safe-Level-SMOTE 0.0486 0.010698 4 1.335719 0.252576

Ripper SMOTE 0.0646 −0.002337 4 8.413056 0.001093


Borderline-SMOTE 0.0484 −0.005066 4 3.324439 0.029257
Safe-Level-SMOTE 0.0400 −0.004840 4 2.819980 0.047829

k-NN SMOTE 0.0400 −0.002592 4 5.557700 0.005131


Borderline-SMOTE 0.0352 −0.006449 4 2.310794 0.081961
Safe-Level-SMOTE −0.0146 0.001271 4 −3.921670 0.017223

AUC
C4.5 SMOTE 0.0740 0.001345 4 5.256297 0.006270
Borderline-SMOTE 0.0418 0.000212 4 6.277380 0.003288
Safe-Level-SMOTE 0.0560 0.000613 4 4.732864 0.009085

Naïve Bayes SMOTE 0.1380 0.001065 4 10.064680 0.000548


Borderline-SMOTE 0.1498 0.001006 4 10.726300 0.000428
Safe-Level-SMOTE 0.1298 0.001067 4 10.62152 0.000445

SVM SMOTE 0.1148 −0.000636 4 2.791098 0.049257


Borderline-SMOTE 0.1326 0.001977 4 3.845078 0.018380
Safe-Level-SMOTE 0.1160 −0.000530 4 2.891070 0.044515

Ripper SMOTE 0.0884 0.001208 4 6.025188 0.003823


Borderline-SMOTE 0.0658 −0.000710 4 5.689777 0.004712
Safe-Level-SMOTE 0.0484 0.000839 4 4.979909 0.007598

k-NN SMOTE 0.0772 0.001935 4 5.213836 0.006455


Borderline-SMOTE 0.0714 0.000395 4 13.267800 0.000187
Safe-Level-SMOTE 0.0248 0.001857 4 1.787274 0.148421
678 C. Bunkhumpornpat et al.

Fig. 18 Experimental results on Glass: (a) F-value, (b) AUC

Table 14 Paired t-tests on


Glass using the indicator Classifier Technique Variable tested
variable DBSMOTE μ σ 2 df t p

F-value
C4.5 SMOTE 0.0090 0.000200 4 1.936492 0.124880
Borderline-SMOTE 0.0140 −0.000028 4 4.128375 0.014513
Safe-Level-SMOTE −0.0026 0.000306 4 −0.430590 0.688953

Naïve Bayes SMOTE 0.0498 −0.000035 4 10.831210 0.000412


Borderline-SMOTE 0.0826 −0.000513 4 13.250390 0.000187
Safe-Level-SMOTE 0.0262 −0.000014 4 12.43397 0.000241

SVM SMOTE 0.0186 −0.000496 4 3.230024 0.031975


Borderline-SMOTE 0.0024 0.000076 4 −1.596460 0.185622
Safe-Level-SMOTE 0.0088 −0.000530 4 2.022054 0.113233

Ripper SMOTE 0.0054 −0.000451 4 1.400827 0.233872


Borderline-SMOTE 0.0002 −0.000218 4 0.097129 0.927296
Safe-Level-SMOTE 0.0074 0.000117 4 1.296849 0.264432

k-NN SMOTE 0.0018 −0.000269 4 0.406164 0.705412


Borderline-SMOTE −0.0028 0.000237 4 −0.736840 0.502100
Safe-Level-SMOTE −0.0090 0.000165 4 −1.982940 0.118403

AUC
C4.5 SMOTE 0.0146 −0.000024 4 4.673346 0.009495
Borderline-SMOTE 0.0164 −0.000005 4 5.776654 0.004460
Safe-Level-SMOTE 0.0018 0.000085 4 0.387837 0.717891

Naïve Bayes SMOTE 0.0152 0.000006 4 5.898744 0.004132


Borderline-SMOTE 0.0228 0.000026 4 11.072660 0.000378
Safe-Level-SMOTE 0.0020 0.000017 4 1.450953 0.220415

SVM SMOTE 0.0190 −0.000184 4 4.023474 0.015819


Borderline-SMOTE −0.0020 0.000057 4 −1.174440 0.305365
Safe-Level-SMOTE 0.0114 −0.000120 4 5.338539 0.005931

Ripper SMOTE 0.0092 −0.000257 4 2.102891 0.103316


Borderline-SMOTE 0.0006 −0.000245 4 −0.124950 0.906594
Safe-Level-SMOTE 0.0226 −0.000028 4 9.335974 0.000733

k-NN SMOTE 0.0002 0.000011 4 0.179605 0.866194


Borderline-SMOTE 0.0078 0.000006 4 3.147824 0.034586
Safe-Level-SMOTE −0.0008 −0.000002 4 −0.267560 0.802268
DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique 679

Fig. 19 Experimental results on Segmentation: (a) F-value, (b) AUC

Table 15 Paired t-tests on


Segmentation using the Classifier Technique Variable tested
indicator variable DBSMOTE μ σ 2 df t p

F-value
C4.5 SMOTE 0.0044 −0.000191 4 1.632993 0.177808
Borderline-SMOTE 0.0046 0.000009 4 5.276562 0.006185
Safe-Level-SMOTE 0.0002 −0.000130 4 0.077615 0.941862

Naïve Bayes SMOTE 0.0224 0.000449 4 14.281718 0.000140


Borderline-SMOTE 0.0316 0.000578 4 16.296456 0.000083
Safe-Level-SMOTE 0.0242 0.000020 4 20.166667 0.000035

SVM SMOTE 0.0268 −0.004571 4 1.714500 0.161587


Borderline-SMOTE 0.0324 −0.005935 4 1.624555 0.179582
Safe-Level-SMOTE 0.0246 −0.004340 4 1.643070 0.175713

Ripper SMOTE 0.0030 −0.000137 4 1.978141 0.119055


Borderline-SMOTE 0.0008 0.000064 4 0.644658 0.554258
Safe-Level-SMOTE −0.0018 0.000041 4 −1.326980 0.255196

k-NN SMOTE 0.0004 0.000081 4 0.293294 0.783884


Borderline-SMOTE −0.0012 0.000082 4 −0.861550 0.437520
Safe-Level-SMOTE 0.0000 0.000043 4 0.000000 1.000000

AUC
C4.5 SMOTE 0.0010 −0.000069 4 0.482243 0.654834
Borderline-SMOTE 0.0024 0.000003 4 1.530184 0.200715
Safe-Level-SMOTE −0.0018 0.000011 4 −1.230450 0.285939

Naïve Bayes SMOTE 0.0610 0.000409 4 7.154226 0.002020


Borderline-SMOTE 0.0540 0.000398 4 7.235460 0.001936
Safe-Level-SMOTE 0.0608 0.000404 4 7.609518 0.001601

SVM SMOTE 0.0180 −0.001759 4 1.714675 0.161554


Borderline-SMOTE 0.0178 −0.002218 4 1.287154 0.267473
Safe-Level-SMOTE 0.0166 −0.001630 4 1.682356 0.167791

Ripper SMOTE 0.0042 −0.000057 4 3.628247 0.022194


Borderline-SMOTE 0.0004 0.000103 4 0.157378 0.882572
Safe-Level-SMOTE 0.0000 0.000015 4 0.000000 1.000000

k-NN SMOTE −0.0008 0.000003 4 −1.632990 0.177808


Borderline-SMOTE −0.0008 0.000006 4 −0.749270 0.495354
Safe-Level-SMOTE −0.0014 0.000005 4 −1.870830 0.134702
680 C. Bunkhumpornpat et al.

Fig. 20 Experimental results on Satimage: (a) F-value, (b) AUC

Table 16 Paired t-tests on


Satimage using the indicator Classifier Technique Variable tested
variable DBSMOTE μ σ 2 df t p

F-value
C4.5 SMOTE 0.0054 0.000303 4 3.857143 0.018191
Borderline-SMOTE 0.0126 −0.001260 4 2.772078 0.050224
Safe-Level-SMOTE −0.0014 −0.000034 4 −0.465120 0.666039

Naïve Bayes SMOTE 0.0868 0.000805 4 26.006210 0.000013


Borderline-SMOTE 0.1218 0.000742 4 31.832940 0.000006
Safe-Level-SMOTE 0.0812 0.000903 4 21.442800 0.000028

SVM SMOTE 0.3258 −0.012232 4 2.387673 0.075358


Borderline-SMOTE 0.3260 −0.014095 4 2.367196 0.077056
Safe-Level-SMOTE 0.1100 −0.097390 4 0.781618 0.478116

Ripper SMOTE 0.0328 −0.000556 4 8.114239 0.001254


Borderline-SMOTE 0.0372 −0.001753 4 5.771779 0.004473
Safe-Level-SMOTE 0.0236 −0.000460 4 8.162231 0.001226

k-NN SMOTE −0.0018 0.000414 4 −0.619590 0.569079


Borderline-SMOTE 0.0150 0.000672 4 3.506434 0.024752
Safe-Level-SMOTE 0.0024 0.000589 4 0.656611 0.547287

AUC
C4.5 SMOTE −0.0016 0.000318 4 −0.593820 0.584588
Borderline-SMOTE −0.0034 −0.000334 4 −1.077330 0.341970
Safe-Level-SMOTE −0.0122 0.000069 4 −3.909130 0.017407

Naïve Bayes SMOTE 0.0770 0.000031 4 11.568810 0.000319


Borderline-SMOTE 0.0384 0.000070 4 11.494740 0.000327
Safe-Level-SMOTE 0.0354 0.000068 4 12.029410 0.000274

SVM SMOTE 0.1574 0.006228 4 3.160120 0.034180


Borderline-SMOTE 0.1552 0.005330 4 3.074298 0.037137
Safe-Level-SMOTE 0.0288 −0.025820 4 0.423166 0.693934

Ripper SMOTE 0.0114 0.000277 4 2.604396 0.059771


Borderline-SMOTE 0.0144 0.000110 4 8.231932 0.001187
Safe-Level-SMOTE 0.0090 −0.000075 4 6.708204 0.002570

k-NN SMOTE −0.0050 0.000062 4 −1.162480 0.309671


Borderline-SMOTE −0.0084 0.000064 4 −2.342390 0.079171
Safe-Level-SMOTE −0.0078 0.000062 4 −2.220430 0.090569
DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique 681

Fig. 21 Experimental results on Ecoli: (a) F-value, (b) AUC

Table 17 Paired t-tests on Ecoli


using the indicator variable Classifier Technique Variable tested
DBSMOTE μ σ 2 df t p

F-value
C4.5 SMOTE 0.0018 0.000604 4 0.243066 0.819910
Borderline-SMOTE −0.0132 0.001313 4 −1.466305 0.216450
Safe-Level-SMOTE 0.0018 −0.002122 4 0.183483 0.863345

Naïve Bayes SMOTE 0.0436 −0.001155 4 6.355658 0.003141


Borderline-SMOTE 0.0036 0.000334 4 0.366356 0.732654
Safe-Level-SMOTE 0.0134 0.000400 4 2.790456 0.049289

SVM SMOTE 0.0112 −0.000152 4 2.023362 0.113064


Borderline-SMOTE 0.0142 −0.000207 4 2.826465 0.047515
Safe-Level-SMOTE −0.0090 0.000004 4 −2.804300 0.048598

Ripper SMOTE −0.0188 0.001567 4 −1.840486 0.139520


Borderline-SMOTE −0.0286 0.001411 4 −3.373824 0.027945
Safe-Level-SMOTE −0.0292 0.000436 4 −3.576971 0.023231

k-NN SMOTE 0.0064 −0.000008 4 1.554057 0.195138


Borderline-SMOTE −0.0044 0.000104 4 −1.731157 0.158468
Safe-Level-SMOTE −0.0072 −0.000135 4 −3.115740 0.035673

AUC
C4.5 SMOTE −0.0018 0.000016 4 −0.582772 0.591320
Borderline-SMOTE 0.0056 −0.000003 4 3.055050 0.037841
Safe-Level-SMOTE 0.0172 −0.001113 4 1.326456 0.255353

Naïve Bayes SMOTE −0.0018 0.000003 4 −1.087420 0.337989


Borderline-SMOTE −0.0068 0.000007 4 −6.106580 0.003640
Safe-Level-SMOTE −0.0082 0.000010 4 −7.083720 0.002097

SVM SMOTE 0.0116 −0.000113 4 2.583525 0.061100


Borderline-SMOTE 0.0020 −0.000127 4 0.559017 0.605967
Safe-Level-SMOTE −0.0064 0.000049 4 −2.254300 0.087229

Ripper SMOTE −0.0208 0.000852 4 −1.800610 0.146134


Borderline-SMOTE −0.0142 0.000648 4 −1.854350 0.137296
Safe-Level-SMOTE −0.0196 0.000091 4 −3.152280 0.034438

k-NN SMOTE 0.0016 0.000009 4 1.485563 0.211579


Borderline-SMOTE 0.0060 −0.000030 4 4.140393 0.014372
Safe-Level-SMOTE 0.0032 −0.000451 4 1.777778 0.150072
682 C. Bunkhumpornpat et al.

Fig. 22 Experimental results on Yeast (a) F-value (b) AUC

Table 18 Paired t-tests on


Yeast using the indicator Classifier Technique Variable tested
variable DBSMOTE μ σ 2 df t p

F-value
C4.5 SMOTE 0.2266 −0.025624 4 4.321595 0.012432
Borderline-SMOTE 0.1000 −0.003899 4 6.558258 0.002796
Safe-Level-SMOTE 0.0150 −0.007720 4 0.682877 0.532184

Naïve Bayes SMOTE 0.1462 0.001518 4 10.404710 0.000482


Borderline-SMOTE 0.2666 0.000285 4 7.721343 0.001515
Safe-Level-SMOTE 0.0644 0.001240 4 4.308102 0.012565

SVM SMOTE 0.0894 0.001136 4 6.384737 0.003088


Borderline-SMOTE 0.1996 −0.000634 4 6.416182 0.003032
Safe-Level-SMOTE 0.0058 0.000887 4 0.423189 0.693919

Ripper SMOTE 0.1524 −0.001344 4 15.437310 0.000103


Borderline-SMOTE 0.0760 −0.000402 4 6.536199 0.002831
Safe-Level-SMOTE −0.0586 0.000445 4 −8.524140 0.001039

k-NN SMOTE 0.1126 −0.000282 4 7.489047 0.001700


Borderline-SMOTE 0.0752 −0.000603 4 4.867951 0.008232
Safe-Level-SMOTE −0.0404 −0.001660 4 −2.906850 0.043816

AUC
C4.5 SMOTE 0.0170 −0.004680 4 0.563792 0.603004
Borderline-SMOTE −0.0634 0.002637 4 −3.858690 0.018168
Safe-Level-SMOTE −0.0462 −0.001380 4 −7.083420 0.002097

Naïve Bayes SMOTE 0.0380 −0.000062 4 14.904830 0.000118


Borderline-SMOTE −0.0040 −0.000155 4 −1.557000 0.194462
Safe-Level-SMOTE −0.0096 −0.000400 4 −1.809070 0.144704

SVM SMOTE 0.0560 0.000344 4 6.746498 0.002516


Borderline-SMOTE 0.1176 −0.000495 4 7.153465 0.002021
Safe-Level-SMOTE 0.0016 0.000308 4 0.184188 0.862827

Ripper SMOTE 0.0586 −0.000375 4 9.328656 0.000735


Borderline-SMOTE −0.0146 −0.000424 4 −2.964200 0.041382
Safe-Level-SMOTE −0.0600 0.001026 4 −9.393360 0.000716

k-NN SMOTE 0.0084 −0.000283 4 0.892104 0.422754


Borderline-SMOTE −0.0062 −0.000594 4 −0.781500 0.478178
Safe-Level-SMOTE −0.0172 0.000804 4 −1.356220 0.246532
DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique 683

References 19. Japkowicz N (2003) Class imbalance: are we focusing on the


right issue? In: 20th international conference on machine learn-
ing, Washington, District of Columbia, USA, pp 17–23
1. Bai X, Yang X, Yu D, Latecki LJ (2008) Skeleton-based shape
20. Jungnickel D (2003) Graphs, networks and algorithms. Springer,
classification using path similarity. Int J Pattern Recognit Artif In-
Heidelberg
tell 22(4):733–746
21. Kamber M, Han J (2000) Data mining: concepts and techniques,
2. Batista GEAPA, Prati RC, Monard MC (2004) A study of the be-
2nd edn. Morgan Kaufman, San Mateo
havior of several methods for balancing machine learning training
22. Khor K-C, Ting C-Y, Phon-Amnuaisuk S (2010) A cascaded
data. SIGKDD Explor 6(1):20–29
classifier approach for improving detection rates on rare at-
3. Blake CL, Merz CJ (2009) UCI Repository of machine learning
tack categories in network intrusion detection. Appl Intell.
databases. http://archive.ics.uci.edu/ml/. Department of Informa-
doi:10.1007/s10489-010-0263-y
tion and Computer Sciences, University of California, Irvine, Cal-
23. Kubat M, Matwin S (1997) Addressing the curse of imbalanced
ifornia, USA
training sets: one-sided selection. In: 14th international conference
4. Bradley AP (1997) The use of the area under the ROC curve in on machine learning, Nashville, Tennessee, USA, pp 179–186
the evaluation of machine learning algorithms. Pattern Recognit 24. Kubat M, Holte R, Matwin S (1997) Learning when negative ex-
30(6):1145–1159 amples abound. In: 9th European conference on machine learning,
5. Buckland M, Gey F (1994) The relationship between recall and Prague, Czech Republic, pp 146–153
precision. J Am Soc Inf Sci 45(1):12–19 25. Lewis DD, Catlett J (1994) Heterogeneous uncertainty sampling
6. Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe- for supervised learning. In: 11th international conference on ma-
level-SMOTE: safe-level-synthetic minority over-sampling tech- chine learning, New Brunswick, New Jersey, USA, pp 148–156
nique for handling the class imbalanced problem. In: Theer- 26. Lu Y, Chen TQ, Hamilton B (1998) A fuzzy diagnostic model and
amunkong T, Kijsirikul B, Cercone N, Ho T-B (eds) 13th its application in automotive engineering diagnosis. Appl Intell
Pacific-Asia conference on knowledge discovery and data min- 9(3):231–243
ing, Bangkok, Thailand. Lecture notes in artificial intelligence, 27. Murphey YL, Chen ZH, Feldkamp LA (2008) An incremental
vol 5476. Springer, Heidelberg, pp 475–482 neural learning framework and its application to vehicle diagnos-
7. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) tics. Appl Intell 28(1):29–49
SMOTE: synthetic minority over-sampling technique. J Artif In- 28. Prati RC, Batista GEAPA, Monard MC (2004) Class imbalances
tell Res 16:341–378 versus class overlapping: an analysis of a learning system be-
8. Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTE- havior. In: Monroy R, Arroyo G, Sucar LE, Sossa H (eds) 3rd
Boost: improving prediction of the minority class in boosting. In: Mexican international conference on artificial intelligence, Mex-
The 7th European conference on principles and practice of knowl- ico City, Mexico. Lecture notes in artificial intelligence, vol 2972,
edge discovery in databases, Cavtat-Dubrovnik, Croatia, pp 107– pp 312–321
119 29. Quinlan JR (1992) C4.5: programs for machine learning. Morgan
9. Chawla NV, Japkowicz N, Kolcz A (2004) SIGKDD Explor Kaufmann, San Mateo
6(1):1–6. Editorial: Special Issue on Learning from imbalanced 30. Tetko IV, Livingstone DJ, Luik AI (1995) Neural network studies.
data sets 1. Comparison of overfitting and overtraining. J Chem Inf Comput
10. Chiang I-J, Shieh M-J, Hsu JY, Wong J-M (2005) Building a med- Sci 35(5):826–833
ical decision support system for colon polyp screening by using 31. Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man
fuzzy classification trees. Appl Intell 22(1):61–75. Special Issue: Cybern 6(11):769–772
Foundations and Advances in Data Mining
11. Cohen WW (1995) Fast effective rule induction. In: 12th inter-
national conference on machine learning, Lake Tahoe, California, Chumphol Bunkhumpornpat re-
USA, pp 115–123 ceived his B.Eng. in Computer En-
12. Corman TH, Leiserson CE, Rivest RL, Stein C (2001) Introduc- gineering from Rangsit University
tion to algorithms, 2nd edn. MIT Press, Cambridge and his M.S. in Computer Sci-
13. Cover T, Hart PE (1967) Nearest neighbor pattern classification. ence from Chiang Mai University.
IEEE Trans Inf Theory 13(1):21–27 He was a lecturer in the Depart-
14. Domingos P (1999) Metacost: a general method for making classi- ment of Computer Science, Chiang
fiers cost-sensitive. In: The 5th ACM SIGKDD international con- Mai University. He is currently a
ference on knowledge discovery and data mining, San Diego, Cal- Ph.D. candidate in Computer Sci-
ifornia, USA, pp 155–164 ence at Chulalongkorn University.
15. Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based His research focus is Data Mining-
algorithm for discovering clusters in large spatial databases with especially in the Class Imbalance
noise. In: The 2nd international conference on knowledge discov- Problem and Clustering.
ery and data mining, Portland, Oregon, USA, pp 226–231
16. Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new
over-sampling method in imbalanced data sets learning. In: Huang
D-S, Zhang X-P, Huang G-B (eds) The 2005 international con-
ference on intelligent computing, Hefei, China. Lecture notes in
computer science, vol 3644. Springer, Heidelberg, pp 878–887
17. Hu X (2005) A data mining approach for retailing bank customer
attrition analysis. Appl Intell 22(1):47–60. Special Issue: Founda-
tions and Advances in Data Mining
18. Japkowicz N (2000) The class imbalance problem: significance
and strategies. In: 2000 international conference on artificial intel-
ligence, Las Vegas, Nevada, USA, pp 111–117
684 C. Bunkhumpornpat et al.

Krung Sinapiromsaran received fessor at the department of Mathematics, Chulalongkorn University.


his B.S. in Mathematics from Chu- His research interests include Design Automation, Silicon Compila-
lalongkorn University, his M.S. and tion, Neural Networks, Computer Architecture, and Artificial Intelli-
Ph.D. in Computer Science from the gence.
University of Wisconsin-Madison.
He is currently an Assistant Profes-
sor in the Department of Mathemat-
ics, Chulalongkorn University. His
ongoing research works are related
to theory and application of Math-
ematical Programming, Knowledge
Discovery and Discrete Algorithms.

Chidchanok Lursinsap received


his B.Eng. in Computer Engineer-
ing from Chulalongkorn University,
his M.S. and Ph.D. in Computer
Science from the University of
Illinois at Urbana-Champaign. He
was a Lecturer at the Department
of Computer Engineering, Chula-
longkorn University from 1978 to
1979, and a visiting Assistant Pro-
fessor from 1986 to 1987 with the
Department of Computer Science
and Center for Advanced Studies,
University of Illinois at Urbana-
Champaign. Presently, he is a Pro-

You might also like