Dbsmote: Density-Based Synthetic Minority Over-Sampling Technique

Appl Intell (2012) 36:664–684
DOI 10.1007/s10489-011-0287-y
DBSMOTE: Density-Based Synthetic Minority Over-sampling

TEchnique
Chumphol Bunkhumpornpat · Krung Sinapiromsaran ·
Chidchanok Lursinsap
Published online: 14 April 2011

© Springer Science+Business Media, LLC 2011
Abstract A dataset exhibits the class imbalance problem classes of instances. A derived classifier requires the anal-
when a target class has a very small number of instances ysis of a training set, defined as a group of identified in-
relative to other classes. A trivial classifier typically fails stances.
to detect a minority class due to its extremely low inci- A dataset is considered to be imbalanced if a target class
dence rate. In this paper, a new over-sampling technique has a very small number of instances compared to other
called DBSMOTE is proposed. Our technique relies on a classes. The class imbalance problem [9, 18, 19] has at-
density-based notion of clusters and is designed to over- tracted the attention of researchers from various fields. Many
sample an arbitrarily shaped cluster discovered by DB- applications only consider the two-class case [23, 24, 31].
SCAN. DBSMOTE generates synthetic instances along In this context, the smaller class is called the minority class
a shortest path from each positive instance to a pseudo- (the positive class), while the larger class is called the ma-
centroid of a minority-class cluster. Consequently, these jority class (the negative class). If a dataset has more than
synthetic instances are dense near this centroid and are two classes, a target class is chosen to be the minority class,
sparse far from this centroid. Our experimental results show and the remaining classes are merged into a single majority
that DBSMOTE improves precision, F-value, and AUC class.
more effectively than SMOTE, Borderline-SMOTE, and Analysts encounter the class imbalance problem in many
Safe-Level-SMOTE for imbalanced datasets. real-world applications, such as medical decision support
system for colon polyp screening [10], retailing bank cus-
Keywords Classification · Class imbalance · tomer attrition analysis [17], network intrusion detection of
Over-sampling · Density-based rare attack categories [22], automotive engineering diagno-
sis [26], and vehicle diagnostics [27]. In these domains,
a standard classifier needs to accurately detect a minority
class. This minority class may have an extremely low rate
1 Introduction
of incidence, and, accordingly, standard classifiers generally
prove inadequate.
Classification [21] is a data mining process that generates a One widely used strategy for handling the class im-
model called a classifier, which describes and distinguishes balance problem involves re-sampling techniques [18, 19],
which aim to balance the class distributions of a dataset
before feeding the output into a classification algorithm.
C. Bunkhumpornpat () · K. Sinapiromsaran · C. Lursinsap There are two main re-sampling techniques: over-sampling
Department of Mathematics, Faculty of Science, Chulalongkorn
University, Bangkok 10330, Thailand
techniques, which amplify positive instances, and under-
e-mail: [email protected] sampling techniques, which suppress negative instances.
K. Sinapiromsaran
However, over-sampling techniques, which create more spe-
e-mail: [email protected] cific decision regions, are negatively impacted by the over-
C. Lursinsap
fitting problem [30], while under-sampling techniques typi-
e-mail: [email protected] cally suppress important parts of a dataset. Other strategies
DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique 665
deal with this problem differently, for example a boosting-

based algorithm [8] and cost-sensitive learning [14].
Batista, Prati, and Monard (2004) used C4.5 in their ex-
periments [2]. They showed that AUC values with various
over-sampling techniques are generally superior to those
achieved with under-sampling techniques. They concluded
that combinations of over-sampling and data cleaning lead
to satisfactory AUC values when the minority class is small. Fig. 1 (a) Eps-neighborhood, (b) Directly density-reachable
This paper is organized as follows. Section 2 briefly
overviews related work. Section 3 describes DBSMOTE.
Section 4 analyzes DBSMOTE. Section 5 gives an overview
of performance metrics and summarizes the experimental
results. Section 6 discusses the trade-offs between the var-
ious over-sampling techniques. Section 7 summarizes and
concludes this work. The Appendix shows all experimental
results and their paired t-tests.
Fig. 2 (a) Density-reachable, (b) Density-connected
2 Related work Definition 4 (Density-connected) An instance p is density-

connected to an instance q w.r.t. Eps and MinPts if there is
2.1 DBSCAN an instance r such that both p and q are density-reachable
from r w.r.t. Eps and MinPts.
Ester, Kriegel, Sander, and Xu (1996) designed a density-
based clustering algorithm called DBSCAN, Density-Based In Figs. 1 and 2, a point represents an instance, a circle
Spatial Clustering of Applications with Noise [15], which represents the space obtained by a center point, and an arrow
discovers clusters of arbitrary shape. Key definitions in the represents directly density-reachable instances. In Fig. 1(a),
context of DBSCAN are as follows: p is a member of the Eps-neighborhood of q. In Fig. 1(b),
By Definition 1, dist returns the distance between in- p is directly density-reachable from q. In Fig. 2(a), p is
stances p and q. density-reachable from q. In Fig. 2(b), p and q are density-
By Definition 2, a core instance relies on the core in- connected to each other by r.
stance condition but a border instance does not. In addition, According to Definitions 5 and 6, a density-based cluster
an instance is directly density-reachable only from a core is a set of instances that are density-connected and that are
instance. maximal with respect to density-reachable. This approach
By Definition 3, density-reachable is the transitive exten- assumes that the rest of the instances are noise. In addition,
sion of directly density-reachable. analysts can determine Eps and MinPts by visualizing and
By Definition 4, if there is a core instance from which considering a sorted k-dist graph [15].
two instances in a cluster are density-reachable, these two
Definition 5 (Cluster) A cluster C w.r.t. Eps and MinPts is
instances are density-connected to each other.
a non-empty subset of a dataset D satisfying the following
Definition 1 (Eps-neighborhood) Let D be a dataset. The conditions:
Eps-neighborhood of an instance p, denoted by NEps (p), is (1) ∀p, q: if p ∈ C and q is density-reachable from p w.r.t.
defined by NEps (p) = {q ∈ D | dist(p, q) ≤ Eps}. Eps and MinPts, then q ∈ C (Maximality).
(2) ∀p, q ∈ C: p is density-connected to q w.r.t. Eps and
Definition 2 (Directly density-reachable) An instance p is MinPts (Connectivity).
directly density-reachable from an instance q w.r.t. Eps and
MinPts if Definition 6 (Noise) Let C1 , . . . , Ck be the k clusters of a
dataset D w.r.t. Epsi and MinPtsi where i ∈ {1, . . . , k}. The
(1) p ∈ NEps (q) and set of instances not belonging to any clusters Ci is consid-
(2) |NEps (q)| ≥ MinPts (Core instance condition). ered noise. Noise = {p ∈ D | ∀i : p ∈ Ci }.
Definition 3 (Density-reachable) An instance p is density- 2.2 The SMOTE family
reachable from an instance q w.r.t. Eps and MinPts if there
is a chain of instances p1 , . . . , pn , p1 = q, pn = p such that The SMOTE family consists of several over-sampling
pi+1 is directly density-reachable from pi . techniques: SMOTE, Borderline-SMOTE, and Safe-Level-
666 C. Bunkhumpornpat et al.
Fig. 3 Over-sampled minority

class as processed by
(a) SMOTE, (b) Borderline-
SMOTE, and
(c) Safe-Level-SMOTE
Fig. 4 The over-generalization Bunkhumpornpat, Sinapiromsaran, and Lursinsap (2009)

problem, which impacts
SMOTE performance designed Safe-Level-SMOTE [6], which generates syn-
thetic instances along the same line segment as SMOTE
but locates them closer to a minority class than a major-
ity class. Accordingly, synthetic instances are sparse in an
over-lapping region, as illustrated in Fig. 3(c). However, the
minority class core is not concentrated, and as a result it
may be misclassified. A superior classifier should ideally
recognize this core, which is significant.
3 DBSMOTE
SMOTE. These techniques upsize a minority class to im-
In this paper, we propose a new over-sampling technique
prove classifier performance at detecting this class.
called DBSMOTE, Density-Based Minority Over-sampling
Chawla, Bowyer, Hall, and Kegelmeyer (2002) designed
TEchnique, which combines DBSCAN and SMOTE. Our
a state-of-the-art over-sampling technique called SMOTE,
goal is to better address the class imbalance problem.
Synthetic Minority Over-sampling TEchnique [7], which
operates in the feature space rather than in the data space. 3.1 Motivation
SMOTE generates synthetic instances along a line segment
that join each instance to its selected nearest neighbor. Fig- A real-world dataset that features proximate data clusters is
ure 3(a) illustrates an over-sampled minority class where typically defined by a normal distribution, which is dense at
a white dot represents a positive instance and a black dot the core and sparse towards the borders. Frequently, a clas-
represents a synthetic instance. However, SMOTE is nega- sifier correctly detects an unidentified instance within a core
tively impacted by the over-generalization problem because because it recognizes this core as a class. Unfortunately a
SMOTE blindly generalizes throughout a minority class minority class core is too small to be recognized by a classi-
without considering the majority class, especially in an over- fier, so this core needs to be over-sampled.
lapping region. Figure 4 illustrates such an over-lapping re- The design of DBSMOTE was inspired by various con-
gion, where a cross represents a negative instance. This case sistent and paradoxical concepts in Borderline-SMOTE.
is a hybrid between a minority class and a majority class, DBSMOTE broadly follows the approach of Borderline-
and it is consequently difficult for a classifier to accurately SMOTE, which operates in an over-lapping region, but DB-
detect an instance. A superior classifier should treat each re- SMOTE precisely over-samples this region to maintain the
gion differently. majority class detection rate. However, DBSMOTE also
Han, Wang, and Mao (2005) designed Borderline- incorporates a different approach to Borderline-SMOTE,
SMOTE [16], which operates only on borderline instances which fails to operate in safe regions. Instead, DBSMOTE
in the over-lapping region illustrated in Fig. 3(b)—namely over-samples this region to improve minority class detection
where synthetic instances are most dense. However, the rate rates.
of detection of majority classes is disappointing because the
classifier mis-detects instances as being positive in this con- 3.2 Data structure
text. A superior classifier should ideally improve the minor-
ity class detection rate while not negatively impacting the In this paper, we define a new data structure called the di-
ability to detect majority classes. rectly density-reachable graph, which is constructed as an
underlying weighted directed graph [20], as in Definition 7. 3.3 Algorithm

Note that R is the set of real numbers.
Figure 6 illustrates the DBSMOTE algorithm, with variables
Definition 7 (Directly density-reachable graph) A directly and functions described as follows.
density-reachable graph of a cluster C discovered by DB- For input and output, p is a positive instance in a cluster
SCAN, denoted by G(C) = (V , E), where V is a set of i of instances Ci , and s is a synthetic instance in a set i of
nodes represented as instances in C and E is a set of edges. synthetic instances Ci .
E is defined as E = {(v1 , v2 ) ∈ V × V | an instance v1 is di- construct_directly_density-reachable_graph constructs a
rectly density-reachable from an instance v2 w.r.t. Eps and directly density-reachable graph G from Ci with respect to
ε and k.
MinPts or vice versa}. Let w : E → R be a weight func-
determine_pseudo-centroid determines a pseudo-centroid
tion where w(v1 , v2 ) is equal to distance between nodes
c, the nearest instance from a mean of Ci .
v1 , v2 ∈ V .
Dijkstra runs Dijkstra’s algorithm [12] to build a prede-
cessor list π where a given source node is c in G.
A directly density-reachable graph is illustrated in Fig. 5.
retrieve_shortest_path retrieves a shortest path § from p
A black dot represents a core instance, and a white dot rep-
to c by traversing π .
resents a border instance. For each edge, at least one node
select_random_edge randomly selects an edge e in §.
is guaranteed to be a core instance because two border in- get_connected_nodes returns a pair of nodes {v1 , v2 }
stances cannot be directly density-reachable according to connected by e.
Definition 2. generate_random_number generates a random number
gap from 0 to 1.
In the for loop, atti is an attribute index, numattrs is the
number of attributes, and dif is the result of subtracting v2
from v1 given the same attribute.
The algorithm over-samples along a shortest path from
each positive instance to a pseudo-centroid. As a result,
the presence of synthetic instances, which are dense near a
pseudo-centroid and sparse far from this centroid, will cause
the classifier to focus on cluster cores.
The shortest path from a considered instance p to a
pseudo-centroid c in a directly density-reachable graph is
Fig. 5 Directly density-reachable graph drawn as a solid line in Fig. 8. Note that not all edges are
Fig. 6 DBSMOTE algorithm

Algorithm: DBSMOTE
Input: a cluster i of positive instances Ci , Eps ε, and MinPts k
Output: a set i of synthetic instances Ci
1. Ci = ∅
2. G = construct_directly_density-reachable_graph(Ci , ε, k)
3. c = determine_pseudo-centroid(Ci )
4. π = Dijkstra(G, c)
5. ∀p ∈ Ci {
6. §= retrieve_shortest_path(π, p, c)
7. e = select_random_edge(§)
8. (v1 , v2 ) = get_connected_nodes(e)
9. for (atti = 1 to numattrs) {
10. dif = v2 [atti] − v1 [atti]
11. gap = generate_random_number(0, 1)
12. s[atti] = v1 [atti] + gap · dif
13. }
14. Ci = Ci ∪ {s}
15. }
16. return Ci
Fig. 7 Over-sampling
framework
shown in this figure. Over-sampling along the shortest path Fig. 8 Shortest path in a
directly density-reachable graph
avoids the over-lapping problem [28] because this shortest
path is the skeleton path [1].
3.4 Framework
A flow diagram for an over-sampling framework integrated

with DBSMOTE is illustrated in Fig. 7. DBSCAN begins
to produce m disjoint clusters C1 , C2 , . . . , Cm , and detect
a set of noise instances N , which erodes in the next step
from a minority class D + . DBSMOTE subsequently gener-
ates m sets of synthetic instances: C1 , C2 , . . . , Cm
. Eventu-
ally, these m sets are merged with an original dataset D to

create an over-sampled dataset D . After this framework ter-
minates, the result is n − t synthetic instances, where n and
t are the numbers of instances in D + and N , respectively. Eps-neighborhood, and Lemma 3 is true because a border
instance does not meet this condition.
Most visited nodes in a shortest path are core nodes be-
cause of their incident edges. Thus, synthetic instances are
4 Analysis
closer to core instances than border instances. DBSMOTE
not only distributes synthetic instances close to a cluster’s
A directly density-reachable graph satisfies the lemmata of
pseudo-centroid but also locates them close to core instances
similarities of instances. In this graph, the weight of each
because these regions contain significant information.
edge is computed from the distance along the edge. The de-
gree of a node is equal to the number of edges incident to Lemma 2 The minimum degree of a core node in a directly
this node. We define two terms: core node, which represents density-reachable graph cannot be less than MinPts.
a core instance, and border node, which represents a border
instance. Lemma 3 The maximum degree of a border node in a di-
Lemma 1 is true because Definitions 1 and 2 restrict the rectly density-reachable graph must be less than MinPts.
distance between two directly density-reachable instances
such that it cannot exceed Eps. According to Theorem 1, the running time of DBSMOTE
DBSMOTE over-samples along an edge. The weight pa- is no lower than that of others in the SMOTE family.
rameter reflects the similarity between a synthetic instance SMOTE is O(n2 ) because, for one positive instance, de-
and a core instance in a given directly density-reachable termining the k nearest neighbors and generating synthetic
graph. In the worst case, the distance between such a pair instances have running times O(n) and O(1), respectively.
cannot exceed the similarity-guaranteed distance Eps. Borderline-SMOTE is O(n2 ) because this technique ap-
plies SMOTE, for which the only input is a set of borderline
Lemma 1 The maximum weight of an edge in a directly instances.
density-reachable graph cannot exceed Eps. Safe-Level-SMOTE is O(n2 ) because the cost of com-
puting the safe level is O(n).
From the core instance condition in Definition 2, Lem- To analyze the time complexity of our over-sampling
ma 2 is true because a core instance obtains at least MinPts framework, DBSCAN is O(n log n) [15], and, consequently,
DBSMOTE is O(n2i ), which is equivalent to O(n2 ), for a Table 1 Confusion matrix

cluster i of ni positive instances. Thus, this framework is Detected positive Detected negative
O(n2 ) because a given dataset will have a finite number of
clusters. Actual positive TP FN
Actual negative FP TN
Theorem 1 (Time complexity) DBSMOTE is O(n2 ) where
n is the number of positive instance in a cluster.
Proof Begin by assuming the contrary, namely that there
Proof Let T (n) and Ti (n) be the time complexities of DB- exists an instance z for which the shortest path from
SMOTE and for the ith instruction line, respectively. a minority-class-cluster pseudo-centroid c in a directly
T1 (n), T10 (n), T12 (n), T14 (n), T16 (n) are O(1). density-reachable graph does not exist. Relying on condi-
construct_directly_density-reachable_graph is O(n2 ), tion 2 in Definition 5, z and c are in the same cluster; thus,
T2 (n), because detecting core instances is O(n2 ) and z is density-connected to c. By Definition 4, there must be
creating edges between these instances and their Eps- an instance r such that z and c are density-reachable from
neighborhood is O(n2 ). r. By Definition 3, there is a chain of instances p1 , . . . , pn ,
determine_pseudo-centroid is O(n), T3 (n), because com- p1 = r, pn = z such that pi+1 is directly density-reachable
puting a cluster mean is O(n) and determining a nearest from pi , where n is the number of instances in this chain and
instance from this mean is O(n). i is an integer in the range from 1 to n − 1. By Definition 7,
Dijkstra is O(n2 ), T4 (n), because of the time complexity pi+1 is directly density-reachable from pi ; thus, an edge
of Dijkstra’s algorithm [12]. connection between pi+1 and pi exists. This sequence is a
retrieve_shortest_path is O(n), T6 (n), because, in the path between r and z. Because z and c are density-reachable
worst case, a shortest path visits all nodes in a directly from r, a path between z and c also exists, contradicting our
density-reachable graph. assumption that there is no path between z and c. Thus, there
select_random_edge is O(1), T7 (n). exists a shortest path between each instance and a pseudo-
get_connected_nodes is O(1), T8 (n). centroid.
generate_random_number is O(1), T11 (n).
for at lines 5 and 9 take n and numattrs steps, respec-
tively. 5 Experiments
T (n) is solved as follows:
5.1 Performance measures
T (n) = T1 (n) + T2 (n) + T3 (n) + T4 (n) + T16 (n)
+ n · (T6 (n) + T7 (n) + T8 (n) + T14 (n) Classifier performance metrics are typically evaluated by a
+ numattrs · (T10 (n) + T11 (n) + T12 (n))) confusion matrix, as shown in Table 1. The rows are actual
classes, and the columns are detected classes. TP (True Pos-
= O(1) + O(n2 ) + O(n) + O(n2 ) + O(1) itive) is the number of correctly classified positive instances.
FN (False Negative) is the number of incorrectly classified
+ n · (O(n) + O(1) + O(1) + O(1)
positive instances. FP (False Positive) is the number of in-
+ numattrs · (O(1) + O(1) + O(1))) correctly classified negative instances. TN (True Negative)
is the number of correctly classified negative instances. The
= O(n2 ) + n · (O(n) + numattrs)
six performance measures; accuracy, precision, recall, F-
numattrs is a constant, so we conclude that T (n) = O(n2 ). value, TP rate, and FP rate, are defined by formulae (1)
through (6).
According to Theorem 2, the algorithm’s correctness is
validated because the algorithm cannot terminate unless all Accuracy = (TP + TN)/(TP + FN + FP + TN), (1)
of the shortest paths have been completely searched per line Recall = TP/(TP + FN), (2)
6 of the algorithm. If there is even a single instance that does
Precision = TP/(TP + FP), (3)
not have a shortest path from a pseudo-centroid, the output
will be incorrect. This theorem confirms that the algorithm F-value = ((1 + β)2 · Recall · Precision)
is reliable.
/(β 2 · Recall + Precision), (4)
Theorem 2 (Correctness) A directly density-reachable TP Rate = Sensitivity = TP/(TP + FN), (5)
graph of a minority class cluster has a shortest path be-
tween each instance and a pseudo-centroid. FP Rate = 1 − Specificity = FP/(TN + FP). (6)
Table 2 Short descriptions of the experimental UCI datasets Table 3 Average values of all over-sampling percentages from Fig. 9
Name Instance Attribute Positive Negative % Minority Technique Precision Recall F-value AUC
Pima 768 9 268 500 34.90 ORG 0.649 0.560 0.601 0.781
Haberman 306 4 81 225 26.47 SMOTE 0.807 0.925 0.862 0.841
Glass 214 11 51 163 23.83 BORD 0.792 0.942 0.861 0.831
Segmentation 2310 20 330 1980 14.29 SAFE 0.823 0.958 0.885 0.881
Satimage 6435 37 626 5809 9.73 DBSMOTE 0.856 0.900 0.877 0.888
Ecoli 336 8 25 311 7.44
Yeast 1484 9 20 1464 1.35
Table 4 Average values of all over-sampling percentages from Fig. 10
In the domain studied by Lewis and Catlett [25], none of Technique Precision Recall F-value AUC
the misclassified positive instances impact accuracy, which
ORG 0.453 0.296 0.358 0.609
approaches 100% if all instances are detected as being neg-
SMOTE 0.712 0.770 0.739 0.698
ative. Accordingly, accuracy is not a suitable metric with
BORD 0.727 0.855 0.785 0.730
which to evaluate the class imbalance problem, which aims
SAFE 0.741 0.830 0.782 0.716
to achieve superior detection rates in the case of minority
DBSMOTE 0.800 0.825 0.812 0.772
classes.
The F-value [5] is large when both recall and precision,
which are of relevance to minority classes, are also large.
β was set to 1, which means that recall is as important as Table 5 Average values of all over-sampling percentages from Fig. 11
precision in our experiments. Technique Precision Recall F-value AUC
ROC, the Receiver Operating Characteristic, summarizes
the detection rates over a range of tradeoffs between TP rate ORG 0.891 0.804 0.845 0.892
and FP rate. The ROC is a two-dimensional graph in which SMOTE 0.952 0.954 0.953 0.943
the x-axis represents the FP rate and the y-axis represents BORD 0.941 0.954 0.948 0.941
the TP rate. In addition, AUC [4], the Area Under ROC, is SAFE 0.953 0.977 0.964 0.956
equal to the probability that a classifier will rank a randomly DBSMOTE 0.966 0.957 0.962 0.958
chosen positive instance higher than a randomly chosen neg-
ative one.
Neighbor Classifier (k-NN) approaches [13]. k was set to 5
5.2 Dataset
by default in SMOTE.
We discuss only the experimental results shown in Figs. 9
We used the seven imbalance datasets from the UCI Repos-
through 15, in which the x-axis represents the minority class
itory of Machine Learning Databases [3]: the Pima Indians
Diabetes, Haberman’s Survival, Glass Identification, Image over-sampling percentages and the y-axis represents perfor-
Segmentation, Landsat Satellite, Ecoli, and Yeast databases. mance metrics. Averages of all over-sampling percentages
The columns in Table 2 list the dataset name, the number of are shown in Tables 3 through 9.
instances, the number of attributes, the number of positive Pima has the largest minority class incidence rate of
instances, the number of negative instances, and the minor- all our experimental datasets. Figure 9 illustrates the ex-
ity class percentages, respectively. For each dataset, the tar- perimental results of applying k-NN. DBSMOTE mostly
get class was chosen as the minority class, and the remaining achieves better precision and AUC, while Safe-Level-
classes were merged to create a large majority class. SMOTE mostly achieves better recall and F-value.
The minority class incidence rate is approximately 25%
5.3 Validation in the Haberman dataset. Figure 10 illustrates the ex-
perimental results of applying C4.5. DBSMOTE mostly
In our experiments, we applied 10-fold cross-validation to achieves superior performance except in the case of recall,
evaluate precision, recall, F-value, and AUC using an orig- for which Borderline-SMOTE is better.
inal dataset (ORG), SMOTE, Borderline-SMOTE (BORD), For the Glass dataset, Non-Window Glass was chosen as
Safe-Level-SMOTE (SAFE), and DBSMOTE. We em- the minority class and Window Glass was chosen as the ma-
ployed the decision tree C4.5 [29], Naïve Bayes [21], sup- jority class. Figure 11 illustrates the experimental results of
port vector machine (SVM) [21], Ripper [11], and k-Nearest applying C4.5. DBSMOTE mostly achieves better precision
Fig. 9 Experimental results of applying k-NN on Pima: (a) Precision, (b) Recall, (c) F-value, and (d) AUC
Fig. 10 Experimental results of applying C4.5 on Haberman: (a) Precision, (b) Recall, (c) F-value, and (d) AUC
Fig. 11 Experimental results of applying C4.5 on Glass: (a) Precision, (b) Recall, (c) F-value, and (d) AUC
and AUC, while Safe-Level-SMOTE achieves better recall of applying Naïve Bayes. DBSMOTE mostly achieves su-
and F-value. perior performance.
In the Segmentation dataset, Window was chosen as the In the Satimage dataset, Damp Grey Soil was chosen as
minority class. Figure 12 illustrates the experimental results the minority class. Figure 13 illustrates the experimental re-
Fig. 12 Experimental results of applying Naïve Bayes on Segmentation: (a) Precision, (b) Recall, (c) F-value, and (d) AUC
Fig. 13 Experimental results of applying Ripper on Satimage: (a) Precision, (b) Recall, (c) F-value, and (d) AUC
Table 6 Average values of all over-sampling percentages from Fig. 12 Table 7 Average values of all over-sampling percentages from Fig. 13
Technique Precision Recall F-value AUC Technique Precision Recall F-value AUC
ORG 0.329 0.903 0.482 0.831 ORG 0.609 0.526 0.564 0.750
SMOTE 0.660 0.924 0.767 0.843 SMOTE 0.824 0.849 0.836 0.904
BORD 0.655 0.908 0.757 0.850 BORD 0.805 0.861 0.832 0.901
SAFE 0.657 0.925 0.765 0.844 SAFE 0.834 0.857 0.845 0.906
DBSMOTE 0.677 0.957 0.789 0.904 DBSMOTE 0.891 0.848 0.869 0.915
experimental results of applying SVM. DBSMOTE mostly

sults of applying Ripper. DBSMOTE mostly achieves su- achieves superior performance.
perior performance except in the case of recall, for which Paired t-tests were applied to the F-value and AUC met-
Borderline-SMOTE is better. rics, as illustrated in Tables 10 and 11. For each paired t-test,
In the Ecoli dataset, Outer Membrane was chosen as the the null and alternative hypotheses were as follows:
minority class. Figure 14 illustrates the experimental results
of applying SVM. Safe-Level-SMOTE mostly achieves su- H0 : μ1 − μ2 = 0,
perior performance. H1 : μ1 − μ2 = 0,
In the Yeast dataset, Peroxisomal was chosen as the mi-
nority class. It was also the most severely imbalanced class where μ1 is a DBSMOTE mean and μ2 is a SMOTE,
among our experimental datasets. Figure 15 illustrates the BORD, or SAFE mean. We calculated three paired t-tests:
Fig. 14 Experimental results of applying SVM on Ecoli: (a) Precision, (b) Recall, (c) F-value, and (d) AUC
Fig. 15 Experimental results of applying SVM on Yeast: (a) Precision, (b) Recall, (c) F-value, and (d) AUC
Table 8 Average values of all over-sampling percentages from Fig. 14 Table 9 Average values of all over-sampling percentages from Fig. 15
Technique Precision Recall F-value AUC Technique Precision Recall F-value AUC
ORG 0.938 0.600 0.732 0.798 ORG 0.733 0.550 0.629 0.774
SMOTE 0.957 0.940 0.948 0.963 SMOTE 0.906 0.550 0.684 0.774
BORD 0.920 0.972 0.945 0.972 BORD 0.885 0.428 0.574 0.712
SAFE 0.962 0.974 0.968 0.981
SAFE 0.920 0.659 0.768 0.828
DBSMOTE 0.953 0.965 0.959 0.974
DBSMOTE 0.920 0.663 0.774 0.830
DBSMOTE to SMOTE, BORD, and SAFE. In these ta-

bles, μ is a mean difference, σ 2 is a variance differ- • Which minority class region should be emphasized by an
ence, df is a degree of freedom, t is a t-statistic value,
over-sampling technique?
and p is a two-tailed probability value. If p < α, H0 is
rejected, which means that there is a difference in means Typically, a dataset is divided into three regions; noise, safe,
across the paired observations. The significance level α and over-lapping.
was 0.05. A noise region is located outside a cluster. A classifier
often detects noise instances as negative because negative
instances encompass these noise instances, and thus an over-
6 Discussion
sampling technique should not operate in this region.
There are several interesting questions of relevance to DB- A safe region is located inside a cluster. A classifier easily
SMOTE: recognizes this region because it has sufficient numbers of
Table 10 Paired t-tests on F-value using the indicator variable DBSMOTE
Dataset Method Variable tested

(classifier) μ σ 2 df t p
Pima SMOTE 0.0154 −0.000068 4 11.952720 0.000281

(k-NN) Borderline-SMOTE 0.0166 −0.001026 4 3.434564 0.026425
Safe-Level-SMOTE −0.0078 0.000571 4 −2.750848 0.051330
Haberman SMOTE 0.0732 −0.003878 4 6.491872 0.002903
(C4.5) Borderline-SMOTE 0.0270 −0.003401 4 2.570846 0.061923
Safe-Level-SMOTE 0.0302 −0.004538 4 2.165222 0.096324
Glass SMOTE 0.0090 0.000201 4 1.936492 0.124880
Safe-Level-SMOTE −0.0026 0.000306 4 −0.430590 0.688953
Segmentation SMOTE 0.0224 0.000449 4 14.281720 0.000140
(Naïve Bayes) Borderline-SMOTE 0.0316 0.000578 4 16.296456 0.000083
Safe-Level-SMOTE 0.0242 0.000020 4 20.166667 0.000035
Satimage SMOTE 0.0328 −0.000556 4 8.114239 0.001254
(Ripper) Borderline-SMOTE 0.0372 −0.001753 4 5.771779 0.004473
Safe-Level-SMOTE 0.0236 −0.000460 4 8.162231 0.001226
Ecoli SMOTE 0.0112 −0.000152 4 2.023362 0.113064
(SVM) Borderline-SMOTE 0.0142 −0.000207 4 2.826465 0.047515
Safe-Level-SMOTE −0.0090 0.000004 4 −2.804300 0.048598
Yeast SMOTE 0.0894 0.001136 4 6.384737 0.003088
Safe-Level-SMOTE 0.0058 0.000887 4 0.423189 0.693919
instances. However, for the class imbalance problem, a safe within a noise region. Thus these instances are dense in a
region of a minority class does not contain enough instances, safe region and are sparse in an over-lapping region.
and thus a classifier often misclassifies this region. These distinct distributions cause the effectiveness of
An over-lapping region is located around a cluster border, these over-sampling techniques to be different.
which contains a blend of positive and negative instances.
This region is detected with great difficultly because a clas- • How can DBSMOTE achieve a significant F-value when
sifier cannot efficiently distinguish between the two classes,
Borderline-SMOTE cannot?
and thus over-sampling in this region might be harmful.
To summarize, for the class imbalance problem, an effi-
For DBSMOTE, a classifier correctly detects a positive in-
cient over-sampling technique should concentrate on a safe
stance in a safe region because it has sufficient numbers of
region and treat with caution any over-lapping regions.
detectable synthetic instances. Furthermore, a classifier suc-
• How should synthetic instances generated by the SMOTE cessfully detects a positive instance in an over-lapping re-
family be distributed? gion because most of its synthetic instances tend to be lo-
SMOTE operates throughout a dataset, and thus synthetic cated closer to a minority class than to a majority class. This
instances are spread throughout every region. approach decreases FP and FN, which in turn increases pre-
Borderline-SMOTE operates only on a dataset border, cision and recall. Thus, the F-value is satisfactory.
and thus synthetic instances are dense in an over-lapping re- For Borderline-SMOTE, a classifier is more likely to mis-
gion.
detect a negative instance located in an over-lapping region
Safe-Level-SMOTE operates throughout a dataset and
as a positive instance because a minority class in this re-
positions synthetic instances in an over-lapping region close
to a safe region. Accordingly, these instances are sparse in gion is dense. This approach not only increases FP, which
an over-lapping region. decreases precision, but also decreases FN, which increases
DBSMOTE produces more synthetic instances around recall. Thus a higher recall is not enough to improve F-value
a dataset core than a dataset border and does not operate because a lower precision will also affect the F-value.
Table 11 Paired t-tests on AUC using the indicator variable DBSMOTE
Dataset Method Variable tested

(classifier) μ σ 2 df t p
Pima SMOTE 0.0466 0.000386 4 6.285829 0.003272

(k-NN) Borderline-SMOTE 0.0570 0.000118 4 35.349900 0.000004
Safe-Level-SMOTE 0.0064 0.000333 4 1.130312 0.321529
Haberman SMOTE 0.0740 0.001345 4 5.256297 0.006270
(C4.5) Borderline-SMOTE 0.0418 0.000212 4 6.277380 0.003288
Safe-Level-SMOTE 0.0560 0.000613 4 4.732864 0.009085
Glass SMOTE 0.0146 −0.000024 4 4.673346 0.009495
Safe-Level-SMOTE 0.0018 0.000085 4 0.387837 0.717891
Segmentation SMOTE 0.0610 0.000409 4 7.154226 0.002020
(Naïve Bayes) Borderline-SMOTE 0.0540 0.000398 4 7.235460 0.001936
Safe-Level-SMOTE 0.0608 0.000404 4 7.609518 0.001601
Satimage SMOTE 0.0114 0.000278 4 2.604396 0.059771
(Ripper) Borderline-SMOTE 0.0144 0.000110 4 8.231932 0.001187
Safe-Level-SMOTE 0.0090 −0.000075 4 6.708204 0.002570
Ecoli SMOTE 0.0116 −0.000113 4 2.583525 0.061100
Safe-Level-SMOTE −0.0064 0.000049 4 −2.254300 0.087229
Yeast SMOTE 0.0560 0.000345 4 6.746498 0.002516
Safe-Level-SMOTE 0.0016 0.000308 4 0.184188 0.862827
7 Conclusion Although we provide evidence that DBSMOTE success-

fully improves detection rates on a minority class, there is
Many researchers have addressed the class imbalance prob- considerable room for future work in this line of research.
lem. Over-sampling is a widely used technique because it Different clustering algorithms that replace DBSCAN in the
improves a classifier’s ability to detect a low-incidence mi- framework may improve classifier performance. Pruning a
nority class. Unfortunately, traditional data mining tech- directly density-reachable graph may be of interest. Auto-
niques are insufficient. matic determination of Eps, MinPts should be addressed.
DBSMOTE executes DBSCAN to discover arbitrarily
shaped clusters and then over-samples inside cluster shapes
Acknowledgements This research is supported by grant funds from
and especially within a pseudo-centroid. Thus, synthetic in- the program Strategic Scholarships for Frontier Research Network for
stances tend to not appear in a majority class. Moreover, the Ph.D. Program Thai Doctoral degree from the Commission on
these instances are dense near this centroid and sparse far Higher Education, Thailand.
from the centroid.
According to our experimental results, DBSMOTE re-
sults in the best precision, F-value, and AUC of any SMOTE
family approach when applying the various classifiers. This Appendix: Extended experimental results
good performance is due to the synthetic instances of DB-
SMOTE being generated in the most appropriate places.
Figures 16 through 22 show bar graphs to visualize the av-
This fact causes the classifier to concentrate on an impor-
tant minority class core. The statistical analysis supports our erages of F-value and AUC, represented on the y-axis, to-
conclusions. gether with over-sampling percentages on the x-axis. Ta-
Analysis of DBSMOTE reveals that our technique takes bles 12 through 18 illustrate the paired t-test results. A *
O(n2 ), which is no worse than others in the SMOTE fam- symbol indicates a significant difference (p < 0.05) on a
ily. n is the minority class size. In addition, the DBSMOTE paired t-test where the indicator variable is DBSMOTE. The
correctness is validated. symbols in these figures and tables are defined in Sect. 5.
Fig. 16 Experimental results on Pima: (a) F-value, (b) AUC
Table 12 Paired t-tests on Pima

using the indicator variable Classifier Technique Variable tested
DBSMOTE μ σ 2 df t p
F-value
C4.5 SMOTE 0.0168 −0.000046 4 4.548855 0.010427
Borderline-SMOTE 0.0094 −0.000660 4 2.729515 0.052469
Safe-Level-SMOTE 0.0108 −0.000210 4 3.717513 0.020519
Naïve Bayes SMOTE 0.0408 −0.002277 4 4.236689 0.013299

Safe-Level-SMOTE 0.0380 −0.000930 4 6.545889 0.002816
SVM SMOTE 0.0380 −0.002107 4 4.459799 0.011162

Safe-Level-SMOTE 0.0250 −0.001350 4 4.398846 0.011702
Ripper SMOTE 0.0190 −0.001085 4 3.343123 0.028752

Safe-Level-SMOTE 0.0112 −0.000926 4 1.939703 0.124420
k-NN SMOTE 0.0154 −0.000068 4 11.952720 0.000281

Safe-Level-SMOTE −0.0078 0.000571 4 −2.750848 0.051330
AUC
C4.5 SMOTE 0.0230 0.000216 4 2.513999 0.065776
Borderline-SMOTE 0.0556 0.000158 4 6.217824 0.003406
Safe-Level-SMOTE 0.0292 0.000154 4 5.733210 0.004584
Naïve Bayes SMOTE 0.0750 0.000418 4 8.863490 0.000895

Safe-Level-SMOTE 0.0664 0.000413 4 8.163386 0.001226
SVM SMOTE 0.1024 −0.001479 4 5.663946 0.004791

Safe-Level-SMOTE 0.0582 −0.000284 4 5.951182 0.004000
Ripper SMOTE 0.0696 −0.000090 4 8.609010 0.001001

Safe-Level-SMOTE 0.0606 −0.000256 4 7.845687 0.001426
k-NN SMOTE 0.0466 0.000386 4 6.285829 0.003272

Safe-Level-SMOTE 0.0064 0.000333 4 1.130312 0.321529
Fig. 17 Experimental results on Haberman: (a) F-value, (b) AUC
Table 13 Paired t-tests on

Haberman using the indicator Classifier Technique Variable tested
variable DBSMOTE μ σ 2 df t p
F-value
C4.5 SMOTE 0.0732 −0.003878 4 6.491872 0.002903
Safe-Level-SMOTE 0.0302 −0.004538 4 2.165222 0.096324

Safe-Level-SMOTE 0.1314 −0.010871 4 3.444254 0.026192
SVM SMOTE 0.0402 0.014313 4 1.185774 0.301337

Safe-Level-SMOTE 0.0486 0.010698 4 1.335719 0.252576
Ripper SMOTE 0.0646 −0.002337 4 8.413056 0.001093

Safe-Level-SMOTE 0.0400 −0.004840 4 2.819980 0.047829
k-NN SMOTE 0.0400 −0.002592 4 5.557700 0.005131

Safe-Level-SMOTE −0.0146 0.001271 4 −3.921670 0.017223
AUC
C4.5 SMOTE 0.0740 0.001345 4 5.256297 0.006270
Safe-Level-SMOTE 0.0560 0.000613 4 4.732864 0.009085

Safe-Level-SMOTE 0.1298 0.001067 4 10.62152 0.000445
SVM SMOTE 0.1148 −0.000636 4 2.791098 0.049257

Safe-Level-SMOTE 0.1160 −0.000530 4 2.891070 0.044515
Ripper SMOTE 0.0884 0.001208 4 6.025188 0.003823

Safe-Level-SMOTE 0.0484 0.000839 4 4.979909 0.007598
k-NN SMOTE 0.0772 0.001935 4 5.213836 0.006455

Safe-Level-SMOTE 0.0248 0.001857 4 1.787274 0.148421
Fig. 18 Experimental results on Glass: (a) F-value, (b) AUC

Glass using the indicator Classifier Technique Variable tested
F-value
C4.5 SMOTE 0.0090 0.000200 4 1.936492 0.124880
Safe-Level-SMOTE −0.0026 0.000306 4 −0.430590 0.688953

Safe-Level-SMOTE 0.0262 −0.000014 4 12.43397 0.000241
SVM SMOTE 0.0186 −0.000496 4 3.230024 0.031975

Borderline-SMOTE 0.0024 0.000076 4 −1.596460 0.185622
Safe-Level-SMOTE 0.0088 −0.000530 4 2.022054 0.113233
Ripper SMOTE 0.0054 −0.000451 4 1.400827 0.233872

Safe-Level-SMOTE 0.0074 0.000117 4 1.296849 0.264432
k-NN SMOTE 0.0018 −0.000269 4 0.406164 0.705412

Borderline-SMOTE −0.0028 0.000237 4 −0.736840 0.502100
Safe-Level-SMOTE −0.0090 0.000165 4 −1.982940 0.118403
AUC
C4.5 SMOTE 0.0146 −0.000024 4 4.673346 0.009495
Safe-Level-SMOTE 0.0018 0.000085 4 0.387837 0.717891

Safe-Level-SMOTE 0.0020 0.000017 4 1.450953 0.220415
SVM SMOTE 0.0190 −0.000184 4 4.023474 0.015819

Safe-Level-SMOTE 0.0114 −0.000120 4 5.338539 0.005931
Ripper SMOTE 0.0092 −0.000257 4 2.102891 0.103316

Borderline-SMOTE 0.0006 −0.000245 4 −0.124950 0.906594
Safe-Level-SMOTE 0.0226 −0.000028 4 9.335974 0.000733
k-NN SMOTE 0.0002 0.000011 4 0.179605 0.866194

Safe-Level-SMOTE −0.0008 −0.000002 4 −0.267560 0.802268
Fig. 19 Experimental results on Segmentation: (a) F-value, (b) AUC

Segmentation using the Classifier Technique Variable tested
indicator variable DBSMOTE μ σ 2 df t p
F-value
C4.5 SMOTE 0.0044 −0.000191 4 1.632993 0.177808
Safe-Level-SMOTE 0.0002 −0.000130 4 0.077615 0.941862

Safe-Level-SMOTE 0.0242 0.000020 4 20.166667 0.000035
SVM SMOTE 0.0268 −0.004571 4 1.714500 0.161587

Safe-Level-SMOTE 0.0246 −0.004340 4 1.643070 0.175713
Ripper SMOTE 0.0030 −0.000137 4 1.978141 0.119055

Safe-Level-SMOTE −0.0018 0.000041 4 −1.326980 0.255196
k-NN SMOTE 0.0004 0.000081 4 0.293294 0.783884

Safe-Level-SMOTE 0.0000 0.000043 4 0.000000 1.000000
AUC
C4.5 SMOTE 0.0010 −0.000069 4 0.482243 0.654834
Safe-Level-SMOTE −0.0018 0.000011 4 −1.230450 0.285939

Safe-Level-SMOTE 0.0608 0.000404 4 7.609518 0.001601
SVM SMOTE 0.0180 −0.001759 4 1.714675 0.161554

Safe-Level-SMOTE 0.0166 −0.001630 4 1.682356 0.167791
Ripper SMOTE 0.0042 −0.000057 4 3.628247 0.022194

Safe-Level-SMOTE 0.0000 0.000015 4 0.000000 1.000000
k-NN SMOTE −0.0008 0.000003 4 −1.632990 0.177808

Safe-Level-SMOTE −0.0014 0.000005 4 −1.870830 0.134702
Fig. 20 Experimental results on Satimage: (a) F-value, (b) AUC

Satimage using the indicator Classifier Technique Variable tested
F-value
C4.5 SMOTE 0.0054 0.000303 4 3.857143 0.018191
Safe-Level-SMOTE −0.0014 −0.000034 4 −0.465120 0.666039

Safe-Level-SMOTE 0.0812 0.000903 4 21.442800 0.000028
SVM SMOTE 0.3258 −0.012232 4 2.387673 0.075358

Safe-Level-SMOTE 0.1100 −0.097390 4 0.781618 0.478116
Ripper SMOTE 0.0328 −0.000556 4 8.114239 0.001254

Safe-Level-SMOTE 0.0236 −0.000460 4 8.162231 0.001226
k-NN SMOTE −0.0018 0.000414 4 −0.619590 0.569079

Safe-Level-SMOTE 0.0024 0.000589 4 0.656611 0.547287
AUC
C4.5 SMOTE −0.0016 0.000318 4 −0.593820 0.584588
Borderline-SMOTE −0.0034 −0.000334 4 −1.077330 0.341970
Safe-Level-SMOTE −0.0122 0.000069 4 −3.909130 0.017407

Safe-Level-SMOTE 0.0354 0.000068 4 12.029410 0.000274
SVM SMOTE 0.1574 0.006228 4 3.160120 0.034180

Safe-Level-SMOTE 0.0288 −0.025820 4 0.423166 0.693934
Ripper SMOTE 0.0114 0.000277 4 2.604396 0.059771

Safe-Level-SMOTE 0.0090 −0.000075 4 6.708204 0.002570
k-NN SMOTE −0.0050 0.000062 4 −1.162480 0.309671

Safe-Level-SMOTE −0.0078 0.000062 4 −2.220430 0.090569
Fig. 21 Experimental results on Ecoli: (a) F-value, (b) AUC
Table 17 Paired t-tests on Ecoli

using the indicator variable Classifier Technique Variable tested
DBSMOTE μ σ 2 df t p
F-value
C4.5 SMOTE 0.0018 0.000604 4 0.243066 0.819910
Safe-Level-SMOTE 0.0018 −0.002122 4 0.183483 0.863345

Safe-Level-SMOTE 0.0134 0.000400 4 2.790456 0.049289
SVM SMOTE 0.0112 −0.000152 4 2.023362 0.113064

Safe-Level-SMOTE −0.0090 0.000004 4 −2.804300 0.048598
Ripper SMOTE −0.0188 0.001567 4 −1.840486 0.139520

Safe-Level-SMOTE −0.0292 0.000436 4 −3.576971 0.023231
k-NN SMOTE 0.0064 −0.000008 4 1.554057 0.195138

Safe-Level-SMOTE −0.0072 −0.000135 4 −3.115740 0.035673
AUC
C4.5 SMOTE −0.0018 0.000016 4 −0.582772 0.591320
Safe-Level-SMOTE 0.0172 −0.001113 4 1.326456 0.255353
Naïve Bayes SMOTE −0.0018 0.000003 4 −1.087420 0.337989

Safe-Level-SMOTE −0.0082 0.000010 4 −7.083720 0.002097
SVM SMOTE 0.0116 −0.000113 4 2.583525 0.061100

Safe-Level-SMOTE −0.0064 0.000049 4 −2.254300 0.087229
Ripper SMOTE −0.0208 0.000852 4 −1.800610 0.146134

Safe-Level-SMOTE −0.0196 0.000091 4 −3.152280 0.034438
k-NN SMOTE 0.0016 0.000009 4 1.485563 0.211579

Safe-Level-SMOTE 0.0032 −0.000451 4 1.777778 0.150072
Fig. 22 Experimental results on Yeast (a) F-value (b) AUC

Yeast using the indicator Classifier Technique Variable tested
F-value
C4.5 SMOTE 0.2266 −0.025624 4 4.321595 0.012432
Safe-Level-SMOTE 0.0150 −0.007720 4 0.682877 0.532184

Safe-Level-SMOTE 0.0644 0.001240 4 4.308102 0.012565
SVM SMOTE 0.0894 0.001136 4 6.384737 0.003088

Safe-Level-SMOTE 0.0058 0.000887 4 0.423189 0.693919
Ripper SMOTE 0.1524 −0.001344 4 15.437310 0.000103

Safe-Level-SMOTE −0.0586 0.000445 4 −8.524140 0.001039
k-NN SMOTE 0.1126 −0.000282 4 7.489047 0.001700

Safe-Level-SMOTE −0.0404 −0.001660 4 −2.906850 0.043816
AUC
C4.5 SMOTE 0.0170 −0.004680 4 0.563792 0.603004
Safe-Level-SMOTE −0.0462 −0.001380 4 −7.083420 0.002097

Safe-Level-SMOTE −0.0096 −0.000400 4 −1.809070 0.144704
SVM SMOTE 0.0560 0.000344 4 6.746498 0.002516

Safe-Level-SMOTE 0.0016 0.000308 4 0.184188 0.862827
Ripper SMOTE 0.0586 −0.000375 4 9.328656 0.000735

Safe-Level-SMOTE −0.0600 0.001026 4 −9.393360 0.000716
k-NN SMOTE 0.0084 −0.000283 4 0.892104 0.422754

Safe-Level-SMOTE −0.0172 0.000804 4 −1.356220 0.246532
References 19. Japkowicz N (2003) Class imbalance: are we focusing on the

right issue? In: 20th international conference on machine learn-
ing, Washington, District of Columbia, USA, pp 17–23
1. Bai X, Yang X, Yu D, Latecki LJ (2008) Skeleton-based shape
20. Jungnickel D (2003) Graphs, networks and algorithms. Springer,
classification using path similarity. Int J Pattern Recognit Artif In-
Heidelberg
tell 22(4):733–746
21. Kamber M, Han J (2000) Data mining: concepts and techniques,
2. Batista GEAPA, Prati RC, Monard MC (2004) A study of the be-
2nd edn. Morgan Kaufman, San Mateo
havior of several methods for balancing machine learning training
22. Khor K-C, Ting C-Y, Phon-Amnuaisuk S (2010) A cascaded
data. SIGKDD Explor 6(1):20–29
classifier approach for improving detection rates on rare at-
3. Blake CL, Merz CJ (2009) UCI Repository of machine learning
tack categories in network intrusion detection. Appl Intell.
databases. http://archive.ics.uci.edu/ml/. Department of Informa-
doi:10.1007/s10489-010-0263-y
tion and Computer Sciences, University of California, Irvine, Cal-
23. Kubat M, Matwin S (1997) Addressing the curse of imbalanced
ifornia, USA
training sets: one-sided selection. In: 14th international conference
4. Bradley AP (1997) The use of the area under the ROC curve in on machine learning, Nashville, Tennessee, USA, pp 179–186
the evaluation of machine learning algorithms. Pattern Recognit 24. Kubat M, Holte R, Matwin S (1997) Learning when negative ex-
30(6):1145–1159 amples abound. In: 9th European conference on machine learning,
5. Buckland M, Gey F (1994) The relationship between recall and Prague, Czech Republic, pp 146–153
precision. J Am Soc Inf Sci 45(1):12–19 25. Lewis DD, Catlett J (1994) Heterogeneous uncertainty sampling
6. Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe- for supervised learning. In: 11th international conference on ma-
level-SMOTE: safe-level-synthetic minority over-sampling tech- chine learning, New Brunswick, New Jersey, USA, pp 148–156
nique for handling the class imbalanced problem. In: Theer- 26. Lu Y, Chen TQ, Hamilton B (1998) A fuzzy diagnostic model and
amunkong T, Kijsirikul B, Cercone N, Ho T-B (eds) 13th its application in automotive engineering diagnosis. Appl Intell
Pacific-Asia conference on knowledge discovery and data min- 9(3):231–243
ing, Bangkok, Thailand. Lecture notes in artificial intelligence, 27. Murphey YL, Chen ZH, Feldkamp LA (2008) An incremental
vol 5476. Springer, Heidelberg, pp 475–482 neural learning framework and its application to vehicle diagnos-
7. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) tics. Appl Intell 28(1):29–49
SMOTE: synthetic minority over-sampling technique. J Artif In- 28. Prati RC, Batista GEAPA, Monard MC (2004) Class imbalances
tell Res 16:341–378 versus class overlapping: an analysis of a learning system be-
8. Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTE- havior. In: Monroy R, Arroyo G, Sucar LE, Sossa H (eds) 3rd
Boost: improving prediction of the minority class in boosting. In: Mexican international conference on artificial intelligence, Mex-
The 7th European conference on principles and practice of knowl- ico City, Mexico. Lecture notes in artificial intelligence, vol 2972,
edge discovery in databases, Cavtat-Dubrovnik, Croatia, pp 107– pp 312–321
119 29. Quinlan JR (1992) C4.5: programs for machine learning. Morgan
9. Chawla NV, Japkowicz N, Kolcz A (2004) SIGKDD Explor Kaufmann, San Mateo
6(1):1–6. Editorial: Special Issue on Learning from imbalanced 30. Tetko IV, Livingstone DJ, Luik AI (1995) Neural network studies.
data sets 1. Comparison of overfitting and overtraining. J Chem Inf Comput
10. Chiang I-J, Shieh M-J, Hsu JY, Wong J-M (2005) Building a med- Sci 35(5):826–833
ical decision support system for colon polyp screening by using 31. Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man
fuzzy classification trees. Appl Intell 22(1):61–75. Special Issue: Cybern 6(11):769–772
Foundations and Advances in Data Mining
11. Cohen WW (1995) Fast effective rule induction. In: 12th inter-
national conference on machine learning, Lake Tahoe, California, Chumphol Bunkhumpornpat re-
USA, pp 115–123 ceived his B.Eng. in Computer En-
12. Corman TH, Leiserson CE, Rivest RL, Stein C (2001) Introduc- gineering from Rangsit University
tion to algorithms, 2nd edn. MIT Press, Cambridge and his M.S. in Computer Sci-
13. Cover T, Hart PE (1967) Nearest neighbor pattern classification. ence from Chiang Mai University.
IEEE Trans Inf Theory 13(1):21–27 He was a lecturer in the Depart-
14. Domingos P (1999) Metacost: a general method for making classi- ment of Computer Science, Chiang
fiers cost-sensitive. In: The 5th ACM SIGKDD international con- Mai University. He is currently a
ference on knowledge discovery and data mining, San Diego, Cal- Ph.D. candidate in Computer Sci-
ifornia, USA, pp 155–164 ence at Chulalongkorn University.
15. Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based His research focus is Data Mining-
algorithm for discovering clusters in large spatial databases with especially in the Class Imbalance
noise. In: The 2nd international conference on knowledge discov- Problem and Clustering.
ery and data mining, Portland, Oregon, USA, pp 226–231
16. Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new
over-sampling method in imbalanced data sets learning. In: Huang
D-S, Zhang X-P, Huang G-B (eds) The 2005 international con-
ference on intelligent computing, Hefei, China. Lecture notes in
computer science, vol 3644. Springer, Heidelberg, pp 878–887
17. Hu X (2005) A data mining approach for retailing bank customer
attrition analysis. Appl Intell 22(1):47–60. Special Issue: Founda-
tions and Advances in Data Mining
18. Japkowicz N (2000) The class imbalance problem: significance
and strategies. In: 2000 international conference on artificial intel-
ligence, Las Vegas, Nevada, USA, pp 111–117
Krung Sinapiromsaran received fessor at the department of Mathematics, Chulalongkorn University.

his B.S. in Mathematics from Chu- His research interests include Design Automation, Silicon Compila-
lalongkorn University, his M.S. and tion, Neural Networks, Computer Architecture, and Artificial Intelli-
Ph.D. in Computer Science from the gence.
University of Wisconsin-Madison.
He is currently an Assistant Profes-
sor in the Department of Mathemat-
ics, Chulalongkorn University. His
ongoing research works are related
to theory and application of Math-
ematical Programming, Knowledge
Discovery and Discrete Algorithms.
Chidchanok Lursinsap received

his B.Eng. in Computer Engineer-
ing from Chulalongkorn University,
his M.S. and Ph.D. in Computer
Science from the University of
Illinois at Urbana-Champaign. He
was a Lecturer at the Department
of Computer Engineering, Chula-
longkorn University from 1978 to
1979, and a visiting Assistant Pro-
fessor from 1986 to 1987 with the
Department of Computer Science
and Center for Advanced Studies,
University of Illinois at Urbana-
Champaign. Presently, he is a Pro-

Dbsmote: Density-Based Synthetic Minority Over-Sampling Technique

Uploaded by

Copyright:

Available Formats

Dbsmote: Density-Based Synthetic Minority Over-Sampling Technique

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dbsmote: Density-Based Synthetic Minority Over-Sampling Technique

Uploaded by

Copyright:

Available Formats

Appl Intell (2012) 36:664–684

DBSMOTE: Density-Based Synthetic Minority Over-sampling

Published online: 14 April 2011

deal with this problem differently, for example a boosting-

2 Related work Definition 4 (Density-connected) An instance p is density-

Fig. 3 Over-sampled minority

Fig. 4 The over-generalization Bunkhumpornpat, Sinapiromsaran, and Lursinsap (2009)

underlying weighted directed graph [20], as in Definition 7. 3.3 Algorithm

Fig. 6 DBSMOTE algorithm

A flow diagram for an over-sampling framework integrated

ally, these m sets are merged with an original dataset D to

DBSMOTE is O(n2i ), which is equivalent to O(n2 ), for a Table 1 Confusion matrix

experimental results of applying SVM. DBSMOTE mostly

DBSMOTE to SMOTE, BORD, and SAFE. In these ta-

Table 10 Paired t-tests on F-value using the indicator variable DBSMOTE

Dataset Method Variable tested

Pima SMOTE 0.0154 −0.000068 4 11.952720 0.000281

Table 11 Paired t-tests on AUC using the indicator variable DBSMOTE

Dataset Method Variable tested

Pima SMOTE 0.0466 0.000386 4 6.285829 0.003272

7 Conclusion Although we provide evidence that DBSMOTE success-

Fig. 16 Experimental results on Pima: (a) F-value, (b) AUC

Table 12 Paired t-tests on Pima

Naïve Bayes SMOTE 0.0408 −0.002277 4 4.236689 0.013299

SVM SMOTE 0.0380 −0.002107 4 4.459799 0.011162

Ripper SMOTE 0.0190 −0.001085 4 3.343123 0.028752

k-NN SMOTE 0.0154 −0.000068 4 11.952720 0.000281

Naïve Bayes SMOTE 0.0750 0.000418 4 8.863490 0.000895

SVM SMOTE 0.1024 −0.001479 4 5.663946 0.004791

Ripper SMOTE 0.0696 −0.000090 4 8.609010 0.001001

k-NN SMOTE 0.0466 0.000386 4 6.285829 0.003272

Fig. 17 Experimental results on Haberman: (a) F-value, (b) AUC

Table 13 Paired t-tests on

Naïve Bayes SMOTE 0.1296 −0.007579 4 3.837142 0.018505

SVM SMOTE 0.0402 0.014313 4 1.185774 0.301337

Ripper SMOTE 0.0646 −0.002337 4 8.413056 0.001093

k-NN SMOTE 0.0400 −0.002592 4 5.557700 0.005131

Naïve Bayes SMOTE 0.1380 0.001065 4 10.064680 0.000548

SVM SMOTE 0.1148 −0.000636 4 2.791098 0.049257

Ripper SMOTE 0.0884 0.001208 4 6.025188 0.003823

k-NN SMOTE 0.0772 0.001935 4 5.213836 0.006455

Fig. 18 Experimental results on Glass: (a) F-value, (b) AUC

Table 14 Paired t-tests on

Naïve Bayes SMOTE 0.0498 −0.000035 4 10.831210 0.000412

SVM SMOTE 0.0186 −0.000496 4 3.230024 0.031975

Ripper SMOTE 0.0054 −0.000451 4 1.400827 0.233872

k-NN SMOTE 0.0018 −0.000269 4 0.406164 0.705412

Naïve Bayes SMOTE 0.0152 0.000006 4 5.898744 0.004132

SVM SMOTE 0.0190 −0.000184 4 4.023474 0.015819

Ripper SMOTE 0.0092 −0.000257 4 2.102891 0.103316

k-NN SMOTE 0.0002 0.000011 4 0.179605 0.866194

Fig. 19 Experimental results on Segmentation: (a) F-value, (b) AUC

Table 15 Paired t-tests on

Naïve Bayes SMOTE 0.0224 0.000449 4 14.281718 0.000140

SVM SMOTE 0.0268 −0.004571 4 1.714500 0.161587

Ripper SMOTE 0.0030 −0.000137 4 1.978141 0.119055

k-NN SMOTE 0.0004 0.000081 4 0.293294 0.783884

Naïve Bayes SMOTE 0.0610 0.000409 4 7.154226 0.002020

SVM SMOTE 0.0180 −0.001759 4 1.714675 0.161554