Multi-granularity relabeled under-sampling algorithm for imbalanced data
The imbalanced classification problem turns out to be one of the important and challenging
problems in data mining and machine learning. The performances of traditional classifiers
will be severely affected by many data problems, such as class imbalanced problem, class
overlap and noise. When the number of one class in the data set is larger than other classes,
class imbalanced problem will inevitably occur. Therefore, many researchers are committed
to solving the problem of category imbalance and improving the overall classification …
problems in data mining and machine learning. The performances of traditional classifiers
will be severely affected by many data problems, such as class imbalanced problem, class
overlap and noise. When the number of one class in the data set is larger than other classes,
class imbalanced problem will inevitably occur. Therefore, many researchers are committed
to solving the problem of category imbalance and improving the overall classification …
Abstract
The imbalanced classification problem turns out to be one of the important and challenging problems in data mining and machine learning. The performances of traditional classifiers will be severely affected by many data problems, such as class imbalanced problem, class overlap and noise. When the number of one class in the data set is larger than other classes, class imbalanced problem will inevitably occur. Therefore, many researchers are committed to solving the problem of category imbalance and improving the overall classification performances of the classifier. The Tomek-Link algorithm was only used to clean data when it was proposed. In recent years, there have been reports of combining Tomek-Link algorithm with sampling technique. The Tomek-Link sampling algorithm can effectively reduce the class overlap on data, remove the majority instances that are difficult to distinguish, and improve the algorithm classification accuracy. However, the Tomek-Links under-sampling algorithm only considers the boundary instances that are the nearest neighbors to each other globally and ignores the potential local overlapping instances. When the number of minority instances is small, the under-sampling effect is not satisfactory, and the performance improvement of the classification model is not obvious. Therefore, on the basis of Tomek-Link, a multi-granularity relabeled under-sampling algorithm (MGRU) is proposed. This algorithm fully considers the local information of the data set in the local granularity subspace, and detects the local potential overlapping instances in the data set. Then, the overlapped majority instances are eliminated according to the global relabeled index value, which effectively expands the detection range of Tomek-Links. The simulation results show that when we select the optimal global relabeled index value for under-sampling, the classification accuracy and generalization performance of the proposed under-sampling algorithm are significantly better than other baseline algorithms.
Elsevier
Showing the best result for this search. See all results