Searching for just a few words should be enough to get started. If you need to make more complex queries, use the tips below to guide you.
Article type: Research Article
Authors: Wu, Xu | Yang, Youlong* | Ren, Lingyu
Affiliations: School of Mathematics and Statistics, Xidian University, Xi’an, Shaanxi 710126, China
Correspondence: [*] Corresponding author: Youlong Yang, School of Mathematics and Statistics, Xidian University, Xi’an, Shaanxi, China. E-mail: [email protected].
Abstract: Class imbalance is often a problem in various real-world datasets, where one class contains a small number of data and the other contains a large number of data. It is notably difficult to develop an effective model using traditional data mining and machine learning algorithms without using data preprocessing techniques to balance the dataset. Oversampling is often used as a pretreatment method for imbalanced datasets. Specifically, synthetic oversampling techniques focus on balancing the number of training instances between the majority class and the minority class by generating extra artificial minority class instances. However, the current oversampling techniques simply consider the imbalance of quantity and pay no attention to whether the distribution is balanced or not. Therefore, this paper proposes an entropy difference and kernel-based SMOTE (EDKS) which considers the imbalance degree of dataset from distribution by entropy difference and overcomes the limitation of SMOTE for nonlinear problems by oversampling in the feature space of support vector machine classifier. First, the EDKS method maps the input data into a feature space to increase the separability of the data. Then EDKS calculates the entropy difference in kernel space, determines the majority class and minority class, and finds the sparse regions in the minority class. Moreover, the proposed method balances the data distribution by synthesizing new instances and evaluating its retention capability. Our algorithm can effectively distinguish those datasets with the same imbalance ratio but different distribution. The experimental study evaluates and compares the performance of our method against state-of-the-art algorithms, and then demonstrates that the proposed approach is competitive with the state-of-art algorithms on multiple benchmark imbalanced datasets.
Keywords: Imbalanced dataset, oversampling, entropy difference, kernel space, SVM
DOI: 10.3233/IDA-194761
Journal: Intelligent Data Analysis, vol. 24, no. 6, pp. 1239-1255, 2020
IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA
Tel: +1 703 830 6300
Fax: +1 703 830 2300
[email protected]
For editorial issues, like the status of your submitted paper or proposals, write to [email protected]
IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands
Tel: +31 20 688 3355
Fax: +31 20 687 0091
[email protected]
For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office [email protected]
Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China
Free service line: 400 661 8717
Fax: +86 10 8446 7947
[email protected]
For editorial issues, like the status of your submitted paper or proposals, write to [email protected]
如果您在出版方面需要帮助或有任何建, 件至: [email protected]