Categorical Feature Encoding Techniques for Improved Classifier Performance when Dealing with Imbalanced Data of Fraudulent Transactions
DOI:
https://doi.org/10.15837/ijccc.2023.3.5433Keywords:
imbalanced data, classifier, feature encoding, high-cardinality, fraud detectionAbstract
Fraudulent transaction data tend to have several categorical features with high cardinality. It makes data preprocessing complicated if categories in such features do not have an order or meaningful mapping to numerical values. Even though many encoding techniques exist, their impact on highly imbalanced massive data sets is not thoroughly evaluated.
Two transaction datasets with an imbalance lower than 1\% of frauds have been used in our study. Six encoding methods were employed, which belong to either target-agnostic or target-based groups. The experimental procedure has involved the use of several machine-learning techniques, such as ensemble learning, along with both linear and non-linear learning approaches.
Our study emphasizes the significance of carefully selecting an appropriate encoding approach for imbalanced datasets and machine learning algorithms. Using target-based encoding techniques can enhance model performance significantly. Among the various encoding methods assessed, the James-Stein and Weight of Evidence (WOE) encoders were the most effective, whereas the CatBoost encoder may not be optimal for imbalanced datasets. Moreover, it is crucial to bear in mind the curse of dimensionality when employing encoding techniques like hashing and One-Hot encoding.
References
Alarfaj, F. K.; Malik, I.; Khan, H. U.; Almusallam, N.; Ramzan, M.; Ahmed, M. (2022). Credit Card Fraud Detection Using State-of-the-Art Machine Learning and Deep Learning Algorithms, IEEE Access, 10, 39700-39715, 2022.
https://doi.org/10.1109/ACCESS.2022.3166891
Altman, E. (2021). Synthesizing credit card transactions, 2nd ACM International Conference on AI in Finance (ICAIF'21), [Online].
https://doi.org/10.1145/3490354.3494378
Alonso Lopez-Rojas, E.; Axelsson, S. (2014). BankSim: A Bank Payment Simulation for Fraud Detection Research, The 26th European Modeling and Simulation Symposium, [Online]. Available: https://www.researchgate.net/publication/265736405
Breiman, L. (1984). Classification and Regression Trees (1st ed.). Routledge.
https://doi.org/10.1201/9781315139470
Breiman, L. (2001). Random Forests, Machine Learning 45, 5-32, 2001.
https://doi.org/10.1023/A:1010933404324
Breskuvien˙e, D.; Dzemyda, G. (2023). Imbalanced Data Classification Approach Based on Clustered Training Set, In: Dzemyda, G., Bernatavičien˙e, J., Kacprzyk, J. (eds) Data Science in Applications. Studies in Computational Intelligence, Springer, Cham. 1084, 43-62, 2023.
https://doi.org/10.1007/978-3-031-24453-7_3
Bourdonnaye, F.; Daniel, F. (2021). Evaluating categorical encoding methods on a real credit card fraud detection database, [Online]. Available: http://www.lusisai.com 2021.
Bugajev, A.; Kriauzien˙e, R.; Vasilecas, O.; Chadyšas, V. (2022). The Impact of Churn Labelling Rules on Churn Prediction in Telecommunications. Informatica, 33(2), 247-277, 2022.
https://doi.org/10.15388/22-INFOR484
Bulavas, V.; Marcinkevičius, V.; Rumiński, J. (2021). Study of Multi-Class Classification Algorithms' Performance on Highly Imbalanced Network Intrusion Datasets, Informatica, 32(3), 441-475, 2021.
https://doi.org/10.15388/21-INFOR457
Chalé, M.; Bastian, N. D. (2022). Generating realistic cyber data for training and evaluating machine learning classifiers for network intrusion detection systems, Expert Systems with Applications, 207, 117936, 2022.
https://doi.org/10.1016/j.eswa.2022.117936
Carneiro, E. M.; Forster, C. H. Q.; Mialaret, L. F. S.; Dias, L. A. V.; Cunha, A. M. (2022). High-Cardinality Categorical Attributes and Credit Card Fraud Detection, Mathematics, 10(20), 2022.
https://doi.org/10.3390/math10203808
Chen, C.; Liaw, A.; Breiman, L. (2004). Using random forest to learn imbalanced data, University of California, Berkeley (110), 1-12, 2004.
Chen, T.; Guestrin, C. (2016). XGBoost: A scalable tree boosting system, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785-794, 2016.
https://doi.org/10.1145/2939672.2939785
Dorogush, A. V.; Ershov, V.; Gulin, A. (2018). CatBoost: gradient boosting with categorical features support, [Online]. Available: http://arxiv.org/abs/1810.11363
Fernández-Delgado, M.; Cernadas, E.; Barro, S.; Amorim, D.; Fernández-Delgado, A. (2014). Do We Need Hundreds of Classifiers to Solve Real World Classification Problems?, [Online]. Available: http://www.mathworks.es/products/neural-network.
Johnson, J. M.; Khoshgoftaar, T. M. (2020). Hcpcs2Vec: Healthcare Procedure Embeddings for Medicare Fraud Prediction, 2020 IEEE 6th International Conference on Collaboration and Internet Computing, 145-152, 2020.
https://doi.org/10.1109/CIC50333.2020.00026
Johnson, J. M.; Khoshgoftaar, T. M. (2021). Encoding Techniques for High-Cardinality Features and Ensemble Learners, 2021 IEEE 22nd International Conference on Information Reuse and Integration for Data Science, 355-361, 2021.
https://doi.org/10.1109/IRI51335.2021.00055
Jordon, J. et al. (2022) Synthetic Data - what, why and how?, [Online]. Available: http://arxiv.org/abs/2205.03257
Ke, G. et al. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree, [Online]. Available: https://github.com/Microsoft/LightGBM.
Micci-Barreca, D. (2001). A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems, ACM SIGKDD Explorations Newsletter, 3(1), 2001.
https://doi.org/10.1145/507533.507538
Moeyersoms, J.; Martens, D. (2015). Including high-cardinality attributes in predictive models: A case study in churn prediction in the energy sector, Decision Support Systems, 72, 72-81, 2015.
https://doi.org/10.1016/j.dss.2015.02.007
Najadat, H.; Altiti, O.; Aqouleh, A. A.; Younes, M. (2020). Credit Card Fraud Detection Based on Machine and Deep Learning, 11th International Conference on Information and Communication Systems, 204-208, 2020.
https://doi.org/10.1109/ICICS49469.2020.239524
Pargent, F.; Pfisterer, F.; Thomas, J.; Bischl, B. (2022). Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features, Computational Statistics, 37(5), 2671-2692, 2022.
https://doi.org/10.1007/s00180-022-01207-6
Peng, Y.; Qiu, Q; Zhang, D.; Yang, T.; Zhang H.(2023). Ensemble Learning for Interpretable Concept Drift and Its Application to Drug Recommendation, International Journal of Computers Communications & Control, 18(1), 5011, 2023.
https://doi.org/10.15837/ijccc.2023.1.5011
Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A. V.; Gulin, A. (2017). CatBoost: unbiased boosting with categorical features, [Online]. Available: http://arxiv.org/abs/1706.09516
Reilly, D.; Taylor, M.; Fergus, P.; Chalmers, C.; Thompson, S. (2022). The Categorical Data Conundrum: Heuristics for Classification Problems - A Case Study on Domestic Fire Injuries, IEEE Access, 10, 70113-70125, 2022.
https://doi.org/10.1109/ACCESS.2022.3187287
Russac, Y.; Caelen, O.; He-Guelton, L. (2018). Embeddings of Categorical Variables for Sequential Data in Fraud Context, Advances in Intelligent Systems and Computing
https://doi.org/10.1007/978-3-319-74690-6_53
Sagi,O.; Rokach, L. (2018). Ensemble learning: A survey, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4), 2018.
https://doi.org/10.1002/widm.1249
Slakey, A.; Salas, D.; Schamroth, Y. (2019). Encoding Categorical Variables with Conjugate Bayesian Models for WeWork Lead Scoring Engine, [Online]. Available: http://arxiv.org/abs/1904.13001
Surowiecki, J. (2004). The wisdom of crowds, Anchor, 2004.
Turhan, B. (2012). On the dataset shift problem in software engineering prediction models, Empirical Software Engineering, 17(1-2), 62-74, 2012.
https://doi.org/10.1007/s10664-011-9182-8
Uyar, A.; Bener, A.; Ciray, H. N.; Bahceci, M. (2009). A frequency based encoding technique for transformation of categorical variables in mixed IVF dataset, 31st Annual International Conference of the IEEE Engineering in Medicine and Biology Society: Engineering the Future of Biomedicine, 6214-6217, 2009.
https://doi.org/10.1109/IEMBS.2009.5334548
Wang, H.; Wang, W.; Liu, Y.; Alidaee, B. (2022). Integrating Machine Learning Algorithms With Quantum Annealing Solvers for Online Fraud Detection, IEEE Access, 10,75908-75917, 2022.
https://doi.org/10.1109/ACCESS.2022.3190897
Zhao, X.-M.; Li, X.; Chen, L.; Aihara, K. (2007). Protein classification with imbalanced data, Proteins, 70(4), 1125-1132, 2007.
https://doi.org/10.1002/prot.21870
Zhou, X. (2015). Shrinkage Estimation of Log-odds Ratios for Comparing Mobility Tables, Sociol Methodology, 45(1), 320-356, 2015.
Additional Files
Published
Issue
Section
License
Copyright (c) 2023 Gintautas Dzemyda, Dalia Breskuvienė
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
ONLINE OPEN ACCES: Acces to full text of each article and each issue are allowed for free in respect of Attribution-NonCommercial 4.0 International (CC BY-NC 4.0.
You are free to:
-Share: copy and redistribute the material in any medium or format;
-Adapt: remix, transform, and build upon the material.
The licensor cannot revoke these freedoms as long as you follow the license terms.
DISCLAIMER: The author(s) of each article appearing in International Journal of Computers Communications & Control is/are solely responsible for the content thereof; the publication of an article shall not constitute or be deemed to constitute any representation by the Editors or Agora University Press that the data presented therein are original, correct or sufficient to support the conclusions reached or that the experiment design or methodology is adequate.