Efficient Feature Selection for Static Analysis Vulnerability Prediction
Abstract
:1. Introduction
2. Related Works
3. Methodology
3.1. Dataset
3.2. Feature Selection
3.3. ML Based Evaluation
4. Experimental Results
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Zhioua, Z.; Short, S.; Roudier, Y. Static code analysis for software security verification: Problems and approaches. In Proceedings of the 2014 IEEE 38th International Computer Software and Applications Conference Workshops, Vasteras, Sweden, 21–25 July 2014; pp. 102–109. [Google Scholar]
- Shin, Y.; Meneely, A.; Williams, L.; Osborne, J.A. Evaluating complexity, code churn, and developer activity metrics as indicators of software vulnerabilities. IEEE Trans. Softw. Eng. 2010, 37, 772–787. [Google Scholar] [CrossRef] [Green Version]
- IEEE Standards Board. IEEE Standard Glossary of Software Engineering Terminology (IEEE Std 610.12-1990). Los Alamitos; Institute of Electrical and Electronics Engineers: New York, NY, USA, 1990; Volume 169. [Google Scholar]
- Ghaffarian, S.M.; Shahriari, H.R. Software vulnerability analysis and discovery using machine-learning and data-mining techniques: A survey. ACM Comput. Surv. 2017, 50, 1–36. [Google Scholar] [CrossRef]
- Corallo, A.; Lazoi, M.; Lezzi, M. Cybersecurity in the context of industry 4.0: A structured classification of critical assets and business impacts. Comput. Ind. 2020, 114, 103165. [Google Scholar] [CrossRef]
- Kehagias, D.; Jankovic, M.; Siavvas, M.; Gelenbe, E. Investigating the Interaction between Energy Consumption, Quality of Service, Reliability, Security, and Maintainability of Computer Systems and Networks. SN Comput. Sci. 2021, 2, 1–6. [Google Scholar]
- Assal, H.; Chiasson, S. ’Think secure from the beginning’ A Survey with Software Developers. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, Scotland, UK, 21–23 May 2019; pp. 1–13. [Google Scholar]
- Siavvas, M.; Gelenbe, E.; Kehagias, D.; Tzovaras, D. Static Analysis-Based Approaches for Secure Software Development. In Proceedings of the Security in Computer and Information Sciences-First International ISCIS Security Workshop 2018, London, UK, 26–27 February 2018. [Google Scholar]
- Cisco. Cisco Cybersecurity Series 2019: Consumer Privacy Survey. Available online: https://www.cisco.com/c/dam/global/en_uk/products/collateral/security/cybersecurity-series-2019-cps.pdf (accessed on 5 August 2020).
- FBI. Internet Crime Report; Technical Report; Federal Bureau of Investigation: Washington, DC, USA, 2019. [Google Scholar]
- Bates, A.; Hassan, W.U. Can data provenance put an end to the data breach? IEEE Secur. Priv. 2019, 17, 88–93. [Google Scholar] [CrossRef]
- Stoyanova, M.; Nikoloudakis, Y.; Panagiotakis, S.; Pallis, E.; Markakis, E.K. A Survey on the Internet of Things (IoT) Forensics: Challenges, Approaches and Open Issues. IEEE Commun. Surv. Tutorials 2020, 22, 1191–1221. [Google Scholar] [CrossRef]
- Cisco. 2019 Annual Report: Defining the Future of the Internet. Available online: https://www.cisco.com/c/dam/en_us/about/annual-report/cisco-annual-report-2019.pdf (accessed on 5 August 2020).
- Computer Emergency Response Team Coordination Center. Available online: https://www.kb.cert.org/vuls/ (accessed on 5 August 2020).
- Open Web Application Security Project (OWASP). Available online: https://owasp.org/ (accessed on 5 August 2020).
- Information Security Training—SANS Cyber Security Certifications & Research. Available online: https://www.sans.org/ (accessed on 5 August 2020).
- National Vulnerability Database (NVD). Available online: https://nvd.nist.gov/ (accessed on 21 December 2020).
- Common Vulnerabilities and Exposures (CVE). Available online: https://cve.mitre.org/ (accessed on 21 December 2020).
- Common Weakness Enumeration (CWE). Available online: https://cwe.mitre.org/ (accessed on 21 December 2020).
- 2019 CWE Top 25 Most Dangerous Software Errors. Available online: https://cwe.mitre.org/top25/archive/2019/2019_cwe_top25.html (accessed on 5 August 2020).
- OWASP Top Ten. Available online: https://owasp.org/www-project-top-ten/ (accessed on 5 August 2020).
- OWASP Secure Coding Practices Quick Reference Guide. Available online: https://owasp.org/www-pdf-archive/OWASP_SCP_Quick_Reference_Guide_v1.pdf (accessed on 5 August 2020).
- Veracode. State of Software Security Volume 9; Technical Report; Veracode: Burlington, MA, USA, 2018. [Google Scholar]
- Veracode. State of Software Security Volume 11; Technical Report; Veracode: Burlington, MA, USA, 2020. [Google Scholar]
- Veracode. State of Software Security; Technical Report; Veracode: Burlington, MA, USA, 2016. [Google Scholar]
- Chess, B.; West, J. Secure Programming with Static Analysis; Pearson Education: Upper Saddle River, NJ, USA, 2007. [Google Scholar]
- Sherriff, M.; Heckman, S.S.; Lake, M.; Williams, L. Identifying fault-prone files using static analysis alerts through singular value decomposition. In Proceedings of the 2007 Conference of the Center for Advanced Studies on Collaborative Research, Richmond Hill, ON, Canada, 22–25 October 2007; pp. 276–279. [Google Scholar]
- Reynolds, Z.P.; Jayanth, A.B.; Koc, U.; Porter, A.A.; Raje, R.R.; Hill, J.H. Identifying and documenting false positive patterns generated by static code analysis tools. In Proceedings of the 2017 IEEE/ACM 4th International Workshop on Software Engineering Research and Industrial Practice (SER&IP), Buenos Aires, Argentina, 21–21 May 2017; pp. 55–61. [Google Scholar]
- Moshtari, S.; Sami, A.; Azimi, M. Using complexity metrics to improve software security. Comput. Fraud. Secur. 2013, 2013, 8–17. [Google Scholar] [CrossRef]
- Chowdhury, I.; Zulkernine, M. Can complexity, coupling, and cohesion metrics be used as early indicators of vulnerabilities? In Proceedings of the 2010 ACM Symposium on Applied Computing, Sierre, Switzerland, 22–26 March 2010; pp. 1963–1969. [Google Scholar]
- Visual Studio IDE, Code Editor, Azure DevOps, & App Center—Visual Studio. Available online: https://visualstudio.microsoft.com/ (accessed on 5 August 2020).
- IntelliJ IDEA: The Java IDE for Professional Developers by JetBrains. Available online: https://www.jetbrains.com/idea/ (accessed on 5 August 2020).
- Enabling Open Innovation & Collaboration | The Eclipse Foundation. Available online: https://www.eclipse.org/ (accessed on 5 August 2020).
- Veracode. Available online: https://www.veracode.com/ (accessed on 5 August 2020).
- SonarQube. Available online: https://www.sonarqube.org/ (accessed on 3 December 2020).
- Li, Z.; Zou, D.; Xu, S.; Ou, X.; Jin, H.; Wang, S.; Deng, Z.; Zhong, Y. Vuldeepecker: A deep learning based system for vulnerability detection. arXiv 2018, arXiv:1801.01681. [Google Scholar]
- VulDeePecker dataset. Available online: https://github.com/CGCL-codes/VulDeePecker (accessed on 21 December 2020).
- NIST Software Assurance Reference Dataset (SARD). Available online: https://samate.nist.gov/SRD/ (accessed on 21 December 2020).
- CCCC - C and C++ Code Counter. Available online: http://sarnold.github.io/cccc/CCCC_User_Guide.html (accessed on 3 December 2020).
- User Guide for CCCC. Available online: http://cccc.sourceforge.net/ (accessed on 3 December 2020).
- Scandariato, R.; Walden, J.; Hovsepyan, A.; Joosen, W. Predicting vulnerable software components via text mining. IEEE Trans. Softw. Eng. 2014, 40, 993–1006. [Google Scholar] [CrossRef]
- Jimenez, M.; Papadakis, M.; Le Traon, Y. Vulnerability prediction models: A case study on the linux kernel. In Proceedings of the 2016 IEEE 16th International Working Conference on Source Code Analysis and Manipulation (SCAM), Raleigh, NC, USA, 2–3 October 2016; pp. 1–10. [Google Scholar]
- Kudjo, P.K.; Chen, J.; Zhou, M.; Mensah, S.; Huang, R. Improving the Accuracy of Vulnerability Report Classification Using Term Frequency-Inverse Gravity Moment. In Proceedings of the 2019 IEEE 19th International Conference on Software Quality, Reliability and Security (QRS), Sofia, Bulgaria, 22–26 July 2019; pp. 248–259. [Google Scholar]
- Gegick, M.; Williams, L. Toward the use of automated static analysis alerts for early identification of vulnerability-and attack-prone components. In Proceedings of the Second International Conference on Internet Monitoring and Protection (ICIMP 2007), San Jose, CA, USA, 1–5 July 2007; p. 18. [Google Scholar]
- Zhang, M.; de Carnavalet, X.d.C.; Wang, L.; Ragab, A. Large-Scale Empirical Study of Important Features Indicative of Discovered Vulnerabilities to Assess Application Security. IEEE Trans. Inf. Forensics Secur. 2019, 14, 2315–2330. [Google Scholar] [CrossRef]
- Du, X.; Chen, B.; Li, Y.; Guo, J.; Zhou, Y.; Liu, Y.; Jiang, Y. Leopard: Identifying vulnerable code for vulnerability assessment through program metrics. In Proceedings of the 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), Montreal, QC, Canada, 25–31 May 2019; pp. 60–71. [Google Scholar]
- Filus, K.; Siavvas, M.; Domańska, J.; Gelenbe, E. The Random Neural Network as a Bonding Model for Software Vulnerability Prediction. In Proceedings of the Interaction between Energy Consumption, Quality of Service, Reliability and Security, Maintainability of Computer Systems and Networks (EQSEM), Nice, France, 17–19 November 2020. [Google Scholar]
- Jackson, K.A.; Bennett, B.T. Locating SQL injection vulnerabilities in Java byte code using natural language techniques. In Proceedings of the SoutheastCon 2018, St. Petersburg, Russia, 19–22 April 2018; pp. 1–5. [Google Scholar]
- Walden, J.; Stuckman, J.; Scandariato, R. Predicting vulnerable components: Software metrics vs text mining. In Proceedings of the 2014 IEEE 25th International Symposium on Software Reliability Engineering, Naples, Italy, 3–6 November 2014; pp. 23–33. [Google Scholar]
- Neuhaus, S.; Zimmermann, T.; Holler, C.; Zeller, A. Predicting vulnerable software components. In Proceedings of the 14th ACM Conference on Computer and Communications Security, Alexandria, VA, USA, 28–31 October 2007; pp. 529–540. [Google Scholar]
- Pang, Y.; Xue, X.; Wang, H. Predicting vulnerable software components through deep neural network. In Proceedings of the 2017 International Conference on Deep Learning Technologies, Chengdu, China, 2–4 June 2017; pp. 6–10. [Google Scholar]
- Nafi, K.W.; Roy, B.; Roy, C.K.; Schneider, K.A. A universal cross language software similarity detector for open source software categorization. J. Syst. Softw. 2020, 162, 110491. [Google Scholar] [CrossRef]
- Wahab, O.A.; Bentahar, J.; Otrok, H.; Mourad, A. Resource-aware detection and defense system against multi-type attacks in the cloud: Repeated bayesian stackelberg game. In IEEE Transactions on Dependable and Secure Computing; IEEE: New York, NY, USA, 2019. [Google Scholar]
- Kwon, S.; Park, S.; Cho, H.; Park, Y.; Kim, D.; Yim, K. Towards 5G based IoT security analysis against Vo5G eavesdropping. Computing 2021, 1–23. [Google Scholar]
- Fatima, A.; Bibi, S.; Hanif, R. Comparative study on static code analysis tools for c/c++. In Proceedings of the 2018 15th International Bhurban Conference on Applied Sciences and Technology (IBCAST), Islamabad, Pakistan, 9–13 January 2018; pp. 465–469. [Google Scholar]
- Chen, X.; Zhao, Y.; Cui, Z.; Meng, G.; Liu, Y.; Wang, Z. Large-scale empirical studies on effort-aware security vulnerability prediction methods. IEEE Trans. Reliab. 2019, 69, 70–87. [Google Scholar] [CrossRef]
- Chen, X.; Yuan, Z.; Cui, Z.; Zhang, D.; Ju, X. Empirical studies on the impact of filter based ranking feature selection on security vulnerability prediction. IET Softw. 2020. [Google Scholar] [CrossRef]
- Cui, J.; Wang, L.; Zhao, X.; Zhang, H. Towards predictive analysis of android vulnerability using statistical codes and machine learning for IoT applications. Comput. Commun. 2020, 155, 125–131. [Google Scholar] [CrossRef]
- Schubert, P.D.; Hermann, B.; Bodden, E. PhASAR: An inter-procedural static analysis framework for C/C++. In Proceedings of the International Conference on Tools and Algorithms for the Construction and Analysis of Systems, Prague, Czech Republic, 8–11 April 2019; pp. 393–410. [Google Scholar]
- SonarQube User Guide—Metric Definitions. Available online: https://docs.sonarqube.org/latest/user-guide/metric-definitions/ (accessed on 21 January 2020).
- Lenarduzzi, V.; Saarimäki, N.; Taibi, D. The technical debt dataset. In Proceedings of the Fifteenth International Conference on Predictive Models and Data Analytics in Software Engineering, Recife, Brazil, 8 January 2019; pp. 2–11. [Google Scholar]
- Thirumalai, C.; Reddy, P.A.; Kishore, Y.J. Evaluating software metrics of gaming applications using code counter tool for C and C++ (CCCC). In Proceedings of the 2017 International conference of Electronics, Communication and Aerospace Technology (ICECA), Coimbatore, India, 20–22 April 2017; pp. 180–184. [Google Scholar]
- Afzal, A.; Schmitt, C.; Alhaddad, S.; Grynko, Y.; Teich, J.; Forstner, J.; Hannig, F. Solving Maxwell’s Equations with Modern C++ and SYCL: A Case Study. In Proceedings of the 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP), Milano, Italy, 10–12 July 2018; pp. 1–8. [Google Scholar]
- SonarQube C++ Plugin (Community). Available online: https://github.com/SonarOpenCommunity/sonar-cxx (accessed on 3 December 2020).
- Liu, Y.; Mu, Y.; Chen, K.; Li, Y.; Guo, J. Daily activity feature selection in smart homes based on pearson correlation coefficient. Neural Process. Lett. 2020, 51, 1771–1787. [Google Scholar] [CrossRef]
- Bishara, A.J.; Hittner, J.B. Reducing bias and error in the correlation coefficient due to nonnormality. Educ. Psychol. Meas. 2015, 75, 785–804. [Google Scholar] [CrossRef]
- Makowski, D.; Ben-Shachar, M.S.; Patil, I.; Lüdecke, D. Methods and algorithms for correlation analysis in R. J. Open Source Softw. 2020, 5, 2306. [Google Scholar] [CrossRef]
- Fernández-García, A.J.; Iribarne, L.; Corral, A.; Criado, J. A Comparison of Feature Selection Methods to Optimize Predictive Models Based on Decision Forest Algorithms for Academic Data Analysis. In Proceedings of the World Conference on Information Systems and Technologies, Naples, Italy, 27–29 March 2018; pp. 338–347. [Google Scholar]
- Puth, M.T.; Neuhäuser, M.; Ruxton, G.D. Effective use of Spearman’s and Kendall’s correlation coefficients for association between two measured traits. Anim. Behav. 2015, 102, 77–84. [Google Scholar] [CrossRef] [Green Version]
- Bressan, M.; Rosseel, Y.; Lombardi, L. The effect of faking on the correlation between two ordinal variables: Some population and Monte Carlo results. Front. Psychol. 2018, 9, 1876. [Google Scholar] [CrossRef] [Green Version]
- Puth, M.T.; Neuhäuser, M.; Ruxton, G.D. Effective use of Pearson’s product–moment correlation coefficient. Anim. Behav. 2014, 93, 183–189. [Google Scholar] [CrossRef]
- Asim, M.N.; Wasim, M.; Ali, M.S.; Rehman, A. Comparison of feature selection methods in text classification on highly skewed datasets. In Proceedings of the 2017 First International Conference on Latest trends in Electrical Engineering and Computing Technologies (INTELLECT), Karachi, Pakistan, 15–16 November 2017; pp. 1–8. [Google Scholar]
- Mitchell, T.M. Machine Learning, 1st ed.; McGraw-Hill, Inc.: Pittsburgh, PA, USA, 1997. [Google Scholar]
- Langs, G.; Menze, B.H.; Lashkari, D.; Golland, P. Detecting stable distributed patterns of brain activation using Gini contrast. NeuroImage 2011, 56, 497–507. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Nassar, M.; Safa, H.; Mutawa, A.A.; Helal, A.; Gaba, I. Chi squared feature selection over Apache Spark. In Proceedings of the 23rd International Database Applications & Engineering Symposium, Athens, Greece, 10–12 June 2019; pp. 1–5. [Google Scholar]
- Koroniotis, N.; Moustafa, N.; Sitnikova, E.; Turnbull, B. Towards the development of realistic botnet dataset in the internet of things for network forensic analytics: Bot-iot dataset. Future Gener. Comput. Syst. 2019, 100, 779–796. [Google Scholar] [CrossRef] [Green Version]
- Altman, D.G.; Bland, J.M. Diagnostic tests. 1: Sensitivity and specificity. BMJ 1994, 308, 1552. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Palomba, F.; Bavota, G.; Di Penta, M.; Fasano, F.; Oliveto, R.; De Lucia, A. On the diffuseness and the impact on maintainability of code smells: A large scale empirical investigation. Empir. Softw. Eng. 2018, 23, 1188–1221. [Google Scholar] [CrossRef] [Green Version]
Elements | Label | Cardinality | |
---|---|---|---|
All | Vulnerable | 3386 | 7534 |
Neutral | 4148 | ||
CWE-399 | Vulnerable | 684 | 1498 |
Neutral | 814 | ||
CWE-119 | Vulnerable | 2702 | 6036 |
Neutral | 3334 |
Minor _Violations | CCCCLines _of_Code | Comment _Lines | Duplicated _Blocks | CCCCLOCCOM | CCCCMVGCOM | Effort_to_REACH _Maintainability_Rating_a | |
---|---|---|---|---|---|---|---|
CWE-399 | |||||||
Pearson | ≪0.001 | ≪0.001 | ≪0.001 | 0.4415 | 0.0002 | 0.0231 | 0.0513 |
Spearman | ≪0.001 | 0.0726 | 0.3540 | 0.0033 | 0.0723 | ≪0.001 | 0.2308 |
Kendall | ≪0.001 | 0.0726 | 0.3539 | 0.00336 | 0.0723 | ≪0.001 | 0.2307 |
CWE-119 | |||||||
Pearson | ≪0.001 | ≪0.001 | ≪0.001 | ≪0.001 | 0.00439 | 0.0088 | ≪0.001 |
Spearman | 0.0694 | ≪0.001 | ≪0.001 | ≪0.001 | ≪0.001 | ≪0.001 | ≪0.001 |
Kendall | 0.0694 | ≪0.001 | ≪0.001 | ≪0.001 | ≪0.001 | ≪0.001 | ≪0.001 |
All | |||||||
Pearson | ≪0.001 | ≪0.001 | ≪0.001 | ≪0.001 | ≪0.001 | 0.0009 | 0.0024 |
Spearman | ≪0.001 | ≪0.001 | 0.0011 | ≪0.001 | 0.0027 | 0.0581 | ≪0.001 |
Kendall | ≪0.001 | ≪0.001 | 0.0011 | ≪0.001 | 0.0027 | 0.0581 | ≪0.001 |
Rank | Spearman Correlation | Information Gain | Gain Ratio | Gini Decrease | Chi-squared |
---|---|---|---|---|---|
1 | code_smells | violations | blocker_violations | violations | critical_violations |
2 | open_issues | open_issues | minor_violations | open_issues | violations |
3 | violations | code_smells | violations | code_smells | open_issues |
4 | major_violations | major_violations | open_issues | major_violations | code_smells |
5 | sqale_index | minor_violations | code_smells | minor_violations | major_violations |
6 | comment_lines_density | sqale_index | major_violations | sqale_index | duplicated_lines_density |
7 | duplicated_lines_density | comment_lines_density | critical_violations | comment_lines_density | sqale_index |
8 | critical_violations | sqale_debt_ratio | sqale_index | sqale_debt_ratio | comment_lines_density |
9 | info_violations | duplicated_lines_density | comment_lines_density | duplicated_lines_density | duplicated_lines |
10 | statements | critical_violations | info_violations | critical_violations | uncovered_lines |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Filus, K.; Boryszko, P.; Domańska, J.; Siavvas, M.; Gelenbe, E. Efficient Feature Selection for Static Analysis Vulnerability Prediction. Sensors 2021, 21, 1133. https://doi.org/10.3390/s21041133
Filus K, Boryszko P, Domańska J, Siavvas M, Gelenbe E. Efficient Feature Selection for Static Analysis Vulnerability Prediction. Sensors. 2021; 21(4):1133. https://doi.org/10.3390/s21041133
Chicago/Turabian StyleFilus, Katarzyna, Paweł Boryszko, Joanna Domańska, Miltiadis Siavvas, and Erol Gelenbe. 2021. "Efficient Feature Selection for Static Analysis Vulnerability Prediction" Sensors 21, no. 4: 1133. https://doi.org/10.3390/s21041133
APA StyleFilus, K., Boryszko, P., Domańska, J., Siavvas, M., & Gelenbe, E. (2021). Efficient Feature Selection for Static Analysis Vulnerability Prediction. Sensors, 21(4), 1133. https://doi.org/10.3390/s21041133