Dataset for : A New Era in Software Security: Towards Self-Healing Software via Large Language Models and Formal Verification

doi:10.5281/zenodo.8026525

Published June 12, 2023 | Version 1

Dataset Open

Dataset for : A New Era in Software Security: Towards Self-Healing Software via Large Language Models and Formal Verification

1. The University of Manchester, UK
2. Technology Innovation Institute, UAE

We present a novel solution combining Large Language Model (LLM) capabilities with Formal Verification strategies to falsify and automatically repair software vulnerabilities. Initially, we employ Bounded Model Checking (BMC) to locate the software vulnerability and derive a counterexample. Relying on mathematical proofs, counterexamples provide evidence that the system behaves incorrectly or contains a vulnerability, thereby preventing the generation of false positive alerts. The counterexample that has been detected, along with the source code, are provided to the LLM engine. Our approach involves establishing a specialized prompt language for conducting code debugging and generation to understand the vulnerability's root cause and repair the code. Finally, we use BMC to verify the corrected version of the code generated by the LLM. As a proof of concept, we create \esbmcai based on the Efficient SMT-based Context-Bounded Model Checker (ESBMC) and a pre-trained Transformer model, specifically gpt-3.5-turbo, to detect and fix errors in C programs. We generated a dataset comprising $1{,}000$ C code samples, each consisting of $20$ to $50$ lines of C code. Experimental results show that our proposed method achieved an impressive success rate of up to $80$\% in repairing vulnerable code, encompassing buffer overflow, arithmetic overflow, and pointer dereference failures. To our knowledge, \esbmcai represents the first proposal for a pioneering initiative to integrate a Large Language Model (LLM) with software model checking. We advocate that this automated approach has the potential to incorporate into the software development lifecycle's continuous integration and deployment (CI/CD) process.

The uploaded dataset contains 1000 codes, each comprising 20 to 50 lines of C code generated with gpt-3.5-turbo. The material also consists of a version of ESBMC statically compiled with all dependencies, a classifier script, and the output file.

Files

ESBMC-LLM.zip

Files (110.4 MB)

Name	Size	Download all
ESBMC-LLM.zip md5:cff61e2a80d972394d4a3b4d8bfc4693	110.4 MB	Preview Download
esbmc_classifier.py md5:ad1b4dc329274c78f6dafbdd1b732717	3.7 kB	Download
ESBMC_vulnerabilities (examples).csv md5:ad887478717a5704baca8e726d65c39e	9.2 kB	Preview Download
ReadMe.txt md5:1e74bc6d09782f584f7a3bfdca7f495d	541 Bytes	Preview Download

Additional details

Is supplement to: Peer review: 10.48550/arXiv.2305.14752 (DOI)

	All versions	This version
Views	54	54
Downloads	29	29
Data volume	1.1 GB	1.1 GB

Dataset for : A New Era in Software Security: Towards Self-Healing Software via Large Language Models and Formal Verification

Creators

Description

Files

ESBMC-LLM.zip

Files (110.4 MB)

Additional details

Related works