FindICI: Using Machine Learning to Detect Linguistic Inconsistencies between Code and NaturalLanguage Descriptions in Infrastructure as Code

This repository contains the code implementation for tha paper FindICI: Using Machine Learning to Detect Linguistic Inconsistencies between Code and NaturalLanguage Descriptions in Infrastructure as Code. The contents are explained below.

The folder 1 Find and merge repositories presents data collection. This included Ansible repositories detection, bug-related commits extraction and repository selection based on the research criteria.

The folder 2 Find Ansible tasks implements a heuristic mechanism in order to identify and extract Ansible tasks from a repository.

The folder 3 Map tasks to ansible documentation introduces a mechanism which identifies the used module within an Ansible task by scraping information from the Ansible documentation website. In addition, the mechanism identifies the used parameters of each module based in the same fashion.

The folder 4 Build ast and tokenize implements an AST generator engine in order to create a token sequence from Ansible task bodies.

The folder 5 Create inconsistent observations contains the implementation of the applied transformations in order to generate the inconsistent observations in Ansible tasks

The folder 6 Detect linguistic inconsistencies Contains the implementation and results for every classifier we used, namely:

Random Forest.
Support Vector Machines.
eXtreme Gradient Boosting (XGBoost).
Multi-Layer Perceptron.
Convolutional Neural Networks.
Long Short-Term Memory.

In addition, for each one of the classifiers we compared different word embedding techniques, namely:

Word2Vec
DocVec
fastText

For more details regarding the linguistic inconsistency detection check the folder 6 Detect linguistic inconsistencies.

The folder 7 Evaluation using unseen real-world dataset contains the evaluation of our best performing models using an extrnaml real-world dataset. We used the dataset from the prior work from Dalla Palma et al. 2021 [1].

[1] Dalla Palma, S., Di Nucci, D., Palomba, F., & Tamburri, D. A. (2021). Within-project defect prediction of infrastructure-as-code using product and process metrics. IEEE Transactions on Software Engineering, 1-1.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
1 Find and merge repositories		1 Find and merge repositories
2 Find Ansible tasks		2 Find Ansible tasks
3 Map tasks to ansible documentation		3 Map tasks to ansible documentation
4 Build ast and tokenize		4 Build ast and tokenize
5 Create inconsistent observations		5 Create inconsistent observations
6 Detect linguistic inconsistency		6 Detect linguistic inconsistency
7 Evaluation using unseen real-world dataset		7 Evaluation using unseen real-world dataset
HYPER-PARAMETERS.md		HYPER-PARAMETERS.md
README.md		README.md
REPOSITORIES.md		REPOSITORIES.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FindICI: Using Machine Learning to Detect Linguistic Inconsistencies between Code and NaturalLanguage Descriptions in Infrastructure as Code

About

Releases

Packages

Contributors 3

Languages

nboro/FindICI

Folders and files

Latest commit

History

Repository files navigation

FindICI: Using Machine Learning to Detect Linguistic Inconsistencies between Code and NaturalLanguage Descriptions in Infrastructure as Code

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages