UNIVERSITY OF MACEDONIA MASTER OF SCIENCE APPLIED INFORMATICS IDENTIFYING SKILLS USING MACHINE LEARNING TO ANALYZE SOURCE CODE I

Page 1

UNIVERSITY OF MACEDONIA

MASTER OF SCIENCE

APPLIED INFORMATICS

IDENTIFYING SKILLS USING MACHINE LEARNING TO ANALYZE SOURCE

CODE IN SOFTWARE REPOSITORIES

Dissertation of

Dafni Georgiou

mai21008

Thessaloniki, June 2024

Page 2

ΑΝΙΧΝΕΥΣΗ ΔΕΞΙΟΤΗΤΩΝ ΑΝΑΛΥΟΝΤΑΣ ΑΠΟΘΕΤΗΡΙΑ ΠΗΓΑΙΟΥ ΚΩΔΙΚΑ ΜΕ

ΤΗΝ ΧΡΗΣΗ ΜΗΧΑΝΙΚΗΣ ΜΑΘΗΣΗΣ

Δάφνη Γεωργίου

Πτυχίο Οργάνωσης και Διοίκησης Επιχειρήσεων, ΠΑΜΑΚ, 2015

MSc in Business Analytics, Aston University, 2017

Διπλωματική Εργασία

υποβαλλόμενη για τη μερική εκπλήρωση των απαιτήσεων του

ΜΕΤΑΠΤΥΧΙΑΚΟΥ ΤΙΤΛΟΥ ΣΠΟΥΔΩΝ ΣΤΗΝ ΕΦΑΡΜΟΣΜΕΝΗ ΠΛΗΡΟΦΟΡΙΚΗ

Επιβλέπων Καθηγητής

Αλέξανδρος Χατζηγεωργίου

Εγκρίθηκε από την τριμελή εξεταστική επιτροπή την ηη/μμ/εεεε

Αλέξανδρος Χατζηγεωργίου

Απόστολος Αμπατζόγλου

Ευτύχιος Πρωτοπαπαδάκης

...................................

Γεωργίου Δάφνη

...................................

Page 3

Περίληψη

Η αναγνώριση δεξιοτήτων των προγραμματιστών ανάπτυξης λογισμικού είναι

ιδιαίτερα σημαντική για τις σημερινές επιχειρήσεις που βασίζονται στην ανάπτυξη

λογισμικού για να παραμένουν ανταγωνιστικές και αποδοτικές. Με την ταχεία ανάπτυξη του

τομέα αυτού και την αυξανόμενη ζήτηση για έμπειρους προγραμματιστές, η παρούσα μελέτη

διερευνά τη χρήση των αποθετηρίων λογισμικού (GitHub repositories) ως κύριας πηγής

άντλησης δεδομένων για την αναγνώριση και ανάλυση των δεξιοτήτων των

προγραμματιστών. Αυτό περιλαμβάνει την εξέταση παραγόντων όπως οι χρησιμοποιούμενες

γλώσσες προγραμματισμού, οι τύποι βιβλιοθηκών που χρησιμοποιούνται και οι

συγκεκριμένες λειτουργίες μέσα στον κώδικα.

Επιπλέον, η έρευνα αυτή στοχεύει στην καθιέρωση ενός πλαισίου για την

αναγνώριση δεξιοτήτων μέσω της ανάλυσης διαφόρων βιβλιοθηκών Python σε διαφορετικές

λειτουργίες. Αυτό το πλαίσιο περιλαμβάνει μεθοδολογίες για την εξαγωγή και επεξεργασία

δεδομένων από αποθετήρια λογισμικού ή πηγαίου κώδικα, τις διαδικασίες καθαρισμού

δεδομένων και τεχνικές για τη δημιουργία των embeddings. Επιπλέον, ενσωματώνει

μεθόδους για την αναγνώριση και κατηγοριοποίηση δεξιοτήτων με τη χρήση αλγορίθμων

μηχανικής μάθησης. Με την ανάπτυξη αυτού του πλαισίου, η έρευνα στοχεύει στην

ανάπτυξη μιας τυποποιημένης προσέγγισης για την αναγνώριση δεξιοτήτων που μπορεί να

εφαρμοστεί καθολικά σε διάφορες εφαρμογές και έργα λογισμικού.

Το βασικό κίνητρο για την παρούσα έρευνα είναι η προσφορά μιας πιο ακριβούς και

αξιόπιστης μεθόδου για την αναγνώριση των δεξιοτήτων των προγραμματιστών λογισμικού.

Εκμεταλλευόμενοι τα εκτεταμένα δεδομένα που παρέχουν τα αποθετήρια πηγαίου κώδικα, η

μελέτη στοχεύει στη βελτίωση της αποδοτικότητας και της αποτελεσματικότητας των

διαδικασιών ανάπτυξης λογισμικού, συμβάλλοντας τελικά στην επιτυχία των έργων αυτών.

Μετά από μια λεπτομερή ανασκόπηση της βιβλιογραφίας σχετικά με τη χρήση των

αποθετηρίων λογισμικού και τις μεθοδολογίες για την αναγνώριση δεξιοτήτων, το τρίτο

κεφάλαιο περιγράφει λεπτομερώς την εξαγωγή και επεξεργασία δεδομένων από αυτά τα

αποθετήρια, τις μεθοδολογίες που δοκιμάστηκαν και μια σύντομη ανάλυση των μοντέλων

μηχανικής μάθησης. Παρέχει επίσης λεπτομερή ανάλυση της μεθοδολογίας word2vec για τη

δημιουργία embeddings και αναλύει ζητήματα καθαρισμού δεδομένων που σχετίζονται με τα

αποθετήρια λογισμικού.

Το κύριο εύρημα είναι ότι διάφορες μέθοδοι μηχανικής μάθησης, συνδυασμένες με

τεχνικές δημιουργίας embeddings όπως το word2vec, μπορούν να εκτιμήσουν

Page 4

αποτελεσματικά τις δεξιότητες προγραμματισμού, τους τύπους βιβλιοθηκών και τη

χρηστικότητα των λειτουργιών χρησιμοποιώντας δεδομένα από τα αποθετήρια GitHub, υπό

την προϋπόθεση ότι ο πηγαίος κώδικας πληροί ορισμένες προϋποθέσεις για τη διασφάλιση

της ποικιλομορφίας των χαρακτηριστικών του για την εκπαίδευση και της ποιότητας των

tokens. Η μελέτη παρουσιάζει επίσης διάφορους περιορισμούς και προκλήσεις που

αντιμετωπίστηκαν κατά τη διάρκεια του έργου και προσφέρει προτάσεις για περαιτέρω

έρευνα σχετικά με ανεπτυγμένα σύνολα δεδομένων κώδικα για την αναγνώριση δεξιοτήτων

και την εκπαίδευση μεθόδων ensemble.

Λέξεις Κλειδιά: word2vec; Skills identification; Classification; Feature extraction;

Tokeniztion; GitHub

Page 5

Abstract

Identifying the skills of software developers is crucial for organizations that rely on

software development to remain competitive and efficient. With the rapidly evolving

software engineering field and the growing demand for skilled developers, this study

investigates the use of software repositories as a valuable data source for identifying and

analyzing developers' skills. This involves examining factors such as the programming

languages used, types of libraries employed, and specific operations within the source code.

Additionally, this research aims to establish a framework for skill identification by

analyzing various Python libraries across different operations. This framework includes

methodologies for extracting and processing data from software repositories or source code,

necessary data cleansing procedures, and techniques for generating embeddings.

Furthermore, it incorporates methods for identifying and categorizing skills based on the

analyzed data. By developing this framework, the research aims to provide a standardized

approach to skill identification that can be universally applied across various organizations

and software projects.

The motivation for this research is to offer a more accurate and reliable method for

identifying the skills of software developers. By leveraging the extensive data available from

software repositories, the study aims to enhance the efficiency and effectiveness of software

development processes, ultimately contributing to the success of software development

projects.

Following a thorough literature review on the use of software repositories and

methodologies for skill identification, the Chapter 3 details the extraction and processing of

data from these repositories, the methodologies tested, and a brief analysis of machine

learning models. It also explains the word2vec technique for generating embeddings and

addresses data cleansing issues related to software repositories. The thesis concludes with

Chapter 4, presenting the main findings, and Chapter 5 discussing the limitations, challenges,

and suggestions for further research.

The main finding is that various machine learning methods, combined with

embedding techniques like word2vec, can effectively estimate programming skills, library

types, and operation usability using data from GitHub repositories, provided the source code

meets certain prerequisites to ensure feature diversity and token quality for training. The

study also presents various limitations and challenges encountered during the project and

Page 6

offers suggestions for further research on annotated datasets of source code snippets for skill

recognition and the training of ensemble methods.

Keywords: word2vec; Skills identification; Classification; Feature extraction; Tokenization;

GitHub

Page 7

Acknowledgements

I am very thankful to my supervisor, Alexandros Chatzigeorgiou, for his guidance and

support throughout this project and my MSc studies. I am also grateful to my family and

friends for their support and encouragement during this time. A special thanks to my brother,

Konstantinos, whose support made this project possible.

Page 8

Contents

Abstract ................................................................................................................................................... 7

Acknowledgements ................................................................................................................................. 9

Contents ................................................................................................................................................ 10

List of Figures ....................................................................................................................................... 12

List of Tables ......................................................................................................................................... 12

1 Introduction ................................................................................................................................... 13

1.1 Research motives ........................................................................................................................ 14

1.2 Aims and Objectives ................................................................................................................... 15

1.3 Contribution ................................................................................................................................ 15

1.4 Thesis Structure .......................................................................................................................... 17

2 Literature Review ............................................................................................................................... 18

2.1 Source code as natural Language and NLP applications ............................................................ 18

2.2 Mining Software Repositories (MSR) ........................................................................................ 21

2.3 Classification of code snippets .................................................................................................... 24

2.3.1 Classification of code snippets for programming language recognition .............................. 24

2.3.2 Classification of code snippets for skills recognition .......................................................... 27

3 Methodology ................................................................................................................................. 30

3.1 Programming Language recognition from source code snippets ................................................ 31

3.1.1 Data Description .................................................................................................................. 31

3.1.2 Data Preprocessing & Dimensionality Reduction ................................................................ 37

3.1.3 Embeddings and Averaging ................................................................................................. 42

3.1.4 Machine Learning Classifiers .............................................................................................. 49

3.2 Identify library type from python source code snippets.............................................................. 59

3.2.1 Data collection & Pre-processing ........................................................................................ 60

3.2.2 Dimensionality reduction with t-sne .................................................................................... 63

3.2.3 Individual classifiers per class ............................................................................................. 67

3.2.4 Ensemble .............................................................................................................................. 68

3.3 Identify arrays in python source code snippets ........................................................................... 71

3.3.1

Training dataset composition ........................................................................................ 72

3.3.2

Feature selection with weighted average ...................................................................... 77

4 Results summary and conclusions ................................................................................................ 79

4.1

Results Discussion ................................................................... Error! Bookmark not defined.

4.1.1 Experiments for programming language recognition .......................................................... 79

4.1.2 Experiments for python library type recognition ................................................................. 83

Page 9

4.1.3 Experiments for python arrays recognition .......................................................................... 86

4.2

Limitation .............................................................................................................................. 87

4.3

Recommendations for future research .................................................................................. 89

5 References ..................................................................................................................................... 91

Page 10

List of Figures

Figure 1. Tech organizations skills shortage worldwide 2015-2023 ....................................... 13

Figure 2. Initial research Methodology Outline ....................................................................... 31

Figure 3. Example of repository structure ............................................................................... 32

Figure 4. Example of an input code file ................................................................................... 33

Figure 5. Example of the Requests python library repository ................................................. 34

Figure 6. Examples of input code file from Requests library .................................................. 35

Figure 7. Share of code files in 2

Training dataset................................................................ 36

Figure 8. Share of code files in 1

Training dataset ............................................................... 36

Figure 9.Graphical representation of the data preparation process.......................................... 41

Figure 10. CBOW and Skip-gram architectures ...................................................................... 46

Figure 11. Adjusted Methodology for library identification ................................................... 59

Figure 12. Machine learning library code snippets example ................................................... 62

Figure 13. Machine Learning library code snippets example after tokenization ..................... 62

Figure 14. t-sne clusters for data visualization libraries .......................................................... 64

Figure 15. t-sne clusters for web development libraries .......................................................... 65

Figure 16. t-sne clusters for machine learning libraries ........................................................... 66

Figure 17. t-sne clusters for NLP libraries ............................................................................... 66

Figure 18. Ensemble method process ...................................................................................... 69

Figure 19. Aggregated probabilities results from classifiers ................................................... 70

Figure 20. Voting results of ensemble method ........................................................................ 71

Figure 21. Adjusted Methodology for arrays identification .................................................... 72

Figure 22. TF-IDF feature matrix for np ................................................................................. 78

Figure 23. TF-IDF feature matrix for arrays ........................................................................... 78

List of Tables

Table 1: Naturalness of software papers .................................................................................. 21

Table 2: Mining software repositories papers .......................................................................... 23

Table 3. Classification of source code programming language papers ................................... 26

Table 4. Classification of source code for skills identification papers .................................... 29

Table 5. Distribution of code snippets ..................................................................................... 36

Table 6. Embedding models details ......................................................................................... 44

Table 7. Embedding Matrix for each code file ........................................................................ 48

Table 8. Python libraries for traiining ...................................................................................... 60

Table 9. Python Libraries classifiers training accuracy ........................................................... 68

Table 10. Results summary ...................................................................................................... 79

Table 11: Java files for experiments ........................................................................................ 80

Table 12: python files for experiments .................................................................................... 81

Table 13: Experiments Results of identifying the library's language ...................................... 82

Table 14: Ensemble method results ......................................................................................... 83

Page 11

1 Introduction

In recent years, particularly in the aftermath of the COVID-19 pandemic, a

pronounced skills shortage has become increasingly apparent on a global scale, with a notable

impact felt across various industries, notably in the technology sector (Causa, Oet al., 2022).

Over the past six years, a significant majority of surveyed international organizations have

encountered persistent deficiencies in skills, significantly hampering their advancement. The

decrease in skill shortages witnessed among organizations in 2020 was largely attributable to

the disruptive effects of the coronavirus (COVID-19) pandemic, which severely constrained

companies' hiring capabilities. However, despite these mitigating factors, by the year 2023, a

substantial 54% of organizations continued to struggle with a shortage of technical expertise.

The term "labor shortage" denotes an insufficient workforce, while "skill shortage"

indicates a lack of specific skills required for success. It's crucial to precisely define the skills

in question. Despite common assumptions, addressing labor shortages in the software

industry also involves non-technical skills. In the modern era, "soft skills" like self-direction,

problem-solving, and communication are increasingly vital, particularly in software.

(Hyrynsalmi, S.M., Rantanen, M.M., and Hyrynsalmi, S., 2021) Research by the World

Economic Forum emphasizes a growing need for both technical and social skills, alongside

cognitive abilities, in information and communication technology fields. Job families such as

database and network professionals, electrotechnology engineers, software developers, and

1 https://www.statista.com/statistics/1269776/worldwide-organizations-talent-shortage-skills-tech/

Figure 1. Tech organizations skills shortage worldwide 2015-2023

Page 12

analysts face recruitment challenges presently and in the future (World Economic Forum,

2016).

The evolving demands of modern software development necessitate a deep and

diverse proficiency across a wide range of technologies, tools, and practices, posing

considerable challenges for organizations striving to maintain competitive development

teams amidst rapid technological advancements and the continuous introduction of new

programming languages, frameworks, and libraries. Managing and assessing the evolving

skills of development teams is essential for project managers, who must allocate tasks

effectively and ensure ongoing capability to support legacy and emerging technologies within

their codebases.

1.1 Research motives

Identifying the skills of software developers is essential for organizations relying on

software development to maintain competitiveness and efficiency. The software engineering

field is evolving rapidly, with an increasing demand for skilled developers. This study aims to

explore the utilization of software repositories as a valuable data source for identifying and

analyzing software developers' skills. This includes examining factors such as the

programming languages employed, types of libraries utilized, and specific operations within

the source code.

Currently, skill identification is largely based on self-reported information, which can

be biased and unreliable. In contrast, software repositories provide a rich source of data that

can be used to analyse the activities of software developers, such as code contributions, bug

fixes, and code reviews Schmidt et al. (2024). By analysing these activities, we can identify

the skills and expertise of software developers objectively and accurately. In the current study

we will focus on the source code of repositories rather than contribution or fixes.

Furthermore, this study aims to establish a framework for skill identification by

analysing diverse Python libraries across various operations. This framework will encompass

methodologies for extracting and processing data from software repositories or source code,

necessary data cleansing procedures, and techniques for generating embeddings. Moreover, it

will incorporate methods for identifying and categorizing skills based on the analysed data.

By developing such a framework, this research seeks to offer a standardized approach to skill

identification that can be universally applied across different organizations and software

projects.

Page 13

While significant research has been conducted in language identification from source

code repositories using embeddings, the identification of the specific Python libraries types

used and the operations applied within the code remains an underexplored area. Current

studies have primarily focused on general language identification tasks, often overlooking the

finer distinctions required to categorize code based on the libraries utilized and the specific

operations performed. This gap in research is particularly critical as software development

increasingly relies on a wide array of Python libraries, each serving distinct functions and

requiring specific expertise.

Overall, the motivation behind this research is to provide a more accurate and reliable

method for identifying the skills of software developers. By leveraging the rich source of data

provided by software repositories, we can improve the efficiency and effectiveness of

software development processes, and ultimately contribute to the success of software

development projects.

1.2 Aims and Objectives

The aim of this project is twofold: Firstly, to represent source code as vectors and

develop a machine learning classification model capable of identifying its primary

characteristics and architecture. Secondly, to extract information on libraries, frameworks,

and algorithms knowledge for further analysis.

In addition to these main aims, the project seeks to achieve the following objectives

• Obtain and compile a dataset of source code that includes pertinent features essential

for the research. Implement suitable pre-processing techniques to extract meaningful

tokens.

• Utilize word2vec to generate valuable embeddings for predictions using source code

tokens, leveraging its capabilities effectively.

• Extract meaningful features from the gathered data. Employ feature importance

analysis and grid search methodologies to identify the optimal machine learning

model for classification.

• Train the machine learning model to accurately identify software development

methodologies, libraries, and functions based on the extracted features.

1.3 Contribution

To address the aforementioned challenges, this study employs standard Machine

Learning techniques adapted for identifying various programming skills, complemented by

Page 14

the use of word2vec for embedding source code. Additionally, we introduce novel

methodologies and frameworks for processing source code snippets aimed at detecting

programming skills. We also conduct comprehensive analyses across different programming

libraries and operations to validate our approaches. Specifically, our focus is on Python

language libraries, an area relatively underexplored in the literature on skills identification

despite its recent growth and recognition.

Existing research often emphasizes in quantitative metrics such as the count of bug

fixes, commits, lines of code, or mentions in README files for language and library

knowledge. In contrast, our study leverages GitHub repositories of prominent Python

libraries to construct a robust dataset for tokenization, encompassing a broad spectrum of

programming skills and operations. The novelty of our work lies in pioneering multiclass

Python skills detection and developing a framework for source code tokenization, feature

extraction, and vectorization.

The contributions of this study to the scientific community are structured as follows:

[C1] Introduction of an ensemble approach for Python skill identification: We

implemented an ensemble framework of classifiers combined with dimensionality reduction

techniques and a voting process for multiclass Python library type detection. Our approach

indicated sufficient accuracy with the limited sample data available, although improvements

are needed. These classifiers serve as a preliminary framework capable not only of

characterizing library types but also identifying specific operations within the source code.

The novelty of this approach is significant due to the limited existing work on multiclass

skills detection, as detailed in Chapter 3.

[C2] Evaluation of machine learning models for identifying Python data structures: Our

study aims to explore the classification capabilities of specific data structures such as arrays,

which are widely used in various programming and software development applications. We

intend to illuminate the features that contribute to identifying specific operations within a

corpus of source code, a topic that has been underexplored in previous research. Even though

our preliminary results may not be encouraging, the knowledge gained at this stage is

significant for the evolution of skills identification frameworks in data structures, as

discussed further in Chapter 3.

[C3] Development of a comprehensive framework for Python skill identification with

broad applicability: As we construct innovative ensemble classifiers, we simultaneously

explore uncharted territory by rationalizing the predictions made by these classifiers. Our

approach involves experimenting with and comparing the performance of various classifiers

Page 15

for programming language predictions across different libraries. Throughout this process, we

employ diverse data processing techniques such as comment elimination, stop words

extraction, feature selection, dimensionality reduction, and weighted tokenization. These

techniques provide powerful insights into advancing source code embeddings and tokens.

However, the prevalence of overfitting in machine learning applications suggests the need for

new paths and further research.

1.4 Thesis Structure

The structure of this thesis follows a systematic approach. Chapter 2 begins with a

comprehensive literature review, covering previous research on mining repositories to

identify features within source code and discussing the application of natural language

processing (NLP) methods to enhance the naturalness of software. It also explores current

advancements in code snippet classification and its practical applications. Chapter 3 is

structured into three main parts: the first part outlines the methodology for programming

language identification. The second part of Chapter 3 details the experimental setup, of

recognizing a python library type from code snippets. The third part discusses the application

for numpy array identification in source code corpus. Chapter 4 provides a detailed

presentation and discussion of prediction results, and Chapter 5 highlighting the strengths and

limitations of the research findings, and discusses avenues for future research in this domain.

Page 16

2 Literature Review

2.1 Source code as natural Language and NLP applications

Globally, there exist hundreds of languages, each characterized by varying levels of

difficulty. Nevertheless, regardless of complexity, all languages adhere to specific syntax

rules, orthography, and expressions that are fundamental for effective communication.

Similarly, software engineering code, crafted by proficient individuals, fluent in a

programming language, mirrors the art of writing in natural language. Software engineers use

programming languages to create code following specific rules and guidelines, akin to skilled

writers, resulting in code that resembles natural language writing.

The authors of "On the Naturalness of Software," (Hindle et al., 2016) tackle the

challenge of creating tools to help engineers work with large collections of source code. Their

goal is to make programming easier and ensure that programs are correct. They use a method

called n-gram models to analyse the structure of code. The study is set in the context of tasks

like code completion, summarization, and error detection. By mining patterns in code, such

as API usage and common errors, the researchers aim to guide developers in making changes

to their programs. The results show that using these models leads to 33%-67% more accurate

suggestions and reduces the number of keystrokes needed. The study makes unique

contributions by supporting the idea that programming languages are simpler and more

repetitive than previously thought. It demonstrates the effectiveness of statistical language

models in capturing patterns in code, which can be used to improve programming tools like

code suggestion features in IDEs. Overall, it highlights the potential of statistical language

models to enhance software engineering practices.

The aforementioned research has inspired subsequent work by other scientists. For

instance, Alon et al. (2019) introduced code2vec, an approach designed to create learning

code embeddings for programming tasks. Similarly, Allamanis and Sutton (2014) explored

the mining of idioms from source code.

Alon et al. (2019) presented a novel framework for predicting program properties

using neural networks. Their core innovation is a neural network that learns code

embeddings, which are continuous distributed vector representations for code. These

embeddings enable effective modeling of the relationship between code snippets and their

labels. The architecture leverages the structured nature of source code, aggregating multiple

syntactic paths into a single vector. This capability is crucial for applying deep learning to

Page 17

programming languages, akin to the impact of word embeddings in natural language

processing (NLP).

In earlier applications, Allamanis and Sutton (2014) introduced the first method for

automatically mining code idioms from existing code corpora. Their approach does not

simply search for frequent syntactic patterns but instead identifies patterns that enhance the

explanatory power of a probabilistic model of the source code. The method employs a

nonparametric Bayesian tree substitution grammar, which has been effective in natural

language processing but had not previously been applied to source code.

The concept of the naturalness of software was revisited by Rahman et al. (2019),

who explored additional research questions on the repetitiveness and predictability of source

code:

• Replication: They replicated the work of Hindle et al. to ensure dataset diversity and to

test the "naturalness" hypothesis across C#, C, JavaScript, Python, Ruby, and Scala.

• SyntaxTokens Removal: They investigated code predictability after the removal of

SyntaxTokens such as separators, keywords, and operators, similar to the removal of

punctuation and stopwords in NLP.

• API Usages: They analyzed the predictability of Java API token usage, noting that API

code tends to exhibit more uniformity across projects compared to general program

code.

• Graph-based Object Usage Models (Groum): They examined the repetitiveness of

Groum extracted from Java programs, contrasting them with n-grams to assess non-

sequential code processing.

Their findings confirm that code is indeed repetitive and predictable, though not as

extensively as Hindle et al. (2016) previously suggested. The repetitive nature of

programming language syntax, with SyntaxTokens making up 59% of Java tokens,

contributes to this perception. The study also indicates that it is essential for researchers to

ensure that corpora are properly tuned and cleaned for prediction tasks to avoid distractions

from more significant recommendations. In terms of API usage, there is sufficient repetition

for accurate recommendations, and future research should integrate static analysis with

statistical models for improved predictions. Additionally, various code representations,

particularly graph representations, exhibit different degrees of repetition and abstraction,

enhancing the effectiveness of recommender tools by reducing noise and facilitating more

complex, non-sequential recommendations.

Page 18

Buratti et al. (2020) investigated the software naturalness hypothesis using

transformer-based language models to analyze syntax and semantics in C language source

code. They introduced a novel sequence labeling task to assess the language model's

understanding of abstract syntax trees (AST) and evaluated its ability to identify

vulnerabilities. Their study highlighted the challenges of data sparsity and the importance of

appropriate tokenization and pre-training objectives, demonstrating that character-based

tokenization with whole word masking (WWM) effectively addresses these issues. They

showed that their language model could learn AST features from raw source code and

perform better than graph-based methods, offering valuable insights for enhancing

productivity during software development. Their contributions include applying LMs to

source code analysis, addressing OOV and rare words issues, and outperforming compiler-

dependent methods.

Several researchers have investigated the naturalness of code across diverse

applications:

• Gholamian and Ward (2021) explored the localness of Software Logs with ANALOG,

an anomaly detection tool leveraging NLP features, which outperforms prior methods.

Their work integrates deep learning language models to benchmark against existing

DL-based anomaly detection methods.

• Sridhara et al. (2015) developed next-word prediction models for commit comments

and source code comments, achieving accuracies of 70% to 93% and 56% to 78%,

respectively. While predicting bug reports posed challenges, the study identified

potential for tailored next-word prediction tools targeting specific user contexts within

bug reports.

• Ray et al. (2016) investigated buggy code using entropy scores from statistical

language models, showing that unnatural code correlates more with bug-fix commits.

They found that repaired code tends to become more natural. The application of

entropy scores to defect prediction tasks, adjusted for syntactic variations,

demonstrated comparable cost-effectiveness to traditional static bug-finders like PMD

and FindBugs, with deterministic ordering improving efficiency.

• Hellendoorn et al. (2018) explored proofs, noting the aid of automated sub-theorem

proving in Coq and features like code completion in programming languages.

Analyzing proofs in Coq's Gallina language and HOL Light kernel-level traces, they

found proofs exhibited high predictability, surpassing natural languages. This

Page 19

discovery opens avenues for naturalness-driven tools to streamline proof-writing

processes.

Table 1: Naturalness of software papers

Paper

Method used

Dataset

Allamanis and Sutton

(2014)

Nonparametric Bayesian

Stack Overflow posts

Sridhara et al. (2015)

N-Gram language model

Stack Overflow posts

Ray et al. (2016)

Language models

10 OSS Java projects from Github and

Apache Software Foundation

Hindle et al. (2016)

Language models

Java projects and Ubuntu applications.

Hellendoorn et al. (2018)

Language models

Coq, HOL

Alon, Zilberstein, Levy, and

Yahav (2019)

AST

Java methods

Rahman, Palani, and Rigby

(2019)

Language models

134 open source projects on GitHub,

Gutenberg corpus, 200, 000 posts from

StackOverflow

Buratti et al. (2020)

AST

100 open source repositories

Gholamian and Ward (2021) Language models

8 log files from a wide range of

computing systems and 2 English

corpora

2.2 Mining Software Repositories (MSR)

Mining Software Repositories (MSR) is a research field focused on analyzing vast

datasets from software projects. These repositories contain extensive data such as source

code, bug reports, and developer communications. Researchers employ techniques from data

mining and machine learning to uncover patterns and insights crucial for understanding

software development processes.

MSR addresses fundamental questions about software quality, developer productivity,

and the impact of code changes over time. By applying statistical analyses and advanced

algorithms, MSR can predict software issues and improve development practices. This

research plays a pivotal role in enhancing software reliability and efficiency.

Our study within this chapter emphasizes extracting skills from repositories. This

involves identifying and analysing patterns in developers' contributions to understand the

expertise and competencies prevalent in software development teams. By mining repositories

for skills extraction, we aim to contribute to the broader goal of improving workforce

planning and enhancing team dynamics in software engineering.

Page 20

GHTorrent (Gousios, 2013), is a widely used MSR (Mining Software Repositories)

application that offers a scalable infrastructure for gathering and analyzing data from GitHub.

Its primary objective is to comprehensively capture and provide access to various aspects of

GitHub activities, encompassing repositories, users, commits, pull requests, issues, and more.

Researchers and developers utilize GHTorrent's maintained dataset for diverse purposes,

including mining software repositories, exploring developer behavior, studying project

evolution, and conducting empirical studies in software engineering and social coding

practices. This dataset serves as a crucial resource for uncovering trends, patterns, and

dynamics within the GitHub ecosystem.

In a subsequent study the following year, Kalliamvakou et al. (2014) collaborated

with other researchers to investigate the promises and perils of mining GitHub for software

engineering research. The authors also conducted a survey and interviews with GitHub users,

aiming to explore how GitHub supports collaboration and its impact on development

processes. The study, involving 240 responses from active GitHub users, exploring critical

challenges when analyzing GitHub repositories, including issues like distinguishing between

base repositories and forks, uneven commit distribution across projects, and the prevalence of

inactive repositories. It also addresses nuances in detecting merged pull requests and the

diverse utilization of GitHub beyond software development.

Other applications of repositories mining discuss the bug handling with combination

to version controlling. BuCo Reporter was developed to combine Version Control and Bug

Tracking system data easily, featuring VCS, BTS, and Reporting modules, and is available

for download with usage details. It is providing comprehensive reports on metrics such as

commit distribution, average lines per commit, and bug correction time, offering an

extensible framework for seamless information retrieval. The VCS module of BuCo connects

to source code repositories to extract and manage project data, offering features such as Delta

extraction, local repository mapping, and source code retrieval. The BTS module connects to

bug tracking repositories to extract bug-related data, create queries, manage local storage, and

ensure interoperability among different Bug Tracking systems. (Ligu, Chaikalis, &

Chatzigeorgiou, 2013)

Cosentino et al. (2016), summarizing the findings from 93 papers discussing mining

GitHub repositories, describe the empirical methods applied, the datasets used and the

limitations described. Based in the review, 75.3% of the works rely on direct observation of

GitHub metadata, 14% use surveys and interviews, and 10.7% combine methods. Only 5.4%

applied longitudinal studies. Also, 60.2% of the works use non-probability sampling, 31.2%

Page 21

use probability sampling, and 3.2% use stratified random sampling, with 8.6% not using

sampling techniques at all. In total, 50.5% of works reported dataset size in terms of projects,

26.9% in terms of users, and 22.6% provided both dimensions. Data collection methods

include curated datasets (with GHTorrent being most popular at 34.4%), GitHub API

(39.8%), GitHub Search API (12.8%), and mixed methods (5.4%). Only 31.2% of works

provide dataset links for replication, with 68.8% not providing replication links despite

explaining data collection processes. Reported limitations include issues with empirical

methods (64.3%), generalization of results (42.9%), data collection (39.3%), and

dataset/third-party services (6.5%). (Cosentino, Luis and Cabot, 2016)

Table 2: Mining software repositories papers

Paper

Method used

Dataset

Gousios (2013)

REST API

Github

Ligu, Chaikalis, &

Chatzigeorgiou (2013)

SVNKit for Subversion,

JIRA SOAP for bug

tracking, Apache POI for

Excel export, Apache Xml-

Rpc for Bugzilla

communication, and

JFreeChart for reports.

Commons IO project

Kalliamvakou et al. (2014)

Quantitative analysis

240 GitHub users & 434

GitHub repos-

itories

Cosentino, Luis and Cabot

(2016)

Literature review

papers from disital

libraries

Page 22

2.3 Classification of code snippets

The classification of code snippets represents a pivotal area of research at the

intersection of software engineering and machine learning, focusing on the automation of

understanding and organizing code. With the exponential growth of code available in

repositories, the demand for effective tools to categorize and comprehend these snippets has

become increasingly significant. This capability not only facilitates automated documentation

and rapid development of source code but also enhances developers' productivity and reduces

development time.

Over the past decade, Machine Learning (ML) and Natural Language Processing

(NLP) have been employed to analyze source code (Ugurel, Krovetz, & Giles, 2002). The

classification of programming languages has been extensively studied (Van Dam & Zaytsev,

2016) using various techniques, including Neural Networks (Gilda, 2017), Multinomial

Naïve Bayes (Alreshedy et al., 2018), Convolutional Neural Networks (CNN) (Ohashi &

Watanobe, 2019), and Neural Text Classification (LeClair, Eberhart, & McMillan, 2018).

This section of the literature review aims to further analyze current research in the

field of code classification, with a specific focus on empirical studies of code snippets. The

objective is to provide a detailed discussion of the progress made thus far and to underscore

the critical importance of ongoing research in this domain.

2.3.1 Classification of code snippets for programming language recognition

The automated identification of programming languages from code snippets is pivotal

in software engineering, impacting tasks such as syntax highlighting, code organization, and

repository management. Machine learning algorithms have been instrumental in this domain,

enabling accurate classification based on programming language usage.

Early research by Ugurel et al. (2002) pioneered automated categorization techniques

using natural language data extracted from code comments and documentation. Their

approach involved feature extraction from lexical components and employed Support Vector

Machines (SVMs) to classify programming languages, demonstrating initial feasibility

despite challenges related to data variability and feature selection. Their study demonstrates

that programming languages and application categories can be accurately classified, yet this

depends on factors such as data variability, application diversity, feature selection techniques,

information retrieval methods, and programming language characteristics. While their results

are promising, improvements are possible, including optimizing feature vector selection,

Page 23

transitioning from binary to term frequency representations, and enriching classification with

more syntactic features and language-specific attributes.

Building upon this foundation, Van Dam and Zaytsev (2016) expanded the scope of

Software Language Identification (SLI) by evaluating a wide array of classifiers across a

diverse dataset encompassing multiple programming languages. Their findings underscored

SLI's critical role in enhancing Integrated Development Environment (IDE) functionalities

and supporting reverse engineering tasks, emphasizing the practical significance of accurate

language classification methods. Their study involved testing 348 classifiers, derived from

lightweight natural language techniques, on a dataset spanning 20 different languages. Each

language subset encompassed between 109 to 150 test projects and 146 to 272 training

projects. Initial findings revealed varying accuracy levels with smaller training sets, yet

certain methods achieved high accuracy rates of 97.5% or higher with larger training sets.

These results highlight the promising potential of natural language methods in SLI,

particularly in the domain of software language reverse engineering

The work of Gilda (2017), employed real-time classification of code snippets by

programming language, utilizing token-based feature extraction after preprocessing to

eliminate extraneous characteristics. Tokenization was achieved through regex expressions,

and Torch was employed to generate word embeddings. Convolutional Neural Networks

(CNNs) were pivotal for prediction, integrating convolutional layers with nonlinear activation

functions, filters, max-pooling, and Rectified Linear Units (ReLUs) to process input data and

enhance training efficiency. These convolutional layers produced high-level features used to

classify code snippets into programming language categories using a softmax output layer,

with dropout regularization applied to mitigate overfitting.

In contrast, Alreshedy et al (2018), introduced the Source Code Classification (SCC)

tool, leveraging Multinomial Naive Bayes (MNB) for the accurate classification of Stack

Overflow code snippets. SCC demonstrated superior performance over the Programming

Languages Identification (PLI) tool by effectively distinguishing between closely related

programming languages and their respective versions. MNB was chosen for its simplicity,

computational speed, and scalability. Stack Overflow snippets were preprocessed using the

Term Frequency-Inverse Document Frequency (TF-IDF) vectorizer, which identified the ten

most frequent words in each snippet. Model performance was optimized through Grid-

SearchCV to fine-tune the alpha parameter for MNB, with rigorous parameter selection based

on 10-fold cross-validation techniques.

Page 24

Baquero et al. (2017) and Dietrich et al. (2019) explored the complexities of language

recognition through analysis of textual and source code data from platforms like Stack

Overflow. Baquero et al. utilized word embeddings and SVMs to classify programming

languages based on semantic similarities, while Dietrich et al. compared human-generated

metadata with automated classification methods, advocating for hybrid approaches to

enhance accuracy across diverse code repositories.

Innovative approaches continue to evolve the landscape, as demonstrated by Hong et

al. (2019) who proposed using ResNet in image classification to directly detect programming

languages from code snippets. Their study validated ResNet's effectiveness with over 90%

precision in identifying languages from snippet-type data sourced from SOTorrent and

function-type data parsed from GitHub repositories, showcasing robust performance across

varied datasets and highlighting the potential for image-based methodologies in language

identification tasks.

In summary, these studies collectively advance the field of programming language

identification through innovative methodologies and rigorous evaluations, paving the way for

more efficient and accurate solutions in software engineering practices. The integration of

machine learning techniques with deep insights into language characteristics and dataset

variability continues to drive advancements towards enhanced automation and precision in

language identification tasks across diverse software development environments.

Table 3. Classification of source code programming language papers

Paper

Method used

Dataset

Processing

Ugurel, Krovetz, and

Giles (2002)

SVM

Ibiblio Linux Archive,

Sourceforge, Planet

Source Code,

Freecode, and web

pages with code

snippets.

Feature extraction,

vectorizing

Van Dam and Zaytsev

(2016)

Naïve Bayes, n-grams

with Good-Turing

discounting, n-grams

with Kneser-Ney

discounting, n-grams

with Witten-Bell

discounting, Skip-

gram language model

Github

Varying parameters

resulted in 348

different methods.

Gilda (2017)

Convolutional Neural

Networks

Github

Tokenization, word

embeddings

Page 25

Baquero et al. (2017) SVM

Stack overflow posts

HDF5, Feature

extraction, Word2Vec

Alreshedy et al. (2018)

Multinomial Naive

Bayes

Stack overflow posts

TF-IDF, Grid-

SearchCV

Dietrich et al. (2019) t-SNE algorithm

SOTorrent data set

PL-tags

Hong et al. (2019)

ResNet CNN

SOTorrent data set

12 pound Courier font,

GuessLang

2.3.2 Classification of code snippets for skills recognition

While programming language recognition from code snippets has been extensively

explored, the recognition of skills or algorithms through source code has not received as

much attention. Recently, however, this area has begun to attract more research interest,

underscoring its growing importance in the field. In the previous section, we detailed the

interest researchers have shown in mining repositories. In this section, we aim to conduct a

more in-depth analysis of the available studies that use code snippets for similar or related

tasks.

Numerous studies explore tools and applications aimed at identifying software

developers' skills and aligning them with job requirements. Tools such as CVExplorer

leverage README files, GitHub language usage, and check-in details (like commit

messages and file changes) to aggregate technical skills and recommend candidates (Greene

and Fischer, 2016). CPDScorer introduces a method using a set covering algorithm to extract

specific programming terms, assigning them to answers and projects for quality evaluation

across diverse programming skills (Huang et al., 2016). Hauff and Gousios (2015) proposed a

pipeline to automatically match GitHub profiles with job advertisements, utilizing README

files and a large-scale ontology to map job requirements and developer skills, highlighting

significant overlaps in covered concepts.

Expertise identification typically focuses on aggregating edit counts, whereas da Silva

et al. (2015) propose a method considering both granularity and time, distinguishing local

from global expertise in artifacts. Oliveira et al. (2019) evaluated a strategy using metrics like

Commits, Imports, and Lines of Code to identify library experts, achieving 88% precision in

identifying experts across 16,000 software systems and 9 libraries. Notably, these approaches

do not directly analyze source code but rather focus on edit frequency, code length, library

imports, or README contents. In contrast, our study aims to identify coding skills by

processing source code akin to natural language.

Page 26

Recent advancements in identifying algorithms within source code as indicative of

software developers' skills have leveraged diverse methodologies and technologies. For

instance, Bui et al. (2018) utilized data from GitHub to categorize six algorithms (mergesort,

bubblesort, quicksort, linkedlist, breadth-first search, knapsack) in both C++ and Java

languages. They applied bilateral neural networks (BiNNs) adapted from natural language

processing to classify code snippets, achieving over 80% accuracy in cross-language program

classification by integrating BiNNs with tree-based convolutional neural networks

(TBCNNs) to encode abstract syntax trees (ASTs).

Another notable advancement is the development of code2vec by Alon et al. (2019),

which introduced a neural network architecture capable of learning semantic labels for code

snippets akin to word embeddings in natural language processing. This model effectively

aggregates syntactic paths into vectors, facilitating tasks like predicting method names from

code bodies, crucial for code comprehension and maintenance.

Despite these innovations, Kang, et al. (2019) examined code2vec's generalizability

across various software engineering tasks such as code comment generation, authorship

identification, and clone detection, finding mixed results compared to GloVe embeddings.

Ohashi and Watanobe (2019) applied convolutional neural networks (CNNs) to classify

source code based on solvable problem types, utilizing tokenized preprocessing to extract

structural features while disregarding code block sequence, achieving high accuracy.

Further advancing this approach, Watanobe et al. (2023) developed CNN models to

identify algorithms in program codes through innovative preprocessing that includes filtering

user-defined tokens and converting structural features into one-hot binary matrices. These

models can support tasks such as code review, bug detection, and code refactoring in

software engineering applications, underscoring their potential as integral components in

machine learning models for analyzing program code structures.

In conclusion, while current research demonstrates promising advancements in using

embeddings for analyzing source code and identifying developer skills, there remains ample

opportunity for further exploration and refinement. Future research should aim to enhance the

robustness and generalizability of embeddings across different programming languages,

paradigms, and software domains. Additionally, efforts should focus on developing

embeddings that capture deeper semantic meanings and context-specific information within

code, thereby enabling more accurate assessments of developer proficiency and

specialization. By addressing these challenges, researchers can pave the way for more

sophisticated tools and methodologies that effectively leverage embeddings to support

Page 27

diverse software engineering tasks, ultimately advancing the field of skills identification in

programming.

Table 4. Classification of source code for skills identification papers

Paper

Method used

Dataset

Processing

Hauff and Gousios (2015) Vector space model

Job advert linked data

& GitHub

TF-IDF

Silva et al. (2015)

Fine-grained analysis,

analysis based on

timeframes, GPU

processing.

Apache Derby

Huang et al. (2016)

M5P for CM, Linear

Regression for CM

StackOverflow &

GitHub

Ability scoring,

Feature

extraction

Greene and Fischer (2016) ConceptCloud

GitHub

Bui et al. (2018)

Bilateral neural networks GitHub

AST2vec

Oliveira, J., Viggiato, M.,

and Figueiredo, E. (2019)

Qualitative and

Quantitative data analysis

GitHub

Alon, Zilberstein, Levy,

and Yahav (2019)

Neural Network

Java methods

AST

Kang, Bissyandé, and Lo

(2019)

LSTM neural network

GitHub

Glove,

code2vec, TF-

IDF

Ohashi and Watanobe

(2019)

Convolutional neural

networks

Aizu Online Judge

Feature

extraction,

Padding,

Tokenize

Watanobe et al. (2023)

Convolutional neural

networks

Aizu Online Judge

Feature

extraction,

Tokenize

Page 28

3 Methodology

This chapter details the research methodology, providing a comprehensive overview

of the specific steps and experiments conducted. To ensure an organized research process and

facilitate independent monitoring of each activity, we structured the workflow into distinct

stages. Our research focuses on three tasks:

• Replicating the programming language recognition prediction classifier widely discussed

in previous studies to identify challenges and explore new approaches.

• Utilizing the knowledge gained from the first task to develop a more specialized classifier

that recognizes the type of Python library from code snippets, this was not covered in the

literature review.

• Narrowing the focus further to identify NumPy arrays within the source code corpus.

Throughout these tasks, a consistent framework for data pre-processing, tokenization,

and embeddings generation will be applied. This framework will be refined appropriately

based on the training results and the specific challenges of each task. Figure 2 illustrates the

entire methodological cycle for the first task, which will be generally followed in the

subsequent tasks. The stages are: 1. Data collection, 2. Preparation & Dimensionality

reduction, 3. Embeddings & Averaging, 4. Classifier Experimentation, and 5. Results. The

proposed approach is designed to be flexible and incremental, breaking down the study into

discrete components while integrating them into a cohesive framework for robust analysis.

Page 29

Figure 2. Initial research Methodology Outline

3.1 Programming Language recognition from source code snippets

3.1.1 Data Description

The primary dataset for this research consists of code snippets written in Python and

Java, sourced from Github repositories. Our objective was to gather well-structured data

representing a wide range of code snippets and programming methods. As a result, to create a

robust dataset, we initially focused on collecting code snippets from The Algorithms GitHub

repository2,

a well-known GitHub repository that hosts a diverse collection of algorithm

implementations in multiple programming languages, including Java and Python. This

repository offered a broad spectrum of code snippets, ranging from simple operations to

complex algorithms, providing a comprehensive foundation for training the initial model.

In greater detail, the repository is a collaborative initiative focused on implementing

algorithms across a diverse array of programming languages, encompassing more than 30

https://github.com/TheAlgorithms

Page 30

different languages. Serving as a centralized hub, it facilitates global developer collaboration

for the ongoing enhancement and upkeep of algorithmic implementations. The repository

hosts an extensive collection of algorithms, spanning sorting, searching, graph algorithms,

dynamic programming, and beyond. Repository structures vary depending on language type

and contributions from developers. For instance, the Python repository is organized into

subject folders, with subfolders dedicated to specific algorithm types.

Figure 3. Example of repository structure

Each algorithm in the repository is well-documented, explaining how it works and its

detailed implementation details. This documentation covers the theory behind the algorithm,

highlighting both its strengths and limitations. Additionally, it includes clear explanations of

the logical reasoning behind the algorithm's functions and the sequence of steps involved.

Moreover, numerical examples are frequently provided to illustrate how the algorithm

behaves or to clarify its processing steps. These examples help users understand how the

algorithm works in real-world scenarios.

Page 31

Figure 4. Example of an input code file

As the training process progressed, we observed signs of overfitting in the algorithm.

This overfitting indicated that the model was learning the training data too well, reducing its

ability to generalize to new or unseen data. To address this issue, we needed to enrich our

dataset by incorporating additional input files. To expand the diversity of the dataset, we

sourced more code snippets from well-known repositories, ensuring representation from both

Java and Python.

For Java, we included code from repositories like Google Guava

, Apache Spark

, and

Maven

. For Python, we added snippets from frameworks and libraries such as Django

Flask

, and the Requests library

. These repositories were selected based on two key factors:

https://github.com/google/guava

https://github.com/apache/spark

https://github.com/apache/maven

https://github.com/django/django

https://github.com/pallets/flask

Page 32

the volume of code they contain and the quality of the code. For example, the Requests

python library repository hosts the codebase to simplify the process of making HTTP

requests and interacting with web APIs in Python. The repository contains the source code

for the Requests library, along with documentation, issue tracking, and contributions from the

developers’ community. By enriching the dataset with code from these reputable sources, we

aim to reduce overfitting and improve the robustness of the model.

Figure 5. Example of the Requests python library repository

https://github.com/psf/requests

Page 33

Figure 6. Examples of input code file from Requests library

All data was stored locally and manually retrieved for use in the model during data

preparation. Table 1 provides additional details about the volume of data for each coding

language marking the separation between the first training dataset (containing code snippets

Page 34

from the Algorithms repository) and the second training dataset (which is augmented by

additional code snippets from several other repositories).

Table 5. Distribution of code snippets

Language 1st Training dataset

2nd Training dataset

python

1.353

1.753

java

1.139

1.902

Total

2.492

3.655

Ensuring a balanced training dataset is crucial in machine learning to facilitate

optimal model performance and generalization across diverse datasets. A balanced dataset,

where each class (in this case, Python and Java code) is represented equally, mitigates the

risk of bias towards the majority class and fosters fair learning of patterns from all classes.

Research suggests that imbalanced datasets can lead to suboptimal model performance,

particularly in classification tasks, due to the model's inclination towards the dominant class

(He and Garcia, 2009). By contrast, a balanced dataset enhances the model's ability to discern

subtle differences between classes and improves its predictive accuracy across all classes

(Chawla et al., 2002). In the context of programming language identification, a balanced

dataset ensures that the model learns the distinctive syntactic and semantic features of both

Python and Java languages without bias towards either. This approach aligns with best

practices in machine learning and contributes to the robustness and reliability of the model's

predictions (Krawczyk, 2016).

Obtaining the raw input code data files involved a systematic process of accessing

GitHub repositories through clone requests and saving the retrieved files locally in a

designated folder named 'trial_files'. The use of clone requests allowed for the replication of

entire repositories, ensuring that all necessary code files and associated metadata were

captured for subsequent analysis and model training. By pulling the data locally, we gained

Figure 8. Share of code files in 1st Training

dataset

Figure 7. Share of code files in 2nd Training

dataset

Page 35

direct access to the raw source code files, enabling efficient preprocessing and transformation

steps required for training the language identification model.

The data preparation processes described in the next section were applied to each of

these datasets, with the aim of extracting meaningful information from each code file while

simultaneously removing files that could be considered noise or redundant data.

3.1.2 Data Pre-processing & Dimensionality Reduction

The data preparation stage involved a two-part process designed to clean and structure

code files, ultimately creating a dataset suitable for the embeddings that will be inputted to

the model. The goal was to prepare code snippets in a way that minimized noise and

maximized relevant information for the following stages of the study.

The first step focused on cleaning and annotating the code files. This process began

by eliminating comments, as they do not contribute to the actual code logic and can introduce

unnecessary clutter. The comment removal process was conducted separately for each type of

data file (Java or Python) and consisted of separate functions that accepted a code file as

input and produced the same file with all comments removed. The relevant code snippet for

Python files can be seen below:

def remove_comments_python(source_code):

pattern = r'(\".*?\"|\'.*?\')|(#.*?$)'

return re.sub(pattern, lambda m: m.group(1) or "", source_code,

flags=re.MULTILINE)

where we have defined a specialized function to exclusively handle Python files. The

function leverages the Regex library to formulate a regular expression that can detect Python

code comments. The specific regular expression used is able to detect all textual information

that is preceded by either the “/” or the “//” characters as well as the “#” character, which

comprise typical cases of comments in the Python programming language. The function then

replaces every occurrence of detected comments in the source code file with an empty string,

ensuring that the source code file maintains its structure and is stripped away from the

comments in an efficient manner.

A similar function used for removing the java comments, as presented below:

def remove_comments_java(code):

in_multiline_comment = False

lines = code.split("\n")

result = []

Page 36

for line in lines:

stripped = line.strip()

if in_multiline_comment:

if stripped.endswith("*/"):

in_multiline_comment = False

continue

if stripped.startswith("/*"):

in_multiline_comment = True

continue

if stripped.startswith("//"):

continue

result.append(line)

return "\n".join(result)

where by parsing the input Java code line by line, it checks for the presence of both single-

line and multiline comments. Initially, it initializes a boolean variable to track whether it is

currently within a multiline comment. Subsequently, the function iterates through each line of

the code, stripping leading and trailing whitespace. If the function encounters a line within a

multiline comment, it continues until it identifies the end of the comment block, indicated by

the presence of the "*/" characters. Single-line comments, denoted by "//", are also detected

and excluded from the result. The function ensures that the integrity of the source code is

preserved by appending non-comment lines to the result.

Additionally, specific elements within the code were labeled for clarity and

consistency. All string literals were replaced with the term "STRING," numerical values with

"NUMBER," and individual characters with "CHARACTER." These annotations ensured

that tokenization—the process of separating the preprocessed code files into individual

tokens—was consistent and meaningful. For all the above processes generic regular

expressions (regex) were utilized.

The subsequent step involves tokenization, a process facilitated by a regular

expression pattern (token_pattern). This pattern enables the function to detect consecutive

sequences of alphanumeric characters delineated by word boundaries. These identified

sequences constitute tokens within the source code.

def tokenize_code(source_code):

token_pattern = re.compile(r'\b\w+\b')

tokens = re.findall(token_pattern, source_code)

return tokens

All the above process are summarized and applied in the below code snippet:

def read_files_with_labels(path):

files = []

Page 37

for root, dir, filenames in os.walk(path):

for filename in filenames:

if filename.endswith('.java') or filename.endswith('.py'):

filepath = os.path.join(root, filename)

with open(filepath, 'r', encoding='utf-8') as f:

try:

file_contents = f.readlines()

file_contents = ' '.join(file_contents)

#file_contents = [line.strip() for line in

file_contents]

if filename.endswith('.java'):

file_contents =

remove_comments_java(file_contents)

if filename.endswith('.py'):

file_contents =

remove_comments_python(file_contents)

#file_contents = [line.strip() for line in

file_contents]

file_contents = re.sub(r"\d+", "NUMBER",

file_contents)

file_contents = re.sub(r"\'.*?\'", "STRING",

file_contents)

file_contents = re.sub(r"\".*?\"", "STRING",

file_contents)

file_contents = re.sub(r"\'.*\'", "CHARACTER",

file_contents)

file_contents = re.sub(r"\".*\"", "CHARACTER",

file_contents)

tokens = tokenize_code(file_contents)

#print(tokens,filepath)

non_blank_tokens = remove_blank_tokens(tokens)

if filename.endswith('.java'):

files.append((filepath, 0, tokens))

elif filename.endswith('.py'):

files.append((filepath, 1, tokens))

except UnicodeError:

continue

return files

Operating on the directory containing code files in Java and Python formats from

Github, the function systematically processes each file, filtering based on file extension to

identify Java and Python code files and add respective labels of 1 and 0. Upon encountering

eligible files, the function reads their contents, it removes comments from Java and Python

files, replaces numerical and string literals with generic tokens to abstract code specifics, and

tokenizes the code for subsequent analysis. Additionally, it handles Unicode encoding errors

to ensure robustness. The function's output is a structured collection of tuples, each

containing the file path, a language label indicating Java or Python, and the tokenized code.

This organized dataset lays the foundation for subsequent analysis and model training,

Page 38

facilitating meaningful insights into the characteristics and behaviours of Java and Python

code.

The second part of the process involved token normalization and frequency analysis.

Normalization involved converting all tokens to lowercase to avoid case-related

discrepancies, ensuring a uniform representation of code elements. Following normalization,

a frequency analysis was conducted to determine the occurrence of each unique token in the

dataset. This analysis provided insights into the most common elements, allowing us to

identify which tokens were most useful for feature selection. During the feature selection

process, it was observed that certain elements, such as "STRINGS", "NUMBERS", and

common stop words, appeared with high frequency in the code files. This high frequency

posed a challenge because it could skew the results of the predictive models, potentially

leading to incorrect interpretations or reduced accuracy. Simultaneously, apart from tokens

with high frequency, tokens with low frequency should also be removed, as they can create

discrepancies in the predictive ability of the models and redundant noise.

data = pd.read_csv("out_rf.csv", header=None)

data = data.dropna()

code_tokens_list = []

import ast

# Extract labels and code tokens

labels = data.iloc[:, 1]

for index,row in data.iterrows():

converted_list = ast.literal_eval(row[2])

code_tokens_list.append(converted_list)

# Remove stopwords from each list in the list of lists

code_tokens_list = [

[word for word in inner_list if word not in stop_words]

for inner_list in code_tokens_list

]

token_dict = []

#Convert tokens to lowercase

for code_token_list in code_tokens_list:

for i in range(0,len(code_token_list)):

code_token_list[i] = code_token_list[i].lower()

token_dict.append(code_token_list[i])

Page 39

To resolve this problem, these high-frequency and low-frequency tokens were

removed from the dataset. This decision was made because these elements generally lack

predictive power—they do not typically play a significant role in differentiating between

various programming languages or coding patterns (Van Der Maaten, Postma, & Van den

Herik, 2009). By excluding them, we improved the robustness and accuracy of our prediction

results. In the end, this refined feature selection approach helped create a more reliable and

efficient machine learning model. In addition, during this stage, only tokens with a frequency

between 0.002 and 0.7 were retained for model training and predictions. This frequency-

based filtering ensured that only tokens with a reasonable level of occurrence were used,

eliminating excessively rare or overly common tokens that could introduce noise or bias into

the model.

All cleaned tokens for each file are stored in CSV files alongside their corresponding

language labels and file paths. These files are utilized in subsequent processing stages,

specifically for generating embeddings and performing averaging operations. Within the

saved CSV files, the tokenized code files are structured as follows: the first column denotes

the file path, the second column contains the language label of the respective files, and the

Figure 9. Graphical representation of the data preparation process

Page 40

third column comprises an array encompassing all code tokens specific to the file, with

tokens separated by commas.

3.1.3 Embeddings and Averaging

In this stage of data preparation we will proceed with the word embeddings

generation to represent each token for all the code files.

Vector space models have played a role in distributional semantics since the 1990s,

with models like Latent Dirichlet Allocation (LDA) Blei, Ng, and Jordan (2003), and Latent

Semantic Analysis (LSA) among the early methods for estimating continuous representations

of words. Text vectorization or word embedding is the process of converting words or

documents from a corpus into numerical vectors, which consist of numbers or real numbers.

This conversion is essential for machine learning tasks and sentiment analysis in natural

language processing, as machine learning algorithms typically require numerical input. In the

2003, researchers laid the foundation for distributional semantic learning, with Bengio et al.

presenting the NNLM model, which learns word representations by predicting the next

word/token based on the previous n-1 words/tokens.

Word embeddings are a way to represent words as vectors, encoding their semantic

meaning. These vectors are situated in a high-dimensional space, with embeddings for words

that are semantically or contextually related positioned near each other, and embeddings for

unrelated words positioned far apart. Some algorithms also create more intricate geometric

relationships among embeddings. A well-known example of this phenomenon is the vector

analogy: king - man + woman = queen. (Mikolov et al., 2013)

The broader adoption of word embeddings, however, is largely credited to Mikolov et

al. (2013), who developed Word2Vec, a toolkit that made training and using pre-trained

embeddings straightforward and efficient. This toolkit gained widespread popularity in the

machine learning community. A year later, Pennington et al. (2014) introduced GloVe, a set

of competitive pre-trained word embeddings, marking the point when word embeddings

became mainstream in natural language processing. Joulin et al. in 2016, introduced FastText

as an extension of CBOW, aiming to improve training efficiency while maintaining

performance levels compared to other algorithms. These foundational works have

significantly influenced the development of current NLP technologies.

Below is a brief overview of some of the most prominent word representation models.

Page 41

• Word2vec: uses a neural network to analyze extensive text datasets and derive these

vector-based representations, employing two hidden layers within a shallow neural

network to generate vectors for each word. Both the Continuous Bag of Words (CBOW)

and Skip-gram models aim to capture semantic and syntactic information within word

vectors. Training Word2vec with large corpora enhances word representation quality,

rendering it useful for various NLP task (Mikolov et al., 2013).

• Global Vectors (GloVe): GloVe extends Word2vec to efficiently learn word vectors by

predicting words based on their surrounding context. The word and context embeddings

are initially random, with the goal of minimizing the distance (usually the dot product)

between target and context. This is achieved using stochastic gradient descent, where

each sample is a sliding window in the corpus. When a target word and a context word

are found together, their vectors are adjusted to be closer, based on the learning rate

(Pennington et al., 2014).

• FastText: By breaking words into n-grams and feeding them into a neural network,

FastText captures character relationships and word semantics more effectively. This

approach yields better word representations, especially for rare words. Facebook released

pre-trained word embeddings for 294 languages, trained on Wikipedia using FastText

with 300 dimensions, and incorporating Word2Vec skip-gram with default parameters,

Joulin et al. in 2016.

• Embeddings from Language Models (ELMo): ELMo utilizes a bi-directional language

model (biLM), which processes text in both forward and backward directions, capturing a

richer understanding of word context compared to traditional models. This contextualized

representation allows ELMo to encode the nuanced meaning of words based on their

surrounding context, enabling more accurate representations of word relationships and

semantics (Peters et al., 2018).

• Bidirectional Encoder Representations from Transformers (BERT): BERT is trained

on two unsupervised tasks: (1) a masked language model (MLM), where 15% of the

tokens are arbitrarily masked (i.e., replaced with the “[MASK]” token), and the model is

trained to predict the masked tokens, (2) a next sentence prediction (NSP) task, where a

pair of sentences are provided to the model and trained to identify when the second one

follows the first. This dual-task training strategy aims to enhance the model's ability to

understand long-term and pragmatic relationships between sentences. BERT is trained on

Page 42

datasets such as the Books Corpus and English Wikipedia text passages (Devlin et al.,

2019).

Table 6. Embedding models details

Embedding

model

Year

Architecture

Dimension

Training dataset

Word2vec

2013

NNLM

100 - 1000

Google news

Glove

2014

NNLM

50 - 300

Crawl corpus

FastText

2016

NNLM

300

Wikipedia

ELMo

2018

Bidirectional LSTM

1024

Wikipedia, Monolingual news

crawl data from WMT 2008-

2012

BERT

2019

Multi-layer bidirectional

Transformer encoder

768

Books Corpus, English

Wikipedia

In the current research, the word2vec model is employed to generate word embeddings

for the code file inputs to enhance robustness and effectiveness. Therefore, a comprehensive

examination of its architecture and intricacies is provided below.

Word2vec operates on two primary architectures: the Continuous Bag-of-Words (CBOW)

model and the Continuous Skip-gram model, both proposed by Mikolov et al. (2013). These

two architectures serve as the fundamental building blocks for generating word embeddings

and offer different approaches to learning word representations based on context.

The Continuous Bag-of-Words (CBOW) model predicts a target word given its

surrounding context words. This architecture proposed is akin to the feedforward Neural

Network Language Model (NNLM), but with some key differences: the non-linear hidden

layer is omitted, and the projection layer is shared across all words, not just the projection

matrix. This setup projects all words into the same space, essentially averaging their vectors.

This approach is termed a bag-of-words model because the word order does not affect the

projection process. Additionally, words from both the past and future are used in the context,

allowing the model to consider a broader context.

The computational complexity of training this model is represented as:

𝑄 = 𝑁 × 𝐷 × 𝐷 × log2(𝑉)

Page 43

where 𝑁is the number of words, 𝐷 represents the dimensionality of the embeddings, and

𝑉denotes the vocabulary size.

The Continuous Skip-gram model reverses the prediction process of CBOW. Instead

of predicting a target word from context, Skip-gram aims to predict context words given a

target word. This model typically employs a larger context window, allowing it to capture

broader semantic relationships. Skip-gram can uncover more complex associations between

words, making it suitable for tasks that require a deeper understanding of context.

Specifically, the model takes each word as input to a log-linear classifier with a continuous

projection layer, then predicts words that are within a certain range before or after the given

word.

This architecture operates by sampling from a defined range around the target word.

The increased range tends to improve the quality of the resulting word vectors, as it provides

a broader context. However, this broader context also comes with increased computational

complexity. Since words further from the target are usually less related to it, we apply a

weighting system to decrease their impact. This is achieved by reducing the frequency of

sampling from these more distant words during training, focusing more on words that are

closer to the target. This approach balances the need for contextual information with the

computational costs, allowing for a more efficient and scalable training process.

The training complexity of this architecture is proportional to:

𝑄 = 𝐶 × (𝐷 + 𝐷 × log2(𝑉)),

where C is the maximum distance of the words.

Page 44

Figure 10. CBOW and Skip-gram architectures

The CBOW model is often chosen for its speed and efficiency, making it ideal for

large-scale datasets or situations where rapid processing is needed. On the other hand, the

Skip-gram model is preferred for its ability to capture richer and more intricate relationships

among words, proving useful for tasks that require a detailed semantic understanding.

In this study, we generated word embeddings using the Word2Vec algorithm to create

dense vector representations of words in the dataset. We configured the model with specific

parameters to ensure that the embeddings captured a broad context and were suitable for

various downstream applications.

The Word2Vec model was trained using a list of tokenized code snippets. The key

parameters for this training process were as follows:

• Vector Size: We set the dimensionality of the word embeddings to 300, indicating

that each word would be represented by a 300-dimensional vector. This size strikes a

balance between capturing semantic relationships and maintaining computational

efficiency.

• Window Size: The context window, set to 5, determines how many surrounding

words are considered for each target word during training. This value helps ensure

that the embeddings capture enough contextual information without becoming

computationally prohibitive.

Page 45

• Minimum Count: The minimum frequency for words to be included in the training

was set to 1. This choice allows the inclusion of even rare words, enriching the

diversity of the embeddings.

• Skip-gram Architecture: The sg=1 parameter specifies the use of the Skip-gram

architecture, which aims to predict context words based on a given target word. This

architecture is typically used to capture more complex relationships among words.

• Parallel Processing: We set the workers=4 parameter to leverage parallel processing,

allowing the model to be trained across four CPU threads simultaneously, thus

speeding up the training process.

•

word2vec_model = Word2Vec(sentences=new_code_tokens_list,

vector_size=300, window=5, min_count=1, sg=1, workers=4)

By configuring the Word2Vec model with these parameters, we aimed to generate high-

quality word embeddings that would serve as the foundation for subsequent stages of

analysis. Each word is transformed into a 300-dimensional vector, allowing for efficient

representation of text in a continuous space.

The next step in preparing the data for the prediction model involves averaging the

embeddings for each code snippet. As outlined in the vectors embedding section and in the

preceded preprocessing section, each code file consists of a list of tokens, with each token

having its own vector for representation. The size of each vector is defined by the user that

conducts the analysis and, in the case of this thesis, all vectors have been assigned a size of

300. Hence, each code file consists of 𝑛 vectors of size 300, corresponding to the number of

tokens that remain after the preprocessing procedure.

The process of averaging creates a single representative vector for each code file,

allowing us to use this compact representation as input for the prediction model. By

averaging these vectors, we can condense the information from an entire code file into a

single vector. This transformation is crucial because many prediction models, like Random

Forest or other traditional machine learning algorithms, require a fixed-size input. The

process of averaging is conducted in the following manner: for a code file that is expressed as

a list of tokens 𝑐𝑓 = [𝑡𝑜𝑘𝑒𝑛1,𝑡𝑜𝑘𝑒𝑛2,…,𝑡𝑜𝑘𝑒𝑛𝑛], each token is also expressed as a vector

array 𝑉 = [𝑣1,𝑣2,…,𝑣300] which contains 300 numbers that comprise the token vector.

Hence, with 𝑐𝑓 being transformed to 𝑐𝑓 = [𝑉1,𝑉2,…,𝑉𝑛], an Embedding Matrix (𝐸𝑀) is

Page 46

constructed, containing the vectors for each token (𝑉1 to 𝑉𝑛). Each row of the 𝐸𝑀 represents a

token vector while the columns can be used for producing sums of vectors element-wise.

Each cell of the 𝐸𝑀 contains a numerical value that represents one element (𝐸) in a token

vector 𝑉. The structure of the 𝐸𝑀 can be seen below.

Table 7. Embedding Matrix for each code file

Token Vectors (𝑽) 𝑪𝒐𝒍𝒖𝒎𝒏𝟏

𝑪𝒐𝒍𝒖𝒎𝒏𝟐

. . . 𝑪𝒐𝒍𝒖𝒎𝒏𝟑𝟎𝟎

𝑉1

𝐸11

𝐸12

. . .

𝐸1300

𝑉2

𝐸21

𝐸22

. . .

𝐸2300

𝑉3

𝐸31

𝐸32

. . .

𝐸3300

. . .

𝑉𝑛

𝐸𝑛1

𝐸𝑛2

. . .

𝐸𝑛300

The process of averaging computes the element-wise sum for all token vectors. More

specifically, for a 𝐶𝑜𝑙𝑢𝑚𝑚𝑖, the code that we have implemented computes the Column Sum

(𝐶𝑆) based on the following formula:

𝐶𝑆𝑖 = ∑

𝐸𝑗𝑖

𝑛

𝑗=0

for 𝑖 = 1,…,300

Where 𝑛 is the number of vector tokens (𝑉) contained in each file, and 𝑖 represents

the size of the vectors, which in the case of this study is 300. Hence, the process computes

300 instances of 𝐶𝑆 and a final vector 𝐹 is produced, which can be expressed as 𝐹 =

[𝐶𝑆1,𝐶𝑆2,…,𝐶𝑆300]. Finally, the 𝐹 vector undergoes a final division and is transformed to

𝐹𝑓𝑖𝑛𝑎𝑙 = [𝐶𝑆1

𝑙𝑒𝑛𝑔𝑡ℎ

⁄

,𝐶𝑆2

𝑙𝑒𝑛𝑔𝑡ℎ

⁄

,…,𝐶𝑆300

𝑙𝑒𝑛𝑔𝑡ℎ

⁄

] where 𝑙𝑒𝑛𝑔𝑡ℎ is the number of

tokens in 𝑐𝑓. The final averaged embeddings are then used for the subsequent

experimentations with machine learning classifiers. Averaging embeddings helps reduce the

complexity of the dataset while retaining meaningful information about the code's semantic

structure. By transforming each code file into a single vector, we create a uniform

representation that can be used for training and testing the prediction model. This step is

particularly useful for working with models that need consistent input formats and allows for

efficient processing and analysis of code data in the context of machine learning.

Page 47

final_vectors = []

for code_token_list in new_code_tokens_list:

vectors = []

for token in code_token_list:

embedding_vector = word2vec_model.wv[token]

vectors.append(embedding_vector)

column_sums = [sum(column) for column in zip(*vectors)]

for i in range(0, len(column_sums)):

column_sums[i] = column_sums[i]/len(code_token_list)

final_vectors.append(column_sums)

final_final_vectors = []

for i in range(0, len(final_vectors)):

if len(final_vectors[i]) == 300:

final_final_vectors.append(final_vectors[i])

3.1.4 Machine Learning Classifiers

In this subsection the procedure followed to experiment with different machine

learning classifiers is demonstrated. Overall, a thorough investigation of related algorithms

was performed in order to trace the optimal ones used for source code language

identification. Related literature has already reported the use of several well-known

algorithms such as CNN (Mandelbaum & Shalev, 2016), Random Forest (RF) (Vora et al.,

2017) or Long Short-Term Memory (LSTM) (Li & Gong, 2021). In addition, this subsection

also illustrates the separate experimentations that were considered for each selected algorithm

and the different setups selected. It should be noted that for this research, the classifiers that

were considered and used to make predictions and classify code files into Python or Java

were Neural Networks, Random Forest and Support Vector Machines.

3.1.4.1 Background on Classifiers

Neural Networks

This type of classifier uses interconnected layers of artificial neurons to learn complex

patterns in the data. The architecture of a neural network typically comprises multiple layers,

each consisting of interconnected nodes or neurons. Each neuron receives input from the

neurons in the previous layer, performs a computation on this input, and then passes the result

to the neurons in the next layer. This process continues through the network until the final

layer produces the output.

The specific neural network used in this research consists of three layers: an

embedding layer, an LSTM (Long Short-Term Memory) layer, and a dense layer.

Page 48

The Embedding layer is responsible for transforming categorical or discrete data into

continuous vector representations, often referred to as embeddings. These embeddings

capture semantic relationships between categories or words and are essential for effectively

processing categorical data in neural networks. In our case the embeddigns will be from the

code tokens of the code snippets files from python and java.

Long Short Term Memory networks (LSTMs), introduced by Hochreiter &

Schmidhuber (1997) and subsequently refined and popularized, are a specialized type of

recurrent neural network (RNN) designed to overcome the challenge of learning long-term

dependencies. It is particularly well-suited for tasks involving time series data or sequences

of data points, such as natural language processing tasks like text generation or sentiment

analysis. LSTM layers are capable of retaining information over extended periods, making

them suitable for processing sequences with long-range dependencies. While traditional

RNNs consist of a simple repeating module typically containing a single layer, LSTMs

feature a more complex structure comprising four interacting layers within each repeating

module. The core concept of LSTMs revolves around the cell state, which serves as a

conveyor belt facilitating the flow of information along the network. Crucially, LSTMs

employ structures called gates to regulate the flow of information into and out of the cell

state, enabling selective retention and addition of information. These gates, comprised of

sigmoid neural net layers and pointwise multiplication operations, control the flow of

information by determining how much information should be retained or discarded.

(Sundermeyer, Schlüter, and Ney, 2012)

The dense layer is a standard component in neural network architectures. Neurons in

this layer are connected to every neuron in the preceding layer, and each connection is

associated with a weight parameter. Each neuron in the dense layer receives input from every

neuron in its preceding layer and performs matrix-vector multiplication. This operation

entails ensuring compatibility between the dimensions of the matrices involved, where the

row vector of the output from the preceding layer matches the column vector of the dense

layer. The matrix-vector multiplication is governed by a general formula, where the

parameters of the preceding layers are updated through backpropagation, a widely used

algorithm for training feedforward neural networks. As a result, the output from the dense

layer is an N-dimensional vector, effectively reducing the dimensionality of the input vectors.

When implementing a dense layer, it's essential to ensure that the number of neurons in the

dense layer matches the dimensions of the output from the preceding layer. This process can

Page 49

be facilitated using frameworks like Keras, where various parameters of the dense layer are

defined to control its behavior effectively. The dense layer performs a linear transformation

on the input data followed by a non-linear activation function, allowing the network to learn

complex, nonlinear relationships within the data. (Javid et al., 2021)

Random Forest (RF)

A robust ensemble learning method that builds multiple decision trees during training

and merges their outputs to improve prediction accuracy and reduce overfitting. Random

Forest is valued for its stability, flexibility, and ability to handle high-dimensional data,

making it a suitable choice for classifying code files.

The Random Forest algorithm cultivates an ensemble of decision trees, amalgamating

their predictions to achieve enhanced accuracy. This approach involves constructing multiple

decision trees, each trained on a distinct subset of the training set, and subsequently

aggregating their predictions to determine the final outcome taking the most popular result.

By employing this ensemble method, Random Forest introduces diversity among the decision

trees, enhancing the model's robustness and mitigating the risk of overfitting. Conversely, in

regression, the forest aggregates the outputs of all trees to derive the average prediction.

Crucially, the efficacy of Random Forest hinges upon the limited or absent correlation among

individual models, ensuring that errors inherent in specific trees are offset by the collective

accuracy of the ensemble, thereby aligning the overall outcome toward the desired direction.

(Parmar, Katariya, and Patel, 2019)

In ensemble learning, such as in Random Forests, decision trees are commonly trained

using the "bagging" technique, which falls under Bootstrap Aggregation. This ensemble

method amalgamates predictions from multiple algorithms to enhance accuracy. Random

Forests, a type of ensemble method, leverage Bootstrap Aggregation by randomly sampling

rows and features from the dataset to create sample datasets for each model. Aggregating

these sample datasets entails summarizing observations and combining them, effectively

reducing the variance of high-variance algorithms like decision trees, thereby mitigating

overfitting. Random Forest finds applications across various industries, including banking,

stock trading, medicine, and e-commerce, for tasks such as predicting customer behavior,

detecting fraud, recommending products, and analyzing medical data. Notably, its advantages

include its ability to address overfitting, efficiency in handling large datasets, and relative

Page 50

ease of implementation compared to more complex models like neural networks. (Breiman,

L., 2001)

Support Vector Machines (SVM)

The Support Vector Machine (SVM), a supervised learning classifier, is instrumental

in data sorting by determining an optimal hyperplane that maximizes class separation within

an N-dimensional space. This method aims to establish the widest margin between different

groups, thereby enhancing classification accuracy. Operating within high-dimensional feature

spaces, SVMs employ linear functions and learning algorithms grounded in optimization and

statistical learning theories (Wang, 2005).

While traditionally conceived for binary classification, SVMs have adapted to

computationally intensive multiclass problems by combining multiple binary classifiers. In

mathematical terms, SVMs employ kernel methods to transform data features via kernel

functions, facilitating the simplification of data boundaries for non-linear problems. This

process entails mapping complex datasets into higher dimensions to aid in data point

separation. While this technique, known as the kernel trick, introduces computational

complexities, it efficiently transforms data into higher dimensions. (Moguerza & Muñoz,

2006)

SVM's versatility extends to text and image classification tasks, where it excels in

tasks like category assignment, spam detection, sentiment analysis, and image recognition,

particularly in aspect-based recognition and color-based classification. Furthermore, SVM

plays a crucial role in handwritten digit recognition, contributing to postal automation

services. Notably, SVM demonstrates efficacy in high-dimensional spaces and is frequently

employed in text classification tasks, serving as a potent tool for distinguishing between

Python and Java code files.

CodeBert

Feng et al. (2020) introduced CodeBERT a pre-trained model designed to understand

the semantic relationships between natural language and code, enabling it to generate

general-purpose representations that are useful for various tasks, such as natural language-

based code search and code documentation generation. CodeBERT is inspired from BERT

(Devlin et al., 2018) and RoBERTa (Liu et al., 2019), leveraging the multi-layer bidirectional

Transformer architecture proposed by Vaswani et al. (2017).

Page 51

For input representation during pre-training, concatinates two segments with a special

separator token: [CLS], followed by the natural language text (𝑤1,𝑤2,...,𝑤𝑛), another

[SEP] token, and the code snippet (𝑐1,𝑐2,...,𝑐𝑚), ending with [EOS]. Here, [CLS] serves as

a special token marking the beginning of the segments, with its final hidden representation

serving as the aggregated sequence representation for classification or ranking tasks. Natural

language text is tokenized into WordPiece units, while code segments are treated as

sequences of tokens.

The output of CodeBERT comprises two main components: the contextual vector

representations for each token, encompassing both natural language and code, and the

representation of [CLS], which functions as the summarized sequence representation for

downstream tasks.

CodeBERT finds diverse applications in software development, including code

summarization, translation between programming languages, code completion, and

facilitating code-to-code and code-to-text transformations. For instance, it aids developers in

understanding and documenting code by generating human-readable summaries and

translating code snippets into natural language. Additionally, it enables code search based on

natural language queries and facilitates translation of code domain text into various

languages.

3.1.4.2 Training Results

These classifiers were chosen for their unique strengths and were used to predict and

classify code files as either Python or Java. The variety of classifier types underscores their

effectiveness in terms of accuracy and predictive capabilities. However, as we progressed

with testing on various datasets, we noticed a decline in accuracy. This prompted us to

experiment with a broader range of classifiers to understand the underlying causes of this

accuracy issue. The findings from these additional tests will be presented later in this work.

Neural Networks

The neural network model used in this research was built with Keras, employing a

sequential architecture, which allows layers to be stacked linearly. The model is designed to

perform a binary classification task, and it consists of three primary components: an

embedding layer, a Long Short-Term Memory (LSTM) layer, and a dense output layer.

Page 52

Before going further analysing the parameters of the neural network used it is

important to mention some details in terms of the data pre-processing in this stage. The

primary focus was on basic cleaning of the code snippet corpus, which involved removing

comments, tokenizing the code, and eliminating blank tokens from the dataset. Also, padding

was used to ensure that the embedings matrix has the same size for all the code files.

Word2Vec was used to generate word embeddings, which were then organized into an

embedding matrix for input into the neural network.

The first layer is the embedding layer, where the matrix with the already saved

embeddings was used. This layer's configuration involves several important parameters. The

size of the vocabulary, the size of each embedding vector was defined. Also, the input_length

parameter sets the maximum length for input sequences. It also uses a mask_zero feature to

manage variable-length sequences with zero-padding.

Next is the LSTM layer, a type of Recurrent Neural Network (RNN) designed to

handle sequential data. The LSTM layer in this model contains 300 units, uses a dropout rate

of 0.3 to prevent overfitting, and is configured to be stateless, meaning it doesn't retain state

information across batches. The activation function for the LSTM layer is ReLU (Rectified

Linear Unit), providing non-linearity to the model.

The final layer is the dense output layer, designed for binary classification. This layer

consists of one unit with a sigmoid activation function, which outputs a probability value

between 0 and 1. This dense layer serves as the model's classification output.

To train the neural network model, we followed a structured approach to ensure that

the training data was randomized and the validation process was robust. First, we randomized

the order of the dataset to minimize any biases due to the sequence of data points. This step

involved generating a list of indices corresponding to the original dataset and then shuffling

these indices. By reordering the dataset with these shuffled indices, we ensured that the

model would be trained on a random distribution of the data, reducing the likelihood of

overfitting to any specific pattern.

Next, we transformed the output labels into a numerical format suitable for training a

machine learning model. This transformation step was crucial because most machine learning

algorithms require numerical inputs for both training data and target outputs. We used a

Page 53

simple encoding method to convert categorical labels into numbers, creating a new dataset of

encoded labels.

To prepare the data for training and validation, we split the shuffled dataset into two

parts: one for training and one for testing. We allocated approximately 80% of the dataset for

training and 20% for testing. In addition to the training and testing split, we created a

validation set from the training data. This validation set was used to fine-tune the model's

hyperparameters and ensure optimal performance. The validation set was created by taking a

portion of the training data, allowing us to adjust the model as needed without affecting the

final testing set.

Despite the well-structured model described above, it failed to yield valuable results,

with accuracy consistently near 0.49. Although we attempted various corrective measures,

such as adjusting the number of epochs and expanding the dataset with additional code

snippets, these efforts did not lead to a significant improvement in performance. Furthermore,

the model exhibited high instability and sensitivity to variations in the input data's structure

and quality, resulting in frequent run-time issues and necessitating manual interventions.

Due to these persistent issues, we ultimately decided to abandon this model as a

viable approach for our prediction methodology. The model's inherent instability and the lack

of valuable insights it provided made it unsuitable for continued use, prompting us to explore

alternative techniques for the initial stages of our research.

Random Forest and SVM

As part of the next steps in our research, we chose Random Forest and Support Vector

Machines (SVM) due to their robustness and flexibility, aiming to determine whether they

could yield more accurate results. However, before going further into the specifics of the

experiments and the fine-tuning techniques applied after using these classifiers, it's crucial to

point out that both classifiers displayed similar trends. Namely, they exhibited overfitting on

the training dataset and a significant drop in prediction accuracy when tested with different

datasets. This section outlines the process followed with the Random Forest classifier. As the

results of the SVM classifier were similar, this section will primarily focus on the steps taken

with Random Forest and the fine-tuning strategies employed to address overfitting and

improve prediction capabilities.

Page 54

In our study, we expanded the evaluation to other classifiers, specifically Random

Forest and Support Vector Machines (SVM), both of which yielded similar results. Initially,

we employed the default Random Forest algorithm without additional parameterization. The

default settings include 100 trees (n_estimators), Gini impurity as the split criterion

(criterion), no limit on tree depth (max_depth), and a minimum of 2 samples required to split

a node (min_samples_split). The classifier uses bootstrapped samples (bootstrap=True), and

the maximum features considered for a split is the square root of the total features

(max_features='auto'). Out-of-bag score (oob_score) is disabled, and a single core is used for

computation (n_jobs=None). The random_state parameter is unset, allowing results to vary

with each run. This default configuration provides a robust starting point for building a

Random Forest model, allowing for flexible customization to suit specific data and use cases.

The initial training phase used the "1st Training dataset," composed solely of

algorithms written in both Java and Python. For data pre-processing, this stage involved

removing comments, tokenizing code snippets, eliminating blank tokens, generating

embeddings using Word2Vec, and averaging the code vectors. The training and testing

datasets were created with 80%/20% ratio. Despite this thorough approach to data

preparation, the results showed an unusually high accuracy, nearing 100%.

Such extreme accuracy raised concerns about the dataset's structure and diversity. It

suggested that the dataset might be too uniform or lacking in variability, leading to an overly

optimistic model performance. This outcome implied that the model might not be

generalizing effectively and could struggle with more diverse or complex datasets.

Consequently, further evaluation and dataset adjustments would be needed to ensure the

model's robustness and reliability in real-world scenarios.

Further adjustments were made by expanding the dataset with additional code

snippets from other Java projects and Python libraries, as described in Section 3.2. This

enrichment aimed to increase dataset diversity and reduce overfitting. After these changes,

the model's accuracy improved, reaching approximately 0.89, with overfitting reduced to a

more acceptable level. Although this accuracy was satisfactory during training, when the

model was tested on unseen data to predict other code files, the accuracy dropped

significantly to 0.42. This decline indicated that the model struggled to generalize beyond the

training data, suggesting that additional modifications to the model's parameters and structure

were necessary to achieve more reliable predictions.

Page 55

We used grid search to evaluate different parameter configurations for the model, aiming

to identify the optimal settings for improved accuracy. This process involved systematically

testing various combinations of parameters to determine the best configuration for our

Random Forest classifier. The final parameters chosen as a result of this exercise are as

follows:

• Split Criterion: The classifier uses 'gini' as the split criterion, employing Gini impurity to

assess the quality of splits within the trees.

• Maximum Tree Depth: The maximum depth of the trees is capped at 10, which helps

prevent overfitting by limiting the growth of the decision trees, ensuring they do not

become excessively complex.

• Maximum Features for Splits: To promote diversity in the forest and reduce overfitting,

the maximum number of features considered for each split is set to the logarithm base 2

of the total number of features ('log2').

• Minimum Samples for Leaf Nodes and Splits: The minimum number of samples

required for a leaf node is one, allowing smaller branches to form. The minimum number

of samples needed to split a node is set at two, providing stability and preventing overly

complex splits.

• Number of Trees: The classifier consists of 100 trees, creating a robust ensemble that

balances diversity and accuracy.

After applying these parameters to the "2nd Training dataset" the accuracy during training

again reached 1, suggesting that the model was overfitting to the training data. However,

when we used this model to predict code files from an external dataset, the accuracy

plummeted to 0.46, indicating that the model's generalization capacity was still limited.

As discussed in the data pre-processing section, when we analysed the token

frequencies for dimensionality reduction, we observed that a substantial portion of the dataset

was composed of stop words, strings, and numbers. These components tended to add noise

and detracted from the patterns necessary for accurate predictions. In response, we chose to

remove all instances of stop words, strings, and numbers from the dataset, thereby

minimizing unnecessary complexity and enhancing data quality. After completing this data

cleansing, we applied dimensionality reduction techniques to retain only the most significant

tokens for prediction.

Page 56

An interesting observation from our research was that, although the accuracy on the

training dataset remained consistently at 1, the model's performance on other datasets showed

considerable variation, with accuracy ranging between 0.46 and 0.79. Despite this variability

in accuracy, we noted an increase in precision. However, when the model was tested on

predictions across three different datasets, the accuracy exhibited significant fluctuation,

indicating a strong likelihood of overfitting to the training dataset and heightened sensitivity

to variations in the input datasets used for prediction. This inconsistency underscores the need

for further investigation into the model's robustness and suggests that additional measures

may be required to mitigate overfitting and improve generalization.

CodeBERT

Given that the results from the classifiers used in our study were not encouraging and

demonstrated unreliable predictive performance, we decided to focus our efforts on

enhancing CodeBERT, a pre-trained model for both programming languages (PL) and natural

languages (NL), as presented by Feng et al. (2020).

CodeBERT is designed to learn general-purpose representations that can support a

range of NL-PL tasks, such as natural language code search, code documentation generation,

and similar applications. It is built upon the multi-layer Transformer architecture, which is a

common structure for large pre-trained models. CodeBERT's effectiveness is demonstrated in

its performance on various tasks, achieving state-of-the-art results in code search and code-

to-text generation, bridging the gap between natural language and programming languages.

We incorporated an alternative version of CodeBERT, known as CodeBERTa-

language-id, into our study. This model, pre-trained on the task of programming language

identification (PLI), classifies code snippets into the programming language they belong to.

By integrating a sequence classification head atop the CodeBERTa-small-v1 architecture, the

model exhibits remarkable evaluation accuracy and F1 scores, exceeding 0.999. Accessible

through a TextClassificationPipeline, users can input code snippets and obtain predictions

regarding their programming language.

In our experimentation with the "2nd Training dataset," we undertook several

preprocessing steps, including the removal of comments, stop words, strings, and numerical

values from the code files. The average accuracy achieved was 0.97.

Page 57

3.2 Identify library type from python source code snippets

Progressing with our research, and after reviewing the results of our language

identification algorithm, we will filter only the files recognized as Python language files.

Using a sample method, we will then identify the type of library each source code file

belongs to, categorizing them into data visualization, machine learning, NLP, or web

development.

Below is a quick overview of the architecture of this solution, along with some

adjustments to improve accuracy. In this case, four separate classifiers were trained to

identify each specific library, and four separate t-SNE reducers were applied after vector

creation. By combining the predicted probabilities from each classifier, the ensemble method

accounts for the confidence levels of each classifier in its predictions, leading to a more

balanced and robust final prediction.

Figure 11. Adjusted Methodology for library identification

Page 58

3.2.1 Data collection & Pre-processing

For training the four separate classifiers, we utilized popular and structured Python

libraries sourced from GitHub across the specified categories. Saabith, Vinothraj, and Fareez

(2020) emphasize factors influencing Python library popularity: ease of use, expressiveness,

interpretation, cross-platform compatibility, and open-source nature. They also highlight

common applications such as NLP, GUI, Web Application, Data Science, and Machine

Learning. Based on this evidence, we selected the following categories for our analysis.

An additional aspect to consider was the selection of files for the second class in all

our classifiers. We chose mathematics and algorithms libraries based on the hypothesis that

they are versatile and applicable across various domains and scenarios.

The libraries that used for training for each category are the below:

Table 8. Python libraries for traiining

Library type

Library title Github link

Data

Visualization

Matplotlib

https://github.com/matplotlib/matplotlib

Seaborn

https://github.com/seaborn/seaborn

Plotly

https://github.com/plotly/plotly.py

Bokeh

https://github.com/bokeh/bokeh

Altair

https://github.com/altair-viz/altair

Dash

https://github.com/plotly/dash

Machine

Learning

Keras

https://github.com/keras-team/keras

Pytorch

https://github.com/pytorch/pytorch

Scikit-learn

https://github.com/scikit-learn/scikit-learn

Tensorflow

https://github.com/tensorflow/tensorflow

xgboost

https://github.com/dmlc/xgboost

NLP

Transformers

https://github.com/huggingface/transformers

TextBlob

https://github.com/sloria/TextBlob

Textacy

https://github.com/chartbeat-labs/textacy

Polyglot

https://github.com/aboSamoor/polyglot

Gensim

https://github.com/RaRe-Technologies/gensim

Flair

https://github.com/flairNLP/flair

FastText

https://github.com/facebookresearch/fastText

BERT

https://github.com/google-research/bert

NLTK

https://github.com/nltk/nltk

Web

Development

Djangp

https://github.com/django/django

Flask

https://github.com/pallets/flask

Pyramid

https://github.com/Pylons/pyramid

Web2py

https://github.com/web2py/web2py

Page 59

CherryPy

https://github.com/cherrypy/cherrypy

Bottle

https://github.com/bottlepy/bottle

Tornado

https://github.com/tornadoweb/tornado

Sanic

https://github.com/sanic-org/sanic

Mathematics,

data structures &

algorithms

NumPy

https://github.com/numpy/numpy

SciPy

https://github.com/scipy/scipy

SymPy

https://github.com/sympy/sympy

Pandas

https://github.com/pandas-dev/pandas

Theano

https://github.com/Theano/Theano

CVXPY

https://github.com/cvxgrp/cvxpy

PyMC

https://github.com/pymc-devs/pymc

SageMath

https://github.com/sagemath/sage

PuLP

https://github.com/coin-or/pulp

CherryPy

https://github.com/cherrypy/cherrypy

All files were downloaded from GitHub and stored locally in separate folders for

training purposes. Each of the four classification types had between 3000-4000 files, while

6000 random files were selected from mathematics and data structures libraries. To ensure

balanced training and avoid biases, a random sample equal to the number of files in each

classification type was chosen from the random files pool. This approach ensures that each

classifier receives diverse training data, optimizing for accuracy across all categories.

The data pre-processing applied is similar to the language recognition algorithm

described in Section 3.1.2, involving the removal of comments and stopwords, tokenization,

and the removal of empty tokens.

For example, an input file for the Machine Learning classification training set looks

like the following before pre-processing:

Page 60

Figure 12. Machine learning library code snippets example

After the pre-processing, the tokenized source code will be saved in a dataframe with the

below structure:

Figure 13. Machine Learning library code snippets example after tokenization

Page 61

3.2.2 Dimensionality reduction with t-sne

Following data cleansing and pre-processing, embeddings were generated using the

Word2Vec model with parameters specified in Section 3.1.3. This model transforms each

word into a 300-dimensional vector, enabling efficient text representation. The next step

involved averaging the embeddings for each code snippet, where each code file, consisting of

a list of tokens, was represented by 300-dimensional vectors for subsequent analysis.

To further reduce complexity and visualize the high-dimensional data, we applied t-

SNE (t-Distributed Stochastic Neighbor Embedding), which reduced the 300-dimensional

embeddings into two features. This dimensionality reduction technique helped in preserving

the local structure of the data while making it easier to analyze and interpret. t-SNE is widely

recognized for its ability to preserve the local structure of high-dimensional data while

enabling effective visualization in lower dimensions (Van der Maaten and Hinton, 2008).

This method has been successfully used in various domains, including natural language

processing (NLP) and bioinformatics, to reveal complex data patterns and clusters (Maaten,

L.J.P. van der, Postma, E.O., & Herik, H.J. van den, 2009). Furthermore, studies have shown

that combining Word2Vec embeddings with t-SNE can enhance the interpretability of text

data by capturing semantic similarities between words (Liu, Q., & Zhang, H., 2017).

Hence, a t-SNE reducer was created for each source code file category and saved with

each classifier using the Scikit-learn library, with n_components=2 and random_state=42.

The n_components=2 parameter reduces the data to two dimensions, essential for effective

visualization and interpretation of patterns within the embeddings. Setting random_state=42

ensures reproducibility, a critical aspect for validating and maintaining consistency in our

results. t-SNE is particularly effective in preserving local structures within high-dimensional

data, thus facilitating the identification of meaningful clusters and relationships.

The findings from this reduction demonstrate clear differentiation between clusters

corresponding to specific Python library source code files and random (mathematics) source

code files. In a t-SNE visualization, clusters are identified as groups of closely located points

within the two-dimensional plot, each representing data points with similar characteristics or

patterns from the original high-dimensional space. Proximity within a cluster indicates higher

similarity among points, contrasting with points in other clusters. Well-separated clusters

suggest distinct groups or categories within the data, whereas overlapping clusters reveal

shared attributes or similarities between different groups. The density and distribution of

Page 62

points within each cluster provide valuable insights into the structure and relationships within

the data, offering a clearer understanding that may not be discernible in higher-dimensional

representations.

All four categories show noticeable separation between clusters of the four Python

library categories and random source code files from mathematics libraries. However, certain

libraries exhibit clearer distinctions than others, suggesting more promising outcomes for the

prediction model.

The data visualization source code files do not display a clear separation from the

random source code cluster, as there are instances of overlap between them. Additionally,

within the data visualization cluster itself, there are regions of close proximity scattered

throughout, suggesting similarities either within the libraries or across different libraries. This

proximity indicates that certain attributes or coding patterns may be shared among files

within the same library or across different libraries, contributing to the observed clustering

patterns.

The cluster representing web development libraries shows a distinct separation from

the cluster containing random source code files, with minimal overlap observed between

them. Within the web development cluster itself, points are more scattered, indicating lower

similarity between the selected libraries. This dispersion suggests that each web development

Figure 14. t-sne clusters for data visualization libraries

Page 63

library exhibits unique characteristics and coding patterns, contributing to the overall

differentiation observed in the t-SNE visualization.

Figure 15. t-sne clusters for web development libraries

In the machine learning libraries cluster, the close proximity of points suggests

significant similarities in the source code among most of the libraries used. Meanwhile, the

random source code files cluster is sufficiently separated with minor overlapping observed.

This indicates distinct coding patterns between machine learning libraries and random source

code files, emphasizing the effectiveness of the t-SNE visualization in highlighting these

Page 64

differences.

Figure 16. t-sne clusters for machine learning libraries

The NLP (Natural Language Processing) cluster exhibits a scattered distribution

similar to web development, with areas of high proximity suggesting shared attributes or

coding patterns within this cluster. However, the separation from the other clusters is distinct,

with minimal overlapping observed.

Figure 17. t-sne clusters for NLP libraries

Page 65

The reduction of dimensionality to two dimensions appears to yield valuable results,

effectively decreasing complexity and potentially improving accuracy for the classifiers

trained on each library. Notably, data visualization libraries exhibit higher overlap with

random source code files. It is important to acknowledge that since random source files are

randomly sampled to match the number of files in other library categories, the clusters

generated by t-SNE vary between different reductions. This variation leads to differing

proximities between clusters across different categories, thereby influencing the training

process of classifiers as well.

3.2.3 Individual classifiers per class

The classifiers selected for predicting each Python library are Random Forest models,

using the same parameters described in section 3.1.4.2. These classifiers are configured with

parameters including criterion, max_depth, max_features, min_samples_leaf,

min_samples_split, and n_estimators. These parameters control the splitting criterion, tree

depth, number of features considered for splits, minimum samples per leaf, minimum

samples for splits, and the number of trees in the forest, respectively. We allocated

approximately 80% of the dataset for training and 20% for testing.

For the data visualization library, the optimal parameters were found to be

criterion='gini',

max_depth=None,

max_features='sqrt',

min_samples_leaf=1,

min_samples_split=2, and n_estimators=200. This configuration achieved a best cross-

validation score of 0.9568. The performance metrics indicated a precision of 0.70 for class 0

and 0.86 for class 1, with recall values of 0.89 and 0.63, respectively, resulting in an overall

accuracy of 0.76.

The machine learning library classifier, configured with criterion='gini',

max_depth=20, max_features='sqrt', min_samples_leaf=1, min_samples_split=2, and

n_estimators=100, achieved a best cross-validation score of 0.9597. It demonstrated a

precision of 0.75 for class 0 and 0.86 for class 1, and recall values of 0.88 and 0.71,

respectively, with an overall accuracy of 0.80.

For the NLP library, the best parameters were criterion='entropy', max_depth=20,

max_features='log2', min_samples_leaf=1, min_samples_split=5, and n_estimators=200,

achieving a best cross-validation score of 0.9656. The classifier showed lower precision and

recall, with values of 0.17 and 0.23 for precision, and 0.15 and 0.27 for recall, resulting in an

overall accuracy of 0.21. This indicates challenges in accurately classifying NLP library files.

Page 66

The web development library classifier, optimized with criterion='entropy',

max_depth=None, max_features='log2', min_samples_leaf=1, min_samples_split=5, and

n_estimators=100, achieved a best cross-validation score of 0.9647. It showed strong

performance with a precision of 0.93 for class 0 and 0.82 for class 1, and recall values of 0.81

and 0.93, respectively, resulting in an overall accuracy of 0.87.

The training results for each classifier are presented below, with the data visualization

classifier indicating the lowest accuracy, while the other three classifiers show similar, higher

accuracy levels. The lower accuracy of the data visualization classifier aligns with the t-SNE

representation discussed in the previous section, where slight overlapping was identified

between the two clusters. The accuracy for the data visualization classifier might differ with a

different set of random source code files for the second class; however, we will proceed to

the ensemble stage and assess how this might affect the results.

Table 9. Python Libraries classifiers training accuracy

Classifier

Accuracy

Precision

Recall

F1 Score

Data visualization RF

68%

70%

68%

Web Development RF

83%

84%

83%

NLP RF

87%

Machine Learning RF

83%

84%

83%

3.2.4 Ensemble

The ensemble method was employed to address overfitting issues encountered in

previous classifiers and to enhance the accuracy and adaptability of our solution, especially

when incorporating additional Python libraries into the prediction model (Dietterich, 2000).

The ensemble method effectively combines the predictions from multiple classifiers trained

on distinct Python library categories, improving classification accuracy. By integrating

probabilistic assessments and normalizing contributions, the ensemble model provided a

robust framework for accurately categorizing source code snippets into their respective

Page 67

domains.

Figure 18. Ensemble method process

In more detail, after training individual Random Forest classifiers for each Python

library category (data visualization, web development, machine learning, and NLP) an

ensemble method was applied to further refine the classification process. Each classifier was

loaded and utilized to predict the probabilities of code snippets belonging to its respective

category. For instance, the data visualization classifier predicted probabilities for both data

visualization-related snippets and random source code snippets. Similarly, the other

classifiers made predictions based on their specific categories, contributing to the overall

probabilistic assessment.

Page 68

Figure 19. Aggregated probabilities results from classifiers

The ensemble method integrated these individual predictions by initializing an array

to aggregate the combined probabilities across all categories. This aggregation process

involved normalizing the probabilities to ensure each category's influence was proportionate,

thereby balancing the contributions from each classifier. After combining and normalizing

the probabilities, the final classification decision for each snippet was determined by

identifying the category with the highest probability in the aggregated array.

Page 69

This is depicted in Figure 20, which illustrates the distribution of labels for the source

code

files

according

the

highest

probability

classifier.

Figure 20. Voting results of ensemble method

The results of the ensemble method and prediction accuracies will be analyzed further

in Section 4.

3.3 Identify arrays in python source code snippets

After achieving a low but sufficient accuracy in the task of recognizing Python library

types from source code, we furthered our research to tackle a more specific task: recognizing

the implementation of NumPy arrays in Python source code. This presented a significant

challenge, as the model needed to distinguish not only the presence of arrays in code snippets

compared to random files but also to identify this specific task within various types of source

code from a broader source code corpus. The complexity arose from the need to accurately

detect the usage of NumPy arrays within a source code corpus.

This part of the research included the below challenges:

1. How will the initial datasets of array source code for training be formed to train the

classifier?

2. What will be the random files for the second class of the classifier?

3. What changes should be applied to the pre-processing process to help the model

identify arrays within the source code corpus?

Page 70

In the following sections, we will discuss these challenges and evaluate the accuracy of the

training dataset. Figure 21 below outlines the adjusted methodology that will be discussed.

Figure 21. Adjusted Methodology for arrays identification

3.3.1 Training dataset composition

To construct the training datasets for recognizing NumPy array implementations in

Python source code, we employed a script generation approach using the Jinja2 templating

engine. This method allowed us to create a variety of code snippets with NumPy array

operations. The templates defined diverse structures and operations to simulate realistic

Python scripts, enhancing the robustness of our training data.

We utilized eight distinct templates to generate the code snippets, each representing

different coding structures such as functions, classes, and control flow statements. For

example, one template created a simple function that initializes and manipulates a NumPy

array, while another defined a class with multiple methods performing various operations on

a NumPy array.

The templates used are the below:

Page 71

1. Simple Function Template: This template creates a basic function that initializes a

NumPy array and performs a series of operations on it.

template1 = Template("""

import numpy as np

def {{ func_name }}():

array = np.array([{{ array_content }}])

return array

""")

2. Class with Methods Template: This template generates a class with an array as an

attribute and includes methods that perform operations on this array.

template2 = Template("""

import numpy as np

class {{ class_name }}:

def __init__(self):

self.array = np.array([{{ array_content }}])

def {{ method1 }}(self):

def {{ method2 }}(self):

def {{ method3 }}(self):

def get_array(self):

return self.array

""")

3. Conditional and Loop Template: This template includes a function with conditional

statements and loops that manipulate a NumPy array.

template3 = Template("""

import numpy as np

def {{ func_name }}():

array = np.array([{{ array_content }}])

if {{ condition }}:

else:

for _ in range({{ loop_count }}):

Page 72

return array

""")

4. Nested Function Template: This template generates a function with nested helper

functions that manipulate a NumPy array.

template4 = Template("""

import numpy as np

def {{ inner_func_name1 }}(array):

return array

def {{ inner_func_name2 }}(array):

return array

def {{ outer_func_name }}():

array = np.array([{{ array_content }}])

array = {{ inner_func_name1 }}(array)

array = {{ inner_func_name2 }}(array)

return array

""")

5. Array Concatenation Template: This template creates a function that initializes two

NumPy arrays and concatenates them after performing operations.

template5 = Template("""

import numpy as np

def {{ func_name }}():

array1 = np.array([{{ array_content1 }}])

array2 = np.array([{{ array_content2 }}])

result = np.concatenate((array1, array2))

return result

""")

6. Class with Static and Instance Methods Template: This template defines a class with

both static and instance methods that perform operations on NumPy arrays.

template6 = Template("""

import numpy as np

class {{ class_name }}:

def __init__(self, array):

self.array = array

@staticmethod

def {{ static_method_name }}():

array = np.array([{{ array_content }}])

return array

Page 73

def {{ instance_method_name }}(self):

return self.array

def process(self):

self.array = self.{{ instance_method_name }}()

return self.array

""")

7. Exception Handling Template: This template includes a function that performs

operations on a NumPy array within a try-except block.

template7 = Template("""

import numpy as np

def {{ func_name }}():

array = np.array([{{ array_content }}])

try:

except Exception as e:

print(f"Error: {{ exception_message }}")

return array

""")

8. Looping Structure Template: This template creates a function with a while loop that

repeatedly performs an operation on a NumPy array.

template8 = Template("""

import numpy as np

def {{ func_name }}():

array = np.array([{{ array_content }}])

count = 0

while count < {{ loop_count }}:

count += 1

return array

""")

To ensure diversity in the generated scripts, we incorporated random elements such as

function names, class names, method names, array contents, and operations. We used the

Faker library to generate random names and the numpy library to produce random array

contents. A selection of NumPy operations, such as array multiplication, sorting, and

trigonometric functions, was randomly applied to the arrays within the templates.

import numpy as np

import random

from faker import Faker

fake = Faker()

Page 74

def generate_array_content():

return ', '.join(map(str, np.random.randint(0, 100,

size=random.randint(5, 15))))

def generate_operation(array_name='array'):

operations = [

f'{array_name} = {array_name} * 2',

f'{array_name} = np.sqrt({array_name})',

f'{array_name} = np.log({array_name} + 1)',

f'{array_name} = np.sin({array_name})',

f'{array_name} = np.sort({array_name})',

Using the templates and the random content generation functions, we produced

multiple synthetic Python scripts. Each script was saved to a file, forming part of our training

dataset. This approach ensured that the dataset was sufficiently varied and realistic, providing

a robust basis for training our classifier.

templates = [template1, template2, template3, template4, template5,

template6, template7, template8]

for i in range(5): # Adjust number as needed

template = random.choice(templates)

script = template.render(

func_name=generate_func_name(),

class_name=generate_class_name(),

method1=generate_method_name(),

method2=generate_method_name(),

method3=generate_method_name(),

inner_func_name1=generate_func_name(),

inner_func_name2=generate_func_name(),

static_method_name=generate_static_method_name(),

instance_method_name=generate_method_name(),

array_content=generate_array_content(),

array_content1=generate_array_content(),

array_content2=generate_array_content(),

operation1=generate_operation(),

operation2=generate_operation(),

operation3=generate_operation(),

inner_operation1=generate_operation('array'),

inner_operation2=generate_operation('array'),

static_operation=generate_operation('array'),

instance_operation=generate_operation('self.array'),

try_operation=generate_operation('array'),

except_operation=generate_operation('array'),

loop_operation=generate_operation('array'),

condition=generate_condition(),

loop_count=generate_loop_count(),

exception_message=generate_exception_message()

)

with open(f'script_{i}.py', 'w') as f:

f.write(script)

This templating approach allowed us to efficiently generate a large, diverse set of

training data, crucial for training a robust classifier capable of recognizing NumPy array

implementations. However, it also presented challenges, such as ensuring the generated

scripts were syntactically correct and representative of real-world code.

Page 75

For the random code snippets used for the other class of the classifier, we utilized the

same dataset that was employed in section 3.6, which includes topics such as mathematics,

data structures, and algorithms. However, to ensure these snippets did not contain arrays, we

performed a keyword search within these files for terms like "array" or "np.array." This

allowed us to exclude any source code files that already included arrays, ensuring that the

random code snippets were free from NumPy array implementations.

In total, the dataset used for training the classifier consisted of 5000 files containing

array source code and 5000 random code files. The data pre-processing followed is similar

with all the previous experiments. We allocated approximately 80% of this dataset for

training purposes and reserved the remaining 20% for testing and evaluating the classifier's

performance.

3.3.2 Feature selection with weighted average

In the initial experiments of this approach, we followed the previously established

methodology by generating embeddings using Word2Vec. These embeddings were averaged

and then used to train both Random Forest and Support Vector Machine (SVM) classifiers.

However, the results from these experiments were not satisfactory, with the accuracy on the

training set being only 14%. This low accuracy indicated significant room for improvement

in our approach to recognizing NumPy array implementations in Python source code.

The low accuracy suggested that the embeddings generated by Word2Vec might not

have been sufficiently capturing the features of NumPy array operations within the code

snippets. Additionally, the complexity of distinguishing specific array operations from a

diverse set of source code files likely contributed to the poor performance. These findings

prompted us to re-evaluate our preprocessing steps, feature extraction methods, and the

overall model architecture to better address the challenges posed by this specific task.

In this process, we transformed a collection of code snippets into a weighted feature

matrix using the TF-IDF (Term Frequency-Inverse Document Frequency) technique.

Initially, each snippet was tokenized and normalized to ensure consistency. We then used the

TF-IDF vectorizer to convert the corpus into a term-document matrix, which quantifies the

importance of each token across all documents. To ensure all relevant tokens were included,

we identified and added any missing tokens as new columns with zero weights. Finally, we

calculated the average TF-IDF weight for each token across all documents, providing a

Page 76

comprehensive representation of token importance within the corpus. This approach

facilitated the creation of a robust feature set for subsequent machine learning tasks.

The results of the weighting process suggest that features related to NumPy arrays are

assigned significant weights, with high values given to terms like "array" and "np." This

weighting emphasizes the importance of these terms in identifying NumPy array operations

within the source code corpus. Initially, each code snippet was tokenized and transformed

into a TF-IDF feature matrix, representing the relevance of each term across all documents.

Figure 22. TF-IDF feature matrix for np

Figure 23. TF-IDF feature matrix for arrays

Page 77

4 Results summary and conclusions

This project had two main purposes. Firstly, to represent source code as vectors and

develop a machine learning classification model capable of identifying its primary

characteristics and architecture. Secondly, to extract information on libraries, frameworks,

and algorithm knowledge for further analysis. The main objective was to provide guidance

through the steps needed in the process of selecting a suitable approach that could produce

reliable skills predictions from source code snippets. For this purpose, data were extracted

from GitHub repositories of various well-known and professionally structured libraries. In

terms of machine learning techniques, neural networks, random forests, and SVM were

considered, with Word2Vec utilized for the embeddings generation. This chapter summarizes

the conclusions derived from the application of the models and their results, discusses the

limitations faced during the project’s execution, and concludes with some recommendations

for future research.

4.1 Experiments for programming language recognition

The Table 10 below provides a concise overview of the outcomes derived from the

classifier experiments for programming language recognition between python and java. The

experiments contacted with random code files derived from GitHub repositories, like httpx

and jhipster. Notably, the accuracy rate from the training indicates potential overfitting

issues, alongside a lack of predictive success across other datasets. It is interesting to observe

that despite extensive data processing and dimensionality reduction applied to the tokenized

code files, certain prediction datasets exhibited a remarkably high accuracy, approaching

0.79. This anomaly could be attributed to the similarity between these datasets and the initial

training dataset of the overfitting classifier, thereby enabling it to predict with exceptional

accuracy in these particular instances.

Table 10. Results summary

Classifier

Accuracy

Precision

Recall

F1 Score

Prediction Accuracy

Neural Networks

0.46

0.33

with Embeddings

Averaging

1st Training dataset

0.99

with Embeddings

Averaging

0.86

0.89

0.86

0.46

Page 78

2nd Training dataset

Grid search

1.00

0.46

Stop words removal

1.00

0.46-0.79

depending on the dataset

Dimensionality reduction

1.00

0.46-0.79

depending on the dataset,

indicating high precision in

some cases

SVM

Grid Search

1.00

0.46

SVM

Dimensionality reduction

0.98

0.46-0.79

depending on the dataset

CodeBERTa

0.97

All analysed above during the training process of the programming language

recognition classifier, it was evident that overfitting in our model necessitated identifying the

proper inputs to enhance prediction accuracy and provide valuable results. Consequently, we

decided to explore the outcomes of predicting the programming languages of various well-

known Python and Java libraries and projects, and to analyse the resulting data.

The Table 11 categorizes selected Java libraries based on their functionality. Web

development libraries have the highest number of analysed files, while the deeplearning4j

library stands out with the most tokens and unique tokens, emphasizing its comprehensive

content and structured files for machine learning experimentation.

Table 11: Java files for experiments

Category

Name

Description

Files Tokens

Unique

tokens

Data

Visualization

XChart

XChart is a simple, lightweight library

for plotting data with just two lines of

code.

323

77,976

3,239

JfreeChart

JFreeChart is a free, open-source Java

library for creating professional-

quality charts with extensive features

and support for various output formats.

993

461,302

9,296

Machine

Learning

deeplearnin

g4j

Eclipse Deeplearning4j enables deep

learning on the JVM, allowing Java

model training and Python ecosystem

interoperability.

3,243 3,194,708

40,391

Page 79

smile

Smile is a versatile Java library for

machine learning and statistical

analysis, providing a broad array of

algorithms for tasks like clustering,

classification, and regression.

1,027

413,264

7,657

Web

development

Spring boot

Provides a simplified and rapid way to

create production-ready applications.

It reduces the need for extensive

configuration.

3,004

782,104

33,946

wicket

The 10th major release of Apache

Wicket, built on Java 17, modernizes

web development by providing a robust

framework for creating contemporary

web applications with Java.

3,003

581,964

18,356

NLP

CoreNLP

CoreNLP is a comprehensive Java tool

for natural language processing,

providing

various

linguistic

annotations for text.

2,038 1,573,066

39,883

opennlp

OpenNLP supports the most common

NLP tasks, such as tokenization,

sentence segmentation, part-of-speech

tagging, named entity extraction,

chunking, parsing, language detection

and coreference resolution.

1,009 252,095

7,518

Table 12 summarizes selected Python libraries chosen for experimentation, with

TensorFlow standing out in terms of the highest number of files, tokens, and unique tokens,

highlighting its extensive content and capabilities in machine learning.

Table 12: python files for experiments

Category

Name

Description

Files

Tokens

Unique tokens

Data

Visualization

Seaborn

Seaborn is a Python library based

on matplotlib for creating

attractive statistical graphics.

138 17,784

7,636

Plotly

Plotly is a versatile, open-source

Python library for creating

interactive and visually appealing

plots and dashboards.

1,116 549,164

12,023

Machine

Learning

xgboost

XGBoost is a highly efficient,

flexible, and portable gradient

boosting library for fast and

accurate machine learning,

supporting major distributed

environments like Hadoop and

MPI.

174 109,391

6,019

tensorflow

TensorFlow is an open-source

machine learning framework

developed by Google, widely used

for building and training machine

learning models.

2,743 3,467,115

86,681

Web

development

pyramid

Pyramid

simplifies

web

application development with a

minimal "hello world" setup that

scales effortlessly as your project

expands, offering advanced

features for building complex

software efficiently.

137 164,013

7,503

tornado

Tornado is a scalable Python web

framework and asynchronous

networking library, ideal for long-

lived connections using non-

99 132,794

Page 80

blocking I/O.

NLP

nltk

NLTK is a leading Python

platform for human language

data, offering interfaces to over 50

corpora, text processing libraries,

NLP wrappers, and an active

discussion forum.

294 371,158

17,186

TextBlob

TextBlob is a Python library

offering a simple API for common

NLP tasks like part-of-speech

tagging, noun phrase extraction,

sentiment

analysis,

and

classification.

31 2,158

1,889

Experiments for language identification were conducted using various combinations

of library categories. Initially, models were tested in pairs within both Python and Java

libraries, followed by testing across all four combined libraries. Results indicate that Random

Forest achieved slightly better accuracy compared to SVM. While overall accuracy was not

sufficient, the language recognition algorithm showed improved results for data visualization

and machine learning libraries. However, improvements are needed for NLP and web

development libraries as presented in Table 13. The high recall results indicate that our

classifiers effectively recognize Python language in source code snippets, particularly in NLP

library categories, more so than Java code snippets.

Table 13: Experiments Results of identifying the library's language

Category

Library

Classifier Accuracy Precision Recall

score

Data

Visualization

Seaborn & xchart

86%

68%

99%

80%

SVM

30%

100%

46%

JfreeChart & Plotly

55%

54%

95%

69%

SVM

53%

100%

69%

All

57%

53%

100%

69%

SVM

48%

100%

65%

Machine

Learning

deeplearning4j

xgboost

82%

SVM

100%

10%

smile & tensorflow

68%

72%

92%

81%

SVM

73%

100%

84%

All

41%

40%

91%

56%

SVM

41%

40%

100%

58%

Web

Development

Spring boot &

pyramid

68%

SVM

100%

10%

Page 81

smile & tensorflow

16%

86%

SVM

100%

All

12%

97%

SVM

100%

NLP

CoreNLP & nltk

27%

15%

100%

26%

SVM

13%

100%

22%

openNLP

TextBlob

100%

SVM

100%

All

11%

10%

100%

18%

SVM

10%

100%

18%

All

31%

25%

92%

39%

SVM

24%

100%

39%

As the results indicate, the algorithm needs further training with files related to web

development and NLP libraries to improve accuracy levels. Additionally, the structure or

content of these libraries might be affecting the model's accuracy.

4.2 Experiments for python library type recognition

The accuracy of the separated classifiers for each Python library type, as discussed in

section 3.2.3, shows that the NLP Random Forest classifier indicated the highest

performance, while data visualization resulted in low accuracy. This low accuracy can be

explained by the t-SNE clusters discussed in section 3.2.2, where the data visualization

source code files do not display a clear separation from the random source code cluster,

leading to instances of overlap between them. Additionally, the proximity of points within the

data visualization cluster is not consistent across all points. This inconsistency suggests a

wide range of operations and characteristics among the different data visualization libraries

used for training, which are not shared and do not facilitate uniform training of the classifier.

Table 14: Ensemble method results

Experiment

Combination

Accuracy Precision

Recall

F1-score

Random

70.8%

16.0%

100.0%

28.0%

0.0%

Data visualization

3.2%

100.0%

4.0%

8.0%

Ensemble

13.0%

Random

54.2%

24.0%

54.0%

33.0%

10.3%

100.0%

10.0%

19.0%

Web Development

66.3%

64.0%

66.0%

65.0%

Page 82

Ensemble

51.0%

Random

54.2%

7.0%

54.0%

12.0%

2.6%

1.0%

3.0%

2.0%

NLP

34.7%

85.0%

35.0%

49.0%

Ensemble

33.0%

Random

25.0%

54.0%

34.0%

Data visualization

6.5%

88.0%

8.0%

14.0%

Web Development

92.9%

56.0%

89.0%

69.0%

Ensemble

48.0%

Random

25.0%

3.0%

25.0%

5.0%

Data visualization

0.0%

NLP

38.2%

58.0%

38.0%

46.0%

Ensemble

30.0%

Random

4.2%

3.0%

4.2%

5.0%

Web Development

83.7%

36.0%

79.0%

50.0%

NLP

45.9%

87.0%

35.0%

50.0%

Ensemble

51.0%

Random

66.7%

11.0%

54.0%

18.0%

0.0%

Data visualization

3.2%

100.0%

2.0%

4.0%

Web Development

80.6%

58.0%

68.0%

63.0%

Ensemble

39.0%

Random

37.5%

11.0%

2.0%

18.0%

2.6%

9.0%

11.0%

10.0%

Data visualization

13.9%

39.0%

14.0%

21.0%

NLP

32.4%

70.0%

31.0%

43.0%

Ensemble

32.0%

Random

0.0%

2.6%

8.0%

23.0%

12.0%

Web Development

5.1%

9.0%

11.0%

10.0%

NLP

20.2%

85.0%

22.0%

35.0%

Ensemble

15.0%

Random

45.8%

39.0%

14.0%

21.0%

Data visualization

47.3%

42.0%

21.0%

28.0%

Web Development

5.1%

1.0%

8.0%

3.0%

NLP

16.7%

12.0%

26.0%

17.0%

Ensemble

21.0%

Random

4.2%

1.0%

4.0%

1.0%

12.8%

5.0%

13.0%

7.0%

Data visualization

1.1%

1.0%

Web Development

39.8%

25.0%

40.0%

30.0%

NLP

26.5%

67.0%

27.0%

38.0%

Ensemble

26.0%

Page 83

We used an ensemble method to run multiple combinations of libraries to evaluate the

results and accuracy of our model. Although the overall performance was not very high, some

valuable outcomes can be discussed. Table 14 above presents the accuracies for all

combinations of the ensemble method.

For the ensemble test, the following libraries were used:

• Machine Learning: H2O.ai, TPOT

• NLP: SpaCy, StanfordNLP

• Web Development: Responder, Falcon

• Data Visualization: VisPy, HoloViews

• Random Files: mpmath, bitarray

Although the NLP classifier demonstrated the highest accuracy during training, the

web development classifier exhibited superior performance in the prediction phase. This

improved performance was particularly notable when the web development classifier was

combined with the data visualization classifier, potentially due to the operational similarities

between these libraries. A similar behaviour was observed with the combination of the web

development and NLP classifiers, which achieved the highest ensemble accuracy.

However, an interesting behaviour was observed with the combination of the web

development classifier with the ML and NLP classifiers, which resulted in almost the lowest

accuracy (0.15). This outcome may be due to the high similarities between ML and NLP

operations and possible similarities with the random files as well, although this was not

implied by the t-SNE clusters analysis.

The data visualization classifier indicated the highest accuracy when combined with

the web development and NLP classifiers (0.47), even though these classifiers do not

individually offer high accuracy in combination. The lowest accuracy was observed when the

ML and data visualization classifiers were combined, where primarily the random files were

predicted correctly.

The performance of the web development classifier can be explained by the well-

scattered clusters observed in the t-SNE visualization. These clusters indicate disparity within

Page 84

the training dataset, allowing the classifier to learn unique characteristics of each library. This

diversity in training contributes to the classifier's efficiency in making predictions.

4.3 Experiments for python arrays recognition

During the training of random forest classifiers on NumPy arrays, we observed an

accuracy of 1, indicating potential overfitting, despite the inclusion of weighted token

features for arrays. However, when applying the classifier to predict on files containing

mathematical libraries—comprising both arrays and unrelated random files—the accuracy

dropped significantly to 0.04. This outcome highlights the considerable challenge in

accurately identifying arrays within extensive source code datasets.

Page 85

4 Conclusions, limitation and future research

5.1 Conclusions

A major finding of this research is that machine learning methods combined with

embedding techniques like Word2Vec can effectively estimate programming language skills,

library types, and operation usability using data from GitHub repositories. This is achievable

provided that the source code fulfills certain prerequisite conditions to ensure the diversity of

features for training and the quality of tokens.

It was also observed that neural network techniques were not appropriate in this case,

as they provided low accuracy and showed limited improvement with parameterization. On

the other hand, Random Forest and SVM, although prone to overfitting during the training

process of some applications, yielded valuable results for further research, especially after

grid search for optimal parameterization. An interesting finding was that data visualization

libraries indicated minimal scatter of features and limited training abilities for classifiers.

Conversely, web development libraries proved more suitable for this type of application,

offering higher accuracy compared to other libraries. This is a significant finding for the

software development community.

When examining the four individual Python libraries under discussion, the results

vary depending on the task at hand. The estimated accuracy in recognizing programming

languages between Java and Python was higher for source code snippets related to data

visualization operations and significantly lower for web development operation source code

files. This discrepancy could be attributed to the differing feature selection methodologies

and dimensionality reduction techniques employed between the two prediction models, with

weights assigned to different features of the source code.

It is also worth mentioning that, for the individual Python library type classifiers, all

categories indicated accuracy in the training set higher than 0.80. This suggests that with a

more concise and well-structured training dataset, the accuracy could be improved

substantially. On the other hand, the effort to recognize array operations in source code—a

very specific and widely used application in the programming and software development

community—yielded insufficient results. This underscores the need for annotated datasets

focused on source code skills and operations to enhance machine learning predictions and

applications in this domain.

Page 86

In summary, it was not possible to clearly identify the most reliable approach for our

predictions. Significant discrepancies in accuracy results indicate the need for further

experimentation to build a robust framework for this type of application. Fortunately, these

challenges, created by the lack of high-quality annotated datasets of source code, did not

hinder the achievement of the principal objectives of the project, which were to describe in

detail the tokenization and embeddings process of source code for skill identification

predictions. The objectives were successfully met since we incorporated raw data from well-

known GitHub repositories, described the current knowledge based on previous applications,

applied the necessary data preprocessing to generate a valuable training input dataset, and

interpreted the findings in a way that can offer guidance on the potentials, prerequisites, and

limitations of these approaches.

5.2 Limitation

Several challenges arose during this project. One major hurdle was the lack of

annotated datasets for identifying skills, making it hard to find enough files with the right

content and structure to train our classifiers effectively. This led to a relatively small sample

size, which might have affected the accuracy of our measurements. Moreover, although

GitHub provided around 4000 code snippets for detecting library types, gathering, analyzing,

and preparing these snippets took a lot of time and resources. To expand our dataset, we

created a Python program to generate template files containing arrays with different

operations. However, this approach risked making our input files too similar, potentially

causing our model to be too specialized and less adaptable.

The limited availability and quality of data, which could affect how well our

application works in different situations. While our results gave us useful insights, they also

showed that our model struggled when we tried to predict new source code files, making it

hard to identify programming languages accurately. This highlights the need for more

training of our classifiers, possibly using more advanced techniques to extract features.

Additionally, a significant challenge was the similarity between features in libraries

used for data visualization, machine learning (ML), and natural language processing (NLP).

This similarity made it harder for our classifiers to make accurate predictions. To address

this, we need to use a wider range of data files and Python libraries during training to

improve the accuracy of our ensemble method classifiers.

Page 87

Furthermore, another limitation was the lack of extensive research in skills

identification. There wasn't much existing literature or established methods that we could

build on or compare our approach to, although this also made our project unique. Developing

and testing new methods required a lot of experimentation and time, which increased the

demand for resources.

Expanding the sample size and refining the specification of features within the source

code would significantly enhance the academic rigor of this project. Nonetheless, the study

conducted a thorough analysis and comparisons among different models. Despite the

aforementioned limitations, the findings offer valuable insights and can serve as a roadmap

for researchers aiming to develop procedures for identifying programming languages, library

types, and specific operations within large software development repositories.

5.3 Recommendations for future research

In light of the insights gained from this study and considering the technical challenges

encountered during its implementation, the following recommendations for further research

are proposed. To deepen the understanding of skill identification in programming languages,

it is imperative to expand the sample size and enhance the breakdown of variables related to

source code features and operations. A broader dataset encompassing diverse programming

languages and a more detailed categorization of operations within source code snippets would

facilitate a more comprehensive analysis and validation of our classifiers' predictions.

Future studies could leverage updated repositories and libraries from platforms like

GitHub, ensuring a richer diversity of source code examples for training and testing

classifiers. Moreover, exploring advanced techniques such as deep learning for source code

embeddings and applying ensemble methods could potentially enhance the accuracy and

robustness of the classifiers in identifying specific programming skills and operations.

Furthermore, the integration of additional features such as semantic parsing or

syntactic analysis could refine the classifiers' ability to discern subtle differences in

programming tasks and functionalities. This approach would contribute to overcoming the

challenges of feature similarity across different libraries and languages, as highlighted in our

study.

Finally, the ongoing evolution of programming practices and the emergence of new

libraries necessitate continuous updates and adaptations in the methodologies used for skill

Page 88

identification in source code. By integrating these advancements into future research

endeavors, we can further advance the field of automated skill detection in programming

languages, thereby supporting developers and researchers in navigating the complexities of

modern software development.

Page 89

5 References

1. Allamanis, M. and Sutton, C., 2014, November. Mining idioms from source code. In

Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of

Software Engineering (pp. 472-483).

2. Alon, U., Zilberstein, M., Levy, O. and Yahav, E., 2019. code2vec: Learning distributed

representations of code. Proceedings of the ACM on Programming Languages, 3(POPL),

pp.1-29.

3. Alreshedy, K., Dharmaretnam, D., German, D.M., Srinivasan, V. and Gulliver, T.A.,

2018. SCC: automatic classification of code snippets. arXiv preprint arXiv:1809.07945.

4. Baquero, J.F., Camargo, J.E., Restrepo-Calle, F., Aponte, J.H. and González, F.A., 2017.

Predicting the programming language: Extracting knowledge from stack overflow posts.

In Advances in Computing: 12th Colombian Conference, CCC 2017, Cali, Colombia,

September 19-22, 2017, Proceedings 12 (pp. 199-210). Springer International Publishing.

5. Bengio, R., Ducharme, R., & Vincent, P. (2003). A neural probabilistic language model.

Journal of Machine Learning Research, 3, 1137-1155.

6. Biau, G., & Scornet, E. (2016). A random forest guided tour. Test, 25, 197-227.

7. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of

Machine Learning Research, 3(Jan), 993-1022.

8. Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with

subword information. Transactions of the Association for Computational Linguistics, 5,

135-146.

9. Buratti, L., Pujar, S., Bornea, M., McCarley, S., Zheng, Y., Rossiello, G., Morari, A.,

Laredo, J., Thost, V., Zhuang, Y. and Domeniconi, G., 2020. Exploring software

naturalness through neural language models. arXiv preprint arXiv:2006.12641.

10. Causa, O., Abendschein, M., Luu, N., Soldani, E. and Soriolo, C., 2022. The post-

COVID-19 rise in labour shortages.

11. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE:

Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research,

16, 321-357.

12. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. P.

(2011). Natural language processing (almost) from scratch. CoRR abs/1103.0398.

Page 90

13. Cosentino, V., Luis, J. and Cabot, J., 2016, May. Findings from GitHub: methods,

datasets and limitations. In Proceedings of the 13th International Conference on Mining

Software Repositories (pp. 137-141).

14. da Silva, J.R., Clua, E., Murta, L. and Sarma, A., 2015, March. Niche vs. breadth:

Calculating expertise over time through a fine-grained analysis. In 2015 IEEE 22nd

international conference on software analysis, evolution, and reengineering (SANER) (pp.

409-418). IEEE.

15. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep

Bidirectional Transformers for Language Understanding. arXiv preprint

arXiv:1810.04805.

16. Dietrich, J., Luczak-Roesch, M. and Dalefield, E., 2019, May. Man vs Machine–A Study

into Language Identification of Stack Overflow Code Snippets. In 2019 IEEE/ACM 16th

International Conference on Mining Software Repositories (MSR) (pp. 205-209). IEEE.

17. Dietterich, T.G., 2000, June. Ensemble methods in machine learning. In International

workshop on multiple classifier systems (pp. 1-15). Berlin, Heidelberg: Springer Berlin

Heidelberg.

18. Dumais, S. T. (2004). Latent semantic analysis. Annual Review of Information Science

and Technology (ARIST), 38, 189-230.

19. Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T.,

Jiang, D., & Zhou, M. (2020). CodeBERT: A pre-trained model for programming and

natural languages. arXiv preprint arXiv:2002.08155.

20. Gilda, S., 2017, July. Source code classification using Neural Networks. In 2017 14th

international joint conference on computer science and software engineering (JCSSE)

(pp. 1-6). IEEE.

21. Gholamian, S. and Ward, P.A., 2021, May. On the naturalness and localness of software

logs. In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories

(MSR) (pp. 155-166). IEEE.

22. Goldberg, Y., & Levy, O. (2014). word2vec explained: Deriving Mikolov et al.'s

negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722.

23. Gousios, G., 2013, May. The GHTorent dataset and tool suite. In 2013 10th Working

Conference on Mining Software Repositories (MSR) (pp. 233-236). IEEE.

24. Greene, G.J. and Fischer, B., 2016, August. Cvexplorer: Identifying candidate developers

by mining and exploring their open source contributions. In Proceedings of the 31st

IEEE/ACM International Conference on Automated Software Engineering (pp. 804-809).

Page 91

25. Hauff, C. and Gousios, G., 2015, May. Matching GitHub developer profiles to job

advertisements. In 2015 IEEE/ACM 12th Working Conference on Mining Software

Repositories (pp. 362-366). IEEE.

26. He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on

Knowledge and Data Engineering, 21(9), 1263-1284.

27. Hellendoorn, V.J., Devanbu, P.T. and Alipour, M.A., 2018, October. On the naturalness

of proofs. In Proceedings of the 2018 26th ACM Joint Meeting on European Software

Engineering Conference and Symposium on the Foundations of Software Engineering

(pp. 724-728).

28. Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural

Computation, 9(8), 1735–1780.

29. Hong, J., Mizuno, O. and Kondo, M., 2019, December. An empirical study of source code

detection using image classification. In 2019 10th International Workshop on Empirical

Software Engineering in Practice (IWESEP) (pp. 1-15). IEEE.

30. Huang, W., Mo, W., Shen, B., Yang, Y. and Li, N., 2016. CPDScorer: Modeling and

Evaluating Developer Programming Ability across Software Communities. In SEKE (pp.

87-92).

31. Hyrynsalmi, S.M., Rantanen, M.M. and Hyrynsalmi, S., 2021, June. The war for talent in

software business-how are finnish software companies perceiving and coping with the

labor shortage?. In 2021 IEEE International Conference on Engineering, Technology and

Innovation (ICE/ITMC) (pp. 1-10). IEEE.

32. Javid, A. M., Das, S., Skoglund, M., & Chatterjee, S. (2021, June). A ReLU dense layer

to improve the performance of neural networks. In ICASSP 2021-2021 IEEE

International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp.

2810-2814). IEEE.

33. Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., & Mikolov, T. (2016).

Fasttext. zip: Compressing text classification models. arXiv preprint arXiv:1612.03651.

34. Kalliamvakou, E., Gousios, G., Blincoe, K., Singer, L., German, D.M. and Damian, D.,

2014, May. The promises and perils of mining github. In Proceedings of the 11th working

conference on mining software repositories (pp. 92-101).

35. Kang, H.J., Bissyandé, T.F. and Lo, D., 2019, November. Assessing the generalizability

of code2vec token embeddings. In 2019 34th IEEE/ACM International Conference on

Automated Software Engineering (ASE) (pp. 1-12). IEEE.

Page 92

36. Krawczyk, B. (2016). Learning from imbalanced data: open challenges and future

directions. Progress in Artificial Intelligence, 5(4), 221-232.

37. LeClair, A., Eberhart, Z. and McMillan, C., 2018, September. Adapting neural text

classification for improved software categorization. In 2018 IEEE international

conference on software maintenance and evolution (ICSME) (pp. 461-472). IEEE.

38. Li, S., & Gong, B. (2021). Word embedding and text classification based on deep

learning methods. In MATEC Web of Conferences (Vol. 336, p. 06022). EDP Sciences.

39. Ligu, E., Chaikalis, T. and Chatzigeorgiou, A., 2013. BuCo Reporter: Mining Software

and Bug Repositories.

40. Liu, Q., & Zhang, H. (2017). Combining Word2Vec and t-SNE to enhance the

interpretability of text data. Proceedings of the International Conference on Data Mining

and Applications, 125-130.

41. Maaten, L.J.P. van der, Postma, E.O., & Herik, H.J. van den. (2009). Dimensionality

reduction: A comparative review. Journal of Machine Learning Research, 10, 66-71.

42. Mandelbaum, A., & Shalev, A. (2016). Word embeddings and their use in sentence

classification tasks. arXiv preprint arXiv:1610.08229.

43. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word

Representations in Vector Space. arXiv preprint arXiv:1301.3781.

44. Moguerza, J. M., & Muñoz, A. (2006). Support vector machines with applications.

45. Ohashi, H. and Watanobe, Y., 2019, October. Convolutional neural network for

classification of source codes. In 2019 IEEE 13th international symposium on embedded

multicore/many-core systems-on-chip (MCSoC) (pp. 194-200). IEEE.

46. Oliveira, J., Souza, M., Flauzino, M., Durelli, R. and Figueiredo, E., 2022, September.

Can source code analysis indicate programming skills? a survey with developers. In

International Conference on the Quality of Information and Communications Technology

(pp. 156-171). Cham: Springer International Publishing.

47. Oliveira, J., Viggiato, M. and Figueiredo, E., 2019, October. How well do you know this

library? mining experts from source code analysis. In Proceedings of the XVIII Brazilian

Symposium on Software Quality (pp. 49-58)

48. Parmar, A., Katariya, R., & Patel, V. (2019). A review on random forest: An ensemble

classifier. In International conference on intelligent data communication technologies and

internet of things (ICICI) 2018 (pp. 758-763). Springer International Publishing.

Page 93

49. Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global Vectors for Word

Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural

Language Processing (EMNLP), 1532-1543.

50. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer,

L. (2018). Deep contextualized word representations. arXiv preprint arXiv:1802.05365.

51. Rahman, M., Palani, D. and Rigby, P.C., 2019. Natural Software Revisited. Department

of Computer Science and Software Engineering, Concordia University, Montréal,

Québec, Canada.

52. Ray, B., Hellendoorn, V., Godhane, S., Tu, Z., Bacchelli, A. and Devanbu, P., 2016, May.

On the" naturalness" of buggy code. In Proceedings of the 38th International Conference

on Software Engineering (pp. 428-439).

53. Ugurel, S., Krovetz, R. and Giles, C.L., 2002, July. What's the code? automatic

classification of source code archives. In Proceedings of the eighth ACM SIGKDD

international conference on Knowledge discovery and data mining (pp. 632-638).

54. Saabith, A.S., Vinothraj, T. and Fareez, M., 2020. Popular python libraries and their

application domains. International Journal of Advance Engineering and Research

Development, 7(11).

55. Schmidt, J.A., Bourdage, J.S., Lukacik, E.R. and Dunlop, P.D., 2024. The Role of Time,

Skill Emphasis, and Verifiability in Job Applicants’ Self-Reported Skill and Experience.

Journal of Business and Psychology, 39(1), pp.67-82.

56. Sridhara, G., Sinha, V.S. and Mani, S., 2015, February. Naturalness of natural language

artifacts in software. In Proceedings of the 8th India Software Engineering Conference

(pp. 156-165).

57. Sundermeyer, M., Schlüter, R., & Ney, H. (2012, September). LSTM neural networks for

language modeling. In Interspeech (Vol. 2012, pp. 194-197). IEEE.

58. Van Dam, J.K. and Zaytsev, V., 2016, March. Software language identification with

natural language classifiers. In 2016 IEEE 23rd international conference on software

analysis, evolution, and reengineering (SANER) (Vol. 1, pp. 624-628). IEEE.

59. Van Der Maaten, L., Postma, E., & Van den Herik, J. (2009). Dimensionality reduction: a

comparative. Journal of Machine Learning Research, 10, 66-71.

60. Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of

Machine Learning Research, 9(Nov), 2579-2605.

Page 94

61. Vora, P., Khara, M., & Kelkar, K. (2017). Classification of tweets based on emotions

using word embedding and random forest classifiers. International Journal of Computer

Applications, 178(3), 1-7.

62. Wang, L. (Ed.). (2005). Support vector machines: Theory and applications (Vol. 177).

Springer Science & Business Media.

63. Watanobe, Y., Rahman, M.M., Amin, M.F.I. and Kabir, R., 2023. Identifying algorithm

in program code based on structural features using CNN classification model. Applied

Intelligence, 53(10), pp.12210-12236.

64. World Economic Forum, The Future of Jobs: Employment, Skills and Workforce

Strategy for Fourth Industrial Revolution, ser. Global Challenge Insight Report. Geneva,

Switzerland: World Economic Forum, Jan. 2016.