This is the html version of the file https://dspace.lib.uom.gr/handle/2159/30831.
Google automatically generates html versions of documents as we crawl the web.
These search terms have been highlighted: learning quick fixes code repositories
UNIVERSITY OF MACEDONIA MASTER OF SCIENCE APPLIED INFORMATICS IDENTIFYING SKILLS USING MACHINE LEARNING TO ANALYZE SOURCE CODE I
Page 1
UNIVERSITY OF MACEDONIA
MASTER OF SCIENCE
APPLIED INFORMATICS
IDENTIFYING SKILLS USING MACHINE LEARNING TO ANALYZE SOURCE
CODE IN SOFTWARE REPOSITORIES
Dissertation of
Dafni Georgiou
mai21008
Thessaloniki, June 2024

Page 2
4
ΑΝΙΧΝΕΥΣΗ ΔΕΞΙΟΤΗΤΩΝ ΑΝΑΛΥΟΝΤΑΣ ΑΠΟΘΕΤΗΡΙΑ ΠΗΓΑΙΟΥ ΚΩΔΙΚΑ ΜΕ
ΤΗΝ ΧΡΗΣΗ ΜΗΧΑΝΙΚΗΣ ΜΑΘΗΣΗΣ
Δάφνη Γεωργίου
Πτυχίο Οργάνωσης και Διοίκησης Επιχειρήσεων, ΠΑΜΑΚ, 2015
MSc in Business Analytics, Aston University, 2017
Διπλωματική Εργασία
υποβαλλόμενη για τη μερική εκπλήρωση των απαιτήσεων του
ΜΕΤΑΠΤΥΧΙΑΚΟΥ ΤΙΤΛΟΥ ΣΠΟΥΔΩΝ ΣΤΗΝ ΕΦΑΡΜΟΣΜΕΝΗ ΠΛΗΡΟΦΟΡΙΚΗ
Επιβλέπων Καθηγητής
Αλέξανδρος Χατζηγεωργίου
Εγκρίθηκε από την τριμελή εξεταστική επιτροπή την ηη/μμ/εεεε
Αλέξανδρος Χατζηγεωργίου
Απόστολος Αμπατζόγλου
Ευτύχιος Πρωτοπαπαδάκης
...................................
...................................
...................................
Γεωργίου Δάφνη
...................................

Page 3
5
Περίληψη
Η αναγνώριση δεξιοτήτων των προγραμματιστών ανάπτυξης λογισμικού είναι
ιδιαίτερα σημαντική για τις σημερινές επιχειρήσεις που βασίζονται στην ανάπτυξη
λογισμικού για να παραμένουν ανταγωνιστικές και αποδοτικές. Με την ταχεία ανάπτυξη του
τομέα αυτού και την αυξανόμενη ζήτηση για έμπειρους προγραμματιστές, η παρούσα μελέτη
διερευνά τη χρήση των αποθετηρίων λογισμικού (GitHub repositories) ως κύριας πηγής
άντλησης δεδομένων για την αναγνώριση και ανάλυση των δεξιοτήτων των
προγραμματιστών. Αυτό περιλαμβάνει την εξέταση παραγόντων όπως οι χρησιμοποιούμενες
γλώσσες προγραμματισμού, οι τύποι βιβλιοθηκών που χρησιμοποιούνται και οι
συγκεκριμένες λειτουργίες μέσα στον κώδικα.
Επιπλέον, η έρευνα αυτή στοχεύει στην καθιέρωση ενός πλαισίου για την
αναγνώριση δεξιοτήτων μέσω της ανάλυσης διαφόρων βιβλιοθηκών Python σε διαφορετικές
λειτουργίες. Αυτό το πλαίσιο περιλαμβάνει μεθοδολογίες για την εξαγωγή και επεξεργασία
δεδομένων από αποθετήρια λογισμικού ή πηγαίου κώδικα, τις διαδικασίες καθαρισμού
δεδομένων και τεχνικές για τη δημιουργία των embeddings. Επιπλέον, ενσωματώνει
μεθόδους για την αναγνώριση και κατηγοριοποίηση δεξιοτήτων με τη χρήση αλγορίθμων
μηχανικής μάθησης. Με την ανάπτυξη αυτού του πλαισίου, η έρευνα στοχεύει στην
ανάπτυξη μιας τυποποιημένης προσέγγισης για την αναγνώριση δεξιοτήτων που μπορεί να
εφαρμοστεί καθολικά σε διάφορες εφαρμογές και έργα λογισμικού.
Το βασικό κίνητρο για την παρούσα έρευνα είναι η προσφορά μιας πιο ακριβούς και
αξιόπιστης μεθόδου για την αναγνώριση των δεξιοτήτων των προγραμματιστών λογισμικού.
Εκμεταλλευόμενοι τα εκτεταμένα δεδομένα που παρέχουν τα αποθετήρια πηγαίου κώδικα, η
μελέτη στοχεύει στη βελτίωση της αποδοτικότητας και της αποτελεσματικότητας των
διαδικασιών ανάπτυξης λογισμικού, συμβάλλοντας τελικά στην επιτυχία των έργων αυτών.
Μετά από μια λεπτομερή ανασκόπηση της βιβλιογραφίας σχετικά με τη χρήση των
αποθετηρίων λογισμικού και τις μεθοδολογίες για την αναγνώριση δεξιοτήτων, το τρίτο
κεφάλαιο περιγράφει λεπτομερώς την εξαγωγή και επεξεργασία δεδομένων από αυτά τα
αποθετήρια, τις μεθοδολογίες που δοκιμάστηκαν και μια σύντομη ανάλυση των μοντέλων
μηχανικής μάθησης. Παρέχει επίσης λεπτομερή ανάλυση της μεθοδολογίας word2vec για τη
δημιουργία embeddings και αναλύει ζητήματα καθαρισμού δεδομένων που σχετίζονται με τα
αποθετήρια λογισμικού.
Το κύριο εύρημα είναι ότι διάφορες μέθοδοι μηχανικής μάθησης, συνδυασμένες με
τεχνικές δημιουργίας embeddings όπως το word2vec, μπορούν να εκτιμήσουν

Page 4
6
αποτελεσματικά τις δεξιότητες προγραμματισμού, τους τύπους βιβλιοθηκών και τη
χρηστικότητα των λειτουργιών χρησιμοποιώντας δεδομένα από τα αποθετήρια GitHub, υπό
την προϋπόθεση ότι ο πηγαίος κώδικας πληροί ορισμένες προϋποθέσεις για τη διασφάλιση
της ποικιλομορφίας των χαρακτηριστικών του για την εκπαίδευση και της ποιότητας των
tokens. Η μελέτη παρουσιάζει επίσης διάφορους περιορισμούς και προκλήσεις που
αντιμετωπίστηκαν κατά τη διάρκεια του έργου και προσφέρει προτάσεις για περαιτέρω
έρευνα σχετικά με ανεπτυγμένα σύνολα δεδομένων κώδικα για την αναγνώριση δεξιοτήτων
και την εκπαίδευση μεθόδων ensemble.
Λέξεις Κλειδιά: word2vec; Skills identification; Classification; Feature extraction;
Tokeniztion; GitHub

Page 5
7
Abstract
Identifying the skills of software developers is crucial for organizations that rely on
software development to remain competitive and efficient. With the rapidly evolving
software engineering field and the growing demand for skilled developers, this study
investigates the use of software repositories as a valuable data source for identifying and
analyzing developers' skills. This involves examining factors such as the programming
languages used, types of libraries employed, and specific operations within the source code.
Additionally, this research aims to establish a framework for skill identification by
analyzing various Python libraries across different operations. This framework includes
methodologies for extracting and processing data from software repositories or source code,
necessary data cleansing procedures, and techniques for generating embeddings.
Furthermore, it incorporates methods for identifying and categorizing skills based on the
analyzed data. By developing this framework, the research aims to provide a standardized
approach to skill identification that can be universally applied across various organizations
and software projects.
The motivation for this research is to offer a more accurate and reliable method for
identifying the skills of software developers. By leveraging the extensive data available from
software repositories, the study aims to enhance the efficiency and effectiveness of software
development processes, ultimately contributing to the success of software development
projects.
Following a thorough literature review on the use of software repositories and
methodologies for skill identification, the Chapter 3 details the extraction and processing of
data from these repositories, the methodologies tested, and a brief analysis of machine
learning models. It also explains the word2vec technique for generating embeddings and
addresses data cleansing issues related to software repositories. The thesis concludes with
Chapter 4, presenting the main findings, and Chapter 5 discussing the limitations, challenges,
and suggestions for further research.
The main finding is that various machine learning methods, combined with
embedding techniques like word2vec, can effectively estimate programming skills, library
types, and operation usability using data from GitHub repositories, provided the source code
meets certain prerequisites to ensure feature diversity and token quality for training. The
study also presents various limitations and challenges encountered during the project and

Page 6
8
offers suggestions for further research on annotated datasets of source code snippets for skill
recognition and the training of ensemble methods.
Keywords: word2vec; Skills identification; Classification; Feature extraction; Tokenization;
GitHub

Page 7
9
Acknowledgements
I am very thankful to my supervisor, Alexandros Chatzigeorgiou, for his guidance and
support throughout this project and my MSc studies. I am also grateful to my family and
friends for their support and encouragement during this time. A special thanks to my brother,
Konstantinos, whose support made this project possible.

Page 8
10
Contents
Abstract ................................................................................................................................................... 7
Acknowledgements ................................................................................................................................. 9
Contents ................................................................................................................................................ 10
List of Figures ....................................................................................................................................... 12
List of Tables ......................................................................................................................................... 12
1 Introduction ................................................................................................................................... 13
1.1 Research motives ........................................................................................................................ 14
1.2 Aims and Objectives ................................................................................................................... 15
1.3 Contribution ................................................................................................................................ 15
1.4 Thesis Structure .......................................................................................................................... 17
2 Literature Review ............................................................................................................................... 18
2.1 Source code as natural Language and NLP applications ............................................................ 18
2.2 Mining Software Repositories (MSR) ........................................................................................ 21
2.3 Classification of code snippets .................................................................................................... 24
2.3.1 Classification of code snippets for programming language recognition .............................. 24
2.3.2 Classification of code snippets for skills recognition .......................................................... 27
3 Methodology ................................................................................................................................. 30
3.1 Programming Language recognition from source code snippets ................................................ 31
3.1.1 Data Description .................................................................................................................. 31
3.1.2 Data Preprocessing & Dimensionality Reduction ................................................................ 37
3.1.3 Embeddings and Averaging ................................................................................................. 42
3.1.4 Machine Learning Classifiers .............................................................................................. 49
3.2 Identify library type from python source code snippets.............................................................. 59
3.2.1 Data collection & Pre-processing ........................................................................................ 60
3.2.2 Dimensionality reduction with t-sne .................................................................................... 63
3.2.3 Individual classifiers per class ............................................................................................. 67
3.2.4 Ensemble .............................................................................................................................. 68
3.3 Identify arrays in python source code snippets ........................................................................... 71
3.3.1
Training dataset composition ........................................................................................ 72
3.3.2
Feature selection with weighted average ...................................................................... 77
4 Results summary and conclusions ................................................................................................ 79
4.1
Results Discussion ................................................................... Error! Bookmark not defined.
4.1.1 Experiments for programming language recognition .......................................................... 79
4.1.2 Experiments for python library type recognition ................................................................. 83

Page 9
11
4.1.3 Experiments for python arrays recognition .......................................................................... 86
4.2
Limitation .............................................................................................................................. 87
4.3
Recommendations for future research .................................................................................. 89
5 References ..................................................................................................................................... 91

Page 10
12
List of Figures
Figure 1. Tech organizations skills shortage worldwide 2015-2023 ....................................... 13
Figure 2. Initial research Methodology Outline ....................................................................... 31
Figure 3. Example of repository structure ............................................................................... 32
Figure 4. Example of an input code file ................................................................................... 33
Figure 5. Example of the Requests python library repository ................................................. 34
Figure 6. Examples of input code file from Requests library .................................................. 35
Figure 7. Share of code files in 2
nd
Training dataset................................................................ 36
Figure 8. Share of code files in 1
st
Training dataset ............................................................... 36
Figure 9.Graphical representation of the data preparation process.......................................... 41
Figure 10. CBOW and Skip-gram architectures ...................................................................... 46
Figure 11. Adjusted Methodology for library identification ................................................... 59
Figure 12. Machine learning library code snippets example ................................................... 62
Figure 13. Machine Learning library code snippets example after tokenization ..................... 62
Figure 14. t-sne clusters for data visualization libraries .......................................................... 64
Figure 15. t-sne clusters for web development libraries .......................................................... 65
Figure 16. t-sne clusters for machine learning libraries ........................................................... 66
Figure 17. t-sne clusters for NLP libraries ............................................................................... 66
Figure 18. Ensemble method process ...................................................................................... 69
Figure 19. Aggregated probabilities results from classifiers ................................................... 70
Figure 20. Voting results of ensemble method ........................................................................ 71
Figure 21. Adjusted Methodology for arrays identification .................................................... 72
Figure 22. TF-IDF feature matrix for np ................................................................................. 78
Figure 23. TF-IDF feature matrix for arrays ........................................................................... 78
List of Tables
Table 1: Naturalness of software papers .................................................................................. 21
Table 2: Mining software repositories papers .......................................................................... 23
Table 3. Classification of source code programming language papers ................................... 26
Table 4. Classification of source code for skills identification papers .................................... 29
Table 5. Distribution of code snippets ..................................................................................... 36
Table 6. Embedding models details ......................................................................................... 44
Table 7. Embedding Matrix for each code file ........................................................................ 48
Table 8. Python libraries for traiining ...................................................................................... 60
Table 9. Python Libraries classifiers training accuracy ........................................................... 68
Table 10. Results summary ...................................................................................................... 79
Table 11: Java files for experiments ........................................................................................ 80
Table 12: python files for experiments .................................................................................... 81
Table 13: Experiments Results of identifying the library's language ...................................... 82
Table 14: Ensemble method results ......................................................................................... 83

Page 11
13
1 Introduction
In recent years, particularly in the aftermath of the COVID-19 pandemic, a
pronounced skills shortage has become increasingly apparent on a global scale, with a notable
impact felt across various industries, notably in the technology sector (Causa, Oet al., 2022).
Over the past six years, a significant majority of surveyed international organizations have
encountered persistent deficiencies in skills, significantly hampering their advancement. The
decrease in skill shortages witnessed among organizations in 2020 was largely attributable to
the disruptive effects of the coronavirus (COVID-19) pandemic, which severely constrained
companies' hiring capabilities. However, despite these mitigating factors, by the year 2023, a
substantial 54% of organizations continued to struggle with a shortage of technical expertise.
1
The term "labor shortage" denotes an insufficient workforce, while "skill shortage"
indicates a lack of specific skills required for success. It's crucial to precisely define the skills
in question. Despite common assumptions, addressing labor shortages in the software
industry also involves non-technical skills. In the modern era, "soft skills" like self-direction,
problem-solving, and communication are increasingly vital, particularly in software.
(Hyrynsalmi, S.M., Rantanen, M.M., and Hyrynsalmi, S., 2021) Research by the World
Economic Forum emphasizes a growing need for both technical and social skills, alongside
cognitive abilities, in information and communication technology fields. Job families such as
database and network professionals, electrotechnology engineers, software developers, and
1 https://www.statista.com/statistics/1269776/worldwide-organizations-talent-shortage-skills-tech/
Figure 1. Tech organizations skills shortage worldwide 2015-2023

Page 12
14
analysts face recruitment challenges presently and in the future (World Economic Forum,
2016).
The evolving demands of modern software development necessitate a deep and
diverse proficiency across a wide range of technologies, tools, and practices, posing
considerable challenges for organizations striving to maintain competitive development
teams amidst rapid technological advancements and the continuous introduction of new
programming languages, frameworks, and libraries. Managing and assessing the evolving
skills of development teams is essential for project managers, who must allocate tasks
effectively and ensure ongoing capability to support legacy and emerging technologies within
their codebases.
1.1 Research motives
Identifying the skills of software developers is essential for organizations relying on
software development to maintain competitiveness and efficiency. The software engineering
field is evolving rapidly, with an increasing demand for skilled developers. This study aims to
explore the utilization of software repositories as a valuable data source for identifying and
analyzing software developers' skills. This includes examining factors such as the
programming languages employed, types of libraries utilized, and specific operations within
the source code.
Currently, skill identification is largely based on self-reported information, which can
be biased and unreliable. In contrast, software repositories provide a rich source of data that
can be used to analyse the activities of software developers, such as code contributions, bug
fixes, and code reviews Schmidt et al. (2024). By analysing these activities, we can identify
the skills and expertise of software developers objectively and accurately. In the current study
we will focus on the source code of repositories rather than contribution or fixes.
Furthermore, this study aims to establish a framework for skill identification by
analysing diverse Python libraries across various operations. This framework will encompass
methodologies for extracting and processing data from software repositories or source code,
necessary data cleansing procedures, and techniques for generating embeddings. Moreover, it
will incorporate methods for identifying and categorizing skills based on the analysed data.
By developing such a framework, this research seeks to offer a standardized approach to skill
identification that can be universally applied across different organizations and software
projects.

Page 13
15
While significant research has been conducted in language identification from source
code repositories using embeddings, the identification of the specific Python libraries types
used and the operations applied within the code remains an underexplored area. Current
studies have primarily focused on general language identification tasks, often overlooking the
finer distinctions required to categorize code based on the libraries utilized and the specific
operations performed. This gap in research is particularly critical as software development
increasingly relies on a wide array of Python libraries, each serving distinct functions and
requiring specific expertise.
Overall, the motivation behind this research is to provide a more accurate and reliable
method for identifying the skills of software developers. By leveraging the rich source of data
provided by software repositories, we can improve the efficiency and effectiveness of
software development processes, and ultimately contribute to the success of software
development projects.
1.2 Aims and Objectives
The aim of this project is twofold: Firstly, to represent source code as vectors and
develop a machine learning classification model capable of identifying its primary
characteristics and architecture. Secondly, to extract information on libraries, frameworks,
and algorithms knowledge for further analysis.
In addition to these main aims, the project seeks to achieve the following objectives
Obtain and compile a dataset of source code that includes pertinent features essential
for the research. Implement suitable pre-processing techniques to extract meaningful
tokens.
Utilize word2vec to generate valuable embeddings for predictions using source code
tokens, leveraging its capabilities effectively.
Extract meaningful features from the gathered data. Employ feature importance
analysis and grid search methodologies to identify the optimal machine learning
model for classification.
Train the machine learning model to accurately identify software development
methodologies, libraries, and functions based on the extracted features.
1.3 Contribution
To address the aforementioned challenges, this study employs standard Machine
Learning techniques adapted for identifying various programming skills, complemented by

Page 14
16
the use of word2vec for embedding source code. Additionally, we introduce novel
methodologies and frameworks for processing source code snippets aimed at detecting
programming skills. We also conduct comprehensive analyses across different programming
libraries and operations to validate our approaches. Specifically, our focus is on Python
language libraries, an area relatively underexplored in the literature on skills identification
despite its recent growth and recognition.
Existing research often emphasizes in quantitative metrics such as the count of bug
fixes, commits, lines of code, or mentions in README files for language and library
knowledge. In contrast, our study leverages GitHub repositories of prominent Python
libraries to construct a robust dataset for tokenization, encompassing a broad spectrum of
programming skills and operations. The novelty of our work lies in pioneering multiclass
Python skills detection and developing a framework for source code tokenization, feature
extraction, and vectorization.
The contributions of this study to the scientific community are structured as follows:
[C1] Introduction of an ensemble approach for Python skill identification: We
implemented an ensemble framework of classifiers combined with dimensionality reduction
techniques and a voting process for multiclass Python library type detection. Our approach
indicated sufficient accuracy with the limited sample data available, although improvements
are needed. These classifiers serve as a preliminary framework capable not only of
characterizing library types but also identifying specific operations within the source code.
The novelty of this approach is significant due to the limited existing work on multiclass
skills detection, as detailed in Chapter 3.
[C2] Evaluation of machine learning models for identifying Python data structures: Our
study aims to explore the classification capabilities of specific data structures such as arrays,
which are widely used in various programming and software development applications. We
intend to illuminate the features that contribute to identifying specific operations within a
corpus of source code, a topic that has been underexplored in previous research. Even though
our preliminary results may not be encouraging, the knowledge gained at this stage is
significant for the evolution of skills identification frameworks in data structures, as
discussed further in Chapter 3.
[C3] Development of a comprehensive framework for Python skill identification with
broad applicability: As we construct innovative ensemble classifiers, we simultaneously
explore uncharted territory by rationalizing the predictions made by these classifiers. Our
approach involves experimenting with and comparing the performance of various classifiers

Page 15
17
for programming language predictions across different libraries. Throughout this process, we
employ diverse data processing techniques such as comment elimination, stop words
extraction, feature selection, dimensionality reduction, and weighted tokenization. These
techniques provide powerful insights into advancing source code embeddings and tokens.
However, the prevalence of overfitting in machine learning applications suggests the need for
new paths and further research.
1.4 Thesis Structure
The structure of this thesis follows a systematic approach. Chapter 2 begins with a
comprehensive literature review, covering previous research on mining repositories to
identify features within source code and discussing the application of natural language
processing (NLP) methods to enhance the naturalness of software. It also explores current
advancements in code snippet classification and its practical applications. Chapter 3 is
structured into three main parts: the first part outlines the methodology for programming
language identification. The second part of Chapter 3 details the experimental setup, of
recognizing a python library type from code snippets. The third part discusses the application
for numpy array identification in source code corpus. Chapter 4 provides a detailed
presentation and discussion of prediction results, and Chapter 5 highlighting the strengths and
limitations of the research findings, and discusses avenues for future research in this domain.

Page 16
18
2 Literature Review
2.1 Source code as natural Language and NLP applications
Globally, there exist hundreds of languages, each characterized by varying levels of
difficulty. Nevertheless, regardless of complexity, all languages adhere to specific syntax
rules, orthography, and expressions that are fundamental for effective communication.
Similarly, software engineering code, crafted by proficient individuals, fluent in a
programming language, mirrors the art of writing in natural language. Software engineers use
programming languages to create code following specific rules and guidelines, akin to skilled
writers, resulting in code that resembles natural language writing.
The authors of "On the Naturalness of Software," (Hindle et al., 2016) tackle the
challenge of creating tools to help engineers work with large collections of source code. Their
goal is to make programming easier and ensure that programs are correct. They use a method
called n-gram models to analyse the structure of code. The study is set in the context of tasks
like code completion, summarization, and error detection. By mining patterns in code, such
as API usage and common errors, the researchers aim to guide developers in making changes
to their programs. The results show that using these models leads to 33%-67% more accurate
suggestions and reduces the number of keystrokes needed. The study makes unique
contributions by supporting the idea that programming languages are simpler and more
repetitive than previously thought. It demonstrates the effectiveness of statistical language
models in capturing patterns in code, which can be used to improve programming tools like
code suggestion features in IDEs. Overall, it highlights the potential of statistical language
models to enhance software engineering practices.
The aforementioned research has inspired subsequent work by other scientists. For
instance, Alon et al. (2019) introduced code2vec, an approach designed to create learning
code embeddings for programming tasks. Similarly, Allamanis and Sutton (2014) explored
the mining of idioms from source code.
Alon et al. (2019) presented a novel framework for predicting program properties
using neural networks. Their core innovation is a neural network that learns code
embeddings, which are continuous distributed vector representations for code. These
embeddings enable effective modeling of the relationship between code snippets and their
labels. The architecture leverages the structured nature of source code, aggregating multiple
syntactic paths into a single vector. This capability is crucial for applying deep learning to

Page 17
19
programming languages, akin to the impact of word embeddings in natural language
processing (NLP).
In earlier applications, Allamanis and Sutton (2014) introduced the first method for
automatically mining code idioms from existing code corpora. Their approach does not
simply search for frequent syntactic patterns but instead identifies patterns that enhance the
explanatory power of a probabilistic model of the source code. The method employs a
nonparametric Bayesian tree substitution grammar, which has been effective in natural
language processing but had not previously been applied to source code.
The concept of the naturalness of software was revisited by Rahman et al. (2019),
who explored additional research questions on the repetitiveness and predictability of source
code:
Replication: They replicated the work of Hindle et al. to ensure dataset diversity and to
test the "naturalness" hypothesis across C#, C, JavaScript, Python, Ruby, and Scala.
SyntaxTokens Removal: They investigated code predictability after the removal of
SyntaxTokens such as separators, keywords, and operators, similar to the removal of
punctuation and stopwords in NLP.
API Usages: They analyzed the predictability of Java API token usage, noting that API
code tends to exhibit more uniformity across projects compared to general program
code.
Graph-based Object Usage Models (Groum): They examined the repetitiveness of
Groum extracted from Java programs, contrasting them with n-grams to assess non-
sequential code processing.
Their findings confirm that code is indeed repetitive and predictable, though not as
extensively as Hindle et al. (2016) previously suggested. The repetitive nature of
programming language syntax, with SyntaxTokens making up 59% of Java tokens,
contributes to this perception. The study also indicates that it is essential for researchers to
ensure that corpora are properly tuned and cleaned for prediction tasks to avoid distractions
from more significant recommendations. In terms of API usage, there is sufficient repetition
for accurate recommendations, and future research should integrate static analysis with
statistical models for improved predictions. Additionally, various code representations,
particularly graph representations, exhibit different degrees of repetition and abstraction,
enhancing the effectiveness of recommender tools by reducing noise and facilitating more
complex, non-sequential recommendations.

Page 18
20
Buratti et al. (2020) investigated the software naturalness hypothesis using
transformer-based language models to analyze syntax and semantics in C language source
code. They introduced a novel sequence labeling task to assess the language model's
understanding of abstract syntax trees (AST) and evaluated its ability to identify
vulnerabilities. Their study highlighted the challenges of data sparsity and the importance of
appropriate tokenization and pre-training objectives, demonstrating that character-based
tokenization with whole word masking (WWM) effectively addresses these issues. They
showed that their language model could learn AST features from raw source code and
perform better than graph-based methods, offering valuable insights for enhancing
productivity during software development. Their contributions include applying LMs to
source code analysis, addressing OOV and rare words issues, and outperforming compiler-
dependent methods.
Several researchers have investigated the naturalness of code across diverse
applications:
Gholamian and Ward (2021) explored the localness of Software Logs with ANALOG,
an anomaly detection tool leveraging NLP features, which outperforms prior methods.
Their work integrates deep learning language models to benchmark against existing
DL-based anomaly detection methods.
Sridhara et al. (2015) developed next-word prediction models for commit comments
and source code comments, achieving accuracies of 70% to 93% and 56% to 78%,
respectively. While predicting bug reports posed challenges, the study identified
potential for tailored next-word prediction tools targeting specific user contexts within
bug reports.
Ray et al. (2016) investigated buggy code using entropy scores from statistical
language models, showing that unnatural code correlates more with bug-fix commits.
They found that repaired code tends to become more natural. The application of
entropy scores to defect prediction tasks, adjusted for syntactic variations,
demonstrated comparable cost-effectiveness to traditional static bug-finders like PMD
and FindBugs, with deterministic ordering improving efficiency.
Hellendoorn et al. (2018) explored proofs, noting the aid of automated sub-theorem
proving in Coq and features like code completion in programming languages.
Analyzing proofs in Coq's Gallina language and HOL Light kernel-level traces, they
found proofs exhibited high predictability, surpassing natural languages. This

Page 19
21
discovery opens avenues for naturalness-driven tools to streamline proof-writing
processes.
Table 1: Naturalness of software papers
Paper
Method used
Dataset
Allamanis and Sutton
(2014)
Nonparametric Bayesian
Stack Overflow posts
Sridhara et al. (2015)
N-Gram language model
Stack Overflow posts
Ray et al. (2016)
Language models
10 OSS Java projects from Github and
Apache Software Foundation
Hindle et al. (2016)
Language models
Java projects and Ubuntu applications.
Hellendoorn et al. (2018)
Language models
Coq, HOL
Alon, Zilberstein, Levy, and
Yahav (2019)
AST
Java methods
Rahman, Palani, and Rigby
(2019)
Language models
134 open source projects on GitHub,
Gutenberg corpus, 200, 000 posts from
StackOverflow
Buratti et al. (2020)
AST
100 open source repositories
Gholamian and Ward (2021) Language models
8 log files from a wide range of
computing systems and 2 English
corpora
2.2 Mining Software Repositories (MSR)
Mining Software Repositories (MSR) is a research field focused on analyzing vast
datasets from software projects. These repositories contain extensive data such as source
code, bug reports, and developer communications. Researchers employ techniques from data
mining and machine learning to uncover patterns and insights crucial for understanding
software development processes.
MSR addresses fundamental questions about software quality, developer productivity,
and the impact of code changes over time. By applying statistical analyses and advanced
algorithms, MSR can predict software issues and improve development practices. This
research plays a pivotal role in enhancing software reliability and efficiency.
Our study within this chapter emphasizes extracting skills from repositories. This
involves identifying and analysing patterns in developers' contributions to understand the
expertise and competencies prevalent in software development teams. By mining repositories
for skills extraction, we aim to contribute to the broader goal of improving workforce
planning and enhancing team dynamics in software engineering.

Page 20
22
GHTorrent (Gousios, 2013), is a widely used MSR (Mining Software Repositories)
application that offers a scalable infrastructure for gathering and analyzing data from GitHub.
Its primary objective is to comprehensively capture and provide access to various aspects of
GitHub activities, encompassing repositories, users, commits, pull requests, issues, and more.
Researchers and developers utilize GHTorrent's maintained dataset for diverse purposes,
including mining software repositories, exploring developer behavior, studying project
evolution, and conducting empirical studies in software engineering and social coding
practices. This dataset serves as a crucial resource for uncovering trends, patterns, and
dynamics within the GitHub ecosystem.
In a subsequent study the following year, Kalliamvakou et al. (2014) collaborated
with other researchers to investigate the promises and perils of mining GitHub for software
engineering research. The authors also conducted a survey and interviews with GitHub users,
aiming to explore how GitHub supports collaboration and its impact on development
processes. The study, involving 240 responses from active GitHub users, exploring critical
challenges when analyzing GitHub repositories, including issues like distinguishing between
base repositories and forks, uneven commit distribution across projects, and the prevalence of
inactive repositories. It also addresses nuances in detecting merged pull requests and the
diverse utilization of GitHub beyond software development.
Other applications of repositories mining discuss the bug handling with combination
to version controlling. BuCo Reporter was developed to combine Version Control and Bug
Tracking system data easily, featuring VCS, BTS, and Reporting modules, and is available
for download with usage details. It is providing comprehensive reports on metrics such as
commit distribution, average lines per commit, and bug correction time, offering an
extensible framework for seamless information retrieval. The VCS module of BuCo connects
to source code repositories to extract and manage project data, offering features such as Delta
extraction, local repository mapping, and source code retrieval. The BTS module connects to
bug tracking repositories to extract bug-related data, create queries, manage local storage, and
ensure interoperability among different Bug Tracking systems. (Ligu, Chaikalis, &
Chatzigeorgiou, 2013)
Cosentino et al. (2016), summarizing the findings from 93 papers discussing mining
GitHub repositories, describe the empirical methods applied, the datasets used and the
limitations described. Based in the review, 75.3% of the works rely on direct observation of
GitHub metadata, 14% use surveys and interviews, and 10.7% combine methods. Only 5.4%
applied longitudinal studies. Also, 60.2% of the works use non-probability sampling, 31.2%

Page 21
23
use probability sampling, and 3.2% use stratified random sampling, with 8.6% not using
sampling techniques at all. In total, 50.5% of works reported dataset size in terms of projects,
26.9% in terms of users, and 22.6% provided both dimensions. Data collection methods
include curated datasets (with GHTorrent being most popular at 34.4%), GitHub API
(39.8%), GitHub Search API (12.8%), and mixed methods (5.4%). Only 31.2% of works
provide dataset links for replication, with 68.8% not providing replication links despite
explaining data collection processes. Reported limitations include issues with empirical
methods (64.3%), generalization of results (42.9%), data collection (39.3%), and
dataset/third-party services (6.5%). (Cosentino, Luis and Cabot, 2016)
Table 2: Mining software repositories papers
Paper
Method used
Dataset
Gousios (2013)
REST API
Github
Ligu, Chaikalis, &
Chatzigeorgiou (2013)
SVNKit for Subversion,
JIRA SOAP for bug
tracking, Apache POI for
Excel export, Apache Xml-
Rpc for Bugzilla
communication, and
JFreeChart for reports.
Commons IO project
Kalliamvakou et al. (2014)
Quantitative analysis
240 GitHub users & 434
GitHub repos-
itories
Cosentino, Luis and Cabot
(2016)
Literature review
papers from disital
libraries

Page 22
24
2.3 Classification of code snippets
The classification of code snippets represents a pivotal area of research at the
intersection of software engineering and machine learning, focusing on the automation of
understanding and organizing code. With the exponential growth of code available in
repositories, the demand for effective tools to categorize and comprehend these snippets has
become increasingly significant. This capability not only facilitates automated documentation
and rapid development of source code but also enhances developers' productivity and reduces
development time.
Over the past decade, Machine Learning (ML) and Natural Language Processing
(NLP) have been employed to analyze source code (Ugurel, Krovetz, & Giles, 2002). The
classification of programming languages has been extensively studied (Van Dam & Zaytsev,
2016) using various techniques, including Neural Networks (Gilda, 2017), Multinomial
Naïve Bayes (Alreshedy et al., 2018), Convolutional Neural Networks (CNN) (Ohashi &
Watanobe, 2019), and Neural Text Classification (LeClair, Eberhart, & McMillan, 2018).
This section of the literature review aims to further analyze current research in the
field of code classification, with a specific focus on empirical studies of code snippets. The
objective is to provide a detailed discussion of the progress made thus far and to underscore
the critical importance of ongoing research in this domain.
2.3.1 Classification of code snippets for programming language recognition
The automated identification of programming languages from code snippets is pivotal
in software engineering, impacting tasks such as syntax highlighting, code organization, and
repository management. Machine learning algorithms have been instrumental in this domain,
enabling accurate classification based on programming language usage.
Early research by Ugurel et al. (2002) pioneered automated categorization techniques
using natural language data extracted from code comments and documentation. Their
approach involved feature extraction from lexical components and employed Support Vector
Machines (SVMs) to classify programming languages, demonstrating initial feasibility
despite challenges related to data variability and feature selection. Their study demonstrates
that programming languages and application categories can be accurately classified, yet this
depends on factors such as data variability, application diversity, feature selection techniques,
information retrieval methods, and programming language characteristics. While their results
are promising, improvements are possible, including optimizing feature vector selection,

Page 23
25
transitioning from binary to term frequency representations, and enriching classification with
more syntactic features and language-specific attributes.
Building upon this foundation, Van Dam and Zaytsev (2016) expanded the scope of
Software Language Identification (SLI) by evaluating a wide array of classifiers across a
diverse dataset encompassing multiple programming languages. Their findings underscored
SLI's critical role in enhancing Integrated Development Environment (IDE) functionalities
and supporting reverse engineering tasks, emphasizing the practical significance of accurate
language classification methods. Their study involved testing 348 classifiers, derived from
lightweight natural language techniques, on a dataset spanning 20 different languages. Each
language subset encompassed between 109 to 150 test projects and 146 to 272 training
projects. Initial findings revealed varying accuracy levels with smaller training sets, yet
certain methods achieved high accuracy rates of 97.5% or higher with larger training sets.
These results highlight the promising potential of natural language methods in SLI,
particularly in the domain of software language reverse engineering
The work of Gilda (2017), employed real-time classification of code snippets by
programming language, utilizing token-based feature extraction after preprocessing to
eliminate extraneous characteristics. Tokenization was achieved through regex expressions,
and Torch was employed to generate word embeddings. Convolutional Neural Networks
(CNNs) were pivotal for prediction, integrating convolutional layers with nonlinear activation
functions, filters, max-pooling, and Rectified Linear Units (ReLUs) to process input data and
enhance training efficiency. These convolutional layers produced high-level features used to
classify code snippets into programming language categories using a softmax output layer,
with dropout regularization applied to mitigate overfitting.
In contrast, Alreshedy et al (2018), introduced the Source Code Classification (SCC)
tool, leveraging Multinomial Naive Bayes (MNB) for the accurate classification of Stack
Overflow code snippets. SCC demonstrated superior performance over the Programming
Languages Identification (PLI) tool by effectively distinguishing between closely related
programming languages and their respective versions. MNB was chosen for its simplicity,
computational speed, and scalability. Stack Overflow snippets were preprocessed using the
Term Frequency-Inverse Document Frequency (TF-IDF) vectorizer, which identified the ten
most frequent words in each snippet. Model performance was optimized through Grid-
SearchCV to fine-tune the alpha parameter for MNB, with rigorous parameter selection based
on 10-fold cross-validation techniques.

Page 24
26
Baquero et al. (2017) and Dietrich et al. (2019) explored the complexities of language
recognition through analysis of textual and source code data from platforms like Stack
Overflow. Baquero et al. utilized word embeddings and SVMs to classify programming
languages based on semantic similarities, while Dietrich et al. compared human-generated
metadata with automated classification methods, advocating for hybrid approaches to
enhance accuracy across diverse code repositories.
Innovative approaches continue to evolve the landscape, as demonstrated by Hong et
al. (2019) who proposed using ResNet in image classification to directly detect programming
languages from code snippets. Their study validated ResNet's effectiveness with over 90%
precision in identifying languages from snippet-type data sourced from SOTorrent and
function-type data parsed from GitHub repositories, showcasing robust performance across
varied datasets and highlighting the potential for image-based methodologies in language
identification tasks.
In summary, these studies collectively advance the field of programming language
identification through innovative methodologies and rigorous evaluations, paving the way for
more efficient and accurate solutions in software engineering practices. The integration of
machine learning techniques with deep insights into language characteristics and dataset
variability continues to drive advancements towards enhanced automation and precision in
language identification tasks across diverse software development environments.
Table 3. Classification of source code programming language papers
Paper
Method used
Dataset
Processing
Ugurel, Krovetz, and
Giles (2002)
SVM
Ibiblio Linux Archive,
Sourceforge, Planet
Source Code,
Freecode, and web
pages with code
snippets.
Feature extraction,
vectorizing
Van Dam and Zaytsev
(2016)
Naïve Bayes, n-grams
with Good-Turing
discounting, n-grams
with Kneser-Ney
discounting, n-grams
with Witten-Bell
discounting, Skip-
gram language model
Github
Varying parameters
resulted in 348
different methods.
Gilda (2017)
Convolutional Neural
Networks
Github
Tokenization, word
embeddings

Page 25
27
Baquero et al. (2017) SVM
Stack overflow posts
HDF5, Feature
extraction, Word2Vec
Alreshedy et al. (2018)
Multinomial Naive
Bayes
Stack overflow posts
TF-IDF, Grid-
SearchCV
Dietrich et al. (2019) t-SNE algorithm
SOTorrent data set
PL-tags
Hong et al. (2019)
ResNet CNN
SOTorrent data set
12 pound Courier font,
GuessLang
2.3.2 Classification of code snippets for skills recognition
While programming language recognition from code snippets has been extensively
explored, the recognition of skills or algorithms through source code has not received as
much attention. Recently, however, this area has begun to attract more research interest,
underscoring its growing importance in the field. In the previous section, we detailed the
interest researchers have shown in mining repositories. In this section, we aim to conduct a
more in-depth analysis of the available studies that use code snippets for similar or related
tasks.
Numerous studies explore tools and applications aimed at identifying software
developers' skills and aligning them with job requirements. Tools such as CVExplorer
leverage README files, GitHub language usage, and check-in details (like commit
messages and file changes) to aggregate technical skills and recommend candidates (Greene
and Fischer, 2016). CPDScorer introduces a method using a set covering algorithm to extract
specific programming terms, assigning them to answers and projects for quality evaluation
across diverse programming skills (Huang et al., 2016). Hauff and Gousios (2015) proposed a
pipeline to automatically match GitHub profiles with job advertisements, utilizing README
files and a large-scale ontology to map job requirements and developer skills, highlighting
significant overlaps in covered concepts.
Expertise identification typically focuses on aggregating edit counts, whereas da Silva
et al. (2015) propose a method considering both granularity and time, distinguishing local
from global expertise in artifacts. Oliveira et al. (2019) evaluated a strategy using metrics like
Commits, Imports, and Lines of Code to identify library experts, achieving 88% precision in
identifying experts across 16,000 software systems and 9 libraries. Notably, these approaches
do not directly analyze source code but rather focus on edit frequency, code length, library
imports, or README contents. In contrast, our study aims to identify coding skills by
processing source code akin to natural language.

Page 26
28
Recent advancements in identifying algorithms within source code as indicative of
software developers' skills have leveraged diverse methodologies and technologies. For
instance, Bui et al. (2018) utilized data from GitHub to categorize six algorithms (mergesort,
bubblesort, quicksort, linkedlist, breadth-first search, knapsack) in both C++ and Java
languages. They applied bilateral neural networks (BiNNs) adapted from natural language
processing to classify code snippets, achieving over 80% accuracy in cross-language program
classification by integrating BiNNs with tree-based convolutional neural networks
(TBCNNs) to encode abstract syntax trees (ASTs).
Another notable advancement is the development of code2vec by Alon et al. (2019),
which introduced a neural network architecture capable of learning semantic labels for code
snippets akin to word embeddings in natural language processing. This model effectively
aggregates syntactic paths into vectors, facilitating tasks like predicting method names from
code bodies, crucial for code comprehension and maintenance.
Despite these innovations, Kang, et al. (2019) examined code2vec's generalizability
across various software engineering tasks such as code comment generation, authorship
identification, and clone detection, finding mixed results compared to GloVe embeddings.
Ohashi and Watanobe (2019) applied convolutional neural networks (CNNs) to classify
source code based on solvable problem types, utilizing tokenized preprocessing to extract
structural features while disregarding code block sequence, achieving high accuracy.
Further advancing this approach, Watanobe et al. (2023) developed CNN models to
identify algorithms in program codes through innovative preprocessing that includes filtering
user-defined tokens and converting structural features into one-hot binary matrices. These
models can support tasks such as code review, bug detection, and code refactoring in
software engineering applications, underscoring their potential as integral components in
machine learning models for analyzing program code structures.
In conclusion, while current research demonstrates promising advancements in using
embeddings for analyzing source code and identifying developer skills, there remains ample
opportunity for further exploration and refinement. Future research should aim to enhance the
robustness and generalizability of embeddings across different programming languages,
paradigms, and software domains. Additionally, efforts should focus on developing
embeddings that capture deeper semantic meanings and context-specific information within
code, thereby enabling more accurate assessments of developer proficiency and
specialization. By addressing these challenges, researchers can pave the way for more
sophisticated tools and methodologies that effectively leverage embeddings to support

Page 27
29
diverse software engineering tasks, ultimately advancing the field of skills identification in
programming.
Table 4. Classification of source code for skills identification papers
Paper
Method used
Dataset
Processing
Hauff and Gousios (2015) Vector space model
Job advert linked data
& GitHub
TF-IDF
Silva et al. (2015)
Fine-grained analysis,
analysis based on
timeframes, GPU
processing.
Apache Derby
Huang et al. (2016)
M5P for CM, Linear
Regression for CM
StackOverflow &
GitHub
Ability scoring,
Feature
extraction
Greene and Fischer (2016) ConceptCloud
GitHub
Bui et al. (2018)
Bilateral neural networks GitHub
AST2vec
Oliveira, J., Viggiato, M.,
and Figueiredo, E. (2019)
Qualitative and
Quantitative data analysis
GitHub
Alon, Zilberstein, Levy,
and Yahav (2019)
Neural Network
Java methods
AST
Kang, Bissyandé, and Lo
(2019)
LSTM neural network
GitHub
Glove,
code2vec, TF-
IDF
Ohashi and Watanobe
(2019)
Convolutional neural
networks
Aizu Online Judge
Feature
extraction,
Padding,
Tokenize
Watanobe et al. (2023)
Convolutional neural
networks
Aizu Online Judge
Feature
extraction,
Tokenize

Page 28
30
3 Methodology
This chapter details the research methodology, providing a comprehensive overview
of the specific steps and experiments conducted. To ensure an organized research process and
facilitate independent monitoring of each activity, we structured the workflow into distinct
stages. Our research focuses on three tasks:
Replicating the programming language recognition prediction classifier widely discussed
in previous studies to identify challenges and explore new approaches.
Utilizing the knowledge gained from the first task to develop a more specialized classifier
that recognizes the type of Python library from code snippets, this was not covered in the
literature review.
Narrowing the focus further to identify NumPy arrays within the source code corpus.
Throughout these tasks, a consistent framework for data pre-processing, tokenization,
and embeddings generation will be applied. This framework will be refined appropriately
based on the training results and the specific challenges of each task. Figure 2 illustrates the
entire methodological cycle for the first task, which will be generally followed in the
subsequent tasks. The stages are: 1. Data collection, 2. Preparation & Dimensionality
reduction, 3. Embeddings & Averaging, 4. Classifier Experimentation, and 5. Results. The
proposed approach is designed to be flexible and incremental, breaking down the study into
discrete components while integrating them into a cohesive framework for robust analysis.

Page 29
31
Figure 2. Initial research Methodology Outline
3.1 Programming Language recognition from source code snippets
3.1.1 Data Description
The primary dataset for this research consists of code snippets written in Python and
Java, sourced from Github repositories. Our objective was to gather well-structured data
representing a wide range of code snippets and programming methods. As a result, to create a
robust dataset, we initially focused on collecting code snippets from The Algorithms GitHub
repository2,
a well-known GitHub repository that hosts a diverse collection of algorithm
implementations in multiple programming languages, including Java and Python. This
repository offered a broad spectrum of code snippets, ranging from simple operations to
complex algorithms, providing a comprehensive foundation for training the initial model.
In greater detail, the repository is a collaborative initiative focused on implementing
algorithms across a diverse array of programming languages, encompassing more than 30
2
https://github.com/TheAlgorithms

Page 30
32
different languages. Serving as a centralized hub, it facilitates global developer collaboration
for the ongoing enhancement and upkeep of algorithmic implementations. The repository
hosts an extensive collection of algorithms, spanning sorting, searching, graph algorithms,
dynamic programming, and beyond. Repository structures vary depending on language type
and contributions from developers. For instance, the Python repository is organized into
subject folders, with subfolders dedicated to specific algorithm types.
Figure 3. Example of repository structure
Each algorithm in the repository is well-documented, explaining how it works and its
detailed implementation details. This documentation covers the theory behind the algorithm,
highlighting both its strengths and limitations. Additionally, it includes clear explanations of
the logical reasoning behind the algorithm's functions and the sequence of steps involved.
Moreover, numerical examples are frequently provided to illustrate how the algorithm
behaves or to clarify its processing steps. These examples help users understand how the
algorithm works in real-world scenarios.

Page 31
33
Figure 4. Example of an input code file
As the training process progressed, we observed signs of overfitting in the algorithm.
This overfitting indicated that the model was learning the training data too well, reducing its
ability to generalize to new or unseen data. To address this issue, we needed to enrich our
dataset by incorporating additional input files. To expand the diversity of the dataset, we
sourced more code snippets from well-known repositories, ensuring representation from both
Java and Python.
For Java, we included code from repositories like Google Guava
3
, Apache Spark
4
, and
Maven
5
. For Python, we added snippets from frameworks and libraries such as Django
6
,
Flask
7
, and the Requests library
8
. These repositories were selected based on two key factors:
3
https://github.com/google/guava
4
https://github.com/apache/spark
5
https://github.com/apache/maven
6
https://github.com/django/django
7
https://github.com/pallets/flask

Page 32
34
the volume of code they contain and the quality of the code. For example, the Requests
python library repository hosts the codebase to simplify the process of making HTTP
requests and interacting with web APIs in Python. The repository contains the source code
for the Requests library, along with documentation, issue tracking, and contributions from the
developers’ community. By enriching the dataset with code from these reputable sources, we
aim to reduce overfitting and improve the robustness of the model.
Figure 5. Example of the Requests python library repository
8
https://github.com/psf/requests

Page 33
35
Figure 6. Examples of input code file from Requests library
All data was stored locally and manually retrieved for use in the model during data
preparation. Table 1 provides additional details about the volume of data for each coding
language marking the separation between the first training dataset (containing code snippets

Page 34
36
from the Algorithms repository) and the second training dataset (which is augmented by
additional code snippets from several other repositories).
Table 5. Distribution of code snippets
Language 1st Training dataset
2nd Training dataset
python
1.353
1.753
java
1.139
1.902
Total
2.492
3.655
Ensuring a balanced training dataset is crucial in machine learning to facilitate
optimal model performance and generalization across diverse datasets. A balanced dataset,
where each class (in this case, Python and Java code) is represented equally, mitigates the
risk of bias towards the majority class and fosters fair learning of patterns from all classes.
Research suggests that imbalanced datasets can lead to suboptimal model performance,
particularly in classification tasks, due to the model's inclination towards the dominant class
(He and Garcia, 2009). By contrast, a balanced dataset enhances the model's ability to discern
subtle differences between classes and improves its predictive accuracy across all classes
(Chawla et al., 2002). In the context of programming language identification, a balanced
dataset ensures that the model learns the distinctive syntactic and semantic features of both
Python and Java languages without bias towards either. This approach aligns with best
practices in machine learning and contributes to the robustness and reliability of the model's
predictions (Krawczyk, 2016).
Obtaining the raw input code data files involved a systematic process of accessing
GitHub repositories through clone requests and saving the retrieved files locally in a
designated folder named 'trial_files'. The use of clone requests allowed for the replication of
entire repositories, ensuring that all necessary code files and associated metadata were
captured for subsequent analysis and model training. By pulling the data locally, we gained
Figure 8. Share of code files in 1st Training
dataset
Figure 7. Share of code files in 2nd Training
dataset

Page 35
37
direct access to the raw source code files, enabling efficient preprocessing and transformation
steps required for training the language identification model.
The data preparation processes described in the next section were applied to each of
these datasets, with the aim of extracting meaningful information from each code file while
simultaneously removing files that could be considered noise or redundant data.
3.1.2 Data Pre-processing & Dimensionality Reduction
The data preparation stage involved a two-part process designed to clean and structure
code files, ultimately creating a dataset suitable for the embeddings that will be inputted to
the model. The goal was to prepare code snippets in a way that minimized noise and
maximized relevant information for the following stages of the study.
The first step focused on cleaning and annotating the code files. This process began
by eliminating comments, as they do not contribute to the actual code logic and can introduce
unnecessary clutter. The comment removal process was conducted separately for each type of
data file (Java or Python) and consisted of separate functions that accepted a code file as
input and produced the same file with all comments removed. The relevant code snippet for
Python files can be seen below:
def remove_comments_python(source_code):
pattern = r'(\".*?\"|\'.*?\')|(#.*?$)'
return re.sub(pattern, lambda m: m.group(1) or "", source_code,
flags=re.MULTILINE)
where we have defined a specialized function to exclusively handle Python files. The
function leverages the Regex library to formulate a regular expression that can detect Python
code comments. The specific regular expression used is able to detect all textual information
that is preceded by either the “/” or the “//” characters as well as the “#” character, which
comprise typical cases of comments in the Python programming language. The function then
replaces every occurrence of detected comments in the source code file with an empty string,
ensuring that the source code file maintains its structure and is stripped away from the
comments in an efficient manner.
A similar function used for removing the java comments, as presented below:
def remove_comments_java(code):
in_multiline_comment = False
lines = code.split("\n")
result = []

Page 36
38
for line in lines:
stripped = line.strip()
if in_multiline_comment:
if stripped.endswith("*/"):
in_multiline_comment = False
continue
if stripped.startswith("/*"):
in_multiline_comment = True
continue
if stripped.startswith("//"):
continue
result.append(line)
return "\n".join(result)
where by parsing the input Java code line by line, it checks for the presence of both single-
line and multiline comments. Initially, it initializes a boolean variable to track whether it is
currently within a multiline comment. Subsequently, the function iterates through each line of
the code, stripping leading and trailing whitespace. If the function encounters a line within a
multiline comment, it continues until it identifies the end of the comment block, indicated by
the presence of the "*/" characters. Single-line comments, denoted by "//", are also detected
and excluded from the result. The function ensures that the integrity of the source code is
preserved by appending non-comment lines to the result.
Additionally, specific elements within the code were labeled for clarity and
consistency. All string literals were replaced with the term "STRING," numerical values with
"NUMBER," and individual characters with "CHARACTER." These annotations ensured
that tokenization—the process of separating the preprocessed code files into individual
tokens—was consistent and meaningful. For all the above processes generic regular
expressions (regex) were utilized.
The subsequent step involves tokenization, a process facilitated by a regular
expression pattern (token_pattern). This pattern enables the function to detect consecutive
sequences of alphanumeric characters delineated by word boundaries. These identified
sequences constitute tokens within the source code.
def tokenize_code(source_code):
token_pattern = re.compile(r'\b\w+\b')
tokens = re.findall(token_pattern, source_code)
return tokens
All the above process are summarized and applied in the below code snippet:
def read_files_with_labels(path):
files = []

Page 37
39
for root, dir, filenames in os.walk(path):
for filename in filenames:
if filename.endswith('.java') or filename.endswith('.py'):
filepath = os.path.join(root, filename)
with open(filepath, 'r', encoding='utf-8') as f:
try:
file_contents = f.readlines()
file_contents = ' '.join(file_contents)
#file_contents = [line.strip() for line in
file_contents]
if filename.endswith('.java'):
file_contents =
remove_comments_java(file_contents)
if filename.endswith('.py'):
file_contents =
remove_comments_python(file_contents)
#file_contents = [line.strip() for line in
file_contents]
file_contents = re.sub(r"\d+", "NUMBER",
file_contents)
file_contents = re.sub(r"\'.*?\'", "STRING",
file_contents)
file_contents = re.sub(r"\".*?\"", "STRING",
file_contents)
file_contents = re.sub(r"\'.*\'", "CHARACTER",
file_contents)
file_contents = re.sub(r"\".*\"", "CHARACTER",
file_contents)
tokens = tokenize_code(file_contents)
#print(tokens,filepath)
non_blank_tokens = remove_blank_tokens(tokens)
if filename.endswith('.java'):
files.append((filepath, 0, tokens))
elif filename.endswith('.py'):
files.append((filepath, 1, tokens))
except UnicodeError:
continue
return files
Operating on the directory containing code files in Java and Python formats from
Github, the function systematically processes each file, filtering based on file extension to
identify Java and Python code files and add respective labels of 1 and 0. Upon encountering
eligible files, the function reads their contents, it removes comments from Java and Python
files, replaces numerical and string literals with generic tokens to abstract code specifics, and
tokenizes the code for subsequent analysis. Additionally, it handles Unicode encoding errors
to ensure robustness. The function's output is a structured collection of tuples, each
containing the file path, a language label indicating Java or Python, and the tokenized code.
This organized dataset lays the foundation for subsequent analysis and model training,

Page 38
40
facilitating meaningful insights into the characteristics and behaviours of Java and Python
code.
The second part of the process involved token normalization and frequency analysis.
Normalization involved converting all tokens to lowercase to avoid case-related
discrepancies, ensuring a uniform representation of code elements. Following normalization,
a frequency analysis was conducted to determine the occurrence of each unique token in the
dataset. This analysis provided insights into the most common elements, allowing us to
identify which tokens were most useful for feature selection. During the feature selection
process, it was observed that certain elements, such as "STRINGS", "NUMBERS", and
common stop words, appeared with high frequency in the code files. This high frequency
posed a challenge because it could skew the results of the predictive models, potentially
leading to incorrect interpretations or reduced accuracy. Simultaneously, apart from tokens
with high frequency, tokens with low frequency should also be removed, as they can create
discrepancies in the predictive ability of the models and redundant noise.
data = pd.read_csv("out_rf.csv", header=None)
data = data.dropna()
code_tokens_list = []
import ast
# Extract labels and code tokens
labels = data.iloc[:, 1]
for index,row in data.iterrows():
converted_list = ast.literal_eval(row[2])
code_tokens_list.append(converted_list)
# Remove stopwords from each list in the list of lists
code_tokens_list = [
[word for word in inner_list if word not in stop_words]
for inner_list in code_tokens_list
]
token_dict = []
#Convert tokens to lowercase
for code_token_list in code_tokens_list:
for i in range(0,len(code_token_list)):
code_token_list[i] = code_token_list[i].lower()
token_dict.append(code_token_list[i])

Page 39
41
To resolve this problem, these high-frequency and low-frequency tokens were
removed from the dataset. This decision was made because these elements generally lack
predictive power—they do not typically play a significant role in differentiating between
various programming languages or coding patterns (Van Der Maaten, Postma, & Van den
Herik, 2009). By excluding them, we improved the robustness and accuracy of our prediction
results. In the end, this refined feature selection approach helped create a more reliable and
efficient machine learning model. In addition, during this stage, only tokens with a frequency
between 0.002 and 0.7 were retained for model training and predictions. This frequency-
based filtering ensured that only tokens with a reasonable level of occurrence were used,
eliminating excessively rare or overly common tokens that could introduce noise or bias into
the model.
All cleaned tokens for each file are stored in CSV files alongside their corresponding
language labels and file paths. These files are utilized in subsequent processing stages,
specifically for generating embeddings and performing averaging operations. Within the
saved CSV files, the tokenized code files are structured as follows: the first column denotes
the file path, the second column contains the language label of the respective files, and the
Figure 9. Graphical representation of the data preparation process

Page 40
42
third column comprises an array encompassing all code tokens specific to the file, with
tokens separated by commas.
3.1.3 Embeddings and Averaging
In this stage of data preparation we will proceed with the word embeddings
generation to represent each token for all the code files.
Vector space models have played a role in distributional semantics since the 1990s,
with models like Latent Dirichlet Allocation (LDA) Blei, Ng, and Jordan (2003), and Latent
Semantic Analysis (LSA) among the early methods for estimating continuous representations
of words. Text vectorization or word embedding is the process of converting words or
documents from a corpus into numerical vectors, which consist of numbers or real numbers.
This conversion is essential for machine learning tasks and sentiment analysis in natural
language processing, as machine learning algorithms typically require numerical input. In the
2003, researchers laid the foundation for distributional semantic learning, with Bengio et al.
presenting the NNLM model, which learns word representations by predicting the next
word/token based on the previous n-1 words/tokens.
Word embeddings are a way to represent words as vectors, encoding their semantic
meaning. These vectors are situated in a high-dimensional space, with embeddings for words
that are semantically or contextually related positioned near each other, and embeddings for
unrelated words positioned far apart. Some algorithms also create more intricate geometric
relationships among embeddings. A well-known example of this phenomenon is the vector
analogy: king - man + woman = queen. (Mikolov et al., 2013)
The broader adoption of word embeddings, however, is largely credited to Mikolov et
al. (2013), who developed Word2Vec, a toolkit that made training and using pre-trained
embeddings straightforward and efficient. This toolkit gained widespread popularity in the
machine learning community. A year later, Pennington et al. (2014) introduced GloVe, a set
of competitive pre-trained word embeddings, marking the point when word embeddings
became mainstream in natural language processing. Joulin et al. in 2016, introduced FastText
as an extension of CBOW, aiming to improve training efficiency while maintaining
performance levels compared to other algorithms. These foundational works have
significantly influenced the development of current NLP technologies.
Below is a brief overview of some of the most prominent word representation models.

Page 41
43
Word2vec: uses a neural network to analyze extensive text datasets and derive these
vector-based representations, employing two hidden layers within a shallow neural
network to generate vectors for each word. Both the Continuous Bag of Words (CBOW)
and Skip-gram models aim to capture semantic and syntactic information within word
vectors. Training Word2vec with large corpora enhances word representation quality,
rendering it useful for various NLP task (Mikolov et al., 2013).
Global Vectors (GloVe): GloVe extends Word2vec to efficiently learn word vectors by
predicting words based on their surrounding context. The word and context embeddings
are initially random, with the goal of minimizing the distance (usually the dot product)
between target and context. This is achieved using stochastic gradient descent, where
each sample is a sliding window in the corpus. When a target word and a context word
are found together, their vectors are adjusted to be closer, based on the learning rate
(Pennington et al., 2014).
FastText: By breaking words into n-grams and feeding them into a neural network,
FastText captures character relationships and word semantics more effectively. This
approach yields better word representations, especially for rare words. Facebook released
pre-trained word embeddings for 294 languages, trained on Wikipedia using FastText
with 300 dimensions, and incorporating Word2Vec skip-gram with default parameters,
Joulin et al. in 2016.
Embeddings from Language Models (ELMo): ELMo utilizes a bi-directional language
model (biLM), which processes text in both forward and backward directions, capturing a
richer understanding of word context compared to traditional models. This contextualized
representation allows ELMo to encode the nuanced meaning of words based on their
surrounding context, enabling more accurate representations of word relationships and
semantics (Peters et al., 2018).
Bidirectional Encoder Representations from Transformers (BERT): BERT is trained
on two unsupervised tasks: (1) a masked language model (MLM), where 15% of the
tokens are arbitrarily masked (i.e., replaced with the “[MASK]” token), and the model is
trained to predict the masked tokens, (2) a next sentence prediction (NSP) task, where a
pair of sentences are provided to the model and trained to identify when the second one
follows the first. This dual-task training strategy aims to enhance the model's ability to
understand long-term and pragmatic relationships between sentences. BERT is trained on

Page 42
44
datasets such as the Books Corpus and English Wikipedia text passages (Devlin et al.,
2019).
Table 6. Embedding models details
Embedding
model
Year
Architecture
Dimension
Training dataset
Word2vec
2013
NNLM
100 - 1000
Google news
Glove
2014
NNLM
50 - 300
Crawl corpus
FastText
2016
NNLM
300
Wikipedia
ELMo
2018
Bidirectional LSTM
1024
Wikipedia, Monolingual news
crawl data from WMT 2008-
2012
BERT
2019
Multi-layer bidirectional
Transformer encoder
768
Books Corpus, English
Wikipedia
In the current research, the word2vec model is employed to generate word embeddings
for the code file inputs to enhance robustness and effectiveness. Therefore, a comprehensive
examination of its architecture and intricacies is provided below.
Word2vec operates on two primary architectures: the Continuous Bag-of-Words (CBOW)
model and the Continuous Skip-gram model, both proposed by Mikolov et al. (2013). These
two architectures serve as the fundamental building blocks for generating word embeddings
and offer different approaches to learning word representations based on context.
The Continuous Bag-of-Words (CBOW) model predicts a target word given its
surrounding context words. This architecture proposed is akin to the feedforward Neural
Network Language Model (NNLM), but with some key differences: the non-linear hidden
layer is omitted, and the projection layer is shared across all words, not just the projection
matrix. This setup projects all words into the same space, essentially averaging their vectors.
This approach is termed a bag-of-words model because the word order does not affect the
projection process. Additionally, words from both the past and future are used in the context,
allowing the model to consider a broader context.
The computational complexity of training this model is represented as:
𝑄 = 𝑁 × 𝐷 × 𝐷 × log2(𝑉)

Page 43
45
where 𝑁is the number of words, 𝐷 represents the dimensionality of the embeddings, and
𝑉denotes the vocabulary size.
The Continuous Skip-gram model reverses the prediction process of CBOW. Instead
of predicting a target word from context, Skip-gram aims to predict context words given a
target word. This model typically employs a larger context window, allowing it to capture
broader semantic relationships. Skip-gram can uncover more complex associations between
words, making it suitable for tasks that require a deeper understanding of context.
Specifically, the model takes each word as input to a log-linear classifier with a continuous
projection layer, then predicts words that are within a certain range before or after the given
word.
This architecture operates by sampling from a defined range around the target word.
The increased range tends to improve the quality of the resulting word vectors, as it provides
a broader context. However, this broader context also comes with increased computational
complexity. Since words further from the target are usually less related to it, we apply a
weighting system to decrease their impact. This is achieved by reducing the frequency of
sampling from these more distant words during training, focusing more on words that are
closer to the target. This approach balances the need for contextual information with the
computational costs, allowing for a more efficient and scalable training process.
The training complexity of this architecture is proportional to:
𝑄 = 𝐶 × (𝐷 + 𝐷 × log2(𝑉)),
where C is the maximum distance of the words.

Page 44
46
Figure 10. CBOW and Skip-gram architectures
The CBOW model is often chosen for its speed and efficiency, making it ideal for
large-scale datasets or situations where rapid processing is needed. On the other hand, the
Skip-gram model is preferred for its ability to capture richer and more intricate relationships
among words, proving useful for tasks that require a detailed semantic understanding.
In this study, we generated word embeddings using the Word2Vec algorithm to create
dense vector representations of words in the dataset. We configured the model with specific
parameters to ensure that the embeddings captured a broad context and were suitable for
various downstream applications.
The Word2Vec model was trained using a list of tokenized code snippets. The key
parameters for this training process were as follows:
Vector Size: We set the dimensionality of the word embeddings to 300, indicating
that each word would be represented by a 300-dimensional vector. This size strikes a
balance between capturing semantic relationships and maintaining computational
efficiency.
Window Size: The context window, set to 5, determines how many surrounding
words are considered for each target word during training. This value helps ensure
that the embeddings capture enough contextual information without becoming
computationally prohibitive.

Page 45
47
Minimum Count: The minimum frequency for words to be included in the training
was set to 1. This choice allows the inclusion of even rare words, enriching the
diversity of the embeddings.
Skip-gram Architecture: The sg=1 parameter specifies the use of the Skip-gram
architecture, which aims to predict context words based on a given target word. This
architecture is typically used to capture more complex relationships among words.
Parallel Processing: We set the workers=4 parameter to leverage parallel processing,
allowing the model to be trained across four CPU threads simultaneously, thus
speeding up the training process.
word2vec_model = Word2Vec(sentences=new_code_tokens_list,
vector_size=300, window=5, min_count=1, sg=1, workers=4)
By configuring the Word2Vec model with these parameters, we aimed to generate high-
quality word embeddings that would serve as the foundation for subsequent stages of
analysis. Each word is transformed into a 300-dimensional vector, allowing for efficient
representation of text in a continuous space.
The next step in preparing the data for the prediction model involves averaging the
embeddings for each code snippet. As outlined in the vectors embedding section and in the
preceded preprocessing section, each code file consists of a list of tokens, with each token
having its own vector for representation. The size of each vector is defined by the user that
conducts the analysis and, in the case of this thesis, all vectors have been assigned a size of
300. Hence, each code file consists of 𝑛 vectors of size 300, corresponding to the number of
tokens that remain after the preprocessing procedure.
The process of averaging creates a single representative vector for each code file,
allowing us to use this compact representation as input for the prediction model. By
averaging these vectors, we can condense the information from an entire code file into a
single vector. This transformation is crucial because many prediction models, like Random
Forest or other traditional machine learning algorithms, require a fixed-size input. The
process of averaging is conducted in the following manner: for a code file that is expressed as
a list of tokens 𝑐𝑓 = [𝑡𝑜𝑘𝑒𝑛1,𝑡𝑜𝑘𝑒𝑛2,…,𝑡𝑜𝑘𝑒𝑛𝑛], each token is also expressed as a vector
array 𝑉 = [𝑣1,𝑣2,…,𝑣300] which contains 300 numbers that comprise the token vector.
Hence, with 𝑐𝑓 being transformed to 𝑐𝑓 = [𝑉1,𝑉2,…,𝑉𝑛], an Embedding Matrix (𝐸𝑀) is

Page 46
48
constructed, containing the vectors for each token (𝑉1 to 𝑉𝑛). Each row of the 𝐸𝑀 represents a
token vector while the columns can be used for producing sums of vectors element-wise.
Each cell of the 𝐸𝑀 contains a numerical value that represents one element (𝐸) in a token
vector 𝑉. The structure of the 𝐸𝑀 can be seen below.
Table 7. Embedding Matrix for each code file
Token Vectors (𝑽) 𝑪𝒐𝒍𝒖𝒎𝒏𝟏
𝑪𝒐𝒍𝒖𝒎𝒏𝟐
. . . 𝑪𝒐𝒍𝒖𝒎𝒏𝟑𝟎𝟎
𝑉1
𝐸11
𝐸12
. . .
𝐸1300
𝑉2
𝐸21
𝐸22
. . .
𝐸2300
𝑉3
𝐸31
𝐸32
. . .
𝐸3300
.
.
.
. . .
.
.
.
.
. . .
.
.
.
.
. . .
.
.
.
.
. . .
.
.
.
.
. . .
.
𝑉𝑛
𝐸𝑛1
𝐸𝑛2
. . .
𝐸𝑛300
The process of averaging computes the element-wise sum for all token vectors. More
specifically, for a 𝐶𝑜𝑙𝑢𝑚𝑚𝑖, the code that we have implemented computes the Column Sum
(𝐶𝑆) based on the following formula:
𝐶𝑆𝑖 = ∑
𝐸𝑗𝑖
𝑛
𝑗=0
for 𝑖 = 1,…,300
Where 𝑛 is the number of vector tokens (𝑉) contained in each file, and 𝑖 represents
the size of the vectors, which in the case of this study is 300. Hence, the process computes
300 instances of 𝐶𝑆 and a final vector 𝐹 is produced, which can be expressed as 𝐹 =
[𝐶𝑆1,𝐶𝑆2,…,𝐶𝑆300]. Finally, the 𝐹 vector undergoes a final division and is transformed to
𝐹𝑓𝑖𝑛𝑎𝑙 = [𝐶𝑆1
𝑙𝑒𝑛𝑔𝑡ℎ
,𝐶𝑆2
𝑙𝑒𝑛𝑔𝑡ℎ
,…,𝐶𝑆300
𝑙𝑒𝑛𝑔𝑡ℎ
] where 𝑙𝑒𝑛𝑔𝑡ℎ is the number of
tokens in 𝑐𝑓. The final averaged embeddings are then used for the subsequent
experimentations with machine learning classifiers. Averaging embeddings helps reduce the
complexity of the dataset while retaining meaningful information about the code's semantic
structure. By transforming each code file into a single vector, we create a uniform
representation that can be used for training and testing the prediction model. This step is
particularly useful for working with models that need consistent input formats and allows for
efficient processing and analysis of code data in the context of machine learning.

Page 47
49
final_vectors = []
for code_token_list in new_code_tokens_list:
vectors = []
for token in code_token_list:
embedding_vector = word2vec_model.wv[token]
vectors.append(embedding_vector)
column_sums = [sum(column) for column in zip(*vectors)]
for i in range(0, len(column_sums)):
column_sums[i] = column_sums[i]/len(code_token_list)
final_vectors.append(column_sums)
final_final_vectors = []
for i in range(0, len(final_vectors)):
if len(final_vectors[i]) == 300:
final_final_vectors.append(final_vectors[i])
3.1.4 Machine Learning Classifiers
In this subsection the procedure followed to experiment with different machine
learning classifiers is demonstrated. Overall, a thorough investigation of related algorithms
was performed in order to trace the optimal ones used for source code language
identification. Related literature has already reported the use of several well-known
algorithms such as CNN (Mandelbaum & Shalev, 2016), Random Forest (RF) (Vora et al.,
2017) or Long Short-Term Memory (LSTM) (Li & Gong, 2021). In addition, this subsection
also illustrates the separate experimentations that were considered for each selected algorithm
and the different setups selected. It should be noted that for this research, the classifiers that
were considered and used to make predictions and classify code files into Python or Java
were Neural Networks, Random Forest and Support Vector Machines.
3.1.4.1 Background on Classifiers
Neural Networks
This type of classifier uses interconnected layers of artificial neurons to learn complex
patterns in the data. The architecture of a neural network typically comprises multiple layers,
each consisting of interconnected nodes or neurons. Each neuron receives input from the
neurons in the previous layer, performs a computation on this input, and then passes the result
to the neurons in the next layer. This process continues through the network until the final
layer produces the output.
The specific neural network used in this research consists of three layers: an
embedding layer, an LSTM (Long Short-Term Memory) layer, and a dense layer.

Page 48
50
The Embedding layer is responsible for transforming categorical or discrete data into
continuous vector representations, often referred to as embeddings. These embeddings
capture semantic relationships between categories or words and are essential for effectively
processing categorical data in neural networks. In our case the embeddigns will be from the
code tokens of the code snippets files from python and java.
Long Short Term Memory networks (LSTMs), introduced by Hochreiter &
Schmidhuber (1997) and subsequently refined and popularized, are a specialized type of
recurrent neural network (RNN) designed to overcome the challenge of learning long-term
dependencies. It is particularly well-suited for tasks involving time series data or sequences
of data points, such as natural language processing tasks like text generation or sentiment
analysis. LSTM layers are capable of retaining information over extended periods, making
them suitable for processing sequences with long-range dependencies. While traditional
RNNs consist of a simple repeating module typically containing a single layer, LSTMs
feature a more complex structure comprising four interacting layers within each repeating
module. The core concept of LSTMs revolves around the cell state, which serves as a
conveyor belt facilitating the flow of information along the network. Crucially, LSTMs
employ structures called gates to regulate the flow of information into and out of the cell
state, enabling selective retention and addition of information. These gates, comprised of
sigmoid neural net layers and pointwise multiplication operations, control the flow of
information by determining how much information should be retained or discarded.
(Sundermeyer, Schlüter, and Ney, 2012)
The dense layer is a standard component in neural network architectures. Neurons in
this layer are connected to every neuron in the preceding layer, and each connection is
associated with a weight parameter. Each neuron in the dense layer receives input from every
neuron in its preceding layer and performs matrix-vector multiplication. This operation
entails ensuring compatibility between the dimensions of the matrices involved, where the
row vector of the output from the preceding layer matches the column vector of the dense
layer. The matrix-vector multiplication is governed by a general formula, where the
parameters of the preceding layers are updated through backpropagation, a widely used
algorithm for training feedforward neural networks. As a result, the output from the dense
layer is an N-dimensional vector, effectively reducing the dimensionality of the input vectors.
When implementing a dense layer, it's essential to ensure that the number of neurons in the
dense layer matches the dimensions of the output from the preceding layer. This process can

Page 49
51
be facilitated using frameworks like Keras, where various parameters of the dense layer are
defined to control its behavior effectively. The dense layer performs a linear transformation
on the input data followed by a non-linear activation function, allowing the network to learn
complex, nonlinear relationships within the data. (Javid et al., 2021)
Random Forest (RF)
A robust ensemble learning method that builds multiple decision trees during training
and merges their outputs to improve prediction accuracy and reduce overfitting. Random
Forest is valued for its stability, flexibility, and ability to handle high-dimensional data,
making it a suitable choice for classifying code files.
The Random Forest algorithm cultivates an ensemble of decision trees, amalgamating
their predictions to achieve enhanced accuracy. This approach involves constructing multiple
decision trees, each trained on a distinct subset of the training set, and subsequently
aggregating their predictions to determine the final outcome taking the most popular result.
By employing this ensemble method, Random Forest introduces diversity among the decision
trees, enhancing the model's robustness and mitigating the risk of overfitting. Conversely, in
regression, the forest aggregates the outputs of all trees to derive the average prediction.
Crucially, the efficacy of Random Forest hinges upon the limited or absent correlation among
individual models, ensuring that errors inherent in specific trees are offset by the collective
accuracy of the ensemble, thereby aligning the overall outcome toward the desired direction.
(Parmar, Katariya, and Patel, 2019)
In ensemble learning, such as in Random Forests, decision trees are commonly trained
using the "bagging" technique, which falls under Bootstrap Aggregation. This ensemble
method amalgamates predictions from multiple algorithms to enhance accuracy. Random
Forests, a type of ensemble method, leverage Bootstrap Aggregation by randomly sampling
rows and features from the dataset to create sample datasets for each model. Aggregating
these sample datasets entails summarizing observations and combining them, effectively
reducing the variance of high-variance algorithms like decision trees, thereby mitigating
overfitting. Random Forest finds applications across various industries, including banking,
stock trading, medicine, and e-commerce, for tasks such as predicting customer behavior,
detecting fraud, recommending products, and analyzing medical data. Notably, its advantages
include its ability to address overfitting, efficiency in handling large datasets, and relative

Page 50
52
ease of implementation compared to more complex models like neural networks. (Breiman,
L., 2001)
Support Vector Machines (SVM)
The Support Vector Machine (SVM), a supervised learning classifier, is instrumental
in data sorting by determining an optimal hyperplane that maximizes class separation within
an N-dimensional space. This method aims to establish the widest margin between different
groups, thereby enhancing classification accuracy. Operating within high-dimensional feature
spaces, SVMs employ linear functions and learning algorithms grounded in optimization and
statistical learning theories (Wang, 2005).
While traditionally conceived for binary classification, SVMs have adapted to
computationally intensive multiclass problems by combining multiple binary classifiers. In
mathematical terms, SVMs employ kernel methods to transform data features via kernel
functions, facilitating the simplification of data boundaries for non-linear problems. This
process entails mapping complex datasets into higher dimensions to aid in data point
separation. While this technique, known as the kernel trick, introduces computational
complexities, it efficiently transforms data into higher dimensions. (Moguerza & Muñoz,
2006)
SVM's versatility extends to text and image classification tasks, where it excels in
tasks like category assignment, spam detection, sentiment analysis, and image recognition,
particularly in aspect-based recognition and color-based classification. Furthermore, SVM
plays a crucial role in handwritten digit recognition, contributing to postal automation
services. Notably, SVM demonstrates efficacy in high-dimensional spaces and is frequently
employed in text classification tasks, serving as a potent tool for distinguishing between
Python and Java code files.
CodeBert
Feng et al. (2020) introduced CodeBERT a pre-trained model designed to understand
the semantic relationships between natural language and code, enabling it to generate
general-purpose representations that are useful for various tasks, such as natural language-
based code search and code documentation generation. CodeBERT is inspired from BERT
(Devlin et al., 2018) and RoBERTa (Liu et al., 2019), leveraging the multi-layer bidirectional
Transformer architecture proposed by Vaswani et al. (2017).

Page 51
53
For input representation during pre-training, concatinates two segments with a special
separator token: [CLS], followed by the natural language text (𝑤1,𝑤2,...,𝑤𝑛), another
[SEP] token, and the code snippet (𝑐1,𝑐2,...,𝑐𝑚), ending with [EOS]. Here, [CLS] serves as
a special token marking the beginning of the segments, with its final hidden representation
serving as the aggregated sequence representation for classification or ranking tasks. Natural
language text is tokenized into WordPiece units, while code segments are treated as
sequences of tokens.
The output of CodeBERT comprises two main components: the contextual vector
representations for each token, encompassing both natural language and code, and the
representation of [CLS], which functions as the summarized sequence representation for
downstream tasks.
CodeBERT finds diverse applications in software development, including code
summarization, translation between programming languages, code completion, and
facilitating code-to-code and code-to-text transformations. For instance, it aids developers in
understanding and documenting code by generating human-readable summaries and
translating code snippets into natural language. Additionally, it enables code search based on
natural language queries and facilitates translation of code domain text into various
languages.
3.1.4.2 Training Results
These classifiers were chosen for their unique strengths and were used to predict and
classify code files as either Python or Java. The variety of classifier types underscores their
effectiveness in terms of accuracy and predictive capabilities. However, as we progressed
with testing on various datasets, we noticed a decline in accuracy. This prompted us to
experiment with a broader range of classifiers to understand the underlying causes of this
accuracy issue. The findings from these additional tests will be presented later in this work.
Neural Networks
The neural network model used in this research was built with Keras, employing a
sequential architecture, which allows layers to be stacked linearly. The model is designed to
perform a binary classification task, and it consists of three primary components: an
embedding layer, a Long Short-Term Memory (LSTM) layer, and a dense output layer.

Page 52
54
Before going further analysing the parameters of the neural network used it is
important to mention some details in terms of the data pre-processing in this stage. The
primary focus was on basic cleaning of the code snippet corpus, which involved removing
comments, tokenizing the code, and eliminating blank tokens from the dataset. Also, padding
was used to ensure that the embedings matrix has the same size for all the code files.
Word2Vec was used to generate word embeddings, which were then organized into an
embedding matrix for input into the neural network.
The first layer is the embedding layer, where the matrix with the already saved
embeddings was used. This layer's configuration involves several important parameters. The
size of the vocabulary, the size of each embedding vector was defined. Also, the input_length
parameter sets the maximum length for input sequences. It also uses a mask_zero feature to
manage variable-length sequences with zero-padding.
Next is the LSTM layer, a type of Recurrent Neural Network (RNN) designed to
handle sequential data. The LSTM layer in this model contains 300 units, uses a dropout rate
of 0.3 to prevent overfitting, and is configured to be stateless, meaning it doesn't retain state
information across batches. The activation function for the LSTM layer is ReLU (Rectified
Linear Unit), providing non-linearity to the model.
The final layer is the dense output layer, designed for binary classification. This layer
consists of one unit with a sigmoid activation function, which outputs a probability value
between 0 and 1. This dense layer serves as the model's classification output.
To train the neural network model, we followed a structured approach to ensure that
the training data was randomized and the validation process was robust. First, we randomized
the order of the dataset to minimize any biases due to the sequence of data points. This step
involved generating a list of indices corresponding to the original dataset and then shuffling
these indices. By reordering the dataset with these shuffled indices, we ensured that the
model would be trained on a random distribution of the data, reducing the likelihood of
overfitting to any specific pattern.
Next, we transformed the output labels into a numerical format suitable for training a
machine learning model. This transformation step was crucial because most machine learning
algorithms require numerical inputs for both training data and target outputs. We used a

Page 53
55
simple encoding method to convert categorical labels into numbers, creating a new dataset of
encoded labels.
To prepare the data for training and validation, we split the shuffled dataset into two
parts: one for training and one for testing. We allocated approximately 80% of the dataset for
training and 20% for testing. In addition to the training and testing split, we created a
validation set from the training data. This validation set was used to fine-tune the model's
hyperparameters and ensure optimal performance. The validation set was created by taking a
portion of the training data, allowing us to adjust the model as needed without affecting the
final testing set.
Despite the well-structured model described above, it failed to yield valuable results,
with accuracy consistently near 0.49. Although we attempted various corrective measures,
such as adjusting the number of epochs and expanding the dataset with additional code
snippets, these efforts did not lead to a significant improvement in performance. Furthermore,
the model exhibited high instability and sensitivity to variations in the input data's structure
and quality, resulting in frequent run-time issues and necessitating manual interventions.
Due to these persistent issues, we ultimately decided to abandon this model as a
viable approach for our prediction methodology. The model's inherent instability and the lack
of valuable insights it provided made it unsuitable for continued use, prompting us to explore
alternative techniques for the initial stages of our research.
Random Forest and SVM
As part of the next steps in our research, we chose Random Forest and Support Vector
Machines (SVM) due to their robustness and flexibility, aiming to determine whether they
could yield more accurate results. However, before going further into the specifics of the
experiments and the fine-tuning techniques applied after using these classifiers, it's crucial to
point out that both classifiers displayed similar trends. Namely, they exhibited overfitting on
the training dataset and a significant drop in prediction accuracy when tested with different
datasets. This section outlines the process followed with the Random Forest classifier. As the
results of the SVM classifier were similar, this section will primarily focus on the steps taken
with Random Forest and the fine-tuning strategies employed to address overfitting and
improve prediction capabilities.

Page 54
56
In our study, we expanded the evaluation to other classifiers, specifically Random
Forest and Support Vector Machines (SVM), both of which yielded similar results. Initially,
we employed the default Random Forest algorithm without additional parameterization. The
default settings include 100 trees (n_estimators), Gini impurity as the split criterion
(criterion), no limit on tree depth (max_depth), and a minimum of 2 samples required to split
a node (min_samples_split). The classifier uses bootstrapped samples (bootstrap=True), and
the maximum features considered for a split is the square root of the total features
(max_features='auto'). Out-of-bag score (oob_score) is disabled, and a single core is used for
computation (n_jobs=None). The random_state parameter is unset, allowing results to vary
with each run. This default configuration provides a robust starting point for building a
Random Forest model, allowing for flexible customization to suit specific data and use cases.
The initial training phase used the "1st Training dataset," composed solely of
algorithms written in both Java and Python. For data pre-processing, this stage involved
removing comments, tokenizing code snippets, eliminating blank tokens, generating
embeddings using Word2Vec, and averaging the code vectors. The training and testing
datasets were created with 80%/20% ratio. Despite this thorough approach to data
preparation, the results showed an unusually high accuracy, nearing 100%.
Such extreme accuracy raised concerns about the dataset's structure and diversity. It
suggested that the dataset might be too uniform or lacking in variability, leading to an overly
optimistic model performance. This outcome implied that the model might not be
generalizing effectively and could struggle with more diverse or complex datasets.
Consequently, further evaluation and dataset adjustments would be needed to ensure the
model's robustness and reliability in real-world scenarios.
Further adjustments were made by expanding the dataset with additional code
snippets from other Java projects and Python libraries, as described in Section 3.2. This
enrichment aimed to increase dataset diversity and reduce overfitting. After these changes,
the model's accuracy improved, reaching approximately 0.89, with overfitting reduced to a
more acceptable level. Although this accuracy was satisfactory during training, when the
model was tested on unseen data to predict other code files, the accuracy dropped
significantly to 0.42. This decline indicated that the model struggled to generalize beyond the
training data, suggesting that additional modifications to the model's parameters and structure
were necessary to achieve more reliable predictions.

Page 55
57
We used grid search to evaluate different parameter configurations for the model, aiming
to identify the optimal settings for improved accuracy. This process involved systematically
testing various combinations of parameters to determine the best configuration for our
Random Forest classifier. The final parameters chosen as a result of this exercise are as
follows:
Split Criterion: The classifier uses 'gini' as the split criterion, employing Gini impurity to
assess the quality of splits within the trees.
Maximum Tree Depth: The maximum depth of the trees is capped at 10, which helps
prevent overfitting by limiting the growth of the decision trees, ensuring they do not
become excessively complex.
Maximum Features for Splits: To promote diversity in the forest and reduce overfitting,
the maximum number of features considered for each split is set to the logarithm base 2
of the total number of features ('log2').
Minimum Samples for Leaf Nodes and Splits: The minimum number of samples
required for a leaf node is one, allowing smaller branches to form. The minimum number
of samples needed to split a node is set at two, providing stability and preventing overly
complex splits.
Number of Trees: The classifier consists of 100 trees, creating a robust ensemble that
balances diversity and accuracy.
After applying these parameters to the "2nd Training dataset" the accuracy during training
again reached 1, suggesting that the model was overfitting to the training data. However,
when we used this model to predict code files from an external dataset, the accuracy
plummeted to 0.46, indicating that the model's generalization capacity was still limited.
As discussed in the data pre-processing section, when we analysed the token
frequencies for dimensionality reduction, we observed that a substantial portion of the dataset
was composed of stop words, strings, and numbers. These components tended to add noise
and detracted from the patterns necessary for accurate predictions. In response, we chose to
remove all instances of stop words, strings, and numbers from the dataset, thereby
minimizing unnecessary complexity and enhancing data quality. After completing this data
cleansing, we applied dimensionality reduction techniques to retain only the most significant
tokens for prediction.

Page 56
58
An interesting observation from our research was that, although the accuracy on the
training dataset remained consistently at 1, the model's performance on other datasets showed
considerable variation, with accuracy ranging between 0.46 and 0.79. Despite this variability
in accuracy, we noted an increase in precision. However, when the model was tested on
predictions across three different datasets, the accuracy exhibited significant fluctuation,
indicating a strong likelihood of overfitting to the training dataset and heightened sensitivity
to variations in the input datasets used for prediction. This inconsistency underscores the need
for further investigation into the model's robustness and suggests that additional measures
may be required to mitigate overfitting and improve generalization.
CodeBERT
Given that the results from the classifiers used in our study were not encouraging and
demonstrated unreliable predictive performance, we decided to focus our efforts on
enhancing CodeBERT, a pre-trained model for both programming languages (PL) and natural
languages (NL), as presented by Feng et al. (2020).
CodeBERT is designed to learn general-purpose representations that can support a
range of NL-PL tasks, such as natural language code search, code documentation generation,
and similar applications. It is built upon the multi-layer Transformer architecture, which is a
common structure for large pre-trained models. CodeBERT's effectiveness is demonstrated in
its performance on various tasks, achieving state-of-the-art results in code search and code-
to-text generation, bridging the gap between natural language and programming languages.
We incorporated an alternative version of CodeBERT, known as CodeBERTa-
language-id, into our study. This model, pre-trained on the task of programming language
identification (PLI), classifies code snippets into the programming language they belong to.
By integrating a sequence classification head atop the CodeBERTa-small-v1 architecture, the
model exhibits remarkable evaluation accuracy and F1 scores, exceeding 0.999. Accessible
through a TextClassificationPipeline, users can input code snippets and obtain predictions
regarding their programming language.
In our experimentation with the "2nd Training dataset," we undertook several
preprocessing steps, including the removal of comments, stop words, strings, and numerical
values from the code files. The average accuracy achieved was 0.97.

Page 57
59
3.2 Identify library type from python source code snippets
Progressing with our research, and after reviewing the results of our language
identification algorithm, we will filter only the files recognized as Python language files.
Using a sample method, we will then identify the type of library each source code file
belongs to, categorizing them into data visualization, machine learning, NLP, or web
development.
Below is a quick overview of the architecture of this solution, along with some
adjustments to improve accuracy. In this case, four separate classifiers were trained to
identify each specific library, and four separate t-SNE reducers were applied after vector
creation. By combining the predicted probabilities from each classifier, the ensemble method
accounts for the confidence levels of each classifier in its predictions, leading to a more
balanced and robust final prediction.
Figure 11. Adjusted Methodology for library identification

Page 58
60
3.2.1 Data collection & Pre-processing
For training the four separate classifiers, we utilized popular and structured Python
libraries sourced from GitHub across the specified categories. Saabith, Vinothraj, and Fareez
(2020) emphasize factors influencing Python library popularity: ease of use, expressiveness,
interpretation, cross-platform compatibility, and open-source nature. They also highlight
common applications such as NLP, GUI, Web Application, Data Science, and Machine
Learning. Based on this evidence, we selected the following categories for our analysis.
An additional aspect to consider was the selection of files for the second class in all
our classifiers. We chose mathematics and algorithms libraries based on the hypothesis that
they are versatile and applicable across various domains and scenarios.
The libraries that used for training for each category are the below:
Table 8. Python libraries for traiining
Library type
Library title Github link
Data
Visualization
Matplotlib
https://github.com/matplotlib/matplotlib
Seaborn
https://github.com/seaborn/seaborn
Plotly
https://github.com/plotly/plotly.py
Bokeh
https://github.com/bokeh/bokeh
Altair
https://github.com/altair-viz/altair
Dash
https://github.com/plotly/dash
Machine
Learning
Keras
https://github.com/keras-team/keras
Pytorch
https://github.com/pytorch/pytorch
Scikit-learn
https://github.com/scikit-learn/scikit-learn
Tensorflow
https://github.com/tensorflow/tensorflow
xgboost
https://github.com/dmlc/xgboost
NLP
Transformers
https://github.com/huggingface/transformers
TextBlob
https://github.com/sloria/TextBlob
Textacy
https://github.com/chartbeat-labs/textacy
Polyglot
https://github.com/aboSamoor/polyglot
Gensim
https://github.com/RaRe-Technologies/gensim
Flair
https://github.com/flairNLP/flair
FastText
https://github.com/facebookresearch/fastText
BERT
https://github.com/google-research/bert
NLTK
https://github.com/nltk/nltk
Web
Development
Djangp
https://github.com/django/django
Flask
https://github.com/pallets/flask
Pyramid
https://github.com/Pylons/pyramid
Web2py
https://github.com/web2py/web2py

Page 59
61
CherryPy
https://github.com/cherrypy/cherrypy
Bottle
https://github.com/bottlepy/bottle
Tornado
https://github.com/tornadoweb/tornado
Sanic
https://github.com/sanic-org/sanic
Mathematics,
data structures &
algorithms
NumPy
https://github.com/numpy/numpy
SciPy
https://github.com/scipy/scipy
SymPy
https://github.com/sympy/sympy
Pandas
https://github.com/pandas-dev/pandas
Theano
https://github.com/Theano/Theano
CVXPY
https://github.com/cvxgrp/cvxpy
PyMC
https://github.com/pymc-devs/pymc
SageMath
https://github.com/sagemath/sage
PuLP
https://github.com/coin-or/pulp
CherryPy
https://github.com/cherrypy/cherrypy
All files were downloaded from GitHub and stored locally in separate folders for
training purposes. Each of the four classification types had between 3000-4000 files, while
6000 random files were selected from mathematics and data structures libraries. To ensure
balanced training and avoid biases, a random sample equal to the number of files in each
classification type was chosen from the random files pool. This approach ensures that each
classifier receives diverse training data, optimizing for accuracy across all categories.
The data pre-processing applied is similar to the language recognition algorithm
described in Section 3.1.2, involving the removal of comments and stopwords, tokenization,
and the removal of empty tokens.
For example, an input file for the Machine Learning classification training set looks
like the following before pre-processing:

Page 60
62
Figure 12. Machine learning library code snippets example
After the pre-processing, the tokenized source code will be saved in a dataframe with the
below structure:
Figure 13. Machine Learning library code snippets example after tokenization

Page 61
63
3.2.2 Dimensionality reduction with t-sne
Following data cleansing and pre-processing, embeddings were generated using the
Word2Vec model with parameters specified in Section 3.1.3. This model transforms each
word into a 300-dimensional vector, enabling efficient text representation. The next step
involved averaging the embeddings for each code snippet, where each code file, consisting of
a list of tokens, was represented by 300-dimensional vectors for subsequent analysis.
To further reduce complexity and visualize the high-dimensional data, we applied t-
SNE (t-Distributed Stochastic Neighbor Embedding), which reduced the 300-dimensional
embeddings into two features. This dimensionality reduction technique helped in preserving
the local structure of the data while making it easier to analyze and interpret. t-SNE is widely
recognized for its ability to preserve the local structure of high-dimensional data while
enabling effective visualization in lower dimensions (Van der Maaten and Hinton, 2008).
This method has been successfully used in various domains, including natural language
processing (NLP) and bioinformatics, to reveal complex data patterns and clusters (Maaten,
L.J.P. van der, Postma, E.O., & Herik, H.J. van den, 2009). Furthermore, studies have shown
that combining Word2Vec embeddings with t-SNE can enhance the interpretability of text
data by capturing semantic similarities between words (Liu, Q., & Zhang, H., 2017).
Hence, a t-SNE reducer was created for each source code file category and saved with
each classifier using the Scikit-learn library, with n_components=2 and random_state=42.
The n_components=2 parameter reduces the data to two dimensions, essential for effective
visualization and interpretation of patterns within the embeddings. Setting random_state=42
ensures reproducibility, a critical aspect for validating and maintaining consistency in our
results. t-SNE is particularly effective in preserving local structures within high-dimensional
data, thus facilitating the identification of meaningful clusters and relationships.
The findings from this reduction demonstrate clear differentiation between clusters
corresponding to specific Python library source code files and random (mathematics) source
code files. In a t-SNE visualization, clusters are identified as groups of closely located points
within the two-dimensional plot, each representing data points with similar characteristics or
patterns from the original high-dimensional space. Proximity within a cluster indicates higher
similarity among points, contrasting with points in other clusters. Well-separated clusters
suggest distinct groups or categories within the data, whereas overlapping clusters reveal
shared attributes or similarities between different groups. The density and distribution of

Page 62
64
points within each cluster provide valuable insights into the structure and relationships within
the data, offering a clearer understanding that may not be discernible in higher-dimensional
representations.
All four categories show noticeable separation between clusters of the four Python
library categories and random source code files from mathematics libraries. However, certain
libraries exhibit clearer distinctions than others, suggesting more promising outcomes for the
prediction model.
The data visualization source code files do not display a clear separation from the
random source code cluster, as there are instances of overlap between them. Additionally,
within the data visualization cluster itself, there are regions of close proximity scattered
throughout, suggesting similarities either within the libraries or across different libraries. This
proximity indicates that certain attributes or coding patterns may be shared among files
within the same library or across different libraries, contributing to the observed clustering
patterns.
The cluster representing web development libraries shows a distinct separation from
the cluster containing random source code files, with minimal overlap observed between
them. Within the web development cluster itself, points are more scattered, indicating lower
similarity between the selected libraries. This dispersion suggests that each web development
Figure 14. t-sne clusters for data visualization libraries

Page 63
65
library exhibits unique characteristics and coding patterns, contributing to the overall
differentiation observed in the t-SNE visualization.
Figure 15. t-sne clusters for web development libraries
In the machine learning libraries cluster, the close proximity of points suggests
significant similarities in the source code among most of the libraries used. Meanwhile, the
random source code files cluster is sufficiently separated with minor overlapping observed.
This indicates distinct coding patterns between machine learning libraries and random source
code files, emphasizing the effectiveness of the t-SNE visualization in highlighting these

Page 64
66
differences.
Figure 16. t-sne clusters for machine learning libraries
The NLP (Natural Language Processing) cluster exhibits a scattered distribution
similar to web development, with areas of high proximity suggesting shared attributes or
coding patterns within this cluster. However, the separation from the other clusters is distinct,
with minimal overlapping observed.
Figure 17. t-sne clusters for NLP libraries

Page 65
67
The reduction of dimensionality to two dimensions appears to yield valuable results,
effectively decreasing complexity and potentially improving accuracy for the classifiers
trained on each library. Notably, data visualization libraries exhibit higher overlap with
random source code files. It is important to acknowledge that since random source files are
randomly sampled to match the number of files in other library categories, the clusters
generated by t-SNE vary between different reductions. This variation leads to differing
proximities between clusters across different categories, thereby influencing the training
process of classifiers as well.
3.2.3 Individual classifiers per class
The classifiers selected for predicting each Python library are Random Forest models,
using the same parameters described in section 3.1.4.2. These classifiers are configured with
parameters including criterion, max_depth, max_features, min_samples_leaf,
min_samples_split, and n_estimators. These parameters control the splitting criterion, tree
depth, number of features considered for splits, minimum samples per leaf, minimum
samples for splits, and the number of trees in the forest, respectively. We allocated
approximately 80% of the dataset for training and 20% for testing.
For the data visualization library, the optimal parameters were found to be
criterion='gini',
max_depth=None,
max_features='sqrt',
min_samples_leaf=1,
min_samples_split=2, and n_estimators=200. This configuration achieved a best cross-
validation score of 0.9568. The performance metrics indicated a precision of 0.70 for class 0
and 0.86 for class 1, with recall values of 0.89 and 0.63, respectively, resulting in an overall
accuracy of 0.76.
The machine learning library classifier, configured with criterion='gini',
max_depth=20, max_features='sqrt', min_samples_leaf=1, min_samples_split=2, and
n_estimators=100, achieved a best cross-validation score of 0.9597. It demonstrated a
precision of 0.75 for class 0 and 0.86 for class 1, and recall values of 0.88 and 0.71,
respectively, with an overall accuracy of 0.80.
For the NLP library, the best parameters were criterion='entropy', max_depth=20,
max_features='log2', min_samples_leaf=1, min_samples_split=5, and n_estimators=200,
achieving a best cross-validation score of 0.9656. The classifier showed lower precision and
recall, with values of 0.17 and 0.23 for precision, and 0.15 and 0.27 for recall, resulting in an
overall accuracy of 0.21. This indicates challenges in accurately classifying NLP library files.

Page 66
68
The web development library classifier, optimized with criterion='entropy',
max_depth=None, max_features='log2', min_samples_leaf=1, min_samples_split=5, and
n_estimators=100, achieved a best cross-validation score of 0.9647. It showed strong
performance with a precision of 0.93 for class 0 and 0.82 for class 1, and recall values of 0.81
and 0.93, respectively, resulting in an overall accuracy of 0.87.
The training results for each classifier are presented below, with the data visualization
classifier indicating the lowest accuracy, while the other three classifiers show similar, higher
accuracy levels. The lower accuracy of the data visualization classifier aligns with the t-SNE
representation discussed in the previous section, where slight overlapping was identified
between the two clusters. The accuracy for the data visualization classifier might differ with a
different set of random source code files for the second class; however, we will proceed to
the ensemble stage and assess how this might affect the results.
Table 9. Python Libraries classifiers training accuracy
Classifier
Accuracy
Precision
Recall
F1 Score
Data visualization RF
68%
70%
68%
68%
Web Development RF
83%
84%
83%
83%
NLP RF
87%
87%
87%
87%
Machine Learning RF
83%
84%
83%
83%
3.2.4 Ensemble
The ensemble method was employed to address overfitting issues encountered in
previous classifiers and to enhance the accuracy and adaptability of our solution, especially
when incorporating additional Python libraries into the prediction model (Dietterich, 2000).
The ensemble method effectively combines the predictions from multiple classifiers trained
on distinct Python library categories, improving classification accuracy. By integrating
probabilistic assessments and normalizing contributions, the ensemble model provided a
robust framework for accurately categorizing source code snippets into their respective

Page 67
69
domains.
Figure 18. Ensemble method process
In more detail, after training individual Random Forest classifiers for each Python
library category (data visualization, web development, machine learning, and NLP) an
ensemble method was applied to further refine the classification process. Each classifier was
loaded and utilized to predict the probabilities of code snippets belonging to its respective
category. For instance, the data visualization classifier predicted probabilities for both data
visualization-related snippets and random source code snippets. Similarly, the other
classifiers made predictions based on their specific categories, contributing to the overall
probabilistic assessment.

Page 68
70
Figure 19. Aggregated probabilities results from classifiers
The ensemble method integrated these individual predictions by initializing an array
to aggregate the combined probabilities across all categories. This aggregation process
involved normalizing the probabilities to ensure each category's influence was proportionate,
thereby balancing the contributions from each classifier. After combining and normalizing
the probabilities, the final classification decision for each snippet was determined by
identifying the category with the highest probability in the aggregated array.

Page 69
71
This is depicted in Figure 20, which illustrates the distribution of labels for the source
code
files
according
to
the
highest
probability
classifier.
Figure 20. Voting results of ensemble method
The results of the ensemble method and prediction accuracies will be analyzed further
in Section 4.
3.3 Identify arrays in python source code snippets
After achieving a low but sufficient accuracy in the task of recognizing Python library
types from source code, we furthered our research to tackle a more specific task: recognizing
the implementation of NumPy arrays in Python source code. This presented a significant
challenge, as the model needed to distinguish not only the presence of arrays in code snippets
compared to random files but also to identify this specific task within various types of source
code from a broader source code corpus. The complexity arose from the need to accurately
detect the usage of NumPy arrays within a source code corpus.
This part of the research included the below challenges:
1. How will the initial datasets of array source code for training be formed to train the
classifier?
2. What will be the random files for the second class of the classifier?
3. What changes should be applied to the pre-processing process to help the model
identify arrays within the source code corpus?

Page 70
72
In the following sections, we will discuss these challenges and evaluate the accuracy of the
training dataset. Figure 21 below outlines the adjusted methodology that will be discussed.
Figure 21. Adjusted Methodology for arrays identification
3.3.1 Training dataset composition
To construct the training datasets for recognizing NumPy array implementations in
Python source code, we employed a script generation approach using the Jinja2 templating
engine. This method allowed us to create a variety of code snippets with NumPy array
operations. The templates defined diverse structures and operations to simulate realistic
Python scripts, enhancing the robustness of our training data.
We utilized eight distinct templates to generate the code snippets, each representing
different coding structures such as functions, classes, and control flow statements. For
example, one template created a simple function that initializes and manipulates a NumPy
array, while another defined a class with multiple methods performing various operations on
a NumPy array.
The templates used are the below:

Page 71
73
1. Simple Function Template: This template creates a basic function that initializes a
NumPy array and performs a series of operations on it.
template1 = Template("""
import numpy as np
def {{ func_name }}():
array = np.array([{{ array_content }}])
{{ operation1 }}
{{ operation2 }}
{{ operation3 }}
return array
""")
2. Class with Methods Template: This template generates a class with an array as an
attribute and includes methods that perform operations on this array.
template2 = Template("""
import numpy as np
class {{ class_name }}:
def __init__(self):
self.array = np.array([{{ array_content }}])
def {{ method1 }}(self):
{{ operation1 }}
def {{ method2 }}(self):
{{ operation2 }}
def {{ method3 }}(self):
{{ operation3 }}
def get_array(self):
return self.array
""")
3. Conditional and Loop Template: This template includes a function with conditional
statements and loops that manipulate a NumPy array.
template3 = Template("""
import numpy as np
def {{ func_name }}():
array = np.array([{{ array_content }}])
if {{ condition }}:
{{ operation1 }}
else:
{{ operation2 }}
for _ in range({{ loop_count }}):
{{ operation3 }}

Page 72
74
return array
""")
4. Nested Function Template: This template generates a function with nested helper
functions that manipulate a NumPy array.
template4 = Template("""
import numpy as np
def {{ inner_func_name1 }}(array):
{{ inner_operation1 }}
return array
def {{ inner_func_name2 }}(array):
{{ inner_operation2 }}
return array
def {{ outer_func_name }}():
array = np.array([{{ array_content }}])
array = {{ inner_func_name1 }}(array)
array = {{ inner_func_name2 }}(array)
return array
""")
5. Array Concatenation Template: This template creates a function that initializes two
NumPy arrays and concatenates them after performing operations.
template5 = Template("""
import numpy as np
def {{ func_name }}():
array1 = np.array([{{ array_content1 }}])
array2 = np.array([{{ array_content2 }}])
{{ operation1 }}
{{ operation2 }}
result = np.concatenate((array1, array2))
return result
""")
6. Class with Static and Instance Methods Template: This template defines a class with
both static and instance methods that perform operations on NumPy arrays.
template6 = Template("""
import numpy as np
class {{ class_name }}:
def __init__(self, array):
self.array = array
@staticmethod
def {{ static_method_name }}():
array = np.array([{{ array_content }}])
{{ static_operation }}
return array

Page 73
75
def {{ instance_method_name }}(self):
{{ instance_operation }}
return self.array
def process(self):
self.array = self.{{ instance_method_name }}()
return self.array
""")
7. Exception Handling Template: This template includes a function that performs
operations on a NumPy array within a try-except block.
template7 = Template("""
import numpy as np
def {{ func_name }}():
array = np.array([{{ array_content }}])
try:
{{ try_operation }}
except Exception as e:
print(f"Error: {{ exception_message }}")
{{ except_operation }}
return array
""")
8. Looping Structure Template: This template creates a function with a while loop that
repeatedly performs an operation on a NumPy array.
template8 = Template("""
import numpy as np
def {{ func_name }}():
array = np.array([{{ array_content }}])
count = 0
while count < {{ loop_count }}:
{{ loop_operation }}
count += 1
return array
""")
To ensure diversity in the generated scripts, we incorporated random elements such as
function names, class names, method names, array contents, and operations. We used the
Faker library to generate random names and the numpy library to produce random array
contents. A selection of NumPy operations, such as array multiplication, sorting, and
trigonometric functions, was randomly applied to the arrays within the templates.
import numpy as np
import random
from faker import Faker
fake = Faker()

Page 74
76
def generate_array_content():
return ', '.join(map(str, np.random.randint(0, 100,
size=random.randint(5, 15))))
def generate_operation(array_name='array'):
operations = [
f'{array_name} = {array_name} * 2',
f'{array_name} = np.sqrt({array_name})',
f'{array_name} = np.log({array_name} + 1)',
f'{array_name} = np.sin({array_name})',
f'{array_name} = np.sort({array_name})',
Using the templates and the random content generation functions, we produced
multiple synthetic Python scripts. Each script was saved to a file, forming part of our training
dataset. This approach ensured that the dataset was sufficiently varied and realistic, providing
a robust basis for training our classifier.
templates = [template1, template2, template3, template4, template5,
template6, template7, template8]
for i in range(5): # Adjust number as needed
template = random.choice(templates)
script = template.render(
func_name=generate_func_name(),
class_name=generate_class_name(),
method1=generate_method_name(),
method2=generate_method_name(),
method3=generate_method_name(),
inner_func_name1=generate_func_name(),
inner_func_name2=generate_func_name(),
static_method_name=generate_static_method_name(),
instance_method_name=generate_method_name(),
array_content=generate_array_content(),
array_content1=generate_array_content(),
array_content2=generate_array_content(),
operation1=generate_operation(),
operation2=generate_operation(),
operation3=generate_operation(),
inner_operation1=generate_operation('array'),
inner_operation2=generate_operation('array'),
static_operation=generate_operation('array'),
instance_operation=generate_operation('self.array'),
try_operation=generate_operation('array'),
except_operation=generate_operation('array'),
loop_operation=generate_operation('array'),
condition=generate_condition(),
loop_count=generate_loop_count(),
exception_message=generate_exception_message()
)
with open(f'script_{i}.py', 'w') as f:
f.write(script)
This templating approach allowed us to efficiently generate a large, diverse set of
training data, crucial for training a robust classifier capable of recognizing NumPy array
implementations. However, it also presented challenges, such as ensuring the generated
scripts were syntactically correct and representative of real-world code.

Page 75
77
For the random code snippets used for the other class of the classifier, we utilized the
same dataset that was employed in section 3.6, which includes topics such as mathematics,
data structures, and algorithms. However, to ensure these snippets did not contain arrays, we
performed a keyword search within these files for terms like "array" or "np.array." This
allowed us to exclude any source code files that already included arrays, ensuring that the
random code snippets were free from NumPy array implementations.
In total, the dataset used for training the classifier consisted of 5000 files containing
array source code and 5000 random code files. The data pre-processing followed is similar
with all the previous experiments. We allocated approximately 80% of this dataset for
training purposes and reserved the remaining 20% for testing and evaluating the classifier's
performance.
3.3.2 Feature selection with weighted average
In the initial experiments of this approach, we followed the previously established
methodology by generating embeddings using Word2Vec. These embeddings were averaged
and then used to train both Random Forest and Support Vector Machine (SVM) classifiers.
However, the results from these experiments were not satisfactory, with the accuracy on the
training set being only 14%. This low accuracy indicated significant room for improvement
in our approach to recognizing NumPy array implementations in Python source code.
The low accuracy suggested that the embeddings generated by Word2Vec might not
have been sufficiently capturing the features of NumPy array operations within the code
snippets. Additionally, the complexity of distinguishing specific array operations from a
diverse set of source code files likely contributed to the poor performance. These findings
prompted us to re-evaluate our preprocessing steps, feature extraction methods, and the
overall model architecture to better address the challenges posed by this specific task.
In this process, we transformed a collection of code snippets into a weighted feature
matrix using the TF-IDF (Term Frequency-Inverse Document Frequency) technique.
Initially, each snippet was tokenized and normalized to ensure consistency. We then used the
TF-IDF vectorizer to convert the corpus into a term-document matrix, which quantifies the
importance of each token across all documents. To ensure all relevant tokens were included,
we identified and added any missing tokens as new columns with zero weights. Finally, we
calculated the average TF-IDF weight for each token across all documents, providing a

Page 76
78
comprehensive representation of token importance within the corpus. This approach
facilitated the creation of a robust feature set for subsequent machine learning tasks.
The results of the weighting process suggest that features related to NumPy arrays are
assigned significant weights, with high values given to terms like "array" and "np." This
weighting emphasizes the importance of these terms in identifying NumPy array operations
within the source code corpus. Initially, each code snippet was tokenized and transformed
into a TF-IDF feature matrix, representing the relevance of each term across all documents.
Figure 22. TF-IDF feature matrix for np
Figure 23. TF-IDF feature matrix for arrays

Page 77
79
4 Results summary and conclusions
This project had two main purposes. Firstly, to represent source code as vectors and
develop a machine learning classification model capable of identifying its primary
characteristics and architecture. Secondly, to extract information on libraries, frameworks,
and algorithm knowledge for further analysis. The main objective was to provide guidance
through the steps needed in the process of selecting a suitable approach that could produce
reliable skills predictions from source code snippets. For this purpose, data were extracted
from GitHub repositories of various well-known and professionally structured libraries. In
terms of machine learning techniques, neural networks, random forests, and SVM were
considered, with Word2Vec utilized for the embeddings generation. This chapter summarizes
the conclusions derived from the application of the models and their results, discusses the
limitations faced during the project’s execution, and concludes with some recommendations
for future research.
4.1 Experiments for programming language recognition
The Table 10 below provides a concise overview of the outcomes derived from the
classifier experiments for programming language recognition between python and java. The
1
st
experiments contacted with random code files derived from GitHub repositories, like httpx
and jhipster. Notably, the accuracy rate from the training indicates potential overfitting
issues, alongside a lack of predictive success across other datasets. It is interesting to observe
that despite extensive data processing and dimensionality reduction applied to the tokenized
code files, certain prediction datasets exhibited a remarkably high accuracy, approaching
0.79. This anomaly could be attributed to the similarity between these datasets and the initial
training dataset of the overfitting classifier, thereby enabling it to predict with exceptional
accuracy in these particular instances.
Table 10. Results summary
Classifier
Accuracy
Precision
Recall
F1 Score
Prediction Accuracy
Neural Networks
0.46
0.33
0.33
0.33
RF
with Embeddings
Averaging
1st Training dataset
0.99
0.99
0.99
0.99
RF
with Embeddings
Averaging
0.86
0.89
0.86
0.86
0.46

Page 78
80
2nd Training dataset
RF
Grid search
1.00
1.00
1.00
1.00
0.46
RF
Stop words removal
1.00
1.00
1.00
1.00
0.46-0.79
depending on the dataset
RF
Dimensionality reduction
1.00
1.00
1.00
1.00
0.46-0.79
depending on the dataset,
indicating high precision in
some cases
SVM
Grid Search
1.00
1.00
1.00
1.00
0.46
SVM
Dimensionality reduction
0.98
0.98
0.98
0.98
0.46-0.79
depending on the dataset
CodeBERTa
0.97
All analysed above during the training process of the programming language
recognition classifier, it was evident that overfitting in our model necessitated identifying the
proper inputs to enhance prediction accuracy and provide valuable results. Consequently, we
decided to explore the outcomes of predicting the programming languages of various well-
known Python and Java libraries and projects, and to analyse the resulting data.
The Table 11 categorizes selected Java libraries based on their functionality. Web
development libraries have the highest number of analysed files, while the deeplearning4j
library stands out with the most tokens and unique tokens, emphasizing its comprehensive
content and structured files for machine learning experimentation.
Table 11: Java files for experiments
Category
Name
Description
Files Tokens
Unique
tokens
Data
Visualization
XChart
XChart is a simple, lightweight library
for plotting data with just two lines of
code.
323
77,976
3,239
JfreeChart
JFreeChart is a free, open-source Java
library for creating professional-
quality charts with extensive features
and support for various output formats.
993
461,302
9,296
Machine
Learning
deeplearnin
g4j
Eclipse Deeplearning4j enables deep
learning on the JVM, allowing Java
model training and Python ecosystem
interoperability.
3,243 3,194,708
40,391

Page 79
81
smile
Smile is a versatile Java library for
machine learning and statistical
analysis, providing a broad array of
algorithms for tasks like clustering,
classification, and regression.
1,027
413,264
7,657
Web
development
Spring boot
Provides a simplified and rapid way to
create production-ready applications.
It reduces the need for extensive
configuration.
3,004
782,104
33,946
wicket
The 10th major release of Apache
Wicket, built on Java 17, modernizes
web development by providing a robust
framework for creating contemporary
web applications with Java.
3,003
581,964
18,356
NLP
CoreNLP
CoreNLP is a comprehensive Java tool
for natural language processing,
providing
various
linguistic
annotations for text.
2,038 1,573,066
39,883
opennlp
OpenNLP supports the most common
NLP tasks, such as tokenization,
sentence segmentation, part-of-speech
tagging, named entity extraction,
chunking, parsing, language detection
and coreference resolution.
1,009 252,095
7,518
Table 12 summarizes selected Python libraries chosen for experimentation, with
TensorFlow standing out in terms of the highest number of files, tokens, and unique tokens,
highlighting its extensive content and capabilities in machine learning.
Table 12: python files for experiments
Category
Name
Description
Files
Tokens
Unique tokens
Data
Visualization
Seaborn
Seaborn is a Python library based
on matplotlib for creating
attractive statistical graphics.
138 17,784
7,636
Plotly
Plotly is a versatile, open-source
Python library for creating
interactive and visually appealing
plots and dashboards.
1,116 549,164
12,023
Machine
Learning
xgboost
XGBoost is a highly efficient,
flexible, and portable gradient
boosting library for fast and
accurate machine learning,
supporting major distributed
environments like Hadoop and
MPI.
174 109,391
6,019
tensorflow
TensorFlow is an open-source
machine learning framework
developed by Google, widely used
for building and training machine
learning models.
2,743 3,467,115
86,681
Web
development
pyramid
Pyramid
simplifies
web
application development with a
minimal "hello world" setup that
scales effortlessly as your project
expands, offering advanced
features for building complex
software efficiently.
137 164,013
7,503
tornado
Tornado is a scalable Python web
framework and asynchronous
networking library, ideal for long-
lived connections using non-
99 132,794
75

Page 80
82
blocking I/O.
NLP
nltk
NLTK is a leading Python
platform for human language
data, offering interfaces to over 50
corpora, text processing libraries,
NLP wrappers, and an active
discussion forum.
294 371,158
17,186
TextBlob
TextBlob is a Python library
offering a simple API for common
NLP tasks like part-of-speech
tagging, noun phrase extraction,
sentiment
analysis,
and
classification.
31 2,158
1,889
Experiments for language identification were conducted using various combinations
of library categories. Initially, models were tested in pairs within both Python and Java
libraries, followed by testing across all four combined libraries. Results indicate that Random
Forest achieved slightly better accuracy compared to SVM. While overall accuracy was not
sufficient, the language recognition algorithm showed improved results for data visualization
and machine learning libraries. However, improvements are needed for NLP and web
development libraries as presented in Table 13. The high recall results indicate that our
classifiers effectively recognize Python language in source code snippets, particularly in NLP
library categories, more so than Java code snippets.
Table 13: Experiments Results of identifying the library's language
Category
Library
Classifier Accuracy Precision Recall
F1
score
Data
Visualization
Seaborn & xchart
RF
86%
68%
99%
80%
SVM
30%
30%
100%
46%
JfreeChart & Plotly
RF
55%
54%
95%
69%
SVM
53%
53%
100%
69%
All
RF
57%
53%
100%
69%
SVM
48%
48%
100%
65%
Machine
Learning
deeplearning4j
&
xgboost
RF
6%
4%
82%
8%
SVM
5%
5%
100%
10%
smile & tensorflow
RF
68%
72%
92%
81%
SVM
73%
73%
100%
84%
All
RF
41%
40%
91%
56%
SVM
41%
40%
100%
58%
Web
Development
Spring boot &
pyramid
RF
6%
3%
68%
6%
SVM
4%
4%
100%
10%

Page 81
83
smile & tensorflow
RF
16%
3%
86%
6%
SVM
3%
3%
100%
6%
All
RF
12%
4%
97%
7%
SVM
4%
4%
100%
7%
NLP
CoreNLP & nltk
RF
27%
15%
100%
26%
SVM
13%
13%
100%
22%
openNLP
&
TextBlob
RF
5%
3%
100%
6%
SVM
3%
3%
100%
6%
All
RF
11%
10%
100%
18%
SVM
10%
10%
100%
18%
All
All
RF
31%
25%
92%
39%
SVM
24%
24%
100%
39%
As the results indicate, the algorithm needs further training with files related to web
development and NLP libraries to improve accuracy levels. Additionally, the structure or
content of these libraries might be affecting the model's accuracy.
4.2 Experiments for python library type recognition
The accuracy of the separated classifiers for each Python library type, as discussed in
section 3.2.3, shows that the NLP Random Forest classifier indicated the highest
performance, while data visualization resulted in low accuracy. This low accuracy can be
explained by the t-SNE clusters discussed in section 3.2.2, where the data visualization
source code files do not display a clear separation from the random source code cluster,
leading to instances of overlap between them. Additionally, the proximity of points within the
data visualization cluster is not consistent across all points. This inconsistency suggests a
wide range of operations and characteristics among the different data visualization libraries
used for training, which are not shared and do not facilitate uniform training of the classifier.
Table 14: Ensemble method results
Experiment
No
Combination
Accuracy Precision
Recall
F1-score
1
Random
70.8%
16.0%
100.0%
28.0%
ML
0.0%
0.0%
0.0%
0.0%
Data visualization
3.2%
100.0%
4.0%
8.0%
Ensemble
13.0%
2
Random
54.2%
24.0%
54.0%
33.0%
ML
10.3%
100.0%
10.0%
19.0%
Web Development
66.3%
64.0%
66.0%
65.0%

Page 82
84
Ensemble
51.0%
3
Random
54.2%
7.0%
54.0%
12.0%
ML
2.6%
1.0%
3.0%
2.0%
NLP
34.7%
85.0%
35.0%
49.0%
Ensemble
33.0%
4
Random
25.0%
25.0%
54.0%
34.0%
Data visualization
6.5%
88.0%
8.0%
14.0%
Web Development
92.9%
56.0%
89.0%
69.0%
Ensemble
48.0%
5
Random
25.0%
3.0%
25.0%
5.0%
Data visualization
0.0%
0.0%
0.0%
0.0%
NLP
38.2%
58.0%
38.0%
46.0%
Ensemble
30.0%
6
Random
4.2%
3.0%
4.2%
5.0%
Web Development
83.7%
36.0%
79.0%
50.0%
NLP
45.9%
87.0%
35.0%
50.0%
Ensemble
51.0%
7
Random
66.7%
11.0%
54.0%
18.0%
ML
0.0%
0.0%
0.0%
0.0%
Data visualization
3.2%
100.0%
2.0%
4.0%
Web Development
80.6%
58.0%
68.0%
63.0%
Ensemble
39.0%
8
Random
37.5%
11.0%
2.0%
18.0%
ML
2.6%
9.0%
11.0%
10.0%
Data visualization
13.9%
39.0%
14.0%
21.0%
NLP
32.4%
70.0%
31.0%
43.0%
Ensemble
32.0%
9
Random
0.0%
0.0%
0.0%
0.0%
ML
2.6%
8.0%
23.0%
12.0%
Web Development
5.1%
9.0%
11.0%
10.0%
NLP
20.2%
85.0%
22.0%
35.0%
Ensemble
15.0%
10
Random
45.8%
39.0%
14.0%
21.0%
Data visualization
47.3%
42.0%
21.0%
28.0%
Web Development
5.1%
1.0%
8.0%
3.0%
NLP
16.7%
12.0%
26.0%
17.0%
Ensemble
21.0%
12
Random
4.2%
1.0%
4.0%
1.0%
ML
12.8%
5.0%
13.0%
7.0%
Data visualization
1.1%
1.0%
1.0%
1.0%
Web Development
39.8%
25.0%
40.0%
30.0%
NLP
26.5%
67.0%
27.0%
38.0%
Ensemble
26.0%

Page 83
85
We used an ensemble method to run multiple combinations of libraries to evaluate the
results and accuracy of our model. Although the overall performance was not very high, some
valuable outcomes can be discussed. Table 14 above presents the accuracies for all
combinations of the ensemble method.
For the ensemble test, the following libraries were used:
Machine Learning: H2O.ai, TPOT
NLP: SpaCy, StanfordNLP
Web Development: Responder, Falcon
Data Visualization: VisPy, HoloViews
Random Files: mpmath, bitarray
Although the NLP classifier demonstrated the highest accuracy during training, the
web development classifier exhibited superior performance in the prediction phase. This
improved performance was particularly notable when the web development classifier was
combined with the data visualization classifier, potentially due to the operational similarities
between these libraries. A similar behaviour was observed with the combination of the web
development and NLP classifiers, which achieved the highest ensemble accuracy.
However, an interesting behaviour was observed with the combination of the web
development classifier with the ML and NLP classifiers, which resulted in almost the lowest
accuracy (0.15). This outcome may be due to the high similarities between ML and NLP
operations and possible similarities with the random files as well, although this was not
implied by the t-SNE clusters analysis.
The data visualization classifier indicated the highest accuracy when combined with
the web development and NLP classifiers (0.47), even though these classifiers do not
individually offer high accuracy in combination. The lowest accuracy was observed when the
ML and data visualization classifiers were combined, where primarily the random files were
predicted correctly.
The performance of the web development classifier can be explained by the well-
scattered clusters observed in the t-SNE visualization. These clusters indicate disparity within

Page 84
86
the training dataset, allowing the classifier to learn unique characteristics of each library. This
diversity in training contributes to the classifier's efficiency in making predictions.
4.3 Experiments for python arrays recognition
During the training of random forest classifiers on NumPy arrays, we observed an
accuracy of 1, indicating potential overfitting, despite the inclusion of weighted token
features for arrays. However, when applying the classifier to predict on files containing
mathematical libraries—comprising both arrays and unrelated random files—the accuracy
dropped significantly to 0.04. This outcome highlights the considerable challenge in
accurately identifying arrays within extensive source code datasets.

Page 85
87
4 Conclusions, limitation and future research
5.1 Conclusions
A major finding of this research is that machine learning methods combined with
embedding techniques like Word2Vec can effectively estimate programming language skills,
library types, and operation usability using data from GitHub repositories. This is achievable
provided that the source code fulfills certain prerequisite conditions to ensure the diversity of
features for training and the quality of tokens.
It was also observed that neural network techniques were not appropriate in this case,
as they provided low accuracy and showed limited improvement with parameterization. On
the other hand, Random Forest and SVM, although prone to overfitting during the training
process of some applications, yielded valuable results for further research, especially after
grid search for optimal parameterization. An interesting finding was that data visualization
libraries indicated minimal scatter of features and limited training abilities for classifiers.
Conversely, web development libraries proved more suitable for this type of application,
offering higher accuracy compared to other libraries. This is a significant finding for the
software development community.
When examining the four individual Python libraries under discussion, the results
vary depending on the task at hand. The estimated accuracy in recognizing programming
languages between Java and Python was higher for source code snippets related to data
visualization operations and significantly lower for web development operation source code
files. This discrepancy could be attributed to the differing feature selection methodologies
and dimensionality reduction techniques employed between the two prediction models, with
weights assigned to different features of the source code.
It is also worth mentioning that, for the individual Python library type classifiers, all
categories indicated accuracy in the training set higher than 0.80. This suggests that with a
more concise and well-structured training dataset, the accuracy could be improved
substantially. On the other hand, the effort to recognize array operations in source code—a
very specific and widely used application in the programming and software development
community—yielded insufficient results. This underscores the need for annotated datasets
focused on source code skills and operations to enhance machine learning predictions and
applications in this domain.

Page 86
88
In summary, it was not possible to clearly identify the most reliable approach for our
predictions. Significant discrepancies in accuracy results indicate the need for further
experimentation to build a robust framework for this type of application. Fortunately, these
challenges, created by the lack of high-quality annotated datasets of source code, did not
hinder the achievement of the principal objectives of the project, which were to describe in
detail the tokenization and embeddings process of source code for skill identification
predictions. The objectives were successfully met since we incorporated raw data from well-
known GitHub repositories, described the current knowledge based on previous applications,
applied the necessary data preprocessing to generate a valuable training input dataset, and
interpreted the findings in a way that can offer guidance on the potentials, prerequisites, and
limitations of these approaches.
5.2 Limitation
Several challenges arose during this project. One major hurdle was the lack of
annotated datasets for identifying skills, making it hard to find enough files with the right
content and structure to train our classifiers effectively. This led to a relatively small sample
size, which might have affected the accuracy of our measurements. Moreover, although
GitHub provided around 4000 code snippets for detecting library types, gathering, analyzing,
and preparing these snippets took a lot of time and resources. To expand our dataset, we
created a Python program to generate template files containing arrays with different
operations. However, this approach risked making our input files too similar, potentially
causing our model to be too specialized and less adaptable.
The limited availability and quality of data, which could affect how well our
application works in different situations. While our results gave us useful insights, they also
showed that our model struggled when we tried to predict new source code files, making it
hard to identify programming languages accurately. This highlights the need for more
training of our classifiers, possibly using more advanced techniques to extract features.
Additionally, a significant challenge was the similarity between features in libraries
used for data visualization, machine learning (ML), and natural language processing (NLP).
This similarity made it harder for our classifiers to make accurate predictions. To address
this, we need to use a wider range of data files and Python libraries during training to
improve the accuracy of our ensemble method classifiers.

Page 87
89
Furthermore, another limitation was the lack of extensive research in skills
identification. There wasn't much existing literature or established methods that we could
build on or compare our approach to, although this also made our project unique. Developing
and testing new methods required a lot of experimentation and time, which increased the
demand for resources.
Expanding the sample size and refining the specification of features within the source
code would significantly enhance the academic rigor of this project. Nonetheless, the study
conducted a thorough analysis and comparisons among different models. Despite the
aforementioned limitations, the findings offer valuable insights and can serve as a roadmap
for researchers aiming to develop procedures for identifying programming languages, library
types, and specific operations within large software development repositories.
5.3 Recommendations for future research
In light of the insights gained from this study and considering the technical challenges
encountered during its implementation, the following recommendations for further research
are proposed. To deepen the understanding of skill identification in programming languages,
it is imperative to expand the sample size and enhance the breakdown of variables related to
source code features and operations. A broader dataset encompassing diverse programming
languages and a more detailed categorization of operations within source code snippets would
facilitate a more comprehensive analysis and validation of our classifiers' predictions.
Future studies could leverage updated repositories and libraries from platforms like
GitHub, ensuring a richer diversity of source code examples for training and testing
classifiers. Moreover, exploring advanced techniques such as deep learning for source code
embeddings and applying ensemble methods could potentially enhance the accuracy and
robustness of the classifiers in identifying specific programming skills and operations.
Furthermore, the integration of additional features such as semantic parsing or
syntactic analysis could refine the classifiers' ability to discern subtle differences in
programming tasks and functionalities. This approach would contribute to overcoming the
challenges of feature similarity across different libraries and languages, as highlighted in our
study.
Finally, the ongoing evolution of programming practices and the emergence of new
libraries necessitate continuous updates and adaptations in the methodologies used for skill

Page 88
90
identification in source code. By integrating these advancements into future research
endeavors, we can further advance the field of automated skill detection in programming
languages, thereby supporting developers and researchers in navigating the complexities of
modern software development.

Page 89
91
5 References
1. Allamanis, M. and Sutton, C., 2014, November. Mining idioms from source code. In
Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of
Software Engineering (pp. 472-483).
2. Alon, U., Zilberstein, M., Levy, O. and Yahav, E., 2019. code2vec: Learning distributed
representations of code. Proceedings of the ACM on Programming Languages, 3(POPL),
pp.1-29.
3. Alreshedy, K., Dharmaretnam, D., German, D.M., Srinivasan, V. and Gulliver, T.A.,
2018. SCC: automatic classification of code snippets. arXiv preprint arXiv:1809.07945.
4. Baquero, J.F., Camargo, J.E., Restrepo-Calle, F., Aponte, J.H. and González, F.A., 2017.
Predicting the programming language: Extracting knowledge from stack overflow posts.
In Advances in Computing: 12th Colombian Conference, CCC 2017, Cali, Colombia,
September 19-22, 2017, Proceedings 12 (pp. 199-210). Springer International Publishing.
5. Bengio, R., Ducharme, R., & Vincent, P. (2003). A neural probabilistic language model.
Journal of Machine Learning Research, 3, 1137-1155.
6. Biau, G., & Scornet, E. (2016). A random forest guided tour. Test, 25, 197-227.
7. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of
Machine Learning Research, 3(Jan), 993-1022.
8. Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with
subword information. Transactions of the Association for Computational Linguistics, 5,
135-146.
9. Buratti, L., Pujar, S., Bornea, M., McCarley, S., Zheng, Y., Rossiello, G., Morari, A.,
Laredo, J., Thost, V., Zhuang, Y. and Domeniconi, G., 2020. Exploring software
naturalness through neural language models. arXiv preprint arXiv:2006.12641.
10. Causa, O., Abendschein, M., Luu, N., Soldani, E. and Soriolo, C., 2022. The post-
COVID-19 rise in labour shortages.
11. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE:
Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research,
16, 321-357.
12. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. P.
(2011). Natural language processing (almost) from scratch. CoRR abs/1103.0398.

Page 90
92
13. Cosentino, V., Luis, J. and Cabot, J., 2016, May. Findings from GitHub: methods,
datasets and limitations. In Proceedings of the 13th International Conference on Mining
Software Repositories (pp. 137-141).
14. da Silva, J.R., Clua, E., Murta, L. and Sarma, A., 2015, March. Niche vs. breadth:
Calculating expertise over time through a fine-grained analysis. In 2015 IEEE 22nd
international conference on software analysis, evolution, and reengineering (SANER) (pp.
409-418). IEEE.
15. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding. arXiv preprint
arXiv:1810.04805.
16. Dietrich, J., Luczak-Roesch, M. and Dalefield, E., 2019, May. Man vs Machine–A Study
into Language Identification of Stack Overflow Code Snippets. In 2019 IEEE/ACM 16th
International Conference on Mining Software Repositories (MSR) (pp. 205-209). IEEE.
17. Dietterich, T.G., 2000, June. Ensemble methods in machine learning. In International
workshop on multiple classifier systems (pp. 1-15). Berlin, Heidelberg: Springer Berlin
Heidelberg.
18. Dumais, S. T. (2004). Latent semantic analysis. Annual Review of Information Science
and Technology (ARIST), 38, 189-230.
19. Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T.,
Jiang, D., & Zhou, M. (2020). CodeBERT: A pre-trained model for programming and
natural languages. arXiv preprint arXiv:2002.08155.
20. Gilda, S., 2017, July. Source code classification using Neural Networks. In 2017 14th
international joint conference on computer science and software engineering (JCSSE)
(pp. 1-6). IEEE.
21. Gholamian, S. and Ward, P.A., 2021, May. On the naturalness and localness of software
logs. In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories
(MSR) (pp. 155-166). IEEE.
22. Goldberg, Y., & Levy, O. (2014). word2vec explained: Deriving Mikolov et al.'s
negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722.
23. Gousios, G., 2013, May. The GHTorent dataset and tool suite. In 2013 10th Working
Conference on Mining Software Repositories (MSR) (pp. 233-236). IEEE.
24. Greene, G.J. and Fischer, B., 2016, August. Cvexplorer: Identifying candidate developers
by mining and exploring their open source contributions. In Proceedings of the 31st
IEEE/ACM International Conference on Automated Software Engineering (pp. 804-809).

Page 91
93
25. Hauff, C. and Gousios, G., 2015, May. Matching GitHub developer profiles to job
advertisements. In 2015 IEEE/ACM 12th Working Conference on Mining Software
Repositories (pp. 362-366). IEEE.
26. He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on
Knowledge and Data Engineering, 21(9), 1263-1284.
27. Hellendoorn, V.J., Devanbu, P.T. and Alipour, M.A., 2018, October. On the naturalness
of proofs. In Proceedings of the 2018 26th ACM Joint Meeting on European Software
Engineering Conference and Symposium on the Foundations of Software Engineering
(pp. 724-728).
28. Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural
Computation, 9(8), 1735–1780.
29. Hong, J., Mizuno, O. and Kondo, M., 2019, December. An empirical study of source code
detection using image classification. In 2019 10th International Workshop on Empirical
Software Engineering in Practice (IWESEP) (pp. 1-15). IEEE.
30. Huang, W., Mo, W., Shen, B., Yang, Y. and Li, N., 2016. CPDScorer: Modeling and
Evaluating Developer Programming Ability across Software Communities. In SEKE (pp.
87-92).
31. Hyrynsalmi, S.M., Rantanen, M.M. and Hyrynsalmi, S., 2021, June. The war for talent in
software business-how are finnish software companies perceiving and coping with the
labor shortage?. In 2021 IEEE International Conference on Engineering, Technology and
Innovation (ICE/ITMC) (pp. 1-10). IEEE.
32. Javid, A. M., Das, S., Skoglund, M., & Chatterjee, S. (2021, June). A ReLU dense layer
to improve the performance of neural networks. In ICASSP 2021-2021 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp.
2810-2814). IEEE.
33. Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., & Mikolov, T. (2016).
Fasttext. zip: Compressing text classification models. arXiv preprint arXiv:1612.03651.
34. Kalliamvakou, E., Gousios, G., Blincoe, K., Singer, L., German, D.M. and Damian, D.,
2014, May. The promises and perils of mining github. In Proceedings of the 11th working
conference on mining software repositories (pp. 92-101).
35. Kang, H.J., Bissyandé, T.F. and Lo, D., 2019, November. Assessing the generalizability
of code2vec token embeddings. In 2019 34th IEEE/ACM International Conference on
Automated Software Engineering (ASE) (pp. 1-12). IEEE.

Page 92
94
36. Krawczyk, B. (2016). Learning from imbalanced data: open challenges and future
directions. Progress in Artificial Intelligence, 5(4), 221-232.
37. LeClair, A., Eberhart, Z. and McMillan, C., 2018, September. Adapting neural text
classification for improved software categorization. In 2018 IEEE international
conference on software maintenance and evolution (ICSME) (pp. 461-472). IEEE.
38. Li, S., & Gong, B. (2021). Word embedding and text classification based on deep
learning methods. In MATEC Web of Conferences (Vol. 336, p. 06022). EDP Sciences.
39. Ligu, E., Chaikalis, T. and Chatzigeorgiou, A., 2013. BuCo Reporter: Mining Software
and Bug Repositories.
40. Liu, Q., & Zhang, H. (2017). Combining Word2Vec and t-SNE to enhance the
interpretability of text data. Proceedings of the International Conference on Data Mining
and Applications, 125-130.
41. Maaten, L.J.P. van der, Postma, E.O., & Herik, H.J. van den. (2009). Dimensionality
reduction: A comparative review. Journal of Machine Learning Research, 10, 66-71.
42. Mandelbaum, A., & Shalev, A. (2016). Word embeddings and their use in sentence
classification tasks. arXiv preprint arXiv:1610.08229.
43. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word
Representations in Vector Space. arXiv preprint arXiv:1301.3781.
44. Moguerza, J. M., & Muñoz, A. (2006). Support vector machines with applications.
45. Ohashi, H. and Watanobe, Y., 2019, October. Convolutional neural network for
classification of source codes. In 2019 IEEE 13th international symposium on embedded
multicore/many-core systems-on-chip (MCSoC) (pp. 194-200). IEEE.
46. Oliveira, J., Souza, M., Flauzino, M., Durelli, R. and Figueiredo, E., 2022, September.
Can source code analysis indicate programming skills? a survey with developers. In
International Conference on the Quality of Information and Communications Technology
(pp. 156-171). Cham: Springer International Publishing.
47. Oliveira, J., Viggiato, M. and Figueiredo, E., 2019, October. How well do you know this
library? mining experts from source code analysis. In Proceedings of the XVIII Brazilian
Symposium on Software Quality (pp. 49-58)
48. Parmar, A., Katariya, R., & Patel, V. (2019). A review on random forest: An ensemble
classifier. In International conference on intelligent data communication technologies and
internet of things (ICICI) 2018 (pp. 758-763). Springer International Publishing.

Page 93
95
49. Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global Vectors for Word
Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP), 1532-1543.
50. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer,
L. (2018). Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
51. Rahman, M., Palani, D. and Rigby, P.C., 2019. Natural Software Revisited. Department
of Computer Science and Software Engineering, Concordia University, Montréal,
Québec, Canada.
52. Ray, B., Hellendoorn, V., Godhane, S., Tu, Z., Bacchelli, A. and Devanbu, P., 2016, May.
On the" naturalness" of buggy code. In Proceedings of the 38th International Conference
on Software Engineering (pp. 428-439).
53. Ugurel, S., Krovetz, R. and Giles, C.L., 2002, July. What's the code? automatic
classification of source code archives. In Proceedings of the eighth ACM SIGKDD
international conference on Knowledge discovery and data mining (pp. 632-638).
54. Saabith, A.S., Vinothraj, T. and Fareez, M., 2020. Popular python libraries and their
application domains. International Journal of Advance Engineering and Research
Development, 7(11).
55. Schmidt, J.A., Bourdage, J.S., Lukacik, E.R. and Dunlop, P.D., 2024. The Role of Time,
Skill Emphasis, and Verifiability in Job Applicants’ Self-Reported Skill and Experience.
Journal of Business and Psychology, 39(1), pp.67-82.
56. Sridhara, G., Sinha, V.S. and Mani, S., 2015, February. Naturalness of natural language
artifacts in software. In Proceedings of the 8th India Software Engineering Conference
(pp. 156-165).
57. Sundermeyer, M., Schlüter, R., & Ney, H. (2012, September). LSTM neural networks for
language modeling. In Interspeech (Vol. 2012, pp. 194-197). IEEE.
58. Van Dam, J.K. and Zaytsev, V., 2016, March. Software language identification with
natural language classifiers. In 2016 IEEE 23rd international conference on software
analysis, evolution, and reengineering (SANER) (Vol. 1, pp. 624-628). IEEE.
59. Van Der Maaten, L., Postma, E., & Van den Herik, J. (2009). Dimensionality reduction: a
comparative. Journal of Machine Learning Research, 10, 66-71.
60. Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of
Machine Learning Research, 9(Nov), 2579-2605.

Page 94
96
61. Vora, P., Khara, M., & Kelkar, K. (2017). Classification of tweets based on emotions
using word embedding and random forest classifiers. International Journal of Computer
Applications, 178(3), 1-7.
62. Wang, L. (Ed.). (2005). Support vector machines: Theory and applications (Vol. 177).
Springer Science & Business Media.
63. Watanobe, Y., Rahman, M.M., Amin, M.F.I. and Kabir, R., 2023. Identifying algorithm
in program code based on structural features using CNN classification model. Applied
Intelligence, 53(10), pp.12210-12236.
64. World Economic Forum, The Future of Jobs: Employment, Skills and Workforce
Strategy for Fourth Industrial Revolution, ser. Global Challenge Insight Report. Geneva,
Switzerland: World Economic Forum, Jan. 2016.