0% found this document useful (0 votes)

16 views

Reasearch 1

Uploaded by

devika Nair

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

Reasearch 1

Uploaded by

devika Nair

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Journal of Network and Computer Applications 218 (2023) 103704

Contents lists available at ScienceDirect

Journal of Network and Computer Applications

journal homepage: www.elsevier.com/locate/jnca

API-MalDetect: Automated malware detection framework for windows based

on API calls and deep learning techniques
Pascal Maniriho ∗, Abdun Naser Mahmood, Mohammad Jabed Morshed Chowdhury
Department of Computer Science and Information Technology, La Trobe University, Melbourne, VIC, Australia

ARTICLE INFO ABSTRACT

MSC: This paper presents API-MalDetect, a new deep learning-based automated framework for detecting malware
00-01 attacks in Windows systems. The framework uses an NLP-based encoder for API calls and a hybrid automatic
99-00 feature extractor based on convolutional neural networks (CNNs) and bidirectional gated recurrent units
Keywords: (BiGRU) to extract features from raw and long sequences of API calls. The proposed framework is designed
Malware analysis to detect unseen malware attacks and prevent performance degradation over time or across different rates
Malware detection of exposure to malware by reducing temporal bias and spatial bias during training and testing. Experimental
Dynamic analysis
results show that API-MalDetect outperforms existing state-of-the-art malware detection techniques in terms
Convolutional neural network
of accuracy, precision, recall, F1-score, and AUC-ROC on different benchmark datasets of API call sequences.
API calls
Machine learning These results demonstrate that the ability to automatically identify unique and highly relevant patterns from
Deep learning raw and long sequences of API calls is effective in distinguishing malware attacks from benign activities
in Windows systems using the proposed API-MalDetect framework. API-MalDetect is also able to show
cybersecurity experts which API calls were most important in malware identification. Furthermore, we make
our dataset available to the research community.

1. Introduction million of malware samples over 2020 (with an average of 328,073

malware samples produced daily) (Drapkin, 2022). According to the
As Internet-based applications continue to shape various businesses AtlasVPN report (Ruth, 2023), more than 95% of all malware attacks
around the globe, malware threats have become a severe problem were against Windows desktop devices in 2022.
for computing devices such as desktop computers, smartphones, local As a solution to address malware attacks, the application of
servers, and remote servers. According to statistics, it is expected signature-based malware detection systems such as anti-virus programs
that in this year (2023) the total number of devices connected to that rely on a database of signatures extracted from the previously
IP networks will be around 29.3 billion (Cisco, 2020), resulting in a identified malware samples has become popular (Anon, 2023e). In
massive interconnection of various networked devices globally. As the static malware analysis, signatures are malware’s unique identities
number of connected devices continues to rise exponentially, this has which are extracted from malware without executing the suspicious
also become a motivating factor for cyber-attackers to develop new program (Zhang et al., 2019; Naik et al., 2021). Some of the static-based
advanced malware programs that disrupt, steal sensitive data, damage, malware detection techniques were implemented using static signatures
and exploit various vulnerabilities. The widespread use of different such as printable strings, opcode sequences, and static API calls (Singh
malware variants makes the existing security systems less effective and Singh, 2020; Sun et al., 2019; Huda et al., 2016). As signature-
whereby, millions of devices are infected by various forms of mal- based malware detection systems rely on previously seen signatures
ware such as worms, ransomware, backdoors, computer viruses, and to detect malware threats, they have become ineffective due to a
Trojans (Jovanovic, 2022; Maniriho et al., 2022). Accordingly, there huge number of new malware variants coming out every day (Alazab
has been a significant increase in new malware targeting Windows et al., 2010). Moreover, static-based techniques are unable to detect
devices over the last decade. For instance, the number of reported obfuscated malware (malware with evasion behaviors) (Zelinka and
malware increased by 23% (9.5 million) (Drapkin, 2022) from 2020 to Amer, 2019; Anon, 2019). Such obfuscated malware include Agent
2021. About 107.27 million of new malware samples were created to Tesla, BitPaymer, Zeus Panda, and Ursnif, to name a few (Anon,
compromise Windows devices in 2021, showing an increase of 16.53 2022a).

∗ Corresponding author.
E-mail address: [email protected] (P. Maniriho).

https://doi.org/10.1016/j.jnca.2023.103704
Received 15 April 2023; Received in revised form 20 June 2023; Accepted 16 July 2023
Available online 22 July 2023
1084-8045/© 2023 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
P. Maniriho et al. Journal of Network and Computer Applications 218 (2023) 103704

In contrast to static analysis-based techniques, dynamic analysis is not be blindly trusted, but instead, it is important to have confidence
developed to counter obfuscation techniques. Dynamic or behavior- about the features or attributes that contributed to the prediction. By
based malware detection techniques are implemented based on the using explainable modules researchers and security analysts can derive
dynamic malware analysis approach which allows monitoring the sus- more insights from the detection techniques/models and understand
picious program’s behaviors and vulnerability exploits by executing it the logic behind the final model’s predictions (Ribeiro et al., 2016).
in a virtual environment (Udayakumar et al., 2017). Dynamic anal- Therefore, in direction to address the problem of inefficiency ob-
ysis can reveal malware’s behavioral characteristics such as running served in the existing malware detection techniques, we propose API-
processes, registry key changes, web browsing history (such as DNS MalDetect, a new DL-based automated framework for detecting mal-
queries), malicious IP addresses, loaded DLLs, API calls, and changes ware attacks in Windows. The motivation for using deep learning is
in the file system (Han et al., 2019; Maniriho et al., 2022). Vem- to automatically identify unique and high relevant patterns from raw
parala et al.’s work (Vemparala et al., 2019) demonstrated that the and long sequences of API calls which distinguish malware attacks
dynamic-based malware detection technique outperforms the static- from benign activities. API-MalDetect uses an encoder based on nat-
based technique in many cases, after comparing their performance ural language text processing (NLP) techniques to construct numerical
using extracted dynamic API calls and static sequence of opcode fea- representations and embedding vectors of API call sequences based on
tures. The dynamic analysis approach can produce relevant feature their semantic relationships. It also uses an automatic hybrid feature
representations that reveal what a malware intends to perform in the extractor based on a convolutional neural network (CNN) and bidirec-
victim’s system (Suaboot et al., 2020; Maniriho et al., 2022). As in many tional gated recurrent unit (BiGRU). Although CNN has the potential to
cases benign software programs evolve (e.g., when new vulnerabilities extract high-level text-based features, it extracts local features (API call
are discovered, new services are getting created), it is important to in our case) which lack contextual semantic information between them
mention that malicious software programs (malware) also evolve at the due to the limitation of the sliding filter (sliding window). In order
same, i.e., the behaviors of malware change over time. Therefore, new to address this problem, we combine CNN and BiGRU to improve the
features such as API call functions that can be relevant to developing feature extraction process. Specifically, BiGRU receives local features of
more accurate and robust malware detection techniques for the present API calls extracted by CNN and processes them to capture more contex-
tasks could emerge from these new malware behaviors, which creates tual semantic information between them. Hence, combining CNN and
the need to perform regular analysis of emerging malware in order BiGRU techniques allows us to effectively capture relevant features that
to capture new features. Nevertheless, extracting dynamic/behavioral can be used in detecting malicious executable files. Features generated
features from malware executable files is a critical task as malware can by the CNN-BiGRU feature extractor are fed to a fully connected neural
damage organizational resources such as corporate networks, confiden- network (FCNN) module for malware classification.
tial information, and other resources when they escape the analysis We have also integrated LIME into our framework to make it
environment (Suaboot et al., 2020; Maniriho et al., 2022). This makes explainable. LIME is a framework for interpreting ML and DL black
the extraction of up-to-date feature representation of malware a chal- box models and was proposed by Ribeiro et al. (2016). It allows
lenging task (Mimura and Ito, 2022; Suaboot et al., 2020). In the case API-MalDetect to produce explainable predictions, which reveal feature
of API calls, it also remains a challenge to obtain relevant API call importance, i.e., LIME produces features of API calls that contributed
features as the number of API calls made by malware executable files to the final prediction of a particular benign or malware executable
is relatively long which makes their processing difficult (Suaboot et al., file, and to the best of our knowledge, none of the previous techniques
2020). has attempted to use LIME on API call features. Explainable results pro-
Given the potential of dynamic malware analysis, this work is duced by LIME can help cybersecurity analysts or security practitioners
focused on analyzing dynamic-based API call sequences extracted from to better understand API-MalDetect’s predictions and to make decisions
Windows executable files to identify malware attacks. Existing tech- based on explainable insights. It is worth noting that we have also taken
niques for dynamic-based malware detection have used machine learn- into consideration the practical constraints suggested in Pendlebury
ing (ML) algorithms to detect malware attacks. These algorithms learn et al. (2019) to address the issue of temporal bias and spacial bias
encountered in previous works such as (Suaboot et al., 2020; Singh
from given data and make predictions on new data. Over the last
and Singh, 2020; Amer et al., 2021). Temporal bias occurs when the
decade, the use of ML-based algorithms has become more prevalent in
time-based split of samples is not considered during training while
the field of cybersecurity such as malware detection (Apruzzese et al.,
spatial bias refers to the unrealistic distribution of malware samples
2018; Gibert et al., 2020). However, conventional ML techniques rely
over benign samples in the test set.
on manual feature extraction and selection process, which requires
human expert domain knowledge to derive relevant or high-level pat-
Contributions
terns (features) to be used to represent a set of malware and benign
files (Gibert et al., 2022; Le et al., 2018). This process is known as
This paper makes several significant contributions to the field of
manual feature engineering and is time-consuming and error-prone
malware detection in Windows systems.
as it depends on a manual process, considering the current plethora
of malware production. On the other hand, deep learning (DL) algo- 1. We introduce a new benchmark dataset for evaluating malware
rithms have also emerged for malware detection (Bostami and Ahmed, detection techniques. The dataset is created by using the dy-
2020; Li et al., 2022a). Different from conventional ML techniques, namic analysis approach to extract sequences of API calls from
DL algorithms can perform automatic feature extraction (Tirumala both benign and malware executable files. We make this dataset
et al., 2020). Nonetheless, the majority of existing ML and DL-based publicly available for the research community to use in their
techniques operate as black boxes (Moraffah et al., 2020; Mehrabi et al., experiments.
2019; Maniriho et al., 2022). These models receive input 𝑋 which is 2. We propose a new deep learning-based automated framework
processed through a series of complex operations to produce 𝑌 as the called API-MalDetect for detecting malware attacks in Windows.
predicted outcome/output. Unfortunately, such operations cannot be The framework uses an NLP-based encoder for API calls and
interpreted by humans, as they fail to provide human-friendly insights a hybrid automatic feature extractor based on deep learning
and explanations (for example: which features contributed to the final techniques such as CNNs and BiGRUs to extract features from
predicted outcome) (Moraffah et al., 2020; Mehrabi et al., 2019). raw and long sequences of API calls. This approach allows us to
Practically, it is ideal to have an ML or DL-based malware detection automatically identify unique and highly relevant patterns from
technique that can detect the presence of malicious files with high API call sequences that distinguish malware attacks from benign
detection accuracy. However, the prediction of such a technique should activities.

2
P. Maniriho et al. Journal of Network and Computer Applications 218 (2023) 103704

3. We introduce practical experimental factors for training and malware programs (Ammar Ahmed E. Elhadi, 2013; Ki et al., 2015).
testing malware detection techniques to avoid temporal bias and API calls analysis reveals a considerable representation of how a given
spatial bias in the experiments. These factors include using dif- malware program behaves. Therefore, monitoring the program’s API
ferent time periods for training and testing, and using different call sequences is by far one of the effective ways to observe if a particu-
proportions of benign and malware samples. We demonstrate lar executable program file has malicious or normal behaviors (Suaboot
that API-MalDetect is effective in detecting unseen malware et al., 2020; Amer and Zelinka, 2020).
attacks even when trained on data from a different time pe-
2.3. Deep learning algorithms
riod or tested on a different machine with different hardware
configurations.
Deep learning algorithms are subsets of machine learning tech-
4. We evaluate the performance of API-MalDetect on benchmark
niques that use artificial neural network architectures to learn and
datasets of API call sequences. Experimental results show that discover interesting patterns/features from data. DL network architec-
API-MalDetect outperforms existing state-of-the-art malware de- tures can handle big datasets with high dimensions and can perform
tection techniques Karbab and Debbabi (2019), Li et al. (2022a), automatic extraction of high-level features (Sharma et al., 2021; Yuan
Qin et al. (2020), Xiaofeng et al. (2019) and Avci et al. (2023) and Wu, 2021). DL algorithms are designed to learn from both labeled
in terms of accuracy, precision, recall, F1-score, and AUC-ROC. and unlabeled datasets and can produce highly accurate results with
These results demonstrate that the ability to automatically iden- low false-positive rates (Najafabadi et al., 2015). The multi-layered
tify unique and highly relevant patterns from raw and long API structure (network with many hidden layers) adopted by DL algorithms
call sequences effectively distinguishes malware attacks from gives them the ability to learn relevant data representations through
benign activities in Windows systems using the proposed API- which low-level features are captured by lower layers and high-level
MalDetect framework. abstract features are extracted by higher layers (Rafique et al., 2020;
5. Finally, by using LIME, API-MalDetect is able to produce local Pinhero et al., 2021). The next subsections introduces CNN and recur-
interpretability and explainability for its predictions. This allows rent neural networks, which are some of the popular categories of DL
security practitioners to understand how the framework makes algorithms that we use to design our framework.
its predictions and which API call features are more important
in distinguishing malware attacks from benign activities. 2.3.1. Convolutional neural network
Convolutional neural network is a category of DL techniques that
Structure: The remaining part of this paper is structured as follows. gained popularity over the last decades. Inspired by how the animal
Section 2 presents the background and Section 3 discusses the related visual cortex is organized (Hubel and Wiesel, 1968; Fukushima, 1979),
works. Section 4 presents the proposed framework while Section 5 CNN was mainly designed for processing data represented in grid pat-
discusses the experimental results. Section 6 presents limitations and terns such as images and has been successfully used to solve computer
future work. The conclusion of this work is provided in Section 7. vision problems (Khan et al., 2018). Recently, CNN has been also
applied to detect malware attacks based on binary images of executable
2. Background files (Chaganti et al., 2022; Tekerek and Yapici, 2022). Similar to other
DL algorithms, it can automatically learn and extract high-level feature
representation from data. Two-dimensional CNN and one-dimensional
This section presents basic background on Windows application pro-
CNN (1-D CNN) are the two main versions of the CNN algorithm
gramming interface (Win API) and executables’ API calls monitoring.
with the Two-dimensional CNN algorithm being mainly applied for
Moreover, it discusses the use of deep learning algorithms for malware
images (Khan et al., 2018; Tekerek and Yapici, 2022), while 1-D CNN
detection.
was designed for processing one-dimensional data such as times series
and sequential data (Kim, 2014; Fesseha et al., 2021). One-dimensional
2.1. Windows API CNN is less computationally expensive compared to 2-D CNN and in
many cases, it does not require graphics processing units (GPUs) as it
The Windows application programming interface (API), also known can be implemented on a standard computer with a CPU (making it
as Win32 API, is a collection of all API functions that allow Windows- much faster than 2-D CNN) (Kiranyaz et al., 2021). One-dimensional
based applications/programs to interact with the Microsoft Windows CNN architectures have been also successful in modeling various tasks
OS (Kernel) and hardware (Stenne, 2021; Silberschatz Abraham, 2018; and solving natural language processing (NLP) problems. For example,
Uppal et al., 2014). Apart from some console programs, all Windows- they were successfully applied to perform text classification (Kim,
based applications must employ Windows APIs to request the operating 2014) and sentiments classification (Liu et al., 2020).
system to perform certain tasks such as opening and closing a file,
displaying a message on the screen, creating, writing content to files, 2.3.2. Recurrent neural networks
and making changes in the registry. This implies that both system A Recurrent neural network (RNN) is a type of DL algorithm suitable
resources and hardware cannot be directly accessed by a program, for modeling sequential data using a memory function that allows it
but instead, programs need to accomplish their tasks via Win32 API. to discover relevant patterns from data (Medsker and Jain, 1999). De-
All available API functions are defined in the dynamic link libraries spite their performance, traditional/classic RNNs suffer from vanishing
gradients (also known as gradient explosion) and are unable to process
(DLLs), i.e., in .dll files included in C:\Windows\System32\*. For ex-
long sequences (Lynn et al., 2019). To address this problem, Hochreiter
ample, many commonly used libraries include Kernel32.dl, User32.dll,
and Schmidhuber (1997) proposed Long short-term memory (LSTM),
Advapi32.dll, Gdi32.dll, Hal.dll, and Bootvid.dll (Microsoft, 2021).
an improved RNN algorithm that performs well on long sequences.
The Gated Recurrent Unit (GRU) was later implemented by Cho et al.
2.2. API calls monitoring (2014) based on LSTM. A GRU network architecture uses a reset gate
and update gate to decide which information to be passed to the
Generally, any Windows-based program performs its task by calling output. Furthermore, it is important to note that GRU uses a simple
some API functions. This functionality makes Win32 API one of the network architecture and has shown better performance over regular
important and core components of the Windows OS as well as an entry LSTMs (Chung et al., 2014). Bidirectional GRU (BiGRU) is a variant of
point for malware programs targeting the Windows platform since the GRU that models information in two directions (right and left direc-
API also allows malware programs to execute their malicious activities. tion) (Vukotić et al., 2016) and studies have shown better performance
Thus, monitoring and analyzing Windows API call sequences gives the with reversed sequences, making it also ideal when modeling sequences
behavioral characteristics that can be used to represent benign and of data (Vukotić et al., 2016).

3
P. Maniriho et al. Journal of Network and Computer Applications 218 (2023) 103704

3. Related works (2020), to mention a few. Although graph-based techniques can achieve
good performance, the complexity of graph matching is one of their
There have been significant efforts in recent works on malware major issues, i.e., as the graph’s size increases the matching complexity,
detection through dynamic-based malware analysis. The work in Ki the detection accuracy of a given detection model decreases (Singh and
et al. (2015) has adopted a DNA sequence alignment approach to design Singh, 2020). However, it is worth noting that it is harder for a cyber
a dynamic analysis-based method for malware detection. Their exper- attacker to modify the behaviors of a malware detection model based
imental outcome revealed that some malicious files possess common on graph method (Hellal et al., 2020).
behaviors/API call functions despite their categories which may be In the work presented in Tran and Sato (2017), NLP techniques
different. In addition, their study has also indicated that new malware were applied to process sequences of API calls which were fed to
can be detected by identifying and matching the presence of certain the detection technique. API calls were processed using the n-gram
API calls as many malware programs perform malicious activities using method and the weights were assigned to each API call feature using
almost similar API calls. However, the limitation of DNA sequence the term frequency-inverse document frequency (TF-IDF) model. The
approaches is that they are prone to consuming many resources and work in Karbab and Debbabi (2019) has also relied on TF-IDF to
require high execution time, making them computationally expensive transform input features for machine learning algorithms such as CART,
given the high volume of emerging malware datasets. Singh and Singh ETrees KNN, RF, SVM, and XGBoost. Another work in Catak et al.
(2020) extracted API calls from benign and malware executable files (2020) has used LSTM and TF-IDF model to build a behavior-based
using dynamic analysis in the Cuckoo sandbox. They computed Shan- malware detection using API call sequences extracted using the Cuckoo
non entropy over API call features to determine their randomness and sandbox in a dynamic analysis environment. TF-IDF and Anti-colony
then processed them using count factorization to obtain features that optimization (Swarm algorithm) were used to implement a graph-based
were used to train a Random Forest-based malware classifier. While malware detection method based on dynamic features of API calls
their proposed approach shows an improvement in accuracy, it is worth extracted from executable files (Amer et al., 2022). Unfortunately, like
mentioning that the count vectorization model applied while processing the count vectorization approach, the TF-IDF approach does not reveal
API calls does not preserve semantic relationship/similarity between or preserve the semantic relationship/similarity that exists between
features (Saket, 2021; Mandelbaum and Shalev, 2016). words. In the case of malware detection, this would be the similarity
Pirscoveanu et al. (2015) used Windows API calls to implement between API calls or another text-based feature such as file name,
a malware classification system that achieved a detection accuracy dropped messages, network operations such as contacted hostnames,
of 98%. Malicious features were extracted from about 80,000 mal- web browsing history, and error message generated while executing the
ware files including four malware categories (Trojans, rootkit, adware, executable program file.
and potentially unwanted programs) downloaded from VirusTotal and Liu and Wang (2019) used Bidirectional LSTM (BiLSTM) to Build
VirusShare. Looking at the detection outcome, their approach only an API call-based approach that classifies malware attacks. ALL se-
performs well when detecting Trojans. Morato et al. (2018) presented quences of API calls were extracted from 21,378 executable files and
a method for detecting ransomware attacks based on network traffic were processed using the word2vec model. In Li et al. (2022b) a
data. In the work presented by Suaboot et al. (2020), a subset of API graph convolutional network (GCN) model for malware classification
call features was extracted from API call sequences using a sub-curve was built using sequences of API calls. Features were extracted using
Hidden Markov Model (HMM) feature extractor. Only six malware principal component analysis (PCA) and Markov Chain. The work
families (Keyloggers, Zeus, Rammit, Lokibot, Ransomware, and Hive- in Maniath et al. (2017) proposed an LSTM-based model that identifies
coin) with data exfiltration behaviors were used for the experimental ransomware attacks based on behavioral API calls from Windows EXE
evaluation. Different malware detection techniques based on ML algo- files generated through dynamic analysis in the Cuckoo sandbox. Chen
rithms such as Random Forest (RF), J48, and SVM were built using et al.’s work (Chen et al., 2022) proposed different malware detection
the extracted features. Nevertheless, their method was only limited to techniques based on CNN, LSTM, and bidirectional LSTM models. These
evaluating executable program files, which exfiltrate confidential infor- models were trained on raw sequences of API calls and parameters
mation from the compromised systems. In addition, with 42 benign and that were traced during the execution of malware and benign files. An
756 malware executable programs used in their experimental analysis, ensemble of ML algorithms for malware classification based on API call
there is a significant class imbalance in their dataset which could lead sequences was implemented in Sukul et al. (2022). Convolutional neu-
to the model failing to predict and identify samples from minority ral networks and BiLSTM were used to develop a malware classification
classes despite its good performance. A fuzzy similarity algorithm was framework based on sequences of API calls extracted from executables
employed by Lajevardi et al. (2022) to develop dynamic-based malware files (Li et al., 2022a). The work proposed in Abbasi et al. (2022) has
detection techniques based on API calls. employed a dataset of API invocations, registry keys, files/directory
The longest common substring and longest common subsequence operations, dropped files, and embedded strings features extracted
methods for malware detection were suggested in Mira et al. (2016). using Cuckoo sandbox to implement a particle swarm-based approach
Both methods were trained on API call sequences, which were captured that classifies ransomware attacks. Jing et al. (2022) proposed Ensila,
from 1500 benign and 4256 malware during dynamic analysis. A an ensemble of RNN, LSTM, and GRU for malware detection which was
Malware detection approach based on a sequence alignment approach trained and evaluated on dynamic features of API calls.
and API calls captured by the Cuckoo sandbox was designed in the
work proposed by Cho et al. (2016). The analysis was carried out 4. Proposed methodology
using 150 malware belonging to ten malware variants while malware
detection was achieved by computing similarity between files based Details on the proposed framework for detecting malware attacks in
on extracted sequences of API calls. Their experimental results show Windows systems are presented in this section.
that similar behaviors of malware families can be found by identifying
a common list of invoked API call sequences generated during the 4.1. System overview
execution of the executable program. Nonetheless, this method cannot
be suitable for high-speed malware detection as it fully relies on a The proposed framework is based on the dynamic analysis approach
pairwise sequence alignment approach, which introduces overheads. where API call sequences are extracted from benign and malware
Several graph-based techniques for malware detection were proposed executable files while running in a virtual isolated environment. The
in the previous studies Wüchner et al. (2019), Jha et al. (2013), Blokhin extracted raw sequences of API calls are then processed and encoded
et al. (2013), Hellal et al. (2020), Ding et al. (2018) and Pei et al. before being fed to a hybrid automatic feature extractor based on CNN

4
P. Maniriho et al. Journal of Network and Computer Applications 218 (2023) 103704

Fig. 1. The proposed API-MalDetect framework for behavior-based malware detection in Windows systems.

and BiGRU deep learning architectures. The final features generated were configured for the dynamic analysis). The Cuckoo agent monitors
by CNN-BiGRU are then passed to a fully connected layer with neural each file’s behaviors while executing and sends the results to the host to
networks which performs the classification of each sequence of API be analyzed by Cuckoo processing and reporting modules which create
calls as malicious or benign (normal). The architecture of the proposed a JSON report containing all API calls captured during execution. As
framework is depicted in Fig. 1 and below we present more details on some advanced malware can escape the analysis environment during
each component. execution (which could cause serious damage to the production envi-
ronment), a virtual network was configured to enable communication
4.2. Generating API calls dataset between the host and the virtual machine.
The analysis reports generated during our dynamic analysis reveal
It is often challenging to find an up-to-date dataset of API calls of be- that some sophisticated malware can use different API calls that poten-
nign and malware executable files. For this reason, we have generated a tially lead to malicious activities. For instance, Table 1 presents some
new dataset of API calls which is used for the experimental evaluations. of the API calls used by ae03e1079ae2a3b3ac45e1e360eaa973.virus,
Different sources of malware executable files such as Malheur (Rieck which is a ransomware variant. This ransomware ended up locking
et al., 2011), Kafan Forum (Pan et al., 2016), Danny Quist (Sethi one of our Windows VMs and demanded for a ransom to be paid
et al., 2017), Vxheaven (Huda et al., 2016), MEDUSA (Nair et al., in Bitcoin in order to get access back to the infected VM. Moreover,
2010), and Malicia (Nappa et al., 2015) were used in the previous stud- this variant also encrypted files and made them inaccessible. We have
ies. However, these repositories are not regularly updated to include also observed that recent malware variants possess multiple behaviors
new malware samples. Hence, we have collected malware executable and can perform multiple malicious activities after compromising the
samples from VirusTotal (Anon, 2021d), the most updated and the target, making their detection difficult. After the analysis, all JSON
world’s largest malware samples repository. Considering millions of reports were processed to extract API calls which resulted in a new
malware samples available in the repository, it is worth mentioning benchmark dataset of API call sequences that is publicly available for
that processing all malware samples is beyond the scope of this study use by the research community focusing on malware detection.
due to hardware limitations. Our benchmark dataset has 2570 records representing benign ex-
Thus, only a subset of malware samples made available in the ecutable files and malware files such as ransomware, worms, viruses,
second quarter of 2021 was collected. We were given access to a Google spyware, backdoor, adware, keyloggers, and Trojans which appeared
Drive folder having malicious EXE files which are shared by VirusTotal. in the second quarter of 2021. Each executable file in the dataset has
Benign samples were collected from CNET site (Anon, 2021b). The been labeled as benign or malware. The distribution of samples in
VirusTotal online engine was used to scan each benign EXE file to the dataset is presented in Fig. 2. The dataset is balanced with the
ensure that all benign files are clean. Therefore, a total number of same number of malware and benign files and can be accessed from
2800 EXE files were collected to be analyzed in an isolated analysis Github (Maniriho, 2022). Furthermore, it has been processed to remove
environment through dynamic analysis. Nevertheless, we experienced all inconsistencies/noise, making it ready to be used for evaluating the
issues while executing some files, resulting in a dataset of 1285 benign performance of deep learning models. Additionally, a hash value of
and 1285 malware files that were successfully executed and analyzed each file has been included to avoid duplication of files while extending
to generate our benchmark dataset of API calls. Some benign files the dataset in the future. Thus, making it easier to include behavioral
were excluded as they were classified as malicious by VirusTotal online characteristics of newly discovered malware variants in the dataset
engine (Anon, 2022b), while some malware files were also excluded or combine the dataset with any of the existing datasets of API calls
because they did not run due to compatibility issues. extracted from Windows PE files through dynamic analysis.
Accordingly, our isolated dynamic analysis environment consists
of one Ubuntu host machine and Windows virtual machines (VMs). 4.2.1. API call sequence encoding
The Cuckoo sandbox (Anon, 2021a) and its dependencies (such as After generating the dataset, the next step is to perform the en-
analysis and reporting modules) were installed in the Ubuntu host coding of API call sequences. Given an input sequence of API call,
machine while the Cuckoo agent was deployed in each VM (2 VMs it is first tokenized to generated API call tokens, which are further

5
P. Maniriho et al. Journal of Network and Computer Applications 218 (2023) 103704

Table 1
Example of potentially malicious API calls observed in ae03e1079ae2a3b3ac45e1e360eaa973.virus while running in a Windows VM during our
dynamic analysis.
Malicious API call Description of API call
WriteConsoleW Malware uses this API call to establish a command line console.
NtProtectVirtualMemory This API call was used by the malware to allocate read-write memory usually to unpack itself.
CreateProcessInternalW A process created a hidden Windows to hide the running process from the task manager. This API
call allows the malicious program to spawn a new process from itself which is usually observed in
packers that use this method to inject actual malicious code in memory and execute them using
CreateProcessInternalW.
Process32FirstW The malware used this API to search running processes potentially to identify processes for
sandbox evasion, code injection, or memory dumping/live image capturing.
FindWindowA The malware used this API to check for the presence of known Windows forensics tools and
debuggers that might be running in the background.

with Keras embedding layer (Anon, 2023a) to automatically learn

and generate dense embedding vectors of API calls. Direct embedding
allows the knowledge to be incorporated inside the detection model
(the proposed framework), i.e., the whole knowledge of API-MalDetect
is incorporated in one component, which is different from the pre-
vious techniques. Additionally, as our proposed framework relies on
direct embedding (where embeddings are constructed while training
the detection model), it can resist some types of adversarial learning
attacks such as the one which can be performed by modifying the pre-
trained embedding models (Wang et al., 2022) with the aim to fool the
performance of DL-based models. This makes the proposed framework
more secure against such attacks. The Keras embedding requires all
input of API calls to be integer encoded, the reason why each API call
sequence has to be encoded as discussed in Section 4.2.1.
The Keras embedding layer uses deep neural networks to create
dense vector representations of API calls from the API calls’ corpus.
It maps all encoded API calls in the input sequences to dense vectors
in a high-dimensional space where similar API calls are located closer
Fig. 2. Distribution of benign and malware samples in our benchmark dataset of API together, allowing neural networks to learn and capture contextual
call sequences. relationships between API calls. Therefore, the Keras embedding layer
first takes an encoded matrix of API calls as input (matrix of integer
indices) where each row of the matrix is a sequence of API calls
encoded by assigning a unique integer number to each token of API representing benign or malware executable files. Each integer index
call where similar API calls have similar encoding. After encoding, the is mapped to a dense vector (with a fixed length) which is learned
output is a sequential numerical representation of the original input through neural network training. Finally, the output is a matrix of
sequence (originally represented in text format). All sequences of API API calls’ dense vectors which is used as input to a CNN feature
calls are tokenized and encoded using an NLP Tokenizer from the Keras extractor (a subsequent layer to the Keras embedding layer in the
framework (Anon, 2023d). Additionally, sequences of API calls were proposed framework). Fig. 4 summarizes steps for generating numerical
padded using pre-padding (zero padding was applied where necessary) representations of API call sequences through encoding and embedding.
to obtain sequences with the same length. This is very important for It is worth noting that the Keras embedding layer requires a few
our CNN feature extractor as it requires all input sequences to be parameters to be specified before creating embedding. The embedding
numeric and have the same length. The encoded sequences are fed to an layer is first initialized with random weights, and thereafter, other
embedding layer which builds dense vector representations of each API parameters are defined. Such parameters include 𝐼𝑛𝑝𝑢𝑡− 𝑑𝑖𝑚 which
call where all created vectors are combined to generate an embedding specifies the vocabulary size of API calls (for instance, if the API calls
matrix of API call features. Fig. 3 illustrates the process of API call data has integer encoded values between 0–1000, then the vocabulary
encoding where each API call sequence extracted from each benign and size would be 1001 API calls. The second parameter is the 𝑜𝑢𝑡𝑝𝑢𝑡− 𝑑𝑖𝑚
malware EXE file is treated as a sentence. Details on creating API call which denotes the size of the vector space in which words/tokens of
embedding are presented in Section 4.2.2. API calls are embedded (i.e., it specifies/defines the size of each API
call’s output vector from the embedding layer). For example, it can be
4.2.2. Creating API call embedding of any dimension such as 5, 10, 20, or any other positive integer which
In natural language processing, embedding models allow the cre- belongs (∈) to Z. In this work, different values of 𝑜𝑢𝑡𝑝𝑢𝑡− 𝑑𝑖𝑚 are tested
ation of numerical dense vectors of words while preserving their seman- to obtain the suitable dimension of the output vector. Hence, this value
tic information (contextual relationship between them). Embedding was empirically chosen after testing different values and has been set to
models have been used in combination with DL techniques to perform 10 in our experiment. Finally, the third parameter is the 𝑖𝑛𝑝𝑢𝑡− 𝑙𝑒𝑛𝑔𝑡ℎ,
NLP-related tasks like text and sentiment classification (Pittaras et al., which is the input length of each API call sequence. For instance, if a
2020; Kim, 2014). As we are dealing with API calls represented in text sequence of API calls extracted from a malware file has 60 API calls, its
format, our API call embedding approach is linked to these studies. input length would be 60. As there are sequences with different lengths
However, in this work we do not use existing pre-trained NLP-based in the dataset, all sequences are processed to have the same 𝑖𝑛𝑝𝑢𝑡− 𝑙𝑒𝑛𝑔𝑡ℎ
word embedding models (Mikolov et al., 2013; Pennington et al., value during encoding. After this step, the embedding layer constructs
2014) because similarities in API call sequences are very dissimilar and concatenates all constructed embedding vectors of API calls to form
with ordinary English words/text. Thereby, we use direct embedding an embedding matrix which is used as input to the next layer.

6
P. Maniriho et al. Journal of Network and Computer Applications 218 (2023) 103704

Fig. 3. An example of API call encoding using the Keras API Tokenizer.

Fig. 4. Steps for generating numerical representations of API call sequences through encoding and embedding.

Fig. 5. The architecture of one-dimensional convolution neural networks (1-D CNN) designed for API calls feature learning and extraction.

4.3. Hybrid automatic feature extraction is very important as CNN cannot work with input vectors of different
lengths. Thus, we have set 𝑘 to a fixed length. In order to identify and
We have designed a hybrid automatic feature extractor that exploits select highly relevant/discriminative features from raw-level features
the potential of CNN and BiGRU deep learning algorithms to effectively of API calls’ embedding vectors, the CNN feature extractor performs
learn relevant features of API calls that are used to train a fully a set of transformations to the sequential input vector 𝑋𝑖∶𝑛 through
connected neural network (FCNN)-based malware classifier/detector. convolution operations, non-linear activation, and pooling operations
in different layers. These layers interact as follows.
4.3.1. 1-D CNN feature extractor The convolutional layer relies on defined filters to perform convo-
The Keras embedding layer described in Section 4.2.2 serves as the lutional operations to the input vectors of API calls. This allows the
key processing layer for our CNN automatic feature extractor which convolutional layer to extract unique features from API call vectors. As
is designed based on the one-dimensional CNN (1-D CNN) technique convolutional filters extract features from different locations/positions
for text classification proposed by Kim (2014). That is, we use 1-D in the embedding vectors (embedding matrix), the extracted features
CNN to extract local features of API calls from the embedding matrix have a lower dimension compared to the original features of API
(which is used as input data). As illustrated in Fig. 5, the architecture calls. Hence, mapping high-dimensional features to lower-dimensional
of the designed 1-D CNN feature extractor is made up of two main features while keeping highly relevant features (i.e., it reduces the
components, namely, the convolutional layer and the pooling layer. dimension of features). Positions considered by filters while convolving
Given a sequence 𝑆 of API calls represented as a dense vector from the to the input are independent for each API call and semantic associations
embedding matrix (created during Keras embedding), let 𝑋𝑖∈R𝑑 denotes between API calls that are far apart in the sequences are captured
a 𝑑-dimensional API call vector representing the 𝑖th API call in 𝑠 where at higher layers. therefore, we have applied a filter 𝑊 ∈ R𝑚×𝑛 to
𝑑 is the dimension of the embedding vector. Therefore, a sequence 𝑆 generate a high-level feature representation, with 𝑚 moving/shifting
consisting of 𝑛 API calls from a single JSON report can be constructed horizontally over the embedding matrix based on a stride 𝑡 to construct
by concatenating individual API calls using the expression in (1) where a feature map 𝑐𝑖 which is computed using the expression in (4). It
the symbol ⊕ denotes the concatenation operator and 𝑛 is the length is important to mention that the multiplication operator (*) which is
of the sequence. in Eq. (2) denotes the convolutional operation (achieved by performing
element-wise multiplication) which represents API call vectors from 𝑋𝑖
𝑋𝑖∶𝑛 = 𝑥1 ⊕ 𝑥2 ⊕ 𝑥3 ⊕ 𝑥4 ⊕ ....... ⊕ 𝑥𝑛 (1) to 𝑋𝑖 +𝑚−1 (which means 𝑚 rows at a time) from 𝑋 which is covered by
the defined filter 𝑊 based on 𝑡. To make the operation faster, we have
We have padded sequences to generate API calls matrix of 𝑘𝑛
kept stride to a value of 1, however, various strides can be adapted as
dimensions having 𝑘 number of tokens of API call with embedding
well. Moreover, in Eq. (2), the bias value is denoted by 𝑏𝑖 .
vectors of length 𝑛. Padding allows sequences to have a fixed number
of 𝑘 tokens (i.e., the same fixed length is kept for all sequences) which 𝐶𝑖 = 𝑓 (𝑊 ∗ 𝑋𝑖∶𝑖+𝑚−1+𝑏𝑖 ) (2)

7
P. Maniriho et al. Journal of Network and Computer Applications 218 (2023) 103704

CNN supports several activation functions such as hyperbolic tan- 𝑥𝑡 is the current input, 𝑡𝑎𝑛ℎ is the hyperbolic tangent activation func-
gent, Sigmoid, and rectified linear unit (ReLU). In this work, we have tion, 𝑤 and 𝑢 are weights matrices to be learned, ⊙ is the Hadamard
used ReLU, which is represented by 𝑓 in (2). Once applied to each product of the matrix, while 𝑏𝑧 , 𝑏𝑟 , and 𝑏ℎ denotes the bias.
input 𝑥, the ReLu activation function introduces non-linearity by cap-
𝑧𝑡 = 𝜎(𝑤𝑧𝑥 𝑥𝑡 + 𝑢𝑧ℎ ℎ𝑡−1 + 𝑏𝑧 ) (6)
ing/turning all negative values to zero. This operation is achieved using
the expression in (3) where activation operation introduces nonlinear-
ity and speeds up the training of the CNN model. This nonlinearity 𝑟𝑡 = 𝜎(𝑤𝑟𝑥 𝑥𝑡 + 𝑢𝑟ℎ ℎ𝑡−1 + 𝑏𝑟 ) (7)
allows the proposed framework (API-MalDetect) to handle complex
input data representation, which is impossible when using simple linear ℎ̃ 𝑡 = 𝑡𝑎𝑛ℎ(𝑤ℎ𝑥 𝑥𝑡 + 𝑟𝑡 ⊙ 𝑢ℎℎ ℎ𝑡−1 + 𝑏ℎ ) (8)
ML models (Chng, 2023).

𝑓 (𝑥) = 𝑚𝑎𝑥(0, 1) (3) ℎ𝑡 = (1 − 𝑧𝑡 ⊙ ℎ̃ 𝑡 ) + 𝑧𝑡 ⊙ ℎ𝑡−1 (9)

After convolving filters to the entire input embedding matrix, the The output produced by the BiGRU layer contains sequences of hid-
out is a feature map corresponding to each convolutional operation den states that encode the contextual information of API calls features
and is obtained using the expression in (4). Note that the convolutional which are passed to the next layer for malware classification.
operations were optimized by the Dropout regularizer with a dropout
rate of 0.2. 4.3.3. Fully connected layer
The fully connected layer consists of neural networks with hidden
𝐶(𝑓 ) = [𝐶1 , 𝐶2 , 𝐶3 , … , 𝐶𝑛−𝑚+1 ] (4) layers, ReLU activation function, and network regularizer. It also has
an output layer with a sigmoid activation function. The hidden layer
The convolutional layer passes its output to the pooling layer which
neurons/units receive input feature vectors ℎ𝑡 𝑖 from the BiGRU layer,
performs further operations to generate a new feature representation
process them, and then compute their activations 𝑙𝑖 using the expression
by aggregating all values received from the feature maps. This opera-
in (10) with 𝑊 being the matrix of weights between the connections
tion is carried out using some well-known statistical techniques such
of the input neurons and hidden layer neurons while 𝑏𝑖 represents the
as computing the mean or average, finding the maximum value, or
bias. We have used the Dropout regularization technique to prevent
applying the L-norm. One of the advantages of the pooling layer is
the network from overfitting on the training data. A dropout rate
that it has the potential to prevent the model’s overfitting, reducing
of 0.5 was used after each hidden layer, which means that at each
the dimensionality of features and producing sequences of API call
training iteration, 50% of the connection weights are randomly selected
features with the same fixed lengths. In this work, we have used
and set to zero. Dropout works by randomly dropping out/disabling
max pooling (Kim, 2014; Collobert et al., 2011) which performs the
neurons and their associated connections to the next layer, which pre-
pooling operation over each generated feature map and then selects
vents the network’s neurons from highly relying on some neurons and
the maximum value associated with a particular filter’s output feature
forcing each neuron to learn and to better generalize on the training
map. For instance, having a feature map 𝑐𝑖 , the max-pooling operation
data (Srivastava et al., 2014).
is performed by the expression in (5), and the same operation is applied
∑
to each 𝑐𝑖 feature map. 𝑙𝑖 = 𝑅𝑒𝐿𝑈 ( 𝑊𝑖 ∗ ℎ𝑡 𝑖 + 𝑏𝑖 ) (10)
𝑖
𝑐̂𝑖 = 𝑚𝑎𝑥(𝑐𝑖 ) (5)
In addition, we have used the binary-cross entropy (Jadon, 2020) to
The goal is to capture high-level features of API calls (the ones computer the classification error/loss whereas the learning weights are
with the maximum/highest value for every feature map). Note that the optimized by Adaptive Moment Estimation (Adam) optimizer (Kingma
selected value from each feature map corresponds to a particular API and Ba, 2014; Yaqub et al., 2020). Cross-entropy is a measure of the dif-
call feature captured by the filter while convolving over the input em- ference between two probability distributions for a given set of events
bedding matrix. All values from the pooling operations are aggregated or random variables and it has been widely used in neural networks
together to produce reduced feature vectors which are passed to the for classification tasks. On the other hand, Adam works by searching
BiGRU feature extractor (also referred to as the BiGRU layer) for further for the best weights 𝑊 and bias 𝑏 parameters which contribute to
processing. minimizing the gradient computed by the error function (binary-cross
entropy in this case). Note that the network learning weights 𝑊 are
4.3.2. BiGRU feature extractor updated through backpropagation operations while for the Sigmoid
function, we have used the binary logistic regression (see Eq. (11)).
Features of API calls generated by the 1-D CNN feature extractor
have low-level semantic information between them compared to the
original ones. Fortunately, gated recurrent units can directly receive 1
𝑆𝑖𝑔𝑚𝑜𝑖𝑑(𝑥) = (11)
and process the intermediate feature maps generated by CNN. Thus, we 1 + 𝑒−𝑥
have used a BiGRU network architecture which works as the subsequent
𝑌 = 𝑆𝑖𝑔𝑚𝑜𝑖𝑑(𝑊𝑖 × 𝑙𝑖 + 𝑏) (12)
layer to the 1-D CNN feature extractor. The BiGRU feature extractor
processes sequences in both directions (i.e., it uses both forward and After computing the activation, the classification outcome (the pre-
backward recurrence to process sequences), allowing it to capture dicted output) is computed by the expression in (12). For more details,
high dependencies across feature maps (local features) of API calls. a summary of parameter settings for the proposed framework is pre-
Therefore, the BiGRU layer depicted in the proposed malware detection sented in Table 2, where the best parameters such as number filter,
framework in Fig. 1 receives the output feature vector generated after kernel size, etc., were determined using the grid search approach.
processing each feature map 𝑐̂𝑖 of API calls) as its input. It then
processes those sequences of API calls using gated units which allows 5. Experiments and results
it to control information flow and prevent the learning model from
vanishing gradients. Mathematical computation of GRU is performed as Details on the experimental evaluations performed while evaluating
shown in Eqs. (6), (7),(8), and (9) where the expression 𝑧𝑡 , 𝑟𝑡 , ℎ̃ 𝑡 and ℎ𝑡 the proposed framework are presented in this section. The results
denote the update gate, reset gate, hidden state of current hidden node are based on a binary classification problem of benign and malware
and output, respectively. The symbol 𝜎 denotes the Sigmoid activation, executable files in Windows systems.

8
P. Maniriho et al. Journal of Network and Computer Applications 218 (2023) 103704

Table 2
A summary of different parameter settings for the proposed framework.
Layer Parameter Used Value
Input sequence length Various (20, 30, 40, 60, 80, and 100)
Embedding layer Embedding dimension 10
Sequence padding Pre-padding with zero padding
Number of convolutional layer 1
Number of filters 128
Filter size/kernel size 4
Activation function ReLU
CNN
Regularizer Dropout with a dropout rate of 0.2
Number of pooling layer 1
Stride 1
Pooling method Max pooling
Number of gated units 3
BiGRU Activation function Tanh
Recurrent activation function Sigmoid
Number of hidden layers 2
Number of neurons (hidden layer 1) 25
Fully connected layer
Number of neurons (hidden layer 2) 25
Regularizer (hidden layer 1 and layer 2) Dropout with a dropout rate of 0.5
Optimizer Adam with a learning rate of 0.001
Model compilation
Loss function Binary Cross Entropy
Number of epochs 16
Other parameters
Batch size 32

5.1. Experimental setup and tools less than that of benign ones (Mimura, 2023; Pendlebury et al.,
2019). Despite being highly produced, malware executable files
The proposed framework was implemented and tested in a computer are rarely used compared to benign executables (Mimura, 2023).
running Windows 10 Enterprise edition (64-bit) with Intel(R) Core As suggested in Moskovitch et al.’s work (Moskovitch et al.,
(TM) i7-10700 CPU @ 2.90 GHz, 16.0 GB RAM, NVIDIA Quadro 2009), malware represents approximately 10% of the traffic on
P620, and 500 GB for the hard disk drive. The framework was im- the Internet (which actually represents the ratio of malware in the
plemented in Python programming language version 3.9.1 using Ten- test set). Accordingly, in this work, we assume that in a realistic
sorFlow 2.3.0 and Keras 2.7.0 frameworks. Other libraries such as scenario (real-life conditions), there would be 90% benign and
Scikit-learn, Numpy, Pandas, Matplotlib, Seaborn, LIME, and Natural 10% malware detected by a malware detection end system. The
Language Toolkit (NLTK) have been also used. All these libraries are same ratio has been suggested in the research studies presented
freely available for public use and can be accessed from PiPy (Pypi, in Pendlebury et al. (2019) Nissim et al. (2014), Tien et al. (2020).
2021), the Python package management website.
5.3. Training and testing datasets
5.2. Eliminating spatial and temporal bias in the experiments
Following the best practices presented in Section 5.2, we have
Considering the best practices for building malware detection tech- used the API calls dataset in Ki et al. (2015) for training (the dataset
niques (systems), we have followed the best practices from Pendlebury can be accessed from Anon (2023c) and Anon (2023b)). Testing was
et al.’s work Pendlebury et al. (2019) to address the issue of temporal performed using our dataset (Maniriho, 2022) (the descriptions on
and spatial bias in the experiment. The practical constraints/guidelines
our dataset can be found in Section 4.2). Because the dataset in Ki
in Pendlebury et al. (2019) allow us to adhere to a realistic scenario
et al. (2015) is highly imbalanced where malware files highly out-
as closely as possible in our experimental evaluations. Specifically, the
number benign files (23,080 malware and 300 benign), we performed
following constraints were taken into consideration.
downsampling in order to have a good distribution of samples across
- Temporal training consistency: We have imposed a temporal the dataset. Moreover, we have also taken benign API call samples
split between the training dataset and testing dataset where mal- from Anon (2021c), which were added to the training dataset. Specif-
ware samples released in 2015 are used for training while testing ically, our training dataset has a total number of 8351 samples (with
is performed on malware samples generated in 2018 and 2021. 4787 benign and 3564 malware) while the testing dataset has three
This allows us to evaluate the performance of the proposed frame- test sets. The first test (testset1) has 1343 samples with 90% and 10%
work on unseen malware samples in order to ensure its potential representing benign and malware samples, respectively. The second test
while detecting newly released/unknown malicious executable set (testset2) has 1511 samples where 80% represent benign while 20%
files. represent malware and it is used to examine if the proposed framework
- Temporal benign consistency: As in many cases benign exe- can still perform well when the number of malware files exceeds 10%
cutable files can remain stable over time, we did not collect them in the test set. It is worth mentioning that both testset1 and testset2 are
based on time. We collected them from various sources and then from our dataset. The last test set (testset3), is taken from Ceschin et al.
split them into two portions, with one portion used for training (2018). It has 1343 samples with 90% and 10% representing benign
and the other one for testing. and malware, respectively. Overall, the proposed framework is trained
- Spatial malware and benign consistency in the test set: Es- and tested on a total number of 12,548 executable samples.
timating the exact percentage of malware executable samples in
the wild is impossible because many of them have a short lifetime 5.4. Evaluation metrics
and they also get updated more often. On the other hand, it
is also clear that the number of daily malware executable sam- Different metrics such as precision (P), recall (R), F1-Score (F1),
ples encountered by individuals or organizations is significantly false positive rate (FPR), false negative rate (FNR), and accuracy (Acc)

9
P. Maniriho et al. Journal of Network and Computer Applications 218 (2023) 103704

were measured to evaluate the performance of the proposed frame- Table 3

Comparisons against other detection techniques based on API call sequences extracted
work. The computations of these metrics are presented in Eqs. (13),
in Windows executable files.
(14), (15), and (16), with TP, TN, FP, and FN indicating true positive,
Work Algorithm used Accuracy (%)
true negative, false positive, and false negative, respectively.
Maldy (Karbab and Debbabi, KNN 97.60
TP 2019)
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛(𝑃 ) = (13) Qin and Wang (Qin et al., TextCNN 95.90
TP + FP
2020)
TP Amer et al. (Amer et al., Particle swarm-based 95.40
𝑅𝑒𝑐𝑎𝑙𝑙(𝑅) = (14) 2022)
TP + FN
Mathew and Kumara (Mathew LSTM 92.00
and Ajay Kumara, 2020)
TP + TN
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (15) Liu and Wang (Liu and Wang, BiGRU 93.72
TP + TN + FP + FN 2019)
Li et al. (Li et al., 2022a) CNN-BiLSTM 97.31
Precision × Recall Yesir and Soğukpinar (Yesir BERT 96.76
𝐹1 = 2 × (16)
Precision + Recall and Soğukpinar, 2021)
Xiaofeng et al. (Xiaofeng RF and Bi-residual LSTM 96.70
We have also computed other metrics such as the area under the Roc et al., 2019)
Curve (AUC), macro average, and the weighted average for precision, Catak et al. (Catak et al., Two-layer LSTM 95.00
recall, and F1-score, respectively. The macro average is calculated 2020)
Xue et al. (Xue et al., 2022) MLP 91.57
using unweighted/arithmetic mean where all classes in the dataset are Sai et al. (Sai et al., 2019) DT 90.00
treated equally irrespective of their support (the total number of times Nawaz et al. (Nawaz et al., J48 97.50
each class appears in the dataset). The weighted average considers each 2022)
Maniath (Maniath et al., 2017) LSTM 96.67
class’s support and is computed by the mean of all per-class scores
Pektaş and Acarman (Pektaş AROW 93.00
(e.g., precision or recall score). Eqs. (17), (18), and (19) show how and Acarman, 2017)
to compute the macro average precision (Macro P), macro average Nunes et al. (Nunes et al., RF 96.00
recall (Macro R), and macro average F1-score (Macro F1) where 𝑖 2019)
Han et al. (Han et al., 2019) XGBoost 93.18
represents the 𝑖th class while 𝑁 is the total number of classes in
Avci et al. (Avci et al., 2023) BiLSTM 93.16
the dataset. Moreover, Eqs. (20), (21), and (22) show the weighted This work CNN-BiGRU 99.07
average precision (weighted P), weighted average recall (weighted R),
and weighted average F1-score (Weighted F1) where 𝑊𝑖 denotes the
Table 4
weight of class 𝑖. Additionally, Eq. (23) shows AUC calculation. Accuracy, false positive rate, and false negative rate achieved by API-MalDetect on
testset1.
1 ∑
𝑁
𝑀𝑎𝑐𝑟𝑜𝑃 = 𝑃 (17) Sequence length Accuracy False positive rate False negative rate
𝑁 𝑖=1 𝑖 (%) (%) (%)
20 96.87 3.05 0.07
1 ∑
𝑁
40 97.32 2.46 0.22
𝑀𝑎𝑐𝑟𝑜𝑅 = 𝑅 (18)
𝑁 𝑖=1 𝑖 60 97.77 2.08 0.15
80 98.06 1.86 0.07
100 98.81 1.04 0.15
2 ∗ 𝑀𝑎𝑐𝑟𝑜𝑃 ∗ 𝑀𝑎𝑐𝑟𝑜𝑅
𝑀𝑎𝑐𝑟𝑜𝐹 1 = (19)
𝑀𝑎𝑐𝑟𝑜𝑃 + 𝑀𝑎𝑐𝑟𝑜𝑅

∑
𝑁
5.5.1. Benchmark comparisons against other techniques
𝑊 𝑒𝑖𝑔ℎ𝑡𝑒𝑑𝑃 = 𝑊𝑖 𝑃𝑖 (20)
𝑖=1 We have examined the performance of the API-MalDetect frame-
work against other techniques for malware detection implemented
∑
𝑁
based on sequences of API calls. Therein, comparative results are
𝑊 𝑒𝑖𝑔ℎ𝑡𝑒𝑅 = 𝑊𝑖 𝑅𝑖 (21)
𝑖=1 presented in Table 3. First, we have compared API-MalDetect against
Maldy (Karbab and Debbabi, 2019), an existing framework based on
2 ∗ 𝑊 𝑒𝑖𝑔ℎ𝑡𝑒𝑑𝑃 ∗ 𝑊 𝑒𝑖𝑔ℎ𝑡𝑒𝑑𝑅
𝑊 𝑒𝑖𝑔ℎ𝑡𝑒𝑑𝐹 1 = (22) NLP and machine learning techniques. As shown in Table 3, our frame-
𝑊 𝑒𝑖𝑔ℎ𝑡𝑒𝑑𝑃 + 𝑊 𝑒𝑖𝑔ℎ𝑡𝑒𝑑𝑅
work outperformed the Maldy framework (Karbab and Debbabi, 2019)
1 with an improvement of 1.47% (99.07%–97.60%) in detection accu-
𝑇𝑃 𝐹𝑃
𝐴𝑈 𝐶 = 𝑑 (23) racy. API-MalDetect also performed well over a TextCNN-based mal-
∫0 𝑇 𝑃 + 𝐹 𝑁 𝑇 𝑁 + 𝐹 𝑃
ware detection approach presented by Qin et al. (2020), revealing the
potential of combining TextCNN (also known as 1-D CNN) with BiGRU
in our framework. Moreover, the proposed framework also shows
5.5. Classification results
better performance over other malware detection techniques presented
by Amer et al. (2022), Mathew and Ajay Kumara (2020), Liu and Wang
The experimental evaluations were conducted to examine how the (2019), Li et al. (2022a), Yesir and Soğukpinar (2021), Xiaofeng et al.
proposed framework can effectively detect unseen malware attacks (2019), and Avci et al. (2023), to mention a few.
based on sequences of API calls representing benign or malware exe-
cutable files. Thus, various evaluations were carried out, and the results 5.5.2. Performance of API-MalDetect under various experimental condi-
are presented in this subsection. We first present the performance of tions
API-MalDetect against other API call-based malware detection tech- As previously mentioned, we have tested the performance of API-
niques (or frameworks) in Section 5.5.1 and then in Section 5.5.2, we MalDetect using testset1, testset2, and testset3. We have examined
present the performance of API-MalDetect under various experimental the effect of API call sequence length (𝑛) on the performance of our
conditions. Finally, Section 5.5.4 provides insights about explainable framework. Moreover, the results obtained on testset2 also allow us
results generated by API-MalDetect based on LIME. to examine if the performance of API-MalDetect can be affected by

10
P. Maniriho et al. Journal of Network and Computer Applications 218 (2023) 103704

Fig. 6. Effect of sequence length on (a) Accuracy and (b) F1-score (with data taken from Tables 4 and 5) and (c) ROC Curve obtained while testing the proposed framework
using testset1.

Table 5 Table 5 presents precision, recall, and F1-score generated while test-
Precision, recall, and F1-score obtained when testing API-MalDetect on testset1. ing API-MalDetect, which also demonstrate better performance while
Sequence length (n) Predicted class Precision Recall F1-Score detecting both malware and benign on unseen data. For instance, a
Benign 0.9991 0.9661 0.9823 precision of 0.9991 and 0.7644 was obtained using the length of API
20
Malware 0.7644 0.9925 0.8636 call sequence of 20 for benign and malware detection, respectively.
Benign 0.9975 0.9727 0.9849 Similarly, using the same length (n=20) API-MalDetect achieves a
40
Malware 0.7988 0.9776 0.8792
recall of 0.9421 and an F1-Score of 0.8636 for malware detection.
Benign 0.9983 0.9768 0.9875 In general, the classification report for precision, recall, and F1-score
60
Malware 0.8250 0.9851 0.8980
obtained using different values of 𝑛 reveal a better performance of our
Benign 0.9992 0.9793 0.9891 framework when distinguishing malicious API calls from benign ones.
80
Malware 0.8418 0.9925 0.9110
Our framework has a good measure of separability between both classes
Benign 0.9983 0.9884 0.9933 (it performs well in identifying malware attacks) and can deal with
100
Malware 0.9041 0.9851 0.9429
long sequences. The higher the value of precision (with 1 being the
highest), the better the performance of a given detection model. As
shown in Table 5, it is also important to highlight that in many cases
increasing malware in the test set, i.e., when there are 20% of malware the precision, recall, and F1-score get increased based on the value of
and 80% of benign samples in the test set. The results in Tables 4–6 are (𝑛). Table 6 provides the macro and weighted average for precision,
based on testset1 while the results in Tables 7, 8, and 9 are generated recall, and F1-score achieved by the proposed framework and were
using testset2. Additionally, Tables 10–12 present the performance computed based on the classification report presented in Table 5, which
of API-MalDetect on testset3 with API calls from one of the existing also shows better performance of API-MalDetect with a high macro F1
datasets mentioned in Section 5.3. Throughout our evaluations, five of 0.9681 and weighted F1 of 0.9883.
lengths of API call sequences were considered (20, 40, 60, 80, 100), Fig. 6(a) and (b) illustrate how accuracy and F1-score change based
however, the proposed framework can handle other lengths of API calls on different values on 𝑛 while Fig. 6(c) depicts the area under the ROC
sequences beyond the mentioned ones. For simplicity, we have kept Curve (AUC or AUC-ROC) values obtained while classifying malware
the embedding size (embedding dimension) to 10, and the evaluations and benign files. Accordingly, all AUC values are close to one with
the highest value being 0.9988, revealing how API-MalDetect can
show good performance.
effectively distinguish malware from benign files’ activities. Tables 7,
Accordingly, Table 4 shows the detection/classification report
8, and 9 present the performance of API-MalDetect obtained on testset2
achieved by API-MalDetect on unseen sequences of API calls. Looking at
(with 80% benign and 20% malware). In particular, the classification
the results, we could see that API-MalDetect has successfully classified
report in Table 7 shows an improvement in accuracy when detecting
malicious API call sequences and benign API call sequences with a
previously unseen API call sequences of benign and malware.
detection accuracy of 96.56% based on a sequence length of 20 API
For instance, the accuracy was improved from 98.81% (obtained us-
calls. The accuracy of 97.32%, 97.77%, and 98.06% was also achieved
ing testset1) to 99.07%. More importantly, the results also demonstrate
based on sequence lengths of 40, 60, and 80, respectively. The highest
that there is a reduction in both FPR and FNR which gives assurance
accuracy (98.81%) was obtained using the lengths of 100. Interestingly, that the proposed framework is able to identify malicious activities
the testing accuracy varies in accordance with the value of 𝑛, revealing based on the executable’s API call sequences even when the number
the effect of the API call sequence length on the performance, i.e., as of malware exceeds the realistic ratio (which is 10% in our case).
𝑛 increases, the accuracy also increases. This is shown by the accuracy The classification report in Tables 8 and 9 also shows performance
improvement of 2.25% (98.81% − 96.56%) achieved by increasing the improvement in macro F1, weighted F1, and in other metrics such as
value of 𝑛 from 20 to 100. Table 4 also shows the false positive recall and precision.
rate (FPR) and false negative rate (FNR) obtained on the same test Fig. 7(a) illustrates how the detection accuracy was improved based
set, which demonstrate that only a few samples of malware were on the ratio of malware and benign in testset1 (where 10% represent
misclassified by our framework, resulting in a lower FNR of 0.15% malware) and testset2 (where 20% represent malware) with 𝑛 = 100.
obtained while testing the framework on API call sequences with a Fig. 7(b) and (c) also show how false positive rate and false negative
length of 100. Nevertheless, a high FPR of 3.05% was achieved with rate increase based on the ratio of malware in the test set (10% and
a sequence length of 20. However, there is a reduction in the false 20%). The results show a trade-off between FPR and FNR, i.e., when
positive rate as the value of 𝑛 increases. In general, the framework is FPR increases, FNR decreases, and vice-versa. For example, the FPR
able to identify malicious API call sequences with a lower FNR and decreased from 1.04% to 0.15% (refer to Fig. 7(b) while FNR increased
FPR. from 0.15% to 0.26% using API call sequences with 𝑛 = 100 from

11
P. Maniriho et al. Journal of Network and Computer Applications 218 (2023) 103704

Table 6
Macro and the weighted average for precision, recall, and F1-score achieved by API-MalDetect on testset1.
Sequences length (n) Macro P Macro R Macro F1 Weighted Pr Weighted R Weighted F1
20 0.8818 0.9793 0.9230 0.9757 0.9687 0.9705
40 0.8981 0.9752 0.9321 0.9776 0.9732 0.9744
60 0.9117 0.9810 0.9427 0.9810 0.9777 0.9785
80 0.9205 0.9859 0.9500 0.9835 0.9806 0.9813
100 0.9512 0.9867 0.9681 0.9889 0.9881 0.9883

Table 7
Accuracy, false positive rate, false negative rate, and AUC achieved by API-MalDetect using testset2.
Sequence length (n) Accuracy (%) False positive rate (%) False negative rate (%) AUC
20 96.56 3.38 0.07 0.9987
40 97.62 2.18 0.20 0.9992
60 98.08 1.72 0.20 0.9993
80 98.68 1.13 0.20 0.9992
100 99.07 0.66 0.26 0.9992

Fig. 7. Variation of (a) accuracy (b) False positive rate (c) False negative rate based on the ratio of malware in the test set.

Table 8 5.5.3. Time complexity analysis of API-MalDetect

Precision, recall, and F1-Score achieved while examining the performance of In this work, we have also analyzed the detection time complexity
API-MalDetect on testset2.
based on the input size of the API call sequence (denoted by 𝑛) which
Sequence length (n) Predicted class Precision Recall F1-Score
is fed to the proposed API-MalDetect framework to detect unseen
Benign 0.9991 0.9578 0.9780
20 malicious files. That is, the detection time taken by API-MalDetect is
Malware 0.8551 0.9967 0.9205
measured with respect to the value of 𝑛, and our analysis is based on the
Benign 0.9975 0.9727 0.9849
40 work in Cormen et al. (2022). We refer to input size (also called input
Malware 0.9006 0.9901 0.9432
sequence length) as the total number of API calls in each sequence.
Benign 0.9975 0.9785 0.9879
60 Accordingly, we have measured the execution time, and the results
Malware 0.9200 0.9901 0.9537
are visualized in Fig. 8(a), (b), and (c). Looking at the training time,
Benign 0.9975 0.9859 0.9917
80 there is an increase in the training time as 𝑛 increases. For instance,
Malware 0.9462 0.9901 0.9676
an execution time of 35.721 s was taken while training API-MalDetect
Benign 0.9967 0.9917 0.9942
100 on sequences of API calls with 𝑛 = 20 while 220.780 s were taken with
Malware 0.9675 0.9868 0.9770
𝑛 = 100. The training and testing time gap is mainly due to the sequence
length.
As shown in Fig. 8, the total number of operations performed by
testset1 and testset2, respectively. Note that the data in Fig. 7 are taken API-MalDetect increases if 𝑛 is increased. In other words, the detection
from Tables 4 and 7. time taken by the proposed framework increase with the growth of 𝑛.
As mentioned in Section 5.3, the performance of the proposed The best detection case will always occur if the framework detects and
framework was also tested on one of the existing datasets of API calls classifies malware attacks based on the first twenty API calls (where
(testset3). Accordingly, Table 10 shows accuracy, false positive rate, 𝑛 = 20) invoked by a particular malware file while running the victim’s
false negative rate, and AUC, achieved by API-MalDetect on unseen system/device. On the other hand, the complexity in detection time
samples from testset3. The results also show that the performance taken by the framework to discover malicious files (malware) also
increases as the length of API call sequences becomes longer. Hence,
increases as the length of the API call sequence increases. For example,
the worst-case detection time of the proposed framework will likely
the accuracy increased from 96.52% (with 𝑛 = 20) to 98.29% (where
occur when processing longer sequences.
𝑛 = 100). In addition, API-MalDetect can detect unseen malware with a
The average detection time/average performance behavior (in terms
low FPR (1.12%) and FNR (0.59%). Tables 11 and 12 present the ob-
of detection time) achieved by API-MalDetect when detecting unseen
tained macro and weighted average for precision, recall, and F1-score, malicious files from our dataset is 0.298 s. It is computed by com-
which also shows a good performance of the proposed framework. More bining the detection time of all input sizes (n=20, 40, 60, 80, and
importantly, the overall performance achieved in various experimental 100 in this case). In addition, there is an average detection time
evaluations proves that the proposed framework can potentially detect of 0.107 s, taken when testing API-MalDetect on another dataset of
known and unseen malware attacks based on sequences of API calls. API call sequences (testset3). Therefore, it is important to note that
This gives our framework the ability to deal with malware attacks in despite a slight increase in detection time when dealing with longer
Windows systems. sequences, API-MalDetect does not take a huge amount of time to detect

12
P. Maniriho et al. Journal of Network and Computer Applications 218 (2023) 103704

Fig. 8. (a) Training time taken when training API-MalDetect (b) and (c) Testing time taken when testing the framework on various test sets.

Fig. 9. Explaining the predicted outcome: API-MalDetect predicts that a sequence of API calls is malicious, and LIME highlights the API calls in the sequence that led/contributed
to the final prediction. These can help security analysts to make decisions and trust the API-MalDetect’s predictions.

Fig. 10. An example of explanation of the classification outcome generated by LIME when classifying a malware file with API-MalDetect.

Fig. 11. Explanation of the classification (predicted) outcome generated by LIME when classifying a benign file with the API-MalDetect framework.

13
P. Maniriho et al. Journal of Network and Computer Applications 218 (2023) 103704

Table 9
Macro and the weighted average for precision, recall, and F1-score obtained by API-MalDetect on testset2.
Sequences length (n) Macro P Macro R Macro F1 Weighted P Weighted R Weighted F1
20 0.9271 0.9773 0.9493 0.9704 0.9656 0.9665
40 0.9490 0.9814 0.9641 0.9781 0.9762 0.9766
60 0.9587 0.9843 0.9708 0.9820 0.9808 0.9811
80 0.9718 0.9880 0.9797 0.9872 0.9868 0.9869
100 0.9821 0.9892 0.9856 0.9909 0.9907 0.9908

Table 10
Accuracy, false positive rate, false negative rate, and AUC achieved by API-MalDetect using testset3.
Sequence length (n) Accuracy (%) False positive rate (%) False negative rate (%) AUC
20 96.52 2.61 0.87 0.9594
40 97.03 1.96 1.01 0.9355
60 97.10 2.10 0.80 0.9586
80 97.17 2.17 0.65 0.9620
100 98.29 1.12 0.59 0.9752

Table 11 This produces local interpretability and allows security practitioners

Precision, recall, and F1-Score obtained while examining the performance of to discover which API call feature changes (in our case) will have
API-MalDetect on testset3.
the most impact on the predicted output. LIME computes the weight
Sequence length (n) Predicted class Precision Recall F1-Score
probabilities of each API call in the sequence and highlights individual
Benign 0.9901 0.9710 0.9805
20 API calls that led to the final prediction of a particular sequence of API
Malware 0.7778 0.9130 0.8400
calls representing malware or benign EXE file.
Benign 0.9886 0.9783 0.9834
40 For instance, in Fig. 10, the Process32NextW, NtDelayExecution,
Malware 0.8212 0.8986 0.8581
Process32FirstW, CreateToolhelp32Snapshot, and NtOpenProcess are
Benign 0.9910 0.9767 0.9838
60 portrayed as the most API calls contributing to the final classification
Malware 0.8141 0.9203 0.8639
of the sequence as ‘‘malicious’’ while SearchPathW, LdrGetDllHandle,
Benign 0.9926 0.9758 0.9842
80 SetFileAttributesW, and GetUserNameW API calls are against the final
Malware 0.8113 0.9348 0.8687
prediction. Another example showing Lime output is presented in
Benign 0.9933 0.9876 0.9905
100 Fig. 11 where API calls features that led to the correct classification of
Malware 0.8944 0.9407 0.9170
a benign file are assigned weight probabilities which are summed up
to give the total weight of 0.99. API calls such as IsDebuggerPresent,
FindResourceW, and GetSystemWindowsDir are among the most influ-
malware attacks, which is a necessity for any robust and efficient anti- ential API calls that contribute to the classification of the sequence/file
malware detection system. API-MalDetect can still identify malicious into its respective class (benign in this case).
files within a reasonable short time, making it ideal for high-speed The screenshots of Lime explanations presented in Figs. 10 and
malware detection on Windows desktop devices. 11 are generated in HTML as it generates clear visualizations than
other visualization tools such as Matplotlib. In addition, the weights
5.5.4. Understanding API-MalDetect prediction with LIME are interpreted by applying them to the prediction probability. For
It is often complicated to understand the prediction/classification instance, if API calls IsDebuggerPresent and FindResourceW are re-
outcome of deep learning models given many parameters they use moved from the sequences, we expect API-MalDetect to classifier the
when making predictions. Therefore, in contrast to the previous mal- benign sequence with the probability of 0.37 (0.99–0.38-0.26= 0.37).
ware detection techniques based on API calls, we have integrated Ideally, the interpretation/explanation produced by Lime is a local
LIME (Ribeiro et al., 2016) into our proposed behavior-based malware approximation of the API-MalDetect framework’s behaviors, allowing it
detection framework which helps to understand the predictions. The to reveal what happened inside the black box. It is crucial to note that
local interpretable model-agnostic explanations (LIME) is an automated in Fig. 11, the tokens under ‘‘Text with highlighted words’’, represent
framework/library with the potential to explain or interpret the pre- the original sequences of API while the number before each API call
diction of deep learning models. More importantly, in the case of (e.g., 0 for RegCreateKeyExW) corresponds to its index in the sequence.
text-based classification, LIME interprets the predicted results and then The highlighted tokens show those API calls which contributed to the
reveals the importance of the most highly influential words (tokens) classification of the sequence. Thus, having this information, a security
which contributed to the predicted results. This works well for our analyst can decide whether the model’s prediction should be trusted or
proposed framework as we are dealing with sequences of API calls that not.
represent malware and benign executable files. The LIME framework
was chosen because it is open source, has a high number of citations 6. Limitations and future work
and the framework has been highly rated on GitHub.
Fig. 9 shows how LIME works to provide an explanation/ In our future work, we intend to extend our dataset of API calls to
interpretation of a given prediction of API call sequence. LIME explains include more features from benign files, and newly released malware
the framework’s predictions at the data sample level, allowing security variants (the current dataset size was limited due to hardware limita-
analysts/end-users to interpret the framework’s predictions and make tions). We also plan to evaluate the proposed framework on API call
decisions based on them. LIME works by perturbing the input of the features extracted in the Android applications where features will be
data samples to understand how the prediction changes i.e., LIME extracted from APK files through dynamic analysis. Although currently,
considers a deep learning model as a black box and discovers the we can explain the results using LIME, in some cases, LIME can be un-
relationships between input and output which are represented by the stable as it depends on the random sampling of new features/perturbed
model (Ribeiro et al., 2016; Hulstaert, 2022). The output produced features (Molnar, 2020). LIME ignores correlations between features
by LIME is a list of explanations showing the contributions of each as data points are sampled from a Gaussian distribution (Molnar,
feature to the final classification/prediction of a given data sample. 2020). Therefore, we plan to explore and compare explanation insights

14
P. Maniriho et al. Journal of Network and Computer Applications 218 (2023) 103704

Table 12
Macro and the weighted average for precision, recall, and F1-score obtained by API-MalDetect on testset3.
Sequences length (n) Macro P Macro R Macro F1 Weighted P Weighted R Weighted F1
20 0.8840 0.9420 0.9102 0.9689 0.9652 0.9664
40 0.9049 0.9384 0.9208 0.9719 0.9703 0.9709
60 0.9026 0.9485 0.9239 0.9733 0.9710 0.9718
80 0.9020 0.9553 0.9264 0.9745 0.9717 0.9726
100 0.9439 0.9642 0.9537 0.9834 0.9829 0.9831

produced by other frameworks such as Anchor (Ribeiro et al., 2018) References

and ELI5 (MIT, 2022), which also interpret deep learning models. The
proposed framework will also be assessed against adversarial sam- Abbasi, M.S., Al-Sahaf, H., Mansoori, M., Welch, I., 2022. Behavior-based ransomware
classification: A particle swarm optimization wrapper-based approach for feature
ples (Peng et al., 2021), which can fool malware detection models
selection. Appl. Soft Comput. 121, 108744.
based on API calls. While this work was focused on dynamic malware Alazab, M., Venkataraman, S., Watters, P., 2010. Towards understanding malware be-
analysis, it is also important to note that extracting dynamic features haviour by the extraction of API calls. In: 2010 Second Cybercrime and Trustworthy
requires more resources and can be much more costly compared to Computing Workshop. pp. 52–59. http://dx.doi.org/10.1109/CTC.2010.8.
Amer, E., Samir, A., Mostafa, H., Mohamed, A., Amin, M., 2022. Malware detection
static feature extraction. Another potential drawback of dynamic mal-
approach based on the swarm-based behavioural analysis over API calling se-
ware analysis is that some sophisticated malware can detect the virtual quence. In: 2022 2nd International Mobile, Intelligent, and Ubiquitous Computing
analysis environment, and in some cases, malware can present different Conference. MIUCC, IEEE, pp. 27–32.
behaviors which are different from their real behaviors when running Amer, E., Zelinka, I., 2020. A dynamic windows malware detection and prediction
method based on contextual understanding of API call sequence. Comput. Secur.
in a real/production environment.
92, http://dx.doi.org/10.1016/j.cose.2020.101760.
Amer, E., Zelinka, I., El-Sappagh, S., 2021. A multi-perspective malware detection
approach through behavioral fusion of api call sequence. Comput. Secur. 110,
7. Conclusion 102449.
Ammar Ahmed E. Elhadi, B.I.A.B., 2013. Improving the detection of malware behaviour
using simplified data dependent API call graph. Int. J. Secur. Appl. 7, 29–42.
The paper proposed a deep learning-based framework called API-
Anon, 2019. MalDAE: Detecting and explaining malware based on correlation and
MalDetect for detecting malware attacks in Windows systems. The fusion of static and dynamic characteristics. Comput. Secur. 83, 208–233. http:
framework used an NLP-based encoder for API calls and a hybrid //dx.doi.org/10.1016/j.cose.2019.02.007.
automatic feature extractor based on deep learning techniques such as Anon, 2021a. Cuckoo sandbox - automated malware analysis. https://cuckoosandbox.
CNNs and BiGRUs to extract features from raw and long sequences of org/. (Accessed 14 May 2021).
Anon, 2021b. Free software downloads and reviews for Windows, Android, Mac, and
API calls. The paper also introduced practical experimental factors for iOS – CNET download. https://download.cnet.com/. (Accessed 15 December 2021).
training and testing malware detection techniques to avoid temporal Anon, 2021c. GitHub: Dataset containing malware and goodware collected in
bias and spatial bias in the experiments. Additionally, it integrated cyberspace over years. https://github.com/fabriciojoc/brazilian-malware-dataset.
LIME into the framework to provide local interpretability and ex- (Accessed 16 December 2021).
Anon, 2021d. VirusTotal - Home. https://www.virustotal.com/gui/home/upload.
plainability for predictions. The proposed framework achieved high (Accessed 29 September 2021).
detection accuracy, with an F1-score of 0.99 on the training set and Anon, 2022a. Obfuscated files or information, technique T1027 - enterprise | MITRE
0.98 on the unseen data, demonstrating its effectiveness in detecting ATT&CK® . https://attack.mitre.org/techniques/T1027/. (Accessed 08 July 2022).
both existing and new malware attacks with high performance. The Anon, 2022b. VirusTotal - Home. https://www.virustotal.com/gui/home/upload.
(Accessed 04 August 2022).
authors also evaluated their approach against several state-of-the-art
Anon, 2023a. Embedding layer. https://keras.io/api/layers/core_layers/embedding/.
methods, achieving better or comparable results in terms of accu- (Accessed 23 February 2023).
racy, precision, recall, and F1-score. Overall, the proposed framework Anon, 2023b. GitHub - leocsato/detector_mw: Optimizer for malware detection. Api
showed promising results in detecting malware attacks in Windows calls sequence of benign files are provided. https://github.com/leocsato/detector_
mw. (Accessed 15 January 2023).
systems using deep learning-based techniques.
Anon, 2023c. HCRL - [HIDE]APIMDS-dataset. https://ocslab.hksecurity.net/apimds-
dataset. (Accessed 20 June 2023).
Anon, 2023d. Tokenizer base class. https://keras.io/api/keras_nlp/tokenizers/
CRediT authorship contribution statement
tokenizer/. (Accessed 23 February 2023).
Anon, 2023e. What is a signature and how can I detect it? https://home.sophos.com/en-
Pascal Maniriho: Conceptualization, Methodology, Implementa- us/security-news/2020/what-is-a-signature. (Accessed 11 February 2023).
Apruzzese, G., Colajanni, M., Ferretti, L., Guido, A., Marchetti, M., 2018. On the
tion, Investigation, Validation, Writing – original draft, Writing – re- effectiveness of machine and deep learning for cyber security. In: 2018 10th
view & editing. Abdun Naser Mahmood: Conceptualization, Valida- International Conference on Cyber Conflict. CyCon, pp. 371–390. http://dx.doi.
tion, Writing – review & editing, Supervision, Funding acquisition. Mo- org/10.23919/CYCON.2018.8405026.
hammad Jabed Morshed Chowdhury: Conceptualization, Validation, Avci, C., Tekinerdogan, B., Catal, C., 2023. Analyzing the performance of long short-
term memory architectures for malware detection models. Concurr. Comput.: Pract.
Writing – review & editing, Supervision. Exper. e7581.
Blokhin, K., Saxe, J., Mentis, D., 2013. Malware similarity identification using call
graph based system call subsequence features. In: 2013 IEEE 33rd International
Declaration of competing interest Conference on Distributed Computing Systems Workshops. pp. 6–10. http://dx.doi.
org/10.1109/ICDCSW.2013.55.
Bostami, B., Ahmed, M., 2020. Deep learning meets malware detection: An in-
The authors declare that they have no known competing finan-
vestigation. In: Fadlullah, Z.M., Khan Pathan, A.-S. (Eds.), Combating Security
cial interests or personal relationships that could have appeared to Challenges in the Age of Big Data: Powered By State-of-the-Art Artificial Intelligence
influence the work reported in this paper. Techniques. pp. 137–155. http://dx.doi.org/10.1007/978-3-030-35642-2_7.
Catak, F.O., Yazı, A.F., Elezaj, O., Ahmed, J., 2020. Deep learning based sequential
model for malware analysis using Windows exe API Calls. PeerJ Comput. Sci. 6,
Data availability e285. http://dx.doi.org/10.7717/peerj-cs.285.
Ceschin, F., Pinage, F., Castilho, M., Menotti, D., Oliveira, L.S., Gregio, A., 2018. The
need for speed: An analysis of brazilian malware classifiers. IEEE Secur. Priv. 16
Data will be made available on request. (6), 31–41.

15
P. Maniriho et al. Journal of Network and Computer Applications 218 (2023) 103704

Chaganti, R., Ravi, V., Pham, T.D., 2022. Image-based malware representation approach Kiranyaz, S., Avci, O., Abdeljaber, O., Ince, T., Gabbouj, M., Inman, D.J., 2021.
with EfficientNet convolutional neural networks for effective malware classification. 1D convolutional neural networks and applications: A survey. Mech. Syst. Signal
J. Inf. Secur. Appl. 69, 103306. Process. 151, 107398. http://dx.doi.org/10.1016/j.ymssp.2020.107398.
Chen, X., Hao, Z., Li, L., Cui, L., Zhu, Y., Ding, Z., Liu, Y., 2022. CruParamer: Learning Lajevardi, A.M., Parsa, S., Amiri, M.J., 2022. Markhor: malware detection using fuzzy
on parameter-augmented API sequences for malware detection. IEEE Trans. Inf. similarity of system call dependency sequences. J. Comput. Virol. Hacking Techn.
Forensics Secur. 17, 788–803. 18 (2), 81–90.
Chng, M.Z., 2023. Using activation functions in neural networks - Le, Q., Boydell, O., Mac Namee, B., Scanlon, M., 2018. Deep learning at the shallow
MachineLearningMastery.com. https://machinelearningmastery.com/using- end: Malware classification for non-domain experts. Digit. Investig. 26, S118–S126.
activation-functions-in-neural-networks. (Accessed 07 June 2023). Li, C., Lv, Q., Li, N., Wang, Y., Sun, D., Qiao, Y., 2022a. A novel deep framework
Cho, I.K., Kim, T.G., Shim, Y.J., Ryu, M., Im, E.G., 2016. Malware analysis and for dynamic malware detection based on api sequence intrinsic features. Comput.
classification using sequence alignments. Intell. Autom. Soft Comput. 22 (3), Secur. 116, 102686.
371–377. http://dx.doi.org/10.1080/10798587.2015.1118916. Li, S., Zhou, Q., Zhou, R., Lv, Q., 2022b. Intelligent malware detection based on graph
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., convolutional network. J. Supercomput. 78 (3), 4182–4198.
Bengio, Y., 2014. Learning phrase representations using RNN encoder-decoder for Liu, Y., Wang, Y., 2019. A robust malware detection system using deep learning on
statistical machine translation. arXiv preprint arXiv:1406.1078. API calls. In: 2019 IEEE 3rd Information Technology, Networking, Electronic and
Chung, J., Gulcehre, C., Cho, K., Bengio, Y., 2014. Empirical evaluation of gated Automation Control Conference. ITNEC, IEEE, pp. 1456–1460.
recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Liu, F., Zheng, L., Zheng, J., 2020. HieNN-DWE: A hierarchical neural network
Cisco, 2020. Cisco Annual Internet Report (2018–2023) White Paper. with dynamic word embeddings for document level sentiment classification.
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P., 2011. Neurocomputing 403, 21–32. http://dx.doi.org/10.1016/j.neucom.2020.04.084.
Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12 Lynn, H.M., Pan, S.B., Kim, P., 2019. A deep bidirectional GRU network model for
(ARTICLE), 2493–2537. biometric electrocardiogram classification based on recurrent neural networks. IEEE
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C., 2022. Introduction to Algorithms. Access 7, 145395–145405.
MIT Press. Mandelbaum, A., Shalev, A., 2016. Word embeddings and their use in sentence
Ding, Y., Xia, X., Chen, S., Li, Y., 2018. A malware detection method based on family classification tasks. arXiv:1610.08229.
behavior graph. Comput. Secur. 73, 73–86. http://dx.doi.org/10.1016/j.cose.2017. Maniath, S., Ashok, A., Poornachandran, P., Sujadevi, V., Sankar A.U., P., Jan, S., 2017.
10.007. Deep learning LSTM based ransomware detection. In: 2017 Recent Developments
Drapkin, A., 2022. Over 100 million pieces of malware were made for Windows users in Control, Automation & Power Engineering. RDCAPE, pp. 442–446.
in 2021. https://tech.co/news/windows-users-malware. (Accessed 14 June 2022). Maniriho, P., 2022. MalbehavD-V1: A new Dataset of API calls extracted from Win-
Fesseha, A., Xiong, S., Emiru, E.D., Diallo, M., Dahou, A., 2021. Text classification based dows PE files of benign and malware. https://github.com/mpasco/MalbehavD-V1.
on convolutional neural networks and word embedding for low-resource languages: (Accessed 07 April 2022).
Tigrinya. Information 12 (2), 52.
Maniriho, P., Mahmood, A.N., Chowdhury, M.J.M., 2022. A study on malicious
Fukushima, K., 1979. Neural network model for a mechanism of pattern recognition
software behaviour analysis and detection techniques: Taxonomy, current trends
unaffected by shift in position-neocognitron. IEICE Tech. Rep. A 62 (10), 658–665.
and challenges. Future Gener. Comput. Syst. 130, 1–18. http://dx.doi.org/10.1016/
Gibert, D., Mateu, C., Planes, J., 2020. The rise of machine learning for detection and
j.future.2021.11.030.
classification of malware: Research developments, trends and challenges. J. Netw.
Mathew, J., Ajay Kumara, M., 2020. API call based malware detection approach using
Comput. Appl. 153, http://dx.doi.org/10.1016/j.jnca.2019.102526.
recurrent neural network—LSTM. In: Intelligent Systems Design and Applications:
Gibert, D., Planes, J., Mateu, C., Le, Q., 2022. Fusing feature engineering and deep
18th International Conference on Intelligent Systems Design and Applications, Vol.
learning: A case study for malware classification. Expert Syst. Appl. 207, 117957.
1. ISDA 2018, Held in Vellore, India, December 6-8, 2018, Springer, pp. 87–99.
Han, W., Xue, J., Wang, Y., Liu, Z., Kong, Z., 2019. MalInsight: A systematic profiling
Medsker, L., Jain, L.C., 1999. Recurrent Neural Networks: Design and Applications.
based malware detection framework. J. Netw. Comput. Appl. 125, 236–250. http:
CRC Press.
//dx.doi.org/10.1016/j.jnca.2018.10.022.
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A., 2019. A survey on
Hellal, A., Mallouli, F., Hidri, A., Aljamaeen, R.K., 2020. A survey on graph-based
bias and fairness in machine learning. arXiv preprint arXiv:1908.09635.
methods for malware detection. In: 2020 4th International Conference on Advanced
Microsoft, 2021. Programming reference for the Win32 API - Win32 apps | Mi-
Systems and Emergent Technologies. pp. 130–134. http://dx.doi.org/10.1109/IC_
crosoft Docs. https://docs.microsoft.com/en-us/windows/win32/api/. (Accessed 01
ASET49463.2020.9318301.
January 2021).
Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neural Comput. 9 (8),
Mikolov, T., Chen, K., Corrado, G., Dean, J., 2013. Efficient estimation of word
1735–1780.
representations in vector space. arXiv:1301.3781.
Hubel, D.H., Wiesel, T.N., 1968. Receptive fields and functional architecture of monkey
Mimura, M., 2023. Impact of benign sample size on binary classification accuracy.
striate cortex. J. Physiol. 195 (1), 215–243.
Expert Syst. Appl. 211, 118630.
Huda, S., Abawajy, J., Alazab, M., Abdollalihian, M., Islam, R., Yearwood, J., 2016.
Hybrids of support vector machine wrapper and filter based framework for malware Mimura, M., Ito, R., 2022. Applying NLP techniques to malware detection in a practical
detection. Future Gener. Comput. Syst. 55, 376–390. http://dx.doi.org/10.1016/j. environment. Int. J. Inf. Secur. 21 (2), 279–291.
future.2014.06.001. Mira, F., Brown, A., Huang, W., 2016. Novel malware detection methods by using LCS
Hulstaert, L., 2022. Understanding model predictions with LIME | by Lars Hulstaert and LCSS. In: 2016 22nd International Conference on Automation and Computing.
| Towards Data Science. https://towardsdatascience.com/understanding-model- ICAC, pp. 554–559. http://dx.doi.org/10.1109/IConAC.2016.7604978.
predictions-with-lime-a582fdff3a3b. (Accessed 03 July 2022). MIT, 2022. GitHub - TeamHG-Memex/eli5: A library for debugging/inspecting machine
Jadon, S., 2020. A survey of loss functions for semantic segmentation. In: 2020 IEEE learning classifiers and explaining their predictions. https://github.com/TeamHG-
Conference on Computational Intelligence in Bioinformatics and Computational Memex/eli5. (Accessed 16 July 2022).
Biology. CIBCB, IEEE, pp. 1–7. Molnar, C., 2020. Interpretable Machine Learning. Lulu. com.
Jha, S., Fredrikson, M., Christodoresu, M., Sailer, R., Yan, X., 2013. Synthesizing Moraffah, R., Karami, M., Guo, R., Raglin, A., Liu, H., 2020. Causal interpretability for
near-optimal malware specifications from suspicious behaviors. In: 2013 8th machine learning-problems, methods and evaluation. ACM SIGKDD Explor. Newsl.
International Conference on Malicious and Unwanted Software: ‘‘The Americas’’. 22 (1), http://dx.doi.org/10.1145/3400051.3400058.
MALWARE, pp. 41–50. http://dx.doi.org/10.1109/MALWARE.2013.6703684. Morato, D., Berrueta, E., Magaña, E., Izal, M., 2018. Ransomware early detection by
Jing, C., Wu, Y., Cui, C., 2022. Ensemble dynamic behavior detection method for the analysis of file sharing traffic. J. Netw. Comput. Appl. 124, 14–32.
adversarial malware. Future Gener. Comput. Syst. 130, 193–206. Moskovitch, R., Stopel, D., Feher, C., Nissim, N., Japkowicz, N., Elovici, Y., 2009.
Jovanovic, B., 2022. A not-so-common cold: Malware statistics in 2022. https:// Unknown malcode detection and the imbalance problem. J. Comput. Virol. 5,
dataprot.net/statistics/malware-statistics/. (Accessed 14 June 2022). 295–308.
Karbab, E.B., Debbabi, M., 2019. Maldy: Portable, data-driven malware detection Naik, N., Jenkins, P., Savage, N., Yang, L., Boongoen, T., Iam-On, N., 2021. Fuzzy-
using natural language processing and machine learning techniques on behavioral import hashing: A static analysis technique for malware detection. Forensic Sci.
analysis reports. Digit. Investig. 28, S77–S87. http://dx.doi.org/10.1016/j.diin. Int. Digit. Investig. 37, 301139. http://dx.doi.org/10.1016/j.fsidi.2021.301139.
2019.01.017. Nair, V.P., Jain, H., Golecha, Y.K., Gaur, M.S., Laxmi, V., 2010. Medusa: Metamorphic
Khan, S., Rahmani, H., Shah, S.A.A., Bennamoun, M., 2018. A guide to convolutional malware dynamic analysis usingsignature from api. In: Proceedings of the 3rd
neural networks for computer vision. Synth. Lect. Comput. Vis. 8 (1), 1–207. International Conference on Security of Information and Networks. pp. 263–269.
Ki, Y., Kim, E., Kim, H.K., 2015. A novel approach to detect malware based on API call http://dx.doi.org/10.1145/1854099.1854152.
sequence analysis. Int. J. Distrib. Sens. Netw. 11 (6), http://dx.doi.org/10.1155/ Najafabadi, M.M., Villanustre, F., Khoshgoftaar, T.M., Seliya, N., Wald, R.,
2015/659101. Muharemagic, E., 2015. Deep learning applications and challenges in big data
Kim, Y., 2014. Convolutional neural networks for sentence classification. arXiv:1408. analytics. J. Big Data 2 (1), 1–21. http://dx.doi.org/10.1186/s40537-014-0007-7.
5882. Nappa, A., Rafique, M.Z., Caballero, J., 2015. The MALICIA dataset: identification
Kingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimization. arXiv preprint and analysis of drive-by download operations. Int. J. Inf. Secur. 14 (1), 15–33.
arXiv:1412.6980. http://dx.doi.org/10.1007/s10207-014-0248-7.

16
P. Maniriho et al. Journal of Network and Computer Applications 218 (2023) 103704

Nawaz, M.S., Fournier-Viger, P., Nawaz, M.Z., Chen, G., Wu, Y., 2022. MalSPM: Stenne, M.A., 2021. Quick introduction to Windows API. https://users.physics.ox.ac.
Metamorphic malware behavior analysis and classification using sequential pattern uk/~Steane/cpp_help/winapi_intro.htm. (Accessed on 14 May 2021).
mining. Comput. Secur. 118, 102741. Suaboot, J., Tari, Z., Mahmood, A., Zomaya, A.Y., Li, W., 2020. Sub-curve HMM:
Nissim, N., Moskovitch, R., Rokach, L., Elovici, Y., 2014. Novel active learning methods A malware detection approach based on partial analysis of API call sequences.
for enhanced PC malware detection in windows OS. Expert Syst. Appl. 41 (13), Comput. Secur. 92, 1–15. http://dx.doi.org/10.1016/j.cose.2020.101773.
5843–5857. Sukul, M., Lakshmanan, S.A., Gowtham, R., 2022. Automated dynamic detection of
Nunes, M., Burnap, P., Rana, O., Reinecke, P., Lloyd, K., 2019. Getting to the root ransomware using augmented bootstrapping. In: 2022 6th International Conference
of the problem: A detailed comparison of kernel and user level data for dynamic on Trends in Electronics and Informatics. ICOEI, pp. 787–794.
malware analysis. J. Inf. Secur. Appl. 48, 102365. Sun, Z., Rao, Z., Chen, J., Xu, R., He, D., Yang, H., Liu, J., 2019. An Opcode sequences
Pan, Z.-P., Feng, C., Tang, C.-J., 2016. Malware Classification Based on the Behavior analysis method for unknown malware detection. In: Proceedings of the 2019
Analysis and Back Propagation Neural Network. In: 3rd Annual International 2nd International Conference on Geoinformatics and Data Analysis. pp. 15–19.
Conference on Information Technology and Applications, Vol. 7. ITA 2016, pp. http://dx.doi.org/10.1145/3318236.3318255.
1–5. http://dx.doi.org/10.1051/itmconf/20160702001. Tekerek, A., Yapici, M.M., 2022. A novel malware classification and augmentation
Pei, X., Yu, L., Tian, S., 2020. AMalNet: A deep learning framework based on graph model based on convolutional neural network. Comput. Secur. 112, 102515.
convolutional networks for malware detection. Comput. Secur. 93, 101792. http: Tien, C.-W., Chen, S.-W., Ban, T., Kuo, S.-Y., 2020. Machine learning framework to
//dx.doi.org/10.1016/j.cose.2020.101792. analyze iot malware using elf and opcode features. Digit. Threat. Res. Pract. 1 (1),
Pektaş, A., Acarman, T., 2017. Classification of malware families based on runtime 1–19.
behaviors. J. Inf. Secur. Appl. 37, 91–100. http://dx.doi.org/10.1016/j.jisa.2017. Tirumala, S.S., Valluri, M.R., Nanadigam, D., 2020. Evaluation of feature and signature
10.005. based training approaches for malware classification using autoencoders. In: 2020
Pendlebury, F., Pierazzi, F., Jordaney, R., Kinder, J., Cavallaro, L., 2019. {TESSERACT}: International Conference on COMmunication Systems and NETworkS. COMSNETS,
Eliminating experimental bias in malware classification across space and time. In: pp. 1–5. http://dx.doi.org/10.1109/COMSNETS48256.2020.9027373.
28th USENIX Security Symposium. USENIX Security 19, pp. 729–746. Tran, T.K., Sato, H., 2017. NLP-based approaches for malware classification from API
Peng, X., Xian, H., Lu, Q., Lu, X., 2021. Semantics aware adversarial malware examples sequences. In: 2017 21st Asia Pacific Symposium on Intelligent and Evolutionary
generation for black-box attacks. Appl. Soft Comput. 109, 107506. Systems. IES, pp. 101–105. http://dx.doi.org/10.1109/IESYS.2017.8233569.
Pennington, J., Socher, R., Manning, C.D., 2014. Glove: Global vectors for word Udayakumar, N., Anandaselvi, S., Subbulakshmi, T., 2017. Dynamic malware analysis
representation. In: Proceedings of the 2014 Conference on Empirical Methods in using machine learning algorithm. In: 2017 International Conference on Intelligent
Natural Language Processing. EMNLP, pp. 1532–1543. Sustainable Systems. ICISS, IEEE, pp. 795–800. http://dx.doi.org/10.1109/ISS1.
Pinhero, A., M L, A., P, V., Visaggio, C., N, A., S, A., S, A., 2021. Malware detection 2017.8389286.
employed by visualization and deep neural network. Comput. Secur. 105, 102247. Uppal, D., Sinha, R., Mehra, V., Jain, V., 2014. Exploring behavioral aspects of API calls
http://dx.doi.org/10.1016/j.cose.2021.102247. for malware identification and categorization. In: 2014 International Conference
Pirscoveanu, R.S., Hansen, S.S., Larsen, T.M.T., Stevanovic, M., Pedersen, J.M., on Computational Intelligence and Communication Networks. pp. 824–828. http:
Czech, A., 2015. Analysis of malware behavior: Type classification using machine //dx.doi.org/10.1109/CICN.2014.176.
learning. In: 2015 International Conference on Cyber Situational Awareness, Vemparala, S., Troia, F.D., Visaggio, C.A., Austin, T.H., Stamp, M., 2019. Malware
Data Analytics and Assessment. CyberSA, http://dx.doi.org/10.1109/CyberSA.2015. detection using dynamic birthmarks. arXiv:1901.07312.
7166115. Vukotić, V., Raymond, C., Gravier, G., 2016. A step beyond local observations with
Pittaras, N., Giannakopoulos, G., Papadakis, G., Karkaletsis, V., 2020. Text classification a dialog aware bidirectional GRU network for spoken language understanding. In:
with semantically enriched word embeddings. Nat. Lang. Eng. 1–35. http://dx.doi. Interspeech.
org/10.1017/S1351324920000170. Wang, J., Bao, R., Zhang, Z., Zhao, H., 2022. Rethinking textual adversarial defense
Pypi, 2021. PyPI ⋅ The Python package index. https://pypi.org/. (Accessed 22 August for pre-trained language models. IEEE/ACM Trans. Audio Speech Lang. Process. 30,
2021). 2526–2540.
Qin, B., Wang, Y., Ma, C., 2020. API call based ransomware dynamic detection approach Wüchner, T., Cisłak, A., Ochoa, M., Pretschner, A., 2019. Leveraging compression-based
using textCNN. In: 2020 International Conference on Big Data, Artificial Intelligence graph mining for behavior-based malware detection. IEEE Trans. Dependable Secure
and Internet of Things Engineering. ICBAIE, IEEE, pp. 162–166. Comput. 16 (1), 99–112. http://dx.doi.org/10.1109/TDSC.2017.2675881.
Rafique, M.F., Ali, M., Qureshi, A.S., Khan, A., Mirza, A.M., 2020. Malware classification Xiaofeng, L., Fangshuo, J., Xiao, Z., Shengwei, Y., Jing, S., Lio, P., 2019. ASSCA:
using deep learning based feature extraction and wrapper based feature selection API sequence and statistics features combined architecture for malware detection.
technique. arXiv:1910.10958. Comput. Netw. 157, 99–111. http://dx.doi.org/10.1016/j.comnet.2019.04.007.
Ribeiro, M.T., Singh, S., Guestrin, C., 2016. ‘‘Why should i trust you?’’ Explaining the Xue, J., Wang, Z., Feng, R., 2022. Malicious network software detection based on API
predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International call. In: 2022 8th Annual International Conference on Network and Information
Conference on Knowledge Discovery and Data Mining. pp. 1135–1144. Systems for Computers. ICNISC, IEEE, pp. 105–110.
Yaqub, M., Feng, J., Zia, M.S., Arshid, K., Jia, K., Rehman, Z.U., Mehmood, A., 2020.
Ribeiro, M.T., Singh, S., Guestrin, C., 2018. Anchors: High-precision model-agnostic
State-of-the-art CNN optimizer for brain tumor segmentation in magnetic resonance
explanations. In: AAAI Conference on Artificial Intelligence. AAAI.
images. Brain Sci. 10 (7), 427.
Rieck, K., Trinius, P., Willems, C., Holz, T., 2011. Automatic analysis of malware
Yesir, S., Soğukpinar, İ., 2021. Malware detection and classification using fasttext and
behavior using machine learning. J. Comput. Secur. 19 (4), 639–668. http://dx.
BERT. In: 2021 9th International Symposium on Digital Forensics and Security.
doi.org/10.3233/JCS-2010-0410.
ISDFS, IEEE, pp. 1–6.
Ruth, C., 2023. Over 95% of all new malware threats discovered in 2022 are aimed
Yuan, S., Wu, X., 2021. Deep learning for insider threat detection: Review, challenges
at Windows - Atlas VPN. https://atlasvpn.com/blog/over-95-of-all-new-malware-
and opportunities. Comput. Secur. 104, 102221. http://dx.doi.org/10.1016/j.cose.
threats-discovered-in-2022-are-aimed-at-windows. (Accessed 11 February 2023).
2021.102221.
Sai, K.N., Thanudas, B., Sreelal, S., Chakraborty, A., Manoj, B., 2019. MACA-I: a
Zelinka, I., Amer, E., 2019. An ensemble-based malware detection model using
malware detection technique using memory management API call mining. In:
minimum feature set. MENDEL 25 (2), 1–10. http://dx.doi.org/10.13164/mendel.
TENCON 2019-2019 IEEE Region 10 Conference. TENCON, IEEE, pp. 527–532.
2019.2.001.
Saket, S., 2021. (9) count vectorizer vs TFIDF vectorizer | Natural language
Zhang, S.-H., Kuo, C.-C., Yang, C.-S., 2019. Static PE malware type classification
processing | LinkedIn. https://www.linkedin.com/pulse/count-vectorizers-vs-tfidf-
using machine learning techniques. In: 2019 International Conference on Intelligent
natural-language-processing-sheel-saket/. (Accessed 03 October 2021).
Computing and Its Emerging Applications. ICEA, IEEE, pp. 81–86. http://dx.doi.
Sethi, K., Tripathy, B.K., Chaudhary, S.K., Bera, P., 2017. A novel malware analysis for
org/10.1109/ICEA.2019.8858297.
malware detection and classification using machine learning algorithms. In: SIN
’17: Proceedings of the 10th International Conference on Security of Information
and Networks. pp. 107–116. http://dx.doi.org/10.1145/3136825.3136883.
Sharma, N., Sharma, R., Jindal, N., 2021. Machine learning and deep learning Pascal Maniriho received his B.Tech with Honors in In-
applications-a vision. Glob. Transitions Proc. 2 (1), 24–28. http://dx.doi.org/10. formation and Communication Technology from Umutara
1016/j.gltp.2021.01.004, 1st International Conference on Advances in Information, Polytechnic, Rwanda, and a Master’s degree in Computer
Computing and Trends in Data Engineering (AICDE - 2020). Science from Institut Teknologi Sepuluh Nopember (ITS),
Silberschatz Abraham, G.B.P., 2018. Operating System Concepts, tenth ed. Wiley, p. Indonesia, in 2013 and 2018, respectively. He has been
1259. working in academia in Information Technology since 2019.
Singh, J., Singh, J., 2020. Detection of malicious software by analyzing the behavioral He is currently pursuing his Ph.D. degree in cybersecurity
artifacts using machine learning algorithms. Inf. Softw. Technol. 121, http://dx. at La Trobe University, Australia. His research interests in-
doi.org/10.1016/j.infsof.2020.106273. clude malware detection, data theft prevention, information
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R., 2014. security, machine learning and deep learning.
Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn.
Res. 15 (1), 1929–1958.

17
P. Maniriho et al. Journal of Network and Computer Applications 218 (2023) 103704

Abdun Naser Mahmood received the B.Sc. degree in Dr. Mohammad Jabed Morshed Chowdhury is currently
applied physics and electronics, and the M.Sc. (research) working as Associate Lecturer at La Trobe University, Mel-
degree in computer science from the University of Dhaka, bourne, Australia. He has earned his Ph.D. Candidate at
Bangladesh, in 1997 and 1999, respectively, and the Ph.D. Swinburne University of Technology, Melbourne, Australia.
degree from the University of Melbourne, Australia, in 2008. He has earned his double Masters in Information Secu-
He is currently an Associate Professor with the Department rity and Mobile Computing from Norwegian University of
of Computer Science, School of Engineering and Mathemat- Science and Technology, Norway and University of Tartu,
ical Sciences, La Trobe University. His research interests Estonia under the European Union’s Erasmus Mundus Schol-
include data mining techniques for scalable network traffic arship Program. He has published his research in top venues
analysis, anomaly detection, and industrial SCADA security. including TrustComm, HICSS, and REFSQ. He is currently
He is a senior member of the IEEE. working with Security, Privacy and Trust. He has published
research work related to blockchain and cybersecurity in
different top venues.