1. Introduction
Ransomware (RW) is malware that prevents access to a computer system or data until a ransom is paid. It is primarily spread via phishing emails and system flaws, and it has a serious negative impact on individuals and companies that use computer systems daily [
1,
2,
3].
In general, ransomware can be divided into two main types. The first type is called locker ransomware. It aims to deny access to a computer system but does not encrypt files. This type of RW blocks users from the system interface and locks them out of their work environments and applications [
4]. The second type is called crypto ransomware. It encrypts valuable data in the system, such as documents and media files, and it renders them inaccessible without a decryption key. This is the dominant form of RW because of its devastating effect on data integrity [
4].
The effects of ransomware attacks go beyond the money lost as soon as the ransom is paid. Operational interruptions can cause major productivity losses for organizations, particularly in vital industries like healthcare [
1,
2,
5]. In addition, victims may experience intense psychological effects, including feelings of anxiety and violation [
2]. Ransomware is a profitable business for hackers because the costs of downtime, data loss, and system recovery frequently outweigh the ransom payment itself [
6]. The rate of ransomware attacks has increased significantly; in 2017, an attack occurred somewhere in the world every 40 s, and by 2019, this frequency had escalated to every 19 s [
7]. Financial losses due to ransomware attacks were
$8 billion in 2018 and over
$20 billion by 2021 [
8]. Ransom demands range from a few hundred dollars for personal computers to up to a million dollars for enterprises [
9], with victims facing potential losses of hundreds of millions of dollars if they do not pay. The first reported death following a ransomware attack occurred at a German hospital in October 2020 [
10].
Given the sophisticated and evolving nature of ransomware, understanding its mechanics and impacts is crucial. This includes recognizing how it can infiltrate systems, the variety of its types, and the extensive consequences of attacks. Therefore, effective detection and mitigation strategies are essential when malicious activity starts. This paper contributes to these efforts by employing deep learning techniques to detect and analyze ransomware based on system behavior and response patterns within the first few seconds of its activity.
Deep learning (DL) is an excellent tool for spotting subtle and complicated patterns in data, which is important for detecting zero-day ransomware assaults [
11,
12]. Once trained, deep learning models can handle enormous amounts of data at rates faster than human analysts, making them perfect for real-time threat identification. These models can also identify new and changing threats over time. But big, well-labeled datasets are necessary for efficient deep learning applications, and their preparation can be costly and time-consuming [
13]. Additionally, there is a chance that models will overfit, which would hinder their ability to be generalized to fresh, untested data. Finally, training deep learning models demands substantial computational resources, which can be an obstacle for some organizations [
14].
Ransomware often performs operations repeatedly, for example, file scanning and the encryption of multiple directories. This conduct implies that RW contains consistent and detectable behavioral patterns. These patterns subtly evolve with each RW variant, presenting an ideal use case for deep learning models, especially those designed for sequence analysis. Moreover, the relative ease of modifying existing ransomware toolkits allows attackers to rapidly develop new variants [
15]. Deep learning’s capability to learn from incremental data adjustments makes it highly effective at identifying slight deviations from known behaviors, offering a robust defense against an ever-evolving ransomware landscape.
In this paper, we present a new dataset and a method for early ransomware detection. Our contribution is three-fold. First, we have created a comprehensive dataset featuring a wide array of initial API call sequences from commonly used benign and verified crypto-ransomware processes. This dataset is unique not only in its verification process, ensuring that all included ransomware samples are 100% validated as crypto-ransomware, but also in the depth of data recorded for each API call. It includes detailed information such as the result of each call, its duration, and the parameters involved. The public release of this dataset will make it a useful tool for researchers, enabling them to make even more progress in ransomware detection and stronger protection system development. Second, we have conducted a detailed comparative analysis of various neural network configurations and dataset features. This analysis aims to determine the most effective neural network model and feature set for ransomware detection. Third, we detect ransomware processes using initial API call sequences of a process and obtain an efficient method of early ransomware detection.
We examine the following research questions (RQs):
- RQ1:
What API call features are essential for early ransomware detection?
- RQ2:
Do neural models outperform traditional machine learning (ML) models for this task?
- RQ3:
What representation of textual API call features yields better results?
- RQ4:
What number of consecutive API calls from every process is sufficient for state-of-the-art results?
- RQ5:
Are test times for neural models competitive and suitable for online ransomware detection?
Due to the scarcity of available datasets and code, we decided to share both in order to facilitate further research in the field. Both data and code will be publicly available when this paper is published.
2. Background
Traditionally, ransomware detection methods have relied on several key strategies. Signature-based detectionis the most common method used in traditional antivirus software. It matches known malware signatures—unique strings of data or characteristics of known malware—against files. While effective against known threats, this method struggles to detect new, unknown ransomware variants [
16,
17]. Heuristic analysis uses algorithms to examine software or file behavior for suspicious characteristics. This method can potentially identify new ransomware based on behaviors similar to known malware, but its effectiveness depends on the sophistication of the heuristic rules [
18]. Behavioral analysis monitors how programs behave and highlights odd behaviors—like quick file encryption—that could indicate ransomware. Although these tools need a baseline of typical behavior and can produce false positives, they may identify zero-day ransomware attacks (new and undiscovered threats) [
18]. Sandboxing runs files or programs in a virtual environment (sandbox) to observe their behavior without risking the actual system. If malicious activities like unauthorized encryption are detected, the ransomware can be identified before it harms the real environment. However, some advanced ransomware can detect and evade sandboxes [
19]. Honeyfiles (decoy folders) places decoy files or folders within a system. Monitoring these decoys for unauthorized encryption or modifications can signal a ransomware attack. While useful as an early warning system, it does not prevent ransomware from infecting genuine files [
20].
Although each of these approaches has advantages, they also come with special difficulties when it comes to ransomware detection. One major obstacle is finding a balance between the necessity for quick, precise detection and the reduction in false positives and negatives. For this purpose, machine learning technologies, especially deep learning (DL), are now used because they provide strong defenses against ransomware and other sophisticated cyber threats. DL is used in malware classification [
21], phishing detection [
22], anomaly identification [
23], and malware detection. By examining the order of operations in a system, which may include odd file-encryption activities, DL models have demonstrated high efficacy in detecting ransomware activities [
24,
25]. DL can spot subtle and complicated patterns in data, which is important for detecting zero-day ransomware assaults; it can also handle enormous amounts of data, making it perfect for real-time threat identification. However, big and well-labeled datasets are necessary for efficient DL models, and their preparation can be costly and time-consuming [
13]. Additionally, there is a chance that models will overfit and not generalize well on fresh and untested data. Training DL models demands substantial computational resources, which can be an obstacle for some organizations [
14].
Next, we survey some of the most prominent works on ML-based ransomware detection. The study [
26] significantly extends the realm of cybersecurity by utilizing an advanced dataset consisting of both ransomware and benign software samples collected from 2014 to early 2021. These samples underwent dynamic analysis to document API call sequences, capturing detailed behavioral footprints. The LightGBM model was used to classify the samples. The model demonstrated exceptional efficacy, achieving an accuracy of 0.987 in classifying software types.
The work [
27] presents a sophisticated approach to malware detection by distinguishing API call sequences using long short-term memory (LSTM) networks, which were not limited to ransomware. The dataset in this paper was sourced from Alibaba Cloud’s cybersecurity efforts, and it contains a comprehensive collection of malware samples, including ransomware. The dataset spans various malware types, and it includes dynamic API call sequences from malware, capturing only the names of the API calls while omitting additional details such as call results or timestamps. API call sequences are mapped from strings into vectors using an API2Vec vectorization method based on Word2Vec [
28]. The LSTM-based model of [
27] achieved an F1-score of 0.9402 on the test set, and it was shown to be notably superior to traditional machine learning models.
The paper [
29] introduces an innovative approach to malware detection using deep graph convolutional neural networks (DGCNNs) [
30]. It focuses on the capabilities of DGCNNs to process and analyze API call sequences. The dataset used in this work comprises 42,797 malware API call sequences and 1079 goodware API call sequences; only the API call names were recorded. DGCNNs demonstrated comparable accuracy and predictive capabilities to LSTMs, achieving slightly higher F1 scores on the balanced dataset but performing less well on the imbalanced dataset.
The work [
31] concentrates on the behavioral analysis of both malicious and benign software through API call monitoring. Instead of analyzing the sequence of API calls, this study employs advanced machine learning techniques to assess the overall frequency and type counts of these API calls. The authors developed two distinct datasets that include a wide variety of ransomware families. The datasets contain only API call names. Several ML algorithms were tested, including k-nearest neighbors (kNNs) [
32], random forest (RF) [
33], support vector machine (SVM) [
34], and logistic regression (LR) algorithms [
35]. Both LR and SVM exhibited exemplary performance, achieving perfect precision scores of 1.000 and the highest recall rates of 0.987, which correspond to an F1-score of 0.994.
In the paper [
29], the authors propose a novel behavioral malware detection method based on deep graph convolutional neural networks (DGCNNs) to learn directly from API call sequences and their associated behavioral graphs. They use longer call sequences (100 API calls) to achieve high classification accuracy and F1 scores on a custom dataset of over 40,000 API call sequences.
The goal of this paper is to improve detection skills by combining deep learning with the fundamentals of behavioral analysis. This will lessen the possibility of false positives and improve the identification of zero-day threats.
3. PARSEC Dataset
3.1. Motivation
One of the primary reasons for opting to collect our data, rather than using pre-built datasets, was the lack of available datasets that include detailed outcomes of API calls. Most publicly available datasets (described in
Section 2) typically provide only the names of the API calls made during the execution of malware and benign applications. We made an effort to find a dataset that would fit our research. The reviewed datasets were MalBehavD-V1 [
36], the Malware API Call Dataset [
37], the Alibaba Cloud Malware Detection Based on Behaviors Dataset [
38], and the datasets introduced in papers [
29,
39]. None of these datasets were suitable for our purposes because they only provided the names of API calls. In our study, we wanted to explore the effect of additional information, such as the result of the API call, its duration, and the parameters it received, to see whether these additional details could improve performance metrics in ransomware detection.
However, for a more nuanced analysis and potentially more effective detection models, it is crucial to consider not only the API calls themselves but also their results, the duration of each call, and its parameters. This is why we present a new dataset named PARSEC, which stands for API calls for ransomware detection.
3.2. Data Collection
We chose to use Windows 7 for malware analysis because, despite being an older operating system, it remains a target for malware attacks due to its widespread use in slower-to-upgrade environments [
40]. Therefore, malware analysis on Windows 7 provides insights into threats still exploiting older systems. Additionally, many malware variants designed to target Windows platforms maintain compatibility with Windows 7 due to its architectural similarities with newer versions. Note that our method and results are also applicable to the Windows 10 OS and its server counterparts. It is important to note that, for our purposes, Windows 7 and Windows 10 are API-compatible. This means a sample identified as ransomware on Windows 7 would also be classified as ransomware on Windows 10.
We used Process Monitor [
41] (PM) on a Windows 7 Service Pack 1 (SP1) environment within VirtualBox v6.1 [
42] to record API calls from both benign and malicious processes. Process Monitor (v3.70) is a sophisticated tool developed by Sysinternals (now part of Microsoft) that can capture detailed API call information [
43].
We collected the data for malicious and benign processes separately. For each API call of a process, we recorded the call’s name, result, parameters, and execution time. Then, we filtered the API calls and their parameters from every process to construct our datasets; this procedure is shown in detail below.
Figure 1 shows the pipeline of our data collection.
Note that our set of ransomware and benign processes is extensive, but it does not contain all possible processes. However, choosing a few representatives from each application group is an acceptable practice in benchmarking, and we consider the selected set of benign applications to be adequately representative. While ransomware can theoretically differ significantly, in practice, it generally follows the same patterns as other malware. There are several notable examples, such as Locky, WannaCry, and Ryuk, from which all others are derived [
44,
45].
3.3. Benign Processes
We selected a diverse suite of 62 benign processes to capture a broad spectrum of normal computer activities. This selection strategy was aimed at ensuring that our dataset accurately reflects the varied operational behaviors a system might exhibit under different scenarios, including active user interactions and passive background tasks. These processes belong to five main types described below.
Common applications, such as 7zip (v22.01), axCrypt (v2.1.16.36), and CobianSoft (v2.3.10), are renowned for their encryption and backup capabilities. These choices are important for studying legitimate encryption activities, as opposed to the malicious encryptions conducted through ransomware.
Utility and multimedia tools, such as curl (for downloading tasks) and ffmpeg (v.3, for multimedia processing), are crucial for representing standard, non-malicious API call patterns that occur during routine operations.
Office applications like Excel (office professional plus 2010) and Word (office professional plus 2010) reflect common document-handling activities–normal document access and modification patterns.
Benchmarking applications such as Passmark (v9) and PCMark7 (v1.4.0) simulate a wide array of system activities, from user engagement to system performance tests. These applications provide a backdrop of benign system-stress scenarios.
Idle-state processes that typically run during the computer’s idle state represent the system’s behavior when it is not actively engaged in user-directed tasks. This category is essential for offering insights into the system’s baseline activities.
3.4. Ransomware Processes
We started from a dataset comprising 38,152 ransomware samples, obtained from
VirusShare.com [
46]. To verify this site’s virus classification, we employed an automated pipeline to verify the authenticity of these samples as ransomware. The objective was to identify at least 62 ransomware programs within this dataset to match the number of benign processes described in
Appendix A.1. The identification pipeline is a multi-stage process designed to differentiate actual ransomware from potential threats. It includes two VirtualBox virtual machines (VMs) and a host machine, each playing a critical role in screening, analyzing behavior, and confirming ransomware candidates. The full pipeline of ransomware API calls collection flow is shown in
Figure 2.
The first virtual machine (denoted as VM1) starts the process by querying the VirusTotal API for each entry in the “VirusShare_CryptoRansom_20160715” collection, which consists of 38,152 potential samples. Its objective is to filter and prioritize samples based on the frequency of detections via various antivirus engines. Prioritized samples are forwarded to the second virtual machine (denoted as VM2) for a detailed behavioral analysis.
VM2 receives prioritized samples (one by one) from VM1 and executes each in a secure, controlled setting. It focuses on detecting encryption attempts targeting a “honey spot,” which refers to a deliberately crafted and strategically placed element within a system or network designed to attract ransomware or malicious activities [
47]. All API calls made during execution are recorded. If a sample is confirmed as ransomware (i.e., it encrypts the “honey spot”), VM2 compresses the API call data into an Excel file, packages it with WinRAR, and sends it back to VM1.
The host machine maintains a consistent testing environment by resetting VM2 after each analysis. It gathers the compressed Excel files containing API call data from confirmed ransomware samples and compiles them into a single list of these verified programs. This process resulted in a dataset of 62 validated ransomware programs from the initial 38,152 candidates after it ran for two weeks.
3.5. Dataset Features
From the collected API calls of the PARSEC dataset, we generated several datasets that differ in the number of API calls taken from each process. We selected N initial API calls of processes to enable our models to detect malicious processes upon their startup; here, N is a parameter. The aim of our approach is the early detection of ransomware processes. If a process executes fewer API calls than required for the dataset, we performed data augmentation using oversampling. Specifically, we replicated sequences of API calls at random; this method guarantees datasets’ consistency.
We selected a number of API calls between 500 and 5000 to evaluate the potential for early ransomware detection based on limited API calls. It also helped us understand the implications of dataset size on the efficiency of our models. Note that the dataset size primarily affects the duration of training. Larger volumes of data extend the training time but may result in models that are better at generalizing across different ransomware behaviors. Conversely, smaller datasets reduce the training time but might limit the model’s comprehensiveness in learning varied ransomware patterns. This balance is crucial for developing practical, deployable models that can be updated and retrained as new ransomware threats emerge. The naming convention for dataset variations is PARSEC-N, where N is the number of initial API calls included for each process. Therefore, we have PARSEC-N datasets for N .
The API features we recorded include process operation, process result, process duration, and process detail features (a full list of these features appears in
Appendix A.2). We denote these feature lists as Ops, Res, Dur, and Det, meaning operations, results, duration, and detail features. In the basic setup, we started with operation features and only extended the list by adding the result features, and then we added the API execution times and detail features. By starting with basic features and incrementally adding complexity, we isolated the impact of each feature type on the models’ performance. We denote as FLIST the list of features used in the dataset; it accepts the values Ops (process operation features), OpsRes (process operation and result features), OpsResDur (process operation, result, and duration features), and OpsResDurDet (process operation, result, duration, and detail features).
3.6. Data Representation
API call names, results, and execution times were directly extracted from the raw data without modification. Process details’ features are long strings representing the parameters passed to each API call in a semi-structured format. Each parameter is delimited with a semicolon (“;”), with key–value pairs within these parameters separated by a colon (“:”). The value of each key varied, ranging from numbers to single words or even phrases. To accurately interpret and utilize this information, we implemented a detailed extraction process:
First, we separated and extracted each parameter and its corresponding key–value pairs.
Then, we filtered out identifiable information—parameters that could serve as identifiers or indicate specific timestamps were meticulously removed to maintain the integrity of the dataset and ensure privacy compliance. The full list of these parameters can be found in
Appendix A.2 of
Appendix A.
We filled in the missing data with sequences of zeros.
Due to the heterogeneous nature of API calls, they might be associated with a set of parameters of different sizes. Therefore, API calls with missing parameters were systematically padded with zeros.
After feature extraction, we normalized the numerical features (such as execution times) using min-max normalization.
We used 1-hot encoding, FastText [
48], and Bidirectional Encoder Representations from Transformers (BERT) sentence embeddings [
49] (BERT SE) to represent text features. For FastText representation, we split all string attributes into separate words, according to camel case patterns, punctuation, tabs, and spaces, as in “
END OF FILE.” The text was kept in its original case. Then, we extracted the
k-dimensional word vector of every word and computed its average vector. We used fastText vectors pre-trained on English webcrawl and Wikipedia of length
. For BERT SE representation, the words were split based on camel cases and spaces, and then all strings representing words were transformed into lowercase. Then, we applied a pre-trained model
bert-base-uncased and extracted vectors of length 768 for every text.
Next, we divided the data into fixed-size windows of size
W. We explored four window sizes, with
. To maintain consistency across the dataset and ensure integrity in the windowed structure, we applied zero-padding where necessary. This is particularly important for the final segments of data sequences, which may not be fully populated due to variability in API call frequencies. The full data representation pipeline is depicted in
Figure 3.
3.7. Data Analysis
We performed a visual and numeric analysis of our datasets to assess the quality and behavior of benign and ransomware processes. We focused on two datasets—PARSEC-500 and PARSEC-5000—that represent the smallest and biggest numbers of initial API calls taken from each process.
Table 1 contains the number of API calls performed by benign and ransomware processes for the PARSEC-500 and PARSEC-5000 datasets. We omitted the calls that were never performed through ransomware processes from this table (the full list of these calls is provided in
Appendix A.3 of
Appendix A). We can see, surprisingly, that the same calls appear when the first 500 API calls are taken (PARSEC-500) or the first 5000 (PARSEC-5000). It is also evident, in total, ransomware processes perform much more CloseFile, CreateFile, and IRP_MJ_CLOSE than benign processes do. They, however, perform fewer ReadFile operations than benign processes, regardless of the number of system calls recorded.
Next, we performed a visual analysis to reveal distinguishing malware characteristics. For each process, we generated a square image where each pixel represents an API call, color-coded according to the operation performed. The images were plotted with legends, associating each color with its respective API call operation. The visual analysis revealed a stark contrast between benign and ransomware processes. Benign processes exhibited a diverse array of patterns, reflecting the wide-ranging legitimate functionalities and interactions within the system. Each benign process presents a unique color distribution, illustrating the variability and complexity of non-malicious software operations. An example is shown in
Figure 4. Visualization of other benign processes appears in
Appendix A.4 of
Appendix A.
In contrast, ransomware processes displayed a more homogenous appearance, with similar color distributions among them. This uniformity suggests a narrower set of operations being executed, which could be indicative of the focused, malicious intent of these processes. Remarkably, the ransomware processes can be grouped into a few distinct types based on the visualization of their operational sequences, suggesting the existence of common strategies employed across different malware samples.
The first type of malware (
Figure 5) prominently features operations like QueryBasicInformationFile, ReadFile, and CreateFile in repetitive patterns.
The second type of malware (
Figure 6) exhibits a more randomized and chaotic distribution of API calls across the images.
Finally, the third type of malware (
Figure 7) displays a distinct two-part division, possibly indicating a shift from the initial setup or reconnaissance to intense malicious activity, such as data manipulation or encryption.
In total, we observed patterns unique to malicious activities visually, which implies that sequence analysis is useful for malware detection.
6. Conclusions
In this paper, we have explored the efficacy of deep learning techniques in the early detection of ransomware through the analysis of API call sequences. We designed and created a comprehensive dataset of initial API call sequences of popular benign processes and verified ransomware processes. We also performed a comprehensive analysis of different baseline and neural-network models applied to the task of ransomware detection on this dataset.
Our investigation has provided substantial evidence that neural network models, especially CNN and LSTM, can be effectively applied to differentiate between benign and malicious system behaviors. We demonstrated that these models outperform traditional ML classifiers (baselines) and a competing method of [
29], providing a positive answer to RQ1. Our findings indicate that the inclusion of the result feature for each API call significantly improved the models’ performance, providing a positive answer to RQ2. We also found that 1-hot encoding of text features yielded the best results, answering RQ3. We, moreover, learned that increasing the number W of consecutive API calls used in the analysis improved the classification accuracy and F1-measure and that setting W = 7 was sufficient to achieve state-of-the-art results.
Across various configurations, the combination of operation and result features yielded the best results. Additionally, our analysis showed that a window size of 7 provided optimal performance, and 1-hot encoding (OH) generally outperformed other encoding methods in terms of accuracy, answering RQ4. Finally, we learned that the test times of neural models are suitable for online ransomware detection, which resolves RQ5.
We hope the PARSEC dataset will become a valuable resource for the cybersecurity community and encourage further research in the area of ransomware detection. Our findings contribute to the development of more robust and efficient ransomware detection systems, advancing the field of cybersecurity.
7. Limitations and Future Research Directions
The findings of this paper open several directions for future research, namely (1) the expansion of the dataset to capture a broader spectrum of real user activities and (2) the exploration of real-time detection systems integrated into network infrastructures. The PARSEC dataset, while robust, primarily includes API call sequences from simulated benign and ransomware processes. There is a compelling need to develop a dataset that will include activities from diverse computing environments such as office tasks, multimedia processing, software development, and gaming. Current ransomware detection models largely operate by analyzing static datasets. However, integrating these models into live network systems could facilitate the detection of ransomware as it attempts to execute. This approach would enable a more dynamic and proactive response to ransomware threats.
The limitations of our approach are the challenges associated with using API call features and neural models for ransomware detection. Collecting and labeling a comprehensive dataset of API call sequences from benign and ransomware processes is complex, time-consuming, and resource-intensive. Maintaining dataset quality and relevance as ransomware evolves requires substantial effort and depends on the chosen processes. Neural models, particularly deep learning ones, risk overfitting specific patterns in the training data. This can result in recognizing only known ransomware sequences, rather than general malicious behavior, necessitating extensive and resource-heavy testing to ensure good generalization. We also observed that the selection of processes for the training set had an effect on the performance of the model when shorter API call sequences were used as training data. This means that future applications should be mindful of this phenomenon.