Mitigating the Impact of Malware Evolution on API Sequence-based Windows Malware Detectors

Xingyuan Wei \orcidlink0009-0001-6595-4222 & Ce Li, Qiujian Lv, Ning Li, Degang Sun, Yan Wang^∗ * Corresponding author& These authors contributed equally to this work and should be considered co-first authors.

Abstract

In dynamic Windows malware detection, deep learning models are extensively deployed to analyze API sequences. Methods based on API sequences play a crucial role in malware prevention. However, due to the continuous updates of APIs and the changes in API sequence calls leading to the constant evolution of malware variants, the detection capability of API sequence-based malware detection models significantly diminishes over time. We observe that the API sequences of malware samples before and after evolution usually have similar malicious semantics. Specifically, compared to the original samples, evolved malware samples often use the API sequences of the pre-evolution samples to achieve similar malicious behaviors. For instance, they access similar sensitive system resources and extend new malicious functions based on the original functionalities. In this paper, we propose a frame(MME), a framework that can enhance existing API sequence-based malware detectors and mitigate the adverse effects of malware evolution. To help detection models capture the similar semantics of these post-evolution API sequences, our framework represents API sequences using API knowledge graphs and system resource encodings and applies contrastive learning to enhance the model’s encoder. Results indicate that, compared to Regular Text-CNN, our framework can significantly reduce the false positive rate by 13.10% and improve the F1-Score by 8.47% on five years of data, achieving the best experimental results. Additionally, evaluations show that our framework can save on the human costs required for model maintenance. We only need 1% of the budget per month to reduce the false positive rate by 11.16% and improve the F1-Score by 6.44%.

Index Terms:

API sequence-based malware detection, Malware evolution, API knowledge graph, Contrastive learning, Deep learning.

I Introduction

Malware often executes its malicious activities through a specific sequence of system API calls. Using deep neural networks (DNNs) to analyze and identify these API sequences is proven to be effective in dynamic malware detection [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. The research conducted in recent years has focused on achieving high accuracy and minimizing false alarm rates. However, malware detectors are deployed in dynamic environments, where malware variants keep evolving, causing the false negative rate to increase significantly over time [11, 12]. This problem is defined as model aging or concept drift [13]. According to the Kaspersky report in 2019 [14], the false negative rate of a malware detector increased sharply from almost zero to over 20% in just three months. Therefore, mitigating the adverse effects of malware evolution is critical in real malware detection environments.

There appear to be two broad approaches to tackle the malware evolution. The first is to retrain and update detection models with newly labeled samples using online learning [15] or active learning [13], or reject evolved samples until they can be expertly analyzed [16, 17]. However, labeling samples and retraining the model still requires a lot of expert knowledge and computing resources, which incurs a huge cost. The second is to extend the “shelf life” of the model through robust model design and feature space optimization. Malware features are represented to be more robust against temporal bias and reduce the impact of malware evolution [18, 19, 20, 21] However, in existing studies, the raw features extracted mainly focus on the statistical information of static analysis (such as byte histogram, API occurrence, etc), which is not applicable to dynamic detection based on API sequence analysis.

We evaluate and find that the API sequences of malware samples before and after evolution usually have similar malicious semantics. A motivating example about malware Zbot [22, 23] is shown in Figure 1. We reverse two samples (called V1 and V2) and extract the malicious behavior of hiding itself in the registry as a startup entry. This behavior is implemented by calling RegOpenKeyEx, RegSetValueEx, and RegCloseKey in turn, and operating the corresponding system resources (i.e., registry keys and file paths). After evolution, three phenomena can be observed:

1.

V2 replaces RegOpenKeyEx in V1 with RegOpenKeyTransacted, which means V2 uses transactions to perform the same malicious behavior as V1 for stability. Although the API names of RegOpenKeyEx in V2 and RegOpenKeyTransacted in V1 are different, they represent the same behavior. Intuitively, during evolution, the samples often keep the similar behaviors with different implementations using semantically equivalent APIs.
2.

Both V2 and V1 access the similar registry keys (i.e., CurrentVersion\Run) and file directories (i.e., <System>\lowsec). This indicates that the system resources (such as files, registry keys, URLs, etc) accessed during evolutuion are highly similar.
3.

V2 still uses some APIs that used in V1 (i.e., RegSetValueEx and RegCloseKey). Actually, during evolution, malware samples often involve massive code reuse and generate similar API sequence fragments.

Based on these observations, we proposed to mitigate the impact of malware evolution on API sequence-based windows malware detectors. Our insight is to capture the semantic similarities (including equivalent APIs, similar system resources, and similar API fragments) between the API sequences before and after evolution, and reduce the feature gaps caused by evolution, thus slowing down model aging.

In this paper, we design a framework to Mitigate the impact of Malware Evolution (called MME) to enhance the API sequence-based malware detectors from two perspectives. Specifically, to capture the similarity in API sequences and system resource calls, we have developed a novel API embedding method. The API embedding includes both API name embedding and API parameter embedding. For API name embedding, we analyze the Windows API documents [24] and construct a API knowledge graph, which can capture the similarities between APIs and represent API names as semantic feature vectors. For API argument embedding, the system resources operated by each API are extracted from API arguments and represented as fixed feature vectors. These feature vectors are concatenated and inputed to the detection model.

Second, we enhance the model’s attention to the similar API sequence fragments by designing a contrastive learning strategy. In contrastive learning, the encoder of detection model can measure the similarity of two API sequences by calculating the distance between their two embeddings. Our contrastive learning strategy is to make malware closer to samples of the same family in the feature space and farther away from benign samples. Thus, when a malware sample experiences gradual evolution, it can be expected that new samples will be similar to past samples (as they all have similar API sequence fragments) and hence the contrastive encoder may automatically adapt to evolution.

To evaluate our approach, MME is used to enhance two classic API sequence detection models, namely long short term memory networks (LSTM) and text convolutional neural networks (Text-CNN), as too many work use their variants or combinations as detection models [3, 4, 10, 7, 5, 6]. We collect about 76K Windows PE samples spanning from 2017 to 2021. We train the regular models and enhanced models using data in 2017 and evaluate the performance of them from 2018 to 2021. Our evaluation shows that MME can significantly mitigate the model aging of the malware detectors. It reduces the average false negative rate from 22.4% to 10.1% for LSTM, and from 22.7% to 9.6% for TextCNN. Additionly, MME can significantly reduce the amount of human analyst effort required for model periodical retraining maintenance. The number of samples needed to be labeled can be reduced by 24.19%-94.42%. Finally, model ablation analysis and feature stability analysis explore why MME can help the model mitigate the impact of malware evolution.

To summarize, we make the following contributions in this paper:

•

We first observe that the API sequences of malware samples before and after evolution usually have similar malicious semantics including equivalent APIs, similar system resources, and similar API fragments. This provides an opportunity to reduce the feature gaps caused by evolution, and slowing down model aging (§I).
•

We design a framework called MME to enhance the API sequence-based malware detectors (§II). MME contains a new API embedding method to capture the similarities between APIs (§III and §IV), and a contrastive learning strategy to enhance the encoder of the detection model (§V).
•

We apply MME to two widely used Windows malware detection models. The results show that MME can significantly reduce the high false negtive rete caused by malware evolution, thereby slowing down model aging. MME also can significantly save the human labeling efforts when retraining models (§VII).

Refer to caption — Figure 1: An example to show the similar semantics of API sequences before and after evolution.

II Design Overview

2 Figure 2 shows the overview architecture of MME. Generally, a DNN malware detection model consists of three parts: API sequence embedding, encoder, and classifier. First, API sequence embedding represents each raw API sequence as feature vectors (i.e., embedded API sequence) and input them to the encoder. Then, the encoder learns the features and maps each embedded API sequence to the feature space. Finally, the classifier learns the samples in the feature space and outputs the prediction results (i.e., malware or goodware).

Our framework MME focuses on enhancing the API sequence embedding and encoder modules. For API sequence embedding enhancement, we first construct an API knowledge graph which can find semantically equivalent APIs and using graph embedding to represent API names (§III). Then, to capture the system resources operated by each API, we use feature hash embedding to represent the arguments of each API (§IV). For encoder enhancement, we design a contrastive learning strategy to help model learn the similarities of samples in the same malware family, while learning the dissimilarity between malware and goodware (§V). Finally, the enhancement can be achieved simply by adding MME’s API sequence embedding and contrastive learning strategy to the original model, without altering the original model structure.

III API Name Embedding

In this section, we analyze the Windows API documents [24] and construct a API knowledge graph, which can capture the similarities between APIs and represent API names as semantic feature vectors.

We first explain how the konwledge graph can capture the semantic similarity between APIs. The components of API documentation, using RegOpenKeyEx as an example, are shown in Figure 3. Some API-related entities can be extracted from this document, such as action Open (mentioned in the first sentence of the description), prototype RegOpenKey (remove the suffix of the API name), header winreg.h (mentioned in the title), and formal parameters (mentioned in the syntax). Figure 4 shows a small part of the knowledge graph, which captures the relations between the equivalent APIs of RegOpenKeyEx and RegOpenKeyTransacted. Intuitively, these two APIs use the same action, extend from the same prototype, and import from the same header. Besides, they have very similar input/output parameters. That is, these APIs are similar enough in terms of their neighborhoods in the graph. If two APIs are connected to more identical entities, their semantics will become more similar. Therefore, the knowledge graph can capture the similarity between equivalent APIs and then help detectors to detect evolved malware.

In the next subsections, we will introduce the API knowledge graph construction (§III-A) and use graph embedding to represent API names as semantic feature vectors (§III-B).

III-A API Knowledge Graph Construction

III-A1 API Documents Collection

To construct the knowledge graph, the Windows API documents are collected. As shown in Figure 3, each document consists of four parts: title, description, syntax and other information. Among these four parts, title and syntax are structured texts, which contain the basic information of the API (i.e., API name, source header file, class to which it belongs, and function declaration). The description and other information are unstructured texts that contain specific descriptions of API functions and the relationship between the current API and other APIs. We downloads the API documents for Windows 10 from the official website [24] and analyze them to construct an API knowledge graph.

III-A2 Knowledge Graph Construction

The API knowledge graph $G=\langle E,R\rangle$ is defined as a directed graph, where $E$ is the set of all nodes (called entities), and $R$ is the set of all edges (called relations) between two nodes. API knowledge graph is heterogeneous, which means that entities and relations have different types.

There are six types of entities and eight types of relations extracted from API documents to construct the API knowledge graph. Table I lists the specific entities of the graph. For entity extraction, we first consider four basic concepts in Windows API documentation: API, header, class, and parameter. These four entities can be extracted directly from the API documentation. Specifically, API, header, and class can be extracted from the title. The input and output parameters can be extracted from the syntax. Using function RegOpenKeyEx in Figure 3 as an example, the entity API is RegOpenKeyEx and header is winreg.h, which can be extracted from the title. Several parameters (including input parameters hKey, lpSubKey, etc., and an output parameter phkRFesult) are extracted from the syntax. Then, the other two types of entities, namely action and prototype, can be extracted after analyzing the content of the API documentation. Specifically, for each API document, the first sentense of the description is a summary, where the verbs are extracted as the action of the API. The action can reflect the semantic similarity between APIs. For example, the actions of RegOpenKeyEx and RegOpenKeyTransacted are both Open, and the actions of GetFileSize and GetFileType are both Retrieve. Another type of entity that reflects semantic similarity is prototype. We found that many similar APIs are extended from the same prototype by adding various suffixes in order to adapt to different system environments, but their functions have not changed. For example, the APIs of RegOpenKeyA/W, RegOpenKeyExA/W, RegOpenKeyTransactedA/W are extended from the prototype of RegOpenKey. Thus, for each API name, we remove some specific suffixes (including A, W, Ex, Transacted, Advanced, and 0-9) and get its prototype.

TABLE I: Entities

Entity Type	Examples	Related Source	Count
API	RegOpenKeyExA, CreateFileA	title	40,472
header	fileapi.h, winbase.h	title	795
class	IUnknown	title	4,242
parameter	hKey, pSubKey	syntax	27,438
action	Open, Create, Write	description	756
prototype	RegOpenKey, CreateFile	title	3,163

For relations, we extracted a total of eight relations (as shown in Table II). Among them, six types of relations can be directly established after entities extraction:

•

function_of: connects an API to its belonging header or class.
•

inheritance: connects a class entity with its inherited class entity. It can be extract from the “Inheritance” section in the class definition document. The sentence template is “the class inherits from the class”.
•

input: connects an API to its input parameter.
•

output: connects an API to its output parameter.
•

use_action: connects an API to its action.
•

extend_from: connects an API to its prototype.

Furthermore, the remaining two types of relations, namely bundled_with and replaced_by, are used to describe the relationships between APIs. The bundled_with refers to the relation that two APIs must be used at the same time, such as a program must call DestroyWindow once for every time it called CreateWindow. The replaced_by means that the two APIs are functionally equivalent and can be used instead, such as RegOpenKeyEx and RegOpenKeyTransacted. These two types of relations can be derived from the unstructured text within API documentation. However, manually extracting these relations one by one from unstructured text is impractical due to the large number of API documents involved. We have observed that there are common patterns when describing the relations between API entities. These patterns can be summarized with templates and utilize them for relation extraction. The template-based relation extraction involves three steps. Firstly, for all API documents, we employ NLP tools to tokenize each unstructured text into sentences and normalize the sentences. Secondly, we select sentences that contain more than one API entity to form a corpus. Thirdly, we employ a semi-automated strategy to analyze the sentences in the corpus and iteratively formulate templates for relation matching. Table III provides several example templates in regular expression format for relations of bundled_with and replaced_by. The detailed process is as follows:

i) Sentence tokenization and normalization. For each API document, we use spaCy [25] (a Python NLP toolkit) for text processing. We first split the unstructured text into sentences. For each sentence, we check if it is a sentence lacking a subject. If it is, we supplement the subject of the sentence with the corresponding API entity it describes. Then, we employ the coreference resolution [26] to convert pronouns in the sentence into their corresponding entities.

ii) Sentence selection. We employ named entity recognition to extract entities from each sentence, and select sentences that contain more than one API entity to form a corpus. After this step, the scale of the data we need to analyze has been reduced from about 40K API documents to 10K sentences in the cropus.

iii) Template iteratively generation. For each sentence in the corpus, we manually check whether there is bundled_with or replaced_by relation between two API entities. If the answer is no, we remove that sentence from the corpus. Otherwise, we manually formulate a template for the relation and use it for regular expression matching with the all sentences in the corpus. For the sentences that match this template, we extract the corresponding relation from the sentence and remove the sentence from the corpus. Finally, we repeat this process until there is no sentence in the corpus.

In total, 76,886 entities and 215,467 relations are extracted to build the API knowledge graph. If two APIs are connected to more identical entities, their semantics will become more similar. Next, we will use graph embedding to represent API names as semantic feature vectors.

TABLE II: Relations

Relation Type

Entity Connection

Examples

Related Source

Count

function_of

API

\xrightarrow{}

header,

API

\xrightarrow{}

class

RegOpenKeyExA

\xrightarrow{}

winreg.h

title

64,217

inheritance

class

\xrightarrow{}

class

Istream

\xrightarrow{}

Isequentialstream

unstructured text

3,501

input

API

\xrightarrow{}

parameter

RegOpenKeyExA

\xrightarrow{}

hKey

syntax

76,967

output

API

\xrightarrow{}

parameter

RegOpenKeyExA

\xrightarrow{}

phkResult

syntax

22,834

use_action

API

\xrightarrow{}

action

RegOpenKeyExA

\xrightarrow{}

Open

description

38,683

extend_from

API

\xrightarrow{}

prototype

RegOpenKeyExA

\xrightarrow{}

RegOpenKey

title

6,060

bundled_with

API

\xrightarrow{}

API

CreateWindow

\xrightarrow{}

DestroyWindow

unstructured text

421

replaced_by

API

\xrightarrow{}

API

RegOpenKeyEx

\xrightarrow{}

RegOpenKeyTransacted

unstructured text

2,784

TABLE III: Templates to extract ralations of bundled_with and replaced_by

Relation

Example Templates

# of Templates

bundled_with

call API once for every time it called API

for every successful call to API, there should be a … call to API

API must be called at the same depth at which API was called

call API before calling API

replaced_by

To perform …, call API

API is superseded by the API

not necessary to call API when API is called

III-B Graph Embedding

Graph embedding [27, 28, 29] can represent each API in the knowledge graph as a feature vector. Moreover, The semantically similar APIs are represented closer in the feature space. To achieve this, we employed an existing algorithm called TransE [27] and integrated it into our graph embedding problem. Specifically, suppose there is a relation $R$ that connects the entity $E_{a}$ to the entity $E_{b}$ , and they are represented by three vectors: $V_{R}$ , $V_{a}$ , and $V_{b}$ . The core idea of the TransE algorithm is to iteratively adjust these three vectors so that the $V_{a}+V_{R}$ is as close as possible to $V_{b}$ . As a result, APIs with similar semantics will have similar vector representations because they will be related to the same other entities. As a result, the entities with similar semantics will have similar vector representations in the vector space. For example, the two API entities RegOpenKeyEx (denoted as $V_{a1}$ ) and RegOpenKeyTransacted (denoted as $V_{a2}$ ) have the same prototype RegOpenKey (denoted as $V_{b}$ ). Thus, there are extend_from relations (denoted as $V_{R}$ ) connect $V_{a1}$ and $V_{a2}$ to the $V_{b}$ . TransE adjusts these vectors so that the $V_{a1}+V_{R}$ and the $V_{a2}+V_{R}$ are as close as possible to $V_{b}$ . Therefore, the $V_{a1}$ and $V_{a2}$ are represented more similar.

After graph embedding, each API entity in the API knowledge graph is represented as a fixed-length semantic vector. In other words, for the input of the raw API sequence, the API name of each API can be mapped to the corresponding semantic vector using the knowledge graph. Furthermore, when malware undergoes API replacement during its evolution, even though the API names before and after evolution may differ, if API functions are similar, then their semantic vectors will be very close.

IV API Argument Embedding

Based on our observations, malware tends to access similar system resources (such as files, registry keys, etc.) before and after evolution. These accessed resources can be extracted from the hooked API sequences during the software execution. Each API call in the sequence consists of two parts: the API name and the arguments. In this section, we extract the system resources accessed during software execution from the API arguments and represent them as semantic feature vectors. This allows detection models to capture the semantic similarity of samples before and after evolution.

IV-A Extract System Resources from Arguments

Figure 5 shows an example hooked API whose name is NtCreateFile. For the first argument, its type is integer, and the value is 2. For the second argument, its type is string and the value is “C: $\backslash\backslash$ User $\backslash\backslash$ Administrator $\backslash\backslash$ AppData $\backslash\backslash$ ...” which is a accessed file path.

To extract system resources, we consider 5 types of string arguments: file paths, dynamic link library file names (DLLs), registry keys, URLs, and IP addresses. These types of resources are accessed frequently and can be extracted directly from the API sequence. For each API in the API sequence, we use regular expression matching to identify its argument values and extract arguments belonging to the 5 types of resources. Specifically, we use “C: $\backslash\backslash$ ” to identify a file path. The DLLs are arguments ending with “.dll”. The registry keys often start with “HKEY_”. URLs often start with “http”. IPs are those arguments with four numbers (range from 0 to 255) separated by dots. These extracted string arguments are then embedded as feature vectors.

IV-B Feature Hash Embedding

Intuitively, the strings sharing a large number of substrings have very similar meanings. Thus, for each extracted argument, we first parse the whole string into several substrings to capture the hierarchical information. For example, for a path like “C: $\backslash\backslash$ f_a $\backslash\backslash$ f_b”, three substrings are generated by splitting based on “ $\backslash\backslash$ ”, namely “C:”, “C: $\backslash\backslash$ f_a”, “C: $\backslash\backslash$ f_a $\backslash\backslash$ f_b”. The DLLs and registry keys can also be parsed like the file paths. The Urls and IP address can be parsed by splitting based on “.”. For example, for a url “https://sample.sec.org/”, we only generate substrings from its hostname, and the following substrings will be generated “org”, “sec.org”, and “sample.sec.org”.

We use feature hashing [30] to represent each extracted argument as a fixed-length feature vector. Let $S$ denotes an substring set of the extracted string argument, and $s_{j}\in S$ denotes a substring. Let $N$ denotes the number of bins. The value of the $i$ -th bin is calculated by

\phi_{i}^{h,\xi}(S)=\sum_{j:h\left(s_{j}\right)=i}\xi\left(s_{j}\right),

(1)

where $h$ is a hash function that maps the $s_{j}$ to a natural number $n_{1}\in\{1,2,...,N\}$ as the bin index. $\xi$ is another hash function that maps the $s_{j}$ to $n_{2}\in\{\pm 1\}$ . After feature hashing, the extracted argument $S$ is represented as a feature vector $[\phi_{1}^{h,\xi}(S),\phi_{2}^{h,\xi}(S),...,\phi_{N}^{h,\xi}(S)]\in\mathbb{R}% ^{N}$ . For example, for the url “https://sample.sec.org/”, if $N$ = 8 and $S$ = {“org”, “sec.org”, “sample.sec.org”}, then $s_{1}$ = “org”, $s_{2}$ = “sec.org”, $s_{3}$ = “sample.sec.org”. After hash mapping, $h\left(s_{1}\right)$ = 1 (i.e., bin index 1), $\xi\left(s_{1}\right)$ = 1, $h\left(s_{2}\right)$ = 2 (i.e., bin index 2), $\xi\left(s_{2}\right)$ = -1, $h\left(s_{3}\right)$ = 4 (i.e., bin index 4), $\xi\left(s_{3}\right)$ = 1. Thus, $\phi_{1}^{h,\xi}(S)$ = 1, $\phi_{2}^{h,\xi}(S)$ = -1, $\phi_{4}^{h,\xi}(S)$ = 1. The feature vector of the extracted argument is $[1,-1,0,1,0,0,0,0]$ .

The arguments with a large number of shared substrings will have the similar set $S$ and will be represented very similar. In this way, if the malware accesses similar system resources before and after evolution, then their feature vectors will be very close.

At this point, the API sequence embedding enhancement is complete. When a raw API sequence is input to the model, for each API in the sequence, its API name is mapped to the API knowledge graph and represented as an API name semantic vector. Each argument of the API is checked to identify if it is an accessed resource. If so, it is hashed and represented as an API argument semantic vector. These two vectors are concatenated as the API’s feature vector. Finally, the embedded API sequence (i.e., the API feature vector sequence) is input to the encoder of the detection model.

V Contrastive Encoder

Based on our observations, during evolution, malware samples often involve massive code reuse and generate similar API sequence fragments. In this section, we enhance the encoder’s attention to the similar API sequence fragments by designing a contrastive learning strategy. Through contrastive learning, the encoder can measure the similarity of two embedded API sequences by calculating the distance between them, and make malware closer to samples with similar API fragments and farther away from benign samples in the feature space. Thus, when a malware sample experiences gradual evolution, it can be expected that the representation of new samples will be similar to past samples and the contrastive encoder can automatically adapt to evolution.

As shown in Figure 6, given the input samples with feature vectors, the contrastive learning encoder aims to map them into a latent feature space. Before contrastive learning, the evolved malware produce many differences in the feature space, leading the detection model to misclassify it as benign. Then, the contrastive learning optimizes the encoder and generates a latent space. In the latent space, pairs of samples in the same class have a smaller distance, and pairs of samples from different classes have a larger distance. As such, the encoder will pay more attention to the similarities among samples from the same malware family. Any evolved sample that retains similar API fragments to the past samples will be represented as closer, thereby reducing misclassifications of the evolved samples.

V-A Contrastive Learning Strategy

We design a contrastive learning strategy to enhances the encoder’s ability to capture fine-grained similarities and differences among API sequences and improve performance in detecting evolved malicious samples.

Let $x$ be an embedded API sequence. The ground truth binary label is $y\in\{0,1\}$ , where $y=0$ indicates a benign sample, and $y=1$ indicates a malicious sample. Let $y\prime$ be the ground truth multi-class family label. When $y\prime=0$ , the label is benign, but otherwise, it is a malware family label. For the detection model $f$ , after API sequence embedding, the embedded sample $x$ is first input to an encoder $en$ (e.g., LSTM, Text-CNN, etc.), which outputs the representation of the input sample in the latent feature space $z=en(x)$ . Then, a classifier $g$ takes the encoder output and predicts the binary label $f(x)=g(z)=g(en(x))$ .

Let $f(x)=g(en(x))$ be the output of the softmax layer for class $y=1$ (i.e., malware) and the benign softmax output is $1-f(x)$ . If $f(x)\geq 0.5$ , the predicted binary label $\hat{y}$ is $\hat{y}=1$ , and otherwise, $\hat{y}=0$ .

In general, the training loss of a regular model is defined as computing a classification loss between $f(x)$ and $y$ . However, in this paper, we define the training loss is the sum of a contrastive loss and a classification loss, and the detection models are trained end-to-end with this loss. Specifically,

\mathcal{L}=\mathcal{L}_{con}+\lambda\mathcal{L}_{cla}

(2)

where $\mathcal{L}_{cla}$ is the classification loss and $\mathcal{L}_{con}$ is the contrastive loss for enhancing the encoder (defined below). As a common heuristic approach, we use a hyperparameter $\lambda$ to balance the two terms $\mathcal{L}_{con}$ and $\lambda\mathcal{L}_{cla}$ , so that they have a similar mean, thus the overall loss is not overwhelmed by just one term. The classification loss $\mathcal{L}_{cla}$ uses the binary cross entropy loss:

\begin{gathered}\mathcal{L}_{cla}=\sum_{i}\mathcal{L}_{cla}\left(x_{i},y_{i}% \right)\\ \mathcal{L}_{cla}\left(x_{i},y_{i}\right)=-y_{i}\log f\left(x_{i}\right)-\left% (1-y_{i}\right)\log\left(1-f\left(x_{i}\right)\right)\end{gathered}

(3)

where $i$ ranges over indices of samples in the batch.

The contrastive loss $\mathcal{L}_{con}$ computes a similarity over positive and negative pairs of samples in a batch. It tends to maximize the similarity between positive pairs and minimize the similarity between negative pairs. We design a contrastive learning strategy that encourages the encoder $en$ to satisfy the following two properties:

•

Positive pairs: If $x_{1}$ , $x_{2}$ are two benign samples, or two malicious samples in the same malware family, then they are positive pairs, and their representations should be similar: i.e., $\left\|en\left(x_{1}\right)-en\left(x_{2}\right)\right\|_{2}$ should be as small as possible.
•

Negative pairs: If one of $x_{1}$ , $x_{2}$ is malicious and the other is benign, then they are negative pairs, and their representations should be dissimilar: i.e., $\left\|en\left(x_{1}\right)-en\left(x_{2}\right)\right\|_{2}$ should be as large as possible.

Specifically, for a batch of size $2N$ , the first $N$ samples in the batch are sampled randomly, denoted as $\{x_{k},y_{k},y\prime_{k}\}_{k=1...N}$ . Then, we randomly select $N$ more samples which have the same label distribution as the first $N$ samples, i.e., $\{x_{k+N},y_{k+N},y\prime_{k+N}\}_{k=1...N}$ are chosen so that $y_{k}=y_{k+N}$ and $y\prime_{k}=y\prime_{k+N}$ . To capture the positive and negative samples paired with $x_{i}$ , the following sets are defined in the batch:

•

The positive sample set of $x_{i}$ . Both samples are benign or both samples are malicious and in the same malware family:

$Pos\left(x_{i}\right)\equiv\left\{x_{j}\mid y_{j}=y_{i},y_{i}=1\Longrightarrow y% _{j}^{\prime}=y_{i}^{\prime},j\neq i\right\}$
•

The negative sample set of $x_{i}$ . One sample is benign and the other is malicious:

$Neg\left(x_{i}\right)\equiv\left\{x_{j}\mid y_{j}\neq y_{i},j\neq i\right\}$

Intuitively, $Pos\left(x_{i}\right)$ contains samples that are considered similar to $x_{i}$ , and $Neg\left(x_{i}\right)$ contains dissimilar samples to $x_{i}$ .

Let $d_{ij}$ denote the euclidean distance between two arbitrary samples $x_{i}$ and $x_{j}$ in the feature space: $d_{ij}=\left\|en\left(x_{1}\right)-en\left(x_{2}\right)\right\|_{2}$ . Let $m$ denote a fixed margin (a hyperparameter). The contrastive loss is defined as:

\mathcal{L}_{con}=\sum_{x_{i}\in Batch}\mathcal{L}_{con}(x_{i})

(4)

	$\displaystyle\mathcal{L}_{con}(x_{i})$	$\displaystyle=\frac{1}{\left\|Pos\left(x_{i}\right)\right\|}\sum_{x_{j}\in Pos% \left(x_{i}\right)}d_{ij}$		(5)
		$\displaystyle+\frac{1}{\left\|Neg\left(x_{i}\right)\right\|}\sum_{x_{j}\in Neg% \left(x_{i}\right)}\max\left(0,m-d_{ij}\right)$		(5)

The contrastive loss has two terms. The first term asks positive pairs from $Pos\left(x_{i}\right)$ to be close together. These pairs are (benign, benign) or (malicious, malicious) pairs with the same malware family. In this way, the encoder will pay more attention to the similarities among samples in the same class. The evolved samples that retain API fragments similar to past samples will be represented as closer to past samples in the latent space. The second term aims to separate benign and malicious samples from each other, hopefully at least $m$ apart from each other. Thus, the encoder will focus on capturing the differences between benign and malicious samples, and prevent the classifier from misclassifying evolved malware as benign ones.

At this point, the encoder enhancement is complete. A contrastive encoder is constructed using our contrastive learning strategy, without altering the structure of the original model. Finally, The enhanced models are trained end-to-end with the loss $\mathcal{L}$ .

VI Experimental Setup

In this section, we describe the datasets and baseline malware detectors used in our experiments.

VI-A Dataset

In this paper, we focus on malware of the Windows portable executable (PE) file which is the most popular malware file format. Our dataset, spanning over five years, contains 76,473 Windows PE files, i.e., 39,349 malicious and 37,124 benign as shown in Table IV. Specifically, The malicious software is obtained from the VirusShare website [31] and using a daily downloading script. The benign software is obtained from popular free software sources, including PortableApps [32], Softonic [33], SourceForge [34], and CNET [35].

To get reliable labels for these samples, we rely on VirusTotal[36] to determine whether a sample is benign or malicious. VirusTotal uses more than 60 anti-virus (AV) engines to vote whether the submitted sample is malicious or benign. In this paper, samples are labeled as malware when at least 10 AV engines report them as malicious, while samples are labeled as benign when no AV reports them as malicious. Note that according to a recent study [37] on measuring the labeling effectiveness of malware samples, this strategy is reasonable and stable. We consider samples up to Dec 2021 because following a previous work [38], the malware labels become stable after about one year, thus choosing Dec 2021 as the finishing time ensures good ground-truth confidence in objects labeled as malware.

Also, we leverage VirusTotal to get the exact appearing time for each sample and make sure that temporal consistency [13] is satisfied at the month level during the testing. Specifically, temporal consistency ensures that training samples should be strictly temporally precedent to testing ones, and all testing samples must come from the same period during each testing to eliminate time bias.

TABLE IV: Dataset

Year	2017	2018	2019	2020	2021	Total
Goodware	5,788	6,748	9,976	5,961	8,651	37,124
Malware	3,517	6,130	7,557	9,556	12,589	39,349
Total	9,305	12,878	17,533	15,517	21,240	76,473

VI-B API Sequence Extraction

After data collection, the Cuckoo Sandbox [39] is used to run the PE files and gather execution logs. Cuckoo sandbox has been widely used in prior works [7, 8, 9, 10]. It executes each PE file inside virtual machines and uses API hooks to monitor the Windows APIs to form a raw API sequence. In our system, dozens of virtual machines are maintained on the Cuckoo server which is installed with Ubuntu 16.04 LTS. All the virtual machines are installed with a 64-bit Windows 10 system and several necessary drivers to ensure the successful execution of the PE samples in the dataset. The snapshot feature of the virtual machine is leveraged to roll it back after execution to ensure the uniformity of the software running environment. Besides, Cuckoo simulates some user actions (such as clicking a button, typing some texts, etc.) to trigger malicious behavior of malware. In this paper, we set the maximum running time of each sample to 5 minutes. That is to say, the sandbox process completes when the uploaded sample ends itself or runs to 5 minutes. After a PE file is uploaded, Cuckoo server begins to call a free client to execute the file and record the API calls automatically. When the process completes, Cuckoo server will generate a sandbox report about this uploaded file and the raw API sequence can be extracted from this report.

VI-C Evaluated Malware Detectors

We employ two representative DNNs, i.e., LSTM and Text-CNN, to build the malware detection models. These two models learn the sequence features from API sequences and have been proved to be effective in malware detection. In fact, many exsiting studies have already using these two models or their variants or combinations as the encoder [3, 4, 7, 5, 6]. The details of two DNN models are illustrated as follows:

VI-C1 LSTM

LSTM [40] is a recurrent neural network architecture. It is able to capture the long-term context information through several gates designed to control the information transmission status. In this paper, we use the architecture of a single layer LSTM in [7] as our baseline model. Specifically, we establish two LSTM models for comparison, namely the regular LSTM model and the LSTM model enhanced by MME. The regular model includes an embedding layer [41] to receives API name sequences as input, a single layer LSTM encoder, and an MLP classifier. The enhanced LSTM includes our API sequence embedding, a contrastive LSTM encoder, and a classifier with the same configuration as the regular model.

VI-C2 Text-CNN

Text-CNN [42] is a variant of CNN used for text classification tasks. The regular model here also use an embedding layer [41] and receives API name sequences as input. The filter size in CNN, or the n-gram size, denotes the number of successive API calls where the features are extracted. In the regular encoder, we set the filter sizes to 3, 4, and 5, respectively for three different Text-CNN layers. The enhanced Text-CNN includes our API sequence embedding, a contrastive Text-CNN encoder, and a classifier with the same configuration as the regular model.

VII Evaluation

In this section, we evaluate the effectiveness of MME in enhancing API sequence-based detection models.

VII-A Model Sustainability Analysis

In this section, we measure the performance of existing malware detection models with and without the help of MME to understand the ability of MME in mitigating model degradation.

VII-A1 Experimental Settings

To evaluate the models’ sustainability, we test the mlaware detectors yearly. For each detectors, we train a model on the samples of 2017, and sequentially test its performance on each year from 2018 to 2021. To ensure the effectiveness of the models, we employ a 5-fold cross-validation during the model training process and ensure that the all the models achieve an average F1 score of over 97% on the validation set. During the model test, we calculate the false positive rate (FPR), false negitive rate (FNR), and F1 score to evaluate how MME can help prolong the life-time of regular models.

We also consider a state-of-the-art work called APIGraph [43], which is most relevant to our MME model, for comparison. APIGraph also leverages API knowledge graph learning and API clustering to enhance the regular malware detectors with capturing the semantically-equivalent APIs among evolved malware, thus slowing down the model aging. In fact, APIGraph primarily enhances the API name embedding stage of the model, whereas in comparison, our framework MME enhances both API name and argument embedding, as well as the encoder module.

VII-A2 Results

Tabel V shows the performance of the each baseline model in every test year. The phenomenon of model aging is observed quite prominently, especially in terms of the FNR. Over a four-year testing period, the regular LSTM model exhibited an average FNR as high as 22.39%, resulting in a decrease in the F1 score to 83.98%. Similarly, the regular Text-CNN model showed an average FNR of 22.70%, with an accompanying drop in the F1 score to 84.67%. This indicates a severe issue of elevated false negatives caused by the evolution of malicious software, as the model tends to classify unknown malware as benign.

Our enhancement method MME demonstrates significant improvement. Compared to the regular models, the MME(LSTM) exhibits a 12.24% reduction in FNR and a 7.24% increase in F1 score, with the average F1 value remaining above 90%. The MME(Text-CNN), on the other hand, experiences a 13.10% decrease in FNR and an 8.47% increase in F1 score, and maintains an average F1 above 93%. Moreover, in comparison to the state-of-the-art model APIGraph, our model lags behind by only 0.05% in LSTM’s FPR, while outperforming APIGraph in other metrics. These results indicate that our enhancement method possesses a strong ability to alleviate model aging.

TABLE V: Comparisons of the regular and enhanced models (%)

Testing Years	Regular LSTM			APIGraph(LSTM)			MME(LSTM)
Testing Years	FPR	FNR	F1	FPR	FNR	F1	FPR	FNR	F1
2018	6.52	21.19	84.75	6.91	15.81	87.80	5.96	8.04	92.65
2019	6.59	17.59	86.25	7.18	15.79	86.96	7.27	7.37	91.61
2020	10.38	24.22	83.16	9.55	21.15	85.33	10.12	11.06	91.10
2021	8.96	26.57	81.78	8.15	23.24	84.19	8.62	14.12	89.55
average	8.11	22.39	83.98	7.94	18.99	86.07	7.99	10.15	91.23
improve	–	–	–	$\downarrow$ 0.17	$\downarrow$ 3.40	$\uparrow$ 2.09	$\downarrow$ 0.12	$\downarrow$ 12.24	$\uparrow$ 7.24
Testing Years	Regular Text-CNN			APIGraph(Text-CNN)			MME(Text-CNN)
Testing Years	FPR	FNR	F1	FPR	FNR	F1	FPR	FNR	F1
2018	5.19	19.45	86.50	5.62	12.90	90.13	2.39	6.72	95.23
2019	5.47	21.27	84.70	5.73	12.11	89.93	4.38	6.84	93.65
2020	5.87	24.97	83.98	7.18	17.01	88.54	5.39	10.86	92.62
2021	6.38	25.12	83.54	6.81	19.68	86.83	4.17	13.99	91.08
average	5.73	22.70	84.67	6.33	15.43	88.86	4.08	9.60	93.14
improve	–	–	–	$\uparrow$ 0.61	$\downarrow$ 7.27	$\uparrow$ 4.18	$\downarrow$ 1.65	$\downarrow$ 13.10	$\uparrow$ 8.47

VII-B Model Maintainability Analysis

The purpose of this experiment is to evaluate how many human efforts MME can save while maintaining a high performance malware detection models.

Specifically, the comparison includes two aspects. On the one hand, we compare the amount of human efforts needed for active learning in maintaining both the regular and the enhanced models. On the other hand, we compare the model performance improvment given a fixed level of human effort.

VII-B1 Comparison of human efforts needed to achieve a fixed performance

First, we train a detection model on the samples of 2017, and test it month by month from Jan 2018 to Dec 2021. Then, when the F1 score of the model falls below a threshold $T$ , we retrain the model so that it can reach the $T$ . We calculate how many human efforts (i.e. the number of samples to label) are needed in the retraining step. To retrain an aged model, we adopt the active learning [13] method, which is an optimization to normal retraining methods. Specifically, the uncertain sampling [13] algorithm is used to actively select the most uncertain predictions. In detail, first we select the most 1% uncertain samples to retrain the model, and then gradually increase the percentage by 1% until the F1 score reaches $T$ . Through this way, we can figure out the minimum efforts to maintain a high-performance model.

Table VI shows the number of samples to label from 2018 to 2021 with $T=0.95$ for both the regular and the enhanced models. It is clear that the models enahanced by MME can significantly save human efforts while reaching the threshold of $T$ . For the LSTM model, the enhanced model can save 24.19% to 70.27% of human efforts during maintenance, with an average savings of 44.17% over 4 years. Moreover, for the Text-CNN model, the enhanced model can save 33.53% to 94.42% of human efforts during maintenance, with an average savings of 52.12% over 4 years. These results indicate that MME can significantly reduce human efforts when maintaining various malware detectors.

TABLE VI: the number of labeled samples for active learning with fixed retrain thresholds (

F1=95\%

)

Testing Years	LSTM # labeled samples			Text-CNN # labeled samples
Testing Years	Regular	MME	improve	Regular	MME	improve
2018	1,729	514	70.27%	735	41	94.42%
2019	1,645	1,247	24.19%	1,265	662	47.68%
2020	5,462	3,113	43.01%	2,977	1,272	57.27%
2021	3,402	1,959	42.41%	2,195	1,459	33.53%
Total	12,238	6,833	44.17%	7,172	3,434	52.12%

VII-B2 Comparison of model performance improvment given a fixed level of human efforts

The second comparison setting is to fix the amount of human efforts and test the model performance of the regular and enhanced models. Similarly, we train a detector with samples from 2012, and test the detector month by month from Jan 2018 to Dec 2021. We also use the uncertain sampling [13] in this experiment. We adopt two fixed human effort strategies: the first one is sample budgeting, where 20, 50, and 100 samples are labeled and used for retraining in each month; the second one is ratio budgeting, where 1%, 5%, and 10% of the samples from each month are labeled and used for retraining. Finally, we calculate the FPR, FNR, and F1 score of the model in each month, and calculate their respective averages as the final comparison metrics.

As shown in Table VII and Table VIII, it can be observed that under the same level of human efforts, the enhanced models achieve better performance. Although there are instances where the FPR results may slightly increase compared to the regular models, this increase is less than 1%. Significant improvements are seen in FNR and F1 scores, particularly in the reduction of FNR. Especially when fixing the analysts labeling effort at a low standard (such as 20 or 1% samples per month), the models enhanced with MME show even more significant performance improvements compared to the regular models, where the FNR can be reduced by more than 10%. This implies that the enhanced models, with just a slight amount of human efforts, can significantly mitigate the impact of malware evolution. The result also indicates that under a fixed level of human efforts, the models enhanced with MME achieve better performance, particularly in reducing FNR and improving F1 score.

For both LSTM and Text-CNN, using the models enhanced with MME, only 20 samples or 1% of samples need to be labeled each month to keep the FNR below 10% and achieve an F1 score above 90%. In contrast, the regular models in our experiment require to label around 100 samples or 5% of the samples per month to achieve the same effect. This experimental result indicates that MME can reduce the analysts labeling effort by 5 $\times$ .

TABLE VII: Active learning with a fixed monthly sample labeling budget

Monthly Sample Budget	Base Model	Method	Average Performance
Monthly Sample Budget	Base Model	Method	FPR(%)	FNR(%)	F1(%)
20	LSTM	Regular	6.96	21.24	85.08
		MME	7.73	9.19	91.86
		MME	$\uparrow$ 0.77	$\downarrow$ 12.05	$\uparrow$ 6.77
	Text-CNN	Regular	5.50	20.87	85.91
		MME	3.68	8.76	93.77
		MME	$\downarrow$ 1.82	$\downarrow$ 12.10	$\uparrow$ 7.87
50	LSTM	Regular	5.42	15.68	89.05
		MME	6.40	6.79	93.71
		MME	$\uparrow$ 0.98	$\downarrow$ 8.88	$\uparrow$ 4.66
	Text-CNN	Regular	2.54	15.99	90.10
		MME	3.28	7.93	94.40
		MME	$\uparrow$ 0.74	$\downarrow$ 8.06	$\uparrow$ 4.30
100	LSTM	Regular	5.68	10.88	91.65
		MME	4.80	5.83	94.90
		MME	$\downarrow$ 0.88	$\downarrow$ 5.05	$\uparrow$ 3.25
	Text-CNN	Regular	4.00	8.67	93.56
		MME	3.28	6.73	95.04
		MME	$\downarrow$ 0.72	$\downarrow$ 1.95	$\uparrow$ 1.48

VII-C Model Ablation Analysis

In this experiment, we want to measure the individual effects of the two parts of the MME framework (i.e., embedding enhancement and encoder enhancement) on enhancing the regular model.

VII-C1 Experimental Settings

In the MME framework we proposed, there are two enhanced components: embedding enhancement and encoder enhancement. The embedding enhancement consists of API name embedding in §III and API argument embedding in §IV. The encoder enhancement refers to the contrastive encoder in §V. To evaluate the impact of each component on the regular model’s enhancement, we construct two MME variants: one with only embedding enhancement and another with only encoder enhancement. We train the 4 models (one regular model, two MME variants, and one proposed MME enhanced model) on the samples of 2017, and test their performance on each year from 2018 to 2021. Based on the previous experiments, it is evident that the main indicators of decreased model performance are the increase in false negative rates and the decrease in F1 scores. Therefore, we use these two metrics to assess the influence of each part of the MME framework on the model’s enhancement. Figure 7 shows results of the ablation experiments, where the baseline models consist of LSTM and Text-CNN with the same experimental settings as §VI-C. The embedding enhanced LSTM/Text-CNN refers to the variant with only embedding enhancement, while the encoder enhanced LSTM/Text-CNN refers to the variant with only encoder enhancement.

TABLE VIII: Active learning with a fixed monthly ratio labeling budget

Monthly Ratio Budget	Base Model	Method	Average Performance
Monthly Ratio Budget	Base Model	Method	FPR(%)	FNR(%)	F1(%)
1%	LSTM	Regular	6.48	19.51	86.32
		MME	6.66	8.35	92.76
		MME	$\uparrow$ 0.18	$\downarrow$ 11.16	$\uparrow$ 6.44
	Text-CNN	Regular	5.33	19.10	87.05
		MME	3.78	8.55	93.84
		MME	$\downarrow$ 1.55	$\downarrow$ 10.55	$\uparrow$ 6.79
5%	LSTM	Regular	6.41	9.92	91.86
		MME	4.28	6.06	95.00
		MME	$\downarrow$ 2.13	$\downarrow$ 3.86	$\uparrow$ 3.14
	Text-CNN	Regular	3.97	8.46	93.69
		MME	3.64	6.03	95.25
		MME	$\downarrow$ 0.34	$\downarrow$ 2.43	$\uparrow$ 1.56
10%	LSTM	Regular	5.28	8.01	93.38
		MME	3.90	5.45	95.49
		MME	$\downarrow$ 1.39	$\downarrow$ 2.56	$\uparrow$ 2.11
	Text-CNN	Regular	3.31	4.98	95.63
		MME	2.35	4.33	96.70
		MME	$\downarrow$ 0.96	$\downarrow$ 0.65	$\uparrow$ 1.07

VII-C2 Results

Intuitively, both the embedding and encoder enhancement demonstrate significant improvements to the model, indicating that optimizing the API sequence embedding and refining the training process of the encoder through contrastive learning can effectively mitigate the impact of malware evoluation. Further observations reveal some differences in the effects of the embedding and encoder enhancements. Over the four years of testing from 2018 to 2021, for the LSTM model, the embedding enhanced LSTM shows an average decrease in FNR of 5.7% and an average increase in F1 score of 3.9% relative to the regular LSTM, while the encoder enhanced LSTM exhibited an decrease in FNR of 3.5% and an increase in F1 score of 2.1%. For the Text-CNN model, the embedding enhanced Text-CNN displays an average decrease in FNR of 8.4% compared to the regular Text-CNN, with an average increase in F1 score of 5%. Meanwhile, the encoder enhanced Text-CNN shows an average decrease in FNR of 6.4% and an average increase in F1 score of 3.3%. From these results, it appears that embedding enhancement has a slightly better effect than encoder enhancement. This suggests that a well-designed feature representation is crucial for mitigating model aging. Finally, the MME framework proposed in this paper combines both embedding and encoder enhancements and achieves the best mitigation effects. Over the four-year testing period, the MME enhanced LSTM shows an average decrease in FNR of 12.2% and an average increase in F1 score of 7.2% compared to the regular LSTM. Moreover, the MME-enhanced Text-CNN achieves an average decrease in FNR of 13.1% compared to the regular Text-CNN, with an average increase in F1 score of 8.5%.

VII-D Malware Feature Stability Analysis

We observed that the malware evolution can disturb the stability of the original feature space, leading to a decline in model performance. In this experiment, we want to measure the stability of the feature space concerning the evolution of malware from the same family to show that the MME-enhanced model can capture the semantic similarity between the original and evolved of malware.

VII-D1 Experimental Settings

Here is our evaluation methodology, which involves four steps. First, we select the top 10 malware families with the most number of samples from the dataset in §VI-A. As a result, we have 17,288 malware samples in this experiment and every family has more than 1k samples. Second, for each malware family, we sort all the family samples by their appearing time and then divide them into 10 groups so that each group contains 10% samples of the family. The samples in one group is strictly ahead of samples from the next group in terms of their appearing time. Third, for each malware sample, we input its raw API sequence into the regular/MME-enhanced model and take the output of the regular/constrastive encoder as its feature representation. Lastly, we calculate a feature stability score of every two adjacent groups using Jensen–Shannon divergence [44]. The Jensen-Shannon divergence is a method used to measure the similarity between two feature distributions: $JS\left(P_{1}\|P_{2}\right)=\frac{1}{2}KL\left(P_{1}\|\frac{P_{1}+P_{2}}{2}% \right)+\frac{1}{2}KL\left(P_{2}\|\frac{P_{1}+P_{2}}{2}\right)$ . It calculates the average Kullback-Leibler divergence between the two distributions (i.e., $KL(P_{1}\|P_{1})=\sum_{i}P_{1}(i)\log\left(\frac{P_{1}(i)}{P_{2}(i)}\right)$ ) and derives a final measurement value by utilizing the symmetry of the logarithmic function. In this experiment, $P_{1}$ and $P_{2}$ refer to the set of softmax-normalized features obtained in the third step for two adjacent groups. The score of $JS\left(P_{1}\|P_{2}\right)$ ranges from 0 to 1, where the value closer to 0 indicates that the malware feature distributions between two groups are more similar, implying better feature space stability.

VII-D2 Result

Figure 8 and 9 show the distribution of feature stability scores (i.e., $JS$ scores) for each malware family with the regular and MME-enhanced models. We can observe that for each malware family, the JS scores of all MME-enhanced models are closer to 0, significantly lower than the results of the regular model. This indicates that the feature stability of the MME-enhanced models demonstrates better performance. During the evolution of malware, the MME enhanced model can reduce the feature space disturbances. This experiment explains why MME can help the model mitigate the impact of malware evolution, as malware tends to retain semantic similarities during its evolution, and MME can capture these similarities and maintain the feature space stability.

VIII Related Work

VIII-A API Sequence-based Malware Detection

Dynamic malware detection executes the software in a secured virtual environment and monitors its run-time behavior. A running software calls many system APIs, which characterize software behaviors including network access, file creation and modification, etc. These API calls form an API call sequence which has become a widely used data source for malware detection and classification [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 45, 46, 47].

Inspired by deep learning-based squence analysis, many researchers apply some DL models like convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to learn features of the API call sequences. Kolosnjaji et al. [3] use the API sequence as input. Their approach stacks a CNN that uses a 3-sized filter to represent 3 consecutive APIs (like the 3-gram approach). After the CNN, the LSTM is applied to handle the time-series sequence. Agrawal et al. [4] input the API names and the n-gram of the string arguments into several stacked LSTMs. Zhang et al. [10] build a feature engineering about the API names and arguments and then design a deep learning model including gate-CNNs and Bi-LSTM as the malware detector. Catak et al. [7] input API sequences into LSTMs to detect and classify malware. Li et al. [5] combine the Text-CNN with Bi-LSTM to analyze the API sequences and detect malware. Similarly, Chen et al. [6] use the Text-CNN and Bi-LSTM as the baseline models for API sequence-based malware detection.

Obviously, too many works have already using Text-CNN and LSTM models or their variants or combinations as the encoder of the detection models. However, how to perform feature representation so that the model can accurately understand the semantic information in API sequences and thus understand the software behavior remains a challenging issue. Our framework MME enhance the API sequence embedding and can better represent each API as a semantic feature vector, as shown in §III and §IV.

VIII-B Model Aging Caused by Malware Evolution

Deep learning techniques were originally designed for stationary environments in which the training and test sets are assumed to be generated from the same statistical distribution. However, this assumption is not valid in the malware domain. Malware samples, including various families, evolve over time due to changes resulting from adding capabilities, fixing bugs, porting to new platforms, etc. Thus, malware detectors are deployed in dynamic environments, where malware variants keep evolving, causing the performance to deteriorate significantly over time. This is known as the problem of model aging or concept drift [13].

There are mainly two methods to address model aging caused by the evolution of malware. The first is to retrain and update detection models with newly labeled samples, or reject drift samples until they can be expertly analyzed. For example, in Android malware detection, DroidEvolver [15] utilizes online learning and pseudo-labels to self-update the detection model. However, the accumulation of pseudo-label errors may lead to model self-poisoning which have catastrophic effects on performance. Some studies focus on detecting drift samples that deviate from existing classes from a large number of test samples and update models using periodical retraining [16, 17]. However, labeling samples and retraining the model still requires a lot of expert knowledge and computing resources. More importantly, it is also difficult to determine when the model should be retrained. Delayed retraining can leave the outdated model vulnerable to evolved malware.

The second method is to deliberately consider the issue of model aging during the process of model design and feature space optimization. Researchers represent features to be more robust against temporal bias and reduce the impact of malware evolution. APIGraph [20] is proposed to enhance state-of-the-art malware detectors with the similarity information among evolved malware in terms of equivalent or similar API usages. It constructs an API knowledge graph based on the API documentation, and use graph embedding and the K-means algorithm to cluster APIs with similar semantics. Similarity, SDAC [21] calculates APIs’ contributions to malware detection and assign APIs to feature vectors. Then, SDAC clusters all APIs based on their semantic distances to create a feature set in the training phase, and extends the feature set to include all new APIs in the detecting phase. However, the detectors above mainly focus on API occurrence or API frequency, which is difficult to be applied to dynamic malware detection based on API sequence analysis. The MME proposed in this paper further extends API sequence embedding and constructs a contrastive encoder to address the model aging issue in API sequence-based malware detection.

IX Discussion And Limitation

IX-A Other types of Detectors

The API sequence is a popular type of feature widely adopted by dynamic malware detectors [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], mainly because API sequences are essential in understanding malware behaviors. In our experiments, we validate the effectiveness of MME by enhancing the LSTM and Text-CNN models. There indeed are many other DNN models such as Bi-LSTM, Gate-CNN, or even more variant models commonly used for learning API sequences [4, 5, 6, 10, 47, 48]. Although these models are not individually validated in this paper, they also require the API sequence embedding and encoder training process. We believe that the MME approach can also be applied to these models and achieve similar enhancement effects.

IX-B Overly Advanced Malware

In reality, there are a few instances where the attack methods of certain malware are so advanced that their behavior bears very little similarity to previous malware. For such newly emerged malware, the performance of MME may weaken. In such cases, drift sample detection methods [16, 17] can be used to identify such samples, as they deviate significantly from the original data distribution. Then, such overly advanced malware can be collected and labeled for updating models. We believe that utilizing the MME enhanced models in conjunction with periodic drift sample detection can better address the continuous evolution of malicious software.

IX-C Malware Sandbox Evasion

To obtain the raw API sequences, sandboxes are widely used to execute malware inside virtual machines and monitor APIs with API hooking techniques. Sandbox evasion refers to techniques employed by malware to avoid detection or analysis within a sandbox environment [49]. For example, when malware detects that it is running within a sandbox environment, it can refrain from executing any malicious operations, or even disguise itself as a legitimate application to exhibit benign behavior. The key to combating such “environment-aware” malware is to optimize the sandbox environment to closely resemble a real system environment. In the sandbox used for experiments, we simulate some user actions (such as clicking a button, typing some texts, etc.) to trigger malware real behaviors. We further utilize the statistical model proposed by Miramirkhani et al. [50] to optimize and fine-tune the sandbox, making it even closer to a real system environment. We believe these operations can help the sandbox capture the real API sequences of malware. In future works, solutions [51, 52, 53, 54] focusing on detecting sandbox evasion can be used to further optimize this issue.

X Conclusion

This paper proposes a model enhancement method MME to mitigate the impact of malware evolution on API sequence-based windows malware detectors. We observe that the API sequences of malware samples before and after evolution usually have similar malicious semantics including equivalent API usage, similar system resources, and similar API fragments. This provides an opportunity to reduce the feature gaps caused by evolution, and slowing down model aging. Firstly, by establishing an API knowledge graph and capture semantic similarities between APIs, the influence of equivalent API substitution is reduced. Secondly, by adopting hierarchical system resource encoding based on feature hashing, the model’s attention to the similarity of system resource access before and after the evolution of malware samples is enhanced. Finally, by designing a contrastive learning strategy, the model’s attention to the similar API fragments retained before and after malware evolution is strengthened. Experimental results show that MME can greatly extend the life-time of the API sequence-based malware detectors and can significantly save the human labeling efforts required for model maintenance. MME method can be applied to most API sequence-based deep learning malware detection models and help them achieve better sustainable usage.

References

[1] K. Chang, N. Zhao, and L. Kou, “A survey on malware detection based on API calls,” in 9th International Conference on Dependable Systems and Their Applications, DSA 2022, Wulumuqi, China, August 4-5, 2022. IEEE, 2022, pp. 464–471.
[2] D. Ucci, L. Aniello, and R. Baldoni, “Survey of machine learning techniques for malware analysis,” Computer & Security, vol. 81, pp. 123–147, 2019.
[3] B. Kolosnjaji, A. Zarras, G. D. Webster, and C. Eckert, “Deep learning for classification of malware system call sequences,” in Australasian Joint Conference on Artificial Intelligence. Springer, 2016, pp. 137–149.
[4] R. Agrawal, J. W. Stokes, M. Marinescu, and K. Selvaraj, “Neural sequential malware detection with parameters,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP. IEEE, 2018, pp. 2656–2660.
[5] C. Li, Q. Lv, N. Li, Y. Wang, D. Sun, and Y. Qiao, “A novel deep framework for dynamic malware detection based on API sequence intrinsic features,” Comput. Secur., vol. 116, p. 102686, 2022.
[6] X. Chen, Z. Hao, L. Li, L. Cui, Y. Zhu, Z. Ding, and Y. Liu, “Cruparamer: Learning on parameter-augmented API sequences for malware detection,” IEEE Trans. Inf. Forensics Secur., vol. 17, pp. 788–803, 2022.
[7] F. Ö. Çatak, A. F. Yazi, O. Elezaj, and J. Ahmed, “Deep learning based sequential model for malware analysis using windows exe API calls,” PeerJ Computer Science, vol. 6, p. e285, 2020.
[8] D. Rabadi and S. G. Teo, “Advanced windows methods on malware detection and classification,” in Annual Computer Security Applications Conference, ACSAC. ACM, 2020, pp. 54–68.
[9] E. Amer and I. Zelinka, “A dynamic windows malware detection and prediction method based on contextual understanding of API call sequence,” Computers & Security, vol. 92, p. 101760, 2020.
[10] Z. Zhang, P. Qi, and W. Wang, “Dynamic malware analysis with feature engineering and feature learning,” in Proceedings of AAAI Conference on Artificial Intelligence. AAAI Press, 2020, pp. 1210–1217.
[11] F. Pendlebury, “Machine learning for security in hostile environments,” Ph.D. dissertation, Royal Holloway, University of London, 2021.
[12] L. Yang, A. Ciptadi, I. Laziuk, A. Ahmadzadeh, and G. Wang, “BODMAS: an open dataset for learning based temporal analysis of PE malware,” in IEEE Security and Privacy Workshops, SP Workshops 2021, San Francisco, CA, USA, May 27, 2021. IEEE, 2021, pp. 78–84.
[13] F. Pendlebury, F. Pierazzi, R. Jordaney, J. Kinder, and L. Cavallaro, “TESSERACT: eliminating experimental bias in malware classification across space and time,” in 28th USENIX Security Symposium, USENIX Security 2019, Santa Clara, CA, USA, August 14-16, 2019. USENIX Association, 2019, pp. 729–746.
[14] Kaspersky, “Machine learning methods for malware detection,” https://media.kaspersky.com/en/enterprise-security/Kaspersky-Lab-Whitepaper-Machine-Learning.pdf, 2019.
[15] K. Xu, Y. Li, R. H. Deng, K. Chen, and J. Xu, “Droidevolver: Self-evolving android malware detection system,” in IEEE European Symposium on Security and Privacy, EuroS&P 2019, Stockholm, Sweden, June 17-19, 2019. IEEE, 2019, pp. 47–62.
[16] L. Yang, W. Guo, Q. Hao, A. Ciptadi, A. Ahmadzadeh, X. Xing, and G. Wang, “CADE: detecting and explaining concept drift samples for security applications,” in 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021. USENIX Association, 2021, pp. 2327–2344.
[17] R. Jordaney, K. Sharad, S. K. Dash, Z. Wang, D. Papini, I. Nouretdinov, and L. Cavallaro, “Transcend: Detecting concept drift in malware classification models,” in 26th USENIX Security Symposium, USENIX Security 2017, Vancouver, BC, Canada, August 16-18, 2017. USENIX Association, 2017, pp. 625–642.
[18] A. T. Nguyen, E. Raff, C. Nicholas, and J. Holt, “Leveraging uncertainty for improved static malware detection under extreme false positive constraints,” CoRR, vol. abs/2108.04081, 2021.
[19] M. Dib, S. Torabi, E. Bou-Harb, N. Bouguila, and C. Assi, “Evoliot: A self-supervised contrastive learning framework for detecting and characterizing evolving iot malware variants,” in ASIA CCS ’22: ACM Asia Conference on Computer and Communications Security, Nagasaki, Japan, 30 May 2022 - 3 June 2022, Y. Suga, K. Sakurai, X. Ding, and K. Sako, Eds. ACM, 2022, pp. 452–466.
[20] X. Zhang, Y. Zhang, M. Zhong, D. Ding, Y. Cao, Y. Zhang, M. Zhang, and M. Yang, “Enhancing state-of-the-art classifiers with API semantics to detect evolved android malware,” in CCS ’20: 2020 ACM SIGSAC Conference on Computer and Communications Security, Virtual Event, USA, November 9-13, 2020. ACM, 2020, pp. 757–770.
[21] J. Xu, Y. Li, R. H. Deng, and K. Xu, “SDAC: A slow-aging solution for android malware detection using semantic distance based API clustering,” IEEE Trans. Dependable Secur. Comput., vol. 19, no. 2, pp. 1149–1163, 2022.
[22] N. Falliere and E. Chien, “Zeus: King of the bots,” Symantec Security Response (http://bit.ly/3VyFV1), 2009.
[23] J. Wyke, “What is zeus?” Sophos, May, 2011.
[24] Microsoft, “Win32 api reference documentation,” https://learn.microsoft.com/en-us/windows/win32/api/, 2021.
[25] spaCy, “spacy - industrial-strength natural language processing,” https://spacy.io/, 2021.
[26] NeuralCoref, “Neuralcoref 4.0: Coreference resolution in spacy with neural networks.” https://github.com/huggingface/neuralcoref/, 2021.
[27] A. Bordes, N. Usunier, A. García-Durán, J. Weston, and O. Yakhnenko, “Translating embeddings for modeling multi-relational data,” in Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems, NIPS 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, 2013, pp. 2787–2795.
[28] Z. Wang, J. Zhang, J. Feng, and Z. Chen, “Knowledge graph embedding by translating on hyperplanes,” in Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, July 27 -31, 2014, Québec City, Québec, Canada. AAAI Press, 2014, pp. 1112–1119.
[29] Y. Lin, Z. Liu, M. Sun, Y. Liu, and X. Zhu, “Learning entity and relation embeddings for knowledge graph completion,” in Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA. AAAI Press, 2015, pp. 2181–2187.
[30] K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg, “Feature hashing for large scale multitask learning,” in annual international conference on machine learning. ACM, 2009, pp. 1113–1120.
[31] VXShare, “Virusshare database,” https://virusshare.com/, 2021.
[32] Portableapps, “Portableapps.com,” https://portableapps.com/, 2021.
[33] Softonic, “Softonic,” https://en.softonic.com/, 2021.
[34] Sourceforge, “Sourceforge,” https://sourceforge.net/, 2021.
[35] CNET, “Apps for widnows,” https://download.cnet.com/windows/, 2021.
[36] VirusTotal, “Virustotal,” https://www.virustotal.com/, 2021.
[37] S. Zhu, J. Shi, L. Yang, B. Qin, Z. Zhang, L. Song, and G. Wang, “Measuring and modeling the label dynamics of online anti-malware engines,” in 29th USENIX Security Symposium, USENIX Security 2020, August 12-14, 2020. USENIX Association, 2020, pp. 2361–2378.
[38] B. Miller, A. Kantchelian, M. C. Tschantz, S. Afroz, R. Bachwani, R. Faizullabhoy, L. Huang, V. Shankar, T. Wu, G. Yiu, A. D. Joseph, and J. D. Tygar, “Reviewer integration and performance measurement for malware detection,” in Detection of Intrusions and Malware, and Vulnerability Assessment - 13th International Conference, DIMVA 2016, San Sebastián, Spain, July 7-8, 2016, Proceedings, ser. Lecture Notes in Computer Science, vol. 9721. Springer, 2016, pp. 122–141.
[39] Cuckoo, “Cuckoo sandbox automated malware analysis,” https://cuckoosandbox.org, 2021.
[40] R. C. Staudemeyer and E. R. Morris, “Understanding lstm–a tutorial into long short-term memory recurrent neural networks,” arXiv preprint arXiv:1909.09586, 2019.
[41] P. Contributors, PyTorch-Embedding, 2021.
[42] Y. Kim, “Convolutional neural networks for sentence classification,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL. ACL, 2014, pp. 1746–1751.
[43] X. Zhang, M. Zhang, Y. Zhang, M. Zhong, X. Zhang, Y. Cao, and M. Yang, “Slowing down the aging of learning-based malware detectors with API knowledge,” IEEE Trans. Dependable Secur. Comput., vol. 20, no. 2, pp. 902–916, 2023.
[44] M. Menéndez, J. Pardo, L. Pardo, and M. Pardo, “The jensen-shannon divergence,” Journal of the Franklin Institute, vol. 334, no. 2, pp. 307–318, 1997.
[45] T. K. Tran and H. Sato, “Nlp-based approaches for malware classification from api sequences,” in 2017 21st Asia Pacific Symposium on Intelligent and Evolutionary Systems (IES). IEEE, 2017, pp. 101–105.
[46] C. W. Kim, “Ntmaldetect: A machine learning approach to malware detection using native API system calls,” CoRR, vol. abs/1802.05412, 2018.
[47] C. Li, Z. Cheng, H. Zhu, L. Wang, Q. Lv, Y. Wang, N. Li, and D. Sun, “Dmalnet: Dynamic malware analysis based on api feature engineering and graph learning,” Computers & Security, vol. 122, p. 102872, 2022.
[48] Z. Xu, X. Fang, and G. Yang, “Malbert: A novel pre-training method for malware detection,” Computers & Security, vol. 111, p. 102458, 2021.
[49] M. Lindorfer, C. Kolbitsch, and P. M. Comparetti, “Detecting environment-sensitive malware,” in Recent Advances in Intrusion Detection - 14th International Symposium, RAID 2011, Menlo Park, CA, USA, September 20-21, 2011. Proceedings, ser. Lecture Notes in Computer Science, R. Sommer, D. Balzarotti, and G. Maier, Eds., vol. 6961. Springer, 2011, pp. 338–357.
[50] N. Miramirkhani, M. P. Appini, N. Nikiforakis, and M. Polychronakis, “Spotless sandboxes: Evading malware analysis systems using wear-and-tear artifacts,” in 2017 IEEE Symposium on Security and Privacy (SP). IEEE, 2017, pp. 1009–1024.
[51] N. Galloro, M. Polino, M. Carminati, A. Continella, and S. Zanero, “A systematical and longitudinal study of evasive behaviors in windows malware,” Computers & Security, vol. 113, p. 102550, 2022.
[52] S. Liu, P. Feng, S. Wang, K. Sun, and J. Cao, “Enhancing malware analysis sandboxes with emulated user behavior,” Computers & Security, vol. 115, p. 102613, 2022.
[53] D. C. D’Elia, E. Coppa, F. Palmaro, and L. Cavallaro, “On the dissection of evasive malware,” IEEE Transactions on Information Forensics and Security, vol. 15, pp. 2750–2765, 2020.
[54] E. Avllazagaj, Z. Zhu, L. Bilge, D. Balzarotti, and T. Dumitraș, “When malware changed its mind: An empirical study of variable program behaviors in the real world,” in 30th USENIX Security Symposium (USENIX Security 21), 2021, pp. 3487–3504.