1 s2.0 S0020025523004565 Main

Information Sciences 636 (2023) 118907
Contents lists available at ScienceDirect
Information Sciences
journal homepage: www.elsevier.com/locate/ins
A novel extended multimodal AI framework towards vulnerability

detection in smart contracts
Wanqing Jie a,b , Qi Chen a,c , Jiaqi Wang a , Arthur Sandor Voundi Koe a,c,∗ , Jin Li a,b,c,∗ ,
Pengfei Huang a , Yaqi Wu a , Yin Wang a
a
Institute of Artificial Intelligence and Blockchain, Guangzhou University, 510006, Guangzhou, China
b
Guangdong Provincial Key Laboratory of Blockchain Security, 510006, Guangzhou, China
c
Pazhou Lab, 510330, Guangzhou, China
A R T I C L E I N F O A B S T R A C T
Keywords: Current automatic data-driven vulnerability detection in smart contracts selects and processes
Smart contract features of interest under black box settings without empirical justification. In this paper, we
Vulnerability detection propose a smart contract testing methodology that bestows developers with flexible, practical and
Multimodal
customizable strategies to detect vulnerabilities. Our work enforces strong whitebox knowledge
AI approach
White box
to a series of supervised multimodal tasks under static analysis. Each task encapsulates a
vulnerability detection branch test and pipelines feature selection, dimension unification, feature
fusion, model training and decision-making. We exploit multiple features made up of code
and graph embeddings at the single modality level (intramodal settings) and across individual
modalities (intermodal settings). We assign each task to either intramodal or intermodal settings,
and show how to train state-of-the-art self-attentive bi-LSTM, textCNN, and random forest (RF)
models to extract a joint multimodal feature representation per task. We evaluate our framework
over 101,082 functions extracted from the SmartEmbed dataset, and rank each multimodal
vulnerability mining strategy in terms of detection performance. Extensive experiments show
that our work outperforms existing schemes, and the highest performance reaches 99.71%.
1. Introduction
The increasing popularity and adoption of blockchain technology have resulted in an abundance of blockchain solutions. Accord-
ing to [1], the investment rate in global blockchain deployments will reach 19 billion USD by 2024. What stands out in blockchain
technology is the use of smart contracts [2], which allow untrustworthy parties to securely adhere to a set of promises over their
assets. In the literature, the term smart contract refers to an immutable self-contained block of rules that are written in a contract
oriented language such as Solidity [3]. The first proof of concept for smart contracts was provided over the Ethereum blockchain
[4], and an estimated one million smart contracts, controlling several billion dollars in digital currency, have been deployed on
Ethereum. There is some evidence such consistent wealth attracts attackers and raises concerns over the safety of smart contracts
[5]. There are two notable directions towards securing smart contracts: contract codification, which addresses the need to write an
optimized and correct contract with fewer bugs [6], and contract vulnerability detection, which identifies weaknesses in the contract
* Corresponding authors at: Institute of Artificial Intelligence and Blockchain, Guangzhou University, 510006, Guangzhou, China.
E-mail addresses: [email protected] (A.S. Voundi Koe), [email protected] (J. Li).
https://doi.org/10.1016/j.ins.2023.03.132
Received 23 October 2022; Received in revised form 12 February 2023; Accepted 20 March 2023
Available online 23 March 2023
0020-0255/© 2023 Elsevier Inc. All rights reserved.
W. Jie, Q. Chen, J. Wang et al. Information Sciences 636 (2023) 118907
code such as the reentrancy vulnerability in decentralized autonomous organization (DAO) contract [7]. In this paper, we investigate
the detection of vulnerabilities in solidity written smart contracts for the Ethereum blockchain.
1.1. Vulnerability detection landscape
Historically, mining vulnerabilities has been referred to as “searching for a needle in haystack” [8]. This definition encompasses
three major directions to unveil vulnerabilities in smart contracts: static analysis, which is generally rule-based or data-driven, and
covers static features extracted from the smart contract without offering high accuracy; dynamic analysis, which lacks complete
guarantee of the full coverage of the contract code, and dynamic symbolic execution, which needs to check all the execution paths
and suffers from path explosion. Although extensive research has been carried out on vulnerability detection in smart contracts, a key
problem remains the lack of quality assessment standards [9]. This shortage has led developers to struggle in ensuring the reliability
of their smart contracts. Such a methodology avenue should ensure strong white box knowledge and discourage active learning.
Research to date focuses mainly on using static analysis to uncover vulnerabilities in smart contracts. For example, based on the
taxonomy of smart contracts proposed by Atzei et al. [10], Argañaraz et al. [11] leveraged static analysis of the source code and
formulated expert-based verification rules. Another example is Slither [12], a static analyzer for solidity written smart contracts that
relies on a set of manually coded vulnerability detectors.
One criticism of much of the literature on static analysis is that it suffers from a strong dependence on expert rules, leading to
a high false positive rate, as well as from a laborious effort to address new vulnerabilities [13]. Despite the aforesaid, Gao et al.,
in their impressive investigation [14], advocate to use static analysis than other more reliable but costly methods such as dynamic
analysis [15][16], and dynamic symbolic execution [7][17][18].
Various methods of machine learning (ML) and deep learning (DL) have been proposed to support software vulnerability mining
automation. However, traditional machine learning techniques are still based on expert-based fixed predictions, which are sensitive
to bias and insufficient generalization [19]. In addition, current AI-based smart contract vulnerability detection tools have not
yet explained in details the relationship between modality selection and detection performance. Such a lack of clarity leads to
considerable bottlenecks in improving the performance of existing methods.
To uphold automatic data-driven vulnerability detection, deep networks were successfully applied to supervised feature learning
for single and multiple modalities. The topic of multimodal learning applied to vulnerability mining in smart contracts can best be
treated when considering three multimodal data sources. First, the contract source code also known as source code layer (SC) made
of features acquired from processing the contract source code. Second, the built-based data or built-based layer (BB) comprising
features extracted from the contract compilation. Third, the contract’s Ethereum virtual machine (EVM) bytecode also known as
EVM bytecode layer (EVMB) which encompasses features obtained from processing the contract EVM bytecode. Under multimodal
learning, each multimodal data source, also known as modality, has one or multiple sub-modalities expressed as features.
We hold on to the three modalities mentioned previously and classify the relationship among features under two main categories.
First, the intramodal settings describing the analysis of features belonging to an individual modality, namely SC, BB, and EVMB.
Second, the intermodal settings relating to the combination of features across individual modalities. We further distinguish two
subgroups in the intermodal settings: two-by-two intermodal settings (SC+BB, SC+EVMB, BB+EVMB) and three-by-three intermodal
settings (SC+BB+EVMB).
1.2. Technical challenges
Addressing smart contract vulnerability mining in multimodal AI settings presents four major technical challenges. First, com-
prehend the nature of the raw features to be extracted, as well as their significance in terms of vulnerability detection performance.
Second, select the most appropriate AI models to perform vulnerability detection inference. Third, determining the best feature fusion
technique for a better detection outcome. Fourth, throughout the vulnerability detection pipeline, maintain a high level of whitebox
knowledge.
1.3. Motivations
Most studies in the field of smart contract vulnerability mining have only focused on intramodal and two-by-two intermodal
settings. Moreover, they operate under black box settings and embrace fixed rules regarding feature selection, feature fusion, and AI
models to leverage. Hence, it is not possible to investigate the significant relationships between feature selection approaches, feature
fusion techniques, AI models’ choice and the effectiveness of vulnerability detection in smart contracts. Such situation denotes the
lack of a common and clear methodology to guide developers in vulnerability uncovering under intramodal settings, two-by-two and
three-by-three intermodal settings.
The aim of this paper is to design an information-rich framework to improve research on the vulnerability of smart contracts in
multimodal AI.
1.4. Our approach
With regard to the smart contract vulnerability detection, this paper describes the design and implementation of a novel and
transparent methodology towards vulnerability mining; our framework supports contracts published with or without their source
code.
2
We apply static analysis to the function granularity level and characterize all extracted features with code and graph embeddings.
We generate code embeddings from a word2vec model and a bidirectional encoder representation from transformers (BERT) model,
while a graph convolutional network (GCN) outputs graph embeddings.
We define eighty-four flexible, practical and customizable strategies to achieve strong whitebox knowledge and guide developers
and researchers towards practical and effective smart contract vulnerability uncovering under multimodal AI settings.
We model each strategy as a supervised vulnerability detection task that pipelines feature selection, single feature dimension
unification technique, single feature fusion approach, model training, and a single unit for decision-making. We exploit features
from intramodal settings: SC, BB and EVMB separately, from two-by-two intermodal settings: SC+BB, SC+EVMB and BB+EVMB
separately, and from three-by-three intermodal settings: SC+BB+EVMB. Fig. 1 depicts the three multimodal data sources together
with their associated features. Owing to multimodal learning, each task relies on max-pooling (MP), spatial pyramid pooling (SPP),
and dense layers (Dense) for feature dimension uniformization. Each task further implements either horizontal or vertical feature
concatenation for feature fusion. Each task includes AI training and AI inference for state-of-the art text convolutional neural network
(textCNN), bidirectional long short term memory (bi-LSTM) with self-attention, and random forest (RF) machine learning model.
What can be clearly seen is that the set of all tasks in our framework supports the smart contract vulnerability branch cover-
age under multimodal learning. We compare our work with the existing literature and assess the increase in performance under
intramodal and intermodal settings, respectively.
1.5. Our contributions
This paper explores an innovative multimodal learning approach to detect the vulnerabilities in smart contracts. Our new frame-
work provides strong whitebox knowledge for modality selection and achieves higher performance. The main contributions of this
work are summarized as follows.
1. Features mixing. In order to provide white-box knowledge, our framework conveys multiple features in intramodal and in-
termodal settings. We characterize such features as code and graph embeddings, to leverage the power of natural language
processing (NLP) algorithms.
2. Vulnerability detection strategies. To achieve the best detection performance, we develop a series of supervised tasks for
automatic vulnerability mining in Ethereum smart contracts under multimodal learning [20]. Each task depicts a vulnerability
detection branch test, and pipelines feature selection, feature dimension unification, feature fusion, model training, and model
testing. Besides, we assign each task to serve as a vulnerability detection strategy in intramodal, two-by-two intermodal, or
three-by-three intermodal settings.
3. Experimental evaluation of strategies. We evaluate every task by leveraging textCNN, bi-LSTM, and RF for training and
decision-making; MP, SPP and dense layers for dimension unification; horizontal feature concatenation (𝑐𝑜𝑛𝑐𝑎𝑡) and vertical
feature concatenation (𝑠𝑡𝑎𝑐𝑘) for feature fusion. We perform extensive empirical analysis of our framework over the SmartEm-
bed dataset [14], while leaving out of scope the technical approach of the authors’ work [14]. Experiment results reveal that
under intramodal settings, artifacts from (BB) perform best, while under two-by-two intermodal settings, (SC+BB) has sig-
nificant advantage. The best detection strategy is achieved by the shared representation learning across the three modalities
(SC+BB+EVMB). Based on evidence, two-by-two intermodal settings outperform intramodal settings.
2. Background
2.1. Smart contracts on blockchain
Smart contracts are programs stored on the blockchain that are automatically executed when predetermined terms and conditions
are achieved. They are written in contract-oriented languages and provide blockchain technology with Turing-complete capabilities.
Smart contracts offer critical technical support for blockchain solutions on the Internet of Things (IoT) [21] [22].
Smart contracts have the ability to hold and manipulate assets. Once deployed, they become immutable. Prior to deploying smart
contracts, it is critical to investigate security flaws. To avoid infinite execution, which depletes resources, researchers at the Ethereum
foundation proposed a cost per computation unit known as gas. Recently, developers have expressed a heightened awareness of the
need to optimize contract code to meet functional requirements while lowering computation costs.
When the contract is invoked using its address, all nodes in the network ensure that the invoking transaction is valid using the
consensus protocol, then the entire network executes the instructions in the contract code and update the machine state. This results
in a fair and trustless environment for stakeholders. It is possible to improve the computational cost and fairness of the blockchain
network in the validation process by enhancing the working rationale of the consensus protocol [23].
2.2. AI-based vulnerability detection methods for smart contracts
Uncovering vulnerabilities in smart contracts follows two main streams: manual detection, which requires a mechanical inspection
of the contract against expert-based pattern specifications; automated detection, which aims to investigate malicious patterns using
AI.
3
Traditional vulnerability detection methods rely heavily on expert-based rules. They are unable to keep up with the ever-
increasing attack surface, resulting in very low accuracy and a high number of false positive cases.
Machine learning and deep learning were proposed to automatically extract detection rules by leveraging features extraction. In
the literature, the process of feature extraction starts with the smart contract source code or bytecode given as input.
An interesting feature of AI-based methods is the ability to conduct appropriate training of the AI model in use, with the possibility
to tweak parameters to improve the detection experience. Despite the fact that a plethora of works have been produced to investigate
AI-based vulnerability mining in smart contracts, they have failed to eliminate expert rules; they lack explanation towards the choice
of features and the choice of AI models that yields higher detection performances.
2.3. Feature fusion
It is necessary here to clarify exactly what is meant by feature fusion and how our work exploits such a concept. The term feature
fusion refers to combining features of different layers, different modalities or different branches [24]. The concept of feature fusion
embodies a multitude of techniques grouped under four categories.
• Feature vectors addition, performing element-wise addition. For example, let 𝐴 and 𝐵 be two vectors of the same size. The
fusion of 𝐴 and 𝐵 produces a single vector 𝐶 where 𝐴 + 𝐵 = 𝐶.
• Feature vectors concatenation. This paper distinguishes two main concatenation types. In horizontal concatenation known as
concat, let 𝐼 be a row vector of dimension 1×𝑛 , and 𝐽 be a row vector of dimension 1×𝑘 . The horizontal concatenation of
vectors 𝐼 and 𝐽 is equivalent to 1×(𝑛+𝑘) . In vertical concatenation denoted as stack, let 𝑈 be a row vector of dimension 𝑑×𝑛 ,
and 𝑉 be a row vector of dimension 𝑒×𝑛 , such that 𝑈 and 𝑉 have the same amount of column-wise elements 𝑛 with same or
different values. We define the stacking of vectors 𝑈 and 𝑉 as the matrix 𝑇 such that 𝑇 = (𝑑+𝑒)×𝑛 .
• Feature fusion technique is gated feature vectors fusion leveraged by [25] that proposes a gated fusion unit which concatenates
feature vectors as input, then combines them with an average pooling layer, a dense layer and a sigmoid layer for caricature
recognition.
• Feature fusion approach is based on attention mechanism, which measures the contributions of each individual feature towards
the segmentation accuracy and can remove redundant features [26].
In this work, we weight horizontal and vertical concatenations of features over other feature fusion techniques.
3. Methodology
We follow the traditional patterns found in the literature on static analysis for vulnerability detection in smart contracts. Rather
than focusing on one or two multimodal data sources, as in previous works, we embrace all three in smart contract vulnerability
assessment. We extract features at the single-layer level (intramodal settings), as well as at the multi-layer level that comprises two-
way interactions among single layers (two-by-two intermodal settings) and a three-way interaction that combines features from all
single layers (three-by-three intermodal settings).
The resulting hierarchy of single-layer and multi-layer features is passed through carefully selected state-of-the-art machine
learning and deep learning models. A binary decision is formed to indicate whether the contract is vulnerable or not. Our feature
hierarchy forms the basis for several detection strategies described in this paper. We explain the best vulnerability detection strategies
given the smart contract format as inputs.
In the lines below, we highlight two essential parts of our framework. The first part deals with how to extract features and
fuse them to get a joint multimodal representation under intramodal and intermodal settings. The other part covers how to input
intramodal and intermodal features to AI models to design vulnerability detection strategies that provide full white-box knowledge.
3.1. Hierarchical feature extraction and fusion
In this subsection, we show how to acquire features from the SC layer, the BB layer, and the EVMB layer. We further detail how
to combine those features for smart contract vulnerability detection.
3.1.1. Feature extraction under intramodal settings

We evaluate key aspects to acquire features at every separate layer of the intramodal settings. Moreover, Fig. 1 illustrates the
intramodal settings and a more detailed account of intramodal layers is given below.
a) Source code layer (SC): This layer manages features acquired from processing the contract source code. In this work, we
choose the function as granularity level, and parse the contract source code to extract the set of all functions. We rely on existing
tools [12] to define the ground truth binary labels for the different functions. We leverage improvements from natural language
processing (NLP) and convert function definitions into embedding vectors. Specifically, we apply two types of embedding vector
models: the Word2vec model [27], and the BERT model [28]. The word2vec embeddings (SC-W2V) and the BERT embeddings
(SC-Bert) form the two features of interest at the SC layer.
4
Fig. 1. Overview of three modalities and associated features. (a) Source code layer (SC) manages features acquired from processing the contract source code. (b)
Built-based layer (BB) manages features extracted during the contract source code compilation. (c) The EVM Bytecode Layer (EVMB) is responsible for features
acquired from processing the contract bytecode expression.
b) Built-based layer (BB): This layer manages features extracted during the contract source code compilation. We exploit the
static analyzer tool Slither [12], and extract the call flow graph (CFG) and the static single assignment expression (SSA) of every
function. We apply word2vec and BERT embeddings to SSA encodings, and generate graph embeddings over CFGs thanks to an
untrained graph convolutional network (GCN). As a result, word2vec embeddings (SSA-W2V), BERT embeddings (SSA-Bert), and
graph embeddings (BB-CFG) are the three main features at the BB layer. We set the binary label for every function at the BB layer
through a simple matching with the corresponding function from the SC layer.
c) EVM Bytecode Layer (EVMB): The EVMB layer is responsible for features acquired from processing the contract bytecode
expression. We disassemble every contract EVM bytecode using the Eth2vec [29] tool, and design a CFG generator to output CFGs
from each disassembled contract. We perform label matching from SC functions and set the corresponding label to each CFG.
Regarding the features of interest at the EVMB layer, we apply an untrained GCN over CFGs to generate graph embeddings (EVMB-
CFG), as well as word2vec embeddings (EVMB-ASM) to every function in disassembled contracts.
5
Fig. 2. Intermodal settings for smart contract vulnerability detection. The two-by-two intermodal settings include the combinations of features from (SC+BB),
(BB+EVMB) and (SC+EVMB), and the three-by-three intermodal settings deal with features from (SC+BB+EVMB).
3.1.2. Feature extraction under intermodal settings

There are two approaches to define features of interest for vulnerability detection under intermodal settings as shown in Fig. 2:
The two-by-two intermodal settings manage the combinations of features from (SC+BB), (BB+EVMB) and (SC+EVMB) and the
three-by-three intermodal settings deal with features from (SC+BB+EVMB).
a) SC + BB Combination: This two-by-two intermodal combination of features from SC and BB layers investigates the associated
performance in vulnerability detection. It brings together five features: SC-W2V, SC-Bert, SSA-W2V, SSA-Bert and BB-CFG.
b) SC + EVMB Combination: Such two-by-two intermodal settings combine features from SC and EVMB layers. It assesses the
vulnerability detection performance linked to such combination, and brings together four features of interest: SC-W2V, SC-Bert,
EVMB-ASM and EVMB-CFG.
c) BB + EVMB Combination: The current combination evaluates the detection performance in mixing features from BB and
EVMB layers. It combines five main features: SSA-W2V, SSA-Bert, BB-CFG, EVMB-ASM and EVMB-CFG.
d) SC + BB + EVMB Combination: This three-by-three intermodal settings simultaneously leverages features from SC, BB and
EVMB layers. It investigates the efficacy of such intermodal vulnerability detection approach, and combines seven features of interest:
SC-W2V, SC-Bert, SSA-W2V, SSA-Bert, BB-CFG, EVMB-ASM and EVMB-CFG.
3.2. Vulnerability detection strategies
This subsection defines the multiple supervised tasks that associate the hierarchy of fused features together with state-of-the-art AI
models under intramodal and intermodal settings. Fig. 3 provides an overview of such tasks. In our framework, each task represents a
vulnerability detection branch test and pipelines input feature selection, feature dimension unification, feature fusion, model training
and decision-making. We aim to assess and classify the performance of every vulnerability detection strategy.
Appendix A details all the specific tasks leveraged to build our framework. The following is a brief description of what happens
during feature dimension unification, feature fusion, as well as during model training and testing.
a) Feature dimension unification: Recent research has revealed that at least one dimension should be equal when combining
different features of interest. To achieve such objective, dimension unification relies on three techniques. First, we apply a max-
pooling layer (MP) to input features. Second, we implement a fully connected layer also known as dense layer (Dense) over feature
candidates for fusion. Third, we experiment with spatial pyramid pooling layer (SPP) [30,31] to uniformize feature dimension.
Recent research has revealed pooling layers choose meaningful information but cause detailed information loss [32]. Dense layers
learn local and global feature information between layers [33], but learning too many features may slow down the training and lead
to overfitting.
6
Fig. 3. Multimodal feature fusion architecture for smart contract vulnerability detection. According to the multimodal feature fusion settings (intramodal and
intermodal), to select the processed features and input them into the multimodal feature fusion network. The network includes four stages: (a) uniform vector
dimensions, (b) feature fusion, (c) model training and (d) decision making.
From the three methods mentioned above, MP, Dense and SPP layers are all likely to have a positive impact in the dimension
unification stage: We evaluate the impact of each those dimension unification methods over vulnerability detection effectiveness.
b) Proper feature fusion: The proper feature fusion follows the dimension unification stage. It relies on horizontal concatenation
(𝑐𝑜𝑛𝑐𝑎𝑡) and vertical concatenation (𝑠𝑡𝑎𝑐𝑘) for feature fusion under intramodal and intermodal settings.
c) model training: After the feature fusion stage, we adopt a bi-LSTM model with self attention at fusion model training stage.
The bi-LSTM model with self attention is the fixed strategy selection model in this stage.
In addition, we replaced the bi-LSTM model with Vanilla RNN [34] and gated recurrent unit (GRU) [35] for performance com-
parison (Table 2), respectively, in the intramodal fusion of the SC layer at the fusion model training stage. The performance of the
bi-LSTM model for smart contract vulnerability detection is significantly better.
d) Decision-making: The last stage is decision-making. To train and evaluate the vulnerability detection, we proceed as follows.
First, we adopt a bi-LSTM model with self attention combined with a random forest (RF) model. Second, we combine a self-attentive
bi-LSTM with textCNN model.
We train RF model and textCNN model in the decision-making stage as the strategy selection model in our multimodal feature
fusion network.
Vulnerability detection strategies: Through the above four stages of model selection respectively, we investigate twelve strate-
gies of interest at each intramodal or intermodal setting after feature selection, and we denote | as the pipeline symbol.
• strategy 1: SPP | concat | (bi-LSTM+self-attention) | textCNN.

• strategy 2: SPP | concat | (bi-LSTM+self-attention) | RF.
• strategy 3: SPP | stack | (bi-LSTM+self-attention) | RF.
• strategy 4: SPP | stack | (bi-LSTM+self-attention) | textCNN.
• strategy 5: MP | concat | (bi-LSTM+self-attention) | textCNN.
• strategy 6: MP | concat | (bi-LSTM+self-attention) | RF.
• strategy 7: MP | stack | (bi-LSTM + self-attention) | RF.
• strategy 8: MP | stack | (bi-LSTM + self-attention) | textCNN.
• strategy 9: Dense | stack | (bi-LSTM + self-attention) | RF.
• strategy 10: Dense | stack | (bi-LSTM+self-attention) | textCNN.
• strategy 11: Dense | concat | (bi-LSTM+self-attention) | textCNN.
• strategy 12: Dense | concat | (bi-LSTM+self-attention) | RF.
The above 12 vulnerability detection strategies are carried out separately under 7 multimodal feature fusion settings, and the
optimal performance model under each setting is used as the final framework.
The performance of each strategy and outperforming strategies in our smart contract vulnerability detection framework will be
described in detail in the next experiment section.
4. Experiments
4.1. Experimental settings
To undertake empirical analysis, we develop a top-down parser with leftmost derivation, using the ANTLR tool version 4 [36]
over the solidity grammar [37]. Our parser dissects the smart contract source code into functions and performs data wrangling
operations such as word splitting and removal of unimportant terms. We code a CFG generator for EVMB that leverages contracts
disassembled with a modified version of the Eth2Vec tools [29].
7
Table 1
Summary of class imbalance resolution approaches.
Strategy Embeddings Wclass 0 Wclass 1 Accuracy F1 Precision Recall AUC-ROC
None SC-Bert 1 1 0.9611 0.8386 0.9174 0.7722 0.8831

None SC-W2V 1 1 0.9605 0.8370 0.9239 0.7651 0.8740
INS SC-Bert 0.1659425 1.7340575 0.9584 0.8249 0.9172 0.7495 0.8679

INS SC-W2V 0.1659425 1.7340575 0.9559 0.8093 0.9307 0.7159 0.8539
ENS SC-Bert 0.85015092 1.14984908 0.9612 0.8374 0.9267 0.7639 0.8774

ENS SC-W2V 0.85015092 1.14984908 0.9601 0.8338 0.9310 0.7550 0.8727
ISNS SC-Bert 0.56282351 1.43717649 0.9588 0.8274 0.9164 0.7541 0.8718

ISNS SC-W2V 0.56282351 1.43717649 0.9583 0.8260 0.9263 0.7453 0.8681
SMOTE SC-W2V 1.285251 1 0.9430 0.9452 0.9168 0.9754 0.9427

SMOTE SC-Bert 1.285251 1 0.9416 0.9438 0.9160 0.9734 0.9413
We implement AI models using TensorFlow, Keras, Gensim, Stellargraph and Pytorch libraries. All experiments are carried out on
a physical machine with the following characteristics: Intel (R) Xeon (R) Gold 6240R CPU running at 2.40 GHz, 32 GB of RAM and
a hard disk drive of 6.5 TB. We run experiments with Python 3.8.2 on Ubuntu 20.04.
4.2. Dataset construction
We exploit contracts from the SmartEmbed dataset [14]. Such dataset is made of 5000 verified smart contracts published with
source code on the Ethereum Mainnet.
Using our parser, we process the contract source code and extract the set of function definitions. We obtain a complete dataset
of 101,082 functions. We label our functions using existing tools [12] and the result is as follows: 87,641 functions are classified as
non-vulnerable under the class 0, and 13,441 functions are classified as vulnerable under the class 1. Such statistics translate into
class imbalance between class 0 and class 1.
We solve the class imbalance issue at the SC layer and propagate the effects over the other layers. To achieve our objective, we
evaluate several class imbalance resolution approaches summarized in Table 1 in which W represents the weight associated to a
class. We list the class imbalance resolution approaches below.
1. We upsample and downsample examples from both classes with no particular technique in mind. This refers to strategy None in
Table 1.
2. We follow the inverse number of samples (INS) approach.
3. We implement the effective sample number weighting (ENS) technique.
4. We adopt the inverse square root of the number of samples (ISNS) method.
5. We experiment with the synthetic minority oversampling technique (SMOTE)
We leverage a random forest AI model as our baseline construction to solve the imbalance issue. For every approach, the testing
set takes 20% of the dataset. We achieve the best results under SMOTE. We undersample class 0 with a factor of 28.5251% and we
upsample class 1 to 25 000 samples. We propagate the SMOTE technique to address class imbalance issues at the BB and the EVMB
layers, respectively. Once the final dataset is ready, We select 95% of the functions as the training set and the other 5% as the test
set. We feed the feature vectors into the multimodal feature fusion network for training under intramodal and intermodal settings to
produce vulnerability detection results (Fig. 3).
4.3. Performance analysis
Following, we propose the questions committed to provide white-box knowledge, and give corresponding answers based on
analysis of the experimental results. Among them, RQ1 and RQ2 are for intramodal settings, and RQ3 to RQ6 are for intermodal
settings. Besides, We embolden each column-wise tuple that holds the highest performance result in tables below. Such a move aims
to ease the interpretability of data.
4.3.1. Intramodal settings (RQ1 to RQ2)

a) RQ1: What strategy yields highest detection results in SC, BB and EVMB separately? Is our intramodal framework
outperform state-of-the-art methods?
Regarding the SC layer, we leverage several state-of-the-art AI models: LSTM, Vanilla RNN [34], RF, textCNN and gated recurrent
unit (GRU) [35] for training and testing over the balanced dataset attached to the layer. We compare such state-of-art models with
vulnerability detection tasks from task 1 to task 12 and Table 2 portrays the comparison results. We observe that for selecting (SC-
W2V+SC-Bert) features, strategy 7 (MP | stack | (bi-LSTM+self-attention) | RF) outperforms existing state-of-the-art models and
aligns as the best strategy for vulnerability detection at the SC layer.
8
Table 2
SC layer performance comparison.
Methods Acc (%) Recall (%) Precision (%) F1 (%) Methods Acc (%) Recall (%) Precision (%) F1 (%)
MP|stack|bi-LSTM|RF 98.04 97.89 98.20 98.04 MP|stack|bi-LSTM|textCNN 97.62 97.30 97.96 97.62
MP|concat|bi-LSTM|RF 97.40 96.76 98.08 97.40 MP|concat|bi-LSTM|textCNN 97.36 96.91 97.84 97.36
MP|stack|RNN [34]|RF 95.88 94.31 97.80 95.88 MP|stack|RNN [34]|textCNN 95.44 94.09 97.13 95.44
MP|stack|GRU [35]|RF 95.82 94.44 97.52 95.82 MP|stack|GRU [35]|textCNN 95.10 94.36 96.10 95.10
Dense|stack|bi-LSTM|RF 97.62 96.74 98.56 97.62 Dense|stack|bi-LSTM|textCNN 97.66 96.74 98.64 97.66
Dense|concat|bi-LSTM|RF 97.48 97.52 97.44 97.48 Dense|concat|bi-LSTM|textCNN 97.43 97.06 97.84 97.44
SPP|stack|bi-LSTM|RF 96.94 96.51 97.40 96.94 SPP|stack|bi-LSTM|textCNN 96.67 96.82 96.48 96.67
SPP|concat|bi-LSTM|RF 97.08 96.52 97.67 97.08 SPP|concat|bi-LSTM|textCNN 95.92 96.67 95.12 95.92
Table 3
BB layer performance comparison.
Methods BB-CFG1 BB-CFG2

Acc (%) Recall (%) Precision (%) F1 (%) Acc (%) Recall (%) Precision (%) F1 (%)
MP|stack|bi-LSTM|RF 97.80 96.72 98.96 97.80 98.27 97.56 99.01 98.27

MP|stack|bi-LSTM|textCNN 97.33 96.20 98.56 97.33 98.08 97.40 98.80 98.08
MP|concat|bi-LSTM|RF 97.44 96.47 98.48 97.44 98.24 97.33 99.20 98.24
MP|concat|bi-LSTM|textCNN 97.60 96.56 98.72 97.60 98.08 97.77 98.40 98.08
Dense|stack|bi-LSTM|RF 97.59 96.48 98.77 97.59 98.13 97.35 98.96 98.13

Dense|stack|bi-LSTM|textCNN 97.24 95.95 98.64 97.24 98.13 97.33 98.99 98.13
Dense|concat|bi-LSTM|RF 97.76 97.01 98.56 97.76 97.88 96.72 99.12 97.88
Dense|concat|bi-LSTM|textCNN 97.08 96.30 97.92 97.08 97.36 96.69 98.08 97.36
SPP|stack|bi-LSTM|RF 97.05 96.20 97.97 97.05 98.16 97.60 98.75 98.16

SPP|stack|bi-LSTM|textCNN 97.44 96.57 98.37 97.44 97.90 97.00 98.89 97.91
SPP|concat|bi-LSTM|RF 97.24 96.60 97.92 97.24 98.16 97.78 98.56 98.16
SPP|concat|bi-LSTM|textCNN 97.32 96.4 98.24 97.32 98.12 96.95 99.36 98.12
cross-attention [38] 96.54 97.05 96.07 96.56 97.18 97.80 96.61 97.20
Table 4
EVMB layer performance comparison.
Methods EVMB-DGCNN [39] EVMB-CFG1 EVMB-CFG2

Acc Recall Precision F1 Acc Recall Precision F1 Acc Recall Precision F1
(%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%)
MP|stack|bi-LSTM|RF 93.48 91.84 95.44 93.48 93.92 93.03 94.96 93.92 95.16 93.93 96.56 95.16
MP|stack|bi-LSTM|textCNN 93.16 91.53 95.12 93.16 93.38 92.61 94.28 93.38 94.60 93.93 95.36 94.60
MP|concat|bi-LSTM|RF 93.84 92.41 95.52 93.84 93.68 92.72 94.80 93.68 95.08 93.85 96.48 95.08
MP|concat|bi-LSTM|textCNN 94.10 92.73 95.76 94.20 93.88 92.88 95.04 93.88 95.24 94.84 95.68 95.24
Dense|stack|bi-LSTM|RF 94.10 92.42 96.08 94.10 94.44 93.57 95.44 94.44 95.64 94.96 96.40 95.64
Dense|stack|bi-LSTM|textCNN 93.32 92.54 94.24 93.32 94.40 92.96 96.08 94.40 95.34 93.92 96.96 95.34
Dense|concat|bi-LSTM|RF 94.08 92.29 96.20 94.08 93.96 91.98 96.32 93.96 95.20 94.35 96.16 95.20
Dense|concat|bi-LSTM|textCNN 93.72 93.07 94.48 93.72 93.87 92.68 95.28 93.88 94.72 92.74 97.04 94.72
SPP|stack|bi-LSTM|RF 91.24 90.69 91.92 91.24 92.56 91.92 93.32 92.56 93.76 93.13 94.48 93.76
SPP|stack|bi-LSTM|textCNN 90.50 91.62 89.16 90.50 91.32 90.13 92.80 91.32 93.56 93.11 94.08 93.56
SPP|concat|bi-LSTM|RF 91.24 90.24 92.48 91.24 91.48 90.99 92.08 91.48 94.36 93.42 95.44 94.36
SPP|concat|bi-LSTM|textCNN 91.56 89.20 94.56 91.57 91.72 90.65 93.04 91.72 93.36 94.53 92.56 93.60
At the BB layer, we hook on the AMEVulDetector model [38], which promotes feature fusion through a cross-attention layer, to
compare our framework strategies. Further, we implement two types of untrained GCNs: GCN1, which generates embeddings BB-
CFG1 under the Stellargraph library, and GCN2, which is implemented with the Pytorch library and outputs embeddings BB-CFG2.
The key difference between GCN1 and GCN2 stems from the observation that embeddings from GCN1 reveal a very large sparse
matrix, while embeddings from GCN2 reveal a lesser sparse matrix than GCN1. We hypothesize such sparse matrix results from con-
sequent loss of information during GCN1 generation. From the experimental results shown in Table 3, it is observed that BB-CFG2
enhances the performance results of vulnerability detection. We conclude that selecting (SSA-W2V+SSA-Bert+BB-CFG) features,
along with strategy 7 (MP | stack | (bi-LSTM+self-attention) | RF) yields the highest performance at the BB layer against AME-
VulDetector model [38] and the range of all the remaining strategies at the BB layer.
As regards the EVMB layer, we cling to the deep graph convolutional neural network (DGCNN) model [39] delivering em-
beddings EVMB-DGCNN, which is compared with the GCN model. We reintroduce the above-mentioned GCN1, which produces
9
Fig. 4. ROC curves for intramodal settings.
embeddings EVMB-CFG1, and GCN2, which outputs embeddings EVMB-CFG2. We analyze the results in Table 4 and make the
following conclusion based on EVMB-CFG2 that offers higher detection effectivity than EVMB-CFG1: strategy 9 (Dense | stack |
(bi-LSTM+self-attention) | RF) on EVMB-CFG2 performs better than DGCNN [39] and the range from strategy 1 to strategy 12 at
the EVMB layer.
We depict in Fig. 4 the receiver operator characteristic (ROC) curves to support our findings regarding optimal strategies in SC,
BB and EVMB layers.
b) RQ2: What modality leads to better detection performance among SC, BB and EVMB?
Based on empirical evidence from Table 2, Table 3, and Table 4, we conclude the following under intramodal settings. Among
SC, BB and EVMB, BB leads smart contact vulnerability assessment, followed by SC performance-wise, while EVMB offers the least
performance upshot.
4.3.2. Intermodal settings (RQ3 to RQ6)

a) RQ3: What strategy upholds highest detection results in (SC + BB), (SC + EVMB), (BB + EVMB), and (SC + BB +
EVMB) combinations separately?
Regarding SC+BB combination, we exploit GCN1 and GCN2 from the BB layer as detailed in Answer to RQ1. We weight the
combination SC+BB-CFG2, which outputs higher results. We conclude based on Table 5 and Table 6 that for (SC-W2V+SC-Bert+SSA-
W2V+SSA-Bert+BB-CFG) feature selecting, strategy 9 (Dense | stack | (bi-LSTM+self-attention) | RF) realizes better vulnerability
detection within the range from 12 strategies.
As for SC+EMVB combination, we call upon GCN1 and GCN2 from the EVMB layer as illustrated in Answer to RQ1. We spot-
light the SC+EVMB-CFG2 combination, which delivers increased outcome. We conclude based on Table 5 and Table 6 that for
(SC-W2V+SC-Bert+EVMB-CFG+EVMB-ASM) feature selection, strategy 10 (Dense | stack | (bi-LSTM+self-attention) | textCNN)
achieves better vulnerability detection within the range from all strategies.
10
Table 5
Two-by-two intermodal fusion performance comparison (GCN1-CFG1).
Methods SC+BB-CFG1 SC+EVMB-CFG1 BB+EVMB-CFG1

(%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%)
Table 6
Two-by-two intermodal fusion performance comparison (GCN2-CFG2).
Methods SC+BB-CFG2 SC+EVMB-CFG2 BB+EVMB-CFG2

(%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%)
Table 7
Three-by-three intermodal fusion performance.
Methods SC+BB+EVMB-CFG1 SC+BB+EVMB-CFG2

Acc (%) Recall (%) Precision (%) F1 (%) Acc (%) Recall (%) Precision (%) F1 (%)
MP|stack|bi-LSTM|RF 99.37 99.15 99.60 99.37 99.49 99.08 99.90 99.49

MP|stack|bi-LSTM|textCNN 99.37 98.96 99.78 99.37 99.57 99.23 99.92 99.57
MP|concat|bi-LSTM|RF 99.40 99.20 99.60 99.40 99.40 99.12 99.68 99.40
MP|concat|bi-LSTM|textCNN 99.32 99.20 99.44 99.32 99.52 99.20 99.84 99.52
Dense|stack|bi-LSTM|RF 99.59 99.26 99.92 99.59 99.70 99.39 99.98 99.68

Dense|stack|bi-LSTM|textCNN 99.56 99.28 99.84 99.56 99.71 99.36 99.92 99.64
Dense|concat|bi-LSTM|RF 99.16 99.04 99.28 99.16 99.28 98.89 99.68 99.28
Dense|concat|bi-LSTM|textCNN 99.12 98.58 99.68 99.12 99.28 99.20 99.36 99.28
SPP|stack|bi-LSTM|RF 99.05 98.64 99.49 99.05 99.50 99.36 99.64 99.50

SPP|stack|bi-LSTM|textCNN 99.44 99.22 99.68 99.44 99.28 99.04 99.52 99.28
SPP|concat|bi-LSTM|RF 99.20 98.66 99.76 99.20 99.44 99.28 99.60 99.44
SPP|concat|bi-LSTM|textCNN 99.24 99.04 99.44 99.24 99.28 99.20 99.36 99.28
In terms of BB+EVMB combination. We emphasize on BB+EVMB-CFG2 combination that renders superior outcome, and con-
clude the following. For selecting (SSA-W2V+SSA-Bert+BB-CFG+EVMB-CFG+EVMB-ASM) features, strategy 8 MP | stack | (bi-
LSTM+self-attention) | textCNN) yields higher performance within the range from all strategies, based on Table 5 and Table 6.
In the case of SC+BB+EVMB three-by-three intermodal combination, we leverage GCN1 and GCN2. We combine (SC+BB) with
EVMB-CFG2 instead of EVMB-CFG1, to enhance results. Based on Table 7, we conclude selecting (SC-W2V+SC-Bert+SSA-W2V+SSA-
Bert+BB-CFG+EVMB-CFG+EVMB-ASM) features, strategy 10 (Dense | stack | (bi-LSTM+self-attention) | textCNN) reaches the
highest performance within the range from strategy 1 to strategy 12.
11
Fig. 5. Two and Three layer intermodal fusion ROC curve.
b) RQ4: Which two-by-two intermodal combination yields the best detection performance towards vulnerabilities in
smart contracts?
Based on empirical evidence from Table 5 and Table 6, we conclude the following. (SC+BB) leads to higher performances,
followed by (BB+EVMB), while (SC+EVMB) yields the least detection performance in two-by-two intramodal settings.
c) RQ5: What approach appeals more between two-by-two and three-by-three intermodal settings?
We exploit the results from Table 5, Table 6 and Table 7. We observe that the three-by-three intermodal settings achieve better
vulnerability detection performance than the two-by-two settings.
d) RQ6: What settings deliver better results between intramodal and intermodal settings?
From the lines above, we learned that three-by-three settings deliver better results than two-by-two intermodal settings. We infer
from Table 2, Table 3, Table 4, Table 5 and Table 6 that two-by-two intermodal settings outperform the intramodal configuration.
We conclude that intermodal settings outperform intramodal settings in vulnerability detection performance.
Fig. 5 illustrates the receiver operator characteristic (ROC) curves to weight our conclusions towards optimal strategies for
vulnerability detection in two-by-two and three-by-three intermodal settings.
4.3.3. Outperforming strategies in our framework

Following the above performance analysis, we summarize the outperforming strategies of our smart contract vulnerability detec-
tion framework in Table 8. Besides, Table 8 provides the outperforming strategy towards each intramodal and intermodal feature
fusion settings, together with the corresponding detection performance.
With extensive experiments, we found that our framework presents the following advantages: our framework provides strong
white-box knowledge for intramodal, two-by-two and three-by-three intermodal feature selection, and achieves higher vulnerability
detection performance compared to existing methods.
12
Table 8
Outperforming Strategies In Our Framework.
Settings Methods Acc (%) Recall (%) Precision (%) F1 (%)
SC MP|stack|bi-LSTM|RF 98.04 97.89 98.20 98.04

BB MP|stack|bi-LSTM|RF 98.27 97.56 99.01 98.2
EVMB Dense|stack|bi-LSTM|RF 95.64 94.96 96.40 95.6
SC+BB Dense|stack|bi-LSTM|RF 99.39 98.89 99.90 99.39
SC+EVMB Dense|stack|bi-LSTM|textCNN 98.81 98.29 99.34 98.81
BB+EVMB MP|stack|bi-LSTM|textCNN 99.23 98.97 99.50 99.2
SC+BB+EVMB Dense|stack|bi-LSTM|textCNN 99.71 99.36 99.92 99.64
5. Discussion
5.1. Significance of our methodology
In smart contract vulnerability detection, feature extraction and feature fusion bear a certain complexity and developers lack a
clear methodology to achieve such objectives. Also, the majority of the existing works leverages intramodal and intermodal informa-
tion without disclosing the fundamentals of their methodology. In most cases, researchers choose and process features according to
fixed rules under black-box settings. Our paper investigates white-box methodology towards smart contract vulnerability detection.
5.2. Flexibility in feature extraction
We notice that many published contracts lack their source code. Penetration testers in most cases have to deal solely with contract
bytecode, which increases the complexity of the analysis. We found that to concentrate on the EVMB layer leads to poor detection
performance, and some other type of information from SC or BB layers are necessary to improve the experience. Our framework can
process contracts published with our without their source code.
5.3. Limitations of our framework
Our architecture is limited by implementation rather than the choice of design. First, we believe that adopting the majority of AI
techniques is impractical, and we investigate few AI models to design our methodology. Second, our work relies on the word2vec
NLP technique and lacks support for out-of-vocabulary words. We propose to replace word2vec with the fastText NLP model [40]
to solve such limitation. Third, our vulnerability detection framework is modeled as a binary classification problem over supervised
learning. In this work, we are more interested into whether a contract contains vulnerability. A further work that investigates specific
type of vulnerabilities over multi-class classification is desired.
6. Related works
In this section, we revisit the literature on vulnerability detection in smart contracts. It is worth mentioning that we could not find
an AI framework based on multimodal feature fusion in the field of smart contract vulnerability detection that offers the following:
the support of strong white-box knowledge towards smart contract modality selection, and high-performance vulnerability detection
capabilities with a three-layer modality fusion.
6.1. Rule-based approach for vulnerability detection under static analysis
To formalize the research on vulnerabilities in smart contracts, the work of Atzei et al. [10] proposed a taxonomy of smart
contract vulnerabilities. Following such an initiative, Argañaraz et al. [11] opted for the use of static analysis over contract source
codes to extract both functional and security vulnerabilities. In their work, the authors formulate some expert-based rules towards
the programming language of interest. Gao et al. [14] designed the SmartEmbed vulnerability analysis tool to detect clones, bugs,
and to validate the vulnerability exemption of a contract. Moreover, Gao et al. [14] outlined a major disadvantage of relying on
expert-based rules. It is cumbersome to keep up with the attack surface and the sophistication of attacks. Furthermore, Gao et al.
[14] vowed for the use of static analysis over more reliable but expensive techniques such as dynamic symbolic execution [7][17][18]
and dynamic analysis [15][16]. The authors in [12] designed Slither, a fantastic and incremental static analysis based vulnerability
detection framework for solidity written smart contracts. Slither can accommodate new vulnerability detectors aiming at uncovering
novel vulnerabilities in the wild.
Existing rule-based vulnerability detection methods are complex and expensive. In order to improve detection efficiency and
generalization capabilities, data-driven approaches for static analysis-based vulnerability detection have become a priority.
6.2. Data-driven approach for static analysis-based vulnerability detection
To keep pace with the always evolving smart contracts attack surface, the authors in [41] advocate the usage of data-driven
techniques, such as machine learning, for discovering vulnerabilities in smart contracts. The interesting work of Eth2vec [29] applies
13
unsupervised machine learning techniques to built-based features and EVM bytecode extracted features. It aims to automatically
cluster buggy contracts as well as cloned contracts. Although Eth2vec is an innovative work, it only supports contracts from the
training dataset and does not employ supervised learning.
The authors in [42] further advocate the use of supervised classification, which leads to higher detection results than unsupervised
classification in natural language processing. Further to support AI-based vulnerability detection, Teng et al. [43] introduce a static
time-slicing source-code based protocol based on long short term memory (LSTM) model, to fetch contracts which behave differently
from the initial one reported on the Ethereum dapps website. The work of [43] requires to manually extract the features from
data in order to characterize the dataset. The compelling work of Qian et al. [18] leverages a bidirectional-LSTM with an attention
mechanism over various embedding vector dimensions. It aims to uncover the embedding vector dimension that yields the best
vulnerability detection results. Moreover, compared to state-of-the art AI models such as Vanilla-recurrent neural network (RNN),
LSTM, and bi-LSTM without self-attention, the work of Qian et al. [18] leads to higher performances.
The work of Liu et al. [38] designs a smart contract vulnerability detection framework that fuses features from SC and BB layers
in black box settings. Such a framework exploits an attentive multi-encoder network comprising self-attention and cross-attention
layers to detect vulnerabilities and to provide feature importance through weight interpretability. Although, being such a promising
piece of research, the work of Liu et al. [38] exhibits two drawbacks. First it still builds upon expert-based rules and inherits the
weaknesses associated to rule-based systems. Second, it only supports smart contracts published with source code. It therefore lacks
support for the EVM bytecode processing, which is an important limitation.
To the best of our knowledge, this is the first time a framework has been able to provide strong white box knowledge for
smart contract modality selection, and improve the vulnerability detection performance based on three-layer modality fusion with
multimodal learning. Our framework supports bytecode based vulnerability detection for contracts published without their source
code.
7. Conclusion
In this paper, we design a novel framework that detects vulnerabilities in Ethereum-based solidity written smart contracts.
To that end, we leverage three modalities of interest. First, features extracted from the contract source code (SC layer). Second,
features acquired during the contract compilation (BB layer). Third, features obtained from the contract bytecode processing (EVMB
layer). Different from existing schemes that leverage intramodal or two-by-two intermodal settings under a black-box approach,
our work offers the following innovations. We dismiss expert patterns and hand-crafted feature fusions to promote AI automation.
Our work characterizes a robust methodology of multimodal learning based on several supervised detection tasks. A number of
advanced AI models are selected to evaluate these tasks on real-world datasets. We provide developers and researchers with clear and
accurate whitebox knowledge, which helps to detect vulnerabilities in intramodal and intermodal settings with high performance.
Our framework allows one or two modalities to be absent while functioning optimally. Our framework supports the analysis of
contracts published without source code. Regarding the outcomes, the empirical findings show that under intramodal settings,
BB provides the best performances followed by SC, with at last, EVMB. Under two-by-two intermodal settings, SC+BB performs
better, followed by BB+EVMB and finally SC+EVMB. We conclude that three-by-three intermodal settings outperform two-by-two
intermodal settings, and two-by-two intermodal settings outdo intramodal settings. Although our scheme limits the number AI
models used, it is impractical to consider all AI models for vulnerability training and inference. Our framework advances the field
of research towards smart contract vulnerability detection. As future works, we aim to solve the out-of-vocabulary issue in our
framework, investigate multi-class classification, and dive into feature importance.
CRediT authorship contribution statement
Wanqing Jie: Conceptualization, Investigation, Methodology, Writing – original draft, Writing – review & editing. Qi Chen:
Supervision, Writing – review & editing. Jiaqi Wang: Methodology, Supervision, Visualization. Arthur Sandor Voundi Koe: In-
vestigation, Writing – original draft, Writing – review & editing. Jin Li: Funding acquisition, Project administration, Supervision.
Pengfei Huang: Formal analysis, Investigation. Yaqi Wu: Software, Visualization. Yin Wang: Investigation, Software.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to
influence the work reported in this paper.
Data availability
The data that has been used is confidential.
Acknowledgements
This work was funded by grants from the National Key Project of China (No. 2020YFB1005700).
14
Appendix A. Vulnerability detection strategies
The lines below describe the multiple supervised tasks of interest that combine the hierarchy of fused features together with
state-of-the-art AI models under intramodal and intermodal settings.
A.1. Strategies under intramodal settings
a) SC layer
At the SC layer, we select SC-W2V and SC-Bert as input features. We proceed with dimension unification with SPP, MP and
Dense separately. Then we fuse features on the one hand with 𝑐𝑜𝑛𝑐𝑎𝑡 and on the other hand with 𝑠𝑡𝑎𝑐𝑘. To train and evaluate the
vulnerability detection, we proceed as follows. First, we adopt a bi-LSTM model with self attention combined with a random forest
(RF) model. Second, we combine a self-attentive bi-LSTM with textCNN model.
We investigate twelve tasks of interest at the SC layer, and we denote | as the pipeline symbol.
• task 1: (SC-W2V+SC-Bert) | SPP | concat | (bi-LSTM+self-attention) | textCNN.

• task 2: (SC-W2V+SC-Bert) | SPP | concat | (bi-LSTM+self-attention) | RF.
• task 3: (SC-W2V+SC-Bert) | SPP | stack | (bi-LSTM+self-attention) | RF.
• task 4: (SC-W2V+SC-Bert) | SPP | stack | (bi-LSTM+self-attention) | textCNN.
• task 5: (SC-W2V+SC-Bert) | MP | concat | (bi-LSTM+self-attention) | textCNN.
• task 6: (SC-W2V+SC-Bert) | MP | concat | (bi-LSTM+self-attention) | RF.
• task 7: (SC-W2V+SC-Bert) | MP | stack | (bi-LSTM+self-attention) | RF.
• task 8: (SC-W2V+SC-Bert) | MP | stack | (bi-LSTM+self-attention) | textCNN.
• task 9: (SC-W2V+SC-Bert) | Dense | stack | (bi-LSTM+self-attention) | RF.
• task 10: (SC-W2V+SC-Bert) | Dense | stack | (bi-LSTM+self-attention) | textCNN.
• task 11: (SC-W2V+SC-Bert) | Dense | concat | (bi-LSTM+self-attention) | textCNN.
• task 12: (SC-W2V+SC-Bert) | Dense | concat | (bi-LSTM+self-attention) | RF.
b) BB layer
Regarding the BB layer, we select SSA-W2V, SSA-Bert and BB-CFG as input features. We implement dimension unification using
SPP, MP and Dense separately. Then we fuse features on the one hand with 𝑐𝑜𝑛𝑐𝑎𝑡 and on the other hand with 𝑠𝑡𝑎𝑐𝑘. For model
training and inference, first, we adopt a bi-LSTM model with self attention combined with a random forest (RF) model. Second, we
combine a self-attentive bi-LSTM with textCNN model. We inspect twelve tasks of interest at the BB layer, and we signify | as the
pipeline symbol.
• task 13: (SSA-W2V+SSA-Bert+BB-CFG) | MP | concat | (bi-LSTM+self-attention) | textCNN.

• task 14: (SSA-W2V+SSA-Bert+BB-CFG) | MP | concat | (bi-LSTM+self-attention) | RF.
• task 15: (SSA-W2V+SSA-Bert+BB-CFG) | MP | stack | (bi-LSTM+self-attention) | textCNN.
• task 16: (SSA-W2V+SSA-Bert+BB-CFG) | MP | stack | (bi-LSTM+self-attention) | RF.
• task 17: (SSA-W2V+SSA-Bert+BB-CFG) | Dense | concat | (bi-LSTM+self-attention) | textCNN.
• task 18: (SSA-W2V+SSA-Bert+BB-CFG) | Dense | concat | (bi-LSTM+self-attention) | RF.
• task 19: (SSA-W2V+SSA-Bert+BB-CFG) | Dense | stack | (bi-LSTM+self-attention) | textCNN.
• task 20: (SSA-W2V+SSA-Bert+BB-CFG) | Dense | stack | (bi-LSTM+self-attention) | RF.
• task 21: (SSA-W2V+SSA-Bert+BB-CFG) | SPP | concat | (bi-LSTM+self-attention) | RF.
• task 22: (SSA-W2V+SSA-Bert+BB-CFG) | SPP | concat | (bi-LSTM+self-attention) | textCNN.
• task 23: (SSA-W2V+SSA-Bert+BB-CFG) | SPP | Dense | (bi-LSTM+self-attention) | RF.
• task 24: (SSA-W2V+SSA-Bert+BB-CFG) | SPP | Dense | (bi-LSTM+self-attention) | textCNN.
c) EVMB layer
Concerning the EVMB layer, we adopt EVMB-CFG and EVMB-ASM as input features. We leverage SPP, MP and Dense separately
to unify dimension. We fuse features on the one hand with 𝑐𝑜𝑛𝑐𝑎𝑡 and on the other hand with 𝑠𝑡𝑎𝑐𝑘. For model training and testing,
first, we implement a bi-LSTM model with self attention combined with a random forest (RF) model. Second, we incorporate a
self-attentive bi-LSTM with textCNN model.
We evaluate twelve tasks of interest at the EVMB layer. The pipeline symbol | represents the process flow from feature selection
to decision-making.
• task 25: (EVMB-CFG+EVMB-ASM) | MP | concat | (bi-LSTM+self-attention) | textCNN.

• task 26: (EVMB-CFG+EVMB-ASM) | MP | concat | (bi-LSTM+self-attention) | RF.
• task 27: (EVMB-CFG+EVMB-ASM) | MP | stack | (bi-LSTM+self-attention) | textCNN.
• task 28: (EVMB-CFG+EVMB-ASM) | MP | stack | (bi-LSTM+self-attention) | RF.
• task 29: (EVMB-CFG+EVMB-ASM) | Dense | concat | (bi-LSTM+self-attention) | textCNN.
• task 30: (EVMB-CFG+EVMB-ASM) | Dense | concat | (bi-LSTM+self-attention) | RF.
15
• task 31: (EVMB-CFG+EVMB-ASM) | Dense | stack | (bi-LSTM+self-attention) | textCNN.

• task 32: (EVMB-CFG+EVMB-ASM) | Dense | stack | (bi-LSTM+self-attention) | RF.
• task 33: (EVMB-CFG+EVMB-ASM) | SPP | concat | (bi-LSTM+self-attention) | textCNN.
• task 34: (EVMB-CFG+EVMB-ASM) | SPP | concat | (bi-LSTM+self-attention) | RF.
• task 35: (EVMB-CFG+EVMB-ASM) | SPP | stack | (bi-LSTM+self-attention) | textCNN.
• task 36: (EVMB-CFG+EVMB-ASM) | SPP | stack | (bi-LSTM+self-attention) | RF.
A.2. Strategies under intermodal settings
a) SC+BB Combination
Regarding SC+BB combination, we set SC-W2V, SC-Bert, SSA-W2V, SSA-Bert and BB-CFG as input features. We unify feature
dimension with MP, SPP and Dense separately. We fuse features on the one hand with 𝑐𝑜𝑛𝑐𝑎𝑡 and on the other hand with 𝑠𝑡𝑎𝑐𝑘. To
train and evaluate AI models for vulnerability detection, we proceed as follows. First, we deploy a bi-LSTM model with self attention
combined with a random forest (RF) model. Second, we link a self-attentive bi-LSTM with textCNN model.
We examine twelve tasks of interest under SC+BB combination.
• task 37: (SC-W2V+SC-Bert+SSA-W2V+SSA-Bert+BB-CFG) | MP | concat | (bi-LSTM+self-attention) | textCNN.

• task 38: (SC-W2V+SC-Bert+SSA-W2V+SSA-Bert+BB-CFG) | MP | concat | (bi-LSTM+self-attention) | RF.
• task 39: (SC-W2V+SC-Bert+SSA-W2V+SSA-Bert+BB-CFG) | MP | stack | (bi-LSTM+self-attention) | textCNN.
• task 40: (SC-W2V+SC-Bert+SSA-W2V+SSA-Bert+BB-CFG) | MP | stack | (bi-LSTM+self-attention) | RF.
• task 41: (SC-W2V+SC-Bert+SSA-W2V+SSA-Bert+BB-CFG) | Dense | concat | (bi-LSTM+self-attention) | textCNN.
• task 42: (SC-W2V+SC-Bert+SSA-W2V+SSA-Bert+BB-CFG) | Dense | concat | (bi-LSTM+self-attention) | RF.
• task 43: (SC-W2V+SC-Bert+SSA-W2V+SSA-Bert+BB-CFG) | Dense | stack | (bi-LSTM+self-attention) | textCNN.
• task 44: (SC-W2V+SC-Bert+SSA-W2V+SSA-Bert+BB-CFG) | Dense | stack | (bi-LSTM+self-attention) | RF.
• task 45: (SC-W2V+SC-Bert+SSA-W2V+SSA-Bert+BB-CFG) | SPP | concat | (bi-LSTM+self-attention) | textCNN.
• task 46: (SC-W2V+SC-Bert+SSA-W2V+SSA-Bert+BB-CFG) | SPP | concat | (bi-LSTM+self-attention) | RF.
• task 47: (SC-W2V+SC-Bert+SSA-W2V+SSA-Bert+BB-CFG) | SPP | stack | (bi-LSTM+self-attention) | textCNN.
• task 48: (SC-W2V+SC-Bert+SSA-W2V+SSA-Bert+BB-CFG) | SPP | stack | (bi-LSTM+self-attention) | RF.
b) SC+EVMB Combination
Regarding SC+EVMB combination, we fix SC-W2V, SC-Bert, EVMB-CFG, and EVMB-ASM as input features. We apply dimension
unification using MP, SPP and Dense separately. We fuse features with concat stack, respectively. For AI models training and
inference, first, we choose a bi-LSTM model with self attention combined with a random forest (RF) model. Second, we adopt a
self-attentive bi-LSTM with textCNN model.
We evaluate twelve tasks of interest under SC + EVMB combination.
• task 49: (SC-W2V+SC-Bert+EVMB-CFG+EVMB-ASM) | MP | concat | (bi-LSTM+self-attention) | textCNN.

• task 50: (SC-W2V+SC-Bert+EVMB-CFG+EVMB-ASM) | MP | concat | (bi-LSTM+self-attention) | RF.
• task 51: (SC-W2V+SC-Bert+EVMB-CFG+EVMB-ASM) | MP | stack | (bi-LSTM+self-attention) | textCNN.
• task 52: (SC-W2V+SC-Bert+EVMB-CFG+EVMB-ASM) | MP | stack | (bi-LSTM+self-attention) | RF.
• task 53: (SC-W2V+SC-Bert+EVMB-CFG+EVMB-ASM) | Dense | concat | (bi-LSTM+self-attention) | textCNN.
• task 54: (SC-W2V+SC-Bert+EVMB-CFG+EVMB-ASM) | Dense | concat | (bi-LSTM+self-attention) | RF.
• task 55: (SC-W2V+SC-Bert+EVMB-CFG+EVMB-ASM) | Dense | stack | (bi-LSTM+self-attention) | textCNN.
• task 56: (SC-W2V+SC-Bert+EVMB-CFG+EVMB-ASM) | Dense | stack | (bi-LSTM+self-attention) | RF.
• task 57: (SC-W2V+SC-Bert+EVMB-CFG+EVMB-ASM) | SPP | concat | (bi-LSTM+self-attention) | textCNN.
• task 58: (SC-W2V+SC-Bert+EVMB-CFG+EVMB-ASM) | SPP | concat | (bi-LSTM+self-attention) | RF.
• task 59: (SC-W2V+SC-Bert+EVMB-CFG+EVMB-ASM) | SPP | stack | (bi-LSTM+self-attention) | textCNN.
• task 60: (SC-W2V+SC-Bert+EVMB-CFG+EVMB-ASM) | SPP | stack | (bi-LSTM+self-attention) | RF.
c) BB+EVMB Combination
As of BB+EVMB combination, we define SSA-W2V, SSA-Bert, BB-CFG, EVMB-CFG, and EVMB-ASM as input features. We enforce
dimension unification with SPP, MP and Dense separately. We fuse features with 𝑐𝑜𝑛𝑐𝑎𝑡 and, 𝑠𝑡𝑎𝑐𝑘 respectively. To support vulnera-
bility detection under multimodal learning, first, we integrate a bi-LSTM model with self attention combined with a random forest
(RF) model. Second, we adopt a self-attentive bi-LSTM with textCNN model.
We explore twelve tasks of interest under the BB + EVMB combination.
• task 61: (SSA-W2V+SSA-Bert+BB-CFG+EVMB-CFG+EVMB-ASM) | MP | concat | (bi-LSTM+self-attention) | textCNN.

• task 62: (SSA-W2V+SSA-Bert+BB-CFG+EVMB-CFG+EVMB-ASM) | MP | concat | (bi-LSTM+self-attention) | RF.
• task 63: (SSA-W2V+SSA-Bert+BB-CFG+EVMB-CFG+EVMB-ASM) | MP | stack | (bi-LSTM+self-attention) | textCNN.
• task 64: (SSA-W2V+SSA-Bert+BB-CFG+EVMB-CFG+EVMB-ASM) | MP | stack | (bi-LSTM+self-attention) | RF.
• task 65: (SSA-W2V+SSA-Bert+BB-CFG+EVMB-CFG+EVMB-ASM) | Dense | concat | (bi-LSTM+self-attention) | textCNN.
16
• task 66: (SSA-W2V+SSA-Bert+BB-CFG+EVMB-CFG+EVMB-ASM) | Dense | concat | (bi-LSTM+self-attention) | RF.

• task 67: (SSA-W2V+SSA-Bert+BB-CFG+EVMB-CFG+EVMB-ASM) | Dense | stack | (bi-LSTM+self-attention) | textCNN.
• task 68: (SSA-W2V+SSA-Bert+BB-CFG+EVMB-CFG+EVMB-ASM) | Dense | stack | (bi-LSTM+self-attention) | RF.
• task 69: (SSA-W2V+SSA-Bert+BB-CFG+EVMB-CFG+EVMB-ASM) | SPP | concat | (bi-LSTM+self-attention) | textCNN.
• task 70: (SSA-W2V+SSA-Bert+BB-CFG+EVMB-CFG+EVMB-ASM) | SPP | concat | (bi-LSTM+self-attention) | RF.
• task 71: (SSA-W2V+SSA-Bert+BB-CFG+EVMB-CFG+EVMB-ASM) | SPP | stack | (bi-LSTM+self-attention) | textCNN.
• task 72: (SSA-W2V+SSA-Bert+BB-CFG+EVMB-CFG+EVMB-ASM) | SPP | stack | (bi-LSTM+self-attention) | RF.
d) SC+BB+EVMB Combination
Concerning SC+BB+EVMB full combination of layers, we set SC-W2V, SC-Bert, SSA-W2V, SSA-Bert, BB-CFG, EVMB-CFG, and
EVMB-ASM as input features. We implement dimension unification with SPP, MP and Dense separately, and fuse features with
concat and stack respectively. For vulnerability detection under multimodal learning, first, we deploy a bi-LSTM model with self
attention combined with a random forest (RF) model. Second, we endorse a self-attentive bi-LSTM with textCNN model.
We explore twelve tasks of interest under SC+BB+EVMB combination.
• task 73: (SC-W2V+SC-Bert+SSA-W2V+SSA-Bert+BB-CFG+EVMB-CFG+EVMB-ASM) | MP | concat | (bi-LSTM+self-attention) |

textCNN.
• task 74: (SC-W2V+SC-Bert+SSA-W2V+SSA-Bert+BB-CFG+EVMB-CFG+EVMB-ASM) | MP | concat | (bi-LSTM+self-attention) |
RF.
• task 75: (SC-W2V+SC-Bert+SSA-W2V+SSA-Bert+BB-CFG+EVMB-CFG+EVMB-ASM) | MP | stack | (bi-LSTM+self-attention) |
textCNN.
• task 76: (SC-W2V+SC-Bert+SSA-W2V+SSA-Bert+BB-CFG+EVMB-CFG+EVMB-ASM) | MP | stack | (bi-LSTM+self-attention) |
RF.
• task 77: (SC-W2V+SC-Bert+SSA-W2V+SSA-Bert+BB-CFG+EVMB-CFG+EVMB-ASM) | Dense | concat | (bi-LSTM+self-attention)
| textCNN.
• task 78: (SC-W2V + SC-Bert + SSA-W2V+SSA-Bert+BB-CFG+EVMB-CFG+EVMB-ASM) | Dense | concat | (bi-LSTM+self-
attention) | RF.
• task 79: (SC-W2V+SC-Bert+SSA-W2V+SSA-Bert+BB-CFG+EVMB-CFG+EVMB-ASM) | Dense | stack | (bi-LSTM+self-attention)
| textCNN.
• task 80: (SC-W2V+SC-Bert+SSA-W2V+SSA-Bert+BB-CFG+EVMB-CFG+EVMB-ASM) | Dense | stack | (bi-LSTM+self-attention)
| RF.
• task 81: (SC-W2V+SC-Bert+SSA-W2V+SSA-Bert+BB-CFG+EVMB-CFG+EVMB-ASM) | SPP | concat | (bi-LSTM+self-attention) |
textCNN.
• task 82: (SC-W2V + SC-Bert + SSA-W2V+SSA-Bert+BB-CFG+EVMB-CFG+EVMB-ASM) | SPP | concat | (bi-LSTM+self-attention)
| RF.
• task 83: (SC-W2V+SC-Bert+SSA-W2V+SSA-Bert+BB-CFG+EVMB-CFG+EVMB-ASM) | SPP | stack | (bi-LSTM+self-attention) |
textCNN.
• task 84: (SC-W2V+SC-Bert+SSA-W2V+SSA-Bert+BB-CFG+EVMB-CFG+EVMB-ASM) | SPP | stack | (bi-LSTM+self-attention) |
RF.
Appendix B. Technical design of key literature study
In this subsection, we pinpoint three main related works that pertain to our construction. We highlight their working rationales
through algorithms that are written in pseudocode. We explore their know-how in detecting smart contracts vulnerabilities.
B.1. Slither framework
Feist et al. [12] developed Slither as an open source tool with an incremental design in mind. Through the Slither Python API,
Slither supports the integration of new vulnerability detectors to improve its detection capabilities. Slither works by converting
the Solidity smart contract code provided as input into an intermediate representation that qualifies as SlithIR to expose smart
contract security issues and fine tune user code. Over the SlithIR representation, the processes of vulnerability detection, user code
enhancement, and source code description take place. SlithIR is produced by combining control flow graph (CFG) and static single
assignment (SSA) formatting. Despite being a static analysis framework for smart contracts, Slither underpins the taint tracking
analysis technique.
The Slither framework is depicted in the high level Algorithm 1.
B.2. Eth2Vec framework
Eth2Vec is a machine-learning-based tool for detecting security flaws in smart contracts. It runs static analysis on the contract’s
Ethereum virtual machine bytecode, assembly code, and abstract syntax tree. Eth2vec embodies two major components: the PV-DM
model [27] and the Ethereum virtual machine extractor.
17
Algorithm 1 Slither Framework Algorithm.

Require: Inputs: Solidity source code (SC)
1. JSON_Output (Json), Abstract syntax tree (AST) ← solidity_compiler(SC);
2. contract Inheritance (CIN), Control flow graph (CFG), Solidity expressions (SE) ← information_recovery(Json, AST);
3. SlithIR, SSA code ← code_conversion(CIN, CFG, SE);
4. code dependency (CD), Read/Write variables (RW), Protected functions (PF) ← code_analysis(CIN, CFG, SE);
(The user has complete control over what happens next)
5. if (𝑣𝑢𝑙𝑛𝑒𝑟𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑑𝑒𝑡𝑒𝑐𝑡𝑖𝑜𝑛) then
6. test_vulnerabilities(CD, RW, PF, 𝑣𝑢𝑙𝑛𝑒𝑟𝑎𝑏𝑖𝑙𝑖𝑡𝑦_𝑑𝑒𝑡𝑒𝑐𝑡𝑜𝑟𝑠[ ])
7. end if
8. if (𝑐𝑜𝑑𝑒 𝑜𝑝𝑡𝑖𝑚𝑖𝑧𝑎𝑡𝑖𝑜𝑛) then
9. optimize_code(CD, RW, PF)
10. end if
11. if (𝑐𝑜𝑑𝑒 𝑒𝑥𝑝𝑙𝑎𝑛𝑎𝑡𝑖𝑜𝑛) then
12. print_code(CD, RW, PF)
13. end if
The Ethereum virtual machine extractor harvests syntactic information from the Ethereum virtual machine bytecode. It is built
with the Kam1n0 server, a Java-based assembly analysis platform. Despite the claim of unsupervised learning, the PV-DM model
[27] relies on semi-supervised learning over the extracted syntactic information from the bytecode and the contract source code
compilation. The PV-DM model [27] exploits the knowledge acquired to expose vulnerabilities in target smart contracts.
Eth2vec compiles the contract source code to produce the abstract syntax tree, assembly code, and Ethereum virtual machine
bytecode. Given the large amount of metadata available, extracting the desired features from the Ethereum virtual machine bytecode
is relatively simple in such circumstances.
Eth2vec has been tested against over 5000 contract files from Etherscan in order to evaluate its performance in clone detection
and vulnerability detection. The findings revealed that Eth2vec outperforms support vector machine models and is resistant to code
rewrites.
The Algorithm 2 below provides sufficient details about the Eth2vec framework.
Algorithm 2 Eth2vec Framework Algorithm.

Require: Inputs: Contract Solidity source code (SC), training_dataset
1. contract assembly (CA), Abstract syntax tree (AST), EVM bytecode ← solidity_compiler(SC);
2. contract features (CF) ← evm_extractor(CA, AST, EVM bytecode);
3. vulnerability list (VL) ← pv-dm_processing(CF, training_dataset);
Algorithm 3 AMEVulDetector Framework Algorithm.

Require: Inputs: Contract Solidity source code (SC), training_dataset, testing_dataset, vulnerability expert patterns (VP)
1. contract functions (CF) ← extract_contract_functions(SC);
2. function expert patterns (EP) ← local_pattern_extraction(CF, VP);
3. core nodes (CN), normal nodes (NN), fallback node (FN) ← function_node_extraction(EP);
4. control-flow edges (CE), data-flow edges (DE), fallback edges (FE) ← generate_edges(CN, NN, FN, EP);
5. code semantic graph (CG) ← generate_graph(CN, NN, FN, CE, DE, FE);
6. normalized graph (NG) ← graph_normalization(CG);
7. feature vectors (FV) ← extract_feature_vectors(EP, multilayer perceptrons (MLPs));
8. deep feature vectors (DV) ← extract_graph_feature_vectors(NG, temporal-message-propagation graph neural network (TMP));
9. predicted label (PL) ← feature_fusion(FV, DV, attentive multi-encoder network (AMN)));
B.3. AMEVulDetector framework
Liu et al. [38] devised AMEVulDetector, an explainable smart contract vulnerability detection tool that combines global graph
feature and local expert patterns over deep learning.
AMEVulDetector comprises three key components: a local expert pattern extraction tool that extracts vulnerability-specific expert
patterns from function code, a graph construction and normalization module that converts the source code into a global semantic
graph, and an attentive multi-encoder network that combines expert patterns and graph features to detect vulnerabilities and output
explainable weights.
AMEVulDetector revolves around three types of vulnerabilities: the reentrancy vulnerability, the block timestamp dependence
vulnerability, and the infinite loop vulnerability.
AMEVulDetector was tested against the Ethereum smart contract dataset (ESC), which contains 307 396 functions from 40 932
Ethereum smart contracts, as well as the VNT Chain smart contract dataset (VSC), which contains 13 761 functions from 4 170 VNT
Chain smart contracts. The findings indicated that AMEVulDetector outperforms state-of-the-art methods.
The AMEVulDetector framework is summed up by the Algorithm 3.
18
References
[1] S. Liu, Global spending on blockchain solutions 2024 | statista, https://www.statista.com/statistics/800426/worldwide-blockchain-solutions-spending, 2020.
[2] N. Szabo, Smart Contracts: Building Blocks for Digital Markets, 2018.
[3] Solidity programming language, https://github.com/ethereum/solidity, 2018.
[4] V. Buterin, Ethereum whitepaper, https://github.com/ethereum/wiki/wiki/White-Paper, 2013.
[5] M. Bartoletti, L. Pompianu, An empirical analysis of smart contracts: platforms, applications, and design patterns, in: International Conference on Financial
Cryptography and Data Security, Springer, 2017, pp. 494–509, https://link.springer.com/chapter/10.1007/978-3-319-70278-0_31.
[6] K. Delmolino, M. Arnett, A. Kosba, A. Miller, E. Shi, Step by step towards creating a safe smart contract: lessons and insights from a cryptocurrency lab, in:
International Conference on Financial Cryptography and Data Security, Springer, 2016, pp. 79–94, https://eprint.iacr.org/2015/460.pdf.
[7] L. Luu, D.-H. Chu, H. Olickel, P. Saxena, A. Hobor, Making smart contracts smarter, in: Proceedings of the 2016 ACM SIGSAC Conference on Computer and
Communications Security, 2016, pp. 254–269, https://dl.acm.org/doi/pdf/10.1145/2976749.2978309.
[8] T. Zimmermann, N. Nagappan, L. Williams, Searching for a needle in a haystack: predicting security vulnerabilities for windows vista, in: 2010 Third International
Conference on Software Testing, Verification and Validation, IEEE, 2010, pp. 421–428, https://ieeexplore.ieee.org/abstract/document/5477059/.
[9] L. Zhang, J. Wang, W. Wang, Z. Jin, Y. Su, H. Chen, Smart contract vulnerability detection combined with multi-objective detection, Comput. Netw. 217 (2022)
109289, https://doi.org/10.1016/j.comnet.2022.109289, https://www.sciencedirect.com/science/article/pii/S1389128622003437.
[10] N. Atzei, M. Bartoletti, T. Cimoli, A survey of attacks on Ethereum smart contracts (sok), in: International Conference on Principles of Security and Trust,
Springer, 2017, pp. 164–186, https://link.springer.com/chapter/10.1007/978-3-662-54455-6_8.
[11] M. Argañaraz, M. Berón, M.J. Pereira, P. Henriques, Detection of vulnerabilities in smart contracts specifications in Ethereum platforms, in: 9th Symposium on
Languages, Applications and Technologies (SLATE 2020), vol. 83, Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 2020, pp. 1–16, https://bibliotecadigital.
ipb.pt/bitstream/10198/22794/1/OASIcs-SLATE-2020-2.pdf.
[12] J. Feist, G. Grieco, A. Groce, Slither: a static analysis framework for smart contracts, in: 2019 IEEE/ACM 2nd International Workshop on Emerging Trends in
Software Engineering for Blockchain (WETSEB), IEEE, 2019, pp. 8–15, https://arxiv.org/pdf/1908.09878.pdf.
[13] S. Kalra, S. Goel, M. Dhawan, S. Sharma, Zeus: analyzing safety of smart contracts, in: Ndss, 2018, pp. 1–12, http://pages.cpsc.ucalgary.ca/~joel.reardon/
blockchain/readings/ndss2018_09-1_Kalra_paper.pdf.
[14] Z. Gao, L. Jiang, X. Xia, D. Lo, J. Grundy, Checking smart contracts with structural code embedding, IEEE Trans. Softw. Eng. 47 (12) (2021) 2874–2891, https://
doi.org/10.1109/TSE.2020.2971482.
[15] Mythril: security analysis tool for evm bytecode, https://github.com/ConsenSys/mythril, 2018.
[16] J. Krupp, C. Rossow, teEther: gnawing at Ethereum to automatically exploit smart contracts, in: 27th USENIX Security Symposium (USENIX Security 18), USENIX
Association, 2018, pp. 1317–1333, https://www.usenix.org/conference/usenixsecurity18/presentation/krupp.
[17] I. Nikolić, A. Kolluri, I. Sergey, P. Saxena, A. Hobor, Finding the greedy, prodigal, and suicidal contracts at scale, in: Proceedings of the 34th Annual Computer
Security Applications Conference, 2018, pp. 653–663, https://dl.acm.org/doi/pdf/10.1145/3274694.3274743.
[18] P. Qian, Z. Liu, Q. He, R. Zimmermann, X. Wang, Towards automated reentrancy detection for smart contracts based on sequential models, IEEE Access 8 (2020)
19685–19695, https://ieeexplore.ieee.org/abstract/document/8970384/.
[19] T. Zimmermann, N. Nagappan, H. Gall, E. Giger, B. Murphy, Cross-project defect prediction: a large scale experiment on data vs. domain vs. process, in:
Proceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software
Engineering, 2009, pp. 91–100, https://dl.acm.org/doi/pdf/10.1145/1595696.1595713.
[20] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, A.Y. Ng, Multimodal deep learning, in: ICML, 2011, https://icml.cc/2011/papers/399_icmlpaper.pdf.
[21] W.-Y. Chiu, W. Meng, C.D. Jensen, My data, my control: a secure data sharing and access scheme over blockchain, J. Inf. Secur. Appl. 63 (2021) 103020, https://
doi.org/10.1016/j.jisa.2021.103020, https://www.sciencedirect.com/science/article/pii/S2214212621001885.
[22] A.S. Voundi Koe, S. Ai, P. Huang, A. Yan, J. Tang, Q. Chen, K. Mo, W. Jie, S. Zhang, Sender anonymity: applying ring signature in gateway-based blockchain for iot
is not enough, Inf. Sci. 606 (2022) 60–71, https://doi.org/10.1016/j.ins.2022.05.054, https://www.sciencedirect.com/science/article/pii/S0020025522004868.
[23] Z. Sun, W.-Y. Chiu, W. Meng, Mosaic - a blockchain consensus algorithm based on random number generation, in: 2022 IEEE International Conference on
Blockchain (Blockchain), 2022, pp. 105–114.
[24] Y. Dai, F. Gieseke, S. Oehmcke, Y. Wu, K. Barnard, Attentional feature fusion, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer
Vision, 2021, pp. 3560–3569, https://openaccess.thecvf.com/content/WACV2021/papers/Dai_Attentional_Feature_Fusion_WACV_2021_paper.pdf.
[25] L. Dai, F. Gao, R. Li, J. Yu, X. Shen, H. Xiong, W. Wu, Gated fusion of discriminant features for caricature recognition, in: International Conference on Intelligent
Science and Big Data Engineering, Springer, 2019, pp. 563–573, https://link.springer.com/chapter/10.1007/978-3-030-36189-1_47.
[26] H. Zhou, Z. Fang, Y. Gao, B. Huang, C. Zhong, R. Shang, Feature fusion network based on attention mechanism for 3d semantic segmentation of point clouds,
Pattern Recognit. Lett. 133 (2020) 327–333, https://www.sciencedirect.com/science/article/pii/S0167865520300994.
[27] Q. Le, T. Mikolov, Distributed representations of sentences and documents, in: International Conference on Machine Learning, PMLR, 2014, pp. 1188–1196,
http://proceedings.mlr.press/v32/le14.pdf.
[28] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers),
Association for Computational Linguistics, 2019, pp. 4171–4186, https://aclanthology.org/N19-1423.
[29] N. Ashizawa, N. Yanai, J.P. Cruz, S. Okamura, Eth2vec: learning contract-wide code representations for vulnerability detection on Ethereum smart contracts, in:
Proceedings of the 3rd ACM International Symposium on Blockchain and Secure Critical Infrastructure, 2021, pp. 47–59, https://dl.acm.org/doi/pdf/10.1145/
3457337.3457841.
[30] K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep convolutional networks for visual recognition, IEEE Trans. Pattern Anal. Mach. Intell. 37 (2015)
1904–1916, https://link.springer.com/content/pdf/10.1007/978-3-319-10578-9_23.pdf.
[31] X. Ouyang, K. Gu, P. Zhou, Spatial pyramid pooling mechanism in 3d convolutional network for sentence-level classification, IEEE/ACM Trans. Audio Speech
Lang. Process. 26 (2018) 2167–2179, https://ieeexplore.ieee.org/abstract/document/8413124/.
[32] N. Dong, Q. Feng, M. Zhai, J. Chang, X. Mai, A novel feature fusion based deep learning framework for white blood cell classification, J. Ambient Intell. Humaniz.
Comput. (2022) 1–13, https://doi.org/10.1007/s12652-021-03642-7.
[33] Z. Zhang, Z. Tang, Y. Wang, Z. Zhang, C. Zhan, Z. Zha, M. Wang, Dense residual network: enhancing global dense feature flow for character recognition, Neural
Netw. 139 (2021) 77–85, https://www.sciencedirect.com/science/article/pii/S0893608021000472.
[34] C. Olah, S. Carter, Attention and augmented recurrent neural networks, Distill 1 (2016) e1, https://distill.pub/2016/augmented-rnns/?spm=a2c4e.11153940.
blogcont640631.83.666325f4P1sc03.
[35] J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, arXiv preprint, arXiv:1412.3555, 2014,
https://arxiv.org/abs/1412.3555.
[36] T. Parr, The definitive ANTLR 4 reference, in: The Definitive ANTLR 4 Reference, 2013, pp. 1–326, https://www.torrossa.com/en/resources/an/5241753.
[37] F. Bond, Solidity grammar for ANTLR4, https://github.com/solidityj/solidity-antlr4, 2019.
[38] Z. Liu, P. Qian, X. Wang, L. Zhu, Q. He, S. Ji, Smart contract vulnerability detection: from pure neural network to interpretable graph feature and expert pattern
fusion, in: IJCAI, 2021, pp. 2751–2759, https://www.ijcai.org/proceedings/2021/0379.pdf.
19
[39] M. Zhang, Z. Cui, M. Neumann, Y. Chen, An end-to-end deep learning architecture for graph classification, in: Proceedings of the AAAI Conference on Artificial
Intelligence, vol. 32, 2018, https://ojs.aaai.org/index.php/AAAI/article/view/11782.
[40] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of tricks for efficient text classification, in: EACL 2017, 2017, p. 427, https://aclanthology.org/E17-2.pdf#
page=459.
[41] J.A. Harer, L.Y. Kim, R.L. Russell, O. Ozdemir, L.R. Kosta, A. Rangamani, L.H. Hamilton, G.I. Centeno, J.R. Key, P.M. Ellingwood, et al., Automated software
vulnerability detection with machine learning, arXiv preprint, arXiv:1803.04497, 2018, https://arxiv.org/pdf/1803.04497.pdf.
[42] F. Hill, K. Cho, A. Korhonen, Learning distributed representations of sentences from unlabelled data, in: Proceedings of the 2016 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 1367–1377, https://aclanthology.org/N16-1162.
pdf.
[43] T. Hu, X. Liu, T. Chen, X. Zhang, X. Huang, W. Niu, J. Lu, K. Zhou, Y. Liu, Transaction-based classification and detection approach for Ethereum smart contract,
Inf. Process. Manag. 58 (2021) 102462, https://www.sciencedirect.com/science/article/pii/S0306457320309547.
20

1 s2.0 S0020025523004565 Main

Uploaded by

Copyright:

Available Formats

1 s2.0 S0020025523004565 Main

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S0020025523004565 Main

Uploaded by

Copyright:

Available Formats

Information Sciences 636 (2023) 118907

Contents lists available at ScienceDirect

A novel extended multimodal AI framework towards vulnerability

1.1. Vulnerability detection landscape

1.2. Technical challenges

1.4. Our approach

1.5. Our contributions

2.1. Smart contracts on blockchain

2.2. AI-based vulnerability detection methods for smart contracts

2.3. Feature fusion

3.1. Hierarchical feature extraction and fusion

3.1.1. Feature extraction under intramodal settings

3.1.2. Feature extraction under intermodal settings

3.2. Vulnerability detection strategies

• strategy 1: SPP | concat | (bi-LSTM+self-attention) | textCNN.

4.1. Experimental settings

Strategy Embeddings Wclass 0 Wclass 1 Accuracy F1 Precision Recall AUC-ROC

None SC-Bert 1 1 0.9611 0.8386 0.9174 0.7722 0.8831

INS SC-Bert 0.1659425 1.7340575 0.9584 0.8249 0.9172 0.7495 0.8679

ENS SC-Bert 0.85015092 1.14984908 0.9612 0.8374 0.9267 0.7639 0.8774

ISNS SC-Bert 0.56282351 1.43717649 0.9588 0.8274 0.9164 0.7541 0.8718

SMOTE SC-W2V 1.285251 1 0.9430 0.9452 0.9168 0.9754 0.9427

4.2. Dataset construction

4.3. Performance analysis

4.3.1. Intramodal settings (RQ1 to RQ2)

Methods BB-CFG1 BB-CFG2

MP|stack|bi-LSTM|RF 97.80 96.72 98.96 97.80 98.27 97.56 99.01 98.27

Dense|stack|bi-LSTM|RF 97.59 96.48 98.77 97.59 98.13 97.35 98.96 98.13

SPP|stack|bi-LSTM|RF 97.05 96.20 97.97 97.05 98.16 97.60 98.75 98.16

Methods EVMB-DGCNN [39] EVMB-CFG1 EVMB-CFG2

Fig. 4. ROC curves for intramodal settings.

4.3.2. Intermodal settings (RQ3 to RQ6)

Methods SC+BB-CFG1 SC+EVMB-CFG1 BB+EVMB-CFG1

Methods SC+BB-CFG2 SC+EVMB-CFG2 BB+EVMB-CFG2

Methods SC+BB+EVMB-CFG1 SC+BB+EVMB-CFG2

MP|stack|bi-LSTM|RF 99.37 99.15 99.60 99.37 99.49 99.08 99.90 99.49

Dense|stack|bi-LSTM|RF 99.59 99.26 99.92 99.59 99.70 99.39 99.98 99.68

SPP|stack|bi-LSTM|RF 99.05 98.64 99.49 99.05 99.50 99.36 99.64 99.50

Fig. 5. Two and Three layer intermodal fusion ROC curve.

4.3.3. Outperforming strategies in our framework

Settings Methods Acc (%) Recall (%) Precision (%) F1 (%)

SC MP|stack|bi-LSTM|RF 98.04 97.89 98.20 98.04

5.1. Signiﬁcance of our methodology

5.2. Flexibility in feature extraction

5.3. Limitations of our framework

6.1. Rule-based approach for vulnerability detection under static analysis

6.2. Data-driven approach for static analysis-based vulnerability detection

CRediT authorship contribution statement

Declaration of competing interest

The data that has been used is conﬁdential.

Appendix A. Vulnerability detection strategies

A.1. Strategies under intramodal settings

• task 1: (SC-W2V+SC-Bert) | SPP | concat | (bi-LSTM+self-attention) | textCNN.

• task 13: (SSA-W2V+SSA-Bert+BB-CFG) | MP | concat | (bi-LSTM+self-attention) | textCNN.

• task 25: (EVMB-CFG+EVMB-ASM) | MP | concat | (bi-LSTM+self-attention) | textCNN.

• task 31: (EVMB-CFG+EVMB-ASM) | Dense | stack | (bi-LSTM+self-attention) | textCNN.

A.2. Strategies under intermodal settings

• task 37: (SC-W2V+SC-Bert+SSA-W2V+SSA-Bert+BB-CFG) | MP | concat | (bi-LSTM+self-attention) | textCNN.