figure a
figure b

1 Introduction

Constrained Horn clauses (CHC) provide a logic-based format for automated verification which has an advantage over other solutions since it separates modeling from solving and makes it suitable for various application domains and reusable across different verification tasks. The CHC-based solutions focus on one hand on the development of a front-end for the translation of the source code into the language of logic constraints and on the other hand on the implementation of the back-end for solving logical queries constructed from the encoding. Various CHC solvers have been applied to solve verification problems in different domains (e.g., SeaHorn [10], Korn [8] and TriCera [9] for C, JayHorn for Java [17], RustHorn for Rust [13], SolCMC  [2] and SmartACE [18] for Solidity).

Recently the CHC-based approach called Horntinuum was proposed for the exhaustive test case generation for an imperative language without recursion [19]. Given a CHC-encoding of a program, this approach systematically explores various control-flow paths represented by CHC unrollings and relies on an off-the-shelf SMT solver to produce exact input values to the program. The approach is exhaustive in the sense that it terminates only when each branch of the original program either has a test case or has been proven to be unreachable (using an automatically generated invariant).

In this paper, we demonstrate the evolution of the ideas described in [19] for programs in a contract-oriented language, namely Solidity. It relies on a CHC representation of a smart contract generated by a recent Solidity compiler’s model checker, SolCMC  [2]. The logic encoding of a contract is different from the encoding of an imperative program, due to the presence of functions that can be invoked in any order. This complicates the trace enumeration process which is necessary for the exhaustive test case generation. In our approach, we explore contract behaviors gradually, from shorter ones to longer ones, and we keep track of functions and branches that have already been covered.

We present SolTG, a new fully automated tool for Solidity test generation. SolTG receives a Solidity source file(s), extracts the compiler metadata and the CHC representation from SolCMC, computes multiple CHC unrollings, and communicates with the Z3 solver [14] to receive values of function parameters to compute a test case. Each test case is compiled into a human-readable test file following the format for a well-known FoundryFootnote 1 framework to build, test, fuzz, debug and deploy Solidity smart contracts. Thus it becomes immediately useful for contract engineers who receive the test coverage reports based on the contract execution, which cannot be obtained by the other tools, e.g. [18].

SolTG generates tests for real-world contracts fully automatically. We evaluated the tool on the benchmarks both from the SolCMC repository [2] and industrial contracts. All the tests were executed by Foundry on smart contracts as they were running in the actual blockchain. Our experimentation demonstrates that SolTG provides a high level of test coverage. Specifically, it reached 71% branch coverage, 81,2% line coverage, and 90,9% function coverage on average. Furthermore, for 35% of benchmarks, SolTG achieved 100% test coverage within 5 s of running time. Overall, the evaluation demonstrates the practicality of SolTG since it provides contract-specific feedback in a small amount of time.

2 Tool Overview

SolTGFootnote 2 supports most of the Solidity features, it can process smart contracts with constructors, multiple fields and functions, polymorphism, inheritance of other contracts or interfaces, etc. It can also handle Solidity-specific constructs like inherent transaction data, a current state of blockchain, contract state variables, and a full set of standard Solidity datatypes.

Fig. 1.
figure 1

Architecture of SolTG.

Fig. 2.
figure 2

Role of SolTG in the testing workflow.

Figure 1 gives an overview of the architecture of SolTG. The input is a source file with a Solidity smart contract, and the output is a set of test files for Foundry, a framework to build, test, fuzz, debug, and deploy Solidity smart contracts. The tool relies on the external modules: Solidity Compiler to obtain the compilation metadata, Solidity Compiler Model Checker, to get the CHC encoding, and an SMT solver Z3 to actually extract concrete values for test cases.

Figure 2 gives an overview of a higher-level testing process monitored by engineers, where SolTG plays the central role. The generated test files are human-readable and are kept until the contract under test has been updated and are thus used for testing. The test reports in the HTML format are generated by Foundry that compiles the contracts being tested and executes the test cases. The internal components of SolTG are defined below.

Preprocessor is responsible for parsing and analyzing the input data. It receives compilation metadata and the CHC encoding of the Solidity contract. The preprocessor then parses the compiler’s metadata, determining the complete list of the contract’s public and external (testable) functions, constructors and their required inputs. Preprocessor encodes the full set of constructors, functions, and input variables. It then provides them as input together with the CHC encoding to Symbolic Behavior Enumerator to discover possible execution scenarios.

Symbolic Behavior Enumerator (see details in Sect. 3) synthesizes sequences of function calls over tuples of concrete values for function parameters. Overall SolTG exhaustively enumerates symbolic representations of different paths through the contract’s functions and relies on an SMT solver (in our case, Z3) to extract values of those parameters, until either all branches are covered with tests or a timeout is reached.

Test Synthesizer receives a set of concrete function arguments for the tests from the SMT models and corresponding sequences of function calls which are provided by Symbolic Behavior Enumerator. It processes the inputs and generates a set of tests stored as a test file for Foundry.

Foundry compiles, executes, performs coverage analysis, and generates a report for the synthesized test files. Coverage analysis uses classical test coverage metrics such as line/branch/function coverage which are commonly used by contract engineers. The distinguishing feature of our framework is that the reported results reflect the exact blockchain response to the tested function calls.

Limitations. Currently SolTG does not support inline assembly code and uses approximate reasoning about dynamically sized byte arrays and strings. SolTG does not model the gas consumption during the function execution, and thus it may generate test cases for some unreachable blocks of source code due to this.

3 Test Case Generation from CHC Encoding

Although SolTG targets smart contracts in Solidity, our test case generation approach can be lifted to other contract-oriented languages given an encoder for them to constrained Horn clauses.

3.1 CHC Preliminaries

A constrained Horn clause (CHC) over a set of uninterpreted predicate symbols \(R = \{r_1, \ldots , r_n\}\), is a universally quantified first-order formula that matches the following regular expression:

$$\begin{aligned} \forall \vec {x},\vec {y} . \big ((r_1 \mid \ldots \mid r_n)(\vec {x})\wedge \big )^* \varphi (\vec {x}, \vec {y}) \implies \big ((r_1 \mid \ldots \mid r_n)(\vec {y})\mid false \big ) \end{aligned}$$

where \(\vec {x},\vec {y}\) are vectors of variables, \(*\) is Kleene star, the left side of the implication (called body) may have multiple occurrences of any symbol from R but \(\varphi \) does not have occurrences of any symbol from R. For readability reasons, we omit writing \(\forall \vec {x},\vec {y}\ldots \) In the following formulas, we introduce indexes (e.g., in \(\vec {x_1}\) or \(\vec {x_2}\)) whenever we need to introduce fresh variables.

CHCs are used as an intermediate representation for procedural and object-oriented programs and allow the verification tools to symbolically enumerate program behaviors. To formally introduce the process, we define the notion of CHC unrolling as follows. Given a CHC system S over uninterpreted predicates R and a rule \(r_1(\vec {x_1}) \wedge \varphi \implies r_0(\vec {x_0})\) where \(r_0,r_1 \in R\), an unfolding of \(r_1\) is another CHC rule \(\psi (\vec {x_1}, \vec {x_2}) \wedge \varphi \implies r_0(\vec {x_0})\) such that:

  • for some \(\vec {x_3}\) and \(\vec {x_4}\), \(\psi (\vec {x_3}, \vec {x_4}) \implies r_1(\vec {x_3})\) is a CHC in S, and

  • \(\vec {x_2}\) is a vector of fresh variables that does not overlap with \(\vec {x_1}\) and \(\vec {x_0}\).

For a given CHC, an unrolling is an output of any number of consecutive unfoldings of that rule that does not have uninterpreted predicates in the body. There could be multiple possible unrollings for a single CHC instance, and we illustrate it in the next subsection.

3.2 Solidity Smart Contracts to CHCs

We begin with a brief intuitive overview of the CHC encoding employed by SolTG which is inspired by SolCMC. For demonstration reasons, we simplify the encoding principles, but we highlight the main ingredients. Each test case-generation target may operate over the application binary interface and cryptographic functions, transaction data, the blockchain state, balances, and storage for every contract. We denote their logic representation as \(\vec {s}\) and assume there is a fully interpreted formula \( init (\vec {s})\) that describes how \(\vec {s}\) is instantiated before an instance of the contract under test has been created. We assume the contract under test has a vector of fields \(\texttt {v}_{1}\), \(\ldots , \text {v}_n\) that is symbolically encoded into some \(\vec {v}\) that does not overlap with \(\vec {s}\). Throughout the lifecycle of an instance of the contract, it could undergo a (possibly unbounded) number of changes made by calling the contract’s functions. These are modeled with the help of an auxiliary uninterpreted predicate \( fc \). Intuitively, it corresponds to a contract transition system (we will use this notation later) that has two logic rules, for the initiation and consecution respectively:

$$\begin{aligned} init (\vec {s}) \implies fc (\vec {s}, \vec {v}) \end{aligned}$$
(1)
$$\begin{aligned} fc (\vec {s},\vec {v}) \wedge summ (\vec {s},\vec {s'},\vec {v},\vec {v'}) \implies fc (\vec {s'},\vec {v'}) \end{aligned}$$
(2)

For the second rule, we assume a set of defined functions F. An uninterpreted predicate \( summ \) stands for a summary of an arbitrary function over two vectors of logic variables \(\vec {v}\) and \(\vec {v'}\) that represent symbolic values of contract’s fields before and after each function call, respectively.

Each \(f\in F\) has a CFG \((V_f,E_f)\), where \(V_f\) is a set of basic blocks and \(E_f\) is a set of control-flow edges, each one connecting two basic blocks. The dedicated basic block \(\texttt {en}_f\) is the only one with no incoming edge in \(E_f\), and the dedicated basic block \(\texttt {ret}_f\) is the only one with no outgoing edge in \(E_f\). We also assume there exists an encoding function \(\tau : E_f \rightarrow Prop \) that translates each \((\texttt {b}_1,\texttt {b}_2)\in E_f\) into an SSA form and further to a logic formula over \(\vec {s}, \vec {s'},\vec {v}, \vec {v'},\vec {in}, \vec {loc}\), where \(\vec {in}\) is a vector of input variables to f and \(\vec {loc}\) is a vector of fresh local variables (and auxiliary SSA variables created during the encoding). This formula encodes symbolically the basic block \(\texttt {b}_1\) and the condition under which the control transitions to basic block \(\texttt {b}_2\) . The CHC encoding has then a set of rules for f:

For each control-flow edge \((\texttt {b}, \texttt {ret}_f)\), the CHC encoding uses an uninterpreted predicate b:

$$\begin{aligned} \tau (b, ret_f) \implies b(\vec {s}, \vec {s'}, \vec {v}, \vec {v'}, \vec {in}, \vec {loc}) \end{aligned}$$
(3)

Then for each control-flow edge \((\texttt {b}_1, \texttt {b}_2)\) where \(\texttt {b}_2 \ne \texttt {ret}_f\), the CHC encoding uses uninterpreted predicates \(b_1\) and \(b_2\):

$$\begin{aligned} b_2 (\vec {s}, \vec {s'}, \vec {v}, \vec {v'}, \vec {in}, \vec {loc}) \wedge \tau (b_1, b_2) &\implies b_1 (\vec {s}, \vec {s'}, \vec {v}, \vec {v'}, \vec {in}, \vec {loc}) \end{aligned}$$
(4)

Lastly, to connect the entry to function f with its summary, it uses uninterpreted predicates \( en _f\) and \( sum \) (which is in turn used to connect f with the contract transition system via rule (2)):

$$\begin{aligned} en _f(\vec {s}, \vec {s'}, \vec {v}, \vec {v'}, \vec {in}, \vec {loc})\ \implies summ (\vec {s}, \vec {s'}, \vec {v},\vec {v'}) \end{aligned}$$
(5)

We illustrate this encoding using the following example contract with one field a, a constructor, and two other functions.

figure c

The CHC encoding is as follows. Field a (resp. inputs in, x, and y) is represented in CHCs by logic variable a (resp. in, x, and y). For simplicity, we let \(\vec {s}\) to be empty and \( init \) to be \( true \). The entry points to the constructor and two functions are represented using predicates \( en _{ con }\), \( en _{ reset }\), and \( en _{ f }\), respectively. The CFG of the constructor (similarly, function reset) is trivial, so it needs one CHC of form (3) and one of form (5). Since function f contains a conditional statement, its CFG has basic blocks \(\texttt {b}_1\) and \(\texttt {b}_2\), and its CHC encoding makes uses of predicates \(b_1\) and \(b_2\) in the CHCs of form (3) and (4). For readability, formula \(\tau (\texttt {en}_f, \texttt {b}_1)\) (resp. \(\tau (\texttt {en}_f, \texttt {b}_2)\)) is colored purple (resp. blue) and placed in the box. Lastly, note that \(\tau (\texttt {b}_1, \texttt {ret}_f)\) has an occurrence of predicate \( en _{reset }\) which corresponds to calling function reset.

figure d

The boxed formulas are the building blocks to be used in the test-code generation as described in the next subsection. Purple (resp. blue) color emphasizes that the formula is used to symbolically encode the then-branch (resp. the else-branch) of f’s behavior, black color in the box – the code of the constructor.

3.3 Algorithmic Enumeration of Contract Behaviors

SolTG generates tests by symbolic enumeration of the possible behaviors of the contract under test. Each test case begins with creating an instance of the contract c (by calling its constructor). Suppose then we wish to observe the behavior of certain function f which will be expressed as c.f(...) in the test. If f uses an input, the test case should specify one of the possible concrete values. A test could have a sequence of multiple functions under certain inputs, each of which contributes to updating the fields. For the example above, when we wish to test function f’s behavior, we could (manually) create a test c = A(0); c.f(0,0). This would however be not enough to test the then-branch of the conditional in f’s body. For this reason, another test, e.g., c = A(0); c.f(1,0) would be needed. In general, finding these sequences of function calls and concrete inputs is challenging. Our tool effectively generates sequences of function calls by enumeration over permutations of available functions, and it generates concrete inputs from the models of satisfiable logic formulas that correspond to distinct unrollings of CHCs. Specifically, our enumeration has three nested loops (denoted respectively A, B, and C).

A: Enumerating the lengths of tests At the higher level, the tool sequentially considers unfoldings of the contract transition system (1)-(2), that allows it to eventually consider test cases of various lengths. That is, an unrolling of length one would yield only test cases with constructors, an unrolling of length two – test cases with a constructor and one function call, etc. More formally, at this level, the tool performs an unfolding of \( fc \) in CHC \( fc (v) \wedge summ (v,v') \implies fc (v')\) but keeps \( summ \) uninterpreted. For a length n, the tool unfolds \( fc \) exactly \(n+1\) times, for which it uses the consecution rule n times and then the initiation rule once. As a result, we get a new CHC (called \(C_n\)) with n distinct occurrences of \( summ \) in the body.

B: Enumerating the functions At the middle level, i.e., for an unrolling of a fixed length n, the tool considers a sequence of n functions (possibly, with repetitions), where the first function is necessarily a constructor. Specifically, given the output from the outer loop, i.e., a CHC \(C_n\) with n distinct occurrences of \( summ \) in the body, the tool considers a set of n-tuples \(F^{n}\) and computes \(|F|^{n}\) tuples of uninterpreted predicates \(T = \{ en_f \mid f \in F\}^n\). It finally computes \(|F|^{n}\) distinct CHCs, each one by the pairwise replacement of the n-tuple of \( summ \) predicates in \(C_n\) by an n-tuple from T (with introducing fresh variables for in and loc whenever needed).

C: Enumerating behaviors for each function At the inner level, for each n-th function that we wish to test, the tool considers all paths through its CFG. That is, it receives one of the CHCs constructed at the middle level, and it computes all possible unrollings by recursively eliminating all uninterpreted symbols. Algorithmically, an unrolling of each function is similar to the approach of [19], and the tool repeats it n times, for each high-level function call (i.e., that it synthesizes in the test file). This loop gives a set of logic formulas, each of which is sent to an SMT solver. Lastly, to enumerate inputs for each function call, we rely on an SMT solver. Given a formula, the solver targets finding a satisfying assignment from which the desired input values are extracted.

Pruning test case enumeration Because of the exhaustiveness of enumeration, the complexity of steps A, B, and C grows exponentially with the number of contract functions and the points of control-flow divergence. We attempt to mitigate it using optimizations to prune the search space of the test cases, which is similar to [19]. First, the initial function to be called in a test case should always be the constructor of the contract under test. Second, assume there is a subset \(F'\subseteq F\) of functions that have already full coverage (thanks to the already synthesized test cases). Then, given that we need to synthesize a new test case as a sequence n function calls, in step B, instead of enumerating \(|F|^n\) possibilities, we can only enumerate \(|F|^{n-2} * |F\setminus F'|\) possibilities (i.e., any function from \(f\in F'\) could only be called in either 1-st, 2-nd, \(\ldots \), or \( n-1\)-th position).

Synthesizing tests Each test case is synthesized from two components: a sequence of functions created at the end of step B, and a sequence of tuples of concrete values extracted from an SMT model at the end of step C. In general, the tool has to create an unrolling of the following CHC, for some \(f_1,f_2,\ldots \in F\):

$$ en _{f_1}(\vec {s},\vec {s'},\vec {v},\vec {v'}, \vec {loc},\vec {in})\wedge en _{f_2}(\vec {s'},\vec {s''},\vec {v'},\vec {v''}, \vec {loc'},\vec {in'})\wedge \ldots \implies fc (\vec {s^{(n)}},\vec {v^{(n)}})$$

Note that the ultimate unrolling still has the occurrences of variables \(\vec {in}\) and \(\vec {in'}\) but no predicates \( en _{f_1}\), \( en _{f_2},\ldots \) Thus, if satisfiable, the SMT solver returns a tuple of values \(\texttt {in}_1\), \(\texttt {in}_2,\dots \) for \(\vec {in}\), a tuple of values \(\texttt {in'}_1\), \(\texttt {in'}_2,\dots \) for \(\vec {in'}\), etc. The final test case is then constructed as follows:

$$ \texttt {f}_1(\texttt {in}_1, \texttt {in}_2,\ldots );\; \texttt {f}_2(\texttt {in'}_1, \texttt {in'}_2,\ldots ); \ldots $$

We illustrate the whole process in our example. The formula encoding calls to the constructor and function f under the then-branch has the following unrolling:

figure e

It is satisfiable with a model \(in\mapsto 0, x \mapsto 1, y\mapsto 0\ldots \) Parsing this model and determining that in represents an input to the constructor and x represents an input to f gives us the test case c = new A(0); c.reset().

A formula encoding the same function calls but another branch is as follows:

figure f

It is satisfiable with a model \(in\mapsto 0, x \mapsto 0, y\mapsto 0\ldots \) which gives us test case c = new A(0); c.f(0,0).

These test cases now can be used for the generation of a Solidity test file:

figure g

The test file consists of multiple functions, the naming convention for which follows the FoundryFootnote 3 standards. SetUp is used to prepare a testing environment by Foundry. Functions test_A_0 and test_A_1 incorporate test cases generated by SolTG above. The test file is human-readable and reusable: it can be used to generate test coverage right away and/or stored for regression testing later.

4 Evaluation

We evaluated SolTG on 59 benchmarks from the SolCMC repository [2] and industrial smart contracts which exhibit different Solidity-specific features. All experiments were conducted on a machine with a 2.3GHz 8-core Intel Core i9 processor and 16 GB RAM running on macOS 13.4. Test coverage data was collected from the Foundry test reports.

For the given set of benchmarks, SolTG provided a high level of test coverage. In particular, on average, it achieved 71% branch coverage, 81,2% line coverage, and 90,9% function coverage for a 60-second timeout. A correlation of the coverage and the chosen timeout for each benchmark is shown in Fig. 3. The cactus plot has a range of timeouts from 0.1 sec to 60 sec on the x-axis and the corresponding coverage on the y-axis. Each curve represents multiple behaviors of SolTG on a single benchmark w.r.t. different timeouts. The rapid growth of many curves demonstrates that 1) the prepossessing and initial enumeration steps are very fast, and 2) SolTG often needs just a few seconds to produce many test cases. In fact, SolTG generates the majority of tests within the first 10 s of the execution. For 21 benchmarks SolTG generated tests that report on 100% line coverage within the first 5 s. Further, with the increase of time, SolTG finds additional test cases, but the production expectedly becomes costlier.

Fig. 3.
figure 3

SolTG performance for the benchmark set with different timeouts.

We investigated the cases when SolTG did not report the full test coverage within the 60-second timeout. One reason as expected was due to the complexity of the control structures of the contracts, resulting in less efficient test case generation. Remarkably, the other reason is the discrepancies between the CHC encoding and the actual semantics of the contracts.

We evaluated SolTG on multiple industrial Ethereum contracts (e.g., Weth.sol, ERC20.sol, VestingWallet.sol), and it exhibited similar performance as for SolCMC benchmarks. For example, for Weth.solFootnote 4 SolTG generated tests that produce 100% line and branch coverage within 120 s timeout. Overall, the results demonstrate the practical value of SolTG in the contract development process.

5 Related Work

The principles of the CHC encoding of smart contracts that SolTG relies on originate from [12]. Previously, they enabled the development of the SolCMC  [2] model checker built into the Solidity compiler. SolCMC, however, is not designed to generating test cases for each branch of each function of a given contract which requires an exhaustive enumeration of path conditions. On the other side, SolTG does not rely on external CHC solvers and has its own approach towards the exhaustiveness of the enumeration.

Automated reasoning about Solidity is also enabled by LLVM-based frameworks SmartACE [18] and SKLEE [11]. SmartACE uses an existing infrastructure for model checking, symbolic execution, and fuzzing [5, 10], and it does not translate tests from LLVM back to Solidity. SKLEE targets the detection of certain types of bugs rather than maximizing the function, branch, or line coverage. Both tools reason about an LLVM binary which is semantically different from the original contract in Solidity. By contrast, SolTG guarantees that behavior demonstrated by the generated tests corresponds to the actual blockchain behavior.

Existing tools for test case generation for Solidity [7, 15] follow genetic algorithms and traditional fuzzing. These methods may struggle with corner cases because they extensively modify some initial random test cases, producing a significant amount of superfluous tests, e.g., ones that cover the same branch. By contrast, SolTG is driven by exploitation of the program’s structure, and specifically, it attempts to cover blocks of code that have not been tested yet.

There exist multiple SMT-based test generation tools like KALI [3], CAVI-TEST [16] for Java, FuseBMC [1], Symbiotic [6], KLEE [4] for C and other languages. Many of them produce test cases after some communication with an SMT solver, but neither of them is based on CHCs. The closest tool to SolTG is the only CHC-based test case generator, Horntinuum [19], which is, however, not tailored for smart contracts and assumes that the CHC representation is linear (i.e., all function calls have to be inlined). Horntinuum alternates invariant generation and test case generator, and it accelerates the enumeration process by exploiting the invariants discovered so far. In the future, we would like to adopt this strategy in SolTG as well.

6 Conclusion

We have presented SolTG, a fully automated Solidity test case generator based on CHC encoding capable of generating tests for industry-grade smart contracts. SolTG analyzes the system of nonlinear CHCs provided by SolCMC and synthesizes test cases as a result of exhaustive enumeration of contract behavior and SMT models, thus avoiding fuzzing. The compiled tests are supported by the widely used Foundry framework. Our evaluation demonstrates that the tool is effective in generating test cases on a range of Solidity contracts and can be fully integrated into the development process. In the future, SolTG could benefit from a tighter connection with invariant generation techniques to accelerate its enumeration process.