Query-Driven Distributed Tracing
Query-Driven Distributed Tracing
Query-Driven Distributed Tracing
collected than needed for the query. On the other hand, uni- while a trace is being created, no extension has a full view of
form head-based sampling may miss anomalous traces, which all microservices or of the trace itself. Yet some computation
are crucial to debugging, and tail-based sampling filters for must be done at that time in order to integrate data collection
specific types of traces, potentially missing traces relevant to and querying. Allowing the developer the illusion of having
subsequent queries. In either case, the data that is collected a full view of both, while individual microservices do not, is
after sampling may not be what is required to answer the challenging. §3.2 discusses how the Snicket compiler handles
developer’s query accurately. this.
Here, we take a different approach: tightly coupling trace Second, it is important to ensure service proxy extensions
data collection and querying. Unlike existing tracing systems, do not add untenable overhead (CPU usage, latency, etc). The
our output is not a database of traces to be queried. Rather extensions run as a bump-in-the-wire, meaning any latency
the database is itself created by the developer’s queries and they incur will have a direct effect on the performance of the
captures precisely properties of traces that are of interest application. Because many microservice applications run in
to the developer. We present a query system, Snicket, that the cloud, extra CPU usage implies more money spent. §5
takes a developer’s queries as input, and produces a database discusses potential solutions to reduce this overhead.
populated by answers to those queries—no more and no less. In preliminary evaluations of Snicket, we test Snicket on
Snicket’s input query language is database-style, high-level, an open-source microservice benchmark called Online Bou-
and graph-centered. It allows the developer to work under the tique [7]. We find that Snicket adds modest latency (~17ms)
illusion that they can process every single trace in a central- and CPU overhead (9% increase). We also evaluate Snicket’s
ized location in a streaming fashion to extract useful insights. expressiveness and how quickly Snicket’s queries can be up-
The developer’s query specifies what traces the developer is dated. Snicket is currently available at https://github.com/
interested in (e.g., those with a particular error code), how to dyn-tracing/snicket_compiler.
process these traces to extract useful information (e.g., end-
to-end request latency), and how to aggregate multiple traces
2 BACKGROUND AND RELATED WORK
to produce useful summary statistics (e.g., mean end-to-end
request latency across multiple traces). To get answers to Distributing Tracing. Distributed tracing is the practice of
the query, Snicket compiles the query to a distributed collec- tracking a user’s request from its entry to its exit. Each RPC,
tion of microservice extensions, one per microservice. These from a parent to a child in the trace, is captured as a span.
extensions run as a bump-in-the-wire before and after the The span may also contain metadata about that RPC such as
application logic within the microservice. the latency of the RPC. Baggage is the data that is propagated
Our extension-based approach is enabled by two recent within RPCs across multiple microservices in order to collect
developments in microservices: (1) the emergence of ser- information about the trace. All spans issued as a result of the
vice proxies and (2) support for programmability in these same user request can be assembled into a trace: a directed
proxies through WebAssembly (WASM) [25] bytecode exten- tree where edges represent caller-callee relationships.
sions. First, diverse microservices share common functions An Ideal System. To contextualize Snicket’s tradeoffs, we
(e.g., authenticating users, granting those users different privi- consider an idealized tracing system that records every span,
leges, and load balancing across microservice replicas). Over and sends them to a centralized service to be stored forever.
time, such common functions have been factored out of the That system would incur a small but significant overhead
microservice’s application logic and moved into a common on the application and would have perfect visibility into any
infrastructure software layer known as the service proxy (e.g., point in the past, but would have prohibitively large storage
Envoy and Linkerd [2, 3]). These service proxies effectively costs. All current tracing systems reduce these storage costs
operate as application-layer switches and serve as the data in some way, and in doing so increase the overhead on the
plane of the inter-microservice network. Second, bytecode application and/or reduce visibility into data.
extensions are a new feature of service proxies [1] that al-
lows them to be extended using WASM programs. These Trace Database Systems. Most tracing systems address the
programs can be developed in a language that supports compi- tradeoff by sampling: they store only a fraction of the traces
lation to WASM such as C++ or Rust. Thus, these extensions in a database for later querying. Dapper, Jaeger, and Canopy
augment service proxies with programmability, similar to pro- employ uniform head-based sampling, which has the advan-
grammable switches and network-interface cards. Snicket im- tage of simplicity, but may miss important unusual traces,
plements distributed tracing by compiling developer queries thus restricting visibility [4, 11, 23]. Canopy also employs
into bytecode extensions running within service proxies. a form of developer-defined tail-based sampling: based on
There are two key challenges in trying to compile queries developer input, Canopy decides which traces to keep for
on traces into service proxy extensions. First, at any time later querying [11]. Lightstep uses dynamic sampling, which
Snicket: Query-Driven Distributed Tracing HotNets’21, November 10-12, 2021, Virtual Event, UK
Figure 2: A table demonstrating each language construct in Snicket, an example corresponding to that construct in
OpenCypher syntax, and a description of the construct.
function implementation keeps a running tally of the total There is one service proxy per microservice. For simplicity
sum, and the number of instances seen. When a new value in compiler implementation, the compiler currently generates
is given to the aggregation function because another trace the same extension for each proxy except for the storage,
has been completed, the aggregation function updates its two which is a container managed by Snicket to keep the results
internal values, and divides them to get the value to be placed of a query. When the developer wants to know the results of a
in storage. Aggregation functions are developer-defined. query, they query the storage container.
Example Scenario. As an example of Snicket in practice,
consider the scenario where a company is switching from 3.3 The Snicket Compiler
local to cloud-based machines. In the midst of the transition, We now describe how the compiler implements each lan-
queries are being load balanced across multiple replicas, some guage construct: the attribute filter, structural filter, developer-
local, and some on the cloud. A developer notices that many defined attributes, return, and aggregation. The compiler cre-
slow requests go through these replicas. A reasonable hypoth- ates as output two extensions: the main extension that runs on
esis is that there may be some difference between how the all application microservices, and an aggregation extension
local and cloud replicas are set up. However, that is only one which runs only on the storage container. Both extensions
possible explanation of many. To test that explanation, the run in the proxies, so no instrumentation is needed in the
developer formulates the query application as long as it uses any service proxy with an exten-
MATCH (a)->(b) sion mechanism. Our implementation uses the Envoy service
WHERE proxy and WASM bytecode extensions.
a.vertex.workload.SERVICE_NAME
== "frontend" Attribute Collection. First, as requests pass through various
AND a.downstream.address microservices, relevant Envoy attributes are added to the bag-
== local_address
RETURN avg(latency(trace))
gage: the data that is propagated alongside an RPC [18] as it
hops across microservices to complete the user request. This
In storage, the developer will be able to access a continually happens within the main extension. The attributes are acces-
updated average of the latency of traces that went through sible through the Envoy interface, and will later determine
the local replicas. If this number is normal, there is likely a whether the trace graph matches the query’s filters.
problem on the cloud replicas. If not, then the developer can
change the query’s IP address to the cloud IP, and figure out Attribute and Structural Filtering. As RPC calls are made
if the local replicas are the problem. in response to incoming user requests, the WASM extension
creates a tree of RPCs in an online fashion, starting at the
leaves. This tree is part of the baggage that is propagated
3.2 Output: WASM Filters between microservices. Once a response for an RPC is pro-
Snicket’s compiler generates WASM extensions from input cessed, the responding microservice is added to the tree as
queries. WASM runs in a safe, memory-sandboxed environ- a vertex—effectively, a post-order traversal of the trace. As
ment that is isolated from the service proxy [25]. The WASM the tree is created from leaves to root, an isomorphism algo-
environment has a well-defined interface with the Envoy ser- rithm [22] is run in a distributed manner at each microservice,
vice proxies; it can put data into proxy storage, inspect in- finding matches to the structural and attribute patterns speci-
coming and outgoing messages, and learn about the status fied in the query, using only the information collected thus far.
and placement of the service proxy. The WASM environment The algorithm executes both structural and attribute filters at
also has access to Envoy-supplied attributes like the trace ID. the same time.
Snicket: Query-Driven Distributed Tracing HotNets’21, November 10-12, 2021, Virtual Event, UK
Developer-Defined Attributes with function(input). De- config map impossible to overwrite [24]. Hence, every time
velopers can also create new recursively defined attributes. the contents of the config map (our Snicket extension) are
The functions defining these attributes execute once for each overwritten, the entire service proxy must be restarted. Al-
vertex: when a response is received from a callee vertex and though this does not affect application functionality, it is slow.
the caller vertex becomes part of the trace, then the caller Recently, a new way to refresh WASM extensions has been
vertex will define the attribute for itself. Then the attribute developed [10]. It is still an experimental feature, and has lim-
is added to baggage in the same way as inherent attributes. ited uses. Using the new method, we improved the extension
To create a new attribute, the developer refers to an attribute refresh time from 30–103 seconds to 0.6–0.9 seconds. We
using parenthesis notation (e.g., height(trace)) within the hope that as the feature moves beyond experimental, Snicket
query, and provides the function defining the attribute as part can become more interactive.
of the query ingested by the compiler.
Return. In the main extension, once it has been determined 4.3 Cost and Performance
that the attribute and structural filters are satisfied, the exten- We measure the cost and performance of Snicket by two
sion sends the result to the storage container. The result is metrics: (1) the extra compute costs that needs to be paid
determined by the return statement and sent to the storage to run Snicket alongside an application, and (2) the latency
container as a (Trace ID, value) pair, in order to distinguish added by Snicket to user requests.
which result came from which trace.
Application. We use Online Boutique, a microservices appli-
Aggregate. The developer can also define their own func- cation of 10 microservices [7]. The only change we made to
tions in the aggregation construct. A developer can create the application itself is increasing the requested CPU for each
an aggregation function that takes in anything that can be pod to 601 millicores (1 millicore is one thousandth of a core);
put as a RETURN value, and continuously outputs one entry to this allows for better performance of the base application.
be put into the storage container. Thus the developer can do
Cluster Configuration. The cluster is initially given 7 nodes
some summary computation on return values, and store these
of machine type “e2-highmem-4” (4vCPU, 32GB memory)
summaries, instead of storing per-trace values.
on the Google Cloud Platform. Horizontal autoscaling is en-
In the example scenario in section 3.1, the input to the ag-
abled with a threshold of 40 percent CPU utilization [19], and
gregation function was latency (trace) and the aggregation
all pods are allowed at most 10 replicas, with the exception
function is avg. The main extension sends (Trace ID, Latency)
of the frontend, which is allowed 30 replicas. None of the
pairs to the storage container. But rather than immediately
experiments reached the autoscaler limits. We enabled the
sending these pairs to storage, the aggregation extension on
default Kubernetes cluster autoscaler [12].
the storage container intercepts these pairs and continuously
recomputes a running average of latencies across all pairs Load Generator. We use a load generator, Locust [9], to cre-
seen so far. This running average is maintained within the ate load on the application. Locust spawns five users every
storage container and updated with every new pair. second until it reaches 500 users. The load generator was
located in a VM in the same cloud zone as the application.
4 EVALUATING SNICKET Each user sends a request, waits a random period between 1
and 3 seconds, then sends another request, and so on. During
4.1 Language Expressiveness this time, both the horizontal and cluster autoscalers are al-
In Figure 3, we show various examples of Snicket queries. lowed to stabilize to handle the load. Then, we record latency
These queries could include correctness checks, investigation measurements for 400 seconds and record the extra resources
of anomalous data emitted by the application, or debugging of used to run the application. We measure three cases: (1) the
erroneous user requests. Two defining features of the Snicket’s application alone, (2) the application with a no-op WASM
language that are difficult to express with prior systems are: extension running in each service proxy; this is meant to cap-
(1) the ability to match on specific graph structures and (2) ture how much overhead is being added by deploying WASM
the ability to create new developer-defined attributes without extensions at all, (3) the application with extensions generated
restarting applications. by Snicket. We use the query:
Latencies able to query both current and historic traces with the illusion
100
that all information is always available.
80
To do so, we are exploring improvements to Snicket that al-
Percent