Query-Driven Distributed Tracing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Snicket: Query-Driven Distributed Tracing

Jessica Bergq , Fabian Ruffyq , Khanh Nguyenq ,


Nicholas Yangq , Taegyun Kimq , Anirudh Sivaramanq ,
Ravi Netravali♣ , Srinivas Narayana♠
q New York University, ♣ Princeton University, ♠ Rutgers University
ABSTRACT Workshop on Hot Topics in Networks (HotNets ’21), November 10–
Increasing application complexity has caused applications to 12, 2021, Virtual Event, United Kingdom. ACM, New York, NY,
USA, 7 pages. https://doi.org/10.1145/3484266.3487393
be refactored into smaller components known as microser-
vices that communicate with each other using RPCs. Dis-
tributed tracing has emerged as an important debugging tool
for such microservice-based applications. Distributed tracing 1 INTRODUCTION
follows the journey of a user request from its starting point at Growing application complexity has led organizations to de-
the application’s front-end, through RPC calls made by the compose large web services into a collection of smaller com-
front-end to different microservices recursively, all the way ponents, known as microservices, that communicate with
until a response is constructed and sent back to the user. To re- each other over an RPC interface [6]. When a user issues a
duce storage costs, distributed tracing systems sample traces request to a web service (e.g., for the landing page of a social
before collecting them for subsequent querying, affecting the network), the request is first received by a front-end microser-
accuracy of queries on the collected traces. vice. The front-end microservice then issues RPCs to internal
We propose an alternative system, Snicket, that tightly in- microservices, which in turn might call other microservices
tegrates querying and collection of traces. Snicket takes as recursively to construct a response for the user.
input a database-style streaming query that expresses the anal- Debugging such microservice-based applications is diffi-
ysis the developer wants to perform on the trace data. This cult because microservices are distributed across multiple
query is compiled into a distributed collection of microservice compute nodes. An important debugging tool for such appli-
extensions that run as “bumps-in-the-wire,” intercepting RPC cations is distributed tracing [5, 23]. Distributed tracing tracks
requests and responses as they flow into and out of microser- the flow of an incoming user request through the collection
vices. This collection of extensions implements the query, of traversed microservices and represents the request as a
performing early filtering and computation on the traces to re- trace: a tree that captures parent-child relationships between
duce the amount of stored data in a query-specific manner. We all RPCs originated by a particular user request along with
show that Snicket is expressive in the queries it can support some metadata of each RPC (e.g., RPC latency). Distributed
and can update queries fast enough for interactive use. tracing systems [4, 8, 11, 17, 23] capture and persist traces in
a database to permit subsequent querying by developers.
ACM Reference Format:
Storing a trace for every user request is prohibitively ex-
Jessica Berg, Fabian Ruffy, Khanh Nguyen, Nicholas Yang, Taegyun
pensive. Hence, all existing tracing systems sample traces in
Kim, Anirudh Sivaraman, Ravi Netravali, Srinivas Narayana. 2021.
Snicket: Query-Driven Distributed Tracing. . In The Twentieth ACM some way. Two sampling strategies are commonly deployed
today. Systems based on head-based sampling [4, 17, 23]
sample user requests for tracing and storage before the re-
Permission to make digital or hard copies of all or part of this work for quests spawn subsequent RPCs at the front-end. Tail-based
personal or classroom use is granted without fee provided that copies are not sampling [13] approaches sample traces for storage after the
made or distributed for profit or commercial advantage and that copies bear
this notice and the full citation on the first page. Copyrights for components
request finishes execution and a full trace is available. Tail-
of this work owned by others than the author(s) must be honored. Abstracting based sampling enables decisions informed by trace contents,
with credit is permitted. To copy otherwise, or republish, to post on servers or but is more complex than head-based sampling.
to redistribute to lists, requires prior specific permission and/or a fee. Request Unfortunately, the existing trace sampling approaches col-
permissions from [email protected]. lect either more or less data than is actually needed to answer
HotNets’21, November 10-12, 2021, Virtual Event, UK
developer queries about traces. On the one hand, even if a
© 2021 Copyright held by the owner/author(s). Publication rights licensed to
the Association for Computing Machinery.
developer is only interested in certain trace properties like
ACM ISBN 978-1-4503-9087-3/21/11. . . $15.00 end-to-end request latency, data is still persisted at the granu-
https://doi.org/10.1145/3484266.3487393 larity of whole traces; this means more information is often
HotNets’21, November 10-12, 2021, Virtual Event, UK Berg et al.

collected than needed for the query. On the other hand, uni- while a trace is being created, no extension has a full view of
form head-based sampling may miss anomalous traces, which all microservices or of the trace itself. Yet some computation
are crucial to debugging, and tail-based sampling filters for must be done at that time in order to integrate data collection
specific types of traces, potentially missing traces relevant to and querying. Allowing the developer the illusion of having
subsequent queries. In either case, the data that is collected a full view of both, while individual microservices do not, is
after sampling may not be what is required to answer the challenging. §3.2 discusses how the Snicket compiler handles
developer’s query accurately. this.
Here, we take a different approach: tightly coupling trace Second, it is important to ensure service proxy extensions
data collection and querying. Unlike existing tracing systems, do not add untenable overhead (CPU usage, latency, etc). The
our output is not a database of traces to be queried. Rather extensions run as a bump-in-the-wire, meaning any latency
the database is itself created by the developer’s queries and they incur will have a direct effect on the performance of the
captures precisely properties of traces that are of interest application. Because many microservice applications run in
to the developer. We present a query system, Snicket, that the cloud, extra CPU usage implies more money spent. §5
takes a developer’s queries as input, and produces a database discusses potential solutions to reduce this overhead.
populated by answers to those queries—no more and no less. In preliminary evaluations of Snicket, we test Snicket on
Snicket’s input query language is database-style, high-level, an open-source microservice benchmark called Online Bou-
and graph-centered. It allows the developer to work under the tique [7]. We find that Snicket adds modest latency (~17ms)
illusion that they can process every single trace in a central- and CPU overhead (9% increase). We also evaluate Snicket’s
ized location in a streaming fashion to extract useful insights. expressiveness and how quickly Snicket’s queries can be up-
The developer’s query specifies what traces the developer is dated. Snicket is currently available at https://github.com/
interested in (e.g., those with a particular error code), how to dyn-tracing/snicket_compiler.
process these traces to extract useful information (e.g., end-
to-end request latency), and how to aggregate multiple traces
2 BACKGROUND AND RELATED WORK
to produce useful summary statistics (e.g., mean end-to-end
request latency across multiple traces). To get answers to Distributing Tracing. Distributed tracing is the practice of
the query, Snicket compiles the query to a distributed collec- tracking a user’s request from its entry to its exit. Each RPC,
tion of microservice extensions, one per microservice. These from a parent to a child in the trace, is captured as a span.
extensions run as a bump-in-the-wire before and after the The span may also contain metadata about that RPC such as
application logic within the microservice. the latency of the RPC. Baggage is the data that is propagated
Our extension-based approach is enabled by two recent within RPCs across multiple microservices in order to collect
developments in microservices: (1) the emergence of ser- information about the trace. All spans issued as a result of the
vice proxies and (2) support for programmability in these same user request can be assembled into a trace: a directed
proxies through WebAssembly (WASM) [25] bytecode exten- tree where edges represent caller-callee relationships.
sions. First, diverse microservices share common functions An Ideal System. To contextualize Snicket’s tradeoffs, we
(e.g., authenticating users, granting those users different privi- consider an idealized tracing system that records every span,
leges, and load balancing across microservice replicas). Over and sends them to a centralized service to be stored forever.
time, such common functions have been factored out of the That system would incur a small but significant overhead
microservice’s application logic and moved into a common on the application and would have perfect visibility into any
infrastructure software layer known as the service proxy (e.g., point in the past, but would have prohibitively large storage
Envoy and Linkerd [2, 3]). These service proxies effectively costs. All current tracing systems reduce these storage costs
operate as application-layer switches and serve as the data in some way, and in doing so increase the overhead on the
plane of the inter-microservice network. Second, bytecode application and/or reduce visibility into data.
extensions are a new feature of service proxies [1] that al-
lows them to be extended using WASM programs. These Trace Database Systems. Most tracing systems address the
programs can be developed in a language that supports compi- tradeoff by sampling: they store only a fraction of the traces
lation to WASM such as C++ or Rust. Thus, these extensions in a database for later querying. Dapper, Jaeger, and Canopy
augment service proxies with programmability, similar to pro- employ uniform head-based sampling, which has the advan-
grammable switches and network-interface cards. Snicket im- tage of simplicity, but may miss important unusual traces,
plements distributed tracing by compiling developer queries thus restricting visibility [4, 11, 23]. Canopy also employs
into bytecode extensions running within service proxies. a form of developer-defined tail-based sampling: based on
There are two key challenges in trying to compile queries developer input, Canopy decides which traces to keep for
on traces into service proxy extensions. First, at any time later querying [11]. Lightstep uses dynamic sampling, which
Snicket: Query-Driven Distributed Tracing HotNets’21, November 10-12, 2021, Virtual Event, UK

Request trace User request


developer to specify both the graph structure and attributes
Front End of traces they are interested in. The input to a query is the
Ads
Product
Front End Product stream of traces created by developer requests, and the output
Ratings is the answer to the query, either expressed as a single value
Time per trace or as the result of an aggregation function over the
Legend per-trace values. The Snicket compiler compiles queries into
WASM Envoy
extension proxy a collection of WASM extensions running in the Envoy proxy
RPC calls Service (Figure 1). Snicket’s language constructs are described below.
applicaton
Ads Ratings
Structural Patterns Using MATCH. MATCH corresponds to a
structural filter, which specifies the structure of the graph to
Figure 1: An example microservice application. Snicket’s match on. For example, one could ask for a trace containing a
generated code runs in the Envoy proxy as extensions. subtree with a parent and 5 children, as shown in Figure 3.
Attribute Specification Using WHERE. WHERE corresponds to
an attribute filter; it specifies any vertex-level attributes the
is similar to tail-based sampling, but instead of considering
vertices referenced in the structural filter might have (e.g.,
traces one at a time, it considers all traces from the last one
vertex “a” must be named “shoppingcart-service”). It also can
hour to decide what to store [26]. LightStep’s dynamic tracing
specify trace-level attributes (e.g., the latency of the entire
prioritizes unusual traces within the last one hour of traces.
trace or the trace ID). Inherent vertex-level attributes come
In all these systems, the trace data is first sampled in a
directly from Envoy’s interface to WASM. This allows the
system-specific way and then persisted to long-term storage
language to grow as Envoy expands what it makes available
where it may be queried. Because querying happens on the
to WASM. These attributes are either inherent to the vertex’s
database after sampling, query results may not be represen-
microservice (e.g., the microservice’s name) or the RPC re-
tative of the full stream of traces observed by the microser-
sponse sent back by the vertex to its parent (e.g., the size of
vices. In contrast, Snicket’s query results are accurate because
the response or a header within the response). Thus, Snicket
Snicket logically operates on all traces seen by the application.
developers also have access to whatever application-specific
However, only the results of Snicket’s query are stored.
information is sent through RPC response headers.
Query-Based Systems. Pivot Tracing takes a similar ap-
Developer-Defined Attributes with function(input). This
proach to Snicket by tightly tying collection and querying
construct allows new attributes to be created from inherent
together [15]. Pivot Tracing takes a query as input, and com-
vertex and trace-level attributes. Inherent attributes that are
piles this query down to dynamic tracepoints throughout a dis-
built into the Envoy service proxy are automatically available
tributed system that find the query answer. Similar to Snicket,
and accessible through dot notation (e.g., vertex.name). A
it stores only the query results, not the traces from which they
developer can define a mapping function on these inherent
are derived. Unlike Snicket, Pivot Tracing does not support
attributes to create new attributes, like the height of the trace
graph-based queries. It also requires more intrusive changes
graph. They are recursively defined: the root’s value will be
to the application for deployment, which Snicket sidesteps by
considered the attribute for the trace as a whole. For exam-
operating at the proxy extension layer.
ple, if a developer wanted the height of a tree, they could
Recently, there has been some interest in offline graph ana-
recursively define it to be the maximum height of a vertex’s
lytics of traces [8, 14] because traces are graph structures that
children, plus one. Then, the root vertex’s height would also
are well suited to graph-processing systems. In these systems,
be the trace’s height.
a trace is treated as a graph and not as a flat collection of
spans; so events from different branches of the same trace are Query Answers with RETURN. RETURN can either return one
more easily correlated and processed. Snicket captures simi- value per trace, (e.g., RETURN latency (trace)) or an ag-
lar graph-based information, while tightly binding together gregation (e.g., RETURN avg (latency (trace))) over at-
collection and querying in an online streaming manner. tributes across traces. In the first case, the output of the query
is a single value per trace (e.g., a vertex’s name for every
3 DESIGN trace that matches the structural and attribute filters). In the
second case, the output of the query is the result of the aggre-
3.1 Input: Query Language gation function implemented on the returned element (e.g.,
In Snicket, a trace is modeled as a directed tree rooted at the the average latency of multiple traces). The aggregation func-
front-end. Each vertex is a unique visit to a microservice, and tions may maintain arbitrary intermediate data to arrive at
each edge is an RPC. Snicket’s query language syntax is based their final value. For example, for an average, the aggregation
on OpenCypher [16] (Figure 2). It allows a microservice
HotNets’21, November 10-12, 2021, Virtual Event, UK Berg et al.

Operator Example Expression in OpenCypher Description


Structural Filter MATCH (a) -> (b) Defines an arbitrary graph structure to match on.
Attribute Filter WHERE a.response.total_size = 500 Filters based on attributes of vertices and traces.
AND height(trace) = 5
Developer-Defined Attributes latency(b) Creates a new, developer-defined attribute for vertices
and traces.
Return RETURN latency(b) Defines what information will be returned to the developer.
Aggregate RETURN average(latency(b)) Specifies an aggregation function that should be
applied to data across traces before returning to the developer.

Figure 2: A table demonstrating each language construct in Snicket, an example corresponding to that construct in
OpenCypher syntax, and a description of the construct.

function implementation keeps a running tally of the total There is one service proxy per microservice. For simplicity
sum, and the number of instances seen. When a new value in compiler implementation, the compiler currently generates
is given to the aggregation function because another trace the same extension for each proxy except for the storage,
has been completed, the aggregation function updates its two which is a container managed by Snicket to keep the results
internal values, and divides them to get the value to be placed of a query. When the developer wants to know the results of a
in storage. Aggregation functions are developer-defined. query, they query the storage container.
Example Scenario. As an example of Snicket in practice,
consider the scenario where a company is switching from 3.3 The Snicket Compiler
local to cloud-based machines. In the midst of the transition, We now describe how the compiler implements each lan-
queries are being load balanced across multiple replicas, some guage construct: the attribute filter, structural filter, developer-
local, and some on the cloud. A developer notices that many defined attributes, return, and aggregation. The compiler cre-
slow requests go through these replicas. A reasonable hypoth- ates as output two extensions: the main extension that runs on
esis is that there may be some difference between how the all application microservices, and an aggregation extension
local and cloud replicas are set up. However, that is only one which runs only on the storage container. Both extensions
possible explanation of many. To test that explanation, the run in the proxies, so no instrumentation is needed in the
developer formulates the query application as long as it uses any service proxy with an exten-
MATCH (a)->(b) sion mechanism. Our implementation uses the Envoy service
WHERE proxy and WASM bytecode extensions.
a.vertex.workload.SERVICE_NAME
== "frontend" Attribute Collection. First, as requests pass through various
AND a.downstream.address microservices, relevant Envoy attributes are added to the bag-
== local_address
RETURN avg(latency(trace))
gage: the data that is propagated alongside an RPC [18] as it
hops across microservices to complete the user request. This
In storage, the developer will be able to access a continually happens within the main extension. The attributes are acces-
updated average of the latency of traces that went through sible through the Envoy interface, and will later determine
the local replicas. If this number is normal, there is likely a whether the trace graph matches the query’s filters.
problem on the cloud replicas. If not, then the developer can
change the query’s IP address to the cloud IP, and figure out Attribute and Structural Filtering. As RPC calls are made
if the local replicas are the problem. in response to incoming user requests, the WASM extension
creates a tree of RPCs in an online fashion, starting at the
leaves. This tree is part of the baggage that is propagated
3.2 Output: WASM Filters between microservices. Once a response for an RPC is pro-
Snicket’s compiler generates WASM extensions from input cessed, the responding microservice is added to the tree as
queries. WASM runs in a safe, memory-sandboxed environ- a vertex—effectively, a post-order traversal of the trace. As
ment that is isolated from the service proxy [25]. The WASM the tree is created from leaves to root, an isomorphism algo-
environment has a well-defined interface with the Envoy ser- rithm [22] is run in a distributed manner at each microservice,
vice proxies; it can put data into proxy storage, inspect in- finding matches to the structural and attribute patterns speci-
coming and outgoing messages, and learn about the status fied in the query, using only the information collected thus far.
and placement of the service proxy. The WASM environment The algorithm executes both structural and attribute filters at
also has access to Envoy-supplied attributes like the trace ID. the same time.
Snicket: Query-Driven Distributed Tracing HotNets’21, November 10-12, 2021, Virtual Event, UK

Developer-Defined Attributes with function(input). De- config map impossible to overwrite [24]. Hence, every time
velopers can also create new recursively defined attributes. the contents of the config map (our Snicket extension) are
The functions defining these attributes execute once for each overwritten, the entire service proxy must be restarted. Al-
vertex: when a response is received from a callee vertex and though this does not affect application functionality, it is slow.
the caller vertex becomes part of the trace, then the caller Recently, a new way to refresh WASM extensions has been
vertex will define the attribute for itself. Then the attribute developed [10]. It is still an experimental feature, and has lim-
is added to baggage in the same way as inherent attributes. ited uses. Using the new method, we improved the extension
To create a new attribute, the developer refers to an attribute refresh time from 30–103 seconds to 0.6–0.9 seconds. We
using parenthesis notation (e.g., height(trace)) within the hope that as the feature moves beyond experimental, Snicket
query, and provides the function defining the attribute as part can become more interactive.
of the query ingested by the compiler.
Return. In the main extension, once it has been determined 4.3 Cost and Performance
that the attribute and structural filters are satisfied, the exten- We measure the cost and performance of Snicket by two
sion sends the result to the storage container. The result is metrics: (1) the extra compute costs that needs to be paid
determined by the return statement and sent to the storage to run Snicket alongside an application, and (2) the latency
container as a (Trace ID, value) pair, in order to distinguish added by Snicket to user requests.
which result came from which trace.
Application. We use Online Boutique, a microservices appli-
Aggregate. The developer can also define their own func- cation of 10 microservices [7]. The only change we made to
tions in the aggregation construct. A developer can create the application itself is increasing the requested CPU for each
an aggregation function that takes in anything that can be pod to 601 millicores (1 millicore is one thousandth of a core);
put as a RETURN value, and continuously outputs one entry to this allows for better performance of the base application.
be put into the storage container. Thus the developer can do
Cluster Configuration. The cluster is initially given 7 nodes
some summary computation on return values, and store these
of machine type “e2-highmem-4” (4vCPU, 32GB memory)
summaries, instead of storing per-trace values.
on the Google Cloud Platform. Horizontal autoscaling is en-
In the example scenario in section 3.1, the input to the ag-
abled with a threshold of 40 percent CPU utilization [19], and
gregation function was latency (trace) and the aggregation
all pods are allowed at most 10 replicas, with the exception
function is avg. The main extension sends (Trace ID, Latency)
of the frontend, which is allowed 30 replicas. None of the
pairs to the storage container. But rather than immediately
experiments reached the autoscaler limits. We enabled the
sending these pairs to storage, the aggregation extension on
default Kubernetes cluster autoscaler [12].
the storage container intercepts these pairs and continuously
recomputes a running average of latencies across all pairs Load Generator. We use a load generator, Locust [9], to cre-
seen so far. This running average is maintained within the ate load on the application. Locust spawns five users every
storage container and updated with every new pair. second until it reaches 500 users. The load generator was
located in a VM in the same cloud zone as the application.
4 EVALUATING SNICKET Each user sends a request, waits a random period between 1
and 3 seconds, then sends another request, and so on. During
4.1 Language Expressiveness this time, both the horizontal and cluster autoscalers are al-
In Figure 3, we show various examples of Snicket queries. lowed to stabilize to handle the load. Then, we record latency
These queries could include correctness checks, investigation measurements for 400 seconds and record the extra resources
of anomalous data emitted by the application, or debugging of used to run the application. We measure three cases: (1) the
erroneous user requests. Two defining features of the Snicket’s application alone, (2) the application with a no-op WASM
language that are difficult to express with prior systems are: extension running in each service proxy; this is meant to cap-
(1) the ability to match on specific graph structures and (2) ture how much overhead is being added by deploying WASM
the ability to create new developer-defined attributes without extensions at all, (3) the application with extensions generated
restarting applications. by Snicket. We use the query:

4.2 Interactivity MATCH (a)-[]->(b)-[]->(c)


WHERE c.node.metadata.WORKLOAD_NAME='ratings-v1'
How long does it take until Snicket’s extensions take effect? RETURN a.node.metadata.WORKLOAD_NAME
The accepted method up until 2020 for replacing a WASM
extension was through uploading the extension to a Kuber- Results. Without Snicket running, the load forces the au-
netes config map [21]. However, Kubernetes also makes this toscaler to create 11 nodes, and the latency has a median of
HotNets’21, November 10-12, 2021, Virtual Event, UK Berg et al.

Question Query Response in Storage


Which services are making calls to the cache? MATCH (a) -[]-> (b) A list of all services that called the cache
WHERE b.node.metadata.WORKLOAD_NAME == cache
RETURN a.node.metadata.WORKLOAD_NAME
How long is latency in a 5 child wide graph? MATCH A list of latencies of traces with >= 5 children
(a) -[]-> (b), (a) -[]-> (c), (a) -[]-> (d),
(a) -[]-> (e), (a) -[]-> (f)
RETURN latency(trace)
How many services do requests going through MATCH (a) - []-> (b) -[]-> (c) The median of the height
a, b, and c normally go through? WHERE of the subtree rooted at frontend
a.node.metadata.WORKLOAD_NAME == frontend AND that goes through the product and currency services
b.node.metadata.WORKLOAD_NAME == productservice AND
c.node.metadata.WORKLOAD_NAME == currencyservice
RETURN median(height(a))
Is the header "foo" present in the following trace graph? MATCH (a) -> (b), (a) -> (c), (b) -> (d), (d) -> (e) A list of each trace ID mapped to its foo header
RETURN foo(a)

Figure 3: Examples of Snicket queries.

Latencies able to query both current and historic traces with the illusion
100
that all information is always available.
80
To do so, we are exploring improvements to Snicket that al-
Percent

60 low us to compactly persist all the information from a trace by


40
No extension exploiting correlations between and across traces to achieve
Empty extension
20 Snicket
lossless compression of the traces. We also plan to organize
this data in a manner that facilitates future queries on historic
40 60 80 100 120 140
Latency (ms) and current trace data. For this, we will develop indexing
mechanisms to easily access information belonging to a par-
Figure 4: A CDF of latencies comparing no WASM exten- ticular trace, microservice, or time window. These improve-
sion at all, the no-op WASM extension, and the WASM ments will augment the online querying abilities of Snicket
extension generated by Snicket. with the ability to also access historic traces in queries.
We are also looking into optimizations to reduce the 30%
latency overhead of Snicket. Because we have measured a
version of the application with the empty WASM extension,
56 ms (95th percentile: 99 ms). With a no-op extension run- we know that the overhead to get into and out of the WASM
ning, the autoscaler creates 11 nodes, and the latency has a runtime is relatively low (~3 ms) (Figure 4). Hence, the bulk
median of 59 ms (95th percentile: 110 ms). In other words, of the latency overhead is imposed by Snicket’s autogenerated
the effect of simply including a WASM extension at all is rela- extensions, and a promising path for optimization is through
tively low. However, an extension generated from an example optimizing the code Snicket generates. We are currently in-
query forces the autoscaler to create 12 nodes and increases vestigating moving more of the extension computation out of
median latency to 73 ms (95th percentile: 130 ms). the critical path of requests so that minimal overhead is added
Comparison to other systems. To contextualize this result, by the WASM extensions. We are also considering optimiza-
it is useful to look at other systems. Dapper, without sampling, tions such as superoptimization [20] of WASM bytecode to
has 16 percent latency overhead [23]. Snicket, which pushes produce bytecode that adds low overhead to RPC processing.
computation to the proxies, has about 30% latency overhead
(56 ms to 73 ms). While this is a significant increase over 6 CONCLUSION
Dapper, we believe it is not insurmountably large (§5). We have presented Snicket, a system for query-guided dis-
tributed tracing that compiles developer queries on traces into
5 FUTURE WORK a distributed collection of extensions. In preliminary experi-
Snicket only persists the results of queries on traces in long- ments, Snicket incurs about 30% latency overhead relative to
term storage—as opposed to the traces themselves. Hence, the an application without tracing. We are continuing to develop
data persisted by Snicket in long-term storage is tied closely Snicket to reduce its current sources of overhead and improve
to the query issued to Snicket. This limits Snicket’s historical its historic visibility into traces.
visibility: if a developer wants to reanalyze historic data with Acknowledgements. We thank the reviewers for their insight-
a new query, the developer is limited by the queries used at the ful comments. This research was partially supported by NSF
time the data was collected. Ideally, the developer should be grants CNS-2152313, CNS-1901510, and CNS-2008048.
Snicket: Query-Driven Distributed Tracing HotNets’21, November 10-12, 2021, Virtual Event, UK

REFERENCES [12] Kubernetes. 2021. kubernetes/autoscaler: Autoscaling components


[1] Craig Box, Mandar Jog, Plevyak John, Ryan Louis, Sikora Piotr, Kohavi for Kubernetes. https://github.com/kubernetes/autoscaler. Accessed:
Yuval, and Weiss Scott. 2020. Redefining Extensibility in Proxies - 2021-10-11.
Introducing WebAssembly to Envoy and Istio. https://istio.io/latest/ [13] Pedro Las-Casas, Giorgi Papakerashvili, Vaastav Anand, and Jonathan
blog/2020/wasm-announce/. Accessed: 2021-06-25. Mace. 2019. Sifter: Scalable Sampling for Distributed Traces, without
[2] Cloud Native Computing Foundation. 2021. Envoy: an open source Feature Engineering. In ACM SOCC.
edge and service proxy, designed for cloud-native applications. https: [14] Pavol Loffay. 2020. Data analytics with Jaeger aka
//www.envoyproxy.io/. Accessed: 2021-06-25. Traces Tell Us More! https://medium.com/jaegertracing/
[3] Cloud Native Computing Foundation. 2021. Linkerd: The world’s data-analytics-with-jaeger-aka-traces-tell-us-more-973669e6f848.
lightest, fastest service mesh. https://linkerd.io/. Accessed: 2021-06- Accessed: 2021-06-25.
25. [15] Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. 2015. Pivot Trac-
[4] Cloud Native Computing Foundation. 2021. Open Source, End-to- ing: Dynamic Causal Monitoring for Distributed Systems. In ACM
End Distributed Tracing. https://www.jaegertracing.io/. Accessed: SOSP.
2021-06-25. [16] Neo4j, Inc. 2019. Cypher Query Language Reference, Version 9.
[5] Rodrigo Fonseca, George Porter, Randy H. Katz, and Scott Shenker. [17] OpenZipkin. 2021. Zipkin. https://zipkin.io/. Accessed: 2021-06-25.
2007. X-Trace: A Pervasive Network Tracing Framework. In USENIX [18] Austin Parker, Daniel Spoonhower, Jonathan Mace, Ben Sigelman, and
NSDI. Rebecca Isaacs. 2020. Distributed Tracing In Practice: Instrumenting,
[6] Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Analyzing, and Debugging. O’Reilly Media.
Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon [19] Google Cloud Platform. 2021. Horizontal Pod autoscaling | Kubernetes
Jackson, et al. 2019. An Open-Source Benchmark Suite for Microser- Engine Documentation. https://cloud.google.com/kubernetes-engine/
vices and Their Hardware-Software Implications for Cloud & Edge docs/concepts/horizontalpodautoscaler.
[20] Eric Schkufza, Rahul Sharma, and Alex Aiken. 2012. Stochastic Super-
Systems. In ACM ASPLOS.
optimization. Accessed: 2021-10-11.
[7] Google, Inc. 2021. Online Boutique: a Cloud-Native Microser-
[21] Toader Sebastian. 2020. How to Write WASM Filters for Envoy and
vices Demo Application. https://github.com/GoogleCloudPlatform/
Deploy It With Istio. https://banzaicloud.com/blog/envoy-wasm-filter/
microservices-demo/. Accessed: 2021-06-25.
#create-a-config-map-to-hold-the-wasm-binary. Accessed: 2021-06-
[8] Xiaofeng Guo, Xin Peng, Hanzhang Wang, Wanxue Li, Huai Jiang, Dan
25.
Ding, Tao Xie, and Liangfei Su. 2020. Graph-Based Trace Analysis for
[22] Ron Shamir and Dekel Tsur. 1999. Faster Subtree Isomorphism. Jour-
Microservice Architecture Understanding and Problem Diagnosis. In
nal of Algorithms (1999).
ACM ESEC/FSE.
[23] Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat
[9] Jonatan Heyman, Joakim Hamrén, Carl Byström, and Hugo Heyman.
Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan
2021. Locust: An Open Source Load Testing Tool. https://locust.io/.
Shanbhag. 2010. Dapper, a Large-Scale Distributed Systems Tracing
Accessed: 2021-06-25.
Infrastructure. Technical Report. Google, Inc.
[10] Istio. 2021. Distributing WebAssembly Modules (Experi-
[24] Joel Smith. 2018. Ensure That the Runtime Mounts RO Volumes
mental). https://istio.io/latest/docs/ops/configuration/extensibility/
Read-Only by. https://github.com/kubernetes/kubernetes/pull/58720.
wasm-module-distribution/. Accessed: 2021-06-25.
Accessed: 2021-06-25.
[11] Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor
[25] W3C Working Group. 2021. WebAssembly: a Binary Instruction For-
Kuropatwa, Joe O’Neill, Kian Win Ong, Bill Schaller, Pingjia Shan,
mat for a Stack-Based Virtual Machine. https://webassembly.org/.
Brendan Viscomi, Vinod Vekataraman, Kaushik Veeraraghavan, and
Accessed: 2021-06-25.
Yee Jiun Song. 2017. Canopy: An End-to-End Performance Tracing
[26] Robin Whitmore. 2021. How Lightstep Works. https://docs.lightstep.
And Analysis System. In ACM SOSP.
com/docs/how-lightstep-works. Accessed: 2021-06-25.

You might also like