SIGMETRIC 2014 IntroPerf Transparent Context-Sensitive Multi-Layer Performance Using System Stack Traces

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

IntroPerf: Transparent Context-Sensitive Multi-Layer

Performance Inference using System Stack Traces


Chung Hwan Kim† , Junghwan Rhee‡ , Hui Zhang‡ , Nipun Arora‡ ,
Guofei Jiang‡ , Xiangyu Zhang† , Dongyan Xu†

Purdue University and CERIAS, ‡ NEC Laboratories America

{chungkim,xyzhang,dxu}@cs.purdue.edu, ‡ {rhee,huizhang,nipun,gfj}@nec-labs.com

ABSTRACT 1. INTRODUCTION
Performance bugs are frequently observed in commodity soft- Performance diagnosis and optimization are essential to
ware. While profilers or source code-based tools can be used software development life cycle for quality assurance. Ex-
at development stage where a program is diagnosed in a well- isting performance tools such as profilers [3, 6, 7, 19] and
defined environment, many performance bugs survive such compiler-driven systems [17, 22, 34, 33] are extensively used
a stage and affect production runs. OS kernel-level tracers in application development and testing stages to identify
are commonly used in post-development diagnosis due to inefficient code and diagnose performance problems at fine
their independence from programs and libraries; however, granularity. Despite these efforts, performance bugs may
they lack detailed program-specific metrics to reason about still escape the development stage, and incur costs and frus-
performance problems such as function latencies and pro- tration to software users [21].
gram contexts. In this paper, we propose a novel perfor- In a post-development setting, software users investigat-
mance inference system, called IntroPerf, that generates ing performance issues are usually not the developers, who
fine-grained performance information – like that from ap- have source code and can debug line by line. Therefore, it
plication profiling tools – transparently by leveraging OS is desirable for those users to have a diagnosis tool that will
tracers that are widely available in most commodity operat- work transparently at the binary level, look into all com-
ing systems. With system stack traces as input, IntroPerf ponents in the vertical software layers with a system-wide
enables transparent context-sensitive performance inference, scope, and pinpoint the component(s) responsible for a per-
and diagnoses application performance in a multi-layered formance issue. Furthermore, detailed and context-rich di-
scope ranging from user functions to the kernel. Evalu- agnosis reports are always helpful to such users so that they
ated with various performance bugs in multiple open source can provide meaningful feedback to the developers hence
software projects, IntroPerf automatically ranks poten- speeding up the resolution of performance issues.
tial internal and external root causes of performance bugs Commodity software is commonly built on top of many
with high accuracy without any prior knowledge about or other software components. For instance, the Apache HTTP
instrumentation on the subject software. Our results show server in Ubuntu has recursive dependencies on over two
IntroPerf’s effectiveness as a lightweight performance in- hundred packages for execution and over 8, 000 packages
trospection tool for post-development diagnosis. to build. Unfortunately, the maintenance of such diverse,
inter-dependent components is usually not well coordinated.
Various components of different versions are distributed via
Categories and Subject Descriptors multiple vendors, and they are integrated and updated by
C.4 [Performance of Systems]: Measurement techniques individual users. Such complexity in the maintenance of
software components makes the localization of performance
anomaly challenging due to the increased chances of unex-
Keywords pected behavior.
Performance inference; stack trace analysis; context-sensitive OS tracers [1, 2, 15, 27] are commonly used as “Swiss
performance analysis Army Knives” in modern operating systems to diagnose ap-
plication performance problems. These tools enable deep
∗Work done during an internship at NEC Laboratories
performance inspection across multiple system layers and
America, Princeton. allow users to spot the “root cause” software component.
Permission to make digital or hard copies of all or part of this work for personal or Given diverse applications and their complex dependencies
classroom use is granted without fee provided that copies are not made or distributed on libraries and lower layer components, compatibility with
for profit or commercial advantage and that copies bear this notice and the full citation all layers is an important requirement. OS tracers are very
on the first page. Copyrights for components of this work owned by others than the effective in this aspect because they are free from the depen-
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
dencies by operating in the OS kernel – below any applica-
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected]. tion software or library.
SIGMETRICS’14, June 16–20, 2014, Austin, Texas, USA. On the other hand, the tracers’ location in the OS kernel
Copyright is held by the owner/author(s). Publication rights licensed to ACM. makes them lose track of details of program internals such
ACM 978-1-4503-2789-3/14/06 ...$15.00. as function latencies and program contexts, which are very
http://dx.doi.org/10.1145/2591971.2592008 .

235
the latency of program function instances with sys-
tem stack traces [1] based on the continuity of calling
context. This technique essentially converts system
stack traces from OS tracers to latencies of function
instances to enable fine-grained localization of perfor-
mance anomalies.
• Automated localization of internal and external
performance bottlenecks via context-sensitive
performance analysis across multiple system lay-
ers: IntroPerf localizes performance bottlenecks in
a context-sensitive manner by organizing and analyz-
Figure 1: Main idea of IntroPerf: context-sensitive ing the estimated function latencies in a calling context
performance diagnosis using inferred latency from tree. IntroPerf’s ranking mechanism on performance-
system stack traces. annotated call paths automatically highlights potential
performance bottlenecks, regardless of whether they
are internal or external to the subject programs.
useful to localize root cause functions. These tools collect in-
formation on coarse-grained kernel-level events; they do not Section 2 presents related work. The design of IntroP-
precisely capture the calls and returns of application func- erf is presented in Section 3. The implementation of In-
tions, the tracing of which is often performed by application troPerf is described in Section 4. Section 5 presents eval-
profilers or dynamic analysis engines. uation results. Discussion and future work are presented in
Recently, OS tracers provide stack traces generated on Section 6. Section 7 concludes this paper.
OS kernel events [1, 15, 27] to improve the visibility into
programs. They cover all system layers from applications 2. RELATED WORK
to the kernel as shown on the left of Figure 1; therefore, Root cause localization in distributed systems: There
we call the traces system stack traces. While they pro- is a large body of research work on performance analysis and
vide improved software performance views, their usage is root cause localization in distributed systems [9, 11, 16, 18,
mostly within event-based performance analysis. For exam- 24, 29, 31, 28]. They typically focus on end-to-end tracing of
ple, Windows Performance Analyzer provides a summary of service transactions and performance analysis on transaction
system performance computed with the weights of program paths across distributed nodes. These nodes are connected
functions’ appearances in the system stack traces. An im- through interactions such as network connections, remote
portant performance feature missing for performance anal- procedure calls, or interprocess procedure calls. While there
ysis is the measurement of application-level function laten- is a similarity in the goal of localizing performance anomaly,
cies. Since the system stack events are generated not on the the performance symptoms in our study do not necessar-
boundaries of function calls but on OS kernel events (e.g., ily involve interactions; instead, they require a finer-grained
system calls), the timestamps of the events do not accurately analysis because the localization target could be any possi-
reflect how long each function executes. bly small code. For instance, if a performance anomaly is
We propose a novel performance inference system, called caused by an unexpectedly extended loop count or a costly
IntroPerf, that offers fine-grained performance diagnosis sequence of function calls, the related approaches will not be
data like those from application profiling tools. IntroP- able to localize it because all program execution is seen as a
erf works transparently by leveraging the OS tracers widely single node without differentiating internal code structures.
available in most operating systems. Figure 1 illustrates the Therefore, the related approaches are not directly applicable
main idea. With system stack traces as input, IntroPerf to our performance debugging problem due to their coarse
transparently infers context-sensitive performance data of granularity.
the traced software by measuring the continuity of calling Profilers: Profilers [3, 6, 19] are tools for debugging ap-
context 1 – the continuous period of a function in a stack plication performance symptoms. Many tools such as gprof
with the same calling context. The usage of stack traces require source code to embed profiling code into programs.
commonly available from production OS tracers [1] allows However, in a post-development stage these tools are of-
IntroPerf to avoid the requirement of source code or mod- ten not applicable because the subject software may be in
ification to application programs while it analyzes detailed binary-only form. Also, user-space profilers may not be able
performance data of the traced application across all layers. to detect the slowdown of lower layer functions such as sys-
Furthermore, it provides context-driven performance analy- tem calls due to their limited tracing scope.
sis for automating the diagnosis process. Oprofile [6] provides whole system profiling via support
Contributions: The contributions of this paper are sum- from recent Linux kernels. Oprofile samples the program
marized as follows. counter during execution using a system timer or hardware
performance counters. Oprofile’s output is based on a call
• Transparent inference of function latency in mul- graph, which is not context-sensitive. In addition, relatively
tiple layers based on stack traces: We propose infrequent appearance of lower layer code in the execution
a novel performance inference technique to estimate may lead to a less accurate capture of program behaviors.
1
Calling context is a sequence of active function invocations In contrast, system stack traces recorded by the OS kernel
as observed in a stack. We interchangeably use calling con- reliably capture all code layers in a context-sensitive way.
text, call path, and stack trace event in this paper because Dynamic binary translators: Dynamic binary trans-
we obtain calling contexts from the stack traces. lators [23, 26] are commonly used in the research community

236
and some of production profiling tools (e.g., Intel vTune is 3. DESIGN OF INTROPERF
based on Pin) for binary program analysis. These tools can In this section, we present the design rationale of IntroP-
transparently insert profiling code without requiring source erf for transparent performance inference across multiple
code. However, the significantly high performance overhead software layers.
makes them suitable mainly for the development stage. Effective diagnosis of performance bugs in post-development
Performance bug analysis: Performance analysis has stage poses several requirements:
been an active area in debugging and optimization of ap-
plication performance. Jin et al. [21] analyzed the charac- • RQ1: Collection of traces using a widely deployed
teristics of performance bugs found from bug repositories, common tracing framework.
and reported new bugs by analyzing similar code patterns
across software. This method requires source code that is • RQ2: Application performance analysis at the fine-
often not available in a post-development stage. StackMine grained function level with calling context information.
[20] and DeltaInfer [32] are closely related to IntroPerf
in the aspect of using run-time information to detect per- • RQ3: Reasonable coverage of program execution cap-
formance bugs. StackMine focuses on enabling performance tured by system stack traces for performance debug-
debugging by clustering similar call stacks in a large num- ging.
ber of reports. StackMine relies on PerfTrack for detecting In general, profilers and source code level debugging tools
performance bugs. PerfTrack works by inserting assert-like provide high precision and accuracy in analysis at the cost
statements in Microsoft’s software products for possible per- of stronger usage prerequisites such as availability of source
formance degradation of functions of interest. This is a very code. In a post-development stage, such requirements may
efficient and effective approach for generating warnings when not be easily satisfied. Instead, IntroPerf uses system
the monitored functions experience a slowdown. However, stack traces from OS tracers which are widely deployed for
it requires manual insertion of performance checks in certain software performance analysis. These tracers operate at the
locations of the source code that may become a bottleneck. kernel layer without strong dependency on programs or li-
DeltaInfer analyzes context-sensitive performance bugs [32]. braries; therefore, IntroPerf can be easily applied as an
It focuses on workload-dependent performance bottlenecks add-on feature on top of existing tracers without causing
(WDPB) which are usually caused in loops which incur ex- extra efforts in instrumenting or recompiling a program.
tended delay depending on workload. IntroPerf is dif- The metrics of runtime program behavior such as function
ferent from DeltaInfer in several aspects. First, it is less latency and dynamic calling context have been used to ana-
intrusive since it avoids source code modification required lyze program performance issues [3, 7, 19]. This information
by DeltaInfer, and thus is suitable for a post-development is typically collected using instrumentation that captures
stage. Second, in code coverage, DeltaInfer mostly focuses function calls and returns in profilers and in source code-
on characteristics of the main binary similar to profilers. based approaches. In IntroPerf we aim to obtain fine-
IntroPerf, however, is able to cover all layers in the local- grained performance monitoring information without pro-
ization of root causes; this is due to its input, system-wide gram instrumentation. Instead, such information is inferred
stack traces, which include information regarding the main using stack traces generated on OS kernel events, which oc-
binary, libraries, plug-ins, and the kernel. cur in coarser granularity compared to program function
Diagnosis with OS tracers: Production OS tracers are calls, leading to lower overhead.
commonly used in modern OSes. DTrace [15] is the de facto The last requirement in the design of IntroPerf is rea-
analysis tool in Solaris and Mac OS X. Linux has a variety of sonable coverage of program execution by system stack traces.
tracing solutions such as LTTng [5], Ftrace [2], Dprobe [25], Our observation is that, even though there are inherent in-
and SystemTap [27]. Event Tracing for Windows (ETW) is ference errors in the performance analysis results due to the
the tracing mechanism in Windows supported by Microsoft coarse granularity of OS level input events, the accuracy is
[1]. These tools are widely used for diagnosis of system prob- reasonable for our purpose – performance debugging. This
lems. For instance, ETW is used not only by the perfor- is because performance bottleneck functions with stretched
mance utilities of Windows (e.g., Performance Monitor) but execution time have higher likelihood to appear in call stack
also as an underlying mechanism of other diagnosis tools samples compared to other functions with short execution
(e.g., Google Chrome Developer Mode). Stack walking [8] time. The inferred results of IntroPerf of such functions
is an advanced feature to collect stack traces on specified hence give high accuracy to analyze the root causes of per-
OS events. This feature is included in ETW from Windows formance problems. This observation is intuitive and should
Vista. Other OS tracers such as DTrace and SystemTap also apply to any sampling-based approach. Our evaluation re-
have similar features. The prototype of IntroPerf is built sults (Section 5) support this claim.
on top of ETW stack walking, but its mechanism is generic
and is applicable to other platforms. 3.1 Architecture
Calling context analysis: Calling context has been
The architecture of IntroPerf is shown in Figure 2. The
used in program optimization and debugging [10, 35]. In-
input is system stack traces which are stack traces collected
troPerf uses dynamic calling context from stack traces
by OS tracers. As briefly described in Section 1, these events
to differentiate function latencies in different contexts and
do not directly indicate function latency – a useful metric for
enable context-sensitive performance analysis. Several ap-
bottleneck analysis – because the timestamps of the events
proaches [12, 13, 30] have been proposed to efficiently encode
are for OS events.
calling context for various debugging purposes. If combined
To address this challenge, IntroPerf provides transpar-
with OS tracers, these approaches will benefit IntroPerf
ent inference of application performance, which is the second
by simplifying the context indexing process.
block from the left in Figure 2. The key idea of our inference

237
Algorithm 1 Dynamic Calling Context Indexing
si = hf1 , ..., fk i: a call stack . order: bottom to top
1: function PathToID(si , CCTw )
2: Node *v = &root of CCTw
3: for f in si do
4: v = getN ode(v, f )
5: if v.pid == −1 then
6: v.pid = path counter++
7: map path[v.pid] = v
8: return v.pid
9: function getNode(Node* v, function f )
Figure 2: Architecture of IntroPerf 10: if {∃v 0 |f == v 0 .f and v 0 ∈ v.children} then
11: return v 0
12: else
mechanism from stack traces is the continuity of a function 13: v 0 = new Node(); v 0 .parent = v; v 0 .f = f
context in the stack : measuring how long a function’s con- 14: v.children = v.children ∪ {v 0 }
text continues without a change in the stack. Essentially 15: v 0 .id = node counter++
IntroPerf converts system stack traces to a set of func- 16: map node[v 0 .id] = v 0
17: return v 0
tion latencies (Section 3.3) along with their calling context
18: function IDtoPath(i, CCTw )
information. This context shows a specific sequence of func- 19: Node *v = map path[i]
tion calls from the “main” function; therefore, it is useful 20: s=∅
to detect any correlation between performance anomaly and 21: while *v != root of CCTw do
how a function is executed. This context-sensitive perfor- 22: s = {v.f } ∪ s
mance analysis requires recording and retrieving calling con- 23: v = v.parent
text frequently and efficiently; to do that, IntroPerf has 24: return s
a component for dynamic calling context indexing (Section
3.2).
The third component in Figure 2, context-sensitive per-
Let a calling context tree be a tree denoted by hF, V, Ei:
formance analysis, determines which functions are perfor-
F = {f1 , . . . , fm } is a set of functions at multiple layers in a
mance bottlenecks and in which calling contexts they exist.
program’s execution. V = {v1 , . . . , vn } is a set of nodes rep-
Our mechanism infers function latency from all layers of the
resenting functions in different contexts. Note there could
stack in the traces. This leads to potential overlaps of func-
be multiple nodes for a function as the function may oc-
tion latencies. The true contributing latency is extracted
cur in distinct contexts. E = {e1 , . . . , eo } ∈ V × V is
using the hierarchy of function calls (Section 3.4.1). Our al-
a set of function call edges. An edge e is represented as
gorithm then ranks the performance of each dynamic calling
hv1 , v2 i where v1 , v2 ∈ V . A calling context (i.e., a call
context and lists the top latency calling context and the top
path) pk = hv1 , . . . , vk i in the CCT is a sequence of function
latency functions within each context (Section 3.4.2), which
nodes from the root node v1 to a leaf node vk . We use the
are the list of performance bug candidates generated as the
ending function node to uniquely identify a call path.
output of IntroPerf.
The input to IntroPerf is a system stack trace T =
Next, we will present the underlying functions to support
h(t1 , s1 ), . . . , (tu , su )i which is a sequence of OS events with
context-sensitive performance analysis and then present our
stack traces. Each trace event is represented as a pair: (1) a
main idea for the inference of function latency.
dynamic calling context si = hf1 , . . . fk i and (2) the times-
3.2 Calling Context Tree Construction and tamp ti of an OS kernel event (e.g., a system call) which trig-
Dynamic Calling Context Indexing gers the generation of si . PathToID in Algorithm 1 presents
how to generate the nodes and edges of a dynamic calling
Calling context has been widely used in program optimiza- context tree for si and index it with a unique ID. Each node
tion and debugging [10, 12, 35, 30]. At runtime there are has several properties: v.f ∈ F indicates the function that v
numerous instances of function calls with diverse contexts, represents; v.children is a set of edges connecting v and its
forming a distinct order of a function call sequence starting child nodes; v.id is the node ID number; v.pid is the path
from the “main” function. To reason about the relevance ID in case v is a leaf node.
between performance symptoms and calling contexts, it is Once a path is created in the CCT, any path having the
necessary to associate each function call with its indexed same context will share the same ID by traversing the same
dynamic calling context. path of the tree. The pair of the unique path ID and the
There are several approaches proposed to represent call- pointer to the leaf is stored in a hash map, map_path, for
ing context in a unique and concise way [13, 30]. Such ap- quick retrieval of a path as shown in the IDToPath function.
proaches require maintaining the context at runtime, and Our context-sensitive performance analysis uses this in-
therefore, demand mechanisms to instrument the code and dexing scheme and the enhanced CCT as the underlying
compute the context on the fly. As an offline analysis, we mechanism and data structure to store and efficiently re-
adopt a simple method to represent a dynamic calling con- trieve intermediate analysis results.
text concisely. We use a variant of the calling context tree
(CCT) data structure [10]. By assigning a unique number to
the pointer going to the end of each path, we index each path 3.3 Inference of Function Latencies
with a unique integer ID. Here we define several notations In this section, we describe how we infer the latency of
to represent the information. function instances using system stack traces which include

238
Table 1: Inference of function instances of the example shown in Figure 3. I: isNew, A: AllNew, L: LastStack,
T: ThisStack[0], R: Register[0-2].
Time t1 t2 t3 t4
Depth I A L T R I A L T R I A L T R I A L T R
0 1 1 - A (t1 ,t1 ,A) 0 0 A A (t1 ,t2 ,A) 0 0 A A (t1 ,t3 ,A) 0 0 A A (t1 ,t4 ,A)
1 1 1 - B (t1 ,t1 ,B) 0 0 B B (t1 ,t2 ,B) 1 1 B C (t3 ,t3 ,C) 0 0 C C (t3 ,t4 ,C)
2 1 1 - D (t1 ,t1 ,D) 0 0 D D (t1 ,t2 ,D) 1 1 D D (t3 ,t3 ,D) - - D - -

Algorithm 2 Function Call Latency Inference


T : a system stack trace, si : a call stack at time ti
1: function inferFunctionInstances(Trace T , CCTw )
2: for (tk , sk ) in T do
3: newStackEvent(tk , sk , CCTw )
4: for d in (0, ..., |Register|) do
5: closeRegister(tlast , d)
6: function newStackEvent(tk , sk , CCTw )
7: initialize T hisStack, LastStack, Register
8: AllN ew = 0, d = 0
9: pid = P athT oID(sk , CCTw )
10: for f in sk do
11: isN ew = (f == LastStack[d])
12: if AllN ew==1 then
13: isN ew = 1 . Context change propagates.
14: if isN ew==1 then
15: AllN ew = 1 . Context change triggered.
Figure 3: Inference of function latencies across mul-
16: T hisStack[d++] = (f, d, isN ew)
tiple layers using stack trace events.
17: if |LastStack| > |T hisStack| then
18: i = LastStack[|LastStack| − 1]
19: while i >= |T hisStack| do
the function information across the application, intermedi- 20: closeRegister(tk , i)
ate libraries, and the kernel (e.g., system calls). 21: remove LastStack[i - - ]
Figure 3 presents the high-level idea of the inference al- 22: reverse T hisStack
gorithm 2. The sub-figure at top left shows an example of 23: for (f, d, isN ew) in T hisStack do
24: if isN ew == 1 then
the latencies between the calls and returns of five function
25: closeRegister(tk , d)
instances: A, B, C, D, and D0 (the notation on the second 26: nid = getN odeIDf romCCT (pid, d, |sk |, CCTw )
D0 indicates its distinct context). Capturing these function 27: Register[d] = [tk , tk , f, nid, pid]
call boundaries precisely would require code instrumenta- 28: else
tion for profilers. Instead, the input to our algorithm is a 29: Register[d][1] = tk . The end time stamp
set of system stack traces which are illustrated as four dotted 30: LastStack[d] = f
rectangles at times t1 , t2 , t3 , and t4 . 31: tlast = tk
The algorithm infers the latency of function calls based 32: function closeRegister(tk , d)
on the continuity of a function context. In Figure 3, we 33: (ts , te , f, nid, pid) = Register[d]
can see that function A continues from time t1 to t4 without 34: ta = tk − ts ; tc = te − ts
any disruption in its calling context; thus, the algorithm 35: newF unctionInstance(f, pid, nid, ta , tc )
considers this whole period as its latency. On the other 36: remove Register[d]
hand, function D experiences a change in its context. During
t1 and t2 , it was called in the context of A->B->D. However,
at t3 its context changes to A->C->D leading to discontinuity Now we present how Algorithm 2 processes the illustrated
of its context even though function D stays at the same stack example in detail step by step. The states of variables in
depth. Note that this discontinuity propagates towards the each step of the algorithm are illustrated in Table 1. Func-
top of the stack. If there are any further calls from D in tion inferFunctionInstances calls newStackEvent for each
this case, their context will be disrupted all together. The stack event. This function tracks the continuity of a func-
algorithm scans the stack trace events in the temporal order tion context by putting each frame of the stack in a register
and tracks the continuity of each function in the stack frame as far as the context continues. isNew is a boolean value
along with its context. showing whether a function newly appears or not (Line 11).
Our approach infers function latencies in two modes, con- Table 1 shows that isNew (shown as I) is set to 1 for all
servative estimation and aggressive estimation, as illustrated stack levels at time t1 due to their initial appearance. Then
in Figure 3. The conservative mode estimates the end of a at time t2 the values become 0 due to the identical stack sta-
function with the last event of the context while the ag- tus. The duration of the context continuity of the currently
gressive mode estimates it with the start event of a distinct active functions in the stack are tracked using the Regis-
context. In our analysis, we mainly use the conservative ter array (shown as R in the Table). On a new function,
mode. However, we also observe that these two modes are its registration is performed in Line 27. As the function’s
complementary when context change is frequent. context continues over time, its duration in Register is up-

239
dated accordingly (Line 29). For instance, three function Algorithm 3 Top-Down Latency Normalization and Dif-
instances are registered at time t1 as shown in the R column ferential Context Ranking
(6th column in Table 1). At time t2 the duration of all three 1: function LatencyNormalization(Node *v)
functions are updated from t1 to t2 (11th column). 2: µchildren = 0;
If discontinuity of a function context is detected, the con- 3: for vi ∈ v.children do
texts of the functions in the higher level of the stack should 4: LatencyN ormalization(vi )
all be discontinued because they are from a different con- 5: µchilden += vi .ta
text. This mechanism is performed by enforcing isNew’s 6: v.µown = v.µ − µchildren
status (Line 12-15). Function D appears twice (at times t2 7: function getPathSet(CCTw )
and t3 ) at the same stack depth. But they have different 8: return getP ath(the root of CCTw , [], 0)
contexts and hence are considered distinct instances. 9: function getPath(Node *v, p)
Function returns are inferred using the moments when the 10: P =∅
11: if v.children = ∅ then
context changes or at the end of the trace. When that hap-
12: P = P ∪ {p · v}
pens, the registered functions are closed by calling closeReg- 13: else
ister (Lines 5, 20, 25). Inside this function the new function 14: for vi ∈ v.children do
instance is created as newFunctionInstance. 15: P = P ∪ getP ath(vi , p · v)
The inferred function latencies are associated with their return P
contexts by recording the corresponding function node in the 16: function DiffRankPaths(N, CCTbase , CCTbuggy )
calling context tree. When a function is registered, the cor- 17: Pbase = getP athSet(CCTbase )
18: Pbuggy = getP athSet(CCTbuggy )
responding CCT node ID (nid) is obtained using the current
19: for p0 ∈ Pbuggy do
context (pid) and the depth (d) of the given function in the 20: if {∃p ∈ Pbase |p0 ≡ p } then
context (Line 26-27). Later when this function is removed 21: c = Σ diff µ of equivalent nodes in p0 and p
from Register and stored as an inferred function instance 22: else
(Line 32-36), the CCT node ID (nid) is recorded along with 23: c = Σ µ in each node in p0
the latency. 24: ∆P .append(c, p0 )
The inferred latency of a program L is a set of function 25: for (c, p0 ) ∈ ∆P do
latency instances (l ∈ L) where l = hf, pid, nid, ta , tc i. ta 26: Sort all function nodes in p0 with regard to v.µ
and tc are aggressive and conservative estimations of a func- 27: Annotate the rank of each node v in p0 in v.rank
tion latency, respectively. nid is a function node ID in CCT 28: Sort all paths in ∆P with regard to c
(when the node is v ∈ V , v.id = nid). pid is the ID of the 29: return top N paths of ∆P
call path that this node belongs to. f ∈ F is the function
ID that this node represents.
We manage the extracted function latency instances in ciating the inferred function latencies and the calling con-
two ways. First, we keep the inferred latencies in a list to texts available in the stack traces. We use such information
enable the examination of individual cases in a timeline for to localize the root causes of performance bugs at the func-
the developers. Second, we aggregate the instances in the tion level along with its specific calling context, which is
CCT so as to have an aggregated view of function latencies valuable information to understand the bugs and generate
accumulated in each context. This is performed using the patches. In this section we present how to automatically lo-
nid field which can directly point to the corresponding node calize a potential performance bottleneck in a PA-CCT that
in the CCT using data structure map_node in Algorithm 1. we construct in the previous stage.
A function node v is extended to have three additional fields: To determine the likely root causes of performance symp-
v.C is the number of function counts accumulated in each toms in terms of latency and context, we perform several
node; v.µa and v.µc are the average function latencies in the steps of processing of the inferred function latencies. First,
aggressive and conservative modes respectively. Let L0 be we normalize the function latencies at multiple layers to
a subset of L where its estimated function instances belong remove overlaps across call stack layers and extract true
to a function node, vj , whose node ID is j, L0 = {l | l ∈ contributing latencies in a top down manner. Second, we
L, l.nid = j}. vj ’s average function latency is computed as present a calling context-based ranking mechanism which
follows. localizes potential root cause contexts and the bottleneck
1 X functions within the contexts automatically.
vj .µ = l.t,
vj .C 0
l∈L
3.4.1 Top-Down Latency Normalization
The operations on function latency are applied to both esti- IntroPerf estimates the latency of all function instances
mation modes, and we will use one expression in the follow- (i.e., duration between their invocations and returns) in the
ing description for brevity. call stack. While this estimation strictly follows the defini-
We call this extended CCT with the accumulated function tion of function latency, raw latencies could be misleading
latency a performance-annotated calling context tree (PA- for identifying inefficient functions because there are over-
CCT). We will use it in the next stage to localize perfor- laps in the execution time across multiple stack layers as il-
mance bugs. lustrated in Figure 4. With raw latencies, function A would
be determined as the slowest function because its duration is
3.4 Context-Sensitive Analysis of Inferred Per- the longest. Without properly offsetting the latency of child
formance functions, such top level functions (e.g., “main”) that stay at
IntroPerf provides a holistic and contextual view re- the bottom of the call stack would always be considered as
garding the performance of all layers of a program by asso- expensive functions. Note that this challenge does not oc-

240
its child node, the path is updated by concatenating the
function node (p · v). When the algorithm reaches a leaf
node which does not have any children, it stores the path
from the root to the current node in the path set (Ps ).
One challenge in hot spot ranking to investigate perfor-
mance bugs is that some code is inherently CPU intensive
Figure 4: Top-down latency normalization. Non- and ranked high regardless of workload changes (e.g., crypto,
overlapped latency is shown in shade. compression functions). While such code region needs care-
ful inspection for optimization, its behavior is not unusual in
the developers’ perspective because it reflects the character-
cur in profilers because their approaches are bottom-up by istics of the program. Performance bugs reported in the
sampling the program counters and always accounting for community typically describe unexpected symptoms that
latency at the low shaded levels in Figure 4. are triggered in a particular input or workload (e.g., a large
To remedy this problem, we use the association between input file). In that sense, such originally costly code is less
callers and callees in the CCT. As shown in Figure 4, the important for our purpose to determine the root cause of a
latencies of callees always overlap with the latency of their performance bug symptom.
caller. This relationship can be expressed as the following To address this problem and improve the quality of rank-
formula. Let V 0 = {vi |hvj , vi i ∈ vj .children}, and the non- ing output, we employ a systematic differential method in
overlapped latency of v be v.µown : the ranking of calling contexts. The method uses two sets of
X CCTs produced from the workload samples under two differ-
vj .µ = vj .µown + vi .µ ent inputs: a base input that a program was tested and con-
vi ∈V 0 firmed to work as expected with, and another buggy input
Based on this observation, we address the above issue by that a user discovered and reported to have a performance
recursively subtracting the latency of callee functions from bug symptom (e.g., a moderate sized input and a large in-
the caller functions in a PA-CCT. Given the root node of a put). By subtracting the inferred execution summary based
PA-CCT, function LatencyNormalization (Line 1-6 in Al- on a base input from that based on a buggy input, we prior-
gorithm 3) recursively traverses the entire tree and subtracts itize the hot spots sensitive to the buggy input in a higher
the sum of the child node latencies from its parent’s, leav- rank and reduce the significance of the commonly observed
ing the latency truly contributing to the execution of that hot spots. To apply this technique, we first need to define
function. the equivalence of paths in multiple workloads.
Let there exist two paths which respectively belong to P1
3.4.2 Performance-Annotated Calling Context Rank- and P2 (pk = hv1 , . . . , vk i, pk ∈ P1 , p0k = hv10 , . . . , vk0 i, p0k ∈
ing P2 ). We define that two paths are equivalent (pk ≡ p0k ) if
the represented functions of two paths are identical in the
In this section, we present how IntroPerf automatically
same order. The differential cost of the paths is calculated
localizes the likely cause of performance problems. We use
as follows: P1 and P2 are for a base input and a buggy input,
the runtime cost of executed code estimated by the inferred
respectively. In comparison between the two sets of paths
latency from system stack traces to localize the root cause.
in dynamic calling contexts, a new context may appear due
While this approach is able to localize hot spot symptoms,
to the buggy input. In such a case, we consider the latency
there is an additional challenge in finding out which code
of the base workload as zero and use only the context of the
region should be fixed because program semantics need to
buggy input.
be considered. For instance, the root cause of high latency
could be due to a bug triggered inside a function. On the k
(vx0 .µ − vx .µ) : ∃pk , ∃p0k , pk ≡ p0k
P
other hand, it is also possible that the root cause is in other
x=1
related functions such as a caller of a hot spot function be- k
vx0 .µ
P
cause it introduces the symptom due to its invocation pa- : Otherwise
x=1
rameters. Our validation case in the evaluation section 5.1
indeed shows that in some applications the final patched The paths are ranked using the above cost formula and the
functions could be away from the hot spots by a couple of top N ranked functions are listed for evaluation (function
frames in the call stack. DiffRankPaths in Algorithm 3). The number of dynamic
Our approach ranks hot calling contexts, which expose calling contexts can go up to tens of thousands even though
closely related functions in the call stack such as callers and a partial workload is sampled depending on the complex-
callees in addition to the hot function particularly in the ity of workload. Such context ranking significantly reduces
context when a high latency is triggered. The invocation analysis efforts by limiting the scope to a small set of hot
relationship in calling contexts allows developers to inspect calling contexts. Furthermore, for each hot calling context
neighboring functions in the call stack that may have impact we provide the ranking of function nodes illustrating hot
on the hot spot function and to find the most suitable code functions inside the path.
region to patch especially when complex code semantics are Later in Section 5 (Figure 7), we will present a few cases
involved. of real world performance bugs with the top-N % hot calling
In order to locate the hot spot calling contexts, we gener- contexts and hot functions in the ranked order in the heat-
ate a set of call paths by traversing a CCT and rank them. map (i.e., a color-mapped table with rows for distinct con-
Let Ps be a set of paths observed in an execution s. Function texts, columns for the functions within a path, and colors
getPathSet in Algorithm 3 recursively traverses the CCT for function latencies). It provides a concise and intuitive
and generates Ps . As the algorithm moves from a node to view on context-sensitive hot spots and assists developers

241
by automatically narrowing down their focus from massive
amount of dynamic calling contexts to a few highly ranked
code contexts.

4. IMPLEMENTATION
IntroPerf is built on top of a production tracer, Event
Tracing Framework for Windows (ETW) [1] – a tracing fa-
cility available on Windows (introduced in Windows 2000) Figure 5: Dynamic calling contexts of the Apache
and is used by management tools such as Resource Moni- 45464 case showing hundreds of distinct call paths.
tor. We use ETW to generate system stack traces, which
are the stack dumps generated when kernel events happen,
thus including a range of function information from user capability of non-native stacks. Supplemental stack inter-
functions to kernel functions. Stack traces are generated on pretation layers such as in jstack [4] will resolve this prob-
a selection of kernel events specified as input to the frame- lem.
work as the stack walking mode. We use two configurations:
(1) system call events, and (2) system call + context switch 5. EVALUATION
events. In addition, we include several miscellaneous events
In this section, we evaluate several aspects of IntroPerf
in tracing (e.g., process, thread, and image loading events)
experimentally. Here are the key questions in the evaluation:
without stack walking. These events are included as a com-
mon practice to disclose execution status during tracing such • How effective is IntroPerf at diagnosing performance
as the loaded program images and the memory address of bugs?
the kernel necessary to understand the trace.
The front-end of IntroPerf parses the ETL (Event Trac- • What is the coverage of program execution captured
ing Log, the output of ETW) files and extracts both kernel by system stack traces?
and user space function information which are sliced for each
process and thread. The back-end consists of several stages • What is the runtime overhead of IntroPerf?
of programs performing the construction of the calling con-
text tree, inference of function latencies, and ranking of the 5.1 Localizing Root Causes of Performance Bugs
performance annotated-CCT. The entire framework includ- IntroPerf enables transparent performance introspec-
ing both the front-end and back-end has a total of 42K lines tion of multiple software layers which includes relative per-
of Windows code written in Visual C++. All experiments formance cost of individual functions in each specific con-
are done on a machine with Intel Core i5 3.40 GHz CPU, text. This information is valuable for developers to un-
8GB RAM, and Windows Server 2008 R2. derstand “where” and “how” (in terms of function call se-
In the presentation of our evaluation, we use the debug- quences) performance bugs occur and eventually to deter-
ging information in the program database (PDB) format for mine the most suitable code regions to be fixed. In this
convenience of illustration and validation of data. However, section, we evaluate the effectiveness of IntroPerf in lo-
it is not a requirement for our mechanism because our frame- calizing the root causes of performance bugs in a set of open
work, instead, can represent instructions (e.g., function en- source projects which are actively developed and used.
tries) by their offsets within the binary. As an example usage Evaluation setup: We selected open source projects
scenario, IntroPerf can generate detailed bug localization with various characteristics such as server programs (Apache
reports on problematic performance symptoms only based HTTP server: web server, MySQL: database server), desk-
on the binary packages on users’ sites. If the user deter- top applications (7zip: file compressor/decompressor), and
mines that a root cause belongs to a software component, a low-level system utility (ProcessHacker: an advanced ver-
he/she can report it to the vendor. The developers on the sion of Task Manager) to highlight generic applicability of
vendor side should have the debugging information which IntroPerf. The life span of these projects range from five
is stripped in the release. They can interpret the details to eighteen years as of 2014 with continuous development
of IntroPerf’s report with function names using symbolic and improvement of code due to popularity and user bases.
information. Input to experiments: As the input cases to be ana-
The Visual Studio compiler produces PDB files for both lyzed, we use the performance bug symptoms that are re-
executable files (e.g., *.exe) and libraries (e.g., *.dll) re- ported by users. We checked the forums of the projects
gardless whether it uses the debug or release mode. Most where users post their complaints on performance issues,
Microsoft software products (e.g., Internet Explore, Win- and collected cases which include instructions to trigger the
dows kernel, and low level Windows subsystems) provide performance issues. In addition to the workload described
such information which can be automatically retrieved from to trigger symptoms, we created another workload with a
the central server with a configuration. Therefore interpret- similar input on a smaller scale as a base input to offset
ing the software layers of Windows including the kernel and costly code which is not closely relevant to the symptom.
GUI system is straightforward even though Windows is a IntroPerf analyzes bug symptoms without any prior
proprietary OS. Most open source projects make their sym- knowledge; thus it is capable of identifying the root causes
bol files available in addition to the binary packages to assist of new bugs. Such a process, however, typically requires
debugging. If not, this information can be easily generated non-trivial turn-around time for evaluation by developers
by compiling the code. and patch generation. While our long-term evaluation plan
Currently ETW does not support the stacks of virtual includes finding un-patched bugs, in this paper, for valida-
machines (e.g., Java) or interpreters due to limited parsing tion purposes we use the cases whose patches are available to

242
represents the similarity between the bottleneck path and
the root cause path in the ground truth using the top (min-
imum) rank of the path that includes a patched function. A
function distance, fmin , shows the effectiveness of IntroP-
erf in a finer function level using the minimum number of
edges between the most costly function node and a patched
function node within a path.
The comparison result is presented in columns 7-10 in
Table 2. As in-depth information, top ranked costly calling
contexts are further illustrated as heat-maps in Figure 7,
where each row represents the cost of one path and the top
row is the call path (context) with the highest estimated
latency. The latency of an individual function is represented
in colors (log10 scale) showing the bottlenecks in red and the
patched functions (ground truth) are illustrated as circled
“P” marks. To ease the identification of a bottleneck within
a path, the function with the highest latency is marked as
Figure 6: A zoomed-in view of the PA-CCTs of the 0 and the adjacent functions are marked with the distance
Apache 45464 case. from it.
In all cases, we can confirm that the bottleneck contexts
ranked by IntroPerf effectively cover the patched func-
tions from the data shown in Table 2 and illustrated in Fig-
compare with IntroPerf’s output derived with zero knowl-
ure 7 while such data vary depending on the characteris-
edge and evaluate its efficacy.
tics of the program and the performance symptoms. For
Performance-annotated calling context tree (PA-
instance, the patch can be placed in a hot function if the
CCT): A calling context tree provides a very informative
bottleneck is in the function. On the other hand, a patch
view to developers to understand the root cause by distin-
could be placed in its caller if its parameter is anomalous.
guishing the overhead along with distinct call paths. In-
Note that in both cases IntroPerf offers an opportunity to
troPerf creates this view on all layers by overlaying the
correlate the bottleneck and the patch using calling context
estimated function latencies from system stack traces. Fig-
performance ranking. It enables developers to easily check
ure 5 illustrates this holistic view of a dynamic calling con-
the adjacent functions around the bottleneck function in the
text for the Apache 45464 case with hundreds of distinct
context and to reason about the suitable program points for
call paths. Due to its complexity, we show a zoom-in view
patches.
in Figure 6 around a costly function call path of Figure 5.
Each node represents a function and an edge shows a func-
tion call. We use two workloads in a combined view showing 5.2 External Root Causes of Performance Bugs
the edges from two PA-CCTs. The workloads of a large in- So far we have presented the cases that the root causes
put and a small input are respectively illustrated as thick red of performance bug symptoms are present inside the pro-
arrows and thin blue arrows. The thickness of the arrows gram. In a post-development stage, software components
represents the inferred latency showing a clear contrast of are installed, integrated, and updated in an un-coordinated
the performance of the two workloads. Note that a PA-CCT manner. If any of the multiple layers has a problem, it will
shows a more informative view than a program CCT due to impact the overall performance of the software. Note that
the distinct calling context in all software component layers. this problem is beyond a typical scope of traditional debug-
The details of the PA-CCTs in the evaluated cases are ging or testing techniques because the analysis requires the
presented in Table 2. Columns 3, 4, 5, and 6 show the capability to inspect multiple software layers that are sepa-
runtime program characteristics captured in stack traces: rately built.
the number of loaded program binaries and libraries (|I|), Since IntroPerf uses system stack traces that include
the number of distinct dynamic calling contexts (|P |), the information at multiple system layers, IntroPerf is not
total number of functions present in the stack (|F |), and the limited to the analysis of software’s internal code but is able
average length of paths (l). to address performance problems external to the program
Costly calling contexts and functions: While human binary as well. In this section, we evaluate potential cases
inspection of a PA-CCT is useful for analysis on a small that a dependent but external software component, such as
scale, for a PA-CCT with non-trivial size, as shown in Figure a plug-in or an external library, causes performance issues.
5, the amount of details would be overwhelming for a manual It shows the unique advantage of IntroPerf: introspecting
approach. Hence IntroPerf provides an automated rank- all vertical layers and identifying the source of performance
ing of costly (i.e., hot) calling contexts. Algorithm 3 ranks impact automatically without requiring source code.
call paths using their runtime cost, which is the sum of the This type of performance bugs are in general not well
function’s aggregated inferred latencies. studied because the root causes may depend on the inte-
In this evaluation, we use two metrics: hot calling contexts gration and deployment of external components including
(i.e., paths) and hot functions within the paths. To validate the system environment. This type of performance anomaly
how effective IntroPerf is, we compare IntroPerf’s re- tends to be quite common in the field. However, because
sults with the patch (considered as the ground truth) and of currently limited documentation of real cases and time
present how closely IntroPerf’s metrics match the ground constraints, we choose one of the components on which the
truth using two distance metrics. A path distance, pmin , target software depends, and manually inject a time lag into

243
Table 2: Evaluation of IntroPerf on the root cause contexts of real world and injected performance bugs.
Program Bug Program Characteristics IntroPerf Evaluation Internal/ Ground
Name ID |I| |P | |F | l pmin fmin Root Cause Binary Root Cause Function External Truth
Apache 45464 29 319 712 40.96 1 36 libapr-1.dll, Internal Library apr_stat Internal Patch
MySQL 15811 36 1051 1275 31.22 1 0 mysql.exe, Main Binary strlen Internal Patch
MySQL 49491 13 144 368 33.71 3 5 mysqld.exe, Main Binary Item_func_sha::val_str Internal Patch
ProcessHacker 3744 23 2704 1172 49.34 1 0 ProcessHacker.exe, Main Binary PhSearchMemoryString Internal Patch
7zip S1 28 1160 1182 72.21 11 16 7zFM.exe, Main Binary CPanel::RefreshListCtrl Internal Patch
7zip S2 33 1793 1496 59.61 3 16 7zFM.exe, Main Binary CPanel::RefreshListCtrl Internal Patch
7zip S3 22 656 819 58.78 1 15 7zFM.exe, Main Binary CPanel::Post_Refresh_StatusBar Internal Patch
7zip S4 30 1002 1274 55.95 2 16 7zFM.exe, Main Binary CPanel::RefreshListCtrl Internal Patch
ProcessHacker 5424 25 1488 978 40.60 1 54 ToolStatus.dll, Plug-in MainWndSubclassProc External Patch
ProcessHacker - 26 1241 906 41.56 1 0 ToolStatus.dll, Plug-in NcAreaWndSubclassProc External Inject
Internet Explorer - 92 18716 6168 71.64 1 0 MotleyFool.dll, Toolbar Plug-in CMFToolbar::GetQuote External Inject
Miranda - 42 1032 1245 52.40 1 0 Yahoo.dll, Plug-in upload_file External Inject
Apache - 14 77 302 26.12 3 0 mod_log_config.so, Plug-in ap_default_log_writer External Inject
Apache - 18 96 331 25.89 2 0 mod_deflate.so, Plug-in deflate_out_filter External Inject
VirtualBox - 39 1288 1031 39.36 1 0 QtCore4.dll, External Library QEventDispatcherWin32::processEvents External Inject

it. Injection of the code enables the validation by knowing periments on Pin. Therefore, such events are simulated by
the ground truth of the fault. In the long term, we expect taking stack snapshots for every time quantum in the offline
to study real cases in future work. trace analysis.
Evaluation setup: For evaluation, we selected the We evaluate three configurations regarding the kernel events
cases where the source of the bottleneck is separate from to generate system stack traces: (1) system calls, (2) system
the main software binary. Such external modules are inte- calls and low rate context switches, and (3) system calls and
grated on the deployment site or optionally activated such high rate context switches. The context switch quantum
as independent libraries which the main software is relying in Windows systems vary from 20 ms to 120 ms depending
on or plug-ins implemented as dynamic libraries. In addi- on scheduling policies and the configuration of the system
tion to one real case of ProcessHacker, we have six delay- [14]. We use 120 ms as the low rate of context switch in the
injection cases: ToolStatus is a plug-in for ProcessHacker. second configuration as a conservative measure. We also
MotleyFool is a toolbar for Internet Explorer. Miranda is evaluate the higher context switch rate (20 ms interval) in
a multi-protocol instant messenger and we selected the file the third configuration. These three configurations present
transfer module of the Yahoo messenger plug-in. Virtual- the views of stack traces in three sampling rates. Note that
Box relies on several libraries for graphics and multimedia. the dynamic translation framework is slower than the native
We injected a bottleneck in the Qt library. Also we injected execution having an effect similar to executing on a slow ma-
bottlenecks in the mod_log_config and mod_deflate mod- chine. Thus we mainly focus on the relative comparison of
ules of Apache HTTP server which log and compress user the three configurations.
requests respectively. We evaluate the coverage of stack traces in two criteria.
Ranked costly contexts: The bottom 7 rows of Table First, the dynamics of calling context are analyzed to eval-
2 show the results of IntroPerf’s capability of localizing uate the diversity of call paths shown in the stack traces.
the external root causes of performance bottlenecks. In all The second criterion is regarding the function call instances
cases, IntroPerf successfully identified the root cause of that are captured by the stack traces.
the injected bottleneck with high accuracy. Coverage of dynamic calling context: Figures 8 (a)-
(c) illustrate the coverage of dynamic calling context based
on the three configurations described above. The X-axis
5.3 Coverage of System Call Stack Traces shows the order of dynamic calling contexts in percentage
The system stack trace is a collection of snapshots of the sorted by the latency in the ascending order. The graph
program stack sampled on OS kernel events. Therefore, the shows the coverage of dynamic calling contexts for all calling
sampled calling context and function instances will be lim- contexts whose ranks are the same or higher than the order
ited to the frequency of the kernel events (e.g., system calls, shown in the X-axis, and the Y-axis shows the coverage of
context switch). However, we found that this coarse granu- dynamic calling contexts in percentage. For example, if x
larity suffices for diagnosis of performance anomalies based is 0%, the corresponding y value shows the coverage of all
on a critical observation that the quality of this sampled view dynamic calling contexts. If x is 99%, y shows the coverage
improves as the latency of a function increases. This prop- of calling contexts with the top 1% high latencies.
erty leads to a larger number of appearances of function Each figure shows multiple lines due to MySQL’s usage
contexts and instances in the stack traces and higher accu- of multiple processes/threads. In the experiment using Pin
racy in the inference accordingly. In this section, we will on Windows, we observed that some threads only execute
experimentally confirm this hypothesis. OS functions in their entire lifetime. We used the processes
Evaluation method: The necessary information for this and threads that include the application code (e.g., “main”)
experiment is the fine-grained function call instances and the in the execution since OS-only behavior could be dependent
kernel events which trigger the generation of system stack on platforms (e.g., program runtime setup) and may not be
traces. We used a dynamic translator, Pin [23], to capture representative application characteristics.
function calls and returns. System stack traces are gener- In the overall scope (x = 0), the stack traces cover under
ated by taking the snapshots of the stack on kernel events. 20% of dynamic calling contexts. However, the results are
System calls are captured by Pin because they are triggered improved on high latency contexts which are the focus of
by the program. However, context switches are driven by IntroPerf for performance diagnosis. The right sides of
the kernel; hence, they are hard to capture in user mode ex-

244
Figure 7: Top ranked contexts of performance bug symptoms. Each row represents a calling context. Columns
represent the functions within contexts. The root causes in the ground truth are highlighted with circles.

Figures 8 (a)-(c) show that the coverage increases for top troPerf generates higher quality views compared to shorter
ranked contexts. For the top 1 % slowest functions shown at (time-wise) functions. This is the core reason for IntroP-
the right end of the X-axis (x = 99), the coverage increases erf’s effectiveness in performance bug diagnosis despite the
to 34.7%-100% depending on processes/threads. relatively coarse granularity of stack traces.
Coverage of function call instances: Figures 8 (d)-
(f) show the coverage of individual function call instances in 5.4 Performance
the stack traces. In all the executions, most threads covered IntroPerf can work with any system stack trace from
1.2%-8.05% of the instances. However, for the top 1% high production event tracers [1, 15, 27], which are already de-
latency functions the coverage increases to 29.1%-100%. ployed and being used in production systems. IntroPerf is
Table 3 summarizes the coverage analysis of dynamic call- an offline trace analysis engine. The tracing efficiency is en-
ing contexts and function call instances on three programs: tirely determined by the design and implementation of OS
Apache HTTP server, MySQL server, and 7zip. Apache and kernel tracers. Even though studying the performance of
MySQL are the examples of web and database server pro- the tracing sessions is not the focus of this work and cannot
grams with I/O oriented workload. 7zip is a data compres- be generalized due to the differences and evolution of trac-
sion/decompression program which would be an example of ing mechanisms, we present the tracing performance that
a desktop application with intensive CPU workload. The we measured for ETW [1] to provide some insight into the
table shows different characteristics of these programs. The potential overhead of current production tracers.
second column shows the percentage of system calls com- We present the overhead for the tracing sessions of three
pared to regular function calls. benchmarks: the Apache benchmarking tool (ab), MySQL
Although there are variations in the behavior of the three benchmark suite (sql-bench --small-test), and 7zip bench-
programs due to their characteristics as shown in columns, mark (7z.exe b). Ab measures the time that an Apache
the general observation applies in the same way in the com- HTTP server takes to handle 10k requests. Sql-bench pro-
parison between the coverage of the entire scope (columns vides a set of sub-benchmarks that measure the wall-clock
“Cov. of Context: All,” “Cov. of Instances: All”) and the times of database operations such as SELECT, INSERT,
coverage of top 1% slowest contexts and latencies (columns UPDATE, etc. The Apache and MySQL benchmarks are
“Cov. of Context: Top 1%,” “Cov. of Instances: Top 1%”): performed in a local network. 7zip has a built-in bench-
the coverage of contexts and function instances is signifi- mark that operates with the b option which measures the
cantly higher for high latency functions. In other words, for compression and decompression performance with internal
slow functions experiencing performance bug symptoms, In- data.

245
Figure 8: Coverage of dynamic calling contexts (a)-(c) and function instances (d)-(f ) of MySQL database.

Table 3: Coverage of dynamic calling context and function instances in multiple sampling schemes of system
stack traces in percentage. Ranges are used due to the data of multiple threads. Sys: system calls, LCTX:
system calls and low rate context switch events, HCTX: system calls and high rate context switch events.
Program Syscall Cov. of Context: All Cov. of Context: Top 1 % Cov. of Instances: All Cov. of Instances: Top 1 %
Name Rate Sys LCTX HCTX Sys LCTX HCTX Sys LCTX HCTX Sys LCTX HCTX
Apache 0.33-2.79 18.7-19.3 19.3-20 22.3-24.5 58.6-84.6 61.8-92.3 75.3-100 1.5-20.1 1.5-20.3 1.8-21.6 27-86.7 30.1-93.3 42.6-100
MySQL 0.21-1.48 5.3-18.2 6.4-18.2 10.8-19.2 34.7-100 44.2-100 64.7-100 1.2-7.48 1.3-7.48 1.7-8.05 29.1-100 33.3-100 52.3-100
7zip 0.11-5.03 13.6-47.5 14.5-49.4 17.5-47.5 38.5-100 41-100 50.1-100 0.6-30.2 0.6-30.2 0.8-31.2 16.6-100 18.6-100 26.5-100

Our configuration of ETW flushes the trace records to


the disk when the 512 MB kernel buffer is full. The goal of
this experiment is to measure the runtime impact due to the
generation of stack traces. To minimize the impact due to
the flush to disk, we stored the trace on RAM disk. ETW
provides another option, on-the-fly trace processing, instead
of flushing events to disk. If a custom tracer can be built, it
can further lower the overhead. Figure 9: Overhead of ETW stack trace collection.
Figure 9 shows the overhead for tracing OS kernel events
along with system stack traces. When system calls and mis-
cellaneous events, which are required to comprehend events OS tracer is likely proportional to the frequency of the events
described in Section 4, are recorded, the overhead is 1.38%- recorded, it would be important to lower the number of
8.2%. If context switch events are additionally collected, recorded events for efficiency as far as the coverage is not
the overhead increases to 2.4%-9.11%. Note that IntroP- significantly affected. Sampling could be leveraged to miti-
erf is not required to be active for the entire execution of a gate the problem. Existing sampling schemes such as those
program because OS tracers can dynamically start and stop employed by traditional profilers could be similarly effective.
a tracing session without interfering applications. A rea-
sonable scenario would be to sample the abnormal period
when the program begins to show it. Moreover, with the 7. CONCLUSION
efficiency of OS tracers being improved, IntroPerf will be We present IntroPerf, a performance inference tech-
able to perform faster offline analyses. nique that transparently introspects the latency of multi-
ple layers of software in a fine-grained and context-sensitive
manner. We tested it on a set of widely used open source
6. DISCUSSION AND FUTURE WORK software, and both internal and external root causes of real
Our evaluation in Section 5 is based on a simple and con- performance bugs and delay-injected cases were automati-
servative scheme (i.e., conservative estimation shown in Fig- cally localized and confirmed. The results showed the effec-
ure 3) to estimate the latencies of functions. A more re- tiveness and practicality of IntroPerf as a lightweight ap-
laxed latency estimation scheme (e.g., aggressive estimation plication performance introspection tool in a post-development
or variants) may have either increased or decreased accuracy stage.
depending on which scheme closely matches the actural tim-
ing of function calls and returns. Improving the estimation
accuracy with advanced schemes would be an interesting di- Acknowledgments
rection for future work. We thank Brendan Saltaformaggio and anonymous review-
Applications heavily relying on system functions may cause ers who provided valuable feedback for the improvement of
high frequency of kernel events. Since the overhead of the this paper.

246
8. REFERENCES [20] S. Han, Y. Dang, S. Ge, D. Zhang, and T. Xie.
Performance debugging in the large via mining
millions of stack traces. In ICSE’12.
[1] Event Tracing for Windows (ETW). [21] G. Jin, L. Song, X. Shi, J. Scherpelz, and S. Lu.
http://msdn.microsoft.com/en-us/library/ Understanding and detecting real-world performance
windows/desktop/aa363668(v=vs.85).aspx. bugs. In PLDI’12.
[2] Ftrace: Function Tracer. https://www.kernel.org/ [22] B. Liblit, M. Naik, A. X. Zheng, A. Aiken, and M. I.
doc/Documentation/trace/ftrace.txt. Jordan. Scalable statistical bug isolation. In PLDI’05.
[3] gperftools: Fast, multi-threaded malloc() and nifty [23] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser,
performance analysis tools. G. Lowney, S. Wallace, V. J. Reddi, and
https://code.google.com/p/gperftools/. K. Hazelwood. Pin: building customized program
[4] jstack - Stack Trace. analysis tools with dynamic instrumentation. In PLDI
http://docs.oracle.com/javase/7/docs/ ’05.
technotes/tools/share/jstack.html. [24] M. L. Massie, B. N. Chun, and D. E. Culler. The
[5] LTTng: Linux Tracing Toolkit - next generation. ganglia distributed monitoring system: Design,
http://lttng.org. implementation and experience. In Parallel
[6] Oprofile: a system-wide profiler for linux systems. Computing, 2004.
http://oprofile.sourceforge.net/. [25] R. J. Moore. A universal dynamic trace for linux and
[7] perf: Linux profiling with performance counters. other operating systems. In Proceedings of the
https://perf.wiki.kernel.org/. FREENIX Track: 2001 USENIX Annual Technical
[8] Stack Walking (Windows Driver). Conference, 2001.
http://msdn.microsoft.com/en-us/library/ [26] N. Nethercote and J. Seward. Valgrind: a framework
windows/desktop/ff191014(v=vs.85).aspx/. for heavyweight dynamic binary instrumentation. In
[9] M. K. Aguilera, J. C. Mogul, J. L. Wiener, PLDI ’07.
P. Reynolds, and A. Muthitacharoen. Performance [27] V. Prasad, W. Cohen, F. C. Eigler, M. Hunt,
debugging for distributed systems of black boxes. In J. Keniston, and B. Chen. Locating system problems
SOSP’03. using dynamic instrumentation. In Proceedings of the
[10] G. Ammons, T. Ball, and J. R. Larus. Exploiting 2005 Ottawa Linux Symposium (OLS), 2005.
hardware performance counters with flow and context [28] P. Reynolds, C. Killian, J. L. Wiener, J. C. Mogul,
sensitive profiling. In PLDI’97. M. A. Shah, and A. Vahdat. Pip: detecting the
[11] P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. unexpected in distributed systems. In NSDI’06.
Using magpie for request extraction and workload [29] B. H. Sigelman, L. A. Barroso, M. Burrows,
modelling. In OSDI’04. P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and
[12] M. D. Bond, G. Z. Baker, and S. Z. Guyer. C. Shanbhag. Dapper, a large-scale distributed
Breadcrumbs: efficient context sensitivity for dynamic systems tracing infrastructure. Technical report,
bug detection analyses. In PLDI’10. Google, Inc., 2010.
[13] M. D. Bond and K. S. McKinley. Probabilistic calling [30] W. Sumner, Y. Zheng, D. Weeratunge, and X. Zhang.
context. In OOPSLA ’07. Precise calling context encoding. Software
[14] M. Buchanan and A. A. Chien. Coordinated thread Engineering, IEEE Transactions on, 38(5), 2012.
scheduling for workstation clusters under windows nt. [31] B. C. Tak, C. Tang, C. Zhang, S. Govindan,
In Proceedings of the USENIX Windows NT B. Urgaonkar, and R. N. Chang. vpath: precise
Workshop 1997. discovery of request processing paths from black-box
[15] B. M. Cantrill, M. W. Shapiro, and A. H. Leventhal. observations of thread and network activities. In
Dynamic instrumentation of production systems. In USENIX ’09.
USENIX’04. [32] X. Xiao, S. Han, D. Zhang, and T. Xie.
[16] M. Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and Context-sensitive delta inference for identifying
E. Brewer. Pinpoint: Problem determination in large, workload-dependent performance bottlenecks. In
dynamic internet services. In DSN’02. ISSTA’13.
[17] T. Chilimbi, B. Liblit, K. Mehra, A. Nori, and [33] D. Yuan, H. Mai, W. Xiong, L. Tan, Y. Zhou, and
K. Vaswani. Holmes: Effective statistical debugging S. Pasupathy. Sherlog: Error diagnosis by connecting
via efficient path profiling. In ICSE’09. clues from run-time logs. In ASPLOS’10.
[18] U. Erlingsson, M. Peinado, S. Peter, and M. Budiu. [34] D. Yuan, J. Zheng, S. Park, Y. Zhou, and S. Savage.
Fay: extensible distributed tracing from kernels to Improving software diagnosability via log
clusters. In SOSP ’11. enhancement. In ASPLOS’11.
[19] S. L. Graham, P. B. Kessler, and M. K. McKusick. [35] X. Zhuang, M. J. Serrano, H. W. Cain, and J.-D.
gprof: a call graph execution profiler. SIGPLAN Not., Choi. Accurate, efficient, and adaptive calling context
39(4):49–57, Apr. 2004. profiling. In PLDI’06.

247

You might also like