Monitoring, Analyzing, and Controlling Internet-Scale Systems With ACME
Monitoring, Analyzing, and Controlling Internet-Scale Systems With ACME
Monitoring, Analyzing, and Controlling Internet-Scale Systems With ACME
1
University of California, Berkeley Computer Science Technical Report UCB//CSD-03-1276. October 6, 2003.
ture that can perform all of these tasks and that meets the tems. Sophia [29] is a distributed expression evaluator for
aforementioned challenges. Prolog statements over sensors and actuators; it is in some
ACME is built from two principal parts. ENTRIE, the ways a more general purpose version of ACME. PIER [11]
ENgine for TRiggering Internet Events, is a user-config- is a distributed SQL query engine for stored data and Inter-
ured trigger engine that invokes “actuators” in response to net sensors; it is thus a more general purpose version of
conditions over metrics collected from “sensors.” This ISING. Finally, TinyDB [15] is a distributed SQL query
sensor data may come directly from nodes or from ISING, engine for wireless sensors; it bears an even stronger
the Internet Sensor in-Network aGgregator, which is the resemblance to ISING in that it performs in-network
second part of our infrastructure. ISING is a very simple aggregation when responding to queries. We feel that
distributed query processor for continuous queries over ACME complements these projects by exploring a distinct
sensor data streams; it broadcasts queries to sensors using design point that offers its own unique lessons.
a tree-based overlay network and then collects and aggre-
gates resulting data streams as they travel back up through 2. Sensors and actuators
the network. ISING trades off expressiveness for ease of Because ACME’s primary capabilities are the ability to
implementation by using its own query language rather aggregate streaming sensor data in real time and to control
than SQL. ISING is built on top of QTree, a spanning tree system operation via “actuators,” we briefly provide some
overlay network with a configurable topology that is used background on these metaphors, existing implementations
by ISING for query distribution and result aggregation. of them, and the new sensors and actuators that we wrote.
ACME meets the challenges we have mentioned as Although the sensor/actuator metaphor for observing
follows. To achieve scalability, ISING broadcasts queries and controlling distributed systems is more than a decade
and collects results using a peer-to-peer overlay network, old [16], the sensor side of this equation has recently
and it aggregates results as they travel through the net- received increased attention due to its incorporation as a
work. In Section 4.2 we show that this aggregation is quite fundamental building block of the PlanetLab testbed [19].
beneficial. For flexibility, ENTRIE allows users to specify A PlanetLab sensor is an abstract source of information
trigger conditions and their corresponding actions using an derived from a local node [23]. Sensor data is accessed via
XML configuration file. Also, we have implemented stan- a sensor server, which implemented as an HTTP server,
dards-compliant “sensors,” as well as “actuators,” to dem- that provides access to one or more of the sensors on the
onstrate the ease with which new application-level sources node. The sensor server for a particular sensor runs on the
of monitoring data and sinks for control actions can be same port number across all physical nodes in the system.
added to the system. Section 3.4 shows sample configura- A sensor can be queried for a value by issuing an HTTP
tion files for benchmarking and system management, and request whose format is described in [23]. The query URL
Section 4.3 shows the application-level sensors and actua- contains the name of the sensor and optional arguments. A
tors in use during a benchmark of two structured peer-to- sensor returns one or more tuples of untyped data in
peer overlay networks.1 Finally, for robustness, ISING comma-separated value format. An example of a sensor
uses timeouts to deliver a node’s aggregated result up the currently available and that our system uses as a data
tree if the node does not hear from all of its children in a source is slicestat, which provides, for each slice (which
timely fashion. can be thought of for the purposes of this discussion as a
The remainder of this paper is organized as follows. In user), various pieces of resource usage information such as
Section 2 we provide some background on the sensor and the amount of physical memory in use by the slice, the
actuator metaphor, and their implementation in the system number of tasks executing on behalf of the slice, and the
we describe. Section 3 describes the design and implemen- rate of sending and receiving network data over the past 1,
tation of ACME, including the ISING data collection 5, and 15 minute intervals.
infrastructure and the ENTRIE trigger engine. In Section 4 The PlanetLab sensors that have been developed and
we evaluate ISING and ACME as a whole and demon- deployed to date allow monitoring of operating system and
strate ACME being used in a benchmarking scenario. In network statistics, such as those described in the previous
Section 5 we discuss related work, Section 6 describes paragraph. In order to allow controlled and uniform data
future work including plans for deploying ISING on Plan- collection from applications and their log files, we imple-
etLab, and in Section 7 we conclude. mented several of our own sensors to provide data about
Although we leave a discussion of related work to the application components. The applications we targeted for
end of the paper, we wish to mention at this juncture that evaluating ACME were two structured peer-to-peer over-
ACME bears a strong resemblance to three existing sys- lay networks (Chord [25] and Tapestry [33]). For Tapestry
1 we embedded a small HTTP server inside each Tapestry
Although the applications we target for monitoring
and controlling in this paper are drawn from the domain of instance; this HTTP server serves as a sensor server for the
structured peer-to-peer overlay networks, our infrastruc- sensors exported by Tapestry. The sensors we imple-
ture can be easily adapted to work with other types of dis- mented for Tapestry return the number of various types of
tributed applications.
2
University of California, Berkeley Computer Science Technical Report UCB//CSD-03-1276. October 6, 2003.
messages that have passed through a node (e.g., locate tralized routing layer workload generator (an instance of
object, publish object); the Tapestry instance’s routing which is running in each process of the decentralized rout-
table; and the latency, bandwidth, and loss statistics for a ing layer) to change its workload model as the routing
requested peer (or all peers) as collected by Tapestry's layer continues to run. We implemented the first actuator
Patchwork background route maintenance component. For within Tapestry, and the second actuator within both Tap-
both Tapestry and Chord we implemented a log file reader estry and Chord.
that collects instrumentation and debugging data that is In the remainder of this paper, when we refer to a Plan-
written to disk as the application runs. etLab sensor (or actuator), we are referring to a data
In addition to implementing application-level sensors, source (or command sink) that is addressable through the
we have extended the PlanetLab sensor metaphor to PlanetLab sensor interface. Also, we note that sensors and
include “actuators,” an idea that was also recently pro- actuators raise a host of security and protection issues that
posed in [29]. Actuators allow one to control entities, we do not address in our current implementation.
much as sensors allow one to monitor entities. A program
interacts with our actuators in exactly the same way that a 3. ACME design and implementation
program interacts with a sensor: the program sends an In this section we describe ACME and its two principal
HTTP query to a sensor server running on the local host components: the ENTRIE trigger engine and the ISING
requesting a URL that specifies the name of the actuator sensor aggregator (which is in turn built on top of QTree).
and any additional arguments. The actuator returns an
acknowledgement that the action was taken or an error 3.1. High-level ACME architecture
message indicating why it was not taken. Figure 1 depicts ACME’s high level architecture. The
Our primary interest in developing actuators is to allow root of the tree is depicted by the boxes drawn above the
fault injection for robustness benchmarks and tests. Our horizontal line. One representative non-root node is
actuators allow the user to inject perturbations into the depicted by the boxes below the horizontal line. Each dot
environment by starting application processes (and having is a physical node running the components (boxes) below
them join an existing application service such as a distrib- the horizontal line. Thus the physical nodes form a tree,
uted hash table), killing nodes, rebooting nodes, and modi- and all nodes are functionally symmetric, except the root
fying the emulated network topology, through a simple of the tree which additionally runs ENTRIE and stores
shell wrapper. (The last two features are available only experiment specifications.
when running on a platform that supports them, e.g., Emu- Zeroing in on a single node, a single Java virtual
lab). We have also embedded actuators into applications machine (JVM) runs: the SEDA event-driven framework
themselves, much as we did with sensors. This allows a and asynchronous I/O libraries (not shown) [30]; QTree, a
program to inject a fault into another program using the configurable overlay network that forms a spanning tree
same interface as it uses to inject faults into the environ- over the nodes in the system; and ISING, a simple distrib-
ment. Among the fault injection actuators we have imple- uted query processor specialized for distributing queries
mented are ones that cause a decentralized routing layer to, and aggregating results from, sensors and actuators.
node to drop a fraction of its packets, and to cause a decen-
query HTTP URL
XM L
Q T ree ISIN G EN TRIE e x p e r im e n t
aggregated
H TTP CSV s p e c if ic a t io n
response
data
Q T reeU p/ Q T reeU p/
sensor H TTP CSV Q T reeD ow n Q T reeD ow n
data
H TTP CSV
actuator
data
Q T r e e in s t a n c e s
f o r m a n o v e r la y
netw ork
Figure 1: Overview of the ACME architecture. A user interacts with ENTRIE running on the root node (drawn
above the horizontal line). ENTRIE queries the root ISING instance, which broadcasts the query and aggregates
responses using the QTree overlay. The ISING instances running on each node communicate with local sensors and
actuators running on those nodes. The sensors and actuators on the root node are omitted from the figure for clarity.
3
University of California, Berkeley Computer Science Technical Report UCB//CSD-03-1276. October 6, 2003.
Sensors and actuators run in a separate process from the • QTreeDown(tree, message): send message to all
JVM running SEDA, QTree, and ISING. The root ISING descendants of the calling node in tree
instance itself exports a PlanetLab sensor interface, and it • QTreeUp(tree, message): send message to the parent
can therefore be used directly as a service. Indeed this is of the calling node in tree
how it is used by ENTRIE, a trigger system that executes
user-specified actions when user-specified conditions are • CountChildren(tree): return the number of children of
met. The conditions are generally ISING queries or timers, the calling node in tree
and the triggers are generally actuator invocations, but • WhatsMyLevel(tree): return the level of the calling
other types of conditions and actions can be specified. node in tree
ENTRIE runs in a separate process from SEDA/QTree/ The simplicity of this interface is beneficial in three
ISING. All communication among components running in ways. First, it is easy to build query distribution and result
separate processes uses TCP with persistent connections. aggregation on top of it and to extend those applications to
3.2. QTree handle new datatypes and aggregation functions. A node
at the root of a tree issues a query to all other nodes by
QTree is a configurable spanning tree overlay network calling QTreeDown. When a node receives a QTreeUp, it
that is used by ISING for query distribution and result optionally aggregates the attached message with the mes-
aggregation. The spanning tree is formed as suggested for sages attached to other QTreeUp's it has received for that
aggregation in [3] -- the paths from each node to a desig- tree, and then delivers the aggregate to the parent.
nated root form the tree, and aggregation takes place at Second, QTree relieves applications from the burden
non-leaf nodes. QTree currently implements three tree of reforming the tree topology in the face of node flux;
topologies: in one (DTREE), the path from a node to the QTree takes care of that, ensuring that a TreeId continues
root is a direct TCP connection, and in the two others the to refer to the tree rooted at a given node even in the face
path is the overlay routing path the node would use when of node flux. We note that we have currently implemented
routing to the root in Tapestry (TTREE) and Chord reliability only in the TTREE configuration of QTree
(CTREE), respectively. Third, QTree enables experimentation with different
TTREE is formed by following the natural Tapestry spanning tree topologies; a new spanning tree implementa-
routing path from all non-root nodes to the root. As tion that maintains the QTree interface can immediately be
described in [6], this policy ensures that children of nodes substituted for an existing QTree implementation.
near the leaves are close to their parents in terms of net-
work latency, while children of nodes near the root are far- 3.3. ISING
ther from their parents. This policy is beneficial in an ISING, the Internet Sensor In-Network agGregator, is a
aggregation network because since most edges of the simple query processor designed for continuous queries
graph are near the leaves, and edges near the leaves are over streaming data received from PlanetLab-style sen-
latency-optimized, most data is sent over low-latency sors. We have built ISING on top of QTree as follows. A
links. The smaller number of links near the root, carrying “root” ISING instance calls NewTree() to form a QTree.
(as we will see) aggregated data, are higher latency. Thus That ISING instance then activates its own sensor inter-
TTREE is beneficial, assuming wide-area network band- face to receive queries from users. A user query is turned
width is expensive in performance and financial cost. into a QTreeDown message that is sent down the tree, and
TTREE is self-organizing, automatically incorporating aggregated results are sent back up the tree using QTreeUp
new nodes as they join the network and remaining fully messages. For this discussion we assume only one ISING
connected even in the face of failures. Note that QTree tree exists in the system at any given time.
does not use Tapestry to route messages; QTree uses Tap- A user’s query to ISING is a standard sensor query
estry's initial topology to form the tree and subsequent consisting of the following components.
topology updates (as Tapestry detects nodes departing and
joining the network) to re-form the tree, but QTree sends • sensor server port: the port number of the sensor
network messages among nodes directly over persistent server, assumed to be running on that port on every
TCP connections (that are shared with Tapestry). node in the query tree.
CTREE is formed by following the natural Chord rout- • sensor name: the name of the sensor whose value the
ing path from all non-root nodes to the root. Due to time user wants returned from the specified sensor server.
constraints we have not yet evaluated ACME using • host: “ALL” if the query should be sent (using
CTREE. QTreeDown) to the indicated sensor server port on
QTree exports a simple interface to applications: every node in the system, or the hostname of a single
• NewTree(): form a new tree rooted at the calling node machine if the query should only be sent to one
and return a handle for the tree machine. In the latter case the query will be sent from
4
University of California, Berkeley Computer Science Technical Report UCB//CSD-03-1276. October 6, 2003.
the ISING root directly over TCP and the response ported are =, !=, >, <, >=, and <=.2 “Invalid” values
returned directly over TCP. are ignored during aggregation.
• aggregation operation: one of the aggregation oper- While complicated queries are not easy to write by
ations MIN, MAX, AVG, MEDIAN, SUM, and hand because they must be encoded in the URL that is sent
COUNT; or the special VALUE operator, which sim- to the ISING root, we expect that a program (such as
ply concatenates all values returned by the queried ENTRIE), not a human, will be generating the URLs.
sensors. In response to a query with the above components,
• epochDuration: ISING supports two types of que- ISING returns one or more lines in comma-separated-
ries. (1) Continuous queries are issued once to the value format, each line of the form <sensor server
ISING root, and a new result tuple is delivered to the host:port>, <timestamp when data item was
user at a fixed interval specified by the epochDura- generated3>, <data>
tion. Note that the ISING root does not re-broadcast As an aside, the ISING request semantics that we have
the query every epochDuration milliseconds; instead described allow the user to specify aggregation over sen-
the query is registered locally with the ISING sor data returned by a sensor running on the same port on
instance at each node. Then the local sensor is re-que- every physical node in the system. This is not a problem
ried, and the result pushed up the aggregation tree, at for sensors that collect per-physical-node data, since there
that frequency. (2) Snapshot queries compute a one- is no reason that there would be more than one sensor
time aggregate across the system. Such queries are server for that sensor on a particular machine. But we also
particularly useful in problem diagnosis, when the want to collect application-level data, and there may be
user is exploring a decision tree of possible problem more than one instance of an application running on a sin-
causes. A user specifies an epochDuration of zero to gle physical node. For example, when evaluating the scal-
indicate that a query should be evaluated only once. ability of a decentralized routing layer, it is common to run
more than one instance of the application on each physical
• value selection criteria (optional): Some sensors machine to emulated a network with a larger number of
may return multiple lines of output (rows), each with nodes than there are physical nodes available for the
multiple fields (columns). The user may request that experiment. In this case the ISING user will want to aggre-
only rows in which a specified column number gate across all application processes running on all physi-
matches a specified regular expression are returned. cal nodes. To address this need, we have implemented a
The user may further request that only a specified col- sensor of which one instance runs on each physical node,
umn number from returned rows be returned.1 that aggregates data from all instances of the sensor of
• value predicate (optional): For some queries a user interest on that node (one per application process of inter-
may be interested only in sensor values provided by est, each running a sensor server on a different port). This
nodes that meet some other criteria. We therefore lower-level aggregator can be thought of as a recursive
allow predicates to be applied to sensor requests. The instance of ISING itself; it aggregates data from multiple
semantics of these requests are as follows: the sensors of interest with identical schemas on a physical
value(s) matching the <sensor server node, and exports a single sensor interface to ISING,
host:port, sensor name, value selection which queries it and aggregates across physical nodes.
criteria> restrictions that would (in the absence of Although we have described ISING as a query proces-
a value predicate) otherwise be considered “valid” are sor for sensors, it can be used identically to broadcast actu-
considered “invalid” if the specified value predicate ator invocations and aggregate their results (success/fail-
is not met. A value predicate consists of one or more ure messages). This is possible because our actuators
clauses, joined by AND or OR, that specify compari- export a standard sensor interface. Actuators that we wish
sons between any <sensor server host:port, to activate simultaneously on every node in the system are
sensor name, value selection criteria> 2
For example, consider a query of the form “tell me
tuple for any sensor on the machine where the origi- which nodes are sending a lot of network traffic but have
nal query is being processed, and a constant or low CPU load”--these nodes might be launching a network
another such tuple. The comparison operations sup- attack. This query would be phrased such that the “sensor”
would be a “tell me your hostname” sensor, while the
1
For example, consider a query to a sensor that imple- value predicate would AND two predicates: the “amount
ments the Unix finger function. An intrusion detection of traffic sent over the last few minutes”sensor greater
application may be interested in querying this sensor at all than some value, and the “CPU load” sensor less than
nodes in the system to find out from where a particular some value.
3
user is logged on to those machines that she is using. This This timestamp is taken from the local clock on the
requires matching the username row(s) to the username machine running the sensor whose data is returned, so it is
and then extracting the “from” field(s) that indicates from only comparable to other time values in the system if the
where she is connecting clocks in the system are synchronized.
5
University of California, Berkeley Computer Science Technical Report UCB//CSD-03-1276. October 6, 2003.
ideally invoked through ISING. Examples that we have minutes would presumably be acceptable, while a once-a-
used include setting a simulated loss rate on each overlay minute query whose response is delayed by two minutes
link to some value to test the system's robustness to lost would be useless since another value will have been deliv-
messages, and increasing the rate of workload generation ered in the interim.
uniformly across all nodes in the system. Of course, QTree can hide the third type of failure, failstop failure
because actuators export a standard sensor interface, they of an ISING instance, from all other ISING instances: as
can also be invoked directly rather than through ISING. nodes join and depart the system (voluntarily or due to
failures), QTree re-forms the tree to incorporate the sur-
3.3.1. ISING failure handling viving nodes, assuming the root has not died. Our current
ISING implementation uses QTree in a “static” mode,
ISING should handle three types of failures: failstop however, so that child death is treated identically to a per-
failure of a sensor server or sensor, performance failure of formance failure. Thus until the child restarts, all messages
an ISING instance (which may be due to performance fail- from it and all of its descendants are lost, though results
ure of a sensor server or sensor), and failstop failure of an from the portion of the tree excluding the dead node and
ISING instance (which may be due to failstop failure of a its descendants continue to be aggregated and returned to
node). the ISING root. In addition to not re-forming the tree when
The easiest of these failures to handle is failstop failure a node dies, “static mode” prevents new nodes from join-
of a sensor server or sensor. This may happen, for exam- ing the system. We are currently in the process of modify-
ple, because the sensor simply is not intended to run on ing ISING to exploit QTree’s ability to dynamically re-
some subset of machines, or because it has crashed and is form trees as nodes join and depart the system; until then,
now refusing HTTP connections. If an ISING instance the same effect can be accomplished only by killing and
detects this type of failure, it considers the value “invalid” restarting all ISING and QTree instances.
for the purposes of aggregation.
An ISING instance may experience a performance fail-
3.4. ENTRIE
ure if something has slowed down one of its children, the Although ISING is a useful standalone service in its
network between one of its children and itself, or a sensor own right, its power is magnified when it is used as a data
server or sensor. To allow partial aggregates to be returned source for ENTRIE. ENTRIE is a configurable trigger sys-
in the face of such failures, each ISING instance sets a tem designed to issue queries to sensors (through ISING or
timeout on receiving values from its children for each directly), to continuously evaluate the results, and based
epoch. If the values for an epoch are not received from all on that evaluation to possibly invoke one or more actions.
children within the timeout interval, the aggregate of We built ENTRIE with two primary uses in mind: for run-
whatever data has been received thus far is passed up the ning controlled experiments such as benchmarks and tests,
tree as the value from that ISING instance for that epoch. and for performing system management.
When the values from the slow children for that epoch are Unlike ISING, which is designed as a single long-run-
received, they are discarded. In order to give timeouts a ning service that can be accessed by anyone as part of the
chance to propagate up the tree, each node’s timeout core system infrastructure, ENTRIE is designed to be
should be proportional to the number of hops to that instantiated separately by each person who uses it.
node’s farthest descendant. As a heuristic, we set each At ENTRIE's core is an XML configuration file speci-
node’s timeout to fying triggers, which are actions and the condition(s) that
timeout = d * timeoutsingle cause them to execute. This configuration file could be
where automatically generated from a more user-friendly syntax,
timeoutsingle = computemax + latencymax but currently it is written by hand. ENTRIE's configura-
and d is the difference between the maximum anticipated tion language is intended to shield the user from such
depth of the tree and the depth of the node whose timeout details as interacting directly with sensors and actuators,
value we are computing, computemax is the maximum or having to write procedural specifications to evaluate
amount of time we expect any node to need to compute the triggers.
aggregate of its children’s results after the last value is The core ENTRIE abstractions are conditions and
received, and latencymax is the maximum anticipated one- actions. One or more conditions are associated with each
way network latency between any two adjacent overlay action; when all of the conditions are met, the action is
nodes. This is a less precise version of the policy used in executed. The same conditions can be bound to different
[8] for loss recovery. actions, to allow an “or” over conditions.
Of course, whether a delayed value is useful to an ENTRIE currently supports three types of conditions:
ISING user depends on the epochDuration of the query -- timer conditions, completion conditions, and sensor condi-
a once-an-hour query whose response is delayed by two tions. Timer conditions specify that an action can be exe-
6
University of California, Berkeley Computer Science Technical Report UCB//CSD-03-1276. October 6, 2003.
cuted only after a certain time, and/or that an action cannot tribution with mean 30 seconds. Thirty minutes after churn
be executed after a certain time. Completion conditions has started, we end.
specify that an action cannot be executed until another
action(s) specified in the configuration file have com- <action ID="1" name="startNode" timerName="T">
pleted. Sensor conditions specify the <hostname:port <params numToStart="150"/>
number> of a sensor server (which may be ISING), the <conditions>
<condition type="timer" value="0"/>
name of the sensor of interest, the period with which the
</conditions>
sensor should be queried (when querying through ISING,
</action>
this is turned into the epochDuration), and one or more
data conditions. A data condition specifies a condition <action ID="2" name="startNode" timerName="T">
over the value, or history of values, returned by the sensor. <params numToStart="1" distribution="exponential"
An example data condition is “the average load returned randLifetime="true" meanLifetime="30000"/>
by the sensor over the last ten query periods is greater than <repeat distribution="exponential" randPeriod="true"
10.” (If the sensor is ISING, then the loads that ENTRIE is meanPeriod="10000"/>
averaging here are themselves the ten previous instanta- <conditions>
neous average loads computed by ISING across all nodes <condition type=”timer” value=”900000”/>
in the system). <condition type="endDelay" value="1800000"/>
</conditions>
ENTRIE’s actions invoke actuators. They can be
</action>
named using the same syntax used for naming conditions,
or they can use special syntax we have designed for start-
ing, killing, or rapidly starting and killing (“churning”) 3.4.2. Example: self-repair
application processes to enable benchmarking. Actions
may specify one of several repeat settings, to indicate that Our second example, which might be used in system
an action should be executed on the conditions’ first tran- management, combines problem detection with a limited
sition from false to true, on every transition from false to form of self-repair. We specify the following policy: every
true, periodically during the first true interval, or periodi- minute, if the load on the most highly loaded physical
cally during every true interval. The first type might be node is more than five times the maximum of the minute-
used to trigger a timer-based action in a benchmark, while by-minute average loads across the system during the past
the fourth might be used to page a system administrator ten minutes, reboot that node. The user writing the file
periodically until a problem has been fixed. would replace text enclosed in [brackets] with a constant.
Although ENTRIE represents a single point of failure
<action ID="1" name="EXECUTE" timerName="T">
for ACME, multiple ISING roots can be named as part of
<params commandType=”actuator”
a sensor condition. The first ISING instance will be used name=”reboot”
as the default, and if the connection to it fails or times out, hosts=”[ISING_host]:[ISING_port]”
the next instance will be tried, and so on. Of course, if node=”VARIABLE_host:[reboot_actuator_servr_port]”/>
ACME or the node on which it is running fails, ACME <conditions>
will itself fail. When ACME is restarted it will re-read its <condition type="sensor" ID="systemAVG"
configuration file and proceed as before, but having lost name=”load"
all of its history data. Standard “hot standby” replication hosts="[ISING_host]:[ISING_port]"
of history data would allow ENTRIE to tolerate failures, node="ALL:[load_sensor_server_port]"
but this functionality has not been implemented. period="60000" sensorAgg="AVG"
histSize="10" histAgg=”MAX” isSecondary="true"/>
To illustrate ENTRIE’s capabilities, we provide two
<condition type="sensor"
concrete configuration examples here. Due to space con-
name="load"
straints we are limited to fairly simple examples. hosts=”[ISING_host]:[ISING_port]"
node="ALL:[load_sensor_server_port]"
3.4.1. Example: benchmarking period="60000" sensorAgg="MAX"
histSize="1" operator=">"
Our first example would be used in benchmarking. secondaryID="systemAVG" scalingFactor="5"/>
First we start 150 nodes (instances of the application pro- </conditions>
cess). Fifteen minutes later we start a process of “churn” in </action>
which we start and kill new nodes repeatedly, such that the
4. System evaluation
period between startups is drawn at random from an expo-
nential distribution with mean 10 seconds, and the lifetime In this section we evaluate ACME’s performance, scal-
of each node is drawn at random from an exponential dis- ability, and robustness to overlay network message loss.
7
University of California, Berkeley Computer Science Technical Report UCB//CSD-03-1276. October 6, 2003.
8
University of California, Berkeley Computer Science Technical Report UCB//CSD-03-1276. October 6, 2003.
seen-so-far each time a child value was received. In this median cannot be computed until all child values are
experiment we used values generated internally by ISING received, while MIN can be computed incrementally
at each node as the sensor value rather than having ISING as child values are received.
at each node contact an external sensor, in order to isolate • TTREE-MIN has lower latency than the DTREE
ISING’s performance from that of specific sensors. The operations because it ships the same total amount of
same node was the root of the tree for each network size. traffic (each node sends one value) but the traffic is
The most important conclusion to be drawn from Fig- spread across many network links. A lesser effect that
ure 2 is that aggregation does help measurably. For a 512- contributes to TTREE-MIN’s performance is that the
node network, a true aggregate (TTREE-MIN) reduces computation of the aggregate is overlapped among all
latency by 76% compared to an aggregate that cannot nodes in the same level of the tree. The one drawback
reduce data as it flows through the overlay network of TTREE, namely that it incurs a delay proportional
(TTREE-MEDIAN), and it reduces latency by 54% com- to the number of hops between the farthest leaf and
pared to having all nodes send their values directly to the the root, apparently does not hurt performance as
root using TCP (DTREE-MIN). much as the load balancing of network traffic and
In analyzing these results we first discuss the relative computation helps it.
latencies of the four configurations ({MIN, MEDIAN} x
{TTREE, DTREE}), and then the relative slopes of each. With respect to slope, the time to compute TTREE-
The absolute latencies we found are MIN depends mainly on the depth of the tree, which we
T-MEDIAN > D-MEDIAN = D-MIN > T-MIN. indeed found to be approximately constant across our tree
Our explanation of these latencies is as follows. sizes. When a new node is added, each existing node sends
up the tree the same amount of data as it used to. The only
• TTREE-MEDIAN has higher latency than either extra work is that the new node’s parent does one more
DTREE operation because it ships the same amount unit of computation, and one new network message is
of traffic over the bottleneck link (all values collected shipped over the network link into the parent of the new
are sent over the root’s network connection) and also node. The time to compute a TTREE-MEDIAN increases
ships additional data over other links (the links with a slope related to the depth of a the tree, because each
among non-root nodes) and incurs a delay propor- new node increases by one unit the amount of network
tional to the number of overlay hops between the far- traffic sent along every overlay link on the path from the
thest leaf and the root as parents at each level wait for new node to the root. Finally, the slope of the DTREE
their slowest child to complete. lines are controlled by the fact that adding a new node
• The two DTREE operations have approximately the increases by one message the amount of traffic sent over
same latencies because they ship identical amounts of the heavily congested network link into the root.
data over exactly the same links. DTREE-MEDIAN Figure 3 plots the total number of bytes sent in
is very slightly slower than DTREE-MIN because response to a query as a function of aggregation operation
and network size, for both the TTREE and DTREE topolo-
End-to-end ISING response time
gies. Quite predictably, the DTREEs and TTREE-MIN all
send exactly the same amount of data--every node sends
3000
one value, and the slope of the line is the number of bytes
TTREE MEDIAN in a message. (Our messages are larger than they would be
2500 TTREE MIN in a production system, as we include some debugging
DTREE MEDIAN
information; obviously the benefit from aggregation
DTREE MIN
2000 would be greater if messages were larger, and smaller if
messages were smaller.) The TTREE-MEDIAN line can
1500
be understood as follows. Every node sends a number of
message units equal to one more than the total number of
1000
its descendants. Therefore a new node causes m extra mes-
sage units to be transferred, where m is the number of
nodes on the path from the new node to the root. The aver-
500
age depth of a node expresses the average number of such
intermediate nodes. Thus we expect the slope of the line to
0
50 100 150 200 250 300 350 400 450 500 550 be approximately equal to the average node depth times
number of nodes the slope of a DTREE line. Indeed the slope of a DTREE
line is about 100 bytes/node and the slope of the TTREE-
Figure 2: ISING response time as a function of
aggregation network size, topology, and operation. MEDIAN line is about 600 bytes for node, for a ratio of
9
University of California, Berkeley Computer Science Technical Report UCB//CSD-03-1276. October 6, 2003.
about 6, which is about the average node depth we found (only one root, but more than half the nodes are leaves).
for our trees. These two effects cancel each other out, and the expected
Finally, in order to investigate ISING’s robustness to number of nodes lost to a given failure is roughly propor-
message loss, we instrumented ISING so that a fraction of tional to the depth, which is roughly constant across all the
QTreeUp messages would be dropped. In particular, each cases. Also, as the loss probabilities increase, the probabil-
time a node is about to send a QTreeUp message to its par- ity of multiple losses in responding to a single query
ent, there is an p% chance that it will drop the message increases, which explains the increase in the average num-
instead of sending it. Nodes decide to drop messages inde- ber of nodes lost as loss probability increases.
pendently, based on a random number generator that is
seeded differently on each node. % of lossy average # of
Loss probability
In an aggregation network there are two loss metrics of responses nodes lost
primary interest: the number of queries whose response 0.01% 4% 5.25
incurs at least one loss, and the number of nodes parti-
0.05% 17% 5.71
tioned from the tree by that loss(es). To assess these met-
rics, we recorded the values returned from a series of 100 0.10% 42% 6.22
COUNT queries issues by ISING; COUNT simply returns 0.15% 48% 7.82
the number of nodes responding to a query. Table 1 shows, Table 1: Percent of query responses that lose at least
for a 512-node network, the total percentage of non-512 one node’s response, and average number of nodes lost
counts returned, representing the number of queries that for lossy responses, as a function of loss probability, for
experienced at least one message loss, and the average dif- network size 512.
ference from 512 for non-512 counts, representing the
number of nodes partitioned from the tree when there was
loss. A full analysis of these results is not possible due to 4.3. Evaluating ENTRIE and ACME
space constraints, but we make the following argument for Although ISING is an important part of ACME, we are
their reasonableness. Assuming failures are independent, also interested in the end-to-end performance of ACME:
the expected fraction of queries that will return non-512 the time from a condition-satisfying value being produced
counts for loss probability p is 1-((1-p)512), since in order by a sensor, until the action corresponding to the condition
for a 512-count to be returned, every link must not fail. (In is invoked on the appropriate node(s). Assuming the
a real network failures are not independent, but we leave action is to invoke an actuator on all nodes, this time is the
exploration of a more realistic fault model to future work.) sum of
This expectation closely matches our findings in the sec- (1) the time for a value to be received from a sensor
ond column. For the third column, in general, the higher in (2) the time for the aggregate value to reach the root
the tree a node is, the more of the tree that is lost when it, (3) the time for the root to pass the value to ENTRIE
or its link to its parent, dies (at the two extremes, the root (4) the time for ENTRIE to evaluate the trigger
takes out everything, while a leaf only disconnects itself). (5) the time for ENTRIE to pass the actuator query to
But there aren’t many nodes at the higher levels of the tree ISING
(6) the time for ISING to pass the actuator invocation
Total data transferred
down the tree
300
(7) the time to invoke the actuator.
TTREE MEDIAN The sum (6) + (2) is precisely the end-to-end number
TTREE MIN
we measured in Section 4.2 (though in that case the
250
DTREE MEDIAN
DTREE MIN “down” happened before the “up”). Due to time con-
200
straints, we were unable to evaluate ENTRIE’s perfor-
mance scalability, i.e., the relationship between trigger
KBytes
150
time and such factors as total number of triggers, number
100
of conditions associated with each trigger, and number of
actions potentially triggered by the same condition.
50
Indeed, the current version of ENTRIE was not designed
with either performance or scalability in mind, but rather
0 as a way to prototype our ideas about controlling distrib-
0 100 200 300
number of nodes
400 500 600
uted experiments and performing distributed system man-
Figure 3: Total bytes sent in computing an aggregate, agement using actuators. We did find that for a few trig-
as a function of aggregation network size, topology, gers (actions), each with a few conditions, ENTRIE’s
and operation. The three lower curves coincide. trigger time never exceeded 100ms. In other words, (4)
10
University of California, Berkeley Computer Science Technical Report UCB//CSD-03-1276. October 6, 2003.
11
University of California, Berkeley Computer Science Technical Report UCB//CSD-03-1276. October 6, 2003.
12
University of California, Berkeley Computer Science Technical Report UCB//CSD-03-1276. October 6, 2003.
NFTAPE, ACME is designed to scale to Internet-scale Finally, we would like to expand ENTRIE’s function-
systems and uses a sensor/actuator interface to communi- ality in four directions. First, we would like to add a layer
cate with monitoring and fault injection components. of syntactic sugar on top of the current XML configuration
file, particularly in the hopes of developing a general lan-
6. Future work and deployment guage capable of expressing the full range of fault injec-
We are interested in enhancing ACME’s performance, tion actions and other control actions that benchmarkers,
robustness, and functionality in a number of ways, while testers, and service operators might need. Second, we
maintaining the application-specific focus that sets ACME would like to add new sensors and actuators to increase the
apart from general-purpose distributed query processors range of conditions and actions that can be utilized. Much
and distributed programming environments. longer term, we would like to provide ENTRIE as a ser-
First, we intend to evaluate additional QTree overlay vice; users should be able to dynamically add and remove
topologies. In contrast to wireless sensor networks, where triggers stored on, and executed by, an “ENTRIE server.”
nodes can only route directly to other nodes within radio Such a service brings up a host of protection and security
distance, Internet aggregation networks can form any issues which must be considered. A final long-term direc-
overlay topology, because any node can route to any other tion for ENTRIE is to exploit statistical anomaly detection
node over IP. Thus we see a wide opportunity to investi- techniques over monitoring data to automatically instanti-
gate the performance and robustness of a host of hierarchi- ate, or to suggest to an operator, conditions that should
cal aggregation networks, including ones that are derived trigger actions such as recovery from failures, quarantine
from structured peer-to-peer overlay networks, ones that of security problems, or operator notification for manual
are based on unstructured networks, and ones that are intervention. For this and other operations that might
derived from social structures (e.g., based on administra- require large amounts of historical monitoring data, stor-
tive domains), particularly in the face of real-world failure ing metrics on disk in raw or aggregate form, at the ISING
modes and queries that might be scoped based on geo- root and/or non-root ISING instances, may be necessary.
graphic distance, administrative domain hierarchies, or We intend to deploy ISING as a continuously-running
network distance. Less structured data dissemination pro- service on PlanetLab soon.
tocols, such as gossip, are also of interest [10].
Within ISING, we would like to investigate the poten- 7. Conclusion
tial performance improvement from caching values at the In this paper we have described ACME, a flexible
ISING root and non-root instances, as well as the sharing infrastructure for Internet-scale monitoring, analysis, and
of query subexpressions (i.e., an individual or aggregate control in support of activities such as benchmarking, test-
sensor value). Also, for some applications, sampling a ing, and self-management. Users create triggers using
fraction of the sensors on each epoch may improve perfor- XML; one possible source of data for these triggers’ con-
mance without significantly sacrificing data quality. We ditions is ISING, a simple distributed query processor that
also intend to add additional aggregation functions such as broadcasts queries to, and aggregates data streams derived
COUNT DISTINCT and HISTOGRAM, and to investi- from, PlanetLab-style sensors. ISING can also be used as a
gate allowing user-defined aggregation functions specified sink for the triggers’ actions, which is particularly useful
within ISING queries as URL pointers to custom aggrega- when a trigger must invoke an actuator on all nodes in the
tion code. Finally, we would like to implement a mecha- system. ISING is in turn built on top of QTree, which
nism for explicitly notifying the issuer of a query when imposes a uniform query/response interface on top of vari-
she is receiving a partial aggregate due to timeouts, as ous overlay network configurations.
opposed to a complete aggregate for which all nodes In evaluating ISING’s performance and scalability, we
responded in a timely fashion. found that for one 512-node system running atop an emu-
From a more practical standpoint, we intend to inte- lated Internet topology, ISING's use of in-network aggre-
grate QTree and ISING into a simulation framework that gation over a spanning tree topology derived from the Tap-
will allow us to evaluate performance beyond the 512 vir- estry structured peer-to-peer overlay network reduced end-
tual nodes to which we were limited for this paper by vir- to-end query-response latency by more than 50% com-
tue of evaluating only a real implementation. Also, we pared to using direct network connections or the same
would like to use ACME to monitor and control additional overlay network without aggregation. We also found that
applications beyond Tapestry and Chord. Finally, we an untuned implementation of ACME can invoke an actu-
intend to add support for “streaming sensors,” i.e., sensors ator on one or all nodes in response to a discrete or aggre-
that return a new tuple of data periodically over a persis- gate event in less than four seconds. Finally, we demon-
tent connection to an ISING instance. This raises interest- strate ACME’s ability to monitor and benchmark peer-to-
ing issues related to matching the user’s epochDuration to peer overlay applications. To accomplish this we have
the rate at which new data is supplied by the sensor. written sensors for measuring application-level behavior
13
University of California, Berkeley Computer Science Technical Report UCB//CSD-03-1276. October 6, 2003.
and actuators for generating perturbations such as starting and T. D. Nguyen. Using fault injection and modeling to
and killing process and nodes, varying the applied work- evaluate the performability of cluster-based services.
load, varying emulated network behavior, and injecting USITS, 2003.
application-specific faults. [18] S. Nath, A. Deshpande, Y. Ke, P. B. Gibbons, B. Karp, and
S. Seshan, IrisNet: an architecture for compute-intensive
ACME is just a first step in investigating the issues
wide-area sensor network services.” Intel Research Techni-
related to building an infrastructure for comprehensively
cal Report IRP-TR-02-10, 2002.
understanding, testing, and managing Internet-scale appli- [19] L. Peterson, T. Anderson, D. Culler, and T. Roscoe. A blue-
cations. We look forward to future work in this area by print for introducing disruptive technology into the Internet.
ourselves and others. HotNets-I, 2002.
[20] S. Ratnasamy, M. Handley, R. Karp, and S. Shenker. Appli-
References cation-level multicast using content-addressable networks.
NGC, 2001
[1] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. [21] S. Rhea, T. Roscoe, and J. Kubiatowicz. Structured peer-to-
Models and issues in data stream systems. PODS, 2002 peer overlays need application-driven benchmarks. IPTPS,
[2] M. Bowman. Handling resource limitations and the role of 2003.
PlanetLab support .http://sourceforge.net/mailarchive/ [22] T. Roscoe, R. Mortier, P. Jardetzky, and S. Hand. InfoSpect:
forum.php?thread_id=3120326&forum_id=10443 using a logic language for system health monitoring in dis-
[3] K. Calvert, J. Griffioen, and S. Wen. Lightweight network tributed systems. SIGOPS European Workshop, 2002.
support for scalable end-to-end services. SIGCOMM, 2002. [23] T. Roscoe, L. Peterson, S. Karlin, and M. Warzoniak. A sim-
[4] M. Castro, P. Druschel, A-M. Kermarrec, A. Nandi, A. Row- ple common sensor interface for PlanetLab.” PlanetLab
stron, and A. Singh. SplitStream: high-bandwidth multicast Design Note PDN-03-010, 2003.
in cooperative environments. SOSP, 2003. [24] N. Spring, D. Wetherall, and T. Anderson. Scriptroute: A
[5] M. Castro, P. Druschel, A-M. Kermarrec and A. Rowstron. public internet measurement facility. USITS, 2003.
SCRIBE: a large-scale and decentralised application-level [25] I. Stoica, R. Morris, D. Karger, M. Frans Kaashoek, and H.
multicast infrastructure, IEEE Journal on Selected Areas in Balakrishnan. Chord: A scalable peer-to-peer lookup ser-
Communications, 20(8), 2002. vice for internet applications. SIGCOMM, 2001.
[6] B. N. Chun, J. Lee, and H. Weatherspoon. Netbait: a distrib- [26] D.T. Stott, B. Floering, D. Burke, Z. Kalbarczyk, and R.K.
uted worm detection service. http://berkeley.intel- Iyer. NFTAPE: A framework for assessing dependability in
research.net/bnc/papers/netbait.pdf, 2003 distributed systems with lightweight fault injectors. 4th
[7] D. D. Clark, C. Partridge, J. C. Ramming, and J. T. Wroclaw- IEEE Intl. Computer Performance. and Dependability Sym-
ski. “A knowledge plane for the Internet.” SIGCOMM, posium., 2000.
2003.. [27] D. L. Tennenhouse, J. M. Smith, W. D. Sincoskie, D. J.
[8] S. Floyd, V. Jacobson, C. Liu, S. McCanne, and L. Zhang. A Wetherall, and G. J. Minden. A survey of active network
reliable multicast framework for light-weight sessions and research. IEEE Communications, 1997.
application-level framing. IEEE Transactions on Network- [28] R. van Renesse and K. Birman. Scalable management and
ing, 5(6), 1997. data mining using Astrolabe. IPTPS, 2002.
[9] Ganglia toolkit. http://ganglia.sourceforge.net/ [29] M. Wawrzoniak, L. Peterson, and T. Roscoe. Sophia: an
[10] I. Gupta, A.M. Kermarrec, and A. J. Ganesh. Efficient epi- information plane for networked systems.” To appear in
demic-style protocols for reliable and scalable multicast. HotNets-II, 2003.
SRDS, 2002. [30] M. Welsh, D. Culler, and E. Brewer. SEDA: an architecture
[11] R. Huebsch, J. M. Hellerstein, N. Lanham, B. T. Loo, S. for well-conditioned, scalable internet services. SOSP, 2001
Shenker, and I. Stoica. Querying the Internet with PIER. [31] B. White, J. Lepreau, L. Stoller, R. Ricci, S. Guruprasad, M.
VLDB, 2003 Newbold, M. Hibler, C. Barb, and A. Joglekar. An inte-
[12] J. Jannotti, D. K. Gifford, K. L. Johnson, F. Kaashoek, and J. grated experimental environment for distributed systems
W. O’Toole. Overcast: Reliable Multicasting with an Over- and networks. OSDI, , 2002.
lay Network. OSDI, 2002. [32] E. Zegura, K. Calvert, and S. Bhattacharjee. How to model
[13] D. Kostic, A. Rodriguez, J. Albrecht, and A. Vahdat. Bullet: an internetwork. INFOCOM, 1996.
high bandwidth data dissemination using an overlay mesh. [33] B. Y. Zhao, L. Huang, S. C. Rhea, J. Stribling, A. D. Joseph,
SOSP, 2003. and J.Kubiatowicz. Tapestry: a resilient global-scale overlay
[14] B. Krishnamachari, D. Estrin, and S. Wicker. The impact of for service deployment. To appear in IEEE Journal on
data aggregation in wireless sensor networks. DEBS, 2002. Selected Areas in Communications, 2003.
[15] S. R. Madden, M. J. Franklin, J. M. Hellerstein, and W. [34] S. Zhuang, B. Zhao, A. Joseph, R. Katz, and J. Kubiatowicz.
Hong. TAG: a tiny aggregation service for ad-hoc sensor Bayeaux: an architecture for scalable and fault-tolerant
Nnetworks. OSDI, 2002. wide-area data dissemination.” NOSSDAV, 2001.
[16] K. Marzullo and M. D. Wood. “Tools for constructing dis-
tributed reactive systems.” Cornell University Techncal
Report 91-1193, 1991.
[17] K. Nagaraja, X. Li, B. Zhang, R. Bianchini, R. P. Martin,
14