0% found this document useful (0 votes)

46 views

Ieee Papers

Uploaded by

Mahaboob Basha

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views

Ieee Papers

Uploaded by

Mahaboob Basha

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

HENZE et al.

: COMPLYING WITH DATA HANDLING REQUIREMENTS IN CLOUD STORAGE SYSTEMS 1

Complying with Data Handling Requirements

in Cloud Storage Systems
Martin Henze, Roman Matzutt, Jens Hiller, Erik Mühmer,
Jan Henrik Ziegeldorf, Johannes van der Giet, and Klaus Wehrle

Abstract—In past years, cloud storage systems saw an enormous rise in usage. However, despite their popularity and importance as
underlying infrastructure for more complex cloud services, today’s cloud storage systems do not account for compliance with regulatory,
organizational, or contractual data handling requirements by design. Since legislation increasingly responds to rising data protection
and privacy concerns, complying with data handling requirements becomes a crucial property for cloud storage systems. We present
P RADA, a practical approach to account for compliance with data handling requirements in key-value based cloud storage systems. To
achieve this goal, P RADA introduces a transparent data handling layer, which empowers clients to request specific data handling
arXiv:1806.11448v2 [cs.NI] 7 Jun 2020

requirements and enables operators of cloud storage systems to comply with them. We implement P RADA on top of the distributed
database Cassandra and show in our evaluation that complying with data handling requirements in cloud storage systems is practical
in real-world cloud deployments as used for microblogging, data sharing in the Internet of Things, and distributed email storage.

Index Terms—cloud computing, data handling, compliance, distributed databases, privacy, public policy issues

1 I NTRODUCTION

N OWADAYS , many web services outsource the storage

of data to cloud storage systems. While this offers
multiple benefits, clients and lawmakers frequently insist
requirements. In consequence, 57% of organizations actually
refrain from outsourcing regulated data to the cloud. The
lack of control over the treatment of data in cloud storage
that storage providers comply with different data handling hence scares away many clients. This especially holds for
requirements (DHRs), ranging from restricted storage lo- the healthcare, financial, and government sectors [9].
cations or durations [1], [2] to properties of the storage Supporting DHRs enables these clients to dictate ade-
medium such as full disk encryption [3], [4]. However, cloud quate treatment of their data and thus allows cloud storage
storage systems do not support compliance with DHRs operators to enter new markets. Additionally, it empowers
today. Instead, the selection of storage nodes is primarily operators to efficiently handle differences in regulations
optimized towards reliability, availability, and performance, [10] (e.g., data protection). Although the demand for DHRs
and thus mostly ignores the demand for DHRs. Even worse, is widely acknowledged, practical support is still severely
DHRs are becoming increasingly diverse, detailed, and dif- limited [9], [11], [12]. Related work primarily focuses on
ficult to check and enforce [5], while cloud storage systems DHRs while processing data [13], [14], [15], limits itself to
are becoming more versatile, spanning different continents location requirements [16], [17], or treats the storage system
[6] or infrastructures [7], and even second-level providers as a black box and tries to coarsely enforce DHRs from
[8]. Hence, clients cannot ensure compliance with DHRs the outside [12], [18], [19]. Practical solutions for supporting
when their data is outsourced to cloud storage systems. arbitrary DHRs when storing data in cloud storage systems
This apparent lack of control is not merely an academic are still missing – a situation that is disadvantageous to
problem. Since customers have no influence on the treat- clients and operators of cloud storage systems.
ment of their data in today’s cloud storage systems, a large Our contributions. In this paper, we present P RADA, a
set of customers cannot benefit from the advantages offered general key-value based cloud storage system that offers
by the cloud. The Intel IT Center surveys [9] among 800 rich and practical support for DHRs to overcome current
IT professionals, that 78% of organizations have to comply compliance limitations. Our core idea is to add one layer
with regulatory mandates. Again, 78% of organizations of indirection, which flexibly and efficiently routes data to
are concerned that cloud offers are unable to meet their storage nodes according to the imposed DHRs. We demon-
strate this approach along classical key-value stores, while
• M. Henze is with the Fraunhofer Institute for Communication, Informa- our approach also generalizes to more advanced storage
tion Processing and Ergonomics FKIE, Germany. systems. Specifically, we make the following contributions:
E-mail: [email protected]
• R. Matzutt, J. Hiller, J. H. Ziegeldorf, J. van der Giet, and K. Wehrle
1) We comprehensively analyze DHRs and the challenges
are with the Chair of Communication and Distributed Systems at RWTH they impose on cloud storage systems. Our analysis
Aachen University, Germany. shows that a wide range of DHRs exist, which clients
E-mail: {matzutt, hiller, ziegeldorf, giet, wehrle}@comsys.rwth-aachen.de and operators of cloud storage systems have to address.
• E. Mühmer is with the Chair of Operations Research at RWTH Aachen
University, Germany. 2) We present P RADA, our approach for supporting DHRs
E-mail: [email protected] in cloud storage systems. P RADA adds an indirection
Author’s version of a manuscript accepted for publication in IEEE Transac- layer on top of the cloud storage system to store data
tions on Cloud Computing. DOI: 10.1109/TCC.2020.3000336 tagged with DHRs only on nodes that fulfill these
c 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media,
including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution
to servers or lists, or reuse of any copyrighted component of this work in other works.
2 ACCEPTED FOR PUBLICATION IN IEEE TRANSACTIONS ON CLOUD COMPUTING

requirements. Our design of P RADA is incremental, i.e.,

it does not impair data without DHRs. P RADA supports
all DHRs that can be expressed as properties of storage
nodes as well as any combination thereof. As we show,
this covers a wide range of actual use cases.
3) We prove the feasibility of P RADA by implementing it
for the distributed database Cassandra (we make our
implementation available [20]) and by quantifying the
costs of supporting DHRs in cloud storage systems.
Fig. 1. Setting. When clients insert data with DHRs, the operator has to
Additionally, we show P RADA’s applicability in a cloud store it only on nodes of the storage system complying with the DHRs.
deployment along three real-world use cases: a Twitter
clone storing two million authentic tweets, a distributed
email store handling half a million emails, and an IoT the cloud storage system and annotate it with DHRs. These
platform persisting 1.8 million IoT messages. requirements are in textual form and can be interpreted by
A preliminary version of this paper appears in the pro- the operator of the cloud storage system. The process of
ceedings of IEEE IC2E 2017 [21]. We extend and improve on annotating data with DHRs is also known as sticky policies
our previous work in the following ways: First, we provide [31], [32] or data handling annotations [11], [23]. Each client
a detailed analysis and definition of the challenge of DHR of the storage system might impose individual and varying
compliance in cloud storage systems. Second, we extend DHRs for each single inserted data item.
P RADA with mechanisms for failure recovery. Third, we Compliance with DHRs has to be realized by the oper-
provide details on our implementation of P RADA. Fourth, ator of the cloud storage system. Only the operator knows
we show the applicability of P RADA by realizing compliance about the characteristics of the storage nodes and can thus
with DHRs in three real-world use cases: a microblogging make the ultimate decision on which node to store a specific
system, a distributed email system, and an IoT platform. Fi- data item. Different works exist that propose cryptographic
nally, we cover a broader range of related work and provide guarantees [14], accountability mechanisms [33], informa-
more detail on design, implementation, and evaluation. tion flow control [5], [34], or virtual proofs of physical reality
Paper structure. In Section 2, we analyze DHRs and derive [35] to relax trust assumptions on the operator, i.e., provid-
goals for supporting DHRs in cloud storage systems. We ing the client with assurance that DHRs are (strictly) adhered
provide an overview of our design in Section 3, before we to. Our goals are different: Our main aim is for functional
provide details on individual storage operations (Section 4), improvements of the status quo. Thus, these works are
replication (Section 5), load balancing (Section 6), and fail- orthogonal to our approach and can possibly be combined
ure recovery (Section 7). Subsequently, we describe our if the operator is not sufficiently trusted.
implementation in Section 8 and evaluate its performance
and applicability in Section 9. We present related work in 2.2 Data Handling Requirements
Section 10 and conclude with a discussion in Section 11.
We analyze DHRs from client and operator perspective and
identify common classes, as well as the need to support also
2 DATA C OMPLIANCE IN C LOUD S TORAGE future and unforeseen requirements.
With the increasing demand for sharing data and storing Client perspective. DHRs involve constraints on the stor-
it at external parties [22], obeying with DHRs becomes a age, processing, distribution, and deletion of data in cloud
crucial challenge for cloud storage systems [11], [12], [23]. storage. These constraints follow from legal (laws and regu-
To substantiate this claim, we outline our setting and rigor- lations) [36], [37], contractual (standards and specifications)
ously analyze existing and potentially future DHRs. Based [38], or intrinsic requirements (user’s or company’s indi-
on this, we derive goals that must be reached to adequately vidual privacy requirements) [39], [40]. Especially for busi-
support DHRs in cloud storage systems. nesses, compliance with legal and contractual obligations is
important to avoid serious (financial) consequences [41].
Location requirements relate to the storage location of data. On
2.1 Setting one hand, these requirements address concerns raised when
We tackle the challenge of supporting DHR compliance in data is stored outside of specified legislative boundaries [2],
cloud storage systems which are realized over a set of nodes [11]. The EU’s General Data Protection Regulation [37], e.g.,
in different data centers [24]. To explain our approach in a forbids the storage of personal data in jurisdictions with an
simple yet general setting, we assume that data is addressed insufficient level of privacy protection. Also other legisla-
by a distinct key, i.e., a unique identifier for each data tion, besides data protection law, can impose restrictions on
item. Key-value based cloud storage systems [25], [26], [27] the storage location. German tax legislation, e.g., forbids the
provide a general, good starting point, since they are widely storage of tax data outside of the EU [23]. On the other hand,
used and their underlying principles have been adopted in clients, especially corporations, can impose location require-
more advanced cloud storage systems [28], [29], [30]. We ments. To increase robustness against outages, a company
discuss how our approach can be applied to other types of might demand to store replicas of their data on different
cloud storage systems in Section 11. continents [39]. Furthermore, an enterprise could require
As a basis for our discussion, we illustrate our setting in that sensitive data is not stored at a competitor for fear of
Figure 1. Clients (end users and companies) insert data into accidental leaks or deliberate breaches [40].
HENZE et al.: COMPLYING WITH DATA HANDLING REQUIREMENTS IN CLOUD STORAGE SYSTEMS 3

Duration requirements impose restrictions on the storage A simple example for a type of DHRs is storage location.
duration of data. The Sarbanes-Oxley Act (SOX) [42], e.g., In this example, the properties consist of all possible storage
requires accounting firms to retain records relevant to audits locations, and nodes whose storage location is equal to the
and reviews for seven years. Contrary, the Payment Card one requested by the clients are considered eligible. In a
Industry Data Security Standard (PCI DSS) [38] limits the more complicated example, we consider as DHR type the
storage duration of cardholder data to the time necessary security level of full-disk encryption. Here, the properties range
for business, legal, or regulatory purposes after which it from 0 bits (no encryption) to different bits of security (e.g.,
has to be deleted. A similar approach, coined “the right to 192 bits or 256 bits), with more bits of security offering a
be forgotten”, is actively being discussed and turned into higher security level [47]. In this case, all storage nodes that
legislation in the EU and Argentina [37], [43]. provide at least the security level requested by the client are
Traits requirements further define how data should be stored. considered eligible to store the data.
For example, the US Health Insurance Portability and By allowing clients to combine different types of DHRs
Accountability Act (HIPAA) [36] requires health data to and to specify a set of required properties (e.g., different
be securely deleted before disposing or reusing a storage storage locations) for each type, we provide them with
medium. Likewise, for the banking and financial services powerful means to express DHRs. We detail how clients can
industry, the Gramm-Leach-Bliley Act (GLBA) [3] requires combine different types in Section 4 and how we integrate
the proper encryption of customer data. Additionally, to DHRs into Cassandra’s query language in Section 8.
protect against theft or seizure, clients may choose to store
their data only on volatile [44] or fully encrypted [4] storage.
2.3 Goals
Operator perspective. The support of DHRs presents clear
business incentives to cloud storage operators as it opens Our analysis of real-world demands for DHRs based on
new markets and eases compliance with regulation. legislation, business interests, and future trends emphasizes
Business incentives are given by the unique selling point that the importance to support DHRs in distributed cloud stor-
DHRs present to the untapped market of clients unable to age. We now derive a set of goals that any approach that
outsource their data to cloud storage systems nowadays due addresses this challenging situation should fulfill:
to unfulfillable DHRs [9]. Indeed, cloud providers already Comprehensiveness: To address a wide range of DHRs, the
adapted to some carefully selected requirements. To be able approach should work with any DHRs that can be expressed
to sell its services to the US government, e.g., Google created as properties of storage nodes and support the combination
the segregated “Google Apps for Government” and had it of different DHRs. In particular, it should support the re-
certified at the FISMA-Moderate level, which enables use by quirements in Section 2.2 and be able to adapt to new DHRs.
US federal agencies [41]. Furthermore, cloud providers open Minimal performance effort: Cloud storage systems are
data centers around the world to address location require- highly optimized and trimmed for performance. Thus, the
ments of clients [7]. From a different perspective, regional impact of DHR support on the performance of a cloud
clouds, e.g., the envisioned “Europe-only” cloud [45], aim at storage system should be minimized.
increasing governance and control over data. Additionally, Cluster balance: In existing cloud storage systems, the
offering clients more control over their data reduces risks storage load of nodes can easily be balanced to increase
for loss of reputation and credibility [46]. performance. Despite having to respect DHRs (and thus
Compliance with legislation is important for operators inde- limiting the set of possible storage nodes), the storage load
pendent of specific business goals and incentives. As an of individual storage nodes should be kept balanced.
example, the business associate agreement of HIPAA [36] Coexistence: Not all data will be accompanied by DHRs.
requires the operator to comply with the same requirements Hence, data without DHRs should not be impaired by
as its clients when transmitting electronic health records [1]. supporting DHRs, i.e., it should be stored in the same way
Furthermore, the EU’s General Data Protection Regula- as in a traditional cloud storage system.
tion [37] requires data controllers from outside the EU that
process data originating from the EU to follow DHRs.
Future requirements. DHRs are likely to change and evolve 3 S YSTEM OVERVIEW
just as legislation and technology are changing and evolv- The problem that has prevented support for DHRs so far
ing over time. Location requirements developed, e.g., since stems from the common pattern used to address data in
cloud storage systems began to span multiple geographic key-value based cloud storage systems: Data is addressed,
regions. As anticipating all possible future changes in DHRs and hence also partitioned (i.e., distributed to the nodes in
is impossible, it is crucial that support for DHRs in cloud the cluster), using a designated key. Yet, the responsible node
storage systems can easily adapt to new requirements. (according to the key) for storing a data item will often not
Formalizing data handling requirements. To also support fulfill the client’s DHRs. Thus, the challenge addressed in
future requirements and storage architectures, we base our this paper is how to realize compliance with DHRs and still
approach on a formalized understanding of DHRs that also allow for key-based data access.
covers yet unforeseen DHRs. To this end, we distinguish To tackle this challenge, the core idea of P RADA is to
between different types of DHRs and consider different add an indirection layer on top of a cloud storage system.
possible properties which storage nodes (can) support for a We illustrate how we integrate this layer into existing cloud
given type of DHRs. This makes it possible to compute the storage systems in Figure 2. If a responsible node cannot
set of eligible nodes for a specified type of DHRs, i.e., those comply with stated DHRs, we store the data at a different
nodes that offer the properties requested by the client. node, called target node. To enable the lookup of data, the
4 ACCEPTED FOR PUBLICATION IN IEEE TRANSACTIONS ON CLOUD COMPUTING

responsible node stores a reference to the target for specific

data. As shown in Figure 2, we introduce three new compo-
nents (capability, relay, and target store) to realize P RADA.
Capability store: The global capability store is used to look
up nodes that can comply with a specific DHR. Here, the
operator of the cloud storage systems specifies for each node
in the cluster which DHR properties this node can fulfill. To
speed up lookups in the capability store, each node keeps a
local copy of the complete capability store. This approach is
Fig. 2. System overview. P RADA adds an indirection layer to support
feasible, as information on DHRs is comparably small and DHRs. The capability store records which nodes support which DHR,
consists of rather static information. Depending on the in- the relay store contains references to indirected data, and the target
dividual cloud storage system, distributing this information store saves indirected data.
can be realized by preconfiguring the capability store for a
storage cluster or by utilizing the storage system itself for
creating a globally replicated view of node capabilities. We
consider all DHRs that describe static properties of a storage
node and range from rather simplistic properties such as
storage location to more advanced capabilities such as the
support for deleting data at a specific date.
Relay store: Each node operates a local relay store containing
references to data stored at other nodes. More precisely, it Fig. 3. Creating data. The coordinator derives nodes that comply with
contains references to data the node itself is responsible for the DHRs from the capability store. It then stores the data at the target
but does not comply with the DHRs. For each data item, the node and a reference to the data at the responsible node.
relay store contains the key of the data, a reference to the
node at which the data is stored, and a copy of the DHRs.
Target store: Each node stores data that is redirected to it in Create. The coordinator first checks whether a create request
a target store. The target store operates exactly as a traditional is accompanied by DHRs. If no requirements are specified,
data store, but allows a node to distinguish data that falls the coordinator uses the standard method of the cloud stor-
under DHRs from data that does not. age system to create data so that the performance of native
Alternatives to adding an indirection layer are likely not create requests is not impaired. For all data with DHRs,
viable for scalable key-value based cloud storage systems: a create request proceeds in three steps as illustrated in
Although it is possible to encode very short DHRs in the Figure 3. In Step 1, the coordinator derives the set of eligible
key used for data access [23], this requires knowledge about nodes from the received DHRs, relying on the capability
DHRs of a data item to compute the key for accessing it and store (as introduced in Section 3) to identify nodes that fulfill
disturbs load balancing. Alternatively, replication of all relay all requested DHRs. Clients can combine different types of
information on all nodes of a cluster allows nodes to derive DHRs (e.g., location and support for deletion). Nodes are
relay information locally. This, however, severely impacts considered eligible if they support at least one of the speci-
scalability of the cloud storage system and reduces the total fied properties for each requested type (e.g., one out of mul-
storage amount to the limited storage space of single nodes. tiple permissible locations). Now, the coordinator knows
Integrating P RADA into a cloud storage system requires which nodes of the cluster can comply with all requirements
us to adapt storage operations (e.g., creating and updating specified by the user and has to choose from the set of
data) and to reconsider replication, load balancing, and eligible nodes the target node on whom to store the data. It
failure recovery strategies in the presence of DHRs. In the is important to select the target such that the overall storage
following, we describe how we address these issues. load in the cluster remains balanced (we defer the discussion
of this issue to Section 6). In Step 2, the coordinator forwards
the data to the target, who stores it in its target store. Finally,
4 C LOUD S TORAGE O PERATIONS in Step 3, the coordinator instructs the responsible node to
The most important modifications of P RADA involve the store a reference to the actual storage location of the data to
CRUD (create, read, update, delete) operations. In the fol- enable locating data upon read, update, and delete requests.
lowing, we describe how we integrate P RADA into the The coordinator acknowledges the successful insertion after
CRUD operations of our cloud storage model (cf. Sec- all three steps have been completed successfully. To speed
tion 2.1). We assume that queries are processed on behalf of up create operations, the second and third step—although
the client by one of the nodes in the cluster, the coordinator logically separated—are performed in parallel.
node (as common in cloud storage systems [26]). Each node Read. Processing read requests in P RADA is performed
of the cluster can act as coordinator for a query and clients in three steps as illustrated in Figure 4. In Step 1, the
use the capability store to select a coordinator that complies coordinator uses the key supplied in the request to initi-
with the requested DHRs. If no DHRs need to be considered, ate a standard read query at the responsible node. If the
clients select a coordinator based on performance metrics responsible node does not store the data locally, it checks its
such as proximity. For reasons of clarity, we postpone the local relay store for a reference to a different node. Should
discussion of the impact of different replication factors and it hold such a reference, the responsible node forwards the
load balancing decisions to Section 5 and 6, respectively. read request (including information on how to reach the
HENZE et al.: COMPLYING WITH DATA HANDLING REQUIREMENTS IN CLOUD STORAGE SYSTEMS 5

coordinator node for this request) to the target listed in

the reference in Step 2. In Step 3, the target looks up the
requested data in its target store and directly returns the
query result to the coordinator. Upon receiving the result
from the target, the coordinator processes the results in the
same way as any other query result. If the responsible node
stores the requested data locally (e.g., because it was stored
without DHRs), it directly answers the request using the
default method of the cloud storage system. In contrast, if Fig. 4. Reading data. The coordinator contacts the responsible node to
the responsible node neither stores the data directly nor a fetch the data. As the data was created with DHRs, the responsible node
forwards the query to the target, which directly sends the response back
reference to it, P RADA will report that no data was found to the coordinator.
using the standard mechanism of the cloud storage system.
Update. The update of already stored data involves the
(potentially partial) update of stored data as well as the Reading data. To process a read request, the coordinator
possible update of associated DHRs. In the scope of this forwards the read request to all responsible nodes. A re-
paper, we define that DHRs of the update request supersede sponsible node that receives a read request for data it does
DHRs supplied with the create request and earlier updates. not store locally looks up the targets in its relay store and
Other semantics, e.g., combining old and new DHRs, can forwards the read request to one of the r target nodes.
be realized by slightly adapting the update procedure of To ensure that each target node receives a request, each
P RADA. Consequently, we process update requests the same responsible node uses the same consistent mapping between
way as create requests (as it is often done in cloud storage responsible and target nodes which is computed based on
systems). Whenever an update request needs to change the node identifiers. Each target that receives a read request
target node of stored data (due to changes in supplied sends the requested data to the coordinator for this request.
DHRs), the responsible node has to update its relay store. If a read query is reissued due to a failure (cf. Section 7),
Furthermore, the update request needs to be applied to each responsible node will forward the request to all r target
the data (currently stored at the old target node). To this nodes to increase reliability.
end, the responsible node instructs the old target node to Impact on reliability. To successfully process a query in
move the data to the new target node. The new target node P RADA, it suffices if one responsible node and one target
applies the update to the data, locally stores the result, node are reachable. Thus, P RADA can tolerate the failure of
and acknowledges the successful update to coordinator and up to r − 1 responsible nodes and up to r − 1 target nodes.
responsible node and the responsible node updates the relay
information. As updates for data without DHRs are directly 6 L OAD BALANCING
sent to the responsible node, the performance of native
In cloud storage systems, load balancing aims to minimize
requests is not impaired compared to an unmodified system.
(long term) load disparities in the storage cluster by dis-
Delete. Delete requests are processed analogously to read
tributing stored data and read requests equally among the
requests: The delete request is sent to the responsible node
nodes. Since P RADA drastically changes how data is as-
for the key that should be deleted. If the responsible node
signed to and retrieved from nodes, existing load balancing
itself stores the data, it deletes the data as in an unmodified
schemes must be rethought. In the following, we describe
system. In contrast, if it only stores a reference to the data, it
a formal metric to measure load balance and then explain
deletes the reference and forwards the delete request to the
how P RADA builds a load-balanced storage cluster.
target. The target deletes the data and informs the coordina-
Load balance metric. Intuitively, a good load balancing aims
tor about the successful deletion. We defer a discussion of
at all nodes being (nearly) equally loaded, i.e., the imbalance
recovering from delete failures to Section 7.
between the load of nodes should be minimized. While un-
derloaded nodes constitute a waste of resources, overloaded
5 R EPLICATION nodes drastically decrease the overall performance of the
cloud storage system. We measure the load balance of a
Cloud storage systems employ replication to realize high
cloud storage system by normalizing the global standard
availability and data durability [26]: Instead of storing a data
deviation of the load with the mean load µ of all nodes [48]:
item only on one node, it is stored on r nodes (typically, with
a replication factor 1 ≤ r ≤ 3). In key-value based storage
sP
|N | 2
1 i=1 (Li − µ)
systems, the r nodes are chosen based on the key of data (see L :=
Section 3). When complying with DHRs, we cannot use the µ |N |
same replication strategy. In the following, we thus detail with Li being the load of node i ∈ N . To achieve load
how P RADA realizes replication instead. balance, we need to minimize L. This metric especially pe-
Creating data. Instead of selecting only one target, the nalizes outliers with extremely low or high loads, following
coordinator picks r targets out of the eligible nodes. The the intuition of a good load balance.
coordinator sends the data to all r targets and the list of all Load balancing in P RADA. Key-value based cloud storage
r targets to the r responsible nodes (according to the repli- systems achieve a reasonably balanced load in two steps:
cation strategy of the cloud storage system). Consequently, (i) Equal distribution of data at insert time, e.g., by applying
each of the r responsible nodes knows about all r targets a hash function to identifier keys, and (ii) re-balancing the
and can update its relay store accordingly. cluster if absolutely necessary by moving data between
6 ACCEPTED FOR PUBLICATION IN IEEE TRANSACTIONS ON CLOUD COMPUTING

nodes. More advanced systems support additional mech- selecting a new target node and updating the reference. Still,
anisms, e.g., load balancing over geographical regions [28]. also the coordinator itself can fail during the process, which
Since our focus in this paper lies on proving the general may lead to unreachable data. As such failures happen only
feasibility of supporting data compliance in cloud storage, rarely, we suggest refraining from including corresponding
we focus on the properties of key-value based storage. consistency checks directly into create operations [51]. In-
Re-balancing a cluster by moving data between nodes stead, the client detects failures of the coordinator due to
can be handled by P RADA similarly to moving data in case absent acknowledgments. In this case, the client informs all
of node failures (Section 7). In the following, we thus focus eligible nodes to remove the unreferenced data and reissues
on the challenge of load balancing in P RADA at insert time. the create operation through another coordinator.
Here, we focus on equal distribution of data with DHRs to Read. In contrast to the other operations, a read request
target nodes as load balancing for indirection information is does not change any state in the cloud storage system.
achieved with the standard mechanisms of key-value based Therefore, read requests are simply reissued in case of a
cloud storage systems, e.g., by hashing identifier keys. failure (identified by a missing acknowledgment) and no
In contrast to key-value based cloud storage systems, further error handling is required.
load balancing in P RADA is more challenging: When pro- Update. Although update operations are more complex than
cessing a create request, the eligible target nodes are not create operations, failure handling can happen analogously.
necessarily equal as they might be able to comply with As the responsible node updates its reference only upon
different DHRs. Hence, some eligible nodes might offer reception of the acknowledgment from the new target node,
rarely supported but often requested requirements. Fore- the storage state is guaranteed to remain consistent. Hence,
seeing future demands is notoriously difficult [49], thus we the coordinator can reissue the process using the same or
suggest to make the load balancing decision based on the a new target node and perform corresponding cleanups if
current load of the nodes. This requires all nodes to be errors occur. Contrary, if the coordinator fails, information
aware of the load of the other nodes in the cluster. Cloud on the potentially new target node is lost. Similar to create
storage systems typically already exchange this information operations, the client resolves this error by informing all eli-
or can be extended to do so, e.g., using efficient gossiping gible nodes about the failure. Subsequently, the responsible
protocols [50]. We utilize this load information in P RADA as nodes trigger a cleanup to ensure a consistent storage state.
follows. To select the target nodes from the set of eligible Delete. When deleting data, a responsible node may delete
nodes, P RADA first checks if any of the responsible nodes a reference but fail in informing the target node to carry out
are also eligible to become a target node and selects those as the delete. Coordinator and client easily detect this error
target nodes first. This allows us to increase the performance through the absence of the corresponding acknowledgment.
of CRUD requests as we avoid the indirection layer in Again, the coordinator or client then issue a broadcast
this case. For the remaining target nodes, P RADA selects message to delete the corresponding data item from the
those with the lowest load. To have access to more timely target node. This approach is more reasonable than directly
load information, each node in P RADA keeps track of all incorporating consistency checks for all delete operations as
create requests it is involved with. Whenever a node itself such failures occur only rarely [51].
stores new data or sends data for storage to other nodes, Propagating target node actions. CRUD operations are
it increments temporary load information for the respective triggered by clients. However, data deletion or relocation,
node. This temporary node information is used to bridge which may result in dangling references or unreferenced
the time between two updates of the load information. As data, can also be triggered by the storage cluster or by
we will show in Section 9.2, this approach enables P RADA DHRs that, e.g., specify a maximum lifetime for data. To
to adapt to different usage scenarios and quickly achieve a keep the state of the cloud storage system consistent, stor-
(nearly) equally balanced storage cluster. age nodes perform data deletion and relocation through a
coordinator as well, i.e., they select one of the other nodes
to perform update and delete operations on their behalf.
7 FAILURE R ECOVERY Thus, the correct execution of deletion and relocation tasks
When introducing support for DHRs to cloud storage sys- can be monitored and repair operations can be triggered.
tems, we must ensure not to break their failure recovery In case either the initiating storage node or the coordinator
mechanisms. With P RADA, we specifically need to take fails while processing a query, the same mitigations as for
care of dangling references, i.e., a reference pointing to CRUD operations (triggered by clients) apply. To protect
a node that does not store the corresponding data, and against rare cases in which both, initiating storage node
unreferenced data, i.e., data stored on a target node without and coordinator, fail while processing an operation, storage
an existing corresponding reference. These inconsistencies system operators can optionally employ commit logs, e.g.,
could stem from failures during the (modified) CRUD oper- based on Cassandra’s atomic batch log [52].
ations as well as from actions that are triggered by DHRs,
e.g., deletions forced by DHRs require propagation of meta
information to corresponding responsible nodes. 8 I MPLEMENTATION
Create. Create requests require to transmit data to the target For the practical evaluation of our approach, we fully imple-
node and inform the responsible node to store the reference. mented P RADA on top of Cassandra [26] (our implementa-
Failures during these operations can be recognized by the tion is available under the Apache License [20]). Cassandra
coordinator by missing acknowledgments. Resolving these is a distributed database that is actively employed as a
errors requires a rollback and/or reissuing actions, e.g., key-value based cloud storage system by more than 1500
HENZE et al.: COMPLYING WITH DATA HANDLING REQUIREMENTS IN CLOUD STORAGE SYSTEMS 7

companies with deployments of up to 75 000 nodes [53] Cassandra, are generated using ANTLR [56]. Using the
and offers high scalability even over multiple data centers WITH REQUIREMENTS statement, arbitrary DHRs can be
[54], which makes it especially suitable for our scenario. specified separated by the keyword AND, e.g., INSERT ...
Cassandra also implements advanced features that go be- WITH REQUIREMENTS location = { ’DE’, ’FR’,
yond simple key-value storage such as column-orientation ’UK’ } AND encryption = { ’AES-256’ }. In this
and queries over ranges of keys, which allows us to show- example, any node located in Germany, France, or the
case the flexibility and adaptability of our design. Data in United Kingdom that supports AES-256 encryption is
Cassandra is divided into multiple logical databases, called eligible to store the inserted data. This approach enables
key spaces. A key space consists of tables which are called users to specify any DHRs covered by our formalized
column families and contain rows and columns. Each node model of DHRs (cf. Section 2.2).
knows about all other nodes and their ranges of the hash To detect and process DHRs in create requests (cf. Section
table. Cassandra uses the gossiping protocol Scuttlebutt [50] 4), we extend Cassandra’s QueryProcessor, specifically
to efficiently distribute this knowledge as well as to detect its getStatement method for processing INSERT requests.
node failure and exchange node state, e.g., load information. When processing requests with DHRs (specified using the
Our implementation is based on Cassandra 2.0.5, but our WITH REQUIREMENTS statement), we base our selection of
design conceptually also works with newer versions. eligible nodes on the global capability store. Nodes are eligi-
Information stores. P RADA relies on three information ble to store data with a given set of DHRs if they provide at
stores: the global capability store as well as relay and target least one of the specified properties for each requested type
stores (cf. Section 3). We implement these as individual key (e.g., one out of multiple permitted locations). We prioritize
spaces in Cassandra as detailed in the following. First, we nodes that Cassandra would pick without DHRs, as this
realize the global capability store as a key space that is globally speeds up reads for the corresponding key later on, and
replicated among all nodes (i.e., each node stores a full copy otherwise choose nodes according to our load balancer (cf.
of the capability store to improve performance of create Section 6). Our load balancing implementation relies on Cas-
operations) initialized at the same time as the cluster. On this sandra’s gossiping mechanism [26], which maintains a map
key space, we create a column family for each DHR type (as of all nodes together with their corresponding loads. We
introduced in Section 2.2). When a node joins the cluster, it access this information using Cassandra’s getLoadInfo
inserts all DHR properties it supports for each DHR type (as method and extend the load information with local estima-
locally configured by operator of the cloud storage system) tors for load changes. Whenever a node sends a create re-
into the corresponding column family. This information is quest or stores data itself, we update the corresponding local
then automatically replicated to all other nodes in the cluster estimator with the size of the inserted data. To this end, we
by the replication strategy of the corresponding key space. hook into the methods that are called when data is modified
For each regular key space of the database, we additionally locally or forwarded to other nodes, i.e., the correspond-
create a corresponding relay store and target store as key ing methods in Cassandra’s ModificationStatement,
spaces. Here, the relay store inherits the replication factor RowMutationVerbHandler, and StorageProxy classes
and replication strategy from the corresponding regular key as well as our methods for processing requests with DHRs.
space to achieve replication for P RADA as detailed in Section Reading data. To allow reading redirected data as described
5, i.e., the relay store will be replicated in exactly the same in Section 4, we modify Cassandra’s ReadVerbHandler
way as the regular key store. Hence, for each column family class for processing read requests at the responsible node.
in the corresponding key space, we create a column family This handler is called whenever a node receives a read
in the relay key space that acts as the relay store. We follow request from the coordinator and allows us to check whether
a similar approach for realizing the target store, i.e., we create the current node holds a reference to another target node
for each key space a corresponding key space to store actual for the requested key by locally checking the corresponding
data. For each column family in the original key space, we column family within the relay store. If no reference exists,
create an exact copy in the target key space to act as the the node continues with a standard read operation. Oth-
target store. However, to ensure that DHRs are adhered to, erwise, the node forwards a modified read request to one
we implement a DHR-agnostic replication mechanism for deterministically selected target node (cf. Section 5) using
the target store and use the relay store to address data. Cassandra’s sendOneWay method, in which it requests the
While the global capability store is created when the data from the respective target on behalf of the coordina-
cluster is initiated, relay and target stores have to be cre- tor. Subsequently, the target nodes send the data directly
ated whenever a new key space and column family is to the coordinator node (whose identifier is included in
created, respectively. To this end, we hook into Cassandra’s the request). To correctly resolve references to data for
CreateKeyspaceStatement class for detecting requests which the coordinator of a query is also the responsible
for creating key spaces and column families and subse- node, we additionally add corresponding checks to the
quently initialize the corresponding relay and target stores. LocalReadRunnable subclass of StorageProxy.
Creating data and load balancing. To allow clients to
specify their DHRs when inserting or updating data, we
support the specification of arbitrary DHRs in textual 9 E VALUATION
form for INSERT requests (cf. Section 2.1). To this end, we We perform benchmarks to quantify query completion
add an optional postfix WITH REQUIREMENTS to INSERT times, storage space, and consumed traffic. Furthermore, we
statements by extending the grammar from which parser study P RADA’s load behavior through simulation and show
and lexer for CQL3 [55], the SQL-like query language of P RADA’s applicability in three real-world use cases.
8 ACCEPTED FOR PUBLICATION IN IEEE TRANSACTIONS ON CLOUD COMPUTING

9.1 Benchmarks 400 Cassandra PRADA* PRADA

query completion time [ms]

350 create read update delete
First, we benchmark query completion time, consumed stor- 300
250
age space, and bandwidth consumption. In all settings, we 200
compare the performance of P RADA with the performance 150
of an unmodified Cassandra installation as well as P RADA*, 100
50
a system running P RADA but receiving only data without 0
DHRs. This enables us to verify that data without DHRs is 100 150 200 250 100 150 200 250 100 150 200 250 100 150 200 250
indeed not impaired by P RADA. round trip time [ms]
We set up a cluster of 10 nodes interconnected via a Fig. 5. Query time vs. RTT. P RADA constitutes limited overhead for
gigabit Ethernet switch. All nodes are equipped with an operations on data with DHRs, while data without DHRs is not impacted.
Intel Core 2 Q9400 and 4 GB RAM as well as either 160 GB
or 500 GB storage and run either Ubuntu 14.04 or 16.04. We 200 Cassandra PRADA* PRADA

query completion time [ms]

assign each node a distinct artificial DHR property to avoid create read update delete
150
potential bias resulting from using only one specific DHR
type (such as storage location). When inserting or updating 100

data, clients request a set of exactly three of the available 50

properties uniformly randomly. Each row of data consists of
200 B of uniformly random data (+ 20 B for the key), spread 0

over 10 columns. These are rather conservative numbers as 1 2 3 1 2 3 1 2 3 1 2 3

replication factor [#]
the relative overhead of P RADA decreases with increasing
storage size. For each result, we performed 5 runs, each Fig. 6. Query time vs. replication. Create and update in P RADA show
modest overhead for increasing replicas due to larger message sizes.
with 1000 operations which were performed in one burst,
i.e., as quickly as could be handled by the coordinator. In
2500 Cassandra PRADA* PRADA
the following, we depict the mean value for performing one
operation with 99% confidence intervals. We provide further 2000
storage space [B]

instructions on how to perform these measurements as part 1500

of the release of our implementation [20]. 1000
Query completion time. The query completion time (QCT) 500
denotes the time the coordinator takes for processing a 0
200 B 400 B 200 B 400 B 200 B 400 B
query, i.e., from receiving it until sending the result back 1 2 3
to the client. It is influenced by the round-trip time (RTT) replication factor [#]
between nodes in the cluster and the replication factor. Fig. 7. Storage vs. replication. P RADA constitutes only constant over-
We first study the influence of RTTs on QCT for a repli- head per DHR affected replica, while not affecting data without DHRs.
cation factor r = 1. To this end, we artificially add latency
to outgoing packets for inter-cluster communication using
netem [57] to emulate RTTs of 100 to 250 ms. Our choice reads (overhead increasing from 38 to 61 ms), and updates
covers RTTs observed in communication between cloud (overhead increasing from 46 to 80 ms) cannot benefit from
data centers around the world [58] and verified through these optimizations, as this would require the coordinator
measurements in the Microsoft Azure cloud. In Figure 5, to be responsible and target node at the same time, which
we depict the QCTs for the different CRUD operations and happens only rarely. Furthermore, the increase in QCTs for
RTTs. We make two observations. First, QCTs of P RADA* are creates and updates results from the overhead of handling
indistinguishable from those of Cassandra, which implies r references at r nodes, while the increase for reads corre-
that data without DHRs is not impaired by P RADA. Second, sponds to the additional 0.5 RTT for the indirection layer.
the additional overhead of P RADA lies between 15.4 to 16.2% For deletes, the overhead decreases from 41 to 12 ms for an
for create, 40.5 to 42.1% for read, 48.9 to 50.5% for update, increasing replication factor, which results from an increased
and 44.3 to 44.8% for delete. The overheads for read, update, likelihood that the coordinator node is at least either respon-
and delete correspond to the additional 0.5 RTT introduced sible or target node, avoiding additional communication.
by the indirection layer and is slightly worse for updates Consumed storage space. To quantify the additional storage
as data stored at potentially old target nodes needs to be space required by P RADA, we measure the consumed stor-
deleted. QCTs below the RTT result from corner cases where age space after data has been inserted, using the cfstats
the coordinator is also responsible for storing data. option of Cassandra’s nodetool utility. To this end, we
From now on, we fix RTTs at 100 ms and study the conduct insertions for payload sizes of 200 and 400 B (plus
impact of replication factors r = 1, 2, and 3 on QCTs as 20 B for the key), i.e., we fill 10 columns with 20 respec-
shown in Figure 6. Again, we observe that the QCTs of tive 40 B payload in each query, with replication factors of
P RADA* and Cassandra are indistinguishable. For increas- r = 1, 2, and 3. In real-world use cases, we observe, e.g.,
ing replication factors, the QCTs for P RADA* and Cassandra a mean payload size of 312 B for an IoT data platform (cf.
reduce as it becomes more likely that the coordinator also Section 9.3). We divide the total consumed storage space
stores the data. In this case, Cassandra optimizes queries. per run by the number of insertions and show the mean
When considering the overhead of P RADA, we witness that consumed storage space per inserted row over all runs in
the QCTs for creates (overhead increasing from 14 to 46 ms), Figure 7. Cassandra requires 383 B to store 200 B of payload
HENZE et al.: COMPLYING WITH DATA HANDLING REQUIREMENTS IN CLOUD STORAGE SYSTEMS 9

and 585 B for a payload of 400 B. Each additional replica 3.0 Cassandra PRADA* PRADA
increases the required storage space by roughly 90%. P RADA 2.5 create read update delete

network traffic [kB]

adds a constant overhead of roughly 110 B per replica. While 2.0
the precise overhead of P RADA depends on the encoding of 1.5
relay information, the important observation here is that it 1.0
does not depend on the size of the stored data. Even for 0.5

extremely small payload sizes, e.g., a mean payload size of 0.0

133 B in a microblogging use case (cf. Section 9.3), P RADA 1 2 3 1 2 3 1 2 3 1 2 3

adds only an additional relative storage overhead of roughly replication factor [#]

38% on top of an overhead of more than 136% already added Fig. 8. Traffic vs. replication. Data without DHRs is not affected by
by Cassandra. When considering larger payload sizes, the P RADA. Replicas increase the traffic overhead introduced by DHRs.
storage overhead of P RADA becomes negligible, e.g., when
0.45

deviation from even load [%]

Cassandra PRADA
storing emails with a mean size of 3626 B (cf. Section 9.3) 0.4
where the overhead for indirection information amounts to 0.35
0.3
only 3% of the data size. 0.25
Bandwidth consumption. We measure the traffic consumed 0.2
0.15
by the individual CRUD operations by hooking into the 0.1
writeConnected method to be able to filter out back- 0.05
0
ground traffic such as gossiping. Figure 8 depicts the mean 10 20 30 40 50 60 70 80 90 100
total generated message payload per single operation aver- throughput [1000 inserts/s]
aged over 5 runs with 2000 operations each for an RTT of
Fig. 9. Load balance vs. throughput. Load balance in P RADA depends
100 ms. Our results show that using P RADA comes at the on throughput of inserts. Even for high throughput it stays below 0.5%.
cost of an overhead that scales with the replication factor.
When considering Cassandra and P RADA*, we observe that
the consumed traffic for read operations does only slightly different random seeds [59] and show the mean of the load
increase when raising the replication factor. This results balance L (cf. Section 6) with 99% confidence intervals.
from an optimization in Cassandra that requests the data Influence of throughput. We expect the load distribution
only from one replica and probabilistically compares only to be influenced by the freshness of the load information as
digests of the data held by the other replicas to perform gossiped by other nodes, which correlates with the through-
post-request consistency checks. Furthermore, with increas- put of create requests. A lower throughput results in less
ing replication factors, it becomes more likely that the co- data inserted between two load information updates and
ordinator also stores the data and thus no communication hence the load information remains relatively fresher. To
is necessary, while P RADA requires the coordinator to be study this effect, we simulate different insertion through-
responsible and target node at the same time, which hap- puts to vary the gossiping delay. We simulate a cluster with
pens only rarely. For the other operations, the overhead 100 nodes and 107 create requests, each accompanied by a
introduced by our indirection layer ranges from 0.7 to DHR. Even for high throughput, this produces enough data
0.9 kB for a replication factor of 3. For a replication factor to guarantee at least one gossip round. To challenge the load
of 1, the highest overhead introduced by P RADA peaks at balancer, we synthetically create two types of DHRs with
0.2 kB. Thus, we conclude that the traffic overhead of P RADA two properties, each supported by half of the nodes such
allows for a practical operation in cloud storage systems. that each combination of the properties of the two types
of DHRs is supported by 25 nodes. For each create request
we randomly select one of the resulting possible DHRs, i.e.,
9.2 Load Distribution demanding one property for one or two of the DHRs types.
To quantify the impact of P RADA on the load distribution Figure 9 shows the deviation from an even load for in-
of the overall cloud storage system, we rely on a simulation creasing throughput compared with a traditional Cassandra
approach as this enables a thorough analysis of the load cluster. Additionally, we calculate the optimal solution un-
distribution and considering a wide range of scenarios. der a posteriori knowledge by formulating the correspond-
Simulation setup. As we are solely interested in the load ing quadratic program for minimizing the load balance L
behavior, we implemented a custom simulator in Python, and solving it using CPLEX [60]. In all cases, we observe
which models the characteristics of Cassandra with respect that the resulting optimal load balance is 0, i.e., all nodes
to network topology, data placement, and gossip behavior. are loaded exactly equal, and hence omit these results in the
Using the simulator, we realize a cluster of n nodes, which plot. Seemingly large confidence intervals result from the
are equally distributed among the key space [52] and insert high resolution of our plot (P RADA deviates less than 0.5%
m data items with uniformly random keys. For simplicity, from even load). The results show that P RADA even outper-
we assume that all data items are of the same size. The nodes forms Cassandra for smaller throughputs (load imbalance
operate Cassandra’s gossip protocol [50], i.e., synchronize of Cassandra results from hashing) and the introduced load
with one random node every second and update its own imbalance stays well below 0.5% in all scenarios, even for a
load information every 60 s. We randomize the initial offset high throughput of 100 000 insertions/s (Dropbox processed
before the first gossip message for each node individually, less than 20 000 insertions/s on average in June 2015 [61]).
as in reality not all nodes perform the gossip at the same These results indicate that frequent updates of node state
point in time. We repeat each measurement 10 times with improve load balance for P RADA.
10 ACCEPTED FOR PUBLICATION IN IEEE TRANSACTIONS ON CLOUD COMPUTING

Influence of DHR fit. In P RADA, one of the core influence 140

deviation from even load [%]

Optimal PRADA
factors on the load distribution is the accordance of clients’ 120

DHRs with the properties provided by storage nodes. If the 100

80
distribution of DHRs heavily deviates from the distribution
60
of DHRs supported by the storage nodes, it is impossible 40
to achieve an even load distribution. To study this aspect, 20
we consider a scenario where each node has a storage 0

location and clients request exactly one of the available 0 10 20 30 40 50 60 70 80 90 100

storage locations. We simulate a cluster of 1000 nodes that deviation from node distribution [%]

are geographically distributed according to the IP address Fig. 10. Load balance vs. node distribution. P RADA’s load balance
ranges of Amazon Web Services [62] (North America: 64%, shows optimal behavior, but depends on node distribution.
Europe: 17%, Asia-Pacific: 16%, South America: 2%, China:
1600 Cassandra PRADA

query completion time [ms]

1%). First, we insert data with DHRs whose distribution 1400 twitter mail IoT
exactly matches the distribution of nodes. Subsequently, we 1200
worsen the accuracy of fit by subtracting 10 to 100% from 1000
800
the location with the most nodes (i.e., North America) and 600
proportionally distribute this demand to the other locations 400
(in the extreme setting, North America: 0%, Europe: 47.61%, 200
0
Asia-Pacific: 44.73%, South America: 5.74%, and China:
Timeline Userline Inbox Message Overview Message
1.91%). We simulate 107 insertions at a throughput of 20 000 Fetch operation
insertions/s. For comparison, we calculate the optimal load
Fig. 11. Usecase evaluation. Adding DHRs to tweets delays query
using a posteriori knowledge by equally distributing the completion by 11 to 15%. Also for email storage and IoT data, account-
data on the nodes of each location. Our results are depicted ing for compliance with DHRs results in acceptable overheads.
in Figure 10. We derive two insights from this experiment:
i) the deviation from an even cluster load scales linearly
with decreasing accordance of clients’ DHRs with node ca- timeline (most recent messages of all users a user follows).
pabilities and ii) in all considered settings P RADA manages To this end, we insert 2 million tweets from the twitter7
to achieve a cluster load that is close to the theoretical dataset [65] and randomly select 1000 users among those
optimum. Hence, P RADA’s approach of load balancing can users who have at least 50 tweets in our dataset. For the
indeed adapt to the challenges imposed by complying with userline measurement, we request 50 consecutive tweets of
DHRs in cloud storage systems. each selected user. As the twitter7 dataset does not contain
follower relationships between users, we request 50 random
tweets for the timeline measurements of each selected user.
9.3 Applicability Our results in Figure 11 (left) show that the runtime over-
We show the applicability of P RADA by realizing three head of supporting DHRs for microblogging in a globally
real-world use cases: a microblogging system, a distributed distributed cluster corresponds to a 11% (15%) increase in
email management system, and an IoT platform. To cre- query completion time for fetching the timeline (userline).
ate a realistic evaluation environment, we use a globally Here, P RADA especially benefits from the fact that identi-
distributed cloud storage consisting of 10 nodes on top of fiers are spread along the cluster and thus the unmodified
the Microsoft Azure cloud platform [63]. More specifically, Cassandra also has to contact a large number of nodes. Our
we utilize virtual machine instances of type D2s v3, each results show that P RADA can be applied to offer support
equipped with 2 virtual CPUs, 8 GB RAM, 30 GB storage, for DHRs in microblogging at reasonable costs with respect
and Ubuntu 16.04 as operating system. The virtual machines to query completion time. Especially when considering that
are globally distributed among 10 distinct regions: asia- not each tweet will likely be accompanied by DHRs, this
east, asia-southeast, canada-central, europe-north, europe- modest overhead is well worth the additional functionality.
west, japan-east, us-central, us-east, us-southcentral, and Email storage. Email providers increasingly move storage
us-west2. In case of read timeouts, e.g., due to temporary of emails to the cloud [66]. To study the impact of support-
connection problems, we resubmit the corresponding query. ing DHRs on emails, we analyzed Cassandra-backed email
The release of our implementation contains further informa- systems such as Apache James [67] and ElasticInbox [68]
tion on how to perform these measurements [20]. and derived a common database layout consisting of one
Microblogging. Microblogging services such as Twitter fre- table for meta data (overview of a complete mailbox) and
quently utilize cloud storage systems to store messages. one table for full emails. To create a realistic scenario, we
To evaluate the impact of P RADA on such services, we utilize the Enron email dataset [69], consisting of about half
use the database layout of Twissandra [64], an exemplary a million emails of 150 users. For each user, we uniformly at
implementation of a microblogging service for Cassandra, random select one of the available storage locations as DHR
and real tweets from the twitter7 dataset [65]. For each user, for their emails and meta information.
we uniformly at random select one of the storage locations Figure 11 (middle) compares the mean QCTs per op-
and attach it as DHR to all their tweets. We perform our eration of Cassandra and P RADA for fetching the overview
measurements using a replication factor of r = 1 and of the mailbox for all 150 users and fetching 10 000 ran-
measure the QCTs for randomly chosen users for retrieving domly selected individual emails. For fetching of mailboxes,
their userline (most recent messages of a user) and their we observe overlapping, rather large confidence intervals
HENZE et al.: COMPLYING WITH DATA HANDLING REQUIREMENTS IN CLOUD STORAGE SYSTEMS 11

resulting from the small number of operations (only 150 the purpose of data usage, which is primarily realized using
mailboxes) and huge differences in mailbox sizes , ranging access control. One interesting aspect of sticky policies is
from 35 to 28 465 messages. While we cannot derive a their ability to make them “stick” to the corresponding
definitive statement (at the 99% confidence level) from these data using cryptographic measures which could also be
results, the mean QCTs for fetching the overview of a mail- applied to P RADA. In the context of cloud computing, sticky
box seem to suggest a notable yet acceptable overhead for policies have been proposed to express requirements on the
using P RADA. When considering the fetching of individual security and geographical location of storage nodes [32].
messages, we observe an overhead of 70% for P RADA’s indi- However, so far it has been unclear how this could be
rection step, increasing QCTs from 97 to 164 ms. Hence, we realized efficiently in a large and distributed storage system.
can provide compliance with DHRs for email storage with With P RADA, we present a mechanism to achieve this goal.
a reasonable increase of 67 ms for fetching individual emails Policy enforcement. To enforce privacy policies when ac-
and a likely increase in the time required for generating an cessing data in the cloud, Betgé-Brezetz et al. [13] monitor
overview of all emails in the mailbox in the order of 28%. access of virtual machines to shared file systems and only
IoT platform. The Internet of Things (IoT) leads to a mas- allow access if a virtual machine is policy compliant. In
sive growth of collected data which is often stored in the contrast, Itani et al. [14] propose to leverage cryptographic
cloud [70], [71]. Literature proposes to attach per-data item coprocessors to realize trusted and isolated execution en-
DHRs to IoT data to preserve privacy [31], [71], [72]. To vironments and enforce the encryption of data. Espling et
study the applicability of P RADA in this setting, we collected al. [15] aim at allowing service owners to influence the
frequency and size of authentic IoT data from the IoT placement of their virtual machines in the cloud to realize
data sharing platform dweet.io [73]. Our data set contains specific geographical deployments or provide redundancy
1.84 million IoT messages of size 72 B to 9.73 KB from 2889 through avoiding co-location of critical components. These
devices. To protect the privacy of people monitored by these approaches are orthogonal to our work, as they primarily
devices, we replaced all payload information with random focus on enforcing policies when processing data, while
data. For each device, we uniformly at random assign one P RADA addresses the challenge of supporting DHRs when
of the storage locations as DHR for the collected data. storing data in cloud storage systems.
In Figure 11 (right), we depict the mean QCTs per op- Location-based storage. Focusing exclusively on location
eration of Cassandra and P RADA for retrieving the overview requirements, Peterson et al. [16] introduce the concept of
of all IoT data for each of the 2889 devices as well as for data sovereignty with the goal to provide a guarantee that
accessing 10 000 randomly selected single IoT messages. The a provider stores data at claimed physical locations, e.g.,
varying amount of sensor data that different IoT devices of- based on measurements of network delay. Similarly, LoSt
fer leads to a slightly varying QCT for fetching of IoT device [17] enables verification of storage locations based on a
data overviews, similar to mailbox fetching (see above). The challenge-response protocol. In contrast, P RADA focuses on
overhead for adhering to DHRs with P RADA in the IoT use the more fundamental challenge of realizing the functional-
case totals to 41% for the fetching of a device’s IoT data ity for supporting arbitrary DHRs.
overview and 57% for a single IoT message, corresponding Controlling placement of data. Primarily focusing on dis-
to the 0.5 RTT added by the indirection layer. We consider tributed hash tables, SkipNet [74] enables control over
these overheads still appropriate given the inherent private data placement by organizing data mainly based on string
nature of most IoT data and the accompanying privacy risks names. Similarly, Zhou et al. [75] utilize location-based node
which can be mitigated with DHRs. identifiers to encode physical topology and hence provide
control over data placement at a coarse grain. In contrast to
P RADA, these approaches need to modify the identifier of
10 R ELATED W ORK data based on the DHRs, i.e., knowledge about the specific
We categorize our discussion of related work by the differ- DHRs of data is required to locate it. Targeting distributed
ent types of DHRs they address. In addition, we discuss ap- object-based storage systems, CRUSH [76] relies on hierar-
proaches for providing assurance that DHRs are respected. chies and data distribution policies to control placement
Distributing storage of data. To enforce storage location of data in a cluster. These data distribution policies are
requirements, a class of related work proposes to split data bound to a predefined hierarchy and hence cannot offer
between different storage systems. Wüchner et al. [12] and the same flexibility as P RADA. Similarly, Tenant-Defined
CloudFilter [18] add proxies between clients and operators Storage [77] enables clients to store their data according to
to transparently distribute data to different cloud storage DHRs. However and in contrast to P RADA, all data of one
providers according to DHRs, while NubiSave [19] allows client needs to have the same DHRs. Finally, SwiftAnalytics
combining resources of different storage providers to fulfill [78] proposes to control the placement of data to speed
individual redundancy or security requirements of clients. up big data analytics. Here, data can only be put directly
These approaches can treat individual storage systems only on specified nodes without the abstraction provided by
as black boxes. Consequently, they do not support fine- P RADA’s approach of supporting DHRs.
grained DHRs within the database system itself and are Hippocratic databases. Hippocratic databases store data
limited to a small subset of DHRs. together with a purpose specification [79]. This allows them
Sticky policies. Similar to our idea of specifying DHRs, the to enforce the purposeful use of data using access control
concept of sticky policies proposes to attach usage and obli- and to realize data retention after a certain period. Using
gation policies to data when it is outsourced to third-parties Hippocratic databases, it is possible to create an auditing
[31]. In contrast to our work, sticky policies mainly concern framework to check if a database is complying with its data
12 ACCEPTED FOR PUBLICATION IN IEEE TRANSACTIONS ON CLOUD COMPUTING

disclosure policies [33]. However, this concept only consid- only possible by encoding some DHRs in the key used for
ers a single database and not a distributed setting where accessing data [23], but this requires everyone accessing the
storage nodes have different data handling capabilities. data to be in possession of the DHRs, which is unlikely. A
Assurance. To provide assurance that storage operators ad- fundamental improvement could be achieved by replicating
here to DHRs, de Oliveira et al. [80] propose an architecture all relay information to all nodes of the cluster, but this is
to automate the monitoring of compliance to DHRs when viable only for small cloud storage systems and does not
transferring data. Bacon et al. [34] and Pasquier et al. [5] offer scalability. We argue that indirection can likely not be
show that this can also be achieved using information flow avoided, but still pose this as an open research question.
control. Similarly, Massonet et al. [41] propose a monitoring Third, the question arises how clients can be assured that an
and audit logging architecture in which the infrastructure operator indeed enforces their DHRs and no errors in the
provider and service provider collaborate to ensure data specification of DHRs have occurred. This has been widely
location compliance. These approaches are orthogonal to studied [16], [33], [41], [80] and the proposed approaches
our work and could be used to verify that operators of cloud such as audit logging, information flow control, and prov-
storage systems run P RADA in an honest way and error-free. able data possession can also be applied to P RADA.
While we limit our approach for providing data compli-
ance in cloud storage to key-value based storage systems,
11 D ISCUSSION AND C ONCLUSION the key-value paradigm is also general enough to provide
Accounting for compliance with data handling require- a practical starting point for storage systems that are based
ments (DHRs), i.e., offering control over where and how on different paradigms. Additionally, the design of P RADA
data is stored in the cloud, becomes increasingly important is flexible enough to extend (with some more work) to other
due to legislative, organizational, or customer demands. storage systems. For example, Google’s globally distributed
Despite these incentives, practical solutions to address this database Spanner (rather a multi-version database than a
need in existing cloud storage systems are scarce. In this key-value store) allows applications to influence data lo-
paper, we proposed P RADA, which allows clients to specify cality (to increase performance) by carefully choosing keys
a comprehensive set of fine-grained DHRs and enables [28]. P RADA could be applied to Spanner by modifying
cloud storage operators to enforce them. Our results show Spanner’s approach of directory-bucketed key-value map-
that we can indeed achieve support for DHRs in cloud pings. Likewise, P RADA could realize data compliance for
storage systems. Of course, the additional protection and distributed main memory databases, e.g., VoltDB, where
flexibility offered by DHRs comes at a price: We observe a tables of data are partitioned horizontally into shards [29].
moderate increase for query completion times, while achiev- Here, the decision on how to distribute shards over the
ing constant storage overhead and upholding a near optimal nodes in the cluster could be taken with DHRs in mind.
storage load balance even in challenging scenarios. Similar adaptations could be performed for commercial
Importantly, however, data without DHRs is not im- products, such as Clustrix [30], that separate data into slices.
paired by P RADA. When a responsible node receives a To conclude, P RADA resolves a situation, i.e., missing
request for data without DHRs, it can locally check that no support for DHRs, that is disadvantageous to both clients
DHRs apply to this data: For create requests, the INSERT and operators of cloud storage systems. By offering the
statement either contains DHRs or not, which can be enforcement of arbitrary DHRs when storing data in cloud
checked efficiently and locally. In contrast, for read, update, storage systems, P RADA enables the use of cloud storage
and delete requests, P RADA performs a simple and local systems for a wide range of clients who previously had to
check whether a reference to a target node for this data refrain from outsourcing storage, e.g., due to compliance
exists. The overhead for this step is comparable to executing with applicable data protection legislation. At the same
an if statement and hence negligible. Only if a reference time, we empower cloud storage operators with a practical
exists, which implies that the data was inserted with DHRs, and efficient solution to handle differences in regulations
P RADA induces overhead. Our extensive evaluation con- and offer their services to new clients.
firms that, for data without DHRs, P RADA shows the same
query completion times, storage overhead, and bandwidth
consumption as an unmodified Cassandra system in all
ACKNOWLEDGMENTS
considered settings (indistinguishable results for Cassandra The authors would like to thank Annika Seufert for support
and P RADA* in Figures 5 to 8.) Consequently, clients can with the simulations. This work has received funding from
choose (even at a granularity of individual data items), if the European Union’s Horizon 2020 research and innovation
DHRs are worth a modest performance decrease. program 2014-2018 under grant agreement No. 644866 (SSI-
P RADA’s design is built upon a transparent indirection CLOPS) and from the Excellence Initiative of the German
layer, which effectively handles compliance with DHRs. federal and state governments. This article reflects only the
This design decision limits our solution in three ways. authors’ views and the funding agencies are not responsible
First, the overall achievable load balance depends on how for any use that may be made of the information it contains.
well the nodes’ capabilities to fulfill certain DHRs matches
the actual DHRs requested by the clients. However, for a R EFERENCES
given scenario, P RADA is able to achieve nearly optimal
load balance as shown in Figure 10. Second, indirection [1] R. Gellman, “Privacy in the Clouds: Risks to Privacy and Confi-
dentiality from Cloud Computing,” World Privacy Forum, 2009.
introduces an overhead of 0.5 round-trip times for reads, [2] S. Pearson and A. Benameur, “Privacy, Security and Trust Issues
updates, and deletes. Further reducing this overhead is Arising from Cloud Computing,” in IEEE CloudCom, 2010.
HENZE et al.: COMPLYING WITH DATA HANDLING REQUIREMENTS IN CLOUD STORAGE SYSTEMS 13

[3] United States Congress, “Gramm-Leach-Bliley Act (GLBA),” [36] United States Congress, “Health Insurance Portability and Ac-
Pub.L. 106-102, 113 Stat. 1338, 1999. countability Act of 1996 (HIPAA),” Pub.L. 104191, 110 Stat. 1936,
[4] D. Song et al., “Cloud Data Protection for the Masses,” Computer, 1996.
vol. 45, no. 1, 2012. [37] “Regulation (EU) 2016/679 of the European Parliament and of the
[5] T. F. J. M. Pasquier et al., “Information Flow Audit for PaaS Council of 27 April 2016 on the protection of natural persons with
Clouds,” in IEEE IC2E, 2016. regard to the processing of personal data and on the free move-
[6] V. Abramova and J. Bernardino, “NoSQL Databases: MongoDB vs ment of such data, and repealing Directive 95/46/EC (General
Cassandra,” in C3S2E, 2013. Data Protection Regulation),” L119, 4/5/2016, 2016.
[7] R. Buyya, R. Ranjan, and R. N. Calheiros, “InterCloud: Utility- [38] PCI Security Standards Council, “Payment Card Industry (PCI)
Oriented Federation of Cloud Computing Environments for Scal- Data Security Standard – Requirements and Security Assessment
ing of Application Services,” in ICA3PP, 2010. Procedures, Version 3.1,” 2015.
[8] D. Bernstein et al., “Blueprint for the Intercloud - Protocols and [39] R. Buyya et al., “Cloud computing and emerging IT platforms:
Formats for Cloud Computing Interoperability,” in ICIW, 2009. Vision, hype, and reality for delivering computing as the 5th
[9] Intel IT Center, “Peer Research: What’s Holding Back the Cloud?” utility,” Future Generation Computer Systems, vol. 25, no. 6, 2009.
Tech. Rep., 2012. [40] T. Ristenpart et al., “Hey, You, Get off of My Cloud: Exploring
[10] D. Catteddu and G. Hogben, “Cloud Computing – Benefits, Risks Information Leakage in Third-party Compute Clouds,” in ACM
and Recommendations for Information Security,” European Net- CCS, 2009.
work and Information Security Agency (ENISA), 2009. [41] P. Massonet et al., “A Monitoring and Audit Logging Architecture
[11] M. Henze, R. Hummen, and K. Wehrle, “The Cloud Needs Cross- for Data Location Compliance in Federated Cloud Infrastruc-
Layer Data Handling Annotations,” in IEEE S&P Workshops, 2013. tures,” in IEEE IPDPS Workshops, 2011.
[12] T. Wüchner, S. Müller, and R. Fischer, “Compliance-Preserving [42] United States Congress, “Sarbanes-Oxley Act (SOX),” Pub.L.
Cloud Storage Federation Based on Data-Driven Usage Control,” 107204, 116 Stat. 745, 2002.
in IEEE CloudCom, 2013. [43] A. Mantelero, “The EU Proposal for a General Data Protection
[13] S. Betgé-Brezetz et al., “End-to-End Privacy Policy Enforcement in Regulation and the roots of the ‘right to be forgotten’,” Computer
Cloud Infrastructure,” in IEEE CloudNet, 2013. Law & Security Review, vol. 29, no. 3, 2013.
[14] W. Itani, A. Kayssi, and A. Chehab, “Privacy as a Service: Privacy- [44] H. A. Jäger et al., “Sealed Cloud – A Novel Approach to Safeguard
Aware Data Storage and Processing in Cloud Computing Archi- against Insider Attacks,” in Trusted Cloud Computing. Springer,
tectures,” in IEEE DASC, 2009. 2014.
[15] D. Espling et al., “Modeling and Placement of Cloud Services with [45] J. Singh et al., “Regional clouds: technical considerations,” Univer-
Internal Structure,” IEEE Transactions on Cloud Computing, vol. 4, sity of Cambridge, Computer Laboratory, Tech. Rep. UCAM-CL-
no. 4, 2014. TR-863, 2014.
[16] Z. N. J. Peterson, M. Gondree, and R. Beverly, “A Position Paper [46] S. Pearson, “Taking Account of Privacy when Designing Cloud
on Data Sovereignty: The Importance of Geolocating Data in the Computing Services,” in Proceedings of the 2009 ICSE Workshop on
Cloud,” in USENIX HotCloud, 2011. Software Engineering Challenges of Cloud Computing. IEEE, 2009.
[17] G. J. Watson et al., “LoSt: Location Based Storage,” in ACM CCSW, [47] E. Barker, “Recommendation for Key Management – Part 1:
2012. General (Revision 4),” NIST Special Publication 800-57, National
[18] I. Papagiannis and P. Pietzuch, “CloudFilter: Practical Control of Institute of Standards and Technology, 2015.
Sensitive Data Propagation to the Cloud,” in ACM CCSW, 2012. [48] A. Corradi, L. Leonardi, and F. Zambonelli, “Diffusive Load-
[19] J. Spillner, J. Müller, and A. Schill, “Creating optimal cloud storage Balancing Policies for Dynamic Applications,” IEEE Concurrency,
systems,” Future Generation Computer Systems, vol. 29, no. 4, 2013. vol. 7, no. 1, 1999.
[20] RWTH Aachen University, “PRADA Source Code Repository,” [49] L. Rainie and J. Anderson, “The Future of Privacy,” Pew Re-
https://github.com/COMSYS/prada. search Center, http://www.pewinternet.org/2014/12/18/future-
[21] M. Henze et al., “Practical Data Compliance for Cloud Storage,” in of-privacy/, 2014.
IEEE IC2E, 2017. [50] R. van Renesse et al., “Efficient Reconciliation and Flow Control
[22] P. Samarati and S. De Capitani di Vimercati, “Data Protection in for Anti-entropy Protocols,” in LADIS, 2008.
Outsourcing Scenarios: Issues and Directions,” in ACM ASIACSS, [51] J. K. Nidzwetzki and R. H. Güting, “Distributed SECONDO: A
2010. Highly Available and Scalable System for Spatial Data Process-
[23] M. Henze et al., “Towards Data Handling Requirements-aware ing,” in SSTD, 2015.
Cloud Computing,” in IEEE CloudCom, 2013. [52] DataStax, Inc., “Apache CassandraTM 2.0 Documentation,” http:
[24] A. Greenberg et al., “The Cost of a Cloud: Research Problems //docs.datastax.com/en/cassandra/2.0/pdf/cassandra20.pdf,
in Data Center Networks,” SIGCOMM Comput. Commun. Rev., 2016, last updated: 21 January 2016.
vol. 39, no. 1, 2008. [53] The Apache Software Foundation, “Apache Cassandra,” https://
[25] G. DeCandia et al., “Dynamo: Amazon’s Highly Available Key- cassandra.apache.org/.
value Store,” in ACM SOSP, 2007. [54] T. Rabl et al., “Solving Big Data Challenges for Enterprise Applica-
[26] A. Lakshman and P. Malik, “Cassandra: A Decentralized Struc- tion Performance Management,” Proc. VLDB Endow., vol. 5, no. 12,
tured Storage System,” ACM SIGOPS Operating Systems Review, 2012.
vol. 44, no. 2, 2010. [55] The Apache Software Foundation, “Cassandra Query Lan-
[27] M. T. Özsu and P. Valduriez, Principles of Distributed Database guage (CQL) v3.3.1,” https://cassandra.apache.org/doc/cql3/
Systems, 3rd ed. Springer, 2011. CQL.html, 2015.
[28] J. C. Corbett et al., “Spanner: Google’s Globally-distributed [56] T. J. Parr and R. W. Quong, “ANTLR: A predicated-LL(k) parser
Database,” in USENIX OSDI, 2012. generator,” Software: Practice and Experience, vol. 25, no. 7, 1995.
[29] M. Stonebraker and A. Weisberg, “The VoltDB Main Memory [57] S. Hemminger, “Network Emulation with NetEm,” in
DBMS,” IEEE Data Eng. Bull., vol. 36, no. 2, 2013. linux.conf.au, 2005.
[30] Clustrix, Inc., “Scale-Out NewSQL Database in the Cloud,” http: [58] S. Sanghrajka, N. Mahajan, and R. Sion, “Cloud Performance
//www.clustrix.com/. Benchmark Series: Network Performance – Amazon EC2,” Cloud
[31] S. Pearson and M. C. Mont, “Sticky Policies: An Approach for Commons Online, 2011.
Managing Privacy across Multiple Parties,” Computer, vol. 44, [59] J. Walker, “HotBits: Genuine Random Numbers,” http://www.
no. 9, 2011. fourmilab.ch/hotbits.
[32] S. Pearson, Y. Shen, and M. Mowbray, “A Privacy Manager for [60] IBM Corporation, “IBM ILOG CPLEX Optimization Studio,” http:
Cloud Computing,” in CloudCom, 2009. //www.ibm.com/software/products/en/ibmilogcpleoptistud/.
[33] R. Agrawal et al., “Auditing Compliance with a Hippocratic [61] Dropbox Inc., “400 million strong,” https://blogs.dropbox.com/
Database,” in VLDB, 2004. dropbox/2015/06/400-million-users/, 2015.
[34] J. Bacon et al., “Information Flow Control for Secure Cloud Com- [62] Amazon Web Services, Inc., “Amazon Web Services General Refer-
puting,” IEEE Transactions on Network and Service Management, ence Version 1.0,” http://docs.aws.amazon.com/general/latest/
vol. 11, no. 1, 2014. gr/aws-general.pdf.
[35] U. Rührmair et al., “Virtual Proofs of Reality and their Physical [63] Microsoft Corporation, “Microsoft Azure Cloud Computing Plat-
Implementation,” in IEEE S&P, 2015. form & Services,” https://azure.microsoft.com/.
14 ACCEPTED FOR PUBLICATION IN IEEE TRANSACTIONS ON CLOUD COMPUTING

[64] “Twissandra,” http://twissandra.com/. Jens Hiller received the B.Sc. and M.Sc. de-
[65] J. Yang and J. Leskovec, “Patterns of Temporal Variation in Online grees in Computer Science from RWTH Aachen
Media,” in Proceedings of the Fourth ACM International Conference University. He is a researcher at the Chair of
on Web Search and Data Mining (WSDM). ACM, 2011. Communication and Distributed Systems (COM-
[66] K. Giannakouris and M. Smihily, “Cloud computing – statistics on SYS) at RWTH Aachen University, Germany. His
the use by enterprises,” Eurostat Statistics Explained, 2014. research focuses on efficient secure communi-
[67] The Apache Software Foundation, “Apache James Project,” http: cation including improvements for today’s pre-
//james.apache.org/. dominant security protocols as well as mecha-
[68] “ElasticInbox – Scalable Email Store for the Cloud,” http://www. nisms for secure communication in the Internet
elasticinbox.com/. of Things.
[69] B. Klimt and Y. Yang, “Introducing the Enron Corpus,” in First
Conference on Email and Anti-Spam (CEAS), 2004.
[70] M. Henze et al., “A Comprehensive Approach to Privacy in the
Cloud-based Internet of Things,” FGCS, 2016.
[71] M. Henze et al., “CPPL: Compact Privacy Policy Language,” in
ACM WPES, 2016.
[72] T. Pasquier et al., “Data-centric access control for cloud comput-
Erik Mühmer received the B.Sc. and M.Sc. de-
ing,” in ACM SACMAT, 2016.
grees in Computer Science from RWTH Aachen
[73] Bug Labs, Inc., “dweet.io – Share your thing like it ain’t no thang.”
University. He was a science assistant at the
https://dweet.io/.
Chair of Communication and Distributed Sys-
[74] N. J. A. Harvey et al., “SkipNet: A Scalable Overlay Network with
tems (COMSYS) at RWTH Aachen University.
Practical Locality Properties,” in USENIX USITS, 2003.
Since 2017 he is a researcher and Ph.D. student
[75] S. Zhou, G. R. Ganger, and P. A. Steenkiste, “Location-based
at the Chair of Operations Research at RWTH
Node IDs: Enabling Explicit Locality in DHTs,” Carnegie Mellon
Aachen University. His research interest lies in
University, Tech. Rep., 2003.
operations research with a focus on scheduling
[76] S. A. Weil et al., “CRUSH: Controlled, Scalable, Decentralized
and robustness.
Placement of Replicated Data,” in ACM/IEEE SC, 2006.
[77] P.-J. Maenhaut et al., “A Dynamic Tenant-Defined Storage System
for Efficient Resource Management in Cloud Applications,” Jour-
nal of Network and Computer Applications, 2017.
[78] L. Rupprecht et al., “SwiftAnalytics: Optimizing Object Storage for
Big Data Analytics,” in IEEE IC2E, 2017.
[79] R. Agrawal et al., “Hippocratic Databases,” in Proceedings of the
28th International Conference on Very Large Data Bases (VLDB). Jan Henrik Ziegeldorf received the Diploma
VLDB Endowment, 2002. (equiv. M.Sc.) and PhD degrees in Computer
[80] A. De Oliveira et al., “Monitoring Personal Data Transfers in the Science from RWTH Aachen University. He is
Cloud,” in IEEE CloudCom, 2013. a post-doctoral researcher at the Chair of Com-
munication and Distributed Systems (COMSYS)
at RWTH Aachen University, Germany. His re-
search focuses on secure computations and
their application in practical privacy-preserving
systems, e.g., for digital currencies and machine
learning.

Martin Henze received the Diploma (equiv.

M.Sc.) and PhD degrees in Computer Science
from RWTH Aachen University. He is a post-
doctoral researcher at the Fraunhofer Institute
for Communication, Information Processing and
Ergonomics FKIE, Germany. His research inter- Johannes van der Giet received the B.Sc. and
ests lie primarily in the area of security and pri- M.Sc. degrees in Computer Science from RWTH
vacy in large-scale communication systems, re- Aachen University. He recently graduated from
cently especially focusing on security challenges RWTH Aachen University and is now working as
in the industrial and energy sectors. a software engineer for autonomous driving at
Daimler Research & Development. His research
interests include distributed systems as well as
software testing and verification

Roman Matzutt received the B.Sc. and M.Sc.

degrees in Computer Science from RWTH
Aachen University. He is a researcher at the Klaus Wehrle received the Diploma (equiv.
Chair of Communication and Distributed Sys- M.Sc.) and PhD degrees from University of
tems (COMSYS) at RWTH Aachen University. Karlsruhe (now KIT), both with honors. He is full
His research focuses on the challenges and op- professor of Computer Science and Head of the
portunities of accountable and distributed data Chair of Communication and Distributed Sys-
ledgers, especially those based on blockchain tems (COMSYS) at RWTH Aachen University.
technology, and means allowing users to ex- His research interests include (but are not limited
press their individual privacy demands against to) engineering of networking protocols, (formal)
Internet services. methods for protocol engineering and network
analysis, reliable communication software, and
all operating system issues of networking.