Ieee Papers
Ieee Papers
Abstract—In past years, cloud storage systems saw an enormous rise in usage. However, despite their popularity and importance as
underlying infrastructure for more complex cloud services, today’s cloud storage systems do not account for compliance with regulatory,
organizational, or contractual data handling requirements by design. Since legislation increasingly responds to rising data protection
and privacy concerns, complying with data handling requirements becomes a crucial property for cloud storage systems. We present
P RADA, a practical approach to account for compliance with data handling requirements in key-value based cloud storage systems. To
achieve this goal, P RADA introduces a transparent data handling layer, which empowers clients to request specific data handling
arXiv:1806.11448v2 [cs.NI] 7 Jun 2020
requirements and enables operators of cloud storage systems to comply with them. We implement P RADA on top of the distributed
database Cassandra and show in our evaluation that complying with data handling requirements in cloud storage systems is practical
in real-world cloud deployments as used for microblogging, data sharing in the Internet of Things, and distributed email storage.
Index Terms—cloud computing, data handling, compliance, distributed databases, privacy, public policy issues
1 I NTRODUCTION
Duration requirements impose restrictions on the storage A simple example for a type of DHRs is storage location.
duration of data. The Sarbanes-Oxley Act (SOX) [42], e.g., In this example, the properties consist of all possible storage
requires accounting firms to retain records relevant to audits locations, and nodes whose storage location is equal to the
and reviews for seven years. Contrary, the Payment Card one requested by the clients are considered eligible. In a
Industry Data Security Standard (PCI DSS) [38] limits the more complicated example, we consider as DHR type the
storage duration of cardholder data to the time necessary security level of full-disk encryption. Here, the properties range
for business, legal, or regulatory purposes after which it from 0 bits (no encryption) to different bits of security (e.g.,
has to be deleted. A similar approach, coined “the right to 192 bits or 256 bits), with more bits of security offering a
be forgotten”, is actively being discussed and turned into higher security level [47]. In this case, all storage nodes that
legislation in the EU and Argentina [37], [43]. provide at least the security level requested by the client are
Traits requirements further define how data should be stored. considered eligible to store the data.
For example, the US Health Insurance Portability and By allowing clients to combine different types of DHRs
Accountability Act (HIPAA) [36] requires health data to and to specify a set of required properties (e.g., different
be securely deleted before disposing or reusing a storage storage locations) for each type, we provide them with
medium. Likewise, for the banking and financial services powerful means to express DHRs. We detail how clients can
industry, the Gramm-Leach-Bliley Act (GLBA) [3] requires combine different types in Section 4 and how we integrate
the proper encryption of customer data. Additionally, to DHRs into Cassandra’s query language in Section 8.
protect against theft or seizure, clients may choose to store
their data only on volatile [44] or fully encrypted [4] storage.
2.3 Goals
Operator perspective. The support of DHRs presents clear
business incentives to cloud storage operators as it opens Our analysis of real-world demands for DHRs based on
new markets and eases compliance with regulation. legislation, business interests, and future trends emphasizes
Business incentives are given by the unique selling point that the importance to support DHRs in distributed cloud stor-
DHRs present to the untapped market of clients unable to age. We now derive a set of goals that any approach that
outsource their data to cloud storage systems nowadays due addresses this challenging situation should fulfill:
to unfulfillable DHRs [9]. Indeed, cloud providers already Comprehensiveness: To address a wide range of DHRs, the
adapted to some carefully selected requirements. To be able approach should work with any DHRs that can be expressed
to sell its services to the US government, e.g., Google created as properties of storage nodes and support the combination
the segregated “Google Apps for Government” and had it of different DHRs. In particular, it should support the re-
certified at the FISMA-Moderate level, which enables use by quirements in Section 2.2 and be able to adapt to new DHRs.
US federal agencies [41]. Furthermore, cloud providers open Minimal performance effort: Cloud storage systems are
data centers around the world to address location require- highly optimized and trimmed for performance. Thus, the
ments of clients [7]. From a different perspective, regional impact of DHR support on the performance of a cloud
clouds, e.g., the envisioned “Europe-only” cloud [45], aim at storage system should be minimized.
increasing governance and control over data. Additionally, Cluster balance: In existing cloud storage systems, the
offering clients more control over their data reduces risks storage load of nodes can easily be balanced to increase
for loss of reputation and credibility [46]. performance. Despite having to respect DHRs (and thus
Compliance with legislation is important for operators inde- limiting the set of possible storage nodes), the storage load
pendent of specific business goals and incentives. As an of individual storage nodes should be kept balanced.
example, the business associate agreement of HIPAA [36] Coexistence: Not all data will be accompanied by DHRs.
requires the operator to comply with the same requirements Hence, data without DHRs should not be impaired by
as its clients when transmitting electronic health records [1]. supporting DHRs, i.e., it should be stored in the same way
Furthermore, the EU’s General Data Protection Regula- as in a traditional cloud storage system.
tion [37] requires data controllers from outside the EU that
process data originating from the EU to follow DHRs.
Future requirements. DHRs are likely to change and evolve 3 S YSTEM OVERVIEW
just as legislation and technology are changing and evolv- The problem that has prevented support for DHRs so far
ing over time. Location requirements developed, e.g., since stems from the common pattern used to address data in
cloud storage systems began to span multiple geographic key-value based cloud storage systems: Data is addressed,
regions. As anticipating all possible future changes in DHRs and hence also partitioned (i.e., distributed to the nodes in
is impossible, it is crucial that support for DHRs in cloud the cluster), using a designated key. Yet, the responsible node
storage systems can easily adapt to new requirements. (according to the key) for storing a data item will often not
Formalizing data handling requirements. To also support fulfill the client’s DHRs. Thus, the challenge addressed in
future requirements and storage architectures, we base our this paper is how to realize compliance with DHRs and still
approach on a formalized understanding of DHRs that also allow for key-based data access.
covers yet unforeseen DHRs. To this end, we distinguish To tackle this challenge, the core idea of P RADA is to
between different types of DHRs and consider different add an indirection layer on top of a cloud storage system.
possible properties which storage nodes (can) support for a We illustrate how we integrate this layer into existing cloud
given type of DHRs. This makes it possible to compute the storage systems in Figure 2. If a responsible node cannot
set of eligible nodes for a specified type of DHRs, i.e., those comply with stated DHRs, we store the data at a different
nodes that offer the properties requested by the client. node, called target node. To enable the lookup of data, the
4 ACCEPTED FOR PUBLICATION IN IEEE TRANSACTIONS ON CLOUD COMPUTING
nodes. More advanced systems support additional mech- selecting a new target node and updating the reference. Still,
anisms, e.g., load balancing over geographical regions [28]. also the coordinator itself can fail during the process, which
Since our focus in this paper lies on proving the general may lead to unreachable data. As such failures happen only
feasibility of supporting data compliance in cloud storage, rarely, we suggest refraining from including corresponding
we focus on the properties of key-value based storage. consistency checks directly into create operations [51]. In-
Re-balancing a cluster by moving data between nodes stead, the client detects failures of the coordinator due to
can be handled by P RADA similarly to moving data in case absent acknowledgments. In this case, the client informs all
of node failures (Section 7). In the following, we thus focus eligible nodes to remove the unreferenced data and reissues
on the challenge of load balancing in P RADA at insert time. the create operation through another coordinator.
Here, we focus on equal distribution of data with DHRs to Read. In contrast to the other operations, a read request
target nodes as load balancing for indirection information is does not change any state in the cloud storage system.
achieved with the standard mechanisms of key-value based Therefore, read requests are simply reissued in case of a
cloud storage systems, e.g., by hashing identifier keys. failure (identified by a missing acknowledgment) and no
In contrast to key-value based cloud storage systems, further error handling is required.
load balancing in P RADA is more challenging: When pro- Update. Although update operations are more complex than
cessing a create request, the eligible target nodes are not create operations, failure handling can happen analogously.
necessarily equal as they might be able to comply with As the responsible node updates its reference only upon
different DHRs. Hence, some eligible nodes might offer reception of the acknowledgment from the new target node,
rarely supported but often requested requirements. Fore- the storage state is guaranteed to remain consistent. Hence,
seeing future demands is notoriously difficult [49], thus we the coordinator can reissue the process using the same or
suggest to make the load balancing decision based on the a new target node and perform corresponding cleanups if
current load of the nodes. This requires all nodes to be errors occur. Contrary, if the coordinator fails, information
aware of the load of the other nodes in the cluster. Cloud on the potentially new target node is lost. Similar to create
storage systems typically already exchange this information operations, the client resolves this error by informing all eli-
or can be extended to do so, e.g., using efficient gossiping gible nodes about the failure. Subsequently, the responsible
protocols [50]. We utilize this load information in P RADA as nodes trigger a cleanup to ensure a consistent storage state.
follows. To select the target nodes from the set of eligible Delete. When deleting data, a responsible node may delete
nodes, P RADA first checks if any of the responsible nodes a reference but fail in informing the target node to carry out
are also eligible to become a target node and selects those as the delete. Coordinator and client easily detect this error
target nodes first. This allows us to increase the performance through the absence of the corresponding acknowledgment.
of CRUD requests as we avoid the indirection layer in Again, the coordinator or client then issue a broadcast
this case. For the remaining target nodes, P RADA selects message to delete the corresponding data item from the
those with the lowest load. To have access to more timely target node. This approach is more reasonable than directly
load information, each node in P RADA keeps track of all incorporating consistency checks for all delete operations as
create requests it is involved with. Whenever a node itself such failures occur only rarely [51].
stores new data or sends data for storage to other nodes, Propagating target node actions. CRUD operations are
it increments temporary load information for the respective triggered by clients. However, data deletion or relocation,
node. This temporary node information is used to bridge which may result in dangling references or unreferenced
the time between two updates of the load information. As data, can also be triggered by the storage cluster or by
we will show in Section 9.2, this approach enables P RADA DHRs that, e.g., specify a maximum lifetime for data. To
to adapt to different usage scenarios and quickly achieve a keep the state of the cloud storage system consistent, stor-
(nearly) equally balanced storage cluster. age nodes perform data deletion and relocation through a
coordinator as well, i.e., they select one of the other nodes
to perform update and delete operations on their behalf.
7 FAILURE R ECOVERY Thus, the correct execution of deletion and relocation tasks
When introducing support for DHRs to cloud storage sys- can be monitored and repair operations can be triggered.
tems, we must ensure not to break their failure recovery In case either the initiating storage node or the coordinator
mechanisms. With P RADA, we specifically need to take fails while processing a query, the same mitigations as for
care of dangling references, i.e., a reference pointing to CRUD operations (triggered by clients) apply. To protect
a node that does not store the corresponding data, and against rare cases in which both, initiating storage node
unreferenced data, i.e., data stored on a target node without and coordinator, fail while processing an operation, storage
an existing corresponding reference. These inconsistencies system operators can optionally employ commit logs, e.g.,
could stem from failures during the (modified) CRUD oper- based on Cassandra’s atomic batch log [52].
ations as well as from actions that are triggered by DHRs,
e.g., deletions forced by DHRs require propagation of meta
information to corresponding responsible nodes. 8 I MPLEMENTATION
Create. Create requests require to transmit data to the target For the practical evaluation of our approach, we fully imple-
node and inform the responsible node to store the reference. mented P RADA on top of Cassandra [26] (our implementa-
Failures during these operations can be recognized by the tion is available under the Apache License [20]). Cassandra
coordinator by missing acknowledgments. Resolving these is a distributed database that is actively employed as a
errors requires a rollback and/or reissuing actions, e.g., key-value based cloud storage system by more than 1500
HENZE et al.: COMPLYING WITH DATA HANDLING REQUIREMENTS IN CLOUD STORAGE SYSTEMS 7
companies with deployments of up to 75 000 nodes [53] Cassandra, are generated using ANTLR [56]. Using the
and offers high scalability even over multiple data centers WITH REQUIREMENTS statement, arbitrary DHRs can be
[54], which makes it especially suitable for our scenario. specified separated by the keyword AND, e.g., INSERT ...
Cassandra also implements advanced features that go be- WITH REQUIREMENTS location = { ’DE’, ’FR’,
yond simple key-value storage such as column-orientation ’UK’ } AND encryption = { ’AES-256’ }. In this
and queries over ranges of keys, which allows us to show- example, any node located in Germany, France, or the
case the flexibility and adaptability of our design. Data in United Kingdom that supports AES-256 encryption is
Cassandra is divided into multiple logical databases, called eligible to store the inserted data. This approach enables
key spaces. A key space consists of tables which are called users to specify any DHRs covered by our formalized
column families and contain rows and columns. Each node model of DHRs (cf. Section 2.2).
knows about all other nodes and their ranges of the hash To detect and process DHRs in create requests (cf. Section
table. Cassandra uses the gossiping protocol Scuttlebutt [50] 4), we extend Cassandra’s QueryProcessor, specifically
to efficiently distribute this knowledge as well as to detect its getStatement method for processing INSERT requests.
node failure and exchange node state, e.g., load information. When processing requests with DHRs (specified using the
Our implementation is based on Cassandra 2.0.5, but our WITH REQUIREMENTS statement), we base our selection of
design conceptually also works with newer versions. eligible nodes on the global capability store. Nodes are eligi-
Information stores. P RADA relies on three information ble to store data with a given set of DHRs if they provide at
stores: the global capability store as well as relay and target least one of the specified properties for each requested type
stores (cf. Section 3). We implement these as individual key (e.g., one out of multiple permitted locations). We prioritize
spaces in Cassandra as detailed in the following. First, we nodes that Cassandra would pick without DHRs, as this
realize the global capability store as a key space that is globally speeds up reads for the corresponding key later on, and
replicated among all nodes (i.e., each node stores a full copy otherwise choose nodes according to our load balancer (cf.
of the capability store to improve performance of create Section 6). Our load balancing implementation relies on Cas-
operations) initialized at the same time as the cluster. On this sandra’s gossiping mechanism [26], which maintains a map
key space, we create a column family for each DHR type (as of all nodes together with their corresponding loads. We
introduced in Section 2.2). When a node joins the cluster, it access this information using Cassandra’s getLoadInfo
inserts all DHR properties it supports for each DHR type (as method and extend the load information with local estima-
locally configured by operator of the cloud storage system) tors for load changes. Whenever a node sends a create re-
into the corresponding column family. This information is quest or stores data itself, we update the corresponding local
then automatically replicated to all other nodes in the cluster estimator with the size of the inserted data. To this end, we
by the replication strategy of the corresponding key space. hook into the methods that are called when data is modified
For each regular key space of the database, we additionally locally or forwarded to other nodes, i.e., the correspond-
create a corresponding relay store and target store as key ing methods in Cassandra’s ModificationStatement,
spaces. Here, the relay store inherits the replication factor RowMutationVerbHandler, and StorageProxy classes
and replication strategy from the corresponding regular key as well as our methods for processing requests with DHRs.
space to achieve replication for P RADA as detailed in Section Reading data. To allow reading redirected data as described
5, i.e., the relay store will be replicated in exactly the same in Section 4, we modify Cassandra’s ReadVerbHandler
way as the regular key store. Hence, for each column family class for processing read requests at the responsible node.
in the corresponding key space, we create a column family This handler is called whenever a node receives a read
in the relay key space that acts as the relay store. We follow request from the coordinator and allows us to check whether
a similar approach for realizing the target store, i.e., we create the current node holds a reference to another target node
for each key space a corresponding key space to store actual for the requested key by locally checking the corresponding
data. For each column family in the original key space, we column family within the relay store. If no reference exists,
create an exact copy in the target key space to act as the the node continues with a standard read operation. Oth-
target store. However, to ensure that DHRs are adhered to, erwise, the node forwards a modified read request to one
we implement a DHR-agnostic replication mechanism for deterministically selected target node (cf. Section 5) using
the target store and use the relay store to address data. Cassandra’s sendOneWay method, in which it requests the
While the global capability store is created when the data from the respective target on behalf of the coordina-
cluster is initiated, relay and target stores have to be cre- tor. Subsequently, the target nodes send the data directly
ated whenever a new key space and column family is to the coordinator node (whose identifier is included in
created, respectively. To this end, we hook into Cassandra’s the request). To correctly resolve references to data for
CreateKeyspaceStatement class for detecting requests which the coordinator of a query is also the responsible
for creating key spaces and column families and subse- node, we additionally add corresponding checks to the
quently initialize the corresponding relay and target stores. LocalReadRunnable subclass of StorageProxy.
Creating data and load balancing. To allow clients to
specify their DHRs when inserting or updating data, we
support the specification of arbitrary DHRs in textual 9 E VALUATION
form for INSERT requests (cf. Section 2.1). To this end, we We perform benchmarks to quantify query completion
add an optional postfix WITH REQUIREMENTS to INSERT times, storage space, and consumed traffic. Furthermore, we
statements by extending the grammar from which parser study P RADA’s load behavior through simulation and show
and lexer for CQL3 [55], the SQL-like query language of P RADA’s applicability in three real-world use cases.
8 ACCEPTED FOR PUBLICATION IN IEEE TRANSACTIONS ON CLOUD COMPUTING
and 585 B for a payload of 400 B. Each additional replica 3.0 Cassandra PRADA* PRADA
increases the required storage space by roughly 90%. P RADA 2.5 create read update delete
adds only an additional relative storage overhead of roughly replication factor [#]
38% on top of an overhead of more than 136% already added Fig. 8. Traffic vs. replication. Data without DHRs is not affected by
by Cassandra. When considering larger payload sizes, the P RADA. Replicas increase the traffic overhead introduced by DHRs.
storage overhead of P RADA becomes negligible, e.g., when
0.45
storage locations. We simulate a cluster of 1000 nodes that deviation from node distribution [%]
are geographically distributed according to the IP address Fig. 10. Load balance vs. node distribution. P RADA’s load balance
ranges of Amazon Web Services [62] (North America: 64%, shows optimal behavior, but depends on node distribution.
Europe: 17%, Asia-Pacific: 16%, South America: 2%, China:
1600 Cassandra PRADA
resulting from the small number of operations (only 150 the purpose of data usage, which is primarily realized using
mailboxes) and huge differences in mailbox sizes , ranging access control. One interesting aspect of sticky policies is
from 35 to 28 465 messages. While we cannot derive a their ability to make them “stick” to the corresponding
definitive statement (at the 99% confidence level) from these data using cryptographic measures which could also be
results, the mean QCTs for fetching the overview of a mail- applied to P RADA. In the context of cloud computing, sticky
box seem to suggest a notable yet acceptable overhead for policies have been proposed to express requirements on the
using P RADA. When considering the fetching of individual security and geographical location of storage nodes [32].
messages, we observe an overhead of 70% for P RADA’s indi- However, so far it has been unclear how this could be
rection step, increasing QCTs from 97 to 164 ms. Hence, we realized efficiently in a large and distributed storage system.
can provide compliance with DHRs for email storage with With P RADA, we present a mechanism to achieve this goal.
a reasonable increase of 67 ms for fetching individual emails Policy enforcement. To enforce privacy policies when ac-
and a likely increase in the time required for generating an cessing data in the cloud, Betgé-Brezetz et al. [13] monitor
overview of all emails in the mailbox in the order of 28%. access of virtual machines to shared file systems and only
IoT platform. The Internet of Things (IoT) leads to a mas- allow access if a virtual machine is policy compliant. In
sive growth of collected data which is often stored in the contrast, Itani et al. [14] propose to leverage cryptographic
cloud [70], [71]. Literature proposes to attach per-data item coprocessors to realize trusted and isolated execution en-
DHRs to IoT data to preserve privacy [31], [71], [72]. To vironments and enforce the encryption of data. Espling et
study the applicability of P RADA in this setting, we collected al. [15] aim at allowing service owners to influence the
frequency and size of authentic IoT data from the IoT placement of their virtual machines in the cloud to realize
data sharing platform dweet.io [73]. Our data set contains specific geographical deployments or provide redundancy
1.84 million IoT messages of size 72 B to 9.73 KB from 2889 through avoiding co-location of critical components. These
devices. To protect the privacy of people monitored by these approaches are orthogonal to our work, as they primarily
devices, we replaced all payload information with random focus on enforcing policies when processing data, while
data. For each device, we uniformly at random assign one P RADA addresses the challenge of supporting DHRs when
of the storage locations as DHR for the collected data. storing data in cloud storage systems.
In Figure 11 (right), we depict the mean QCTs per op- Location-based storage. Focusing exclusively on location
eration of Cassandra and P RADA for retrieving the overview requirements, Peterson et al. [16] introduce the concept of
of all IoT data for each of the 2889 devices as well as for data sovereignty with the goal to provide a guarantee that
accessing 10 000 randomly selected single IoT messages. The a provider stores data at claimed physical locations, e.g.,
varying amount of sensor data that different IoT devices of- based on measurements of network delay. Similarly, LoSt
fer leads to a slightly varying QCT for fetching of IoT device [17] enables verification of storage locations based on a
data overviews, similar to mailbox fetching (see above). The challenge-response protocol. In contrast, P RADA focuses on
overhead for adhering to DHRs with P RADA in the IoT use the more fundamental challenge of realizing the functional-
case totals to 41% for the fetching of a device’s IoT data ity for supporting arbitrary DHRs.
overview and 57% for a single IoT message, corresponding Controlling placement of data. Primarily focusing on dis-
to the 0.5 RTT added by the indirection layer. We consider tributed hash tables, SkipNet [74] enables control over
these overheads still appropriate given the inherent private data placement by organizing data mainly based on string
nature of most IoT data and the accompanying privacy risks names. Similarly, Zhou et al. [75] utilize location-based node
which can be mitigated with DHRs. identifiers to encode physical topology and hence provide
control over data placement at a coarse grain. In contrast to
P RADA, these approaches need to modify the identifier of
10 R ELATED W ORK data based on the DHRs, i.e., knowledge about the specific
We categorize our discussion of related work by the differ- DHRs of data is required to locate it. Targeting distributed
ent types of DHRs they address. In addition, we discuss ap- object-based storage systems, CRUSH [76] relies on hierar-
proaches for providing assurance that DHRs are respected. chies and data distribution policies to control placement
Distributing storage of data. To enforce storage location of data in a cluster. These data distribution policies are
requirements, a class of related work proposes to split data bound to a predefined hierarchy and hence cannot offer
between different storage systems. Wüchner et al. [12] and the same flexibility as P RADA. Similarly, Tenant-Defined
CloudFilter [18] add proxies between clients and operators Storage [77] enables clients to store their data according to
to transparently distribute data to different cloud storage DHRs. However and in contrast to P RADA, all data of one
providers according to DHRs, while NubiSave [19] allows client needs to have the same DHRs. Finally, SwiftAnalytics
combining resources of different storage providers to fulfill [78] proposes to control the placement of data to speed
individual redundancy or security requirements of clients. up big data analytics. Here, data can only be put directly
These approaches can treat individual storage systems only on specified nodes without the abstraction provided by
as black boxes. Consequently, they do not support fine- P RADA’s approach of supporting DHRs.
grained DHRs within the database system itself and are Hippocratic databases. Hippocratic databases store data
limited to a small subset of DHRs. together with a purpose specification [79]. This allows them
Sticky policies. Similar to our idea of specifying DHRs, the to enforce the purposeful use of data using access control
concept of sticky policies proposes to attach usage and obli- and to realize data retention after a certain period. Using
gation policies to data when it is outsourced to third-parties Hippocratic databases, it is possible to create an auditing
[31]. In contrast to our work, sticky policies mainly concern framework to check if a database is complying with its data
12 ACCEPTED FOR PUBLICATION IN IEEE TRANSACTIONS ON CLOUD COMPUTING
disclosure policies [33]. However, this concept only consid- only possible by encoding some DHRs in the key used for
ers a single database and not a distributed setting where accessing data [23], but this requires everyone accessing the
storage nodes have different data handling capabilities. data to be in possession of the DHRs, which is unlikely. A
Assurance. To provide assurance that storage operators ad- fundamental improvement could be achieved by replicating
here to DHRs, de Oliveira et al. [80] propose an architecture all relay information to all nodes of the cluster, but this is
to automate the monitoring of compliance to DHRs when viable only for small cloud storage systems and does not
transferring data. Bacon et al. [34] and Pasquier et al. [5] offer scalability. We argue that indirection can likely not be
show that this can also be achieved using information flow avoided, but still pose this as an open research question.
control. Similarly, Massonet et al. [41] propose a monitoring Third, the question arises how clients can be assured that an
and audit logging architecture in which the infrastructure operator indeed enforces their DHRs and no errors in the
provider and service provider collaborate to ensure data specification of DHRs have occurred. This has been widely
location compliance. These approaches are orthogonal to studied [16], [33], [41], [80] and the proposed approaches
our work and could be used to verify that operators of cloud such as audit logging, information flow control, and prov-
storage systems run P RADA in an honest way and error-free. able data possession can also be applied to P RADA.
While we limit our approach for providing data compli-
ance in cloud storage to key-value based storage systems,
11 D ISCUSSION AND C ONCLUSION the key-value paradigm is also general enough to provide
Accounting for compliance with data handling require- a practical starting point for storage systems that are based
ments (DHRs), i.e., offering control over where and how on different paradigms. Additionally, the design of P RADA
data is stored in the cloud, becomes increasingly important is flexible enough to extend (with some more work) to other
due to legislative, organizational, or customer demands. storage systems. For example, Google’s globally distributed
Despite these incentives, practical solutions to address this database Spanner (rather a multi-version database than a
need in existing cloud storage systems are scarce. In this key-value store) allows applications to influence data lo-
paper, we proposed P RADA, which allows clients to specify cality (to increase performance) by carefully choosing keys
a comprehensive set of fine-grained DHRs and enables [28]. P RADA could be applied to Spanner by modifying
cloud storage operators to enforce them. Our results show Spanner’s approach of directory-bucketed key-value map-
that we can indeed achieve support for DHRs in cloud pings. Likewise, P RADA could realize data compliance for
storage systems. Of course, the additional protection and distributed main memory databases, e.g., VoltDB, where
flexibility offered by DHRs comes at a price: We observe a tables of data are partitioned horizontally into shards [29].
moderate increase for query completion times, while achiev- Here, the decision on how to distribute shards over the
ing constant storage overhead and upholding a near optimal nodes in the cluster could be taken with DHRs in mind.
storage load balance even in challenging scenarios. Similar adaptations could be performed for commercial
Importantly, however, data without DHRs is not im- products, such as Clustrix [30], that separate data into slices.
paired by P RADA. When a responsible node receives a To conclude, P RADA resolves a situation, i.e., missing
request for data without DHRs, it can locally check that no support for DHRs, that is disadvantageous to both clients
DHRs apply to this data: For create requests, the INSERT and operators of cloud storage systems. By offering the
statement either contains DHRs or not, which can be enforcement of arbitrary DHRs when storing data in cloud
checked efficiently and locally. In contrast, for read, update, storage systems, P RADA enables the use of cloud storage
and delete requests, P RADA performs a simple and local systems for a wide range of clients who previously had to
check whether a reference to a target node for this data refrain from outsourcing storage, e.g., due to compliance
exists. The overhead for this step is comparable to executing with applicable data protection legislation. At the same
an if statement and hence negligible. Only if a reference time, we empower cloud storage operators with a practical
exists, which implies that the data was inserted with DHRs, and efficient solution to handle differences in regulations
P RADA induces overhead. Our extensive evaluation con- and offer their services to new clients.
firms that, for data without DHRs, P RADA shows the same
query completion times, storage overhead, and bandwidth
consumption as an unmodified Cassandra system in all
ACKNOWLEDGMENTS
considered settings (indistinguishable results for Cassandra The authors would like to thank Annika Seufert for support
and P RADA* in Figures 5 to 8.) Consequently, clients can with the simulations. This work has received funding from
choose (even at a granularity of individual data items), if the European Union’s Horizon 2020 research and innovation
DHRs are worth a modest performance decrease. program 2014-2018 under grant agreement No. 644866 (SSI-
P RADA’s design is built upon a transparent indirection CLOPS) and from the Excellence Initiative of the German
layer, which effectively handles compliance with DHRs. federal and state governments. This article reflects only the
This design decision limits our solution in three ways. authors’ views and the funding agencies are not responsible
First, the overall achievable load balance depends on how for any use that may be made of the information it contains.
well the nodes’ capabilities to fulfill certain DHRs matches
the actual DHRs requested by the clients. However, for a R EFERENCES
given scenario, P RADA is able to achieve nearly optimal
load balance as shown in Figure 10. Second, indirection [1] R. Gellman, “Privacy in the Clouds: Risks to Privacy and Confi-
dentiality from Cloud Computing,” World Privacy Forum, 2009.
introduces an overhead of 0.5 round-trip times for reads, [2] S. Pearson and A. Benameur, “Privacy, Security and Trust Issues
updates, and deletes. Further reducing this overhead is Arising from Cloud Computing,” in IEEE CloudCom, 2010.
HENZE et al.: COMPLYING WITH DATA HANDLING REQUIREMENTS IN CLOUD STORAGE SYSTEMS 13
[3] United States Congress, “Gramm-Leach-Bliley Act (GLBA),” [36] United States Congress, “Health Insurance Portability and Ac-
Pub.L. 106-102, 113 Stat. 1338, 1999. countability Act of 1996 (HIPAA),” Pub.L. 104191, 110 Stat. 1936,
[4] D. Song et al., “Cloud Data Protection for the Masses,” Computer, 1996.
vol. 45, no. 1, 2012. [37] “Regulation (EU) 2016/679 of the European Parliament and of the
[5] T. F. J. M. Pasquier et al., “Information Flow Audit for PaaS Council of 27 April 2016 on the protection of natural persons with
Clouds,” in IEEE IC2E, 2016. regard to the processing of personal data and on the free move-
[6] V. Abramova and J. Bernardino, “NoSQL Databases: MongoDB vs ment of such data, and repealing Directive 95/46/EC (General
Cassandra,” in C3S2E, 2013. Data Protection Regulation),” L119, 4/5/2016, 2016.
[7] R. Buyya, R. Ranjan, and R. N. Calheiros, “InterCloud: Utility- [38] PCI Security Standards Council, “Payment Card Industry (PCI)
Oriented Federation of Cloud Computing Environments for Scal- Data Security Standard – Requirements and Security Assessment
ing of Application Services,” in ICA3PP, 2010. Procedures, Version 3.1,” 2015.
[8] D. Bernstein et al., “Blueprint for the Intercloud - Protocols and [39] R. Buyya et al., “Cloud computing and emerging IT platforms:
Formats for Cloud Computing Interoperability,” in ICIW, 2009. Vision, hype, and reality for delivering computing as the 5th
[9] Intel IT Center, “Peer Research: What’s Holding Back the Cloud?” utility,” Future Generation Computer Systems, vol. 25, no. 6, 2009.
Tech. Rep., 2012. [40] T. Ristenpart et al., “Hey, You, Get off of My Cloud: Exploring
[10] D. Catteddu and G. Hogben, “Cloud Computing – Benefits, Risks Information Leakage in Third-party Compute Clouds,” in ACM
and Recommendations for Information Security,” European Net- CCS, 2009.
work and Information Security Agency (ENISA), 2009. [41] P. Massonet et al., “A Monitoring and Audit Logging Architecture
[11] M. Henze, R. Hummen, and K. Wehrle, “The Cloud Needs Cross- for Data Location Compliance in Federated Cloud Infrastruc-
Layer Data Handling Annotations,” in IEEE S&P Workshops, 2013. tures,” in IEEE IPDPS Workshops, 2011.
[12] T. Wüchner, S. Müller, and R. Fischer, “Compliance-Preserving [42] United States Congress, “Sarbanes-Oxley Act (SOX),” Pub.L.
Cloud Storage Federation Based on Data-Driven Usage Control,” 107204, 116 Stat. 745, 2002.
in IEEE CloudCom, 2013. [43] A. Mantelero, “The EU Proposal for a General Data Protection
[13] S. Betgé-Brezetz et al., “End-to-End Privacy Policy Enforcement in Regulation and the roots of the ‘right to be forgotten’,” Computer
Cloud Infrastructure,” in IEEE CloudNet, 2013. Law & Security Review, vol. 29, no. 3, 2013.
[14] W. Itani, A. Kayssi, and A. Chehab, “Privacy as a Service: Privacy- [44] H. A. Jäger et al., “Sealed Cloud – A Novel Approach to Safeguard
Aware Data Storage and Processing in Cloud Computing Archi- against Insider Attacks,” in Trusted Cloud Computing. Springer,
tectures,” in IEEE DASC, 2009. 2014.
[15] D. Espling et al., “Modeling and Placement of Cloud Services with [45] J. Singh et al., “Regional clouds: technical considerations,” Univer-
Internal Structure,” IEEE Transactions on Cloud Computing, vol. 4, sity of Cambridge, Computer Laboratory, Tech. Rep. UCAM-CL-
no. 4, 2014. TR-863, 2014.
[16] Z. N. J. Peterson, M. Gondree, and R. Beverly, “A Position Paper [46] S. Pearson, “Taking Account of Privacy when Designing Cloud
on Data Sovereignty: The Importance of Geolocating Data in the Computing Services,” in Proceedings of the 2009 ICSE Workshop on
Cloud,” in USENIX HotCloud, 2011. Software Engineering Challenges of Cloud Computing. IEEE, 2009.
[17] G. J. Watson et al., “LoSt: Location Based Storage,” in ACM CCSW, [47] E. Barker, “Recommendation for Key Management – Part 1:
2012. General (Revision 4),” NIST Special Publication 800-57, National
[18] I. Papagiannis and P. Pietzuch, “CloudFilter: Practical Control of Institute of Standards and Technology, 2015.
Sensitive Data Propagation to the Cloud,” in ACM CCSW, 2012. [48] A. Corradi, L. Leonardi, and F. Zambonelli, “Diffusive Load-
[19] J. Spillner, J. Müller, and A. Schill, “Creating optimal cloud storage Balancing Policies for Dynamic Applications,” IEEE Concurrency,
systems,” Future Generation Computer Systems, vol. 29, no. 4, 2013. vol. 7, no. 1, 1999.
[20] RWTH Aachen University, “PRADA Source Code Repository,” [49] L. Rainie and J. Anderson, “The Future of Privacy,” Pew Re-
https://github.com/COMSYS/prada. search Center, http://www.pewinternet.org/2014/12/18/future-
[21] M. Henze et al., “Practical Data Compliance for Cloud Storage,” in of-privacy/, 2014.
IEEE IC2E, 2017. [50] R. van Renesse et al., “Efficient Reconciliation and Flow Control
[22] P. Samarati and S. De Capitani di Vimercati, “Data Protection in for Anti-entropy Protocols,” in LADIS, 2008.
Outsourcing Scenarios: Issues and Directions,” in ACM ASIACSS, [51] J. K. Nidzwetzki and R. H. Güting, “Distributed SECONDO: A
2010. Highly Available and Scalable System for Spatial Data Process-
[23] M. Henze et al., “Towards Data Handling Requirements-aware ing,” in SSTD, 2015.
Cloud Computing,” in IEEE CloudCom, 2013. [52] DataStax, Inc., “Apache CassandraTM 2.0 Documentation,” http:
[24] A. Greenberg et al., “The Cost of a Cloud: Research Problems //docs.datastax.com/en/cassandra/2.0/pdf/cassandra20.pdf,
in Data Center Networks,” SIGCOMM Comput. Commun. Rev., 2016, last updated: 21 January 2016.
vol. 39, no. 1, 2008. [53] The Apache Software Foundation, “Apache Cassandra,” https://
[25] G. DeCandia et al., “Dynamo: Amazon’s Highly Available Key- cassandra.apache.org/.
value Store,” in ACM SOSP, 2007. [54] T. Rabl et al., “Solving Big Data Challenges for Enterprise Applica-
[26] A. Lakshman and P. Malik, “Cassandra: A Decentralized Struc- tion Performance Management,” Proc. VLDB Endow., vol. 5, no. 12,
tured Storage System,” ACM SIGOPS Operating Systems Review, 2012.
vol. 44, no. 2, 2010. [55] The Apache Software Foundation, “Cassandra Query Lan-
[27] M. T. Özsu and P. Valduriez, Principles of Distributed Database guage (CQL) v3.3.1,” https://cassandra.apache.org/doc/cql3/
Systems, 3rd ed. Springer, 2011. CQL.html, 2015.
[28] J. C. Corbett et al., “Spanner: Google’s Globally-distributed [56] T. J. Parr and R. W. Quong, “ANTLR: A predicated-LL(k) parser
Database,” in USENIX OSDI, 2012. generator,” Software: Practice and Experience, vol. 25, no. 7, 1995.
[29] M. Stonebraker and A. Weisberg, “The VoltDB Main Memory [57] S. Hemminger, “Network Emulation with NetEm,” in
DBMS,” IEEE Data Eng. Bull., vol. 36, no. 2, 2013. linux.conf.au, 2005.
[30] Clustrix, Inc., “Scale-Out NewSQL Database in the Cloud,” http: [58] S. Sanghrajka, N. Mahajan, and R. Sion, “Cloud Performance
//www.clustrix.com/. Benchmark Series: Network Performance – Amazon EC2,” Cloud
[31] S. Pearson and M. C. Mont, “Sticky Policies: An Approach for Commons Online, 2011.
Managing Privacy across Multiple Parties,” Computer, vol. 44, [59] J. Walker, “HotBits: Genuine Random Numbers,” http://www.
no. 9, 2011. fourmilab.ch/hotbits.
[32] S. Pearson, Y. Shen, and M. Mowbray, “A Privacy Manager for [60] IBM Corporation, “IBM ILOG CPLEX Optimization Studio,” http:
Cloud Computing,” in CloudCom, 2009. //www.ibm.com/software/products/en/ibmilogcpleoptistud/.
[33] R. Agrawal et al., “Auditing Compliance with a Hippocratic [61] Dropbox Inc., “400 million strong,” https://blogs.dropbox.com/
Database,” in VLDB, 2004. dropbox/2015/06/400-million-users/, 2015.
[34] J. Bacon et al., “Information Flow Control for Secure Cloud Com- [62] Amazon Web Services, Inc., “Amazon Web Services General Refer-
puting,” IEEE Transactions on Network and Service Management, ence Version 1.0,” http://docs.aws.amazon.com/general/latest/
vol. 11, no. 1, 2014. gr/aws-general.pdf.
[35] U. Rührmair et al., “Virtual Proofs of Reality and their Physical [63] Microsoft Corporation, “Microsoft Azure Cloud Computing Plat-
Implementation,” in IEEE S&P, 2015. form & Services,” https://azure.microsoft.com/.
14 ACCEPTED FOR PUBLICATION IN IEEE TRANSACTIONS ON CLOUD COMPUTING
[64] “Twissandra,” http://twissandra.com/. Jens Hiller received the B.Sc. and M.Sc. de-
[65] J. Yang and J. Leskovec, “Patterns of Temporal Variation in Online grees in Computer Science from RWTH Aachen
Media,” in Proceedings of the Fourth ACM International Conference University. He is a researcher at the Chair of
on Web Search and Data Mining (WSDM). ACM, 2011. Communication and Distributed Systems (COM-
[66] K. Giannakouris and M. Smihily, “Cloud computing – statistics on SYS) at RWTH Aachen University, Germany. His
the use by enterprises,” Eurostat Statistics Explained, 2014. research focuses on efficient secure communi-
[67] The Apache Software Foundation, “Apache James Project,” http: cation including improvements for today’s pre-
//james.apache.org/. dominant security protocols as well as mecha-
[68] “ElasticInbox – Scalable Email Store for the Cloud,” http://www. nisms for secure communication in the Internet
elasticinbox.com/. of Things.
[69] B. Klimt and Y. Yang, “Introducing the Enron Corpus,” in First
Conference on Email and Anti-Spam (CEAS), 2004.
[70] M. Henze et al., “A Comprehensive Approach to Privacy in the
Cloud-based Internet of Things,” FGCS, 2016.
[71] M. Henze et al., “CPPL: Compact Privacy Policy Language,” in
ACM WPES, 2016.
[72] T. Pasquier et al., “Data-centric access control for cloud comput-
Erik Mühmer received the B.Sc. and M.Sc. de-
ing,” in ACM SACMAT, 2016.
grees in Computer Science from RWTH Aachen
[73] Bug Labs, Inc., “dweet.io – Share your thing like it ain’t no thang.”
University. He was a science assistant at the
https://dweet.io/.
Chair of Communication and Distributed Sys-
[74] N. J. A. Harvey et al., “SkipNet: A Scalable Overlay Network with
tems (COMSYS) at RWTH Aachen University.
Practical Locality Properties,” in USENIX USITS, 2003.
Since 2017 he is a researcher and Ph.D. student
[75] S. Zhou, G. R. Ganger, and P. A. Steenkiste, “Location-based
at the Chair of Operations Research at RWTH
Node IDs: Enabling Explicit Locality in DHTs,” Carnegie Mellon
Aachen University. His research interest lies in
University, Tech. Rep., 2003.
operations research with a focus on scheduling
[76] S. A. Weil et al., “CRUSH: Controlled, Scalable, Decentralized
and robustness.
Placement of Replicated Data,” in ACM/IEEE SC, 2006.
[77] P.-J. Maenhaut et al., “A Dynamic Tenant-Defined Storage System
for Efficient Resource Management in Cloud Applications,” Jour-
nal of Network and Computer Applications, 2017.
[78] L. Rupprecht et al., “SwiftAnalytics: Optimizing Object Storage for
Big Data Analytics,” in IEEE IC2E, 2017.
[79] R. Agrawal et al., “Hippocratic Databases,” in Proceedings of the
28th International Conference on Very Large Data Bases (VLDB). Jan Henrik Ziegeldorf received the Diploma
VLDB Endowment, 2002. (equiv. M.Sc.) and PhD degrees in Computer
[80] A. De Oliveira et al., “Monitoring Personal Data Transfers in the Science from RWTH Aachen University. He is
Cloud,” in IEEE CloudCom, 2013. a post-doctoral researcher at the Chair of Com-
munication and Distributed Systems (COMSYS)
at RWTH Aachen University, Germany. His re-
search focuses on secure computations and
their application in practical privacy-preserving
systems, e.g., for digital currencies and machine
learning.