Trends in Data Replication Strategies. A Survey
Trends in Data Replication Strategies. A Survey
Trends in Data Replication Strategies. A Survey
net/publication/321213617
CITATIONS READS
8 957
2 authors:
Some of the authors of this publication are also working on these related projects:
6th International Conference on Variable Neighborhood Search (ICVNS 2018) View project
All content following this page was uploaded by Angelo Sifaleras on 21 January 2019.
Survey
Data Grids allow many organizations and individuals to share their data across long-distance
areas. Nowadays, a huge amount of data is produced in all scientific fields and to enhance
collaboration and data sharing, it is necessary to make this data available to as many nodes
of the grid as possible. Data replication is the technique used to provide this availability.
Moreover, it improves access time and reduces the bandwidth used. Recently, data replication
has received considerable attention and several new algorithms have been developed. This
article provides an overview of the state-of-the-art techniques of data replication. We identify
the advantages and disadvantages of these strategies and discuss about their performance.
1. Introduction
Souravlas S. and Sifaleras A., "Trends in data replication strategies: A survey", International Journal of Parallel, Emergent
and Distributed Systems, Taylor & Francis Publications, Vol. 34, No. 2, pp. 222-239, 2019.
The final publication is available at Taylor & Francis Publications via http://dx.doi.org/10.1080/17445760.2017.1401073
The International Journal of Parallel, Emergent and Distributed Systems
data replicas provide read-only access to the table data that originates from a
primary (master) site. Applications can query data from local data replicas to avoid
network access regardless of network availability. The main difference in advanced
replication is that the applications are allowed to update the table replicas [13].
Another important application of data replication is disaster recovery, which
refers to programs designed to get a business back up and running after an un-
expected interruption of services. A disaster recovery plan designs the response
to a disruptive event, to restore crucial business functions. The recovery process
includes the restoration or relocation of sources like servers and storage systems,
reestablishment of system’s functionality, data recovery and synchronization, and
finally, restoration of business functions [14].
Many more applications of data replication can be described: Biologists need to
share and have available the data about the molecular structure, earth scientists
handle and need to have remote access to huge data sets, while data replication in
different servers has been used as a means of increasing data availability. Google
and Ceph file systems are such examples. Recently, software architectures have
been developed, that facilitate the data replication exchange. An example is found
in [15].
Data replication can be considered as a system optimization technique that tries
to increase the hit ratio, that is, the number of requests completed at each round
divided by the total number of requests, reduce the average job execution time
(AJET), guarantee high data availability and optimize the use of the network
bandwidth.
There are many issues involved in the development of an effective data repli-
cation strategy. (1) Topology: Each replication strategy has to consider the grid
topology, for which it has been designed [16]. Examples include binary tree struc-
tures, arbitrary structures where each node can be connected with a number of
other nodes without following any rule, torus structures, peer to peer topologies,
The International Journal of Parallel, Emergent and Distributed Systems
etc. Interested readers can find a nice survey focusing on the issue of grid topolo-
gies in [17]. (2) Available space: The data replication strategies need to consider the
available storage space before creating a replica. In several algorithms found in the
literature, replacement strategies based on a discipline (FIFO, LRU, etc) are used
when there are limitations in the storage available, and (3) Total Replication Cost:
The total replication cost generally involves the time required to execute the de-
signed replication strategy and the memory this strategy requires. A good scheme
should be designed in such a way that, the replication benefits overcome the time
and memory losses, (4) Decisions to be taken: In all data replication strategies, two
important decisions need to be taken: (i) the choice of the proper files for replica-
tion, and (ii) the replica placement policy, i.e., the choice of the target nodes to
store the replicas and the time that replication will take place.
In this survey, we will classify the data replication schemes based on the last
issue, decision making, and discuss about the metrics used to choose the replicas
and the placement policies. Our motivation is to present the state-of-the-art data
replication strategies from the metric point of view and compare them, so that the
researchers can include all the appropriate metrics in their future work. To make the
survey more complete, we include a separate paragraph devoted to the comparison
between the presented schemes, where the other issues are also discussed. The
metrics proposed in the literature consider two important principles: The temporal
locality and the spatial locality. The temporal locality principle states that, the files
with high recent demand will probably be requested again soon, with even higher
rates. The spatial locality states that, a file related to recently accessed files will
probably be requested soon. Also, there is a third principle called the geographical
locality [18] which states that the files accessed by some users of a grid are likely
to be requested by their neighbors as well. This principle formed the basis for a
new metric, the file scope, introduced in [19].
The rest of the paper is organized as follows: In Section 2, we review and compare
the most important techniques for selecting the proper files for replication. The
techniques are divided into time-based (using the principle of temporal locality) and
space-based (using the principle of spatial locality), according to the criteria they
use (temporal or spatial). The file scope will be discussed in a separate paragraph,
since it is based on geographical criteria. In Section 3 the most important techniques
for selecting the target nodes for the replicas, are reviewed and compared. Then,
Section 4 presents the basic metrics used to evaluate the performance of data
replication strategies. Finally, Section 5 presents aspects of future work on data
replication.
The strategies proposed in the literature for selecting the proper files for data
replication fall into two major groups, depending on the principle they use to
support this selection: the time-based strategies and the space-based strategies. In
this section, we present the most representative algorithms of both families. In the
end, we refer the geography-based metric of file scope, which is a metric based on
the geographical position of the nodes.
The International Journal of Parallel, Emergent and Distributed Systems
Generally, the time-based strategies try to record the behavior of the grid nodes
towards each file, to make a decision regarding which files need to be replicated. We
can further divide the time-based strategies into two subclasses: the static strate-
gies, that assume the users behavior remains unchanged, at least over a considerable
amount of time, and the dynamic strategies, that assume the user behavior changes
at regular intervals. The static strategies study the user behavior over an integral
amount of time. On the contrary, the dynamic strategies study the changes in user
behavior by dividing the total time into smaller periods, referred to as time slots
and studying separately the users’ behavior within each slot. In this subsection, we
will study the most representative static and dynamic time-based strategies.
2.1.1 Static strategies
Ranganathan and Foster [20], in an early approach to static time-based replica-
tion strategies, implemented six different replication strategies, namely, No Repli-
cation, Best Client, Cascading Replication, Plain Caching, Caching plus Cascad-
ing Replication, and Fast Spread. The first strategy simply implies that, there is
no replication or caching, apparently for comparison purposes. In the Best Client
strategy, each node maintains a history for each file that it stores, where informa-
tion like the number of requests for this file as well as the requesting nodes is saved.
Each node compares the number of requests for each file to a predefined threshold.
If the threshold is exceeded, the node that has created the maximum number of
requests receives a replica. Thus, all files for which the threshold is exceeded have
one replica to a best client. Thus, the metric used to select a file for replication is:
of the replicas is not taken into account: what if the existing replicas are more
important than the new ones?
The authors of EFS aimed mainly at addressing the issue of replica replacement
(this will be further discussed in Section 4), but they also added some more factors
other than the number of requests, to identify the replicas. Such factors were the
replica sizes and the frequency specific time interval. These factors have also been
later used in other approaches too.
The EFS strategy replaces a group of replicas based on the value of two metrics:
the group value (GV) and the replica value (RV). Specifically, a group is replaced
only if its value is smaller than the value of the requested replica. The GV and RV
values are computed as follows:
Pn Pn
i=1 N ORi N ORF ST Ii 1
GV = Pn + i=1 + Pn
LRT
(2)
i=1 Si F ST I CT − i=1n i
N OR N ORF ST I 1
RV = + + (3)
S F ST I CT − LRT
where n is the number of replicas in the group, N ORi is the number of requests for
replica i in the group, Si is the size of replica i in the group, F ST I is the frequency
specific time interval, N ORF ST Ii is the number of requests of replica i in the
group within the F ST I, CT is the current time, LRTi is the last request time of
replica i in the group, N OR is the number of requests of a specific replica, S is the
size of this replica, F ST I is the frequency specific time interval, N ORF ST I is the
number of requests of this replica within the F ST I, and LRT is the last request
time of this replica. The smallest RV and GV among all replicas indicates the least
important replicas and groups, while the largest values indicate important ones.
In [22], the authors introduced the FairShare Replication (FSR) that balances
the load and storage usage of the servers of the grid. It takes the number of requests
and the load on the nodes into consideration, before determining to store the files
or not. Specifically, they use a hierarchical topology, where the children of the
same node are considered as siblings. Then, the strategy is implemented as follows:
When a client requests a file, the request is forwarded to its parent. If the data is
found there, it is transferred back to the client, otherwise the request is transferred
to the sibling node. If the data is not found in any of the siblings, the request is
transferred one level up and the process continues in a similar manner. The decision
regarding the files to be replicated is based on the data access frequency. The
system maintains a global workload table G. During specified time intervals, G is
processed to get a cumulative table (fileId, ClientId, Frequency), where frequency
is the number of accesses. To decide about the replicas, the algorithm uses the
average access frequency defined as follows:
P
f req
Favg = (4)
n
in a time period Tn and then, they compute the average popularity of the whole
distributed system within a period Tn . Based on these computations, the strategy
decides if an object needs to be replicated. If the popularity of an object is larger
than the average popularity then, the object is replicated and the number of its
replicas is computed as:
The authors also presented two extensions of their basic strategy called the Clos-
est Access Greatest Weight, CAGW : (1) the CAGW NP (New Placement) that
handles the cases where the system does not include only read-only files or the
read/write ratio is high, since the CAGW strategy only pays attention to the read
cost, and (2) the CAGW PD (Proactive Deletion), where the algorithm checks and
removes bad replicas, in order to achieve balancing between read and write over-
heads. The first two strategies delete objects only when there is not enough space
on the servers and as long as there is space available, the replication continues,
resulting in larger costs that overcome the benefits of replication. Actually, the
CAGW PD poses a threshold on the number of replicas by using the read/write
ratio as a metric that determines if deletion is needed. A similar approach regarding
the replica placement policy is also followed in [24].
The authors of [25] presented a category-based data replication strategy for Data
Grids that considers the fact that, the files existing on a node belong to different
categories. Each of these categories is given a value that shows its importance
for the node. Thus, when a node has not much storage left, it starts storing only
the files that belong to the category with the highest value. Thus, the decision
regarding the replicas is strictly based on the importance of each file.
2.1.2 Discussion on static time based strategies
As the name suggests, the static time-based replication strategies decide about
the replicas in a static way, in the beginning of a time period. Once the decision is
made, there are no changes in the strategy despite the possible changes in the users
behavior. The access patterns used by the static algorithms discussed, follow the
principle of temporal locality. In some of them, the simulations performed indicate
an improved performance over random access patterns. This paragraph discusses
the performance issues of the strategies described, their strengths and weaknesses.
Ranganathan et al. [20] proposed six different replication techniques for three
different access patterns: (1) random, (2) temporal locality, and (3) combination
of temporal and geographical locality. The authors used simulation experiments
to measure the Total Response Time (TRT) and the Bandwidth Consumption
(BC). The experimental results showed that the random patterns are the worst-case
scenario, in which, there is a small improvement in response time and bandwidth
saving when the plain caching strategy is used. When Cascading and Fast Spread
(FS) strategies are used, the improvement in total response time and bandwidth
consumption is more significant, since Cascading and Fast Spread use the storage
space available at the intermediate tiers, while the other strategies only use the
space available at the last tier.
For the temporal locality patterns, the performance improvement was much more
considerable than that of random patterns. This is expected since the data repli-
cation strategy assumes that, the replicas created will be repeatedly requested.
Finally, there is further improvement when some geographical locality is combined
with temporal locality in some access patterns because it is assumed that, the files
The International Journal of Parallel, Emergent and Distributed Systems
All records in the same cluster will be aggregated and summarised by the cluster
header. Based on the information of all access records, the most popular file is
computed. To show the importance of a file over a certain period, the computation
is performed using different weights for different time intervals. More precisely,
if n periods have passed, the earliest is assigned a weight of 21−n , the next to
earliest is assigned a weight of 22−n , and the most recent is assigned a weight of
2n−n = 1. Thus, this weight factor is halved between successive periods. This is the
big strength of this method of computation, as recent popular files are considered
to be better candidates for replication. The drawback is the extensive use of mem-
ory, where the access log files are stored. When the number of files becomes large,
a lot of storage is wasted for the historical data access records. Table 2 shows an
example:
Table 2. Log file example
Based on the above description, the metric used to find the most popular file
was the Access Frequency (AF). More precisely:
NT
X
AF (f ) = (aft × 2−(NT −t) ), ∀f ∈ F, (5)
t=1
where NT is the number of time intervals passed, F is the set of requested files,
and f is the number of times a file f has been accessed during a time interval t.
To the best of our knowledge, the first strategy to divide the total time into
rounds in order to study the changes in user behavior, was named PFRF (Popular
File Replicate First-IPFRF), proposed by Lee et al. [26]. In PFRF, the popularity
of each file is calculated at the end of each round, and only a percentage of the
most popular files is replicated. The measure used to decide about the replicas
is the Popularity Weight P Wcr (fi ) for a file fi in cluster c during round r. The
popularity weight is computed as follows:
with a, b being constants. If Arc (fi ) > 0 there have been requests for fi , so the
popularity weight increases by Arc (fi ) · a, otherwise, it decreases by a constant b.
In [27], the authors proposed an improved version of PFRF, the Improved Popular
File Replicate First-IPFRF. As in PFRF, the authors divide the total time into time
The International Journal of Parallel, Emergent and Distributed Systems
slots, but their computation of the file popularity includes other factors such as the
replica sizes and the frequency specific time interval. Specifically, the popularity of
a file i in cluster c during a round n (a round is a time period) is computed as:
where N ORi,c,n is the number of requests for a file i in cluster c during a round n,
T N ORn is the total number of requests at round n and a is a constant. The file
size plays a key role in the computation of popularity and usually, small files have
higher probability to be selected for replication. The authors justified this decision
based on the fact that, the storage space is always limited. However, if there was
no request for a file during a round then, its size plays no role at all.
In [19], the authors introduced the notion of file potential into the computation
of file popularity. The potential of each file was computed using a binary tree
mechanism. The high-potential files are considered to have high number of requests
in the near future; thus they are promoted for replication. In this sense, they
become available sooner than they would be, based only on their access numbers.
The user behavior is modeled by a single parameter B, which lies in the interval
[0 . . . 1]. If B approaches 0 then, the users behavior has changed completely and
the computations for file potential and popularity are performed from scratch. The
total popularity of a file f in node n is computed as follows:
where N Rf,n is the number of requests for file f at the end of a round in node n
while Pf,n is the potential of f during the same round in the same node and Sf,n
is its scope (it will be discussed later in this section). If Pf,n > 0 then, the number
of requests is multiplied by a power of 2, otherwise if Pf,n < 0, the number of
requests is divided by a power of 2, and finally if Pf,n = 0, there is no change.
2.1.4 Discussion on dynamic time-based strategies
The main advantage of the dynamic data replication strategies consists in the
fact that, they consider the change in users’ behavior. Again, the access patterns
used by the static algorithms discussed follow the principle of temporal locality.
This paragraph discusses the performance issues of the strategies described, their
strengths and weaknesses.
In [24], the authors proposed a dynamic replication algorithm that, uses access
weights to keep track of the access history of all files. The data access history is
used to replicate data at frequent time intervals. The network topology selected
is a multi-tier grid and the data replication decisions are centrally taken. The
simulation parameters evaluated are the Job Execution Time (JET), the storage
usage, and the Effective Network Usage (ENU), which is the ratio of the sum of
the number of accesses of a file from a remote site, and the total number of file
replications to the number of times that a file is read either from a remote site
or locally. A lower value indicates a more efficient bandwidth use. The simulation
results have shown that, the total job execution improves by 15% compared to
the least frequently used (LFU) scheme. In addition, the ENU of LALW is lower
by 12% compared to that of LFU. The reason is that, the LFU algorithm always
replicates so the ENU increases. Because the LFU algorithm always replicates, the
The International Journal of Parallel, Emergent and Distributed Systems
Based on the priorities, the PHFS strategy proceeds to determine how the replica-
tion will be configured. The PHFS method uses spatial data patterns and therefore,
it is more suitable for applications in which the users work with the same projects
or files. This means that the data requests are not random, but somehow related.
A multi-tier data grid architecture was used and the replication decisions were
centrally made by the server. The authors used a theoretical example to compare
their strategy to Fast Spread. The example was presented on the basis of spatial
locality which suggests that the users who work on the same context have almost
similar, predictable, requests. From the example, they drew the conclusion that,
when the requested files present higher spatial locality, the proposed strategy of-
fers improvement in the average access latency. However, the fast spread strategy
has mainly used temporal access patterns; thus the suggested improvement seems
rather normal.
In [29], the authors proposed a method called Partitioned Context Modeling
(PCM). In PCM, the idea of model contexts as sequences of file system events
was introduced. All previously seen contexts are stored and each node stores a file
system event. Through its path from the root, each node represents a sequence of
file system events or a context, that has been previously seen. Also, the number of
times a sequence occurs is kept. To determine future events, the model is updated
by using a pointer array, 0 to m, that indicates the nodes representing the current
contexts (C0 to Cm ). Any time a new event E occurs, the children of each old Ck
are examined, searching for a child that represents E. If such a child exists, then
this sequence (the new Ck+1 ) has occurred before, and is represented by this node’s
child, so we set the new Ck+1 to point to this child and its count increases by one.
If no such child is found then, this sequence occurs for the first time, so a child
denoting E is created and the k + 1th element of the array is assigned to point to
its node. Once each context is updated, a new state of the file system is formed.
Having formed the new file system, the authors select the events to be prefetched
CountChild
using the formula (Count P arent −1)
. This formula computes the probability of occur-
rence of that child’s event. This probability is compared to a parametric threshold
to determine whether the data corresponding to this event should be pre-cached
or not. Similar approaches are followed in [30, 31].
For evaluation, the authors used a trie, a structure based on a tree. The authors
performed a series of tests for spatial access patterns that indicate that, predictive
prefetching can significantly reduce I/O latencies and the total runtime, as far as
the benchmarks used. However, there were two drawbacks: (1) these tests repre-
sented a system with many limitations compared to actual computer workloads
and (2) The tests repeatedly used exactly the same data patterns.
period of time. If many replicas of such files are created, the system may become
inefficient. On the other hand, there may be files with lower potential during the
first slots, that may be of great interest to a large number of users after some period
of time. For example, a news story may be read only by a small number of users
during the first hours and then become a viral (thus, its potential may increase).
By file scope, we denote the extent to which diverse users are interested or may
potentially be interested for a file. The scope Sf,n of a file f in node n, is estimated
as follows:
NR
X δn,j
Sf,n = − Ωn,j (10)
max(δn,j )
i=1
where N R is the total number of requests, i indexes the requests for a file f by
their arrival time during a slot, j is the node that made request i for file f , δn,j
is the distance between n and j expressed in number of nodes, max(δn,j ) is the
maximum distance observed in all the requests for f submitted to n by different
δn,j
nodes j and max(δ n,j )
− Ωn,j is used to assign larger scope values for f when the
requests for it are made from faraway nodes.
Returning to Equation (8) note that, if a file has a high scope then, its potential
will play a major role in the replication policy. Otherwise, its potential may fall off
and the algorithm will assume that, the file was only locally requested by a small
number of users and, in fact, it has little or no potential.
The main goals of the replication placement strategies are the following.
1. Take a replication decision: Decide whether replication should take place, when
a file is not available in a node’s storage.
2. Replica selection: Select the best of the replicas, if the strategy decides that
replication should take place.
3. File replacement: Replace some files, in case there is not enough space to store
the replica.
Several approaches have been introduced regarding the placement of the selected
replicas. The replica placement policies enhance the data replication strategies
since if the selection of storage for the replicas is not defined in a strict way or
if the selection is random, then: (1) the data files may not be evenly distributed
over the grid, for example one or two nodes may be overloaded with replicas, (2)
several large files may unnecessarily be copied in more nodes than it is required;
thus reducing the system performance. Some of the strategies described in the
previous section are equipped with such a placement policy. However, there are
some more papers, which focus mainly on the replica placement policy [33–36].
These strategies cannot be categorized based on our taxonomy, but they have been
influential in the design of good replica placement policies. This section describes
the most important replica placement policies found in the literature.
Ranganathan and Foster developed the Fast Spread strategy in [20]. In this
strategy, a replica of the requested file is stored in each node along its path to the
requesting node. When one of these nodes lacks storage, some replicas have to be
removed. The strategy creates a list of replicas available in the current node. Then,
these replicas are deleted one by one, based on the least recently used principle or
The International Journal of Parallel, Emergent and Distributed Systems
as follows: First, the file is searched locally within a node. If it is found then it is
used, otherwise the request goes up the tree. The first replica found on the way
to the root is used. If there is no replica then, the hub provides it. The goal of
this replica placement strategy is to place the replicas in such a way that, more
requests can be satisfied. The strategy also addresses the following issues: (1) the
accurate estimation of how often a leaf is used, (2) the proper placement of the
replicas, and (3) the load balancing.
Two models were created: The constrained model that places a range limit to
each request, i.e., each request can be served within a number of hops towards
the root and the unconstrained model that does not have such limitations. Both
models aim to solve the following problems:
1. MinMaxLoad: Based on estimations regarding the data usage and given the
number of replicas k, the goal is to find a set of tree nodes with cardinality k
that minimises the workload.
2. FindR: Based on estimations regarding the data usage and given the amount of
data D that, a replica or the hub can serve, find a set of the tree nodes with
minimum cardinality so that, the maximum workload does not exceed D.
In [23], the authors use the popularity value to make replication decisions. Firstly,
they find the average popularity of each object in a time period Tn and then, they
compute the average popularity of the whole distributed system within a period
Tn . Based on these computations, the strategy decides whether an object needs
to be replicated or not. If the popularity of an object is larger than the average
popularity then, the object is replicated and the number of its replicas is computed:
Afterwards, a decision is made regarding the nodes where the replicas should be
placed. The decision constitutes a stepwise procedure described as follows:
1. In the nth interval, if an object k needs replication, a list with all servers re-
questing k is created.
2. The list is sorted in decreasing order based on the object popularity.
3. The replicas are stored on the top servers of this list.
4. If the server has no storage available then, a LFU policy determines the replicas
that will be removed to enlarge storage availability.
The authors also presented two extensions of their basic strategy. The first is
denoted as the Closest Access Greatest Weight, CAGW : (1) the CAGW NP (New
Placement) that handles the cases where the system does not include only read-only
files or the read/write ratio is high, since the CAGW strategy only pays attention
to the read cost. The second is the CAGW PD (Proactive Deletion), where the
algorithm checks and removes bad replicas, in order to achieve balancing between
read and write overheads. The first two strategies delete objects only in cases where
there is not enough space on the servers and as long as there is space available,
the replication continues, resulting in larger costs that overcome the benefits of
replication. Actually, the CAGW PD poses a threshold on the number of replicas,
by using the read/write ratio as a metric that determines if deletion is needed. A
similar approach regarding the replica placement policy is also followed in [24].
Another replacement strategy was introduced in [35]. The authors used a
popularity-based form of the least recently used discipline with one constraint
The International Journal of Parallel, Emergent and Distributed Systems
added, to ensure that there will be no replacement for replicas created in the
current time interval. Assuming a multi-tier data grid model, the replication is ac-
complished in two phases: (1) Firstly, the bottom-up aggregation phase aggregates
access history records for each file to upper tiers, until the root is reached. During
the computation, the access counts are summed, for those records whose nodes are
siblings and which refer to the same files. The result is then stored in the parent
node. (2) Secondly, by using the aggregated information, the replicas are placed
from the top to the bottom of the tree. The idea is to traverse the tree with a top-
bottom approach as long as the aggregated access count is greater than or equal
to a pre-defined threshold that, determines popular files. A replica is placed at a
node if the threshold has such a value that, prevents further traversal through the
children of the current node.
An important issue is the determination of the initial threshold value and its
adjustment during the execution of the algorithm. The initial threshold value is
based on the average aggregated access counts of servers located in the lower tier
of the grid. Afterwards, the adjustment is made according to the arrival rate of
requests. These changes do not apply immediately but only on a time-interval basis,
where a time interval is a fraction of the time required to sample access histories
of clients. The threshold value is increased or decreased by the difference between
the current and previous average aggregated access counts of servers located in the
lower tier of the grid.
A quite simple approach for replica placement is followed in PFRF [26]: After
computing the popularity, the authors use the following policy:
1. File selection: the set of popular files is sorted according to the average popularity
values in decreasing order and the number of files is calculated.
2. File replication: the strategy checks whether each file is stored in a cluster or
not. If so, it takes no action, otherwise it checks for available storage. If storage
exists, the file is replicated from the nearest node. In other case, the strategy
deletes a number of less popular files compared to the current file.
In [23] the decision regarding the placement of replicas is a stepwise procedure
described as follows:
1. In the n-th interval, if an object k needs replication, a list with all servers re-
questing k is created.
2. The list is sorted in decreasing order according to the object popularity.
3. The replicas are stored on the top servers of this list.
4. If the server has no storage available, a LFU policy determines the replicas that
will be removed to enlarge storage availability.
The IPFRF [27] constitutes an extension of PFRF, which computes the suitabil-
ity of each node as a target for a specific replica, based on the relationship:
N ORn,i F SSn SODn
Sn,i =A× +B× +C × 1− (11)
HN ORi T SS HSOD
where Sn,i is the suitability of cluster node n for file i, N ORn,i is the number of
requests from cluster node n for file i, HN ORi is the node with highest number
of requests i, F SSn is the free storage of n, T SS is the total storage, SODn is the
sum of distances between n and the other nodes in the cluster, and HSOD is a
node with the highest SOD within the cluster. This approach is very interesting,
since it provides three constants A, B, and C, where A + B + C = 1, and each
The International Journal of Parallel, Emergent and Distributed Systems
one can be used as a weight for three different factors. In order to select the node
with the highest number of requests, we can increase the value of A and decrease
the values of B and C. If we want to balance the load between different nodes in
the cluster, we can assign a higher value to B and decrease the values of A and C.
Finally, if we want to select the node with smallest distance to the nodes in the
cluster, we can increase C and decrease A and B accordingly.
In order to evaluate the effectiveness of their strategy, the authors have considered
a variety of different metrics. In this section, we briefly describe the most important
of them:
Mean Job Execution Time: The mean job execution time is the total time required
to execute all the jobs divided by the number of completed jobs and it is one of
the most important metrics to evaluate performance. The mean job execution time
can be considered as a function of the average job turnaround time (AJTT), which
is the time elapsed from the time a job requests the files needed until the time it
receives these files. Thus:
AJT T
M JET = (12)
N um of jobs completed
This metric has been used in [26, 27, 33, 35, 37] among others.
Bandwidth Consumption: The file replication takes time and consumes a good
portion of the network bandwidth. The policies described in Section 3 aim at
reducing the number of replicas, whenever this is possible. A good metric that can
be used is the Effective Bandwidth Consumption (EBC), which can be described
by the following equation:
When EBC is small, more files are accessed locally, resulting in lower bandwidth
consumption. Most of the strategies presented in this survey, consider the band-
width consumption in the simulations performed.
Hit Ratio or Number of Completed Requests: The hit ratio is the number of
requests completed at each round divided by the total number of requests. Wei et
al. [38] explicitly study the hit/miss ratio in their work, while the majority of the
papers presented in this survey prefer to consider the number of completed requests.
Number of replicas and percentage of used storage: Some of the strategies described
[9, 24, 34, 35, 37] also study the number of replicas created and the percentage of
storage used per node or per cluster since, their main goal is to keep the number
of replicas and the storage used to the minimum.
The file size: The file size can’t be considered in fact as a metric to evaluate the
performance of a data replication strategy. However, simulations are conducted to
The International Journal of Parallel, Emergent and Distributed Systems
REFERENCES 19
study the effect of file size [19, 39] on other metrics, such as the AJET or hit ratio.
There are two issues related to the file size, when used to evaluate data replication
schemes: (a) The nodes must have enough disk space or follow some replacement
policy (for example, least recently used) to be able to store the replicas, (b) Larger
replicas cause higher communication latency.
In this survey, we classified the data replication strategies for data grids. We pre-
sented the most representative strategies for choosing the proper files for replica-
tion, separated into three categories: time-based, space-based and geography-based.
The time-based strategies were further divided to static and dynamic.
The strategies have been developed for different access patterns (tempo-
ral/spatial) and for a variety of different architectures. The main parameters com-
puted to evaluate these strategies were also discussed. Generally, we can conclude
that there is no standard architecture or access pattern used in the papers. In most
cases, the multi-trier architecture is used, but the random graph is also a common
approach.
There is a variety of parameters that are computed to evaluate the performance
of data replication schemes. Some of the most important are the total response
time, the bandwidth consumption, the data cost, and the effective network use.
There is no single strategy that considers all these parameters, so it is not an easy
task to have 100% accurate comparisons between them. Also, the majority of works
use simulation for evaluation purposes.
As the volume of data operated by modern applications keeps growing, new
challenges are going to be posed in the field of data replication. Future trends may
include scheduling of algorithms that optimize the memory usage. As mentioned
in Section 2, strategies that keep large access log files like [24] are consuming a
lot of memory. An idea to resolve this issue would be to schedule an algorithm
that estimates the access values for every time interval and uses them as input
to compute the popularity of each file. In this way, the system may be relieved
from keeping excessively large log files especially now that, the data replication
algorithms will have to deal with even larger grids and huge numbers of files.
Another future trend may be the design of new network architectures and hier-
archies, that aim to minimise the time spent in moving data between nodes in a
distributed system. Finally, it is important to implement data replication strategies
in the cloud.
References
20 REFERENCES
REFERENCES 21
22 REFERENCES