1. Introduction
A large number of users, clients, and customers access data and information, which raises concerns regarding the confidentiality of data [
1,
2]. Accessing statistical summaries is obligatory in several public and private entities [
3,
4], threatening data security and privacy. Several statistical agencies worldwide aim to provide useful statistical summaries without breaking the confidentiality requirements. Assessment of the confidentiality and utility of the data is studied using various methods and strategies [
4].
“Micro-Aggregation” is a perturbative method that critically partitions the micro-data file into groups of either a fixed-size
k or variable-size
, where
k is a predefined threshold set by the data protector [
4]. If the size of the group is satisfiable, then Micro-Aggregation Technique
discloses the mean values of the group as a replacement of the original micro-records.
The Micro-Aggregation Problem
belongs to the
-hard class problems, which aims to obtain the optimal partition of the micro-data file. It is defined as follows. A micro-data set
is defined as
n multi-variate individuals, namely the
’s. Each of them is a data vector that has
p continuous variables. Micro-aggregation involves partitioning the
n data vectors into
m groups in order to reach the optimal
k-partition, such that each group,
of size,
, contains either exactly
k data vectors (fixed-size case), or
(data-oriented case). The best
k-partition is the partition that minimizes between-group similarity and maximizes within-group similarity. The similarity of each group is measured as the Sum of Squares Error
calculated using the Euclidean distances of each individual data vector
to the mean of the group
it belongs to [
5]. It is given by:
Analogously, the between-group is computed as the Sum of Squares Among the groups
reflecting the squared deviations of the means from the data total mean [
5]. It is given as:
The Total Sum of Squares
is designated by
. Information Loss
is a metric expressed as a ratio of
to
. The value of
falls in the range of 0 and 1 as given [
5]:
The primary contribution of this work is to apply the divide and conquer concept to the state-of-the-art Enhanced Genetic Multi-variate Micro-Aggregation Technique
[
6] to reduce the computational time and to enhance the value of the
and Disclosure Risk
(Disclosure risk presents the probability that an intruder can obtain some information about the original micro-data from the published one.) by imposing a variable group-size. The applicability of integrating the divide and conquer concept and genetic algorithm to the
provides a favorable strategy for preserving sensitive data in the micro-data file and compromising the contradiction between the
and the
.
This research article presents an introduction in
Section 1; a brief survey about the reported
in
Section 2 and, particularly, the
strategy.
Section 3 illustrates the informal and algorithmic expressions of the newly proposed
.
Section 4 shows results of experiments performed on real benchmark data sets. Finally, the conclusions are drawn in
Section 5.
2. Micro-Aggregation
Micro-aggregation is applied to preserve statistical databases and to protect the individual records [
2]. This technique seeks to group the micro-records in the original file into
groups of size
and then disseminates the average values instead of the original micro-record values. To preserve the privacy of original data before publishing, the records should be placed in a group whose size equals
k or more [
4]. The
s are classified based on [
7]:
The degree of the micro-data file, which represents the number of attributes used in the micro-aggregation process that determines the aggregation method, namely whether it is uni-variate or multi-variate. The uni-variate method covers using a principle component, choosing a particular variable, or calling the sum of z-scores [
8], whereas the multi-variate method covers unprojected multi-variate data or projected multi-variate data on a single axis [
9].
The cardinality of micro-records per group [
5,
10,
11,
12] determines its size, and whether it is fixed or variable. The fixed group-size (
k) is known as classical-micro-aggregation, while the variable group size (between
k and
inclusive) is known as data-oriented micro-aggregation.
The type of solution. The optimal uni-variate
solves
as the shortest path problem on a graph with a polynomial complexity [
13]. However, there is no optimal
for multi-variate
known as an
-hard problem [
5,
7]. Thus, researchers have shown great interests in heuristic
s that provide approximate solutions near the optimal, by employing genetic algorithms [
14,
15,
16], hierarchical clustering [
10,
17], automata theory [
12], neural networks [
2,
18], graph theory [
19,
20], or fuzzy-logic [
21].
This paper will focus on the Maximum Distance to Average Vector
technique [
10] and the Enhanced Genetic Multi-variate Micro-Aggregation Technique
[
6] as examples of the state-of-the-art. Therefore, a summary of each one will be proposed.
The
is one of the simplest and the most attractive techniques. It is designed to generate a fixed group size constraint.
begins by computing the mean of the whole data set, and then searching for the farthest record from the mean, called
. After that, it obtains the farthest record from
, called
. Then, the technique finds the closest
records to the
s and the closest
records to
r in order to form two groups. The records in these two groups are deleted from the original data set. The above steps are continuously repeated until the number of records in the data set is less than
k records. The remaining records have to be grouped to the closest group to them [
10]. Finally, the means of each group are published [
22].
The
is one of the
that employs a genetic algorithm to solve
[
6]. Firstly, the micro-data file is divided into a number of domes based on the proximity distance similarity. All domes have equal size, which is pre-defined by the data protector. The genetic operations, namely
crossover and
mutation, are independently invoked in every dome and repeated until the convergence condition is satisfied. The latter condition is defined to reach a stable fitness value defined as the value of
. Secondly, all domes are merged into one single dome to refine the final results by reinvoking the genetic operations in the whole micro-data file. Further details can be found in [
6]; as mentioned earlier, the micro-data records/genes are divided into a number of sub-domes
L, with size equal to
. The authors of [
6] reported that choosing the number of domes to be between
and
leads to the optimal value of the
. The best value of
L may belong to a large range; therefore, guessing the optimal size of these sub-domes is not an easy task. Another disadvantage of this
is that it requires substantial computational time to generate and disclose the micro-aggregated file. It is worth addressing here that the
belongs to the fixed size group
type where all groups/chromosomes share the same size, which is equal to
k. This will lead to an increase in the value of the Disclosure Risk
.
3. Recursive Genetic Micro-Aggregation Technique
(RGMAT)
We developed a recursive and plausible mechanism, referred to as the Recursive Genetic Micro-Aggregation Technique (RGMAT), for minimizing the required computational time of the entire data set and providing us with a favorable value of the Scoring Index (SI)(The scoring index is a metric that trades off between the achieved level of protection (privacy) and the correctness of the results that the users could obtain (utility).) besides generating a variable group-size constraint for the aggregated micro-individuals. Our methodology is as follows: rather than splitting the entire micro-data file into a number of domes by invoking the method, we propose that the entire data in a single original dome be recursively sub-divided into two smaller sub-domes. The genetic operation “crossover” is performed on each sub-dome independently until the convergence condition is satisfied. The recursion is successfully terminated, if the generated sub-dome size is between k and inclusive. Lastly, the genetic operation “mutation” is performed over all generated sub-domes to maximize the objective function. We stress that the smaller sub-domes should not be obtained as the result of invoking the on the original dome. This recursive sub-division cannot be “arbitrary” (it must be based on a meaningful criterion). It must utilize the underlying clustering philosophy by applying the genetic concept. Moreover, we suggest that every sub-dome is independently micro-aggregated. Finally, the micro-aggregated records are combined in order to obtain the published file.
The algorithm that implements the can be formalized as follows. Let the input of the micro-data set be given by InSet with a size of N and the output micro-aggregated records be OutSet.
The process is initiated with normalizing the micro-data file
(InSet) to give equal weights for all variables [
5,
6,
7]. The similarity between records/genes is estimated by building the similarity distance matrix based on “Euclidean distance” [
5,
12].
Instead of micro-aggregating the individual records using genetic algorithm by dividing the vast dome into L domes where each dome size is equal to in , we apply the divide and conquer concept to partition the original dome (InSet) set into two mutually exclusive sub-domes (i.e., and ), satisfying the variable group size constraint.
The InSet is ready to be micro-aggregated if the size of InSet, N, is between k and inclusive. Nevertheless, if the size of InSet, N, is equal to or greater than , then the InSet is recursively invoked into two mutually exclusive sub-domes. It is worth mentioning that before recursive calls, the seeks to reach the optimal size of the two sub-domes and represented by left.dome.size and right.dome.size, respectively. We converge to this optimal size for both sub-domes by initializing the left.dome.size to half the original dome size . Then we check if the left.dome.size is divisible by two and, simultaneously, divisible by k. If the above condition is satisfied, then we successfully determine the optimal size of the . Otherwise, we keep decreasing the left.dome.size by one until the above condition is satisfied. After converging to the optimal size of the left.dome.size, the right.dome.size is directly assigned to be equal to the remaining genes/records in the original dome (right.dome.size = N − left.dome.size). Secondly, and more importantly, this procedure utilizes the underlying clustering philosophy by applying the genetic operations. This is done by computing the mean of the original dome InSet, and then searching for the furthest record/gene from the computed centroid called , using this to create an sub-dome of a size equal to left.dome.size. The consists of and the nearest left.dome.size genes/records. After removing these genes/records from the original dome InSet, we assign the remaining records/genes to the sub-dome.
The objective function is to maximize the homogeneity of the records/genes in the generated sub-domes. The goal is to simultaneously maximize the within-group and minimize the between-group similarity of records/genes in each sub-dome and . This is done by applying the genetic learning process to epochs. Each epoch is started with computing the fitness value (i.e., the sum of square error) of the and sub-domes. Then, the crossover process is set in progress by choosing a set of records/genes based on the crossover ratio (CorRatio) predefined by the data protector. Then, a random pair of records/genes is selected from the chosen set to swap the predefined percentage of the CorRatio between them. Consequently, this impacts the fitness value of both sub-domes and . If the fitness value decreased, then the swap action takes place, otherwise, it is cancelled. Accordingly, the pair (original genes) is deleted from the chosen set. The whole crossover process is repeated when the chosen set is not empty.
After performing the crossover operation, the total value of the fitness function is computed. If there is a change compared to the old computed value, a new generation will take place. Otherwise, a recursive call will be invoked for each sub-dome and . The reason behind the recursive calls is converging to the desired underlying cluster in and .
The
mutation operation does not immediately start after the
crossover step in each level of recursion. This
mutation step will only take place at the leaves level by creating a chosen mutation set of records/genes from the rightmost sub-dome to the leftmost sub-dome based on the mutation ratio (MuRatio) defined by the data protector. Then, a random record/gene is selected from the chosen mutation set to migrate one record/gene from the most
to other sub-domes without violating the variable group-size constraint. The total fitness value of all sub-domes at the leaves levels will be affected by this migration process. Thus, if the total fitness value decreased, then the migration action takes place; otherwise, it is cancelled. Accordingly, the migrated record/gene is deleted from the chosen mutation set. The whole mutation process is repeated while the chosen set is not empty. Finally, the aggregated file is created and disclosed. The above description is formulated in Algorithms 1 and 2.
Algorithm 1: The RGMAT Scheme for the MAP |
Input: InSet: Set of micro-data records. N: Number of records. d: Number of dimensions of dataset. Output: IL: The value of the Information Loss Note: : Security Threshold is a constant value. : Constant value of crossover ration. : Constant value of mutation ration. Method: - 1:
Normalizing dataset. - 2:
Building similarity distance matrix,D. - 3:
OutSet=REC−Split(InSet,N). - 4:
Calculate the IL value from the Outset - 5:
Return The value of the IL. - 6:
End Algorithm The RGMAT Scheme
|
Time is not a crucial factor in genetic algorithms; as consequence, the most important criterion is the accuracy of the results. It is well-known that the genetic algorithm runs iteratively in a polynomial degree that depends on the number of generations, the size of the data set, and the inner genetic operations. A set of solutions is randomly generated, thus forming the initial population. The cost of each reachable solution includes the cost of crossover, mutation and selection. The best k solutions are kept. Then the genetic algorithm is continued as previously explained to either reach maximum number of generations, or to successfully converge to the sub-optimal fitness value measured by . After the last iteration, the optimal partition is found.
The main advantage of the new is to micro-aggregate the micro-data set in substantially less time without sacrificing neither the nor the values. Another advantage of the proposed technique is that such a strategy is applicable in multi-processor machines, and particularly shared-memory systems (where there is no need to plan the communication of data between different processors). Additionally, the memory caches will be used efficiently because the subset size is small enough to be stored in the cache and then the partitioning can be achieved without accessing the slower main memory. Integrating the divide and conquer approach with the genetic algorithm will reduce the required time.
As mentioned earlier, imposing a recursive strategy does not only lead to an evident saving of time, but also it preserves the minimization of the
and
values. This is achieved by invoking the base (terminating) step, where the
is minimized for each atomic partition. The beauty of this
is to aggregate the genes in different chromosome sizes that satisfy the variable group size constraint.
Algorithm 2: REC-Split(InSet,N) |
Input: InSet: Set of micro-records. N: Number of micro-records. Output: OutSet: The micro-aggregated records. Note: : Security Threshold is a constant value. : Constant value of crossover ration. : Constant value of mutation ration. Method: - 1:
if () then - 2:
Print “Error”. - 3:
Exit(). - 4:
else if (() and ()) then - 5:
Aggregated group/chromosome. - 6:
Return aggregated group. - 7:
else - 8:
left.dome.size. - 9:
while ((left.dome.size is not divisible by k)or (left.dome.size is not divisible by 2)) do - 10:
Decrease left.dome.size by one. - 11:
end while - 12:
right.dome.size= N − left.dome.size. - 13:
Select the furthest record/gene, , to the centriod of InSet. - 14:
Add to the left subdome, . - 15:
Put the (left.dome.size ) nearest records/genes to in . - 16:
Put the remaining records/genes in the right subdome, . - 17:
repeat - 18:
for (each subdome and ) do - 19:
Fitness Evaluation(). - 20:
Crossover(). - 21:
end for - 22:
until (Convergence criterion is satisfied) - 23:
REC−Split(LS,left.dome.size). - 24:
REC−Split(RS,right.dome.size). - 25:
Mutation(). - 26:
end if - 27:
End Algorithm REC-Split
|
4. Experimental Results
The
was thoroughly tested, and the results are encouraging. It was tested on the Tarragona data set (834 records and 13 variables) and the Census data set (1080 records and 13 variables) [
5,
10].
The strength of the newly developed is profound when it is used with a variable group size constraint. The has the talent of implementing a recursive division of the whole data set into two groups/chromosomes based on the distance proximity between the individual micro-records/genes. The recursive step is terminated when the number of genes per chromosomes is between k and . The main objective of the function is to minimize the value of the fitness function by varying the number of chromosomes from one generation to another in each recursive step. It is worth mentioning that using epochs is enough to maintain the diversity between generations. Additionally, sequentially invoking the crossover process per recursive step and performing the mutation process once before aggregation gives a positive contribution to the variation between generations. The values of the and were set to be equal to and , respectively.
Table 1 presents the results of using the newly proposed
on multivariate data sets. The experiments of testing
were performed with various values of
k to investigate the effect of increasing the number of genes per chromosome. Increasing the number of genes/micro-records per chromosome/groups tends to increase the computational time and the value of the
. Although the value of
obtained by
is comparable (either equal or less than) to the value obtained by
, the required computational time was always less than the required computational time in
. Splitting the single dome into two sub-domes continuously until satisfying the variable group size constrain reduced the required computational time by up to 70%. It is essential to highlight that
did not involve optimizing the dome size at all, as is the case in the
. Within the context of this work, the computational time represents the time needed for obtaining the micro-aggregated file with a specified dome size. This does not cover the total computational time for all dome sizes between
and
(to find the best value of
).
The experiments were performed to test the applicability of the proposed algorithm to balance the two conflicting criteria of
and
[
23,
24], which were evaluated by calculating the Scoring Index
value for the
,
, and
on both the Census and Tarragona data sets as shown in
Table 2.
General Information Loss (
): The value of
approximates how much of the data was generically damaged when using the
[
8,
10,
13,
25,
26]. We now assess the impact of MAT on the original file’s data utility. Our goal is to evaluate the difference between the masked aggregated file and the original one. This is generally measured by demonstrating how the statistics have been structurally modified and how large is the modification [
5]. The statistical characteristics of the original file are essential to be protected. This is usually measured by calculating the mean variation of the data (
), the mean variation of the means (
), the mean variation of the variances (
), the mean variation of the covariance (
), and the mean absolute error of correlations (
). The overall
is defined as follows:
[
5,
7,
9]. These calculations will undoubtedly provide a better understanding of the performance of the
[
5]. More information on measuring the
can be found in [
25,
27,
28,
29,
30]. In general, the
’s value is directly proportional to the number of genes per chromosome (records per group) represented by the value of
k. For the Census data set, the
’s best value for all three
was at
k equal to 3. Specifically, the
value for the
was
,
was
, and
was
. By contrast, the best value of
in the Tarragona data set for all three
was at
k equal to 4. The
value for
was
,
was
, and
was
. These results clearly indicate that the
method saved the data utility more efficiently than
and
. Therefore, the
outperformed the state-of-the-art methods in term of
.
General Disclosure Risk
: Analyzing the effect on the confidentiality of disseminating the micro-aggregated file must also be comprehensively studied. The reason is that the
depends on the data and the intruder’s prior knowledge about the data. Therefore, we have to quantify the risk of having extra information that can link a masked record in the masked file with the corresponding original record in the original file. This also evaluates the risk of accurately estimating the original records’ values from the published masked records [
12,
18,
28,
31]. The
will be evaluated as the average value of two different recommended strategies, the Record Linkage Disclosure technique
and the confidential Interval Disclosure
as follows:
.
Record Linkage Disclosure Technique
: The Euclidean distance is calculated between every single micro-record in the generated micro-aggregated masked file with all micro-records in the original file. Then the “the first nearest” and “the second nearest” micro-records for each micro-record in the masked file are marked. If the marked micro-record in the original file has the same index record as in the masked file, a “Match” is counted [
12,
23,
27,
32,
33,
34]. The number of matches over the number of micro-records in the original file defines the
. This technique estimates the number of masked micro-records whose identity can be re-identified by the invader [
12]. Applying this technique requires an assumption that “an intruder has an external file containing a subset of the key variables that are common with the published file”. The intruder tries to pair-match a subset of common shared variables in the external file with the published file to infer more information about the original micro-record. Therefore, The
is calculated as the average of the overall possible combinations as
combinations, such that
S represents the number of key variables in the micro-data file and
C represents the number of selected variables known to the intruder in the external file. Namely, seven key variables are used based on the literature, including: Var1, Var2, Var3, Var5, Var10, Var11, and Var12 [
18,
28,
31]. The results shown in
Table 2 illustrate that using
to estimate the risk of disclosing the confidentiality of the information is decreased with increasing genes per chromosome for a given
k value. Herein, we found that the
scored the minimum value of the estimated risk of using the
on the Tarragona data set, while
scored the minimum value of the estimated risk of using the
on the Census data set.
Confidential Interval Disclosure Technique
: This technique is not attentive to define the exact original value; it has interest only in finding an approximate value [
32]. The
independently ranks each attribute and defines an interval for each ranked attribute based on the neighborhood of the value that the attribute takes on for a specific micro-record, say
r. The rank of this value should not be more than
of the size of the original micro-file, and the value of that attribute in the micro-record,
r, should correspond to the value of the center of the interval. In other words, it is assumed that a specific variable is independently sorted, and
r is the value taken by that variable in a certain micro-record. Then, the lower and the upper bounds of the interval are equal to
and the value
, respectively. The match occurs when values of all variables in the micro-record fall into the corresponding computed intervals. Further details can be found in [
5,
23,
32,
35].
The invader estimates each interval size by using the
. A large interval indicates a large value for the confidence. The average confidence is calculated by using a specific fixed determined range of percentage (between 1 and 10%) of the micro-records. Clearly, if
P has a large value, then a larger value of
will be obtained and a small amount of information is disclosed [
5]. The percentage value of
was measured as the average values at the various settings of P(
) on the Census and Tarragona data sets, as shown in
Table 2.
Finally, the
values were calculated for all of the
and presented in
Table 2.
Evaluating the scoring index for this proposed
is an urgent demand in order to compare its performance with the state-of-the-art
and
. It is well-known that every
disturbs the original dataset in two fronts: privacy and utility. To the best of our knowledge, it is inappropriate to focus on one of them and ignore the other. Additionally, the direct comparison between privacy and utility is not reasonable for several technical and philosophical reasons. The most important reason is that privacy is an individual concept, while utility is an aggregate concept. The masked dataset will not be disseminated unless the privacy for each individual is protected and the utility gain adds up when multiple pieces of knowledge are learned. Secondly, when publishing a masked dataset, only the individuals whose data are included have potential privacy loss, while others have potential utility gain. Therefore, the Scoring Index,
, is a measure that focuses on the two conflicting criteria General Information Loss (
) and General Disclosure Risk (
); a decrease in one of them results in an increase in the other one. Estimating the
is a recommended practice since each criterion measures a totally different perspective for the
. For that reason, there is a serious requisite to utilize a rational index that linearly combines
and
as follows:
, where
X is set to be equal to
to give both criteria an equal weight [
5]. The lower score value implies a better performance [
36,
37]. From
Table 2, the
technique has comparable performance to the state-of-art
and
techniques in terms of the
and
at different
k values.
A motivating task includes studying how the
,
, and
are compared when the conflicting criteria come to production at the same time. This can be achieved by plotting the
versus the
for the Census data set, as shown in
Figure 1, for all schemes. A set of paired values of
and
for the particular technique at various values of
k ranging from 3 to 5 were plotted. The user will witness the effect of the
k values on a masked method. From the curve we observe that the
successfully balances these conflicting criteria in an excellent way comparable to the
and
methods. This small difference has a significant impact on trading off between the two conflicting criteria
and
. Finding the optimal combination of these two measures is a difficult and challenging task [
38]. This confirms the difficulty of improving the measure.