Introduction To Differential Privacy
Introduction To Differential Privacy
Samuel Cheng
January 25, 2015
1 Introduction
As we discussed in previous lectures, we can categorize attributes of a record
into key attributes, sensitive attributes, and quasi-identifiers. Use a
medical database as an example. Key attributes include attributes such as
patient name and social security number. Sensitive attributes can be the disease
the patient possesses. And quasi-identifiers can be zip code and salary range and
so on. Removing key attributes alone cannot ensure the privacy of a patient
in the database. Because a malicious user can take advantage of other side
information (prior knowledge) to uniquely identify the owner of a record and/or
infer sensitive information from the quasi-identifiers.
To ensure the privacy of records’ owners, we cannot always return an answer
faithfully for a query. All privacy-ensuring techniques are essentially equivalent
to introducing distortion to the query answers. Roughly speaking, we can cate-
gorize the privacy-ensuring techniques into two classes: pre-query randomization
and post-query randomization. For pre-query randomization, we essentially re-
construct a “fake” database to serve as a surrogate in replace of the original
one. This can be achieved by adding fake records, deleting true records, dis-
torting records, suppressing some attributes and generalizing some attributes
of records. Many privacy methodolgies such as k-anonymization and l-diversity
are proposed along this line of thought. For example, if for any quasi-identifier
x, the conditional entropy of a sensitive attribute W given x, H(W |x), is larger
than log2 (l). We will say that the modified data set satisfies l-diversity. The
advantage of this class of methods is that we can allow users to utilize the data
set directly and non-interactively. However, as users are going to use the
surrogate data set directly for their studies, the surrogate data set should have
all “statistics” closed to the original one. The latter, of course, is very hard to
be guaranteed.
If we restrict the queries that can be made by the users, rather than dis-
torting the data set itself, we may distort the query results afterward. The
post-query randomization approach is suitable to data sets that only allow in-
teractive access. An example of this approach is the Laplacian mechanism for
differential privacy. Even though differential privacy can be made possible in
non-interactive setting as well, we will restrict ourselves to interactive setting
in this lecture.
1
2 Differential Privacy
The foundation of differential privacy is based on the premise that the query
result of a “private” data set should not change drastically with an addition
or deletion of a single record. Otherwise, there can be a chance for malicious
users to infer the identity or sensitive information of a private record. Note that
differential privacy is a rather strong condition since it does not assume the
methods a malicious user could use in extracting that information. Differential
privacy does not try to limit what records can or cannot exist in a data set.
Actually, the definition is blind to (independent of) the actual data inside the
data set. To ensure that the statistics of query results do not change drastically
with adding or deleting a record, the query outcome is typically distorted by
the addition of noise. As a result, the query outcome function K(·) even for a
fixed data set D should be considered as a random variable. Therefore, if K(·)
is private, for any two data sets D1 and D2 where one data set is equal to the
other one with an additional record, we want
P r(K(D1 ) ∈ S) ∼ P r(K(D2 ) ∈ S)
for all subset S of range K(·). The above formulation can be tighten with the
following definition.
Definition 2.1 (-Differential Privacy). The query K(·) is -differentially pri-
vate if and only if for any subset S of the range of K(·),
where D1 and D2 are any two data sets where they differ by at most one record
and one data set is a subset of the other.
Since (1) has to be held by all “adjacent” data sets D1 and D2 (including
when D1 and D2 are swapped), (1) apparently can only be satisfied when e ≥ 1
or ≥ 0.
3 Laplacian Mechanism
To continue our discussion on Laplacian mechanism, we need to define the
sensitivity of any function applying on a data set.
Definition 3.1 (Sensitivity). The sensitivity of a function on a data set, ∆f ,
is defined by
where D1 and D2 are any two data sets where they differ by at most one record
and one data set is a subset of the other.
2
As one can see from the definition, sensitivity measure to a extreme how
much a single record can change the function value of a query. Intuitively, a
malicious user may have higher chance to infer private information from a more
sensitive query than a less sensitive one. Therefore more protection may be
needed for a more sensitve query.
Instead of returning the true computation, f (D), the Laplacian mechanism
√
returns the query result plus a Laplacian noise with standard deviation 2∆f /
(denoted as Lap(∆f /)). In other words, it returns K(D) = f (D)+Lap(∆f /).
It is not difficult to show that (1) will be satisfied. Note that (1) is true
for all subset S if it is true for all singleton set {r}. Therefore we can restrict
ourselves to only consider the latter case. For any r, note that
P r(K(D1 ) ∈ {r}) exp −|f (D1 ) − r| ∆f
= (3)
P r(K(D2 ) ∈ {r}) exp −|f (D2 ) − r| ∆f
= exp (|f (D2 ) − r| − |f (D1 ) − r|) (4)
∆f
(a)
≤ exp |f (D2 ) − f (D1 )| (5)
∆f
(b)
≤ exp , (6)
where K1 (·) and K2 (·) are the modified query output functions for f1 (·) and
f2 (·), respectively. Again, D1 and D2 are any two data sets where they differ
by at most one record and one data set is a subset of the other.
As in the single query case, we can construct K1 (·) and K2 (·) by simply
adding Laplacian noises to the true results. However, we anticipate a larger
required noise as mentioned earlier. Actually, it turns out that Laplacian noise
Lap( ∆f1 +∆f
2
) is sufficient. In other words, we may choose K1 (D) = f1 (D) +
∆f1 +∆f2
Lap( ) and K2 (D) = f2 (D) + Lap( ∆f1 +∆f
2
).
Similar to the single query case, to show (7) to be true for all S, it is sufficient
3
to show that it is true for any singleton set {r1 , r2 }. Then,
P r((K1 (D1 ), K2 (D1 )) ∈ {(r1 , r2 )})
(8)
P r((K1 (D2 ), K2 (D2 )) ∈ {(r1 , r2 )})
exp −[|f1 (D1 ) − r1 | + |f2 (D1 ) − r2 |] ∆f1 +∆f2
= (9)
exp −[|f1 (D2 ) − r1 | + |f2 (D2 ) − r2 |] ∆f1 +∆f2
(a)
(b)
≤ exp [|f2 (D2 ) − f2 (D1 )| + |f1 (D2 ) − f1 (D1 )|] ≤ exp , (10)
∆f1 + ∆f2
where (a) is due to triangle inequality and (b) is from the definition of sensitivity.
4 Exponential Mechanism
During data disclosure, it may happen that we would like to disclose one type
of data more than the other. One example is when the data is used for classifi-
cation. For example, we want to conduct a study to increase the classification
accuracy for different types of cancers. In this case, we would like to extract data
that are only revelant to cancers rather than any arbitrary diseases. We can
characterize such preference with an utility function u(D, x), where the larger
the u(D, x), the more preferable the x is for the given data set D.
If the utility function is independent of the current data set D (i.e., u(D, x) =
u(x)), it will introduce no privacy problem. However, in general, if the utility
actually depends on the current data set (for example, we group data dynami-
cally through generalization in order to improve classification accuracy), privacy
can leak from the distribution of the drawn outcomes. Exponential mechanism
is a way to minimize privacy leak in this scenario.
Definition 4.1 (Exponential Mechanism). Given an utility function u(D, x),
u(D,x)
we output x with probability proportional to exp ∆u , where ∆u is the
sensitivity of u(·, ·) given by
4
Because Ky (D) is supposed to be deterministic.
The below proof is essentially the same as that given by Mcsherry [1]. Even
though we do not consider measure here, I think maybe there is some error in
their reasoning.
Denote K(D) be the mechanism of drawing output y. Recall that the prob-
ability of y for data set D can be written as
exp u(D,y)
∆u
P r(K(D) = y) = P . (13)
u(D,y)
y exp ∆u
Note that if we shift from D to an adjacent data set D0 . The change of numerator
is bounded by exp() and the change of denominator is also bounded by exp().
So overall the change in probability is bounded by exp(2). Thus the mechanism
is 2-differentially private.
Here I am quite a lot confused. I can’t be sure if he is correct or not. A strong
doubt I have here is that we can sample K(·) infinitely number of time. Unlike
the Laplacian mechanism that described earlier where we can only sample once.
5
where D and D0 are “adjacent” data sets and (a) is due to the fact that K(·) is
-differential private. Therefore, F (K(·)) is -differential private.
Actually, the lemma should be apparent from the beginning. Because if
this is not satisfied, the definition of privacy is meaningless since the “private”
information could disclose unauthorized information after processing.
On the other hand, if multiple mechanisms are sequentially explored, we
expect that the privacy of the overall mechanism (as a sequential combination
of the multiple access) should reduce. We have this very simple lemma as
follows.
Lemma 5.2 (Sequential Composition). If we have two mechanisms K1 and
K2 are 1 - and 2 -differential private, respectively, the overall mechanism is
1 + 2 -differential private provided that the two mechanisms are independent.
Proof.
6
Example 5.1 (Disclosing gender population sizes of a group). Consider that
we have a data set containing only the gender attribute and we disclose the
population sizes of each gender. To ensure the data set is differentially private,
one can add laplacian noise to each population size before disclosing. Let K1 (D)
and K2 (D) denote the mechanism of outputing the population for the males and
females, respectively. If each mechanism is -differential private, the combined
mechanism is also -differential private. Note that we have parallel composition
here since the gender populations are disjoint.
Example 5.2 (Disclosing disease population sizes of a group). Consider a simi-
lar setting as the previous example but we try to disclose the population sizes of
a number of diseases, e.g., AIDS, SARS, lung cancer, etc. Again, we add lapla-
cian noise to each population size before disclosing such that each disclosing
mechanism is -differentially private. Let say we consider n different diseases
here and we will disclose the population size for each of them. The overall
mechanism in this case will be n-differentially private instead of -differentially
private. Note that we have sequential composition here rather than parallel
composition since the diseases populations are generally not disjoint (e.g., one
can have both AIDS and lung cancer).
7
Once again, the definition itself is essentially “data blind”—to the extend that
the definition is indifferent to the change of the query outcome. It is because
the inequality in the definition has to be satisfied by not just for all adjacent
data set pairs, but also by all subsets of the range of the query function. For
example, for the counting query that is most commonly used in explaining dif-
ferential privacy, the degrees of protection towards the cases when the counting
results being 1 and 10,000 are completely the same. Intuitively, if the query is
counting the number of AIDS patients living in a particular zip code, we expect
that higher protection to patient privacy should be needed when the number
of patients is significantly smaller. Finally, because of the data blind nature,
differential privacy completely ignores the possibility of quantifying the utility
of a data record. It essentially adds noise evenly to all records despite that some
records could be more useful than the rest. Moreover, the parameter , without
any apparent realistic interpretation, is usually predetermined rather arbitrarily
and controls a hard privacy constraint. A systematic way to trade off privacy
and utility (in what way one should adjust if higher utility is desirable) is not
given.
• X: Quasi-identifier;
• W : Sensitive attributes.
To ensure privacy of the record owners, the key attributes are first removed from
a disclosed data set. However, that is not sufficient because a malcious user may
be able to infer the identity and/or sensitive attributes of a record owner with
some prior information. All non-interactive privacy techniques essentially boil
down to modifying the quasi-identifier X to a distorted one X̂, which is disclosed
instead.
Note that generally we can have more than one quasi-identifiers. However,
without loss of generality, we can always use one variable X to summarize the
corresponding set of attributes. Similarly, we can have more than one sensi-
tive attributes but in this case, it may not be helpful to group them into one
variable. Fortunately, we do not lose much generality in our analysis by con-
sidering privacy protection of only one sensitive attribute. When we have more
than one sensitive attributes. We may apply privacy constraint to each of them
separately.
8
A useful concept is the equivalence class of a quasi-identifier, where the class
includes all records with the corresponding quasi-identifier. The main intuition
is that to a malicious user with prior information regarding a quasi-identifier, it
could be possible for him to uniquely identify a record with the quasi-identifier
alone. Therefore, to him, the quasi-identifier is almost the same as an identifier.
Thus, sufficient diversity has to be ensured in each equivalence class to prevent
privacy leakage. Denote the equivalence class of a modified quasi-identifier x0
as Y(x0 ). Note that typically we can have several x corresponds to the same x0 .
Or in other words, an equivalent class can have different quasi-identifiers of x
to begin with but each is mapped to x0 . On the other hands, it is also possible
that we can have one x maps to several different x0 to confuse a malicious user.
However, this is much less explored systematically.
7.1.1 k-anonymity
The objective of k-anonymity is to prevent identity leaking from a disclosed
data set. The idea to achieve this is to ensure that within each equivalence
class, there are at least k different records. In other words, |Y(x0 )| ≥ k for
all x0 . Consider this stochastically in the perspective of information theory, we
may assume that each record identified by key attribute y happens equally likely
inside any equivalence class x0 , i.e., P r(Y = y|x0 ) = 1/|Y(x0 )|. Then, we have
the conditional entropy of Y given x0 H(Y |x0 ) = log2 |Y(x0 )|. Therefore, we can
rephrase the privacy condition compactly as1
for all x0 in X 0 .
Since H(Y |X 0 ) = x0 ∈X 0 P r(X 0 = x0 )H(Y |x0 ), we also have
P
if k-anonymity is satisfied. However, note that (26) is only a necessary but not
a sufficient condition.
7.1.2 l-diversity
While the k-anonymity condition helps avoid identity leakage, it does not pre-
vent the leakage of sensitive attributes. Since even when there are multiple
records inside an equivalent class, if all sensitive attributes of these records
is the same, a malicious user can still infer that the sensitive attribute of a
record owner in the class. Therefore, the sensitive attribute inside each equiv-
alent class has to be diversified as well. To achieve this, l-diversity requires
that |W(x0 )| ≥ l, where W(x0 ) is the set of sensitive attributes appeared in the
equivalence class x0 , more precisely, W(x0 ) , {w(y)|y ∈ Y(x0 )}. However, when
W(x0 ) is overly skewed, a malicious user may still predict with high confidence
the sensitive attribute of a record owner. For example, consider an equivalence
1 There was an mistake in [?] that the condition is incorrectly phrased as H(X|x0 ) ≥ log2 k.
9
class with health condition as sensitive attribute in which 95 out of 100 records
are HIV positive while we have one flu, two heart disease, and two lung cancer
cases for the rest of the records. Even though it sastisfies 4-diversity as defined
above. A malicious user who identifies a record owner inside the equivalence
class can predict with high confidence that the record owner is HIV positive.
As a result, it is more reasonable to bound the empirical conditional entropy
H(W |x0 ) instead of |W(x0 )|. Therefore, a refined definition requires
for all x0 . Note that when w is uniformly distributed in the equivalence class,
we have H(W |x0 ) = log2 |W(x0 )| and thus we have as the same condition as in
the original definition.
As in the case of k-anonymity, since H(W |X 0 ) = x0 ∈X 0 P r(X 0 = x0 )H(W |x0 ),
P
we also have
7.1.3 t-closeness
The inherent problem of l-diversity is that it could introduce “anomaly” to an
equivalence class and lead to harmful interpretation to the record owner by a
malcious user. For example, consider a data set with a sensitive attribute of
only two possible values: lung cancer and flu. As flu is much more common
and there may be only 0.01% of records contribute to lung cancer over the
entire data set. Now, consider an equivalence class with nothing but one flu
record. While adding a cancer record to the equivalence class could “improve”
to data set to 2-diversity, a malicious user who knows a record owner inside the
equivalence class may wrongly interpret that as owner has 50% chance of having
cancer instead of the normal 0.01%. Therefore, to the best interest of a record
owner, it can be more important to avoid “anomaly” in an equivalence class
rather than forcing it to satsify diversity blindly. This leads to the definition of
t-closeness that for any quasi-identifier x0
where p(w|x0 ) and p(w) are the distributions of w inside an equivalence class x0
and w considering the entire data set, respectively, and D(·||·) is the Kullback-
Leiber (KL-) divergence or the relative entropy between two distributions. The
KL-divergence essentially measures the distance between distribution and thus t-
closeness is just a condition to ensure that the distribution of w inside
Peach equiv-
0
alence class is almost the same. Moreover, note that as I(W ; X 0 ) = x P r(X 0 =
x0 )D(p(w|x0 )||p(w)), therefore we have
I(W ; X 0 ) ≤ t (30)
10
as a necessary but not sufficient condition for t-closeness, where I(W ; X 0 ) is
the mutual information between W and X 0 and can be rewritten as H(W ) −
H(W |X 0 ). Therefore, we can rewrite (30) as
References
[1] F. McSherry and K. Talwar, “Mechanism design via differential privacy,”
in Foundations of Computer Science, 2007. FOCS’07. 48th Annual IEEE
Symposium on. IEEE, 2007, pp. 94–103.
11