Preserving Statistical Validity in Adaptive Data Analysis
Preserving Statistical Validity in Adaptive Data Analysis
Abstract
A great deal of effort has been devoted to reducing the risk of spurious scientific discoveries,
from the use of sophisticated validation techniques, to deep statistical methods for controlling
the false discovery rate in multiple hypothesis testing. However, there is a fundamental discon-
nect between the theoretical results and the practice of data analysis: the theory of statistical
inference assumes a fixed collection of hypotheses to be tested, or learning algorithms to be ap-
plied, selected non-adaptively before the data are gathered, whereas in practice data is shared
and reused with hypotheses and new analyses being generated on the basis of data exploration
and the outcomes of previous analyses.
In this work we initiate a principled study of how to guarantee the validity of statistical
inference in adaptive data analysis. As an instance of this problem, we propose and investigate
the question of estimating the expectations of m adaptively chosen functions on an unknown
distribution given n random samples.
We show that, surprisingly, there is a way to estimate an exponential in n number of ex-
pectations accurately even if the functions are chosen adaptively. This gives an exponential
improvement over standard empirical estimators that are limited to a linear number of esti-
mates. Our result follows from a general technique that counter-intuitively involves actively
perturbing and coordinating the estimates, using techniques developed for privacy preservation.
We give additional applications of this technique to our question.
∗
Preliminary version of this work appears in the proceedings of the ACM Symposium on Theory of Computing
(STOC), 2015
†
Microsoft Research
‡
IBM Almaden Research Center. Part of this work done while visiting the Simons Institute, UC Berkeley
§
IBM Almaden Research Center
¶
University of Toronto
k
Samsung Research America
∗∗
Department of Computer and Information Science, University of Pennsylvania
1 Introduction
Throughout the scientific community there is a growing recognition that claims of statistical signif-
icance in published research are frequently invalid [Ioa05b, Ioa05a, PSA11, BE12]. The past few
decades have seen a great deal of effort to understand and propose mitigations for this problem.
These efforts range from the use of sophisticated validation techniques and deep statistical methods
for controlling the false discovery rate in multiple hypothesis testing to proposals for preregistration
(that is, defining the entire data-collection and data-analysis protocol ahead of time). The statis-
tical inference theory surrounding this body of work assumes a fixed procedure to be performed,
selected before the data are gathered. In contrast, the practice of data analysis in scientific research
is by its nature an adaptive process, in which new hypotheses are generated and new analyses are
performed on the basis of data exploration and observed outcomes on the same data. This dis-
connect is only exacerbated in an era of increased amounts of open access data, in which multiple,
mutually dependent, studies are based on the same datasets.
It is now well understood that adapting the analysis to data (e.g., choosing what variables to
follow, which comparisons to make, which tests to report, and which statistical methods to use) is
an implicit multiple comparisons problem that is not captured in the reported significance levels
of standard statistical procedures. This problem, in some contexts referred to as “p-hacking” or
“researcher degrees of freedom”, is one of the primary explanations of why research findings are
frequently false [Ioa05b, SNS11, GL14].
The “textbook” advice for avoiding problems of this type is to collect fresh samples from the
same data distribution whenever one ends up with a procedure that depends on the existing data.
Getting fresh data is usually costly and often impractical so this requires partitioning the available
dataset randomly into two or more disjoint sets of data (such as a training and testing set) prior to
the analysis. Following this approach conservatively with m adaptively chosen procedures would
significantly (on average by a factor of m) reduce the amount of data available for each procedure.
This would be prohibitive in many applications, and as a result, in practice even data allocated for
the sole purpose of testing is frequently reused (for example to tune parameters). Such abuse of
the holdout set is well known to result in significant overfitting to the holdout or cross-validation
set [Reu03, RF08].
Clear evidence that such reuse leads to overfitting can be seen in the data analysis competitions
organized by Kaggle Inc. In these competitions, the participants are given training data and
can submit (multiple) predictive models in the course of competition. Each submitted model is
evaluated on a (fixed) test set that is available only to the organizers. The score of each solution
is provided back to each participant, who can then submit a new model. In addition the scores are
published on a public leaderboard. At the conclusion of the competition the best entries of each
participant are evaluated on an additional, hitherto unused, test set. The scores from these final
evaluations are published. The comparison of the scores on the adaptively reused test set and one-
time use test set frequently reveals significant overfitting to the reused test set (e.g. [Win, Kaga]),
a well-recognized issue frequently discussed on Kaggle’s blog and user forums [Kagb, Kagc].
Despite the basic role that adaptivity plays in data analysis we are not aware of previous
general efforts to address its effects on the statistical validity of the results (see Section 1.4 for
an overview of existing approaches to the problem). We show that, surprisingly, the challenges
of adaptivity can be addressed using insights from differential privacy, a definition of privacy
tailored to privacy-preserving data analysis. Roughly speaking, differential privacy ensures that
the probability of observing any outcome from an analysis is “essentially unchanged” by modifying
2
any single dataset element (the probability distribution is over the randomness introduced by the
algorithm). Differentially private algorithms permit a data analyst to learn about the dataset as a
whole (and, by extension, the distribution from which the data were drawn), while simultaneously
protecting the privacy of the individual data elements. Strong composition properties show this
holds even when the analysis proceeds in a sequence of adaptively chosen, individually differentially
private, steps.
3
distribution. This theorem allows us to draw on a rich body of results in differential privacy and
to obtain corresponding results for our problem of guaranteeing validity in adaptive data analysis.
Before we state this general theorem, we describe a number of important corollaries for the question
we formulated above.
Our primary application is that, remarkably, it is possible to answer nearly exponentially many
adaptively chosen statistical queries (in the size of the data set n). Equivalently, this reduces the
sample complexity of answering m queries from linear in the number of queries to polylogarithmic,
nearly matching the dependence that is necessary for non-adaptively chosen queries.
Theorem 1 (Informal). There exists an algorithm that given a dataset of size at least n ≥
min(n0 , n1 ), can answer any m adaptively chosen statistical queries so that with high probability,
each answer is correct up to tolerance τ , where:
!
(log m)3/2 log |X |
p
log m · log |X |
n0 = O and n1 = O .
τ 7/2 τ4
The two bounds above are incomparable. Note that the first bound is larger pthan the sample
complexity needed to answer non-adaptively chosen queries by only a factor of O log m log |X |/τ 3/2 ,
whereas the second one is larger by a factor of O log(|X |)/τ 2 . Here log |X | should be viewed as
roughly the dimension of the domain. For example, if the underlying domain is X = {0, 1}d , the
set of all possible vectors of d-boolean attributes, then log |X | = d.
The above mechanism is not computationally efficient (it has running time linear in the size of
the data universe |X |, which is exponential in the dimension of the data). A natural question raised
by our result is whether there is an efficient algorithm for the task. This question was addressed
in [HU14, SU14] who show that under standard cryptographic assumptions any algorithm that can
answer more than ≈ n2 adaptively chosen statistical queries must have running time exponential
in log |X |.
We show that it is possible to match this quadratic lower bound using a simple and practical
algorithm that perturbs the answer to each query with independent noise.
Theorem 2 (Informal). There exists a computationally efficient algorithm for answering m adap-
tively chosen statistical queries, such that with high probability, the answers are correct up to toler-
ance τ , given a data set of size at least n ≥ n0 for:
√ !
m(log m)3/2
n0 = O .
τ 5/2
Finally, we show a computationally efficient method which can answer exponentially many
queries so long as they were generated using o(n) rounds of adaptivity, even if we do not know
where the rounds of adaptivity lie. Another practical advantage of this algorithm is that it only
pays the price for a round if adaptivity actually causes overfitting. In other words, the algorithm
does not pay for the adaptivity itself but only for the actual harm to statistical validity that
adaptivity causes. This means that in many situations it would be possible to use this algorithm
successfully with a much smaller “effective” r (provided that a good bound on it is known).
4
Theorem 3 (Informal). There exists a computationally efficient algorithm for answering m adap-
tively chosen statistical queries, generated in r rounds of adaptivity, such that with high probability,
the answers are correct up to some tolerance τ , given a data set of size at least n ≥ n0 for:
r log m
n0 = O .
τ2
denote the empirical average of ψ. We denote a random dataset chosen from P n by S. For any
fixed function ψ, the empirical average ES [ψ] is strongly concentrated around its expectation P[ψ].
However, this statement is no longer true if ψ is allowed to depend on S (which is what happens if
we choose functions adaptively, using previous estimates on S). However for a hypothesis output
by a differentially private A on S (denoted by φ = A(S)), we show that ES [φ] is close to P[φ] with
high probability.
High probability bounds are necessary to ensure that valid answers can be given to an exponen-
tially large number of queries. To prove these bounds, we show that differential privacy roughly
preserves the moments of ES [φ] even when conditioned on φ = ψ for any fixed ψ. Now using strong
concentration of the k-th moment of ES [ψ] around P[ψ]k , we can obtain that ES [φ] is concentrated
around P[φ]. Such an argument works only for (ε, 0)-differential privacy due to conditioning on
2
A weaker connection that gives closeness in expectation over the dataset and algorithm’s randomness was known
to some experts and is considered folklore. We give a more detailed comparison in Sec. 1.4 and Sec. 2.1.
5
the event φ = ψ which might have arbitrarily low probability. We use a more delicate condition-
ing to obtain the extension to (ε, δ)-differential privacy. We note that (ε, δ)-differential privacy is
necessary to obtain the stronger bounds that we use for Theorems 1 and 2.
We give an alternative, simpler proof for (ε, 0)-differential privacy that, in addition, extends
this connection beyond expectations of functions. We consider a differentially private algorithm A
that maps a database S ∼ P n to elements from some arbitrary range Z. Our proof shows that if
we have a collection of events R(y) defined over databases, one for each element y ∈ Z, and each
event is individually unlikely in the sense that for all y, the probability that S ∈ R(y) is small, then
the probability remains small that S ∈ R(Y ), where Y = A(S). Note that this statement involves
a re-ordering of quantifiers. The hypothesis of the theorem says that the probability of event R(y)
is small for each y, where the randomness is taken over the choice of database S ∼ P n , which is
independent of y. The conclusion says that the probability of R(Y ) remains small, even though
Y is chosen as a function of S, and so is no longer independent. The upshot of this result is that
adaptive analyses, if performed via a differentially private algorithm, can be thought of (almost) as
if they were non-adaptive, with the data being drawn after all of the decisions in the analysis are
fixed.
To prove this result we note that it would suffice to establish that for every y ∈ Z, P[S ∈
R(y) | Y = y] is not much larger than P[S ∈ R(y)]. By Bayes’ rule, for every dataset S,
P[S = S | Y = y] P[Y = y | S = S]
= .
P[S = S] P[Y = y]
Therefore, to bound the ratio of P[S ∈ R(y) | Y = y] to P[S ∈ R(y)] it is sufficient to bound
the ratio of P[Y = y | S = S] to P[Y = y] for most S ∈ R(y). Differential privacy implies
that P[Y = y | S = S] does not change fast as a function of S. From here, using McDiarmid’s
concentration inequality, we obtain that P[Y = y | S = S] is strongly concentrated around its
mean, which is exactly P[Y = y].
6
There are procedures for controlling false discovery in a sequential setting in which tests arrive
one-by-one [FS08, ANR11, AR14]. However the analysis of such tests crucially depends on tests
maintaining their statistical properties despite conditioning on previous outcomes. It is therefore
unsuitable for the problem we consider here, in which we place no restrictions on the analyst.
The classical approach in theoretical machine learning to ensure that empirical estimates gen-
eralize to the underlying distribution is based on the various notions of complexity of the set of
functions output by the algorithm, most notably the VC dimension (see [KV94] or [SSBD14] for a
textbook introduction). If one has a sample of data large enough to guarantee generalization for
all functions in some class of bounded complexity, then it does not matter whether the data ana-
lyst chooses functions in this class adaptively or non-adaptively. Our goal, in contrast, is to prove
generalization bounds without making any assumptions about the class from which the analyst can
choose query functions. In this case the adaptive setting is very different from the non-adaptive
setting.
An important line of work [BE02, MNPR06, PRMN04, SSSSS10] establishes connections be-
tween the stability of a learning algorithm and its ability to generalize. Stability is a measure of
how much the output of a learning algorithm is perturbed by changes to its input. It is known that
certain stability notions are necessary and sufficient for generalization. Unfortunately, the stability
notions considered in these prior works are not robust to post-processing, and so the stability of
a query answering procedure would not guarantee the stability of the query generating procedure
used by an arbitrary adaptive analyst. They also do not compose in the sense that running mul-
tiple stable algorithms sequentially and adaptively may result in a procedure that is not stable.
Differential privacy is stronger than these previously studied notions of stability, and in particular
enjoys strong post-processing and composition guarantees. This provides a calculus for building up
complex algorithms that satisfy stability guarantees sufficient to give generalization. Past work has
considered the generalization properties of one-shot learning procedures. Our work can in part be
interpreted as showing that differential privacy implies generalization in the adaptive setting, and
beyond the framework of learning.
Differential privacy emerged from a line of work [DN03, DN04, BDMN05], culminating in the
definition given by [DMNS06]. It defines a stability property of an algorithm developed in the
context of data privacy. There is a very large body of work designing differentially private algorithms
for various data analysis tasks, some of which we leverage in our applications. Most crucially, it is
known how to accurately answer exponentially many adaptively chosen queries on a fixed dataset
while preserving differential privacy [RR10, HR10], which is what yields the main application in
our paper, when combined with our main theorem. See [Dwo11] for a short survey and [DR14] for
a textbook introduction to differential privacy.
For differentially private algorithms that output a hypothesis it has been known as folklore that
differential privacy implies stability of the hypothesis to replacing (or removing) an element of the
input dataset. Such stability is long known to imply generalization in expectation (e.g. [SSSSS10]).
See Section 2.1 for more details. Our technique can be seen as a substantial strengthening of this
observation: from expectation to high probability bounds (which is crucial for answering many
queries), from pure to approximate differential privacy (which is crucial for our improved efficient
algorithms), and beyond the expected error of a hypothesis.
Further Developments: Our work has attracted substantial interest to the problem of statistical
validity in adaptive data analysis and its relationship to differential privacy. Hardt and Ullman
[HU14] and Steinke and Ullman [SU14] have proven complementary computational lower bounds
7
for the problem formulated in this work. They show that, under standard cryptographic assump-
tions, the exponential running time of the algorithm instantiating our main result is unavoidable.
Specifically, that the square-root dependence on the number of queries in the sample complexity
of our efficient algorithm is nearly optimal, among all computationally efficient mechanisms for
answering arbitrary statistical queries.
In [DFH+ 15a] we discuss approaches to the problem of adaptive data analysis more generally.
We demonstrate how differential privacy and description-length-based analyses can be used in
this context. In particular, we show that the bounds on n1 obtained in Theorem 1 can also be
obtained by analyzing the transcript of the median mechanism for query answering [RR10] (even
without adding noise). Further, we define a notion of approximate max-information between the
dataset and the output of the analysis that ensures generalization with high probability, composes
adaptively and unifies (pure) differential privacy and description-length-based analyses. We also
demonstrate an application of these techniques to the problem of reusing the holdout (or testing)
dataset. An overview of this work and [DFH+ 15a] intended for a broad scientific audience appears
in [DFH+ 15b].
Blum and Hardt [BH15] give an algorithm for reusing the holdout dataset specialized to the
problem of maintaining an accurate leaderboard for a machine learning competition (such as those
organized by Kaggle Inc. and discussed earlier). Their generalization analysis is based on the
description length of the algorithm’s transcript.
Our results for approximate (δ > 0) differential privacy apply only to statistical queries (see
Thm. 10). Bassily, Nissim, Smith, Steinke, Stemmer and Ullman [BNS+ 15] give a novel, elegant
analysis of the δ > 0 case that gives an exponential improvement in the dependence on δ and gen-
eralizes it to arbitrary
p low-sensitivity queries. This leads to stronger bounds on sample complexity
that remove an O( log(m)/τ ) factor from the bounds on n0 we give in Theorems 1 and 2. It also
implies a similar improvement and generalization to low-sensitivity queries in the reusable holdout
application [DFH+ 15a].
Another implication of our work is that composition and post-processing properties (which are
crucial in the adaptive setting) can be ensured by measuring the effect of data analysis on the
probability space of the analysis outcomes. Several additional techniques of this type have been
recently analyzed. Bassily et al. [BNS+ 15] show that generalization in expectation (as discussed
in Cor. 7) can also be obtained from two additional notions of stability: KL-stability and TV-
stability that bound the KL-divergence and total variation distance between output distributions
on adjacent datasets, respectively. Russo and Zou [RZ15] show that generalization in expectation
can be derived by bounding the mutual information between the dataset and the output of analysis.
They give applications of their approach to analysis of adaptive feature selection procedures. We
note that these techniques do not imply high-probability generalization bounds that we obtain here
and in [DFH+ 15a].
2 Preliminaries
Let P be a distribution over a discrete universe X of possible data points. For a function ψ : X →
[0, 1] let P[ψ] = Ex∼P [ψ(x)]. Given a dataset S = (x1 , . . . , xn ), a natural estimator of P[ψ] is the
1 Pn
empirical average n i=1 ψ(xi ). We let ES denote the empirical distribution that assigns weight
1/n to each of the data points in S and thus ES [ψ] is equal to the empirical average of ψ on S.
Definition 4. A statistical query is defined by a function ψ : X → [0, 1] and tolerance τ . For
8
distribution P over X a valid response to such a query is any value v such that |v − P(ψ)| ≤ τ .
The standard Hoeffding bound implies that for a fixed query function (chosen independently
of the data) the probability over the choice of the dataset that ES [ψ] has error greater than τ is
at most 2 · exp(−2τ 2 n). This implies that an exponential in n number of statistical queries can be
evaluated within τ as long as the hypotheses do not depend on the data.
We now formally define differential privacy. We say that datasets S, S ′ are adjacent if they differ
in a single element.
Definition 5. [DMNS06, DKM+ 06] A randomized algorithm A with domain X n is (ε, δ)-differentially
private if for all O ⊆ Range(A) and for all pairs of adjacent datasets S, S ′ ∈ X n :
where the probability space is over the coin flips of the algorithm A. The case when δ = 0 is
sometimes referred to as pure differential privacy, and in this case we may say simply that A is
ε-differentially private.
Lemma 6. Let A be an (ǫ, δ)-differentially private algorithm ranging over functions from X to
[0, 1]. For any pair of adjacent datasets S and S ′ and x ∈ X :
and, in particular,
E [A(S)(x)] − E A(S ′ )(x) ≤ eε − 1 + δ.
(1)
Corollary 7. Let A be an (ǫ, δ)-differentially private algorithm ranging over functions from X
to [0, 1], let P be a distribution over X and let S be an independent random variable distributed
according to P n . Then
| E[ES [A(S)]] − E[P[A(S)]]| ≤ eε − 1 + δ.
9
This corollary was observed in the context of functions expressing the loss of the hypothesis
output by a (private) learning algorithm, that is, φ(x) = L(h(x), x), where x is a sample (possibly
including a label), h is a hypothesis function and L is a non-negative loss function. When applied
to such a function, Corollary 7 implies that the expected true loss of a hypothesis output by an
(ε, δ)-differentially private algorithm is at most eε − 1 + δ larger than the expected empirical loss of
the output hypothesis, where the expectation is taken over the random dataset and the randomness
of the algorithm. A special case of this corollary is stated in a recent work of Bassily et al. [BST14].
More recently, Wang et al. [WLF15] have similarly used the stability of differentially private learning
algorithms to show a general equivalence of differentially private learning and differentially private
empirical loss minimization.
A standard way to obtain a high-probability bound from a bound on expectation in Corollary
7 is to use Markov’s inequality. Using this approach, a bound that holds with probability 1 − β
will require a polynomial dependence of the sample size on 1/β. While this might lead to a useful
bound when the expected empirical loss is small it is less useful in the common scenario when
the empirical loss is relatively large. In contrast, our results in Sections 3 and 4 directly imply
generalization bounds with logarithmic dependence of the sample size on 1/β. For example, in
Theorem 9 we show that for any ε, β > 0 and n ≥ O(ln(1/β)/ε2 ), the output of an ε-differentially
private algorithm A satisfies P [|P[A(S)] − ES [A(S)]| > 2ε] ≤ β.
Lemma 8. Assume that A is an (ǫ, 0)-differentially private algorithm ranging over functions from
X to [0, 1]. Let S, T be independent random variables distributed according to P n . For any function
ψ : X → [0, 1] in the support of A(S),
h i h i
E ES [φ]k φ = ψ ≤ ekε · E ET [ψ]k . (2)
Proof. We use I to denote a k-tuple of indices (i1 , . . . , ik ) ∈ [n]k and use I to denote a k-tuple
chosen randomly and uniformly from [n]k . For a data set T = (y1 , . . . , yn ) we denote by ΠIT (ψ) =
10
Q
j∈[k] ψ(yij ). We first observe that for any ψ,
For two datasets S, T ∈ X n , let SI←T denote the data set in which for every j ∈ [k], element ij
in S is replaced with the corresponding element from T . We fix I. Note that the random variable
SI←T is distributed according to P n and therefore
Now for any fixed t, S and T consider the event ΠIT (A(S)) ≥ t and A(S) = ψ (defined on the
range of A). Data sets S and SI←T differ in at most k elements. Therefore, by the ε-differential
privacy of A and Lemma 24, the distribution A(S) and the distribution A(SI←T ) satisfy:
Taking the expectation over I and using eq. (3) we obtain that
h i h i
E ES [φ]k φ = ψ ≤ ekε E ET [ψ]k ,
We now turn our moment inequality into a theorem showing that ES [φ] is concentrated around
the true expectation P[φ].
Theorem 9. Let A be an ε-differentially private algorithm that given a dataset S outputs a function
from X to [0, 1]. For any distribution P over X and random variable S distributed according to
P n we let φ = A(S). Then for any β > 0, τ > 0 and n ≥ 12 ln(4/β)/τ 2 , setting ε ≤ τ /2 ensures
P [|P[φ] − ES [φ]| > τ ] ≤ β, where the probability is over the randomness of A and S.
11
Proof. Consider an execution of A with ε = τ /2 on a data set S of size n ≥ 12 ln(4/β)/τ 2 . By
Lemma 29 we obtain that RHS of our bound in Lemma 8 is at most eεk Mk [B(n, P[ψ])]. We use
Lemma 31 with ε = τ /2 and k = 4 ln(4/β)/τ (noting that the assumption n ≥ 12 ln(4/β)/τ 2
ensures the necessary bound on n) to obtain that
12
Now for fixed values of t, S and T we consider the event ΠIT (A(S)) ≥ t and P[A(S)] ∈ Bℓ defined
on the range of A. Datasets S and SI←T differ in at most k elements. Therefore, by the (ε, δ)-
differential privacy of A and Lemma 24, the distribution over the output of A on input S and the
distribution over the output of A on input SI←T satisfy:
Taking the probability over S and T and substituting this into eq. (6) we get
Z 1 I
I kε P ΠT (φ) ≥ t and P[φ] ∈ Bℓ e(k−1)ε δ
E ΠS (φ) | P[φ] ∈ Bℓ ≤ e dt +
0 P [P[φ] ∈ Bℓ ] P [P[φ] ∈ Bℓ ]
e(k−1)ε δ
= ekε E ΠIT (φ) | P[φ] ∈ Bℓ +
P [P[φ] ∈ Bℓ ]
δe(k−1)ε · 4L
P [ ES [φ] ≥ τ ℓ + τ | P[φ] ∈ Bℓ ] ≤ β/2 + , (8)
β(τ ℓ + τ )k
Using condition δ = exp(−2 · ln(4/β)/τ ) and inequality ln(x) ≤ x/e (for x > 0) we obtain
13
Substituting this into eq. (8) we get
Apply the same argument to 1 − φ and use a union bound. We obtain the claim after rescaling τ
and β by a factor 2.
Further, by definition of differential privacy, for two databases S, S ′ that differ in a single element,
P[Y = y | S = S] ≤ eε · P[Y = y | S = S ′ ].
=y | S=S]
Now consider the function g(S) = ln P[Y P[Y =y] . By the properties above we have that
E[g(S)] ≤ ln(P[Y = y]) − ln(P[Y = y]) = 0 and |g(S) − g(S ′ )| ≤ ε. This, by McDiarmid’s inequality
(Lemma 28), implies that for any t > 0,
2 /n
P[g(S) ≥ εt] ≤ e−2t . (9)
14
For an integer i ≥ 1 let
q
. q
Bi = S ε n ln(2i /β)/2 ≤ g(S) ≤ ε n ln(2i+1 /β)/2
. p
and let B0 = {S | g(S) ≤ ε n ln(2/β)/2}.
By inequality (9) we have that for i ≥ 1, P[g(S) ≥ ε n ln(2i /β)/2] ≤ β/2i . Therefore, for all
p
i ≥ 0,
P[S ∈ Bi ∩ R(y)] ≤ β/2i ,
where the case of i = 0 follows from the assumptions of the lemma.
By Bayes’ rule, for every S ∈ Bi ,
q
P[S = S | Y = y] P[Y = y | S = S] i+1
= = exp(g(S)) ≤ exp ε n ln(2 /β)/2 .
P[S = S] P[Y = y]
Therefore,
X
P[S ∈ Bi ∩ R(y) | Y = y] = P[S = S | Y = y]
S∈Bi ∩R(y)
q X
≤ exp ε n ln(2i+1 /β)/2 · P[S = S]
S∈Bi ∩R(y)
q
i+1
= exp ε n ln(2 /β)/2 · P[S ∈ Bi ∩ R(y)]
q
i+1 i
≤ exp ε n ln(2 /β)/2 − ln(2 /β) . (10)
q
The condition ε ≤ ln(1/β)
2n implies that
r r
n ln(2i+1 /β) i ln(1/β) ln(2i+1 /β)
ε − ln(2 /β) ≤ − ln(2i /β)
2 4 !
ln(2(i+1)/2 /β) 2(3i−1)/4
≤ − ln(2i /β) = − ln √
2 β
15
Our theorem gives a result for statistical queries that achieves the same bound as our earlier
result in Theorem 9 up to constant factors in the parameters.
Corollary 12. Let A be an ε-differentially private algorithm that outputs a function from X to
[0, 1]. For a distribution P over X , let S be a random
p variable distributed according to P n and
let φ = A(S). Then for any τ > 0, setting ε ≤ τ 2 − ln(2)/2n ensures P [|P[φ] − ES [φ]| > τ ] ≤
√ 2
3 2e−τ n .
Proof. By the Chernoff bound, for any fixed query function ψ : X → [0, 1],
2
P[|P[ψ] − ES [ψ]| ≥ τ ] ≤ 2e−2τ n .
2n
Now,
p by Theorem 11 for R(ψ) = {S ∈ X n | |P[ψ] − ES [ψ]| > τ }, β = 2e−2τ and any ε ≤
2
τ − ln(2)/2n,
√ 2
P [|P[φ] − ES [φ]| > τ ] ≤ 3 2e−τ n .
5 Applications
To obtain algorithms for answering adaptive statistical queries we first note that if for a query
function ψ and a dataset S, |P[ψ] − ES [ψ]| ≤ τ /2 then we can use an algorithm that outputs a
value v that is τ /2-close to ES [ψ] to obtain a value that is τ -close to P[ψ]. Differentially private
algorithms that for a given dataset S and an adaptively chosen sequence of queries φ1 , . . . , φm
produce a value close to ES [φi ] for each query φi : X → [0, 1] have been the subject of intense
investigation in the differential privacy literature (see [DR14] for an overview). Such queries are
usually referred to as (fractional) counting queries or linear queries in this context. This allows
us to obtain statistical query answering algorithms by using various known differentially private
algorithms for answering counting queries.
The results in Sections 3 and 4 imply that |P[ψ] − ES [ψ]| ≤ τ holds with high probability when-
ever ψ is generated by a differentially private algorithm M. This might appear to be inconsistent
with our application since there the queries are generated by an arbitrary (possibly adversarial)
adaptive analyst and we can only guarantee that the query answering algorithm is differentially
private. The connection comes from the following basic fact about differentially private algorithms:
Fact 13 (Postprocessing Preserves Privacy (see e.g. [DR14])). Let M : X n → O be an (ǫ, δ)
differentially private algorithm with range O, and let F : O → O′ be an arbitrary randomized
algorithm. Then F ◦ M : X n → O′ is (ǫ, δ)-differentially private.
Hence, an arbitrary adaptive analyst A is guaranteed to generate queries in a manner that is
differentially private in S so long as the only access that she has to S is through a differentially
private query answering algorithm M. We also note that the bounds we state here give the
probability of correctness for each individual answer to a query, meaning that the error probability
β is for each query φi and not for all queries at the same time. The bounds we state in Section 1.2
hold with high probability for all m queries and to obtain them from the bounds in this section,
we apply the union bound by setting β = β ′ /m for some small β ′ .
We now highlight a few applications of differentially private algorithms for answering counting
queries to our problem.
16
5.1 Laplacian Noise Addition
The Laplacian Mechanism on input of a dataset S answers m adaptively chosen queries φ1 , . . . , φm
by responding with φi (S) + Lap(0, σ) when given query φi . Here, Lap(0, σ) denotes a Laplacian
random variable of mean 0 and scale σ. For suitably chosen σ the algorithm has the following
guarantee.
Theorem 14 (Laplace). Let τ, β, ǫ > 0 and define
m log(1/β)
nL (τ, β, ǫ, m) = .
ǫτ
p
m log(1/δ) log(1/β)
nδL (τ, β, ǫ, δ, m) = .
ǫτ
There is computationally efficient algorithm called Laplace which on input of a data set S of size
n accepts any sequence of m adaptively chosen functions φ1 , . . . , φm ∈ X [0,1] and returns estimates
a1 , . . . , am such that for every i ∈ [m] we have P [|ES [φi ] − ai | > τ ] ≤ β. To achieve this guarantee
under (ǫ, 0)-differential privacy, it requires n ≥ CnL (τ, β, ǫ, m), and to achieve this guarantee under
(ǫ, δ)-differential privacy, it requires n ≥ CnδL (τ, β, ǫ, δ, m) for sufficiently large constant C.
Applying our main generalization bound for (ǫ, 0)-differential privacy directly gives the following
corollary.
Corollary 15. Let τ, β > 0 and define
m log(1/β)
nL (τ, β, m) = .
τ2
There is a computationally efficient algorithm which on input of a data set S of size n sampled
from P n accepts any sequence of m adaptively chosen functions φ1 , . . . , φm ∈ X [0,1] and returns
estimates a1 , . . . , am such that for every i ∈ [m] we have P [|P[φi ] − ai | > τ ] ≤ β provided that
n ≥ CnL (τ, β, m) for sufficiently large constant C.
Proof. We apply Theorem 9 with ǫ = τ /2 and plug this choice of ǫ into the definition of nL in
Theorem 14. We note that the stated lower bound on n implies the lower bound required by
Theorem 9.
The corollary that follows the (ǫ, δ) bound gives a quadratic improvement in m compared with
Corollary 15 at the expense of a slightly worse dependence on τ and 1/β.
Corollary 16. Let τ, β > 0 and define
√
m log1.5 (1/β)
nδL (τ, β, m) = .
τ 2.5
There is a computationally efficient algorithm which on input of a data set S of size n sampled
from P n accepts any sequence of m adaptively chosen functions φ1 , . . . , φm ∈ X [0,1] and returns
estimates a1 , . . . , am such that for every i ∈ [m] we have P [|P[φi ] − ai | > τ ] ≤ β provided that
n ≥ CnδL (τ, β, m) for sufficiently large constant C.
Proof. We apply Theorem 10 with ǫ = τ /2 and δ = exp(−4 ln(8/β)/τ ). Plugging these parameters
into the definition of nδL in Theorem 14 gives the stated lower bound on n. We note that the stated
lower bound on n implies the lower bound required by Theorem 10.
17
5.2 Multiplicative Weights Technique
The private multiplicative weights algorithm [HR10] achieves an exponential improvement in m
compared with the Laplacian mechanism. The main drawback is a running time that scales linearly
with the domain size in the worst case and is therefore not computationally efficient in general.
log(|X |) log(1/β)
nM W (τ, β, ǫ) = .
ǫτ 3
p
log(|X |) log(1/δ) log(1/β)
nδM W (τ, β, ǫ, δ) = .
ǫτ 2
There is algorithm called PMW which on input of a data set S of size n accepts any sequence of m
adaptively chosen functions φ1 , . . . , φm ∈ X [0,1] and with probability at least 1 − (n log |X |)β returns
estimates a1 , . . . , am such that for every i ∈ [m] we have P [|ES [φi ] − ai | > τ ] ≤ β. To achieve this
guarantee under (ǫ, 0) differential privacy, it requires that n ≥ CnM W (τ, β, ǫ) and to achieve it
under (ǫ, δ)-differential privacy it requires n ≥ CnδM W (τ, β, ǫ, δ) for sufficiently large constant C.
log(|X |) log(1/β)
nM W (τ, β) = .
τ4
There is an algorithm which on input of a data set S of size n sampled from P n accepts any sequence
of m adaptively chosen functions φ1 , . . . , φm ∈ X [0,1] and with probability at least 1 − (n log |X |)β
returns estimates a1 , . . . , am such that for every i ∈ [m] we have P [|P[φi ] − ai | > τ ] ≤ β provided
that n ≥ CnM W (τ, β) for sufficiently large constant C.
Proof. We apply Theorem 9 with ǫ = τ /2 and plug this choice of ǫ into the definition of nM W
in Theorem 17. We note that the stated lower bound on n implies the lower bound required by
Theorem 9.
Under (ǫ, δ) differential privacy we get the following corollary that improves the dependence on
τ and log |X | in Corollary 18 at the expense of a slightly worse dependence on β.
log(|X |) log(1/β)3/2
p
nδM W (τ, β) = .
τ 3.5
There is an algorithm which on input of a data set S of size n sampled from P n accepts any sequence
of m adaptively chosen functions φ1 , . . . , φm ∈ X [0,1] and with probability at least 1 − (n log |X |)β
returns estimates a1 , . . . , am such that for every i ∈ [m] we have P [|P[φi ] − ai | > τ ] ≤ β provided
that n ≥ CnδM W (τ, β) for sufficiently large constant C.
Proof. We apply Theorem 10 with ǫ = τ /2 and δ = exp(−4 ln(8/β)/τ ). Plugging these parameters
into the definition of nδM W in Theorem 17 gives the stated lower bound on n. We note that the
stated lower bound on n implies the lower bound required by Theorem 10.
18
5.3 Sparse Vector Technique
In this section we give a computationally efficient technique for answering exponentially many
queries φ1 , . . . , φm in the size of the data set n so long as they are chosen using only o(n) rounds
of adaptivity. We say that a sequence of queries φ1 , . . . , φm ∈ X [0,1] , answered with numeric values
a1 , . . . , am is generated with r rounds of adaptivity if there are r indices i1 , . . . , ir such that the
procedure that generates the queries as a function of the answers can be described by r +1 (possibly
randomized) algorithms f0 , f1 , . . . , fr satisfying:
9r (ln(4/β))
nSV (τ, β, ǫ) = .
τǫ
√ p
δ 512 + 1 r ln(2/δ) (ln(4/β))
nSV (τ, β, ǫ, δ) = .
τǫ
There is an algorithm called SPARSE parameterized by a real valued threshold T , which on input of
a data set S of size n accepts any sequence of m adaptively chosen queries together with guesses
at their values gi ∈ R: (φ1 , g1 ), . . . , (φm , gm ) and returns answers a1 , . . . , am ∈ {⊥} ∪ R. It has
the property that for all i ∈ [m], with probability 1 − β: if ai = ⊥ then |ES [φi ] − gi | ≤ T + τ
and if ai ∈ R, |ES [φi ] − ai | ≤ τ . To achieve this guarantee under (ǫ, 0)-differential privacy it
requires n ≥ nSV (τ, β, ǫ) and to achieve this guarantee under (ǫ, δ)-differential privacy, it requires
n ≥ nδSV (τ, β, ǫ, δ). In either case, the algorithm also requires that |{i : |ES [φi ] − gi | ≥ T − τ }| ≤ r.
(If this last condition does not hold, the algorithm may halt early and stop accepting queries)
We observe that the naı̈ve method of answering queries using their empirical average allows us
to answer each query up to accuracy τ with probability 1−β given a data set of size n0 ≥ ln(2/β)/τ 2
so long as the queries are non-adaptively chosen. Thus, with high probability, problems only arise
between rounds of adaptivity. If we knew when these rounds of adaptivity occurred, we could
refresh our sample between each round, and obtain total sample complexity linear in the number
of rounds of adaptivity. The method we present (using (ǫ, 0)-differential privacy) lets us get a
19
comparable bound without knowing where the rounds of adaptivity appear. Using (ǫ, δ) privacy
would allow us to obtain constant factor improvements if the number of queries was large enough,
but does not get an asymptotically better dependence on the number of rounds r (it would allow
us to reuse the round testing set quadratically many times, but we would still potentially need to
refresh the training set after each round of adaptivity, in the worst case).
The idea is the following: we obtain r different estimation samples S1 , . . . , Sr each of size
sufficient to answer non-adaptively chosen queries to error τ /8 with probability 1 − β/3, and a
separate round detection sample Sh of size nSV (τ /8, β/3, ǫ) for ǫ = τ /16, which we access only
through a copy of SPARSE we initialize with threshold T = τ /4. As queries φi start arriving, we
compute their answers ati = ES1 [φi ] using the naı̈ve method on estimation sample S1 which we
use as our guess of the correct value on Sh when we feed φi to SPARSE. If the answer SPARSE
returns is ahi = ⊥, then we know that with probability 1 − β/3, ati is accurate up to tolerance
T + τ /8 = 3τ /8 with respect to Sh , and hence statistically valid up to tolerance τ /2 by Theorem 9
with probability at least 1 − 2β/3. Otherwise, we discard our estimation set S1 and continue with
estimation set S2 . We know that with probability 1 − β/3, ahi is accurate with respect to Sh up to
tolerance τ /8, and hence statistically valid up to tolerance τ /4 by Theorem 9 with probability at
least 1 − 2β/3. We continue in this way, discarding and incrementing our estimation set whenever
our guess gi is incorrect. This succeeds in answering every query so long as our guesses are not
incorrect more than r times in total. Finally, we know that except with probability at most mβ/3,
by the accuracy guarantee of our estimation set for non-adaptively chosen queries, the only queries
i for which our guesses gi will deviate from ESh [φi ] by more than T − τ /8 = τ /8 are those queries
that lie between rounds of adaptivity. There are at most r of these by assumption, so the algorithm
runs to completion with probability at least 1 − mβ/3. The algorithm is given in figure 1.
Algorithm EffectiveRounds
1156r ln( 12
β
)
Input: A database S of size |S| ≥ τ2 .
4 ln( 12
β
)
Initialization: Randomly split S into r+1 sets: r sets S1 , . . . , Sr with size |Si | ≥ τ2
, and one
1152·r·ln( 12
β
)
set Sh with size |Sh | = τ2
.
Instantiate SPARSE with input Sh and parameters T = τ /4,
τ ′ = τ /8, β ′ = β/3, and ǫ = τ /16. Let c ← 1.
Query stage For each query φi do:
1. Compute ati = ESc [φi ]. Let gi = ati and feed (φi , gi ) to sparse and receive answer ahi .
4. If c > r HALT.
r ln( β1 )
nSV (τ, β) = .
τ2
20
There is an algorithm which on input of a data set S of size n sampled from P n accepts any sequence
of m adaptively chosen queries φ1 , . . . , φm generated with at most r rounds of adaptivity. With
probability at least 1 − mβ the algorithm runs to completion and returns estimates a1 , . . . , am for
each query. These estimates have the property that for all i ∈ [m] we have P [|P[φi ] − ai | > τ ] ≤ β
provided that n ≥ CnSV (τ, β) for sufficiently large constant C.
Remark 22. Note that the accuracy guarantee of SPARSE depends only on the number of incorrect
guesses that are actually made. Hence, EffectiveRounds does not halt until the actual number of
instances of over-fitting to the estimation samples Si is larger than r. This could be equal to the
number of rounds of adaptivity in the worst case (for example, if the analyst is running the Dinur-
Nissim reconstruction attack within each round [DN03]), but in practice might achieve a much better
bound (if the analyst is not fully adversarial).
Acknowledgements We would like to thank Sanjeev Arora, Nina Balcan, Avrim Blum, Dean
Foster, Michael Kearns, Jon Kleinberg, Sasha Rakhlin, and Jon Ullman for enlightening discussions
and helpful comments. We also thank the Simons Institute for Theoretical Computer Science at
Berkeley where part of this research was done.
References
[ANR11] Ehud Aharoni, Hani Neuvirth, and Saharon Rosset. The quality preserving database: A
computational framework for encouraging collaboration, enhancing power and control-
ling false discovery. IEEE/ACM Trans. Comput. Biology Bioinform., 8(5):1431–1437,
2011.
[AR14] Ehud Aharoni and Saharon Rosset. Generalized a-investing: definitions, optimality
results and application to public databases. Journal of the Royal Statistical Society:
Series B (Statistical Methodology), 76(4):771–794, 2014.
[BDMN05] Avrim Blum, Cynthia Dwork, Frank McSherry, and Kobbi Nissim. Practical privacy:
the SuLQ framework. In PODS, pages 128–138, 2005.
[BE02] Olivier Bousquet and André Elisseeff. Stability and generalization. JMLR, 2:499–526,
2002.
[BE12] C. Glenn Begley and Lee Ellis. Drug development: Raise standards for preclinical
cancer research. Nature, 483:531–533, 2012.
[BH95] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate – a practical
and powerful approach to multiple testing. Journal of the Royal Statistics Society:
Series B (Statistical Methodology), 57:289–300, 1995.
[BH15] Avrim Blum and Moritz Hardt. The ladder: A reliable leaderboard for machine learn-
ing competitions. CoRR, abs/1502.04585, 2015.
[BNS+ 15] Raef Bassily, Kobbi Nissim, Adam D. Smith, Thomas Steinke, Uri Stemmer,
and Jonathan Ullman. Algorithmic stability for adaptive data analysis. CoRR,
abs/1511.02513, 2015.
21
[BST14] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimiza-
tion, revisited. CoRR, abs/1405.7085, 2014.
[CKL+ 06] C. Chu, S. Kim, Y. Lin, Y. Yu, G. Bradski, A. Ng, and K. Olukotun. Map-reduce for
machine learning on multicore. In Proceedings of NIPS, pages 281–288, 2006.
[DFH+ 15a] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and
Aaron Roth. Generalization in adaptive data analysis and holdout reuse. CoRR,
abs/1506, 2015.
[DFH+ 15b] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and
Aaron Roth. The reusable holdout: Preserving validity in adaptive data analysis.
Science, 349(6248):636–638, 2015.
[DKM+ 06] Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni
Naor. Our data, ourselves: Privacy via distributed noise generation. In EUROCRYPT,
pages 486–503, 2006.
[DMNS06] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise
to sensitivity in private data analysis. In Theory of Cryptography, pages 265–284.
Springer, 2006.
[DN03] Irit Dinur and Kobbi Nissim. Revealing information while preserving privacy. In PODS,
pages 202–210. ACM, 2003.
[DN04] Cynthia Dwork and Kobbi Nissim. Privacy-preserving datamining on vertically parti-
tioned databases. In CRYPTO, pages 528–544, 2004.
[DR14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy.
Foundations and Trends in Theoretical Computer Science, 9(34):211–407, 2014.
[Dwo11] Cynthia Dwork. A firm foundation for private data analysis. CACM, 54(1):86–95,
2011.
[FGR+ 13] Vitaly Feldman, Elena Grigorescu, Lev Reyzin, Santosh Vempala, and Ying Xiao.
Statistical algorithms and a lower bound for planted clique. In STOC, pages 655–664.
ACM, 2013.
[Fre83] David A. Freedman. A note on screening regression equations. The American Statis-
tician, 37(2):152–155, 1983.
[FS08] D. Foster and R. Stine. Alpha-investing: A procedure for sequential control of ex-
pected false discoveries. J. Royal Statistical Soc.: Series B (Statistical Methodology),
70(2):429–444, 2008.
[GL14] Andrew Gelman and Eric Loken. The statistical crisis in science. The American
Statistician, 102(6):460, 2014.
[HR10] Moritz Hardt and Guy N. Rothblum. A multiplicative weights mechanism for privacy-
preserving data analysis. In 51st IEEE FOCS 2010, pages 61–70, 2010.
22
[HTF09] Trevor Hastie, Robert Tibshirani, and Jerome H. Friedman. The Elements of Statis-
tical Learning: Data Mining, Inference, and Prediction. Springer series in statistics.
Springer, 2009.
[HU14] Moritz Hardt and Jonathan Ullman. Preventing false discovery in interactive data
analysis is hard. In FOCS, pages 454–463, 2014.
[Ioa05a] John A. Ioannidis. Contradicted and initially stronger effects in highly cited clinical
research. The Journal of American Medical Association, 294(2):218–228, 2005.
[Ioa05b] John P. A. Ioannidis. Why Most Published Research Findings Are False. PLoS
Medicine, 2(8):124, August 2005.
[Kea98] Michael Kearns. Efficient noise-tolerant learning from statistical queries. Journal of
the ACM (JACM), 45(6):983–1006, 1998.
[MNPR06] Sayan Mukherjee, Partha Niyogi, Tomaso Poggio, and Ryan Rifkin. Learning theory:
stability is sufficient for generalization and necessary and sufficient for consistency of
empirical risk minimization. Advances in Computational Mathematics, 25(1-3):161–
193, 2006.
[PRMN04] Tomaso Poggio, Ryan Rifkin, Sayan Mukherjee, and Partha Niyogi. General conditions
for predictivity in learning theory. Nature, 428(6981):419–422, 2004.
[PSA11] Florian Prinz, Thomas Schlange, and Khusru Asadullah. Believe it or not: how much
can we rely on published data on potential drug targets? Nature Reviews Drug Dis-
covery, 10(9):712–712, 2011.
[Reu03] Juha Reunanen. Overfitting in making comparisons between variable selection methods.
Journal of Machine Learning Research, 3:1371–1382, 2003.
[RF08] R. Bharat Rao and Glenn Fung. On the dangers of cross-validation. an experimental
evaluation. In International Conference on Data Mining, pages 588–596. SIAM, 2008.
[RR10] Aaron Roth and Tim Roughgarden. Interactive privacy via the median mechanism. In
42nd ACM STOC, pages 765–774. ACM, 2010.
[RZ15] Daniel Russo and James Zou. Controlling bias in adaptive data analysis using infor-
mation theory. CoRR, abs/1511.05219, 2015.
23
[SNS11] Joseph P. Simmons, Leif D. Nelson, and Uri Simonsohn. False-positive psychology:
Undisclosed flexibility in data collection and analysis allows presenting anything as
significant. Psychological Science, 22(11):1359–1366, 2011.
[SSBD14] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From
Theory to Algorithms. Cambridge University Press, 2014.
[SSSSS10] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learn-
ability, stability and uniform convergence. The Journal of Machine Learning Research,
11:2635–2670, 2010.
[SU14] Thomas Steinke and Jonathan Ullman. Interactive fingerprinting codes and the hard-
ness of preventing false discovery. arXiv preprint arXiv:1410.1228, 2014.
[TT15] Jonathan Taylor and Robert J. Tibshirani. Statistical learning and selective inference.
Proceedings of the National Academy of Sciences, 112(25):7629–7634, 2015.
[WLF15] Yu-Xiang Wang, Jing Lei, and Stephen E. Fienberg. Learning with differential pri-
vacy: Stability, learnability and the sufficiency and necessity of ERM principle. CoRR,
abs/1502.06309, 2015.
The analyst attempts to solve the problem using the following simple but adaptive strategy:
P
1. For i = 1, . . . , d, determine si = sign x∈D x i .
2. Let ũ = √1 (s1 , . . . , sd ).
d
Intuitively, this natural approach first determines for each attribute whether it is positively or
negatively correlated. It then aggregates this information across all d attributes into a single linear
model.
24
The next lemma shows that this adaptive strategy has a terrible generalization performance
(if d is large). Specifically, we show that even if there is no linear structure whatsoever in the
underlying distribution (namely it is normally distributed), the analyst’s strategy falsely discovers
a linear model with large objective value.
Lemma 23. p SupposepD = N (0, 1)d . Then, every unit vector u ∈ Rd satisfies f (u) = 0. However,
ED [f˜D (ũ)] = 2/π · d/n.
Proof. The first claim follows because hu, xi for x ∼ N (0, 1)d is distributed like a Gaussian random
variable N (0, 1). Let us now analyze the objective value of ũ.
d d X
1 X si X 1 1
f˜D (ũ) =
X
√ xi = √ xi
n n
x∈D
d i=1
d i=1
x∈D
Hence, #
d
"
1 1 X
E[f˜D (ũ)] =
X
√ E xi .
n
D
i=1
dD x∈D
P
Now, (1/n) x∈D xi is distributed like a gaussian random variable g ∼ N (0, 1/n), since each xi is
a standard gaussian. It follows that r
˜ 2d
E fD (ũ) = .
D πn
Note that all the operations performed by the analyst are based on empirical averages of real-
valued functions. To determine the bias, the function is just φi (x) = xi and to determine the final
correlation it is ψ(x) = hu, xi. These functions are not bounded to the range [0, 1] as required by
the formal definition of our model. However, it is easy to see that this is a minor issue. Note that
both xi and hu, xi are distributed according to N (0, 1) whenever x ∼ N (0, 1)d . This implies that
for every query function φ we used, P[|φ(x)| ≥ B] ≤ 1/poly(n, d) for some B = O(log(dn)). We
can therefore truncate and rescale each query as φ′ (x) = PB (φ(x))/(2B) + 1/2, where PB is the
truncation of the values outside [−B, B]. This ensures that the range of φ′ (x) is [0, 1]. It is easy to
verify that using these [0, 1]-valued queries does not affect the analysis in any significant way (aside
from scaling by a logarithmic factor) and we obtain overfitting in the same way as before (for large
enough d).
25
Differential privacy also degrades gracefully under composition. It is easy to see that the
independent use of an (ε1 , 0)-differentially private algorithm and an (ε2 , 0)-differentially private
algorithm, when taken together, is (ε1 + ε2 , 0)-differentially private. More generally, we have
n
Theorem 25. QkLet Ai : X → Ri be an (εi , δi )-differentially private algorithm forPi ∈ [k].PThen if
A[k] : X → i=1 Ri is defined to be A[k](S) = (A1 (S), . . . , Ak (S)), then A[k] is ( ki=1 εi , ki=1 δi )-
n
differentially private.
Theorem 26. For all ε, δ, δ′ ≥ 0, the composition of k arbitrary (ε, δ)-differentially private mech-
anisms is (ε′ , kδ + δ′ )-differentially private, where
Theorems 25 and 26 are very general. For example, they apply to queries posed to overlapping,
but not identical, data sets. Nonetheless, data utility will eventually be consumed: the Fundamental
Law of Information Recovery states that overly accurate answers to too many questions will destroy
privacy in a spectacular way (see [DN03] et sequelae). The goal of algorithmic research on differential
privacy is to stretch a given privacy “budget” of, say, ε0 , to provide as much utility as possible, for
example, to provide useful answers to a great many counting queries. The bounds afforded by the
composition theorems are the first, not the last, word on utility.
Lemma 27 (Chernoff’s bound). Let Y1 , Y2 , . . . , Yn be i.i.d. Bernoulli random variables with expec-
tation p > 0. Then for every γ > 0,
X
P Yi ≥ (1 + γ)np ≤ exp (−np((1 + γ) ln(1 + γ) − γ)) .
i∈[n]
−2α2
P [f (X1 , . . . , Xn ) − µ ≥ α] ≤ exp .
n · c2
26
C.2 Moment Bounds
Lemma 29. Let Y1 ,Y2 , . . . , Yn be i.i.d. Bernoulli random variables with expectation p. We denote
k
.
P
1
by Mk [B(n, p)] = E n i∈[n] Yi . Let X1 , X2 , . . . , Xn be i.i.d. random variables with values in
[0, 1] and expectation p. Then for every k > 0,
k
1 X
E Xi ≤ Mk [B(n, p)].
n
i∈[n]
Proof. We use I to denote a k-tuple of indices (i1 , . . . , ik ) ∈ [n]k (not necessarily distinct). For I
1 , . . . , ℓk ′ } the set of distinct indices in I and let k1 , . . . , kk ′ denote their
like that we denote by {ℓP
multiplicities. Note that j∈[k′ ] kj = k. We first observe that
k
1 X Y Y k
Y h
k
i
E Xi = E E Xij = E E Xℓjj = E E Xℓjj ,
n I∼[n]k I∼[n]k ′
I∼[n] k
′
i∈[n] j∈[k] j∈[k ] j∈[k ]
(11)
where the last equality follows from independence of Xi ’s. For every j, the range of Xℓj is [0, 1]
and thus h i
k
E Xℓjj ≤ E Xℓj = p.
Moreover the value p is achieved when Xℓj is Bernoulli with expectation p. That is
h i h i
k k
E Xℓjj ≤ E Yℓjj ,
Proof. Let U denote n1 i∈[n] Xi , where Xi ’s are i.i.d. Bernoulli random variables with expectation
P
p > 0 (the claim is obviously true if p = 0). Then
Z 1
k k
E[U ] ≤ p + P[U k ≥ t]dt. (12)
pk
27
We substitute t = (1 + γ)k pk and observe that Lemma 27 gives:
P[U k ≥ t] = P[U k ≥ ((1 + γ)p)k ] = P[U ≥ (1 + γ)p] ≤ exp (−np((1 + γ) ln(1 + γ) − γ)) .
dt
Using this substitution in eq.(12) together with dγ = k(1 + γ)k−1 · pk we obtain
Z 1/p−1
k k
E[U ] ≤ p + exp (−np((1 + γ) ln(1 + γ) − γ)) · k(1 + γ)k−1 dγ
0
1/p−1
1
Z
k k
=p +p k · exp (k ln(1 + γ) − np((1 + γ) ln(1 + γ) − γ)) dγ
0 1 + γ
Z 1/p−1
k k 1
≤ p + p k max {exp (k ln(1 + γ) − np((1 + γ) ln(1 + γ) − γ))} · dγ
γ∈[0,1/p−1] 0 1+γ
k k
= p + p k ln(1/p) · max {exp (k ln(1 + γ) − np((1 + γ) ln(1 + γ) − γ))} . (13)
γ∈[0,1/p−1]
.
We now find the maximum of g(γ) = k ln(1 + γ) − np((1 + γ) ln(1 + γ) − γ). Differentiating
k
the expression we get 1+γ − np ln(1 + γ) and therefore the function attains its maximum at the
k k
(single) point γ0 which satisfies: (1 + γ0 ) ln(1 + γ0 ) = np . This implies that ln(1 + γ0 ) ≤ ln np .
k
Now we observe that (1 + γ) ln(1 + γ) − γ is always non-negative and therefore g(γ0 ) ≤ k ln np .
Substituting this into eq.(13) we conclude that
k
k k
E[U k ] ≤ pk + pk k ln(1/p) · exp k ln = pk + k ln(1/p) · .
np n
Finally, we observe that if p ≥ 1/n then clearly ln(1/p) ≤ ln n and the claim holds. For any p < 1/n
we use monotonicity of Mk [B(n, p)] in p and upper bound the probability by the bound for p = 1/n
that equals
k k k
1 k k
+ (k ln n) · ≤ (k ln n + 1) · .
n n n
Lemma 31. Let n > k > 0, ε > 0, p > 0, δ ≥ 0 and let V be a non-negative random variable that
satisfies E[V k ] ≤ eεk Mk [B(n, p)] + δ. Then for any τ ∈ [0, 1/3], β ∈ (0, 2/3] if
• ε ≤ τ /2,
• n ≥ 3k/τ then
P[V ≥ p + τ ] ≤ β + δ/(p + τ )k .
28
Using Lemma 30 we obtain that
k
k k
pk + (k ln n + 1) · nk δ 1 + (k ln n + 1) · pn δ
P[V ≥ p + τ ] ≤ + = + . (14)
e−εk pk (1 + τ /p)k (p + τ )k (e−ε (1 + τ /p))k (p + τ )k
Together with the condition k ≥ max{4 ln(2/β)/τ, 2 log log n}, we have
since k/2 ≥ log log n holds by assumption and for k ≥ 12 ln(2/β), k/6 ≥ log(2/β) and k/3 ≥
log(k + 1) (whenever β < 2/3). Therefore we get
k k
−ε k −ε k k k 2 k
(e (1 + τ /p)) ≥ (e τ /p) ≥ 2 · ≥ · (k ln n + 1) · . (16)
pn β pn
Combining eq.(15) and (16) we obtain that
k
k
1 + (k ln n + 1) · pn
≤ β/2 + β/2 = β.
(e−ε (1 + τ /p))k
29