Unlabeled Data Now It Helps, Now It Doesn't

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Unlabeled data: Now it helps, now it doesn’t

Aarti Singh, Robert D. Nowak∗ Xiaojin Zhu†


Department of Electrical and Computer Engineering Department of Computer Sciences
University of Wisconsin - Madison University of Wisconsin - Madison
Madison, WI 53706 Madison, WI 53706
{singh@cae,nowak@engr}.wisc.edu [email protected]

Abstract
Empirical evidence shows that in favorable situations semi-supervised learning
(SSL) algorithms can capitalize on the abundance of unlabeled training data to
improve the performance of a learning task, in the sense that fewer labeled train-
ing data are needed to achieve a target error bound. However, in other situations
unlabeled data do not seem to help. Recent attempts at theoretically character-
izing SSL gains only provide a partial and sometimes apparently conflicting ex-
planations of whether, and to what extent, unlabeled data can help. In this paper,
we attempt to bridge the gap between the practice and theory of semi-supervised
learning. We develop a finite sample analysis that characterizes the value of un-
labeled data and quantifies the performance improvement of SSL compared to
supervised learning. We show that there are large classes of problems for which
SSL can significantly outperform supervised learning, in finite sample regimes
and sometimes also in terms of error convergence rates.

1 Introduction
Labeled data can be expensive, time-consuming and difficult to obtain in many applications. Semi-
supervised learning (SSL) aims to capitalize on the abundance of unlabeled data to improve learning
performance. Empirical evidence suggests that in certain favorable situations unlabeled data can
help, while in other situations it does not. As a result, there have been several recent attempts
[1, 2, 3, 4, 5, 6] at developing a theoretical understanding of semi-supervised learning. It is well-
accepted that unlabeled data can help only if there exists a link between the marginal data distribution
and the target function to be learnt. Two common types of links considered are the cluster assump-
tion [7, 3, 4] which states that the target function is locally smooth over subsets of the feature space
delineated by some property of the marginal density (but may not be globally smooth), and the man-
ifold assumption [4, 6] which assumes that the target function lies on a low-dimensional manifold.
Knowledge of these sets, which can be gleaned from unlabeled data, simplify the learning task.
However, recent attempts at characterizing the amount of improvement possible under these links
only provide a partial and sometimes apparently conflicting (for example, [4] vs. [6]) explanations
of whether or not, and to what extent semi-supervised learning helps. In this paper, we bridge the
gap between these seemingly conflicting views and develop a minimax framework based on finite
sample bounds to identify situations in which unlabeled data help to improve learning. Our results
quantify both the amount of improvement possible using SSL as well as the the relative value of
unlabeled data.
We focus on learning under a cluster assumption that is formalized in the next section, and estab-
lish that there exist nonparametric classes of distributions, denoted PXY , for which the decision
sets (over which the target function is smooth) are discernable from unlabeled data. Moreover,
we show that there exist clairvoyant supervised learners that, given perfect knowledge of the de-
cision sets denoted by D, can significantly outperform any generic supervised learner fn in these

Supported in part by the NSF grants CCF-0353079, CCF-0350213, and CNS-0519824.

Supported in part by the Wisconsin Alumni Research Foundation.

1
(a) (b) (c)

Figure 1: (a) Two separated high density sets with different labels that (b) cannot be discerned if the
sample size is too small, but (c) can be estimated if sample density is high enough.

classes. That is, if R denotes a risk of interest, n denotes the labeled data sample size, fbD,n denotes
the clairvoyant supervised learner, and E denotes expectation with respect to training data, then
supPXY E[R(fbD,n )] < inf fn supPXY E[R(fn )]. Based on this, we establish that there also exist
semi-supervised learners, denoted fbm,n , that use m unlabeled examples in addition to the n labeled
examples in order to estimate the decision sets, which perform as well as fbD,n , provided that m
grows appropriately relative to n. Specifically, if the error bound for fbD,n decays polynomially (ex-
ponentially) in n, then the number of unlabeled data m needs to grow polynomially (exponentially)
with the number of labeled data n. We provide general results for a broad range of learning problems
using finite sample error bounds. Then we examine a concrete instantiation of these general results
in the regression setting by deriving minimax lower bounds on the performance of any supervised
learner and compare that to upper bounds on the errors of fbD,n and fbm,n .
In their seminal papers, Castelli and Cover [8, 9] suggested that in the classification setting the
marginal distribution can be viewed as a mixture of class conditional distributions. If this mixture is
identifiable, then the classification problem may reduce to a simple hypothesis testing problem for
which the error converges exponentially fast in the number of labeled examples. The ideas in this
paper are similar, except that we do not require identifiability of the mixture component densities,
and show that it suffices to only approximately learn the decision sets over which the label is smooth.
More recent attempts at theoretically characterizing SSL have been relatively pessimistic. Rigollet
[3] establishes that for a fixed collection of distributions satisfying a cluster assumption, unlabeled
data do not provide an improvement in convergence rate. A similar argument was made by Lafferty
and Wasserman [4], based on the work of Bickel and Li [10], for the manifold case. However, in
a recent paper, Niyogi [6] gives a constructive example of a class of distributions supported on a
manifold whose complexity increases with the number of labeled examples, and he shows that the
error of any supervised learner is bounded from below by a constant, whereas there exists a semi-
supervised learner that can provide an error bound of O(n−1/2 ), assuming infinite unlabeled data.
In this paper, we bridge the gap between these seemingly conflicting views. Our arguments can
be understood by the simple example shown in Fig. 1, where the distribution is supported on two
component sets separated by a margin γ and the target function is smooth over each component.
Given a finite sample of data, these decision sets may or may not be discernable depending on the
sampling density (see Fig. 1(b), (c)). If γ is fixed (this is similar to fixing the class of cluster-based
distributions in [3] or the manifold in [4, 10]), then given enough labeled data a supervised learner
can achieve optimal performance (since, eventually, it operates in regime (c) of Fig. 1). Thus, in this
example, there is no improvement due to unlabeled data in terms of the rate of error convergence for
a fixed collection of distributions. However, since the true separation between the component sets
is unknown, given a finite sample of data, there always exists a distribution for which these sets are
indiscernible (e.g., γ → 0). This perspective is similar in spirit to the argument in [6]. We claim
that meaningful characterizations of SSL performance and quantifications of the value of unlabeled
data require finite sample error bounds, and that rates of convergence and asymptotic analysis may
not capture the distinctions between SSL and supervised learning. Simply stated, if the component
density sets are discernable from a finite sample size m of unlabeled data but not from a finite sample
size n < m of labeled data, then SSL can provide better performance than supervised learning. We
also show that there are certain plausible situations in which SSL yields rates of convergence that
cannot be achieved by any supervised learner.

2
(2) (2)
g (x1) g (x1)
2 2
(1)
g (x1) (2)
g (x1)
2
x2 1
γ x2 γ
(1)
g (x1)
(2) 2
g (x1)
1
(1) (1)
x1 g (x1) x1 g (x1)
1 1
γ positive γ negative

Figure 2: Margin γ measures the minimum width of a decision set or separation between the support
sets of the component marginal mixture densities. The margin is positive if the component support
sets are disjoint, and negative otherwise.

2 Characterization of model distributions under the cluster assumption


Based on the cluster assumption [7, 3, 4], we define the following collection of joint distributions
PXY (γ) = PX × PY |X indexed by a margin parameter γ. Let X, Y be bounded random variables
with marginal distribution PX ∈ PX and conditional label distribution PY |X ∈ PY |X , supported
on the domain X = [0, 1]d .
PK
The marginal density p(x) = k=1 ak pk (x) is the mixture of a finite, but unknown, number of
component densities {pk }Kk=1 , where K < ∞. The unknown mixing proportions ak ≥ a > 0 and
PK
k=1 ak = 1. In addition, we place the following assumptions on the mixture component densities:
1. pk is supported on a unique compact, connected set Ck ⊆ X with Lipschitz boundaries. Specifi-
cally, we assume the following form for the component support sets: (See Fig. 2 for d=2 illustration.)
(1) (2)
Ck = {x ≡ (x1 , . . . , xd ) ∈ X : gk (x1 , . . . , xd−1 ) ≤ xd ≤ gk (x1 , . . . , xd−1 )},
(1) (2)
where gk (·), gk (·) are d − 1 dimensional Lipschitz functions with Lipschitz constant L.1
2. pk is bounded from above and below, 0 < b ≤ pk ≤ B.
3. pk is Hölder-α smooth on Ck with Hölder constant K1 [12, 13].
Let the conditional label density on Ck be denoted by pk (Y |X = x). Thus, a labeled training
point (X, Y ) is obtained as follows. With probability ak , X is drawn from pk and Y is drawn from
pk (Y |X = x). In the supervised setting, we assume access to n labeled data L = {Xi , Yi }ni=1
drawn i.i.d according to PXY ∈ PXY (γ), and in the semi-supervised setting, we assume access to
m additional unlabeled data U = {Xi }m i=1 drawn i.i.d according to PX ∈ PX .

Let D denote the collection of all non-empty sets obtained as intersections of {Ck }K k=1 or their
complements {Ckc }K k=1 , excluding the set ∩K
C
k=1 k
c
that does not lie in the support of the marginal
density. Observe that |D| ≤ 2K , and in practical situations the cardinality of D is much smaller
as only a few of the sets are non-empty. The cluster assumption is that the target function will be
smooth on each set D ∈ D, hence the sets in D are called decision sets. At this point, we do not
consider a specific target function.
The collection PXY is indexed by a margin parameter γ, which denotes the minimum width of
a decision set or separation between the component support sets Ck . The margin γ is assigned a
positive sign if there is no overlap between components, otherwise it is assigned a negative sign as
illustrated in Figure 2. Formally, for j, k ∈ {1, . . . , K}, let
(p) (q) (1) (2)
djk := min kgj − gk k∞ j 6= k, dkk := kgk − gk k∞ .
p,q∈{1,2}
Then the margin is defined as 
1 if Cj ∩ Ck = ∅ ∀j 6= k
γ=σ· min djk , where σ= .
j,k∈{1,...,K} −1 otherwise
1
This form is a slight generalization of the boundary fragment class of sets which is used as a common
tool for analysis of learning problems [11]. Boundary fragment sets capture the salient characteristics of more
general decision sets since, locally, the boundaries of general sets are like fragments in a certain orientation.

3
3 Learning Decision Sets
Ideally, we would like to break a given learning task into separate subproblems on each D ∈ D since
the target function is smooth on each decision set. Note that the marginal density p is also smooth
within each decision set, but exhibits jumps at the boundaries since the component densities are
bounded away from zero. Hence, the collection D can be learnt from unlabeled data as follows:
1) Marginal density estimation — The procedure is based on the sup-norm kernel density estimator
proposed in [14]. Consider a uniform square grid over the domain X = [0, 1]d with spacing 2hm ,
where hm = κ0 ((log m)2 /m)1/d and κ0 > 0 is a constant. For any point x ∈ X , let [x] denote the
closest point on the grid. Let G denote the kernel and Hm = hm I, then the estimator of p(x) is
m
1 X −1
pb(x) = G(Hm (Xi − [x])).
mhdm i=1
2) Decision set estimation — Two points x1 , x2 ∈ X are said to be connected, denoted by x1 ↔ x2 ,
sequence of points x1 = z1 , z2 , . . . , zl−1 , zl = x2 such that z2 , . . . , zl−1 ∈ U,
if there exists a √
kzj −zj+1 k ≤ 2 dhm , and for all points that satisfy
√ kzi −zj k ≤ hm log m, |b p(zi )− pb(zj )| ≤ δm :=
(log m)−1/3 . That is, there exists a sequence of 2 dhm -dense unlabeled data points between x1 and
x2 such that the marginal density varies smoothly along the sequence. All points that are pairwise
connected specify an empirical decision set. This decision set estimation procedure is similar in
spirit to the semi-supervised learning algorithm proposed in [15]. In practice, these sequences only
need to be evaluated for the test and labeled training points.
The following lemma shows that if the margin is large relative to the average spacing m−1/d between
unlabeled data points, then with high probability, two points are connected if and only if they lie in
the same decision set D ∈ D, provided the points are not too close to the decision boundaries. The
proof sketch of the lemma and all other results are deferred to Section 7.
Lemma 1. Let ∂D denote the boundary of D and define the set of boundary points as

B = {x : inf kx − zk ≤ 2 dhm }.
z∈∪D∈D ∂D

2 −1/d

If |γ| > Co (m/(log m) ) , where Co = 6 dκ0 , then for all p ∈ PX , all pairs of points
x1 , x2 ∈ supp(p) \ B and all D ∈ D, with probability > 1 − 1/m,
x1 , x2 ∈ D if and only if x1 ↔ x2
for large enough m ≥ m0 , where m0 depends only on the fixed parameters of the class PXY (γ).

4 SSL Performance and the Value of Unlabeled Data


We now state our main result that characterizes the performance of SSL relative to supervised learn-
ing and follows as a corollary to the lemma stated above. Let R denote a risk of interest and
E(fb) = R(fb) − R∗ , where R∗ is the infimum risk over all possible learners.
Corollary 1. Assume that the excess risk E is bounded. Suppose there exists a clairvoyant super-
vised learner fbD,n , with perfect knowledge of the decision sets D, for which the following finite
sample upper bound holds
sup E[E(fbD,n )] ≤ ǫ2 (n).
PXY (γ)

Then there exists a semi-supervised learner fbm,n such that if |γ| > Co (m/(log m)2 )−1/d ,
 −1/d !
1 m
sup E[E(fbm,n )] ≤ ǫ2 (n) + O +n .
PXY (γ) m (log m)2

This result captures the essence of the relative characterization of semi-supervised and supervised
learning for the margin based model distributions. It suggests that if the sets D are discernable
using unlabeled data (the margin is large enough compared to average spacing between unlabeled
data points), then there exists a semi-supervised learner that can perform as well as a supervised
d
learner with clairvoyant knowledge of the decision sets, provided m ≫ n so that (n/ǫ2 (n)) =

4
O(m/(log m)2 ) implying that the additional term in the performace bound for fbm,n is negligible
compared to ǫ2 (n). This indicates that if ǫ2 (n) decays polynomially (exponentially) in n, then m
needs to grow polynomially (exponentially) in n.
Further, suppose that the following finite sample lower bound holds for any supervised learner:
inf sup E[E(fn )] ≥ ǫ1 (n).
fn PXY (γ)

If ǫ2 (n) < ǫ1 (n), then there exists a clairvoyant supervised learner with perfect knowledge of the
decision sets that outperforms any supervised learner that does not have this knowledge. Hence,
Corollary 1 implies that SSL can provide better performance than any supervised learner provided
d
(i) m ≫ n so that (n/ǫ2 (n)) = O(m/(log m)2 ), and (ii) knowledge of the decision sets simplifies
the supervised learning task, so that ǫ2 (n) < ǫ1 (n). In the next section, we provide a concrete
application of this result in the regression setting. As a simple example in the binary classification
setting, if p(x) is supported on two disjoint sets and if P (Y = 1|X = x) is strictly greater than
1/2 on one set and strictly less than 1/2 on the other, then perfect knowledge of the decision sets
reduces the problem to a hypothesis testing problem for which ǫ2 (n) = O(e−ζ n ), for some constant
ζ > 0. However, if γ is small relative to the average spacing n−1/d between labeled data points,
then ǫ1 (n) = cn−1/d where c > 0 is a constant. This lower bound follows from the minimax lower
bound proofs for regression in the next section. Thus, an exponential improvement is possible using
semi-supervised learning provided m grows exponentially in n.

5 Density-adaptive Regression
Let Y denote a continuous and bounded random variable. Under squared error loss, the target
function is f (x) = E[Y |X = x], and E(fb) = E[(fb(X) − f (X))2 ]. Recall that pk (Y |X = x)
is the conditional density on the k-th component and let Ek denote expectation with respect to the
corresponding conditional distribution. The regression function on each component is fk (x) =
Ek [Y |X = x] and we assume that for k = 1, . . . , K
1. fk is uniformly bounded, |fk | ≤ M .
2. fk is Hölder-α smooth on Ck with Hölder constant K2 .
This implies that the overall regression function f (x) is piecewise Hölder-α smooth; i.e., it is
Hölder-α smooth on each D ∈ D, except possibly at the component boundaries. 2 Since a Hölder-α
smooth function can be locally well-approximated by a Taylor polynomial, we propose the follow-
ing semi-supervised learner that performs local polynomial fits within each empirical decision set,
that is, using training data that are connected as per the definition in Section 3. While a spatially
uniform estimator suffices when the decision sets are discernable, we use the following spatially
adaptive estimator proposed in Section 4.1 of [12]. This ensures that when the decision sets are
indiscernible using unlabeled data, the semi-supervised learner still achieves an error bound that is,
up to logarithmic factors, no worse than the minimax lower bound for supervised learners.
X n
fbm,n,x (·) = arg min

(Yi − f ′ (Xi ))2 1x↔Xi + pen(f ′ ) and fbm,n (x) ≡ fbm,n,x (x)
f ∈Γ
i=1
Here 1x↔Xi is the indicator of x ↔ Xi and Γ denotes a collection of piecewise polynomials
of degree [α] (the maximal integer < α) defined over recursive dyadic partitions of the domain
X = [0, 1]d with cells of sidelength betweenP2−⌈log(n/ log n)/(2α+d)⌉ and 2−⌈log(n/ log n)/d⌉ . The
penalty term pen(f ′ ) is proportional to log( ni=1 1x↔Xi ) #f ′ , where #f ′ denotes the number
of cells in the recursive dyadic partition on which f ′ is defined. It is shown in [12] that this
estimator yields a finite sample error bound of n−2α/(2α+d) for Hölder-α smooth functions, and
max{n−2α/(2α+d) , n−1/d } for piecewise Hölder-α functions, ignoring logarithmic factors.
Using these results from [12] and Corollary 1, we now state finite sample upper bounds on the semi-
supervised learner (SSL) described above. Also, we derive finite sample minimax lower bounds on
the performance of any supervised learner (SL). Our main results are summarized in the following
table, for model distributions characterized by various values of the margin parameter γ. A sketch
2
If the component marginal densities and regression functions have different smoothnesses, say α and β,
the same analysis holds except that f (x) is Hölder-min(α, β) smooth on each D ∈ D.

5
of the derivations of the results is provided in Section 7.3. Here we assume that dimension d ≥
2α/(2α − 1). If d < 2α/(2α − 1), then the supervised learning error due to to not resolving
the decision sets (which behaves like n−1/d ) is smaller than error incurred in estimating the target
function itself (which behaves like n−2α/(2α+d) ). Thus, when d < 2α/(2α − 1), the supervised
regression error is dominated by the error in smooth regions and there appears to be no benefit to
using a semi-supervised learner. In the table, we suppress constants and log factors in the bounds,
and also assume that m ≫ n2d so that (n/ǫ2 (n))d = O(m/(log m)2 ). The constants co and Co
only depend on the fixed parameters of the class PXY (γ) and do not depend on γ.
Margin range SSL upper bound SL lower bound SSL helps
γ ǫ2 (n) ǫ1 (n)

γ ≥ γ0 n−2α/(2α+d) n−2α/(2α+d) No
γ ≥ co n−1/d n−2α/(2α+d) n−2α/(2α+d) No
co n−1/d > γ ≥ Co ( (logmm)2 )−1/d n−2α/(2α+d) n−1/d Yes
Co ( (logmm)2 )−1/d > γ ≥ −Co ( (logmm)2 )−1/d n−1/d n−1/d No
−Co ( (logmm)2 )−1/d > γ n −2α/(2α+d)
n−1/d Yes
−γ0 > γ n−2α/(2α+d) n−1/d Yes

If γ is large relative to the average spacing between labeled data points n−1/d , then a supervised
learner can discern the decision sets accurately and SSL provides no gain. However, if γ > 0 is small
relative to n−1/d , but large with respect to the spacing between unlabeled data points m−1/d , then
the proposed semi-supervised learner provides improved error bounds compared to any supervised
learner. If |γ| is smaller than m−1/d , the decision sets are not discernable with unlabeled data and
SSL provides no gain. However, notice that the performance of the semi-supervised learner is no
worse than the minimax lower bound for supervised learners. In the γ < 0 case, if −γ larger than
m−1/d , then the semi-supervised learner can discern the decision sets and achieves smaller error
bounds, whereas these sets cannot be as accurately discerned by any supervised learner. For the
overlap case (γ < 0), supervised learners are always limited by the error incurred due to averaging
across decision sets (n−1/d ). In particular, for the collection of distributions with γ < −γ0 , a faster
rate of error convergence is attained by SSL compared to SL, provided m ≫ n2d .

6 Conclusions
In this paper, we develop a framework for evaluating the performance gains possible with semi-
supervised learning under a cluster assumption using finite sample error bounds. The theoretical
characterization we present explains why in certain situations unlabeled data can help to improve
learning, while in other situations they may not. We demonstrate that there exist general situations
under which semi-supervised learning can be significantly superior to supervised learning in terms
of achieving smaller finite sample error bounds than any supervised learner, and sometimes in terms
of a better rate of error convergence. Moreover, our results also provide a quantification of the rela-
tive value of unlabeled to labeled data. While we focus on the cluster assumption in this paper, we
conjecture that similar techniques can be applied to quantify the performance of semi-supervised
learning under the manifold assumption as well. In particular, we believe that the use of minimax
lower bounding techniques is essential because many of the interesting distinctions between super-
vised and semi-supervised learning occur only in finite sample regimes, and rates of convergence
and asymptotic analyses may not capture the complete picture.

7 Proofs
We sketch the main ideas behind the proofs here, please refer to [13] for details. Since the component
densities are bounded from below and above, define pmin := b mink ak ≤ p(x) ≤ B =: pmax .
7.1 Proof of Lemma 1
First, we state two relatively straightforward results about the proposed kernel density estimator.
Theorem 1 (Sup-norm density estimation of non-boundary points). Consider the kernel density
estimator pb(x)Rproposed in Section 3. If Rthe kernel G satisfies supp(G) = [−1, 1]d, 0 < G ≤
Gmax < ∞, [−1,1]d G(u)du = 1 and [−1,1]d uj G(u)du = 0 for 1 ≤ j ≤ [α], then for all

6
p ∈ PX , with probability at least 1 − 1/m,
 q 
min(1,α) d
sup |p(x) − pb(x)| = O hm + log m/(mhm ) =: ǫm .
x∈supp(p)\B

Notice that ǫm decreases with increasing m. A detailed proof can be found in [13].
Corollary 2 (Empirical density of unlabeled data). Under the conditions of Theorem 1, for all
p ∈ PX and √ large enough m, with probability > 1 − 1/m, for all x ∈ supp(p) \ B, ∃ Xi ∈ U s.t.
kXi − xk ≤ dhm .
Proof. From Theorem 1, for all x ∈ supp(p) \ B, pb(x) ≥ p(x) − ǫm ≥ pmin −√ǫm > 0, for m
Pm −1
sufficiently large. This implies i=1 G(Hm (Xi − x)) > 0, and ∃Xi ∈ U within dhm of x.

Using the density estimation results, we now show that if |γ| > 6 dhm , then for all p ∈ PX , all
pairs of points x1 , x2 ∈ supp(p)\B and all D ∈ D, for large enough m, with probability > 1−1/m,
we have x1 , x2 ∈ D if and only if x1 ↔ x2 . We establish this in two steps:
1. x1 ∈ D, x2 6∈ D ⇒ x1 6↔ x2 : Since x1 and x2 belong to different decision sets, all sequences
connecting x1 and x2 through unlabeled data points √ pass through a region where either (i) the density
is zero and since the region is at least |γ| > √ 6 dhm wide, there cannot exist a sequence as defined
in Section 3 such that kzj − zj+1 k ≤ 2 dhm , or (ii) the density is positive. In the latter case,
the marginal density p(x) jumps by at least pmin one or more times along all sequences connecting
x1 and x2 . Suppose the first jump occurs where decision set D ends √ and another decision set
′ ′
D 6= D begins (in the sequence). Then since D is at least |γ| > 6 dhm wide, by Corollary 2
for all sequences connecting x1 and x2 through unlabeled data points, there exist points zi , zj in the
sequence that lie in D \ B, D′ \ B, respectively, and kzi − zj k ≤ hm log m. Since the density on
each decision set is Hölder-α smooth, we have |p(zi ) − p(zj )| ≥ pmin − O((hm log m)min(1,α) ).
Since zi , zj 6∈ B, using Theorem 1, |b p(zi ) − pb(zj )| ≥ |p(zi ) − p(zj )| − 2ǫm > δm for large enough
m. Thus, x1 6↔ x2 .

2. x√ 1 , x2 ∈ D ⇒ x1 ↔ x2 : Since D has width at least |γ| > 6 dhm , there exists a region of width
> 2 dhm contained in D \ B, and Corollary 2 implies that with√probability > 1 − 1/m, there exist
sequence(s) contained in D \ B connecting x1 and x2 through 2 dhm -dense unlabeled data points.
Since the sequence is contained in D and the density on D is Hölder-α smooth, we have for all points
zi , zj in the sequence that satisfy kzi − zj k ≤ hm log m, |p(zi ) − p(zj )| ≤ O((hm log m)min(1,α) ).
Since zi , zj 6∈ B, using Theorem 1, |b p(zi ) − pb(zj )| ≤ |p(zi ) − p(zj )| + 2ǫm ≤ δm for large enough
m. Thus, x1 ↔ x2 . 

7.2 Proof of Corollary 1


Let Ω1 denote the event under which Lemma 1 holds. Then P (Ωc1 ) ≤ 1/m. Let Ω2 denote the
event that the test point X and training data X1 , . . . , Xn ∈ L don’t lie in B. Then P (Ωc2 ) ≤
(n + 1)P (B) ≤ (n + 1)pmax vol(B) = O(nhm ). The last step follows from the definition of the set
B and since the boundaries of the support sets are Lipschitz, K is finite, and hence vol(B) = O(hm ).
Now observe that fbD,n essentially uses the clairvoyant knowledge of the decision sets D to
discern which labeled points X1 , . . . , Xn are in the same decision set as X. Condition-
ing on Ω1 , Ω2 , Lemma 1 implies that X, Xi ∈ D iff X ↔ Xi . Thus, we can define a
semi-supervised learner fbm,n to be the same as fbD,n except that instead of using clairvoyant
knowledge of whether X, Xi ∈ D, fbm,n is based on whether X ↔ Xi . It follows that
supPXY (γ) E[E(fbm,n )|Ω1 , Ω2 ] = supPXY (γ) E[E(fbD,n )], and since the excess risk is bounded:
supPXY (γ) E[E(fbm,n )] ≤ sup PXY (γ)E[E(fbm,n )|Ω1 , Ω2 ] + O (1/m + nhm ) . 

7.3 Density adaptive Regression results


1) Semi-Supervised Learning Upper Bound: The clairvoyant counterpart of fbm,n (x) is given as
Pn
fbD,n (x) ≡ fbD,n,x(x), where fbD,n,x (·) = arg minf ′ ∈Γ i=1 (Yi − f ′ (Xi ))2 1x,Xi ∈D + pen(f ′ ), and
is a standard supervised learner that performs piecewise polynomial
Pn fit on each decision set, where
the regression function is Hölder-α smooth. Let nD = n1 i=1 1Xi ∈D . It follows [12] that


E[(f (X) − fbD,n (X))2 1X∈D |nD ] ≤ C (nD / log nD ) d+2α .

7
P
Since E[(f (X) − fbD,n (X))2 ] = b 2
D∈D E[(f (X) − fD,n (X)) 1X∈D ]P (D), taking expecta-
tion over nD ∼Binomial(n, P (D)) and summing over all decision sets recalling that |D| is a
finite constant, the overall error of fbD,n scales as n−2α/(2α+d) , ignoring logarithmic factors. If
|γ| > Co (m/(log m)2 )−1/d , using Corollary 1, the same performance bound holds for fbm,n pro-
vided m ≫ n2d . See [13] for further details. If |γ| < Co (m/(log m)2 )−1/d , the decision sets are
not discernable using unlabeled data. Since the regression function is piecewise Hölder-α smooth
on each empirical decision set, Using Theorem 9 in [12] and similar analysis, an upper bound of
max{n−2α/(2α+d) , n−1/d } follows, which scales as n−1/d when d ≥ 2α/(2α − 1).
2) Supervised Learning Lower Bound: The formal minimax proof requires construction of a finite
subset of distributions in PXY (γ) that are the hardest cases to distinguish based on a finite number
of labeled data n, and relies on a Hellinger version of Assouad’s Lemma (Theorem 2.10 (iii) in [16]).
Complete details are given in [13]. Here we present the simple intuition behind the minimax lower
bound of n−1/d when γ < co n−1/d . In this case the decision boundaries can only be localized
to an accuracy of n−1/d , the average spacing between labeled data points. Since the boundaries
are Lipschitz, the expected volume that is incorrectly assigned to any decision set is > c1 n−1/d ,
where c1 > 0 is a constant. Thus, if the expected excess risk at a point that is incorrectly assigned
to a decision set can be greater than a constant c2 > 0, then the overall expected excess risk is
> c1 c2 n−1/d . This is the case for both regression and binary classification. If γ > co n−1/d , the
decision sets can be accurately discerned from the labeled data alone. In this case, it follows that
the minimax lower bound is equal to the minimax lower bound for Hölder-α smooth regression
functions, which is cn−2α/(d+2α) , where c > 0 is a constant [17].
References
[1] Balcan, M.F., Blum, A.: A PAC-style model for learning from labeled and unlabeled data. In: 18th
Annual Conference on Learning Theory, COLT. (2005)
[2] Kääriäinen, M.: Generalization error bounds using unlabeled data. In: 18th Annual Conference on
Learning Theory, COLT. (2005)
[3] Rigollet, P.: Generalization error bounds in semi-supervised classification under the cluster assumption.
Journal of Machine Learning Research 8 (2007) 1369–1392
[4] Lafferty, J., Wasserman, L.: Statistical analysis of semi-supervised regression. In: Advances in Neural
Information Processing Systems 21, NIPS. (2007) 801–808
[5] Ben-David, S., Lu, T., Pal, D.: Does unlabeled data provably help? worst-case analysis of the sample
complexity of semi-supervised learning. In: 21st Annual Conference on Learning Theory, COLT. (2008)
[6] Niyogi, P.: Manifold regularization and semi-supervised learning: Some theoretical analy-
ses. Technical Report TR-2008-01, Computer Science Department, University of Chicago. URL
http://people.cs.uchicago.edu/∼niyogi/papersps/ssminimax2.pdf (2008)
[7] Seeger, M.: Learning with labeled and unlabeled data. Technical report, Institute for ANC, Edinburgh,
UK. URL http://www.dai.ed.ac.uk/∼seeger/papers.html (2000)
[8] Castelli, V., Cover, T.M.: On the exponential value of labeled samples. Pattern Recognition Letters 16(1)
(1995) 105–111
[9] Castelli, V., Cover, T.M.: The relative value of labeled and unlabeled samples in pattern recognition.
IEEE Transactions on Information Theory 42(6) (1996) 2102–2117
[10] Bickel, P.J., Li, B.: Local polynomial regression on unknown manifolds. In: IMS Lecture NotesMono-
graph Series, Complex Datasets and Inverse Problems: Tomography, Networks and Beyond. Volume 54.
(2007) 177–186
[11] Korostelev, A.P., Tsybakov, A.B.: Minimax Theory of Image Reconstruction. Springer, NY (1993)
[12] Castro, R., Willett, R., Nowak, R.: Faster rates in regression via active learning.
Technical Report ECE-05-03, ECE Department, University of Wisconsin - Madison. URL
http://www.ece.wisc.edu/∼nowak/ECE-05-03.pdf (2005)
[13] Singh, A., Nowak, R., Zhu, X.: Finite sample analysis of semi-supervised learning.
Technical Report ECE-08-03, ECE Department, University of Wisconsin - Madison. URL
http://www.ece.wisc.edu/∼nowak/SSL TR.pdf (2008)
[14] Korostelev, A., Nussbaum, M.: The asymptotic minimax constant for sup-norm loss in nonparametric
density estimation. Bernoulli 5(6) (1999) 1099–1118
[15] Chapelle, O., Zien, A.: Semi-supervised classification by low density separation. In: Tenth International
Workshop on Artificial Intelligence and Statistics. (2005) 57–64
[16] Tsybakov, A.B.: Introduction a l’estimation non-parametrique. Springer, Berlin Heidelberg (2004)
[17] Stone, C.J.: Optimal rates of convergence for nonparametric estimators. The Annals of Statistics 8(6)
(1980) 1348–1360

You might also like