Sequential Learning of Principal Curves: Summarizing Data Streams on the Fly

Li, Le; Guedj, Benjamin

doi:10.3390/e23111534

Open AccessArticle

Sequential Learning of Principal Curves: Summarizing Data Streams on the Fly

by

Le Li

¹

and

Benjamin Guedj

^2,*

¹

Department of Statistics, Central China Normal University, Wuhan 430079, China

²

Inria, Lille-Nord Europe Research Centre and Inria London, France and Centre for Artificial Intelligence, Department of Computer Science, University College London, London WC1V 6LJ, UK

^*

Author to whom correspondence should be addressed.

Entropy 2021, 23(11), 1534; https://doi.org/10.3390/e23111534

Submission received: 22 August 2021 / Revised: 23 October 2021 / Accepted: 1 November 2021 / Published: 18 November 2021

(This article belongs to the Special Issue Approximate Bayesian Inference)

Download

Browse Figures

Versions Notes

Abstract

:

When confronted with massive data streams, summarizing data with dimension reduction methods such as PCA raises theoretical and algorithmic pitfalls. A principal curve acts as a nonlinear generalization of PCA, and the present paper proposes a novel algorithm to automatically and sequentially learn principal curves from data streams. We show that our procedure is supported by regret bounds with optimal sublinear remainder terms. A greedy local search implementation (called slpc, for sequential learning principal curves) that incorporates both sleeping experts and multi-armed bandit ingredients is presented, along with its regret computation and performance on synthetic and real-life data.

Keywords:

sequential learning; principal curves; data streams; regret bounds; greedy algorithm; sleeping experts

1. Introduction

Numerous methods have been proposed in the statistics and machine learning literature to sum up information and represent data by condensed and simpler-to-understand quantities. Among those methods, principal component analysis (PCA) aims at identifying the maximal variance axes of data. This serves as a way to represent data in a more compact fashion and hopefully reveal as well as possible their variability. PCA was introduced by [1,2] and further developed by [3]. This is one of the most widely used procedures in multivariate exploratory analysis targeting dimension reduction or feature extraction. Nonetheless, PCA is a linear procedure and the need for more sophisticated nonlinear techniques has led to the notion of principal curve. Principal curves may be seen as a nonlinear generalization of the first principal component. The goal is to obtain a curve which passes “in the middle” of data, as illustrated by Figure 1. This notion of skeletonization of data clouds has been at the heart of numerous applications in many different domains, such as physics [4,5], character and speech recognition [6,7], mapping and geology [5,8,9], to name but a few.

1.1. Earlier Works on Principal Curves

The original definition of principal curve dates back to [10]. A principal curve is a smooth (

C^{\infty}

) parameterized curve

f (s) = (f_{1} (s), \dots, f_{d} (s))

in

R^{d}

which does not intersect itself, has finite length inside any bounded subset of

R^{d}

and is self-consistent. This last requirement means that

f (s) = E [X | s_{f} (X) = s]

, where

X \in R^{d}

is a random vector and the so-called projection index

s_{f} (x)

is the largest real number s minimizing the squared Euclidean distance between

f (s)

and x, defined by

s_{f} (x) = sup \{s : {∥x - f (s)∥}_{2}^{2} = inf_{τ} {∥x - f (τ)∥}_{2}^{2}\} .

Self-consistency means that each point of

f

is the average (under the distribution of X) of all data points projected on

f

, as illustrated by Figure 2.

However, an unfortunate consequence of this definition is that the existence is not guaranteed in general for a particular distribution, let alone for an online sequence for which no probabilistic assumption is made. In order to handle complex data structures, Ref. [11] proposed principal curves (PCOP) of principal oriented points (POPs) which are defined as the fixed points of an expectation function of points projected to a hyperplane minimising the total variance. To obtain POPs, a cluster analysis is performed on the hyperplane and only data in the local cluster are considered. Ref. [12] introduced the local principal curve (LPC), whose concept is similar to that of [11], but accelerates the computation of POPs by calculating local centers of mass instead of performing cluster analysis, and local principal component instead of principal direction. Later, Ref. [13] also considered LPC in data compression and regression to reduce the dimension of predictors space to low-dimension manifold. Ref. [14] extended the idea of localization to independent component analysis (ICA) by proposing a local-to-global non-linear ICA framework for visual and auditory signal. Ref. [15] considered principal curves from a different perspective: as the ridge of a smooth probability density function (PDF) generating dataset, where the ridges are collections of all points; the local gradient of a PDF is an eigenvector of the local Hessian, and the eigenvalues corresponding to the remaining eigenvectors are negative. To estimate principal curves based on this definition, the subspace constrained mean shift (SCMS) algorithm was proposed. All the local methods above require strong assumptions on the PDF, such as twice continuous differentiability, which may prove challenging to be satisfied in the settings of online sequential data. Ref. [16] proposed a new concept of principal curves which ensures its existence for a large class of distributions. Principal curves

f^{⋆}

are defined as the curves minimizing the expected squared distance over a class

F_{L}

of curves whose length is smaller than

L > 0

; namely,

f^{⋆} \in \underset{f \in F_{L}}{arg inf} Δ (f),

where

Δ (f) = E [Δ (f, X)] = E [inf_{s} {∥f (s) - X∥}_{2}^{2}] .

If

{E ∥ X ∥}_{2}^{2} < \infty

,

f^{⋆}

always exists but may not be unique. In practical situations where only i.i.d. copies

X_{1}, \dots, X_{n}

of X are observed, the method of [16] considers classes

F_{k, L}

of all polygonal lines with k segments and length not exceeding L, and chooses an estimator

{\hat{f}}_{k, n}

of

f^{⋆}

as the one within

F_{k, L}

, which minimizes the empirical counterpart

Δ_{n} (f) = \frac{1}{n} \sum_{i = 1}^{n} Δ (f, X_{i})

of

Δ (f)

. It is proved in [17] that if X is almost surely bounded and

k \propto n^{1 / 3}

, then

Δ ({\hat{f}}_{k, n}) - Δ (f^{⋆}) = O (n^{- 1 / 3}) .

As the task of finding a polygonal line with k segments and length of at most L that minimizes

Δ_{n} (f)

is computationally costly, Ref. [17] proposed a polygonal line algorithm. This iterative algorithm proceeds by fitting a polygonal line with k segments and considerably speeds up the exploration part by resorting to gradient descent. The two steps (projection and optimization) are similar to what is done by the k-means algorithm. However, the polygonal line algorithm is not supported by theoretical bounds and leads to variable performance depending on the distribution of the observations.

As the number of segments, k, plays a crucial role (a too small a k value leads to a poor summary of data, whereas a too-large k yields overfitting; see Figure 3), Ref. [18] aimed to fill the gap by selecting an optimal k from both theoretical and practical perspectives.

Their approach relies strongly on the theory of model selection by penalization introduced by [19] and further developed by [20]. By considering countable classes

{F_{k, ℓ}}_{k, ℓ}

of polygonal lines with k segments and total length

ℓ \leq L

, and whose vertices are on a lattice, the optimal

(\hat{k}, \hat{ℓ})

is obtained as the minimizer of the criterion

crit (k, ℓ) = Δ_{n} ({\hat{f}}_{k, ℓ}) + pen (k, ℓ),

where

pen (k, ℓ) = c_{0} \sqrt{\frac{k}{n}} + c_{1} \frac{ℓ}{n} + c_{2} \frac{1}{\sqrt{n}} + δ^{2} \sqrt{\frac{w_{k, ℓ}}{2 n}}

is a penalty function where

δ

stands for the diameter of observations and

w_{k, ℓ}

denotes the weight attached to class

F_{k, ℓ}

; and it has constants

c_{0}, c_{1}, c_{2}

depending on

δ

, maximum length L and a certain number of dimensions of observations. Ref. [18] then proved that

E [Δ ({\hat{f}}_{\hat{k}, \hat{ℓ}}) - Δ (f^{⋆})] \leq inf_{k, ℓ} \{E [Δ ({\hat{f}}_{k, ℓ}) - Δ (f^{⋆})] + pen (k, ℓ)\} + \frac{δ^{2} Σ}{2^{3 / 2}} \sqrt{\frac{π}{n}},

(1)

where

Σ

is a numerical constant. The expected loss of the final polygonal line

{\hat{f}}_{\hat{k}, \hat{ℓ}}

is close to the minimal loss achievable over

F_{k, ℓ}

up to a remainder term decaying as

1 / \sqrt{n}

.

1.2. Motivation

The big data paradigm—where collecting, storing and analyzing massive amounts of large and complex data becomes the new standard—commands one to revisit some of the classical statistical and machine learning techniques. The tremendous improvements of data acquisition infrastructures generates new continuous streams of data, rather than batch datasets. This has drawn great interest to sequential learning. Extending the notion of principal curves to the sequential settings opens up immediate practical application possibilities. As an example, path planning for passengers’ locations can help taxi companies to better optimize their fleet. Online algorithms that could yield instantaneous path summarization would be adapted to the sequential nature of geolocalized data. Existing theoretical works and practical implementations of principal curves are designed for the batch setting [7,16,17,18,21] and their adaptation to the sequential setting is not a smooth process. As an example, consider the algorithm in [18]. It is assumed that vertices of principal curves are located on a lattice, and its computational complexity is of order

O (n N^{p})

where n is the number of observations, N the number of points on the lattice and p the maximum number of vertices. When p is large, running this algorithm at each epoch yields a monumental computational cost. In general, if data are not identically distributed or even adversary, algorithms that originally worked well in the batch setting may not be ideal when cast onto the online setting (see [22], Chapter 4). To the best of our knowledge, little effort has been put so far into extending principal curves algorithms to the sequential context.

Ref. [23] provided an incremental version of the SCMS algorithm [15] which is based on a definition of a principal curve as the ridge of a smooth probability density function generating observations. They applied the SCMS algorithm to the input points that are associated with the output points which are close to the new incoming sample and leave the remaining outputs unchanged. Hence, this algorithm can be used to deal with sequential data. As presented in the next section, our algorithm for sequentially updating principal curve vertices that are close to new data is similar in spirit to that of incremental SCMS. However, a difference is that our algorithm outputs polygonal lines. In addition, the computation complexity of our method is

O (n^{2})

, and incremental SCMS has

O (n^{3})

complexity, where n is the number of observations. Ref. [24] considered sequential principal curves analysis in a fairly different setting in which the goal was to derive in an adaptive fashion a set of nonlinear sensors by using a set of preliminary principal curves. Unfolding sequentially principal curves and a sequential path for Jacobian integration were considered. The “sequential” in this setting represented the generalization of principal curves to principal surfaces or even a principal manifold of higher dimensions. This way of sequentially exploiting principal curves was firstly proposed by [11] and later extended by [14,25,26] to give curvilinear representations using sequence of local-to-global curves. In addition, Refs. [15,27,28] presented, respectively, principal polynomial and non-parametric regressions to capture the nonlinear nature of data. However, these methods are not originally designed for treating sequential data. The present paper aims at filling this gap: our goal was to propose an online perspective to principal curves by automatically and sequentially learning the best principal curve summarizing a data stream. Sequential learning takes advantage of the latest collected (set of) observations and therefore suffers a much smaller computational cost.

Sequential learning operates as follows: a blackbox reveals at each time t some deterministic value

x_{t}, t = 1, 2, \dots

, and a forecaster attempts to predict sequentially the next value based on past observations (and possibly other available information). The performance of the forecaster is no longer evaluated by its generalization error (as in the batch setting) but rather by a regret bound which quantifies the cumulative loss of a forecaster in the first T rounds with respect to some reference minimal loss. In sequential learning, the velocity of algorithms may be favored over statistical precision. An immediate use of aforecited techniques [17,18,21] at each time round t (treating data collected until t as a batch dataset) would result in a monumental algorithmic cost. Rather, we propose a novel algorithm which adapts to the sequential nature of data, i.e., which takes advantage of previous computations.

The contributions of the present paper are twofold. We first propose a sequential principal curve algorithm, for which we derived regret bounds. We then present an implementation, illustrated on a toy dataset and a real-life dataset (seismic data). The sketch of our algorithm’s procedure is as follows. At each time round t, the number of segments of

k_{t}

is chosen automatically and the number of segments

k_{t + 1}

in the next round is obtained by only using information about

k_{t}

and a small number of past observations. The core of our procedure relies on computing a quantity which is linked to the mode of the so-called Gibbs quasi-posterior and is inspired by quasi-Bayesian learning. The use of quasi-Bayesian estimators is especially advocated by the PAC-Bayesian theory, which originated in the machine learning community in the late 1990s, in the seminal works of [29] and McAllester [30,31]. The PAC-Bayesian theory has been successfully adapted to sequential learning problems; see, for example, Ref. [32] for online clustering. We refer to [33,34] for a recent overview of the field.

The paper is organized as follows. Section 2 presents our notation and our online principal curve algorithm, for which we provide regret bounds with sublinear remainder terms in Section 3. A practical implementation was proposed in Section 4, and we illustrate its performance on synthetic and real-life datasets in Section 5. Proofs of all original results claimed in the paper are collected in Section 6.

2. Notation

A parameterized curve in

R^{d}

is a continuous function

f : I ⟶ R^{d}

where

I = [a, b]

is a closed interval of the real line. The length of

f

is given by

L (f) = lim_{M \to \infty} \{sup_{a = s_{0} < s_{1} < \dots < s_{M} = b} \sum_{i = 1}^{M} {∥ f (s_{i}) - f (s_{i - 1}) ∥}_{2}\} .

Let

x_{1}, x_{2}, \dots, x_{T} \in B (0, \sqrt{d} R) \subset R^{d}

be a sequence of data, where

B (c, R)

stands for the

ℓ_{2}

-ball centered in

c \in R^{d}

with radius

R > 0

. Let

Q_{δ}

be a grid over

B (0, \sqrt{d} R)

, i.e.,

Q_{δ} = B (0, \sqrt{d} R) \cap Γ_{δ}

where

Γ_{δ}

is a lattice in

R^{d}

with spacing

δ > 0

. Let

L > 0

and define for each

k \in ⟦ 1, p ⟧

the collection

F_{k, L}

of polygonal lines

f

with k segments whose vertices are in

Q_{δ}

and such that

L (f) \leq L

. Denote by

F_{p} = \cup_{k = 1}^{p} F_{k, L}

all polygonal lines with a number of segments

\leq p

, whose vertices are in

Q_{δ}

and whose length is at most L. Finally, let

K (f)

denote the number of segments of

f \in F_{p}

. This strategy is illustrated by Figure 4.

Our goal is to learn a time-dependent polygonal line which passes through the “middle” of data and gives a summary of all available observations

x_{1}, \dots, x_{t - 1}

(denoted by

{(x_{s})}_{1 : (t - 1)}

hereafter) before time t. Our output at time t is a polygonal line

{\hat{f}}_{t} \in F_{p}

depending on past information

{(x_{s})}_{1 : (t - 1)}

and past predictions

{({\hat{f}}_{s})}_{1 : (t - 1)}

. When

x_{t}

is revealed, the instantaneous loss at time t is computed as

Δ ({\hat{f}}_{t}, x_{t}) = inf_{s \in I} {∥ {\hat{f}}_{t} (s) - x_{t} ∥}_{2}^{2} .

(2)

In what follows, we investigate regret bounds for the cumulative loss based on (2). Given a measurable space

Θ

(embedded with its Borel

σ

-algebra), we let

P (Θ)

denote the set of probability distributions on

Θ

, and for some reference measure

π

, we let

P_{π} (Θ)

be the set of probability distributions absolutely continuous with respect to

π

.

For any

k \in ⟦ 1, p ⟧

, let

π_{k}

denote a probability distribution on

F_{k, L}

. We define the prior

π

on

F_{p} = \cup_{k = 1}^{p} F_{k, L}

as

π (f) = \sum_{k \in ⟦ 1, p ⟧} w_{k} π_{k} (f) 𝟙_{\{f \in F_{k, L}\}}, f \in F_{p},

where

w_{1}, \dots, w_{p} \geq 0

and

\sum_{k \in ⟦ 1, p ⟧} w_{k} = 1

.

We adopt a quasi-Bayesian-flavored procedure: consider the Gibbs quasi-posterior (note that this is not a proper posterior in all generality, hence the term “quasi”):

{\hat{ρ}}_{t} (\cdot) \propto exp (- λ S_{t} (\cdot)) π (\cdot),

where

S_{t} (f) = S_{t - 1} (f) + Δ (f, x_{t}) + \frac{λ}{2} {(Δ (f, x_{t}) - Δ ({\hat{f}}_{t}, x_{t}))}^{2},

as advocated by [32,35] who then considered realizations from this quasi-posterior. In the present paper, we will rather focus on a quantity linked to the mode of this quasi-posterior. Indeed, the mode of the quasi-posterior

{\hat{ρ}}_{t + 1}

is

arg min_{f \in F_{p}} \{\underset{(i)}{\underset{⏟}{\sum_{s = 1}^{t} Δ (f, x_{s})}} + \underset{(i i)}{\underset{⏟}{\frac{λ}{2} \sum_{s = 1}^{t} {(Δ (f, x_{t}) - Δ ({\hat{f}}_{t}, x_{t}))}^{2}}} + \underset{(i i i)}{\underset{⏟}{\frac{ln π (f)}{λ}}}\},

where (i) is a cumulative loss term, (ii) is a term controlling the variance of the prediction

f

to past predictions

{\hat{f}}_{s}, s \leq t

, and (iii) can be regarded as a penalty function on the complexity of

f

if

π

is well chosen. This mode hence has a similar flavor to follow the best expert or follow the perturbed leader in the setting of prediction with experts (see [22,36], Chapters 3 and 4) if we consider each

f \in F_{p}

as an expert which always delivers constant advice. These remarks yield Algorithm 1.

Algorithm 1 Sequentially learning principal curves.

1:: Input parameters: $p > 0, η > 0, π (z) = e^{- z} 1_{{z > 0}}$ and penalty function $h : F_{p} \to R^{+}$
2:: Initialization: For each $f \in F_{p}$ , draw $z_{f} \sim π$ and $Δ_{f, 0} = \frac{1}{η} (h (f) - z_{f})$
3:: For $t = 1, \dots, T$
4:: Get the data $x_{t}$
5:: Obtain

${\hat{f}}_{t} = \underset{f \in F_{p}}{arg inf} \{\sum_{s = 0}^{t - 1} Δ_{f, s}\},$

where $Δ_{f, s} = Δ (f, x_{s})$ , $s \geq 1$ .
6:: End for

3. Regret Bounds for Sequential Learning of Principal Curves

We now present our main theoretical results.

Theorem 1.

For any sequence

{(x_{t})}_{1 : T} \in B (0, \sqrt{d} R)

,

R \geq 0

and any penalty function

h : F_{p} \to R^{+}

, let

π (z) = e^{- z} 1_{{z > 0}}

. Let

0 < η \leq \frac{1}{d {(2 R + δ)}^{2}}

; then the procedure described in Algorithm 1 satisfies

\sum_{t = 1}^{T} E_{π} [Δ ({\hat{f}}_{t}, x_{t})] \leq (1 + c_{0} (e - 1) η) S_{T, h, η} + \frac{1}{η} (1 + ln \sum_{f \in F_{p}} e^{- h (f)}),

where

c_{0} = d {(2 R + δ)}^{2}

and

S_{T, h, η} = inf_{k \in ⟦ 1, p ⟧} \{inf_{\begin{matrix} f \in F_{p} \\ K (f) = k \end{matrix}} \{\sum_{t = 1}^{T} Δ (f, x_{t}) + \frac{h (f)}{η}\}\} .

The expectation of the cumulative loss of polygonal lines

{\hat{f}}_{1}, \dots, {\hat{f}}_{T}

is upper-bounded by the smallest penalized cumulative loss over all

k \in {1, \dots, p}

up to a multiplicative term

(1 + c_{0} (e - 1) η)

, which can be made arbitrarily close to 1 by choosing a small enough

η

. However, this will lead to both a large

h (f) / η

in

S_{T, h, η}

and a large

\frac{1}{η} (1 + ln \sum_{f \in F_{p}} e^{- h (f)})

. In addition, another important issue is the choice of the penalty function h. For each

f \in F_{p}

,

h (f)

should be large enough to ensure a small

\sum_{f \in F_{p}} e^{- h (f)}

, but not too large to avoid overpenalization and a larger value for

S_{T, h, η}

. We therefore set

h (f) \geq ln (p e) + ln | {f \in F_{p}, K (f) = k} |

(3)

for each

f

with k segments (where

| M |

denotes the cardinality of a set M) since it leads to

\sum_{f \in F_{p}} e^{- h (f)} = \sum_{k \in ⟦ 1, p ⟧} \sum_{\begin{matrix} f \in F_{p} \\ K (f) = k \end{matrix}} e^{- h (f)} \leq \sum_{k \in ⟦ 1, p ⟧} \frac{1}{p e} \leq \frac{1}{e} .

The penalty function

h (f) = c_{1} K (f) + c_{2} L + c_{3}

satisfies (3), where

c_{1}, c_{2}, c_{3}

are constants depending on R, d,

δ

, p (this is proven in Lemma 3, in Section 6). We therefore obtain the following corollary.

Corollary 1.

Under the assumptions of Theorem 1, let

η = min \{\frac{1}{d {(2 R + δ)}^{2}}, \sqrt{\frac{c_{1} p + c_{2} L + c_{3}}{c_{0} (e - 1) {inf}_{f \in F_{p}} \sum_{t = 1}^{T} Δ (f, x_{t})}}\} .

Then

\begin{matrix} \sum_{t = 1}^{T} E [Δ ({\hat{f}}_{t}, x_{t})] \leq inf_{k \in ⟦ 1, p ⟧} \{inf_{\begin{matrix} f \in F_{p} \\ K (f) = k \end{matrix}} \{\sum_{t = 1}^{T} Δ (f, x_{t}) + \sqrt{c_{0} (e - 1) r_{T, k, L}}\}\} \\ + \sqrt{c_{0} (e - 1) r_{T, p, L}} + c_{0} (e - 1) (c_{1} p + c_{2} L + c_{3}), \end{matrix}

where

r_{T, k, L} = {inf}_{f \in F_{p}} \sum_{t = 1}^{T} Δ (f, x_{t}) (c_{1} k + c_{2} L + c_{3})

.

Proof.

Note that

\sum_{t = 1}^{T} E [Δ ({\hat{f}}_{t}, x_{t})] \leq S_{T, h, η} + η c_{0} (e - 1) inf_{f \in F_{p}} \sum_{t = 1}^{T} Δ (f, x_{t}) + c_{0} (e - 1) (c_{0} p + c_{2} L + c_{3}),

and we conclude by setting

η = \sqrt{\frac{c_{1} p + c_{2} L + c_{3}}{c_{0} (e - 1) {inf}_{f \in F_{p}} \sum_{t = 1}^{T} Δ (f, x_{t})}} .

□

Sadly, Corollary 1 is not of much practical use since the optimal value for

η

depends on

{inf}_{f \in F_{p}} \sum_{t = 1}^{T} Δ (f, x_{t})

which is obviously unknown, even more so at time

t = 0

. We therefore provide an adaptive refinement of Algorithm 1 in the following Algorithm 2.

Algorithm 2 Sequentially and adaptively learning principal curves.

1:: Input parameters: $p > 0$ , $L > 0$ , $π$ , h and $η_{0} = \frac{\sqrt{c_{1} p + c_{2} L + c_{3}}}{c_{0} \sqrt{e - 1}}$
2:: Initialization: For each $f \in F_{p}$ , draw $z_{f} \sim π$ , $Δ_{f, 0} = \frac{1}{η_{0}} (h (f) - z_{f})$ and ${\hat{f}}_{0} = \underset{f \in F_{p}}{arg inf} Δ_{f, 0}$
3:: For $t = 1, \dots, T$
4:: Compute $η_{t} = \frac{\sqrt{c_{1} p + c_{2} L + c_{3}}}{c_{0} \sqrt{(e - 1) t}}$
5:: Get data $x_{t}$ and compute $Δ_{f, t} = Δ (f, x_{t}) + (\frac{1}{η_{t}} - \frac{1}{η_{t - 1}}) (h (f) - z_{f})$
6:: Obtain

${\hat{f}}_{t} = \underset{f \in F_{p}}{arg inf} \{\sum_{s = 0}^{t - 1} Δ_{f, s}\} .$

(4)
7:: End for

Theorem 2.

For any sequence

{(x_{t})}_{1 : T} \in B (0, \sqrt{d} R), R \geq 0

, let

h (f) = c_{1} K (f) + c_{2} L + c_{3}

where

c_{1}

,

c_{2}

,

c_{3}

are constants depending on

R, d, δ, ln p

. Let

π (z) = e^{- z} 1_{{z > 0}}

and

η_{0} = \frac{\sqrt{c_{1} p + c_{2} L + c_{3}}}{c_{0} \sqrt{e - 1}}, η_{t} = \frac{\sqrt{c_{1} p + c_{2} L + c_{3}}}{c_{0} \sqrt{(e - 1) t}},

where

t \geq 1

and

c_{0} = d {(2 R + δ)}^{2}

. Then the procedure described in Algorithm 2 satisfies

\begin{matrix} \sum_{t = 1}^{T} E [Δ ({\hat{f}}_{t}, x_{t})] \leq inf_{k \in ⟦ 1, p ⟧} \{inf_{\begin{matrix} f \in F_{p} \\ K (f) = k \end{matrix}} \{\sum_{t = 1}^{T} Δ (f, x_{t}) + c_{0} \sqrt{(e - 1) T (c_{1} k + c_{2} L + c_{3})}\}\} \\ + 2 c_{0} \sqrt{(e - 1) T (c_{1} p + c_{2} L + c_{3})} . \end{matrix}

The message of this regret bound is that the expected cumulative loss of polygonal lines

{\hat{f}}_{1}, \dots, {\hat{f}}_{T}

is upper-bounded by the minimal cumulative loss over all

k \in {1, \dots, p}

, up to an additive term which is sublinear in T. The actual magnitude of this remainder term is

\sqrt{k T}

. When L is fixed, the number k of segments is a measure of complexity of the retained polygonal line. This bound therefore yields the same magnitude as (1), which is the most refined bound in the literature so far ([18] where the optimal values for k and L were obtained in a model selection fashion).

4. Implementation

The argument of the infimum in Algorithm 2 is taken over

F_{p} = \cup_{k = 1}^{p} F_{k, L}

which has a cardinality of order

{|Q_{δ}|}^{p}

, making any greedy search largely time-consuming. We instead turn to the following strategy: Given a polygonal line

{\hat{f}}_{t} \in F_{k_{t}, L}

with

k_{t}

segments, we consider, with a certain proportion, the availability of

{\hat{f}}_{t + 1}

within a neighborhood

U ({\hat{f}}_{t})

(see the formal definition below) of

{\hat{f}}_{t}

. This consideration is well suited for the principal curves setting, since if observation

x_{t}

is close to

{\hat{f}}_{t}

, one can expect that the polygonal line which well fits observations

x_{s}, s = 1, \dots, t

lies in a neighborhood of

{\hat{f}}_{t}

. In addition, if each polygonal line

f

is regarded as an action, we no longer assume that all actions are available at all times, and allow the set of available actions to vary at each time. This is a model known as “sleeping experts (or actions)” in prior work [37,38]. In this setting, defining the regret with respect to the best action in the whole set of actions in hindsight remains difficult, since that action might sometimes be unavailable. Hence, it is natural to define the regret with respect to the best ranking of all actions in the hindsight according to their losses or rewards, and at each round one chooses among the available actions by selecting the one which ranks the highest. Ref. [38] introduced this notion of regret and studied both the full-information (best action) and partial-information (multi-armed bandit) settings with stochastic and adversarial rewards and adversarial action availability. They pointed out that the EXP4 algorithm [37] attains the optimal regret in the adversarial rewards case but has a runtime exponential in the number of all actions. Ref. [39] considered full and partial information with stochastic action availability and proposed an algorithm that runs in polynomial time. In what follows, we materialize our implementation by resorting to “sleeping experts”, i.e., a special set of available actions that adapts to the setting of principal curves.

Let

σ

denote an ordering of

| F_{p} |

actions, and

A_{t}

a subset of the available actions at round t. We let

σ (A_{t})

denote the highest ranked action in

A_{t}

. In addition, for any action

f \in F_{p}

we define the reward

r_{f, t}

of

f

at round

t, t \geq 0

by

r_{f, t} = c_{0} - Δ (f, x_{t}) .

It is clear that

r_{f, t} \in (0, c_{0})

. The convention from losses to gains is done in order to facilitate the subsequent performance analysis. The reward of an ordering

σ

is the cumulative reward of the selected action at each time:

\sum_{t = 1}^{T} r_{σ (A_{t}), t},

and the reward of the best ordering is

{max}_{σ} \sum_{t = 0}^{T} r_{σ (A_{t}), t}

(respectively,

E [{max}_{σ} \sum_{t = 1}^{T} r_{σ (A_{t}), t}]

when

A_{t}

is stochastic).

Our procedure starts with a partition step which aims at identifying the “relevant” neighborhood of an observation

x \in R^{d}

with respect to a given polygonal line, and then proceeds with the definition of the neighborhood of an action

f

. We then provide the full implementation and prove a regret bound.

Partition. For any polygonal line

f

with k segments, we denote by

\overset{⇀}{V} = (v_{1}, \dots, v_{k + 1})

its vertices and by

s_{i}, i = 1, \dots, k

the line segments connecting

v_{i}

and

v_{i + 1}

. In the sequel, we use

f (\overset{⇀}{V})

to represent the polygonal line formed by connecting consecutive vertices in

\overset{⇀}{V}

if no confusion arises. Let

V_{i}, i = 1, \dots, k + 1

and

S_{i}, i = 1, \dots, k

be the Voronoi partitions of

R^{d}

with respect to

f

, i.e., regions consisting of all points closer to vertex

v_{i}

or segment

s_{i}

. Figure 5 shows an example of Voronoi partition with respect to

f

with three segments.

Neighborhood. For any

x \in R^{d}

, we define the neighborhood

N (x)

with respect to

f

as the union of all Voronoi partitions whose closure intersects with two vertices connecting the projection

f (s_{f} (x))

of x to

f

. For example, for the point x in Figure 5, its neighborhood

N (x)

is the union of

S_{2}, V_{3}, S_{3}

and

V_{4}

. In addition, let

N_{t} (x) = \{x_{s} \in N (x), s = 1, \dots, t .\}

be the set of observations

x_{1 : t}

belonging to

N (x)

and

{\bar{N}}_{t} (x)

be its average. Let

D (M) = {sup}_{x, y \in M} | | x - {y | |}_{2}

denote the diameter of set

M \subset R^{d}

. We finally define the local grid

Q_{δ, t} (x)

of

x \in R^{d}

at time t as

Q_{δ, t} (x) = B ({\bar{N}}_{t} (x), D (N_{t} (x)) \cap Q_{δ} .

We can finally proceed to the definition of the neighborhood

U ({\hat{f}}_{t})

of

{\hat{f}}_{t}

. Assume

{\hat{f}}_{t}

has

k_{t} + 1

vertices

\overset{⇀}{V} = (\underset{(i)}{\underset{⏟}{v_{1 : i_{t} - 1}}}, \underset{(i i)}{\underset{⏟}{v_{i_{t} : j_{t} - 1}}}, \underset{(i i i)}{\underset{⏟}{v_{j_{t} : k_{t} + 1}}})

, where vertices of

(i i)

belong to

Q_{δ, t} (x_{t})

while those of

(i)

and

(i i i)

do not. The neighborhood

U ({\hat{f}}_{t})

consists of

f

sharing vertices

(i)

and

(i i i)

with

{\hat{f}}_{t}

, but can be equipped with different vertices

(i i)

in

Q_{δ, t} (x_{t})

; i.e.,

U ({\hat{f}}_{t}) = \{f (\overset{⇀}{V}), \overset{⇀}{V} = (v_{1 : i_{t} - 1}, v_{1 : m}, v_{j_{t} : k_{t} + 1})\},

where

v_{1 : m} \in Q_{δ, t} (x_{t})

and m is given by

m = \{\begin{matrix} j_{t} - i_{t} - 1 & reduce segments by 1 unit, \\ j_{t} - i_{t} & same number of segments, \\ j_{t} - i_{t} + 1 & increase segments by 1 unit . \end{matrix}

In Algorithm 3, we initiate the principal curve

{\hat{f}}_{1}

as the first component line segment whose vertices are the two farthest projections of data

x_{1 : t_{0}}

(

t_{0}

can be set to 20 in practice) on the first component line. The reward of

f

at round t in this setting is therefore

r_{f, t} = c_{0} - Δ (f, x_{t_{0} + t})

. Algorithm 3 has an exploration phase (when

I_{t} = 1

) and an exploitation phase (

I_{t} = 0

). In the exploration phase, it is allowed to observe rewards of all actions and to choose an optimal perturbed action from the set

F_{p}

of all actions. In the exploitation phase, only rewards of a part of actions can be accessed and rewards of others are estimated by a constant, and we update our action from the neighborhood

U ({\hat{f}}_{t - 1})

of the previous action

{\hat{f}}_{t - 1}

. This local update (or search) greatly reduces computation complexity since

| U ({\hat{f}}_{t - 1}) | ≪ |F_{p}|

when p is large. In addition, this local search will be enough to account for the case when

x_{t}

locates in

U ({\hat{f}}_{t - 1})

. The parameter

β

needs to be carefully calibrated since it should not be too large to ensure that the condition

c o n d (t)

is non-empty; otherwise, all rewards are estimated by the same constant and thus lead to the same descending ordering of tuples for both

(\sum_{s = 1}^{t - 1} {\hat{r}}_{f, s}, f \in F_{p})

and

(\sum_{s = 1}^{t} {\hat{r}}_{f, s}, f \in F_{p})

. Therefore, we may face the risk of having

{\hat{f}}_{t + 1}

in the neighborhood of

{\hat{f}}_{t}

even if we are in the exploration phase at time

t + 1

. Conversely, very small

β

could result in large bias for the estimation

\frac{r_{f, t}}{P ({\hat{f}}_{t} = f | H_{t})}

of

r_{f, t}

. Note that the exploitation phase is close yet different to the label efficient prediction ([40], Remark 1.1) since we allow an action at time t to be different from the previous one. Ref. [41] proposed the geometric resampling method to estimate the conditional probability

P ({\hat{f}}_{t} = f | H_{t})

since this quantity often does not have an explicit form. However, due to the simple exponential distribution of

z_{f}

chosen in our case, an explicit form of

P ({\hat{f}}_{t} = f | H_{t})

is straightforward.

Algorithm 3 A locally greedy algorithm for sequentially learning principal curves.

1:: Input parameters: $p > 0$ , $R > 0$ , $L > 0$ , $ϵ > 0$ , $α > 0$ , $1 > β > 0$ and any penalty function h
2:: Initialization: Given ${(x_{t})}_{1 : t_{0}}$ , obtain ${\hat{f}}_{1}$ as the first principal component
3:: For $t = 2, \dots, T$
4:: Draw $I_{t} \sim B e r n o u l l i (ϵ)$ and $z_{f} \sim π$ .
5:: Let

${\hat{σ}}_{t} = sort (f, \sum_{s = 1}^{t - 1} {\hat{r}}_{f, s} - \frac{1}{η_{t - 1}} h (f) + \frac{1}{η_{t - 1}} z_{f}),$

i.e., sorting all $f \in F_{p}$ in descending order according to their perturbed cumulative reward till $t - 1$ .
6:: If $I_{t} = 1$ , set $A_{t} = F_{p}$ and ${\hat{f}}_{t} = {\hat{σ}}^{t} (A_{t})$ and observe $r_{{\hat{f}}_{t}, t}$
7:: ${\hat{r}}_{f, t} = r_{f, t} for f \in F_{p} .$
8:: If $I_{t} = 0$ , set $A_{t} = U ({\hat{f}}_{t - 1})$ , ${\hat{f}}_{t} = {\hat{σ}}^{t} (A_{t})$ and observe $r_{{\hat{f}}_{t}, t}$
9:: ${\hat{r}}_{f, t} = \{\begin{matrix} \frac{r_{f, t}}{P ({\hat{f}}_{t} = f | H_{t})} & i f f \in U ({\hat{f}}_{t - 1}) \cap c o n d (t) and {\hat{f}}_{t} = f, α & otherwise, \end{matrix}$

where $H_{t}$ denotes all the randomness before time t and $cond (t) = \{f \in F_{p} : P ({\hat{f}}_{t} = f | H_{t}) > β\}$ . In particular, when $t = 1$ , we set ${\hat{r}}_{f, 1} = r_{f, 1}$ for all $f \in F_{p}$ , $U ({\hat{f}}_{0}) = \emptyset$ and ${\hat{r}}_{{\hat{σ}}^{1} (U ({\hat{f}}_{0})), 1} \equiv 0$ .
10:: End for

Theorem 3.

Assume that

p > 6

,

T \geq 2 | F_{p} |^{2}

and let

β = {|F_{p}|}^{- \frac{1}{2}} T^{- \frac{1}{4}}

,

α = \frac{c_{0}}{β}

,

{\hat{c}}_{0} = \frac{2 c_{0}}{β}

,

ϵ = 1 - {|F_{p}|}^{\frac{1}{2} - \frac{3}{p}} T^{- \frac{1}{4}}

and

η_{1} = η_{2} = \dots = η_{T} = \frac{\sqrt{c_{1} p + c_{2} L + c_{3}}}{\sqrt{T (e - 1)} {\hat{c}}_{0}} .

Then the procedure described in Algorithm 3 satisfies the regret bound

\sum_{t = 1}^{T} E [Δ ({\hat{f}}_{t}, x_{t})] \leq inf_{f \in F_{p}} E [\sum_{t = 1}^{T} Δ (f, x_{t})] + O (T^{\frac{3}{4}}) .

The proof of Theorem 3 is presented in Section 6. The regret is upper bounded by a term of order

({|F_{p}|}^{\frac{1}{2}} T^{\frac{3}{4}})

, sublinear in T. The term

(1 - ϵ) c_{0} T = c_{0} {|F_{p}|}^{\frac{1}{2}} T^{\frac{3}{4}}

is the price to pay for the local search (with a proportion

1 - ϵ

) of polygonal line

{\hat{f}}_{t}

in the neighborhood of the previous

{\hat{f}}_{t - 1}

. If

ϵ = 1

, we would have that

{\hat{c}}_{0} = c_{0}

, and the last two terms in the first inequality of Theorem 3 would vanish; hence, the upper bound reduces to Theorem 2. In addition, our algorithm achieves an order that is smaller (from the perspective of both the number

|F_{p}|

of all actions and the total rounds T) than [39] since at each time, the availability of actions for our algorithm can be either the whole action set or a neighborhood of the previous action while [39] consider at each time only partial and independent stochastic available set of actions generated from a predefined distribution.

5. Numerical Experiments

We illustrate the performance of Algorithm 3 on synthetic and real-life data. Our implementation (hereafter denoted by slpc—Sequential Learning of Principal Curves) is conducted with the R language and thus our most natural competitors are the R package princurve, which is the algorithm from [10], and incremental, which is the algorithm from SCMS [23]. We let

p = 50

,

R = {max}_{t = 1, \dots, T} {| | x | |}_{2} / \sqrt{d}

,

L = 0.1 p \sqrt{d} R

. The spacing

δ

of the lattice is adjusted with respect to data scale.

Synthetic data We generate a dataset

\{x_{t} \in R^{2}, t = 1, \dots, 500\}

uniformly along the curve

y = 0.05 \times {(x - 5)}^{3}

,

x \in [0, 10]

. Table 1 shows the regret (first row) for

the ground truth (sum of squared distances of all points to the true curve),
princurve and incremental SCMS (sum of squared distances between observation $x_{t + 1}$ and fitted princurve on observations $x_{1 : t}$ ),
slpc (regret being equal to $\sum_{t = 0}^{T - 1} E [Δ ({\hat{f}}_{t + 1}, x_{t + 1})]$ in both cases).

The mean computation time with different values for the time horizons T are also reported.

Table 1 demonstrates the advantages of our method slpc, as it achieved the optimal tradeoff between performance (in terms of regret) and runtime. Although princurve outperformed the other two algorithms in terms of computation time, it yielded the largest regret, since it outputs a curve which does not pass in “the middle of data” but rather bends towards the curvature of the data cloud, as shown in Figure 6 where the predicted principal curves

{\hat{f}}_{t + 1}

for princurve, incremental SCMS and slpc are presented. incremental SCMS and slpc both yielded satisfactory results, although the mean computation time of splc was significantly smaller than that of incremental SCMS (the reason being that eigenvectors of the Hessian of PDF need to be computed in incremental SCMS). Figure 7 showed, respectively, the estimation of the regret of slpc and its per-round value (i.e., the cumulative loss divided by the number of rounds) both with respect to the round t. The jumps in the per-round curve occurred at the beginning, due to the initialization from a first principal component and to the collection of new data. When data accumulates, the vanishing pattern of the per-round curve illustrates that the regret is sublinear in t, which matches our aforementioned theoretical results.

In addition, to better illustrate the way slpc works between two epochs, Figure 8 focuses on the impact of collecting a new data point on the principal curve. We see that only a local vertex is impacted, whereas the rest of the principal curve remains unaltered. This cutdown in algorithmic complexity is one the key assets of slpc.

Synthetic data in high dimension. We also apply our algorithm on a dataset

{x_{t} \in R^{6},

t = 1, 2, \dots, 200}

in higher dimension. It is generated uniformly along a parametric curve whose coordinates are

(\begin{matrix} 0.5 t cos (t) \\ 0.5 t sin (t) \\ 0.5 t \\ - t \\ \sqrt{t} \\ 2 ln (t + 1) \end{matrix})

where t takes 100 equidistant values in

[0, 2 π]

. To the best of our knowledge, [10,16,18] only tested their algorithm on 2-dimensional data. This example aims at illustrating that our algorithm also works on higher dimensional data. Table 2 shows the regret for the ground truth, princurve and slpc.

In addition, Figure 9 shows the behaviour of slpc (green) on each dimension.

Seismic data. Seismic data spanning long periods of time are essential for a thorough understanding of earthquakes. The “Centennial Earthquake Catalog” [42] aims at providing a realistic picture of the seismicity distribution on Earth. It consists in a global catalog of locations and magnitudes of instrumentally recorded earthquakes from 1900 to 2008. We focus on a particularly representative seismic active zone (a lithospheric border close to Australia) whose longitude is between E

130^{\circ}

to E

180^{\circ}

and latitude between S

70^{\circ}

to N

30^{\circ}

, with

T = 218

seismic recordings. As shown in Figure 10, slpc recovers nicely the tectonic plate boundary, but both princurve and incremental SCMS with well-calibrated bandwidth fail to do so.

Lastly, since no ground truth is available, we used the

R^{2}

coefficient to assess the performance (residuals are replaced by the squared distance between data points and their projections onto the principal curve). The average over 10 trials was 0.990.

Back to Seismic Data.Figure 11 was taken from the USGS website (https://earthquake.usgs.gov/data/centennial/) and gives the global locations of earthquakes for the period 1900–1999. The seismic data (latitude, longitude, magnitude of earthquakes, etc.) used in the present paper may be downloaded from this website.

Daily Commute Data. The identification of segments of personal daily commuting trajectories can help taxi or bus companies to optimize their fleets and increase frequencies on segments with high commuting activity. Sequential principal curves appear to be an ideal tool to address this learning problem: we tested our algorithm on trajectory data from the University of Illinois at Chicago (https://www.cs.uic.edu/~boxu/mp2p/gps_data.html). The data were obtained from the GPS reading systems carried by two of the laboratory members during their daily commute for 6 months in the Cook county and the Dupage county of Illinois. Figure 12 presents the learning curves yielded by princurve and slpc on geolocalization data for the first person, on May 30. A particularly remarkable asset of slpc is that abrupt curvature in the data sequence was perfectly captured, whereas princurve does not enjoy the same flexibility. Again, we used the

R^{2}

coefficient to assess the performance (where residuals are replaced by the squared distances between data points and their projections onto the principal curve). The average over 10 trials was 0.998.

6. Proofs

This section contains the proof of Theorem 2 (note that Theorem 1 is a straightforward consequence, with

η_{t} = η

,

t = 0, \dots, T

) and the proof of Theorem 3 (which involves intermediary lemmas). Let us first define for each

t = 0, \dots, T

the following forecaster sequence

{({\hat{f}}_{t}^{⋆})}_{t}

\begin{matrix} {\hat{f}}_{0}^{⋆} = \underset{f \in F_{p}}{arg inf} \{Δ_{f, 0}\} = \underset{f \in F_{p}}{arg inf} \{\frac{1}{η_{0}} h (f) - \frac{1}{η_{0}} z_{f}\}, \\ {\hat{f}}_{t}^{⋆} = \underset{f \in F_{p}}{arg inf} \{\sum_{s = 0}^{t} Δ_{f, s}\} = \underset{f \in F_{p}}{arg inf} \{\sum_{s = 1}^{t} Δ (f, x_{s}) + \frac{1}{η_{t - 1}} h (f) - \frac{1}{η_{t - 1}} z_{f}\}, t \geq 1 . \end{matrix}

Note that

{\hat{f}}_{t}^{⋆}

is an “illegal” forecaster since it peeks into the future. In addition, denote by

f^{⋆} = \underset{f \in F_{p}}{arg inf} \{\sum_{t = 1}^{T} Δ (f, x_{t}) + \frac{1}{η_{T}} h (f)\}

the polygonal line in

F_{p}

which minimizes the cumulative loss in the first T rounds plus a penalty term.

f^{⋆}

is deterministic, and

{\hat{f}}_{t}^{⋆}

is a random quantity (since it depends on

z_{f}

,

f \in F_{p}

drawn from

π

). If several

f

attain the infimum, we chose

f_{T}^{⋆}

as the one having the smallest complexity. We now enunciate the first (out of three) intermediary technical result.

Lemma 1.

For any sequence

x_{1}, \dots, x_{T}

in

B (0, \sqrt{d} R)

,

\sum_{t = 0}^{T} Δ_{{\hat{f}}_{t}^{⋆}, t} \leq \sum_{t = 0}^{T} Δ_{{\hat{f}}_{T}^{⋆}, t}, π - almost surely .

(5)

Proof.

Proof by induction on T. Clearly (5) holds for

T = 0

. Assume that (5) holds for

T - 1

:

\sum_{t = 0}^{T - 1} Δ_{{\hat{f}}_{t}^{⋆}, t} \leq \sum_{t = 0}^{T - 1} Δ_{{\hat{f}}_{T - 1}^{⋆}, t} .

Adding

Δ_{{\hat{f}}_{T}^{⋆}, T}

to both sides of the above inequality concludes the proof. □

By (5) and the definition of

{\hat{f}}_{T}^{⋆}

, for

k \geq 1

, we have

π

-almost surely that

\begin{matrix} \sum_{t = 1}^{T} Δ ({\hat{f}}_{t}^{⋆}, x_{t}) & \leq \sum_{t = 1}^{T} Δ ({\hat{f}}_{T}^{⋆}, x_{t}) + \frac{1}{η_{T}} h ({\hat{f}}_{T}^{⋆}) - \frac{1}{η_{T}} Z_{{\hat{f}}_{T}^{⋆}} + \sum_{t = 0}^{T} (\frac{1}{η_{t - 1}} - \frac{1}{η_{t}}) (h ({\hat{f}}_{t}^{⋆}) - Z_{{\hat{f}}_{t}^{⋆}}) \\ \leq \sum_{t = 1}^{T} Δ (f^{⋆}, x_{t}) + \frac{1}{η_{T}} h (f^{⋆}) - \frac{1}{η_{T}} Z_{f^{⋆}} + \sum_{t = 0}^{T} (\frac{1}{η_{t - 1}} - \frac{1}{η_{t}}) (h ({\hat{f}}_{t}^{⋆}) - Z_{{\hat{f}}_{t}^{⋆}}) \\ = inf_{f \in F_{p}} \{\sum_{t = 1}^{T} Δ (f, x_{t}) + \frac{1}{η_{T}} h (f)\} - \frac{1}{η_{T}} Z_{f^{⋆}} + \sum_{t = 0}^{T} (\frac{1}{η_{t - 1}} - \frac{1}{η_{t}}) (h ({\hat{f}}_{t}^{⋆}) - Z_{{\hat{f}}_{t}^{⋆}}), \end{matrix}

where

1 / η_{- 1} = 0

by convention. The second and third inequality is due to respectively the definition of

{\hat{f}}_{T}^{⋆}

and

f_{T}^{⋆}

. Hence

\begin{matrix} E [\sum_{t = 1}^{T} Δ ({\hat{f}}_{t}^{⋆}, x_{t})] & \leq inf_{f \in F_{p}} \{\sum_{t = 1}^{T} Δ (f, x_{t}) + \frac{1}{η_{T}} h (f)\} - \frac{1}{η_{T}} E [Z_{f_{T}^{⋆}}] \\ + \sum_{t = 0}^{T} E [(\frac{1}{η_{t}} - \frac{1}{η_{t - 1}}) (- h ({\hat{f}}_{t}^{⋆}) + Z_{{\hat{f}}_{t}^{⋆}})] \\ \leq inf_{f \in F_{p}} \{\sum_{t = 1}^{T} Δ (f, x_{t}) + \frac{1}{η_{T}} h (f)\} + \sum_{t = 1}^{T} (\frac{1}{η_{t}} - \frac{1}{η_{t - 1}}) E [sup_{f \in F_{p}} (- h (f) + Z_{f})] \\ = inf_{f \in F_{p}} \{\sum_{t = 1}^{T} Δ (f, x_{t}) + \frac{1}{η_{T}} h (f)\} + \frac{1}{η_{T}} E [sup_{f \in F_{p}} (- h (f) + Z_{f})], \end{matrix}

where the second inequality is due to

E [Z_{f_{T}^{⋆}}] = 0

and

(\frac{1}{η_{t}} - \frac{1}{η_{t - 1}}) > 0

for

t = 0, 1, \dots, T

since

η_{t}

is decreasing in t in Theorem 2. In addition, for

y \geq 0

, one has

P (- h (f) + Z_{f} > y) = e^{- h (f) - y} .

Hence, for any

y \geq 0

P (sup_{f \in F_{p}} (- h (f) + Z_{f}) > y) \leq \sum_{f \in F_{p}} P (Z_{f} \geq h (f) + y) = \sum_{f \in F_{p}} e^{- h (f)} e^{- y} = u e^{- y},

where

u = \sum_{f \in F_{p}} e^{- h (f)}

. Therefore, we have

\begin{matrix} E [sup_{f \in F_{p}} (- h (f) + Z_{f}) - ln u] & \leq E [max (0, sup_{f \in F_{p}} (- h (f) + Z_{f} - ln u))] \\ \leq \int_{0}^{\infty} P (max (0, sup_{f \in F_{p}} (- h (f) + Z_{f} - ln u)) > y) d y \\ \leq \int_{0}^{\infty} P (sup_{f \in F_{p}} (- h (f) + Z_{f}) > y + ln u) d y \\ \leq \int_{0}^{\infty} u e^{- (y + ln u)} d y = 1 . \end{matrix}

We thus obtain

E [\sum_{t = 1}^{T} Δ ({\hat{f}}_{t}^{⋆}, x_{t})] \leq inf_{f \in F_{p}} \{\sum_{t = 1}^{T} Δ (f, x_{t}) + \frac{1}{η_{T}} h (f)\} + \frac{1}{η_{T}} (1 + ln \sum_{f \in F_{p}} e^{- h (f)}) .

(6)

Next, we control the regret of Algorithm 2.

Lemma 2.

Assume that

z_{f}

is sampled from the symmetric exponential distribution in

R

, i.e.,

π (z) = e^{- z} 1_{{z > 0}}

. Assume that

{sup}_{t = 1, \dots, T} η_{t - 1} \leq \frac{1}{d {(2 R + δ)}^{2}}

, and define

c_{0} = d {(2 R + δ)}^{2}

. Then for any sequence

(x_{t}) \in B (0, \sqrt{d} R)

,

t = 1, \dots, T

,

\sum_{t = 1}^{T} E [Δ ({\hat{f}}_{t}, x_{t})] \leq \sum_{t = 1}^{T} (1 + η_{t - 1} c_{0} (e - 1)) E [Δ ({\hat{f}}_{t}^{⋆}, x_{t})] .

(7)

Proof.

Let us denote by

F_{t} (Z_{f}) = Δ ({\hat{f}}_{t}, x_{t}) = Δ (\underset{f \in F}{arg inf} (\sum_{s = 1}^{t - 1} Δ (f, x_{s}) + \frac{1}{η_{t - 1}} h (f) - \frac{1}{η_{t - 1}} Z_{f}), x_{t})

the instantaneous loss suffered by the polygonal line

{\hat{f}}_{t}

when

x_{t}

is obtained. We have

\begin{matrix} E [Δ ({\hat{f}}_{t}^{⋆}, x_{t})] & = \int F_{t} (z - η_{t - 1} Δ (f, x_{t})) π (z) d z \\ = \int F_{t} (z) π (z + η_{t - 1} Δ (f, x_{t})) d z \\ = \int F_{t} (z) e^{- (z + η_{t - 1} Δ (f, x_{t}))} d z \\ \geq e^{- η_{t - 1} d {(2 R + δ)}^{2}} \int F_{t} (z) e^{- z} d z \\ = e^{- η_{t - 1} d {(2 R + δ)}^{2}} E [Δ ({\hat{f}}_{t}, x_{t})], \end{matrix}

where the inequality is due to the fact that

Δ (f, x) \leq d {(2 R + δ)}^{2}

holds uniformly for any

f \in F_{p}

and

x \in B (0, \sqrt{d} R)

. Finally, summing on t on both sides and using the elementary inequality

e^{x} \leq 1 + (e - 1) x

if

x \in (0, 1)

concludes the proof. □

Lemma 3.

For

k \in ⟦ 1, p ⟧

, we control the cardinality of set

\{f \in F_{p}, K (f) = k\}

as

\begin{matrix} ln |\{f \in F_{p}, K (f) = k\}| & \leq (ln (8 p e V_{d}) + 3 d^{\frac{3}{2}} - d) k + (\frac{ln 2}{δ \sqrt{d}} + \frac{d}{δ}) L + d ln (\frac{\sqrt{d} (2 R + δ)}{δ}) \\ \overset{Δ}{=} c_{1} k + c_{2} L + c_{3}, \end{matrix}

where

V_{d}

denotes the volume of the unit ball in

R^{d}

.

Proof.

First, let

N_{k, δ}

denote the set of polygonal lines with k segments and whose vertices are in

Q_{δ}

. Notice that

N_{k, δ}

is different from

{f \in F_{p}, K (f) = k}

and that

|{f \in F_{p}, K (f) = k}| \leq (\binom{p}{k}) |N_{k, δ}| .

Hence

\begin{matrix} ln |{f \in F_{p}, K (f) = k}| & \leq ln (\binom{p}{k}) + ln |N_{k, δ}| \\ \leq k ln \frac{p e}{k} + k (ln 8 V_{d} + 3 d^{\frac{3}{2}} - d) + (\frac{ln 2}{\sqrt{d} δ} + \frac{d}{δ}) L + d ln (\frac{\sqrt{d} (2 R + δ)}{δ}) \\ \leq k ln (p e) + k (ln 8 V_{d} + 3 d^{\frac{3}{2}} - d) + (\frac{ln 2}{\sqrt{d} δ} + \frac{d}{δ}) L + d ln (\frac{\sqrt{d} (2 R + δ)}{δ}), \end{matrix}

where the second inequality is a consequence to the elementary inequality

(\binom{p}{k}) \leq {(\frac{p e}{k})}^{k}

combined with Lemma 2 in [16]. □

We now have all the ingredients to prove Theorem 1 and Theorem 2.

First, combining (6) and (7) yields that

\begin{matrix} \sum_{t = 1}^{T} E [Δ ({\hat{f}}_{t}, x_{t})] & \leq inf_{f \in F_{p}} \{\sum_{t = 1}^{T} Δ (f, x_{t}) + \frac{1}{η_{T}} h (f)\} + \frac{1}{η_{T}} (\frac{1}{2} + ln \sum_{f \in F_{p}} e^{- h (f)}) \\ + c_{0} (e - 1) \sum_{t = 1}^{T} η_{t - 1} E [Δ ({\hat{f}}_{t}^{⋆}, x_{t})] \\ \leq inf_{k \in ⟦ 1, p ⟧} \{inf_{\begin{matrix} f \in F_{p} \\ K (f) = k \end{matrix}} \{\sum_{t = 1}^{T} Δ (f, x_{t}) + \frac{h (f)}{η_{T}}\}\} + \frac{1}{η_{T}} (\frac{1}{2} + ln \sum_{f \in F_{p}} e^{- h (f)}) \\ + c_{0} (e - 1) \sum_{t = 1}^{T} η_{t - 1} E [Δ ({\hat{f}}_{t}^{⋆}, x_{t})] . \end{matrix}

Assume that

η_{t} = η

,

t = 0, \dots, T

and

h (f) = c_{1} K (f) + c_{2} L + c_{3}

for

f \in F_{p}

, then

(\frac{1}{2} + \sum_{f \in F_{p}} e^{- h (f)}) \leq 0

and moreover

\begin{matrix} \sum_{t = 1}^{T} E [Δ ({\hat{f}}_{t}, x_{t})] & \leq S_{T, h, η} + \frac{1}{η} (\frac{1}{2} + ln \sum_{f \in F_{p}} e^{- h (f)}) + c_{0} (e - 1) η \sum_{t = 1}^{T} E [Δ ({\hat{f}}_{t}^{⋆}, x_{t})] \\ \leq S_{T, h, η} + c_{0} (e - 1) η S_{T, h, η} \\ \leq S_{T, h, η} + η c_{0} (e - 1) inf_{f \in F_{p}} \sum_{t = 1}^{T} Δ (f, x_{t}) + c_{0} (e - 1) (c_{1} p + c_{2} L + c_{3}), \end{matrix}

where

S_{T, h, η} = inf_{k \in ⟦ 1, p ⟧} \{inf_{\begin{matrix} f \in F_{p} \\ K (f) = k \end{matrix}} \{\sum_{t = 1}^{T} Δ (f, x_{t}) + \frac{h (f)}{η}\}\}

and the second inequality is obtained with Lemma 1. By setting

η = \sqrt{\frac{c_{1} p + c_{2} L + c_{3}}{c_{0} (e - 1) {inf}_{f \in F_{p}} \sum_{t = 1}^{T} Δ (f, x_{t})}}

we obtain

\begin{matrix} \sum_{t = 1}^{T} E [Δ ({\hat{f}}_{t}, x_{t})] \leq inf_{k \in ⟦ 1, p ⟧} \{inf_{\begin{matrix} f \in F_{p} \\ K (f) = k \end{matrix}} \{\sum_{t = 1}^{T} Δ (f, x_{t}) + \sqrt{c_{0} (e - 1) r_{T, k, L}}\}\} \\ + \sqrt{c_{0} (e - 1) L_{T, p, L}} + c_{0} (e - 1) c_{1} p + c_{2} L + c_{3}, \end{matrix}

where

r_{T, k, L} = {inf}_{f \in F_{p}} \sum_{t = 1}^{T} Δ (f, x_{t}) (c_{1} k + c_{2} L + c_{3})

. This proves Theorem 1.

Finally, assume that

η_{0} = \frac{\sqrt{c_{1} p + c_{2} L + c_{3}}}{c_{0} \sqrt{(e - 1)}} and η_{t} = \frac{\sqrt{c_{1} p + c_{2} L + c_{3}}}{c_{0} \sqrt{(e - 1) t}}, t = 1, \dots, T .

Since

E [Δ ({\hat{f}}_{t}^{⋆}, x_{t})] \leq c_{0}

for any

t = 1, \dots, T

, we have

\begin{matrix} \sum_{t = 1}^{T} E [Δ ({\hat{f}}_{t}, x_{t})] & \leq inf_{k \in ⟦ 1, p ⟧} \{inf_{\begin{matrix} f \in F_{p} \\ K (f) = k \end{matrix}} \{\sum_{t = 1}^{T} Δ (f, x_{t}) + \frac{h (f)}{η_{T}}\}\} + \frac{1}{η_{T}} (1 + ln \sum_{f \in F_{p}} e^{- h (f)}) \\ + c_{0}^{2} (e - 1) \sum_{t = 1}^{T} η_{t - 1} \\ \leq inf_{k \in ⟦ 1, p ⟧} \{inf_{\begin{matrix} f \in F_{p} \\ K (f) = k \end{matrix}} \{\sum_{t = 1}^{T} Δ (f, x_{t}) + c_{0} \sqrt{(e - 1) T (c_{0} k + c_{2} L + c_{3})}\}\} \\ + 2 c_{0} \sqrt{(e - 1) T (c_{0} p + c_{2} L + c_{3})}, \end{matrix}

which concludes the proof of Theorem 2.

Lemma 4.

Using Algorithm 3, if

0 < ϵ \leq 1

,

0 < β < 1

,

α \geq \frac{(1 - β) c_{0}}{β}

and

|U ({\hat{f}}_{t - 1})| \geq 2

for all

t \geq 2

, where

|U ({\hat{f}}_{t - 1})|

is the cardinality of

U ({\hat{f}}_{t - 1})

, then we have

\sum_{t = 1}^{T} E [r_{{\hat{f}}_{t}, t}] \geq \sum_{t = 1}^{T} E [{\hat{r}}_{{\hat{σ}}^{t} (A_{t}), t}] - 2 (1 - ϵ) α β \sum_{t = 1}^{T} |U ({\hat{f}}_{t - 1})| .

Proof.

First notice that

A_{t} = U ({\hat{f}}_{t - 1})

if

I_{t} = 0

, and that for

t \geq 2

\begin{matrix} E [r_{{\hat{f}}_{t}, t} | H_{t}, I_{t} = 0] = & E [r_{{\hat{σ}}^{t} (A_{t}), t} | H_{t}, I_{t} = 0] \\ = & \sum_{f \in A_{t} \cap c o n d (t)} r_{f, t} P ({\hat{σ}}^{t} (A_{t}) = f | H_{t}) + \sum_{f \in A_{t} \cap {c o n d (t)}^{c}} r_{f, t} P ({\hat{σ}}^{t} (A_{t}) = f | H_{t}) \\ \geq & \sum_{f \in A_{t} \cap c o n d (t)} r_{f, t} + \sum_{f \in A_{t} \cap {c o n d (t)}^{c}} α P ({\hat{σ}}^{t} (A_{t}) = f | H_{t}) \\ - (1 - β) \sum_{f \in A_{t} \cap c o n d (t)} r_{f, t} - \sum_{f \in A_{t} \cap {c o n d (t)}^{c}} (α - r_{f, t}) P ({\hat{σ}}^{t} (A_{t}) = f | H_{t}) \\ = & E [{\hat{r}}_{{\hat{σ}}^{t} (A_{t}), t} | H_{t}, I_{t} = 0] - (1 - β) \sum_{f \in A_{t} \cap c o n d (t)} r_{f, t} \\ - \sum_{f \in A_{t} \cap {c o n d (t)}^{c}} (α - r_{f, t}) P ({\hat{σ}}^{t} (A_{t}) = f | H_{t}) \\ \geq & E [{\hat{r}}_{{\hat{σ}}^{t} (A_{t}), t} | H_{t}, I_{t} = 0] - (1 - β) c_{0} |A_{t}| - α β |A_{t}| \\ \geq & E [{\hat{r}}_{{\hat{σ}}^{t} (A_{t}), t} | H_{t}, I_{t} = 0] - 2 α β |A_{t}|, \end{matrix}

where

c o n d {(t)}^{c}

denotes the complement of set

c o n d (t)

. The first inequality above is due to the assumption that for all

f \in A_{t} \cap c o n d (t)

, we have

P ({\hat{σ}}^{t} (A_{t}) = f | H_{t}) \geq β

. For

t = 1

, the above inequality is trivial since

{\hat{r}}_{{\hat{σ}}^{1} (U ({\hat{f}}_{0})), 1} \equiv 0

by its definition. Hence, for

t \geq 1

, one has

\begin{matrix} E [r_{{\hat{f}}_{t}, t} | H_{t}] & = ϵ E [r_{{\hat{σ}}^{t} (F_{p}), t} | H_{t}, I_{t} = 1] + (1 - ϵ) E [r_{{\hat{σ}}^{t} (A_{t}), t} | H_{t}, I_{t} = 0] \\ \geq E [{\hat{r}}_{{\hat{f}}_{t}, t} | H_{t}] - 2 α β |A_{t}| . \end{matrix}

(8)

Summing on both sides of inequality (8) over t terminates the proof of Lemma 4. □

Lemma 5.

Let

{\hat{c}}_{0} = \frac{c_{0}}{β} + α

. If

0 < η_{1} = η_{2} = \dots = η_{T} = η < \frac{1}{{\hat{c}}_{0}}

, then we have

\begin{matrix} E [max_{\hat{σ}} \{\sum_{t = 1}^{T} {\hat{r}}_{\hat{σ} (A_{t}), t} - \frac{1}{η} h (\hat{σ} (A_{t}))\}] - \sum_{t = 1}^{T} E [{\hat{r}}_{{\hat{σ}}^{t} (A_{t}), t}] \leq \\ {\hat{c}}_{0}^{2} (e - 1) η T + {\hat{c}}_{0} (e - 1) (c_{1} p + c_{2} L + c_{3}) . \end{matrix}

Proof.

By the definition of

{\hat{r}}_{f, t}

in Algorithm 3, for any

f \in F_{p}

and

t \geq 1

, we have

{\hat{r}}_{f, t} \leq max \{\frac{r_{f, t}}{P ({\hat{f}}_{t} = f | H_{t})}, α, r_{f, t}\} \leq max \{\frac{c_{0}}{β}, α\} \leq {\hat{c}}_{0},

where in the second inequality we use that

r_{f, t} \leq c_{0}

for all

f

and t, and that

P ({\hat{f}}_{t} = f | H_{t}) \geq β

when

f \in U ({\hat{f}}_{t - 1}) \cap c o n d (t)

. The rest of the proof is similar to those of Lemmas 1 and 2. In fact, if we define by

\hat{Δ} (f, x_{t}) = {\hat{c}}_{0} - {\hat{r}}_{f, t}

, then one can easily observe the following relation when

I_{t} = 1

(similar relation in the case that

I_{t}

= 0)

\begin{matrix} {\hat{f}}_{t} = {\hat{σ}}^{t} (F_{p}) & = \underset{f \in F_{p}}{arg max} \{\sum_{s = 1}^{t - 1} {\hat{r}}_{f, s} + \frac{1}{η} (z_{f} - h (f))\} \\ = \underset{f \in F_{p}}{arg min} \{\sum_{s = 1}^{t - 1} \hat{Δ} (f, x_{s}) + \frac{1}{η} (h (f) - z_{f})\} . \end{matrix}

Then applying Lemmas 1 and 2 on this newly defined sequence

\hat{Δ} ({\hat{f}}_{t}, x_{t}), t = 1, \dots T

leads to the result of Lemma 5. □

The proof of the upcoming Lemma 6 requires the following submartingale inequality: let

Y_{0}, \dots Y_{T}

be a sequence of random variable adapted to random events

H_{0}, \dots, H_{T}

such that for

1 \leq t \leq T

, the following three conditions hold:

E [Y_{t} | H_{t}] \leq 0, Var (Y_{t} | H_{t}) \leq a^{2}, Y_{t} - E [Y_{t} | H_{t}] \leq b .

Then for any

λ > 0

,

P (\sum_{t = 1}^{T} Y_{t} > Y_{0} + λ) \leq exp (- \frac{λ^{2}}{2 T (a^{2} + b^{2})}) .

The proof can be found in Chung and Lu [43] (Theorem 7.3).

Lemma 6.

Assume that

0 < β < \frac{1}{|F_{p}|}, α \geq \frac{c_{0}}{β}

and

η > 0

, then we have

\begin{matrix} E [max_{σ} \{\sum_{t = 1}^{T} r_{σ (A_{t}), t} - \frac{1}{η} h (σ (A_{t}))\}] - E [max_{\hat{σ}} \{\sum_{t = 1}^{T} {\hat{r}}_{\hat{σ} (A_{t}), t} - \frac{1}{η} h (\hat{σ} (A_{t}))\}] \\ \leq (1 - |F_{p}| β) \sqrt{2 T [\frac{c_{0}^{2}}{β} + α^{2} (1 - β) + {(c_{0} + 2 α)}^{2}] ln (\frac{1}{β})} + |F_{p}| β c_{0} T . \end{matrix}

Proof.

First, we have almost surely that

\begin{matrix} max_{σ} \{\sum_{t = 1}^{T} r_{σ (A_{t}), t} - \frac{1}{η} h (σ (A_{t}))\} - max_{\hat{σ}} \{\sum_{t = 1}^{T} {\hat{r}}_{\hat{σ} (A_{t}), t} - \frac{1}{η} h (\hat{σ} (A_{t}))\} & \leq max_{f \in F_{p}} \sum_{t = 1}^{T} (r_{f, t} - {\hat{r}}_{f, t}) . \end{matrix}

Denote by

Y_{f, t} = r_{f, t} - {\hat{r}}_{f, t}

. Since

E [{\hat{r}}_{f, t} | H_{t}] = \{\begin{matrix} r_{f, t} + (1 - ϵ) α (1 - P ({\hat{f}}_{t} = f | H_{t})) & i f f \in U ({\hat{f}}_{t - 1}) \cap c o n d (t), \\ ϵ r_{f, t} + (1 - ϵ) α & otherwise, \end{matrix}

and

α > c_{0} \geq r_{f, t}

uniformly for any

f

and t, we have uniformly that

E [Y_{t} | H_{t}] \leq 0

, satisfying the first condition.

For the second condition, if

f \in U ({\hat{f}}_{t - 1}) \cap c o n d (t)

, then

\begin{matrix} Var (Y_{t} | H_{t}) = & E [{\hat{r}}_{f, t}^{2} | H_{t}] - {(E [{\hat{r}}_{f, t} | H_{t}])}^{2} \\ \leq & ϵ r_{f, t}^{2} + (1 - ϵ) [\frac{r_{f, t}^{2}}{P ({\hat{f}}_{t} = f | H_{t})} + α (1 - P ({\hat{f}}_{t} = f | H_{t}))] \\ - {[r_{f, t} + (1 - ϵ) α (1 - P ({\hat{f}}_{t} = f | H_{t}))]}^{2} \\ \leq & \frac{r_{f, t}^{2}}{β} + α^{2} (1 - β) \leq \frac{c_{0}^{2}}{β} + α^{2} (1 - β) . \end{matrix}

Similarly, for

f \notin U ({\hat{f}}_{t - 1}) \cap c o n d (t)

, one can have

Var (Y_{t} | H_{t}) \leq α^{2}

. Moreover, for the third condition, since

E [Y_{f, t} | H_{t}] \geq - 2 α,

then

Y_{f, t} - E [Y_{f, t} | H_{t}] \leq r_{f, t} + 2 α \leq c_{0} + 2 α .

Setting

λ = \sqrt{2 T [\frac{c_{0}^{2}}{β} + α^{2} (1 - β) + {(c_{0} + 2 α)}^{2}] ln (\frac{1}{β})}

leads to

P (\sum_{t = 1}^{T} Y_{f, t} \geq λ) \leq β .

Hence the following inequality holds with probability

1 - | F_{p} | β

max_{f \in F_{p}} \sum_{t = 1}^{T} (r_{f, t} - {\hat{r}}_{f, t}) \leq \sqrt{2 T [\frac{c_{0}^{2}}{β} + α^{2} (1 - β) + {(c_{0} + 2 α)}^{2}] ln (\frac{1}{β})} .

Finally, noticing that

{max}_{f \in F_{p}} \sum_{t = 1}^{T} (r_{f, t} - {\hat{r}}_{f, t}) \leq c_{0} T

almost surely, we terminate the proof of Lemma 6. □

Proof of Theorem 3.

Assume that

p > 6

,

T \geq 2 | F_{p} |^{2}

and let

\begin{matrix} β = {|F_{p}|}^{- \frac{1}{2}} T^{- \frac{1}{4}}, α = \frac{c_{0}}{β}, {\hat{c}}_{0} = \frac{2 c_{0}}{β}, \\ η_{1} = η_{2} = \dots = η_{T} = \frac{\sqrt{c_{1} p + c_{2} L + c_{3}}}{\sqrt{T (e - 1)} {\hat{c}}_{0}}, ϵ = 1 - {|F_{p}|}^{\frac{1}{2} - \frac{3}{p}} T^{- \frac{1}{4}} . \end{matrix}

With those values, the assumptions of Lemmas 4, 5 and 6 are satisfied. Combining their results lead to the following

\begin{matrix} \sum_{t = 1}^{T} E [r_{{\hat{f}}_{t}, t}] \geq & E [max_{σ} \{\sum_{t = 1}^{T} r_{σ (A_{t}), t} - \frac{1}{η} h (σ (A_{t}))\}] - 2 α β (1 - ϵ) \sum_{t = 1}^{T} |U ({\hat{f}}_{t - 1})| \\ - {\hat{c}}_{0}^{2} (e - 1) η T - {\hat{c}}_{0} (e - 1) (c_{1} p + c_{2} L + c_{3}) \\ - (1 - |F_{p}| β) \sqrt{2 T [\frac{c_{0}^{2}}{β} + α^{2} (1 - β) + {(c_{0} + 2 α)}^{2}] ln (\frac{1}{β})} - |F_{p}| β c_{0} T \\ \geq & E [max_{σ} \{\sum_{t = 1}^{T} r_{σ (A_{t}), t} - \frac{1}{η} h (σ (A_{t}))\}] - (1 - ϵ) {|F_{p}|}^{\frac{3}{p}} c_{0} T \\ - {\hat{c}}_{0}^{2} (e - 1) η T - {\hat{c}}_{0} (e - 1) (c_{1} p + c_{2} L + c_{3}) \\ - (1 - |F_{p}| β) \sqrt{2 T [\frac{c_{0}^{2}}{β} + α^{2} (1 - β) + {(c_{0} + 2 α)}^{2}] ln (\frac{1}{β})} - |F_{p}| β c_{0} T \\ \geq & E [max_{σ} \{\sum_{t = 1}^{T} r_{σ (A_{t}), t} - \frac{1}{η} h (σ (A_{t}))\}] - O ({|F_{p}|}^{\frac{1}{2}} T^{\frac{3}{4}}), \end{matrix}

where the second inequality is due to the fact that the cardinality

|U ({\hat{f}}_{t - 1})|

is upper bounded by

{|F_{p}|}^{\frac{3}{p}}

for

t \geq 1

. In addition, using the definition of

r_{f, t}

that

r_{f, t} = c_{0} - Δ (f, x_{t})

terminates the proof of Theorem 3. □

Author Contributions

Conceptualization, L.L. and B.G.; Formal analysis, L.L. and B.G.; Methodology, B.G.; Project administration, B.G.; Software, L.L.; Supervision, B.G.; Writing—original draft, L.L. and B.G.; Writing—review and editing, L.L. and B.G. All authors have read and agreed to the published version of the manuscript.

Funding

LL is funded and supported by the Fundamental Research Funds for the Central Universities (Grand No. 30106210158) and National Natural Science Foundation of China (Grant No. 61877023), the Fundamental Research Funds for the Central Universities (CCNU19TD009). BG is supported in part by the U.S. Army Research Laboratory and the U. S. Army Research Office, and by the U.K. Ministry of Defence and the U.K. Engineering and Physical Sciences Research Council (EPSRC) under grant number EP/R013616/1. BG acknowledges partial support from the French National Agency for Research, grants ANR-18-CE40-0016-01 and ANR-18-CE23- 0015-02.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Pearson, K. On lines and planes of closest fit to systems of point in space. Philos. Mag. 1901, 2, 559–572. [Google Scholar] [CrossRef] [Green Version]
Spearman, C. “General Intelligence”, Objectively Determined and Measured. Am. J. Psychol. 1904, 15, 201–292. [Google Scholar] [CrossRef]
Hotelling, H. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 1933, 24, 417–441. [Google Scholar] [CrossRef]
Friedsam, H.; Oren, W.A. The application of the principal curve analysis technique to smooth beamlines. In Proceedings of the 1st International Workshop on Accelerator Alignment, Stanford, CA, USA, 31 July–2 August 1989. [Google Scholar]
Brunsdon, C. Path estimation from GPS tracks. In Proceedings of the 9th International Conference on GeoComputation, Maynoorth, Ireland, 3–5 September 2007. [Google Scholar]
Reinhard, K.; Niranjan, M. Parametric Subspace Modeling Of Speech Transitions. Speech Commun. 1999, 27, 19–42. [Google Scholar] [CrossRef]
Kégl, B.; Krzyżak, A. Piecewise linear skeletonization using principal curves. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 59–74. [Google Scholar] [CrossRef]
Banfield, J.D.; Raftery, A.E. Ice floe identification in satellite images using mathematical morphology and clustering about principal curves. J. Am. Stat. Assoc. 1992, 87, 7–16. [Google Scholar] [CrossRef]
Stanford, D.C.; Raftery, A.E. Finding curvilinear features in spatial point patterns: Principal curve clustering with noise. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 601–609. [Google Scholar] [CrossRef] [Green Version]
Hastie, T.; Stuetzle, W. Principal curves. J. Am. Stat. Assoc. 1989, 84, 502–516. [Google Scholar] [CrossRef]
Delicado, P. Another Look at Principal Curves and Surfaces. J. Multivar. Anal. 2001, 77, 84–116. [Google Scholar] [CrossRef] [Green Version]
Einbeck, J.; Tutz, G.; Evers, L. Local principal curves. Stat. Comput. 2005, 15, 301–313. [Google Scholar] [CrossRef] [Green Version]
Einbeck, J.; Tutz, G.; Evers, L. Data Compression and Regression through Local Principal Curves and Surfaces. Int. J. Neural Syst. 2010, 20, 177–192. [Google Scholar] [CrossRef] [Green Version]
Malo, J.; Gutiérrez, J. V1 non-linear properties emerge from local-to-global non-linear ICA. Netw. Comput. Neural Syst. 2006, 17, 85–102. [Google Scholar] [CrossRef]
Ozertem, U.; Erdogmus, D. Locally Defined Principal Curves and Surfaces. J. Mach. Learn. Res. 2011, 12, 1249–1286. [Google Scholar]
Kégl, B. Principal Curves: Learning, Design, and Applications. Ph.D. Thesis, Concordia University, Montreal, QC, Canada, 1999. [Google Scholar]
Kégl, B.; Krzyżak, A.; Linder, T.; Zeger, K. Learning and design of principal curves. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 281–297. [Google Scholar] [CrossRef] [Green Version]
Biau, G.; Fischer, A. Parameter selection for principal curves. IEEE Trans. Inf. Theory 2012, 58, 1924–1939. [Google Scholar] [CrossRef] [Green Version]
Barron, A.; Birgé, L.; Massart, P. Risk bounds for model selection via penalization. Probab. Theory Relat. Fields 1999, 113, 301–413. [Google Scholar] [CrossRef]
Birgé, L.; Massart, P. Minimal penalties for Gaussian model selection. Probab. Theory Relat. Fields 2007, 183, 33–73. [Google Scholar] [CrossRef]
Sandilya, S.; Kulkarni, S.R. Principal curves with bounded turn. IEEE Trans. Inf. Theory 2002, 48, 2789–2793. [Google Scholar] [CrossRef]
Cesa-Bianchi, N.; Lugosi, G. Prediction, Learning and Games; Cambridge University Press: New York, NY, USA, 2006. [Google Scholar]
Rudzicz, F.; Ghassabeh, Y.A. Incremental algorithm for finding principal curves. IET Signal Process. 2015, 9, 521–528. [Google Scholar]
Laparra, V.; Malo, J. Sequential Principal Curves Analysis. arXiv 2016, arXiv:1606.00856. [Google Scholar]
Laparra, V.; Jiménez, S.; Camps-Valls, G.; Malo, J. Nonlinearities and Adaptation of Color Vision from Sequential Principal Curves Analysis. Neural Comput. 2012, 24, 2751–2788. [Google Scholar] [CrossRef] [Green Version]
Laparra, V.; Malo, J. Visual Aftereffects and Sensory Nonlinearities from a single Statistical Framework. Front. Hum. Neurosci. 2015, 9. [Google Scholar] [CrossRef] [Green Version]
Laparra, V.; Jiménez, S.; Tuia, D.; Camps-Valls, G.; Malo, J. Principal Polynomial Analysis. Int. J. Neural Syst. 2014, 24, 1440007. [Google Scholar] [CrossRef] [Green Version]
Laparra, V.; Malo, J.; Camps-Valls, G. Dimensionality Reduction via Regression in Hyperspectral Imagery. IEEE J. Sel. Top. Signal Process. 2015, 9, 1026–1036. [Google Scholar] [CrossRef] [Green Version]
Shawe-Taylor, J.; Williamson, R.C. A PAC analysis of a Bayes estimator. In Proceedings of the 10th annual conference on Computational Learning Theory, Nashville, TN, USA, 6–9 July 1997; pp. 2–9. [Google Scholar] [CrossRef]
McAllester, D.A. Some PAC-Bayesian Theorems. Mach. Learn. 1999, 37, 355–363. [Google Scholar] [CrossRef] [Green Version]
McAllester, D.A. PAC-Bayesian Model Averaging. In Proceedings of the 12th Annual Conference on Computational Learning Theory, Santa Cruz, CA, USA, 7–9 July 1999; pp. 164–170. [Google Scholar]
Li, L.; Guedj, B.; Loustau, S. A quasi-Bayesian perspective to online clustering. Electron. J. Stat. 2018, 12, 3071–3113. [Google Scholar] [CrossRef]
Guedj, B. A Primer on PAC-Bayesian Learning. In Proceedings of the Second Congress of the French Mathematical Society, Long Beach, CA, USA, 10 June 2019; pp. 391–414. [Google Scholar]
Alquier, P. User-friendly introduction to PAC-Bayes bounds. arXiv 2021, arXiv:2110.11216. [Google Scholar]
Audibert, J.Y. Fast Learning Rates in Statistical Inference through Aggregation. Ann. Stat. 2009, 37, 1591–1646. [Google Scholar] [CrossRef]
Hutter, M.; Poland, J. Adaptive Online Prediction by Following the Perturbed Leader. J. Mach. Learn. Res. 2005, 6, 639–660. [Google Scholar]
Auer, P.; Cesa-Bianchi, N.; Freund, Y.; Schapire, R.E. The Nonstochastic multiarmed Bandit problem. SIAM J. Comput. 2003, 32, 48–77. [Google Scholar] [CrossRef]
Kleinberg, R.D.; Niculescu-Mizil, A.; Sharma, Y. Regret Bounds for Sleeping Experts and Bandits. In COLT; Springer: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
Kanade, V.; McMahan, B.; Bryan, B. Sleeping Experts and Bandits with Stochastic Action Availability and Adversarial Rewards. Artif. Intell. Stat. 2009, 3, 1137–1155. [Google Scholar]
Cesa-Bianchi, N.; Lugosi, G.; Stoltz, G. Minimizing regret with label-efficient prediction. IEEE Trans. Inf. Theory 2005, 51, 2152–2162. [Google Scholar] [CrossRef] [Green Version]
Neu, G.; Bartók, G. An Efficient Algorithm for Learning with Semi-Bandit Feedback; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2013; Volume 8139, pp. 234–248. [Google Scholar]
Engdahl, E.R.; Villaseñor, A. 41 Global seismicity: 1900–1999. Int. Geophys. 2002, 81, 665–690. [Google Scholar]
Chung, F.; Lu, L. Concentration Inequalities and Martingale Inequalities: A Survey. Internet Math. 2006, 3, 79–127. [Google Scholar] [CrossRef] [Green Version]

Figure 1. A principal curve.

Figure 2. A principal curve and projections of data onto it.

Figure 3. Principal curves with different numbers (k) of segments. (a) A too small k. (b) Right k. (c) A too large k.

Figure 4. An example of a lattice

Γ_{δ}

in

R^{2}

with

δ = 1

(spacing between blue points) and

B (0, 10)

(black circle). The red polygonal line is composed of vertices in

Q_{δ} = B (0, 10) \cap Γ_{δ}

.

Figure 4. An example of a lattice

Γ_{δ}

in

R^{2}

with

δ = 1

(spacing between blue points) and

B (0, 10)

(black circle). The red polygonal line is composed of vertices in

Q_{δ} = B (0, 10) \cap Γ_{δ}

.

Figure 5. An example of a Voronoi partition.

Figure 6. Synthetic data. Black dots represent data

x_{1 : t}

. The red point is the new observation

x_{t + 1}

. princurve (solid red) and slpc (solid green). (a)

t = 150

, princurve. (b)

t = 450

, princurve. (c)

t = 150

, incremental SCMS. (d)

t = 450

, incremental SCMS. (e)

t = 150

, slpc. (f)

t = 450

, slpc.

Figure 6. Synthetic data. Black dots represent data

x_{1 : t}

. The red point is the new observation

x_{t + 1}

. princurve (solid red) and slpc (solid green). (a)

t = 150

, princurve. (b)

t = 450

, princurve. (c)

t = 150

, incremental SCMS. (d)

t = 450

, incremental SCMS. (e)

t = 150

, slpc. (f)

t = 450

, slpc.

Figure 7. Mean estimation of regret and per-round regret of slpc with respect to time round t, for the horizon

T = 500

. (a) Mean estimation of the regret of slpc over 20 trials (black line) and a bisection line (green) with respect to time round t. (b) Per-round of estimated regret of slpc with respect to t.

Figure 7. Mean estimation of regret and per-round regret of slpc with respect to time round t, for the horizon

T = 500

. (a) Mean estimation of the regret of slpc over 20 trials (black line) and a bisection line (green) with respect to time round t. (b) Per-round of estimated regret of slpc with respect to t.

Figure 8. Synthetic data. Zooming in: how a new data point impacts the principal curve only locally. (a) At time

t = 97

. (b) And at time

t = 98

.

Figure 8. Synthetic data. Zooming in: how a new data point impacts the principal curve only locally. (a) At time

t = 97

. (b) And at time

t = 98

.

Figure 9. slpc (green line) on synthetic high dimensional data from different perspectives. Black dots represent recordings

x_{1 : 99}

; the red dot is the new recording

x_{200}

. (a) slpc,

t = 199

, 1st and 2nd coordinates. (b) slpc,

t = 199

, 3th and 5th coordinates. (c) slpc,

t = 199

, 4th and 6th coordinates.

Figure 9. slpc (green line) on synthetic high dimensional data from different perspectives. Black dots represent recordings

x_{1 : 99}

; the red dot is the new recording

x_{200}

. (a) slpc,

t = 199

, 1st and 2nd coordinates. (b) slpc,

t = 199

, 3th and 5th coordinates. (c) slpc,

t = 199

, 4th and 6th coordinates.

Figure 10. Seismic data. Black dots represent seismic recordings

x_{1 : t}

; the red dot is the new recording

x_{t + 1}

. (a) princurve,

t = 100

. (b) princurve,

t = 125

. (c) incremental SCMS,

t = 100

. (d) incremental SCMS,

t = 125

. (e) slpc,

t = 100

. (f) slpc,

t = 125

.

Figure 10. Seismic data. Black dots represent seismic recordings

x_{1 : t}

; the red dot is the new recording

x_{t + 1}

. (a) princurve,

t = 100

. (b) princurve,

t = 125

. (c) incremental SCMS,

t = 100

. (d) incremental SCMS,

t = 125

. (e) slpc,

t = 100

. (f) slpc,

t = 125

.

Figure 11. Seismic data from https://earthquake.usgs.gov/data/centennial/.

Figure 12. Daily commute data. Black dots represent collected locations

x_{1 : t}

. The red point is the new observation

x_{t + 1}

. princurve (solid red) and slpc (solid green). (a)

t = 10

, princurve. (b)

t = 127

, princurve. (c)

t = 10

, slpc. (d)

t = 127

, slpc.

Figure 12. Daily commute data. Black dots represent collected locations

x_{1 : t}

. The red point is the new observation

x_{t + 1}

. princurve (solid red) and slpc (solid green). (a)

t = 10

, princurve. (b)

t = 127

, princurve. (c)

t = 10

, slpc. (d)

t = 127

, slpc.

Table 1. The first line is the regret (cumulative loss) on synthetic data (average over 10 trials, with standard deviation in brackets). Second and third lines are the average computation time for two values of the time horizon T. princurve and incremental SCMS are deterministic, hence the zero standard deviation for regret.

Ground Truth	`Princurve`	`Incremental SCMS`	`slpc`
2.48 (0)	26.02 (0)	19.09 (0)	20.83 (3.23)
T = 500	0.029 s (0.0001 s)	18.79 s (0.007 s)	1.44 s (0.030 s)
T = 5000	0.35 s (0.006 s)	>60 s (NA)	4.13 s (0.807 s)

Table 2. Regret (cumulative loss) on synthetic high dimensional data in (average over 10 trials, with standard deviation in brackets). princurve and incremental SCMS are deterministic, hence the zero standard deviation.

Ground Truth	`Princurve`	`Incremental SCMS`	`slpc`
3.290 (0)	14.204 (0)	5.38 (0)	6.797 (0.409)

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, L.; Guedj, B. Sequential Learning of Principal Curves: Summarizing Data Streams on the Fly. Entropy 2021, 23, 1534. https://doi.org/10.3390/e23111534

AMA Style

Li L, Guedj B. Sequential Learning of Principal Curves: Summarizing Data Streams on the Fly. Entropy. 2021; 23(11):1534. https://doi.org/10.3390/e23111534

Chicago/Turabian Style

Li, Le, and Benjamin Guedj. 2021. "Sequential Learning of Principal Curves: Summarizing Data Streams on the Fly" Entropy 23, no. 11: 1534. https://doi.org/10.3390/e23111534

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Sequential Learning of Principal Curves: Summarizing Data Streams on the Fly

Abstract

1. Introduction

1.1. Earlier Works on Principal Curves

1.2. Motivation

2. Notation

3. Regret Bounds for Sequential Learning of Principal Curves

4. Implementation

5. Numerical Experiments

6. Proofs

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI