Numerous methods have been proposed in the statistics and machine learning literature to sum up information and represent data by condensed and simpler to understand quantities. Among those methods, Principal Component Analysis (PCA) aims at identifying the maximal variance axes of data. This serves as a way to represent data in a more compact fashion and hopefully reveal as well as possible their variability. PCA has been introduced bypeason1901lines and spearman1904general and further developed by hotelling1933analysis
. This is one of the most widely used procedures in multivariate exploratory analysis targeting dimension reduction or features extraction. Nonetheless, PCA is a linear procedure and the need for more sophisticated nonlinear techniques has led to the notion of principal curve. Principal curves may be seen as a nonlinear generalization of the first principal component. The goal is to obtain a curve which passes "in the middle" of data, as illustrated byFigure 1. This notion has been at the heart of numerous applications in many different domains, such as physics (FO1989; BRU2007), character and speech recognition (RN1999; KK2002), mapping and geology (banfield1992ice; stanford2000finding; BRU2007), to name but a few.
1.1 Earlier works on principal curves
The original definition of principal curve dates back to HS1989. A principal curve is a smooth () parameterized curve in which does not intersect itself, has finite length inside any bounded subset of and is self-consistent. This last requirement means that , where
is a random vector and the so-called projection indexis the largest real number minimizing the squared Euclidean distance between and , defined by
Self-consistency means that each point of is the average (under the distribution of ) of all data points projected on , as illustrated by Figure 2.
However, an unfortunate consequence of this definition is that the existence is not guaranteed in general for a particular distribution, let alone for an online sequence for which no probabilistic assumption is made. KEG1999 proposed a new concept of principal curves which ensures the existence for a large class of distributions. Principal curves are defined as the curves minimizing the expected squared distance over a class of curves whose length is smaller than , namely,
If , always exists but may not be unique. In practical situation where only i.i.d copies of are observed, KEG1999 considers classes of all polygonal lines with segments and length not exceeding
, and chooses an estimatorof as the one within which minimizes the empirical counterpart
of . It is proved in KKLZ2000 that if is almost surely bounded and is proportional to , then
As the task of finding a polygonal line with segments and length at most that minimizes is computationally costly, KKLZ2000 proposes the Polygonal Line algorithm. This iterative algorithm proceeds by fitting a polygonal line with segments and considerably speeds up the exploration part by resorting to gradient descent. The two steps (projection and optimization) are similar to what is done by the -means algorithm. However, the Polygonal Line algorithm is not supported by theoretical bounds and leads to variable performance depending on the distribution of the observations.
As the number of segments plays a crucial role (a too small leads to a rough summary of data while a too large yields overfitting, see Figure 3), BA2012 aim to fill the gap by selecting an optimal from both theoretical and practical perspectives.
Their approach relies strongly on the theory of model selection by penalization introduced by BBM1999 and further developed by BM2007. By considering countable classes of polygonal lines with segments, total length and whose vertices are on a lattice, the optimal is obtained as the minimizer of the criterion
is a penalty function where stands for the diameter of observations, denotes the weight attached to class and with constants depending on , the maximum length and the dimension of observations. BA2012 then prove that
where is a numerical constant. The expected loss of the final polygonal line is close to the minimal loss achievable over up to a remainder term decaying as .
The big data paradigm—where collecting, storing and analyzing massive amounts of large and complex data becomes the new standard—commands to revisit some of the classical statistical and machine learning techniques. The tremendous improvements of data acquisition infrastructures generates new continuous streams of data, rather than batch datasets. This has drawn a large interest to sequential learning. Existing theoretical works and practical implementations of principal curves are designed for the batch setting (KEG1999; KKLZ2000; KK2002; SK2002; BA2012). To the best of our knowledge, very little effort has been put so far into extending principal curves algorithms to the sequential context (to the exception of laparra2016sequential, in a fairly different setting and with no theoretical results). The present paper aims at filling this gap: our goal is to propose an online perspective to principal curves by automatically and sequentially learning the best principal curve summarizing a data stream. This represents a clear improvement over the batch setting as when data streams are collected, running a batch algorithm at each collect time is likely to be resources-demanding. Sequential learning takes advantage of the latest collected (set of) observations and therefore suffers a much smaller computational cost.
Sequential learning operates as follows: a blackbox reveals at each time some deterministic value , and a forecaster attempts to predict sequentially the next value based on past observations (and possibly other available information). The performance of the forecaster is no longer evaluated by its generalization error (as in the batch setting) but rather by a regret bound which quantifies the cumulative loss of a forecaster in the first rounds with respect to some reference minimal loss. In sequential learning, the velocity of algorithms may be favored over statistical precision. An immediate use of aforecited techniques (KKLZ2000; SK2002; BA2012) at each time round (treating data collected until as a batch dataset) would result in a monumental algorithmic cost. Rather, we propose a novel algorithm which adapts to the sequential nature of data, i.e., which takes advantage of previous computations. We refer the reader to the monograph CBL2006 for a thorough introduction to sequential learning.
The contributions of the present paper are twofold. We first propose a sequential principal curves algorithm, for which we derive regret bounds. We then move towards an implementation, illustrated on a toy dataset and a real-life dataset (seismic data). The sketch of our algorithm procedure is as follows. At each time round , the number of segments of is chosen automatically and the number of segments in the next round is obtained by only using information about and a small amount of past observations. The core of our procedure relies on computing a quantity which is linked to the mode of the so-called Gibbs quasi-posterior and is inspired by quasi-Bayesian learning. The use of quasi-Bayesian estimators is especially advocated by the PAC-Bayesian theory which originates in the machine learning community in the late 1990s, in the seminal works of STW1997 and McA1998; McA1999. In a recent preprint, LGL2017 discuss the use of PAC-Bayesian tools for sequential learning.
The rest of the paper proceeds as follows. Section 2 presents our notation and our online principal curve algorithm, for which we provide regret bounds with sublinear remainder terms in Section 3. A practical implementation is proposed in Section 4 and we illustrate its performance on a toy dataset and seismic data in Section 5. Finally, we collect in Section 6 proofs to original results claimed in the paper.
A parameterized curve in is a continuous function where is a closed interval of the real line. The length of is given by
Let be a sequence of data, where stands for the -ball centered in with radius . Let be a grid over , i.e., where is a lattice in with spacing . Let and define for each the collection of polygonal lines with segments whose vertices are in and such that . Denote by all polygonal lines with a number of segments , whose vertices are in and whose length is at most . Finally, let denote the number of segments of . This strategy is illustrated by Figure 4.
Our goal is to learn a time-dependent polygonal line which passes through the "middle" of data and gives a summary of all available observations (denoted by hereafter) before time . Our output at time is a polygonal line depending on past information and past predictions . When is revealed, the instantaneous loss at time is computed as
In what follows, we investigate regret bounds for the cumulative loss based on (2). Given a measurable space (embedded with its Borel -algebra), we let
denote the set of probability distributions on, and for some reference measure , we let be the set of probability distributions absolutely continuous with respect to .
For any , let denote a probability distribution on . We define the prior on as
where and .
A quasi-Bayesian procedure would now consider the Gibbs quasi-posterior (note that this is not a proper posterior in all generality, hence the term "quasi")
as advocated by Aud2009 and LGL2017 who then consider realisations from this quasi-posterior. In the present paper, we will rather consider a quantity linked to the mode of this quasi-posterior. Indeed, the mode of the quasi-posterior is
where (i) is a cumulative loss term, (ii) is a term controlling the variance of the prediction to past predictions , and (iii) can be regarded as a penalty function on the complexity of if is well chosen. This mode hence has a similar flavor to follow the best expert or follow the perturbed leader in the setting of prediction with experts (see HP2005 or CBL2006, Chapters 3 and 4) if we consider each as an expert which always delivers constant advice. These remarks yield Algorithm 1.
3 Regret bounds for sequential learning of principal curves
We now present our main theoretical results.
For any sequence , and any penalty function , let . Let , then the procedure described in Algorithm 1 satisfies
The expectation of the cumulative loss of polygonal lines is upper-bounded by the smallest penalised cumulative loss over all up to a multiplicative term which can be made arbitrarily close to 1 by choosing a small enough . However, this will lead to both a large in and a large . In addition, another important issue is the choice of the penalty function . For each , should be large enough to ensure a small while not too large to avoid overpenalization and a larger value for . We therefore set
for each with segments (where denotes the cardinality of a set ) since it leads to
Under the assumptions of Theorem 1, let
and we conclude by setting
However Corollary 1 is not usable in practice since the optimal value for depends on which is obviously unknown, even more so at time . We therefore provide an adaptive refinement of Algorithm 1 in the following Algorithm 2.
For any sequence , let where , , are constants depending on . Let and
where . Then the procedure described in Algorithm 2 satisfies
The message of this regret bound is that the expected cumulative loss of polygonal lines is upper-bounded by the minimal cumulative loss over all , up to an additive term which is sublinear in . The actual magnitude of this remainder term is . When is fixed, the number of segments is a measure of complexity of the retained polygonal line. This bound therefore yields the same magnitude than (1) which is the most refined bound in the literature so far (BA2012, where the optimal values for and are obtained in a model selection fashion).
The argument of the infimum in Algorithm 2 is taken over which has a cardinality of order , making any greedy search largely time-consuming. We instead turn to the following strategy: given a polygonal line with segments, we consider, with a certain proportion, the availability of within a neighbourhood (see the formal definition below) of . This consideration is suited for principal curve setting since if observation is close to , one can expect that the polygonal line which well fits observations lies in a neighbourhood of . In addition, if each polygonal line is regarded as an action, we no longer assume that all actions are available at all times, and allow the set of available actions to vary at each time. This is a model known as "sleeping experts (or actions)" in prior work (ACFS2003; KNS2008). In this setting, defining the regret with respect to the best action in the whole set of actions in hindsight remains to be difficult since that action might sometimes be unavailable. Hence it is natural to define the regret with respect to the best ranking of all actions in the hindsight according to their losses or rewards, and at each round one chooses among the available actions by selecting the one which ranks the highest in this ordering. KNS2008 introduced this notion of regret and studied both the full-information (best action) and partial-information (multi-armed bandit) settings with stochastic and adversarial rewards and adversarial action availability. They pointed out that the EXP4 algorithm (ACFS2003) attains the optimal regret in adversarial rewards case but runs in time exponential in the number of all actions. KMB2009 considered full and partial information with stochastic action availability and proposed an algorithm that runs in polygonal time. In what follows, we will realise our implementation by resorting to ”sleeping experts” i.e., a special available set of actions that adapts to the setting of principal curves.
Let denote an ordering of actions, and an available subset of the actions at round . We let denote the highest ranked action in . In addition, for any action we define the reward of at round by
It is clear that . The convention from losses to gains is done in order to facilitate the subsequent performance analysis. The reward of an ordering is the cumulative reward of the selected action at each time
and the reward of the best ordering is ( when is stochastic).
Our procedure starts with a partition step which aims at identifying the "relevant" neighbourhood of an observation with respect to a given polygonal line, and then proceeds with the definition of neighbourhood of an action , and finally we give details of implementation and show its regret bound.
For any polygonal line with segments, we denote by its vertices and by the line segments connecting and . In the sequel, we use to represent the polygonal line formed by connecting consecutive vertices in if no confusion is possible. In addition, for , let be the squared distance between and , and let be the squared distance between and line segment , i.e.,
where is the projection index of to line segment . In this step, can be partitioned into at most disjoint sets and defined as
where , is the set of all whose closest vertex of is , and , is the set of all whose closest segment of is . In the sequel, we use (resp., and ) to abbreviate the sequence of partitions (resp., and ).
For any , let us denote by the set of vertices of connecting to the projection and denote by the neighbourhood with respect to , i.e.,
Note that those definitions of and hold whenever . If , and are identical. Next, let
be the set of observations belonging to . We finally let stand for the average of all observations in and denotes the diameter of set .
We initiate the principal curve as the first component line segment whose vertices are the two farthest projections of data on the first component line. Given , with segments, we obtain firstly the Voronoi-like partition and , then identify the adjacent vertices set and set for observations in .
For each , let us first define the local vertices set around as
where is a -ball in , centered in with radius . Then for , a candidate (equivalently a candidate ) with vertices is formed as
where and are distinct vertices chosen from the local vertices set around . In other words, a candidate is built by keeping all the vertices of that do not belong to and by updating vertices drawn from . We define the neighbourhood of by
The cardinality is upper bounded by since all elements in differ with up to at most three vertices.
Finally, we give our implementation with action availability that may vary at each time in Algorithm 3.
The Algorithm 3 has an exploration phase (when ) and an exploitation phase (). In the exploration phase, it is allowed to observe rewards of all actions and to choose an optimal perturbed action from the set of all actions. In the exploitation phase, only rewards of a part of actions can be accessed to and rewards of others are estimated by a constant, and we update our action from the neighbourhood of the previous action . This local update (or search) can hence reduce computation complexity. In addition, this local search will be enough to account for the case when locates in . The parameter needs to be carefully calibrated since on the one hand, should not be too big to make sure that the condition is non-empty, otherwise, all rewards are estimated by the same constant and thus leading to the same descending ordering of tuples for both and . Therefore, we may face the risk of having in the neighbourhood of even if we are in the exploration phase at time ; on the other hand, very small could result in large bias for estimation of . In addition, the exploitation phase is similar but different to the label efficient prediction (NGG2005, Remark 1.1) since we allow an action at time to be different from the previous one. Moreover, NB2013 have proposed the Geometric Resampling method to estimate the conditional probability
since this quantity often doesn’t have an explicit form. However, due to the simple exponential distribution ofchosen in our case, an explicit form of is straightforward.
Assume that , and let
Then the procedure described in Algorithm 3 satisfies the regret bound