Online Supervised Subspace Tracking

09/01/2015
by   Yao Xie, et al.
Georgia Institute of Technology
0

We present a framework for supervised subspace tracking, when there are two time series x_t and y_t, one being the high-dimensional predictors and the other being the response variables and the subspace tracking needs to take into consideration of both sequences. It extends the classic online subspace tracking work which can be viewed as tracking of x_t only. Our online sufficient dimensionality reduction (OSDR) is a meta-algorithm that can be applied to various cases including linear regression, logistic regression, multiple linear regression, multinomial logistic regression, support vector machine, the random dot product model and the multi-scale union-of-subspace model. OSDR reduces data-dimensionality on-the-fly with low-computational complexity and it can also handle missing data and dynamic data. OSDR uses an alternating minimization scheme and updates the subspace via gradient descent on the Grassmannian manifold. The subspace update can be performed efficiently utilizing the fact that the Grassmannian gradient with respect to the subspace in many settings is rank-one (or low-rank in certain cases). The optimization problem for OSDR is non-convex and hard to analyze in general; we provide convergence analysis of OSDR in a simple linear regression setting. The good performance of OSDR compared with the conventional unsupervised subspace tracking are demonstrated via numerical examples on simulated and real data.

READ FULL TEXT VIEW PDF

Authors

page 12

09/12/2016

Online Data Thinning via Multi-Subspace Tracking

In an era of ubiquitous large-scale streaming data, the availability of ...
09/29/2017

Fast online low-rank tensor subspace tracking by CP decomposition using recursive least squares from incomplete observations

We consider the problem of online subspace tracking of a partially obser...
06/17/2018

Subspace Embedding and Linear Regression with Orlicz Norm

We consider a generalization of the classic linear regression problem to...
09/26/2021

Data Summarization via Bilevel Optimization

The increasing availability of massive data sets poses a series of chall...
07/26/2019

Online Subspace Tracking for Damage Propagation Modeling and Predictive Analytics: Big Data Perspective

We analyze damage propagation modeling of turbo-engines in a data-driven...
09/11/2018

Phaseless Subspace Tracking

This work takes the first steps towards solving the "phaseless subspace ...
03/04/2013

Bayesian Compressed Regression

As an alternative to variable selection or shrinkage in high dimensional...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Subspace tracking plays an important role in various signal and data processing problems, including blind source separation [1], dictionary learning [2, 3]

, online principal component analysis (PCA)

[4, 5]

, imputing missing data

[4], denoising [6], and dimensionality reduction [7]

. To motivate online supervised subspace tracking, we consider online dimensionality reduction. Applications such as the Kinect system generate data that are high-resolution 3D frames of dimension 1280 by 960 at a rate of 12 frames per second. At such a high rate, it is desirable to perform certain dimensionality reduction on-the-fly rather than storing the complete data. In the unsupervised setting, dimensionality reduction is achieved by PCA, which projects the data using dominant eigenspace of the data covariance matrix. However, in many signal and data processing problems, side information is available in the form of labels or tasks. For instance, the data generated by the Kinect system contains the gesture information (e.g. sitting, standing, etc.)

[8, 9]. A supervised dimensionality reduction may take advantage of the side information in the choice of the subspaces for dimensionality reduction. The supervised dimensionality reduction is a bit more involved as it has two objectives: making a choice of the subspace that represents the predictor vector and choosing parameters for the model that relates the predictor and response variables.

Existing online subspace tracking research has largely focused on unsupervised learning, including the GROUSE algorithm (based on online gradient descent on the Grassmannian manifolds)

[4, 10, 11], the PETRELS algorithm [12] and the MOUSSE algorithm [13]. Local convergence of GROUSE has been shown in [11]

in terms of the expected principle angle between the true and the estimated subspaces. A preliminary exploration for supervised subspace tracking is reported in

[14], which performs dimensionality reduction on the predictor vector without considering the response variable in the formulation. What can go wrong if we perform subspace tracking using only the predictor but ignoring the response variable (e.g., the approach in [14])? Fig. 1 and Fig. 2 demonstrate instances in classification, where unsupervised dimensionality reduction using a subspace may completely fail while the supervised dimension reduction does it right.

(a) (b)
Fig. 1:

Example to illustrate the difference of unsupervised and supervised subspace tracking. (a): a case where dimensionality reduction along largest eigenvector is optimal and both supervised and unsupervised dimensionality reduction pick the dominant eigenvector; (b): a case where dimensionality reduction along the second largest eigenvector is optimal. Unsupervised dimensionality reduction erroneously picks the largest eigenvector since it only maximizes the variance of the predictor variable, but the supervised dimensionality reduction, by considering the response variable, correctly picks the second largest eigenvector.

(a)

(b) (c)
Fig. 2: Simulated data correspond to the case in Fig.1(b): (a): three-dimensional data cloud with two classes; (b): scatter plot of the data projected by unsupervised subspace tracking into a two-dimensional space ; the two classes are completely overlapped in the projected space; (c): scatter plot of the data projected by supervised subspace tracking into a two-dimensional space ; the two classes are well-separately. More details for this example can be found in Section V.

In this paper, we present a general framework for supervised subspace tracking which we refer to as the online supervised dimensionality reduction (OSDR), which is a meta-algorithm that may be applied to various models. OSDR simultaneously learns the subspace and a predictive model through alternating minimization, and the formulation of OSDR takes into consideration both the high-dimensional predictor sequence and the response sequence

. We explore different forms of OSDR under various settings, with the loss function being induced by linear regression, logistic regression, multiple linear regression, multinomial logistic regression, support vector machine, random dot-product model, and union-of-subspaces model, respectively. A common structure is that the Grassmannian gradient of the cost function with respect to the subspace

is typically rank-one or low-rank (e.g., rank- for the -classification problem, or the rank being dependent on the number of samples used for a mini-batch update). This structure enables us to develop a simple and efficient update for along the geodesic. Due to the orthogonality requirement and bilinearity, the optimization problem involved in OSDR is non-convex. We provide convergence analysis for OSDR in a simple linear regression setting. Good performance of OSDR is demonstrated on simulated and real data.

A notion in statistics related to our problem is sufficient dimensionality reduction, which combines the idea of dimensionality reduction with the concept of sufficiency. Given a response variable , a -dimensional predictor vector , a dimensionality reduction statistic is sufficient if the distribution of conditioned on is the same as that of conditioned on . In other words, in the case of sufficiency, no information about the regression is lost by reducing the dimension of . Classic sufficient dimensionality reduction methods include the sliced inverse regression (SIR) [15], which uses the inverse regression curve, to perform a weighted principle component analysis; more recent works [16] use likelihood-based sufficient dimensionality reduction in estimating the central subspace. From this perspective, OSDR can be viewed as aiming at sufficiency for online subspace based dimensionality reduction. In the offline setting, a notable work is sufficient dimension reduction on manifold [17], which considers the problem of discovering a manifold that best preserves information relevant to a non-linear regression using a convex optimization formulation.

The rest of the paper is organized as follows. Section II sets up the formalization and the meta-algorithm form for OSDR. Section III presents specific OSDR algorithms under various settings. Section IV includes theoretical analysis. Section V contains numerical examples based on simulated and real data.

The notation in this paper is standard: denotes the set of positive real numbers; ; for any scalar ; denotes the th element of a vector ; is the indicator function for an event ; and denote the norm and norm of a vector , respectively; denotes transpose of a vector or matrix and denotes the Frobenius norm of a matrix ;

is the identity matrix of dimension

-by-. Define the sign function is equal to 1 if and is equal to 0 if .

Ii OSDR: the meta-algorithm

Assume two time-series , , such that can be predicted from the high-dimensional vector . Here can be either binary, real, or vector-valued. To reduce the data dimensionality, we project by a subspace to , with , to obtain a projected vector (or feature) . Here is the number of principle components we will use to explain the response series. Ideally, we would like to choose an that maximizes the prediction power of for , and can be time-varying to be data-adaptive. With such a goal, we formulate the online supervised dimensionality reduction (OSDR), which simultaneously learns the subspace and a function that relates and , by minimizing a cost function which measures the quality of the projection in terms of predictive power. The optimization problem of minimizing the loss function is inherently non-convex, due to the orthogonality requirement for : , as well as bilinear terms arising from coupling of

and the model parameters. Hence, we develop the algorithm based on an alternating minimization and stochastic gradient descent scheme.

We consider two related formulations: the -formulation, which assumes the response function only depends on the projected components and hence is a compact model that fits into the goal of dimensionality reduction. However, the -formulation cannot deal with missing data. Then we also introduce the -formulation, which handles missing data or can also be used for denoising. The -formulation estimates the projection of the data using the subspace from the previous step, and then uses to update the subspace; however, it requires us to store high-dimensional model parameters. The loss function for the - and the -formulations are different, but the Grassmannian gradient with respect to is often rank-one (or low-rank). Such simple structure enables us to derive efficient algorithm to update along the geodesic, as summarized in Algorithm 1. In the following derivations, we omit the time indices for notational simplicity.

Ii-a -formulation

The -formulation is motived by sliced inverse regression, which assumes that the response variable depends only on the projected components. The loss function for the -formulation takes the form of

which measures the predictive quality of the projection for the response : with some parameter . To compute the gradient of with respect to on the Grassmannian manifold, we follow the program developed in [18]. First compute a partial derivative of with respect to . Let the partial derivative of the function with respect to the first argument be denoted as

By the chain rule, we have the partial derivative with respective to

is given by

Using equation (2.70) in [18], we can calculate the gradient on the Grassmannian from this partial derivative

In many problems that we describe in Section III, the gradient is one term or a sum a few terms. Hence, the gradient has the desired low-rank structure.

Ii-B -formulation

The -formulation assumes that the loss function is defined in the ambient dimension space:

This setting is particularly useful for denoising and imputing the missing data, where we will assume the signal lies near a low-dimensional subspace, and estimate a low-dimensional component and use it to predict . The loss function takes the form of . Following a similar derivation as above, the gradient on the Grassmannian can be written as

where the partial derivative of with respect to the first arguement is given by

Again, is often only one term or a sum of a few terms, and hence has the desired low-rank structure.

To estimate , for the denoising setting, we may use

(1)

When there is missing data, we are not able to observe the complete data vector , but are only able to observe a subset of entries . Using an approach similar to that in [4], is estimated as

(2)

where is an diagonal matrix which has 1 in the th diagonal entry if and has 0 otherwise. It can be shown , where and .

0:  a sequence of predictors and responses , initial model parameter and subspace .
1:  for  do
2:     {-formulation} {}
3:     {-formulation}, {, estimated from }
4:     fix or , update along Grassmannian gradient in geodesic
5:     fix , update or
6:  end for
Algorithm 1 OSDR: meta-algorithm

Iii OSDR for specific models

In the section, we illustrate various forms of loss functions and show that the Grassmannian gradient with respect to typically takes the form of , for some scalar , vectors , and .

Iii-a Linear regression

For linear regression, and the loss function will be the -norm of prediction error. In the -formulation, with and , and the loss function is

Define

(3)

The Grassmannian gradient of the loss function with respect to is given by

Using the rank-one structure of the gradient above, we perform geodesic gradient descent for using a simple form. Write

where

are an orthonormal set orthogonal to and are an orthonormal set orthogonal to . Subsequently, using the formula in [18] update of is given by

(4)

where is a step-size. Similarly, for a fixed , we may find its gradient with respect to the regression coefficient vector and update via

(5)

where is step-size for the parameter update.

In the -formualtion, the model parameters are , with and . Essentially, by replacing by its estimate , we have the loss function

where is estimated using the subspace from the previous step using (1) or (2). Let , we can show

and hence the subspace can be updated similarly, and the model parameters are updated via

(6)
Remark 1 (Difference from unsupervised tracking).

Due to an alternative formulation, update in OSDR differs from that in the unsupervised version [4, 11] in that the update in OSDR depends on . This is intuition since the amount of update we have on the subspace should be driven by the prediction accuracy for the response variable.

Remark 2 (Mini-batch update).

Instead of updating with every single new sample, we may also perform an update with a batch of samples. The Grassmannian gradient for this mini-batch update scheme can be derived as

In this case the gradient is no longer rank-one. We may use a rank-one approximation of this gradient, or use the exact rank- update described in (10), which requires computing an SVD of this gradient.

Remark 3 (Computational complexity).

The computational complexity of OSDR is quite low and it is . The most expensive step is to compute or , in the - and -formulations, respectively. This term can be computed as, for instance , to have a lower complexity (otherwise the complexity is ).

Iii-B Logistic regression

For logistic regression,

. Define the sigmoid function

The loss function for logistic regression corresponds to the negative log-likelihood function assuming

is a Bernoulli random variable. For the

-formulation, the loss function is given by

and the model parameter with and . For the -formulation with a parameter estimate , we have

and the model parameter with and . It can be shown that, in the logistic regression setting, the Grassmannian gradients with respect to , for these two cases are given by

where the predicted response

(7)

or

where the predicted response

(8)

Note that the gradients for linear regression and logistic regression take a similar form, with the only difference being how the response is predicted: in linear regression it is defined linearly as in (3), and in logistic regression it is defined through the sigmoid function as in (8). Hence, OSDR for linear regression and logistic regression take similar forms and only differs by what response function is used, as shown in Algorithm 2.

0:   and (missing data, given observed on an index set ), . Initial values for , and (or and ). Step-sizes and . can be linear or sigmoid function.
1:  for  do
2:     {-formulation}, ,
3:     {-formulation} , ,
4:     {update subspace}
5:     {update regression coefficients and residuals} ,
6:  end for
Algorithm 2 OSDR for linear and logistic regressions

Iii-C Multiple linear regression

We may also extend OSDR to multiple linear regression, where for some integer . The loss functions for the - and -formulations are given by

with , and . Here we assume the slope parameter is zero and this can be achieved by subtracting the means from the predictor vector and the response variable, respectively. It can be shown that

and

It can also be shown that the partial derivative of with respect to for a fixed is given by , and the partial derivative of with respect to for a fixed is given by . OSDR for multiple linear regression is given in Algorithm 3.

0:   and (missing data, given observed on an index set ), . Initial values for , and (or and ). Step-sizes and .
1:  for  do
2:     {-formulation}, ,,
3:     {-formulation} , ,,
4:     {update subspace}
5:     {update regression coefficients and residuals} {-formulation} {-formulation}
6:  end for
Algorithm 3 OSDR for multiple linear regression

Iii-D Multinomial logistic regression

Multinomial logistic regression means that for some integer is a categorical random variable and it is useful for classification. In the following, we focus on the -formulation; the -formulation can be derived similarly. The loss function is the negative likelihood function given by

In this case, the Grassmannian gradient will no longer be rank-one but rank-, with

and

(9)

Note that consists of a sum of terms and, hence, is usually rank-. We no longer have the simple expression to calculate update of

along the geodesic and the precise update requires performing a (reduced) singular value decomposition of the gradient

where is a diagonal matrix with the diagonal entries being the singular values. Using Theorem 2.3 in [18], we update as

(10)

where is the step-size. Alternatively, we may use the rank-one approximation to the Grassmannian gradient to derive, again, a simple update, which is given by Algorithm 4.

0:   and (missing data, given observed on an index set ), . Initial values for , . Step-sizes and .
1:  for  do
2:     {-formulation}
3:     {predict response} , ,
4:     compute using (9)
5:     find the dominant singular value , corresponding left singular vector and right singular vector for
6:     {update subspace}
7:     {Update regression coefficients}
8:  end for
Algorithm 4 OSDR for -multinomial logistic regression

Iii-E Support vector machine (SVM)

The loss function for SVM is the hinge loss. For the - and -formulations, the loss functions are

where , and

where . Note that the loss function is not differentiable. We may use its sub-gradient to perform gradient descent, or find a smooth surrogate function to approximate the hinge loss. The Grassmannian sub-gradients for the two loss functions are

and

These gradients are again rank-one and, hence, we may update along geodesic efficiently.

Iii-F Random dot product graph model

The random dot product graph model is useful for relational data which usually occurs in social network study [19, 20, 21]. The model assumes that each node is associated with a feature

, and an edge between two nodes are formed with a probability proportional to the inner product between their features

. Suppose at each time , we observe a pair of nodes in the network with predictor vectors and as well as their relation indicator variable (i.e., an edge is formed or not). We assume a logistic regression model for some feature vectors and that are projections of and . Here our goal is to choose a proper subspace that fits the data nicely.

Note that given and , the inner product can be estimated as , which involves a quadratic term in (rather than linear in as in other cases). To be able to obtain the rank-one structure, we use a two-step strategy: first fix and update , and then fix and update . The log-likelihood function for fixed is given by

Similar to logistic regression,

which is rank-one and we may update the subspace similarly. In the second step, the log-likelihood function for fixed is given by

and the subspace can be updated similarly. Finally, we fix (and hence fix and ) and update the logistic regression parameters as

Description of the complete algorithm is omitted here as it is similar to the case of logistic regression.

Iii-G Union-of-subspaces model

Union-of-subspaces model [22] and multi-scale union-of-subspace model [23, 24, 13] have been used to approximate manifold structure of the state. As illustrated in Fig. 3, the multi-scale union-of-subspace is a tree of subsets defined on low-dimensional affine spaces

with each of these subsets lying on a low-dimensional hyperplane with dimension

and is parameterized by

where denotes the scale or level of the subset in the tree, is the tree depth at time , and denotes the index of the subset for that level. The matrix is the subspace basis, and is the offset of the subset from the origin. The diagonal matrix

with

, contains eigenvalues of the covariance matrix of the projected data onto each hyperplane. This parameter specifies the shape of the ellipsoid by capturing the spread of the data within the hyperplanes.

We may couple the subspace tree with regression model, by attaching a set of regression coefficients with each subset. Given , we may find a subset in the union that has the smallest affinity, and use that subset to estimate by projection onto the associated subspace and use the associated (linear or logistic) regression coefficients to predict . The affinity can be a distance to the subset similar to what has been used for discriminate analysis or in [13],

or simply the distance to a subspace

Then we predict the local coefficient associated with that subset. OSDR can be derived for these models by combining a step of finding the subset with the smallest affinity with subsequent subspace and parameter update similar to the linear or logistic regression OSDR.

We may also use this model together with the random dot product graph model for social networks, where two users may belong to two different subsets in the tree and their interaction is determined by the regression model associated with their common parent in the tree. This may capture the notion of community: each node represents one community and there is a logistic regression model for each community. The probability that two users interact is determined by the “smallest” community that they are both in. In this case, OSDR will be two-stage: classification based on affinity function followed by a two-step subspace update similar to the OSDR for the random dot product model. Section V-C presents one such example for illustration.

Fig. 3: Multi-scale union-of-subspaces model [24]. Data are bi-partitioned iteratively to form nested subsets, and a low-dimensional affine space (with offset from the origin) is fitted for data in each subset. The bi-partitioning creates a binary tree with multi-scale representation of the data: nodes of the tree contain parameters for that subset, and leaves represent the union-of-subspaces used at time . The lower panel illustrates the structure of these subspaces. The black dots represent historical data used to estimate the subspace. This model can be used in conjunction with, for example, a random dot product graph to model social networks: at each time , two persons with features and from two subsets interact with probability proportional to the inner product .

Iv Theoretical analysis

The general OSDR problem is hard to analyze due to its non-convexity and bilinearlity in and the model parameters. In this section, we study the optimization problem for linear regression with the -formulation to obtain some theoretical insights. In the linear regression case, the loss function is the norm When there is no missing data, the projection coefficient is given by . In a simplified setting, assume the response variable is generated with the parameter : Then the loss function is given by

Iv-a Fixed-point with respect to

First, we show that for a fixed model parameter, the optimization problem with respect to the subspace will converge to an orthonormal matrix even without the orthogonality constraint. We make a simplifying assumption that the true response is linear with parameter equal to the assumed parameter : , then we have that one step in the alternating minimization can be written as

(11)

This problem is non-convex due to the constraints as well as the quadratic term in the objective function. Without loss of generality, we may assume . Construct a matrix with being its first column. Then we consider the following optimization problem related to (11) that will help us to establish properties later.

(12)
Theorem 1.

Given a fixed orthogonal matrix

, the stationary point to the optimization problem (12) without the constraint:

(13)

are orthogonal matrices of size whose columns are largest eigenvectors of the covariance matrix of data . Assume has distinct dominant eigenvalues.

We need the following lemma for the proof.

Lemma 1 ([25]).

Let be a positive semi-definite matrix and . For a function defined as

the gradient of is

is a stationary point of if and only if , where contains any distinct eigenvectors of and is an arbitrary orthonormal matrix. Moreover, all stationary points of are saddle points except when contains the dominant eigenvectors of , in which case attains the global minimum at .

Proof of Theorem 1.

Let denote the covariance matrix of : , and let . We may write the objective function as

Then the partial derivative of with respect to is then given by

If we choose the columns of properly that is orthonormal, we have , and thus,

With the equation above together with Lemma 1, we have that the only stationary points of the optimization problem are distinct dominant eigenvectors of the matrix (assuming is full-rank).

Iv-B Convergence

In the same setting, if we fix the model parameter, we may establish the following local convergence property with respect to the Grassmannian gradient of . Suppose the case where is exactly on the subspace , and . We use , the principal angles between and the true subspace , which is defined as

as a metric. Further define

Note that when there is no missing data, , , and

Typically we can assume . Hence, we can choose a set of step-sizes properly such that

(14)

Define such that