## I Introduction

Kernel-based expansions can boost the generalization capability of learning tasks by powerfully modeling nonlinear functions, when linear functions fall short in practice. When provided with sufficient training data, kernel methods can approximate arbitrary nonlinear functions with desired accuracy. Although “data deluge” sets the stage by providing the “data-hungry” kernel methods with huge datasets, limited memory and computational constraints prevent such tools from fully exploiting their learning capabilities. In particular, given training vectors , kernel regression or classification machines take operations to form the kernel matrix , memory to store it, and

computational complexity to find the sought predictor or classifier.

In this context, several efforts have been made in different fields of stochastic optimization, functional analysis, and numerical linear algebra to speed up kernel machines for “big data” applications [21, 9, 41, 12, 52, 33, 26].
A common approach to scaling up kernel methods is to approximate the kernel matrix by a low-rank factorization; that is, , where with is the reduced rank, through which storage and computational requirements go down to and , respectively. Kernel (K)PCA [38] provides a viable factorization for a such low-rank approximation, at the cost of order computations. Alternatively, a low-rank factorization can be effected by randomly selecting training vectors to approximate the kernel matrix [23].
Along these lines, Nystrom approximation [52], and its advanced renditions [12, 54, 22, 44, 48] are popular among this class of randomized factorizations. They trade off accuracy in approximating with , for reducing KPCA complexity from to . Their merits are well-documented for nonlinear regression and classification tasks performed offline [8, 55, 1]. Rather than factorizing , one can start from high-dimensional (lifted) feature vectors whose inner product induces the kernel [33, 53, 25, 26, 42]. Approximating through an vector , the nonlinear kernel can be approximated by a linear one as . Exploiting the fast linear learning machines [13, 41], the kernel-based task then reduces to learning a linear function over features , which can be achieved in operations. Such a computationally attractive attribute is common to both kernel matrix factorization and lifted feature approximation. Note however, that online Nystrom-type schemes are *not* available, while feature approximation algorithms are randomized, and thus they are *not* data driven.

Different from kernel matrix and feature approximations performed in batch form, online kernel-based learning algorithms are of paramount importance. Instead of loading the entire datasets in memory, online methods iteratively pass over the set from an external memory [21, 41, 45, 5, 18, 40, 20]. This is also critical when the entire dataset is not available beforehand, but is acquired one datum at a time. For large data streams however, as the number of data increases with time, the support vectors (SVs) through which the function is estimated, namely the set in the approximation , also increases in size. Thus, the function evaluation delay as well as the required memory for storing the SV set eventually become unaffordable. Efforts have been devoted to reducing the number of SVs while trying to maintain performance on unseen data (a.k.a. generalization capability) [11]. In more recent attempts, by restricting the maximum number of SVs to a predefined budget , the growth of algorithmic complexity is confined to an affordable limit, that is maintained throughout the online classification [50, 49, 10] or regression [47] task.

The present work builds a generative model according to which the high (possibly infinite)-dimensional features are approximated by their projection onto a low-rank subspace, thus providing a linear kernel function approximation (Section II). In contrast to [33, 12, 54, 25], where due to the nature of randomization the number of required features for providing an accurate kernel function approximation is often large, systematically learning the ambient nonlinear subspace yields an accurate approximation through a smaller number of extracted features.

Offline and online solvers for subspace learning are developed, and their convergence is analyzed in Sections III and IV respectively. In order to keep the complexity and memory requirements affordable, budgeted versions of the proposed algorithms are devised in Section V, in which the number of stored data vectors is confined to a predefined budget . Budget maintenance is performed through a greedy approach, whose effectiveness is corroborated through simulated tests. This is the first work to address *dynamic* nonlinear (kernel-based) feature extraction under limited memory resources.

Analytical results in Section VI provide performance bounds on how well the kernel matrix as well as kernel-based classification and regression can be approximated by leveraging the novel budgeted online subspace-learning and feature-extraction approach. Finally, Section VII presents experiments on synthetic and real datasets, demonstrating the efficiency of the proposed methods in terms of accuracy and run time.

## Ii Preliminaries and Problem Statement

Consider real data vectors of size . As large values of and hinder storage and processing of such datasets, extracting informative features from the data (a.k.a. dimensionality reduction) results in huge savings on memory and computational requirements. This fundamentally builds on the premise that the informative part of the data is of low dimension , and thus the data are well represented by the generative model

(1) |

where the tall matrix has rank ; vector is the projection of onto the column space of ; and denotes zero-mean additive noise.

Pursuit of the subspace and the low-dimensional features is possible using a blind least-squares (LS) criterion regularized by a rank-promoting term using e.g., the nuclear norm of , where [35]. Albeit convex, nuclear-norm regularization is not attractive for sequential learning.

To facilitate reducing the computational complexity, it is henceforth assumed that an upper bound on the rank of matrix is given . ^{1}^{1}1In practice, the rank is controlled by tuning regularization parameter, as it can be made small enough for sufficiently large . Thus, building on the work of [29] by selecting , and to
arrive at a scalable subspace tracker, here we surrogate the nuclear norm with the summation of the Frobenious-norms of and , which yields (cf. Prop. 1 in [29] for proof on equivalence)

(2) |

where controls the tradeoff between LS fit and rank regularization [28]

. Principal component analysis (PCA) - the “workhorse” of dimensionality reduction- solves (

2) when the rank regularization is replaced with orthonormality constraints on . Undoubtedly, the accuracy of any linear dimensionality reduction method is dictated by how well the model (1) fits a given dataset, which is related to how well the corresponding data covariance matrix can be approximated by a low-rank matrix [17, p. 534].In practice however, low-rank linear models often fail to accurately capture the datasets. A means to deal with nonlinearities in pattern recognition tasks, is to first map vectors to a higher -dimensional space using a function (possibly with ), and subsequently seek a linear function over the lifted data . This map induces a so-termed kernel function . Selecting the kernel to have a closed-form expression circumvents the need to explicitly know - what is referred to as the “kernel trick.” Similarly, the norm corresponding to the reproducing kernel Hilbert space (RKHS) is defined as . Upon defining the matrix , the kernel matrix related to the covariance of the lifted data is formed with entry as , where . Its computation and storage incurs complexity and respectively, which is often not affordable when and/or .

Fortunately, for large data sets in practice has approximately low rank. This fact is exploited in e.g., [53, 12] and [52] to approximate via a low-rank factorization, hence reducing the evaluation and memory requirements of offline kernel-based learning tasks from down to . Here, we further build on this observation to deduce that the low-rank property of implies that can also be approximated by a low-rank matrix, thus motivating our pursuit of online low-rank factorization of . To this end, instead of projecting s onto the columns of as in (2), we will project s on , whose columns span what we refer to as “virtual” column subspace since can be infinite. Specifically, we consider [cf. (2)]

(3) |

where the -norm has been substituted by the -norm in the -dimensional Hilbert space. Similarly, let the Hilbert–Schmidt operator be defined as with denoting the -th column of . Note that for Euclidean spaces, the Hilbert-Schmidt norm reduces to the Frobenious norm.

Observe also that similar to the linear model in (2), upon removing the regularization terms and adding the orthonormality constraints on the columns of , (3) reduces to that of KPCA (without centering) in primal domain [38, p. 429]. The present formulation in (3) however, enables us to develop sequential learning algorithms, which will later be enhanced with a tracking capability for dynamic datasets.

For a fixed , the criterion in (3) is minimized by

(4) |

where the factor can be viewed as “morphing’ the columns of to offer a flexible basis for the lifted data. Substituting (4) back into (3) and exploiting the kernel trick, we arrive at

(5) | ||||

where the vector in (5) is the -th column of , and since has size , the minimization in (5) does not depend on .

Our goal is to develop and analyze batch as well as online solvers for (5). By pre-specifying an affordable complexity for the online solver, we aim at a low-complexity algorithm where subspace learning and feature extraction can be performed on-the-fly for streaming applications. Furthermore, we will introduce a novel approach to extracting features on which the kernel-based learning tasks of complexity can be well approximated by linear counterparts of complexity , hence realizing great savings in memory and computation while maintaining performance. A remark is now in order.

Remark 1. The subspace in (4) can be thought as a dictionary whose atoms are morphed via factor . Sparse representation over kernel-based dictionaries have been considered [37, 16, 46, 30]. Different from these approaches however, the novelty here is on developing algorithms that can process streaming datasets, possibly with dynamic underlying generative models. Thus, our goal is to efficiently learn and track a dictionary that adequately captures streaming data vectors, and can afford a low-rank approximation of the underlying high-dimensional map.

## Iii Offline kernel based Feature Extraction

Given a dataset and leveraging the bi-convexity of the minimization in (5), we introduce in this section a batch solver, where two blocks of variables ( and ) are updated alternately. The following two updates are carried out iteratively until convergence.

Update 1. With given from iteration , the projection vectors in iteration are updated as

The minimizer of (6a) yields the features as regularized projection coefficients of the lifted data vectors onto the virtual subspace , and is given in closed form by

(7) |

Update 2. With fixed and after dropping irrelevant terms, the subspace factor is obtained as [cf. (5)]

(8) |

Since is positive definite in practice, (III) involves a strictly convex minimization. Equating the gradient to zero, yields the wanted subspace factor in closed form

(9) |

Algorithm 1 provides the pseudocode for the update rules (III) and (9) of the batch solver, and the following proposition gives a guarantee on the convergence of the proposed solver to a local stationary point.

Proposition 1. *For positive definite kernels and , the sequence generated by Algorithm 1 converges to a stationary point of the minimization in (5).*

*Proof*: Since the minimizations in (6a) and (III) are strictly convex with unique solutions, the result follows readily from [3, p. 272].

Since matrix inversions in (III) and (9) cost , and and have size and , respectively, the per iteration cost is . Although the number of iterations needed in practice for Algorithm 1 to converge is effectively small, this per iteration complexity can be unaffordable for large datasets. In addition, datasets are not always available offline, or due to their massive volume, can not be uploaded into memory at once. To cope with these issues, an online solver for (5) is developed next, where the updates are carried out by iteratively passing over the dataset one datum at a time.

## Iv Online kernel based feature extraction

This section deals with low-cost, on-the-fly updates of the ‘virtual’ subspace , or equivalently its factor as well as the features

that are desirable to keep up with streaming data. For such online updates, stochastic gradient descent (SGD) has well-documented merits, especially for

*parametric*settings. However, upon processing data vectors, has size , which obviously grows with . Hence, as the size of increases with the number of data, the task of interest is a

*nonparametric*one. Unfortunately, performance of SGD on nonparametric learning such as the one at hand is an uncharted territory. Nevertheless, SGD can still be performed on the initial formulation (3), where solving for the virtual constitutes a parametric task, not dependent on .

Starting with an update for , an update for will be derived first, as an alternative to those in [41, 9], and [49]. Next, an SGD iteration for will be developed in subsection IV-B, while in subsection IV-C a connection between the two update rules will be drawn, suggesting how SGD can be broadened to learning nonparametric models as well.

### Iv-a SGD on “parametric” subspace tracking

Suppose that is acquired at time , posing the overall joint subspace tracking and feature extraction problem as [cf. (3)]

(10) |

Using an alternating minimization approach, we update features and the subspace per data vector as follows.

Update 1. Fixing the subspace estimate at its recent value from time , the projection vector of the new data vector is found as [cf. (6a)]

(11a) | ||||

which through the kernel trick readily yields | ||||

(11b) |

Although (11) can be done for all the previous features as well, it is skipped in practice to prevent exploding complexity. In the proposed algorithm, feature extraction is performed only for the most recent data vector .

Update 2. Having obtained , the subspace update is given by solving

(12) |

where Solving (12) as time evolves, becomes increasingly complex, and eventually unaffordable. If data

satisfy the law of large numbers, then (

12) approximates, where expectation is with respect to the unknown probability distribution of the data. To reduce complexity of the minimization, one typically resorts to stochastic approximation solvers, where by dropping the expectation (or the sample averaging operator), the ‘virtual’ subspace update is

(13) |

with denoting a preselected stepsize, and the gradient of the -th summand in (12) given by

(14) |

Because has size regardless of , iteration (13) is termed “parametric” Using (4) to rewrite , and substituting into (13), yields

(15) |

which suggests the following update rule for factor

(16) |

Even though (16) is not the only iteration satisfying (IV-A), it offers an efficient update of the factor . The update steps for the proposed parametric tracker are summarized as Algorithm 2. Note that the multiplication and inversion in (9) are avoided. However, per data vector processed, the kernel matrix is expanded by one row and one column, while the subspace factor grows accordingly by one row.

### Iv-B SGD for “nonparametric” subspace tracking

In this subsection, the feature extraction rule in (11) is retained, while the update rule (16) is replaced by directly acquiring the SGD direction along the gradient of the instantaneous objective term with respect to . Since, in contrast to the fixed-size matrix , the number of parameters in grows with , we refer to the solver developed in this subsection as a *nonparametric* subspace tracker. Furthermore, the connection between the two solvers is drawn in subsection IV-C, and convergence of the proposed algorithm is analyzed in subsection IV-D.

At time instance , subproblem (12) can be expanded using the kernel trick as

(17) |

where

(18) |

with given by (6b). Stochastic approximation solvers of (17) suggest the update

(19a) | ||||

where denotes the user-selected step size, and denotes the gradient of the -th summand in (17) with respect to that is given by | ||||

(19b) | ||||

Substituting (19) into (19a) yields the desired update of which together with (11) constitute our nonparametroc solver, tabulated under Algorithm 3.

### Iv-C Parametric vis-a-vis nonparametric SGD updates

Considering that holds for all , it is apparent from (19) and (IV-A) that . The latter implies that the update rule in (19a) amounts to performing SGD on with a matrix stepsize ; that is,

(20) |

It is important to check whether this constitutes a valid descent direction, which is guaranteed since

(21) |

where

For positive-definite e.g., Gaussian kernel matrices, we have , which guarantees that is a descent direction [3, p. 35]. Leveraging this link, Algorithm 3 will be shown next to enjoy the same convergence guarantee as that of Algorithm 2.

Remark 2. Although the SGD solver in Algorithm 3 can be viewed as a special case of Algorithm 2, developing the parametric SGD solver in Algorithm 2 will allow us to analyze convergence of the two algorithms in the ensuing subsections.

### Iv-D Convergence analysis

The cost in (10) can be written as

(22) |

with , and as in (12). Thus, the minimization in (10) is equivalent to . To ensure convergence of the proposed algorithms, the following assumptions are adopted.

(A1) * independent identically distributed; and*

(A2) *The sequence is bounded.*

Data independence across time is standard when studying the performance of online algorithms [28], while boundedness of the iterates , corroborated by simulations, is a technical condition that simplifies the analysis, and in the present setting is provided due to the Frobenious-norm regularization. In fact, rewriting subspace update in Alg. 2 yields

which consists of: i) contraction of the most recent subspace iterate; and, ii) an additive term. Thus, with proper selection of the diminishing step size , A2 is likely to hold. The following proposition provides convergence guarantee for the proposed algorithm.

Proposition 2.* Under (A1)-(A2), if with and , then the subspace iterates in (13) satisfy almost surely; that is, , thus the sequence falls into the stationary point of (10).*

*Proof*: Proof is inspired by [27], and a sketch of the required modifications can be found in the Appendix.

So far, we have asserted convergence of the SGD-based algorithm for the “virtual” provided by Algorithm 2. A related convergence result for Algorithm 3 is guaranteed by the following argument.

Proposition 3.* Under (A1)-(A2) and for positive definite radial kernels, if with and , then the subspace iterates in (19a) satisfy almost surely; that is, , and the subspace iterates will converge to the stationary point of (10).*

*Proof:*
The proof follows the steps in Proposition 2, with an extra step in the construction of the appropriate surrogate cost in Step 1.
In particular, using that the optimal subspace is of the form , the objective can be further majorized over the subset of virtual subspaces , by

for which we have

The Cauchy-Schwarz inequality implies that

and by choosing , we will have . Selecting now as the new surrogate whose minimizer coincides with the update rule in (19a), the rest of the proof follows that of Prop. 2.

## V Reduced-complexity OK-FE on a budget

Per data vector processed, the iterative solvers of the previous section have one column of and one row of added, which implies growing memory and complexity requirements as grows. The present section combines two means of coping with this formidable challenge: one based on censoring uninformative data, and the second based on *budget maintenance*.
By modifying Algorithms 2 and 3 accordingly, memory and complexity requirements are rendered affordable.

### V-a Censoring uninformative data

In the LS cost that Algorithms 2 and 3 rely on, small values of the fitting error can be tolerated in practice without noticeable performance degradation. This suggests modifying the LS cost so that small fitting errors (say up to ) induce no penalty, e.g., by invoking the insensitive cost that is popular in support vector regression (SVR) settings [17].

Consider henceforth positive-definite kernels for which low-rank factors offer an approximation to the full-rank kernel matrix, and lead to a generally nonzero LS-fit . These considerations suggest replacing the LS cost with

(23) | |||

By proper choice of , the cost implies that if , the virtual is captured well enough by the virtual current subspace , and the solver will not attempt to decrease its LS error, which suggests skipping the augmentation of , provided by the new lifted datum [4].

In short, if the upper branch of (23) is in effect, is deemed uninformative, and it is censored for the subspace update step; whereas having the lower branch deems informative, and augments the basis set of the virtual subspace. The latter case gives rise to what we term online support vectors (OSV), which must be stored, while ‘censored’ data are discarded from subsequent subspace updates.

In order to keep track of the OSVs, let denote the set of indices corresponding to the SVs revealed up to time . Accordingly, rewrite , and the modified LS cost as , depending on which of the following two cases emerges.

C1. *If , the OSV set will not grow, and we will have ; or,*

C2. *If , the OSV set will grow, and we will have .*

The subspace matrix per iteration will thus take the form , where , with , and . Upon replacing in Algorithm 3 with , Algorithm 4 gives the pseudocode for our reduced-complexity online kernel-based feature extraction (OK-FE), which also includes a budget maintenance module that will be presented in the ensuing Section V-B.

Modifying the LS-fit in (23) and discarding the censored data, certainly reduce the rate at which the memory and complexity requirements increase. In practice, thresholding is enforced after the budget is exceeded, when one needs to discard data. Regarding the selection of the threshold value, the later may be initialized at zero and be gradually increased until the desired censoring rate is reached ( final threshold value will depend on the average fitting error and desired censoring rate) ; see also [4] for related issues. Albeit at a slower rate,

Comments

There are no comments yet.