DeepAI

# Stabilized Sparse Online Learning for Sparse Data

Stochastic gradient descent (SGD) is commonly used for optimization in large-scale machine learning problems. Langford et al. (2009) introduce a sparse online learning method to induce sparsity via truncated gradient. With high-dimensional sparse data, however, the method suffers from slow convergence and high variance due to the heterogeneity in feature sparsity. To mitigate this issue, we introduce a stabilized truncated stochastic gradient descent algorithm. We employ a soft-thresholding scheme on the weight vector where the imposed shrinkage is adaptive to the amount of information available in each feature. The variability in the resulted sparse weight vector is further controlled by stability selection integrated with the informative truncation. To facilitate better convergence, we adopt an annealing strategy on the truncation rate, which leads to a balanced trade-off between exploration and exploitation in learning a sparse weight vector. Numerical experiments show that our algorithm compares favorably with the original algorithm in terms of prediction accuracy, achieved sparsity and stability.

• 2 publications
• 14 publications
06/12/2021

03/22/2016

Stochastic gradient descent is the method of choice for large-scale mach...
12/31/2020

Stochastic gradient descent (SGD) has taken the stage as the primary wor...
06/28/2015

### Stochastic Gradient Made Stable: A Manifold Propagation Approach for Large-Scale Optimization

Stochastic gradient descent (SGD) holds as a classical method to build l...
02/16/2017

### Unbiased Online Recurrent Optimization

The novel Unbiased Online Recurrent Optimization (UORO) algorithm allows...
01/18/2021

### Screening for Sparse Online Learning

Sparsity promoting regularizers are widely used to impose low-complexity...
08/24/2020

### Noise-induced degeneration in online learning

In order to elucidate the plateau phenomena caused by vanishing gradient...

None

## 1 Introduction

Modern datasets pose many challenges for existing learning algorithms due to their unprecedented large scales in both sample sizes and input dimensions. It demands both efficient processing of massive data and effective extraction of crucial information from an enormous pool of heterogeneous features. In response to these challenges, a promising approach is to exploit online learning methodologies that performs incremental learning over the training samples in a sequential manner. In an online learning algorithm, one sample instance is processed at a time to obtain a simple update, and the process is repeated via multiple passes over the entire training set. In comparison with batch learning algorithms in which all sample points are scrutinized at every single step, online learning algorithms have been shown to be more efficient and scalable for data of large size that cannot fit into the limited memory of a single computer. As a result, online learning algorithms have been widely adopted for solving large-scale machine learning tasks (Bottou, 1998).

In this paper, we focus on first-order subgradient-based online learning algorithms, which have been studied extensively in the literature for dense data.111Dense data is defined as a dataset in which the number of nonzero entries in all columns of its design matrix are in the order of while the ones of sparse data are in the order of or less. Among these algorithms, popular methods include the Stochastic Gradient Descent (SGD) algorithm (Zhang, 2004; Bottou, 2010), the mirror descent algorithm (Beck and Teboulle, 2003) and the dual averaging algorithm (Nesterov, 2009). Since these methods only require the computation of a (sub)gradient for each incoming sample, they can be scaled efficiently to high-dimensional inputs by taking advantage of the finiteness of the training sample. In particular, the stochastic gradient descent algorithm is the most commonly used algorithm in the literature of subgradient-based online learning. It enjoys an exceptionally low computational complexity while attaining steady convergence under mild conditions (Bottou, 1998)

, even for cases where the loss function is not everywhere differentiable.

Despite of their computational efficiency, online learning algorithms without further constraint on the parameter space suffers the “curse of dimensionality” to the same extent as their non-online counterparts. Embedded in a dense high-dimensional parameter space, not only does the resulted model lack interpretability, its variance is also inflated. As a solution,

sparse online learning was introduced to induce sparsity in the parameter space under the online learning framework (Langford et al., 2009)

. It aims at learning a linear classifier with a sparse weight vector, which has been an active topic in this area. For most efforts in the literature, sparsity is introduced by applying

regularization on a loss function as in the classific LASSO method (Tibshirani, 1996; Shalev-Shwartz and Tewari, 2011). For example, Duchi and Singer (2009) extend the framework of Forward-Backward splitting (Lions and Mercier, 1979) by alternating between an unconstrained truncation step on the sample gradient and an optimization step on the loss function with a penalty on the distance from the truncated weight vector.Langford et al. (2009) and Carpenter (2008) both explore the idea of imposing a soft-threshold on the weight vector updated by the stochastic gradient descent algorithm:

 wj=sign(wj)max(|wj|−λ,0),j=1,…,p.

This class of methods is known as the truncated gradient algorithm. For every standard SGD updates, the weight vector is shrunk by a fixed amount to induce sparsity. In the work of Duchi et al. (2010), the same strategy has also been combined with a variant of the mirror descent algorithm (Beck and Teboulle, 2003). Wang et al. (2015) further extends the truncated gradient framework to adjust for cost-effectiveness. This simple yet efficient method of truncated gradients particularly motivates the algorithm proposed in this paper. Strategies different from the truncation-based algorithm have also been proposed. For example, Xiao (2009) proposed the Regularized Dual-Averaging (RDA) algorithm which builds upon the primal-dual subgradient method by Nesterov (2009). The RDA algorithm learns a sparse weight vector by solving an optimization problem using the running average over all preceding gradients, instead of a single gradient at each iteration.

Closely related to sparse online learning is another area of active research, online feature selection. Instead of enforcing just a shrinkage on the weight vectors via

regularization, online feature selection algorithms explicitly invoke feature selection by imposing a hard

constraint on the weight vector, (e.g., Wang et al., 2014; Wu et al., 2014). In other words, online feature selection algorithms focus on generating a resulted weight vector that has a high sparsity level by directly shrinking a large proportion of the weights directly to zero (also referred to as a hard thresholding). In practice, regularization is computationally expensive to solve due to its non-differentiability. The set of selected features also suffers from high variability as the decisions of hard-thresholding are based on single random samples in an online learning setting. Therefore, important features can be discarded simply owing to random perturbations.

Most recent subgradient-based online learning algorithms do not consider potential structures or heterogeneity in the input features. As pointed out by Duchi et al. (2011), current methods largely follow a predetermined procedural scheme that is oblivious to the characteristics of data being used at each iteration. In large-scale applications, a common and important structure is heterogeneity in sparsity levels of the input features, i.e., the variability in the number of nonzero entries among features. For instance, consider the bag-of-word features in text mining applications.222Here, by sparse features, we refer to features for which most samples assume a constant value (e.g., 0) and a few samples take on other values. Without loss of generality, we assume the majority constant is 0 throughout this paper. For a learning task, the importance of a feature is not necessarily associated with the frequencies of its values. In genetics, for example, rare variants ( in the population) have been found to be associated with disease risks (Morris and Zeggini, 2010). Both dense and sparse features may contain important information for the learning task. However, in the presence of heterogeneity in sparsity levels, using a simple regularization in an online setting will predispose rare features to be truncated more than necessary. The resulted sparse weight vectors usually exhibit high variance in terms of both weight values and the membership in the set of features with nonzero weights. As a result, the convergence of the standard truncation-based framework may also be hampered by this high variability. When the amount of information is scarce due to sparsity at each iteration, the convergence of the weight vector would understandably take a large number of iterations to approach the optimum. In two recent papers, Oiwa et al. (2011) and Oiwa et al. (2012) tackle this problem via penalty weighted by the accumulated norm of subgradients for extending several basic frameworks in sparse online learning. Their results suggest that, by acknowledging the sparsity structure in the features, both prediction accuracy and sparsity are improved over the original algorithms while maintaining the same convergence rate. However, their resulted weight vectors are unstable as the imposed subgradient-based regularization are excessively noisy due to the randomness of incoming samples in online learning. The membership in the set of selected features with nonzero weights is also very sensitive to the orderings of the training samples.

In this paper, we propose a stabilized truncated stochastic gradient descent algorithm for high-dimensional sparse data. The learning framework is motivated by that of the Truncated Gradient algorithm proposed by Langford et al. (2009). To deal with the aforementioned issues with sparse online learning methods applied to high-dimensional sparse data, we introduce three innovative components to reduce variability in the learned weight vector and stabilize the selected features. First, when applying the soft-thresholding, instead of a uniform truncation on all features, we perform only informative truncations, based on actual information from individual features during the preceding computation window of updates. By doing so, we reduce the heterogeneous truncation bias associated with feature sparsity. The key idea here is to ensure that each truncation for each feature is based on sufficient information, and the amount of shrinkage is adjusted for the information available on each feature. Second, beyond the soft-thresholding corresponding to the ordinary regularization, the resulted weight vector is stabilized by staged purges of irrelevant features permanently from the active set of features. Here, irrelevant features are defined as features whose weights have been repeatedly truncated. Motivated by stability selection introduced in Meinshausen and Bühlmann (2010), these permanent purges prevent irrelevant features from oscillating between the active and non-active set of features, The “purging” process also resembles hard-thresholding in online feature selection and results in a stabler sparse solution than other sparse online learning algorithms. Results on the theoretical regret bound (See Section 4

) show that this stabilization step helps improve over the original truncated gradient algorithm, especially when the target weight vector is notably sparse. To attune the proposed learning algorithm to the sparsity of the remaining active features, the third component of our algorithm is adjusting the amount of shrinkage progressively instead of fixing it at a predetermined value across all stages of the learning process. A novel hyperparameter,

rejection rate, is introduced to balance between exploration of different sparse combinations of features at the beginning and the exploitation

of the selected features to construct accurate estimate at a later stage. Our method gradually anneal the rejection rate to acquire the necessary amount of shrinkage on the fly for achieving the desired balance.

The rest of paper is organized as follows. Section 2 reviews the Truncated Gradient algorithm based on Stochastic Gradient Descent (SGD) framework for sparse learning proposed in Langford et al. (2009). In Section 3, we introduce, in details, the three novel components of our proposed algorithm. Theoretical analysis of the expected online regret bound is given in Section 4, along with the computational complexity. Section 5 gives practical remarks for efficient implementation. In Section 6, we evaluate the performance of the proposed algorithm on several real-world high-dimensional datasets with varying sparsity levels. We illustrate that the proposed method leads to improved stability and prediction performance for both sparse and dense data, with the most improvement observed in data with the highest average sparsity level. Section 7 concludes with further discussion on the proposed algorithm.

## 2 Truncated Stochastic Gradient Descent for Sparse Learning

Assume that we have a set of training data , where the feature vector and the scalar output . In the following, we use to represent the vector of the sample of length and for the feature vector of all samples of length . In this paper, we are interested in the case that both and are large and the feature vectors ’s, , are sparse. We consider a loss function that measures the cost of predicting when the truth is . The prediction is given by function from a family parametrized by a weight vector . Denote . The learning goal is to obtain an optimal weight vector that minimize the loss function over the training data, with sparsity in the weight vector induced by a regularization term . We can then formulate the learning task as a regularized minimization problem:

 ^w =argminw∈Rp n∑i=1L(w,zi)+Ψ(w). (1)

The above optimization problem is often solved using some version of gradient descent. When both and are large, the computation becomes very demanding. To address this computational complexity, the Stochastic Gradient Descent (SGD) algorithm was proposed as a stochastic approximation of the full gradient algorithm Bottou (1998). Instead of computing the gradient over the entire training set as under the batch setting, the stochastic gradient descent algorithm uses approximate gradients based on subsets of the training data. This is particularly attractive to large scale problems as it leads to a substantial reduction in computing complexity and potentially distributed implementation.

For applications with large data sets or streaming data feeds, SGD has also been used as a subgradient-based online learning method. Online learning and stochastic optimization are closely related and interchangeable most of the time (Cesa-Bianchi et al., 2004). For simplicity, in the following, we focus our discussion and algorithmic description under the online learning framework with regret bound models. Nonetheless, our results can be readily generalized to stochastic optimization as well.

In online learning, the algorithm receives a training sample at a time from a continuous feed. Without sparsity regularization, at time , the weight vector is updated in an online fashion with a single training sample drawn randomly,

 wt =wt−1−ηL′(wt−1,zt) (2)

where is the learning rate and is a subgradient of the loss function with respect to . The set of subgradients of f at the point is called the subdifferential of at , and is denoted . A function is called subdifferentiable if it is subdifferentiable at all dom . When is differentiable at , . At the same time, a sequence of decisions is generated at , that encounters a loss respectively.

The goal of online learning algorithm with sparsity regularization is to achieve low regret with respect to a fixed optimal weight vector . Here, is the parameter space for sparse weights vectors (see Assumption 3 on page 15 for more details.) The regret is defined as:

 RT(w∗)≜T∑t=1(L(wt,zt)+Ψ(wt))−T∑t=1(L(w∗,zt)+Ψ(w∗)). (3)

In this paper, we focus on the regularization where and is the regularizing parameter. When adopted in an online learning framework, standard SGD algorithm does not work well in addressing (1) with penalty. Firstly, a simple online update requires the projection of the weight vector onto a -ball at each step, which is computationally expensive with a large number of features. Secondly, with noisy approximate subgradient computed using a single sample, the weights can easily deviate from zero due to the random fluctuations in ’s. Such a scheme is therefore inefficient to maintain a sufficiently sparse weight vector.

To address this issue, Langford et al. (2009) induced sparsity in by subjecting the stochastic gradient descent algorithm to soft-thresholding. For every iterations at step , each of which is as defined in (2), the weight vector is shrunk by a soft-threshold operator with a gravity parameter with for . For a vector ,

 ^wt=T(wt,g), (4)

where with the operator defined by

 T(wj,gj) ≜{max(wj−gj,0),if% wj>0;min(wj+gj,0),if wj≤0. (5)

As one can see, the sequence of SGD updates can be treated as a unit computational block, which will be referred to as a burst hereafter. Here the word burst indicates that it is a sequence of repetitive actions, e.g., the standard SGD updates as defined in (2), without interruption. Each burst is followed by a soft-thresholding truncation defined in (4), which puts a shrinkage on the learned weight vector.

A burst can be viewed as a base feature selection realized on a set of random samples with regularization as in the classical LASSO (Tibshirani, 1996). Within a burst, let be the set of random samples on which the weight vector is stochastically learned. We define the set of features with nonzero weights in as its active (feature) set:

 ^Sg(^w;XK)={j:|^wj|>0}, (6)

with a corresponding gravity . The steps within a truncated burst are summarized in Algorithm 1.

In the truncated gradient algorithm of Langford et al. (2009), the gravity parameter is a constant across all dimensions as , where is a base gravity for each update in a burst and . In general, with greater parameter and smaller burst size , more sparsity is attained. When , the update in (4) becomes identical to the standard stochastic gradient descent update in (2). Langford et al. (2009) showed that this updating process can be regarded as an online counterpart of regularization in the sense that it approximately solves (1) in the limit as and .

## 3 Stabilized Truncated SGD for Sparse Learning

Truncated SGD Langford et al. (2009) works well for dense data. When it comes to high-dimensional sparse inputs, however, it suffers from a number of issues. Shalev-Shwartz and Tewari (2011) observe that the truncated gradient algorithm is incapable of maintaining sparsity of the weight vector as it iterates. Recall that, under the online learning setting, the weight vector

is updated with a noisy approximation of the true expected gradient using one sample at a time, from a random ordering of the data. With sparse inputs, it is highly probable that an important feature does not have a nonzero entry for many consequent samples, and is meaningfully updated for only a few times out of the

updates in a burst. As a result, it would be truncated after a few iterations and brought back to nonzero after another few updates. At the same time, sparsity in inputs will also give rise to sporadic large nonzero updates for irrelevant features, which cannot be fully resolved by the soft-threshold operator. The derived weight vector ’s are of high variance, inadequate sparsity and poor generalizability. As an example, the number of nonzero variables in the weight vector during the last 1000 stochastic updates from the truncated gradient algorithm implemented on a high-dimensional sparse dataset (Dexter text mining data set; see Section 6 for details.) are shown in Figure 1. It can be seen that the numbers of nonzero features in the weight vectors learned by the truncated SGD algorithm () remain large and highly unstable throughout these 1000 iterations, oscillating within 10% of the total number of features. As a comparison, also in Figure 1, we plot the results from our proposed stabilized truncated SGD applied to the same data. During these last 1000 updates, the proposed algorithm is using a less frequent truncation schedule due to our annealed reject rate. It attains both high sparsity in the weight vector and high stability with high-dimensional sparse data.

In this section, we introduce the stabilized truncated Stochastic Gradient Descent (SGD) algorithm. It attains a truly sparse weight vector that is stable and gives generalizable performance. Our proposed method attunes to the sparsity of each feature and adopts informative truncation. The algorithm keeps track of whether individual features have had enough information to be confidently subject to soft-thresholding. Based on the truncation results, we systematically reduce the active feature set by permanently discarding features that are truncated to zero with high probability via stability selection. We further improve the efficiency of our algorithm by adapting gravity to the sparsity of the current active feature set as the algorithm proceeds.

### 3.1 Informative Truncation

For the truncated SGD algorithm, Langford et al. (2009) suggest a general guideline for determining gravity in the batch mode by scaling a base gravity by , the number of updates, for a single truncation after a burst. A direct online adaptation of a regularization would shrinks the weight vector at every iteration. The above batch mode operation is to delay the shrinkage for iterations so that the truncation is executed based on information collected from random samples instead of from a single instance. This guideline implicitly assumes that the SGD updates in a burst are equally informative, which is in general true for dense features. For sparse features, however, under the online learning setting, not every update is informative about every feature due to the scarcity of nonzero entries. The original uniform formula, , for gravity would then create an undesirable differential treatment for features with different levels of sparsity. With a relatively small , it is very likely that a substantial proportion of features would have no non-zero values on a size- subsample used in a particular burst. The weights for these features remain unchanged after updates. Consequently, the set of sparse features run the risk of being truncated to zero based on very few informative updates. The truncation decision is therefore mostly determined by a feature’s sparsity level, rather than its relevance to the class boundary.

To make the learning be informed of the heterogeneity in sparsity level among features, we introduce the informative truncation step, extended from the idea of base gradient used in Algorithm 1. Instead of applying a universal gravity proportional to to all features, the amount of shrinkage is set proportional to the number of times that a feature is actually updated with nonzero values in the size- subsample, i.e., the number of informative updates. Specifically, within each burst, the algorithm keeps a vector of counters, , of the numbers of informative updates for the features , . Let be the base gravity parameter that serves as the unit amount of shrinkage for each informative update on each feature. At the end of each burst, we shrink feature by . In other words, here we set . The computational steps for a burst with informative truncation in summarized in Algorithm 2.

The proposed informative truncation scheme ensures that the decision of truncation is made based on sufficient and equivalent amount of evidence for evaluating each feature. A theoretical justification of how informative truncation helps improving truncation bias can be found in Lemma 2 (Section 4). This feature-specific gravity attunes to the sparsity structure incurred at each burst without ad-hoc adjustment. It also avoids data pre-processing for locating sparse entries, which can be computationally expensive and compromises the advantage of online computation. In comparison to the truncated gradient algorithm in Langford et al. (2009) that quickly shrinks many features to zero indiscriminately, informative truncation keeps sparse features until enough evaluation is conducted. In doing so, sparse yet important features will be retained. The proposed approach also reduce the variability in the resulted sparse weight vector during the training process. Duchi et al. (2011) uses a similar strategy that allows the learning algorithm to adaptively adjust its learning rates for different features based on cumulative update history. They use the norm of accumulated gradients to regulate the learning rate. By adapting the gravity with the counter within each burst, our proposed strategy here can be viewed as applying the norm to the accumulated gradients that is refreshed every steps.

### 3.2 Stability Selection

Despite of its scalability, subgradient-based online learning algorithms commonly suffer from instability. It has been shown both theoretically and empirically that stochastic gradient descent algorithms are sensitive to random perturbations in training data as well as specifications of learning rate (Toulis et al., 2015; Hardt et al., 2015). This instability is particularly pronounced in sparse online learning with sparse data, as discussed in Section 1. Under an online learning setting, using random ordering of the training sample as inputs, the algorithm would produce distinct weight vectors and unstable memberships of the final active feature set. Moreover, there has been a lot of discussion, in the literature, on the link between the instability of an learning algorithm and its deteriorated generalizability (Bousquet and Elisseeff, 2002; Kutin and Niyogi, 2002; Rakhlin et al., 2005; Shalev-Shwartz et al., 2010).

To tackle this instability issue, in the proposed algorithm, we exploit the method of stability selection to improve its robustness to random perturbation in the training data. Stability selection (Meinshausen and Bühlmann, 2010) does not launch a new feature selection method. Rather, its aim is to enhance and improve a sparse learning method via subsampling. The key idea of stability selection is similar to the generic bootstrap (Meinshausen and Bühlmann, 2010). It feeds the base feature selection procedure with multiple random subsamples to derive an empirical selection probability. Based on aggregated results from subsamples, a subset of features is selected with low variability across different subsamples. With proven consistency in variable selection, stability selection helps remove noisy irrelevant features and thus reduce the variability in learning a sparse weight vector.

Incorporating stability selection into our proposed framework, each truncated burst with gravity parameter is treated as an individual sparse learning engine. It takes random samples and carries out a feature selection to obtain a sparse weight vector. In the following, we define first the notion of selection probability for the stability selection step in our proposed algorithm.

###### Definition 1 (selection probability).

Let be a random subsample of of size , drawn without placement. Parametrized by the gravity parameter , the probability of the feature being in the active set of a truncated burst that returns is

 Πgj=P∗(j∈^Sg(^w;XK))=ED[1(|^wj|>0)],

where the probability is with respect to the random subsampling of . Let .

For simplicity, we drop the superscript of in later discussions. For the rest of the paper, the selection probability always refers to that corresponds to weight vector with gravity parameter .

Under unknown data distribution, the selection probabilities cannot be computed explicitly. Instead, they are estimated empirically. Since each truncation burst performs a screening on all features, the frequency of each feature being selected by a sequence of bursts can be used to derive an estimator of the selection probability. We denote a sequence of truncated bursts as a stage. A preliminary empirical estimate of the selection probability is given by

 ^Πj=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩∑τ:~kj,τ>01(|^wj,τ|>0)nk∑τ=11(~kj,τ>0),for j s.t.nK∑τ=1~kj,τ>01,otherwise, (7)

where are the counters of informative updates for burst , .

Different from the conventional stability selection setting, ’s are obtained sequentially and thus are dependent with each other. When is small, different subsamples produce selection probability estimates using (7) exhibit high variability, even when initialized with the same weight vector at . On the other hand, a large value of requires a prohibitively large number of iterations for convergence. To resolve the issues of estimating selection probability using a single sequence of SGD updates, we introduce a multi-thread framework of updating paths. Multiple threads of sequential SGD updates are executed in a distributed fashion, which readily utilizes modern multi-core computer architecture. With processors, we initialize the algorithm on each path of SGD updates with a random permutation of the training data, , denoted as . Then independently, stages of bursts run in parallel along paths, which return with , , . The joint estimate of selection probability with gravity is obtained as

 ^Πj=⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩M∑m=1∑τ:~k(m)j,τ>01(|^w(m)j,τ|>0)M∑m=1nk∑τ=11(~k(m)j,τ>0),for j s.t.M∑m=1nK∑τ=1~k(m)j,τ>01,otherwise. (8)

When more processors are available, a smaller is required for the algorithm to obtain a stable estimate of selection probability. The dependence among ’s is also attenuated when random subsets of samples are used for the estimation. This strategy falls under parallelized stochastic gradient descent methods, which is discussed in detail by Zinkevich et al. (2010).

Under the framework of stability selection, each stage on every path uses a random subsample. The estimated selection probability quantifies the chance that a feature is found to have high relevance to class differences given a random subsample. At the end of each stage, stable features are identified as those that belong to a large fraction of active sets incurred during this stage of bursts.

###### Definition 2 (Stable Features).

For a purging threshold , the set of stable features with gravity parameter is defined as

 ^Ωg={j:Πgj≥π0}. (9)

For simplicity, we write the stable set as when there is no ambiguity.

Stability selection retains features that have high selection probabilities and discard those with low selection probabilities. At the end of a stage of paths, we purge the features that are not in the set of stable features by permanently setting their corresponding weights to zero, and remove them from subsequent updates. We define the stabilized weight vector as

 ~w=^w⋅1^Ω. (10)

As discussed above, due to the nature of online learning with sparse data, there are two undesirable learning setbacks in a single truncated burst. The first occurs when an important feature has its weight stuck at zero due to inadequate information in the subsample used, while the second case is when a noise feature’s weight gets sporadic large updates by chance. Using informative bursts, we can avert the first type of setbacks and using selection probability based on multiple bursts, we can spot noisy features more easily. In the presence of a large number of noisy features, the learned weights for important features suffer from high variance. Via stability selection, we systematically remove noisy features permanently from the feature pool. Furthermore, the choice of a proper regularization parameter is crucial yet known to be difficult for sparse learning, especially due to the unknown noise level. Applying stability selection renders the algorithm less sensitive to choice of the base gravity parameter in learning a sparse weight vector via truncated gradient. As we will show using results from our numerical experiments, this purging by stability selection leads to a notable reduction in the estimation variance of the weight vector. Here, is a tuning parameter in practice. We have found that the learning results in the numerical experiments are not sensitive to different values of within a reasonable range. Under mild assumptions discussed in Section 4, we derive a lower bound of the expected improvement in convergence by employing stability selection in the learning process in Lemma 1.

### 3.3 Adaptive Gravity with Annealed Rejection Rate

The truncated SGD (Langford et al., 2009) adopts a universal and fixed base gravity parameter at all truncations. As pointed out in Langford et al. (2009) , a large value of the base gravity achieves more sparsity but the accuracy is compromised, while a small value of leads to less sparse weight vector yet attaining better performance. In other words, different extents of shrinkage serve different purposes of a learning algorithm. The needs for shrinkage also changes as the weight vector and the stable set evolves. Intuitively, the truncation is expected to be greedy at the beginning so that the number of nonzero feature can be quickly reduced for better computational efficiency and learning performance. As the algorithm proceeds, fewer features remain in the stable set. We should then be careful not to shrink important features with a truncation that is too harsh.

A large base gravity is effective in inducing sparsity at the beginning of the algorithm when the weight vector is dense. As the algorithm proceeds, the same value of gravity is likely to impose too much shrinkage when the learned weight vector becomes very sparse, exposing some truly important features at the risk of being purged. On the other hand, a small fixed gravity is over-conservative so that the algorithm will not shrink irrelevant features effectively, leading to slow convergence and a dense weight vector overridden by noise. Tuning a reasonable fixed base gravity parameter for a particular data set does not only creates additional computational burden, but also inadequate in addressing different learning needs during different stages of the algorithm.

As the role of gravity in a learning algorithm is to induce sparse estimates, in this paper, we propose an adaptive gravity scheme that delivers the right amount of shrinkage at each stage of the algorithm towards a desirable level of sparsity for the learned weight vector. We propose to control sparsity by a target rejection rate , that is, the proportion of updates that are expected to be truncated. Guided by this target rejection rate, we derive the necessary shrinkage amount and the corresponding gravity. As we discussed in Section 3.1, a base gravity is used in our learning algorithms to create gravity values for individual features that are attuned to their data sparsity levels. Therefore our adaptive gravity scheme is carried out by adjusting . At the beginning of a particular stage, we examine the truncations carried out during the previous stage. The base gravity is then adjusted to project the target rejection rate during the current stage. Specifically, at stage , we look at the pooled set of non-truncated weight vectors and informative truncation counters from all the bursts conducted in the previous stages on multiple threads. The adaptive base gravity for a target rejection rate is then obtained as

 g0,s(βs)≜sup{g0≥0:^ps(g0)≤βs}. (11)

Here is the empirical probability, i.e.,

 ^ps(g0)≜M∑m=1nk∑τ=1∑{j:j∈^Ωs,~k(m)j,τ>0}1(∣∣ ∣∣Δw(m)j,τ~k(m)j,τ∣∣ ∣∣>g0)M∑m=1nk∑τ=1∑j∈^Ωs1(~k(m)j,τ>0),

where is the amount of updates on feature during the burst.

We initialize the algorithm with a high rejection rate so that a large proportion of the weight vector can be reduced to zero at the end of each burst during the early stage of the algorithm. It allows the algorithm to explore as many sparse combination of features as possible at the early stage of the learning process. Along with the stability selection, the set of stable features can be quickly reduced to a manageable size by removing the majority of noises. When the weight vector becomes sparse, we decrease the rejection rate proportionally. With a lower rejection rate, and consequently a lower gravity, the algorithm can better exploit the subsequent standard SGD updates for a more accurate estimate of the true weight vector. As the rejection rate decreases to 0, the algorithm converges to the standard stochastic gradient descent algorithm on a small subset of stable features.

To achieve the balance between exploration and exploitation, we construct an annealing function for the rejection rate that decreases monotonically as the level of sparsity decreases. Let be the maximum rejection rate at initialization and let be the annealing rate. The annealing function for the rejection rate at stage is given by

 βs+1 =ϕ(ds;β0,γ) =⎧⎨⎩β0[exp(−γds)−dse−γ]γ≥0β0log(1−γ(1−ds))log(1−γ),γ<0, (12)

where is the level of weight vector sparsity at the end of stage . The greater the value is, the faster the rejection rate is annealed to zero as the number of stable features decreases.

A positive, zero and negative value of corresponds to exponential decay, linear decay and logarithmic decay of the rejection rate, respectively. Figure 2 presents examples of the rejection rate anneal function with , and respectively.

By using adaptive gravity (11) with annealed rejection rate (12), the amount of shrinkage is adjusted to the current level of sparsity of the weight vector quantified by the size of the stable set or the norm of the purged . Instead of tuning a fixed gravity parameter as in Langford et al. (2009), for our proposed algorithm, we tune the annealing rate and the maximum rejection rate . Here balances the trade-off between exploration and exploitation and determines the initial intensity of truncation. It enables the tuning process to be tailored to the data at hand as well as being comparable across different datasets. In Section 6, the tuning results instantiate that a negative annealing rate is preferred for highly sparse data, such as the RCV1 dataset, since the a high rejection rate needs to be maintained longer allowing sufficient information of sparse features to be evaluated by the learning process. On the other hand, a positive annealing rate is chosen for relatively dense data, such as the Arcene dataset, where the high frequency of nonzero values permit fast reduction of the active set. The complete algorithm of the stabilized truncated stochastic gradient descent algorithm is summarized in Algorithm 4.

## 4 Properties of the Stabilized Truncated Stochastic Gradient Descent Algorithm

The learning goal of sparse online learning is to achieve a low regret as defined in (3). In this section, we analyze the online regret bound of the proposed stabilized truncated SGD algorithm in Algorithm 4 with convex loss. For simplicity, the effect of adaptive gravity with annealed rejection rate is not considered here. To achieve viable result, we make the following assumptions.

###### Assumption 1.

The absolute values of the weight vector are bounded above, that is, for some , .

###### Assumption 2.

The loss function is convex in , and there exist non-negative constants and such that, for all and , .

For linear prediction problems, the class of loss function that satisfies Assumption 2 includes common loss functions used in machine learning problems, such as the loss, the hinge loss and the logistic loss, with the condition that for some constant .

###### Assumption 3.

Assume that the set for identifying has the following properties:

1. is the parameter space for weight vectors that is subject to a sparsity constraint

 ||w||0=d∗.

For the optimal weight vector , we denote and . We further assume that is sufficiently small and the gravity parameter associated with is reasonably large so that, for , the average number of active selected features from each truncated bursts, , given a gravity , is greater than or equal to .

2. Assume features has various sparsity distribution that , where , for , and if , for .

Assumption 3 posits that ’s parameter space of interest is substantially sparse, which is the main focus of sparse learning and of this paper. The theoretical analysis in the following concerns for an fixed optimal weight vector , if exists, under such a constraint. Nevertheless, this condition does not confine the applicable scenarios of the proposed method to a fixed subclass of problems. It suggests a balance between the model sparsity and the value of the gravity parameter that is implicitly embedded within the parameter tuning process.

###### Lemma 1.

Let be a non-stabilized dense weight vector, i.e., an output weight vector from Algorithm 3. Let be the stabilized weight vector derived from , which is purged by the stability selection (10) with a set of stable features . Let be the average number of nonzero entries in ’s of the previous truncated bursts, i.e.,the number of selected features from truncated bursts, with gravity . Then, if Assumption 1 holds, there exists an with such that the bound on the expected difference between the distance from the non-stabilized weight vector to under Assumption 3 and the distance from the stabilized weight vector to is given by

 E(||^w−w∗||2−||~w−w∗||2) ≥ ε2(|^Sε|−|^Ω|)+2π0C2[(1−qg2π0p−p)qg−d∗] (13)

When the purging threshold is sufficiently high such that ,

 E(||^w−w∗||2−||~w−w∗||2)≥0.
###### Proof.

: See Appendix A. ∎

Lemma 1 quantifies the gain of using stabilization when is highly sparse, where stability selection efficiently shrinks high variable estimates to zero. When the purging threshold is sufficiently high such that , the lower bound achieved by (13) is guaranteed to be positive. Furthermore, this result also indicates that the expected difference between distances from the non-stabilized and stabilized weight vector to the sparse depends on the differences between the sizes of the temporary nonzero set of features before purging, , and the size of the stable features after purging. In expectation, the stabilized weight vector is closer to the target sparse weight vector as the operation of purging efficiently reduces the size of stable features. This suggests a much faster convergence with stabilization. Lemma 1 also provides an insight on the benefit from using adaptive gravity with annealed rejection rate. At the beginning of the algorithm, the gap between the size of and the size of the set of stable features is large when aiming for extensive exploration of different sparse combination of features. Hence, the improvement brought by stabilization is more substantial during the early state of learning period. As the algorithm proceeds and the set of stable features becomes smaller and stabler, it dwindles the leeway that allows the aforementioned two sets to be different. Consequently, the proposed algorithm is gradually tuned toward the standard stochastic gradient descent algorithm to facilitate better convergence at the later period of the learning process.

###### Lemma 2.

Let be the weight vector at initialization. After the first burst, let be the truncated weight vector using universal gravity as in Algorithm 1 and let be the truncated weight vector with informative truncation as in Algorithm 2. Then, under Assumption 3,

 E(||¯w1−w∗||2−||^w1−w∗||2)≥2g0K∑t=1∥ζtw∗∥1≥0 (14)

where for and .

###### Proof.

: See Appendix B. ∎

In Lemma 2, we compare the distances towards from 1) the weight vector with uniform gravity and 2) the weight vector with informative truncation that depends on the number of zero entries occurred in a burst. Such a gap suggests the effectiveness of informative truncation on sparse data in which feature sparsity is highly heterogeneous. In the scenarios where very few nonzero entries appear in a burst, the informative truncation imposes gravity that is proportional to the information presented in a burst. It is a fairer treatment than uniform truncation and leads to a large improvement in expectation. When features are all considerably dense in a burst, the informative truncation is equivalent to the uniform truncation.

In short, Lemma 1 demonstrates the improvement in expected squared error due to stabilization on the weight vector. Lemma 2, on the other hand, quantifies the improvements in reduce truncation bias when implementing informative truncation on sparse features with heterogeneous sparsity levels.

Given Lemma 1 and Lemma 2, we have the expected regret bound of the proposed Algorithm 4 in Theorem 1.

###### Theorem 1.

Consider the updating rules for the weight vector in Algorithm 3. On an arbitrary path, with and , let be the resulted weight vector and be the gravity values applied to the weight vectors generated by Algorithm 4, along with the base gravity parameters . Set the purging threshold to be sufficiently large such that . If Assumption 1, 2, and 3 hold, then there exists a sequence of at each stability selection with the set of stable features such that the expectation of the regret defined in (3) is bounded above by

 E(T∑t=1[L(w––t,zt)+Kg–t||w––t||1]−T∑t=1[L(w∗,zt)+Kg–t||w∗||1]) ≤ ηA2−ηA(E[T∑t=1L(w∗,zt)+Kg–0,t(||w∗||1−||w––t||1)])+12−ηA(ηTB+1η||w∗||2) −12η−η2AT∑t=1ε2t1(tKnK∈Z)(|^Sεt,t|−|^Ωt|) (15)

where and is the weight vector at time before stabilization.

###### Proof.

: See Appendix C. ∎

In the result of Theorem 1, the first two parts of the right-hand-side of the expected regret bound (15) is similar to the bound obtained in Langford et al. (2009). It implies the trade-off between attained sparsity in the resulted weight vector and the regret performance. When the applied gravity is small under the joint effect of the base gravity and the size of each burst , the sparsity is less but the expected regret bound is lower. On the other hand, when the applied gravity is large, the resulted weight vector is more sparse but at the risk of higher regret. Based on Lemma 1, the proposed algorithm is guaranteed to achieve lower regret bound in expectation when the target sparse weight vector is highly sparse. As quantified in the third term of the right-hand-side of (15), the improvement comes from the reduction of the active set at each purging. By its virtue, noisy features are removed from the set of stable features and thus are absent in later SGD updates and truncations.

Theorem 1 is stated with a constant learning rate . It is possible to obtain a lower regret bound in expectation with adaptive learning rate decaying with , such as , which is commonly used in the literature of online learning and stochastic optimization. However, the discussion of using an varying learning rate is not a main focus of this paper and adds extra complexity of the analysis. Without knowing in advance, this may lead to a no-regret bound as suggested in Langford et al. (2009). Instead, in Corollary 1, we show that the convergence rate of the proposed algorithm is with .

###### Corollary 1.

Assume that all conditions of Theorem 1 are satisfied. Let the learning rate be . The upper bound of the expected regret is

where .

###### Proof.

By plugging in to the result from Theorem 1, we get

 E(T∑t=1[L(w––t,zt)+g–t||w––t||1]−T∑t=1[L(w∗,zt)+g–t||w∗||1]) ≤ A2√T−A(E[T∑t=1L(w∗,zt)+g–t(||w∗||1−||w––t||1]) +T2√T−A(ηTB+1η||w∗||2)−T2√T−AT∑t=1ε2t1(tKnK∈Z)(|^Sε,t|−|^Ωt|).

The result is then straightforward. ∎

Assume that the input features have nonzero entries on average. With linear prediction model , the computational complexity at each iteration is . Leveraging the sparse structure, the informative truncation only requires an additional space for recording the counters. The purging process of stability selection consumes , , space for storing the generated intermediate weight vectors and computational complexity. Both storage and computational cost decrease when the set of stable features diminishes as the algorithm proceeds. Since the parameters , , and is normally set to be small values, the complexity mostly depends on . In summary, the proposed algorithm scales with the number of nonzero entries instead of the total dimensions, making it appealing to high-dimensional applications.

## 5 Practical Remarks

When implementing Algorithm 4 in practice, the performance can be further improved in terms of both accuracy and computational efficiency by employing a couple of practical techniques. It includes applying informative purging and attenuating the truncation frequency to achieve more accurate sparse learning and steadier convergence.

The first improvement can be implemented by better addressing the issue of scarcity of incoming samples. For computing selection probabilities, instead of using only information from the current stage, we can inherit information from previous stages for features that are too scarce to accumulate enough updates during one stage. Specifically, we introduce an accumulated counter at stage as the total number of times that a feature is updated within a burst during this stage:

 κj,s=M∑m=1nK∑τ=1~k(m)j,τ,j=1,…,p,

which is essentially the denominator of the selection probability in (8). Similarly, we define an accumulated truncation indicator at stage as the total number of times that a feature is truncated to zero given valid update(s):

 bj,s=M∑m=1∑τ:~k(m)j,τ>01(|^w(m)j,τ|>0),j=1,…,p.

A feature is then evaluated in the stability selection only if there are enough updates from the present stage and from any unused information carried over from previous stages. Given a threshold , let and . The selection probability is modified as

 ^Πj,s=⎧⎨⎩~bj,s~κs,for j s.t.~κs>δK1,otherwise, for j=1,…,p. (16)

This strategy extends the key idea in Section 3.1 that, with sparse data, each decision need to be based on sufficient evidence. Using the “carried-over” information allows the algorithm to utilize information available in a sequence of SGD updates while attuned to the needs of features with different levels of sparsity. In practice, this modification facilitates faster convergence especially for ultra-sparse data.

The second practical strategy is that the size of each burst, , can be adaptively adjusted in a similar fashion as the rejection rate in (12). At the end of each stage, the burst size is updated as

 Ks=⌈K0log(1αds−1)⌉,

where is the initial burst size and, as in (12), . The tuning parameter adjusts the annealing rate of the truncation frequency. Although the result in Theorem 1 is based on a fixed , it can be easily shown that the same upper bound can also be attained with an increasing . By increasing in the later stage of the algorithm, when the majority of irrelevant features have been removed from the stable set, the chance of erroneous truncation is reduced. Such scheme further steers the algorithm from the mode of exploring potential sparse combination of features in the early stage toward the fine tuning of the weight vector by exploiting information from more samples in a sequence. It also facilitates faster convergence as the size of the stable set approaches to a sufficiently small number, as the algorithm converges to the standard stochastic gradient descent approximately.

## 6 Results

In this section, we present experimental results evaluating the performance of the proposed stabilized truncated SGD algorithm in high-dimensional classification problems with sparsity regularization. In this paper, we focus on linear prediction model for binary classification where and with the observed class label . We consider two commonly used convex loss functions in machine learning tasks that both satisfy Assumption 1:

• Hinge loss:

• Logistic loss:

Using five datasets from different domains, the performance of our algorithm and other algorithms for comparison are evaluated on classification performance and feature selection stability and sparsity. We first define measure of feature stability in Section 6.1.

### 6.1 Feature Selection Stability

The goal of sparse learning is to select a subset of truly informative features with stabilized estimation variance as well as increased classification accuracy and model interpretability. Subgradient-based online learning methods depend heavily by the random ordering of samples on which they are fed to the algorithm. Such dependence leads to much deteriorated performance when it comes to high-dimensional sparse inputs. For a particular feature, the positions of its nonzero occurrences in a random ordering of samples greatly affect its learning outcome, in terms of learnt weight and membership in the set of selected features. Therefore, in addition to attaining a low generalization error, a desirable sparse online learning method should also produce an informative feature subset that is stable and robust to random permutations of input data. To evaluate feature selection stability of subgradient-based sparse learning methods, we define in the following a numerical measure of similarity between selected feature subsets resulted from different random permutations of data. Given an output weight vector from a subgradient-based algorithm with input data , similarly as in (6), we denote the selected feature subset as

 S(w––;D)={j:|w––j|>0,w––=Ψ(D}.

Given two random permutations of the training data , and , the similarity between the two sets of selected feature subsets and is measured by the Cohen’s kappa coefficient (Cohen, 1960),

 κ(S1,S2)=qo−qe1−qe,

where is the relative observed agreement between and