Online learning has a wide range of applications in recommendation, advertisement, many others. The commonly used algorithm for online learning is online gradient descent (OGD), which linearizes the loss and regularizer in each step. OGD is simple and easy to implement. However because of linearization, OGD may incur the numerical stability issue if the learning rate is not properly chosen. Meanwhile it is unable to effectively explore the structure of regularizers. To overcome the numerical stability issue of OGD, the algorithms to optimize the loss exactly ( without linearization) are proposed under variant situations, such as the well-known passive aggressive (PA) framework crammer2006online ; dredze2008confidence ; wang2012exact ; shi2014online , implicit online learning kulis2010implicit and implicit SGD toulis2014statistical . To explore the structure of regularizer, the algorithms to optimize the regularizer exactly are proposed, such as composite mirror descent (COMID) duchi2010composite and regularized dual averaging (RDA) xiao2010dual . In the online setting, we call the exact minimization to loss or regularizer as implicit update, because it is equivalent to OGD with an implicit learning rate; while the vanilla OGD is called explicit update.
Consider the benefits of implicit update, it may be of interest to explore the algorithm performing implicit update on both the loss and regularizer, which is called the fully implicit online learning (FIOL) algorithm in this paper. There exist tons of research on the algorithm performing implicit update on either the loss function or the regularizer. However, due to the difficulties of analysis and computation, to the best of our knowledge, only mcmahan2010unified give analyses for the FIOL algorithm. However on the one hand, the analyses were somewhat complicated and do not provide an explicit rate on the regret gain by implicit update. On the other hand, mcmahan2010unified did not provide efficient computational methods for the iteration of FIOL.
In this paper we aim to systemically study the FIOL algorithm. From the theoretical aspect, compared with Mcmahan mcmahan2010unified , we first conduct a much simpler analysis to show that the FIOL algorithm admits a better regret than its linearized counterpart if each iteration in FIOL can be solved exactly. The extra gain of implicit update on regret is bounded by the difference between the loss function and its lower bound, and can be further bounded by an explicit rate if a certain condition can be satisfied.
From the computational aspect, the main challenge in FIOL is how to efficiently solve the implicit update subproblem of each iteration, since it may need to minimize a function with a nonseparable component and a nonsmooth component. In this paper, we show that for the widely used -norm regularizer and a large class of loss functions, the implicit update subproblem can be solved very efficiently. First we propose a simple deterministic algorithm with cost; second, we propose a randomized algorithm with expected time. Therefore the complexity results show that the iteration of FIOL can be solved as efficient as the step of OGD. We use experiments to validate the proposed approaches.
Before continue, we provide the notations and the problem setting first. Let bold italic denote vector such asand lower case italic denote scalar such as . Let the Hadamard product of two vectors and as . We denote a sequence of vectors by subscripts, , and entries in a vector by non-bold subscripts, such as the -th entry of is . For a function , we use to denote its subgradient set at .
Assume the dataset is , where for all , is the feature vector and is the predictive value. In this paper we mainly consider the regularized empirical risk minimization (ERM) problem,
where is a convex loss function. Examples of the above formulation include many well-known classification and regression problems. For binary classification, the predictive value
. The linear support vector machine (SVM) is obtained by settingand . For regression, . Lasso is obtained by setting and .
In online learning, Eq. (1) is optimized by receiving a sample on each round. On the -th round, will be received only after we make the prediction (for regression) or (for binary classification). The basic task in online learning is to find an algorithm that can minimize the following regularized regret bound
with a sublinear rate . Besides the regret bound, the numerical stability and the property of solution are also of major concern. In this paper, the iteration of FIOL is as follows
where is the learning rate and is an auxiliary function.
In this section, we show the theoretical analyses for FIOL. When is linearized at , Eq. (3) becomes an iteration of COMID. In this case, the proof for regret in duchi2010composite is a clear extension of the convergence proof of proximal gradient descent. However if we perform exact minimization for ( implicit update), the analysis has been proved difficult. In the case that the regularizer does not exist and is not linearized, crammer2006online gives relative loss bounds when is hinge loss or squared hinge loss. However, the relative loss bounds are unable to be converted to a sublinear regret bound to the best of our knowledge. Then kulis2010implicit gives an regret bound when is squared loss. The above two papers does not show any advantage of implicit update on regret bound. Meanwhile their proofs are only suitable for some particular loss functions. When exists, and both and are not linearized, Mcmahan mcmahan2010unified gives the first regret bound for general convex functions and show the one-step improvement of the implicit update for . In the regret proof of mcmahan2010unified , first a specific function is constructed and then the one-step improvement is measured by the difference of the function values at two different points. However they only show that the one-step improvement is nonnegative and does not upper bounded by an explicit rate.
Compared with mcmahan2010unified , in this section, we quantify the one-step improvement of implicit update in an intuitive way and thus our proof is much simpler. Inspired by the fact that implicit update avoids the approximation error of linearization, we define the one-step improvement by the difference between the value and its linear approximation at , ,
Because we assume is convex, we have . Meanwhile it is shown that only if is not linear between and , the implicit update is nontrivial and . Given the definition of , we have Lemma 1.
Let the sequence be defined by the update in Eq. (3). For any , we have
In Eq. (3), if is replaced by its linearized approximation and we set the learning rate as the best adaptive learning rate , then the standard regret bound for COMID is . Therefore compared with COMID, we have the regret gain , where . To further bound , we assume that is strongly convex with strong convexity constant 111In this case is not strongly convex .,
For example, the widely used squared loss function satisfies the strong convexity property with strong convexity constant . It should be noted that even if is strongly convex , is not strongly convex . Based on the strong convexity assumption of , we have Theorem 2.
Besides the assumptions in Theorem 1, suppose that . If is strongly convex with strong convexity constant , then it follows that
The existing algorithms that linearize the loss function ignore the strong convexity of . While because of exact minimization, FIOL implicitly capture the higher-order information of the loss. Therefore in Theorem 2, the regret bound of FIOL has an decrease if is strongly convex . To the best of our knowledge, this is the first analysis to give an explicit bound about the benefit of implicit update on regret, although the logarithmic decrease seems small.
In machine learning, the widely used regularizers are -norm, -norm square, and the variants or combinations of them. When is -norm square, the solution of Eq. (3) often has closed-form or can be solved efficiently by Newton method on a transformed one-dimensional problem. However, if the -norm regularizer is a component of , the problem in Eq. (3) does not have a closed-form in general and becomes nonsmooth. In this case it will be inefficient to apply general optimization methods on the problem in Eq. (3). Therefore, in this section, we mainly discuss the efficient computational methods when is the -norm regularizer. Meanwhile, because we only care about the subproblem in Eq. (3), in this section we omit the subscript “t”. To further simplify the notation, we only consider the binary classification problem and reformulate Eq. (3) as
where . Meanwhile we use to denote the current iteration and to denote the previous iteration. In Eq. (8), we allow to be the widely used loss functions in machine learning, such as hinge loss, squared hinge loss, squared loss, logistic loss, exponential loss, etc. Formally we give the assumption in this section as follows.
. Meanwhile, satisfies the following assumptions.
is convex and closed, and can be evaluated in time;
is a non-decreasing function;
Denote the function and constants , then the solution of the equation can be solved exactly or with high accuracy in cost.
By Assumption A, is a non-increasing function. If is strictly monotonic, then is a single-value function; otherwise, is a multivalue function, the value of will be a set at some points. Meanwhile, if the range of is , the domain of is ; otherwise, the domain of is a proper set of . Assumption A can be satisfied for widely used loss functions in machine learning. Table 1 shows the formulation of and for squared loss
, logistic regression loss, exponential loss , hinge loss and absolute loss . For logistic regression loss and exponential loss, the solutions of the equation do not have a closed-form. However it can be solved efficiently with high accuracy by Newton method. (If a suitable initial value is given, Newton iterations are often enough.)
For simplicity, in the following discussion, we assume that is a single-value function and the domain of is , such as the squared loss. Then in subsection 3.3 we discuss the case that is a multivalue function or the domain is a proper set of , which includes the other loss functions in Table 1.
Suppose the in Assumption A is a single-value function. Denote
Then we can find the unique solution of the equation such that for all , we have .
Suppose satisfy that for all , denote , . Denote . Then we can rewrite as follows
By Eq. (10), is a non-decreasing function. By (11), can be reduced to the sum of the max operators of plus a linear function . If we know the relationship of the solution and beforehand, then will be a linear segment and thus then we get a much simpler problem, which can be solved with cost by Assumption A. Therefore, the remaining task is to determine the relationship between and . In this section, we provide two kinds of algorithms: one is based on sorting; the other is based on partition.
3.1 The sorting-based algorithm
First we sort such that , where is a permutation of . In addition, set and . Then for , if , then by Eq. (11), we have
which means that if we restrict in , then is a linear segment. For , if we compute the linear coefficients of the linear segment orderly from to , then we can compute all the coefficients in time. Meanwhile by Assumption A, it has cost to check whether there exists a satisfying . Because Eq. (8) is strongly convex, the solution is unique. Correspondingly, the solution exists and is unique. Therefore, if we compute the linear coefficients of from to and meanwhile check whether there exists satisfying , we can always find the solution in at most iterations. However using the above procedure naively need us compute the solution of exactly, which has cost by Assumption A. In order to find a more convenient stop condition, it is observed that by Assumption A and Eq. (10), is a non-increasing function and is a non-decreasing function. Therefore is a non-decreasing function. Then for a , there exists and if and only if
Equivalently we have Lemma 4.
Let be the optimal solution. Let and be defined in Lemma 3. Let be the permutation of such that and set . Denote . Then we can find
such that is the solution of the equation
3.2 The partition-based algorithm
According to Lemma 4, the problem to find the optimal solution is equivalent to finding the -smallest element of . Finding the -smallest element in a sequence is a well-known problem in computer science. By the partition-based selection algorithm cormen2009introduction , the -smallest element can be found in expected time. The partition-based selection algorithm combines of three parts: the method to choose a pivot, the method to divide the search sequence into the left sequence and the right sequence, and the criterion to reduce the search sequence to the left sequence or the right sequence. In the standard partition-based algorithm, the criterion is whether the left sequence has the elements more than or not. While although is unknown beforehand, by rewriting Eq. (12), is found if is the minimal index in that satisfies
Then because is a non-decreasing function, for , if , then will be smaller than ; otherwise, will be greater or equal than . Therefore, the criterion to reduce the sequence in the standard partition-based algorithm can be changed to judge whether is less than or not.
The remaining problem is about the complexity about computing , which in turn reduces to compute the linear coefficients and of corresponding to the pivot. By the expression
and can be expressed as partial sums. While by tracking the partial sums and of in the previous iteration, the complexity to compute the coefficients of the current iteration will be not higher than the partition step. Therefore, the extra complexity will not impact the total complexity of the generalized selection algorithm. Therefore the expected time will still be . The final algorithm is given in Alg. 2.
First in order to simplify the notation, we only consider binary classification problems. However, from the computational view of point, in the ERM setting, both of regression problems and classification problems use a composite function with a nonlinear outside function and a linear inside function, therefore Alg. 1 and 2 can be trivially extended to the regression settings.
Meanwhile, in the previous subsection discussion, for simplicity, we assume that is a single-value function and its domain is . The two assumptions are equivalent to assuming that is strictly increasing and the range of is . However as shown in Table 1, most widely used losses in machine learning do not satisfy the both conditions. Fortunately, to address this problem, we only need to make slight modification about Alg. 1 and Alg. 2.
If the range of is not , which means that the domain of is not . It in turn means that the optimal solution is restricted in the domain of . Therefore in this case, before the operations in Alg. 1 and Alg. 2, we can simply only let the elements of in the domain of do the following sort or partition related operation. In fact, because the elements that participate the sort or partition related operations are reduced, the limited range of reduce the time complexity.
If is not strictly monotonic, which means that can be a multivalue function. In the machine learning context, as shown in Table 1, it means that the value of will be a set in the boundary point(s). In this case, and only need to satisfy an inequality rather than an equality in the boundary point(s). Before the following operations, we can check the boundary points first. The checking operation has cost and does not influence the total complexity. Meanwhile, the partial sum computed in the boundary point provide an initialization step for the following operations in Alg. 1 and Alg. 2.
In the section, to show the speed, stability and the sparsity of solution, we compare methods: stochastic subgradient descent (SGD), online composite mirror descent (COMID), implicit SGD(I-SGD) and the full implicit online learning in Eq. (3) of this paper. Alg. 1 and Alg. 2 are used to solve Eq. (3). In this experiment we solve the lasso problem
in the online setting, where is the sample vector, is the prediction value. In order to show the performance under data with different quality, following tran2015stochastic , we use synthetic data and control the correlation coefficient betwee features. In the -the iteration, a sample vector is generated, where with and is a constant. Then the correlation coefficient between and is . The prediction of the -th iteration is defined as , where so that the elements of the true parameters have alternating signs and are exponentially decreasing, the noise and is chosen to control the signal-to-noise ratio. For the algorithms, the learning rate (LR) is tuned over . We implement the algorithms in a common framework and use them to solve Eq. (17) in the online fashion.
In this experiments, we set and run all the algorithms in a fixed time under the setting and . Then the result is given in Table 2.
In Table 2, the column LR denotes the learning rate which makes the largest reduction of the objective function; the column Value denote the value of objective function , where is the number of iterations; the column Sparsity denote the number of zero elements of the solution in the last iteration.
In Table 2, it is shown that the correlation between the feature vectors have large impact on the explicit update algorithm SGD and COMID which linearizes the loss function. While the algorithms such as I-SGD, Alg. 1 and Alg. 2, which performs implicit update for loss function, are robust for the correlation coefficient . Because implicit update can be viewed as explicit update with data adaptive learning rate kulis2010implicit , it is more robust for the scale of data and has better numerical stability.
Meanwhile, both SGD and I-SGD linearize the regularization term and thus cannot induce sparsity of solution effectively. While COMID, Alg. 1and Alg. 2 perform implicit update for the regularization term . From the computational perspective, implicit update corresponds the update by soft thresholding operator, which can shrink small elements to . Therefore, the algorithms have sparsity inducing effect. While it is observed that when and COMID becomes unstable, it can not induce sparsity effectively.
Finally, under the same runtime, they can result in larger reduction of objection function than Alg. 1and Alg. 2 , although the iterative solving method employed by Alg. 1and Alg. 2 are slower than the closed-form update of SGD and COMID. This is because that implicit update to the loss function allows us to use a larger learning rate.
While because Alg. 1and Alg. 2 and I-SGD can use the same learning rate and the closed-form update of I-SGD is faster, under the same run time, I-SGD can get a larger reduction of objection function. However, it should be noted that first, to the best of our knowledge, the proposed Alg. 1and Alg. 2 algorithms are the first attempts to solve the full implicit online learning problem in Eq. (3) efficiently; second compared to I-SGD, Alg. 1and Alg. 2 can induce sparsity effectively.
- (1) Thomas H Cormen. Introduction to algorithms. MIT press, 2009.
- (2) Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. Online passive-aggressive algorithms. Journal of Machine Learning Research, 7(Mar):551–585, 2006.
- (3) Mark Dredze, Koby Crammer, and Fernando Pereira. Confidence-weighted linear classification. In ICML, pages 264–271. ACM, 2008.
- (4) John C Duchi, Shai Shalev-Shwartz, Yoram Singer, and Ambuj Tewari. Composite objective mirror descent. In COLT, pages 14–26, 2010.
- (5) Brian Kulis and Peter L Bartlett. Implicit online learning. In ICML, pages 575–582, 2010.
- (6) H Brendan McMahan. A unified view of regularized dual averaging and mirror descent with implicit updates. arXiv preprint arXiv:1009.3240, 2010.
- (7) Tianlin Shi and Jun Zhu. Online bayesian passive-aggressive learning. In ICML, pages 378–386, 2014.
- (8) Panagiotis Toulis, Edoardo Airoldi, and Jason Rennie. Statistical analysis of stochastic gradient methods for generalized linear models. In ICML, pages 667–675, 2014.
- (9) Dustin Tran, Panos Toulis, and Edoardo M Airoldi. Stochastic gradient descent methods for estimation with large data sets. arXiv preprint arXiv:1509.06459, 2015.
- (10) Jialei Wang, Peilin Zhao, and Steven CH Hoi. Exact soft confidence-weighted learning. In ICML, pages 107–114, 2012.
- (11) Lin Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11(Oct):2543–2596, 2010.
Appendix A Theory
Proof of Lemma 1.
In the proof, we use to denote the subgradient and use to denote the subgradient the scalar .
For Eq. (3), we have the optimality condition
Then it follows that for any ,
where 1⃝ is by the convexity of , 2⃝ is by the optimality condition Eq. (18), 3⃝ is by the triangle inequality. Meanwhile
where 1⃝ is by the definition of in Eq. (4).