The development of optimization software for learning from large datasets is heavily influenced by memory hierarchies of computer storage. In presence of memory constraints, most of the high order optimization methods become unfeasible, whereas techniques such as coordinate descent or stochastic gradient descent may exploit the specific structure of learning functionals to scale well with the dataset size. Considerable effort has been devoted to make kernel methods feasible on large scale problems[Bottou et al., 2007]
. One of the most important features of modern machine learning methodologies is the ability to leverage on sparsity in order to obtain scalability. Typically, learning methods that impose sparsity are based on the minimization of non-differentiable objective functionals. Is this the case of support vector machines or methods based onregularization.
In this chapter, we analyze optimization algorithms for a general class of regularization functionals, using sub-differential calculus and resolvents of monotone operators [Rockafellar, 1970, Hiriart-Urruty04] to manage non-differentiability. In particular, we study learning methods that can be interpreted as the minimization of a convex empirical risk term plus a squared norm regularization into a reproducing kernel Hilbert space [Aronszajn, 1950] with non-null reproducing kernel , namely
where is a finite-valued bounded below convex function. Regularization problems of the form (1) admit a unique optimal solution which, in view of the representer theorem [Schölkopf et al., 2001], can be represented as a finite linear combination of kernel sections:
We characterize optimal coefficients of the linear combination via a family of non-linear equations. Then, we introduce two general classes of optimization algorithms for large scale regularization methods that can be regarded as instances of non-linear Jacobi and Gauss-Seidel algorithms, and analyze their convergence properties. Finally, we state a theorem that shows how to reformulate convex regularization problems, so as to trade off positive semidefiniteness of the kernel matrix for differentiability of the empirical risk.
2 Solution characterization
As a consequence of the representer theorem, an optimal solution of problem (1) can be obtained by solving finite-dimensional optimization problems of the form
where is a non-null symmetric positive semi-definite matrix called kernel matrix. The entries of the kernel matrix are given by
where is a positive semidefinite kernel function. It is easy to verify that the resulting kernel matrix is symmetric and positive semi-definite. Let () denote the columns of the kernel matrix. Particularly interesting is the case in which function is additively separable.
Definition 1 (Additively separable functional).
A functional is called additively separable if
Parametric models with (ridge) regularization corresponds to the case in which inputs are -dimensional numeric vectors () and the kernel matrix is chosen as , where is a matrix whose rows are the input data . Letting
the following class of problems is obtained:
Observe that one can optimize over the whole space , since the optimal weight vector will automatically be in the form (4). Parametric models with regularization can be seen as specific instances of kernel methods in which is the linear kernel:
In the following, two key mathematical objects will be used to characterize optimal solutions of problems (2) and (5). The first is the subdifferential of the empirical risk. The second is the resolvent of the inverse subdifferential, defined as
See the appendix for more details about these objects. The following result characterizes optimal solutions of problem (2) via a non-linear equation involving . The characterization also holds for non-differentiable loss functions, and is obtained without introducing constrained optimization problems. The proof of Theorem 1 is given into the appendix.
The usefulness of condition (7) depends on the possibility of computing closed-form expressions for the resolvent, which may not be feasible for general convex functionals. Remarkably, for many learning methods one can typically exploit the specific structure of to work out closed-form expressions. For instance, when is additively separable as in (3
), the sub-differential decouples with respect to the different components. In such a case, the computation of the resolvent reduces to the inversion of a function of a single variable, which can be often obtained in closed form. Indeed, in many supervised learning problems, additive separability holds, where, is a loss function, and is a regularization parameter. Table 1 reports the expression of the in correspondence with commonly used loss functions. When is additively separable, the characterization (7) can be generalized as follows.
In this paper, we analyze two iterative approaches to compute optimal solutions of problem (2), based on the solution characterizations of Theorem 1 and Corollary 1. For both methods, we show that cluster points of the iteration sequence are optimal solutions, and we have
where denote the functional of problem (2). Section 3 describes a first approach, which involves simply iterating equation (7) according to the fixed-point method. The method can be also regarded as a non-linear Jacobi algorithm to solve equation (7). It is shown that can be always chosen so as to make the iterations approximate an optimal solution to arbitrary precision. In section 4, we describe a second approach, that involves separately iterating the single components using the characterization of equation (8). For a suitable choice of , the method boils down to coordinate descent, and optimality of cluster points holds whenever indices are picked according to an “essentially cyclical” rule. Equivalently, the method can be regarded as a non-linear Gauss-Seidel algorithm to solve (8).
3 Fixed-point algorithms
Such procedure is the well-known fixed point iteration (also known as Picard or Richardson iteration) method. Provided that is properly chosen, the procedure can be used to solve problem (2) to any given accuracy. Before analyzing the convergence properties of method (10), let’s study the computational complexity of a single iteration. To this end, one can decompose the iteration into three intermediate steps:
The decomposition emphasize the separation between the role of the kernel (affecting only step 1) and the role of the function (affecting only step 3).
Step one is the only one that involves the kernel matrix. Generally, it is also the most computationally and memory demanding step. Since represents predictions on training inputs (or a quantity related to them), it holds that being able to perform fast predictions also have a crucial impact on the training time. This is remarkable, since good prediction speed is a desirable goal on its own. Notice that an efficient implementation of the prediction step is beneficial for any learning method of the form (2), independently of . Ideally, the computational cost of such matrix-vector multiplication is . However, the kernel matrix might not fit into the memory, so that the time needed to compute the product might also include special computations or additional I/O operations. Observe that, if many components of vector are null, only a subset of the rows of the kernel matrix is necessary in order to compute the product. Hence, methods that impose sparsity in vector may produce a significant speed-up in the prediction step. As an additional remark, observe that the matrix-vector product is an operation that can be easily parallelized.
In the linear case (5), the computation of can be divided in two parts:
In order to compute the product, it is not even necessary to form the kernel matrix, which may yields a significant memory saving. The two intermediate products both need operations and the overall cost still scales with . When the number of features is much lower than the number of examples (), there’s a significant improvement with respect to . Speed-up and memory saving are even more dramatic when is sparse. In such a case, computing the product in two steps might be more convenient also when .
Step two is a simple subtraction between vectors, whose computational cost is . In section 5, it is shown that can be interpreted as the vector of predictions on the training inputs associated with another learning problem consisting in stabilizing a functional regularized whose empirical risk is always differentiable, and whose kernel is not necessarily positive.
Step three is the only one that depends on function . Hence, different algorithms can be implemented by simply choosing different resolvents . Table 1 reports the loss function and the corresponding resolvent for some common supervised learning methods. Some examples are given below. Consider problem (2) with the “hinge” loss function , associated with the popular Support Vector Machine (SVM). For SVM, step three reads
where denotes the element-wise product, and is applied element-wise. As a second example, consider classic regularized least squares (RLS). In this case, step three reduces to
Generally, the complexity of step three is for any of the classical loss functions.
A stronger convergence result holds when the kernel matrix is strictly positive or is differentiable with Lipschitz continuous gradient. Under these conditions, it turns out that the whole sequence converges at least linearly to an unique fixed point.
The kernel matrix is positive definite.
Function is everywhere differentiable and is Lipschitz continuous,
Then, there exists a unique solution of equation (7), and converges to with the following rate
In practice, condition (11) can be equivalently satisfied by fixing and scaling the kernel matrix to have spectral norm between 0 and 2. In problems that involve a regularization parameter, this last choice will only affect its scale. A possible practical rule to choose the value of is , which is equivalent to scale the kernel matrix to have spectral norm equal to one. However, in order to compute the scaling factor in this way, one generally needs all the entries of the kernel matrix. A cheaper alternative that uses only the diagonal entries of the kernel matrix is , which is equivalent to fix to one and normalizing the kernel matrix to have trace one. To see that this last rule satisfy condition (11), observe that the trace of a positive semidefinite matrix is an upper bound for the spectral norm. In the linear case (5), one can directly compute on the basis of the data matrix . In particular, we have , and , where denotes the Frobenius norm.
4 Coordinate-wise iterative algorithms
In this section, we describe a second optimization approach that can be seen as a way to iteratively enforce optimality condition (8). Throughout the section, it is assumed that is additively separable as in (3). In view of Corollary 1, the optimality condition can be rewritten for a single component as in (8). Consider the following general update algorithm:
A serial implementation of algorithm (10) can be obtained by choosing and by cyclically computing the new components according to equation (12). Observe that this approach requires to keep in memory both and at a certain time. In the next sub-section, we analyze a different choice of parameters that leads to a class of coordinate descent algorithms, based on the principle of using new computed information as soon as it is available.
4.1 Coordinate descent methods
A coordinate descent algorithm updates a single variable at each iteration by solving a sub-problem of dimension one. During the last years, optimization via coordinate descent is becoming a popular approach in machine learning and statistics, since its implementation is straightforward and enjoys favorable computational properties [Friedman et al., 2007, Tseng and Yun, 2008, Wu and Lange, 2008, Chang et al., 2008, Hsieh et al., 2008, Yun and Toh, 2009, Huang et al., 2010, Friedman et al., 2010]. Although the method may require many iterations to converge, the specific structure of supervised learning objective functionals allows to solve the sub-problems with high efficiency. This makes the approach competitive especially for large-scale problems, in which memory limitations hinder the use of second order optimization algorithms. As a matter of fact, state of the art solvers for large scale supervised learning such as glmnet [Friedman et al., 2010] for generalized linear models, or LIBLINEAR [Fan et al., 2008] for SVMs are based on coordinate descent techniques.
The update for in algorithm (12) also depends on components with which have already been updated. Hence, one needs to keep in memory coefficients from two subsequent iterations and . In this sub-section, we describe a method that allows to take advantage of the computed information as soon as it is available, by overwriting the coefficients with the new values. Assume that the diagonal elements of the kernel matrix are strictly positive, i.e. . Notice that this last assumption can be made without any loss of generality. Indeed, if for some index then, in view of the inequality , it follows that for all . Hence, the whole -th column (row) of the kernel matrix is zero, and can be removed without affecting optimization results for the other coefficients. By letting and in equation (8), the -th coefficient in the inner sum does cancel out, and we obtain
The optimal -th coefficient is thus expressed as a function of the others. Similar characterizations have been also derived in [Dinuzzo and De Nicolao, 2009] for several loss functions. Equation (13) is the starting point to obtain a variety of coordinate descent algorithms involving the iterative choice of a a coordinate index followed by the optimization of as a function of the other coefficients. A simple test on the residual of equation (13) can be used as a stopping condition. The approach can be also regarded as a non-linear Gauss-Seidel method [Ortega and Rheinboldt, 2000] for solving the equations (8). It is assumed that vector is initialized to some initial , and coefficients are initialized to the residuals of equation (13) evaluated in correspondence with . Remarkably, in order to implement the method for different loss functions, we simply need to modify the expression of functions . Each update only involves a single row (column) of the kernel matrix. In the following, we will assume that indices are recursively picked according to a rule that satisfy the following condition, see [Tseng, 2001, Luenberger and Ye, 2008].
Essentially Cyclic Rule.
There exists a constant integer such that every index is chosen at least once between the -th iteration and the -th, for all .
Iterations of coordinate descent algorithms that use an essentially cyclic rule can be grouped in macro-iterations, containing at most updates of the form (13), within which all the indices are picked at least once. Below, we report some simple rules that satisfy the essentially cyclic condition and don’t require to maintain any additional information (such as the gradient):
Cyclic rule: In each macro-iteration, each index is picked exactly once in the order . Hence, each macro-iteration consists exactly of iterations.
Aitken double sweep rule: Consists in alternating macro-iterations in which indices are chosen in the natural order with macro-iterations in the reverse order, i.e. .
Randomized cyclic rule: The same as the cyclic rule, except that indices are randomly permuted at each macro-iteration.
In the linear case (5), can be computed as follows
By exploiting the fact that only one component of vector changes from an iteration to the next, the first equation can be further developed:
where denotes the index chosen in the previous iteration, and denotes the variation of coefficient in the previous iteration. By introducing these new quantities, the coordinate descent algorithm can be rewritten as in Algorithm 2, where we have set .
The computational cost of a single iteration depends mainly on the updates for and , and scales linearly with the number of features, i.e. . When the loss function have linear traits, it is often the case that coefficient doesn’t change after the update, so that . When this happen, the next update of can be skipped, obtaining a significant speed-up. Further, if the vectors are sparse, the average computational cost of the second line may be much lower than . A technique of this kind has been proposed in [Hsieh et al., 2008] and implemented in the package LIBLINEAR [Fan et al., 2008] to improve speed of coordinate descent iterations for linear SVM training. Here, one can see that the same technique can be applied to any convex loss function, provided that an expression for the corresponding resolvent is available.
The main convergence result for coordinate descent is stated below. It should be observed that the classical theory of convergence for coordinate descent is typically formulated for differentiable objective functionals. When the objective functional is not differentiable, there exist counterexamples showing that the method may get stuck in a non-stationary point [Auslender, 1976]. In the non-differentiable case, optimality of cluster points of coordinate descent iterations has been proven in [Tseng, 2001] (see also references therein), under the additional assumption that the non-differentiable part is additively separable. Unfortunately, the result of [Tseng, 2001] cannot be directly applied to problem (2), since the (possibly) non-differential part is not separable with respect to the optimization variables , even when (3) holds. Notice also that, when the kernel matrix is not strictly positive, level sets of the objective functional are unbounded (see Lemma 1 in the appendix). Despite these facts, it still holds that cluster points of coordinate descent iterations are optimal, as stated by the next Theorem.
Suppose that the following conditions hold:
Function is additively separable as in (3),
The diagonal entries of the kernel matrix satisfy ,
5 A reformulation theorem
If satisfy (7), then it is also a stationary point of the following functional:
where denotes the Moreau-Yosida regularization of , and .
Theorem 5 gives an insight into the role of parameter , as well as providing an interesting link with machine learning with indefinite kernels. By the properties of the Moreau-Yosida regularization, is differentiable with Lipschitz continuous gradient. It follows that also have such property. Notice that lower values of are associated with smoother functions , while the gradient of is non-expansive. A lower value of
also implies a “less positive semidefinite” kernel, since the eigenvalues ofare given by , where denote the eigenvalues of . Indeed, the kernel becomes non-positive as soon as . Hence, the relaxation parameter regulates a trade-off between smoothness of and positivity of the kernel.
When is additively separable as in (3), it follows that is also additively separable:
and is the Moreau-Yosida regularization of . The components can be often computed in closed form, so that an “equivalent differentiable loss function” can be derived for non-differentiable problems. For instance, when is given by the hinge loss , letting , we obtain
Observe that this last function is differentiable with Lipschitz continuous derivative. By Theorem 5, it follows that the SVM solution can be equivalently computed by searching the stationary points of a new regularization functional obtained by replacing the hinge loss with its equivalent differentiable loss function, and modifying the kernel matrix by subtracting the identity.
In this paper, fixed-point and coordinate descent algorithms for regularized kernel methods with convex empirical risk and squared RKHS norm regularization have been analyzed. The two approaches can be regarded as instances of non-linear Jacobi and Gauss-Seidel algorithms to solve a suitable non-linear equation that characterizes optimal solutions. While the fixed-point algorithm has the advantage of being parallelizable, the coordinate descent algorithm is able to immediately exploit the information computed during the update of a single coefficient. Both classes of algorithms have the potential to scale well with the dataset size. Finally, it has been shown that minimizers of convex regularization functionals are also stationary points of a family of differentiable regularization functionals involving the Moreau-Yosida regularization of the empirical risk.
In this section, we review some concepts and theorems from analysis and linear algebra, which are used in the proofs. Let denote an Euclidean space endowed with the standard inner product and the induced norm .
A set-valued map (or multifunction) is a rule that associate to each point a subset . Notice that any map can be seen as a specific instance of multifunction such that is a singleton for all . The multi-function is called monotone whenever
If there exists such that
then is single-valued, and is called Lipschitz continuous function with modulus . A Lipschitz continuous function is called nonexpansive if , contractive if , and firmly non-expansive if
In particular, firmly non-expansive maps are single-valued, monotone, and non-expansive. For any monotone multifunction , its resolvent is defined for any as , where stands for the identity operator. Resolvents of monotone operators are known to be firmly non-expansive.
Finite-valued convex functions
A function is called finite-valued convex if, for any and any , it satisfy
The subdifferential of a finite-valued convex function is a multifunction defined as
It can be shown that the following properties hold:
is a non-empty convex compact set for any .
is (Gâteaux) differentiable at if and only if is a singleton (whose unique element is the gradient).
is a monotone multifunction.
The point is a (global) minimizer of if and only if .
For any finite-valued convex function , its Moreau-Yosida regularization (or Moreau envelope, or quadratic min-convolution) is defined as
For any fixed , the minimum in the definition of is attained at , where denotes the so-called proximal mapping. It can be shown that the following remarkable properties hold:
is convex differentiable, and the gradient is Lipschitz continuous with modulus .
and have the same set of minimizers for all .
The gradient is called Moreau-Yosida regularization of , and satisfy
where denote the resolvent of the inverse sub-differential defined as
Theorem 6 (Contraction mapping theorem).
Let and suppose that, given , the sequence is generated as
If is contractive with modulus , then there exists a unique fixed-point such that , and the sequence converges to at linear rate:
Theorem 7 (Zangwill’s convergence theorem).
Let denote a multifunction, and suppose that, given , the sequence is generated as
Let called solution set. If the following conditions hold:
The graph is a closed set,
There exists a descent function such that
For all , ,
For all , ,
The sequence is bounded,
then all the cluster points of belongs to the solution set.
The following Lemma will prove useful in the subsequent proofs.
The functional of problem (2) is such that , for any vector in the nullspace of the kernel matrix.
Let denote any vector in the nullspace of the kernel matrix. Then, we have
Proof of Theorem 1.
Problem (2) is a convex optimization problem, where the functional is continuous and bounded below. First of all, we show that there exists optimal solution. Observe that minimization can be restricted to the range of the kernel matrix. Indeed, any vector can be uniquely decomposed as , where belongs to the nullspace of and belongs to the range. By Lemma 1, we have . Since is coercive on the range of the kernel matrix (), it follows that there exist optimal solutions.
A necessary and sufficient condition for to be optimal is
Consider the decomposition , where belongs to the nullspace of the kernel matrix and belongs to the range. Observe that
so that, for any optimal , there exists an optimal such that
By introducing the inverse sub-differential, equation (14) can be rewritten as
Multiplying by both sides and subtracting , we obtain
Finally, introducing the resolvent as in (6), we have
Since is single-valued, equation (7) follows. ∎
Proof of Corollary 1.
Let’s start from the sufficient condition for optimality (14). If (3) holds, then the subdifferential of decouples with respect to the different components, so that there exist optimal coefficients such that
Multiplying by both sides and subtracting , we have
The thesis follows by introducing the resolvents and solving for . ∎
Proof of Theorem 2.
We show that the sequence generated by algorithm (10) converges to an optimal solution of Problem (2). By Theorem 1, there exists optimal solutions satisfying (7). We now observe that any other vector such that is also optimal. Indeed, we have , where belongs to the nullspace of the kernel matrix. By Lemma 1, it follows that . To prove (9), it suffices to show that , where can be uniquely decomposed as
We need to prove that . Since is nonexpansive, we have
is orthogonal to the nullspace of the kernel matrix, we can further estimate as follows
and denote the eigenvalues of the kernel matrix. Since the kernel matrix is positive semidefinite and condition (11) holds, we have
Since the kernel matrix is not null and have a finite number of eigenvalues, there’s at least one eigenvalue with strictly positive distance from zero. It follows that . Since
we have, necessarily, that . Finally, observe that remains bounded
so that there’s a subsequence converging to an optimal solution. In fact, by (9) it follows that any cluster point of is an optimal solution. ∎
Proof of Theorem 3.
Algorithm (10) can be rewritten as
where the map is defined as
Under both conditions (1) and (2) of the theorem, we show that is contractive. Uniqueness of the fixed-point, and convergence with linear rate will then follow from the contraction mapping theorem (Theorem 6). Let
where denote the eigenvalues of the kernel matrix. Since the kernel matrix is positive semidefinite, and condition (11) holds, we have
so that . We now show that the following inequality holds:
and denotes the Lipschitz modulus of when is differentiable with Lipschitz continuous gradient, and otherwise. Since is nonexpansive, it is easy to see that (15) holds when . Suppose now that is differentiable and is Lipschitz continuous with modulus . It follows that the inverse gradient satisfies
Since is monotone, we have
From this last inequality, we obtain (15). Finally, we have
where we have set . Consider the case in which is strictly positive definite. Then, it holds that
so that , and is contractive. Finally, when is differentiable and is Lipschitz continuous, we have and, again, it follows that is contractive. By the contraction mapping theorem (Theorem 6), there exists a unique satisfying (7), and the sequence of Picard iterations converges to at a linear rate. ∎
Proof of Theorem 4.
We shall apply Theorem 7 to the coordinate descent macro-iterations, where the solution set is given by
Let denote the algorithmic map obtained after each macro-iteration of the coordinate descent algorithm. By the essentially cyclic rule, we have
where is the set of strings of length at most on the alphabet such that all the characters are picked at least once. Observing that the set has finite cardinality, it follows that the graph is the union of a finite number of graphs of point-to-point maps:
Now notice that each map is of the form
All the resolvents are Lipschitz continuous, so that functions are also Lipschitz continuous. It follows that the composition of a finite number of such maps is continuous, and its graph is a closed set. Since the union of a finite number of closed sets is also closed, we obtain that is closed.
Each map yields the solution of an exact line search over the -th coordinate direction for minimizing functional of Problem (2). Hence, the function
is minimized at , that is
By definition of subdifferential, we have
In particular, in view of (16), we have
Now, observe that
Since , the following inequalities hold:
We now show that is a descent function for the map associated with the solution set . Indeed, if satisfy (8), then the application of the map doesn’t change the position, so that
Finally, we need to prove that the sequence of macro-iterations remains bounded. In fact, it turns out that the whole sequence of iterations of the coordinate descent algorithm is bounded. From the first inequality in (17), the sequence is non-increasing and bounded below, and thus it must converge to a number
Again from (17), we obtain that the sequence of step sizes is square summable: