Performance Limits of Stochastic Sub-Gradient Learning, Part I: Single Agent Case

11/24/2015 ∙ by Bicheng Ying, et al. ∙ 0

In this work and the supporting Part II, we examine the performance of stochastic sub-gradient learning strategies under weaker conditions than usually considered in the literature. The new conditions are shown to be automatically satisfied by several important cases of interest including SVM, LASSO, and Total-Variation denoising formulations. In comparison, these problems do not satisfy the traditional assumptions used in prior analyses and, therefore, conclusions derived from these earlier treatments are not directly applicable to these problems. The results in this article establish that stochastic sub-gradient strategies can attain linear convergence rates, as opposed to sub-linear rates, to the steady-state regime. A realizable exponential-weighting procedure is employed to smooth the intermediate iterates and guarantee useful performance bounds in terms of convergence rate and excessive risk performance. Part I of this work focuses on single-agent scenarios, which are common in stand-alone learning applications, while Part II extends the analysis to networked learners. The theoretical conclusions are illustrated by several examples and simulations, including comparisons with the FISTA procedure.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The minimization of non-differentiable convex cost functions is a critical step in the solution of many design problems [3, 4, 5], including the design of sparse-aware (LASSO) solutions [6, 7]

, support-vector machine (SVM) learners

[8, 9, 10, 11, 12], or total-variation-based image denoising solutions [13, 14]. Several powerful techniques have been proposed in the literature to deal with the non-differentiability aspect of the problem formulation, including methods that employ sub-gradient iterations [3, 4, 5], cutting-plane techniques [15], or proximal iterations [16, 17]. This work focuses on the class of sub-gradient methods for the reasons explained in the sequel. The sub-gradient technique is closely related to the traditional gradient-descent method[3, 4] where the actual gradient is replaced by a sub-gradient at points of non-differentiability. It is one of the simplest methods in current practice but is known to suffer from slow convergence. For instance, it is shown in [5] that, for convex cost functions, the optimal convergence rate that can be delivered by sub-gradient methods in deterministic optimization problems cannot be faster than under worst case conditions, where is the iteration index. Under some adjustments to the update equation through the use of weight averaging, it was shown in [18] that this rate can be improved to .

I-a The Significance of Subgradient Algorithms

Still, there are at least three strong reasons that motivate a closer examination of the limits of performance of sub-gradient learning algorithms. First, the explosive interest in large-scale and big data scenarios favors the use of simple and computer-efficient algorithmic structures, of which the sub-gradient technique is a formidable example. Second, it is becoming increasingly evident that more sophisticated optimization iterations do not necessarily ensure improved performance when dealing with complex models and data structures [19, 20, 4, 21]. This is because the assumed models, or the adopted cost functions, do not always reflect faithfully the underlying problem structure. In addition, the presence of noise in the data generally implies that a solution that may be perceived to be optimal is actually sub-optimal due to perturbations in the data and models. Third, it turns out that a clear distinction needs to be made between optimizing deterministic costs [3, 4, 5], where the cost function is known completely beforehand, and optimizing stochastic

costs, where the cost function is actually unavailable due to its dependence on the unknown probability distribution of the data. Stochastic problem formulations are very common in applications arising in machine learning problems, adaptation, and estimation. We will show that sub-gradient algorithms have surprisingly favorable behavior in the stochastic setting.

Motivated by these remarks, we therefore examine in some detail the performance of stochastic sub-gradient algorithms for the minimization of non-differentiable convex costs. Our analysis will reveal some interesting properties when these algorithms are used in the context of continuous adaptation and learning (i.e., when actual sub-gradients cannot be evaluated but need to be approximated continually in an online manner). The study is carried out for both cases of single stand-alone agents in this part and for multi-agent networks in Part II[2]. We start with single-agent learning and establish some revealing conclusions about how fast and how well the agent is able to learn. Extension of the results to the multi-agent case will require additional effort due to the coupling that exists among neighboring agents. Interestingly, the same broad conclusions will continue to hold in this case with proper adjustments.

I-B Contributions and Relation to Prior Literature

In order to examine the performance of stochastic sub-gradient implementations, it is necessary to introduce some assumptions on the gradient noise process (which is the difference between a true sub-gradient and its approximation). Here we will diverge in a noticeable way from assumptions commonly used in the literature for two reasons (see Sec. III for further explanations). First, we shall introduce weaker assumptions than usually adopted in prior works and, secondly and more importantly, we shall show that our assumptions are automatically satisfied for important cases of interest (such as SVM, LASSO, Total Variation). In contrast, these applications do not satisfy the traditional assumptions used in the prior literature and, therefore, conclusions derived based on these earlier works are not directly applicable to these problems. For example, it is common in the literature to assume that the cost function has a bounded gradient [22, 23, 4, 24, 18]; this condition is not even satisfied by quadratic costs whose gradient vectors are affine in their parameter and therefore grow unbounded. This condition is also in direct conflict with strongly-convex costs [24]. By weakening the assumptions, the analysis in this work becomes more challenging. At the same time, the conclusions that we arrive at will be stronger and more revealing, and they will apply to a broader class of algorithms and scenarios.

A second aspect of our study is that we will focus on the use of constant step-sizes in order to enable continuous adaptation and learning. Since the step-size is assumed to remain constant, the effect of gradient noise is always present and does not die out, as would occur if we were using instead a diminishing step-size of the form for some as is common in many other studies[25, 9, 23]. Such diminishing step-sizes annihilate the gradient noise term asymptotically albeit at the cost of turning off adaptation in the long run. When this happens, the learning algorithm loses its ability to track drifts in the solution. In contrast, a constant step-size keeps adaptation alive and endows the learning algorithm with an inherent tracking mechanism: if the minimizer that we are seeking drifts with time due, for example, to changes in the statistical properties of the data, then the algorithm will be able to track the new location since it is continually adapting [26]

. This useful tracking feature, however, comes at the expense of a persistent gradient noise term that never dies out. The challenge in analyzing the performance of learning algorithms in the constant adaptation regime is to show that their feedback mechanism induces a stable behavior that reduces the variance of the gradient noise to a small level and that ensures convergence of the iterates to within a small

-neighborhood of the desired optimal solution. Moreover, and importantly, it turns out that constant step-size adaptation is not only useful under non-stationary conditions when drifts in the data occur, but it is also useful even under stationary conditions when the minimizer does not vary with time. This is because, as we will see, the convergence towards the steady-state regime will now be guaranteed to occur at an exponential (i.e., linear rather than sub-linear) rate, for some , which is much faster than the rate that would be observed under diminishing step-size implementation for strongly-convex costs.

A third aspect of our contribution is that it is known that sub-gradient methods are not descent methods. For this reason, it is customary to employ pocket variables (i.e., the best iterate)[3, 5, 27, 28] or arithmetic averages [9] to smooth out the output. However, as the analysis will reveal, the pocket method is not practical in the stochastic setting (its implementation requires knowledge of unavailable information), and the use of arithmetic averages [29] does not match the convergence rate derived later in Sec. IV-C. We shall therefore propose an alternative weighted averaging scheme with an exponentially-decaying window applied to the weight iterates. Similar, but different, weighting schemes applied to the data directly have been used in other contexts in the design of adaptive[26]

and reinforcement learning

[30] schemes. We shall show that the proposed averaging technique does not degrade convergence performance and is able to match the results derived later in Sec. IV-C.

Notation

: We use lowercase letters to denote vectors, uppercase letters for matrices, plain letters for deterministic variables, and boldface letters for random variables. We also use

to denote transposition, for matrix inversion, for the trace of a matrix,

for the eigenvalues of a matrix,

for the 2-norm of a matrix or the Euclidean norm of a vector, and for the spectral radius of a matrix. Besides, we use to denote that is positive semi-definite, and to denote that all entries of vector are positive.

Ii Problem Formulation

Ii-a Problem Formulation

We consider the problem of minimizing a risk function,

, which is assumed to be expressed as the expected value of some loss function,

, namely,

(1)

where we assume is strongly convex with its unique minimizer denoted by , and where

(2)

Here, the letter represents the random data and the expectation operation is performed over the distribution of this data. Many problems in adaptation and learning involve risk functions of this form, including, for example, mean-square-error designs and support vector machine (SVM) solutions — see, e.g., [26, 12, 11]. For generality, we allow the risk function to be non-differentiable. This situation is common in machine learning formulations, e.g., in SVM costs and in regularized sparsity-inducing formulations; examples to this effect are provided in the sequel.

In this work, we examine in some detail the performance of stochastic sub-gradient algorithms for the minimization of (1) and reveal some interesting properties when these algorithms are used in the context of continuous adaptation and learning (i.e., when actual sub-gradients cannot be evaluated but need to be approximated continually in an online manner). This situation arises when the probability distribution of the data is not known beforehand, as is common in practice. This is because in many applications, we only have access to data realizations but not to their actual distribution.

Ii-B Stochastic Sub-Gradient Algorithm

To describe the sub-gradient algorithm, we first recall that the sub-gradient of a convex function at any arbitrary point is defined as any vector that satisfies:

(3)

We shall often write , instead of simply , in order to emphasize that it is a sub-gradient vector at location . We note that sub-gradients are generally non-unique. Accordingly, a related concept is that of the sub-differential of at , denoted by . The sub-differential is defined as the set of all possible sub-gradient vectors at :

(4)

In general, the sub-differential is a set and it will collapse to a single point if, and only if, the cost function is differentiable at [5]; in that case, the sub-gradient vector will coincide with the actual gradient vector at location .

Referring back to problem (1), the traditional sub-gradient method to minimizing the risk function takes the form:

(5)

where refers to one particular choice of a sub-gradient vector for at location , and is a small step-size parameter. Since sub-gradients are non-unique, in construction (5), it is assumed that once a form for is selected, that choice remains invariant throughout the adaptation process. That is, the user selects one choice for and sticks to it throughout the adaptation process. It is not the case that can sometimes be chosen in one way in one iteration and then in another way in another iteration (we will illustrate this point in examples given further ahead — see, e.g., (10)).

Now, in the context of adaptation and learning, we usually do not know the exact form of because the distribution of the data is not known to enable computation of and its gradient vector. As such, true sub-gradient vectors for cannot be determined and they will need to be replaced by stochastic approximations evaluated from streaming data; examples to this effect are provided in the sequel in the context of support-vector machines and LASSO sparse designs. Accordingly, we replace the deterministic iteration (5) by the following stochastic iteration[3, 5, 27, 28]:

(6)

where the successive iterates, , are now random variables (denoted in boldface) and represents an approximate sub-gradient vector at location estimated from data available at time . The difference between an actual sub-gradient vector and its approximation is referred to as gradient noise and is denoted by

(7)

Ii-C Examples: SVM and LASSO

To illustrate the construction, we list two examples dealing with support vector machines (SVM) [8] and the LASSO problem [7]; the latter is also known as the sparse LMS problem or basis pursuit [31, 32, 6]. We will be using these two problems throughout the manuscript to illustrate our findings.

Example 1 (SVM problem).

The two-class SVM formulation deals with the problem of determining a separating hyperplane,

, in order to classify feature vectors, denoted by

, into one of two classes: or . The regularized SVM risk function is of the form:

(8)

where is a regularization parameter. We are generally given a collection of independent training data, , consisting of feature vectors and their class designations and assumed to arise from jointly wide-sense stationary processes. Using this data, the loss function at time is given by

(9)

where the second term on the right-hand side, which is also known as the hinge function, is non-differentiable at all points satisfying . There are generally many choices for the sub-gradient vector at these locations . One particular choice is:

(10)

where the indicator function is defined as follows:

(11)

The choice (10) requires the computation of an expectation operator, which is infeasible since the distribution of the data is not known beforehand. One approximation for this particular sub-gradient choice at iteration is the construction

(12)

where the expectation operator is dropped. We refer to (12) as an instantaneous approximation for (10) since it employs the instantaneous realizations to approximate the mean operation in (10). There can be other choices for the true sub-gradient vector at and for its approximation. However, it is assumed that once a particular choice is made for the form of , as in (10), then that and its approximation (12), remain invariant during the operation of the algorithm. Using (10) and (12), the gradient noise process associated with this implementation of the SVM formulation is then given by

(13)

Example 2 (LASSO problem). The least-mean-squares LASSO formulation deals with the problem of estimating a sparse weight vector by minimizing a risk function of the form [33, 34]:111Traditionally, LASSO refers to minimizing a deterministic cost function, such as . However, we are interested in stochastic formulations, which motivates (14).

(14)

where is a regularization parameter and denotes the norm of . In this problem formulation, the variable now plays the role of a desired signal, while

plays the role of a regression vector. It is assumed that the data are zero-mean wide-sense stationary with second-order moments denoted by

(15)

It is generally assumed that

satisfy a linear regression model of the form:

(16)

where is the desired unknown sparse vector, and refers to an additive zero-mean noise component with finite variance and independent of . If we multiply both sides of (16) by from the left and compute expectations, we find that satisfies the normal equations:

(17)

We are again given a collection of independent training data, , consisting of regression vectors and their noisy measured signals. Using this data, the loss function at time is given by

(18)

where the second term on the right-hand side is again non-differentiable. One particular choice for the sub-gradient vector is:

(19)

where the notation , for a scalar , refers to the sign function:

(20)

When applied to a vector , as is the case in (19), the sgn function is a vector consisting of the signs of the individual entries of . Similar to the previous example, it is infeasible to find the exact sub-graident (19) since is unknown. Instead, we use the following instantaneous approximation for (19):

(21)

It then follows that the gradient noise process in the LASSO formulation is given by

(22)

Iii Modeling Conditions

In order to examine the performance of the stochastic sub-gradient implementation (6), it is necessary to introduce some assumptions on the gradient noise process. We diverge here from assumptions that are commonly used in the literature for two main reasons. First, we introduce weaker assumptions than usually adopted in prior works and, secondly and more importantly, we show that our assumptions are automatically satisfied by important cases of interest (such as SVM and LASSO). In contrast, these applications do not satisfy the traditional assumptions used in the literature and, therefore, conclusions derived based on these earlier works are not directly applicable to SVM and LASSO problems. We clarify these remarks in the sequel.

First, we emphasize, as explained above, that it is assumed that the particular construction for the sub-gradient function at location remains invariant during the operation of the algorithm, as well as its instantaneous approximation.

Assumption 1 (Conditions on gradient noise)

The first and second-order conditional moments of the gradient noise process satisfy the following conditions:

(23)
(24)

for some constants and , and where the notation denotes the filtration (collection) corresponding to all past iterates:

(25)

Conditions (23) and (24) essentially require that the construction of the approximate sub-gradient vector should not introduce bias and that its error variance should decrease as the quality of the iterate improves. Both of these conditions are sensible and, moreover, they will be shown to be satisfied by, for example, SVM and LASSO constructions.

Assumption 2 (Strongly-convex risk function)

The risk function is assumed to be strongly-convex (or, simply, strongly-convex), i.e., there exists an such that

(26)

for any , and . The above condition is equivalent to requiring [4]:

(27)

for any . Under this condition, the minimizer exists and is unique.

Assumption 2 is relatively rare in works on non-differentiable function optimization because it is customary in these earlier works to focus on studying piece-wise linear risks; these are useful non-smooth functions but they do not satisfy the strong-convexity condition. In our case, strong-convexity is not a restriction because in the context of adaptation and learning, it is common for the risk functions to include a regularization term, which generally helps ensure strong-convexity.

Assumption 3 (Sub-gradient is Affine-Lipschitz)

It is assumed that the sub-gradient choice used in (5) is affine Lipschitz, meaning that there exist constants and such that the following property holds:

(28)

for any .

It is customary in the literature to use in place of Assumption 3 a more restrictive condition that requires the sub-gradient to be bounded [3, 22, 24], i.e., to require instead of (28) that

(29)

which is also equivalent to assuming the risk function is Lipschitz:

(30)

Such a requirement does not even hold for quadratic risk functions, whose gradient vectors are affine in and, therefore, grow unbounded! Even more, it can be easily seen that requirement (29) is always conflicted with the strong-convexity assumption. For example, if we set and in (27), we would obtain:

(31)

Likewise, if we instead set and in (27), we would obtain:

(32)

Adding relations (31)–(32) we arrive at the strong monotonicity property:

(33)

which implies, in view of the Cauchy-Schwarz inequality, that

(34)

In other words, the strong-convexity condition (27) implies that the sub-gradient satisfies (34); and this condition is in clear conflict with the bounded requirement in (29).

One common way to circumvent the difficulty with the bounded requirement (29) and to ensure that it holds is to restrict the domain of to some bounded convex set, say, , in order to bound its sub-gradient vectors, and then employ a projection-based sub-gradient method (i.e., one in which each iteration is followed by projecting onto ). However, this approach has at least three difficulties. First, the unconstrained problem is transformed into a more demanding constrained problem involving an extra projection step. Second, the projection step may not be straightforward to perform unless the set is simple enough. Third, the bound that results on the sub-gradient vectors by limiting to can be very loose, which will be dependent on the diameter of the convex set .

For these reasons, we do not rely on the restrictive condition (29) and introduce instead the more relaxed affine-Lipschitz condition (28). This condition is weaker than (29). Indeed, it can be verified that (29) implies (28) but not the other way around. To see this, assume (29) holds. Then, using the triangle inequality of norms we have

(35)

which is a special case of (28) with and . We now verify that important problems of interest satisfy Assumption 3 but not the traditional condition (29).

Example 3 (SVM problem). We revisit the SVM formulation from Example 1. The risk function (8) is strongly convex due to the presence of the quadratic regularization term, , and since the hinge function is convex. The zero-mean property of the gradient noise process (13) is obvious in this case. With respect to the variance condition, we note that

(36)

so that Assumption 1 is satisfied with and . Let us now verify Assumption 3. For that purpose, we first note that the sub-differentiable of the SVM risk is given by:

(37)

where in step (a) we use the fact that the SVM loss function is continuous convex and, therefore, we can exchange the order of the sub-differential operation with the expectation operation[35, Prop. 2.10]. In step (b), the operator is defined by

(38)

Different choices for the value of at the location lead to different sub-gradients. We can therefore express any arbitrary sub-gradient in the form

(39)

where the notation means that we pick a particular value within the range to define the sub-gradient (39). It now follows that

(40)

Note further that:

(41)

where the last inequality is because is non-negative and is uniformly bounded by one. Substituting into (III) gives

(42)

which is of the same form as (28) with parameters and .

Example 4 (LASSO problem). We revisit the LASSO formulation from Example 2. Under the condition that , the risk function (14) is again strongly-convex because the quadratic term, , is strongly convex and the regularization term, , is convex. With regards to the gradient noise process (22), it was already shown in Eq. (3.22) in [36] that, conditioned on past iterates, it has zero-mean and its conditional variance satisfies:

(43)

where . It follows that Assumption 1 is satisfied with and . Let us now verify Assumption 3. For that purpose, we first note that the sub-differential set of the LASSO risk is given by:

(44)

where the operator for a vector is defined as:

(45)

Different choices for the value of at the locations lead to different sub-gradients. We can therefore express any arbitrary sub-gradient in the form

(46)

where the notation means that we pick particular values within the range to define the sub-gradient (46). It now follows that:

(47)

Observing that the difference between any entries of and cannot be larger than 2 in magnitude, we get

(48)

where is the column vector with all its entries equal to one. We again arrive at a relation of the same form as (28) with parameters and .

Iv Performance Analysis

We now carry out a detailed mean-square-error analysis of the stability and performance of the stochastic sub-gradient recursion (6) in the presence of gradient noise and for constant step-size adaptation. In particular, we will be able to show that linear (exponential) convergence can be attained at the rate for some .

Iv-a Continuous Adaptation

Since the step-size is assumed to remain constant, the effect of gradient noise is continually present and does not die out, as would occur if we were using instead a diminishing step-size, say, of the form . Such diminishing step-sizes annihilate the gradient noise term asymptotically albeit at the expense of turning off adaptation in the long run. In that case, the learning algorithm will lose its tracking ability. In contrast, a constant step-size keeps adaptation alive and endows the learning algorithm with a tracking mechanism and, as the analysis will show, enables convergence towards the steady-state regime at an exponential rate, , for some .

Iv-B A Useful Bound

In preparation for the analysis, we first conclude from (28) that the following useful condition also holds, involving squared-norms as opposed to the actual norms:

(49)

where and . This is because

(50)

Iv-C Stability and Convergence

We are now ready to establish the following important conclusion regarding the stability and performance of the stochastic sub-gradient algorithm (6); the conclusion indicates that the algorithm is stable and converges exponentially fast for sufficiently small step-sizes. But first, we explain our notation and the definition of a “best” iterate, denoted by [5]. This variable is useful in the context of sub-gradient implementations because it is known that sub-gradient directions do not necessarily correspond to real ascent directions (as is the case with actual gradient vectors for differentiable functions).

At every iteration , the risk value that corresponds to the iterate is . This value is obviously a random variable due to the randomness in the data used to run the algorithm. We denote the mean risk value by . The next theorem examines how fast and how close this mean value approaches the optimal value, . To do so, the statement in the theorem relies on the best pocket iterate, denoted by , and which is defined as follows. At any iteration , the value that is saved in this pocket variable is the past iterate, , that has generated the smallest mean risk value up to that point in time, i.e.,

(51)

The statement below then proves that approaches a small neighborhood of size around exponentially fast:

(52)

where the big-O notation means in the order of .

Theorem 1 (Single agent performance)

Consider using the stochastic sub-gradient algorithm (6) to seek the unique minimizer, , of the optimization problem (1), where the risk function, , is assumed to satisfy Assumptions  13. If the step-size parameter satisfies (i.e., if it is small enough):

(53)

then it holds that

(54)

That is the convergence of towards the neighborhood around occurs at the linear rate, , dictated by the parameter:

(55)

Condition (53) ensures . In the limit:

(56)

That is, for large , is approximately -suboptimal.

We introduce the error vector, , and use it to deduce from (6)–(7) the following error recursion:

(57)

Squaring both sides and computing the conditional expectation we obtain:

(58)
(59)

In step (a), we eliminated the cross term because, conditioned on , the gradient noise process has zero-mean. Now, from the strong convexity condition (27), it holds that

(60)

Substituting into (58) gives

(61)

Referring to (49), if we set , , and use the fact that there exists one particular sub-gradient satisfying , we obtain:

(62)

Substituting into (61), we get

(63)

Taking expectation again we eliminate the conditioning on and arrive at:

(64)

To proceed, we simplify the notation and introduce the scalars

(65)
(66)
(67)
(68)

Note that since is the unique global minimizer of , then it holds that so that for all . The variable represents the average excess risk. Now, we can rewrite (64) more compactly as

(69)

Iterating over , for some interval length , gives:

(70)

Let us verify that . First, observe from expression (67) for that is a quadratic function in . This function attains its minimum at location . For any , the value of is larger than the minimum value of the function at , i.e., it holds that

(71)

Now, comparing relations (34) and (28), we find that the sub-gradient vector satisfies:

(72)

which implies that since the above inequality must hold for all . It then follows from (50) that and from (71) that

(73)

In other words, the parameter is positive. Furthermore, some straightforward algebra using (67) shows that condition (53) implies . We therefore established that , as desired.

Returning to (69), we note that because the (negative) sub-gradient direction is not necessarily a descent direction, we cannot ensure that . However, we can still arrive at a useful conclusion by introducing a pocket variable, denoted by . This variable saves the value of the smallest increment, , up to time , i.e.,

(74)

Let denote the corresponding iterate