Perceptron like Algorithms for Online Learning to Rank

08/04/2015 ∙ by Sougata Chaudhuri, et al. ∙ University of Michigan 0

Perceptron is a classic online algorithm for learning a classification function. In this paper, we provide a novel extension of the perceptron algorithm to the learning to rank problem in information retrieval. We consider popular listwise performance measures such as Normalized Discounted Cumulative Gain (NDCG) and Average Precision (AP). A modern perspective on perceptron for classification is that it is simply an instance of online gradient descent (OGD), during mistake rounds, using the hinge loss function. Motivated by this interpretation, we propose a novel family of listwise, large margin ranking surrogates. Members of this family can be thought of as analogs of the hinge loss. Exploiting a certain self-bounding property of the proposed family, we provide a guarantee on the cumulative NDCG (or AP) induced loss incurred by our perceptron-like algorithm. We show that, if there exists a perfect oracle ranker which can correctly rank each instance in an online sequence of ranking data, with some margin, the cumulative loss of perceptron algorithm on that sequence is bounded by a constant, irrespective of the length of the sequence. This result is reminiscent of Novikoff's convergence theorem for the classification perceptron. Moreover, we prove a lower bound on the cumulative loss achievable by any deterministic algorithm, under the assumption of existence of perfect oracle ranker. The lower bound shows that our perceptron bound is not tight, and we propose another, purely online, algorithm which achieves the lower bound. We provide empirical results on simulated and large commercial datasets to corroborate our theoretical results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning to rank (Liu, 2011)

is a supervised learning problem where the output space consists of

rankings of a set of objects. In the learning to rank problem that frequently arises in information retrieval, the objective is to rank documents associated with a query, in the order of the relevance of the documents for the given query. The accuracy of a ranked list, given actual relevance scores of the documents, is measured by various ranking performance measures, such as Normalized Discounted Cumulative Gain (NDCG) (Järvelin and Kekäläinen, 2002) and Average Precision (AP) (Baeza-Yates and Ribeiro-Neto, 1999). Since optimization of ranking measures during the training phase is computationally intractable, ranking methods are often based on minimizing surrogate losses that are easy to optimize.

The historical importance of the perceptron algorithm in the classification literature is immense (Rosenblatt, 1958; Freund and Schapire, 1999). Classically the perceptron algorithm was not linked to surrogate minimization but the modern perspective on perceptron is to interpret it as online gradient descent (OGD), during mistake rounds, on the hinge loss function (Shalev-Shwartz, 2011). The hinge loss has special properties that allow one to establish bounds on the cumulative zero-one loss (viz., the total number of mistakes) in classification, without making any statistical assumptions on the data generating mechanism. Novikoff’s celebrated result (Novikoff, 1962)

about the perceptron says that, if there is a perfect linear classification function which can correctly classify, with some margin, every instance in an online sequence, then the total number of mistakes made by perceptron, on that sequence, is bounded. Moreover, unlike the standard OGD algorithm, the performance of perceptron is independent of learning rate parameter, which is of significant advantage due to not having to learn the optimal parameter value.

Our work provides a novel extension of the perceptron algorithm to the learning to rank setting with a focus on two listwise ranking measures, NDCG and AP. Listwise measures are so named because the quality of ranking function is judged on an entire list of document, associated with a query, usually with an emphasis to avoid errors near top of the ranked list. Specifically, we make the following contributions in this work.

  • We develop a family of listwise

    large margin ranking surrogates. The family consists of Lipschitz functions and is parameterized by a set of weight vectors that makes the surrogates adaptable to losses induced by performance measures NDCG and AP. The family of surrogates is an extension of the hinge surrogate in classification that upper bounds the

    - loss. The family of surrogates has a special self-bounding property: the norm of the gradient of a surrogate can be bounded by the surrogate loss itself.

  • We exploit the self bounding property of the surrogates to develop an online perceptron-like algorithm for learning to rank (Algorithm 1). We provide bounds on the cumulative NDCG and AP induced losses (Theorem 6). We prove that, if there is a perfect linear ranking function which can rank correctly, with some margin, every instance in an online sequence, our perceptron-like algorithm perfectly ranks all but a finite number of instances (Corollary 7). This implies that the cumulative loss induced by NDCG or AP is bounded by a constant, and our result can be seen as an extension of the classification perceptron mistake bound (Theorem 1). The performance of our perceptron algorithm, however, is dependent on a learning rate parameter, which is a disadvantage over classification perceptron. Moreover, the bound depends linearly on the number of documents per query. In practice, during evaluation, NDCG is often cut off at a point which is much smaller than number of documents per query. In that scenario, we prove that the cumulative NDCG loss of our perceptron is upper bounded by a constant which is dependent only on the cut-off point. (Theorem 8).

  • We prove a lower bound, on the cumulative loss induced by NDCG or AP, that can be achieved by any deterministic online algorithm (Theorem 9) under a separability assumption. The lower bound is independent of the number of documents per query. We propose a second perceptron like algorithm (Algorithm 1) which achieves the lower bound (Theorem 10), with performance being independent of learning rate parameter. However, the surrogate on which the perceptron type algorithm operates is not listwise in nature and does not adapt to different performance measures. Thus, its empirical performance on real data is significantly worse than the first perceptron algorithm (Algorithm 1).

  • We provide empirical results on simulated as well as large scale benchmark datasets and compare the performance of our perceptron algorithm with the online version of the widely used ListNet learning to rank algorithm (Cao et al., 2007).

The rest of the paper is organized as follows. Section 2 provides formal definitions and notations related to the problem setting. Section 3 provides a review of perceptron for classification, including algorithm and theoretical analysis. Section 4 introduces the family of listwise large margin ranking surrogates, and contrasts our surrogates with a number of existing large margin ranking surrogates in literature. Section 5 introduces the perceptron algorithm for learning to rank, and discusses various aspects of the algorithm and the associated theoretical guarantee. Section 6 establishes a lower bound on NDCG/AP induced cumulative loss and introduces the second perceptron like algorithm. Section 7 compares our work with existing perceptron algorithms for ranking. Section 8 provides empirical results on simulated and large scale benchmark datasets.

2 Problem Definition

In learning to rank, we formally denote the input space as . Each input consists of rows of document-query features represented as dimensional vectors. Each input corresponds to a single query and, therefore, the

rows have features extracted from the same query but

different documents. In practice changes from one input instance to another but we treat as a constant for ease of presentation. For , , where is the feature extracted from a query and the th document associated with that query. The supervision space is , representing relevance score vectors. If , the relevance vector is binary graded. For , relevance vector is multi-graded. Thus, for , , where denotes relevance of th document to a given query. Hence, represents a vector and , a scalar, denotes th component of vector. Also, relevance vector generated at time is denoted with th component denoted .

The objective is to learn a ranking function which ranks the documents associated with a query in such a way that more relevant documents are placed ahead of less relevant ones. The prevalent technique is to learn a scoring function and obtain a ranking by sorting the score vector in descending order. For , a linear scoring function is , where . The quality of the learnt ranking function is evaluated on a test query using various performance measures. We use two of the most popular performance measures in our paper, viz. NDCG and AP.

NDCG, cut off at for a query with documents, with relevance vector and score vector induced by a ranking function, is defined as follows:

(1)

Shorthand representation of is . Here, , , . Further, represents the set of permutations over objects. is the permutation induced by sorting score vector in descending order (we use and interchangeably). A permutation gives a mapping from ranks to documents and gives a mapping from documents to ranks. Thus, means document is placed at position while means document is placed at position . For , we denote as . The popular performance measure, Average Precision (AP), is defined only for binary relevance vector, i.e., each component can only take values in :

(2)

where is the total number of relevant documents.

All ranking performances measures are actually gains. When we say “NDCG induced loss”, we mean a loss function that simply subtracts NDCG from its maximum possible value, which is (same for AP).

3 Perceptron for Classification

We will first briefly review the perceptron algorithm for classification, highlighting the modern viewpoint that it executes online gradient descent (OGD) (Zinkevich, 2003) on hinge loss during mistake rounds and achieves a bound on total number of mistakes. This will allow us to directly compare and contrast our extension of perceptron to the learning to rank setting. For more details, we refer the reader to the survey written by Shalev-Shwartz (2011, Section 3.3).

In classification, an instance is of the form and corresponding supervision (label) is . A linear classifier is a scoring function , parameterized by , producing score . Classification of is obtained by using “sign” predictor on , i.e., . The loss is of the form: . The hinge loss is defined as: , where .

The perceptron algorithm operates on the loss , defined on a sequence of data , produced by an adaptive adversary as follows:

(3)

where is the learner’s move in round . It is important to understand the concept of the loss and adaptive adversary here. An adaptive adversary is allowed to choose at round based on the moves of the perceptron algorithm (Algorithm 1) upto that round. Once the learner fixes its choice at the end of step , the adversary decides which function to play. It is either or 0, depending on whether is 1 or 0 respectively. Notice that is convex in both cases.

The perceptron updates a classifier (effectively updates ), in an online fashion. The update occurs by application of OGD on the sequence of functions in the following way: perceptron initializes and uses update rule , where ( is a subgradient) and is the learning rate (the importance of will be discussed at the end of the section). If , then ; hence . Otherwise, . Thus,

(4)

The perceptron algorithm for classification is described below:

Learning rate , .
For to
Receive .
Predict .
Receive
If
  
else
  
End For
Algorithm 1 Perceptron Algorithm for Classification
Theorem 1.

Suppose that the perceptron for classification algorithm runs on an online sequence of data and let . Let be defined as in Eq. 3. For all and setting , the perceptron mistake bound is:

(5)

In the special case where there exists s.t. , , we have

(6)

As can be clearly seen from Eq. 5, the cumulative loss bound (i.e., total number of mistakes over rounds) is upper bounded in terms of the cumulative sum of the functions . In the special case where there exists a perfect linear classifier with margin, Eq. 6 shows that the total number of mistakes is bounded, regardless of the number of instances.

One drawback of the bound in Eq. 6 is that the concept of margin is not explicit, i.e., it is hidden in the norm of the parameter of the perfect classifier (). Let us assume that there is a linear classifier parameterized by a unit norm vector , such that all instances are not only correctly classified, but correctly classified with a margin , defined as:

(7)

It is easy to see that the scaled vector , whose norm is , will satisfy for all . Therefore, we have following corollary.

Corollary 2.

If the margin condition (7) holds, then total number of mistakes is upper bounded by , a bound independent of the number of instances in the online sequence.

Importance of learning rate parameter : The prediction at round is . Let indicate the rounds, up to time point , where perceptron made a mistake. Starting from , unraveling , we get . It can be easily seen that is invariant to value of , for . Hence, the actual performance of the perceptron algorithm (in terms of total number of mistakes) is independent of learning rate and thus, can be fixed from the beginning of the algorithm. The reason for including in the algorithm is that in the subsequent analysis (Theorem 1), the perceptron loss bound uses standard regret analysis of OGD, where the optimal regret bound is established by optimizing over learning rate . So, though the performance is actually independent of , optimization over is necessary to establish the optimal theoretical upper bound on the loss.

4 A Novel Family of Listwise Surrogates

We define the novel SLAM family of loss functions: these are Surrogate, Large margin, Listwise and Lipschitz losses, Adaptable to multiple performance measures, and can handle Multiple graded relevance. For score vector , and relevance vector , the family of convex loss functions is defined as:

(8)

The constant denotes margin and is an element-wise non-negative weight vector. Different vectors , to be defined later, yield different members of the SLAM family. Though can be varied for empirical purposes, we fix for our analysis. The intuition behind the loss setting is that scores associated with more relevant documents should be higher, with a margin, than scores associated with less relevant documents. The weights decide how much weight to put on the errors.

The following reformulation of will be useful in later derivations.

(9)
Lemma 3.

For any relevance vector , the function is convex.

Proof.

Claim is obvious from the representation given in Eq. 9. ∎

4.1 Weight Vectors Parameterizing the SLAM Family

As we stated after Eq. 8, different weight vectors lead to different members of the SLAM family. The weight vectors play a crucial role in the subsequent theoretical analysis. We will provide two weight vectors, and , that result in upper bounds for AP and NDCG induced losses respectively. Later, we will discuss the necessity of choosing such weight vectors.

Since the losses in SLAM family is calculated with the knowledge of the relevance vector , for ease of subsequent derivations, we can assume, without loss of generality, that documents are sorted according to their relevance levels. Thus, we assume that , where is the relevance of document . Note that both and depend on the relevance vector but we hide that dependence in the notation to reduce clutter.

Weight vector for AP loss: Let be a binary relevance vector. Let be the number of relevant documents (thus, and ). We define vector as

(10)

Weight vector for NDCG loss: For a given relevance vector , we define vector as

(11)

Note: Both weights ensure that (since ). Using the weight vectors, we have the following upper bounds.

Theorem 4.

Let and be the weight vectors as defined in Eq. (10) and Eq. (11) respectively. Let and be the AP value and NDCG value determined by relevance vector and score vector . Then, the following inequalities hold, ,

(12)

The proof of the theorem is in Appendix A.

4.2 Properties of SLAM Family and Upper Bounds

We discuss some of the properties of SLAM family and related upper bounds. Listwise Nature of SLAM Family: The critical property for a surrogate to be considered listwise is that the loss must be calculated over the entire list of documents as a whole, with errors at the top penalized more than errors at the bottom. Since perfect ranking places the most relevant documents at top, errors corresponding to most relevant documents should be penalized more in SLAM in order to be considered a listwise family. Both and have the property that the more relevant documents get more weight.
Upper Bounds on NDCG and AP: By Theorem 4, the weight vectors make losses in SLAM family upper bounds on NDCG and AP induced losses. The SLAM loss family is analogous to the hinge loss in classification. Similar to hinge loss, the surrogate losses of SLAM family are when the predicted scores respect the relevance labels (with some margin). The upper bound property will be crucial in deriving guarantees for a perceptron-like algorithm in learning to rank. Like hinge loss, the upper bounds can possibly be loose in some cases, but, as we show next, the upper bounding weights make SLAM family Lipschitz continuous with a small Lipschitz constant. This naturally restricts SLAM losses from growing too quickly. Empirically, we will show that the perceptron developed based on the SLAM family produce competitive performance on large scale industrial datasets. Along with the theory, the empirical performance supports the fact that upper bounds are quite meaningful.
Lipschitz Continuity of SLAM: Lipschitz continuity of an arbitrary loss, w.r.t. in norm, means that there is a constant such that , for all . By duality, it follows that . We calculate as follows:
Let . The sub-gradient of , w.r.t. to , from Eq. (9), is: , where

(13)

and is a standard basis vector along coordinate .

Since , it is easy to see that . Since norm dominates norm, is Lipschitz continuous in norm whenever we can bound . It is easy to check that and . Hence, and induce Lipschitz continuous surrogates, with Lipschitz constant at most 2.
Comparison with Surrogates Derived from Structured Prediction Framework: We briefly highlight the difference between SLAM and listwise surrogates obtained from the structured prediction framework (Chapelle et al., 2007; Yue et al., 2007; Chakrabarti et al., 2008). Structured prediction for ranking models assume that the supervision space is the space of full rankings of a document list. Usually a large number of full rankings are compatible with a relevance vector, in which case the relevance vector is arbitrarily mapped to a full ranking. In fact, here is a quote from one of the relevant papers (Chapelle et al., 2007), “It is often the case that this is not unique and we simply take of one of them at random” ( refers to a correct full ranking pertaining to query ). Thus, all but one correct full ranking will yield a loss. In contrast, in SLAM, documents with same relevance level are essentially exchangeable (see Eq. (9)). Thus, our assumption that documents are sorted according to relevance during design of weight vectors is without arbitrariness, and there will be no change in the amount of loss when documents within same relevance class are compared.

5 Perceptron-like Algorithms

We present a perceptron-like algorithm for learning a ranking function in an online setting, using the SLAM family. Since our proposed perceptron like algorithm works for both NDCG and AP induced losses, for derivation purposes, we denote a performance measure induced loss as RankingMeasureLoss (RML). Thus, RML can be NDCG induced loss or AP induced loss.

Informal Definition: The algorithm works as follows. At time , the learner maintains a linear ranking function, parameterized by . The learner receives , which is the document list retrieved for query and ranks it. Then the ground truth relevance vector is received and ranking function updated according to the perceptron rule.

Let . For subsequent ease of derivations, we write SLAM loss from Eq. (9) as: , where

(14)

and .

Like classification perceptron, our perceptron-like algorithm operates on the loss , defined on a sequence of data , produced by an adaptive adversary (i.e., an adversary who can see the learner’s move before making its move) as follows:

(15)

Here, and or depending on whether RML is NDCG or AP induced loss. Since weight vector depends on relevance vector (Eq. (10), (11)), the subscript in denotes the dependence on . Moreover, is the parameter produced by our perceptron (Algorithm 1) at the end of step , with the adaptive adversary being influenced by the move of perceptron (recall Eq. 3 and discussion thereafter).

It is clear from Theorem. 4 and Eq. (15) that . It should also be noted that that is convex in either of the two cases. Thus, we can run the online gradient descent (OGD) algorithm (Zinkevich, 2003) to learn the sequence of parameters , starting with . The OGD update rule, , for some and step size , requires a subgradient that, in our case, is computed as follows. When , we have . When , we have

(16)

where is the standard basis vector along coordinate and is as defined in Eq. (14) (with ).

We now obtain a perceptron-like algorithm for the learning to rank problem.

Learning rate , .
For to
Receive (document list for query ).
Set  , predicted ranking output = .
Receive
If // Note:
   // is defined in Eq. (16)
else
  
End For
Algorithm 1 Perceptron Algorithm for Learning to Rank

5.1 Bound on Cumulative Loss

We provide a theoretical bound on the cumulative loss (as measured by RML) of perceptron for the learning to rank problem. The technique uses regret analysis of online convex optimization algorithms. We state the standard OGD bound used to get our main theorem (Zinkevich, 2003). An important thing to remember is that OGD guarantee holds for convex functions played by an adaptive adversary, which is important for an OGD based analysis of the perceptron algorithm.

Proposition (OGD regret).

Let be a sequence of convex functions. The update rule of function parameter is , where . Then for any , the following regret bound holds after rounds,

(17)

We first control the norm of the subgradient , defined in Eq. (16). To do this, we will need to use the norm of matrix.

Definition (p q norm).

Let be a matrix. The norm of A is:

Lemma 5.

Let be the bound on the maximum norm of the feature vectors representing the documents. Let with , and be bound on number of documents per query. Then we have the following norm bound,

(18)
Proof.

For a mistake round , we have from Eq. (16) .
1st bound for :

The first inequality uses the norm and last inequality holds because and .

2nd bound for (The self-bounding property of SLAM is being used here, to bound the norm of gradient by loss itself):
We note that in a mistake round, . Thus, there is at least 1 pair of documents whose ranks are inconsistent with their relevance levels. Mathematically,

Now, (Eq. (14) ). For , we have .

Since , document has strictly greater than minimum possible relevance, i.e., . By our calculations of weight vector for both NDCG and AP, we have .

Thus, by definition, (since and and with ).

Then, , . Thus, we have:

It follows that .

Combining 1st and 2nd bound for , we get , for mistake rounds.

Since, for non-mistake rounds, we have and , we get the final inequality.

Taking , we have the following theorem, which uses the norm bound on :

Theorem 6.

Suppose Algorithm 1 receives a sequence of instances and let be the bound on the maximum norm of the feature vectors representing the documents. Then the following inequality holds, after optimizing over learning rate , :

(19)

In the special case where there exists s.t. , , we have

(20)
Proof.

The proof follows by plugging in expression for (Lemma 5) in OGD equation (Prop. OGD Regret), optimizing over , using the algebraic trick: and then using the inequality . ∎

Note: The perceptron bound, in Eq. 19, is a loss bound, i.e., the left hand side is cumulative NDCG/AP induced loss while right side is function of cumulative surrogate loss. We discuss in details the significance of this bound later.

Like perceptron for binary classification, the constant in Eq. 19 needs to be expressed in terms of a “margin”. A natural definition of margin in case of ranking data is as follows: let us assume that there is a linear scoring function parameterized by a unit norm vector , such that all documents for all queries are ranked not only correctly, but correctly with a margin :

(21)
Corollary 7.

If the margin condition (21) holds, then total loss, for both NDCG and AP induced loss, is upper bounded by , a bound independent of the number of instances in the online sequence.

Proof.

Fix a and the example . Set . For this , we have

which means that

This immediately implies that , . Therefore, and hence . Since this holds for all , we have .

5.1.1 Perceptron Bound-General Discussion

We remind once again that is either or 1, depending on measure of interest.

Importance of learning rate parameter : Like the classification perceptron, Algorithm 1 also has the learning rate parameter embedded, and the optimal upper bound on loss is obtained by optimizing over . However, unlike classification perceptron, the performance is not independent of . The prediction at each round is the ranking obtained from sorted order of score, i.e., . Let indicate the rounds, up to time point , where the algorithm did not produce perfect ranking. Starting from , unraveling , we get . Now, had been independent of , then , which is the sorted order of score vector, would have been independent of scaling factor . However, each is dependent on implicitly (Eq. 16), which themselves are dependent of (recall for classification perceptron, , i.e., independent of during mistake round ). To clarify, we consider, during a mistake round, two score vector and , where . Had subgradient , during a mistake round, been indeed independent of (and hence score ), then would have been same for both and . However, this is not the case. To see this, note that (Eq. 14), for some , can be 0 for but non-zero for , depending on value of , which affects the gradient.

Dependence of perceptron bound on number of documents per query: The perceptron bound in Eq. 19 is meaningful only if is a finite quantity.

For AP, it can be seen from the definition of in Eq. 10 that . Thus, for AP induced loss, the constant in the perceptron bound is: .

For NDCG, depends on maximum relevance level. Assuming maximum relevance level is finite (in practice, maximum relevance level is usually below ), . Thus, for NDCG induced loss, the constant in the perceptron bound is: .

Significance of perceptron bound: The main perceptron bound is given in Eq. 19, with the special case being captured in Corollary 7. At first glance, the bound might seem non-informative because the left side is the cumulative NDCG/AP induced loss bound, while the right side is a function of the cumulative surrogate loss.

The first thing to note is that the perceptron bound is derived from the regret bound in Eq. 17, which is the well-known regret bound of the OGD algorithm applied to an arbitrary convex, Lipschitz surrogate. So, even ignoring the bound in Eq. 19, the perceptron algorithm is a valid online algorithm, applied to the sequence of convex functions , to learn ranking function , with a meaningful regret bound. Second, as we had mentioned in the introduction, our perceptron bound is the extension of perceptron bound in classification, to the cumulative NDCG/AP induced losses in the learning to rank setting. This can be observed by noticing the similarity between Eq. 19 and Eq. 5. In both cases, the the cumulative target loss on the left is bounded by a function of the cumulative surrogate loss on the right, where the surrogate is the hinge (and hinge like SLAM) loss.

The interesting aspects of perceptron loss bound becomes apparent on close investigation of the cumulative surrogate loss term and comparing with the regret bound. It is well known that when OGD is run on any convex, Lipschitz surrogate, the guarantee on the regret scales at the rate . So, if we only ran OGD on an arbitrary convex, Lipschitz surrogate, then, even with the assumption of existence of a perfect ranker, the upper bound on the cumulative loss would have scaled as . However, in the perceptron loss bound, if , then the upper bound on the cumulative loss would scale as , which can be much better than for . In the best case of , the total cumulative loss would be bounded, irrespective of the number of instances.

Comparison and contrast with perceptron for classification: The perceptron for learning to rank is an extension of the perceptron for classification, both in terms of the algorithm and the loss bound. To obtain the perceptron loss bounds in the learning to rank setting, we had to address multiple non-trivial issues, which do not arise in the classification setting. Unlike in classification, the NDCG/AP losses are not -valued. The analysis is trivial in classification perceptron since on a mistake round, the absolute value of gradient of hinge loss is 1, which is same as the loss itself. In our setting, Lemma 5 is crucial, where we exploit the structure of SLAM surrogate to bound the square of gradient by the surrogate loss.

5.1.2 Perceptron Bound Dependent On NDCG Cut-Off Point

The bound on the cumulative loss in Eq. (19) is dependent on , the maximum number of documents per query. It is often the case in learning to rank that though a list has documents, the focus is on the top documents () in the order sorted by score. The measure used for top- documents is (Eq. 1) (there does not exist an equivalent definition for AP).

We consider a modified set of weights s.t. holds , for every . We provide the definition of later in the proof of Theorem8 .

Overloading notation with , let with 0, 0 and .

Theorem 8.

Suppose the perceptron algorithm receives a sequence of instances . Let be the cut-off point of NDCG. Also, for any , let be as defined in Eq. (15), but with . Then, the following inequality holds, after optimizing over learning rate ,

(22)

In the special case where there exists s.t. , , we have