The (stochastic) gradient descent and the multiplicative update method are probably the most popular algorithms in machine learning. We introduce and study a new regularization which provides a unification of the additive and multiplicative updates. This regularization is derived from an hyperbolic analogue of the entropy function, which we call hypentropy. It is motivated by a natural extension of the multiplicative update to negative numbers. The hypentropy has a natural spectral counterpart which we use to derive a family of matrix-based updates that bridge gradient methods and the multiplicative method for matrices. While the latter is only applicable to positive semi-definite matrices, the spectral hypentropy method can naturally be used with general rectangular matrices. We analyze the new family of updates by deriving tight regret bounds. We study empirically the applicability of the new update for settings such as multiclass learning, in which the parameters constitute a general rectangular matrix.

## Authors

• 5 publications
• 46 publications
• 16 publications

Continuous-time mirror descent (CMD) can be seen as the limit case of th...
02/24/2020 ∙ by Ehsan Amid, et al. ∙ 0

• ### A General Family of Stochastic Proximal Gradient Methods for Deep Learning

We study the training of regularized neural networks where the regulariz...
07/15/2020 ∙ by Jihun Yun, et al. ∙ 0

• ### Spectral Sparsification and Regret Minimization Beyond Matrix Multiplicative Updates

In this paper, we provide a novel construction of the linear-sized spect...
06/16/2015 ∙ by Zeyuan Allen-Zhu, et al. ∙ 0

• ### An Implicit Form of Krasulina's k-PCA Update without the Orthonormality Constraint

We shed new insights on the two commonly used updates for the online k-P...
09/11/2019 ∙ by Ehsan Amid, et al. ∙ 0

• ### Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks

We analyze algorithms for approximating a function f(x) = Φ x mapping ^d...
02/16/2018 ∙ by Peter L. Bartlett, et al. ∙ 0

• ### Sparse Communication for Distributed Gradient Descent

We make distributed stochastic gradient descent faster by exchanging spa...
04/17/2017 ∙ by Alham Fikri Aji, et al. ∙ 0

• ### Learning compositional functions via multiplicative weight updates

Compositionality is a basic structural feature of both biological and ar...
06/25/2020 ∙ by Jeremy Bernstein, et al. ∙ 35

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Algorithms for online learning can morally be divided into two camps. On one side is the additive gradient update. Additive gradient-based stochastic methods are the most commonly used approach for learning the parameters of shallow and deep models alike. On the other side stands the multiplicative update method. It is somewhat less glamorous, nonetheless a fundamental primitive in game theory and machine learning, and was rediscovered repeatedly in a variety of algorithmic settings

(Arora et al., 2012). Both additive and multiplicative updates can be seen as special cases of a more general technique of learning with regularization. General frameworks for regularization were developed in online learning, dubbed Follow-The-Regularized-Leader and in convex optimization as the Mirrored Decent algorithms, see more below.

Notable attempts were made to unify different regularization techniques, in particular between the multiplicative and additive update methods Kivinen and Warmuth (1997). For example, AdaGrad Duchi et al. (2011) stemmed from a theoretical study of learning the best regularization in hindsight. As the name implies, the -norm update Grove et al. (2001); Gentile (2003) uses the squared -norm of the parameters as a regularization. By varying the order of the norm between regret bounds that are characteristic of additive and multiplicative updates.

We study a new, arguably more natural, family of regularization which “interpolates” between additive and multiplicative forms. We analyze its performance both experimentally, and theoretically to obtain tight regret bounds in the online learning paradigm. The motivation for this interpolation stems from the extension of the multiplicative update to negative weights. Instead of using the so called EG

trick”, a term coined by Warmuth

, which simulates arbitrary weights through duplication to positive and negative components, we use a direct approach. To do so we introduce the hyperbolic regularization with a single temperature-like hyperparameter. Varying the hyperparameter yields regret bounds that translate between those akin to additive, and multiplicative, update rules.

As a natural next step, we investigate the spectral analogue of the hypentropy function. We show that the spectral hypentropy is strongly-convex with respect to the Euclidean or trace norms, again as a function of the single interpolation parameter. The spectral hypentropy yields updates that can be viewed as interpolation between gradient descent rule and matrix multiplicative update Tsuda et al. (2005); Arora and Kale (2007).

The standard matrix multiplicative update rule applies only to positive semi-definite matrices. Standard extensions to square and more general matrices increase the dimensionality Hazan et al. (2012). Moreover, the the regret bounds scale as for matrices. In contrast, the spectral hypentropy regularization is defined for arbitrary, rectangular, matrices. Moreover, the hypentropy-based update in better regret bounds of , matching the best known bounds in Kakade et al. (2012).

#### Related work

For background on the multiplicative updates method and its use in machine learning and algorithmic design, see Arora et al. (2012). The matrix version of multiplicative updates method was proposed in Tsuda et al. (2005) and later in Arora and Kale (2007). The study of the interplay between additive and multiplicative updates was initiated in the influential paper of Kivinen and Warmuth (1997). Generalizations of multiplicative updates to negative weights were studied in the context of the Winnow algorithm and mistake bounds in Warmuth (2007); Grove et al. (2001). The latter paper also introduced the -norm algorithm which was further developed in Gentile (2003). The generalization of the p-norm regularization to matrices was studied in Kakade et al. (2012).

#### Organization of paper

and are mirror descent algorithms using the hypentropy and spectral hypentropy regularization functions defined in Sec. 3 and Sec. 4 respectively. These sections explore the geometric properties of these new regularization functions and provide regret analysis. Experimental results which underscore the applicability of and are described in Sec. 5. A thorough description of mirror descent is given for completeness in App. A. The view of as an adaptive variant of is explored in App. B.

## 2 Problem Setting

#### Notation.

Vectors are denoted by bold-face letters, e.g. . The zero vector and the all ones vector are denoted by and respectively. We denote a ball of radius with respect to the -norm in as . For simplicity of the presentation in the sequel we assume that weights are confined to the unit ball. Our results generalize straightforwardly to arbitrary radii.

Matrices are denoted by capitalized bold-face letters, e.g. . We denote the space of real matrices of size as and symmetric matrices of size as . For a matrix

, we denote the vector of singular values

where and . Analogously, for

, we denote the vector of its eigenvalues as

. We use to represent the Schatten -norm of a matrix, namely, the -norm of the vector of eigenvalues. We refer to the Schatten norm for as the trace-norm. Note that the notation of the spectral norm of is . We denote the ball of radius with respect to the trace-norm as . We also define the intersection of a ball and the positive orthant as . We use to denote . We denote by the dual norm of , .

#### Online Convex Optimization.

In online convex optimization Cesa-Bianchi and Lugosi (2006); Hazan (2016); Shalev-Shwartz (2012), a learner iteratively chooses a vector from a convex set . We denote the total number of rounds as . In each round, the learner commits to a choice

. After committing to this choice, a convex loss function

is revealed and the learner incurs a loss . The most common performance objective of an online learning algorithm is regret. Regret is defined to be the total loss incurred by the algorithm with respect to the loss of the best fixed single prediction found in hindsight. Formally, the regret of a learning algorithm is defined as,

 \regretT(\calA) \eqdefsupℓ1…ℓt{T∑t=1ℓt(\bwt)−min\bw∗∈\calKT∑t=1ℓt(\bw∗)} .

## 3 Divergence

We begin by defining the -hyperbolic entropy, denoted . [Hyperbolic-Entropy] For all , let be defined as,

 ϕβ(\bx)=d∑i=1(xi\arcsinh(xiβ)−√x2i+β2) .

Alternatively, we can view as the sum of scalar functions, , each of which satisfies,

 ϕ′′β(x)=1√x2+β2 . (1)

For brevity and clarity, we use the shorthand for . Given the function, we derive its associated Bregman divergence, the relative as,

 Dβϕ\infdivx\bx\by= ϕβ(\bx)−ϕβ(\by)−\ip∇ϕβ(\by)\bx−\by = d∑i=1[xi(\arcsinh(xiβ)−\arcsinh(yiβ))−√x2i+β2+√y2i+β2] .

As we vary , the relative interpolates between the squared Euclidean distance and the relative entropy. The potentials for these divergences are sums of element-wise scalar functions, for simplicity we view them as scalar functions. The interpolation properties of hypentropy can be seen in Figure 1. As approaches , we see that approaches . When working only over the positive orthant, as is the case with entropic regularization, the hypentropy second derivative converges to the second derivative of the negative entropy. On the other hand, as grows much larger than , we see . Therefore, for larger , is essentially a constant. In this regime hypentropy behaves like a scaled squared euclidean distance.

From a mirror descent perspective of mirror descent (see Sec. A), it makes sense to look at the mirror map, the gradient of the which defines the dual space where additive gradient updates take place. Weights are mapped into the dual space via the mirror map and mapped back into the primal space via . Gradient-Descent (GD) can be framed as mirror descent using the squared euclidean norm potential while Exponentiated-Gradient (EG) amounts to mirror descent using the entropy potential. The Update (HU) uses the mirror map . As can be seen from Figure 1, for sufficiently large weights, the mirror map behaves like , namely, the EG mirror map. In contrast, is linear for small weights, and thus behave like GD. Large values of correspond to a slower transition from the linear regime to the logarithmic regime of the mirror map.

The regret analysis of GD and EG depend on geometric properties of the divergence related to the -norm and -norm respectively. Given this connection, it is useful to analyze the properties of with respect to both the -norm and -norm. Recall that a function is -strongly convex with respect to a norm on if,

 ∀\bx,\by∈\calK,f(\bx)−f(\by)−∇f(\by)(\bx−\by)≥α2∥\bx−\by∥2.

For convenience, we use the following second order characterization of strong-convexity from Thm.  3 in Yu (2013). Let be a convex subset of some finite vector space . A twice differentiable function is -strongly convex with respect to a norm iff

 inf\bx∈\calK,\by∈\calX:∥\by∥=1\by⊤∇2ϕ(\bx)\by≥α .

We next prove elementary properties of . The function is -strongly-convex over w.r.t the -norm. To prove the first part, note that from (1) we get that the Hessian is the diagonal matrix,

 ∇2ϕβ(\bx)=\diag[1√x21+β2,…1√x2d+β2]

Strong convexity follows from the diagonal structure of the Hessian, whose smallest eigenvalue is

 1√x2+β2≥1√1+β2≥11+β .

The function is -strongly-convex over w.r.t. the -norm. We work with the characterization provided in Lemma 3,

 inf\bx∈B1,∥\by∥1=1 \by⊤∇2ϕ(\bx)\by =inf\bx∈B1,∥\by∥1=1d∑i=1y2i√β2+x2i [Equation (???)] =inf\bx∈B1,∥\by∥1=11∑di=1√β2+x2i(d∑i=1y2i√β2+x2i)(d∑i=1√β2+x2i) ≥inf\bx∈B1,∥\by∥1=11∑di=1√β2+x2i(d∑i=1√y2i)2 [Cauchy- Schwarz] =inf\bx∈B1,∥\by∥1=11∑di=1√β2+x2i∥\by∥21≥1∑di=1(β+|xi|)≥11+βd .

We next introduce a generalized notion of diameter and use it to prove properties of . The diameter of a convex set with respect to is,

Whenever implied by the context we omit the potential from the diameter. Before we consider two specific diameters below, we bound the diameter in general as follows,

 Dβϕ\infdivx\bx\bzero =ϕβ(\bx)−ϕβ(\bzero) =d∑i=1(xi\arcsinh(xi/β)−√x2i+β2)+βd ≤d∑i=1xi\arcsinh(xi/β) =d∑i=1|xi|log(1β(√x2i+β2+|xi|))

Thus, without loss of generality, we can assume that lies in the positive orthant. We next bound the diameter of as follows,

 \diam(B2) ≤d∑i=1xilog(1β(√x2i+β2+xi)) ≤d∑i=1xilog(1+2xiβ) ≤d∑i=12x2iβ=2∥\bx∥22β≤2β . (2)

For and it holds that, . Hence, for , we have

 \diam(B1) (3)

### 3.1 HU algorithm

We next describe an OCO algorithm over a convex domain . [ht] KwByby , convex domain Initialize weight vector (a) Predict     (b) Incur loss     (c) Calculate   Update:   Project onto : Update (HU)

is an instance of OMD with divergence . We provide a simple regret analysis that follows directly from the geometric properties derived above. The following theorem allows us to bound the regret of an OMD algorithm in terms of the diameter and strong convexity.

Assume that is -strongly convex in respect to a norm whose dual norm is . Assume that the diameter of is bounded, . Last, assume that , then the regret boound of and learning rate satisfies

 \regretT≤2√2μ−1DTG2 .

This follows from the more general Theorem A. We next provide regret bounds for over and .

[Additive Regret] Let and assume that for all , . Setting ,

 η=1G2√1β(β+1)T ,  yields   \regretT(\HU)≤4G2√T .

Applying Lemma 3 and the diameter bounds from (2) to Theorem 3.1 yields,

 \regretT≤2√2(1+1β)TG22≤4G2√T .

The final inequality follows from the condition that .

[Multiplicative Regret] Let and assume that for all , . Setting for

 η=1G∞ ⎷log(3β)2T(1+βd) ,  yields   \regretT(\HU)≤3G∞√T(1+βd)log(3β) .

Applying Lemma 3 and the diameter bounds from (3) to Theorem 3.1 yields,

 \regretT≤2G∞√2T(1+βd)log(3β)≤3G∞√T(1+βd)log(3β) .

## 4 Spectral Hyperbolic Divergence

In this section, the focus is on using as a spectral regularization function. We show that the matrix version of is strongly convex with respect to the trace norm. Our proof technique of strong convexity is a roundabout for the matrix potential. The proof works by showing that the conjugate potential function is smooth with respect to the spectral norm (the dual of the trace norm). The duality of smoothness and strong convexity is then used to show strong convexity.

### 4.1 Matrix Functions

We are concerned with potential functions that act on the singular values of a matrix. For an even scalar function, , consider the trace function,

 F(\bX)=(f∘σ)(\bX)=d∑i=1f(σi)=\Tr(f(√\bX⊤\bX)) , (4)

where we overload the notation for and denote . For we use

 F(\bX)=(f∘λ)(\bX)=\Tr(f(\bX)) . (5)

Here represents the standard lifting of a scalar function to a square matrix, where acts on the vector of eigenvalues, namely,

 \bX=\bU\diag[λ(\bX)]\bU⊤⇒f(\bX)=\bU\diag[f(λ(\bX))]\bU⊤ .

We also use the gradient of a trace-function in our analysis. The following result from Thm. 14 in Kakade et al. (2009)

shows how to compute a gradient using a singular value decomposition. Let

and be defined as above, then

 ∇F(\bX)=f′(\bX) .

We also make use of the Fenchel conjugate functions. Consider a convex function defined on a finite vector space endowed with an inner product . The conjugate of is defined as follows.

The conjugate of a convex function is

 f∗(\bz)=sup\bx∈\calX\ip\bx\bz−f(\bx) .

In this section, we use the space of matrices (either or with the inner product . Thus, the dual space of is itself and is defined over .

We need to relate the conjugate of a trace function to that of a scalar function. This is achieved by the following result, restated from Thm. 12 Kakade et al. (2009). The theorem implies that the conjugate of a singular-values function is the singular-values function lifted from the conjugate of the scalar function. Let be defined as in (4), then .

### 4.2 Duality of Strong Convexity and Smoothness

Recall that a function is -smooth with respect to a norm on if,

 ∀\bx,\by∈\calK,  f(\bx)−f(\by)−∇f(\by)(\bx−\by)≤L2∥\bx−\by∥2 .

For convenience, we use the following second order characterization of smoothness which is an analogue of Lemma 3. A twice differentiable function is locally -smooth with respect to at iff

 sup\by∈\calX:∥\by∥=1\by⊤∇2ϕ(\bx)\by≤L .

Strong convexity and smoothness are dual notions in the sense that is -strongly convex with respect to a norm iff its Fenchel conjugate is -smooth with respect to the dual norm .

For the matrix variant of we find it easier to show smoothness of the conjugate rather than strong convexity directly. Unfortunately, as we see in the sequel, the conjugate function is not smooth everywhere. Therefore, we would need a local variant of the duality of strong convexity and smoothness. In the context of mirror descent with mirror map , we show that is strongly convex over if is locally smooth at all points within the image of the mirror map. Formally, we have the following lemma. In the following we use the standard notation for image of vector functions, ,

[Local duality of smoothness and strong convexity] Let be an open convex set and be a norm with dual norm . Let be twice differentiable, closed and convex function. Suppose the Fenchel conjugate is locally smooth with respect to at all points in . Then, is strongly convex with respect to over .

It suffices to show that for any , is locally -strongly convex with respect to at if is locally -smooth at with respect to .

From local smoothness at , we have for any ,

 f(\by)=12\by⊤∇2ϕ∗(\bx∗)\by≤L2∥\by∥2∗ .

Taking the dual, which is order reversing, we have for any ,

 f∗(\bz)=12\bz⊤[∇2ϕ∗(\bx∗)]−1\bz≥12L∥\bz∥2 . (6)

Since , then from the inverse function theorem, we have that

 ∇2ϕ∗(\bx∗)=[∇2ϕ(\bx)]−1 .

Using the above equality in (6), we have for any ,

 12\bz⊤∇2ϕ(\bx)\bz≥12L∥\bz∥2 .

### 4.3 Strong Convexity of Spectral Hypentropy

We now analyze the strong convexity of the spectral . The spectral function is defined for by (4) and for by (5) replacing with . The main theorem of this subsection is as follows. The trace function is -strongly convex with respect to the trace norm over .

We denote the -dimensional symmetric matrices of trace-ball with maximal radius by

 \btrsτ={\bX∈\SSd:∥\bX∥1≤τ} .

We prove the above theorem by first proving the lemma below for matrices in . We then extend it to arbitrary matrices using a symmetrization argument, using a technique similar to Juditsky and Nemirovski (2008); Warmuth (2007); Hazan et al. (2012).

Let be a subset of matrices and be a vector space containing such that and .

This abstraction will be useful in translating strong convexity of arbitrary matrices to the symmetric case. The bound on the rank is essential to give a modulus of strong convexity result that depends only on rather than . The final property is necessary for the low rank structure to be preserved after a primal-dual mapping.

The trace function is -strongly convex w.r.t the trace norm over .

To prove the symmetric variant, we show that is smooth with respect to the spectral norm, which is the dual norm of the trace norm. The result then follows directly from Lemma 4.2.

From Theorem 4.1, has Fenchel conjugate

 Φ∗β(\bX)=\Tr(ϕ∗β(\bX)) .

Since the derivative of the conjugate of a function is the inverse of the derivative of the function, we have

 dϕ∗βdx=(dϕβdx)−1=βsinh(x) .

The indefinite integral of the above yields that up to a constant, . Clearly, is not smooth everywhere. Nonetheless, we do have smoothness over . Before proving this property, we introduce a clever technical lemma of Juditsky and Nemirovski (2008) that allows us to reduce the spectral smoothness for matrices to smoothness of functions in the vector-case. Let be a function and such that that for ,

 f′(a)−f′(b)a−b≤c(f′′(a)+f′′(b))2 . (7)

Let be a function defined by . Then, the second directional derivative of is bounded for any as follows,

 \dd\bHF(\bX)≤c\Tr(\bHf′′(\bX)\bH) .

We are now prepared to analyze the smoothness of . [Local Smoothness] The trace function is locally -smooth with respect to the spectral norm for all matrices in .

To prove local smoothness, we use the second order conditions from Lemma 4.2. This requires us to upper bound the second directional derivatives for all directions corresponding to matrices of unit spectral norm. We consider the matrix

 \bX=∇Φ(\bY)=ϕ′β(\bY)=\arcsinh(\bYβ) .

Note that is positive and convex. Therefore, by the mean value theorem, there exists for which,

 (ϕ∗β)′(b)−(ϕ∗β)′(a)b−a=(ϕ∗β)′′(c)≤max{ϕ∗β)′′(a),(ϕ∗β)′′(b)}≤(ϕ∗β)′′(a)+(ϕ∗β)′′(b) .

We note that by Definition 4.3, , so we can restrict ourselves to the vector space . Therefore, applying Lemma 4.3, we have

 sup\bH∈\calX:∥\bH∥∞≤1\dd\bHΦ∗β(\bX) ≤sup\bH∈\calX:∥\bH∥∞≤12\Tr(\bH(ϕ∗β)′′(\bX)\bH) =sup\bH∈\calX:∥\bH∥∞≤12\Tr(\bH2(ϕ∗β)′′(\bX)) [Commutativity of trace] ≤sup\bH∈\calX:∥\bH∥∞≤12\ipσ2(\bH)σ((ϕ∗β)′′(\bX)) [von Neumann's trace inequality]

where von Neumann’s trace inequality stands for, . Now, since , we know , and so can have at most nonzero singular values, yielding

 sup\bH∈\calX:∥\bH∥∞≤1\dd\bHΦ∗β(\bX) ≤sup\bH∈\calX:∥\bH∥∞≤12∥\bH2∥∞r∑i=1σi((ϕ∗β)′′(\bX)) =2r∑i=1(ϕ∗β)′′(ϕ′β(σi(\bY))) .

Now, note that

 (ϕ∗β)′′(ϕ′β(x))=βcosh(\arcsinh(x/β))=√β2+x2≤β+|x| .

It then follows that

 r∑i=1(ϕ∗β)′′(ϕ′β(σi(\bY)))≤βr+∥\bY∥1≤τ+βr .

Therefore, the second directional derivative in bounded by as desired.

[Theorem 4.3] We introduce the symmetrization operator which is a linear function that takes a matrix to a symmetric matrix,

 S(\bX)=[0\bX\bX⊤0] .

The eigenvalues of are exactly one copy of singular values and one copy of negative singular values of . Therefore, we have

 Φβ(\bX)=min{m,n}∑i=1ϕβ(σi(\bX))=12Φβ(S(\bX)) .

Technically, for the above to hold true, we should shift such that its is at . Since a constant shift does not affect diameter or convexity properties so this is not an issue.

Let be the modulus of strong convexity in the symmetrized space. We bound from below as follows,

 2Φβ(\bX) =Φβ(S(\bX)) ≥Φβ(S(\bY))+\ip∇Φβ(S(\bY))S(\bX)−S(\bY)+μ2∥S(\bX)−S(\bY)∥21 =Φβ(S(\bY))+\ip∇Φβ(S(\bY))S(\bX−\bY)+μ2∥S(\bX−\bY)∥21 =2Φβ(\bY)+2\ip∇Φβ(\bY)\bX−\bY+2μ∥\bX−\bY∥21 .

Therefore, the modulus of strong convexity over asymmetric matrices is .

Note that satisfies the properties listed in Definition 4.3, with . In particular, we have a vector space containing symmetric matrices of rank at most . Furthermore, for any , for some and thus,

 ∇Φβ(\bX)=∇Φβ(S(\bY))=S(∇Φβ(\bY))∈\calX .

It follows from Theorem 4.3 that , and so the strong convexity is at most as desired.

### 4.4 SHU algorithm

We next describe, the Spectral Hypentropy Update (), an OCO algorithm over a convex domain of matrices .

[ht] KwByby , convex domain of matrices Initialize weight matrix (a) Predict     (b) Incur loss     (c) Calculate   Update:   Project onto : Spectral Update (SHU)

The pseudocode of is provided in Algorithm 4.4. The update step of SHU requires a spectral decomposition. We define where is the singular value decomposition of . We use this definition twice, once with , and after subtracting the gradient with . We prove the following regret bound for . Let and let be a spectral norm bound on the gradients. For , setting,

 η=12G∞ ⎷log(3γ)T(1+γmin{m,n}) ,  yields   \regretT(\SHU)≤4τG∞√T(1+γmin{m,n})log(3γ) .

Like for , we use the general OMD analysis. It suffices to find an upper bound on and a strong convexity bound.

Applying (3) on the vector singular values, we have

 \diamΦβ(\btrτ)≤τlog(3τβ) ,

where . Furthermore, from Theorem 4.3, we can see that is strongly convex with respect to the trace-norm over . Letting , the result follows from Theorem 3.1.

## 5 Experimental Results

Next, we experiment with HU in the context of empirical risk minimization (ERM). In the experiments,

stands for a stochastic estimate of the gradient of the empirical loss. Thus, we can convert the regret analysis to convergence in expectation

Cesa-Bianchi et al. (2004).

#### Effective Learning Rate

For small value , . As a result, near the update in is morally the additive update, . The product can be viewed as the de facto learning rate of the gradient descent portion of the interpolation. As such, we define the effective learning rate to be . In the sequel, fixing the effective learning rate while changing is a fruitful lens for comparing to with .

### 5.1 Logistic Regression

In this experiment we use the

algorithm to optimize a logit model. The ambient dimension

is chosen to be . A weight is drawn uniformly at random from . The features are from and distributed according to the power law, . The label associated with an example is set to with with probability and otherwise flipped.

The algorithms are trained with log-loss using batches of size . Stochastic gradient descent and the -norm algorithm Gentile (2003) are used for comparison. As can be seen in Fig. 2, the -norm algorithm performs significantly worse than HU for a large set of values of , while SGD performs comparably. As expected, for large value of , SGD and HU are indistinguishable.

In the next experiment we use the same logit model with ambient dimension chosen to be . We generate weights, , with sparsity (fraction of zero weights) . The nonzero weights are chosen uniformly at random from . We run the algorithms for 20,000 iterations. Rather than fixing , we fix . This way, as , behaves like while for small , the update is roughly . We let . As discussed in Appendix B, this choice of is similar to running with an -norm bound of . We then choose and . In Fig. 3, we show the interpolation between and . The larger is, the closer the progress of HU resembles that of . Intermediate values of have progress in between the and .

### 5.2 Multiclass Logistic Regression

In this experiment we use to optimize a multiclass logistic model. We generated 200,000 examples in . We set the number of classes to . Labels are generated using a rank matrix . An example was labeled according to the prediction rule,

 y=\argmaxi∈[k]{(\bW\bx)i} .

With probability the label was flipped to a different one at random. The matrix and each example features are determined in a joint process to make the problem poorly conditioned for optimization. Features of each example are first drawn from a standard normal. Weights of

are sampled from a standard normal distribution for the first

features and are set to for the remaining

features. After labels are computed, features are perturbed by Gaussian noise with standard deviation

. The examples and weights are then scaled and rotated. Coordinate of the data is scaled by where . Then a random rotation is applied. The inverse of these transformation is applied to the weights. Therefore, from the original sample and weights