# Logistic Regression Regret: What's the Catch?

We address the problem of the achievable regret rates with online logistic regression. We derive lower bounds with logarithmic regret under L_1, L_2, and L_∞ constraints on the parameter values. The bounds are dominated by d/2 log T, where T is the horizon and d is the dimensionality of the parameter space. We show their achievability for d=o(T^1/3) in all these cases with Bayesian methods, that achieve them up to a d/2 log d term. Interesting different behaviors are shown for larger dimensionality. Specifically, on the negative side, if d = Ω(√(T)), any algorithm is guaranteed regret of Ω(d log T) (greater than Ω(√(T))) under L_∞ constraints on the parameters (and the example features). On the positive side, under L_1 constraints on the parameters, there exist algorithms that can achieve regret that is sub-linear in d for the asymptotically larger values of d. For L_2 constraints, it is shown that for large enough d, the regret remains linear in d but no longer logarithmic in T. Adapting the redundancy-capacity theorem from information theory, we demonstrate a principled methodology based on grids of parameters to derive lower bounds. Grids are also utilized to derive some upper bounds. Our results strengthen results by Kakade and Ng (2005) and Foster et al. (2018) for upper bounds for this problem, introduce novel lower bounds, and adapt a methodology that can be used to obtain such bounds for other related problems. They also give a novel characterization of the asymptotic behavior when the dimension of the parameter space is allowed to grow with T. They additionally establish connections to the information theory literature, demonstrating that the actual regret for logistic regression depends on the richness of the parameter class, where even within this problem, richer classes lead to greater regret.

## Authors

• 7 publications
02/26/2019

### Logarithmic Regret for parameter-free Online Logistic Regression

We consider online optimization procedures in the context of logistic re...
02/13/2021

### Sequential prediction under log-loss with side information

The problem of online prediction with sequential side information under ...
05/07/2022

### Precise Regret Bounds for Log-loss via a Truncated Bayesian Algorithm

We study the sequential general online regression, known also as the seq...
01/27/2020

### Naive Exploration is Optimal for Online LQR

We consider the problem of online adaptive control of the linear quadrat...
02/11/2022

### Scale-free Unconstrained Online Learning for Curved Losses

A sequence of works in unconstrained online convex optimisation have inv...
10/08/2021

### Mixability made efficient: Fast online multiclass logistic regression

Mixability has been shown to be a powerful tool to obtain algorithms wit...
03/01/2019

### On the complexity of logistic regression models

We investigate the complexity of logistic regression models which is def...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Logistic regression plays a significant role in many learning applications, where a set of parameters representing the effects of different features on the outcome (label) is learned from a training data set with known labels

. The learned parameters are then used to predict the true labels of, yet unseen, data examples. Examples include predicting the probability some person carries some disease based on features that are, e.g., hereditary or environmental; or predicting the click-through-rate of ads shown in online advertising. Many applications may require to operate in the online learning (or online convex optimization) setting. In this setting, an algorithm consumes the data in rounds. At round

, predictions can be based on all examples seen up to round , including on their true labels (but not on data beyond round ), to predict the label of the example at round .

The performance of an online algorithm is measured by its regret, which is defined as the extra loss it incurs beyond that of an algorithm that is playing, at all rounds, some comparator value , where is a predefined space of possible values. The values of the parameters can be those that minimize the cumulative loss over all rounds up to the horizon . While regret is defined for the online setting, it is directly connected to the convergence rate, which measures an expected loss on an unseen example at round , based on training on the first examples.

Paper Outline: In Section 2, we outline our contributions. We present a summary of related work in Section 3. Section 4 formulates the problem. In Section 5, we frame and extend results from the literature, setting them to prove our results. Section 6 describes regret lower bounds for any algorithm. Section 7 shows upper bounds that can be achieved with Bayesian mixture algorithms and apply to logistic regression when the

feature vector

is observed prior to predicting a label.

## 2 Summary of Contributions and Methods

We consider several settings with a -dimensional parameter space with some limit on some norm of the parameters. Specifically, , for . Define as the count of units constituting . We focus on the case in which the norm of the example (or, feature value vector) at is bounded in , i.e., , (or ). (This setup generalizes the practical setup with binary features). However, with proper adjustments (which decrease the bounds), the results transform also to the more restrictive .

Our contributions include:

• Comprehensive characterization of the regret for the logistic regression problem, including the asymptotic behavior in the dimensionality , showing regret bounds logarithmic in and linear in for lower regions of .

• Novel bounds that lead to this characterization, especially, lower bounds showing limitations on regret in the different settings.

• Specific negative results that demonstrate that in cases such as constraints, for , we are guaranteed regret rates of at least .

• Specific positive results that demonstrate that for upper regions of , there exist algorithms with regret rates (for constraints with ), as well as regret rates that are linear in , and no longer logarithmic in (for constraints with ).

• Adaptation of a principled methodology from the information theory literature, that allows derivation of lower bounds for this and related problems.

The sub-linear regret in for is very interesting for logistic regression because the dot product, used for prediction, is a linear combination of the parameters, making constraints very realistic, especially in sparse real-worlds problems that have binary feature vectors (see, e.g., [mcmahan13]).

Our results characterize the behavior of the regret for the various regions of . For smaller , we show lower bounds of , , and for the cases where and is , respectively, and upper bounds of , , and for the respective norm constraints. (An additional term in the denominator of the logarithm in all bounds applies to the setting in which .) The difference between the constraints on different norms illustrates that regret is a function of the richness of . The richer is (e.g., constraints are richer than , which are richer than ) the greater is the regret. As the dimension is allowed to grow, the bounds change when the denominator of the logarithmic term above equals the numerator. They lead to different regions of the different bounds, with different behavior in each region. Table 1 shows the different lower and upper bounds for different and different norm constraints on the space , and summarizes the results in Theorems 10-7. For simplicity of the table, we omitted the lower limit on for each row, but it should be understood as where the upper limit of the previous row is (for the first row in each block, the previous ). The lower bounds that are should be understood as for some small which can be as small as ). This is again omitted for simplicity. For the setting in which , the additional term in the denominators of the logarithm leads to earlier transitions between regions of , for all cases (as well as adding a upper regret region for ). Table 2 compares results in this paper to previously reported results (described in more detail in Section 3). We omit middle ranges of that are in Table 1. Some results in this paper are extended from [kakade05] and adapted to the setup . A footnote marks these with a proper explanation. For multi labels, we use to denote the -dimensional projection of the parameter space for label . Results for and (that were not directly derived) are generalized from results that were derived for and are described in the “previous results” column.

To prove lower bounds, we adapt techniques based on the redundancy-capacity theorem (see, e.g., [davisson73, merhav95, shamir06]) from the information theory literature. Specifically, we set a grid of points in the parameter space that are distinguishable by the observed label sequence for some example sequence. The logarithm of the cardinality of the grid is a lower bound on the regret. The concept of distinguishability was used somewhat differently by [hazan14] to prove regret lower bounds. Upper bounds for and , but not for , can be derived by manipulating the Bayesian mixture approach in [kakade05], (adjusted to our setup). Using a normal prior with

large variance

can attain the proper rates, with the respective constants. However, for , we combine this approach with the method of grids, applying a discrete uniform prior on some . Applying the method in [kakade05], initially dominates an upper bound, with additional contriubtion from the effective quantization of the parameters by the mixture only on a discrete subset of the space. This method can also be used for and , and achieves a similar bound for , but a weaker one for .

## 3 Related Work

Prior results in both the machine learning literature (see, e.g.

[azoury01, cesa02, littlestone89]) and the information theory literature (see, e.g., [krichevsky81, merhav95, rissanen84]) illustrate that the performance of the regret (or redundancy in information theory) of the online setting normalized by the number of rounds meets batch results of convergence rate at least to first order. Hence, studying regret in the online setting also implies to the generalization ability of an algorithm. The setup of a logistic regression problem, whether online (studying regret) or batch (studying convergence rate) is very similar to the setups of the universal compression problems studied in the information theory literature. In these problems, the redundancy of algorithms that predict multi label outcomes in a setup that is equivalent to single dimensional logistic regression with binary features was studied. It was shown (see, e.g., the seminal work in [rissanen84], subsequent work in [ds04, orlitsky04, shamir06, spa12], and references therein) that for these problems, regret of is achievable to first order, where is the number of labels. However, the concepts presented by [rissanen84] should apply also to more general dimensional problems, where is the number of parameters that affect the label outcome. Specifically, in [rissanen84], central limit arguments, that are also satisfied in the logistic regression setting, were used to prove redundancy bounds, when . The subsequent results in [ds04, orlitsky04, shamir06, shamir06a], however, extended the redundancy results to , even when (for some small fixed ) but were more specific to the equivalent of single dimensional logistic regression with multi labels. The machine learning literature considered general online convex optimization, and derived minimax-optimal algorithms for both the linear and strongly convex settings (see, e.g., [abernethy08]), with logarithmic regret in the strongly convex setting. For weakly convex settings (which generally includes logistic regression), regret rates of have been shown to be achievable (see, e.g., [zinkevich03], and references therein), where is the radius of the ball defining the allowed values of the parameter , played at round , and the space of values of a possible comparator .

To the best of our knowledge, in [kakade05], a first result suggesting that regret of is achievable for logistic regression, and in fact for other generalized linear models, was presented. Instead of using gradient methods, (typically used for this problem) in which the training algorithm updates the learned parameters taking a step against the gradient on the loss, the method took from the Bayesian literature to apply Bayesian Model Averaging (or Bayesian mixture) to show a regret upper bound (but not a lower bound) that achieves this rate. In addition, however, the algorithm pays an additional penalty that depends on the prior selection as well as on the squared norm of a comparator (which can be the loss minimizing parameter in hindsight). If is larger than the term, this penalty term could dominate the bound (depending on the selected prior). The proof of the bound utilized variational techniques, and also, in part, resembled some of the central limit arguments used in [rissanen84] to show upper bounds on redundancy. The use of Bayesian methods is also justified in the information theory literature (see, e.g., [davisson73, krichevsky81, rissanen84]). Specifically, [merhav95] showed that a mixture code is as good as the best code in terms of regret (and thus can be better but not worse than any other type of code).

[mcmahan12] demonstrated that with binary feature values, using the Follow-The-Regularized-Leader (FTRL) methodology (see, e.g., [hazan12, mcmahan11, rakhlin05, shalev07] and references therein) with a Beta regularizer, regret can be achieved for the single dimensional problem. In the special case of a Beta regularizer with , their FRTL algorithm coincides with the well-known (add-) [krichevsky81] (KT) estimator that, in fact, achieves the lower bound on the regret for this problem of . It is interesting to note, however, that the KT method is derived using a Bayesian mixture with the Dirichlet- prior. Thus for the single dimensional case, both the FTRL methodology and the Bayesian mixture one result in the same estimator. Unfortunately, this result does not generalize to larger dimensions.

While the lower bound can be achieved for the single dimensional case for binary features with an FTRL gradient method, [mcmahan12] posed a problem of what happens in larger dimensions. The results in [kakade05] hint in the direction of Bayesian methods, but still fall short of achieving regret due to the additional penalty on the prior. (Although, as we demonstrate, these results with a proper, perhaps unexpected, choice of prior could lead to the desired rates and constants in some cases, but, to the best of our knowledge, such a result was not reported in the literature.) A series of papers [bach10, bach13, bach14] studied the convergence rate of gradient methods for logistic regression, and concluded, that while logistic loss is not globally strongly convex, it can, depending on the actual data, locally exhibit strong convexity (referred to as the self-concordance property). Then, gradient methods can achieve convergence rate of , where

is the smallest eigenvalue of the Hessian at the global optimum. This implies that gradient methods can, in many case, achieve logarithmic regret, but there do exist situations where gradient methods fail to achieve

regret (when is small).

[hazan14] studied the problem, in which Bayesian methods are not possible to apply directly, where the feature values are unknown when playing at round , and are only revealed later, together with the label. Bayesian methods do condition the predicted label probability on the observed feature values, and if such are not available, they would require also mixing on the feature values. It was shown, that in this setting, which is more difficult to the algorithm, regret of is achieved for the single dimensional problem where only is possible, and is only possible for any larger dimensions even for .

[foster18] separated the problem posed in [mcmahan12] to the case considered by [hazan14], where the algorithm plays with no knowledge of the feature values , which is referred to as a proper setting, and to the mixable setting where the feature values are revealed to the algorithm prior to generating a prediction, referred to as the improper setting. Using Bayesian model averaging with a uniform prior with an approach that resembles that in [kakade05], an upper bound of was shown for the multi label -dimensional (with distinct features) logistic regression problem, where is the number of distinct labels, under constraints on . A lower bound of was shown for the binary labels / binary features setting under the constraints that . The upper bound matches the logarithmic order of the bound expected from the information theory problems, but not the constant, and the lower bound is lower in order.

The results summarized above suggest that there are, in fact, two different sets of online logistic problems considered. In the first, the features are revealed prior to playing or to generating a prediction, and in the second, is played before the feature values are revealed. The first problem allows the use of Bayesian methods, while the second will require such methods to also mix over the unseen . For the first problem, logarithmic regret is possible for low dimensionalities, whereas for the second extreme case, it is not in many settings, even in the single dimensional problem. In this paper, we give a comprehensive characterization of the regret behavior for the first problem, including the asymptotic regime, where is allowed to grow with . The lower bounds we derive apply to any case, including the second problem, but the upper bounds are specific to the first one.

## 4 Problem Formulation, Notation and Definitions

We consider online convex optimization over a series of rounds as in mcmahan14 (see also [boyd04, rockafellar97, shalev12]). Each round , a -dimensional example feature vector and a label are observed. For the binary labels, we use . We assume, without loss of generality, that , as features can be normalized. We denote a subsequence up to time by . For the example/label pair, we also use

. Capital letters denote random variables. A learning algorithm

is a function that, given a sequence , an example , and an arbitrary label , returns at round a probability for the label

 A(St−1,xt,y)△=P[Yt=y|Xt=xt,St−1]. (1)

To produce a prediction, an algorithm may play a weight vector , or perform a Bayesian mixture over . For a given model , the probability of a label for example is given by . The loss at for model is , where it will sometimes be convenient to use the dot product . Similarly, the loss of at is . The total loss for model on sequence is . Similarly, .

The regret of for a given example/label pair sequence relative to a comparator model , where constrains the norm of , is defined as

 Regret(A,ST,θ∗)△=L(A,ST)−L(θ∗,ST). (2)

We limit the comparator such that , and consider the different cases where . It is reasonable to assume that for some . Bayesian mixture algorithms could have support in where . The regret of relative to the best comparator is given by

 Regret(A,ST)△=supθ∗∈ΘRegret(A,ST,θ∗). (3)

A mixture algorithm that may rely on the values of in its predictions of , predicts

 p(yt|xt)△=∫θ∈Θmp(yt|xt,θ)⋅p0(θ)dθ=∫θ∈Θmt∏τ=1p(yτ|xτ,θ)⋅p0(θ)dθ (4)

where is some initial prior on the distribution of the parameter vector , and is the support of the mixture, which may be different form . The probability (4) assigned to can also be expressed as a set of equations that sequentially update a posterior distribution over at round from the prior at , which is the posterior at , i.e.,

 p(θ|St)=∏tτ=1p(yτ|xτ,θ)⋅p0(θ)∫θ∏tτ=1p(yτ|xτ,θ)⋅p0(θ)dθ△=p(θ,yt|xt)p(yt|xt). (5)

The prediction of is then given by

 p(yt|xt,St−1)=∫θp(yt|xt,θ)⋅p(θ|St−1)dθ. (6)

As seen in (6), the prediction is also conditioned on the feature values (example) vector . The prior distribution is shown to be continuous in (4)-(6). However, can be set to be a discrete set, and then (4) can be re-rewritten as

 p(yt|xt)△=∑θ∈Θmp(yt|xt,θ)⋅p0(θ)=∑θ∈Θmt∏τ=1p(yτ|xτ,θ)⋅p0(θ). (7)

## 5 Useful Methods

### 5.1 Lower Bounds on Regret - The Redundancy Capacity Theorem

A lower bound on regret is meaningful only when stated in terms of existence of a sequence for every possible algorithm, for which the regret is at least the lower bound. [davisson73] formulated such a notion for universal compression redundancy as the redundancy-capacity theorem which showed that the redundancy (or regret) can be lower bounded by the mutual information between the parameter and the observed data sequence, induced by the prior on of a mixture model. A specific interesting case is when the prior is uniform on a discrete subset of the parameter space, and the elements in are distinguishable by the observation , i.e., observing is sufficient to determine which generated with error probability as . This case leads to a weaker lower bound than the bounds described in [davisson73] and subsequent works, but is sufficient for showing redundancy bounds in many cases (see, e.g., [merhav95, shamir06]), and also for regret bounds for our problem. We frame this result to regret, and prove it by mirroring the part of the derivation in [davisson73] that is sufficient for the result we need, but described in terms that apply to the regret problem. We next state the theorem, which is proved in Appendix A.

Distinguishable Grid Regret (adapted from [davisson73]): Let . Let be a set of distinct values of . Draw with a uniform prior, and generate from the distribution determined by . Let be some estimator of from the observed . Then, if , the regret of any algorithm for the worst sequence is lower bounded by

 supSTRegret(A,ST)≥(1+o(1))logM. (8)

Similarly, for a fixed , if we draw instead of and the conditions above hold, (8) also holds.

### 5.2 Variational Approach for Upper Bounds

Upper bounds can be obtained by showing Bayesian methods that can achieve low regret and bounding their regret. For simplicity, one can select priors with a diagonal covariance. [kakade05] selected a normal prior, whereas [foster18] used a uniform one. We follow [kakade05] and manipulate their approach to obtain tighter bounds for and , and then use the method of grids with a uniform discrete prior combined with their method to derive an bound. We first describe their approach. Define a distribution on where , and , where if and is , otherwise, i.e, diagonal covariance matrix, where is an upper bound on elements of the diagonal. (Note that can be a subset of or , but does not have to, and in fact, is not for the normal prior). Let be the KL-divergence between and . Then the following theorem holds. [kakade05]: The regret of a Bayesian algorithm with prior for sequence and comparator is upper bounded by

 Regret(A∗,ST,θ∗)≤D(Q||p0)+dT8η2q (9)

The proof of Theorem 5.2 is in [kakade05], but needs to be modified a bit because they restricted , while we assume (). We rely on their proof, except where it needs to be modified. The proof is in Appendix B.

## 6 Regret Lower Bounds

We now use Theorem 5.1 to derive lower bounds for the binary label case. A lower bound for the multi label case is given in Appendix E. We first define

 (10)

as the effective count of units in (where if is very large, we will only consider a clipped portion of for to ensure distinguishability. This will guarantee that ). The lower bounds are stated in the following theorem. Fix an arbitrary , let . Then, for every algorithm there exists a sequence , for which the regret is lower bounded by

 Regret(A,ST)≥⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩(1−o(1))d2logTd;for d=O(1),(1−o(1))d2log4γTd;for ∥θ∗∥∞≤B and d<4eγT1−ε,(1−o(1))2eγT1−ε;for ∥θ∗∥∞≤B and d≥4eγT1−ε,(1−o(1))d2log2πeγTd2;for ∥θ∗∥2≤B and d<√2πeγT1−ε,(1−o(1))√2πeγT1−ε;for ∥θ∗∥2≤B and d≥√2πeγT1−ε,(1−o(1))d2log4e2γTd3;for ∥θ∗∥1≤B and d<(4γT1−εe)1/3,(1−o(1))32(4γT1−εe)1/3;for ∥θ∗∥1≤B and d≥(4γT1−εe)1/3. (11)

Theorem 10 shows that for small each feature/dimension contributes to the worst case regret. Generally, for each parameter costs . For there is a reduction inside the logarithm by a factor of , and an additional similar reduction is observed between and . These relations are expected, because the relations between the bounds reflect the logarithm of the ratio between the respective volumes of the parameter spaces, dictated by the constraints. The greater the volume, the harder the algorithm has to work to match the best comparator, and the larger the regret penalty it pays. This is similar to observations in the information theory literature, which tie the redundancy to the richness of the class. The dependence on is through . Each interval of consists of roughly distinguishable parameters. Hence, the ratio between and dictates how many parameter regions are in an interval of diameter . Thus this ratio, represented by , dominates the effect of , which is normally, in practice, negligible relative to the effects of and . (In practice, we would normally limit to some reasonable range, which is usually . However, theoretically, can be larger, in which case it does influence the bound.) If is too large, the effective in (10) guarantees that , and if , it guarantees this with respect to the largest value used for the bound.

An interesting behavior is observed for all cases , and . When we reach , and , respectively, a threshold phenomenon happens, and the bound becomes constant for every greater . It is not clear how much of this is a result of the bounding techniques and how much is real. However, as we see in the upper bounds for in the following section, there exist algorithms that achieve regret for constraints. We also observe a decrease in rate in the upper bounds for at . Together, these results imply, that there are, in fact, cases in which the regret does not grow linearly with for large enough . The bounds demonstrate that there are situations in which we are guaranteed regret rates of . In fact, the regret could even be linear with up to .

To prove the first region of Theorem 10, we partition into separate equal length segments, where in each segment only one component of is and the rest are . This transforms the problem to a standard universal compression problem in different segments, in each a single parameter is to be estimated. In each segment, we now have a grid of points, which are spaced (in the space of label probability they induce) at a distance from one another. The total grid is the power set of the individual grids over the segments. Large deviation typical sets analysis (see [cover06]) with the union bound over the segments is used to show that each of these points is distinguishable from the others. Finally, applying Theorem 5.1 with a fixed gives the lower bound. For diminishing large deviation exponent to dominate over the union bound, we need to use . For larger values of , we use a grid that varies only in the first components of , and apply the resulting bound.

For the remaining regions, we fix the first component of the parameter at the maximal point . Then, would scale it by factors of , where is an integer, taking all values from to . This will induce priors, that make distinct distinguishable regions of a second nonzero component of that occurs in the same examples as . We partition each of now segments, where in each of these segments a different component is for , while the remaining ones are , to subsegments corresponding to the different values of . We show that the points on the grid, now constructed as a power set of grids of points, are, again, distinguishable with the fixed . Using Theorem 5.1, the logarithm of the cardinality of the power grid lower bounds the regret. However, for and , only the components of which satisfy the constraints are included in the grid. This reduces the bounds, and leads to a threshold point, in which the lower bounds become useless for the value of , since the remaining space no longer contains parameters for which all components of are nonzero. We can thus use the value of , which is lower, but achieves the largest bound. This leads to regions 3, 5, and 7 in the bound. The proof of Theorem 10 is presented in Appendix C.

## 7 Regret Upper Bounds

Theorem 5.2 allows us to prove the following two theorems: There exist Bayesian algorithms that for every sequence and comparator , achieve regret

 Regret(A∗,ST,θ∗)≤⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩(1+o(1))⋅d2log(B2Te4+e);for ∥θ∗∥∞≤B,(1+o(1))⋅d2log(B2Te4d+e);for ∥θ∗∥2≤B,(1+o(1))⋅d2log(B2Te34d2);for% ∥θ∗∥1≤B and d=o(B√T),(1+o(1))⋅(d2log(4e)+B√T2);for ∥θ∗∥1≤B and d=Θ(B√T),(1+o(1))⋅(d2+√2dB√T);for ∥θ∗∥1≤B and d=Ω(B√T) (12)

Let . Then, there exists a Bayesian algorithm that for every sequence and comparator with , achieves regret

 Regret(A∗,ST,θ∗)=O(T1/5d3/5B2/5)=o(d). (13)

Theorem 7 shows that regret logarithmic with and linear with is achievable in all three cases. The bounds asymptotically differ only by a factor of inside the logarithm. The wider allowable range of gives an upper bound where in the logarithmic term is not normalized by . The smaller comparator region, where norms are restricted by , reduces the logarithmic cost from to . A similar reduction is achieved from to . Both and have interesting threshold behavior, which matches the behavior with the lower bounds. For , as long as , we observe regret linear in and logarithmic in . For larger dimensions, though, we observe only linear behavior in , without the logarithmic terms. Furthermore, if we tighten the bounds further, Theorem 7 shows that even sub-linear behavior in is possible in this case. For , the transition from to occurs at .

The bounds in Theorem 7 are derived for our setting of . [kakade05] and others considered the setting where . In this setting, all the bounds in (12) will have an additional factor of in the denominator of the logarithmic term. This means that the transition from to rate (and for to ) will now occur at , and for , , and , respectively. (Such a transition will now also happen for .)

Unlike Theorem 10, is present in the upper bounds instead of . For distinguishability on an individual feature, each region of in a single dimension consists of only distinguishable points, and not . However, when dimensions are mixed through the dot product, it is hard to disentangle the dimensions. This leads to the difference between the lower and the upper bounds.

The proof of Theorems 7 and 7 is based on Theorem 5.2. For and , we use a Gaussian prior with a Gaussian . We derive the bounds on the terms of (9), find the value of the parameter that gives the smallest bound and apply it. We then find the variance of the prior that gives the tightest bound, and apply it. For , we construct a grid whose points are assigned a uniform probability. We construct

as a Bernoulli distribution in each dimension, giving nonzero probability only to the surrounding neighbors of the

th component of , but ensuring that is the expectation of . Then, in the same manner, we upper bound the terms of (9), and optimize the free parameter for the tightest bound. Finally, since a threshold occurs, where the bound becomes useless, we use a union bound on lower dimensions of the parameter space. We find the dimension that gives the maximal element of the sum over all dimensions, and use it to upper bound all dimensions. Applying this method more tightly gives some tedious algebra, but yields a bound of for the upper region of the constraint problem. The proof of both theorems is presented in Appendix D.

## 8 Conclusions

We studied logistic regression regret, and derived lower and upper bounds for settings constrained by the norm of a comparator. We presented a comprehensive characterization of the regret for the different settings, including the asymptotic behavior in the dimensionality. Adapting a methodology from the universal compression literature, we derived lower bounds on the regret, showing initial logarithmic in , linear in regret, with rates whose growth slows with larger dimensions of the feature space. Matching upper bounds confirm the general behavior of the lower bounds. Specifically, we demonstrated that under constraints, for large enough , regret becomes sub-linear in , and for constraints, it drops from linear in and logarithmic in to just linear in . On the negative side, under constraints, regrets of are guaranteed for .

## Appendix A Proof of Theorem 5.1

The following lemma is needed to prove Theorem 5.1. Let be a parameter in the support of a Bayesian algorithm that predicts as described in (4)-(6) or in (7). Then,

 Regret(A∗,ST,θ∗)=logpA∗(θ∗|ST)−logp0(θ∗). (14)

By definition

 Regret(A∗,ST,θ∗) = logp(yT|xT,θ∗)−log∫θ∈Θmp(yT|xT,θ)p0(θ)dθ (15) =

where the equalities are obtained by multiplication and division by , which by the conditions of the lemma is greater than , and by identifying the posterior derived from .

of Theorem 5.1: Let be a Bayesian algorithm as defined in (7) on a discrete . Let now be uniform with , and (where can be a function of ). Now,

 supSTRegret(A,ST) = supSTsupθ∗∈ΘRegret(A,ST,θ∗)(a)≥supθ∗∈ΘmsupST[L(A,ST)−L(θ∗,ST)] (16) (b)≥ (c)≥ Elogp(YT|XT,Φ)−ESTEΘm|STlogpA∗(YT|XT,ST) = E[Regret(A∗,ST,Φ)](d)=ElogpA∗(Φ|St)−Elogp0(Φ)

Step follows for exchanging order of supremums, and shrinking to . Substituting loss definitions, and lower bounding the supremum on by an expectation over conditioned on leads to . Step follows from lower bounding the supremum on by expectation of w.r.t. . This yields expectation w.r.t. and for the left term. Performing this expectation on the right term implies expectation on with a distribution that is the one assigned to by . This negative logarithm is minimized with predictions given from , leading to the right term in step , which, similarly to the left term, performs the expectation on both and . The resulting expression is the expectation of the regret of on . Applying Lemma A, gives . By the uniform construction of , the right term is . The left term can be bounded using (see, e.g., Fano’s inequality, [cover06])

 −ElogpA∗(Φ|St)≤h2(Pe)+Pelog(M−1)≤1+Pe⋅logM, (17)

where is the binary entropy function. This is proved by expectation over , and then, breaking the events of the value of into the event and , and then hierarchically separating the latter into the different possible values of , upper bounding the conditional entropy on

by that of a uniform distribution. Combining both terms of (

16), give and , concludes the proof of the first statement of Theroem 5.1. The second statement follows the exact same derivation conditioned on a fixed , after lower bounding the supremum over by that for this fixed .

## Appendix B Proof of Theorem 5.2

of Theorem 5.2: Let . (Similarly, if is discrete, the integral is replaced by a sum). Then, for a Bayesian algorithm with prior distribution on a logistic regression model,

 Regret(A∗,ST,θ∗)=L(A∗,ST)