Adaptive Minimax Regret against Smooth Logarithmic Losses over High-Dimensional ℓ_1-Balls via Envelope Complexity

10/09/2018 ∙ by Kohei Miyaguchi, et al. ∙ The University of Tokyo 0

We develop a new theoretical framework, the envelope complexity, to analyze the minimax regret with logarithmic loss functions and derive a Bayesian predictor that achieves the adaptive minimax regret over high-dimensional ℓ_1-balls up to the major term. The prior is newly derived for achieving the minimax regret and called the spike-and-tails (ST) prior as it looks like. The resulting regret bound is so simple that it is completely determined with the smoothness of the loss function and the radius of the balls except with logarithmic factors, and it has a generalized form of existing regret/risk bounds. In the preliminary experiment, we confirm that the ST prior outperforms the conventional minimax-regret prior under non-high-dimensional asymptotics.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As a notion of complexity of predictive models (sets of predictors), minimax regret has been considered in the literature of online learning (Cesa-Bianchi and Lugosi, 2006) and the minimum description length (MDL) principle (Rissanen, 1978; Grünwald, 2007). The minimax regret of a model is given by

(1)

where denotes the loss of the prediction over data made by , denotes the feasible predictions and is the space of data. Here, the data may consist of a sequence of datum, , and the loss maybe additive, , but we keep them implicit for generality. The minimax regret is a general complexity measure in the sense that it is defined without any assumptions on the generation process of . For instance, one can bound statistical risks with regardless of the distribution of data (Littlestone, 1989; Cesa-Bianchi et al., 2004; Cesa-Bianchi and Gentile, 2008). Therefore, bounding the minimax regret and constructing the corresponding predictor is important to make a good and robust prediction.

We consider that

is parametrized by a real-valued vector

, , where denotes a radius function such as norms of . Thus, we may consider the luckiness minimax regret (Grünwald, 2007),

(2)

instead of the original minimax regret. Here, we abuse the notation . There are at least three reasons for adopting this formulation. Firstly, as we do not assume the underlying distribution of , it may be plausible to pose a soft restriction as in (2) rather than the hard restriction in (1). Secondly, it is straightforwardly shown that the luckiness minimax regret bounds above the minimax regret. Thus, it is often sufficient to bound for bounding . Finally, the luckiness minimax regret is including the original minimax regret as a special case such that if and otherwise. Therefore, we may avoid possible computational difficulties of the minimax regret by choosing the penalty carefully.

That being said, the closed-form expression of the exact (luckiness) minimax regret is even intractable except with few special cases (e.g., Shtar’kov (1987); Koolen et al. (2014)).

However, if we focus on information-theoretic settings, i.e., the model

is a set of probabilistic distributions, everything becomes explicit. Now, let predictors be sub-probability distributions

and adopt the logarithmic loss function with respect to an appropriate base measure

such as counting and Lebesgue measures. Note that a number of important practical problems such as logistic regression and data compression can be handled with this framework. With the logarithmic loss, the closed form of the luckiness minimax regret is given by 

Shtar’kov (1987); Grünwald (2007) as

(3)

where denotes the minimum operator given by . We refer to the left-hand-side value as the Shtarkov complexity. Moreover, when all the distributions in are i.i.d. regular distributions of -sequences , under some regularity conditions, the celebrated asymptotic formula (Rissanen, 1996; Grünwald, 2007) is given by

(4)

where is the Fisher information matrix and as . More importantly, although the exact minimax-regret predictor achieving is still intractable, the asymptotic formula implies that it is asymptotically achieved with the Bayesian predictor associated with the tilted Jeffreys prior .

Here, our research questions are as follows: First, (Q1) How can we evaluate in modern high-dimensional contexts? In particular, the asymptotic formula (4) does not withstand high-dimensional learning problems where increases as . The exact evaluation of the Shtarkov complexity (3), on the other hand, is often intractable due to the minimum operator inside the integral. Second, (Q2) How can we achieve the minimax regret with computationally feasible predictors? It is important to provide the counterpart of the tilted Jeffreys prior in order to make actual predictions.

Regarding the above questions, our contribution is summarized as follows:

  • We introduce the envelope complexity, a non-asymptotic approximation of the Shtarkov complexity that allows us systematic computation of its upper bounds and predictors achieving these bounds. In particular, we show that the regret of the predictor is characterized with the smoothness.

  • We demonstrate its usefulness by giving a Bayesian predictor that adaptively achieves the minimax regret within a factor of two over any high-dimensional smooth models under -constraints .

The rest of the paper is organized as follows: In Section 2, we introduce the notion of Bayesian minimax regret as an approximation of the minimax regret within the ‘feasible’ set of predictors. We then develop a complexity measure called envelope complexity in Section 3 as a mathematical abstraction of the Bayesian minimax regret. We also present a collection of techniques for bounding the envelope complexity to the Shtarkov complexity. In Section 4, we utilize the envelope complexity to construct a near-minimax Bayesian predictor under -penalization, namely the spike-and-tails (ST) prior. We also show that it achieves the minimax rate over under high-dimensional asymptotics. In Section 5, we demonstrate numerical experiments to visualize our theoretical results. The discussion on these results in comparison to the existing studies are given in Section 6. Finally, we conclude the paper in Section 7.

2 Bayesian Minimax Regret

The minimax regret with logarithmic loss is given by the Shtarkov complexity . The computation of the Shtarkov complexity

is often intractable if we consider practical models such as deep neural networks. This is because the landscapes of loss functions

are complex as the models are, and hence their minimums and the complexity, which is an integral over the function of , are not tractable. Moreover, computations of the optimal predictor are still often intractable even if are given. For instance, the minimax-regret prediction for Bernoulli models over outcomes cost time. Of course there exist some special cases for which closed forms of are given. However, so far they are limited to exponential families.

One cause of this issue is that we seek for the best predictor among all the possible predictors , i.e., all probability distributions. This is too general that it maybe not possible to compute nor . To avoid this difficulty, we narrow the set of feasible predictors to the Bayesian predictors. Let be a positive measure over , which we may refer to as pre-prior, and let be the Bayesian predictor associated with the prior . Then we have

(5)

where denotes the integral operation with respect to . Now, we consider the Bayesian (luckiness) minimax regret given by

One advantage of considering the Bayesian minimax regret is that, given a measure , one can compute

analytically or numerically utilizing techniques developed in the literature of Bayesian inference. In particular, a number of sophisticated variants of Monte Carlo Markov chain (MCMC) methods such as stochastic gradient Langevin Dynamics 

(Welling and Teh, 2011) are developed for sampling from complex posteriors.

Note that their does exist a case where the Bayesian minimax regret strictly differs from the minimax regret. See Barron et al. (2014) for example. It implies that narrowing the range of predictors to Bayesian may worsen the achievable worst-case regret. However, as we will show shortly, the gap between these minimax regrets can be controlled with model .

3 Envelope Complexity

We have introduced the Bayesian minimax regret . In this section, we present a set representation of Bayesian minimax regret, namely the envelope complexity . Then, we show that the Shtarkov complexity is bounded by the envelope complexity and the envelope complexity can be easily bounded even if the models are complex.

3.1 Set Representation of Bayesian Minimax Regret

The envelope complexity is a simple mathematical abstraction of Bayesian minimax regret and gives a fundamental basis for systematic computation of upper bounds on the (Bayesian) minimax regret. Let be a set of continuous functions which is not necessarily logarithmic. Define the Bayesian envelope of as

and define the envelope complexity as

Then, the envelope complexity characterizes Bayesian minimax regret.

Theorem 1 (Set representation)

Let . Then, all measures in the envelope satisfies that

Moreover, we have

Proof  Let . Observe that

Then, since for all , we have the first inequality.

Note that for any , and whenever . Then we have

yielding the second equality. This completes the proof.  

We have seen that the envelope complexity is equivalent to the Bayesian minimax regret. Below, we present upper bounds of the Shtarkov complexity we put our basis on in the rest of the paper.

Theorem 2 (Bounds on Shtarkov complexity)

Let where is logarithmic. Then, for all , we have

Proof  The first inequality follows from that the envelope minimax regret is no less than the minimax regret, as the range of infimum is shrunk from to the Bayes class . The second inequality is seen by that the definition of the envelope complexity. This completes the proof.  

3.2 Useful Lemmas for Evaluating Envelope Complexity

Next, we show several lemmas that highlight the computational advantage of the envelope complexity. We start to show that the envelope complexity is easily evaluated with the surrogate relation. We say a function is surrogate of another function if and only if , which is denoted by . Moreover, if there is one-to-one correspondence between and such that , then we may write .

Lemma 3 (Monotonicity)

Let . Then we have

and therefore

Proof  Note that if , which means . Also, as increasing the argument from to just strengthen the predicate of the envelope, we have . Therefore, we have

 
This is especially useful when the loss functions are complex but there exist simple surrogates . Consider any models such that the landscapes of the associated loss functions are not fully understood and the evaluation of is expensive. It is impossible to check if is in the envelope, , and therefore Theorem 2 cannot be used directly. However, even in such cases, one can possibly find a surrogate class of . If the surrogate is simple enough for checking if , it is possible to bound the envelope complexity utilizing Lemma 3 and Theorem 2.

In what follows, we consider the specific instance of the surrogate relation based on the smoothness. A function is -upper smooth if and only if, for all , there exists such that

(6)

Note that the upper smoothness is weaker than (Lipschitz) smoothness. Thus, if is -upper smooth and has at least one minima , we can construct a simple quadratic surrogate of , .

Motivated by the smoothness assumption, below we present more specific bounds for quadratic functions. Let be the set of all quadratic functions with curvature one, defined as . Moreover, for all sets of loss functions and penalty functions , we write . Then, the envelope complexity of is evaluated with that of .

Lemma 4 (Bounds of smoothness)

Suppose that all are -upper smooth. Let be the scaling function. Then we have

and moreover,

Proof  Note that since is a set of -upper smooth functions. Observe that, for all ,

where and range over . Thus, by Lemma 3, we have . This proves the inclusion. Now we also have

which yields the inequality.  

This lemma shows that, as long as we consider the envelope complexity of of upper smooth functions , it suffices for bounding above them to evaluate the envelope complexity of penalized quadratic functions .

Further, according to the lemma below, we can restrict ourselves to one-dimensional parametric models w.l.o.g. if the penalty functions

is separable. Here, is said to be separable if and only if it can be written in the form of .

Lemma 5 (Separability)

Suppose that is separable. Then, the envelope complexity of is bounded by a separable function, i.e.,

where is the set of normalized one-dimensional quadratic functions with curvature one, .

Proof  Note that all is separable, i.e., where and . Let . Then we have

 

Summary

We have defined the Bayesian envelope and envelope complexity. The envelope complexity is equal to the Bayesian minimax regret if is the set of penalized logarithmic loss functions. Any measures in the Bayesian envelope can be utilized for bounding the Shtarkov complexity through the envelope complexity. Most importantly, the envelope complexity satisfies some useful properties such as monotonicity, parametrization invariance and separability. Specifically, the monotonicity differentiate the envelope complexity from the Shtarkov complexity.

4 The Spike-and-Tails Prior for High-Dimensional Prediction

We leverage the envelope complexity to give a Bayesian predictor closely achieving where , namely, the spike-and-tails prior. Moreover, the predictor is shown to be also approximately minimax without luckiness where .

4.1 Envelope Complexity for -Penalties

Let be the weighted -norm given by

(7)

where . Let be the spike-and-tails (ST) prior over given by

(8)
(9)

where denotes Kronecker’s delta measure at

. We call it the spike-and-tails prior because it consists of a delta measure (spike) and two exponential distributions (tails) as shown in Figure 

1.

Then, envelope complexities for quadratic loss functions can be bounded as follows.

[width=3in]st_prior.pdf

Figure 1: Density of the spike-and-tails (ST) prior
Lemma 6 (Sharp bound on envelope complexity)

Take as given by (7). Then, we have and

for some constant , where as .

Proof  Consider the logarithmic loss functions of the -dimensional standard normal location model, given by and let . Note that . Then, the lower bound follows from Lemma 8 in Section A. with .

Note that is separable and by Lemma 4, we restrict ourselves to the case of . Let and be positive real numbers. Let be a measure over the real line, where denotes the delta measure and denotes the Lebesgue measures restricted to . That is, we have for measurable sets . Then we have

(10)

We want to minimize (10) with respect to . Let . Then we have if , and otherwise. It suffices for and to have for all . Here, we only care about the case of since it is symmetric with respect to and trivially we have for all . Now, for , we have

Let . Thus a sufficient condition for is that , which is satisfied with . Finally, evaluating (10) at yields the ST pre-prior . Therefore, we have and the upper bound is shown. The equality is a result of straightforward calculation of .  
According to Lemma 6, the ST prior bounds the envelope complexity in a quadratic rate as . The exponent, , is optimally sharp since the lower bound has the same exponent.

This gives an upper bound on the envelope complexity for general smooth loss functions. Let and be the scale-corrected ST (pre) prior given by

The following is a direct corollary of Lemma 4, 5, 6 and 3.

Corollary 1

If all is -upper smooth with respect to , and if is given by (7), then and therefore

4.2 Regret Bound with the ST Prior

Now, we utilize Corollary 1 for bounding actual prediction performance of the ST prior. Here we consider the scenario of the online-learning under -constraint.

Setup

Let be a sequence of outcomes. Let be a logarithmic loss function such that . Then, the conditional Bayesian pre-posterior with respect to given is given by

The online regret of the predictor is defined as

(11)

Now, we can bound the online regret of the ST prior as follows.

Theorem 7 (Adaptive minimaxity over -balls)

Suppose that are -upper smooth and logarithmic. Let . Take . Then, with , we have

for all . Moreover, this is adaptive minimax rate and not improvable more than a factor of two even if is fixed and non-Bayesian predictors are involved.

Proof  Let be the cumulative loss, , and observe that is -upper smooth and logarithmic. Let and . Also, let be the indicator penalty of the set such that if and only if and otherwise . Then, we have where is taken with respect to . Now, observe that

which, combined with Corollary 1 where , yields the asymptotic equality. The proof of the minimaxity is adopted from the existing analysis on the minimax risk (see Section B for the rigorous proof and Section 6.5 for detailed discussions).  

5 Visual Comparison of the ST Prior and the Tilted Jeffreys Prior

Now, we verify the results on the -regularization obtained above. In particular, we compare the worst-case regrets achievable with Bayesian predictors to the minimax regret, i.e., the Shtarkov complexity.

Setting

We adopted the one-dimensional quadratic loss functions with curvature one, , and the -penalty function, . We varied the penalty weight from to and observed how the worst-case regret of each Bayesian predictor changes. Specifically, we employed the spike-and-tails (ST) prior (9) and the tilted Jeffreys prior for the predictors. Note that, in this case, the tilted Jeffreys prior is nothing more than the double exponential prior given by .

Results

In Figure 2, the worst-case regrets of the ST prior and the Jeffreys prior are shown along with the minimax regret (Optimal). While the regret of the tilted Jeffreys prior is almost same as the optimal regret where is small, it performs poorly where is large. On the other hand, the ST prior performs robustly well in the entire range of . Specifically, it converges to zero quadratically where is large. Therefore, since one must take sufficiently large if is large, it is implied that the ST prior is a better choice than the tilted Jeffreys prior.

[width=3in]comp_wregret_st_jef.pdf

Figure 2: Worst-case regrets of the spike-and-tails (ST) prior and the tilted Jeffreys prior

6 Implications and Discussions

In this section, we discuss interpretations of the results and present solutions to some technical difficulties.

6.1 Gap between and

One may wonder if there exists a prior that achieves the lower bound where . Unfortunately, the answer is negative. With a similar technique of higher-order differentiations used by Hedayati and Bartlett (2012), we can show that, if is convex and not differentiable like the -norm, then the gap is nonzero, i.e., . The detailed statement and the proof is in Section C.

6.2 Infinite-dimensional Models

If the dimensionality of the parameter space is countably infinite, the minimax regret with any nonzero radius diverges. In this case, one may apply different penalty weights to different dimensions. For instance, taking different penalty weights for different dimensions, e.g., for and , the separability of the envelope complexity guarantees that

. Then, the corresponding countably-infinite tensor product of the one-dimensional ST prior

gives a finite regret with respect the infinite-dimensional models .

6.3 Comparison to the Titled Jeffreys Priors and Others

There have been previous studies on the minimax regret with Bayesian predictors (Takeuchi and Barron, 1998, 2013; Watanabe and Roos, 2015; Xie and Barron, 2000). In these studies, the Bayesian predictor based on the Jeffreys prior (namely Jeffreys predictor) is proved to attain minimax-regret asymptotically under some regularity conditions. The tilted Jeffreys prior, which takes the effect of penalization into consideration, is given by Grünwald (2007) as , where denotes the Fisher information matrix. In the case of quadratic loss functions , as the Fisher information is equal to identity, we have . Therefore, it implies that taking the uniform pre-prior is good for smooth models under the conventional large-sample limit. This is in very strong contrast with our result, where completely nonuniform preprior performs better with high-dimensional models.

6.4 Comparison to Online Convex Optimization

So far, we have considered the luckiness minimax regret, which leads to the adaptive minimax regret. Perhaps surprizingly, our minimax regret bound coincides with the results given in the literature of online convex optimization, where different assumptions on the loss functions and predictors are made. Specifically, with , the regret bound is reduced to . This coincides with the standard no-regret rates of online learning such as Hedge algorithm (Freund and Schapire, 1997) and high-dimensional online regression (Gerchinovitz and Yu, 2014), where is referred to as the number of trials and is referred to as the number of experts or dimensions . Moreover, with , the regret bound is reduced to . This is equal to the minimax-regret rate achieved under large-sample asymptotics such as in Hazan et al. (2007); Cover (2011).

Note that, the conditions assumed in those two regimes are somewhat different. In our setting, loss functions are assumed to be upper smooth and satisfy some normalizing condition to be logarithmic losses, while the boundedness and convexity of loss functions is often assumed in online learning. Moreover, we have employed Bayesian predictors, whereas more simple online predictors are typically used in the context of the online learning.

6.5 Comparison to Minimax Risk over -balls

In the literature of high-dimensional statistics, the minimax rate of

statistical risk is also achieved with -regularization (Donoho and Johnstone, 1994), when the true parameter is in the unit -ball. Although both risk and regret are performance measures of prediction, there are two notable difference. One is that risks are calculated under some assumptions on true statistical distribution, whereas regrets are defined without any assumptions on data. The other is that risks are typically considered with in-model predictor, i.e., predictors are restricted to a given model, whereas regrets are often considered with out-model predictors such as Bayesian predictors and online predictors. Therefore, the minimax regret can be regarded as a more agnostic complexity measure than the minimax risk.

If we assume Gaussian noise models and adopt the logarithmic loss functions, the minimax rate of the risk is given as according to Donoho and Johnstone (1994). Interestingly, this is same with the rate of the regret bound given by Theorem 7 where . Moreover, the minimax-risk optimal penalty weights is also minimax-regret optimal in this case. Therefore, if the dimensionality is large enough compared to ( in case of online-learning), making no distributional assumption on data costs nothing in terms of the minimax rate.

7 Conclusion

In this study, we presented a novel characterization of the minimax regret for logarithmic loss functions, called the envelope complexity, with -regularization problems. The virtue of the envelope complexity is that it is much easier to evaluate than the minimax regret itself and able to produce upper bounds systematically. Then, using the envelope complexity, we have proposed the spike-and-tails (ST) prior, which almost achieves the luckiness minimax regret against smooth loss functions under -penalization. We also show that the ST prior actually adaptively achieves the 2-approximate minimax regret under high-dimensional asymptotics . In the experiment, we have confirmed our theoretical results: The ST prior outperforms the tilted Jeffreys prior where the dimensionality is high, whereas the tilted Jeffreys prior is optimal if .

Limitation and future work

The present work is relying on the assumption of the smoothness and logarithmic property on the loss functions. The smoothness assumption may be removed by considering the smoothing effect of stochastic algorithms like stochastic gradient descent as in 

Kleinberg et al. (2018). As for the logarithmic assumption, it will be generalized to evaluate complexities with non-logarithmic loss functions with the help of tools that have been developed in the literature of information theory such as in Yamanishi (1998). Finally, since our regret bound with the ST prior is quite simple (there are only the smoothness and the radius

except with the logarithmic term), applying these results to concrete models such as deep learning models would be interesting future work as well as the comparison to the existing generalization error bounds.

References

  • Barron et al. (2014) Barron, A., Roos, T., and Watanabe, K. (2014). Bayesian properties of normalized maximum likelihood and its fast computation. In IEEE International Symposium on Information Theory - Proceedings.
  • Cesa-Bianchi et al. (2004) Cesa-Bianchi, N., Conconi, A., and Gentile, C. (2004). On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057.
  • Cesa-Bianchi and Gentile (2008) Cesa-Bianchi, N. and Gentile, C. (2008). Improved risk tail bounds for on-line algorithms. IEEE Transactions on Information Theory, 54(1):386–390.
  • Cesa-Bianchi and Lugosi (2006) Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, learning, and games. Cambridge university press.
  • Cover (2011) Cover, T. M. (2011). Universal portfolios. In The Kelly Capital Growth Investment Criterion: Theory and Practice, pages 181–209. World Scientific.
  • Donoho and Johnstone (1994) Donoho, D. L. and Johnstone, I. M. (1994). Minimax risk overl p-balls forl p-error. Probability Theory and Related Fields, 99(2):277–303.
  • Freund and Schapire (1997) Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139.
  • Gerchinovitz and Yu (2014) Gerchinovitz, S. and Yu, J. Y. (2014).

    Adaptive and optimal online linear regression on ℓ1-balls.

    Theoretical Computer Science, 519:4–28.
  • Grünwald (2007) Grünwald, P. D. (2007). The minimum description length principle. MIT press.
  • Hazan et al. (2007) Hazan, E., Agarwal, A., and Kale, S. (2007). Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169–192.
  • Hedayati and Bartlett (2012) Hedayati, F. and Bartlett, P. L. (2012).

    The optimality of jeffreys prior for online density estimation and the asymptotic normality of maximum likelihood estimators.

    In Conference on Learning Theory, pages 7–1.
  • Kleinberg et al. (2018) Kleinberg, R., Li, Y., and Yuan, Y. (2018). An alternative view: When does sgd escape local minima? arXiv preprint arXiv:1802.06175.
  • Komatu (1955) Komatu, Y. (1955). Elementary inequalities for mills’ ratio. Rep. Statist. Appl. Res. Un. Jap. Sci. Engrs, 4:69–70.
  • Koolen et al. (2014) Koolen, W. M., Malek, A., and Bartlett, P. L. (2014). Efficient minimax strategies for square loss games. In Advances in Neural Information Processing Systems, pages 3230–3238.
  • Littlestone (1989) Littlestone, N. (1989). From on-line to batch learning. In

    Proceedings of the second annual workshop on Computational learning theory

    , pages 269–284.
  • Rissanen (1978) Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14(5):465–471.
  • Rissanen (1996) Rissanen, J. J. (1996). Fisher information and stochastic complexity. IEEE transactions on information theory, 42(1):40–47.
  • Shtar’kov (1987) Shtar’kov, Y. M. (1987). Universal sequential coding of single messages. Problemy Peredachi Informatsii, 23(3):3–17.
  • Takeuchi and Barron (1998) Takeuchi, J. and Barron, A. R. (1998). Asymptotically minimax regret by bayes mixtures. In IEEE International Symposium on Information Theory - Proceedings.
  • Takeuchi and Barron (2013) Takeuchi, J. and Barron, A. R. (2013). Asymptotically minimax regret by bayes mixtures for non-exponential families. In Information Theory Workshop (ITW), 2013 IEEE, pages 1–5. IEEE.
  • Watanabe and Roos (2015) Watanabe, K. and Roos, T. (2015). Achievability of asymptotic minimax regret by horizon-dependent and horizon-independent strategies. The Journal of Machine Learning Research, 16(1):2357–2375.
  • Welling and Teh (2011) Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 681–688.
  • Xie and Barron (2000) Xie, Q. and Barron, A. R. (2000). Asymptotic minimax regret for data compression, gambling, and prediction. IEEE Transactions on Information Theory, 46(2):431–445.
  • Yamanishi (1998) Yamanishi, K. (1998). A decision-theoretic extension of stochastic complexity and its applications to learning. IEEE Transactions on Information Theory, 44(4):1424–1439.

References

  • Barron et al. (2014) Barron, A., Roos, T., and Watanabe, K. (2014). Bayesian properties of normalized maximum likelihood and its fast computation. In IEEE International Symposium on Information Theory - Proceedings.
  • Cesa-Bianchi et al. (2004) Cesa-Bianchi, N., Conconi, A., and Gentile, C. (2004). On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057.
  • Cesa-Bianchi and Gentile (2008) Cesa-Bianchi, N. and Gentile, C. (2008). Improved risk tail bounds for on-line algorithms. IEEE Transactions on Information Theory, 54(1):386–390.
  • Cesa-Bianchi and Lugosi (2006) Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, learning, and games. Cambridge university press.
  • Cover (2011) Cover, T. M. (2011). Universal portfolios. In The Kelly Capital Growth Investment Criterion: Theory and Practice, pages 181–209. World Scientific.
  • Donoho and Johnstone (1994) Donoho, D. L. and Johnstone, I. M. (1994). Minimax risk overl p-balls forl p-error. Probability Theory and Related Fields, 99(2):277–303.
  • Freund and Schapire (1997) Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139.
  • Gerchinovitz and Yu (2014) Gerchinovitz, S. and Yu, J. Y. (2014).

    Adaptive and optimal online linear regression on ℓ1-balls.

    Theoretical Computer Science, 519:4–28.
  • Grünwald (2007) Grünwald, P. D. (2007). The minimum description length principle. MIT press.
  • Hazan et al. (2007) Hazan, E., Agarwal, A., and Kale, S. (2007). Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169–192.
  • Hedayati and Bartlett (2012) Hedayati, F. and Bartlett, P. L. (2012).

    The optimality of jeffreys prior for online density estimation and the asymptotic normality of maximum likelihood estimators.

    In Conference on Learning Theory, pages 7–1.
  • Kleinberg et al. (2018) Kleinberg, R., Li, Y., and Yuan, Y. (2018). An alternative view: When does sgd escape local minima? arXiv preprint arXiv:1802.06175.
  • Komatu (1955) Komatu, Y. (1955). Elementary inequalities for mills’ ratio. Rep. Statist. Appl. Res. Un. Jap. Sci. Engrs, 4:69–70.
  • Koolen et al. (2014) Koolen, W. M., Malek, A., and Bartlett, P. L. (2014). Efficient minimax strategies for square loss games. In Advances in Neural Information Processing Systems, pages 3230–3238.
  • Littlestone (1989) Littlestone, N. (1989). From on-line to batch learning. In

    Proceedings of the second annual workshop on Computational learning theory

    , pages 269–284.
  • Rissanen (1978) Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14(5):465–471.
  • Rissanen (1996) Rissanen, J. J. (1996). Fisher information and stochastic complexity. IEEE transactions on information theory, 42(1):40–47.
  • Shtar’kov (1987) Shtar’kov, Y. M. (1987). Universal sequential coding of single messages. Problemy Peredachi Informatsii, 23(3):3–17.
  • Takeuchi and Barron (1998) Takeuchi, J. and Barron, A. R. (1998). Asymptotically minimax regret by bayes mixtures. In IEEE International Symposium on Information Theory - Proceedings.
  • Takeuchi and Barron (2013) Takeuchi, J. and Barron, A. R. (2013). Asymptotically minimax regret by bayes mixtures for non-exponential families. In Information Theory Workshop (ITW), 2013 IEEE, pages 1–5. IEEE.
  • Watanabe and Roos (2015) Watanabe, K. and Roos, T. (2015). Achievability of asymptotic minimax regret by horizon-dependent and horizon-independent strategies. The Journal of Machine Learning Research, 16(1):2357–2375.
  • Welling and Teh (2011) Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 681–688.
  • Xie and Barron (2000) Xie, Q. and Barron, A. R. (2000). Asymptotic minimax regret for data compression, gambling, and prediction. IEEE Transactions on Information Theory, 46(2):431–445.
  • Yamanishi (1998) Yamanishi, K. (1998). A decision-theoretic extension of stochastic complexity and its applications to learning. IEEE Transactions on Information Theory, 44(4):1424–1439.

Appendix A Asymptotic Lower Bound of Shtarkov Complexity for Standard Normal Location Models

We show an asymptotic lower bound of the Shtarkov complexity of standard normal location models.

Lemma 8

Consider the -dimensional standard normal location model, given by , where . Let for . Then we have

Proof  By definition of , we have

where

denotes the standard normal distribution function. Now, by 

Komatu (1955), is bounded below with for being the standard normal density, which yields the lower bound of interest after a few lines of elementary calculation.  

Appendix B Lower Bound on Minimax Regret of Smooth Models

We describe how we adopt the minimax risk lower bound as to show the minimax-regret lower bound.

The story of the proof is based on Donoho and Johnstone (1994). First, the so-called three-point prior is constructed to approximate the least favorable prior. Then, since the approximate prior violates the -constraint, the degree of the violation is shown to be appropriately bounded to derive a valid lower bound.

The goal of our proof is to establish a lower bound on the minimax regret with respect to logarithmic losses, whereas their proof is about the minimax risk with respect to -loss. Therefore, below we present the proof highlighting (i) an approximate least favorable prior for logarithmic losses over -balls and (ii) the way to bound regrets on the basis of risk bounds.

Let be a -ball. Let be a

-dimensional normal random variable with mean

and precision . We denote the distribution just by where any confusion is unlikely. Let be a predictor associated with any sub-probability distribution . For notational simplicity, we may write and where is the Lebesgue measure over .

Consider the risk function

and the Bayes risk function

where denotes prior distributions on . Then, the minimax Bayes risk bounds below the minimax regret,