1 Introduction
As a notion of complexity of predictive models (sets of predictors), minimax regret has been considered in the literature of online learning (CesaBianchi and Lugosi, 2006) and the minimum description length (MDL) principle (Rissanen, 1978; Grünwald, 2007). The minimax regret of a model is given by
(1) 
where denotes the loss of the prediction over data made by , denotes the feasible predictions and is the space of data. Here, the data may consist of a sequence of datum, , and the loss maybe additive, , but we keep them implicit for generality. The minimax regret is a general complexity measure in the sense that it is defined without any assumptions on the generation process of . For instance, one can bound statistical risks with regardless of the distribution of data (Littlestone, 1989; CesaBianchi et al., 2004; CesaBianchi and Gentile, 2008). Therefore, bounding the minimax regret and constructing the corresponding predictor is important to make a good and robust prediction.
We consider that
is parametrized by a realvalued vector
, , where denotes a radius function such as norms of . Thus, we may consider the luckiness minimax regret (Grünwald, 2007),(2) 
instead of the original minimax regret. Here, we abuse the notation . There are at least three reasons for adopting this formulation. Firstly, as we do not assume the underlying distribution of , it may be plausible to pose a soft restriction as in (2) rather than the hard restriction in (1). Secondly, it is straightforwardly shown that the luckiness minimax regret bounds above the minimax regret. Thus, it is often sufficient to bound for bounding . Finally, the luckiness minimax regret is including the original minimax regret as a special case such that if and otherwise. Therefore, we may avoid possible computational difficulties of the minimax regret by choosing the penalty carefully.
That being said, the closedform expression of the exact (luckiness) minimax regret is even intractable except with few special cases (e.g., Shtar’kov (1987); Koolen et al. (2014)).
However, if we focus on informationtheoretic settings, i.e., the model
is a set of probabilistic distributions, everything becomes explicit. Now, let predictors be subprobability distributions
and adopt the logarithmic loss function with respect to an appropriate base measuresuch as counting and Lebesgue measures. Note that a number of important practical problems such as logistic regression and data compression can be handled with this framework. With the logarithmic loss, the closed form of the luckiness minimax regret is given by
Shtar’kov (1987); Grünwald (2007) as(3) 
where denotes the minimum operator given by . We refer to the lefthandside value as the Shtarkov complexity. Moreover, when all the distributions in are i.i.d. regular distributions of sequences , under some regularity conditions, the celebrated asymptotic formula (Rissanen, 1996; Grünwald, 2007) is given by
(4) 
where is the Fisher information matrix and as . More importantly, although the exact minimaxregret predictor achieving is still intractable, the asymptotic formula implies that it is asymptotically achieved with the Bayesian predictor associated with the tilted Jeffreys prior .
Here, our research questions are as follows: First, (Q1) How can we evaluate in modern highdimensional contexts? In particular, the asymptotic formula (4) does not withstand highdimensional learning problems where increases as . The exact evaluation of the Shtarkov complexity (3), on the other hand, is often intractable due to the minimum operator inside the integral. Second, (Q2) How can we achieve the minimax regret with computationally feasible predictors? It is important to provide the counterpart of the tilted Jeffreys prior in order to make actual predictions.
Regarding the above questions, our contribution is summarized as follows:

We introduce the envelope complexity, a nonasymptotic approximation of the Shtarkov complexity that allows us systematic computation of its upper bounds and predictors achieving these bounds. In particular, we show that the regret of the predictor is characterized with the smoothness.

We demonstrate its usefulness by giving a Bayesian predictor that adaptively achieves the minimax regret within a factor of two over any highdimensional smooth models under constraints .
The rest of the paper is organized as follows: In Section 2, we introduce the notion of Bayesian minimax regret as an approximation of the minimax regret within the ‘feasible’ set of predictors. We then develop a complexity measure called envelope complexity in Section 3 as a mathematical abstraction of the Bayesian minimax regret. We also present a collection of techniques for bounding the envelope complexity to the Shtarkov complexity. In Section 4, we utilize the envelope complexity to construct a nearminimax Bayesian predictor under penalization, namely the spikeandtails (ST) prior. We also show that it achieves the minimax rate over under highdimensional asymptotics. In Section 5, we demonstrate numerical experiments to visualize our theoretical results. The discussion on these results in comparison to the existing studies are given in Section 6. Finally, we conclude the paper in Section 7.
2 Bayesian Minimax Regret
The minimax regret with logarithmic loss is given by the Shtarkov complexity . The computation of the Shtarkov complexity
is often intractable if we consider practical models such as deep neural networks. This is because the landscapes of loss functions
are complex as the models are, and hence their minimums and the complexity, which is an integral over the function of , are not tractable. Moreover, computations of the optimal predictor are still often intractable even if are given. For instance, the minimaxregret prediction for Bernoulli models over outcomes cost time. Of course there exist some special cases for which closed forms of are given. However, so far they are limited to exponential families.One cause of this issue is that we seek for the best predictor among all the possible predictors , i.e., all probability distributions. This is too general that it maybe not possible to compute nor . To avoid this difficulty, we narrow the set of feasible predictors to the Bayesian predictors. Let be a positive measure over , which we may refer to as preprior, and let be the Bayesian predictor associated with the prior . Then we have
(5) 
where denotes the integral operation with respect to . Now, we consider the Bayesian (luckiness) minimax regret given by
One advantage of considering the Bayesian minimax regret is that, given a measure , one can compute
analytically or numerically utilizing techniques developed in the literature of Bayesian inference. In particular, a number of sophisticated variants of Monte Carlo Markov chain (MCMC) methods such as stochastic gradient Langevin Dynamics
(Welling and Teh, 2011) are developed for sampling from complex posteriors.Note that their does exist a case where the Bayesian minimax regret strictly differs from the minimax regret. See Barron et al. (2014) for example. It implies that narrowing the range of predictors to Bayesian may worsen the achievable worstcase regret. However, as we will show shortly, the gap between these minimax regrets can be controlled with model .
3 Envelope Complexity
We have introduced the Bayesian minimax regret . In this section, we present a set representation of Bayesian minimax regret, namely the envelope complexity . Then, we show that the Shtarkov complexity is bounded by the envelope complexity and the envelope complexity can be easily bounded even if the models are complex.
3.1 Set Representation of Bayesian Minimax Regret
The envelope complexity is a simple mathematical abstraction of Bayesian minimax regret and gives a fundamental basis for systematic computation of upper bounds on the (Bayesian) minimax regret. Let be a set of continuous functions which is not necessarily logarithmic. Define the Bayesian envelope of as
and define the envelope complexity as
Then, the envelope complexity characterizes Bayesian minimax regret.
Theorem 1 (Set representation)
Let . Then, all measures in the envelope satisfies that
Moreover, we have
Proof Let . Observe that
Then, since for all , we have the first inequality.
Note that for any , and whenever . Then we have
yielding the second equality.
This completes the proof.
We have seen that the envelope complexity is equivalent to the Bayesian minimax regret. Below, we present upper bounds of the Shtarkov complexity we put our basis on in the rest of the paper.
Theorem 2 (Bounds on Shtarkov complexity)
Let where is logarithmic. Then, for all , we have
Proof
The first inequality follows from that
the envelope minimax regret is no less than the minimax regret,
as the range of infimum is shrunk from to the Bayes class .
The second inequality is seen by that the definition of the envelope complexity.
This completes the proof.
3.2 Useful Lemmas for Evaluating Envelope Complexity
Next, we show several lemmas that highlight the computational advantage of the envelope complexity. We start to show that the envelope complexity is easily evaluated with the surrogate relation. We say a function is surrogate of another function if and only if , which is denoted by . Moreover, if there is onetoone correspondence between and such that , then we may write .
Lemma 3 (Monotonicity)
Let . Then we have
and therefore
Proof Note that if , which means . Also, as increasing the argument from to just strengthen the predicate of the envelope, we have . Therefore, we have
This is especially useful when the loss functions are complex but there exist simple surrogates .
Consider any models such that the landscapes of the associated loss functions
are not fully understood and the evaluation of is expensive.
It is impossible to check if is in the envelope, ,
and therefore Theorem 2 cannot be used directly.
However, even in such cases, one can possibly find a surrogate class of .
If the surrogate is simple enough for checking if , it is possible to bound the envelope complexity utilizing Lemma 3 and Theorem 2.
In what follows, we consider the specific instance of the surrogate relation based on the smoothness. A function is upper smooth if and only if, for all , there exists such that
(6) 
Note that the upper smoothness is weaker than (Lipschitz) smoothness. Thus, if is upper smooth and has at least one minima , we can construct a simple quadratic surrogate of , .
Motivated by the smoothness assumption, below we present more specific bounds for quadratic functions. Let be the set of all quadratic functions with curvature one, defined as . Moreover, for all sets of loss functions and penalty functions , we write . Then, the envelope complexity of is evaluated with that of .
Lemma 4 (Bounds of smoothness)
Suppose that all are upper smooth. Let be the scaling function. Then we have
and moreover,
Proof Note that since is a set of upper smooth functions. Observe that, for all ,
where and range over . Thus, by Lemma 3, we have . This proves the inclusion. Now we also have
which yields the inequality.
This lemma shows that, as long as we consider the envelope complexity of of upper smooth functions , it suffices for bounding above them to evaluate the envelope complexity of penalized quadratic functions .
Further, according to the lemma below, we can restrict ourselves to onedimensional parametric models w.l.o.g. if the penalty functions
is separable. Here, is said to be separable if and only if it can be written in the form of .Lemma 5 (Separability)
Suppose that is separable. Then, the envelope complexity of is bounded by a separable function, i.e.,
where is the set of normalized onedimensional quadratic functions with curvature one, .
Proof Note that all is separable, i.e., where and . Let . Then we have
Summary
We have defined the Bayesian envelope and envelope complexity. The envelope complexity is equal to the Bayesian minimax regret if is the set of penalized logarithmic loss functions. Any measures in the Bayesian envelope can be utilized for bounding the Shtarkov complexity through the envelope complexity. Most importantly, the envelope complexity satisfies some useful properties such as monotonicity, parametrization invariance and separability. Specifically, the monotonicity differentiate the envelope complexity from the Shtarkov complexity.
4 The SpikeandTails Prior for HighDimensional Prediction
We leverage the envelope complexity to give a Bayesian predictor closely achieving where , namely, the spikeandtails prior. Moreover, the predictor is shown to be also approximately minimax without luckiness where .
4.1 Envelope Complexity for Penalties
Let be the weighted norm given by
(7) 
where . Let be the spikeandtails (ST) prior over given by
(8)  
(9) 
where denotes Kronecker’s delta measure at
. We call it the spikeandtails prior because it consists of a delta measure (spike) and two exponential distributions (tails) as shown in Figure
1.Then, envelope complexities for quadratic loss functions can be bounded as follows.
Lemma 6 (Sharp bound on envelope complexity)
Proof Consider the logarithmic loss functions of the dimensional standard normal location model, given by and let . Note that . Then, the lower bound follows from Lemma 8 in Section A. with .
Note that is separable and by Lemma 4, we restrict ourselves to the case of . Let and be positive real numbers. Let be a measure over the real line, where denotes the delta measure and denotes the Lebesgue measures restricted to . That is, we have for measurable sets . Then we have
(10) 
We want to minimize (10) with respect to . Let . Then we have if , and otherwise. It suffices for and to have for all . Here, we only care about the case of since it is symmetric with respect to and trivially we have for all . Now, for , we have
Let .
Thus a sufficient condition for is that
, which is satisfied with .
Finally, evaluating (10) at yields the ST preprior .
Therefore, we have
and the upper bound is shown.
The equality is a result of straightforward calculation of .
According to Lemma 6,
the ST prior bounds the envelope complexity in a quadratic rate as .
The exponent, , is optimally sharp since the lower bound has the same exponent.
This gives an upper bound on the envelope complexity for general smooth loss functions. Let and be the scalecorrected ST (pre) prior given by
Corollary 1
If all is upper smooth with respect to , and if is given by (7), then and therefore
4.2 Regret Bound with the ST Prior
Now, we utilize Corollary 1 for bounding actual prediction performance of the ST prior. Here we consider the scenario of the onlinelearning under constraint.
Setup
Let be a sequence of outcomes. Let be a logarithmic loss function such that . Then, the conditional Bayesian preposterior with respect to given is given by
The online regret of the predictor is defined as
(11) 
Now, we can bound the online regret of the ST prior as follows.
Theorem 7 (Adaptive minimaxity over balls)
Suppose that are upper smooth and logarithmic. Let . Take . Then, with , we have
for all . Moreover, this is adaptive minimax rate and not improvable more than a factor of two even if is fixed and nonBayesian predictors are involved.
Proof Let be the cumulative loss, , and observe that is upper smooth and logarithmic. Let and . Also, let be the indicator penalty of the set such that if and only if and otherwise . Then, we have where is taken with respect to . Now, observe that
which, combined with Corollary 1 where ,
yields the asymptotic equality.
The proof of the minimaxity is adopted from the existing analysis on the minimax risk (see Section B for the rigorous proof and Section 6.5 for detailed discussions).
5 Visual Comparison of the ST Prior and the Tilted Jeffreys Prior
Now, we verify the results on the regularization obtained above. In particular, we compare the worstcase regrets achievable with Bayesian predictors to the minimax regret, i.e., the Shtarkov complexity.
Setting
We adopted the onedimensional quadratic loss functions with curvature one, , and the penalty function, . We varied the penalty weight from to and observed how the worstcase regret of each Bayesian predictor changes. Specifically, we employed the spikeandtails (ST) prior (9) and the tilted Jeffreys prior for the predictors. Note that, in this case, the tilted Jeffreys prior is nothing more than the double exponential prior given by .
Results
In Figure 2, the worstcase regrets of the ST prior and the Jeffreys prior are shown along with the minimax regret (Optimal). While the regret of the tilted Jeffreys prior is almost same as the optimal regret where is small, it performs poorly where is large. On the other hand, the ST prior performs robustly well in the entire range of . Specifically, it converges to zero quadratically where is large. Therefore, since one must take sufficiently large if is large, it is implied that the ST prior is a better choice than the tilted Jeffreys prior.
6 Implications and Discussions
In this section, we discuss interpretations of the results and present solutions to some technical difficulties.
6.1 Gap between and
One may wonder if there exists a prior that achieves the lower bound where . Unfortunately, the answer is negative. With a similar technique of higherorder differentiations used by Hedayati and Bartlett (2012), we can show that, if is convex and not differentiable like the norm, then the gap is nonzero, i.e., . The detailed statement and the proof is in Section C.
6.2 Infinitedimensional Models
If the dimensionality of the parameter space is countably infinite, the minimax regret with any nonzero radius diverges. In this case, one may apply different penalty weights to different dimensions. For instance, taking different penalty weights for different dimensions, e.g., for and , the separability of the envelope complexity guarantees that
. Then, the corresponding countablyinfinite tensor product of the onedimensional ST prior
gives a finite regret with respect the infinitedimensional models .6.3 Comparison to the Titled Jeffreys Priors and Others
There have been previous studies on the minimax regret with Bayesian predictors (Takeuchi and Barron, 1998, 2013; Watanabe and Roos, 2015; Xie and Barron, 2000). In these studies, the Bayesian predictor based on the Jeffreys prior (namely Jeffreys predictor) is proved to attain minimaxregret asymptotically under some regularity conditions. The tilted Jeffreys prior, which takes the effect of penalization into consideration, is given by Grünwald (2007) as , where denotes the Fisher information matrix. In the case of quadratic loss functions , as the Fisher information is equal to identity, we have . Therefore, it implies that taking the uniform preprior is good for smooth models under the conventional largesample limit. This is in very strong contrast with our result, where completely nonuniform preprior performs better with highdimensional models.
6.4 Comparison to Online Convex Optimization
So far, we have considered the luckiness minimax regret, which leads to the adaptive minimax regret. Perhaps surprizingly, our minimax regret bound coincides with the results given in the literature of online convex optimization, where different assumptions on the loss functions and predictors are made. Specifically, with , the regret bound is reduced to . This coincides with the standard noregret rates of online learning such as Hedge algorithm (Freund and Schapire, 1997) and highdimensional online regression (Gerchinovitz and Yu, 2014), where is referred to as the number of trials and is referred to as the number of experts or dimensions . Moreover, with , the regret bound is reduced to . This is equal to the minimaxregret rate achieved under largesample asymptotics such as in Hazan et al. (2007); Cover (2011).
Note that, the conditions assumed in those two regimes are somewhat different. In our setting, loss functions are assumed to be upper smooth and satisfy some normalizing condition to be logarithmic losses, while the boundedness and convexity of loss functions is often assumed in online learning. Moreover, we have employed Bayesian predictors, whereas more simple online predictors are typically used in the context of the online learning.
6.5 Comparison to Minimax Risk over balls
In the literature of highdimensional statistics, the minimax rate of
statistical risk is also achieved with regularization (Donoho and Johnstone, 1994), when the true parameter is in the unit ball. Although both risk and regret are performance measures of prediction, there are two notable difference. One is that risks are calculated under some assumptions on true statistical distribution, whereas regrets are defined without any assumptions on data. The other is that risks are typically considered with inmodel predictor, i.e., predictors are restricted to a given model, whereas regrets are often considered with outmodel predictors such as Bayesian predictors and online predictors. Therefore, the minimax regret can be regarded as a more agnostic complexity measure than the minimax risk.If we assume Gaussian noise models and adopt the logarithmic loss functions, the minimax rate of the risk is given as according to Donoho and Johnstone (1994). Interestingly, this is same with the rate of the regret bound given by Theorem 7 where . Moreover, the minimaxrisk optimal penalty weights is also minimaxregret optimal in this case. Therefore, if the dimensionality is large enough compared to ( in case of onlinelearning), making no distributional assumption on data costs nothing in terms of the minimax rate.
7 Conclusion
In this study, we presented a novel characterization of the minimax regret for logarithmic loss functions, called the envelope complexity, with regularization problems. The virtue of the envelope complexity is that it is much easier to evaluate than the minimax regret itself and able to produce upper bounds systematically. Then, using the envelope complexity, we have proposed the spikeandtails (ST) prior, which almost achieves the luckiness minimax regret against smooth loss functions under penalization. We also show that the ST prior actually adaptively achieves the 2approximate minimax regret under highdimensional asymptotics . In the experiment, we have confirmed our theoretical results: The ST prior outperforms the tilted Jeffreys prior where the dimensionality is high, whereas the tilted Jeffreys prior is optimal if .
Limitation and future work
The present work is relying on the assumption of the smoothness and logarithmic property on the loss functions. The smoothness assumption may be removed by considering the smoothing effect of stochastic algorithms like stochastic gradient descent as in
Kleinberg et al. (2018). As for the logarithmic assumption, it will be generalized to evaluate complexities with nonlogarithmic loss functions with the help of tools that have been developed in the literature of information theory such as in Yamanishi (1998). Finally, since our regret bound with the ST prior is quite simple (there are only the smoothness and the radiusexcept with the logarithmic term), applying these results to concrete models such as deep learning models would be interesting future work as well as the comparison to the existing generalization error bounds.
References
 Barron et al. (2014) Barron, A., Roos, T., and Watanabe, K. (2014). Bayesian properties of normalized maximum likelihood and its fast computation. In IEEE International Symposium on Information Theory  Proceedings.
 CesaBianchi et al. (2004) CesaBianchi, N., Conconi, A., and Gentile, C. (2004). On the generalization ability of online learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057.
 CesaBianchi and Gentile (2008) CesaBianchi, N. and Gentile, C. (2008). Improved risk tail bounds for online algorithms. IEEE Transactions on Information Theory, 54(1):386–390.
 CesaBianchi and Lugosi (2006) CesaBianchi, N. and Lugosi, G. (2006). Prediction, learning, and games. Cambridge university press.
 Cover (2011) Cover, T. M. (2011). Universal portfolios. In The Kelly Capital Growth Investment Criterion: Theory and Practice, pages 181–209. World Scientific.
 Donoho and Johnstone (1994) Donoho, D. L. and Johnstone, I. M. (1994). Minimax risk overl pballs forl perror. Probability Theory and Related Fields, 99(2):277–303.
 Freund and Schapire (1997) Freund, Y. and Schapire, R. E. (1997). A decisiontheoretic generalization of online learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139.

Gerchinovitz and Yu (2014)
Gerchinovitz, S. and Yu, J. Y. (2014).
Adaptive and optimal online linear regression on ℓ1balls.
Theoretical Computer Science, 519:4–28.  Grünwald (2007) Grünwald, P. D. (2007). The minimum description length principle. MIT press.
 Hazan et al. (2007) Hazan, E., Agarwal, A., and Kale, S. (2007). Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(23):169–192.

Hedayati and Bartlett (2012)
Hedayati, F. and Bartlett, P. L. (2012).
The optimality of jeffreys prior for online density estimation and the asymptotic normality of maximum likelihood estimators.
In Conference on Learning Theory, pages 7–1.  Kleinberg et al. (2018) Kleinberg, R., Li, Y., and Yuan, Y. (2018). An alternative view: When does sgd escape local minima? arXiv preprint arXiv:1802.06175.
 Komatu (1955) Komatu, Y. (1955). Elementary inequalities for mills’ ratio. Rep. Statist. Appl. Res. Un. Jap. Sci. Engrs, 4:69–70.
 Koolen et al. (2014) Koolen, W. M., Malek, A., and Bartlett, P. L. (2014). Efficient minimax strategies for square loss games. In Advances in Neural Information Processing Systems, pages 3230–3238.

Littlestone (1989)
Littlestone, N. (1989).
From online to batch learning.
In
Proceedings of the second annual workshop on Computational learning theory
, pages 269–284.  Rissanen (1978) Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14(5):465–471.
 Rissanen (1996) Rissanen, J. J. (1996). Fisher information and stochastic complexity. IEEE transactions on information theory, 42(1):40–47.
 Shtar’kov (1987) Shtar’kov, Y. M. (1987). Universal sequential coding of single messages. Problemy Peredachi Informatsii, 23(3):3–17.
 Takeuchi and Barron (1998) Takeuchi, J. and Barron, A. R. (1998). Asymptotically minimax regret by bayes mixtures. In IEEE International Symposium on Information Theory  Proceedings.
 Takeuchi and Barron (2013) Takeuchi, J. and Barron, A. R. (2013). Asymptotically minimax regret by bayes mixtures for nonexponential families. In Information Theory Workshop (ITW), 2013 IEEE, pages 1–5. IEEE.
 Watanabe and Roos (2015) Watanabe, K. and Roos, T. (2015). Achievability of asymptotic minimax regret by horizondependent and horizonindependent strategies. The Journal of Machine Learning Research, 16(1):2357–2375.
 Welling and Teh (2011) Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML11), pages 681–688.
 Xie and Barron (2000) Xie, Q. and Barron, A. R. (2000). Asymptotic minimax regret for data compression, gambling, and prediction. IEEE Transactions on Information Theory, 46(2):431–445.
 Yamanishi (1998) Yamanishi, K. (1998). A decisiontheoretic extension of stochastic complexity and its applications to learning. IEEE Transactions on Information Theory, 44(4):1424–1439.
References
 Barron et al. (2014) Barron, A., Roos, T., and Watanabe, K. (2014). Bayesian properties of normalized maximum likelihood and its fast computation. In IEEE International Symposium on Information Theory  Proceedings.
 CesaBianchi et al. (2004) CesaBianchi, N., Conconi, A., and Gentile, C. (2004). On the generalization ability of online learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057.
 CesaBianchi and Gentile (2008) CesaBianchi, N. and Gentile, C. (2008). Improved risk tail bounds for online algorithms. IEEE Transactions on Information Theory, 54(1):386–390.
 CesaBianchi and Lugosi (2006) CesaBianchi, N. and Lugosi, G. (2006). Prediction, learning, and games. Cambridge university press.
 Cover (2011) Cover, T. M. (2011). Universal portfolios. In The Kelly Capital Growth Investment Criterion: Theory and Practice, pages 181–209. World Scientific.
 Donoho and Johnstone (1994) Donoho, D. L. and Johnstone, I. M. (1994). Minimax risk overl pballs forl perror. Probability Theory and Related Fields, 99(2):277–303.
 Freund and Schapire (1997) Freund, Y. and Schapire, R. E. (1997). A decisiontheoretic generalization of online learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139.

Gerchinovitz and Yu (2014)
Gerchinovitz, S. and Yu, J. Y. (2014).
Adaptive and optimal online linear regression on ℓ1balls.
Theoretical Computer Science, 519:4–28.  Grünwald (2007) Grünwald, P. D. (2007). The minimum description length principle. MIT press.
 Hazan et al. (2007) Hazan, E., Agarwal, A., and Kale, S. (2007). Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(23):169–192.

Hedayati and Bartlett (2012)
Hedayati, F. and Bartlett, P. L. (2012).
The optimality of jeffreys prior for online density estimation and the asymptotic normality of maximum likelihood estimators.
In Conference on Learning Theory, pages 7–1.  Kleinberg et al. (2018) Kleinberg, R., Li, Y., and Yuan, Y. (2018). An alternative view: When does sgd escape local minima? arXiv preprint arXiv:1802.06175.
 Komatu (1955) Komatu, Y. (1955). Elementary inequalities for mills’ ratio. Rep. Statist. Appl. Res. Un. Jap. Sci. Engrs, 4:69–70.
 Koolen et al. (2014) Koolen, W. M., Malek, A., and Bartlett, P. L. (2014). Efficient minimax strategies for square loss games. In Advances in Neural Information Processing Systems, pages 3230–3238.

Littlestone (1989)
Littlestone, N. (1989).
From online to batch learning.
In
Proceedings of the second annual workshop on Computational learning theory
, pages 269–284.  Rissanen (1978) Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14(5):465–471.
 Rissanen (1996) Rissanen, J. J. (1996). Fisher information and stochastic complexity. IEEE transactions on information theory, 42(1):40–47.
 Shtar’kov (1987) Shtar’kov, Y. M. (1987). Universal sequential coding of single messages. Problemy Peredachi Informatsii, 23(3):3–17.
 Takeuchi and Barron (1998) Takeuchi, J. and Barron, A. R. (1998). Asymptotically minimax regret by bayes mixtures. In IEEE International Symposium on Information Theory  Proceedings.
 Takeuchi and Barron (2013) Takeuchi, J. and Barron, A. R. (2013). Asymptotically minimax regret by bayes mixtures for nonexponential families. In Information Theory Workshop (ITW), 2013 IEEE, pages 1–5. IEEE.
 Watanabe and Roos (2015) Watanabe, K. and Roos, T. (2015). Achievability of asymptotic minimax regret by horizondependent and horizonindependent strategies. The Journal of Machine Learning Research, 16(1):2357–2375.
 Welling and Teh (2011) Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML11), pages 681–688.
 Xie and Barron (2000) Xie, Q. and Barron, A. R. (2000). Asymptotic minimax regret for data compression, gambling, and prediction. IEEE Transactions on Information Theory, 46(2):431–445.
 Yamanishi (1998) Yamanishi, K. (1998). A decisiontheoretic extension of stochastic complexity and its applications to learning. IEEE Transactions on Information Theory, 44(4):1424–1439.
Appendix A Asymptotic Lower Bound of Shtarkov Complexity for Standard Normal Location Models
We show an asymptotic lower bound of the Shtarkov complexity of standard normal location models.
Lemma 8
Consider the dimensional standard normal location model, given by , where . Let for . Then we have
Proof By definition of , we have
where
denotes the standard normal distribution function. Now, by
Komatu (1955), is bounded below with for being the standard normal density, which yields the lower bound of interest after a few lines of elementary calculation.Appendix B Lower Bound on Minimax Regret of Smooth Models
We describe how we adopt the minimax risk lower bound as to show the minimaxregret lower bound.
The story of the proof is based on Donoho and Johnstone (1994). First, the socalled threepoint prior is constructed to approximate the least favorable prior. Then, since the approximate prior violates the constraint, the degree of the violation is shown to be appropriately bounded to derive a valid lower bound.
The goal of our proof is to establish a lower bound on the minimax regret with respect to logarithmic losses, whereas their proof is about the minimax risk with respect to loss. Therefore, below we present the proof highlighting (i) an approximate least favorable prior for logarithmic losses over balls and (ii) the way to bound regrets on the basis of risk bounds.
Let be a ball. Let be a
dimensional normal random variable with mean
and precision . We denote the distribution just by where any confusion is unlikely. Let be a predictor associated with any subprobability distribution . For notational simplicity, we may write and where is the Lebesgue measure over .Consider the risk function
and the Bayes risk function
where denotes prior distributions on . Then, the minimax Bayes risk bounds below the minimax regret,
Comments
There are no comments yet.