Model selection is a problem which underpins the field of machine learning. Key to its formulation is the notion of learning an appropriate predictor, from an underlying model class, , based on input training examples , with each . Typically, the predictor, , is chosen so as to minimize some risk functional; that is, with , where is the risk functional, and
denotes the probability density function (pdf) over the data. Fundamentally, the aim of such an approach is to ensure thatprovides good generalization capability, so that after training it minimizes the out-of-sample test error bishop2006pattern
. This is historically estimated via the Akaike information criterion (AIC)akaike1973information, the Bayesian information criterion (BIC) schwarz1978estimating, or through cross validation bishop2006pattern. AIC and BIC are derived based on asymptotic assumptions in the sample size , and work similarly. Moreover, both criteria suggest that out-of-sample error increases as 111Technically AIC has a model complexity term, and BIC has a model complexity term. We will refer to the implied effects of both model complexities simply as the “ terms”., suggesting that an over-parameterized model should generalize poorly, which is an idea consistent with traditional empirical evidence, via the U-shaped train-test curves bishop2006pattern.
have tried to infer some similarities between ReLU networks and traditional kernel models, and Geiger et al.geiger2020scaling have tried to connect the double descent cusp-like behaviour with diverging norms, through a neural tangent kernel framework. In addition, double descent risk has been explored in a variety of simpler (and shallow) model classes ghorbani2019linearized; ba2020generalization; d2020double, with various risk asymptotics established bartlett2019benign; hastie2019surprises. Lastly and rather interestingly, it has been found that there exist certain parallels between double descent risk, and the notion of the jamming transition
which occurs in physical materials which undergo a phase transitionspigler2018jamming; geiger2019jamming.
In this paper, we attempt to understand the double descent phenomenon from a geometrically driven perspective. This will be achieved via the notion of a model volume, and shown to have interpretations which stem from information theory (coding theory, and signal analysis), as well as Occam’s razor. We find that for a variety of simple statistical models, the model volume can decrease with increasing , which ultimately serves to overpower the magnitude of the lower order model complexity term: . This idea is also clarified in Figure 1.
1.1 Model Selection and Occam’s Razor
In the late 90s and early 2000s, extensions to the base AIC and BIC formulations were developed by Rissanen rissanen1997stochastic and Balasubramanian balasubramanian2005mdl, which include additional model-specific terms. From the perspective of coding theory, Rissanen developed a notion of stochastic model complexity, which builds upon Shannon’s information criteria used for lossless encoding shannon1948mathematical. Upon this notion, Rissanen formalized an extension of binary Shannon entropy to continuous function classes, via the discretization of the model manifold over approximately equivalent model classes. This approach establishes an intuition behind “model distinguishability”, which is also echoed by Balasubramanian. In particular, under Risannen’s construction information is encoded in nats (as opposed to bits) and it is formally recognized as the Minimum Description Length (MDL). This is shown in Equation (1),
denotes a random vector ofdata samples, , is the likelihood function evaluated at its optimal parameter setting, with being the space of possible parameter settings, and bishop2006pattern. In a parallel fashion, Balasubramanian approached the problem of model selection, albeit from a more Bayesian perspective, which curiously yielded the same equation as in Rissanen’s MDL (with the exception of one additional term, and ignoring constants) balasubramanian1996geometric. To achieve this he takes an alternate route based on specifying a Jeffrey’s prior over the underlying parameter space, which is noted to act as an appropriate measure for the density of distinguishable distributions balasubramanian2005mdl. Ultimately, in both Rissanen’s and Balasubramanian’s model selection criteria, there is a term which acts like the model complexity used in AIC and BIC, and an additional term: , which they collectively referred to as a model distinguishability-like term. Moreover, since these methods are built upon the log-marginal: , they also appeal to a Bayesian Occam’s razor-like principle. Ultimately, this paper aims to explore this additional term: , through a geometric lens. Moreover, it will be made clear that this term also describes the underlying log-model volume, which we denote by amari2016information. Ultimately, it will be shown that, can in fact decrease with increasing for several simple model classes, thus serving to counter-act the behaviours of the model complexity term. By implication, this suggests that certain model classes have the power to generalize well when transitioning into over-parameterized regimes.
1.2 Information Geometry
Information geometry concerns the application of differential geometry to statistical models. In particular, it considers a statistical manifold, , over a co-ordinate system. A Riemannian metric, , can be placed on , where for each point , with defined as the local tangent space at point on the manifold. Principally, is a generalization of the inner product on Euclidean spaces to Riemannian manifolds. In addition, Amari defines a dually coupled affine co-ordinate system on statistical manifolds. Dually coupled co-ordinates arise naturally from the dually flat property which is intrinsic to many information manifolds. These co-ordinates are known as the (e-flat) and (m-flat) co-ordinates for the exponential family in particular, and are related through the Legendre transformation and via two convex functions amari2007methods. The and
co-ordinates for exponential models correspond to the natural and expectation parameters, respectively. Furthermore, the FIM defines a natural Riemannian metric tensor:rao1992information; amari2007methods. Thus the motivation for using information geometry is clear as Rissanen’s MDL, and Balasubramanian’s Occam Razor, depend on the FIM, which is, geometrically speaking, the metric tensor. Consequently, the term has a clear definition in differential geometry as being the log-volume of the underlying information manifold; that is, the square root of the determinant of the Fisher information matrix is the manifold volume amari2007methods; jeffreys1946invariant. Unfortunately, the FIM is singular () for many modern statistical models, implying that classical assumptions such as the asymptotic normality of the maximum likelihood estimator, does not hold, and that the BIC is not necessarily equal to the Bayes marginal likelihood watanabe2009algebraic. The implication of this will be explored for the studied models.
2 Double Descent Risk and the Volume of Statistical Models
In the following subsections we will explore how the notion of model volume suggests that in some models, increasing can result in a decreasing, or even saturating volume, which ultimately acts to counter-balance the poor generalization properties traditionally implied by the model complexity. This will be demonstrated for three popular models in machine learning: (1) Isotropic linear regression, (2) Statistical lattice models, and (3) Stochastic perceptron units.
2.1 Isotropic Linear Regression
We define isotropic linear regression as: , for a given dataset , with each , where and each . Furthermore, we place a power constraint on such that (for reasons which will become clear soon). It will be shown that for the regime the model volume is well defined (FIM is non-singular), but in the regime of interest () a singular geometry arises leading to zero determinants, thus rendering the volume expression of Equation (1) not meaningful. This singularity will be an issue central to every model investigated, and in particular for linear regression we draw inspiration from innovations in signal processing literature in order to help circumvent it. We begin by revealing the general form of for linear regression, which is elucidated in Theorem 1, where refers to the volume of a -ball of radius .
Theorem 1 (Regression Log Volume).
Corollary 1 ().
The proofs of Theorem 1 and Corollary 1 are provided in Appendix A.1 and A.2, respectively. Theorem 1 makes clear that the expression separates into two rather distinct components, (i) which we claim possesses a notion of model richness, and (ii) , which is the volume of a -ball scaled against noise, which defines a degree of model distinguishability. Considering, richness, it is clear that it involves the expectation over the data. And whilst not explicit in Theorem 1, it also contains information regarding the model architecture, as this term arises as a by product of the FIM calculation, which necessitates the second derivative of the underlying model parameter space. Thus, this term serves to connect the underlying data distribution, against the (usually) implicitly made geometric assumptions of the model’s architecture. Considering now distinguishability, in the calculation of this term manifests as the volume implied by the constraints underlying the parameter space. In the case of Theorem 1, , implying behaves according to a ball geometry, wherein it is known that =0.222The volume of a -ball, or radius is: , where This implies that the addition of new features at increasingly high can work to counteract the model complexity contribution, traditionally implied by AIC and BIC. Moreover, distinguishability is noted to behave as a function of the model noise. This carries with it a nice intuition as an increasingly noisy model will find it difficult to distinguish between similar datapoints. Thus, even though the traditional understanding of out-of-sample test error, as implied by AIC and BIC is driven by: (i) Training error, and (ii) Model complexity, we can see that model: (iii) Richness, and (iv) Distinguishability, also seem to play an important role. Ultimately, all the distinct model components (i) - (iv) seem to necessitate investigation, as each will respond in a unique way against the effects of, , , and the underlying model architecture.
From Corollary 1, it is shown that the richness term simplifies down to if , which is able to add directly onto the already existing term in AIC and BIC. However, extending this analysis to , is difficult as , even though , meaning . Thus we have an increasingly large null-space as increases, serving to exacerbate the singular geometry. By implication, , meaning that a naïve volume calculation lacks meaning. To circumvent this issue, we draw inspiration from theories in communication theory. In particular, this line of research is often concerned with the same underlying system: , however, the interpretations of the two terms, and , are appreciably different. In particular, is interpreted as an input signal, and is the signal mixing matrix. In other words, the known quantity in this literature is , and the unknown quantity acts like a confounding factor, which is contrary with the traditional statistical goal of regression, whereby is known data, and are the unknown coefficients. Moreover, the signal , is typically power constrained: , which in statistics often arises due to regularization, or through consideration of a prior distribution tse2005fundamentals; bishop2006pattern. For communication channels, a quantity of central interest is the channel capacity: . It acts as an upper-bound on the rate of reliable information transfer for a communication channel, and it is mathematically equivalent to the supremum of the mutual information: , wherein we consider andclarke1994jeffreys. For the problem of isotropic linear regression with real feature variables, follows Theorem 2 (see Appendix A.4).
Theorem 2 (Channel Capacity).
The channel capacity for the system with , , and , where refers to the signal-to-noise ratio, is given as,
refers to the signal-to-noise ratio, is given as,
The importance of considering , is that it bears an equivalency (asymptotic in ) to the term in Equation (1). Specifically, its relationship can be summarized in Equation (3), with further clarification provided in Theorems 6 and 7 in Appendix A.3. Thus, is also reflective of out-of-sample test error.
However, not only does relate to out-of-sample test error through Equation (3), it: (i) Allows us to perform calculations for , circumventing the singularity issue, and (ii) Implicitly encodes the required forms of richness and distinguishability. In fact, as a consequence of point (i), it is possible to derive upper and lower bounds (Theorem 9 in Appendix A.5), and further it can be made clear that the channel capacity saturates as , by studying such bounds (Corollary 2 in Appendix A.6). The presence of saturation (that is, an asymptotic behaviour) in such linear models has been recently observed in the work of Hastie et al. hastie2019surprises, using a bias-variance trade off. Moreover, Xu & Raginsky xu2017information have shown that generalization error relates intimately to the notion of input-output mutual information–the supremum of which is known to be . It is interesting that such properties arise naturally in different fields, which in turn have different research aims.
We further study the interpretation of . In Remark 1 (see Appendix A.7), is shown to be upper bounded by an expression containing: , the regression richness term. However, here it is scaled with a factor of
, and added against the identity matrix:. It is the presence of this identity matrix which acts as a regularizer against the null space of . This allows one to investigate in the domain of , by working with (a slightly modified) which avoids the singularity issues.
Remark 1 (Regularized Richness Bound).
In addition to richness, we make clear that the notion of distinguishability arises implicitly in , when considering a sphere packing argument. A common example of sphere packing as it arises in additive white Gaussian noise (AWGN) is provided for the reader in Appendix A.8. tse2005fundamentals Here however, we focus on showing how it manifests in , considering the model , where each , and where we consider initially fixed (that is, non-random). We base this calculation on two parts: (i) Considering what the model is trying to achieve without noise: , and (ii) The impact of noise. Considering (i), we have established that is zero-mean Gaussian, which implies that will have an ellipsoidal shape. That is, it will be scaled and rotated due to . Moreover, considering that is ultimately a linear map, it is known that: . And since, , where each , we are ultimately dealing with a space wherein . This results in , in which most vectors transformed under will lie inside with high probability. Considering now point (ii), since , and , the volume induced by the -ball in regards to the presence of noise is, . Ultimately a log-volume ratio can be calculated as: , which derives for the case of . As a last point of connection, notice that the distinguishability term in Theorem 1 may be re-expressed as: , which is simply the unit -ball scaled against the -th root of the signal-to-noise ratio. Therefore, not only does encode in it, a notion of richness, it also strongly encodes a notion of volume distinguishability, the intuition of which is made clear in Figure 2. From a communication theory perspective, each transmitted signal vector should be able to (i) Be decoded by the receiver (each transmitted codeword should map to distinct codes), and (ii) Maximize the total amount of uniquely decodeable codes that exists. In this case, the transmitted signals can be interpreted as , and thus through training one hopes to find such a , which would allow one to map the new, and unseen data, , efficiently throughout the entire ellipsoid, thereby maximizing the generalization capability of the in question. Typically, if is large, then this can become increasingly difficult, as the outer ellipsoid will grow in volume relative to the noise spheres, providing yet another intuition for why we regularize models.
Empirical results for the out-of-sample test error in the case of isotropic linear regression model, and the impact of the magnitude of is made clear in Figure 3. To generate these results we used sklearn.linear_model.Ridge
, as (i) The ridge regression hyper-parameter,, is used to emulate the effect of a power constraint, and (ii) We intend to keep the code as simple as possible to maximize reproducibility. For these experiments, we generated data according to , and we assumed that there exists a true underlying generating process, of some dimension, which we choose wlog as 150. We keep increasing (the number of feature variables of the proposed model fit), up to and beyond 150, ranging from , up to . In this way, we transition from tall matrices to long matrices, and are able to demonstrate a very clear transition from the classical, to the modern train-test risk regimes. In addition, 10-fold cross validation is performed in order to produce the train-test curves, and we opt for three values of the ridge hyper-parameter: , chosen wlog, in order to demonstrate how the double descent behaviour can be slowly switched on-and-off, and considered datapoints. Moreover, the random seed is fixed between runs, and each is sampled from , with additive noise of , so that we work with a fixed SNR between each case. Lastly, the experiments were run on a 64 bit Windows OS, with Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz, since these experiments are designed to be straight-forward, and reproducible by everyone.
Firstly, in Figure 3 we can see that the classical (U-shaped), and modern (double descent) regimes are visible. As noted by many researchers, the peak of the observed double descent phenomenon (the interpolation threshold), occurs at belkin2019reconciling; belkin2019reconciling; hastie2019surprises. When considering the train-test curves based on, , it is evident that out-of-sample test error indeed peaks at . However, for continually increasing , all the test risk curves reach a point of saturation, regardless of the choice of . This was implied to occur due to the saturating effect of , when . Moreover, it is clear that for each run, the train-test curves asymptote to the same error magnitude, which results due to the fixed SNR between each run. Such a behaviour is predicted to occur from Hastie et al. hastie2009elements, wherein they claim that the asymptotic behaviour for risk curves (in linear problems), is highly dependent upon SNR. And indeed here, SNR arises naturally in , and when considering . In addition to this, the peaking behaviour of double descent risk coincides with the peaking locations of , and it scales strongly dependent upon the choice of . Such a behaviour was anticipated to occur via the considerations of model distinguishability, and sphere packing. In other words, the outer ellipsoid: , has a volume dependent on , which is in turn reflective of the maximum magnitude allowed by . Evidently if explodes, then, will explode in volume, relative to the noise spheres: , and the relative sizes of these volumes is indicative of out-of-sample test error. In other words, when only weak regularisation is applied (), it is not possible to properly constrain the magnitude of
, especially at the interpolation threshold (), and thus like-wise explodes in magnitude. However as we progressively strengthen the effect of regularisation: , we observe a gradual reduction, and even a complete elimination of the double descent risk pathology. Recently, Nakkiran et al. nakkiran2020optimal have also investigated the impact of ridge regularization on the double descent peaking phenomenon, wherein they observe that an optimally tuned regularizer can work to eliminate the presence of this peaking behaviour. However, the intuition behind this has not yet been elucidated from a geometric angle.
In addition, the double descent peak is observed to shift to the right with increasing . Information theoretically, increasing proliferates the total number of possible encodings which may be able to explain the observed data. This is an interpretation which is consistent with Rissanen’s original derivation of MDL, in which he states: “the number of distinguishable models grows with the length of the data, which seems reasonable. In view of this we define the model complexity (as seen through the data)” rissanen1996fisher. Interestingly, this can imply that when a model’s total distinguishability is insufficient (weak regularization, and or insufficient noise), it is possible for the model to generalize well on one quantity of data, , and then upon re-training on some new data such that , to then generalize poorly, due to the ability of the double-descent cusp to shift towards the right. Similar ideas have been uttered recently by Nakkiran et al., in that: “for a fixed architecture and training procedure, more data (can) actually hurt” nakkiran2019deep.
Finally, we make clear that the concept of distinguishability (as it stems from ), has an intuition which is shared by many schools of thought on double descent risk, and generalization error. For example: (i) It bears strong intuitive similarities to the idea of a jamming transition, which refers to the tight packing of particles in physics, when a material transitions from fluid to solid. This idea is used by Geiger et al. geiger2019jamming; geiger2020scaling to clarify several ideas on double descent risk. Moreover, (ii) Many statistical generalization theorems are based on the notion of “sphere coverings” over constrained function spaces, in which the sphere covering number, and sphere packing number, closely relate vapnik2013nature.
2.2 Statistical Lattice Models
Thus far it has been established that in the case of isotropic linear regression, the double descent phenomenon and notions of generalization, possess strong geometric intuition. Here we show that this extends to a family of statistical lattice models which include Boltzmann machines (or Ising models) Ackley85, log-linear models Amari01, and the matrix balancing problem sugiyama2017tensor
. We will encode the hierarchical structure of the domain of the probability distribution via alattice, which will be shown to naturally lead into the co-ordinates (m-flat) from information geometry (see §1.2), and analyze the learning of distributions over a lattice structured domain. In order to achieve this, we formalize lattice domains using partial orders.
Formally, a partially ordered set (poset) is a tuple, , where is a set of elements, and denotes an ordering structure, such that (1) (reflexivity), (2) If , and , then (antisymmetry), and (3) If , and , then . Note that not every element may be directly comparable to every other element in the set (which would be known as a total ordering). In addition, a poset is called a lattice if every pair of elements has the least upper bound and the greatest lower bound Davey02. We assume that is finite. In working with posets it is common to consider the zeta function, such that Gierz03. The lattice structure always gives us the - and -coordinates of information geometry: and sugiyama2017tensor. For example, for Boltzmann machines with binary variables, the lattice space , where if for all . The size in this case. The metric tensor for the information manifold of the proposed lattice structure is shown in Theorem 3, which was previously derived by Sugiyama et al. sugiyama2017tensor. We assume that such that corresponds to the least element without loss of generality.
Theorem 3 (Lattice Metric Tensor sugiyama2017tensor).
In Theorem 3, we can replace with as we assume that is a lattice and always exists, which says that we only consider those poset structures in which the co-ordinate values are shared (nested) between and for the off-digagonal terms in the metric tensor. In Theorem 3 it is clear that this metric tensor is expressed in terms of the co-ordinates. The equivalent metric tensor in terms of the co-ordinates is available, but it is much more difficult to work with (requires the Möbius function instead of the zeta function). Moreover, in this definition of the metric tensor, the first row and column are always zeros, resulting in again, a singular geometry. Luckily this time however, since corresponds to the partition function, it is generally removed in practice, sugiyama2018legendre so that we effectively work with co-ordinates , resulting in . Based on this set-up, it is possible to derive the upper and lower bounds for . For lattice models, the co-ordinates lie compactly in the unit -hypercube, that is , which makes evaluations much simpler sugiyama2016information. In fact, it is possible to perform the re-parameterization: , which allows to be sampled from a Dirichlet distribution. This makes the co-ordinate more intuitive to work with, and provides us with a tractable way to evaluate the volume integral via sampling. We provide details of this re-parameterization in Appendix B.1. The bounds which result are shown in Theorem 4, with the proof clarified in Appendix B.2.
Theorem 4 (Lattice Log Volume Bounds).
is bound as in Equation (4), where , is a re-paramterization, and is the standard Gamma function.
For poset models, we once again find that can be decomposed into two distinct terms (i) A model richness term which this time is driven by the diagonal elements found in the metric tensor, and (ii) The model distinguisability term, this time relating to the volume of the probability simplex. Intuitively: (i) The metric tensor is used to describe notions of angles, and lengths (divergences for dually flat manifolds), and so the product of its diagonal elements does encode a notion of complexity dependent on the choice of the model architecture. And once again, notice that this term manifests in the presence of the expectation operator, which summarizes how the model architecture relates to the data distribution. As for point (ii), we have a probability simplex geometry, of which the volume is: . Ultimately, the resulting behaviour of as is clarified in Remark 2 (see Appendix B.3).
Remark 2 (Limiting Lattice Volume).
2.3 The Stochastic Perceptron Unit
Finally, we extend the current thesis to the classic stochastic perceptron unit. In particular, we consider a simple perceptron unit of the form: , where , and is the sigmoid non-linearity. Moreover, we follow Amari’s parameterization from amari1997information, which considers the input data distributed as, , and defines, , and , so that we can deal with the term: . Based on this set-up, we obtain Theorem 5.
Theorem 5 (Log-Volume of Perceptron).
Theorem 5 makes clear that for the stochastic perceptron unit similarly decomposes into distinct distinguisability and richness components. The distinguishability term closely parallels that already derived for isotropic linear regression (Thereom 1) as it is driven by the volume of a -ball, and scaled against noise. However, the richness term here strongly encodes the model architecture - as opposed to linear regression, as it contains the derivatives of the non-linearity function, in relation to the norm of the input weights, with the distribution over . From observing this decomposition, it would appear beneficial for the network architecture to be modeled with non-linearities whose derivatives are well-constrained, so that as , we can better guarantee that
, implying good generalizability. As it turns out, the popular sigmoid, ReLU, and tanh activation functions do satisfy this property, as common to all we have:. However, the perceptron unit has a strongly singular geometry, which does pose a marked difficulty when extending its geometry to deeper, and more complex network architectures. On this note, a possible solution may lie in the work of Sun & Nielsen sun2019lightlike. In particular, they have tried to circumvent the pathologies of singular semi-Riemannian geometries via a clever use of lightlike manifold structures. Doing so, they observe that under certain conditions for deep neural network (DNN) architectures, a Balasubramanian Occam’s razor-like term may be developed as: (Equation (21) in sun2019lightlike), wherein the higher order term is developed based on a combined understanding of the empirical FIM, and . At large , and for fixed , this term dominates, which they theorize may imply the strong compression, and generalization properties for DNNs; a conclusion which runs parallel to the thesis of this paper.
Motivated through the works of Rissanen, and Balasubramanian, investigation of the term has shown that for a variety of statistical models, a geometrical understanding of out-of-sample test error and double descent risk involves considering: (i) Model training error, (ii) Model complexity, (iii) Model richness, and (iv) Model distinguishability. Such ideas carry an appealing intuition, and are firmly rooted in information geometry, coding theory, as well as in the principle of Occam’s razor. Future work in this area could entail an investigation into the practicality of such terms for the purpose of empirical model selection. Regardless however, the FIM does appear to be a key object in deepening one’s understanding of many statistical models, especially from a geometric perspective. However, its singular nature does necessitate a considered, and careful analysis, especially for models based upon deep, latent architectures.
Appendix A Isotropic Linear Regression
a.1 Proof of Theorem 1
Remark 3 (Linear Regression FIM).
The FIM (Riemannian metric tensor) for the proposed linear regression model is
The proposed regression can be represented probabilistically as :
where we have assumed variance is known. Generalizing this to the data matrix , negating the above expression, and taking the expectation, we obtain the Fisher information matrix (Riemannian metric tensor) as required. ∎
The proof of Theorem 1 thus completes as follows:
Since , where, .
Thus, arriving at , where is the volume of a -ball of radius . ∎
We make clear that the condition used to evaluate the integral in Theorem 1 is based on integrating over the domain: , where . However the proposed power constraint in this paper is defined as , so that ultimately is considered to be a random vector. This extension is necessary later when defining channel capacity, wherein it will be assumed that each , implying that . The choice of distribution on each
is driven by a maximum entropy argument, based on the first and second statistical moments. Specifically, it is known that the maximal entropy distribution with a specified mean (which we take as zero), and a specified covariance,, is Gaussian, and further this covariance naturally relates to the required power constraint through: . Thus, based on a maximum entropy argument we opt for . Additionally, it is known that the norm of a -dimensional random vector of sub-Gaussain components, distributed as will take values close to (the radius of the -sphere in question) with high probability. This can be seen in the concentration inequality: , for all , where is a constant, and , where is the sub-Gaussian norm (see Equation (3.3) in [vershynin2018high]). Thus the domain of the -ball integral will contain most vectors with a high probability.
a.2 Proof of Corollary 1
Consider , such that the -th component: . Now, considering
, we arrive at a random matrix, with-distributed elements along its diagonals. In other words, the -th diagonal element: , implying that . Moreover, we obtain zeros in the off-diagonals due to the iid sampling conditions, meaning: since , so that due to independence we get: , and the summation term follows from the linearity of expectation. Thus, . Taking the logarithm we obtain for the richness term, as required. ∎
a.3 Channel Capacity Redundancy Theorems
Here, we make clear that channel capacity (which manifests as the supremum of a mutual information), is related to the MDL generalization term via the min-max KL risk. These ideas are well-known, and stated by Clarke & Barron in and of [clarke1994jeffreys], and by Rissanen in and of [rissanen1996fisher].
Theorem 6 (Redundancy-Capacity Theorem [clarke1994jeffreys]).
The maximum channel capacity equals to the min-max KL divergence,
Theorem 7 (Redundancy-Generalization Theorem [clarke1994jeffreys] ).
In the infinite limit of , the min-max KL risk approaches the MDL generalization estimator:
a.4 Proof of Theorem 2
In this subsection we provide a proof for as it appears in Theorem 2. We note that there exists a proof for a similar looking system, involving, being circularly complex in Telatar’s seminal paper [telatar1999capacity]. However, since properties over complex random matrix spaces do not necessarily lend themselves to real random matrices, we must work to provide a quick alternate proof for the case of , and with more general noise term, - which are the required assumptions for this particular paper, and for the problem of isotropic linear regression in general. To prove Theorem 2, we call upon Theorem 9.2.1 in Pinsker [pinsker1964information], which relates the mutual information between real Gaussian random vectors in a convenient form via the logarithm of the determinant. This is expressed as Theorem 8 below.
Theorem 8 (Pinsker’s Mutual Information [pinsker1964information]).
Let , and be mean zero Gaussian random vectors taking values in some corresponding measurable Cartesian product spaces, , , and where . Then,
where , and .
We now move onto proving Theorem 2, and begin by relating the notation used by Pinsker, to that used within this paper.
Consider , and , which implies that and . Moreover we initially consider a fixed (that is, non random) value of to perform calculations. Through maximum entropy arguments, we consider , which is based on the power constraint of , coupled with the fact that each . For details on the choice of this prior in regards to maximum entropy, see the end of Appendix A.1, and in addition see 4.1 of Telatar as to why this is a natural choice for: , in regards to concavity of [telatar1999capacity]. Now consider: , via linearity of expectation. In order to calculate we consider block matrices as follows:
It can be shown that, , and note that the top right, and bottom left elements are transposes of one another. Thus by expanding the determinant of the above block matrix system, we obtain:
We once again emphasise that and that the determinants have originated from Theorem 8. Moreover, the expression thus far has been calculated for a fixed ; that is, more accurately we have an expression for , with the conditioning implicitly assumed, whereas we ultimately desire a form for . Considering now random :
a.5 Theorem 9 - Channel Capacity Bounds
In this section we derive upper and lower bounds for the channel capacity, . We note that upper and lower bounds are required separately, for the cases of , and . This results in four bounds in total.
Theorem 9 (Channel Capacity Bounds).
The channel capacity , where is the digamma function, is bounded as follows:
Lower Bound ():
(Minkowski’s inequality). We take , fix an , and proceed as follows:
Consider that for any , such that we have , for . Thus,
Now, since each , it follows that ; that is, is Wishart distributed, where
is the degrees of freedom, andis the Wishart scale matrix, which in this case is the identity matrix. For the Wishart distribution, there is a known expansion for the expected log-determinant as follows:
with being the standard digamma function [bishop2006pattern]. Thus we arrive at:
where the lower bound for follows similarly.
Upper Bound :
From Jensen’s inequality,
The proof for the upper bound in follows similarly, where we would consider a term instead.
We can see how the bound behaves in Figure 5. In particular, it seems to suggest that the lower bound is much tighter than the upper bound for high SNR. However, it exhibits a small “dip” at the transition point of . This is primarily due to the term in the lower bound, which for smaller SNR becomes negative if . Figure 5 naturally leads one to question if something may be said about the limiting nature of the upper and lower bounds as . We thus establish Corollary 2.
a.6 Corollary 2 - Convergence of Channel Capacity Limit
In order to establish convergence we consider the following mild conditions: (1) A high SNR (SNR), and (2) , which is a common approximation to the digamma function, for large .
Corollary 2 (Channel Capacity Convergence).
For , and under mild conditions we have, where and refer to the upper and lower capacity bounds which are established in Theorem 9.