## 1 Introduction

### 1.1 Motivation

We are concerned with model dimensionality of a probabilistic model. Model dimensionality usually means the number of independent parameters, which we call the parametric dimensionality. For example, it is the number of real-valued parameters in the regression model, the number of mixture components in the finite mixture model, etc. This paper reinterprets the model dimensionality from an information-theoretic viewpoint, specifically on the basis of the theory of the minimum description length (MDL) principle [17].

Our motivation is summarized as follows: First, model dimensionality is an important notion to understand how complex a given model class is. We may measure this complexity in terms of description length from an information-theoretic viewpoint. This can be calculated no matter whether the model class is parametric or non-parametric, or no matter whether a number of model classes are synthesized or not. Then model dimensionality would not necessarily be integer-valued. We are thus interested in the problems of how we can formalize the description-based model dimensionality in a general way and how it is related to the conventional parametric dimensionality.

Secondly, the complexity of a model class is closely related to the performance of algorithms learning it. Specifically, the MDL-based learning algorithm [1, 25, 19, 13] and the MDL-based change detection algorithms [27, 29, 26]

, which have been designed on the basis of the MDL principle, have turned out to be effective in the scenario of machine learning and information theory. We can infer that their rates of convergence or exponents in error probabilities would possibly be characterized in terms of description-based model dimensionality. This is analogous to the fact that the rates of convergence of the empirical risk minimization algorithms are characterized by Vapnik-Chervonenkis dimension

[2, 21] or metric dimension of a function class [16, 11]. We are thus interested in analyzing the relations among the description-based dimensionality and the performance of the MDL-based learning and change detection algorithms.The purpose of this paper is twofold: One is to introduce a new notion of description-based model dimensionality. We call this notion the descriptive dimensionality (Ddim). The classical integer-valued parametric dimensionality is extended into a real-valued one for the case where a number of model classes of different complexities are mixed. The other purpose is to derive new theoretical results on the MDL-based algorithms for learning and change detection. We thereby characterize them in terms of Ddim.

### 1.2 Related Work

A number of notions of dimensionality have been proposed in the areas of physics and statistics. The metric dimension was proposed by Kolmogorov and Tihomirov [14] to measure the complexity of a given set of points on the basis of the notion of covering numbers. This was evolved to the notion of the box counting dimension, equivalently, the fractal dimension [15, 8]. It is also related to the capacity [6]. The metric dimension was used to measure the complexity of a class of functions and was related to the rate of uniform convergence over the class (see e.g. [6], [16]). Vapnik Chervonenkis (VC) dimension was proposed to measure the power of representation for a given class of functions [21]. It was also related to the rate of uniform convergence of the algorithm for minimizing empirical losses over the class. See [11] for relations between dimensionality and learning. In all of the previous work on dimensionality, it has not been related to information-theoretic notions such as codelengths.

As for the MDL principle, Rissanen has proven the consistency of the model estimated by the MDL criterion

[18]. In earlier stages of MDL research, the two-stage MDLdistribution has extensively been studied, that is, the data and model are encoded in two steps, then the probability distribution with model of the shortest two-stage codelengths have been analyzed. Recently, the normalized maximum likelihood (NML) distribution has received more attentions than the two-stage MDL distribution. This is because the NML distribution achieves the minimum of Shtarkov’s minimax regret (see Section 2.1), and the NML distribution with model of the shortest codelength has estimation optimality

[19]. See [10] for recent advances of MDL.The MDL principle has extensively been applied to learning theory and change detection. As for learning, i.e, model estimation, Barron and Cover first showed the rate of convergence of the MDL model estimator and characterized it in terms of index resolvability [1]. Note that they evaluated the model with the shortest two-stage codelength over the discretized model class. Yamanishi gave a non-asymptotic form of the rate of convergence of the model with the two-stage shortest codelength in the PAC (probably approximately correct) learning framework [25]. He focused on the specific model called the stochastic rules with finite partitioning, which can be thought of as piecewise Bernoulli models. In [1, 25], the model class was discretized and the convergence was analyzed over the discretized class. Chatterjee and Barron [3] and Kawakita and Takeuchi [13]

designed the MDL learning algorithms without discretization and applied them to the problem of supervised learning with L1-regularization. In their analysis, the target of analysis remained to be models of the shortest two-stage codelength, and the discretization technique still played an essential role in their analysis.

As for model change detection, Yamanishi and Maruyama formulated the problem of dynamic model selection (DMS) when the underlying probabilistic model changes over time [28, 27]. DMS is to detect time-varying models from a stream data. They developed the MDL-based DMS algorithm to solve the problem. The issues similar to DMS have been addressed in a number of different scenarios including switching distributions [7], derandomization [22], tracking best experts [12], Bayesian change point detection in multivariate time series [24], concept drift [9]

, structural break estimation for autoregression models

[5]. etc. Vreeken et al. proposed Krimp as a description length-based test statistics for change point detection and demonstrated its empirical effectiveness

[23]. Yamanishi and Fukushima [26] developed the MDL-based model change statistics using the NML codelength to propose a hypothesis testing for model change detection, which we call the MDL test. They derived non-asymptotic forms of Type I and II error probabilities of the MDL test, both of which converge to zero exponentially in data length.### 1.3 Significance of This Paper

The significance of this paper is summarized as follows:

(1)Model dimensionality is reformulated through Ddim.
We define Ddim similarly with the box counting dimension in that it is given by the logarithm of -covering number divided by

. Hence Ddim can be defined regardless of whether the model class is parametric or non-parametric, as with the box counting dimension. However, Ddim is unique in that the covering number for Ddim is defined as the least number of representative points necessary for approximating the shortest codelength for the model class. We show that Ddim coincides with the parametric dimensionality when the model class is a single parametric class. Ddim can be further defined for the cases where a number of parametric model classes are probabilistically mixed (model fusion) or are concatenated (model concatenation). Then Ddim can be real-valued. Such cases may occur e.g. when the model changes over time.

2)Theoretical results on the performance of MDL-based learning and change detection are updated to be characterized by Ddim. We derive the rate of convergence of MDL learning algorithm both for the cases where the true distribution is in a given family of model classes and is not in the family. In previous work on MDL learning [1, 25, 3, 13], the maximum likelihood distribution with model with the shortest two-stage codelength were analyzed. Meanwhile, this paper derives upper bounds on the rate of convergence of the NML distribution with model of the shortest NML codelength. Unlike all of the existing work, we do not use any discretization technique both in the design and analysis of MDL learning. The result gives a new justification of MDL learning and gives a new rationale for the NML distribution. To the author’s knowledge, this is the first result on the rate of convergence on the NML distribution. We relate the rate of convergence of the NML distribution to Ddim for the true model. Furthermore, in the case where a number of models are probabilistically mixed, we prove that the rate of convergence of the NML distribution is governed by Ddim for model fusion.

We also consider the problem of model change detection. We conduct a hypothesis testing whether a data sequence comes from a a given model sequence or not. The model sequence may include multiple change points. This is a scenario different from [26], in which it has been tested whether a model change has occurred or not at a given single change point. We propose the MDL change statistics for this new scenario, on the basis of the notion of dynamic model selection [28]. We derive upper bounds on Type I and II error probabilities of hypothesis testing for the MDL-based model change detection. We prove that the error probabilities decay exponentially to zero as sample size increases, and that their exponents are governed by Ddim for model concatenation.

Through the analysis, we demonstrate that Ddim is an intrinsic quantity which characterizes the performance of MDL-based learning and change detection.

The rest of this paper is organized as follows: Section 2 gives a formal definition of Ddim and its theoretical properties. Section 3 evaluates rates of convergence of learning the NML distributions and relates them to Ddim. Section 4 evaluates error probabilities for the MDL-based change detection and relates them to Ddim. Section 5 gives concluding remarks. Appendix gives proofs of a number of theorems.

## 2 Descriptive Dimensionality

### 2.1 NML and Parametric Complexity

This section gives a formal definition of Ddim from an information-theoretic viewpoint. Let be the data domain where is either discrete or continuous. Without loss of generality, we assume that is discrete. Let be a data sequence of length . We drop when it is clear from the context. We assume that each is independently generated. be a class of probabilistic models where

is a probability mass function or a probability density function.

can be a real-valued or discrete-valued parametrized class. In either case, through the paper, we assume that there exists for any . We start by defining the NML codelength, the fundamental notion in the MDL principle.###### Definition 2.1

We define the normalized maximum likelihood (NML) distribution over relative to , which we denote as , by

(1) | |||

The normalized maximum likelihood (NML) codelength of relative to , which we denote as , is given as follows:

(2) | |||||

where

(3) |

The first term in (2) is the negative logarithm of the maximum likelihood while the second term (3) is the logarithm for the normalization term. The latter is called the parametric complexity of [19]. This means the information-theoretic complexity for the model class . The NML codelength can be thought of as an extension of Shannon information into the case where the true model is unknown but only is known.

In order to understand the meaning of the NML codelength and the parametric complexity, according to [20],
we define the minimax regret as follows:

where the minimum is taken over the set of all probability distributions.
The minimax regret means the descriptive complexity of the model class, indicating how largely any prefix codelength is
deviated from the smallest negative
log-likelihood over the model class.
Shtarkov proved that the NML distribution (1) is optimal in the sense that it attains the minimum of the minimax regret [20]. In this sense the NML codelength is the optimal codelength required for encoding for given .
Then the minimax regret coincides with the parametric complexity [20]. That is,

(4) | |||

We next consider how to calculate the parametric complexity . According to [19] (pp:43-44), the parametric complexity can be represented using a variable transformation technique as follows:

(5) |

where is defined as

(6) | |||

Note that for fixed , forms a probability density function of . That is,

### 2.2 Definition of Descriptive Dimensionality

Below we give the definition of Ddim from a view of approximation of the parametric complexity, equivalently, the minimax regret (see (4)). The overall scenario of defining Ddim is as follows: We first count how many points are required to approximate the parametric complexity (5) with discretization of . We consider the counts as information-theoretic richness of representation for a model class. We then employ that counts to define Ddim in a similar manner with the box counting dimension [6].

We consider to approximate (5) with a finite sum of partial integrals of . Let be a finite subset of . For for , let

where
is the Kullback-Leibler (KL) divergence between and :

(7) |

Then we approximate as

(8) |

where

That is, (8) gives an approximation to with a finite sum of integrals of over the
neighborhood of a point with respect to the KL-divergence.
We define as the smallest number of points with respect to such that
. More precisely,

(9) |

We are ready to introduce the descriptive dimension as follows:

###### Definition 2.2

The definition of Ddim is similar with that of the box counting dimension [6, 15, 8]. The main difference between them is how to count the number of points. Ddim is calculated on the basis of the number of points required for approximating the parametric complexity, while the box counting dimension is calculated on the basis of the number of points required for covering a given object with their -neighborhoods with some metric.

We denote as the total number of representative points for parametric complexity for obtained by choosing in as in (9).
Eq.(10) is then equivalent with

(11) |

In order to verify that Ddim is a reasonable definition of dimensionality, we show that Ddim coincides with the parametric dimensionality in the special case where the model class is a finite dimensional parametric one.

Consider the case where is a -dimensional parametric class, i.e.,
,
where is a -dimensional real-valued parameter space.
Let for the conditional probabilistic mass function .
We then write according to (6) as follows

Assume that the central limit theorem holds for the maximum likelihood estimator of a parameter vector

. Then according to [19], we can take a Gaussian density function as (6) asymptotically. That is, for sufficiently large , (6) converges to the Gaussian distribution in law:

(12) | |||

where is the Fisher information matrix.

Under the assumption of (12), the following theorem shows the basic property of for the parametric case.

###### Theorem 2.1

Suppose that is continuously three-times differentiable with respect to . Under the assumption of the central limit theorem so that (12) holds, for sufficiently large , we have

(13) | |||

(Since the length proof is long for the introduction of Ddim, we move the proof of Theorem 2.1 to Appendix A.1.)

Theorem 2.2 relates Ddim to the parametric dimensionality for the parametric case.

###### Theorem 2.2

For a -dimensional parametric class , under the assumption as in Theorem 2.1 for , we have

(14) | |||

Proof:
We denote as that obtained by choosing in .
According to [19] (p.53),
when the class is a -dimensional parametric class , then under the central limit theorem condition for the maximum likelihood estimator for each , the parametric complexity is asymptotically
expanded as follows:

(15) |

where is the Fisher information matrix:
.
Plugging (13) with (15) into (11)
yields (14). This completes the proof of Theorem 2.2.

Theorem 2.1 can be generalized no matter whether the model class is either parametric or non-parametric. Eq.(61) in the proof of Theorem 2.1 (see Appendix A.1), which relates the central limit theorem to the volume of the discretized mass, is the key to prove Theorem 2.1. By generalizing (61), we have the following form of a relation between and .

###### Theorem 2.3

Suppose that for , it holds:

(16) | |||

Then we have

Note that (16) can be thought of as a kind of generalized variants of the central limit theorem.

Theorem 2.2 shows that Ddim coincides with the parametric dimensionality when the model class is a single parametric one. Ddim can also be defined even for the case where a number of parametric classes are fused or concatenated, as shown below.

(a) Model fusion (b) Model concatenation |

We first consider model fusion where a number of model classes are probabilistically mixed, as in Fig. 1 (a). Let be a family of model classes and assume a model class is probabilistically distributed according to over where does not change over time. The true distribution is determined once is generated. We denote model fusion over as .

Then by taking the expectation of with respect to , the definition of Ddim of is naturally induced as follows:

We immediately obtain the following lower bound on Ddim of model fusion.

###### Theorem 2.4

(18) |

###### Definition 2.3

We define the pseudo Ddim; for model fusion over as the righthand side of (18).

###### Example 2.1

Let be a prior distribution of over .
When a data sequence is given, for each , let the NML distribution associated with be

The posterior probability distribution of

for given is given bywhere is a temperature parameter.

Thus Ddim
for model fusion
associated with , which we write as , is calculated as

(19) | |||||

We have used the relation in Theorem 2.4. We call (19) the pseudo Ddim for model fusion associated with , which we write as . To determine the model dimensionality as (19) from data may be called continuous model selection.

We next consider model concatenation where a number of model classes are concatenated along the timeline as in Fig. 1 (b).
Let
be a family of model classes.
Let a set of precision parameters .
For , let . Then
.
We write
model concatenation over with ratio as
, which means that a model class is specified with precision for any .
Then
the number of points is given by

(20) |

Then Ddim of with ratio is calculated as follows:

(21) |

As for Ddim for model concatenation we have the following theorem:

###### Theorem 2.5

Let . Ddim for model concatenation of with ratio is given by

(22) | |||

Proof. The definition (21) of Ddim of model concatenation with the property (20) directly induces the following:

This proves (22).

###### Corollary 2.1

Let where is the model with parametric dimensionality . Under the condition as in Theorem 2.1 for each , Ddim for model concatenation over with ratio is given as follows:

## 3 Rate of Convergence of Learning NML Distributions

This section gives the rates of convergence of the MDL learning algorithm and relates them to Ddim. It selects a model with the shortest total codelength required for encoding the data as well as the model itself. It is formalized as follows:

Let where .
For a given training data sequence where each is independently drawn, the MDL learning algorithm selects such that

where is the parametric complexity of as in (5).
The MDL learning algorithm outputs the NML distribution associated with as in (3): For a sequence

(24) | |||

Note that is independent of the training sequence used to obtain .

We have the following theorem relating Ddim to the rate of convergence of the MDL learning algorithm.

###### Theorem 3.1

Suppose that each is
generated according to .
Let be the output of the MDL learning algorithm
as in (24).
Let be the Bhattacharyya distance
between and :

(25) |

Then for any ,
we have the following upper bound on
the probability that
the Bhattacharyya distance between the output of the MDL learning algorithm and the true distribution exceeds :

(26) |

Further under the condition for as in Theorem 2.1 or 2.3, we have

Theorem 3.1 implies that if , the NML distribution with model of the shortest NML codelength converges exponentially to the true distribution in probability as increases and the rate is governed by Ddim for the true model.

For , we define the -divergence between two probability mass functions: and by

The Hellinger distance is defined as a specific case of : . Since we can easily verify the following relation between the Hellinger distance and the Bhacchataryya distance:

the results (26) and (3.1) on the rates of convergence hold for the Hellinger distance, within a constant factor. Further, using a similar technique of the proof of Theorem 3.1 as shown below, we can verify that the same results hold for the divergence with , within a constant factor.

Proof of Theorem 3.1. Let be the true distribution associated with the true model . Let be the model selected by the MDL learning algorithm and let be the NML distribution associated with . We write it as .

By the definition of the MDL learning algorithm,
we have

Let be the NML distribution associated with defined as

For , the following inequalities hold:

(29) |

Let be the event that

Note that under the event ,
we have

Then under the condition that , we have

(30) | |||||

where we have used the fact that by (25), under , it holds

Plugging (30) into (29) yields

This implies (26). Further note that under the condition for as in Theorem 2.1 or Theorem 2.3, the following asymptotic relation holds:

(31) | |||

Plugging (31) into (26) yields (3.1).
This completes the proof.

Theorem 3.1 has dealt with the case where the true distribution is in some . We may be further interested in the agnostic case where the true distribution is not necessarily in some . Theorem 3.2 shows the rate of convergence of the MDL learning algorithm for such an agnostic case.

###### Theorem 3.2

Let be the true distribution and be the output of the MDL learning algorithm, Let

be the Kullback-Leibler divergence between

and . For , let be the event that for any , for(32) | |||

Let
where is the complementary set of . Then for any , the probability that the Bhattacharyya distance between and , is upper-bounded as follows:

where

Specifically,
if for some function of , for any ,
, then we have

Basically, Theorem 3.2 can be proven similarly with Theorem 3.1. However, the bound (3.2) in Theorem 3.1 cannot be obtained as a specific case of Theorem 3.2 where is in some . This is due to a technical reason that the convergence of the empirical log likelihood ratios to the KL-divergence should be explicitly evaluated in the proof of Theorem 3.2 while they need not be evaluated in the proof of Theorem 3.1. From this reason, we leave the proof of Theorem 3.1 here and move that of Theorem 3.2 to Appendix A.2.

Theorem 3.2 shows that the NML distribution with model of the shortest NML codelength converges to the true model in probability as increases, provided that and .

Next we consider model fusion where is chosen randomly according to the probability distribution over . Then the unknown true distribution is chosen from . We have the following corollary relating Ddim for model fusion to the rate of convergence of the MDL learning algorithm.

###### Corollary 3.1

Let be the model selected by the MDL learning algorithm and be the NML distribution associated with .
Then under the condition for each as in Theorem 2.1 or Theorem 2.3, for model fusion over , we have the following upper bound on
the expected Bhattacharyya distance between the output of the MDL learning algorithm and the true distribution:

Comments

There are no comments yet.