## 1 Introduction

The minimum description length (MDL) principle provides a general information-theoretic approach to model selection and other forms of statistical inference [5, 17]

. The MDL criterion for model selection is consistent, meaning that it will select the data-generating model from a countable set of competing parametric models with probability approaching

as the sample size goes to infinity [4]. For example, if each of the parametric models is a logistic regression model with predictor variables taken from a fixed set of potential predictors, then the MDL model-selection criterion will choose the correct combination of predictors with probability approaching

as . The MDL model-selection criterion also has a number of strong optimality properties, which greatly extend Shannon’s noiseless coding theorem [5, §III.E].In its simplest form, the MDL principle advocates choosing the model for which the observed data has the shortest message length under a particular prefix code defined by a minimax condition [11, §2.4.3]. Shtarkov [19] showed that this is equivalent to choosing the model with the largest normalized maximum likelihood (NML) for the observed data. Here, minus the logarithm of the NML for a model and observations is

where: is the likelihood function for data at model parameter ;

is the maximum likelihood estimate of

corresponding to ; and the parametric complexity of the model is(1) |

where the integral is over all possible values of the data (and is technically the Lebesgue integral with respect to some measure , so the integral becomes a sum if is discrete).

The NML balances how well the model fits the data (quantified by the maximized likelihood) against how complex or versatile the model is (quantified by the parametric complexity), so the NML gives a natural measure of parsimony. However, in many cases of interest, it is not practical to calculate the parametric complexity directly from the definition (1). For example, the data space for a logistic regression model consists of all binary sequences of length , and the parametric complexity is defined in terms of the sum of over this space, so calculating this sum directly is infeasible even for since there are terms. Clarke and Barron [7, 8] therefore obtained approximations to the parametric complexity in the limit for independent and identically distributed (IID) data, and Rissanen [15] greatly extended these results to the non-IID setting.

Approximations to are only sensible when is finite, but this condition fails in many cases. In proving their approximation, Clarke and Barron [7, 8] and Rissanen [15] therefore effectively restricted the data to those data points whose corresponding maximum likelihood estimates lie in a given compact (i.e., closed and bounded) subset of the parameter space . Restricting to this set of data points in the integral (1) gives a quantity which is always finite in our main case of interest. Then the approximation of [7, 8, 15] is

(2) |

as , where is the dimension of , is the Jeffreys prior on (scaled so it does not depend on ) and is the Fisher information volume of .

Suppose now that the data are IID and that is a natural exponential family, and let be any parameter space for , not necessarily the natural parameterisation. Assume that each distribution in satisfies Cramér’s condition (which holds for all continuous distributions) and that is a compact, -dimensional subset of with smooth boundary (see the end of Section 2 for precise statements of these conditions). Then our main result is the following refinement of (2), for any non-negative integer :

(3) |

as , where each is a smooth function which does not depend on .

Our methods can be used to calculate the asymptotic expansion (3) for to any desired degree of accuracy. The first few of the functions are given explicitly in Theorem 2 in terms of the Fisher information metric and cumulants. In particular, , so taking in (3) gives the approximation (2). Further, setting in (3) shows, at least for the case of exponential families and IID data, that the approximation (2) is actually valid to order , which appears to be previously unknown.

The functions which appear in (3) are likely to be interesting, since they arise naturally from the general, information-theoretic principles which underpin MDL. These functions appear to be more related to the statistical aspects of the model than to its curvature or other geometrical aspects, see Remark 4. Theorem 2 gives these functions in terms of the Fisher information metric and cumulants, but a more natural formulation in terms of the Amari-Chentsov tensors is also given in Section 7.

The rest of this paper is set out as follows. In Section 2 we define the main objects of this paper and fully describe the regularity conditions under which (3) holds. In Section 3 we give an asymptotic expansion for the parametric complexity in terms of a function which arises in Edgeworth expansions. Explicit formulae for the first few terms of this function are then obtained in Section 4, completing the proof of (3). In Section 5, the first two terms of (3) are calculated explicitly for some examples. In Section 6, the finite-sample performance of (2) and these higher order approximations are assessed, using an exact formula for the parametric complexity in the case of spherical normal data. We give alternative, co-ordinate-independent formulae for the functions in Section 7, before finishing with a summary of the paper in Section 8. Appendix A contains the proofs of all of our results and Appendix B gives a fairly self-contained description of Edgeworth expansions and the Hermite numbers.

## 2 Definitions and regularity conditions

As in the Introduction, let be a natural exponential family of order [13, §2.2.1] and let be any parameter space of , which we assume without loss of generality is an open subset of . Then is a set of probability measures on , with each distribution in of the form for some , where: is some measure on ; is the function of alone, with held fixed;

is the reparameterisation map from to the natural parameter space (a diffeomorphism onto its image); is the log-partition function, which is determined by the condition that each measure is normalised; and is the Euclidean inner product of and .

Let be the corresponding model for IID observations and let be the observed data for this model. Then is the set of all measures of the form for , where is the product likelihood and is the product measure on .

The (scaled) Fisher information metric on is the matrix-valued function on with entry

(4) |

for , where the factor of ensures that does not depend on (since the data are IID). Then the Jeffreys prior is times the Lebesgue measure on , and so is equal to the Fisher information volume density on .

Let be the support of , which can be interpreted as the set of all possible values of the observed data . Let be the maximum likelihood estimator for the parameter , so that is the maximum likelihood estimate corresponding to . This estimate does not exist for all but it is unique when it does exist, so is well-defined if we restrict appropriately [2, Corollary 9.6]. If is any compact subset of then let

be the set of data points whose maximum likelihood estimates exist and lie in . Then define

(5) |

to be the contribution to the parametric complexity corresponding to .

Now, the maximum likelihood estimate

is a random variable, so write its distribution as

, where is some smooth function and is some measure on which is independent of . In fact, the family of distributions for all is also an exponential family, though we will not need this result here. For now, we only use the fact that the maximum likelihood estimator is a sufficient statistic (by Theorem 2.2.6 of [13] and the paragraph preceding it) and that the parametric complexity can be calculated from the distribution of any sufficient statistic [5, §III.F]. Then (5) and the argument in [5, §III.F] imply(6) |

which is intuitively reasonable since is the maximized likelihood.

###### Remark 1.

) by using the central limit theorem to approximate the integrand of (

6), despite the fact that the central limit theorem describes the distribution of for any fixed while the integrand of (6) is , where is not fixed. To get around this problem, Rissanen effectively replaced the function by a step function which closely approximates , then he applied the central limit theorem in each of the small sets where is constant (and hence is fixed). By contrast, our approach will be to approximate the integral (6) by (a multiple of) the double integral of with respect to and , where ranges over a small neighbourhood of the set . If we integrate with respect to first and second in this double integral then is fixed in the innermost integral, so we can apply the central limit theorem (and higher-order Edgeworth expansions) there.We now impose the following regularity conditions, which will be needed when we use Edgeworth expansions in the next section to largely prove (3).

###### Condition 1.

We assume that each distribution in the exponential family satisfies Cramér’s condition

(7) |

where

is the characteristic function of

[6, eqn. 1.29]. Cramér’s condition is satisfied for many interesting distributions, such as all continuous distributions (those which admit probability density functions)

[12, p. 57] and the distributions of the minimal sufficient statistics of continuous exponential families [6, Lemma 1.5].###### Condition 2.

In addition to assuming that is compact, we assume that is a smooth -manifold with boundary. This means that each point of is at the centre of some small -dimensional ball which intersects either in the whole ball (for interior points of ) or in approximately a half ball (for boundary points of ).

## 3 Higher order asymptotics

In this section, we will mostly work in the expectation parameter space of the exponential family , though the details of this parameter space are not important here beyond two specific properties, which we now recall. Let denote the expectation parameter space and let denote a generic expectation parameter (which is equal to the expected value of the sufficient statistic). If are IID random variables governed by an unknown element of and , as in Section 2, then the first property is that the maximum likelihood estimator of the expectation parameter is simply the mean, i.e.,

(8) |

by Theorem 2.2.6 of [13] and the comments preceding it. The second property is that this estimator is exactly (as opposed to asymptotically) unbiased and efficient, meaning that

(9) |

by [13, Theorems 2.2.1 and 2.2.5], where is the Fisher information metric on and the superscripted indicates the matrix inverse. The first property allows us to apply Edgeworth expansions directly to the maximum likelihood estimator and the second property allows us to express the approximating distribution in terms of the Fisher information matrix.

Let be the probability density function (PDF) of the

-dimensional normal distribution

which has the same mean and variance as

, so that(10) |

for any . As in Section 2 (but with , , , in place of , , , ), let the distribution of the maximum likelihood estimator of be . For any fixed , (8) and the central limit theorem imply that this distribution can be approximated as

(11) |

as , where is the Lebesgue measure on and the approximation is uniform for all Borel sets satisfying some weak regularity conditions (e.g. see [6, Corollary 1.4]). Edgeworth expansions [3, 6] refine this result to imply, for fixed and for any integer or half-integer , that there is a function so that

(12) |

as , where the error is uniform for all Borel sets satisfying weak regularity conditions, as for (11).

The function appearing in (12) is a polynomial in and only depends on through the cumulants of the distribution in corresponding to . In Section 4 we will give explicit formulae for the function in the case of most interest to us. However, even without an explicit formula, the defining property (12) of allows us to state and prove the following theorem, which is the main ingredient in our proof of (3).

###### Theorem 1.

Let and satisfy the regularity conditions given at the end of Section 2, where is any parameter space for . Let be the reparameterisation map from to the expectation parameter space . Then for any integer ,

(13) |

as .

###### Remark 2.

Edgeworth expansions have a number of known weaknesses, such as potentially giving signed distributions (e.g., partly negative PDFs) and only controlling absolute errors, which are not very informative in the tails of distributions. However, neither of these defects affects our results, since we only need Edgeworth expansions in small neighbourhoods of the mean.

## 4 The integrand on the natural parameter space

In this section, we complete the proof of (3) by calculating the integrand appearing in (13) for the special case when is the natural parameter space of the exponential family. Note that the corresponding function for any other parameter space is simply obtained by composing the function found in this section with the appropriate reparameterisation map (see the first paragraph of Section A.1).

So let be the natural parameter space and let be the log-partition function of the natural exponential family , so that the distribution of one data point is

(14) |

for some measure on which does not depend on . The cumulant generating function of the distribution (14) is [13, eq. 2.2.4], so for any integer and any , the cumulant of the distribution (14) is

(15) |

Let be the component of the matrix inverse of the Fisher information metric on .

###### Theorem 2.

For the natural parameter space, the integrand appearing in (13) is given by

for any , where each is a function of but not . In particular, ,

(16) |

and

(17) |

where is defined below and each index ranges from to in each sum, e.g.

###### Proof.

The expressions appearing in (17), where is an even positive integer, are the (un-normalized) Hermite numbers given by

(18) |

where

are dummy variables, see Theorem

5 in Appendix B. The expression in square brackets is a degree polynomial in so does not depend on . Up to sign, is a certain sum of products of components of , e.g. see (39) and (A.2).###### Remark 3.

The un-normalized Hermite numbers (which corresponding to a normal distribution with a general, rather than identity, variance-covariance matrix) also arise in Gaussian processes and quantum field theory [14, eqn. 1 and 2].

###### Remark 4.

The geometrical or statistical significance of the functions is not clear. does not appear to be any sort of contraction of the Riemann curvature tensor (for either the Levi-Civita connection or any of Amari’s other -connections [1, §2.3]) because in the natural parameter space, only involves third derivatives of the log-partition function whereas involves fourth derivatives. Some of the terms in (2) have a superficial similarity to Efron’s curvature (see [9] or [13, eqn. 4.3.10]), but Efron’s formula measures the extrinsic curvature of curved exponential families and so is for exponential families. In Section 7, we give an expression for that is valid in any parameter space, in terms of contractions of products of the Amari-Chentsov tensors.

## 5 Examples

It is possible to calculate the functions appearing in (3) by using the formulae of Theorem 2 and a formula-manipulation program like Maxima [18]. This is especially easy when is small, though it can also be done for all , as we now illustrate.

### 5.1 Exponential data

Let

be the family of exponential distributions on

, so each observation is governed by some distribution of the form (14), where is the Lebesgue measure restricted to the positive reals, the natural parameter space is the set of negative reals, is the rate parameter of the exponential distribution and the log-partition function isEach distribution in is continuous so it satisfies Cramér’s condition by [12, p. 57].

Since the parameter space of this family has dimension , the indices in (2) and (17) all take the value , so each sum consists of a single term. Also, is the second derivative of by (4), so , and the matrix inverse of the matrix has the single component . Also, by (15), the cumulants appearing in Theorem 2 are just the derivatives of , e.g. and . So by (2),

A similar calculation, based on (17) and aided by Maxima [18], shows that

Since we have just shown that and are constant in , we will now write and for these constant functions. Then (3) implies

(19) |

where the second last step uses the asymptotic expansion as . Substituting the above values of and into (19) therefore gives

because vanishes.

### 5.2 Spherical normal data with unknown variance

We now consider a model where each observation is distributed according to the -dimensional spherical normal distribution , where is the identity matrix and the parameters and are to be estimated. One reason for studying this model is that an exact formula for is known, so in Section 6 we will verify the expression for given here.

For this model, a sufficient statistic and the corresponding natural parameter are given by

(20) |

where and . Then is governed by a member of a natural exponential family of the form (14), with log-partition function

Note that Cramér’s condition holds for each distribution in this natural exponential family, by [6, Lemma 1.5]. The (scaled) Fisher information metric is the Hessian of , by (4) or [13, Theorem 2.2.5], and the matrix inverse of has component

(21) |

Also, the cumulants can be obtained by differentiating , as in (15). Then a calculation based on (2) (which was aided by Maxima [18] and which considered all combinations of the two cases and for each index ) shows that the three sums appearing in (2) are constant in and are equal to the expressions in square brackets below:

(22) |

So when , (3) becomes

(23) |

as . Note that this is just the usual approximation (2) plus the correction term . We will investigate the finite-sample performance of the two approximations (23) and (2) in Section 6, below.

### 5.3 Spherical normal data with known variance

For spherical normal data with known variance, the log-partition function is quadratic in , so all of the cumulants vanish for , and hence the functions also vanish for . So by (3), the standard formula (2) is accurate to order for all integers . This is a positive check on our formulae, because it is well-known that (2) is the exact formula for this model, as we now briefly demonstrate.

Working in the expectation parameter space, if for known and unknown, then and hence . Therefore and

so . Hence if then the exact formula for is

since .

## 6 Finite sample performance

In this section we compare the finite-sample performance of the standard approximation (2) and its first-order correction (i.e., (3) with ), in the case of the spherical normal model with unknown variance (see Section 5.2).

Rissanen [16]

studied the linear regression model with the parameterisation

, where is the noise variance and is the regression coefficient (note that this is not the natural parameterisation of this exponential family). Let be the maximum likelihood estimate for . If the model has observations, covariates and design matrix (i.e., is the matrix whose columns are the covariates of the model) then Rissanen showed that the maximized likelihood is(24) |

with respect to the Lebesgue measure , where is the gamma function. It is not hard to show that the unscaled Jeffrey’s prior for the above parameterisation is given by

(25) |

where ‘unscaled’ means that is the volume density of the unscaled Fisher information metric, which is times the right-hand side of (4). Therefore

(26) |

where the last step follows because the integrand is constant in . Note that this is not an approximation, is given exactly by (26).

Now, take to be the matrix , where , and is the

identity matrix. Then the regression response variable

is distributed in such a way that are IID, as in Section 5.2. Substituting these values into (26) therefore gives the following exact formula for the parametric complexity of the model of Section 5.2:(27) |

where we have retained the notation for convenience and where, as is standard throughout this paper, is the volume of with respect to the scaled Jeffreys prior . We can use this exact formula in two ways: to assess the accuracy of our asymptotic expansion and to check the veracity of our calculations.

Figure 1 assesses the accuracy of both the standard approximation (2) to , in the case of spherical normal data with unknown variance, and its first-order correction (23). Note that the summand cancels when comparing these two approximations to the exact formula (27), so the the amount of over- or underestimation of does not depend on . Figure 1 shows that both formulae overestimate , and hence NML codelength, for all model dimensions and all sample sizes , with the amount of overestimation increasing with and decreasing with (note that results cannot be calculated for the under-determined models with .) The overestimation is negligible for the corrected formula (23) when , but it is significant for the standard formula (2) for moderate , e.g. it is greater than for and .

Substituting an asymptotic expansion for the gamma function into (27) gives

Comparing the above formula to (23), which was obtained directly from Theorems 1 and 2, shows that our expression (22) for is correct. The coefficient of in the above formula also verifies our calculation (not shown here) of for . This is therefore a positive check on the formulae of Theorems 1 and 2.

## 7 The integrand on a general parameter space

In this section, we give an expression for the integrand of (13) which is valid in all parameter spaces. The expression for in Theorem 2 is convenient for calculations, but it is not invariant under co-ordinate changes because cumulants are not tensors on . The cumulant for a distribution is independent of the family to which it belongs (cumulants are derivatives of with respect to the dummy variable ) so the cumulants transform like functions rather than the components of a tensor. A co-ordinate-dependent expression like the one in Theorem 2 is likely to be arbitrary in some ways, and therefore likely to hide the true mathematical structure, so in this section we give an expression for which holds for all parameterisations.

For any , define the Amari-Chentsov tensor to be the symmetric tensor on with components

(28) |

where is the likelihood function at parameter for one observation

. This is a tensor on any parameter space because it has a co-ordinate-independent definition (essentially just by replacing the partial differentials with arbitrary first-order partial differential operators, which are interpreted as vector fields in a way which is standard in differential geometry).

It is easy to show that the first Amari-Chentsov tensor is trivial, , but the second one is the Fisher information metric, , so the Amari-Chentsov tensors can be seen as generalisations of the Fisher information metric tensor. The third Amari-Chentsov tensor is also interesting, because the difference between any two of Amari’s -connections is a multiple of [1, §2.3].

In the natural parameterisation of an exponential family, the Amari-Chentsov tensors can be calculated using the following recurrence relation.

###### Lemma 3.

If is the natural parameterisation of the exponential family and is an integer then

where the hat indicates that the term should be left out.

###### Proof.

See Section A.3. ∎

So in the natural parameterisation, using , and Lemma 3, we have

Comments

There are no comments yet.