 # Efficient estimation of the ANOVA mean dimension, with an application to neural net classification

The mean dimension of a black box function of d variables is a convenient way to summarize the extent to which it is dominated by high or low order interactions. It is expressed in terms of 2^d-1 variance components but it can be written as the sum of d Sobol' indices that can be estimated by leave one out methods. We compare the variance of these leave one out methods: a Gibbs sampler called winding stairs, a radial sampler that changes each variable one at a time from a baseline, and a naive sampler that never reuses function evaluations and so costs about double the other methods. For an additive function the radial and winding stairs are most efficient. For a multiplicative function the naive method can easily be most efficient if the factors have high kurtosis. As an illustration we consider the mean dimension of a neural network classifier of digits from the MNIST data set. The classifier is a function of 784 pixels. For that problem, winding stairs is the best algorithm. We find that inputs to the final softmax layer have mean dimensions ranging from 1.35 to 2.0.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The mean dimension of a square integrable function quantifies the extent to which higher order interactions among its input variables are important. At one extreme, an additive function has mean dimension one and this makes numerical tasks such as optimization and integration much simpler. It can also make it easier to compare the importance of the inputs to a function and it simplifies some visualizations. At the other extreme, a function that equals a -fold interaction has mean dimension and can be much more difficult to study.

The mean dimension of a function can be estimated numerically by algorithms that change just one input variable at a time. A prominent example is the winding stairs estimator of Jansen et al. (1994) which runs a Gibbs sampler over the input space. The squared differences in a function’s value arising from changing one input at a time can be used to estimate a certain Sobol’ index described below. The mean dimension is defined in terms of a sum of such Sobol’ indices. When estimating the mean dimension, covariances among the corresponding Sobol’ estimates can greatly affect the efficiency of the estimation strategy. Sometimes a naive approach that uses roughly twice as many function evaluations can be more efficient than winding stairs because it eliminates covariances.

The outline of this paper is as follows. Section 2 introduces some notation, and defines the ANOVA decomposition, Sobol’ indices and the mean dimension. Section 3 presents three strategies for sampling pairs of input points that differ in just one component. A naive method takes function evaluations to get such pairs of points for each of input variables. It never reuses any function values. A radial strategy (Campolongo et al., 2011) uses function evaluations in which baseline points each get paired with other points that change one of the inputs. The third strategy is winding stairs mentioned above which uses function evaluations. Section 4

compares the variances of mean dimension estimates based on these strategies. Those variances involve fourth moments of the original function. We consider additive and multiplicative functions. For additive functions all three methods have the same variance making the naive method inefficient by a factor of about

for large . For more complicated functions, methods that save function evaluations by reusing some of them can introduce positive correlations yielding a less efficient estimate. The presence of high kurtoses can decrease the value of reusing evaluations. Section 5 presents an example where we measure the mean dimension of a neural network classifier designed to predict a digit through based on pixels. We find some mean dimensions in the range to for the penultimate layer of the network, suggesting that the information from those pixels is being used mostly one or two or three at a time. For instance, there cannot be any meaningfully large interactions of or more inputs. Section 6 makes some concluding remarks. Notably, the circumstances that make the radial method inferior to the naive method or winding stairs for computing mean dimension serve to make it superior to them for some other uncertainty quantification tasks. We also discuss randomized quasi-Monte Carlo sampling alternatives. Finally, there is an Appendix in which we make a more detailed analysis of winding stairs.

## 2 Notation

We begin with the analysis of variance (ANOVA) decomposition for a function where . We let where . The ANOVA is defined in terms of a distribution on for which the are independent and for which . The are ordinarily subsets of but the ANOVA is well defined for more general domains. We let denote the distribution of and denote the distribution of .

We will use as a short form for . For sets , their cardinality is and their complement is denoted by . The components for are collectively denoted by . We will use hybrid points that merge components from two other points. The point has for and for . It is typographically convenient to replace singletons by , especially within subscripts.

The ANOVA decomposition writes where the ‘effect’ depends on only through . The first term is and the others are defined recursively via

 fu(x)=E(f(x)−∑v⊊ufv(x)∣∣xu).

The variance component for is

 σ2u≡Var(fu(x))={E(fu(x)2),u≠∅0,u=∅.

The effects are orthogonal under and . We will assume that in order to make some quantities well defined.

Sobol’ indices (Sobol’, 1990, 1993) quantify importance of subsets of input variables on . They are a primary method in global sensitivity analysis (Saltelli et al., 2008; Iooss and Lemaître, 2015; Borgonovo and Plischke, 2016). Sobol’s lower and upper indices are

 τ––2u=∑v⊆uσ2vand¯τ2u=∑v∩u≠∅σ2v,

respectively. These are commonly normalized, with known as the closed index and is called the total index. Normalized indices are between and giving them interpretations as a proportion of variance explained, similar to from regression models. The Sobol’ indices and for singletons are of special interest. Sobol’ indices satisfy some identities

 τ––2u =E(f(x)f(xu:z−u))−μ2 =E(f(x)(f(xu:z−u)−f(z)))and ¯τ2u =12E((f(x)−f(x−u:zu))2),

that make it possible to estimate them by Monte Carlo or quasi-Monte Carlo sampling without explicitly computing estimates of any of the effects . The first and third identity are due to Sobol’ (1993). The second was proposed independently by Saltelli (2002) and Mauntz (2002).

The mean dimension of is

 ν(f)=∑u⊆1:d|u|σ2uσ2.

It satisfies . A low mean dimension indicates that is dominated by low order ANOVA terms, a favorable property for some numerical problems.

An easy identity from Liu and Owen (2006) shows that . Then the mean dimension of is

 ν(f)≡1σ2d∑j=1¯τ2j,% for¯τ2j=12E((f(x)−f(x−j:zj))2.

Although the mean dimension combines nonzero variances it can be computed from Sobol’ indices (and the total variance ).

We can get a Monte Carlo estimate of the numerator of by summing estimates of such as

 12NN∑i=1(f(xi)−f(xi,−j:zi,j))2 (1)

for independent random points . There is more than one way to arrange this computation and the choice can make a big difference to the accuracy.

## 3 Estimation strategies

Equation (1) gives an estimate of evaluating at pairs of points that differ only in their ’th coordinate. An estimate for the numerator of sums these estimates. We have found empirically and somewhat surprisingly that different sample methods for computing the numerator can have markedly different variances.

A naive implementation uses function evaluations taking independent for for each of . In that strategy, the point in (1) is actually different for each . Such a naive implementation is wasteful. We could instead use the same and for all in the radial method of Campolongo et al. (2011). This takes evaluations of . A third strategy is known as ‘winding stairs’ (Jansen et al., 1994). The data come from a Gibbs sampler, that in its most basic form changes inputs to one at a time changing indices in this order: . It uses only evaluations of . These three approaches are illustrated in Figure 1. We will also consider a variant of winding stairs that randomly refreshes after every block of evaluations. Figure 1: Examples of three input sets to compute δ=∑dj=1¯τ2j when d=2. The naive estimate uses dN pairs of points, N pairs for each of d variables. Each edge connects a pair of points used in the estimate. The radial estimate uses N baseline points and d comparison points for each of them. The winding stairs estimates sequentially changes one input at a time.

First we compare the naive to the radial strategy. For we concentrate on estimation strategies for the numerator

 δ=σ2ν=d∑j=1¯τ2j.

This quantity is much more challenging to estimate than the denominator , especially for large , as it involves covariances.

The naive sampler takes

 ^δ =d∑j=1ˆ¯τ2jwhereˆ¯τ2j=12NN∑i=1(f(x(j)i)−f(x(j)i,−j:zi,j))2 (2)

with independent for and . It takes

input vectors and

evaluations of .

 ~δ =d∑j=1˜¯τ2jwhere˜¯τ2j=12NN∑i=1(f(xi)−f(xi,−j:zi,j))2, (3)

for independent , .

For both and converge to as

by the law of large numbers. To compare accuracy of these estimates we assume also that

. Then and both estimates have variances that are .

A first comparison is that

 (4)

by independence of from . What we see from (4) is that while the naive estimate uses about twice as many function evaluations, the radial estimate sums times as many terms. The off diagonal covariances do not have to be very large for us to have , in which case becomes the more efficient estimate despite using more function evaluations. Intuitively, each time takes an unusually large or small value it could make a large contribution to all of and this can result in positive covariances. We study this effect more precisely below giving additional assumptions under which . We also have a numerical counter-example at the end of this section, and so this positive covariance does not hold for all .

The winding stairs algorithm starts at and then makes a sequence of single variable changes to generate for . We let be the index of the component that is changed at step . The new values are independent samples . That is, for

 xij={zi,j=ℓ(i)xi−1,j,j≠ℓ(i).

We have a special interest in the case where and there each is .

The indices can be either deterministic or random. We let be the entire collection of . We assume that the entire collection of are independent of . The most simple deterministic update has and it cycles through all indices in order. The simplest random update has . In usual Gibbs sampling it would be better to take for . Here because we are accumulating squared differences it is not very harmful to have . The vector contains

independently sampled Gaussian random variables. Which ones those are, depends on

. Because conditionally on it also has that distribution unconditionally.

Letting be the ’th unit vector in we can write

 xi=xi−1+(zi−xi−1,ℓ(i))eℓ(i).

If , then the distribution of given is a mixture of

different Gaussian distributions, one for each value of

. As a result does not then have a multivariate Gaussian distribution and is harder to study. For this reason, we focus on the deterministic update.

In the deterministic update we find that any finite set of or has a multivariate Gaussian distribution. We also know that and are independent for because after steps all components of have been replaced by new values. It remains to consider the correlations among a block of consecutive vectors. Those depend on the pattern of shared components within different observations as illustrated in the following diagram:

 xdxd+1xd+2⋯x2d−1x2d∥∥∥∥∥⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝z1z2⋮zd−1zd⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝zd+1z2⋮zd−1zd⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝zd+1zd+2⋮zd−1zd⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠⋯⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝zd+1zd+2⋮z2d−1zd⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝zd+1zd+2⋮z2d−1z2d⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠. (5)

For and we can write

 xi,j=zr(i,j)wherer(i,j)=d⌊i−jd⌋+j. (6)

It is convenient to use (6) for all which is equivalent to initializing the sampler at . Equation (6) holds for any independent and does not depend on our choice of .

The winding stairs estimate of is

 ˇδ =d∑j=1ˇ¯τ2jforˇ¯τ2j=12NN∑i=1Δ2d(i−1)+j, (7)

where . We will see that the covariances of and depend on the pattern of common components among the . In our special case functions certain kurtoses have an impact on the variance of winding stairs estimates.

A useful variant of winding stairs simply makes independent replicates of the vectors shown in (5). That raises the number of function evaluations from to . It uses

independent Markov chains of length

. For large the increased computation is negligible. For this disjoint winding stairs method is the same as the radial method. In original winding stairs, each squared difference can be correlated with up to other squared differences. In disjoint winding stairs, it can only be correlated with other squared differences. We denote the resulting estimate by which is a sum of .

In section 4 we present some multiplicative functions where the naive estimator of has much less than half of the variance of the radial estimator. To complete this section we exhibit a numerical example where the naive estimator has increased variance which must mean that the correlations induced by the radial and winding estimators are at least slightly negative. The integrand is simply for in dimensions. Figure 2 shows results. We used evaluations to show that (truncated) winding stairs and radial sampling both have smaller variance than the naive algorithm for estimating . We also see extremely small mean dimensions for that decrease as

increases. It relates to some work in progress studying mean dimension of radial basis functions as a counterpart to

Hoyt and Owen (2020) on mean dimension of ridge functions. The visible noise in that figure stems from the mean dimensions all being so very close to that the vertical range is quite small. The estimate for is roughly where the true value must be . Figure 2: The left panel shows low and mostly decreasing estimates of ν(f) versus dimension for f(x)=∥x∥2 when x∼N(0,I). The right panel shows variances of estimates of δ for this function.

## 4 Additive and multiplicative functions

The variances of quadratic functions of the values such as , and , involve fourth moments of the original function. Whereas variance components are sufficient to define Sobol’ indices and numerous generalizations, fourth moments do not simplify nearly as much from orthogonality and involve considerably more quantities. While distinct pairs of ANOVA effects are orthogonal, we find for non-empty that

 E(fu(x)fv(x)fw(x))

does not in general vanish when , and all hold. This ‘chaining phenomenon’ is worse for products of four effects: the number of non-vanishing combinations rises even more quickly with . The chaining problem also comes up if we expand in an orthonormal basis for and then look at fourth moments.

In this section we investigate some special functional forms. The first is an additive model

 fA(x)=μ+d∑j=1gj(xj) (8)

where . An additive model with finite variance has mean dimension . It represents one extreme in terms of mean dimension. The second function we consider is a product model

 fP(x)=d∏j=1gj(xj) (9)

where and . Product functions are frequently used as test functions. For instance, Sobol’s -function (Saltelli and Sobol’, 1995) is the product in which later authors make various choices for the constants .

If all then . In general, the mean dimension of a product function is

 ν(fP)=∑dj=1σ2j/(μ2j+σ2j)1−∏dj=1μ2j/(μ2j+σ2j).

See Owen (2003).

We will use Lemma 1 below to compare the variances of our mean dimension estimators. We will need some additional moments. For a random variable

, define the skewness

and the kurtosis . Gaussian random variables have .

###### Lemma 1.

Let be independent identically distributed random variables with variance and kurtosis . Then

 E((Y1−Y2)4) =(12+2κ)σ4 Var((Y1−Y2)2) =(8+2κ)σ4 E((Y1−Y2)2(Y3−Y4)2) =4σ4 E((Y1−Y2)2(Y1−Y3)2) =(6+κ)σ4.
###### Proof.

These follow directly from independence of the and the definitions of variance and kurtosis. ∎

###### Theorem 1.

For the additive function of (8),

 Var(~δ)=Var(^δ)=Var(¨δ) =1Nd∑j=1(2+κj2)σ4j (10)

and

 Var(ˇδ) =Var(¨δ)+N−12N2d∑j=1(κj+2)σ4j. (11)
###### Proof.

The winding stairs results for and quoted above are proved in Theorem 3 of the Appendix. For the naive estimate, is independent of when as remarked upon at (4). For an additive function

 fA(xi)−fA(xi,−j:zi,j)=gj(xij)−gj(zij)

is independent of for and so the radial estimate has the same independence property as the naive estimate. Therefore

 Var(ˆ¯τ2j)=Var(˜¯τ2j) =14NVar((gj(x1j)−gj(z1j))2)

and using Lemma 1, . ∎

If is additive, then Theorem 1 shows that the radial method is better than the naive one. They have the same variance but the naive method uses roughly twice as many function evaluations. If the function is nearly additive, then it is reasonable to expect the variances to be nearly equal and the radial method to be superior. Because always holds the theorem shows an advantage to disjoint winding stairs over plain winding stairs.

We turn next to functions of product form. To simplify some expressions for winding stairs we adopt the conventions that for and quantities , means , means and products over empty index sets equal one.

###### Theorem 2.

For the product function of (9),

 Var(^δ) =1Nd∑j=1σ4j((3+κj2)∏ℓ≠jμ4ℓ−∏ℓ≠jμ22ℓ)and (12) Var(~δ) =Var(^δ)+2N∑j

where , for independent . The winding stairs estimates satisfy

 Var(¨δ) =Var(^δ)+2N∑j

and

 Var(ˇδ) =Var(¨δ)+2N∑j
###### Proof.

The winding stairs results are from Theorem 4 in the Appendix. Next we turn to the naive estimator. For independently, define . Now

 Δj =(gj(xj)−gj(zj))×∏ℓ≠jgℓ(xℓ)

and so and , from Lemma 1. Therefore

 Var(Δ2j)=(12+2κj)σ4j×∏ℓ≠jμ4j−4σ4j×∏ℓ≠jμ22ℓ.

establishing (12).

In the radial estimate, is as above and . In this case however the same point is used in both and so equals

 E(gj(xj)2gk(xk)2(gj(xj)−gj(zj))2(gk(xk)−gk(zk))2∏ℓ∉{j,k}gℓ(xℓ)4) =ηjηk∏ℓ∉{j,k}μ4ℓ.

Then , establishing (13). ∎

We comment below on interpretations of the winding stairs quantities. First we compare naive to radial sampling.

As an illustration, suppose that for . Then

 Var(^δ) =1Nd∑j=1(3d−1)=(3d−1)dN

and since this example has ,

 Var(~δ) =d(3d−1)N+2N∑j

For large the radial method has variance about times as large as the naive method. Accounting for the reduced sample size of the radial method it has efficiency approximately compared to the naive method, for this function.

A product of mean zero functions has mean dimension making it an exceptionally hard case. More generally, if for , then while is larger than a multiple of .

###### Corollary 1.

For the product function of (9), suppose that for . Then for , and so .

###### Proof.

It suffices to show that for . Let for have mean , uncentered moments , and of orders , and , respectively, variance , skewness , and kurtosis . Now let . This simplifies to

 η=(κ+2)σ4+2μσ3γ+2μ2σ2+σ4

and so

 η−2σ2μ2y=(κ+2)σ4+2μσ3γ+μ2σ2.

If then and so we suppose that . Replacing by does not change the sign of . It becomes for . If and have equal signs, then , so we consider the case where they have opposite signs. Without loss of generality we take . An inequality of Rohatgi and Székely (1989) shows that and so

 κ+2+2μ∗γ+μ4∗ ⩾θ2−2μ∗θ+μ4∗ (16)

for . Equation (16) is minimized over at and so . One last variable change to gives

 κ+2+2μ∗γ+μ4∗⩾λ4(4λ2−3).

This is nonnegative for , equivalently and finally for . ∎

From the above discussion we can see that large kurtoses and hence large values of create difficulties. In this light we can compare winding stairs to the radial sampler. The covariances in the radial sampler involve a product of of the . The winding stairs estimates involve products of fewer of those quantities. For disjoint winding stairs the -covariance include a product of only of them. The values for nearest to and appear the most often and so the ordering of the variables makes a difference. For regular winding stairs some additional fourth moments appear in a second term.

## 5 Example: MNIST classification

In this section, we investigate the mean dimension of a neural network classifier that predicts a digit in based on an image of pixels. We compare algorithms for finding mean dimension, investigate some mean dimensions, and then plot some images of Sobol’ indices.

The MNIST data set from http://yann.lecun.com/exdb/mnist/ is a very standard benchmark problem for neural networks. It consists of 70,000 images of hand written digits that were size-normalized and centered within pixel gray scale images. We normalize the image values to the unit interval, . The prediction problem is to identify which of the ten digits ‘0’, ‘1’, , ’9’ is in one of the images based on pixel values. We are interested in the mean dimension of a fitted prediction model.

The model we used is a convolutional neural network fit via tensorflow

(Abadi et al., 2016). The architecture applied the following steps to the input pixels in order:

1. a convolutional layer (with 28 kernels, each of size 3x3),

2. a max pooling layer (over 2x2 blocks),

3. a flattening layer,

4. a fully connected layer with 128 output neurons (ReLU activation),

5. a dropout layer (node values were set to 0 with probability 0.2), and

6. a final fully connected layer with 10 output neurons (softmax activation).

This model is from Yalcin (2018)

who also defines those terms. The network was trained using 10 epochs of ADAM optimization, also described in

Yalcin (2018), on 60,000 training images. For our purposes, it is enough to know that it is a complicated black box function of inputs. The accuracy on 10,000 held out images was 98.5%. This is not necessarily the best accuracy attained for this problem, but we consider it good enough to make the prediction function worth investigating.

There are nontrivial sets of pixels, each making their own contribution to the prediction functions, but the mean dimension can be estimated by summing only Sobol’ indices.

We view the neural network’s prediction as a function on input variables . For data where is the true digit of the image, the estimated probability that is given by

 fy(x)=exp(gy(x))∑9ℓ=0exp(gℓ(x)).

for functions , . This last step, called the softmax layer, exponentiates and normalizes functions that implement the prior layers. We study the mean dimension of as well as the mean dimensions of . Studying the complexity of predictions via the inputs to softmax has been done earlier Yosinski et al. (2015).

To compute mean dimension we need to have a model for with independent components. Real images are only on or near a very small manifold within . We considered several distributions for the value of pixel : (salt and pepper) (random gray), independent resampling from per pixel histograms of all images, and independent resampling per pixel just from images with a given value of . The histogram of values for pixel from those images is denoted by with representing all of them. Figure 3 shows some sample draws along with one real image. We think that resampling pixels from images given is the most relevant of these methods, though ways to get around the independence assumption would be valuable. We nonetheless include the other samplers in our computations. Figure 3: From left to right: draws from U{0,1}28×28, U[0,1]28×28, margins of all images, margins of all 7s, an example 7.

Our main interest is in comparing the variance of estimates of . We compared the naive method , the radial method and truncated winding stairs . For our winding stairs algorithm changed pixels in raster order, left to right within rows, taking rows of the image from top to bottom. We omit because we think there is no benefit from its more complicated model and additional correlations. Our variance comparisons are based on samples. Figure 4: The upper left histogram shows Var(~δ)/Var(^δ) for functions gy that exclude softmax. The upper right histogram shows Var(~δ)/Var(¨δ). The bottom two show the same ratios for functions fy that include softmax. The histograms include all 10 values of output y, and all 10 y-specific input histograms and the pooled input histogram.

Figure 4 shows the results for all output values , all input histogram distributions, with separate plots for functions that include softmax and that exclude it. The radial method always had greater variance than the naive method. For functions it never had as much as twice the variance of the naive method, and so the radial method proves better for . For there were some exceptions where the naive method is more efficient. In all of our comparisons the winding stairs method had lower variance than the radial method, and so for these functions, (truncated) winding stairs is clearly the best choice.

Figure 4 is a summary of different variance estimates. We inspected the variances and found two more things worth mentioning but not presenting. The variances were all far smaller using softmax than not, which is not surprising since softmax compresses the range of to be within which will greatly affect the differences that go into estimates of . The variances did not greatly depend on the input distribution. While there were some statistically significant differences, which is almost inevitable for such large , the main practical difference was that variances tended to be much smaller when sampling from . We believe that this is because images for have much less total illumination than the others.

While our main purpose is to compare estimation strategies for mean dimension, the mean dimensions for this problem are themselves of interest. Table 1 shows mean dimensions for functions that include softmax as estimated via winding stairs. For this we used when resampling from images and otherwise. The first thing to note is an impossible estimate of for binary and uniform sampling. The true cannot be larger than . The function has tiny variance under those distributions and recall that . Next we see that moving from binary to uniform to the combined histogram generally lowers the mean dimension. Third, for the -specific histograms we typically see smaller mean dimensions for with the same that was used in sampling. That is, the diagonal of the lower block tends to have smaller values.