1 Introduction
Since the popularization of deep neural networks in the early 2010s, tailoring neural network architectures to specific tasks has been one of the main sources of activity for both academics and practitioners. Accordingly, a palette of empirical methods has been developed for automating the choice of neural networks hyperparameters (a process sometimes called Neural Architecture Search), including – but not limited to – random search
[2, 1][16, 13], bayesian methods [24, 12][29]. However, when the computational requirements for training a single model are high, such approaches might be too expensive or result in iteration cycles that are too long to be practically useful – though some work in that direction has been carried out recently [5, 14]. In other cases, when the loss function is only used as a proxy for the task at hand [25, 26, 10] or is not interpretable [8], a further perceptual evaluation is typically necessary to evaluate the quality of a model’s outputs and such systematic approaches at least partially break down. In both cases, an efficient and quantitative method to analyze and compare neural network architectures would be highly desirable – be it only to come up with a limited set of plausible candidates to pass on to the more expensive (or manual) methods.In this paper, we introduce the notion of capacity allocation analysis, which is a systematic, quantitative and computationally efficient way to analyze neural network architectures by quantifying which dependencies between inputs and outputs a parameter of a set of parameters actually model. We develop a quantitative framework for assessing and comparing different architectures for a given task, providing insights that are complementary to the value of the loss function itself.
In this paper, we develop the theory in linear settings, where both models and data are linearized. Linearizing neural networks might be regarded as the inverse of “neuralizing” linear models: while the intuition for the latter is to augment the desirable properties of some wellknow (often linear) method with the expressivity [19, 18, 9] of deep and nonlinear neural networks, linearizing neural networks provides a way to quantify some of the properties that might be characteristic of a given architecture through theoretical analysis. In some way, both approaches make the leap of faith that some properties remain more or less valid independently from the complexity of the data and the expressivity of the model (in particular, its degree of nonlinearity).
We focus more particularly on spatial capacity allocation, with the following intuition: a network’s spatial architecture (i.e. whether it uses fully connected, recurrent, convolutional, dilated layers, etc.) tends to define its capacity allocation across the input space, while its complexity (its nonlinearities, number of channels, etc.) tends to define the complexity of the dependencies that it can model. As hinted above, we mostly set aside the latter for now and choose to focus on the spatial aspect of neural network architectures by considering linearized versions of the models we wish to analyze. How much the two dimensions of the problem can be disentangled remains to be understood, but this is the leap of faith that we are willing to make here.
Our work is related to a posteriori analysis of trained models (also sometimes called network inspection), which has been the object of many recent studies [23, 27, 28, 22], as a way to peek into the neural black boxes. The goal of most of this literature is to analyze a network’s activations, to understand which property of the input lead to or correlate with a given behaviour – for example, a final classification decision or simply the activation of a particular unit in the network. Such methods however differ from the present one in that they are mostly examplebased rather than intrinsic, i.e. they mostly make sense on particular instances of the input rather than in general. In contrast, we are looking for an objective, quantitative and computationally efficient way to compare model architectures (rather than specific instances of such architectures that are obtained after expensive training) to provide grounds for principled network design.
We start by providing some guiding intuitions in Section 2 before introducing formally the concept of capacity allocation in linear systems in Section 3. We introduce the notion of spatial capacity allocation by showing that capacity can be broken down along subspaces of the input space, and show that capacity can be used to provide statistical upper bounds on the model error. We also introduce the notion of conditional capacity, which allows to study the conditional influence of subsets of constraints. Section 4 then applies capacity allocation analysis in the case of linear(ized) models and linear tasks (e.g. gaussian process prediction tasks). We show that the total model capacity at some optimal state corresponds to its effective number of parameters, and that it can be broken down across its input space and its parameters (or layers). Section 5 illustrates the theory on two common type of architectures – hierarchical and recurrent – and presents several insights in both cases. Finally, Section 6 considers how capacity allocation might translate in nonlinear settings.
2 Guiding intuitions
Our end goal is to define some extensive property^{2}^{2}2In physics, an extensive property is a property which is additive for subsystems, while an intensive property is a property that does not depend on the system size. which characterizes a model’s modelling capacity, and to use it to perform comparisons between models in a meaningful and objective way for a given task. We wish to be able to break down this quantity (which we call the capacity, denoted ) across various dimensions or subspaces of the input space – for example, the spatial dimensions – to quantify how much of this capacity is allocated for each subspace: this is the spatial capacity allocation alluded to above.
The appeal of such a quantity is particularly salient when the loss function is only a proxy for the task one want to achieve – which is often the case with generative models. In that case, minimizing the loss might result in spurious behaviour which is suboptimal from the point of view of the task. Such considerations must therefore be taken into consideration earlier in the design process. One example, which has been one motivation for the present work, is the task of artificial music generation using autoregressive models
[25], which are usually trained in a 1stepahead fashion, hoping that long term dependencies will accessorily be captured to produce a musical output – rather than mere babbling. Such tradeoff between audio quality and structure has been described in [4], whose authors point out that lower audio quality might be the price to pay to be able to capture long term structure – so they voluntarily limit the former to gain on the latter. One goal of the present theory is to make sense of such observation, and guide network design in a principled way: in the terms of capacity allocation analysis, one would want in their case to allocate enough capacity to remote inputs rather than focussing on the recent past, even though this might be suboptimal for the loss considered. At this stage, this is just a construction of the mind, but our goal is to make this intuition quantitative.Figure 1 shows some fictitious spatial capacity curves for three different models, whose task is to make some prediction from inputs represented on the
axis (this can be thought of as predicting the next sample of a 1dimensional autoregressive process). Each model has a fixed intrinsic capacity (the area under the curve), which is being allocated spatially in a way that might be specific to the task at hand (i.e. the joint distribution of the input and output variables). For the same task, different architectures might focus on different dependencies – in the example, Model 1 focuses on the far right part of the space (i.e. the most recent past, in the autoregressive case), while Model 2 looks further onto the left and Model 3 looks uniformly at the whole input space. In the context of the music generation example mentioned above, one might thus prefer Models 2 or 3 over Model 1, as they are more susceptible to capture long term structure. As we will see, this pattern of excessive capacity allocation to the recent past is in fact typical of time series prediction, and one major challenge associated with multiscale tasks such as audio modelling.
More generally, capacity analysis might be used (i) to tailor architectures to specific needs, as in the case just mentioned, or (ii) to simply analyze and compare architectures a posteriori to get a better understanding of what they achieve (see Appendix C for an example in the context of Wavenets).
3 Capacity allocated to a subspace
Before defining capacity allocation in the context of linear models (which will be the topic of Section 4), we start by defining the concept more generally for linear systems.
3.1 Total capacity
Let us consider a linear system, i.e. a set of linear orthonormal constraints:
(1) 
The orthonormality requirement means that the columns of
(i.e. the constraints) are orthonormal vectors in
. In this linear setting, we will call the number of (orthogonal) constraints, the total capacity . Naturally, each additional constraint decreases the dimension of the space of possible values that can take. Note that in a space of dimension , the subspace that satisfies independent linear constraints has a dimension , often called the number of degrees of freedom in statistics^{3}^{3}3We prefer to use the concept of capacity rather than degrees of freedom, as this concept will be better suited to analyze neural networks later on. [3]. At full capacity (equivalently, degrees of freedom), is fully constrained and is equal to .3.2 Spatial capacity
We now want to be more specific and define the notion of capacity allocated to a subspace, to quantify how many constraints are being applied along a given subspace of the input space.
Definition 1.
Let be a vector subspace of of dimension . Let be an orthonormal basis of , and the orthogonal complement of , defined by the set of points that satisfy the linear constraints . Let be a vector subspace of with orthonormal basis . We define the capacity allocated to subspace by , noted as the Frobenius norm of the matrix :
(2) 
For simplicity, when there is no ambiguity regarding the set of constraints, we will omit the subscript and use the notation . The capacity allocated to a subspace has a number of convenient and intuitive properties, which we detail below:
Property 1.
Since the Frobenius norm is rotation and permutationinvariant, the capacity allocated to a subspace by does not depend on the particular orthonormal bases that are chosen to represent and . The capacity is therefore an intrinsic property of and .
Property 2.
If , then the orthonormality of gives : the capacity allocated to the whole space is equal to the number of independent constraints .
Property 3.
The capacity allocated to the vector subspace generated by two orthogonal vector subspaces is equal to the sum of their allocated capacities: . In particular, if is an orthonormal basis of , then the sum of the capacities allocated to each subspace is equal to the capacity allocated to the whole space:
(3) 
Property 4.
The capacity allocated to the 1dimensional subspace , where belongs to the set of constraints, is equal to 1. The capacity allocated to the subspace , where is orthogonal to the set of constraints, is equal to 0.
Basically, the capacity allocated to a given subspace represents the number of independent constraints that are being used to constrain the projection of onto that subspace. For every vector in the space of constraints, the above property means that exactly one constraint is being used to enforce the constraint . For every vector orthogonal to the set of constraints, the projection is unconstrained (it uses 0 constraints).
For a given orthonormal basis of , the respective capacities represent the respective numbers of independent constraints (between 0 and 1) that are being used to constrain along each axis. These capacities sum to the number of free parameters in the model. In short, the notion of capacity allocated to subspaces allow us to break down how the constraints are being allocated with respect to a given partition of the space.
To give some flesh to the above ideas, let us jump ahead and anticipate Section 4, where will typically represent the difference between a linear model’s effective AR coefficients and the true AR coefficients of some gaussian process. Each of the model parameters will give rise to exactly one (linear) constraint through the chosen optimality criterion. If the model allocates 1 free parameter to enforce a constraint along a given dimension, then the modelling error along that dimension will be zero. Analyzing the capacity allocated per dimension will thus allow us to understand which components of the AR coefficients are being captured, and to which extent. By definition, a linear model with a number of free parameters that matches the dimension of the input space (capacity , or equivalently number of degrees of freedom ) will be able to reproduce exactly all the true coefficients as the error will be fully constrained, while an underparametrized model (, ) will have to allocate its sparse resources to a larger number of coefficients – and choose which ones to put more capacity on.
3.3 Statistical bounds on errors
Given some capacity allocation along a given subspace, can we derive bounds on the errors^{4}^{4}4We are again jumping ahead and assuming by using this terminology that will correspond to errors with respect to the true model such that the ideal state is , cf Section 4. along that same subspace? Let us consider the constraint with , and a complementary basis of such that . A given vector of errors can be written as:
(4) 
Let us define the error along some subspace of dimension as , where is again an orthonormal basis, as well as the average squared error:
(5) 
where the expectation is taken over the distribution of ’s.^{5}^{5}5We will see later than the values of are constrained by the model space. Thus averaging over can be seen as “averaging over model spaces”. This expectation is not tractable in general,^{6}^{6}6…and meaningless in general without further assumptions. but by assuming some symmetry in (such that
’s are i.i.d. of variance
), some elementary manipulations then give:(6)  
This means that the squared error along a given subspace of dimension is statistically bounded by the dimension of that subspace minus the capacity allocated by the model to that subspace. If the model allocates a full capacity to a subspace, and therefore : that subspace is perfectly modelled.
This calculation only gives a statistical order of magnitude of the errors, for a fixed capacity allocation. In real settings, model spaces often have specific structures rather than “average” ones, such that the errors can differ greatly from the statistical bound. In particular, some true coefficients are often vanishingly small in practice, which leads to vanishingly small errors in spite of a zero or nearzero capacity allocation. Eq. (6) should therefore be understood as a statistical upper bound, rather than viewed as a general equality. In fact, this is precisely what makes the capacity theory appealing, as it quantifies the modelling capacity allocated to modelling some given dependencies (or conversely the degree of freedom allowed), regardless of the realized (and idiosyncratic) errors, which are input and modelspacedependent.
There is however one case where the equality holds exactly, i.e. where it is not necessary to take the expectation in Eq. 5 to get closed form results. Indeed when , there is only one element in the sum in Eqs. (4) and (6) and one can write directly:
(7) 
In that case, the relative squared errors along subspaces are exactly the complementary of the corresponding capacities.
3.4 Conditional capacity
The concept of capacity is analogous to that of probabilities, in that this is an object that can only be defined jointly for a set of constraints. In particular, the sum of the capacities allocated by two sets of constraints is not equal in general to the capacity allocated by the direct sum of these two constraint spaces, unless these constraint spaces are orthogonal (this is akin to independence in probabilities):
Property 5.
Let and be two nontrivial spaces of constraints, and . Then the following equivalence holds:
(8) 
The proof of the above equivalence is presented in Appendix A. We therefore define the conditional capacity allocated by a set of constraints given a set of constraints :
Definition 2.
Let and be two spaces of constraints. The conditional capacity allocated by to a vector subspace given another space of constraints , noted , is defined as:
(9) 
This quantifies the additional capacity that brings over . If for example, the conditional capacity is equal to zero as no new constraints are being added. As in probabilities, this definition gives rise to various properties, among which the following identity on chains of conditional capacities:
Property 6.
Let be spaces of constraints and . Then the following holds:
(10) 
Note that this is akin to the chain rule of probabilities. This will allow later on to decompose a model’s capacity into the (conditional) capacities of each of its parameters, or each of its layers (see e.g. Section
5.1.4).4 Capacity applied to linear models
In the previous section, we have defined the concept of capacity in the abstract. Here, we show how it applies to the case of trained linear models, by allowing one to determine a posteriori what dependencies a given model has captured once it has reached a (locally) optimal state. More precisely, we want to determine what part of the input space the model has focused its modelling capacity on, by determining which components are tightly fixed and which ones are free to vary – in a quantitative manner. Since one of the main tasks of model architecture design is to impose which dependencies between its inputs and outputs the model should try to capture, this framework should be useful for approaching the task in a more principled way.
As we will see, one can map a model’s parameters with a corresponding set of linear constraints, such that the capacity of the model, defined as the capacity associated with its associated set of constraints, is equal to its number of free parameters. For a given subspace of the input space, the model’s capacity allocated to then quantifies how many free parameters it allocates for reproducing dependencies along that subspace. In particular, this will allow us to define a model’s spatial capacity allocation, as its capacity allocation along the natural dimensions of the input space.
4.1 Models manifold
Let us start by defining some terminology related to linear models, which will be useful throughout the rest of the paper. We consider linear models with 1dimensional outputs:
(11)  
The model space (i.e. the ensemble of possible values of ) is defined by some parametrization:
(12)  
which defines the space of models as from a parameter space .
The components of are called the model parameters, while the components of are called the model coefficients. Typically, the space of models will be a dimensional manifold where represents the number of effective parameters of the model (aka the total model capacity). We also define the space of errors with respect to some target model as:
(13) 
and accordingly we will note the model error .
4.2 Optimization program
Assume that one tries to learn some target model by minimizing some quadratic loss over some model space :
(14)  
or in the parameter space,
(15) 
For example, one might be trying to predict the next sample of some gaussian process with lag covariance matrix using a linear model with a receptive field of size , parametrized by (the expression for the optimal model in that case is provided in Appendix B.2). In the case where the model space is a dimensional manifold (i.e. parametrized by independent parameters), selecting one model in (or equivalently, one error in ) requires to impose independent constraints on the system, which will stem from the optimality criterion. An optimal model therefore:

belongs to the space of models parametrized by ,

satisfies a set of orthonormal linear constraints imposed by the optimality criterion (which are taskdependent).
The first condition is imposed by the parametrization of the model space (for example some linearized neural network architecture), while the second condition describes the tradeoffs that the model has to make when modelling the input space – i.e. which dependencies to focus on when allocating its parameters.
The intersection of the errors manifold (of dimension ) with the orthogonal of the constraints subspace, (of dimension ), will then give us a set of locally optimal errors, and therefore a set of locally optimal models. If the optimization program has only one local minimum equal to the global minimum, the intersection will be reduced to the singleton containing the optimal error: . A graphical representation is shown in Fig. 2 for a 2dimensional input space and a 1dimensional model manifold.
Our goal below will be to determine the set of orthonormal linear constraints derived from the above optimization program. We will then be able to perform a capacity analysis of the (locally) optimal model using the tools introduced in Section 3 by considering the constraints space .
4.3 Constraints subspace at a locally optimum state
Let be a set of parameters that achieves a local optimum of in Eq. (15). Then the following relations hold at :
(16)  
One can find an orthonormal basis of (the vector space generated by the columns of ) using the factorization of the Gram matrix , where is a rotation matrix and is a positive diagonal matrix with nonzero diagonal values, which we call the capacity weighting matrix
(note that its number of nonzero eigenvalues is equal to the number of effective parameters
, which is a convenient method to compute ). Then, define as the matrix containing the columns of that correspond to the nonzero eigenvalues. The above relations are then equivalent to:(17) 
where the columns of are orthonormal vectors and . These constraints determine which coefficients of the (locally) optimal model are tightly imposed at the optimal point (number of degrees of freedom per dimension close to 0, or equivalently allocated capacity per dimension close to 1), and which ones are virtually free to vary (number of degrees of freedom per dimension close to 0, or equivalently allocated capacity per dimension close to 1).
For a given subspace with orthonormal basis , we can therefore compute the corresponding capacity allocated by the model according to Definition 1, as:
(18) 
One particularly interesting partition of the space we will consider below is the partition according to the natural basis of , which will allow us to perform a spatial capacity analysis of our models, i.e. to analyze their capacity allocation along the spatial dimensions of the input space, for a range of model architectures and input distributions. Another interesting study would be to perform a frequency analysis along Fourier components.
5 Examples
We now illustrate the theory above on two types of architecture that are popular for modelling 1dimensional data with longrange dependencies: hierarchical models and recurrent models. In both cases, we will consider the task of predicting the next sample of a gaussian process with autocorrelation process (equivalently, its autocorrelation matrix where ), from its last inputs. The exact solution and the associated optimal variance for this problem are given in Appendix B.
5.1 Hierarchical models
Hierarchical models have become popular since the introduction of Wavenets [25, 7, 17] for modelling audio signals, which are one prime example of 1D signals with longrange dependencies. Indeed, audio signals typically have tens of thousands of samples per second in order to cover the full spectrum of our auditory perception. In order to capture such long range dependencies while keeping a manageable number of parameters and reasonable memory requirements, the authors of [25] have introduced Wavenets, which use a hierarchical architecture using dilated convolutions with exponentially growing dilation rate, resulting in a receptive field that grows exponentially in the number of layers. In this section, we investigate simplified linearized versions of such hierarchical models using the tools introduced in the previous sections, to see what properties of the input space they capture – and what they focus their capacity on.
5.1.1 Model definition
The class of hierarchical linear models we consider here are the models of the form:
(19) 
where denotes the convolution operator. Each layer consists of a filter of size and dilation rate , where is the number of channels of layer and is the spatial extent of the filters at that layer. An example with is represented in Fig. 3.
In this particular case where and , the space of models is parametrized as:
(20) 
where the total receptive field of the model is . The above parametrization is such that every coefficient can be written as a product of coefficients, one from each layer. Note that in this case the mapping from parameters to models is affine in each of its inputs, since parameters are not shared across layers.
5.1.2 Capacity analysis
The space of models defined by Eq. (20) is quite complex and finding closed form solutions is not an easy task, therefore we use numerical optimization over the parameters to find the optimal solution to Eq. (15). We can then perform a capacity analysis of the optimal model according to the theory developed in the previous sections, and compare the optimal variance to the theoretical lower bound (note that they are related via the loss function through the relation ).
Fig. 4 plots the total capacity of hierarchical models with parameters and a variable number of channels as a function of its total number of parameters.^{7}^{7}7The number of effective parameters can be easily computed for and . In that case, it is equal to , whereas the total number of parameters is . Although the total number of parameters scales quadratically with the number of channels, the number of effective parameters scales more slowly until it reaches the upper boundary , where the whole space becomes accessible and the exact model can be attained (note that the equality is exact, as is by definition an integer). As the graph on the right shows, the loss decreases as a power law of the number of parameters, then saturates when it reaches the loss obtained for a reference model with one parameter per input (i.e. a fully connected model). Interestingly, the transition happens beyond the point where is first reached, due to numerical errors: the optimization process seems to be more efficient when the model is overparametrized.
The spatial capacity allocation along the natural basis of the input space for the same models is shown in Fig. 5. The left plot represents the eigenvalues of the capacity weighting matrix defined in Section 4.3, and whose number of nonzero values corresponds to the model capacity . Because the optimization process has a finite precision, typically doesn’t have any zero eigenvalues, but in practice it is often possible to separate small but genuinely nonzero eigenvalues from noisy “zero” eigenvalues.^{8}^{8}8In particular, noiseinduced nonzero eigenvalues are typically symmetric around zero, whereas genuinely nonzero eigenvalues are always positive. The distribution of negative eigenvalues can therefore be used to find the scale of the noise on the positive halfspace. The plot in the middle represents the capacity per natural input dimensions, which we also call the capacity per input (CPI), and which corresponds to the number of parameters that the model dedicates to modelling direct dependencies on a given input. It is defined as the set where is the orthonormal basis of the space of constraints and is the onehot vector corresponding to the input at distance . In this example, more capacity is allocated to the recent past (e.g. , on the very right) than on the distant past (e.g. , on the very left). As the total capacity increases with the number of channels, so does the spatial capacity per input dimension. Notably, as the capacity increases, the shortest range dependencies are fully modelled first. Longer range dependencies are only allocated capacity once shorter dependencies are modelled. Finally, the plot on the right shows the capacity allocated along the eigenvectors of the covariance matrix, which often shows a cleaner pattern but doesn’t allow for a spatial interpretation.
5.1.3 Errors vs. capacity
We can compare the squared errors along the input space dimensions (i.e. the AR coefficients) with the bound from Eq. (6), to evaluate the relationship empirically (with along each input dimension). Figure 6
shows the average (plain line) as well as the standard deviation (colored area) across many runs of the optimization process for randomly initialized models with
. Since the relationship between the capacity bound and the squared errors is defined up to some constant, both have been normalized to sum to 1. The figure confirms qualitatively the relationship from Eq. (6): . The relationship appears to hold quite accurately along the covariance components – better than along the input dimensions. In general, the capacity bound is much more stable across runs than the realized squared errors, which makes it a good candidate for analyzing an architecture in a more intrinsic way.5.1.4 Further analysis
One conclusion from the spatial capacity analysis conducted in Figure 5 is that the hierarchical structure tends to focus the model capacity on short range dependencies for the process considered, at the expense of long range structure. Could we dissect this behaviour layer by layer?
We first use the conditional capacity defined in Section 3.4 to evaluate the contribution of each layer to the total model capacity, for a hierarchical model with . Figure 7 shows the chained conditional capacity contributions of the model layers, illustrating Property 6. Such analysis requires to choose an arbitrary order for the layers: here we compare the forward order (i.e. starting from the lowermost layer) with the backward order (starting from the uppermost layer). The figures show that most of the shortterm capacity allocation is realized by the few lowermost layers, while most of the long term capacity allocation is realized by the uppermost layers – as expected. Put differently, the lowermost layers end up allocating their capacity for reproducing the short term dependencies – because they can, and that it’s optimal for the prediction task. In light of these results, it seems unlikely for example that such model trained on audio will learn waveletlike filters (which would be uniformly useful across the input space). Instead, their modelling capacity will be allocated to extracting signal from the recent past, insofar as possible. In the context of audio modelling, this encourages the use of twoscale models, with one part of the model trying to capture shortterm dependencies, and the other part trying to capture more universal features.
Figure 8 finally quantifies how each layer behaves individually, in two ways: (i) by analyzing its capacity allocation independently from all other layers (i.e. as if all other layers’ parameters were constants), and (ii) by analysing its marginal contribution to the total model’s capacity allocation, defined as their conditional capacity given the space of constraints associated with all other parameters in the model. The observations are consistent with Figure 7: the lower the layer, the more the capacity allocation is peaked around the recent past, while higher layers tend to achieve a more uniform allocation. Analyzing the marginal contributions is also interesting. Naturally, the marginal contributions are lower than the independent contributions as the former is some residual of the latter. More specifically, it seems that the middle layers are the one that are the least redundant, while some layers have a zero or nearzero marginal contribution to the model capacity allocation (more on this in Appendix D).
5.2 Recurrent models
As an alternative to hierarchical models for audio modelling, [15, 11] have used recurrent models as a way to encode dependencies between inputs that are arbitrarily far apart (using some architecture and backpropagation tricks to make training manageable). In this section, we analyze the simplest linear recurrent models, and compare their behaviour to that of the hierarchical models of the previous section.
5.2.1 Model definition
For the purpose of this study, we consider one particular type of linear recurrent models with a single recurrent layer (cf. Figure 9), and whose number of parameters scales linearly with the number of channels . More precisely, the space of models is parametrized as:
(21) 
where and are 1x1 convolutions and is a recurrent layer with no links across channels.
5.2.2 Capacity analysis
As in the previous section, we analyze the model capacity as a function of its number of parameters. We vary the number of channels , and plot the corresponding number of parameters and effective parameters (i.e. the total capacity) in Figure 10. We also plot the loss as a function of the number of channels and as a function of the number of (effective) parameters. In this case, it appears that the loss decreases as a power law of the number of effective parameters. Finally, Figure 11 shows the corresponding spatial capacity analysis, for . As above, the model shares its capacity between (i) a few close inputs, to which it allocates a capacity of 1, and (ii) more distant inputs, to which it allocates a powerlaw decreasing capacity.
6 Towards richer models
6.1 Multidimensional inputs
The measure of a model’s capacity allocation introduced above only makes sense in linear contexts  linear models, linear processes. To make a first step towards richer models, we now consider multidimensional gaussian input processes (of dimension ). In a way, this is just a remake of everything that was presented in the previous sections – but with richer dependencies between the inputs and the variable to predict.
The spatial capacity analysis is of particular interest, as the subspace corresponding to inputs at a given spatial position is now dimensional. The maximum capacity allocation for one spatial position will thus also be equal to . Therefore, for large enough, it should be rarer to reach the degenerate situations where the capacity saturates at its maximal value for shortterm dependencies (as seen in Figure 5). Rather, by increasing the dimensionality of the inputs, one should expect to observe different tradeoffs between shortterm and longterm capacity allocation.
The first analysis, which we present in Figure 12, compares the scaling of the capacity with the number of channels, for where the input components are taken to be independent processes with similar autocorrelations. For the 1dimensional process, we observe the same pattern as in Section 5.1, where a full capacity is first allocated to the most shortterm dependencies, then spills over to longer ones as the ceiling is reached. For the 16dimensional process, more capacity continues to be allocated to shortterm dependencies beyond . For a similar number of parameters, more capacity is therefore allocated for modelling the recent past when the relationships between input and output are more complex.
Can we quantify better the interplay between and the number of parameters in the model ? Figure 13 shows the capacity allocation normalized by the dimensionality of the input process, when the number of parameters is scaled proportionally to the dimensionality of the input (equivalently, the number of channels is scaled as ). As one might have expected, the relative capacities are almost equivalent – only smoother in the higher dimensional case. This suggests that one can study the capacity allocated for high dimensional processes and large number of parameters, simply by scaling down the dimensionality of the input and the number of parameters proportionally.
6.2 Nonlinear models
6.2.1 Feature space
A simple instance of nonlinear problems are those where the prediction is a linear function of some fixed nonlinear feature map applied to the input . The space of such functions is defined as:
(22)  
where is some fixed function that maps the input to some feature space and is a linear model that acts on the feature space. As in Section 4, is an element of the space of linear functions defined by some mapping from some parameter space:
(23)  
The loss function is then:
(24)  
where . In the trivial case , one recovers exactly the setting of Section 4. In general, can be nonlinear and can be arbitrarily large, leading to a much richer set of functions than considered above.
6.2.2 Capacity allocation in the feature space
Because of the linear nature of the problem in the feature space, one can apply capacity analysis as previously in the feature space, by substituting
(25)  
As above, one can compute , find an orthonormal basis of and define the capacity allocated to a subspace of the feature space:
(26) 
6.2.3 Capacity allocation in the input space
The remaining challenge is then to define a notion of capacity in the input space. While the task does not appear to be straightforward in general, there is one special case where the question is simpler: when acts on the different input components separately, such that we can write:
(27) 
where are linearly independent functions (for example, polynomial basis functions, Fourier basis, etc.). If we denote by the capacities corresponding to the natural dimensions of the feature space, then the capacity allocated to the th input component can be written as:
(28) 
Just like in the multidimensional case of the previous section, the maximum capacity per input component is , as one parameter per basis function is now necessary to fully model the dependencies. The size of the set of basis functions defines the complexity of the data dependencies – which is typically infinite for real data. This illustrates the fact that the notion of underparametrization becomes much more common as the data complexity increases, and so does the regime in which capacity analysis makes sense.
The above analysis is only a glimpse of how the notion of capacity generalizes in nonlinear settings. A more thorough study in the context of nonlinear neural network layers is presented in [6].
7 Conclusion
In this paper, we have introduced the notion of capacity analysis for linear systems. We have defined a linear model’s capacity , which represents the number of independent parameters that describe the model space, and shown how this capacity can be broken down along input subspaces. In particular, we have focussed on spatial capacity allocation along natural dimensions of the input space. We have illustrated these concepts in the case of 1dimensional hierarchical and recurrent models, and shown that some typical allocation patterns arise for each type of architecture. Finally, we have made a step towards capacity allocation in richer settings, by considering multidimensional inputs and nonlinear feature maps. This opens the door for more principled network design, by going beyond the value of the loss function and better understanding which dependencies a given architecture can be expected to capture. This is only a first step towards a deeper theoretical understanding of neural networks through the lens of capacity allocation, and the journey ahead is still long. One obvious next step is to perform capacity analysis across a number of architectural variants, and see if or how this can guide us through architecture design. But to be really useful, the concept of capacity analysis first needs to be generalized to other nonlinear models – for example, nonlinear neural networks.
Acknowledgements
The author would like to thank Martin Gould, Marc Sarfati and Antoine Tilloy for their very useful comments on the manuscript.
References

[1]
James Bergstra and Yoshua Bengio.
Random search for hyperparameter optimization.
Journal of Machine Learning Research
, 13(Feb):281–305, 2012.  [2] James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyperparameter optimization. In Advances in neural information processing systems, pages 2546–2554, 2011.
 [3] George EP Box, William Gordon Hunter, J Stuart Hunter, et al. Statistics for experimenters. 1978.
 [4] Sander Dieleman, Aäron van den Oord, and Karen Simonyan. The challenge of realistic music generation: modelling raw audio at scale. arXiv preprint arXiv:1806.10474, 2018.
 [5] Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In IJCAI, volume 15, pages 3460–8, 2015.
 [6] Jonathan Donier. Capacity allocation through neural networks layers. In preparation, 2019.
 [7] Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Douglas Eck, Karen Simonyan, and Mohammad Norouzi. Neural audio synthesis of musical notes with wavenet autoencoders. arXiv preprint arXiv:1704.01279, 2017.
 [8] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [9] William H Guss and Ruslan Salakhutdinov. On characterizing the capacity of neural networks using algebraic topology. arXiv preprint arXiv:1802.04443, 2018.
 [10] Andreas Jansson, Eric Humphrey, Nicola Montecchio, Rachel Bittner, Aparna Kumar, and Tillman Weyde. Singing voice separation with deep unet convolutional networks. 2017.
 [11] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. Efficient neural audio synthesis. arXiv preprint arXiv:1802.08435, 2018.
 [12] Kirthevasan Kandasamy, Willie Neiswanger, Jeff Schneider, Barnabas Poczos, and Eric Xing. Neural architecture search with bayesian optimisation and optimal transport. arXiv preprint arXiv:1802.07191, 2018.
 [13] Hiroaki Kitano. Designing neural networks using genetic algorithms with graph generation system. Complex systems, 4(4):461–476, 1990.
 [14] Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, and Frank Hutter. Fast bayesian optimization of machine learning hyperparameters on large datasets. arXiv preprint arXiv:1605.07079, 2016.
 [15] Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua Bengio. Samplernn: An unconditional endtoend neural audio generation model. arXiv preprint arXiv:1612.07837, 2016.
 [16] Geoffrey F Miller, Peter M Todd, and Shailesh U Hegde. Designing neural networks using genetic algorithms. In ICGA, volume 89, pages 379–384, 1989.
 [17] Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis C Cobo, Florian Stimberg, et al. Parallel wavenet: Fast highfidelity speech synthesis. arXiv preprint arXiv:1711.10433, 2017.
 [18] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha SohlDickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. In Advances in neural information processing systems, pages 3360–3368, 2016.
 [19] Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha SohlDickstein. On the expressive power of deep neural networks. arXiv preprint arXiv:1606.05336, 2016.
 [20] Amir Rosenfeld and John K Tsotsos. Intriguing properties of randomly weighted networks: Generalizing while learning next to nothing. arXiv preprint arXiv:1802.00844, 2018.
 [21] Andrew M Saxe, Pang Wei Koh, Zhenghao Chen, Maneesh Bhand, Bipin Suresh, and Andrew Y Ng. On random weights and unsupervised feature learning. In ICML, pages 1089–1096, 2011.
 [22] Thibault Sellam, Kevin Lin, Ian Yiran Huang, Michelle Yang, Carl Vondrick, and Eugene Wu. Deepbase: Deep inspection of neural networks. arXiv preprint arXiv:1808.04486, 2018.
 [23] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
 [24] Jost Tobias Springenberg, Aaron Klein, Stefan Falkner, and Frank Hutter. Bayesian optimization with robust bayesian neural networks. In Advances in Neural Information Processing Systems, pages 4134–4142, 2016.
 [25] Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. In SSW, page 125, 2016.
 [26] Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pages 4790–4798, 2016.

[27]
Matthew D Zeiler and Rob Fergus.
Visualizing and understanding convolutional networks.
In
European conference on computer vision
, pages 818–833. Springer, 2014.  [28] Luisa M Zintgraf, Taco S Cohen, Tameem Adel, and Max Welling. Visualizing deep neural network decisions: Prediction difference analysis. arXiv preprint arXiv:1702.04595, 2017.
 [29] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
Appendix A Proof of Property 5
First note that and therefore the left hand side is equivalent to . Since and have orthonormal columns, one can write where has unitary columns such that . Note that is a square rotation matrix if and only if , therefore one needs to prove the following equivalence:
(29) 
If is a rotation matrix, then the left hand side proposition is trivially true by invariance property of the Frobenius norm. Let us now prove the forward implication and assume that the left hand side proposition holds. Since the columns of are orthonormal vectors, then . Therefore, which implies that . Taking the trace, one obtains , which shows that . is therefore a square matrix that verifies , i.e. a rotation matrix.
Appendix B Gaussian processes and linear models
b.1 Gaussian process prediction
Gaussian processes (GP) are processes for which the joint distribution of any finite set of points is gaussian, and which can thus be fully characterized by their mean and covariance matrix. If the process is stationary, we can assume that the mean is zero without loss of generality, and the covariance matrix takes a simple symmetric shape where all diagonals are constant:
(30) 
where the function is called the autocovariance function of the gaussian process. In this framework, the conditional distribution of a sample conditioned on some samples takes the simple form:
(31)  
where is the autocorrelation matrix defined above and we have used the notation . The best prediction of given is therefore realized by a linear model with coefficients , and the corresponding residual variance is . Conversely, one can compute the autocovariance function of a gaussian process generated by the linear autoregressive model where , by reversing Eq. 31. Naturally, if one uses and , then one recovers the autocovariance matrix .
b.2 The optimization problem
Let us consider a gaussian process of autocovariance matrix and the class of linear models:
(32) 
We want to find the optimal parameter that solves the following optimization problem:
(33)  
where is the optimal prediction from Section B.2, and the solution to this problem for . In general, for underparametrized models, has no reason a priori to be in , in which case and the residual variance is .
The above optimization problem has no general solution as the model space can be arbitrarily complex, as we have seen in Section 5. However, thanks to the linearity of the problem, the residual variance to be minimized could be expressed more directly using the autocorrelation matrix
, allowing to eliminate of the stochasticity of the problem and perform a more stable and straightforward gradient descent (instead of a stochastic gradient descent). This enables us to use a second order optimization which finds a nearoptimal solution in seconds, even for models that have millions of parameters.
b.3 The hierarchical example
One can compare the optimal parameters with the true parameters and the corresponding autocorrelation with the true autocorrelation . The curves as well as their relative and absolute differences are shown in Fig. 14 for a number of channels .
We can make the two following observations from the plots:

The error on the autocorrelation is lower for short lags. This behaviour is expected as is obtained by solving the same linear system as for obtaining from , but with a boundary condition instead of .

The error on the coefficients shows a less clear pattern. There is in fact a competition between two forces: the model tends to allocate more capacity for reproducing the largest target coefficients (as observed in Fig. 5), but they are also more difficult to reproduce exactly. Depending on the dominant force, larger coefficients will be either better approximated (in absolute terms) or not. The capacity theory developed in the previous sections is in fact a good way to isolate and measure the first force. Indeed, as we’ve seen in Fig. 5, the patterns in the capacity plots are much cleaner and much more interpretable than the realized errors of Fig. 14.
Appendix C Wavenets and repeated layers
One architecture trick that was introduced in [25] is to tile dilated blocks, with a dilation pattern that looks like . One interesting question is how such architectural choice differs from repeating layers instead of blocks: . Figure 15 show the spatial capacity allocations for the two variants aforementioned and a number of channels , for the same task and the same data. Perhaps as expected, the capacity allocation patterns are highly different: the tiled version has a larger total capacity for the same number of parameters, and allocated most of it to short range dependencies. On the contrary, the repeated version has a lower total capacity (more redundant parameters), but puts more focus on the distant past.
Appendix D Layer redundancy
In Section 3.4, we have defined the concept of conditional capacity, and we have considered some examples in the case of hierarchical models in Section 5.1.4. We have observed in particular that some layers have a marginal capacity equal to 0. Here we formulate some hypotheses regarding such layers with zero marginal capacity.
Definition 3.
Let be a set optimal parameters wrt some optimization criterion, be a subset of these parameters and be the set of all parameters except . Let and denote the space of constraints respectively associated to and . Then, if and only if the marginal contribution of to the capacity is zero for every subspace. In this case, the parameters are said to be (jointly) redundant.
The above property defines what it means for a set of parameters to be jointly redundant: namely, the constraints associated with such parameters could be ignored without affecting the capacity allocation. When this is the case, we are making two conjectures regarding these parameters:
Conjecture 1.
If a set of parameters is redundant, then for almost all values of these parameters, the optimal model can be recovered by adapting the other parameters. The parameter values for which this does not hold are those that lead to degenerate cases, i.e. to spaces of constraints of lower dimensions. The measure of such set is zero.
Conjecture 2.
If a set of parameters is redundant, then with probability 1 these parameters can be fixed at random before learning the other parameters, without affecting the optimal model.
The first conjecture comes from the intuition that if a parameter is redundant, any change in this parameter can be compensated by tweaking other parameters. For example, if the model space is , and if one optimal model is found with parameters and , then for any fixed , the optimal model can be recovered for . The case is the only degenerate case in this example. The second conjecture immediately follows from the first, as the set of degenerate parameters has measure 0. Although redundant parameters are more likely to happen in linear settings, it has been observed that fixing a large fraction of the weights at random in deep networks might result in performance that is on par with fully learnable models [20, 21], which might be related to the above conjectures.
Comments
There are no comments yet.