1 Introduction
A central factor in the application of machine learning to a given task is the
inductive bias, i.e. the choice of hypotheses space from which learned functions are taken. The restriction posed by the inductive bias is necessary for practical learning, and reflects prior knowledge regarding the task at hand. Perhaps the most successful exemplar of inductive bias to date manifests itself in the use of convolutional networks (LeCun and Bengio (1995)) for computer vision tasks. These hypotheses spaces are delivering unprecedented visual recognition results (e.g. Krizhevsky et al. (2012); Szegedy et al. (2015); Simonyan and Zisserman (2014); He et al. (2015)), largely responsible for the resurgence of deep learning (
LeCun et al. (2015)). Unfortunately, our formal understanding of the inductive bias behind convolutional networks is limited – the assumptions encoded into these models, which seem to form an excellent prior knowledge for imagery data, are for the most part a mystery.Existing works studying the inductive bias of deep networks (not necessarily convolutional) do so in the context of depth efficiency, essentially arguing that for a given amount of resources, more layers result in higher expressiveness. More precisely, depth efficiency refers to a situation where a function realized by a deep network of polynomial size, requires superpolynomial size in order to be realized (or approximated) by a shallower network. In recent years, a large body of research was devoted to proving existence of depth efficiency under different types of architectures (see for example Delalleau and Bengio (2011); Pascanu et al. (2013); Montufar et al. (2014); Telgarsky (2015); Eldan and Shamir (2015); Poggio et al. (2015); Mhaskar et al. (2016)). Nonetheless, despite the wide attention it is receiving, depth efficiency does not convey the complete story behind the inductive bias of deep networks. While it does suggest that depth brings forth functions that are otherwise unattainable, it does not explain why these functions are useful. Loosely speaking, the hypotheses space of a polynomially sized deep network covers a small fraction of the space of all functions. We would like to understand why this small fraction is so successful in practice.
A specific family of convolutional networks gaining increased attention is that of convolutional arithmetic circuits. These models follow the standard paradigm of locality, weight sharing and pooling, yet differ from the most conventional convolutional networks in that their pointwise activations are linear, with nonlinearity originating from product pooling. Recently, Cohen et al. (2016b) analyzed the depth efficiency of convolutional arithmetic circuits, showing that besides a negligible (zero measure) set, all functions realizable by a deep network require exponential size in order to be realized (or approximated) by a shallow one. This result, termed complete depth efficiency, stands in contrast to previous depth efficiency results, which merely showed existence of functions efficiently realizable by deep networks but not by shallow ones. Besides their analytic advantage, convolutional arithmetic circuits are also showing promising empirical performance. In particular, they are equivalent to SimNets – a deep learning architecture that excels in computationally constrained settings (Cohen and Shashua (2014); Cohen et al. (2016a)), and in addition, have recently been utilized for classification with missing data (Sharir et al. (2016)). Motivated by these theoretical and practical merits, we focus our analysis in this paper on convolutional arithmetic circuits, viewing them as representative of the class of convolutional networks. We empirically validate our conclusions with both convolutional arithmetic circuits and convolutional rectifier networks
– convolutional networks with rectified linear (ReLU,
Nair and Hinton (2010)) activation and max or average pooling. Adaptation of the formal analysis to networks of the latter type, similarly to the adaptation of the analysis in Cohen et al. (2016b) carried out by Cohen and Shashua (2016), is left for future work.Our analysis approaches the study of inductive bias from the direction of function inputs. Specifically, we study the ability of convolutional arithmetic circuits to model correlation between regions of their input. To analyze the correlations of a function, we consider different partitions of input regions into disjoint sets, and ask how far the function is from being separable w.r.t. these partitions. Distance from separability is measured through the notion of separation rank (Beylkin and Mohlenkamp (2002)), which can be viewed as a surrogate of the distance from the closest separable function. For a given function and partition of its input, high separation rank implies that the function induces strong correlation between sides of the partition, and vice versa.
We show that a deep network supports exponentially high separation ranks for certain input partitions, while being limited to polynomial or linear (in network size) separation ranks for others. The network’s pooling geometry effectively determines which input partitions are favored in terms of separation rank, i.e. which partitions enjoy the possibility of exponentially high separation rank with polynomial network size, and which require network to be exponentially large. The standard choice of square contiguous pooling windows favors interleaved (entangled) partitions over coarse ones that divide the input into large distinct areas. Other choices lead to different preferences, for example pooling windows that join together nodes with their spatial reflections lead to favoring partitions that split the input symmetrically. We conclude that in terms of modeled correlations, pooling geometry controls the inductive bias, and the particular design commonly employed in practice orients it towards the statistics of natural images (nearby pixels more correlated than ones that are far apart). Moreover, when processing data that departs from the usual domain of natural imagery, prior knowledge regarding its statistics can be used to derive respective pooling schemes, and accordingly tailor the inductive bias.
With regards to depth efficiency, we show that separation ranks under favored input partitions are exponentially high for all but a negligible set of the functions realizable by a deep network. Shallow networks on the other hand, treat all partitions equally, and support only linear (in network size) separation ranks. Therefore, almost all functions that may be realized by a deep network require a replicating shallow network to have exponential size. By this we return to the complete depth efficiency result of Cohen et al. (2016b), but with an added important insight into the benefit of functions brought forth by depth – they are able to efficiently model strong correlation under favored partitions of the input.
The remainder of the paper is organized as follows. Sec. 2
provides a brief presentation of necessary background material from the field of tensor analysis. Sec.
3 describes the convolutional arithmetic circuits we analyze, and their relation to tensor decompositions. In sec. 4 we convey the concept of separation rank, on which we base our analyses in sec. 5 and 6. The conclusions from our analyses are empirically validated in sec. 7. Finally, sec. 8 concludes.2 Preliminaries
The analyses carried out in this paper rely on concepts and results from the field of tensor analysis. In this section we establish the minimal background required in order to follow our arguments ^{1}^{1}1 The definitions we give are actually concrete special cases of more abstract algebraic definitions as given in Hackbusch (2012). We limit the discussion to these special cases since they suffice for our needs and are easier to grasp. , referring the interested reader to Hackbusch (2012) for a broad and comprehensive introduction to the field.
The core concept in tensor analysis is a tensor, which for our purposes may simply be thought of as a multidimensional array. The order of a tensor is defined to be the number of indexing entries in the array, which are referred to as modes. The dimension of a tensor in a particular mode is defined as the number of values that may be taken by the index in that mode. For example, a by matrix is a tensor of order , i.e. it has two modes, with dimension in mode and dimension in mode . If is a tensor of order and dimension in each mode , the space of all configurations it can take is denoted, quite naturally, by .
A fundamental operator in tensor analysis is the tensor product, which we denote by . It is an operator that intakes two tensors and (orders and respectively), and returns a tensor (order ) defined by: . Notice that in the case
, the tensor product reduces to the standard outer product between vectors,
i.e. if and , then is no other than the rank1 matrix .We now introduce the important concept of matricization, which is essentially the rearrangement of a tensor as a matrix. Suppose is a tensor of order and dimension in each mode , and let be a partition of , i.e. and are disjoint subsets of whose union gives . We may write where , and similarly where . The matricization of w.r.t. the partition , denoted , is the by matrix holding the entries of such that is placed in row index and column index . If or , then by definition is a row or column (respectively) vector of dimension holding in entry .
A well known matrix operator is the Kronecker product, which we denote by . For two matrices and , is the matrix in holding in row index and column index . Let and be tensors of orders and respectively, and let be a partition of . The basic relation that binds together the tensor product, the matricization operator, and the Kronecker product, is:
(1) 
where and are simply the sets obtained by subtracting from each of the elements in and respectively. In words, eq. 1 implies that the matricization of the tensor product between and w.r.t. the partition of , is equal to the Kronecker product between two matricizations: that of w.r.t. the partition of induced by the lower values of , and that of w.r.t. the partition of induced by the higher values of .
3 Convolutional arithmetic circuits
The convolutional arithmetic circuit architecture on which we focus in this paper is the one considered in Cohen et al. (2016b), portrayed in fig. 1(a). Instances processed by a network are represented as tuples of dimensional vectors. They are generally thought of as images, with the dimensional vectors corresponding to local patches. For example, instances could be by RGB images, with local patches being
regions crossing the three color bands. In this case, assuming a patch is taken around every pixel in an image (boundaries padded), we have
and . Throughout the paper, we denote a general instance by , with standing for its patches.The first layer in a network is referred to as representation. It consists of applying representation functions to all patches, thereby creating feature maps. In the case where representation functions are chosen as , with parameters and some pointwise activation , the representation layer reduces to a standard convolutional layer. More elaborate settings are also possible, for example modeling the representation as a cascade of convolutional layers with pooling inbetween. Following the representation, a network includes hidden layers indexed by . Each hidden layer begins with a conv operator, which is simply a threedimensional convolution with channels and filters of spatial dimensions by. ^{2}^{2}2 Cohen et al. (2016b) consider two settings for the conv operator. The first, referred to as weight sharing, is the one described above, and corresponds to standard convolution. The second is more general, allowing filters that slide across the previous layer to have different weights at different spatial locations. It is shown in Cohen et al. (2016b) that without weight sharing, a convolutional arithmetic circuit with one hidden layer (or more) is universal, i.e. can realize any function if its size (width) is unbounded. This property is imperative for the study of depth efficiency, as that requires shallow networks to ultimately be able to replicate any function realized by a deep network. In this paper we limit the presentation to networks with weight sharing, which are not universal. We do so because they are more conventional, and since our entire analysis is oblivious to whether or not weights are shared (applies as is to both settings). The only exception is where we reproduce the depth efficiency result of Cohen et al. (2016b). There, we momentarily consider networks without weight sharing. This is followed by spatial pooling, that decimates feature maps by taking products of nonoverlapping twodimensional windows that cover the spatial extent. The last of the hidden layers () reduces feature maps to singletons (its pooling operator is global), creating a vector of dimension . This vector is mapped into network outputs through a final dense linear layer.
Altogether, the architectural parameters of a network are the type of representation functions (), the pooling window shapes and sizes (which in turn determine the number of hidden layers ), and the number of channels in each layer ( for representation, for hidden layers, for output). Given these architectural parameters, the learnable parameters of a network are the representation weights ( for channel ), the conv weights ( for channel of hidden layer ), and the output weights ( for output node ).
For a particular setting of weights, every node (neuron) in a given network realizes a function from
to . The receptive field of a node refers to the indexes of input patches on which its function may depend. For example, the receptive field of node in channel of conv operator at hidden layer is , and that of an output node is , corresponding to the entire input. Denote by the function realized by node of channel in conv operator at hidden layer , and let be its receptive field. By the structure of the network it is evident that does not depend on , so we may write instead. Moreover, assuming pooling windows are uniform across channels (as customary with convolutional networks), and taking into account the fact that they do not overlap, we conclude that and are necessarily disjoint if . A simple induction over then shows that may be expressed as , where stands for the receptive field , and is a tensor of order and dimension in each mode, with entries given by polynomials in the network’s conv weights . Taking the induction one step further (from last hidden layer to network output), we obtain the following expression for functions realized by network outputs:(2) 
here is an output node index, and is the function realized by that node. is a tensor of order and dimension in each mode, with entries given by polynomials in the network’s conv weights and output weights . Hereafter, terms such as function realized by a network or coefficient tensor realized by a network, are to be understood as referring to or respectively. Next, we present explicit expressions for under two canonical networks – deep and shallow.
Deep network.
Consider a network as in fig. 1(a), with pooling windows set to cover four entries each, resulting in hidden layers. The linear weights of such a network are for conv operator in hidden layer , for conv operator in hidden layer , and for dense output operator. They determine the coefficient tensor (eq. 2) through the following recursive decomposition:
(3) 
and here are scalars representing entry in the vectors and respectively, and the symbol with a superscript stands for a repeated tensor product, e.g. . To verify that under pooling windows of size four is indeed given by eq. 3, simply plug the rows of the decomposition into eq. 2, starting from bottom and continuing upwards. For context, eq. 3 describes what is known as a hierarchical tensor decomposition (see chapter 11 in Hackbusch (2012)), with underlying tree over modes being a full quadtree (corresponding to the fact that the network’s pooling windows cover four entries each).
Shallow network.
The second network we pay special attention to is shallow, comprising a single hidden layer with global pooling – see illustration in fig. 1(b). The linear weights of such a network are for hidden conv operator and for dense output operator. They determine the coefficient tensor (eq. 2) as follows:
(4) 
where stands for entry of , and again, the symbol with a superscript represents a repeated tensor product. The tensor decomposition in eq. 4 is an instance of the classic CP decomposition, also known as rank1 decomposition (see Kolda and Bader (2009) for a historic survey).
To conclude this section, we relate the background material above, as well as our contribution described in the upcoming sections, to the work of Cohen et al. (2016b). The latter shows that with arbitrary coefficient tensors , functions as in eq. 2 form a universal hypotheses space. It is then shown that convolutional arithmetic circuits as in fig. 1(a) realize such functions by applying tensor decompositions to , with the type of decomposition determined by the structure of a network (number of layers, number of channels in each layer etc.). The deep network (fig. 1(a) with size pooling windows and hidden layers) and the shallow network (fig. 1(b)) presented hereinabove are two special cases, whose corresponding tensor decompositions are given in eq. 3 and 4 respectively. The central result in Cohen et al. (2016b) relates to inductive bias through the notion of depth efficiency – it is shown that in the parameter space of a deep network, all weight settings but a set of (Lebesgue) measure zero give rise to functions that can only be realized (or approximated) by a shallow network if the latter has exponential size. This result does not relate to the characteristics of instances , it only treats the ability of shallow networks to replicate functions realized by deep networks.
In this paper we draw a line connecting the inductive bias to the nature of , by studying the relation between a network’s architecture and its ability to model correlation among patches . Specifically, in sec. 4 we consider partitions of (, where stands for disjoint union), and present the notion of separation rank as a measure of the correlation modeled between the patches indexed by and those indexed by . In sec. 5.1 the separation rank of a network’s function w.r.t. a partition is proven to be equal to the rank of – the matricization of the coefficient tensor w.r.t. . Sec. 5.2 derives lower and upper bounds on this rank for a deep network, showing that it supports exponential separation ranks with polynomial size for certain partitions, whereas for others it is required to be exponentially large. Subsequently, sec. 5.3 establishes an upper bound on for shallow networks, implying that these must be exponentially large in order to model exponential separation rank under any partition, and thus cannot efficiently replicate a deep network’s correlations. Our analysis concludes in sec. 6, where we discuss the pooling geometry of a deep network as a means for controlling the inductive bias by determining a correspondence between partitions and spatial partitions of the input. Finally, we demonstrate experimentally in sec. 7 how different pooling geometries lead to superior performance in different tasks. Our experiments include not only convolutional arithmetic circuits, but also convolutional rectifier networks, i.e. convolutional networks with ReLU activation and max or average pooling.
4 Separation rank
In this section we define the concept of separation rank for functions realized by convolutional arithmetic circuits (sec. 3), i.e. real functions that take as input . The separation rank serves as a measure of the correlations such functions induce between different sets of input patches, i.e. different subsets of the variable set .
Let be a partition of input indexes, i.e. and are disjoint subsets of whose union gives . We may write where , and similarly where . For a function , the separation rank w.r.t. the partition is defined as follows: ^{3}^{3}3 If or then by definition (unless , in which case ).
(5)  
In words, it is the minimal number of summands that together give , where each summand is separable w.r.t. , i.e. is equal to a product of two functions – one that intakes only patches indexed by , and another that intakes only patches indexed by . One may wonder if it is at all possible to express through such summands, i.e. if the separation rank of is finite. From the theory of tensor products between spaces (see Hackbusch (2012) for a comprehensive coverage), we know that any , i.e. any that is measurable and squareintegrable, may be approximated arbitrarily well by summations of the form . Exact realization however is only guaranteed at the limit , thus in general the separation rank of need not be finite. Nonetheless, as we show in sec. 5, for the class of functions we are interested in, namely functions realizable by convolutional arithmetic circuits, separation ranks are always finite.
The concept of separation rank was introduced in Beylkin and Mohlenkamp (2002) for numerical treatment of highdimensional functions, and has since been employed for various applications, e.g. quantum chemistry (Harrison et al. (2003)), particle engineering (Hackbusch (2006)) and machine learning (Beylkin et al. (2009)). If the separation rank of a function w.r.t. a partition of its input is equal to , the function is separable, meaning it does not model any interaction between the sets of variables. Specifically, if then there exist and such that , and the function cannot take into account consistency between the values of and those of . In a statistical setting, if
is a probability density function, this would mean that
and are statistically independent. The higher is, the farther is from this situation, i.e. the more it models dependency between and , or equivalently, the stronger the correlation it induces between the patches indexed by and those indexed by .The interpretation of separation rank as a measure of deviation from separability is formalized in app. B, where it is shown that is closely related to the distance of from the set of separable functions w.r.t. . Specifically, we define as the latter distance divided by the norm of ^{4}^{4}4 The normalization (division by norm) is of critical importance – without it rescaling would accordingly rescale , rendering the latter uninformative in terms of deviation from separability. , and show that provides an upper bound on . While it is not possible to lay out a general lower bound on in terms of , we show that the specific lower bounds on underlying our analyses can be translated into lower bounds on . This implies that our results, facilitated by upper and lower bounds on separation ranks of convolutional arithmetic circuits, may equivalently be framed in terms of distances from separable functions.
5 Correlation analysis
In this section we analyze convolutional arithmetic circuits (sec. 3) in terms of the correlations they can model between sides of different input partitions, i.e. in terms of the separation ranks (sec. 4) they support under different partitions of . We begin in sec. 5.1, establishing a correspondence between separation ranks and coefficient tensor matricization ranks. This correspondence is then used in sec. 5.2 and 5.3 to analyze the deep and shallow networks (respectively) presented in sec. 3. We note that we focus on these particular networks merely for simplicity of presentation – the analysis can easily be adapted to account for alternative networks with different depths and pooling schemes.
5.1 From separation rank to matricization rank
Let be a function realized by a convolutional arithmetic circuit, with corresponding coefficient tensor (eq. 2). Denote by an arbitrary partition of , i.e. . We are interested in studying – the separation rank of w.r.t. (eq. 5). As claim 1 below states, assuming representation functions are linearly independent (if they are not, we drop dependent functions and modify accordingly ^{5}^{5}5 Suppose for example that is dependent, i.e. there exist such that . We may then plug this into eq. 2, and obtain an expression for that has as representation functions, and a coefficient tensor with dimension in each mode. Continuing in this fashion, one arrives at an expression for whose representation functions are linearly independent. ), this separation rank is equal to the rank of – the matricization of the coefficient tensor w.r.t. the partition . Our problem thus translates to studying ranks of matricized coefficient tensors.
Claim 1.
Let be a function realized by a convolutional arithmetic circuit (fig. 1(a)), with corresponding coefficient tensor (eq. 2). Assume that the network’s representation functions are linearly independent, and that they, as well as the functions in the definition of separation rank (eq. 5), are measurable and squareintegrable. ^{6}^{6}6 Squareintegrability of representation functions may seem as a limitation at first glance, as for example neurons , with parameters and sigmoid or ReLU activation , do not meet this condition. However, since in practice our inputs are bounded (e.g. they represent image pixels by holding intensity values), we may view functions as having compact support, which, as long as they are continuous (holds in all cases of interest), ensures squareintegrability. Then, for any partition of , it holds that .
Proof.
See app. A.1. ∎
As the linear weights of a network vary, so do the coefficient tensors () it gives rise to. Accordingly, for a particular partition , a network does not correspond to a single value of , but rather supports a range of values. We analyze this range by quantifying its maximum, which reflects the strongest correlation that the network can model between the input patches indexed by and those indexed by . One may wonder if the maximal value of is the appropriate statistic to measure, as apriori, it may be that is maximal for very few of the network’s weight settings, and much lower for all the rest. Apparently, as claim 2 below states, this is not the case, and in fact is maximal under almost all of the network’s weight settings.
Claim 2.
Proof.
See app. A.2. ∎
5.2 Deep network
In this subsection we study correlations modeled by the deep network presented in sec. 3 (fig. 1(a) with size pooling windows and hidden layers). In accordance with sec. 5.1, we do so by characterizing the maximal ranks of coefficient tensor matricizations under different partitions.
Recall from eq. 3 the hierarchical decomposition expressing a coefficient tensor realized by the deep network. We are interested in matricizations of this tensor under different partitions of . Let be an arbitrary partition, i.e. . Matricizing the last level of eq. 3 w.r.t. , while applying the relation in eq. 1, gives:
Applying eq. 1 again, this time to matricizations of the tensor , we obtain:
For every define and . In words, represents the partition induced by on the ’th quadrant of , i.e. on the ’th size group of input patches. We now have the following matricized version of the last level in eq. 3:
where the symbol with a running index stands for an iterative Kronecker product. To derive analogous matricized versions for the upper levels of eq. 3, we define for :
(6) 
That is to say, represents the partition induced by on the set of indexes , i.e. on the ’th size group of input patches. With this notation in hand, traversing upwards through the levels of eq. 3, with repeated application of the relation in eq. 1, one arrives at the following matrix decomposition for :
(7) 
Eq. 7 expresses – the matricization w.r.t. the partition of a coefficient tensor realized by the deep network, in terms of the network’s conv weights and output weights . As discussed above, our interest lies in the maximal rank that this matricization can take. Theorem 1 below provides lower and upper bounds on this maximal rank, by making use of eq. 7, and of the rankmultiplicative property of the Kronecker product ().
Theorem 1.
Let be a partition of , and be the matricization w.r.t. of a coefficient tensor (eq. 2) realized by the deep network (fig. 1(a) with size pooling windows). For every and , define and as in eq. 6. Then, the maximal rank that can take (when network weights vary) is:

No smaller than , where .

No greater than , where for , and for .
Proof.
See app. A.3. ∎
The lower bound in theorem 1 is exponential in , the latter defined to be the number of size patch groups that are split by the partition , i.e. whose indexes are divided between and . Partitions that split many of the size patch groups will thus lead to a large lower bound. For example, consider the partition defined as follows:
(8) 
This partition splits all size patch groups (), leading to a lower bound that is exponential in the number of patches ().
The upper bound in theorem 1 is expressed via constants , defined recursively over levels , with ranging over for each level . What prevents from growing doubleexponentially fast (w.r.t. ) is the minimization with . Specifically, if is small, i.e. if the partition induced by on the ’th size group of patches is unbalanced (most of the patches belong to one side of the partition, and only a few belong to the other), will be of reasonable size. The higher this takes place in the hierarchy (i.e. the larger is), the lower our eventual upper bound will be. In other words, if partitions induced by on size patch groups are unbalanced for large values of , the upper bound in theorem 1 will be small. For example, consider the partition defined by:
(9) 
Under , all partitions induced on size patch groups (quadrants of ) are completely onesided ( for all ), resulting in the upper bound being no greater than – linear in network size.
To summarize this discussion, theorem 1 states that with the deep network, the maximal rank of a coefficient tensor matricization w.r.t. , highly depends on the nature of the partition – it will be exponentially high for partitions such as , that split many size patch groups, while being only polynomial (or linear) for partitions like , under which size patch groups are unevenly divided for large values of . Since the rank of a coefficient tensor matricization w.r.t. corresponds to the strength of correlation modeled between input patches indexed by and those indexed by (sec. 5.1), we conclude that the ability of a polynomially sized deep network to model correlation between sets of input patches highly depends on the nature of these sets.
5.3 Shallow network
We now turn to study correlations modeled by the shallow network presented in sec. 3 (fig. 1(b)). In line with sec. 5.1, this is achieved by characterizing the maximal ranks of coefficient tensor matricizations under different partitions.
Recall from eq. 4 the CP decomposition expressing a coefficient tensor realized by the shallow network. For an arbitrary partition of , i.e. , matricizing this decomposition with repeated application of the relation in eq. 1, gives the following expression for – the matricization w.r.t. of a coefficient tensor realized by the shallow network:
(10) 
and here are column vectors of dimensions and respectively, standing for the Kronecker products of with itself and times (respectively). Eq. 10 immediately leads to two observations regarding the ranks that may be taken by . First, they depend on the partition only through its division size, i.e. through and . Second, they are no greater than , meaning that the maximal rank is linear (or less) in network size. In light of sec. 5.1 and 5.2, these findings imply that in contrast to the deep network, which with polynomial size supports exponential separation ranks under favored partitions, the shallow network treats all partitions (of a given division size) equally, and can only give rise to an exponential separation rank if its size is exponential.
Suppose now that we would like to use the shallow network to replicate a function realized by a polynomially sized deep network. So long as the deep network’s function admits an exponential separation rank under at least one of the favored partitions (e.g. – eq. 8), the shallow network would have to be exponentially large in order to replicate it, i.e. depth efficiency takes place. ^{7}^{7}7 Convolutional arithmetic circuits as we have defined them (sec. 3) are not universal. In particular, it may very well be that a function realized by a polynomially sized deep network cannot be replicated by the shallow network, no matter how large (wide) we allow it to be. In such scenarios depth efficiency does not provide insight into the complexity of functions brought forth by depth. To obtain a shallow network that is universal, thus an appropriate gauge for depth efficiency, we may remove the constraint of weight sharing, i.e. allow the filters in the hidden conv operator to hold different weights at different spatial locations (see Cohen et al. (2016b) for proof that this indeed leads to universality). All results we have established for the original shallow network remain valid when weight sharing is removed. In particular, the separation ranks of the network are still linear in its size. This implies that as suggested, depth efficiency indeed holds. Since all but a negligible set of the functions realizable by the deep network give rise to maximal separation ranks (sec 5.1), we obtain the complete depth efficiency result of Cohen et al. (2016b). However, unlike Cohen et al. (2016b), which did not provide any explanation for the usefulness of functions brought forth by depth, we obtain an insight into their utility – they are able to efficiently model strong correlation under favored partitions of the input.
6 Inductive bias through pooling geometry
The deep network presented in sec. 3, whose correlations we analyzed in sec. 5.2, was defined as having size pooling windows, i.e. pooling windows covering four entries each. We have yet to specify the shapes of these windows, or equivalently, the spatial (twodimensional) locations of nodes grouped together in the process of pooling. In compliance with standard convolutional network design, we now assume that the network’s (size) pooling windows are contiguous square blocks, i.e. have shape . Under this configuration, the network’s functional description (eq. 2 with given by eq. 3) induces a spatial ordering of input patches ^{8}^{8}8 The network’s functional description assumes a onedimensional full quadtree grouping of input patch indexes. That is to say, it assumes that in the first pooling operation (hidden layer ), the nodes corresponding to patches are pooled into one group, those corresponding to are pooled into another, and so forth. Similar assumptions hold for the deeper layers. For example, in the second pooling operation (hidden layer ), the node with receptive field , i.e. the one corresponding to the quadruple of patches , is assumed to be pooled together with the nodes whose receptive fields are , and . , which may be described by the following recursive process:

Set the index of the topleft patch to .

For : Replicate the alreadyassigned topleft by block of indexes, and place copies on its right, bottomright and bottom. Then, add a offset to all indexes in the right copy, a offset to all indexes in the bottomright copy, and a offset to all indexes in the bottom copy.
With this spatial ordering (illustrated in fig. 1(c)), partitions of convey a spatial pattern. For example, the partition (eq. 8) corresponds to the pattern illustrated on the left of fig. 1(c), whereas (eq. 9) corresponds to the pattern illustrated on the right. Our analysis (sec. 5.2) shows that the deep network is able to model strong correlation under , while being inefficient for modeling correlation under . More generally, partitions for which , defined in theorem 1, is high, convey patterns that split many patch blocks, i.e. are highly entangled. These partitions enjoy the possibility of strong correlation. On the other hand, partitions for which is small for large values of (see eq. 6 for definition of and ) convey patterns that divide large patch blocks unevenly, i.e. separate the input to distinct contiguous regions. These partitions, as we have seen, suffer from limited low correlations.
We conclude that with pooling, the deep network is able to model strong correlation between input regions that are highly entangled, at the expense of being inefficient for modeling correlation between input regions that are far apart. Had we selected a different pooling regime, the preference of input partition patterns in terms of modeled correlation would change. For example, if pooling windows were set to group nodes with their spatial reflections (horizontal, vertical and horizontalvertical), coarse patterns that divide the input symmetrically, such as the one illustrated on the right of fig. 1(c), would enjoy the possibility of strong correlation, whereas many entangled patterns would now suffer from limited low correlation. The choice of pooling shapes thus serves as a means for controlling the inductive bias in terms of correlations modeled between input regions. Square contiguous windows, as commonly employed in practice, lead to a preference that complies with our intuition regarding the statistics of natural images (nearby pixels more correlated than distant ones). Other pooling schemes lead to different preferences, and this allows tailoring a network to data that departs from the usual domain of natural imagery. We demonstrate this experimentally in the next section, where it is shown how different pooling geometries lead to superior performance in different tasks.
7 Experiments
The main conclusion from our analyses (sec. 5 and 6) is that the pooling geometry of a deep convolutional network controls its inductive bias by determining which correlations between input regions can be modeled efficiently. We have also seen that shallow networks cannot model correlations efficiently, regardless of the considered input regions. In this section we validate these assertions empirically, not only with convolutional arithmetic circuits (subject of our analyses), but also with convolutional rectifier networks – convolutional networks with ReLU activation and max or average pooling. For conciseness, we defer to app. C some details regarding our implementation. The latter is fully available online at https://github.com/HUJIDeep/inductivepooling.
Our experiments are based on a synthetic classification benchmark inspired by medical imaging tasks. Instances to be classified are
by binary images, each displaying a random distorted oval shape (blob) with missing pixels in its interior (holes). For each image, two continuous scores in range are computed. The first, referred to as closedness, reflects how morphologically closed a blob is, and is defined to be the ratio between the number of pixels in the blob, and the number of pixels in its closure (see app. D for exact definition of the latter). The second score, named symmetry, reflects the degree to which a blob is leftright symmetric about its center. It is measured by cropping the bounding box around a blob, applying a leftright flip to the latter, and computing the ratio between the number of pixels in the intersection of the blob and its reflection, and the number of pixels in the blob. To generate labeled sets for classification (train and test), we render multiple images, sort them according to their closedness and symmetry, and for each of the two scores, assign the label “high” to the top 40% and the label “low” to the bottom 40% (the mid 20% are considered illdefined). This creates two binary (twoclass) classification tasks – one for closedness and one for symmetry (see fig. 2 for a sample of images participating in both tasks). Given that closedness is a property of a local nature, we expect its classification task to require a predictor to be able to model strong correlations between neighboring pixels. Symmetry on the other hand is a property that relates pixels to their reflections, thus we expect its classification task to demand that a predictor be able to model correlations across distances.We evaluated the deep convolutional arithmetic circuit considered throughout the paper (fig. 1(a) with size pooling windows) under two different pooling geometries. The first, referred to as square, comprises standard pooling windows. The second, dubbed mirror, pools together nodes with their horizontal, vertical and horizontalvertical reflections. In both cases, input patches () were set as individual pixels, resulting in patches and hidden layers. representation functions () were fixed, the first realizing the identity on binary inputs ( for ), and the second realizing negation ( for ). Classification was realized through network outputs, with prediction following the stronger activation. The number of channels across all hidden layers was uniform, and varied between and . Fig. 3 shows the results of applying the deep network with both square and mirror pooling, to both closedness and symmetry tasks, where each of the latter has images for training and images for testing. As can be seen in the figure, square pooling significantly outperforms mirror pooling in closedness classification, whereas the opposite occurs in symmetry classification. This complies with our discussion in sec. 6, according to which square pooling supports modeling correlations between entangled (neighboring) regions of the input, whereas mirror pooling puts focus on correlations between input regions that are symmetric w.r.t. one another. We thus obtain a demonstration of how prior knowledge regarding a task at hand may be used to tailor the inductive bias of a deep convolutional network by designing an appropriate pooling geometry.
In addition to the deep network, we also evaluated the shallow convolutional arithmetic circuit analyzed in the paper (fig. 1(b)). The architectural choices for this network were the same as those described above for the deep network besides the number of hidden channels, which in this case applied to the network’s single hidden layer, and varied between and . The highest train and test accuracies delivered by this network (with hidden channels) were roughly on closedness task, and on symmetry task. The fact that these accuracies are inferior to those of the deep network, even when the latter’s pooling geometry is not optimal for the task at hand, complies with our analysis in sec. 5. Namely, it complies with the observation that separation ranks (correlations) are sometimes exponential and sometimes polynomial with the deep network, whereas with the shallow one they are never more than linear in network size.
Finally, to assess the validity of our findings for convolutional networks in general, not just convolutional arithmetic circuits, we repeated the above experiments with convolutional rectifier networks. Namely, we placed ReLU activations after every conv operator, switched the pooling operation from product to average, and reevaluated the deep (square and mirror pooling geometries) and shallow networks. We then reiterated this process once more, with pooling operation set to max instead of average. The results obtained by the deep networks are presented in fig. 4. The shallow network with average pooling reached train/test accuracies of roughly on closedness task, and
on symmetry task. With max pooling, performance of the shallow network did not exceed chance. Altogether, convolutional rectifier networks exhibit the same phenomena observed with convolutional arithmetic circuits, indicating that the conclusions from our analyses likely apply to such networks as well. Formal adaptation of the analyses to convolutional rectifier networks, similarly to the adaptation of
Cohen et al. (2016b) carried out in Cohen and Shashua (2016), is left for future work.8 Discussion
Through the notion of separation rank, we studied the relation between the architecture of a convolutional network, and its ability to model correlations among input regions. For a given input partition, the separation rank quantifies how far a function is from separability, which in a probabilistic setting, corresponds to statistical independence between sides of the partition.
Our analysis shows that a polynomially sized deep convolutional arithmetic circuit supports exponentially high separation ranks for certain input partitions, while being limited to polynomial or linear (in network size) separation ranks for others. The network’s pooling window shapes effectively determine which input partitions are favored in terms of separation rank, i.e. which partitions enjoy the possibility of exponentially high separation ranks with polynomial network size, and which require network to be exponentially large. Pooling geometry thus serves as a means for controlling the inductive bias. The particular pooling scheme commonly employed in practice – square contiguous windows, favors interleaved partitions over ones that divide the input to distinct areas, thus orients the inductive bias towards the statistics of natural images (nearby pixels more correlated than distant ones). Other pooling schemes lead to different preferences, and this allows tailoring the network to data that departs from the usual domain of natural imagery.
As opposed to deep convolutional arithmetic circuits, shallow ones support only linear (in network size) separation ranks. Therefore, in order to replicate a function realized by a deep network (exponential separation rank), a shallow network must be exponentially large. By this we derive the depth efficiency result of Cohen et al. (2016b), but in addition, provide an insight into the benefit of functions brought forth by depth – they are able to efficiently model strong correlation under favored partitions of the input.
We validated our conclusions empirically, with convolutional arithmetic circuits as well as convolutional rectifier networks – convolutional networks with ReLU activation and max or average pooling. Our experiments demonstrate how different pooling geometries lead to superior performance in different tasks. Specifically, we evaluate deep networks in the measurement of shape continuity, a task of a local nature, and show that standard square pooling windows outperform ones that join together nodes with their spatial reflections. In contrast, when measuring shape symmetry, modeling correlations across distances is of vital importance, and the latter pooling geometry is superior to the conventional one. Shallow networks are inefficient at modeling correlations of any kind, and indeed lead to poor performance on both tasks.
Finally, our analyses and results bring forth the possibility of expanding the coverage of correlations efficiently modeled by a deep convolutional network. Specifically, by blending together multiple pooling geometries in the hidden layers of a network, it is possible to facilitate simultaneous support for a wide variety of correlations suiting data of different types. Investigation of this direction, from both theoretical and empirical perspectives, is viewed as a promising avenue for future research.
Acknowledgments
This work is supported by Intel grant ICRICI #920126133, by ISF Center grant 1790/12 and by the European Research Council (TheoryDL project). Nadav Cohen is supported by a Google Doctoral Fellowship in Machine Learning.
References
References
 Bellman [1970] Richard Bellman. Introduction to matrix analysis, volume 960. SIAM, 1970.
 Beylkin and Mohlenkamp [2002] Gregory Beylkin and Martin J Mohlenkamp. Numerical operator calculus in higher dimensions. Proceedings of the National Academy of Sciences, 99(16):10246–10251, 2002.
 Beylkin et al. [2009] Gregory Beylkin, Jochen Garcke, and Martin J Mohlenkamp. Multivariate regression and machine learning with sums of separable functions. SIAM Journal on Scientific Computing, 31(3):1840–1857, 2009.
 Caron and Traynor [2005] Richard Caron and Tim Traynor. The zero set of a polynomial. WSMR Report 0502, 2005.
 Cohen and Shashua [2014] Nadav Cohen and Amnon Shashua. Simnets: A generalization of convolutional networks. Advances in Neural Information Processing Systems (NIPS), Deep Learning Workshop, 2014.
 Cohen and Shashua [2016] Nadav Cohen and Amnon Shashua. Convolutional rectifier networks as generalized tensor decompositions. International Conference on Machine Learning (ICML), 2016.

Cohen et al. [2016a]
Nadav Cohen, Or Sharir, and Amnon Shashua.
Deep simnets.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2016a.  Cohen et al. [2016b] Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: A tensor analysis. Conference On Learning Theory (COLT), 2016b.
 Cover and Thomas [2012] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
 Delalleau and Bengio [2011] Olivier Delalleau and Yoshua Bengio. Shallow vs. deep sumproduct networks. In Advances in Neural Information Processing Systems, pages 666–674, 2011.
 Eckart and Young [1936] Carl Eckart and Gale Young. The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211–218, 1936.
 Eldan and Shamir [2015] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. arXiv preprint arXiv:1512.03965, 2015.
 Golub and Van Loan [2013] G.H. Golub and C.F. Van Loan. Matrix Computations. Johns Hopkins Studies in the Mathematical Sciences. Johns Hopkins University Press, 2013. ISBN 9781421407944. URL https://books.google.co.il/books?id=X5YfsuCWpxMC.
 Hackbusch [2006] Wolfgang Hackbusch. On the efficient evaluation of coalescence integrals in population balance models. Computing, 78(2):145–159, 2006.
 Hackbusch [2012] Wolfgang Hackbusch. Tensor Spaces and Numerical Tensor Calculus, volume 42 of Springer Series in Computational Mathematics. Springer Science & Business Media, Berlin, Heidelberg, February 2012.
 Haralick et al. [1987] Robert M Haralick, Stanley R Sternberg, and Xinhua Zhuang. Image analysis using mathematical morphology. IEEE transactions on pattern analysis and machine intelligence, (4):532–550, 1987.
 Harrison et al. [2003] Robert J Harrison, George I Fann, Takeshi Yanai, and Gregory Beylkin. Multiresolution quantum chemistry in multiwavelet bases. In Computational ScienceICCS 2003, pages 103–110. Springer, 2003.
 He et al. [2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
 Jia et al. [2014] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678. ACM, 2014.
 Jones [2001] Frank Jones. Lebesgue integration on Euclidean space. Jones & Bartlett Learning, 2001.
 Kingma and Ba [2014] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kolda and Bader [2009] Tamara G Kolda and Brett W Bader. Tensor Decompositions and Applications. SIAM Review (), 51(3):455–500, 2009.
 Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems, pages 1106–1114, 2012.
 LeCun and Bengio [1995] Yann LeCun and Yoshua Bengio. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10), 1995.
 LeCun et al. [2015] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, May 2015.
 Mhaskar et al. [2016] Hrushikesh Mhaskar, Qianli Liao, and Tomaso Poggio. Learning real and boolean functions: When is deep better than shallow. arXiv preprint arXiv:1603.00988, 2016.
 Montufar et al. [2014] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Advances in Neural Information Processing Systems, pages 2924–2932, 2014.
 Nair and Hinton [2010] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML10), pages 807–814, 2010.
Comments
There are no comments yet.