# Width is Less Important than Depth in ReLU Neural Networks

We solve an open question from Lu et al. (2017), by showing that any target network with inputs in ℝ^d can be approximated by a width O(d) network (independent of the target network's architecture), whose number of parameters is essentially larger only by a linear factor. In light of previous depth separation theorems, which imply that a similar result cannot hold when the roles of width and depth are interchanged, it follows that depth plays a more significant role than width in the expressive power of neural networks. We extend our results to constructing networks with bounded weights, and to constructing networks with width at most d+2, which is close to the minimal possible width due to previous lower bounds. Both of these constructions cause an extra polynomial factor in the number of parameters over the target network. We also show an exact representation of wide and shallow networks using deep and narrow networks which, in certain cases, does not increase the number of parameters over the target network.

• 15 publications
• 11 publications
• 71 publications
10/27/2020

### Are wider nets better given the same number of parameters?

Empirical studies demonstrate that the performance of neural networks im...
03/01/2021

### Computing the Information Content of Trained Neural Networks

How much information does a learning algorithm extract from the training...
09/23/2021

### Arbitrary-Depth Universal Approximation Theorems for Operator Neural Networks

The standard Universal Approximation Theorem for operator neural network...
10/26/2020

### Provable Memorization via Deep Neural Networks using Sub-linear Parameters

It is known that Θ(N) parameters are sufficient for neural networks to m...
04/03/2017

### Truncating Wide Networks using Binary Tree Architectures

Recent study shows that a wide deep network can obtain accuracy comparab...
01/18/2021

### A simple geometric proof for the benefit of depth in ReLU networks

We present a simple proof for the benefit of depth in multi-layer feedfo...
07/15/2022

### Algorithmic Determination of the Combinatorial Structure of the Linear Regions of ReLU Neural Networks

We algorithmically determine the regions and facets of all dimensions of...

## 1 Introduction

The expressive power of neural networks has been widely studied in many previous works. A particular focus was given to the role of the network’s depth and width: How wide or how deep do we need to make the network, in order to express various target functions of interest? In an asymptotic sense, we know that making either the width or the depth large enough is sufficient to approximate any target function of interest. Specifically, classical universal approximation results (e.g. Cybenko (1989); Leshno et al. (1993); Hornik et al. (1989)) imply that even with depth , a wide enough neural network can approximate essentially any target function on a bounded domain in . More recently, it was shown that the same also applies to depth: Neural networks with width and sufficient depth can also approximate essentially any target function.

However, these results are asymptotic in nature, and do not provide a quantitative answer as to whether depth or width play a more significant role in the expressive power of neural networks. A recent line of works have shown that for certain target functions, depth plays a more significant role than width, in the sense that slightly decreasing the depth requires a huge increase in the width to maintain approximation accuracy. For example, Eldan and Shamir (2016); Safran and Shamir (2017); Daniely (2017) constructed functions on that can be expressed by depth- neural networks with parameters, while depth- neural networks require a number of parameters at least exponential in to approximate them well. In Telgarsky (2016); Chatziafratis et al. (2019) a family of functions represented by depth and width neural networks is constructed such that approximating them up to arbitrarily small accuracy with depth would require width exponential in .

A natural question that arises is whether we can provide similar results in terms of width, namely:

Are there functions that can be expressed by wide and shallow neural networks, that cannot be approximated by any narrow neural network, unless its depth is very large?

This question was stated as an open problem in Lu et al. (2017). We note that both a positive and a negative answer to this question has interesting consequences. If the answer is positive, then width and depth, in principle, play an incomparable role in the expressive power of neural networks, as sometimes depth can be more significant, and sometimes width. On the other hand, if the answer is negative, then depth generally plays a more significant role than width for the expressive power of neural networks.

In this work we solve this open problem for ReLU neural networks, by providing a negative answer to the above question. In more details, we prove the following theorem:

###### Theorem 1.1 (Informal).

Let be a ReLU neural network with width , depth , and let be some input distribution with an upper bounded density function over a bounded domain in . Then, for every there exists a neural network with width , and

parameters such that with probability at least

over we have:

 ∣∣N0(x)−N(x)∣∣≤ϵ

where the notation hides logarithmic terms in the problem’s parameters (see Thm. 3.1 for a formal claim).

Note that a network with width and depth has parameters, whereas the theorem above proves the existence of an approximating narrow network with parameters. This means that any wide network can be approximated up to an arbitrarily small accuracy, by a narrow network where the number of parameters increases (up to log factors) only by a factor of . Hence, it shows that the price for making the width small is only a linear increase in the network depth, in sharp contrast to the results mentioned earlier on how making the depth small may require an exponential increase in the network depth. In Subsection 3.2 we further discuss the extra

factor, which occurs due to a rough estimate of the Lipschitz parameter of the network. We argue that this factor can also be avoided by having stronger assumptions on the Lipschitz parameter of the networks, which shows that only having a logarithmic blow-up in the number of parameters is enough for such cases.

In Lu et al. (2017) it was shown that the universal approximation property on a compact domain does not hold for network with width less than . In Park et al. (2020b) it was shown that networks with width already have the universal approximation property. We extend our construction from Thm. 1.1 to approximating any wide network using a network with width , close to the minimal possible width. This construction has an additional blow-up from the bound in Thm. 1.1 on the number of parameters by a factor of . We also discuss how to extend Thm. 1.1 when the construction is restricted to having bounded weights. We show that we can approximate a wide network using a narrow network with weights bounded by , while, suffering an additional blow-up from the bound in Thm. 1.1 by a factor of .

The above constructions only apply on a bounded domain, and they approximate the target network w.h.p over some distribution. We additionally provide a different construction which exactly represents for all a target network of width and depth , using a network of width , although its depth is . We show that for and , the number of parameters in this construction does not increase in comparison to the target network. Hence, for theses cases, this exact representation is more efficient by a log factor in terms of parameters than the construction in Thm. 1.1

### Related Work

##### The benefits of depth.

Quite a few theoretical works in recent years have explored the beneficial effect of depth on increasing the expressiveness of neural networks. A main focus is on depth separation, namely, showing that there is a function that can be approximated by a -sized network of a given depth, with respect to some input distribution, but cannot be approximated by -sized networks of a smaller depth. As we already mentioned, depth separation between depth and was shown in (Eldan and Shamir, 2016; Safran and Shamir, 2017; Daniely, 2017). A construction shown by Telgarsky (2016) gives separation between networks of a constant depth and networks of some non-constant depth. Complexity-theoretic barriers to proving separation between two constant depths beyond depth , and to proving separation for certain “well behaved” functions were established in Vardi and Shamir (2020); Vardi et al. (2021a). In Safran and Shamir (2017); Liang and Srikant (2016); Yarotsky (2017) another notion of depth separation is considered. They show that there are functions that can be -approximated by a network of width and depth, but cannot be -approximated by a network of depth unless its width is . Depth separation was also widely studied in other works in recent years (e.g., Martens et al. (2013); Safran et al. (2019); Chatziafratis et al. (2019); Bresler and Nagaraj (2020); Venturi et al. (2021); Malach et al. (2021)).

The expressivity benefits of depth in the context of the VC-dimension (namely, how the VC dimension increases with more depth, even if the total number of parameters remain the same) is implied by, e.g., Bartlett et al. (2019). Finally, Park et al. (2020a); Vardi et al. (2021b) proved that deep networks have more memorization power than shallow ones. That is, deep networks can memorize samples using roughly parameters, while shallow networks require parameters.

##### Deep and narrow networks.

The expressive power of narrow neural networks has been extensively studied in recent years (e.g., (Lu et al., 2017; Hanin and Sellke, 2017; Johnson, 2018; Kidger and Lyons, 2020; Park et al., 2020b)). As we already discussed, Lu et al. (2017) posed the open question that we study in this work. They also showed that the minimal width for universal approximation (denoted ) using ReLU networks w.r.t. the norm of functions from to , satisfies . For -approximation of functions from a compact domain they showed a lower bound of . Kidger and Lyons (2020) extended their results to -approximation of functions from to , and obtained . Park et al. (2020b) further improved this result and obtained . Hanin and Sellke (2017) considered universal approximation (using ReLU networks) of functions from a compact domain to w.r.t. the norm, and proved that

. Universal approximation using narrow networks with other activation functions has been studied in

Johnson (2018); Kidger and Lyons (2020); Park et al. (2020b). We note that all prior results on universal approximation using deep and narrow networks require networks of depth exponential in the input dimension. However, our results are of a different nature, since we focus on approximating a given network of bounded size, while universal approximation results aim at approximating arbitrary functions. For a more detailed discussion on related prior works see Park et al. (2020b).

## 2 Preliminaries

For and we denote by the string of bits in places until inclusive, in the binary representation of and treat is as an integer (in binary basis). For example, , i.e. the three most significant bits (from the left). We denote by the minimal number of bits in its binary representation. We denote , i.e. the -th bit on . For a function and we denote by the composition of with itself

times. We denote vectors in bold face. For a vector

we denote by its -th coordinate. We use the notation to hide logarithmic factors, and use to hide constant factors. For we denote .

#### Neural Networks

We denote by the ReLU function. In this paper we only consider neural networks with the ReLU activation.

Let be the data input dimension. We define a neural network of depth as , where is computed recursively by

• for

• for for

• for

The width of the network is . We define the number of parameters of the network as the total number of coordinates in its weight matrices and biases , which is at most . Note that in some previous works (e.g. Bartlett et al. (2019); Vardi et al. (2021b)) the number of parameters of the network is defined as the number of weights of which are non-zero. Our definition is stricter, as we also count zero weights.

### Input Dimension

Throughout the paper, we assume that , i.e. the input dimension is smaller than the width of the target network. This assumption is important, because our goal is to approximate a network of width and depth with a deep network, but with width bounded by . If , then the network we are given is already in the correct form and there is nothing to prove. We note that networks with width smaller than do not have the universal approximation property (see e.g. Lu et al. (2017); Johnson (2018); Park et al. (2020b); Hanin and Sellke (2017)), no matter how deep they are. This is in contrast to networks with depth , which have the universal approximation property (where the width is unbounded). This means that we cannot expect to approximate all networks of width using networks with width smaller than , hence constructing a networks with width that depends on is unavoidable.

## 3 Narrow and Deep Networks Can Approximate Wide Networks

In this section we show that given a network of width , depth and input dimension , we can approximate it up to error using another network with width and depth .

###### Theorem 3.1.

Let and let be a neural network with width , depth and weights bounded in . Let be some distribution over with density function such that for every where . Then, there exists a neural network with width , depth , such that over we have that:

 |N(x)−N0(x)|≤ϵ .

The total number of parameters in is .

The full proof can be found in Appendix A. We note that the number of parameters in the target network is . Hence, the number of parameters in is larger only by a factor of , we will discuss this dependence later on. Specifically, the blow-up in the number of parameters w.r.t. the width is only logarithmic. The dependence on the approximation parameter is also logarithmic. Note that and do not affect the number of parameters in the network, as they only appear in the magnitude of the weights (see Thm. A.6 in the Appendix, and the discussion in Subsection 3.4). We also note that although our result shows an approximation w.h.p, it can be easily modified to obtain approximation w.r.t norms. This can be done by adding an extra layer that clips large output values of the network, and once the outputs are bounded, we can choose accordingly to get an approximation in .

### 3.1 Proof Intuition

The main idea for our proof is to encode for each layer (including the first layer) all its input coordinates into a single number. Now, when we want to apply some computation on an input coordinate (e.g. multiply it by a constant), we extract only the relevant bits out of that number and apply our computation on them. Encoding a vector of dimension into a single number can be done in the following way: For each coordinate we extract its most significant bits (for an appropriate ), and we concatenate all these bits into a single number with a total of bits. Now, to apply some computation on the -th coordinate, we first extract the bits in places until from the number we created, and apply the computation on these bits. The main novelty of the proof comes from using this encoding technique, such that given a layer with input coordinates and output coordinates, we simulate it with a network of width , and depth which depends on , and the number of extracted bits from each coordinate. We now explain in more details the different building blocks of our proof.

#### Encoding the Input

The main idea in this part is to construct a subnetwork which encodes all the coordinates of the input within a single number. For simplicity, we assume here that the inputs are in . We construct a network such that for every :

 \textscbin(i−1)⋅c+1:i⋅c(Fenc(x))=⌊xi⋅2c⌋ .

In words, each bits of the output of the network is an encoding of the most significant bits of the -th coordinate of the input. Note that if we want to approximate the input up to an error of , we need to use only the most significant bits. This construction uses an efficient bit extraction technique based on Telgarsky’s triangle function (Telgarsky, 2016), which was also used in Vardi et al. (2021b). The depth of this network depends only on the number of extracted bits, and the width depends on the dimension of the inputs. We note that exact bit extraction is not a continuous operation. We approximate this operation using the ReLU activation, such that it succeeds with probability at least . This parameter only affects the size of the weights, and not the number of parameters. We further discuss the size of the required weights of the network in Subsection 3.4.

#### Encoding Each Layer

In this part, we construct deep and narrow subnetworks with real-valued inputs and outputs for , where each such subnetwork simulates the

-th layer from the target network. We first explain how to simulate a single neuron, and then how to extend it to simulating a layer.

A single ReLU neuron is a function of the form , for some . Suppose that the input of this single neuron (i.e. ) is represented in a single coordinate with bits, where each bits represents the most significant bits of a coordinate of . We iteratively decode the representation of (the -th coordinate of ), multiply it by and add it to a designated output number. The decoding of the input is done using Telgarsky’s triangle function. To deal with both negative and positive ’s, we use two designated output numbers, one for the positive weights and one for the negative weights. The final layer adds up these designated outputs with their corresponding sign to get the correct result. Then, it adds the bias term to the output and applies the ReLU function. The depth of this network depends on , i.e., the number of input coordinates times the number of bits used for their encoding. The width of this network is .

Simulating an entire layer requires iteratively simulating each neuron of the layer as described above, and then encoding the output of each neuron from the target network within a single number. In more details, the subnetwork iteratively simulates a single neuron from the -th layer of the target network using the method described above. It also keeps track of the input (which encodes in a single number all the output coordinates from the previous layer) and a single designated output coordinate. After simulating a neuron, the network truncates the output to having only bits, and stores it in an output coordinate, where the output of the -th neuron from the target network is stored in the until bits of this designated output coordinate. In total, the network has an input dimension of , i.e. its input is a single number representing an encoding of the -th layer’s outputs, and it outputs a single number with an encoding of the -th layer. The width of this network is , and its depths depends on the , where are the input and output dimensions of the -th layer, and is the number of bits for each neuron.

### 3.2 On the Number of Parameters in the Construction

As we already discussed, the number of parameters in our construction is almost the same as the number of parameters in the target network. The main difference is that in our construction we have an extra term, and extra logarithmic terms in the other parameters of the problem. Here we will discuss why these extra terms come up in our construction, and in what situations they can be avoided.

The network we construct in Thm. 3.1 can be roughly represented as , that is, encoding the data and then simulating all the layers from the target network. Since we use bit extraction techniques for this construction, we cannot represent exactly the inputs and the weights of the target network. To this end, we only keep track of the most significant bits of each component of the target network (weights and inputs).

The approximation capacity of our construction depends on the Lipschitz parameter of the each layer of the target network, and on , the number of bits we store. To see this, first note that to approximate some number in up to an error of , requires to store only its most significant bits. Recall that by our assumptions the weights of the target network are bounded in , and each coordinate of the input data is bounded in . The Lipschitz parameter of each layer of the network can be roughly upper bounded by , where is the width of the target network. This means that after layers, the Lipschitz parameter of the network can be bounded by . Using this estimate, it can be seen that to get an approximation of the output, storing bits for every weight and input coordinate can suffice.

The number of parameters for simulating each layer of the target network depends on the number of stored bits, hence the number of parameters in our construction increases by logarithmic factors, and an factor. We get an term in the total number of parameters, since there are layers, and the simulation of each layer involves a blow-up by a factor of .

We emphasize that this blow-up in the number of parameters is mainly due to a rough estimate of the Lipschitz constant for each layer of the target network. Since we use an efficient bit extraction technique, the number of parameters increases only by log of the Lipschitz constant.

Informally, if the Lipschitz parameter of the network and its intermediate computations is small (which seems to often occur in practice, see for example Fazlyab et al. (2019); Scaman and Virmaux (2018); Latorre et al. (2020)), then we believe that the extra factor can be reduced or even removed all together. However, a formal statement requires a more delicate analysis, which we leave for future works.

### 3.3 Extension to Multiple Outputs

Our construction can be readily extended to the case where there are multiple outputs to the target network. Given some target network , we use a similar construction to Thm. 3.1, except for simulating the last layer. To simulate the last layer, given an encoding of the penultimate layer of , we simulate each output in parallel in a similar manner as we did for a single output in Thm. 3.1. In more details, for each output coordinate we construct a subnetwork which given an encoding of the values from the penultimate layer, computes the -th output. The construction of each is exactly the same as the construction from the proof of Thm. 3.1 which simulates the last layer of a target network with a single output. Now, the last layer computes

 x↦⎛⎜ ⎜⎝F1(x)⋮Fdout(x)⎞⎟ ⎟⎠ .

Since the width of the subnetwork which simulates a layer is , the width of this new network increases by a factor of , and the depth of this network does not change.

### 3.4 Approximation With Bounded Weights

Our construction in Thm. 3.1 uses a network with very large weights (exponential in and , see Thm. A.6 in the appendix for the exact expression), which may be seen as a limitation of our construction. In this section we show that having such large weights can be easily avoided by slightly altering our construction from Thm. 3.1. This change results in an extra linear factor, and some log factors on the number of parameters.

The reason we do have such large weights is because we use a bit extraction technique, which requires that the subnetworks in our construction will have a very large Lipschitz constant. For example, constructing a neural network which outputs the -th bit of its input (say, in dimension), requires that the Lipschitz constant will be approximately . For this reason, in several places in the proof our weights are exponential in the parameters of the problem, and they equal exactly to for some large which depends on the parameters of the problem. To avoid such a blow-up in the size of the weights, we can approximate a weight of size by just using layers, and multiplying times the number to obtain the same result. Using this technique we are able to construct a network with bounded weights but at the cost of increasing the number of parameters of the network, up to logarithmic terms, by a linear term in and :

###### Corollary 3.2.

Under the same setting as in Thm. 3.1, there exists a neural network with width bounded by , depth bounded by and weights bounded by , such that w.p over we have that:

 |N(x)−N0(x)|≤ϵ .

The total number of parameters in is .

###### Proof.

We use the homogeneity of the ReLU activation. Note that for a neuron of the form , we can divide the weights by some constant, and multiply the output of the neuron by the same constant, and for all the result will stay the same. Given the network constructed in Thm. 3.1, denote its largest weight by , and its depth by . Denote by and the weight matrices and biases of this network. We divide and by . For each layer , we divide by , and by . In the last layer we will multiply by .

We simulate the multiplication by using small weights in the following way: We write , where with and . We note that the output may be negative, hence we need to simulate multiplication without the ReLU activation. To do that, we add a layer which acts as: . We now use layers to multiply each of the two outputs by the number , and in the penultimate layer we multiply the result by . The last layer acts as . Note that since the second coordinate is equal to and the first coordinate is equal to (for some ), then the output of the network is .

To prove the correctness of our construction, first note that the magnitude of each weight in our new network is bounded by , since we divided each weight of the original network by for some where is the size of the maximal weight. Second, we show that the output of the network is the same for all . Given some , denote by the output of the original network with input after layers. Assume by induction that after dividing the weights as explained above, the output of the -th layer is divided by , then for the -th layer we have:

 σ(1C~W(i)⋅1Ci−1x(i)+1Ci~b(i))=1Ciσ(~W(i)⋅x(i)+~b(i)) .

This means that after layers, the output is divided by . Since we also multiply by this term in the last layers of the network, the output of the network does not change.

The weights of our construction are bounded by . The width of our construction does not change from the width of the original network. The depth of our construction can be bounded by where is the depth of the original network, and is the maximal weight in the original network. The log of the largest weight can be bounded by (see Thm. A.6 in the appendix):

Hence, the depth of the network can be bounded by . The number of parameters in the network also increases by . Hence, the total number of parameters in the network can be bounded by . ∎

Corollary 3.2 shows that even if we use networks with constant weights, we can simulate any target network up to any accuracy using a deep and narrow network, while having only a polynomial blow-up in the parameters of the problem. Moreover, the number of parameters in this construction is only larger by a factor of than the construction in Thm. 3.1. An interesting question is whether a better bound can be achieved using a different construction. We leave this question for future research.

###### Remark 3.3.

Instead of bounding the magnitude of the weights in the network, we could have bounded the bit complexity of the network. By bit complexity, we mean the number of bits that are needed to represent all the weights of the network. By carefully following the proof of Thm. 3.1, it can be seen that each weight in our construction can be represented by at most bits. We note that although it seems possible to provide a construction where each weight can be represented with bits, at the cost of increasing the number of parameters in the network (by similar arguments to the proof of Corollary 3.2), such a construction does not seem to reduce the overall bit complexity of the network. This is because, we still use the same total number of bits to represent all the weights of the network, but we spread those bits across more weights. An interesting question is whether it is possible to provide a construction with smaller bit complexity, and we leave it for future work.

## 4 Achieving Close to Minimal Width

Previous works have shown that neural networks over a compact input domain with width (where is the input dimension) are not universal approximators, in the sense that they cannot approximate any function w.r.t. the norm up to arbitrarily small accuracy (see e.g. Lu et al. (2017); Park et al. (2020a)). Hence, we cannot expect to approximate any wide network using a narrow network with width less than .111Note that the approximation in Thm. 3.1 is given w.h.p over some distribution, and not in the sense. However, any construction that achieves approximation w.p. can be used to obtain approximation, by bounding the output and choosing an appropriate . In this section we show how to approximate any wide network using a narrow network with width , which is only larger than the lower bound by . We note that in Park et al. (2020b), both an upper and lower bound of is shown for universal approximation over an unbounded domain, although their construction uses an exponential number of parameters. Our main result in this section is the following:

###### Theorem 4.1.

Assume the same setting as in Thm. 3.1. Then, there exists a neural network with width , and depth , such that over we have that:

 |N(x)−N0(x)|≤ϵ .

The total number of parameters in is . If , then total number of parameters is .

The full proof can be found in Appendix B. The proof is very similar to the proof of Thm. 3.1. The only difference is that we replace the first component of the network which encodes the input data. The new encoding scheme is more efficient in terms of width as it allows encoding the inputs coordinates using width instead of width . We extract the bits of each coordinate sequentially, instead of in parallel. This results in a blow-up on the number of parameters by a factor of . The bit extraction technique we use here also relies on Telgarsky’s triangle function, but to extract bits it requires a depth of , instead of a depth of as in the proof of Thm. 3.1. This results in a blow-up by a logarithmic factor on the number of parameters.

We note that in Park et al. (2020b) the authors achieved a universal approximation result using width , vs. width in our theorem. Moreover, they use a bit extraction technique somewhat reminiscent of ours. However, their required depth is exponential in the problem’s parameters, while in our construction it is polynomial. We conjecture that it is not possible to achieve a similar construction to ours with width less than , unless we increase the number of parameters by a factor which is polynomial in both , and the Lipschitz constant of the network. We leave this question for future research.

###### Remark 4.2.

In Thm. 4.1 the number of parameters in the network increases by a factor of compared to the number of parameters in Thm. 3.1

. We note that this is due to the way we count the parameters of the network. We defined the number of parameters as the number of coordinates in its weight matrices and bias vectors. In Thm.

4.1, the width of the subnetwork that encodes the input is and its depth is , hence its number of parameters is . An alternative way to define the number of parameters is as the number of non-zero weights in the network. This alternative definition is used in many previous works (see, e.g., Bartlett et al. (2019); Vardi et al. (2021b)). Then, the number of parameters for the subnetwork which encodes the input is only , which gives us the exact same bound as in Thm. 3.1.

## 5 Exact Representation With Deep and Narrow Networks

In Sec. 3 we showed a construction for approximating a target shallow and wide neural network using a deep and narrow neural network. We note that this construction assumes that the data is bounded in and that we approximate the target network w.h.p over some distribution up to an error of . The number of parameters in the construction depends logarithmically on and .

In this section we show a different construction which exactly represents the target network for all using a deep and narrow construction. We will also discuss in which cases this construction is better than the one given in Thm. 3.1. We show the following:

###### Theorem 5.1.

Let be a neural network with layers and width . Then, there exists a neural network with width and depth such that for every we have that .

The full proof can be found in Appendix C, but in a nutshell, is based on an inductive argument over the layers of the network (starting from the bottom layer and ending in the output neuron). Specifically, fix some layer, and consider some neuron in that layer (where ranges from to the width of that layer). We can view the output of that neuron as the output of a subnetwork which ends at that neuron. Suppose by induction that we can convert this subnetwork to an equivalent subnetwork which is narrow. Doing this for all , we get a sequence of narrow subnetworks which represent the outputs of all neurons in the layer. Now, instead of placing them side-by-side (which would result in a wide network), we put them one after the other, using in parallel neurons to remember the original inputs, and another neurons per layer to incrementally accumulate a weighted linear combination of the subnetworks’ outputs, mimicking the computation of the layer at the original network. Overall, we end up with a narrow network which mimics the outputs of the original layer, which we can then use inductively for constructing the outputs of the following layers.

We emphasize that this construction is not an approximation of the target network, but an exact representation of it using a deep and narrow network. Note that our construction is narrow only if , otherwise the target network might be narrower than our construction. This construction is not efficient in the sense that we compute each neuron many times. For example, a neuron in the first layer of the target network is computed exponentially many times (in ). This is because, every neuron in a consecutive layer computes this neuron recursively.

#### Cases Where the Exact Representation is Efficient

We argue that for and , the construction presented in this section does not significantly increase the number of parameters compared to the target network. The construction in Thm. 5.1 has width throughout the entire network. Also, the depth of the network constructed in Thm. 5.1 is exponential in . For these reasons, the number of parameters in the network constructed in Thm. 5.1 is , which seems less efficient than the construction in Thm. 3.1.

Assume that the input dimension is constant, that is , and we are only interested in the asymptotic dependence on for different values of . If , then the target network has parameters, because it is a depth- network with constant input dimension. By the bound we saw above, the construction in Thm. 5.1 also have parameters. If , then the target network has parameters, and the construction presented in this section also has parameters. For , since the target network has parameters, while the construction from Thm. 5.1 has parameters, then the construction does increase the number of parameters.

We emphasize that the construction here simulates a wide network using a deep network with width independent of (the width of the target network), and that it is an exact representation for every . On the other hand, in Thm. 3.1 the construction only approximates the target network up to some , with high probability and in a bounded domain. We conjecture that it is not possible to obtain an exact representation without increasing the number of parameters for general and .

## 6 Discussion

In this work we solved an open question from Lu et al. (2017). We proved that any target network with width , depth and inputs in can be approximated by a network with width , where the number of parameters increases by only a factor of over the target network (up to log factors). Relying on previous results on depth separation (e.g. Eldan and Shamir (2016); Safran and Shamir (2017); Telgarsky (2016); Daniely (2017)), this shows that depth plays a more significant role in the expressive power of neural networks than width. We also extend our construction to having bounded weights, and having width at most , where previous lower bounds showed that such a construction is not possible for width less than . Both of these extensions cause an extra polynomial blow-up in the number of parameters. Finally, we show a different construction which allows exact representation of wide networks using deep and shallow networks. We argue that this construction does not increase the number of parameters by more than constant factors when and .

There are a couple of future research directions which may be interesting to pursue. First, it would be interesting to see if the upper bound established in Thm. 3.1 is tight. Namely, whether the extra blow-up by a factor of and by logarithmic factors is unavoidable. Second, it would be interesting to find a more efficient construction than in Thm. 5.1 for exact representation of wide networks using narrow networks, or to establish a lower bound which shows that it is not possible. Finally, in terms of optimization, given two approximations of the same function, one using a narrow and deep network, and the other using a shallow and wide network, it would be interesting to analyze their optimization process, and see which representation is easier to learn using standard methods (e.g. SGD).

## Appendix A Proofs from Sec. 3

### a.1 Encoding of the data

###### Lemma A.1.

Let , where . There exists a neural network with width depth at most and weights bounded by , such that if we sample , then w.p for every we have that:

 \textscbin(i−1)⋅c+1:i⋅c(N(x))=⌊xiA⋅2c0⌋⋅A
###### Proof.

We first use Lemma A.2 to construct a network , such that w.p at least , if we sample then . Also, has width 5 and depth bounded by . We define a network which maps the following input to output:

 ⎛⎜ ⎜⎝x1⋮xd⎞⎟ ⎟⎠↦(∑di=12(i−1)⋅cA⋅G(xiA))

We can construct such that it has width and depth bounded by in the following way: We first map:

 ⎛⎜ ⎜⎝x1⋮xd⎞⎟ ⎟⎠↦⎛⎜ ⎜ ⎜⎝x1A⋮xdA⎞⎟ ⎟ ⎟⎠↦⎛⎜ ⎜ ⎜ ⎜ ⎜⎝A⋅G(x1A)⋮A⋅G(xdA)⎞⎟ ⎟ ⎟ ⎟ ⎟⎠

This can be done using width and depth , since calculating each requires a width of and depth . In the last layer of we sum all the with the corresponding weights.

We note that if , then . Hence, by Lemma A.2 and the union bound, the output is correct for all w.p . Hence, by our construction, satisfies the conditions of the lemma.

The maximal width of is the maximal width of its subnetworks which is . The depth of is the sum of the depths of its subnetworks which can be bounded by . The maximal weight of can be bounded by the maximal weight of its subnetworks. The maximal weight of can be bounded by , which appears in its last layer. ∎

###### Lemma A.2.

Let and . There exists a neural network with width 5, depth bounded by and weights bounded by , such that if we sample , w.p we have that .

###### Proof.

We define , this is Telgarsky’s triangle function Telgarsky [2016]. We also define the following function for :

 ψi(x)=2c+2−iδσ(φ(i)(x+δ2c+2)−φ(i)(x+δ2c+1)) . (1)

The intuition behind Eq. (1) is the following: The function is a piecewise linear function with ”bumps”. Each such ”bump” consists of two linear parts with a slope of , the first linear part goes from 0 to 1, and the second goes from 1 to 0. Let , it can be seen that the -th bit of is 1 if is on the second linear part (i.e. descending from 1 to 0) and its -th bit is 0 otherwise.

Assume that and are on the same linear piece of for . Then, the -th bit of is equal to 1 if , and otherwise. Also, if we sample , then w.p both terms are on different linear pieces for some .

Using this observation we get that the output of is equal to the -th bit of w.p over sampling .

We construct a network which maps the following input to output:

This network can be realized using four layers (two for Telgarsky’s function, one for and one for the output) and width (one for storing and four for applying Telgarsky’s function twice). We also define as:

 f0(x)=⎛⎜ ⎜ ⎜⎝0x+δ2c+1x+δ2c+2⎞⎟ ⎟ ⎟⎠

Finally, we construct the network:

 N:=P1∘fc∘⋯∘f1∘f0 ,

where is the projection on the first coordinate. By our argument above, w.p over sampling we get that as required. The width of is the maximal width of each of its subnetworks which is at most . The depth of is the sum of the depths of its subnetworks, which can be bounded by . Each weight of can be bounded by

### a.2 Approximation of a single neuron

In the following we show that given a previous layer with neurons each encoded with bits, we can construct a network which outputs a single neuron defined by some given weights.

###### Lemma A.3.

Let , let with for every and let . There exists a neural network with width , depth at most and weights bounded by , such that for every with we have that:

 N(x)=σ(n−1∑i=1αiwi\textscbin(i−1)⋅c+1:i⋅c(x)+b) .
###### Proof.

We construct two sets of networks: for and for . The intuition is that each will decode the -th bit from the -th input neuron, and will add up the -th input neuron to the output neuron.

The construction of the bit extraction is similar to the one from Eq. (1). We first define which is Telgarsky’s triangle function. We also define the following function for every :

 ψℓ(x)=2n⋅c+2−ℓσ(φ(ℓ)(x2n⋅c+12n⋅c+2)−φ(ℓ)(x2n⋅c+12n⋅c+1)) . (2)

By the same reasoning as in Eq. (1), the output of is equal to the -th bit of , for every with .

Let and , then we define which maps the following input to output:

 ⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝xxcuryposynegφ((i−1)⋅c+j−1)(x2n⋅c+12n⋅c+1)φ((i−1)⋅c+j−1)(x2n⋅c+12n⋅c+2)⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠↦⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝x2⋅xcur+ψ(i−1)⋅c+j(x)yposynegφ((i−1)⋅c+j)(x2n⋅c+12n⋅c+1)φ((i−1)⋅c+j)(x2n⋅c+12n⋅c+2)⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠

where we calculate using Eq. (2).

For we define which maps the following input to output if :

 ⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝xxcuryposynegφ(i⋅c)(x2n⋅c+12n⋅c+1)φ(i⋅c)(x2n⋅c+12n⋅c+2)⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠↦⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝x0ypos+wi⋅xcurynegφ(i⋅c)(x2n⋅c+12n⋅c+1)φ(i⋅c)(x2n⋅c+12n⋅c+2)⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠ , (3)

and if :

 ⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝xxcuryposynegφ(i⋅c)(x2n⋅c+12n⋅c+1)φ(i⋅c)(x2n⋅c+12n⋅c+2)⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠↦⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝x0yposyneg+wi⋅xcurφ(i⋅c)(x2n⋅c+12n⋅c+1)φ(i⋅c)(x2n⋅c+12n⋅c+2)⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠ . (4)

We now define for as:

 Gi:=Fi∘fi,c∘⋯∘fi,1 .

In words, the goal of each is to add to the output neuron the output of the -th input neuron multiplied by its corresponding weight. We also define the input and output networks , as:

 Gin(x)=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝x000x2n⋅c+12n⋅c+1x2n⋅c+12n⋅c+2⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠ Gout⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝xxcuryposynegz1z2⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠=σ(ypos−yneg+b)

Finally, we define the network as:

 N:=Gout∘Gn∘⋯∘G1∘Gin .

By the construction of , and each of the we get that for every with :

 N(x)=σ(n−1∑i=1αiwi\textscbin(i−1)⋅c+1:i⋅c(x)+b) .

The width of each can be bounded by . This is because each requires a width of , and simulating the identity requires a width of 1, since all the inputs are positive (hence ). The width of each is , since it only requires addition and simulating the identity on positive inputs. The width of and can also be bounded by . Hence, the width of is at most . The depth of each can be bounded by , and the depth of each and of and can be bounded by . In total, the depth of , which is bounded by the sum of depths of its subnetworks, can be bounded by . The maximal weight of each is . The maximal weight of each can be bounded by . In total, the weights of can be bounded by . ∎

### a.3 Approximation of a layer

In the following we show that given a previous layer with neurons, each encoded with bits, we can construct a network which outputs an encoded output layer with neurons.

###### Lemma A.4.

Let . For every let with and . For every let with . Then, there exists a neural network with width , depth bounded by , and weights bounded by with the following property: Let with such that for every we have:

 \textsclen(n1∑j=1αi,jwi,j\textscbin(j−1)⋅c+1:j⋅c(x)+bi)≤c . (5)

Then, for every we get:

 \textscbin(i−1)⋅c+1:i⋅c(N(x))=σ(n1∑j=1αi,jwi,j\textscbin(j−1)⋅c+1:j⋅c(x)+bi) . (6)
###### Proof.

For every we use Lemma A.3 to construct a network such that for every with we get:

 ~Fi(x)=σ(n1∑j=1αi,jwi,j\textscbin(j−1)⋅c+1:j⋅c