# The phase diagram of approximation rates for deep neural networks

We explore the phase diagram of approximation rates for deep neural networks. The phase diagram describes theoretically optimal accuracy-complexity relations and their qualitative properties. Our contribution is three-fold. First, we generalize the existing result on the existence of deep discontinuous phase in ReLU networks to functional classes of arbitrary positive smoothness, and identify the boundary between the feasible and infeasible rates. Second, we demonstrate that standard fully-connected architectures of a fixed width independent of smoothness can adapt to smoothness and achieve almost optimal rates. Finally, we discuss how the phase diagram can change in the case of non-ReLU activation functions. In particular, we prove that using both sine and ReLU activations theoretically leads to very fast, nearly exponential approximation rates, thanks to the emerging capability of the network to implement efficient lookup operations.

There are no comments yet.

## Authors

• 12 publications
• 1 publication
• ### Approximation of Smoothness Classes by Deep ReLU Networks

We consider approximation rates of sparsely connected deep rectified lin...
07/30/2020 ∙ by Mazen Ali, et al. ∙ 0

• ### High-Order Approximation Rates for Neural Networks with ReLU^k Activation Functions

We study the approximation properties of shallow neural networks (NN) wi...
12/14/2020 ∙ by Jonathan W. Siegel, et al. ∙ 0

• ### Deep Learning in High Dimension: Neural Network Approximation of Analytic Functions in L^2(ℝ^d,γ_d)

For artificial deep neural networks, we prove expression rates for analy...
11/13/2021 ∙ by Christoph Schwab, et al. ∙ 0

• ### Deep Neural Networks with ReLU-Sine-Exponential Activations Break Curse of Dimensionality on Hölder Class

In this paper, we construct neural networks with ReLU, sine and 2^x as a...
02/28/2021 ∙ by Yuling Jiao, et al. ∙ 0

• ### Deep Neural Networks for Estimation and Inference: Application to Causal Effects and Other Semiparametric Estimands

We study deep neural networks and their use in semiparametric inference....
09/26/2018 ∙ by Max H. Farrell, et al. ∙ 0

• ### Resource theory of heat and work with non-commuting charges: yet another new foundation of thermodynamics

We consider a theory of quantum thermodynamics with multiple conserved q...
11/16/2020 ∙ by Zahra Baghali Khanian, et al. ∙ 0

• ### Nonlinear Approximation and (Deep) ReLU Networks

05/05/2019 ∙ by I. Daubechies, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The topic of expressiveness of deep neural networks has received much attention in recent years. One of the fundamental questions in this area is the complexity of networks required to approximate classes of functions of given smoothness. Given a class of maps from the -dimensional cube to , we want to identify network architectures of minimal complexity sufficient to approximate all with given accuracy. In this paper we focus on the classical setting in which the sets are Sobolev- or Hölder balls, approximation is with respect to the uniform norm , and the complexity of the network is measured by the number of weights . In this case we expect a power law relation between accuracy and complexity:

 ∥f−˜fW∥∞=O(W−p),∀f∈F, (1)

where is an approximation of by a network with weights, and is an -dependent constant that we will call an approximation rate.

There are several important general ideas explaining which approximation rates we can reasonably expect in Eq.(1

). In the context of abstract approximation theory, we can forget (for a moment) about the network-based implementation of

and just think of it as some approximate parameterization of

by vectors

. Let us view the approximation process as a composition of the weight assignment map and the reconstruction map where is the full normed space containing . If both the weight assignment and reconstruction maps were linear, and so their composition , the l.h.s. of Eq.(1

) could be estimated by the

linear -width of the set (see constrappr96 ). For a Sobolev ball of -variate functions of smoothness , the linear -width is asymptotically , suggesting the approximation rate Remarkably, this argument extends to non-linear weight assignment and reconstruction maps under the assumption that the weight assignment is continuous. More precisely, it was proved in continuous that, under this assumption, in Eq.(1) cannot be larger than .

An even more important set of ideas is related to estimates of Vapnik-Chervonenkis dimensions of deep neural networks. The concept of expressiveness in terms of VC-dimension (based on finite set shattering) is weaker than expressiveness in terms of uniform approximation, but upper bounds on the VC-dimension directly imply upper bounds on feasible approximation rates. In particular, VC-dimension of networks with piecewise-polynomial activations are (goldberg1995bounding ), which implies that cannot be larger than – note the additional factor 2 coming from the power 2 in the VC bound. We refer to the book anthony2009neural for a detailed exposition of this and related results.

Returning to approximations with networks, the above arguments suggest that the rate in Eq.(1) can be up to assuming the continuity of the weight assignment, and up to without assuming the continuity, but assuming a piecewise-polynomial activation function such as ReLU. We then face the constructive problem of showing that these rates can indeed be fulfilled by a network computation. One standard general strategy of proving the rate is based on polynomial approximations of (in particular, via the Taylor expansion). A survey of early results along this line for networks with a single hidden layer and suitable activation functions can be found in pinkus1999review . An interesting aspect of piecewise-linear activations such as ReLU is that the rate cannot be achieved with shallow networks, but can be achieved with deeper networks implementing approximate multiplication and polynomials (yarsawtooth ; liang2016why ; petersen2018optimal ; safran2017depth ).

It was shown in yaropt that ReLU networks can also achieve rates beyond The result of yaropt is stated in terms of the modulus of continuity of ; when restricted to Hölder functions with constant , it implies that on such functions ReLU networks can provide rates in the interval , in agreement with the mentioned upper bound . The construction is quite different from the case and has a “coding theory” rather than “analytic” flavor. The central idea is to divide the domain into suitable patches and encode an approximation to in each patch by a single network weight using a binary-type representation. Then, the network computes the approximation by finding the relevant weight and decoding it using the bit extraction technique of bartlett1998almost . In agreement with continuous approximation theory and existing VC bounds, the construction inherently requires discontinuous weight assignment (as a consequence of coding finitely many values) and network depth (necessary for the bit extraction part). In this sense, at least in the case of one can distinguish two qualitatively different “approximation phases”: the shallow continuous one corresponding to (and lower values), and the deep discontinuous one corresponding to . It was shown in petersen2018optimal ; voigtlaender2019approximation that the shallow rate but not faster rates, can be achieved if the network weights are discretized with the precision of bits, where is the approximation accuracy.

#### Contribution of this paper.

The developments described above leave many questions open. One immediate question is whether and how the deep discontinuous approximation phase generalizes to higher values of smoothness (). Another natural question is how much the network architectures providing the maximal rate depend on the smoothness class. Yet another question is how sensitive the phase diagram is with respect to changing ReLU to other activation functions. In the present paper we resolve some of these questions. Our contribution is three-fold:

• In Section 3, we prove that the approximation phase diagram indeed generalizes to arbitrary , with the deep discontinuous phase occupying the region .

• In Section 4, we prove that the standard fully-connected architecture of a sufficiently large constant width only depending on the dimension , say , can serve for implementing approximations that are asymptotically almost optimal (with rate ), up to a logarithmic correction. This can be described as a phenomenon of “universal adaptivity to smoothness” exhibited by such architectures.

• In Section 5, we discuss how the ReLU phase diagram can change if ReLU is replaced or supplemented by other activation functions. In particular, we show that the ReLU-infeasible region is fully feasible if the network is allowed to include the sine function in addition to ReLU. Our key observation leading to this result is that the networks containing both ReLU and can implement lookup operations more efficient than the sequential lookup provided by the bit extraction technique of bartlett1998almost .

## 2 Preliminaries

#### Smooth functions.

The paper revolves about what we informally describe as “functions of smoothness ”, for any . It is convenient to precisely define them as follows. If is integer, we consider the standard Sobolev space with the norm

 ∥f∥Wr,∞([0,1]d)=maxk:|k|≤resssupx∈[0,1]d|Dkf(x)|.

Here denotes the (weak) partial derivative of . For an the derivatives of order exist in the strong sense and are continuous. The derivatives of order are Lipschitz, and can be upper- and lower-bounded in terms of the Lipschitz constants of these derivatives.

In the case of non-integer

, we consider Hölder spaces that provide a natural interpolation between the above Sobolev spaces. For any integer

and , we define the Hölder space as a subspace of times continuously differentiable functions having a finite norm

 ∥f∥Ck,α([0,1]d)=max{∥f∥Wk,∞([0,1]d),maxk:|k|=ksupx,y∈[0,1]d,x≠y|Dkf(x)−Dkf(y)|∥x−y∥α}.

At the norm in is equivalent to the norm in . Given a non-integer , we define “-smooth functions” as those belonging to , where is the floor function. We choose the sets appearing in Eq.(1) to be unit balls in the Sobolev spaces for integer or in the Hölder spaces for non-integer ; we denote these balls by .

#### Neural networks.

We consider conventional feedforward neural networks in which each hidden unit performs a computation of the form where are input signals coming from some of the previous units, and and are the weights associated with this unit. In addition to input units and hidden units, the network is assumed to have a single output unit performing a computation similar to that of hidden units, but without the activation function. In Sections 3 and 4 we assume that the activation function is ReLU:

In the general results of Sections 3 and 5 we do not make any special connectivity assumptions about the architecture. On the other hand, in Section 4 we consider a particular family of architectures in which the hidden units are divided into a sequence of layers, with each layer having a constant number of units. Two units are connected if and only if they belong to neighboring layers. The input units are connected to the units of the first hidden layer and only to them; the output unit is connected to the units of the last hidden layer, and only to them. We refer to this as a standard deep fully-connected architecture of constant width.

#### Approximations.

In the accuracy–complexity relation (1) we assume that approximations are obtained by assigning -dependent weights to a network architecture common to all In particular, this allows us to speak of the weight assignment map associated with a particular architecture We say that the weight assignment is continuous if this map is continuous with respect to the topology of uniform norm on We will be interested in considering different approximation rates , and we interpret Eq.(1) in a precise way by saying that a rate can be achieved iff

 infηW,GWsupf∈F∥f−˜fηW,GW∥∞≤cF,pW−p, (2)

where denotes the approximation obtained by the weight assignment in the architecture . Here and in the sequel we generally denote by various positive constants possibly dependent on (typically on smoothness and dimension ). Throughout the paper, we will treat and as fixed parameters in the asymptotic accuracy-complexity relations.

## 3 The phase diagram of ReLU networks

Our first main result is the phase diagram of approximation rates for ReLU networks, shown in Fig.1. The “shallow continuous phase” corresponds to , the “deep discontinuous phase” corresponds to , and the infeasible region corresponds to Our main new contribution is the exact location of the deep discontinuous phase for all . The precise meaning of the diagram is explained by the following series of theorems (partly established in earlier works).

###### Theorem 3.1 (The shallow continuous phase).

The approximation rate in Eq.(2) can be achieved by ReLU networks having layers, and with a continuous weights assignment.

This result was proved in yarsawtooth in a slightly weaker form, for integer and with error instead of . The proof is based on ReLU approximations of local Taylor expansions of . The extension to non-integer is immediate thanks to our definition of general -smoothness in terms of Hölder spaces. The logarithmic factor can be removed by observing that the computation of the approximate Taylor polynomial can be isolated from determining its coefficients and hence only needs to be implemented once in the network rather than for each local patch as in yarsawtooth (see Remark A.1; the idea of isolation of operations common to all patches is developed much further in the proof Theorem 3.3 below, and is applicable in the special case ).

###### Theorem 3.2 (Feasibility of rates p>rd).
1. Approximation rates are infeasible for networks with piecewise-polynomial activation function and, in particular, ReLU networks;

2. Approximation rates cannot be achieved with continuous weights assignment;

3. If an approximation rate is achieved with ReLU networks, then the number of layers in must satisfy for some .

These statements follow from existing results on continuous nonlinear approximation (continuous for statement 2) and from upper bounds on VC-dimensions of neural networks (goldberg1995bounding for statement 1 and bartlett2017nearly for statement 3), see (yarsawtooth, , Theorem 1) for a derivation. The extensions to arbitrary are straightforward.

The main new result in this section is the existence of approximations with :

###### Theorem 3.3 (The deep discontinuous phase).

For any any rate can be achieved with deep ReLU networks with layers.

This result was proved in yaropt in the case . We generalize this to arbitrary by combining the coding-based approach of yaropt with Taylor expansions. The technical details are given in Section A, but we explain now the main ideas.

Sketch of proof. We use two length scales for the approximation: the coarser one and the finer one , with We start by partitioning the cube into patches (particularly, simplexes) of linear size and then sub-partitioning them into patches of linear size In each of the finer -patches we approximate the function by a Taylor polynomial of degree Then, from the standard Taylor remainder bound, we have on . This shows that if is the required approximation accuracy, we should choose

Now, if we tried to simply save the Taylor coefficients for each -patch in the weights of the network, we would need at least i.e. , weights in total. This corresponds to the classical rate . In order to save on the number of weights and achieve higher rates, we collect Taylor coefficients of all -patches lying in one -patch and encode them in a single encoding weight associated with this -patch. Given we choose so that in total we create encoding weights, each containing information about , i.e. , Taylor coefficients. The number of encoding weights then matches the desired complexity .

To encode the Taylor coefficients we actually need to discretize them first. Note that to reconstruct the Taylor approximation in an -patch with accuracy we need to know the Taylor coefficients of order with precision . We implement an efficient sequential encoding/decoding procedure for the approximate Taylor coefficients of orders for all -patches lying in the given -patch . Specifically, choose some sequence of the -patches in so that neighboring elements of the sequence correspond to neighboring patches. Then, the order- Taylor coefficients at can be determined with precision from the respective and higher order coefficients at using predefined discrete values. This allows us to encode all the approximate Taylor coefficients in all the -patches of by a single -bit number.

To reconstruct the approximate Taylor polynomial for a particular input , we sequentially reconstruct all the coefficients for the sequence , and, among them, select the coefficients at the patch . The sequential reconstruction can be done by a deep subnetwork with the help of the bit extraction technique bartlett1998almost . The depth of this subnetwork is proportional to the number of -patches in , i.e. , which is according to our definitions of and . If then and hence this depth is smaller or comparable to the number of encoding weights, However, if then the depth is asymptotically larger than the number of encoding weights, so the total number of weights is dominated by the depth of the decoding subnetwork, which is , and the approximation becomes less efficient than at . This explains why is the boundary of the feasible region.

Once the (approximate) Taylor coefficients at are determined, an approximate Taylor polynomial can be computed by a ReLU subnetwork implementing efficient approximate multiplications yarsawtooth .

## 4 Fixed-width networks and their universal adaptivity to smoothness

The network architectures constructed in the proof of Theorem 3.3 to provide the faster rates are relatively complex and -dependent. We can ask if such rates can be supported by some simple conventional architectures. It turns out that we can achieve nearly optimal rates with standard fully-connected architectures of sufficiently large constant widths only depending on :

###### Theorem 4.1.

Let be standard fully-connected ReLU architectures of width with weights. Then

 infGWsupf∈Fr,d∥f−˜fηW,GW∥∞≤cr,dW−2r/dlog2r/dW. (3)

The rate in Eq.(3) differs from the optimal rate with only by the logarithmic factor .

An interesting result proved in hanin2017approximating ; lu2017expressive (see also lin2018resnet for a related result for ResNets) states that standard fully-connected ReLU architectures of a fixed width can approximate any -variate continuous function if and only if . Theorem 4.1 shows that with slightly larger widths, such networks can not only adapt to any function, but also adapt to its smoothness. The results of hanin2017approximating ; lu2017expressive also show that Theorem 4.1 cannot hold with -independent widths.

In the case , it was proved in yaropt that standard networks of width allow to achieve the highest feasible rate

Details of the proof of Theorem 4.1 are given in Section B; we explain now the main idea.

Sketch of proof. The proof is similar to the proof of Theorem 3.3, but requires a different implementation of the reconstruction of from encoded Taylor coefficients. The network constructed in Theorem 3.3 traverses -knots of an -patch and computes Taylor coefficients at the new -knot by updating the coefficients at the previous -knot. This computation can be arranged within a fixed-width network, but its width depends on , since we need to store the coefficients from the previous step, and the number of these coefficients grows with (see yaropt for the constant-width fully-connected implementation in the case of in which the Taylor expansion degenerates into the 0-order approximation).

To implement the approximation using an -independent network width, we can decode the Taylor coefficients afresh at each traversed -knot, instead of updating them. This is slightly less efficient and leads to the additional logarithmic factor in Eq.(3), as can be seen in the following way. First, since we need to reconstruct the Taylor coefficients of degree with precision we need to store bits for each coefficient in the encoding weight. Since this means a -fold increase in the depth of the decoding subnetwork. Moreover, an approximate Taylor polynomial must be computed separately for each -patch. Multiplications can be implemented with accuracy by a fixed-width ReLU network of depth (see yarsawtooth ). Computation of an approximate polynomial of the components of the input vector can be arranged as a chain of additions and multiplications in a network of constant width independent of the degree of the polynomial – assuming the coefficients of the polynomial are decoded from the encoding weight and supplied as they become required. This shows that we can achieve accuracy with a network of constant width independent of at the cost of taking the larger depth (instead of simply as in Theorem 3.3). Since is proportional to the depth, we get . By inverting this relation, we obtain Eq.(3).

## 5 Non-piecewise-polynomial activation functions

We discuss now how much the ReLU phase diagram depends on the activation function. We note first that it is well-known that some exotic activation functions allow to achieve much higher rates than those discussed in the previous sections. For example, a result of maiorov1999lower based on the Kolmogorov Superposition Theorem ((constrappr96, , p. 553)) shows the existence of a strictly increasing analytic activation function such that any can be approximated with arbitrary accuracy by a three-layer -network with only units.

On the other hand, note that statement 1 of Theorem 3.2 holds not only for ReLU, but for any piecewise-polynomial activation functions, so that the region remains infeasible for any such activation. Also, since all piecewise-linear activation functions are essentially equivalent (see e.g. (yarsawtooth, , Proposition 1)), the phase diagram for any piecewise-linear activation is the same as for ReLU.

A remarkable class of functions that can be seen as a far-reaching generalization of polynomials are the Pfaffian functions khovanskii . Level sets of these functions admit bounds on the number of their connected components that are similar to analogous bounds for algebraic sets, and this is a key property in establishing upper bounds on VC dimensions of networks. In particular, it was proved in karpinski1997polynomial that the VC-dimension of networks with the standard sigmoid activation function is upper-bounded by where is the number of computation units (see also (anthony2009neural, , Theorem 8.13)). Since , the bound implies the slightly weaker bound . Then, by mimicking the proof of statement 1 of Theorem 3.2 and replacing there the bound for piecewise-polynomial activation by the bound for the standard sigmoid activation, we find that the approximation rates are infeasible for networks with the standard sigmoid activation function. It appears that there remains a significant gap between the upper and lower VC dimension bounds for networks with (see a discussion in (anthony2009neural, , Chapter 8)). Likewise, we do not know if the approximation rates up to are indeed feasible with this .

We note, at the same time, that the network expressiveness in terms of covering numbers can be upper bounded for any Lipschitz activation function if the network weights are bounded, see (anthony2009neural, , Theorem 14.5). Assuming moderately growing weights, this implies (see Section C).

Our main result in this section is the proof that the ReLU-infeasible sector becomes fully feasible if we allow some hidden units of the network to have the activation function and make no restriction on the weights. Moreover, the approximation rate becomes exponential in a power of :

###### Theorem 5.1.

Let be feed-forward networks with weights containing both ReLU and activation functions. Then, for any and

 infηW,GWsupf∈Fr,d∥f−˜fηW,GW∥∞≤exp(−cr,dW1/2) (4)

with some -dependent constant

On the one hand, this result is not very surprising since

has level sets with infinitely many connected components. It is well-known, for example, that the family of classifiers

, where has an infinite VC-dimension. On the other hand, note that our network can be considered as a generalization of the Fourier series expansion , which can be viewed as a neural network with one hidden layer, the activation function, and predefined weights in the first layer. Standard convergence bounds for Fourier series (see e.g. jackson1930theory ) correspond to the shallow continuous rate , in agreement with the fact that the conventional assignment of Fourier coefficients is linear in . Thus, adding depth and the ReLU activation to the Fourier expansion makes it substantially more expressive.

In any case, it is interesting to pinpoint the particular constructive mechanism that leads to the very fast approximation rates of Theorem 5.1. Our proof is based on the observation that networks including both ReLU and can implement an efficient, dichotomy-based lookup. We sketch the main idea of the proof; see details in Section D.

Sketch of proof. Recall the concepts of coarser partition on the scale and the finer partition on the scale used in the proofs of Theorem 3.3 and 4.1. In those theorems, both and were with some constant powers . In contrast, we choose now , and we’ll set to grow much faster (roughly exponentially) with : this will be possible thanks to the much more efficient decoding available with the activation.

Specifically, note first that we can implement an almost perfect approximation of the parity function using a constant size networks, by computing with a large and then thresholding the result at 1 and using ReLU operations (the approximation only fails in small neighborhoods of the integer points). If the cube is partitioned into cubic -patches, we can apply rescaled versions of coordinate-wise to create a binary dictionary of these patches. Specifically, we can construct a network of size that maps a given to a size- binary sequence encoding the place of the patch in the cube , with . We call this network the patch-encoder.

Given a function we approximate it by a function which is constant in each -patch. Suppose for simplicity and without loss of generality that the smoothness then this approximation has accuracy . Let be the value that the approximation returns on the patch . It is sufficient to define with precision . Consider the binary expansion of that provides this precision: where and . Suppose that for each we can construct a network that maps each patch to the corresponding bit . Summing these patch-classifiers with coefficients , we then reconstruct the full approximation .

We have thus reduced the task to efficiently implementing an arbitrary binary classifier on the -partition of The patch-encoder constructed above efficiently encodes each -patch by a binary -bit sequence. We can then think of the classifier as an assignment that must be implemented by our network. We show below that this can be done by a size- network, with the assignment encoded in a single weight . The full number of network weights (including the patch-encoder and the patch-classifiers on all scales) can then be bounded by i.e. . The relations and then yield (with ), as claimed in Eq.(4).

To make these arguments fully rigorous, we need to handle the issue of our approximation to the parity function becoming invalid near the boundaries of the patches. This is done in Section D using partitions of unity; the resulting complications do not affect the asymptotic.

We explain now how an arbitrary assignment can be implemented by a network of size with a single encoding weight . Let us define two sequences, and :

 l1=0.1,a1=10,lk=lk−1ak,ak=4πlk−1 (5)

Consider iterations in which each can be either the identity function , or with some initial value . For each , let us define as the of the value obtained by substituting the respective functions:

 HK,w∗(z)=sgn∘{Id,z1=0,sin(a1⋅),z1=1∘{Id,z2=0,sin(a2⋅),z2=1∘…∘{Id,zK=0,sin(aK⋅),zK=1(w∗)
###### Lemma 5.1.

For any assignment there exists such that for all .

###### Proof.

Proof by induction on , but of a slightly sharper statement: the desired not only exist, but fill (at least) an interval of length .

The base is immediate. Suppose we have proved the statement for . Given an assignment , consider it as a pair of assignments By the hypothesis, we can find two intervals and of length such that and for all and . Consider the set

 I={w∈R:w∈I(0)K−1 and sin(aKw)∈I(1)K−1}.

Then for any , we have the desired property . We need to show now that contains an interval of length . This follows from Eq.(5) since and since has the period twice as small as the length of . ∎

This lemma shows that the network can implement any classifier if the network can somehow branch into applying either or depending on the signal bit that is output by the patch-encoder subnetwork. This branching can be easily implemented by forming the linear combination , and also noting that a product of any and admits the ReLU implementation .

We remark that the construction in Lemma 5.1 can be interpreted as an efficient lookup if we think of the assignment as a binary sequence of size In each of the network steps we divide the sequence in half, ultimately locating the desired bit in steps. We can compare this with the less efficient bit extraction procedure of bartlett1998almost (for which it is however sufficient to only have the ReLU activation in the network). In this latter procedure, the bits are extracted from the encoding weight one-by-one, and so the lookup requires steps.

Our results highlight a tradeoff between complexity of the architecture and complexity of network weights: optimization of the number of weights forces the weights to represent information in intricate ways. While we have not treated the topic of weight precision in this paper (cf. bolcskei2017memory ; petersen2018optimal ; voigtlaender2019approximation ), we can give a rough estimate of how the required precision depends on discontinuous approximation rates. For ReLU networks and the arguments of Section 3 show that approximation with accuracy requires the encoding weights to contain i.e. bits. For ReLU/ networks of Section 5, the required precision of the encoding weight can be estimated from the lengths of the intervals considered in Lemma 5.1. Using Eq.(5), we find that . Since we used , this means that should contain about bits. These estimates agree with the observation that the information required to specify a function with accuracy is bits KolmogorovTikhomirov

, and this information is uniformly distributed over the encoding weights of the network (

weights in ReLU networks or weights in ReLU/ networks).

## Appendix A Theorem 3.3: proof details

We follow the paper [10] where Theorem 3.3 was proved for , and generalize it to arbitrary using the strategy explained in Section 3. Given we show that it is possible to construct a network architecture with weights and layers which approximates every with error . In Remark A.1 we deal with the case .

We start by describing the space partition and related constructions. Then we give an overview of the network structure. Finally, we describe in more detail the network computation of the Taylor approximations, which is the main novel element of Theorem 3.3.

### a.1 Space partitions

For an integer we denote by a standard triangulation of into simplexes:

 ΔN,n,ρ={x∈Rd:0≤xρ(1)−nρ(1)N≤⋯≤xρ(d)−nρ(d)N},

where and is a permutation of elements. The vertices of these simpixes are the points of the grid . We call the set of all the vertices the -grid and a particular vertex an -knot. For an -knot we call the union of simplexes it belongs to an -patch. We denote a set of all -knots .

Let be the “spike” function defined as the continuous piecewise linear function such that:

1. is linear on every simplex from the triangulation ;

2. , for all other .

The function can be computed by a feed-forward ReLU network with weights (see [10, Section 4.2] for details). We treat as a constant, so we can say that can be computed by a network with a constant number of weights. Note that for integer and , the function is a continuous piecewise linear function which is linear in each simplex from , is equal to 1 at , and vanishes at all other N-knots of .

It is convenient to keep in mind two following simple propositions:

###### Proposition A.1.

Suppose we have -knots , and corresponding numbers . Then the function

 g(x)=K∑k=1ℓkϕ(Nx−nk)

has the following properties:

1. is linear on each simplex from ;

2. for . For other -knots , is zero: ;

3. can be computed exactly by a network with weights and layers.

###### Proposition A.2.

Suppose we have -knots , and corresponding numbers . Suppose also that -patches associated with are disjoint. Then there exists function with the following properties:

1. is linear on each simplex from ;

2. For , at an -patch associated with ;

3. can be computed exactly by a network with weights and layers.

###### Proof.

Follows directly from Prop. A.1. We assign value to all -knots in -patch associated with and apply Prop. A.1. Since -patches of interest are disjoint, each -knot has at most one assigned value. ∎

### a.2 The filtering subgrids

Given the total number of weights , we set . We will assume without loss of generality that is integer. We consider triangulation of on length scale .

It is convenient to split the -grid into disjoint subgrids with the grid spacing:

 Nq={nN:n∈(q+(3Z)d)∩[0,N]d},q∈{0,1,2}d.

Clearly, each subgrid contains knots. Note that -patches associated with -knots in are disjoint. It means, in particular, that any point lies in at most one such -patch. It also means that Prop. A.2 is applicable to . We will use this observation in subsection A.3 for constructing an efficient approximation in a neighbourhood of for a single . We call the union of these -patches a domain of .

We compute the full approximation as a sum

 ˜f(x) =∑q∈{0,1,2}d˜wq(x)˜fq(x). (6)

Function computes with error for every in the domain of . For out of the domain of it computes some garbage value. We describe in subsection A.3. The final approximation is a weighted sum of with weights . We choose such functions , that vanishes outside the domain of and

 ∑q∈{0,1,2}d˜wq(x)≡1.

It follows that is a weighted sum (with weights with the sum 1) of terms approximating with error . Consequently, approximates with error .

Function is given by applying Prop. A.1 to -knots from with all values equals to 1. Clearly, vanishes outside the domain of . Sum is linear on each simplex from and equals to 1 at all -knots, because each -knot belongs to exactly one set . Consequently, this sum equals to 1 for every . It follows from Prop. A.1 that network implementing has weights and layers.

Multiplication is implemented approximately, with error , by network given by [6, Proposition 3] and requires additional weights.

### a.3 The approximation for a subgrid

Here we describe how we construct for a single . Remind that computes accurate approximation for only on the domain of .

For any -knot in we consider a cube with center at and edge :

 {x∈Rd:max1≤i≤d∣∣∣xi−niN∣∣∣≤1N}.

We call such cube an -cube and denote it by . Note that .

Remind that the domain of consists of disjoint -patches associated with -knots from . Each from the domain of belongs to exactly one such -patch. We call this patch an -patch for and associated -knot an -knot for . Let us denote an -knot for by .

We set . Note that and, therefore, we need to construct an approximation of error . We will assume without loss of generality that is integer and is divisible by . Then is a subpartition of . We define -knot and -patch similarly to -knot and -patch. We denote a set of all -knots by . Note that there are -knots in each -patch and -cube. See Fig.2 for an illustration of all described constructions.

Suppose that lies in an -patch associated with an -knot . Consider a Taylor polynomial at of order . Standard bounds for the remainder of Taylor polynomial imply that it approximates with error uniformly for . Taylor polynomial at (and actually any polynomial) can be implemented with error by a network with weights and layers. We refer reader to [6, Proposition 3] and a proof of [6, Theorem 1] for details.

We can approximate with error with a weighted sum of Taylor polynomials at all -knots:

 ˜f(x)=∑mM∈KMϕ(Mx−m)Pm/M(x). (7)

Note that vanishes outside an -patch associated with and

 ∑mM∈KMϕ(Mx−m)≡1.

There are terms in (7) and calculating single term requires weights. So, the total number of weights needed to implement (7) is . It is clearly infeasible for . For it leads to approximation error and makes a statement of [6, Theorem 1]. Note that in this construction Taylor coefficients at -knots are the weights of network.

Note that terms of (7) are nonzero only for -knots in an -cube for . Suppose that lies in the domain of and, therefore, has well defined -knot . For such we can write

 ˜fq(x)=∑mM∈KM∩Cn