Neural Network Approximation of Refinable Functions

In the desire to quantify the success of neural networks in deep learning and other applications, there is a great interest in understanding which functions are efficiently approximated by the outputs of neural networks. By now, there exists a variety of results which show that a wide range of functions can be approximated with sometimes surprising accuracy by these outputs. For example, it is known that the set of functions that can be approximated with exponential accuracy (in terms of the number of parameters used) includes, on one hand, very smooth functions such as polynomials and analytic functions (see e.g. <cit.>) and, on the other hand, very rough functions such as the Weierstrass function (see e.g. <cit.>), which is nowhere differentiable. In this paper, we add to the latter class of rough functions by showing that it also includes refinable functions. Namely, we show that refinable functions are approximated by the outputs of deep ReLU networks with a fixed width and increasing depth with accuracy exponential in terms of their number of parameters. Our results apply to functions used in the standard construction of wavelets as well as to functions constructed via subdivision algorithms in Computer Aided Geometric Design.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/15/2021

Function approximation by deep neural networks with parameters {0,±1/2, ± 1, 2}

In this paper it is shown that C_β-smooth functions can be approximated ...
01/09/2020

Deep Network Approximation for Smooth Functions

This paper establishes optimal approximation error characterization of d...
12/09/2019

Efficient approximation of high-dimensional functions with deep neural networks

In this paper, we develop an approximation theory for deep neural networ...
02/23/2020

Comparing the Parameter Complexity of Hypernetworks and the Embedding-Based Alternative

In the context of learning to map an input I to a function h_I:X→R, we c...
01/08/2019

Deep Neural Network Approximation Theory

Deep neural networks have become state-of-the-art technology for a wide ...
05/17/2021

Universal Regular Conditional Distributions

We introduce a general framework for approximating regular conditional d...
10/21/2019

Universal flow approximation with deep residual networks

Residual networks (ResNets) are a deep learning architecture with the re...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural Network Approximation (NNA) is concerned with how efficiently a function, or a class of functions, is approximated by the outputs of neural networks. One overview of NNA is given in [7] but there are other noteworthy expositions on this subject such as [10, 12]. The main theme of NNA is to understand for specific functions, or classes of functions, how fast the approximation error tends to zero as the number of parameters of the neural net grows. In this paper, we prove bounds on the rate of NNA for univariate refinable functions (see (1.2)) when using deep networks with ReLU activation.

We follow the notation and use the results in [7] for neural networks. In particular, we denote by the set of outputs of a fully-connected neural network with width , depth , input dimension , output dimension , and the Rectified Linear Unit

(ReLU) as the activation function. Since we shall use deep networks for the approximation of univariate functions, we introduce the notation

(1.1)

where and are fixed. The set is a nonlinear parameterized set depending on at most parameters, where depends only on and . The elements in

are known to be Continuous Piecewise Linear (CPwL) (vector valued) functions on

. While each is determined by parameters, the number of breakpoints of a given may be exponential in . However, as shown in [8], not all CPwL functions with an exponential number of breakpoints are in , and indeed, the membership of in imposes strong dependencies between the linear pieces.

We consider a univariate function which is refinable in the sense that there are constants , , such that

(1.2)

The sequence is called the mask of the refinement equation (1.2). Because of (1.2), refinable functions are self-similar. Note that the functions satisfying (1.2) are not unique since, for example, any multiple of also satisfies the same equation. However, under very minimal requirements on the mask , there is a unique solution to (1.2) up to scaling.

There is by now a vast literature on refinable functions (see for example [1]) which derives various properties of the function from assumptions on the mask . In our presentation, we describe our assumptions as properties imposed on and thus, if the reader wishes to know which properties of the mask will guarantee our assumptions, they must refer to the existing literature, in paricular [1, 5, 6].

Refinable functions are of particular interest in approximation theory because they provide the natural framework for every practical wavelet basis in which the basic wavelets have bounded support. One and several dimensional refinable functions are also the underlying mathematical constructs in subdevision schemes used in Computer Aided Geometric Design (CAGD).

We rely heavily on the results and techniques from [5, 6]. To keep our presentation as transparent as possible, we only consider refinable functions that satisfy the two scale relationship (1.2). There are various generalizations of (1.2), including the replacement of the dilation factor by as well as generalizations of the definition of refinablity to the multivariate settings where the dilation is given by general linear mappings (matrices). Generalizations of the results of the present paper to these broader settings is left to future work.

We next introduce the Banach space of continuous and bounded functions , defined on an interval (which can be all of ), and the uniform norm

(1.3)

We consider the linear operator , , given by

(1.4)

and its composition with itself times

(1.5)

The main contribution of our article is described formally in the following theorem.

Theorem 1.1.

Let be any refinement mask, let be any CPwL function which vanishes outside of , and let be the linear operator of (1.4). Then, the function is in with depending only on and the number of breakpoints of .


As a corollary, under certain standard assumptions on , we show in Section 4 that a refinable function can be approximated by the elements of with exponential accuracy. More precisely, we prove that under certain assumptions on the mask , the normalized solution of (1.2) satisfies the following for

(1.6)

where and depend on the mask.

Our main vehicle for proving these results is the cascade algorithm which is used to compute . We describe this algorithm in Section 2. Note that in [5] the term cascade algorithm was used more narrowly to indicate that as a consequence of (1.2), the numerical values of could be computed easily from a few , where is close to . Now we are using this terminology in a more general sense, including also what in [5] was given the more cumbersome name of two-scale difference equation. Notice that the cascade algorithm cannot be directly implemented by ReLU NNs and so various modifications in this algorithm need to be made. These are given in the proofs of the theorem as portrayed in Section 3.

We wish to stress here that some of the lemmas we use in our proofs may be applicable to other settings of NNA. In particular, we draw the reader’s attention to the result of §3.3 and its utilization, which show that in some cases we can describe a multiplication of two functions from as a function in .

It is well-known that the solutions to refinement equations are used in the construction of wavelets with compact support (see [3]) and in subdivision algorithms in CAGD. In Section 4, we discuss how our main theorem can be combined with the existing theory of refinable functions and the existing convergence results for the cascade algorithm to prove that under standard conditions on , the solution to the refinement equation can be approximated to exponential accuracy by the elements of . Finally, in Section 5, we discuss how our results relate to -term wavelet approximation.

2 Preliminaries

In this section, we touch upon some of the necessary tools that describe the cascade algorithm as outlined in the works of Daubechies-Lagarias [5, 6].

2.1 The operator

We consider the action of on any continuous function supported on . It is hard to do a direct analysis of because the points , , are spread out. However, note two important facts. The first is that the points appearing in (1.4) are all equal to modulo one. This means that there is at most one point from each interval and all these points differ by an integer amount. Secondly, since is supported on , only the points that land in appear in (1.4). More precisely, the following statement about holds.

Remark 2.1.

If is a CPwL function supported on , then is also a CPwL supported on . Moreover, each breakpoint of satisfies , , where is a breakpoint of . In particular, given a CPwL function supported on with breakpoints at the integers , has breakpoints at , .

Indeed, the fact that is supported on follows from the observation that each of the functions , , is supported on .


It follows from Remark 2.1 that if is a CPwL function supported on and has breakpoints at the integers , then is a CPwL supported on with breakpoints , , and as such, . We shall show that is actually an output of an NN with much smaller depth.

In going forward, we put ourselves in the setting of the works of Daubechies-Lagarias [5, 6], where a better understanding of is facilitated by the introduction of the operator Vec. It assigns to each a vector valued function , where with

(2.1)

Even though is defined on all of , we are mainly concerned with its values on . Note that the solution of (1.2) turns out to be supported on . Therefore, knowing the restriction to of is equivalent to knowing on its full support. On that interval, is the piece of living on , reparameterized to live on . Note that

(2.2)

Further, we define for and

(2.3)

Before describing the cascade algorithm which represents via bit extraction, we recall in the next subsection how we find the binary bits of a number .

2.2 Binary bits and quantization

Any , can be represented as

where the bits . While such a representation of is not unique, we shall use one particular representation where the bits are found using the quantizer function

with

denoting the characteristic function of a set

. The first bit of and its residual are defined as

(2.4)

respectively. The graph of has two linear pieces, one on and the other on

and a jump discontinuity at

. Each linear piece for has slope .

While for the most part we consider only for , there are occasions where we need to be defined for outside this interval. For such , we define when and when . Figure 2.1 shows the graphs of and .

(a) The graph of .
(b) The graph of .
Figure 2.1: The graphs of and .

We find the later bits and residuals recursively as

(2.5)

Note that on

(2.6)

The ’s are piecewise linear functions with jump discontinuities at the dyadic integers , , see for example, Figure 2.2 for the graph of . Note that, as in the case of , we define for and for . Then we will have that on the whole real line.

Figure 2.2: Graph of .

2.3 The cascade algorithm

We now look at the computation of on for general continuous functions supported on . Since for , and we have

(2.7)

we can write

(2.8)

In this way, we get two different formulas depending on whether or . For example, when , using the fact that if is not in and that for when or , we have for ,

where is the matrix with -th entry equal to ,

(2.9)

A similar derivation gives

where now is the matrix with -th entry equal to ,

(2.10)

More succinctly, we have for

(2.11)

Then, using (2.5) we get

and if we iterate this computation we get the cascade algorithm. Since , we have for

(2.13)

It is useful to keep in mind what looks like as traverses . It alternately takes the values and with the switch coming at the dyadic integers , .

3 Proof of the main theorem

Before we start proving Theorem 1.1, we observe a simple fact regarding how behaves with respect to the translation operator. This fact, which is described in the following lemma, will help us simplify the proof of Theorem 1.1.

Lemma 3.1.

Let be a continuous function on and for any , consider the translated function

Then, for each ,

(3.1)

Proof: Let us first see the action of on . Since , we have

This proves the case in (3.1). We next complete the proof of (3.1) for all by induction. Suppose that we have established the result for a given value of . Consider the function . Formula (3.1) says that satisfies

(3.3)

So we can apply (3) with in place of and obtain

This advances the induction and proves (3.1).

We turn now to discuss the proof of Theorem 1.1. We first show how to prove the theorem when , where is any CPwL function that has support in and has breakpoints at the integers. We make this assumption on the breakpoints only for transparency of the proof. We remark later how the same method of proof gives the theorem for arbitrary CPwL functions .

We represent as a linear combination of hat functions each of which has a translate with support in . The fact that these functions are supported on this sub-interval will be pivotal in many of the lemmas and theorems below and we will therefore use the following terminology throughout the paper. We call a univariate function special if:

  • is a non-negative CPwL function defined on .

  • the support of

Therefore, the translates of the hat functions in the representation of are special functions and all our results below, proven for special functions, can be applied to these translates.

Let be the ’hat function’ with break points at and , see Figure 3.3. That is, is given by the formula

(3.4)
Figure 3.3: The graph of .

Clearly is a special function with break points. Although the function under investigation is not special, it can be written as a linear combination of shifts of the special function ,

(3.5)

Therefore, from (3.5), Lemma 3.1 and the fact that is a linear operator it follows that

(3.6)

Accordingly, the discussion in the following subsections concentrate on special functions. We will come back to finalize the proof of the main theorem in the closing subsection.

3.1 The function is an output of an NN for special functions

In this section, we shall show that for certain choices of , the function is in the set , where depends only on the mask and the function . If is a special function, then the corresponding vector function , viewed as a function on , is

Namely, all its coordinates are zero except the first one, which is the nonzero function supported on . Therefore , when considered as a function on , and is the zero vector when takes a value outside . Since the formula for is on , , this gives

(3.7)

where

In particular, the support of is contained in .

The first step in our argument to prove that is an output of an NN is to replace the discontinuous functions by the CPwL function . We shall give a family of possible replacements which depend on the choice of two parameters satisfying

(3.8)

Definition of : We let be the CPwL function defined on , see Figure 3.4(a), with breakpoints at which satisfies:

  • , ;

  • , for and for ;

  • on ,

    is the linear function that interpolates

    at and interpolates at ;

  • on .

In the next lemma, we summarize the properties of that we will need in going forward. For its statement, we introduce for the sets

(3.9)
(a) Graph of .
(b) Graph of .
Figure 3.4: The graphs of and its composition for , .
Lemma 3.2.

For each , the function has the following properties: (i) ; (ii) for ; (iii) For , we have , at least at every , .

Proof: Since is in we have that the composition is in , see page 10 in [7]. This proves (i).

We prove (ii) by induction on . From the definition of , we have that except for the interval and therefore we have proved the case . Suppose we have established (ii) for a value of . Let . Then, because . Therefore, we only need to check that when , . Any such is in an interval for some and . If then and therefore . Therefore, we have proved (ii).

Finally, we prove (iii) also by induction on . This statement is clear from the definition of when . Suppose that we have proved (iii) for a value of and consider the statement for . We consider two cases depending on the parity of . If is even, then the interval under consideration is . Hence, we know vanishes on this interval and since , we get that also vanishes on this interval. Consider now the case that . Then, the interval under consideration is

since . For in that interval, because of (ii), we have

Hence, .

Remark 3.3.

Note that in addition to the assumption , we require that so that the sets , , do not overlap.

Remark 3.4.

In the construction of NNs in this paper, we usually construct NNs whose layers have different width. However, we can always add additional nodes to these layers so that we end up with a fully-connected NN with all layers being the same size.

Now, we are ready to state and prove the main theorem in this section.

Theorem 3.5.

Let be a special function that has at most breakpoints. Then, for any ,

Proof: We choose . Let us denote by the function and denote by . We claim that

(3.10)

Let us assume this claim for a moment and proceed to prove the theorem. Because of (i) in Lemma

3.2, we know that each of the functions is in . Since is a CPwL function with breakpoints, we have . We can output by concatenating the networks for with that for . Namely, we place the neural network for in the first layers and then follow that with the network for in the last layer using as its input. Thus, is in . We now place the two neural networks that output and stacked on top of each other. The resulting network has width . At the final step, we recall that

(3.11)

Therefore, by adding another layer to the network we have already created, we can output the right side of (3.10). Hence, it is in with the advertised width .

(a) Graph of the special function .
(b) Graphs of and on
Figure 3.5: Comparison between and on , where .

With these remarks in hand, we are left with proving the claim. For the proof, let us first note that (3.10) holds when is outside because outside . Now, consider a general and understand what looks like as traverses a dyadic interval from left to right. We will track this behavior only when is extremely close to , i.e. is very small; at least . From (ii) of Lemma 3.2, we know that until gets close to the right endpoint of , and in particular on

This means that until . On , we have since is a special function (and therefore non-negative). By (iii) of Lemma 3.2, on

where

and therefore on .

Now, we return to our choice of . We first choose the very close to so that at least , which ensures that

because of (3.7). Moreover, it follows from the discussion so far that

For any we have that on , but we choose so that

This guarantees that for one of the two non-negative numbers or is zero, and thus we have (3.10).

3.2 Matrix constructions

In this section, we continue considering special functions and their vectorization . Let us introduce the notation

for the piecewise constant matrix valued functions . Let be the piecewise constant matrix valued function which is defined as

(3.12)
(3.13)

Notice that we have purposefully defined on all of . We have

(3.14)

if we set

We know from the cascade algorithm (2.13) that

(3.15)

We next introduce our technique for proving that is an output of a neural network whose depth grows linearly in . In what follows, we derive a new expression for and then use it to prove the existence of such a neural network. Recall that for a special function ,

where the column vectors , , are the standard basis for