1 Introduction
Neural Network Approximation (NNA) is concerned with how efficiently a function, or a class of functions, is approximated by the outputs of neural networks. One overview of NNA is given in [7] but there are other noteworthy expositions on this subject such as [10, 12]. The main theme of NNA is to understand for specific functions, or classes of functions, how fast the approximation error tends to zero as the number of parameters of the neural net grows. In this paper, we prove bounds on the rate of NNA for univariate refinable functions (see (1.2)) when using deep networks with ReLU activation.
We follow the notation and use the results in [7] for neural networks. In particular, we denote by the set of outputs of a fullyconnected neural network with width , depth , input dimension , output dimension , and the Rectified Linear Unit
(ReLU) as the activation function. Since we shall use deep networks for the approximation of univariate functions, we introduce the notation
(1.1) 
where and are fixed. The set is a nonlinear parameterized set depending on at most parameters, where depends only on and . The elements in
are known to be Continuous Piecewise Linear (CPwL) (vector valued) functions on
. While each is determined by parameters, the number of breakpoints of a given may be exponential in . However, as shown in [8], not all CPwL functions with an exponential number of breakpoints are in , and indeed, the membership of in imposes strong dependencies between the linear pieces.We consider a univariate function which is refinable in the sense that there are constants , , such that
(1.2) 
The sequence is called the mask of the refinement equation (1.2). Because of (1.2), refinable functions are selfsimilar. Note that the functions satisfying (1.2) are not unique since, for example, any multiple of also satisfies the same equation. However, under very minimal requirements on the mask , there is a unique solution to (1.2) up to scaling.
There is by now a vast literature on refinable functions (see for example [1]) which derives various properties of the function from assumptions on the mask . In our presentation, we describe our assumptions as properties imposed on and thus, if the reader wishes to know which properties of the mask will guarantee our assumptions, they must refer to the existing literature, in paricular [1, 5, 6].
Refinable functions are of particular interest in approximation theory because they provide the natural framework for every practical wavelet basis in which the basic wavelets have bounded support. One and several dimensional refinable functions are also the underlying mathematical constructs in subdevision schemes used in Computer Aided Geometric Design (CAGD).
We rely heavily on the results and techniques from [5, 6]. To keep our presentation as transparent as possible, we only consider refinable functions that satisfy the two scale relationship (1.2). There are various generalizations of (1.2), including the replacement of the dilation factor by as well as generalizations of the definition of refinablity to the multivariate settings where the dilation is given by general linear mappings (matrices). Generalizations of the results of the present paper to these broader settings is left to future work.
We next introduce the Banach space of continuous and bounded functions , defined on an interval (which can be all of ), and the uniform norm
(1.3) 
We consider the linear operator , , given by
(1.4) 
and its composition with itself times
(1.5) 
The main contribution of our article is described formally in the following theorem.
Theorem 1.1.
Let be any refinement mask, let be any CPwL function which vanishes outside of , and let be the linear operator of (1.4). Then, the function is in with depending only on and the number of breakpoints of .
As a corollary, under certain standard assumptions on , we show in Section 4 that a refinable function can be approximated by the elements of with exponential accuracy. More precisely, we prove that under certain assumptions on the mask , the normalized solution of (1.2) satisfies the following for
(1.6) 
where and depend on the mask.
Our main vehicle for proving these results is the cascade algorithm which is used to compute . We describe this algorithm in Section 2. Note that in [5] the term cascade algorithm was used more narrowly to indicate that as a consequence of (1.2), the numerical values of could be computed easily from a few , where is close to . Now we are using this terminology in a more general sense, including also what in [5] was given the more cumbersome name of twoscale difference equation. Notice that the cascade algorithm cannot be directly implemented by ReLU NNs and so various modifications in this algorithm need to be made. These are given in the proofs of the theorem as portrayed in Section 3.
We wish to stress here that some of the lemmas we use in our proofs may be applicable to other settings of NNA. In particular, we draw the reader’s attention to the result of §3.3 and its utilization, which show that in some cases we can describe a multiplication of two functions from as a function in .
It is wellknown that the solutions to refinement equations are used in the construction of wavelets with compact support (see [3]) and in subdivision algorithms in CAGD. In Section 4, we discuss how our main theorem can be combined with the existing theory of refinable functions and the existing convergence results for the cascade algorithm to prove that under standard conditions on , the solution to the refinement equation can be approximated to exponential accuracy by the elements of . Finally, in Section 5, we discuss how our results relate to term wavelet approximation.
2 Preliminaries
In this section, we touch upon some of the necessary tools that describe the cascade algorithm as outlined in the works of DaubechiesLagarias [5, 6].
2.1 The operator
We consider the action of on any continuous function supported on . It is hard to do a direct analysis of because the points , , are spread out. However, note two important facts. The first is that the points appearing in (1.4) are all equal to modulo one. This means that there is at most one point from each interval and all these points differ by an integer amount. Secondly, since is supported on , only the points that land in appear in (1.4). More precisely, the following statement about holds.
Remark 2.1.
If is a CPwL function supported on , then is also a CPwL supported on . Moreover, each breakpoint of satisfies , , where is a breakpoint of . In particular, given a CPwL function supported on with breakpoints at the integers , has breakpoints at , .
Indeed, the fact that is supported on follows from the observation that each of the functions , , is supported on .
It follows from Remark 2.1 that if is a CPwL function supported on and has breakpoints at the integers , then is a CPwL supported on with breakpoints , , and as such, . We shall show that is actually an output of an NN with much smaller depth.
In going forward, we put ourselves in the setting of the works of DaubechiesLagarias [5, 6], where a better understanding of is facilitated by the introduction of the operator Vec. It assigns to each a vector valued function , where with
(2.1) 
Even though is defined on all of , we are mainly concerned with its values on . Note that the solution of (1.2) turns out to be supported on . Therefore, knowing the restriction to of is equivalent to knowing on its full support. On that interval, is the piece of living on , reparameterized to live on . Note that
(2.2) 
Further, we define for and
(2.3) 
Before describing the cascade algorithm which represents via bit extraction, we recall in the next subsection how we find the binary bits of a number .
2.2 Binary bits and quantization
Any , can be represented as
where the bits . While such a representation of is not unique, we shall use one particular representation where the bits are found using the quantizer function
with
denoting the characteristic function of a set
. The first bit of and its residual are defined as(2.4) 
respectively. The graph of has two linear pieces, one on and the other on
and a jump discontinuity at
. Each linear piece for has slope .While for the most part we consider only for , there are occasions where we need to be defined for outside this interval. For such , we define when and when . Figure 2.1 shows the graphs of and .
We find the later bits and residuals recursively as
(2.5) 
Note that on
(2.6) 
The ’s are piecewise linear functions with jump discontinuities at the dyadic integers , , see for example, Figure 2.2 for the graph of . Note that, as in the case of , we define for and for . Then we will have that on the whole real line.
2.3 The cascade algorithm
We now look at the computation of on for general continuous functions supported on . Since for , and we have
(2.7) 
we can write
(2.8) 
In this way, we get two different formulas depending on whether or . For example, when , using the fact that if is not in and that for when or , we have for ,
where is the matrix with th entry equal to ,
(2.9) 
A similar derivation gives
where now is the matrix with th entry equal to ,
(2.10) 
More succinctly, we have for
(2.11) 
Then, using (2.5) we get
and if we iterate this computation we get the cascade algorithm. Since , we have for
(2.13) 
It is useful to keep in mind what looks like as traverses . It alternately takes the values and with the switch coming at the dyadic integers , .
3 Proof of the main theorem
Before we start proving Theorem 1.1, we observe a simple fact regarding how behaves with respect to the translation operator. This fact, which is described in the following lemma, will help us simplify the proof of Theorem 1.1.
Lemma 3.1.
Let be a continuous function on and for any , consider the translated function
Then, for each ,
(3.1) 
Proof: Let us first see the action of on . Since , we have
This proves the case in (3.1). We next complete the proof of (3.1) for all by induction. Suppose that we have established the result for a given value of . Consider the function . Formula (3.1) says that satisfies
(3.3) 
So we can apply (3) with in place of and obtain
This advances the induction and proves (3.1).
We turn now to discuss the proof of Theorem 1.1. We first show how to prove the theorem when , where is any CPwL function that has support in and has breakpoints at the integers. We make this assumption on the breakpoints only for transparency of the proof. We remark later how the same method of proof gives the theorem for arbitrary CPwL functions .
We represent as a linear combination of hat functions each of which has a translate with support in . The fact that these functions are supported on this subinterval will be pivotal in many of the lemmas and theorems below and we will therefore use the following terminology throughout the paper. We call a univariate function special if:

is a nonnegative CPwL function defined on .

the support of
Therefore, the translates of the hat functions in the representation of are special functions and all our results below, proven for special functions, can be applied to these translates.
Let be the ’hat function’ with break points at and , see Figure 3.3. That is, is given by the formula
(3.4) 
Clearly is a special function with break points. Although the function under investigation is not special, it can be written as a linear combination of shifts of the special function ,
(3.5) 
Therefore, from (3.5), Lemma 3.1 and the fact that is a linear operator it follows that
(3.6)  
Accordingly, the discussion in the following subsections concentrate on special functions. We will come back to finalize the proof of the main theorem in the closing subsection.
3.1 The function is an output of an NN for special functions
In this section, we shall show that for certain choices of , the function is in the set , where depends only on the mask and the function . If is a special function, then the corresponding vector function , viewed as a function on , is
Namely, all its coordinates are zero except the first one, which is the nonzero function supported on . Therefore , when considered as a function on , and is the zero vector when takes a value outside . Since the formula for is on , , this gives
(3.7) 
where
In particular, the support of is contained in .
The first step in our argument to prove that is an output of an NN is to replace the discontinuous functions by the CPwL function . We shall give a family of possible replacements which depend on the choice of two parameters satisfying
(3.8) 
Definition of : We let be the CPwL function defined on , see Figure 3.4(a), with breakpoints at which satisfies:

, ;

, for and for ;

on .
In the next lemma, we summarize the properties of that we will need in going forward. For its statement, we introduce for the sets
(3.9) 
Lemma 3.2.
For each , the function has the following properties: (i) ; (ii) for ; (iii) For , we have , at least at every , .
Proof: Since is in we have that the composition is in , see page 10 in [7]. This proves (i).
We prove (ii) by induction on . From the definition of , we have that except for the interval and therefore we have proved the case . Suppose we have established (ii) for a value of . Let . Then, because . Therefore, we only need to check that when , . Any such is in an interval for some and . If then and therefore . Therefore, we have proved (ii).
Finally, we prove (iii) also by induction on . This statement is clear from the definition of when . Suppose that we have proved (iii) for a value of and consider the statement for . We consider two cases depending on the parity of . If is even, then the interval under consideration is . Hence, we know vanishes on this interval and since , we get that also vanishes on this interval. Consider now the case that . Then, the interval under consideration is
since . For in that interval, because of (ii), we have
Hence, .
Remark 3.3.
Note that in addition to the assumption , we require that so that the sets , , do not overlap.
Remark 3.4.
In the construction of NNs in this paper, we usually construct NNs whose layers have different width. However, we can always add additional nodes to these layers so that we end up with a fullyconnected NN with all layers being the same size.
Now, we are ready to state and prove the main theorem in this section.
Theorem 3.5.
Let be a special function that has at most breakpoints. Then, for any ,
Proof: We choose . Let us denote by the function and denote by . We claim that
(3.10) 
Let us assume this claim for a moment and proceed to prove the theorem. Because of (i) in Lemma
3.2, we know that each of the functions is in . Since is a CPwL function with breakpoints, we have . We can output by concatenating the networks for with that for . Namely, we place the neural network for in the first layers and then follow that with the network for in the last layer using as its input. Thus, is in . We now place the two neural networks that output and stacked on top of each other. The resulting network has width . At the final step, we recall that(3.11) 
Therefore, by adding another layer to the network we have already created, we can output the right side of (3.10). Hence, it is in with the advertised width .
With these remarks in hand, we are left with proving the claim. For the proof, let us first note that (3.10) holds when is outside because outside . Now, consider a general and understand what looks like as traverses a dyadic interval from left to right. We will track this behavior only when is extremely close to , i.e. is very small; at least . From (ii) of Lemma 3.2, we know that until gets close to the right endpoint of , and in particular on
This means that until . On , we have since is a special function (and therefore nonnegative). By (iii) of Lemma 3.2, on
where
and therefore on .
Now, we return to our choice of . We first choose the very close to so that at least , which ensures that
because of (3.7). Moreover, it follows from the discussion so far that
For any we have that on , but we choose so that
This guarantees that for one of the two nonnegative numbers or is zero, and thus we have (3.10).
3.2 Matrix constructions
In this section, we continue considering special functions and their vectorization . Let us introduce the notation
for the piecewise constant matrix valued functions . Let be the piecewise constant matrix valued function which is defined as
(3.12)  
(3.13) 
Notice that we have purposefully defined on all of . We have
(3.14) 
if we set
We know from the cascade algorithm (2.13) that
(3.15) 
We next introduce our technique for proving that is an output of a neural network whose depth grows linearly in . In what follows, we derive a new expression for and then use it to prove the existence of such a neural network. Recall that for a special function ,
where the column vectors , , are the standard basis for
Comments
There are no comments yet.