# Analyzing Upper Bounds on Mean Absolute Errors for Deep Neural Network Based Vector-to-Vector Regression

In this paper, we show that, in vector-to-vector regression utilizing deep neural networks (DNNs), a generalized loss of mean absolute error (MAE) between the predicted and expected feature vectors is upper bounded by the sum of an approximation error, an estimation error, and an optimization error. Leveraging upon error decomposition techniques in statistical learning theory and non-convex optimization theory, we derive upper bounds for each of the three aforementioned errors and impose necessary constraints on DNN models. Moreover, we assess our theoretical results through a set of image de-noising and speech enhancement experiments. Our proposed upper bounds of MAE for DNN based vector-to-vector regression are corroborated by the experimental results and the upper bounds are valid with and without the "over-parametrization" technique.

## Authors

• 13 publications
• 42 publications
• 14 publications
• 14 publications
• 28 publications
08/12/2020

### On Mean Absolute Error for Deep Neural Network Based Vector-to-Vector Regression

In this paper, we exploit the properties of mean absolute error (MAE) as...
05/22/2019

### Fine-grained Optimization of Deep Neural Networks

In recent studies, several asymptotic upper bounds on generalization err...
11/08/2018

### An Optimal Transport View on Generalization

We derive upper bounds on the generalization error of learning algorithm...
02/12/2018

### Certified Roundoff Error Bounds using Bernstein Expansions and Sparse Krivine-Stengle Representations

Floating point error is a drawback of embedded systems implementation th...
11/28/2021

### On the Robustness and Generalization of Deep Learning Driven Full Waveform Inversion

The data-driven approach has been demonstrated as a promising technique ...
03/22/2021

### Performance Bounds for Neural Network Estimators: Applications in Fault Detection

We exploit recent results in quantifying the robustness of neural networ...
10/06/2021

### Robust Localization with Bounded Noise: Creating a Superset of the Possible Target Positions via Linear-Fractional Representations

Locating an object is key in many applications, namely in high-stakes re...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Vector-to-vector regression, also known as multivariate regression, provides an effective way to find underlying relationships between input vectors and their corresponding output ones at the same time. The problems of vector-to-vector regression are of great interest in signal processing, wireless communication, and machine learning communities. For example, speech enhancement aims at finding a vector-to-vector mapping to convert noisy speech spectral vectors to the clean ones

[62, 40]. Similarly, clean images can be extracted from the corrupted ones by leveraging upon image de-noising techniques [60]. Besides, wireless communication systems are designed to transmit local encrypted and corrupted codes to targeted receivers with decrypted information as correct as possible [48, 17]

. Moreover, the vector-to-vector regression tasks are also commonly seen in ecological modeling, natural gas demand forecasting, and drug efficacy prediction domains

[9].

The vector-to-vector regression can be theoretically formulated as follows: given a -dimensional input vector space and a measurable -dimensional output vector space , the goal of vector-to-vector regression is to learn a functional relationship such that the output vectors can approximate desirable target ones. The regression process is described as:

 y=f(x)+e, (1)

where , , e is an error vector, and refers to the regression function to be exploited. To implement the regression function

[39] was the earliest approach and several other methods, such as support vector regression [53]

and decision tree regressions

[34], were further proposed to enhance regression performance. However, deep neural networks (DNN) [22, 29] with multiple hidden layers offer a more efficient and robust solution to dealing with large-scale regression problems. For example, our previous experimental study [61]

demonstrated that DNNs outperform shallow neural networks on speech enhancement. Similarly, auto-encoders with deep learning architectures can achieve better results on image de-noising

[64].

Although most endeavors on DNN based vector-to-vector regression focus on the experimental gain in terms of mapping accuracy, the related theoretical performance of DNN has not been fully developed. Our recent work [42] tried to bridge the gap by analyzing the representation power of DNN based vector-to-vector regression and deriving upper bounds for different DNN architectures. However, those bounds particularly target experiments with consistent training and testing conditions, and they may not be adapted to the experimental tasks where unseen testing data are involved. Therefore, in this work, we focus on an analysis of the generalization power and investigate upper bounds on a generalized loss of mean absolute error (MAE) for DNN based vector-to-vector regression with mismatched training and testing scenarios. Moreover, we associate the required constraints with DNN models to attain the upper bounds.

The remainder of this paper is organized as follows: Section II highlights the contribution of our work and its relationship with the related work. Section III underpins concepts and notations used in this work. Section IV discusses the upper bounds on MAE for DNN based vector-to-vector regression by analyzing the approximation, estimation, and optimization errors, respectively. Section V presents how to utilize our derived upper bounds to estimate practical MAE values. Section VI shows the experiments of image de-noising and speech enhancement to validate our theorems. Finally, Section VII concludes our work.

## Ii Related Work and Contribution

The recent success of deep learning has inspired many studies on the expressive power of DNNs [32, 18, 31, 47], which extended the original universal approximation theory on shallow artificial neural networks (ANNs) [28, 13, 4, 5, 25] to DNNs. As discussed in [19], the approximation error is tightly associated with the DNN expressive power. Moreover, the estimation error and optimization error jointly represent the DNN generalization power, which can be reflected by error bounds on the out-of-sample error or the testing error. The methods of analyzing DNN generalization power are mainly divided into two classes: one refers to algorithm-independent controls [37, 6, 20] and another one denotes algorithm-dependent approaches [33, 12]. In the class of algorithm-independent controls, the upper bounds for the estimation error are based on the empirical Rademacher complexity [7] for a functional family of certain DNNs. In practice, those approaches concentrate on techniques of how weight regularization affects the generalization error without considering advanced optimizers and the configuration of hyper-parameters. As for the algorithm-dependent approaches [33, 12], several theoretical studies focus on the “over-parametrization” technique [16, 36, 2, 30], and they suggest that a global optimal point can be ensured if parameters of a neural network significantly exceed the amount of training data during the training process.

We notice that the generalization capability of deep models can also be investigated through the stability of the optimization algorithms. More specifically, an algorithm is stable if a small perturbation to the input does not significantly alter the output, and a precise connection between stability and generalization power can be found in [10, 63]. Besides, in [3], the authors investigate the stability and oscillations of various competitive neural networks from the perspective of equilibrium points. However, the analysis of the stability of the optimization algorithm is out of the scope of the present work, and we do not discuss it further in this study.

In this paper, the aforementioned issues are taken into account by employing the error decomposition technique [15] with respect to an empirical risk minimizer (ERM) [55, 54] using three error terms: an approximation error, an estimation error, and an optimization error. Then, we analyze generalized error bounds on MAE for DNN based vector-to-vector regression models. More specifically, the approximation error can be upper bounded by modifying our previous bound on the representation power of DNN based vector-to-vector regression [42]. The upper bound on the estimation error relies on the empirical Rademacher complexity [7] and necessary constraints imposed upon DNN parameters. The optimization error can be upper bounded by assuming -Polyak-Lojasiewicz (-PL) [27] condition under the “over-parameterization” configuration for neural networks [1, 56]. Putting together all pieces, we attain an aggregated upper bound on MAE by summing the three upper bounds. Furthermore, we exploit our derived upper bounds to estimate practical MAE values in experiments of DNN based vector-to-vector regression.

We use image de-noising and speech enhancement experiments to validate the theoretical results in this work. Image de-noising is a simple regression task from to

, where the configuration of “over-parametrization” can be simply satisfied on datasets like MNIST

[14]. Speech enhancement is another useful illustration of the general theoretical analysis because it is an unbounded conversion from . Although the “over-parametrization” technique could not be employed in the speech enhancement task due to a significantly huge amount of training data, we can relax the “over-parametrization” setup and solely assume the -PL condition to attain the upper bound for MAE. In doing so, the upper bound can be adopted in experiments of speech enhancement.

## Iii Preliminaries

### Iii-a Notations

• : The composition of functions and .

• : norm of the vector v.

• and : Inner product of two vectors x and y.

• : An integer set .

• : A first-order gradient of function .

• : Expectation over a random variable

.

• : The -th element in the vector w.

• : DNN based vector-to-vector regression function.

• : Smooth ReLU function.

• 1: A vector of all ones.

• : Indicator vector of zeros but with the -th dimension assigned to .

• : -dimensional real coordinate space.

• : A family of the DNN based vector-to-vector functions.

• : A family of generalized MAE loss functions.

### Iii-B Numerical Linear Algebra

• Hölder’s inequality: Let be conjugate: . Then, for all ,

 |⟨x,y⟩|≤||x||p||y||q, (2)

with equality when for all . In particular, when , Hölder’s inequality becomes the Cauchy-Shwartz inequality.

### Iii-C Convex and Non-Convex Optimization

• A function is -Lipschitz continuous if ,

 ||f(x)−f(y)||≤β||x−y||. (3)
• Let be a -smooth function on . Then, ,

 f(x)−f(y)≤∇f(y)T(x−y)+β2||x−y||22. (4)
• A function satisfies the -Polyak-Lojasiewicz (-PL) condition [27]. Then, ,

 ||∇f(x)||22≥γ(f(x)−f∗). (5)

where refers to the optimal value over the input domain. The -PL condition is a significant property for a non-convex function because a global minimization can be attained from , and a local minimum point corresponds to the global one. Furthermore, if a function is convex and also satisfies -PL condition, the function is strongly convex.

• Jensen’s inequality: Let be a random vector taking values in a non-empty convex set with a finite expectation , and be a measurable convex function defined over . Then, is in , is finite, and the following inequality holds

 f(E[X])≤E[f(X)]. (6)

Empirical Rademacher complexity [7] is a measure of how well the function class correlates with the Rademacher random value. The references [19, 65, 58] show that a function class with a larger empirical Rademacher complexity is more likely to be overfitted to the training data.

###### Definition 1.

A Rademacher random variable takes on values

and is defined by the uniform distribution as:

 σi={$1$,with probability 12$−1$,with probability 12. (7)
###### Definition 2.

The empirical Rademacher complexity of a hypothesis space of functions with respect to samples is:

 ^RS(H):=Eσ[suph∈H1NN∑i=1σih(xi)], (8)

where indicates a set of Rademacher random variables.

###### Lemma 1 (Talagrand’s Lemma [35]).

Let be -Lipschitz functions and be Rademacher random variables. Then, for any hypothesis space of functions with respect to samples , the following inequality holds

 1NEσ[suph∈HN∑i=1σi(Φi∘h)(xi)]≤LNEσ[suph∈HN∑i=1σih(xi)]=L^RS(H). (9)

### Iii-E MAE and MSE

###### Definition 3.

MAE measures the average magnitude of absolute differences between predicted vectors and actual observations , which is related to norm and the corresponding loss function is defined as:

 LMAE(S,S∗)=1NN∑i=1||xi−yi||1. (10)

Mean Squared Error (MSE) [38] denotes a quadratic scoring rule that measures the average magnitude of predicted vectors and actual observations , which is related to norm and the corresponding loss function is shown as:

 LMSE(S,S∗)=1NN∑i=1||xi−yi||22. (11)

## Iv Upper Bounding MAE for DNN Based Vector-to-Vector Regression

This section derives the upper bound on a generalized loss of MAE for DNN based vector-to-vector regression. We first discuss the error decomposition technique for MAE. Then, we upper bound each decomposed error, and attain an aggregated upper bound on MAE.

### Iv-a Error Decomposition of MAE

Based on the traditional error decomposition approach [35, 50]

, we generalize the technique to the DNN based vector-to-vector regression, where the smooth ReLU activation function, the regression loss functions, and their associated hypothesis space are separately defined in Definition

4.

###### Definition 4.

A smooth vector-to-vector regression function is defined as , and a family of DNN based vector-to-vector functions is represented as , where a smooth ReLU activation is given as:

 gu(x)=limt→∞ln(1+exp(tx))t. (12)

Moreover, we assume as the family of generalized MAE loss functions. For simplicity, we denote as . Besides, we denote as a distribution over .

The following proposition bridges the connection of Rademacher complexity between the family of generalized MAE loss functions and the family of DNN based vector-to-vector functions.

###### Proposition 1.

For any sample set drawn i.i.d. according to a given distribution , the Rademacher complexity of the family is upper bounded as:

 ^RS(L)≤^RS(F), (13)

where denotes the empirical Rademacher complexity over the family , and it is defined as:

 ^RS(F)=Eσ[1Nsupfv∈FN∑i=1(σi1% )Tfv(xi)]. (14)
###### Proof.

We first show that MAE loss function is -Lipschitz continuous. For two vectors and a fixed vector , the MAE loss difference is

 ∣∣L(y1,y)−L(y2,y)∣∣=∣∣||y1−y||1−||y2−y||1∣∣≤||y1−y2||1(triangle inequality). (15)

Since the target function is given, is -Lipschitz. By applying Lemma 1, we obtain that

 ^RS(L)=1NEσ[supfv∈FN∑i=1σiL(fv(xi))]=1NEσ[supfv∈FN∑i=1σiL(q∑m=1⟨1m,fv(xi)⟩1m)]≤1NEσ[supfv∈FN∑i=1(σi1)Tfv(xi)]=^RS(F). (16)

Since is an upper bound of , we can utilize the upper bound on to derive the upper bound for . Next, we adopt the error decomposition technique to attain an aggregated upper bound which consists of three error components.

###### Theorem 1.

Let denote the loss function for a set of samples drawn i.i.d. according to a given distribution , and define as an ERM for . For a generalized MAE loss function , , and , there exists such that . Then, with a probability of , we attain that

 L(^fv)≤inffv∈FL(fv)Approx. error+2supfv∈F|L(fv)−^L(fv)|Estimation error+L(fϵv)−inffv∈FL(fv)Optimization error≤inffv∈FL(fv)+2^RS(F)+ϵ. (17)
###### Proof.
 L(^fv)=inffv∈FL(fv)+L(^fv)−L(fϵv)+L(fϵv)−inffv∈FL(fv)≤inffv∈FL(fv)+L(^fv)−L(fϵv)+ϵ≤inffv∈FL(fv)+L(^fv)−^L(^fv)+^L(fϵv)−L(fϵv)+ϵ≤inffv∈FL(fv)+2supfv∈F|L(fv)−^L(fv)|+ϵ.

Then, we continue to upper bound the term . We first define as the expected value of , and then introduce the fact that

 μ=E[supfv∈F∣∣L(fv)−^L(fv)∣∣]≤2^RS(L), (18)

which is justified by Lemma 2 in Appendix A. Then, for a small , we apply the Hoeffding’s bound [24] as follows

 P(2supfv∈F∣∣L(fv)−^L(fv)∣∣≤ν)≥1−2exp(−2N(ν−μ)2)≥1−2exp(−2N(ν−2^RS(L))2)=δ,

which can derive as:

 ν=2^RS(L)+√12Nln(21−δ),

and we thus obtain that

 2supfv∈F∣∣L(fv)−^L(fv)∣∣≤2^RS(L)+√12Nln(21−δ).

Therefore,

 L(^fv)≤inffv∈FL(fv)+(2^RS(L)+√12Nln(21−δ))+ϵ≤inffv∈FL(fv)+2^RS(F)+√12Nln(21−δ)+ϵ≈inffv∈FL(fv)+2^RS(F)+ϵ(for % sufficiently large N).

Next, the remainder of this section presents how to upper bound the approximation error, approximation error, and optimization error, respectively.

### Iv-B An Upper Bound for Approximation Error

The upper bound for the approximation error is shown in Theorem 2, which is based on the modification of our previous theorem for the representation power of DNN based vector-to-vector regression [42].

###### Theorem 2.

For a smooth vector-to-vector regression target function , there exists a DNN with modified smooth ReLU based hidden layers, where the width of each hidden layer is at least and the top hidden layer has units. Then, we derive the upper bound for the approximation error as:

 inffv∈FL(fv)=||f∗v−¯fv||1=O⎛⎝q(nk+k−1)rd⎞⎠, (19)

where a smooth ReLU function is defined in Eq. (12), and refers to the differential order of .

The smooth ReLU function in Eq. (12) is essential to derive the upper bound for the optimization error. Since Theorem 2 is a direct result from Lemma 2 in [21] where the standard ReLU is employed and does not consider Barron’s bound for activation functions [4], the smooth ReLU function can be flexibly utilized in Theorem 2 because it is a close approximation to the standard ReLU function. Moreover, Theorem 2 requires at least neurons for a -dimensional input vector to achieve the upper bound.

### Iv-C An Upper Bound for Estimation Error

Since the estimation error in Eq. (17) is upper bounded by the empirical Rademacher complexity , we derive Theorem 3 to present an upper bound on . The derived upper bound is explicitly controlled by the constraints of weights in the hidden layers, inputs, and the number of training data. In particular, the constraint of norm is set to the top hidden layer, and norm is imposed on the other hidden layers.

###### Theorem 3.

For a DNN based vector-to-vector mapping function with a smooth ReLU function as in Eq. (12) and , being the weight matrix of the -th hidden layer, we obtain an upper bound for the empirical Rademacher complexity with regularized constraints of the weights in each hidden layer, and the norm of input vectors x is bounded by .

 2supfv∈F|L(fv)−L(^fv)|≤2^RS(F)≤2qΛ′Λk−1s√Ns.t.,||Wk(i,:)||1≤Λ′,∀i∈[q]||Wj(a,:)||2≤Λ,∀j∈[k−1],a∈[nj]||x||2≤s, (20)

where is an element associated with the -th hidden layer of DNN where is indexed to neurons in the -th hidden layer and is pointed to units of the -th hidden layer, and contains all weights from the -th neuron to all units in the ()-th hidden layer.

###### Proof.

We first consider an ANN with one hidden layer of neuron units with the smooth ReLU function as Eq. (12), and also denote as a family of ANN based vector-to-vector regression functions. can be decompoed into the sum of subspaces and each subspace is defined as:

 ^Fm={x→n∑j=1wjgu(uTjx)⋅1m:||w||1≤Λ′,||uj||2≤Λ},

where is the number of hidden neurons, , w and separately correspond to and in Eq. (20). Given data samples , the empirical Rademacher complexity of is bounded as:

 ^RS(^Fm)=1NEσ⎡⎢⎣sup||w||1≤Λ′,||uj||2≤ΛN∑i=1σin∑j=1wjgu(uTjxi)⎤⎥⎦=1NEσ⎡⎢⎣sup||w||1≤Λ′,||uj||2≤Λn∑j=1wjN∑i=1σigu(uTjxi)⎤⎥⎦≤Λ′NEσ[sup||uj||2≤Λmaxj∈[n]∣∣ ∣∣N∑i=1σigu(uTjxi)∣∣ ∣∣](H\"{o}lder's ineq.)=Λ′NEσ[sup||u||2≤Λ∣∣ ∣∣N∑i=1σigu(uTxi)∣∣ ∣∣].≤Λ′NEσ[sup||u||2≤Λ∣∣ ∣∣N∑i=1σiuTxi∣∣ ∣∣](c.f. Lemma~{}???)≤ΛΛ′NEσ[||N∑i=1σixi||2](Cauchy-Schwartz ineq.)≤ΛΛ′N ⎷Eσ[||N∑i=1σixi||22](Jensen's inequality). (21)

The last term in the inequality (21) can be further simplified based on the independence of s. Thus, we finally derive the upper bound as:

 ^RS(^Fm)≤ΛΛ′N ⎷Eσ[||N∑i=1σixi||22]=ΛΛ′N ⎷N∑i,j=1Eσ[σiσj](xTixj)=ΛΛ′N ⎷N∑i=1||xi||22(independence of σis)≤ΛΛ′s√N. (22)

The upper bound for is derived based on the fact that for families of functions , there is , and thus

 ^RS(^F)=q∑m=1^RS(^Fm)≤qΛΛ′s√N, (23)

which is an extension of the empirical Rademacher identities [35], which is demonstrated in Lemma 3 of Appendix A.

Then, for the family of DNNs with hidden layers activated by the smooth ReLU function, we iteratively apply Lemma 1 and end up attaining the upper bound as:

 ^RS(F)=Eσ⎡⎣sup∀l,wjl∈Uq∑m=1N∑i=1σink∑jk=1wjkgu(⋅⋅⋅n1∑j1=1wj1gu(uTjxi))⎤⎦≤Eσ⎡⎣sup∀l,wjl∈Uq∑m=1N∑i=1σink∑jk=1wjk⋅⋅⋅n1∑j1=1wj1uTjxi⎤⎦≤qΛ′Λk−1s√N,

where are selected from the hypothesis space . ∎

### Iv-D An Upper Bound for Optimization Error

Next, we derive an upper bound for the optimization error. A recent work [8] has shown that the -PL property can be ensured if neural networks are configured with the setup of the “over-parametrization” [56], which is induced from the two facts as follows:

• Neural networks can satisfy -PL condition, when the weights of hidden layers are initialized near the global minimum point [56, 11].

• As the neural network involves more parameters, the update of parameters moves less, and there exists a global minimum point near the random initialization [2, 30].

Thus, the upper bound on the optimization error can be tractably derived in the context of the -PL condition for the generalized MAE loss . Since the smooth ReLU function admits smooth DNN based vector-to-vector functions, which can lead to an upper bound on the optimization error as:

 ϵ=L(fϵv)−inffv∈FL(fv)≤μM2β2γ. (24)

To achieve the upper bound in Eq. (24

), we assume that the stochastic gradient descent (SGD) algorithm can result in an approximately equal optimization error for both the generalized MAE loss

and the empirical MAE loss .

More specifically, for two DNN based vector-to-vector regression functions and , we have that

 ϵ=L(fϵv)−inffv∈FL(fv)≈^L(^fϵv)−inffv∈F^L(fv). (25)

Thus, we focus on analyzing because it can be updated during the training process. We assume that is -smooth with and it also satisfies the -PL condition from an early iteration . Besides, the learning rate of SGD is set to .

Moreover, we define as the function with an updated parameter at the iteration , and denote as the function with the optimal parameter . The smoothness of implies that

 ^L(fv,wt+1)−^L(fv,wt)−⟨∇^L(fv,wt),wt+1−wt⟩≤β2||wt−wt+1||22. (26)

Then, we apply the SGD algorithm to update model parameters at the iteration as:

 wt+1=wt−μ∇^L(fv,wt). (27)

Next, we substitute in Eq. (27) for in the inequality (26), we have that

 ^L(fv,wt+1)−^L(fv,wt)+μ||∇^L(fv,wt)||22≤βμ22||∇^L(fv,%wt)||22. (28)

By employing the condition , we further derive that

 ^L(fv,wt+1)−^L(fv,wt)+μ||∇^L(fv,wt)||22≤μ2M2β2. (29)

Furthermore, we employ the -PL condition to Eq. (29) and obtain the inequalities as:

 ^L(fv,wt+1)−^L(fv,w∗)≤(^L(fv,wt)−^L(fv,w∗)−γμ(^L(fv,wt)−^L(fv,w∗)))+μ2M2β2≤(1−μγ)(^L(fv,wt)−^L(fv,w∗))+μ2M2β2≤(1−μγ)2(^L(fv,wt−1)−^L(fv,w∗))+1∑i=0(1−γμ)iμ2M2β2≤⋅⋅⋅≤(1−μγ)t−t0+1(^L(fv,wt0)−^L(fv,w∗))+t−t0∑i=0(1−γμ)iμ2M2β2≤(1−μγ)t−t0+1(^L(fv,wt0)−^L(fv,w∗))+μM2β2γ≤exp(−μγ(t−t0+1))(^L(fv,wt0)−^L(fv,w∗))+μM2β2γ. (30)

By connecting the optimization error in Eq. (24) to our derived Eq. (30), we further have that

 ϵ=L(fϵv)−inffv∈FL(fv)≈^L(fϵv)−inff