I Introduction
Vectortovector regression, also known as multivariate regression, provides an effective way to find underlying relationships between input vectors and their corresponding output ones at the same time. The problems of vectortovector regression are of great interest in signal processing, wireless communication, and machine learning communities. For example, speech enhancement aims at finding a vectortovector mapping to convert noisy speech spectral vectors to the clean ones
[62, 40]. Similarly, clean images can be extracted from the corrupted ones by leveraging upon image denoising techniques [60]. Besides, wireless communication systems are designed to transmit local encrypted and corrupted codes to targeted receivers with decrypted information as correct as possible [48, 17]. Moreover, the vectortovector regression tasks are also commonly seen in ecological modeling, natural gas demand forecasting, and drug efficacy prediction domains
[9].The vectortovector regression can be theoretically formulated as follows: given a dimensional input vector space and a measurable dimensional output vector space , the goal of vectortovector regression is to learn a functional relationship such that the output vectors can approximate desirable target ones. The regression process is described as:
(1) 
where , , e is an error vector, and refers to the regression function to be exploited. To implement the regression function
[39] was the earliest approach and several other methods, such as support vector regression [53]and decision tree regressions
[34], were further proposed to enhance regression performance. However, deep neural networks (DNN) [22, 29] with multiple hidden layers offer a more efficient and robust solution to dealing with largescale regression problems. For example, our previous experimental study [61]demonstrated that DNNs outperform shallow neural networks on speech enhancement. Similarly, autoencoders with deep learning architectures can achieve better results on image denoising
[64].Although most endeavors on DNN based vectortovector regression focus on the experimental gain in terms of mapping accuracy, the related theoretical performance of DNN has not been fully developed. Our recent work [42] tried to bridge the gap by analyzing the representation power of DNN based vectortovector regression and deriving upper bounds for different DNN architectures. However, those bounds particularly target experiments with consistent training and testing conditions, and they may not be adapted to the experimental tasks where unseen testing data are involved. Therefore, in this work, we focus on an analysis of the generalization power and investigate upper bounds on a generalized loss of mean absolute error (MAE) for DNN based vectortovector regression with mismatched training and testing scenarios. Moreover, we associate the required constraints with DNN models to attain the upper bounds.
The remainder of this paper is organized as follows: Section II highlights the contribution of our work and its relationship with the related work. Section III underpins concepts and notations used in this work. Section IV discusses the upper bounds on MAE for DNN based vectortovector regression by analyzing the approximation, estimation, and optimization errors, respectively. Section V presents how to utilize our derived upper bounds to estimate practical MAE values. Section VI shows the experiments of image denoising and speech enhancement to validate our theorems. Finally, Section VII concludes our work.
Ii Related Work and Contribution
The recent success of deep learning has inspired many studies on the expressive power of DNNs [32, 18, 31, 47], which extended the original universal approximation theory on shallow artificial neural networks (ANNs) [28, 13, 4, 5, 25] to DNNs. As discussed in [19], the approximation error is tightly associated with the DNN expressive power. Moreover, the estimation error and optimization error jointly represent the DNN generalization power, which can be reflected by error bounds on the outofsample error or the testing error. The methods of analyzing DNN generalization power are mainly divided into two classes: one refers to algorithmindependent controls [37, 6, 20] and another one denotes algorithmdependent approaches [33, 12]. In the class of algorithmindependent controls, the upper bounds for the estimation error are based on the empirical Rademacher complexity [7] for a functional family of certain DNNs. In practice, those approaches concentrate on techniques of how weight regularization affects the generalization error without considering advanced optimizers and the configuration of hyperparameters. As for the algorithmdependent approaches [33, 12], several theoretical studies focus on the “overparametrization” technique [16, 36, 2, 30], and they suggest that a global optimal point can be ensured if parameters of a neural network significantly exceed the amount of training data during the training process.
We notice that the generalization capability of deep models can also be investigated through the stability of the optimization algorithms. More specifically, an algorithm is stable if a small perturbation to the input does not significantly alter the output, and a precise connection between stability and generalization power can be found in [10, 63]. Besides, in [3], the authors investigate the stability and oscillations of various competitive neural networks from the perspective of equilibrium points. However, the analysis of the stability of the optimization algorithm is out of the scope of the present work, and we do not discuss it further in this study.
In this paper, the aforementioned issues are taken into account by employing the error decomposition technique [15] with respect to an empirical risk minimizer (ERM) [55, 54] using three error terms: an approximation error, an estimation error, and an optimization error. Then, we analyze generalized error bounds on MAE for DNN based vectortovector regression models. More specifically, the approximation error can be upper bounded by modifying our previous bound on the representation power of DNN based vectortovector regression [42]. The upper bound on the estimation error relies on the empirical Rademacher complexity [7] and necessary constraints imposed upon DNN parameters. The optimization error can be upper bounded by assuming PolyakLojasiewicz (PL) [27] condition under the “overparameterization” configuration for neural networks [1, 56]. Putting together all pieces, we attain an aggregated upper bound on MAE by summing the three upper bounds. Furthermore, we exploit our derived upper bounds to estimate practical MAE values in experiments of DNN based vectortovector regression.
We use image denoising and speech enhancement experiments to validate the theoretical results in this work. Image denoising is a simple regression task from to
, where the configuration of “overparametrization” can be simply satisfied on datasets like MNIST
[14]. Speech enhancement is another useful illustration of the general theoretical analysis because it is an unbounded conversion from . Although the “overparametrization” technique could not be employed in the speech enhancement task due to a significantly huge amount of training data, we can relax the “overparametrization” setup and solely assume the PL condition to attain the upper bound for MAE. In doing so, the upper bound can be adopted in experiments of speech enhancement.Iii Preliminaries
Iiia Notations

: The composition of functions and .

: norm of the vector v.

and : Inner product of two vectors x and y.

: An integer set .

: A firstorder gradient of function .

: The th element in the vector w.

: DNN based vectortovector regression function.

: Smooth ReLU function.

1: A vector of all ones.

: Indicator vector of zeros but with the th dimension assigned to .

: dimensional real coordinate space.

: A family of the DNN based vectortovector functions.

: A family of generalized MAE loss functions.
IiiB Numerical Linear Algebra

Hölder’s inequality: Let be conjugate: . Then, for all ,
(2) with equality when for all . In particular, when , Hölder’s inequality becomes the CauchyShwartz inequality.
IiiC Convex and NonConvex Optimization

A function is Lipschitz continuous if ,
(3) 
Let be a smooth function on . Then, ,
(4) 
A function satisfies the PolyakLojasiewicz (PL) condition [27]. Then, ,
(5) where refers to the optimal value over the input domain. The PL condition is a significant property for a nonconvex function because a global minimization can be attained from , and a local minimum point corresponds to the global one. Furthermore, if a function is convex and also satisfies PL condition, the function is strongly convex.

Jensen’s inequality: Let be a random vector taking values in a nonempty convex set with a finite expectation , and be a measurable convex function defined over . Then, is in , is finite, and the following inequality holds
(6)
IiiD Empirical Rademacher Complexity
Empirical Rademacher complexity [7] is a measure of how well the function class correlates with the Rademacher random value. The references [19, 65, 58] show that a function class with a larger empirical Rademacher complexity is more likely to be overfitted to the training data.
Definition 1.
Definition 2.
The empirical Rademacher complexity of a hypothesis space of functions with respect to samples is:
(8) 
where indicates a set of Rademacher random variables.
Lemma 1 (Talagrand’s Lemma [35]).
Let be Lipschitz functions and be Rademacher random variables. Then, for any hypothesis space of functions with respect to samples , the following inequality holds
(9) 
IiiE MAE and MSE
Definition 3.
MAE measures the average magnitude of absolute differences between predicted vectors and actual observations , which is related to norm and the corresponding loss function is defined as:
(10) 
Mean Squared Error (MSE) [38] denotes a quadratic scoring rule that measures the average magnitude of predicted vectors and actual observations , which is related to norm and the corresponding loss function is shown as:
(11) 
Iv Upper Bounding MAE for DNN Based VectortoVector Regression
This section derives the upper bound on a generalized loss of MAE for DNN based vectortovector regression. We first discuss the error decomposition technique for MAE. Then, we upper bound each decomposed error, and attain an aggregated upper bound on MAE.
Iva Error Decomposition of MAE
Based on the traditional error decomposition approach [35, 50]
, we generalize the technique to the DNN based vectortovector regression, where the smooth ReLU activation function, the regression loss functions, and their associated hypothesis space are separately defined in Definition
4.Definition 4.
A smooth vectortovector regression function is defined as , and a family of DNN based vectortovector functions is represented as , where a smooth ReLU activation is given as:
(12) 
Moreover, we assume as the family of generalized MAE loss functions. For simplicity, we denote as . Besides, we denote as a distribution over .
The following proposition bridges the connection of Rademacher complexity between the family of generalized MAE loss functions and the family of DNN based vectortovector functions.
Proposition 1.
For any sample set drawn i.i.d. according to a given distribution , the Rademacher complexity of the family is upper bounded as:
(13) 
where denotes the empirical Rademacher complexity over the family , and it is defined as:
(14) 
Proof.
We first show that MAE loss function is Lipschitz continuous. For two vectors and a fixed vector , the MAE loss difference is
(15) 
Since the target function is given, is Lipschitz. By applying Lemma 1, we obtain that
(16) 
∎
Since is an upper bound of , we can utilize the upper bound on to derive the upper bound for . Next, we adopt the error decomposition technique to attain an aggregated upper bound which consists of three error components.
Theorem 1.
Let denote the loss function for a set of samples drawn i.i.d. according to a given distribution , and define as an ERM for . For a generalized MAE loss function , , and , there exists such that . Then, with a probability of , we attain that
(17) 
Proof.
Next, the remainder of this section presents how to upper bound the approximation error, approximation error, and optimization error, respectively.
IvB An Upper Bound for Approximation Error
The upper bound for the approximation error is shown in Theorem 2, which is based on the modification of our previous theorem for the representation power of DNN based vectortovector regression [42].
Theorem 2.
For a smooth vectortovector regression target function , there exists a DNN with modified smooth ReLU based hidden layers, where the width of each hidden layer is at least and the top hidden layer has units. Then, we derive the upper bound for the approximation error as:
(19) 
where a smooth ReLU function is defined in Eq. (12), and refers to the differential order of .
The smooth ReLU function in Eq. (12) is essential to derive the upper bound for the optimization error. Since Theorem 2 is a direct result from Lemma 2 in [21] where the standard ReLU is employed and does not consider Barron’s bound for activation functions [4], the smooth ReLU function can be flexibly utilized in Theorem 2 because it is a close approximation to the standard ReLU function. Moreover, Theorem 2 requires at least neurons for a dimensional input vector to achieve the upper bound.
IvC An Upper Bound for Estimation Error
Since the estimation error in Eq. (17) is upper bounded by the empirical Rademacher complexity , we derive Theorem 3 to present an upper bound on . The derived upper bound is explicitly controlled by the constraints of weights in the hidden layers, inputs, and the number of training data. In particular, the constraint of norm is set to the top hidden layer, and norm is imposed on the other hidden layers.
Theorem 3.
For a DNN based vectortovector mapping function with a smooth ReLU function as in Eq. (12) and , being the weight matrix of the th hidden layer, we obtain an upper bound for the empirical Rademacher complexity with regularized constraints of the weights in each hidden layer, and the norm of input vectors x is bounded by .
(20) 
where is an element associated with the th hidden layer of DNN where is indexed to neurons in the th hidden layer and is pointed to units of the th hidden layer, and contains all weights from the th neuron to all units in the ()th hidden layer.
Proof.
We first consider an ANN with one hidden layer of neuron units with the smooth ReLU function as Eq. (12), and also denote as a family of ANN based vectortovector regression functions. can be decompoed into the sum of subspaces and each subspace is defined as:
where is the number of hidden neurons, , w and separately correspond to and in Eq. (20). Given data samples , the empirical Rademacher complexity of is bounded as:
(21) 
The last term in the inequality (21) can be further simplified based on the independence of s. Thus, we finally derive the upper bound as:
(22) 
The upper bound for is derived based on the fact that for families of functions , there is , and thus
(23) 
which is an extension of the empirical Rademacher identities [35], which is demonstrated in Lemma 3 of Appendix A.
Then, for the family of DNNs with hidden layers activated by the smooth ReLU function, we iteratively apply Lemma 1 and end up attaining the upper bound as:
where are selected from the hypothesis space . ∎
IvD An Upper Bound for Optimization Error
Next, we derive an upper bound for the optimization error. A recent work [8] has shown that the PL property can be ensured if neural networks are configured with the setup of the “overparametrization” [56], which is induced from the two facts as follows:
Thus, the upper bound on the optimization error can be tractably derived in the context of the PL condition for the generalized MAE loss . Since the smooth ReLU function admits smooth DNN based vectortovector functions, which can lead to an upper bound on the optimization error as:
(24) 
To achieve the upper bound in Eq. (24
), we assume that the stochastic gradient descent (SGD) algorithm can result in an approximately equal optimization error for both the generalized MAE loss
and the empirical MAE loss .More specifically, for two DNN based vectortovector regression functions and , we have that
(25) 
Thus, we focus on analyzing because it can be updated during the training process. We assume that is smooth with and it also satisfies the PL condition from an early iteration . Besides, the learning rate of SGD is set to .
Moreover, we define as the function with an updated parameter at the iteration , and denote as the function with the optimal parameter . The smoothness of implies that
(26) 
Then, we apply the SGD algorithm to update model parameters at the iteration as:
(27) 
By employing the condition , we further derive that
(29) 
Furthermore, we employ the PL condition to Eq. (29) and obtain the inequalities as:
(30) 
Comments
There are no comments yet.