Given their capability to approximate highly nonlinear functions, deep neural networks (DNNs) have found increasing application in domains such as image classification [krizhevsky2012imagenet, googlenet], natural language processing [hinton2012deep, hannun2014deep], and learning-based control [shi2019neural, chen2019large, zhou-cdc17]. As compared to their shallow counterparts, DNNs are often favoured in practice due to their compact representation of nonlinear functions [montufar2017number]
. Despite their practical successes, the theoretical understanding of the representative power of such deep architectures remains an active research topic addressed by both the machine learning and neuroscience community. In this work, we aim to contribute to the understanding of the expressiveness of DNNs by presenting a new perspective based on Lipschitz constant analysis that is interpretable for applications such as system control.
There are several recent works analyzing the expressive power of deep architectures. One notable work is [NIPS2011_4350]
, where the authors show that, for a sum-product network, a deep network is exponentially more efficient than a shallow network in representing the same function. Following this work, several researchers then considered more practical DNNs with piecewise linear activation functions (e.g., rectified linear units (ReLU) and hard tanh) and showed that the expressiveness of a DNN measured by the number linear regions partitioned by the DNN grows exponentially with depth and polynomially with width[pascanu2013number, montufar2014number, arora2016understanding, serra2017bounding]. In parallel to the work on piecewise linear DNNs, (raghu2017expressive) consider DNNs with independent and identically distributed (i.i.d.) Gaussian weight and bias parameters (i.e., random DNNs) and introduce a new measure of expressiveness based on the length of the output trajectory as the DNN traverses a one-dimensional trajectory in its input space. Similar to the other results, the authors show that the expressiveness of a DNN measured by the expected output trajectory length increases exponentially with the depth of the network.
While existing work has shown the exponential expressiveness of deep architectures, the measures of expressiveness are typically specific to the type of deep architectures being considered. For instance, for the sum-product networks considered in NIPS2011_4350, the measure of expressiveness is the number of monomials used to construct the polynomial function, and for DNNs with piecewise linear activation functions pascanu2013number; montufar2014number; arora2016understanding; serra2017bounding), the number of linear regions is used as the measure to characterize the complexity of the DNN. These specialized notions of expressivity prohibit sensible comparisons between the complexity of a DNN and the underlying function it approximates. While the expressiveness measure based on output trajectory length raghu2017expressive is applicable to DNNs with more general activation functions, it is still not trivial to connect this measure to the properties of the function to be approximated by the DNN.
In this work, motivated by the theoretical analysis of DNNs in feedback control applications shi2019neural; fazlyab2019efficient, we introduce an alternative perspective on the expressive power of DNNs based on their Lipschitz properties. Similar to raghu2017expressive, we consider a DNN with random weight parameters. By leveraging results from random matrix theory, we provide an analysis of the expressive power of DNNs based on their Lipschitz constant and establish connections with earlier results using alternative measures of DNN expressiveness. Our ultimate goal is to understand the implications of choosing particular neural network architectures for learning in feedback control applications.
We consider fully-connected DNNs, , that are defined as follows:
where is the input, is the output, the subscripts denote the layer index with being the input layer, being the hidden layers, and being the output layer, is the output from the th layer with being the element-wise activation function and
being the number of neurons in theth layer, and and are the weight and bias parameters between layers and . In our analysis, we focus on DNNs with 1-Lipschitz activation functions virmaux2018lipschitz, which include most commonly used activation functions such as ReLU, tanh, and sigmoid.
To facilitate our analysis, similar to raghu2017expressive, in this work, we consider DNNs with random weight matrices
whose elements are i.i.d. zero-mean Gaussian random variables, where
3 Lipschitz Constant as a Measure of Expressiveness
In this work, we characterize the expressiveness of a DNN by its Lipschitz constant. Intuitively, a larger Lipschitz constant implies that small changes in the DNN input can lead to large changes at the output, which provide greater flexibility to model nonlinear functions.
Formally, a function is said to be Lipschitz continuous on if
and its Lipschitz constant on is the smallest such that the inequality in (2) holds. It is not hard to verify that common activation functions (e.g., ReLU, tanh, and sigmoid) are globally Lipschitz continuous. A DNN with such activation functions is a finite number of compositions of Lipschitz continuous functions and is thus Lipschitz continuous on its domain . Note that, in general, the Lipschitz continuity condition in (2) is independent of the choice of the norm; in this work, we will consider Lipschitz continuity in the -norm.
In the following subsections, we establish a connection between the expected Lipschitz constant of a DNN and its architecture (i.e., width and depth), and compare the result to existing results on the expressive power of DNNs in the literature. We summarize our main results in this manuscript and provide details of the derivations and proofs in the appendices.
3.1 Upper and Lower Bounds on the Lipschitz Constant of a DNN
As noted in fazlyab2019efficient; virmaux2018lipschitz
, the exact estimation of the Lipschitz constant of a DNN is NP-hard; however, for our purpose of understanding the expressiveness of DNNs, estimates of the upper and lower bounds on the Lipschitz constant of a DNNs based on their weight matrices are sufficient.
Recall that we consider a family of DNNs with 1-Lipschitz activation functions. By the Lipschitz continuity of composite functions, an upper bound on the Lipschitz constant of a DNN (1
) with 1-Lipschitz activation functions is the product of the spectral norms, or equivalently, of the maximum singular values of the weight matrices:
where denotes the upper bound on the Lipschitz constant of the DNN, denotes the spectral norm or the maximum singular value of the weight matrix . As derived in combettes2019lipschitz, a lower bound on the Lipschitz constant of a DNN is
which corresponds to the Lipschitz constant of a purely linear network (i.e., a network with activation nonlinearities removed).
Note that the upper and lower bounds on the Lipschitz constant of a DNN in (3) and (4) depend only on the maximum singular values of the weight matrices and their product. In the following analysis, we leverage random matrix theory to derive expressions of the bounds in (3) and (4) in terms of the width and depth of the DNN and the variance of the weight parameters .
3.2 Estimates of the Lipschitz Constant Bounds Based on Extreme Singular Value Theorem
In this subsection, we establish a connection between the Lipschitz constant of a DNN and its architecture (i.e., width and depth) based on the extreme singular value theory for random matrices.
3.2.1 Upper Bound
In this part, we show that, for a sufficiently large , the expected upper bound on the Lipschitz constant (3) and hence the attainable expressiveness of a DNN increases exponentially with depth and polynomially with width. To start our discussion, we state the following result from random matrix theory on the extreme singular values of Gaussian random matrices: [Gaussian Random Matrix (rudelson2010non)] Let be an matrix whose elements are independent standard normal random variables. Then, , where and denote the minimum and maximum singular values of , respectively, and represents the expected value. Note that, for a Gaussian random matrix, the theorem above allows us to infer the extreme singular values of the matrix without explicitly knowing the values of its elements. By representing the weight parameters of a DNN as i.i.d. Gaussian random variables, we can leverage this result to estimate the upper bound of the Lipschitz constant (3). In particular, by applying Theorem 3.2.1, we prove the following theorem in App. A.1: [Upper Bound on Lipschitz Constant of a Gaussian Random DNN] Consider a DNN defined in (1), where the weight parameters are independent Gaussian random variables distributed as with denoting the variance of the Gaussian distribution, and where the activation functions are 1-Lipschitz. The expected Lipschitz constant of the DNN is upper bounded by . Theorem 3.2.1 allows us to obtain an intuition about the expected attainable Lipschitz constant and thus the flexibility of a DNN as we vary its width for and depth . To compare to established results serra2017bounding; raghu2017expressive, we set the width of the hidden layers to (i.e., for ), then the expected Lipschitz constant of a DNN with Gaussian random weights is upper bounded by . For , this upper bound increases exponentially with depth and polynomially with width. This observation is consistent with the results on the expressiveness measured by the number of linear regions for piecewise linear networks serra2017bounding; raghu2017expressive and the expressiveness measured by the trajectory length for Gaussian random networks raghu2017expressive.
3.2.2 Lower Bound
Similarly based on the extreme singular value theorem for random matrices, we present a conjecture on the lower bound of the Lipschitz constant (4). We include a justification of the conjecture in App. A.2 and empirically illustrate the result in Sec. 4. [Lower Bound on Lipschitz Constant of a Gaussian Random DNN] Consider a DNN defined in (1) where the weight parameters are independent Gaussian random variables distributed as and the activation functions are 1-Lipschitz. The Lipschitz constant of the DNN is approximately lower bounded by .
Based on Conjecture 3.2.2, if we consider a DNN with constant width (i.e., for ), the Lipschitz constant of the DNN with independent Gaussian weight parameters is approximately lower bounded by , which also increases exponentially in depth and polynomially in the width of the DNN given sufficiently large (i.e., ). Interestingly, we note that, for the case where and , this asymptotic lower bound based on the Lipschitz constant of the DNN coincides with the expressiveness lower bound based on the output trajectory length measure for DNNs with ReLU activation functions raghu2017expressive. This connection is sensible since the expressiveness measure in raghu2017expressive can be intuitively thought of as the extent to which the DNN stretches a trajectory in its input space, which is a property related to the Lipschitz constant of a DNN (see App. B for further details).
Note that, for both the upper and lower bound analysis, we require the magnitude of to be sufficiently large. Intuitively, a small means that the magnitude of the weights are small. In the extreme case, where all weights are zero, a deep architecture cannot be expressive in any notion of expressiveness (e.g., number of linear regions). We therefore require the spread of the weights to be sufficiently large to exploit the expressivity of the deep layers. This lower bound is typically not restrictive; as an example, is approximately 0.22 for .
3.2.3 Differences Compared to Other Expressiveness Measures
In this work, we propose to use the Lipschitz constant of a DNN as a measure of its expressiveness. In contrast to existing expressiveness measures, a Lipschitz-based characterization has two benefits:
Less assumptions on the DNN: As compared to previous work on piecewise linear DNNs pascanu2013number; montufar2014number; arora2016understanding; serra2017bounding, by considering the Lipschitz constant as the expressiveness measure, we do not constrain ourself to DNNs with specific activation functions such as ReLUs or hard tanh. In our analysis, we only require the activation function to be 1-Lipschitz, which is satisfied by most commonly used activations that include but are not limited to ReLU, tanh, hard tanh, and sigmoid.
Towards understanding DNN expressiveness for practical applications: In contrast to expressiveness measures such as the number of linear regions pascanu2013number; montufar2014number; arora2016understanding; serra2017bounding and trajectory length raghu2017expressive, the Lipschitz constant is a generic property for Lipschitz continuous nonlinear functions. For regression problems, the expressiveness characterization through the Lipschitz constant allows us to make sensible comparisons between a DNN and the function it approximates. For control applications, the Lipschitz constant also plays a critical role in stability analysis. The Lipschitz-based characterization of the expressiveness of a DNN has the potential to facilitate the design of deep architectures for safe and efficient learning in a closed-loop control setup.
4 Numerical Examples
In this section, we provide numerical examples that illustrate the insights on the expressiveness of DNNs based on the results in Sec. 3. In particular, we show the connection between the architectural properties of a DNN and its expressiveness.
4.1 Bounds on the Lipschitz Constant of a DNN
To visualize the results of Sec. (3), we randomly sample the weight parameters of DNNs from a zero-mean, unit variance Gaussian distribution and compare the upper and lower bounds on the Lipschitz constants of these DNNs as we increase its width and depth. To examine the quality of the estimated Lipschitz constant bounds from Sec. (3), we show a comparison of the estimated bounds computed based on Theorem 3.2.1 and Conjecture 3.2.2 and the bounds computed directly based on (3) and (4) in Fig. 1. From these plots, we see that there is a close correspondence between the Lipschitz constant bounds computed based on Theorem 3.2.1 and Conjecture 3.2.2, which assumes random matrices, and the bounds computed based on (3) and (4) based on the actual network weights. This result verifies that the bounds provided in Theorem 3.2.1 and Conjecture 3.2.2 are good approximations of the bounds on the Lipschitz constant of a fixed DNN based on (3) and (4). We note that here we compute the bounds in (3) and (4) directly based on the sampled weight parameters that are known for this simulation study; in general, to understand the implications of a DNN architecture based on Theorem 3.2.1 and Conjecture 3.2.2, we do not rely on knowing the weights explicitly.
Figure 2 shows the upper and lower bounds of the Lipschitz constant based on Theorem 3.2.1 and Conjecture 3.2.2 for different DNN architectures. By inspecting horizontal slices and vertical slices of the plots in Fig. 2, which correspond to the top and bottom plots in Fig. 1, we see that the upper and lower bounds of the Lipschitz constant of a DNN increase exponentially with depth and polynomially with width. The dashed contour lines in the plots show DNN architectures with the same number of neurons. As we trace one of the contour lines from left to right, we see that increasing width and decreasing depth reduces the bounds of the Lipschitz constants, which indicates a decrease in the expressiveness of the deep architecture. Similar to the discussion in montufar2017number, based on our formulation, we also see that, given the same number of neurons, deeper networks are more compact representations of nonlinear functions.
4.2 Towards Learning Deep Models for Control
To illustrate the implication of the expressiveness of a DNN for control, we consider a simple system setup and examine the stability of the system when we use a DNN with different architectures in the loop. In particular, we consider a system that is represented by
where is the state, is Hurwitz, and is a function parametrized by a DNN. By Lyapunov’s direct method, one can show that a condition that guarantees stability of the system (5) is
where denotes the Lipschitz constant of the DNN, is a positive definite matrix, is the corresponding solution to the Lyapunov equation , and and
are the minimum and maximum eigenvalues of a matrix, respectively.
For an illustration, we set . We compare five DNN architectures with different widths and depths but the same number of neurons. For each DNN architecture, we sample 50 DNNs with i.i.d. zero-mean, unit variance Gaussian weight parameters. We note that out of the five architectures, we know based on Theorem 3.2.1 that the first case, a DNN with a hidden layer of 300 neurons, has an estimated upper bound on the Lipschitz constant less than the safe upper bound in (6), and system (5) is stable. In contrast, as we can see from Fig. 2, when we decrease the width and increase the depth of a DNN, its Lipschitz constant increases and system (5) is less likely to be stable. Table 1 shows empirical results for the relationship between the architectural properties of a DNN and the stability of the system. This means, in practice, one may want to carefully choose an appropriate DNN architecture, or, alternatively, regularize the weight parameters, to ensure stability of a learning-based control system. We consider our insights to be a step towards providing design guidelines for DNN architectures, for example, for closed-loop control applications.
|Architecture (width depth)||10 30|
|Likelihood of stable system (%)||100||100||40||32||32|
5 Discussion on the Assumption of Gaussian Random Weight Matrices
In this work, we considered DNNs with Gaussian random weight matrices to facilitate analysis of their expressiveness. In this section, we examine if this assumption is reasonable for practical applications. In particular, we examine, through some examples, the accuracy of estimating the maximum singular value of the weight matrices based on Theorem 3.2.1 when the assumption of Gaussian random matrices does not hold exactly.
|Network 1 (64 Neurons)||Network 2 (256 Neurons)|
|True Norm||Estimated Norm||True Norm||Estimated Norm|
To examine the properties of weight matrices in trained networks, we consider a regression problem. The true function to be approximated has two inputs and one output. Fig. 3 shows the distributions of two weight matrices from two trained networks with different architectures, and Table 2 summarizes their maximum singular values. By inspecting the distributions (Fig. 3), we see that the weights are not necessarily always Gaussian-distributed; however, the estimates of the maximum singular values of the matrices based on the assumption of random weights are very close to the true maximum singular values (Table 2). Based on Bai-Yin’s law for extreme singular values of random matrices with more general distributions rudelson2010non, we can infer that the expected maximum singular value based on Theorem 3.2.1 is an approximation of the true maximum singular value of a random matrix with an error of , where
is the standard deviation of the weight distribution andis the matrix column dimension. In future, we plan to explore the properties of the weight matrices of trained networks and examine their relation to random matrix theory.
In this paper, we presented a new perspective on the expressiveness of DNNs based on their Lipschitz properties. Using random matrix theory, we showed that, given the spread of the weights is sufficiently large (i.e., for ), the expressiveness of a DNN measured by its Lipschitz constant grows exponentially with depth and polynomially with width. This result is similar to the results based on other expressiveness measures discussed in the current literature. By considering the Lipschitz constant as a measure of DNN expressiveness, we can more sensibly understand the implication of being ‘deep’ in the context of function approximation for applications including safe learning-based control.
Appendix A Proofs of Main Results in Sec. 3
a.1 Proof of Theorem 3.2.1: Upper Bound on Lipschitz Constant of a Gaussian Random DNN
The following is a proof for Theorem 3.2.1 presented in Sec. 3. In the following proof, based on the extreme singular value theorem for random matrices (Theorem 3.2.1), we derive an expression for the upper bound on the Lipschitz constant of a DNN in terms of its width and depth.
Consider a random matrix whose elements are independent Gaussian random variables distributed as . As a result of Theorem 3.2.1 and the homogeneity of the matrix norm, the expected maximum singular value of is upper bounded by . By assumption, the elements of each weight matrix are distributed as . The expected spectral norm, or equivalently the expected maximum singular value, of the weight matrices are upper bounded as follows:
The expression in (8) establishes a connection between the upper bound on the Lipschitz constant of a DNN and its architecture, which is represented by the dimensions of the weight matrices in this analysis. This result allows us to obtain insights on the expressiveness of a DNN without explicitly knowing the values of its weights.
a.2 Justification of Conjecture 3: Lower Bound on Lipschitz Constant of a Gaussian Random DNN
To derive an estimate of the lower bound in (4), we first note that the product of random Gaussian matrices is in general not a Gaussian random matrix. In deriving the lower bound, we need to consider a more general class of matrices than in Theorem 3.2.1: [Random Matrix (rudelson2010non)] Let be an
matrix whose elements are independent random variables with zero mean, unit variance, and finite fourth moment. Suppose that the dimensionsand grow to infinity with converging to a constant in . Then, and almost surely. In contrast to Theorem 3.2.1, the above theorem is applicable to a wider class of random matrices with independent elements; however, this result is an asymptotic result in the limit of sufficiently large and . For practical DNNs where the dimensions of the weight matrices are sufficiently large, this theorem allows us to derive an approximate lower bound for (4). We provide a justification of Conjecture 3.2.2 presented in Sec. 3 of our manuscript below:
We consider two random matrices and whose elements are independent zero-mean random variables with variances and , respectively. The th row and th column element of the matrix product is , where denotes the th row and th column element of and denotes the th row and th column element of . Here, in our derivation, we make a conjecture that the elements of the product matrix of random matrices with elements being i.i.d. zero-mean random variables approximately preserve independence. Based on this conjecture, we derive an expression of the variance the elements of . Without loss of generality, we consider the th row and th column element of . Since, by assumption, the elements of and have zero mean and are i.i.d., the variance of the th row and th column element of is
where denotes the variance of a random variable, and is the variance of the product of an element of and an element of . The standard deviation of elements in the product of can be written as
By applying (10) recursively, we can derive an estimate of the bound in (4), which is the spectral norm of the product of random matrices. In particular, a recursive relationship in the standard deviations of the product of random matrices can be written as
where denotes the standard deviation of the product of random matrices . For the product random matrix in (4), we have
As above, we make a conjecture that the elements of the product matrix constructed from the random weight matrices are independent. Since the elements of the product matrix are the sums of products of independent zero-mean random variables by construction, the elements of the product matrix have zero mean. Moreover, since the elements of the weight matrices are assumed to be Gaussian distributed, they have finite fourth moments. Further by the properties of the sum and product of random variables (dufour2003properties), the elements of the product matrix constructed from the weight matrices also have finite fourth moments. By Theorem A.2 and the homogeneity of matrix norms, a random matrix whose elements are i.i.d. random variables with mean 0, variance , and finite fourth moment, the expected maximum singular value of is given by
Similar to the upper bound, this expected lower bound on the Lipschitz constant allows us to infer the Lipschitz constant of a DNN based on its architectural properties.
In Sec. 4 of the manuscript, we empirically show that the expression in (14) is a reasonable approximation of the lower bound of the Lipschitz constant of a DNN in (4). However, we note that, in our justification above, we make an assumption that the elements of the product matrix constructed from random matrices whose elements are i.i.d. zero-mean Gaussian random variables preserve independence. This is a conjecture that requires further investigation. We would like to further look into results on multiplications of random matrices to improve this result.
Appendix B Connection to the Result Based on Output Trajectory Length
In this appendix, we show a connection between our result and the result in raghu2017expressive. Both our work and raghu2017expressive consider DNNs with i.i.d. zero-mean Gaussian weight parameters. In our work, we use the Lipschitz constant as a measure of the expressiveness of a DNN, while in raghu2017expressive, the proposed expressiveness measure of a DNN is the expected length of an output trajectory as the DNN traverses a one-dimensional trajectory in its input space. Intuitively, as an input trajectory is passed through a DNN, it is deformed by the linear weight layers and the nonlinear activation layers; the output trajectory length measure in raghu2017expressive is the extent to which the DNN ‘stretches’ a trajectory given in the input space.
By considering the expected output trajectory length as the expressiveness measure, raghu2017expressive prove the following result:
[Lower Bound on Output Trajectory Length (raghu2017expressive)] Let be a DNN with ReLU activation functions and weights being i.i.d. Gaussian random variables , and let be a one-dimensional trajectory with having a non-trivial perpendicular component to for all . Denote as the image of the trajectory in the th layer of the DNN. The expected output trajectory length of the DNN is lower bounded by
where is the trajectory length and is the width of the DNN.
Note that, if we consider the expected output trajectory length normalized by the input trajectory length (i.e., the ‘stretch’ of the trajectory), we can establish a connection with the lower bound in (15) and the lower bound we derived based on Lipschitz constant expressiveness characterization in Sec. 3.2.2. In particular, in Sec. 3.2.2, we showed that for a DNN with a constant width (i.e., for ), the asymptotic lower bound on the Lipschitz constant of the DNN is . On the other hand, the normalized lower bound on the expected output trajectory in (15) can be written as . For and , this asymptotic lower bound from (15) coincides with the asymptotic lower bound we obtained based on the Lipschitz constant measure of expressiveness. Fig. 4 illustrates this connection between our proposed expressiveness measure based on the Lipschitz constant of a DNN and the expressiveness measure based on the output trajectory length raghu2017expressive for a set of ReLU DNNs with different widths and depths. From the plot, we see that, for DNNs with different architectures, the correlation between the asymptotic lower bounds based on these two measures of expressiveness (grey dots) approximately coincides with the identity line (red line).
The observed connection between the two measures of expressiveness of a DNN is sensible. If we consider the input trajectory to a DNN to be represented by a set of discrete points, the length of the output trajectory captures the extent of ‘stretch’ between pairs of points as they are passed through the DNN. Mathematically, the extent of ‘stretch’ or the distance between two points in a DNN’s output space in relation to the distance between the corresponding points in the input space is characterized by the Lipschitz property of the DNN.