# Depth separation for reduced deep networks in nonlinear model reduction: Distilling shock waves in nonlinear hyperbolic problems

Classical reduced models are low-rank approximations using a fixed basis designed to achieve dimensionality reduction of large-scale systems. In this work, we introduce reduced deep networks, a generalization of classical reduced models formulated as deep neural networks. We prove depth separation results showing that reduced deep networks approximate solutions of parametrized hyperbolic partial differential equations with approximation error ϵ with 𝒪(|log(ϵ)|) degrees of freedom, even in the nonlinear setting where solutions exhibit shock waves. We also show that classical reduced models achieve exponentially worse approximation rates by establishing lower bounds on the relevant Kolmogorov N-widths.

## Authors

• 5 publications
• 7 publications
• 88 publications
• 20 publications
03/31/2019

### A Theoretical Analysis of Deep Neural Networks and Parametric PDEs

We derive upper bounds on the complexity of ReLU neural networks approxi...
04/30/2021

### Model-Order Reduction For Hyperbolic Relaxation Systems

We propose a novel framework for model-order reduction of hyperbolic dif...
12/30/2019

### Manifold Approximations via Transported Subspaces: Model reduction for transport-dominated problems

This work presents a method for constructing online-efficient reduced mo...
07/19/2021

### Reduced order models for nonlinear radiative transfer based on moment equations and POD/DMD of Eddington tensor

A new group of reduced-order models (ROMs) for nonlinear thermal radiati...
02/25/2020

### A Certified Two-Step Port-Reduced Reduced-Basis Component Method for Wave Equation and Time Domain Elastodynamic PDE

We present a certified two-step parameterized Model Order Reduction (pMO...
12/20/2021

### Model order reduction strategies for weakly dispersive waves

We focus on the numerical modelling of water waves by means of depth ave...
12/19/2019

### Model reduction for a power grid model

We apply model reduction techniques to the DeMarco power grid model. The...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We propose reduced deep networks (RDNs), which are deep neural network (DNN) constructions that generalize classical reduced models [21, 41]

. We show that RDNs achieve exponentially faster error decay with respect to number of degrees of freedom when approximating solution manifolds of certain nonlinear hyperbolic partial differential equations (PDEs) in contrast to classical reduced models. Our arguments yield lower bounds on the smallest number of degrees of freedom necessary to achieve a given accuracy with classical reduced models, by estimating the Kolmogorov

-width [44, 21]. The lower bounds apply in general to a function class we call sharply convective and advances the existing results [40, 20, 63] beyond constant-speed problems. The two results indicate a type of depth separation: RDNs can achieve dimensionality reduction where shallow approximations such as classical reduced models cannot. The results are shown for representative hyperbolic problems, the color equation (variable-speed transport) and the Burgers’ equation in a single spatial dimension.

Classical reduced models fail to be efficient not only for hyperbolic problems but for transport-dominated problems in general [52, 39]. Nonlinear model reduction techniques are developed to overcome the limitations. These include the removal of symmetry [52], dynamical low-rank (DLR) approximations or dynamically orthogonal (DO) method [25, 53, 35], method of freezing [39], approximated Lax-Pairs [18], reduction of optimal transport maps [24], calibrated manifolds [6, 37], shock curve estimation [58], adaptive online low-rank updates [43, 42], adaptive -refinement [7], shifted proper orthogonal decomposition (sPOD) [47], Lagrangian basis method [34], transport reversal [50]

, transformed snapshot interpolation

[63, 64], generalized Lax-Philips representation [49, 48]

deep autoencoders

[30], characteristic dynamic mode decomposition [55], registration methods [57], Wasserstein barycenters [14], unsupervised traveling wave identification with shifting truncation [33], a generalization of the moving finite element method (MFEM) [3], and Manifold Approximations via Transported Subspaces (MATS) [51]. A common feature among these new methods is the dynamic adaptation of the low-rank representation. The adaptation is achieved using low-rank updates, adaptive refinements, or nonlinear transformations.

The works [64, 30] make use of DNNs. There also has been efforts to approximate the solution manifold of parametric PDEs directly with DNNs [26, 46, 27, 17], by exploiting the expressive power of DNNs for approximating solutions of PDEs and nonlinear functions in general [11, 59, 65, 45, 12, 54]. DNNs also have been used to compute the reduced coefficients [62]. The key challenge in these approaches is in achieving the level of computational efficiency desired in model reduction, as these DNN constructions are more computationally expensive to evaluate or manipulate than the classical reduced models.

MATS is a nonlinear reduced solution that is written as a composition of two low-rank representations, which allows efficient computations. The efficiency is equivalent to that of classical reduced models and thus enables it to be used directly with the governing differential equations and achieve significant speed-ups [51]. MATS was motivated by the distinguishing feature of hyperbolic PDEs, namely that the solution propagates along characteristic curves [16, 31]. However, there are limitations in its applicability, as the numerical experiments in [51] indicate that the efficiency of MATS depends on the regularity of the characteristic curves.

The RDN introduced here is a generalization of MATS with additional hidden layers, where each layer has a low-rank representation. We will show that RDNs yield efficient approximations of singular characteristic curves by using additional hidden layers with regular representations. Thus, RDNs can approximate solution manifolds of nonlinear hyperbolic PDEs, even when nonlinear shocks are present.

The RDN is reminiscent of the compression framework for deep networks that is being studied theoretically for improving generalization bounds [36, 2], or being utilized in practice to accelerate the performance of large networks in practical applications [8, 38, 9]

. However, the fact that an RDN is a set of networks with a specifically designed degree of freedom, rather than a single network exhibiting low-rank structure in its weights, distinguishes it from the compression frameworks. Furthermore, the specific architecture we use includes special components, such as layers that compute the inverse of a function, not very common in generic architectures used in machine learning.

RDNs are different from deep network approximations that have sparse connections [4, 29]. An RDN can be viewed as a dense network with a large number of activations, albeit with very few number of effective parameters. But beyond the differences in the architecture, the RDNs are constructed to maintain important properties that are indispensible in model reduction. While sparse approximations lead to efficient approximations of general function classes [60], such approximations are difficult to deploy in model reduction applications. For example, the choice of best terms is not necessarily regular with respect to the target of approximation, whereas the success of the reduced system rely crucially on such regularity.

In the machine learning literature, distillation or model compression refers to the transferring of the learned knowledge from an accurate model to another specialized model that is more efficient for deployment [5, 22]. Model reduction is driven by an identical motivation.

## 2 Reduced deep networks

In this section, we introduce RDNs and the notion of deep reduction. We first provide a brief overview of model reduction for computing reduced solutions and then show that reduced solutions can be represented as shallow networks. We then derive a deep-network representations of reduced solutions, resulting in RDNs.

### 2.1 Model reduction

We give a brief overview of model reduction. For a comprehensive review, we refer the reader to the references [41, 21].

Our goal is in approximating solutions of PDEs. The specific PDEs will be defined later. For now, it is only important that the solution functions depend on the spatial variable , time , and parameters . Let us denote by the solution manifold,

 M:={u(⋅,t;μ)∈V:Ω→R|t∈[0,tF],μ∈D}, (2.1)

which is a set of functions in a real Hilbert space over the spatial domain . The parameter domain is () and the time interval is , where denotes the set of natural numbers.

A full solution (or a full-model solution) is an approximation of a solution in a finite-dimensional subspace spanned by basis functions ,

 u(f)Nδ(x;t,μ)=Nδ∑n=1wn(t,μ)φn(x), (2.2)

with coefficients that depend on time and parameter. For ease of exposition, we consider in the following full solutions that are piecewise linear in the spatial variable on an equidistant grid with grid points and being the canonical nodal point basis [56]. Then, for all , there is large enough so that for each the full solution of the form Eq. 2.2 approximates the solution with

 ∥∥u(⋅;t,μ)−u(f)Nδ(⋅;t,μ)∥∥V<δ. (2.3)

For a fixed , the approximate solution manifold is

 M(f):={u(f)Nδ(⋅;t,μ):(t,μ)∈[0,tF]×D}. (2.4)

Full solutions typically are computed with finite-difference, finite-element or finite-volume methods, which can be computationally expensive if a large is required to achieve the desired tolerance . Model reduction aims to construct reduced solutions in problem-dependent subspaces of much lower dimension to reduce computational costs [41, 21]. Model reduction consists of an offline stage and an online stage. During the offline stage, the basis of the low-dimensional subspace, the reduced space , is constructed. A reduced basis is typically computed by collecting a finite subset of full solutions, where and

, and then computing a low-dimensional basis using, e.g., the singular value decomposition (SVD)

[19]. Let be the set of the reduced-basis functions.

In the online phase, a reduced solution (or a reduced-model solution) is derived in the space spanned by the reduced basis,

 u(r)M(x;t,μ):=M∑m=1γm(t,μ)ξm(x). (2.5)

The coefficients of the reduced solutions are obtained by solving a system of equations for any given . The reduced system is derived using the PDE. The computational complexity of solving the reduced system scales with the dimension of the reduced space and is independent of the dimension of the full solutions . If the dimension of the reduced space is small compared to the dimension of the full solutions, then solving for the reduced solution can be computationally cheaper than solving for the full solution. At the same time, the dimension of the reduced space needs to be chosen sufficiently large so that the reduced solution are sufficiently accurate.

Analogously to Eq. 2.2, we assume in the following that for all , there exists such that for each , the solution can be approximated with a reduced solution of the form Eq. 2.5 satisfying

 ∥∥u(⋅;t,μ)−u(r)M(⋅;t,μ)∥∥V<ε. (2.6)

Note that in the model reduction literature, the error Eq. 2.6 is typically obtained with respect to the full solution , rather than the (exact) solution . For a fixed reduced basis with basis functions, we call the set of reduced solutions that satisfies Eq. 2.6 the reduced solution manifold,

 M(r):={u(r)M(⋅;t,μ):(t,μ)∈[0,tF]×D}. (2.7)

### 2.2 Deep neural networks (DNNs)

We will define deep feed-forward neural networks . We define the set

to contain two possible choices of activation functions in our networks. Let

, where

is the rectified linear unit (ReLU) and

is the threshold function. The input variable is in unless specified otherwise, and the output in . Note that the inclusion of threshold functions in is not strictly necessary, but simplifies the exposition. On the other hand, other activations yielding universal approximations can be used without affecting the results in this work (see, e.g. [15]).

We denote by

the entry-wise composition: Given a vector of functions

, and a real vector , the entrywise composition is given by

For specified total number of layers and the widths , we denote the weights, biases, and activations

 ⎧⎪⎨⎪⎩Wℓ∈RNℓ+1×Nℓ,ℓ=1,...,L,bℓ∈RNℓ+1×1,ℓ=1,...,L,ρℓ∈PNℓ+1,ℓ=1,...,L−1,⎧⎪⎨⎪⎩W:=(W1,...,WL),B:=(b1,...,bL),P:=(ρ1,...,ρL−1). (2.8)

We define the corresponding set of weights, biases and activations

 ⎧⎪ ⎪⎨⎪ ⎪⎩W(N)=RN2×N1×⋯×RNL+1×NL,B(N)=RN2×1×⋯×RNL+1×1,P(N)=PN2×1×⋯×PNL×1. (2.9)

Let us define the affine maps for ,

 Aℓ(z)=Wℓz+bℓ,Wℓ∈RNℓ+1×Nℓ,bℓ∈RNℓ. (2.10)

Entries of and those of are called weights and biases, respectively. A deep network is formed by the alternating compositions of these affine functions with activations in .

A deep neural network (DNN) or a deep network with layers is given by

 ¯¯¯fP(x)=AL∘ρL−1⊙AL−1∘...∘ρ2⊙A2∘ρ1⊙A1, (2.11)

where for some . We denote the class of such networks by ,

 ¯¯¯¯¯N:={¯¯¯fP|¯¯¯fP of the form \lx@cref{creftype~refnum}{eq:DNN} with N∈NL+1,L∈N}. (2.12)

A full deep network solution and the corresponding solution manifold is defined analogously to the full solution and the approximate solution manifold defined in Section 2.1.

###### Definition 1 (Full deep network solution)
1. [label=(),leftmargin=0.4in]

2. Given an error threshold , if for each corresponding to there exists that

• has dimensions and the choice of activations , both independent of ,

• weights , biases ,

• satisfies the estimate

 ∥∥u(⋅;t,μ)−¯¯¯uP(⋅;t,μ)∥∥V<δ, (2.13)

then we call a full deep network solution.

3. We denote the full deep network solution manifold by

 ¯¯¯¯¯¯¯M(f):={¯¯¯uP(⋅;t,μ)∈¯¯¯¯¯N|(t,μ)∈[0,tF]×D}, (2.14)

and say that has the dimensions .

### 2.3 Reduced deep networks and deep reduction

We now introduce RDNs, a deep network generalization of classical reduced models. They are derived by writing down the low-rank approximation to the weight matrices in DNNs.

Suppose we are given a finite sample of deep networks in with identical dimensions () and activations . That is,

 ¯¯¯¯¯NS:={¯¯¯f(i)P(x)∈¯¯¯¯¯N|i=1,...,S}. (2.15)

Then let us denote the weights and biases of the -th layer of by for and . Then we may write

 {Wℓi=UℓΓℓiVTℓ,bℓi=Uℓcℓi,Uℓ∈RNℓ×Nℓ,Vℓ∈RNℓ−1×Nℓ−1,Γℓi∈RNℓ×Nℓ−1, (2.16)

in which and contain orthogonal columns.

Now, suppose that there are low-rank approximations and of the form

 ~Wℓi=~Uℓ~Γℓi~VTℓ,~bℓi=~Uℓ~cℓi (2.17)

in which , , , with , the columns of are columns of , and and are sufficiently small. Then has a truncated version given by Projecting the input to the column space of , we obtain the reduced affine maps

 Bℓ(y):=Γℓy+cℓ,Γℓ∈RMℓ×Mℓ−1,cℓ∈RMℓ×1. (2.18)

By including a dummy input in for every , we may drop the bias . Hence, we let without loss of generality

 Bℓ(y):=Γℓy,Γℓ∈RMℓ×Mℓ−1. (2.19)

Let us define the reduced activations

 ξℓ(y):=~UTℓ+1ρℓ⊙(~Vℓy),(ρℓ∈PNℓ). (2.20)

Collecting all the weights and reduced activations, let

 Γ:=(Γ1,...,ΓL0),Ξ:=(ξ1,...,ξL0−1), (2.21)

and define the space of weights given by

 G(M)=RM2×M1×⋯×RML0+1×ML0. (2.22)
###### Definition 2 (Reduced deep network)

Given , reduced weights , and reduced activations of the form Eq. 2.20, and of the form Eq. 2.19, we define a reduced deep network (RDN) as given by

 ¯¯¯f(r)Ξ(x):=BL∘ξL−1⊙BL−1∘...∘ξ2⊙B2∘ξ1⊙B1(x). (2.23)

We will denote the class of reduced deep networks by,

 ¯¯¯¯¯N(r):={¯¯¯f(r)Ξ|¯¯¯f(r)Ξ of the form \lx@cref{% creftype~refnum}{eq:RDN} with M∈NL+1,L∈N}. (2.24)

We call the procedure of obtaining RDNs from a subset of discussed above deep reduction. The RDN is determined by the reduced activations the reduced weights , and the total number of degrees of freedom in the weight parameters is small, equal to minus the number of shared weights or biases.

The primary utility of RDN from the model reduction point of view is in finding with small degrees of freedom such that, for each it satisfies .

###### Definition 3 (Reduced deep network solution)
1. [label=(),leftmargin=0.4in]

2. Given an error threshold , if for each corresponding to there exists that

• has dimensions and reduced activations of the form Eq. 2.20 both independent of

• has reduced weights

• satisfies the estimate

 ∥∥u(⋅,t;μ)−¯¯¯u(r)Ξ(⋅,t;μ)∥∥V<ε, (2.25)

we call a reduced deep network solution.

3. Denote the reduced deep network solution manifold by

 ¯¯¯¯¯¯¯M(r):={¯¯¯u(r)Ξ(⋅;t,μ)∈¯¯¯¯¯N(r)|(t,μ)∈[0,tF]×D}. (2.26)

and say that has the dimensions .

### 2.4 Example: Full and reduced solutions as 2-layer networks

As an example, we will show that classical model reduction framework from Section 2.1 can be expressed in terms of neural networks. A 2-layer network is a member in (Eq. 2.12) with two layers ( in Eq. 2.11). Such a network of width can be written in the form

 fρ(x)=Nδ∑n=1w2,nρn(w1,nx+b1,n)+b1,2=w2ρ⊙(w1x+b1)+b2, (2.27)

where , , , , , , and .

We defined full solutions (2.2) as piecewise linear functions on an equidistant grid with grid points, which can be represented as a specific 2-layer network whose weights and biases in the hidden layer is fixed. With grid-width and the number of grid-points , set

 w––1 :=1Δx[1,...,1]=1NδΔx, b––1 :=[1,0,−1,−2,...,−Nδ], (2.28) ρ–– :=[σ,...,σ]T, b––2 :=0.

Having fixed these weights and biases, only is allowed to vary, so we will simplify the notation by newly denoting the variable weights by , and write

 f––Nδ(x):=wρ––(w––1x+b––1). (2.29)

We will denote the class of this specific networks given by Eqs. 2.28 and 2.29

 N–––:={f––Nδ(x)|f––Nδ(x) of the form \lx@cref{creftype~refnum}{eq:fbar},Nδ∈N}. (2.30)

Then is equivalent to the set of continuous piecewise linear functions on the equidistant grid: Any can be written as a special case of a full solution Eq. 2.2,

 f––Nδ(x)=Nδ∑n=1wnφn(x),φn(x):=1Δxσ(x−Δx(n−2)), (2.31)

and forms a basis of the space of continuous piecewise linear functions on a equidistant grid on of grid-width . Since is dense in , its members can serve the role of full solutions Eq. 2.3. Thus we can find the approximate solution manifold using the 2-layer networks in , and denote it by

 M–––(f)={u––Nδ(⋅;t,μ)∈N–––|(t,μ)∈[0,tF]×D}. (2.32)

The set corresponds to the set of full solutions Eq. 2.4 in classical model reduction.

If the full 2-layer network solutions Eq. 2.32 have weights that lie in a low-dimensional subspace with dimension , then one may write

 w(t,μ)=γ(t,μ)VT,γ(t,μ)=[γ1(t,μ),...,γM(t,μ)]∈R1×M,V∈RNδ×M, (2.33)

in which has orthogonal columns. Then one obtains the reduced representation

 ξ(x)=[ξ1(x),...,ξM(x)]T:=VTρ––⊙(w––1x+b––1). (2.34)

Each entry of is a reduced activation function Eq. 2.20. This leads to a reduced 2-layer network,

 f––(r)M(x):=γξ(x)=M∑m=1γmξm(x). (2.35)

We shall denote the class of such 2-layer networks

 N–––(r):={f––(r)M|f––(r)M,ξ of the form \lx@crefcreftype refnumeq:fbarr,\lx@crefcreftype refnumeq:shallow−rb,M∈N}. (2.36)

The set of reduced solutions in that approximate the solution manifold form the reduced 2-layer network solution manifold,

 M–––(r):={u––(r)M(⋅;t,μ)∈N–––(r)|(t,μ)∈[0,tF]×D}. (2.37)

Then the reduced activations Eq. 2.34 correspond to the reduced basis functions Eq. 2.5 in classical model reduction, and the reduced 2-layer network solution Eq. 2.35 to a reduced solution with degrees of freedom.

## 3 The Kolmogorov N-width of sharply convective class

In this section, we recall the notion of Kolmogorov -width and define the sharply convective class of functions. Then, we will prove a key lemma that establishes a lower bound of the Kolmogorov -width of this class, showing that it decays with an algebraic rate with respect to . This will be used to show the limitations of classical reduced models Eq. 2.5 and reduced 2-layer networks Eq. 2.35.

### 3.1 Kolmogorov N-width

Let us begin by defining the Kolmogorov -width. Within this section, we will let , since the results apply to dimensions , and recall that we let .

###### Definition 4 ([44])

The Kolmogorov -width of the set of functions is

 d(N;M)=infVNsupu∈Minfv∈VN∥u−v∥V, (3.1)

where the first infinimum is taken over all -dimensional subspaces of .

When the Kolmogorov -width of a solution manifold Eq. 2.1 is known, the smallest possible dimension of its reduced manifold Eq. 2.7 that satisfies the estimate Eq. 2.6 for given is also known. This implies that classical reduced models of the form Eq. 2.35 are not efficient for problems whose solution manifolds do not have a fast decaying Kolmogorov -width [21, 41]. For example, an exponential decay implies that an efficient classical reduced model exists, whereas an algebraic decay implies the contrary.

### 3.2 Sharply convective class

Here, we describe a key criteria we use to determine if a profile with a sharp gradient is being convected. Then we show that a set of functions satisfying this criteria have the Kolmogorov -width which decays slowly with respect to .

###### Definition 5

A set is said to generate a -ball () if there is a set of linearly independent functions given by the sum

 ϕn=∞∑k=1ankunk,unk∈M,ank∈R. (3.2)

For such we will associate a real number given by

 AN,p:={supn(∑∞k=1|ank|), if p=1,supn(∑∞k=1kp|ank|p)1p, if p∈(1,∞). (3.3)

We use the notation for real functions and to state that for some constant that does not depend on the arguments of and . We also write if and . We say is orthogonal if the functions are pairwise orthogonal with respect to the inner product of .

###### Definition 6 (Sharply convective class)

Let .

1. [label=(),leftmargin=0.4in]

2. is said to be -convective for if it generates a -ball , with for all .

3. If each ball generated by generates an orthogonal -ball with for certain and for some , is said to be -sharply convective. If is -sharply convective for all , then it is called -sharply convective.

Examples of -sharply convective class of functions are shown in Fig. 1.

###### Lemma 1 (Kolmogorov N-width of convective classes)

Let .

1. [label=(),leftmargin=0.4in]

2. If is -convective with the associated -ball then .

3. If is -sharply convective, then .

###### Proof

Suppose satisfies Eq. 3.2. Then we have for any ,

 (3.4)

for . If with as in Eq. 3.3, by Hölder’s inequality

 (∞∑k=1kp|ank|p)1p(∞∑k=1∣∣∣∥unk−wnk∥Vk∣∣∣q)1q≥∞∑k=1|ank|∥unk−wnk∥V, (3.5)

where . Then using the fact that ,

 supk∥unk−wnk∥V≥CpAN,p∥ϕn−vn∥V. (3.6)

This inequality is derived similarly for the case . Noting that was arbitrary, for any arbitrary subspace of dimensions of it follows,

 supkinfwnk∈VN∥unk−wnk∥V≥CpAN,p∥ϕn−vn∥V≥CpAN,pinfv∈VN∥ϕn−v∥V,

and thus

 supu∈Minfw∈VN∥u−w∥V≥CpAN,pinfv∈VN∥ϕn−v∥V,

because . Since the above holds for any we take the supremum on the right-hand side,

 supu∈Minfw∈VN∥u−w∥V≥CpAN,psupϕ∈B2Ninfv∈VN∥ϕ−v∥V. (3.7)

Taking the infimum on both sides over arbitrary -dimensional subspaces of

 infVNsupu∈Minfw∈VN∥u−w∥V≥CpAN,pinfWNsupϕ∈B2Ninfv∈WN∥ϕ−v∥V. (3.8)

Since for the given , this proves , the first part of the lemma.

Suppose each itself generates a -ball, with . Then for all . If is orthogonal, we can normalize with and set with . Recalling that for some constant by Eq. 3.3,

 d(N;M)≥CpAN,pinfVNsupn∈{1,...,2N}1Dninfv∈VN∥^ψn−v∥V≳(Cp√2C′)N−α,

proving .

The rate in Lemma 1 is independent of , which means that only affects the constants.

### 3.3 Example: Constant-speed transport

Numerical experiments suggest that