# Invertible Neural Networks for Graph Prediction

In this work, we address conditional generation using deep invertible neural networks. This is a type of problem where one aims to infer the most probable inputs X given outcomes Y. We call our method invertible graph neural network (iGNN) due to the primary focus on generating node features on graph data. A notable feature of our proposed methods is that during network training, we revise the typically-used loss objective in normalizing flow and consider Wasserstein-2 regularization to facilitate the training process. Algorithmic-wise, we adopt an end-to-end training approach since our objective is to address prediction and generation in the forward and backward processes at once through a single model. Theoretically, we characterize the conditions for identifiability of a true mapping, the existence and invertibility of the mapping, and the expressiveness of iGNN in learning the mapping. Experimentally, we verify the performance of iGNN on both simulated and real-data datasets. We demonstrate through extensive numerical experiments that iGNN shows clear improvement over competing conditional generation benchmarks on high-dimensional and/or non-convex data.

## Authors

• 34 publications
• 29 publications
• 80 publications
02/26/2019

### Graph Neural Processes: Towards Bayesian Graph Neural Networks

We introduce Graph Neural Processes (GNP), inspired by the recent work i...
09/04/2021

### Training Graph Neural Networks by Graphon Estimation

In this work, we propose to train a graph neural network via resampling ...
02/24/2021

### Pre-Training on Dynamic Graph Neural Networks

The pre-training on the graph neural network model can learn the general...
06/12/2020

### Effective Training Strategies for Deep Graph Neural Networks

Graph Neural Networks (GNNs) tend to suffer performance degradation as m...
10/18/2021

### Beltrami Flow and Neural Diffusion on Graphs

We propose a novel class of graph neural networks based on the discretis...
08/19/2019

### On Regularization Properties of Artificial Datasets for Deep Learning

The paper discusses regularization properties of artificial data for dee...
01/14/2022

### A Kernel-Expanded Stochastic Neural Network

The deep neural network suffers from many fundamental issues in machine ...

## Code Repositories

### Invertible-Graph-Neural-Network-iGNN

Official code for the paper: Invertible Neural Network for Graph Prediction

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We are interested in the problem of finding an invertible neural network , such that given a subject to for an input , we can find the corresponding . In particular, this can be viewed as a type of inverse problem. Inverse problem can be largely categorized as either one-to-one (i.e., with an one-to-one mapping ) or can allow for one-to-many mapping (i.e., is categorical and , where denotes the number of features). This work focuses on the latter (i.e., we do not necessarily need one-to-one invertible mapping), which naturally arises in many applications. In molecular design (SnchezLengeling2018InverseMD)

, scientists want to infer features of the molecule that lead to certain outcomes (e.g., type of disease). In anomaly detection

(Zong2018DeepAG) such as power outage detection (AlShaalan2019ReliabilityEO), studying distributions of normal vs. anomalous features (e.g., weathers or past power output by sensors) is important for future prevention. In market analysis (Lu2015RecommenderSA)

, estimating patterns of customers based on their behaviors is crucial for the accurate recommendation. Figure

1 illustrates how our iGNN applies to generic Euclidean and non-Euclidean (e.g., graph) data, which subsume these application examples above.

This one-to-many task is synonymous with conditional generation, where the goal is to generate based on specific outcomes . Before introducing common approaches in conditional generation, we acknowledge several (unconditional) generation approaches, which address the simpler question of generating samples as close to features

as possible. At present, generative adversarial networks (GAN)

(GAN; WassersteinGAN) and variational auto-encoders (VAE) (VAE; VAE_review) are two of the most widely studied and popular frameworks. Both approaches have created remarkable successes in many fields (Makhzani2015AdversarialA; Zhu2017UnpairedIT; Ledig2017PhotoRealisticSI). However, they also have clear limitations, including the inability to provide an exact evaluation of data likelihood for new samples and difficulties in training, such as mode collapse (Salimans2016ImprovedTF) and posterior collapse (Lucas2019UnderstandingPC).

For our purpose of conditional generation, there have also been several streams of works, with an abstract description presented on the left of Figure 2. In the following, we first summarize the limitations of these methods and the open questions in Section 1.1. Then, we introduce the general structure of the invertible neural network used in our paper. Section 1.2 summarizes our contributions and presents the roadmap for this work.

### 1.1 Background

There have been many studies on conditional versions of GAN (cGAN) (cGAN_old; cGAN), where the outcome and random noise are both taken as inputs to the generator. However, these methods suffer from similar issues as their unconditional counterparts; we would compare iGNN with cGAN in our experiments. More recently, the work (CINN_inverse) directly builds conditional invertible neural networks (CINN) for analyzing inverse problems. A single invertible NN is trained by minimizing maximum mean discrepancy (MMD) losses in the latent space and the original data space ; we call their method CINN-MMD for comparison. However, this method also takes in the outcome as an additional input to its invertible NN, thus increasing the dimensionality of the original problem. In addition, the scalability of MMD-based methods in high-dimensional spaces and choices of kernel bandwidths can also limit its practical performance, as will be shown in the experiments.

In this work, we tackle the question of conditional generation for inverse problems by extending techniques in normalizing flow (FFJORD; iResnet; Ziegler2019LatentNF; Nflow_review; CINN). Flow-based methods estimate arbitrarily complex densities via the direct maximum likelihood estimation (MLE) approach–they transport original random features into more tractable ones (e.g., standard multivariate Gaussian) through invertible neural networks. However, most methods only focus on unconditional generation. More recently, (CINN) tackles the conditional generative task by combining conditional invertible neural networks (CINN_inverse) with the normalizing flow, a method which we call CINN-Nflow for comparison. Nevertheless, CINN-Nflow suffers from a similar issue as CINN-MMD–by treating as an input, the data dimension is increased so that training is more difficult.

In addition, there have been many architecture designs of the invertible neural network (see Figure 2, right). Some are based on coupling layers (RNVP) and autoregressive flows (Wehenkel2019UnconstrainedMN) while others build on residual networks (iResnet)

(FFJORD). At present, we choose to be an invertible residual network (iResnet), due to its expressiveness and efficiently-generated inverses as proposed in (iResnet). Denote as the -th residual block with its own trainable parameters , which has the form

 Fb,Θb(x) :=x+gb,Θb(x), (1) gb,Θb :=Wbl∘ϕbl∘…ϕb1∘Wb1, (2)

where each pair of in (2

) denotes the weight matrix and the activation function at the

-th layer of the -th residual block. Then, is the concatenation of residual blocks. To ensure invertibility of each block, (iResnet) suggested using spectral normalizing so that the Lipschitz constant of each is less than one. However, spectral normalization is computationally expensive and may not be sufficient beyond fully-connected layers, so that we propose an alternative regularization technique in Section 2.3.

Therefore, although conditional generation has been a long-standing and actively researched area, there are still several open questions.

1. [itemsep=0em]

2. How to design (graph) generative models that do not increase difficulties in training? Treating

as additional inputs increase the data dimension. For non-invertible networks such as cGAN and its variants, doing so requires even more careful hyperparameter tuning and tends to cause many practical issues, as described earlier. For invertible networks such as CINN-MMD or CINN-Nflow, doing so causes more scalability issues in higher dimensions due to their training objectives. In addition, conditional node feature generation on graph without these training difficulties has not been well-addressed.

3. How to reformulate the training objective in a natural way that also allows for prediction? Most conditional generative models consider the same loss as their unconditional counterparts and mainly differ by taking in as network inputs. In addition, for prediction, these models either require building separate supervised networks or require as an additional output by the generative model (e.g., CINN-MMD). Both of these approaches increase complexities in training.

### 1.2 Contribution

Compared to current approaches (see Figure 2, left), our approach (see Figure 2, right) circumvents the inevitable dimensionality issue: we first generate hidden variables that have disjoint supports over and then trains a single invertible network that maintains the original data dimension. Due to disjointness, the prediction of can also be performed by simple linear classifiers. The method is called invertible graph neural network (iGNN) due to our primary focus on graph data. In summary, our technique contributions in response to the open questions above include

1. [itemsep=0em]

2. For the training objective, we formulate a conditional generative loss based on the change-of-variable formula (see Eq. (3)), which naturally extends to graph observations (see Eq. (7)). Predicting the outcome is simple based on our problem design and is built as a part of our end-to-end training. In particular, the invertible neural network does not treat the conditional variable as an additional network input to maintain the original data dimension.

3. In network design, we impose the Wasserstein-2 regularization on each residual block of the invertible residual network in iGNN. In contrast to the much more expensive spectral normalization (iResnet), this regularization is easy-to-implement and facilitates smoother density transport.

4. Theoretically, we establish the identifiability of a true invertible mapping and verify its existence and invertibility based on the well-posedness of the initial-value problem for ordinary differential equations. We also show that graph neural networks with enough (resp. insufficient) expressiveness can (resp. cannot) theoretically learn the true invertible mapping.

5. In experiments, we compare the performance of our model with competing models on simulated and real data, with clear improvement over competing models on high-dimensional and/or non-convex graph data. We also verify the importance of model expressiveness through specific simulation designs.

The rest of the paper is organized as follows. Section 2 contains our method. We describe how the training objective is formulated to include prediction and conditional generation, how the framework is extended to graph data, and how each residual block is regularized through the Wasserstein-2 penalty. Section 3 contains theoretical analyses with three parts. We first establish the identifiability of the true invertible mapping, assuming its existence. Then, we analyze when such invertible mappings exist upon connecting with ODE. Lastly, we derive a closed-form expression for the invertible mapping in a special case and consider the expressiveness of our iGNN in two examples. Section 4 contains simulated and real-data experiments, which show the improved performance of our method over other conditional generative models, especially when data are high-dimensional and/or non-convex. In particular, we also verify the model expressiveness statements in Section 3 with numerical results. Section 5 concludes our work and discusses future directions. Section 6 contains complete derivation and proofs. Appendix A contains details regarding experiment setups and additional numerical results.

## 2 Method

We first describe how the training objective is formulated in Section 2.1 for generic Euclidean data. We then discuss the extension to non-Euclidean data, such as graph observations in Section 2.2. Lastly, we propose the regularization technique of residual blocks described in (2) in Section 2.3.

### 2.1 End-to-end training for conditional generation and prediction

We now present the concrete mathematical formulation of our training objective based on the abstract representation on the right of Figure 2. Compared to other conditional generative models, our iGNN performs a simple but effective modification of the unconditional generative loss based on the change-of-variable formula. As we will see, doing so has several benefits: (1) maintain the original data dimension of instead of increasing it to include , (2) facilitate model training as the network only needs to transport each to its corresponding instead of all to a standard Gaussian variable , and (3) enable easy classification, as each is designed to be disjoint.

We first consider the case where the random feature . Given an invertible network and additional categorical outcome variables for classes, we consider the following conditional generative loss:

 logpX|Y(X) =logpH|Y(FΘ(X))+log|detJFΘ(X)|, (3) Lg :=−logpX|Y(X),

where denotes the generative loss to be minimized and each is disjoint over the value of . Note that conditional generation based on (3) is simple—given different outcomes , we can sample from different distributions and then perform the inverse mapping . In implementation, we let distributions

be isotropic multivariate Gaussian distributions, with mean parameters

using trainable parameters

and variances to be small enough for disjointness. The term

denotes the -th standard basis vector in , and the subscript denotes generation.

While the loss in (3) may seem simple, we highlight its benefit over the current CINN-Nflow approach (CINN), whose loss is also based on the change-of-variable formula. CINN-Nflow treats as the network input and writes

 logpX|Y(X) :=logpZ(GΘ′(X,Y)|Z)+log|detJGΘ′(X,Y)|, (4) GΘ′(X,Y) :=[GΘ′(X,Y)|Y,GΘ′(X,Y)|Z].(Dimension increase)

In (4), denotes a standard multivariate Gaussian random vector and is the conditional invertible neural network. In contrast to our objective (3), the output in (4) contains predictions for both and , so that the dimension of the original feature space is increased. In addition, because and are inherently independent, the conditional generative model must transport the uni-modal random vector into multiple , each of which may be multi-modal. Both procedures increase the difficulty in training.

In addition, our conditional formulation in (3) also allows one to perform classification (e.g., estimate ) using simple classifiers. For instance, assuming is binary but the original decision boundary of is highly nonlinear. Instead of relying on complex classifiers (e.g., another supervised neural network), one may build a simple one (e.g., linear classifier) as long as (1) distributions do not overlap (e.g., disjoint multivariate Gaussian distributions) (2) the invertible mapping can accurately transport the distribution of to match that of . Mathematically, we thus minimize the classification loss using a simple linear classifier on :

 Lc:=L(Y,WcFΘ(X)+bc)=−eTY% softmax(WcFΘ(X)+bc). (5)

Therefore, our end-to-end training objective can be written as

 min{Θ,{Wc,bc},{Wg,bg}}EX,Y[Lg+W2(FΘ)+μLc], (6)

which is estimated using training data. The term will be introduced in Section 2.3, Eq. (8) to ensure model invertibility and smoothness in density transport. The term is a parameter that controls the relative rate at which the generative model trains with respect to the linear classifier. In practice, can be chosen small because of problem formulation—classification is easy as long as estimated distributions using accurately match those of , which are disjoint. Figure 3 illustrates the result on a simple simulated two-moon data. For prediction, both parts of the two-moon data are mapped separately through the invertible map to disjoint parts of the Gaussian mixture, which can be easily classified via the linear classifier . For conditional generation, one simply samples from the Gaussian mixture, and the invertible map generates the desired .

### 2.2 Generative loss on graph data

A key emphasis of our paper is on non-Euclidean observation, such as graph data, where we currently focus on node classification and node feature generation. Notation-wise, given a graph with (with a slight abuse of notation), let be its adjacency matrix, be the degree matrix, be the node feature, and be the node label. Denote as the matrix of node features and as node labels.

For training, we still minimize the end-to-end objective in (6). However, there is a key difference regarding the evaluation of the log-likelihood of input node features :

 logpX|Y(X) = logpH|Y(FΘ(X))+logdetJFΘ(X) (Change of variable) (i)= V∑v=1[logpH(v)|Y(v)(FΘ(X)(v))]+logdetJFΘ(X) (Independence) (ii)= V∑v=1[logpH(1)|Y(1)(FΘ(X)(v))]+logdetJFΘ(X), (Homogeneity) (7)

where in (i), we assume independence of node features upon transporting the raw feature, and in (ii), we further assume that the conditional distribution at each node is homogenous, so the distribution only depends on the label of rather than the node . In practice, requiring (ii) is important as the number of nodes is often significantly larger than the number of node classes. In that case, it is nearly infeasible to pre-specify node-dependent disjoint conditional distributions . Computationally, we would perform vectorize-then-unvectorize operations of inputs and outputs to make sure .

### 2.3 Invertible flow Resnet with Wasserstein-2 regularization

Recall that we choose in our iGNN as an invertible residual network (see right of Figure 2), which is a concatenation of residual blocks defined in Eq. (2). To ensure network invertibility, instead of performing spectral normalization of the weight matrices in each residual block, we consider the following regularization on the movement of each residual block:

 W2(FΘ):=12B∑b=1∥gb,Θb(X)∥22. (8)

Note that (8) is the Wasserstein-2 regularization (bolley2012convergence). Figure 12 in the appendix shows that our model maintains invertibility on both in-distribution and out-of-distribution data using the Wasserstein-2 regularization in (8).

There are several benefits of the Wasserstein-2 regularization compared to the spectral normalization technique (iResnet). Computationally, the Wasserstein-2 penalty can be easily implemented as the network performs forward passes on data. In contrast, spectral normalization relies on the power iteration at

each training epoch

to normalize the spectral norm of weight matrices in all

residual blocks. Doing so can be very computationally expensive when both the total number of residual blocks and the number of hidden neurons are large, whereby the procedure must spectrally normalize many high-dimensional weight matrices. In addition, the Wasserstein-2 penalty is agnostic of the residual block design—each block can consist of only fully-connected (FC) layers or generic graph filters

(Chebnet; cheng2021graph)

. In contrast, spectral normalization is applicable when each residual block concatenates fully-connected layers (FC) with contractive nonlinearities (e.g., ReLU, ELU, Tanh). When each residual block contains more than FC layers (e.g., general graph filters), it can be insufficient to require the spectral norm of all weight matrix parameters to be less than one.

## 3 Theory

We provide several theoretical results in this section. First, we show in Section 3.1 that under idealized assumptions, the optimal solution of the conditional generative loss (3) identifies the true invertible mapping that generates for each . Then, we show in Section 3.2 that the true invertible mapping exists under the well-posedness of an ordinary differential equation, which can be verified under certain Lipschitz conditions. Lastly, we discuss in Section 3.3 the expressiveness of our iGNN for Gaussian processes in two examples that depend on graph topology. Throughout this section, we assume that , where graph features can be equivalently interpreted as random vectors in , with correlation among features according to the graph structure. All derivatives and proofs appear in Appendix 6.

### 3.1 Identifiability

Upon minimizing in (3), we first have the following proposition regarding the identifiability of a true invertible mapping .

###### Proposition 3.1 (Identifiability of the true mapping).

Let be discrete labels. Assume that

• [itemsep=0em]

• For any with , the random vectors and follow disjoint tractable distributions (e.g., multivariate Gaussian).

• There exists a bijective mapping such that for every , follows the same distribution as .

• There exist parameters such that , where is the pre-speficied class of invertible models that depends on parameters .

Based on (3), denote

 ^ΘN :=argmaxΘN∑n=1logpH|Y(FΘ(Xn))+log|detJFΘ(Xn)|, (9) NK :=mini=1,…,KN∑n=11(Yn=i),

where (9) is the maximum-likelihood estimator based on independent and identically distributed (upon conditioning on values of ) samples . Then, under regularity conditions for maximum likelihood estimation, we have that as , with probability 1.

### 3.2 Existence and invertibility

Note that Proposition 3.1 relies on the idealized assumption that there exists a true invertible mapping that transports to as desired. In this subsection, we utilize the Fokker-Planck equation of an Ornstein-Uhlenbeck (OU) process to show that such an invertible mapping exists and can be approximated by a residual network with finite model complexity.

We first reduce the problem to the construction of invertible flow for the unconditional case. The conditional case can be viewed as having a joint flow which coincides with each individual flow upon restricting to the support of . More precisely, this joint flow maps each to its corresponding , all of which are disjoint over ; disjointness can be achieved by re-scaling and shifting of a standard Gaussian (see explanation earlier below Eq. (3)). As a result, once the existence of invertible flow from some to a normal random vector can be constructed, then composed with the shifting and re-scaling operations which are invertible transports, one can construct an invertible joint flow for the conditional generating of .

Below, we consider the unconditional case where the desired flow transports from data distribution in

which is . Recall that the continuous-time flow is represented by an initial value problem (IVP) of ODE

 dxtdt=f(xt,t), (10)

with initial value . The solution mapping for is invertible as long as (10) is well-posed. There may be more than one which induces such a flow towards , and to show existence, we construct to correspond to the following Fokker-Planck equation of an OU process ( denotes )

 ∂tρ=∇⋅(ρ∇V+∇ρ),V(x)=|x|22,ρ(x,0)=pX(x), (11)

where represents the probability density of the OU process at time starting from . Comparing (11) to the Liouville equation of (10) which is , where represents the distribution of , we see that the density transportation under the flow can be made the same as in (11) if we set the force as

 −f(x,t)=∇V(x)+∇logρ(x,t)=x+∇logρ(x,t). (12)

We first show the differentiability of in the following lemma.

###### Lemma 3.2.

Suppose is the solution to (11) from , then as in (12) is smooth on .

The following proposition directly gives the well-posedness of the IVP, and subsequently the invertibility of the continuous-time flow, assuming an -Lipschitz constant of . The -derivative of a vector field is denoted as which is a -by- matrix and we use to stand for matrix operator norm.

###### Proposition 3.3.

Suppose as in (12) with being the solution to (11) satisfies that

 ∥∇⊗f(x,t)∥≤L,∀t>0,x∈\Rd,

then the IVP (10) is well-posed.

The boundedness of the -Lipschitz constant of can be shown when the initial distribution is sufficiently regular and decay fast on , e.g., smooth and compactly supported. The precise analysis is deferred to future work. Under generic conditions, the solution of Fokker-Planck equation converges to the equilibrium distribution exponentially fast under e.g. or Wasserstein-2 distance (bolley2012convergence). This means that at time , the transported distribution of is -close to the normal density. One can construct a discrete-time flow on to approximate the continuous-time one, and when the force is sufficiently regular, at each it can be approximated by a residual block with finite many trainable parameters. Furthermore, since the -Lipschitz constant of is bounded by , invertibility of the ResNet can be fulfilled as long as the discrete time-step , and the number of residual blocks equals which can be made .

### 3.3 Model expressiveness

Above in Section 3.2 that we verified the existence of an invertible mapping and argued for approximation of the force in (12) using a residual network with finite model complexity. In this subsection, we further establish the exact form of for Gaussian processes (GP). In particular, we discuss two graph cases where the covariance matrix of the initial distribution of the GP directly relates to properties of the graph, with an interesting implication regarding model expressivess of GNN.

We first provide the following analytic form of in (12), which applies to general vector data in :

###### Lemma 3.4.

If and , then

 f(x,t)=−(Id−\Sigmat−1)x, (13)

where is the covariance matrix that converges to exponentially fast.

Because the force in (13) is linear, it can always be approximated by residual blocks of fully-connected layers, which are universal approximators (univ_approx; univ_approx1).

On the other hand, one can interpret the random vector as the collection of one-dimensional node features on a graph with nodes, with encoding feature dependency. In such cases, it is not always true that any graph filters can approximate and as a consequence, the covariance in (13) that depends on ; one example that spectral-based GNN fails to do so will be given in Example 1. In particular, we have the following propositions on the properties of under different assumptions of , which help us identify networks that are able or unable to approximate the force . More precisely, we relate to the spectral-based GNN in one case and spatial-based GNN in another.

For notation consistency with previous discussion on graph data, we change the dimensionality to in statements on graph data from now on. For simplicity, all analyses onward implicitly assume node features have dimension .

Case I. Spectral-based presentation of : We first show that when can be written as a polynomial expression of the graph Laplacian , then

has a closed-form in terms of the polynomial evaluated on eigenvalues of the graph Laplacian, whereby

can be approximated up to arbitrary precision by another polynomial. Hence, spectral-based GNN can in theory correctly approximate the true force in (13).

###### Proposition 3.5 (Spectral-based Σ).

Denote as the graph Laplacian. Suppose there exists a polynomial such that . Then

 \Sigmat−1=V∑i=1((1−exp(−2t))+exp(−2t)P(λi(L)))−1UiUTi,

where denotes the -th largest eigenvalue of and with under eigen-decomposition.

The implication of Proposition 3.5 is that under certain regularity condition on , there exists another polynomial that can approximate up to arbitrary precision. For instance, when denotes the Chebyshev polynomial of order , then this approximation can be achieved by (possibly) raising the degree of polynomial above .

We also consider a corollary below where the covariance matrix is not accurately represented by , so that estimation errors exist.

###### Corollary 3.6.

Under the same notations in Proposition 3.5, suppose

(1) and share the same eigen-bases.

(2) For some polynomial , where denotes the -th largest eigenvalue of a matrix.

(3) There exists such that .

Define . Then

 ρ(\Sigmat−1−~Σ−1t)=O(e−2tδ),

where denotes the spectral norm of the matrix difference, and the constant in big-O depends on the minimum eigenvalues of and . In particular, note that can be approximated up to arbitrary precision based on the implication of Proposition 3.5.

Case II. Spatial-based presentation of : We next show that under certain spatial properties of , satisfies similar spatial properties and can thus be approximated by spatial-based GNN filters such as the low-rank locally learnable network (L3net) (cheng2021graph). However, it may not be approximated by spectral-based GNN such as Chebnet (Chebnet) as shown in Example 1. We first define the desired spatial property in terms of locality.

###### Definition 3.7 (v-locality).

Denote as the graph adjacency matrix for nodes. We define that the covariance matrix , which encodes correlation among node features, is -local for some integer if

 Σij=0 if j∉N(v)i,

where denotes the set of -th neighbors of node (e.g., all nodes accessible from node within steps along adjacent nodes).

###### Proposition 3.8 (Local Σ and Σ−1).

Suppose and are -local and -local based on Definition 3.7. Then depending on values of , in (13) can be expressed by power series of or . If the power series are truncated at order , can be approximated by a -local covariance matrix.

On the other hand, the following simple example show that spectral-based GNN can never learn and therefore, the covariance required by the force in (13). We will revisit this example in Section 4.3.

###### Example 1 (Σ−1t cannot be learned by spectral-based graph filters).

Consider a simple graph with three nodes and two edges between nodes. Self-loops at each node are also inserted. Note that the adjacency matrix satisfies the property , where is any permutation matrix over graph nodes. Thus, any graph filter based on satisfies . However, if the covariance matrix and permutation matrix take the form

 Σ:=⎡⎢⎣1ρ0ρ0ρ10ρ11⎤⎥⎦,π=⎡⎢⎣001010100⎤⎥⎦,

then it is easy to see that if . Therefore, any cannot approximate up to arbitrary accuracy.

In summary, Propositions 3.5 and 3.8 state the existence of approximations for under correct model specification. In experiments, we will explore the importance of architecture design—whether can/cannot indeed be approximated by our iGNN under different designs of residual blocks. Results are shown in Section 4.1 Figures 10 (for Proposition 3.5) and 11 (for Proposition 3.8).

## 4 Numerical Examples

The experiments are organized as follows. In Section 4.1, we first compare our iGNN with competing methods on simulated examples, where one of the competitors performs similarly as iGNN. In Section 4.2, we then consider more challenging simulated data and real-data examples. In particular, the competitors all perform worse than iGNN in terms of conditionally generating the high-dimensional and/or non-convex node features. In Section 4.3, we lastly verify model expressiveness statements in Section 3.3. Code to produce our results can be found at https://github.com/hamrel-cxu/Invertible-Graph-Neural-Network-iGNN.

We consider three competing conditional generative models and several two-sample testing metrics for quantitative assessment. Regarding competing methods, we compare iGNN with the conditional generative adversarial network (cGAN) (cGAN), the conditional invertible neural network trained with maximum mean discrepancy (CINN-MMD) (CINN_inverse), and the same CINN trained with the normalizing flow (CINN-Nflow) (CINN). Details and hyperparameters of the competitors in each setting are described in Appendix A.1

. Regarding quantitative assessments, two-sample testing statistics are used to compare the similarity between two distributions (e.g.,

vs. at different ). In particular, we use kernel MMD under the Gaussian kernel with various bandwidth choices (Gretton2012AKT) and the energy-based test (Szkely2013EnergySA), which utilizes the energy distance or E-statistics for comparisons of distributions. We remark that since CINN-MMD is based on the MMD test (although the kernel functions are different), the kernel MMD statistics can be biased towards this method.

The following settings are fixed in all experiments; other hyperparameters will be explained separately in each experiment. All graphs are assumed undirected and unweighted, with inserted self-loops. The activation function is chosen as ELU (ELU) due to its continuous derivatives and the hidden dimension is fixed at 64. During training, we use the Adam optimizer (Kingma2015AdamAM) to minimize the end-to-end objective (6). We terminate the training of all methods once the percentage decrease of training loss is smaller than 0.01%; for compactness and clarity in visualization, we show the training and test losses for competing methods in each experiment in Appendix A.2, Figure 14.

### 4.1 Simple examples

We consider two simulated examples in this section. The first considers non-graph Euclidean vector data, and the second deals with graph node features. In particular, we find that CINN-MMD yields competitive performances as iGNN, but the other two other competing methods perform much worse.

1. Simulation: 8 Gaussians. We consider one non-graph toy example that is widely used by conditional generative methods for assessment. Each conditional distribution comprises two disjoint Gaussian distributions for four possible choices of . The task is to estimate as closely as possible for each . We build the iGNN to have 40 residual blocks, where each block has four FC layers with 64 hidden nodes. During training, we generate 2500 i.i.d. training and test samples for each of the four labels of , fix the learning rate at 5e-4, and train with a batch size of 1000. Figure 4 compares our generative results with those of CINN-MMD, where both methods can reasonably and correctly estimate each . Table 1 provides the two-sample testing statistics, where the values by CINN-Nflow and cGAN are much larger than those by ours and by CIIN: MMD. These larger values indicate a greater discrepancy between the true and estimated conditional distributions.

2. Simulation: 2D convex node features. We now consider the graph introduced in Example 1. Each node has a binary label so that is a three-dimensional binary vector. Conditioning on a specific binary vector

out of the eight choices, the random matrix

is defined as

 X|Y:=PA(H|Y+[−404000]) s.t. H(v)|Y(v)∼{N((1.5,0)T,0.1I3)if Y(v)=0N((−1.5,0)T,0.1I3)if Y(v)=1,

where is the graph average matrix. In particular, we purposefully design the example so that the marginal distribution of each node feature does not yield enough information for classification (i.e. can be either 1 or zero for the same , when features remain unobserved). We build the iGNN to have 40 residual blocks, where each block has 3 layers in the form of (same neighborhood orders as section 4.1.2). During training, we generate 4000 i.i.d. training and test samples for each of the eight binary vectors , fix the learning rate at 5, and train with a batch size of 100. Figure 5 shows that the generate and classification performances by our method are both satisfactory, and CINN-MMD can yield similar performances. Meanwhile, Table 2 shows the similar issue that neither CINN-Nflow nor cGAN can yield satisfactory performances.

###### Remark 4.1 (Using FC layers on graph data).

Note that we can always build each residual block using only FC layers and ignore the relationship among graph nodes. The main difference from using a graph filter is that one has to first vectorize the node feature matrix as and then transport the high-dimensional density. Doing so can significantly increase the number of parameters in the model if is large. Generative performance may also deteriorate in some instances since graph information is ignored.

### 4.2 Complex examples

We now consider more challenging node feature generation examples beyond ones in Section 4.1. We first build upon the three-node graph by considering non-convex feature distributions . We then consider two high-dimensional real-data graph examples in anomaly detection, where it is essential to generate graph features jointly based on the label at each node (recall Figure 1). In all these cases, we will see that CINN-MMD performs worse than iGNN when node features are non-convex or high-dimensional and that neither CINN-Nflow nor cGAN yields satisfactory performances as iGNN.

1. Simulation: 2D non-convex node features. We still use the graph introduced in Example 1 under a more complex node feature setting. At each node , features

can be either convex (e.g., linear transform of Gaussian vectors) or non-convex (e.g., a part of the two-moon dataset as in Figure

3). The response vector is still binary. We ignore details regarding the generation of , which is very similar to before, except for the way is chosen. We train the same iGNN under the same hyperparameter settings. Figure 6 shows that our method tends to yield better performances when the true conditional distribution is non-convex, which is also verified in Table 3 in terms of the smaller two-sample testing statistics.

2. Real-data: solar ramping events. We then consider the anomaly detection task on California solar data in 2017 and 2018. The whole network has ten sensors with bi-hourly recordings. Features denote the average of raw radiation recordings every 12 hours in the past 24 hours, and response vectors contain the anomaly status of each city. The first 75% (resp. res 25%) data denote the training (resp. test) data, which have 1094 (resp. 365) graph observations. Figure 7 visualizes the graph and the raw features colored by anomaly status (i.e., black means normal). During training, we fix the learning rate at 1e-4 and train with a batch size of 150. We build iGNN to have 40 residual blocks, where we considered in each block three different architectures: (a) of order 2, (b) of neighborhood order (0, 1, 2), and (c) . In terms of generative performance, architecture (a) performs the best, whose results are shown in Figure 7 upon comparing with CINN-MMD; we show results for architecture (b) and (c) in Appendix A.2, Figure 15 and 16. In particular, Figure 7 shows that when iGNN captures the entire distributional pattern for different , CINN-MMD fails to do so when the underlying distribution is non-convex (see the second row in (c) or (d)). Meanwhile, Table 4 shows that both CINN-Nflow and cGAN yield much larger test statistics than iGNN and CINN-MMD, indicating much poorer generative performance. Although CINN-MMD may yield smaller training statistics than iGNN, the reason is likely attributed to how we weight the statistics for different and high variance under insufficient samples. Details can be found in Table 6, Appendix A.2 where we show the test statistics at each without weighting.

3. Real-data: traffic flow anomalies. We lastly consider the anomaly detection task on Los Angeles traffic flow data from April to September 2019. The whole network has 15 sensors with hourly recordings. Features denote the raw hourly recording in the past two hours, and response vectors contain the anomaly status of each traffic sensor. The first 75% (resp. res 25%) data denote the training (resp. test) data, which have 3070 (resp. 1320) graph observations. Figure 8 visualizes the graph and the raw training features, which have different anomaly patterns from the solar ramping events in Figure 7. During training, we fix the learning rate at 1e-4 and train with a batch size of 200. We build iGNN to have 40 residual blocks, where each block contains four FC layers; empirical evidence suggests that combining FC layers with graph filters yields poorer generative performance, which we will examine further in the future. Figure 8 shows that our generative performance is satisfactory in general, whereas CINN-MMD tends to generate distributions much more dispersed than the actual ones. Table 5 further verifies this through the much larger two-sample energy-based test statistics by CINN-MMD.

###### Remark 4.2 (Further implications).

For many real-world graph datasets, test data may contain node label vectors that lie outside the set of node label vectors in training data, but we can still estimate the distribution of upon sampling from and applying . In words, this tells us “what circumstances (i.e., ) most likely result in (previously unseen) outcomes (i.e., )”. In the context of anomaly detection on graph, specific nodes may only contain normal observations in training data, but for prognosis, we can examine what features would lead to anomaly outcomes at these nodes.

### 4.3 iGNN expressiveness

We focus on verifying model expressiveness analyses in Section 3.3 with empirical evidence. We first revisit the three-node example in Figure 5, where we show that models that lack expressiveness can never correctly generate the features. Then, we design two examples according to Propositions 3.5 and 3.8. In the first case, the covariance matrix is similar to the output of a spectral-based graph filter. In the second case, satisfies certain spatial locality properties. We show graph filters that have enough expressiveness can accurately estimate the covariance matrix of and of , hence transporting the densities correctly.

1. GNN lacks expressiveness. Recall Figure 5 on the three-node graph in Example 1, where we showed the satisfactory performance of our iGNN. In particular, each residual block takes the form . Figure 9 shows the results where we replace the first L3net in each residual block by a Chebnet of order 2 or by a FC layer. In particular, iGNN under the Chebnet variant cannot learn the distribution , due to the symmetry issue inherent in spectral-based graph filters mentioned in Example 1.

2. Sufficient GNN expressiveness for spectral-based , see Proposition 3.5. We first consider the case in which , where denotes the -th degree Chebyshev polynomial and is the scaled and normalized graph Laplacian. We let so that is positive definite with a significant number of non-zero and large off-diagonal entries. We consider a seven-node chordal cycle graph (Lubotzky1994DiscreteGE), which is one example of expander graphs. Figure 10 visualizes the raw graph. To verify the existence statement, we design iGNN to have 40 residual blocks, where each residual block is a single Chebnet filter of order two (Chebnet) mapping from to . During training, we generate 4000 i.i.d. training and test samples following the GP, fix the learning rate at 5e-4, and train with a batch size of 400. Figure 10 shows that both the training and test losses converge in a small number of training epochs. Figure 10 shows that the correlation matrices and are both estimated with high accuracy.

3. Sufficient GNN expressiveness for spatial-based and , see Proposition 3.8. We then consider the three-node graph in Example 1, where node 1 is connected to both node 0 and 2, neither of which connects each other. In particular, we specify the correlation between node 0 and 1 (resp. 2 and 1) as 0.6 (resp. ), so that (resp. ) is one-neighbor local (resp. two-neighbor local). Figure 11 visualizes the raw graph. We also use a 40-block iGNN for estimation, where the only difference is that an L3net layer replaces each Chebnet layer with neighborhood orders . All other hyperparameters are the same as those in section 4.1.1 above. Figure 11 shows that the losses converge reasonably fast. Figure 11 shows that the correlation matrices are estimated with high accuracy. Lastly, note that we intentionally design the correlation matrix to be asymmetric as in Example 1, whereas the graph Laplacian is symmetric due to graph design. As a result, any spectral-based graph filter is still unable to express the matrix and, therefore, perform the generative task. Results are shown in Appendix A.2, Figures 13 and 13, which show that even if losses converge, neither correlation matrices is estimated correctly.

## 5 Conclusion

We proposed a general framework based on normalizing flow for conditional generation in this work. Although our primary focus is on graph data, the framework is general for any Euclidean data, which may be viewed as observations on graphs with a single node, so iGNN naturally applies to these problems. This framework benefits one-to-many inverse problems that aim to infer . In particular, our techniques address both prediction and conditional generation through a single invertible residual neural work at once, thus allowing for implementation simplicity and clarity in training. Theoretically, we analyzed the identifiability of a true invertible mapping assuming its existence, verified its existence and invertibility utilizing the Fokker-Planck equation of an Ornstein-Uhlenbeck process, and examined the expressiveness of iGNN in learning the mapping. Experimentally, we compared iGNN with competing conditional generative models on simulated and real-data examples, verifying the improved performance of iGNN, especially in high-dimensional and non-convex graph cases.

There are many open avenues for research. Method-wise, we want to consider continuous outcomes and other generative tasks such as edge feature generation and graph generation. Regarding network training, we want to consider the continuous-time variant of residual networks (FFJORD) and utilize techniques by monotone variational inequalities in each residual block for performance guarantees (xu2022training). Theory-wise, we will further the analyses regarding the invertibility of the true force in (10) based on the Jordan-Kinderlehrer-Otto (JKO) proximal scheme regularization. We also want to study the expressiveness of iGNN in estimating the true force more deeply.

## Acknowledgement

The work is supported by NSF DMS-2134037. C.X. and Y.X. are supported by an NSF CAREER Award CCF-1650913, NSF DMS-2134037, CMMI-2015787, DMS-1938106, and DMS-1830210. X.C. is partially supported by NSF, NIH and the Alfred P. Sloan Foundation.

## 6 Proofs

###### Proof of Proposition 3.1.

Recall that is bijective in and maps to for each potential value of . Hence, for each value , the random vector follows the same distribution as

. Therefore, the random variables generated by the inverse of

follow the same conditional distributions . In addition, since we assume that there exists a at which the model is identical to , also follows the true conditional distribution.

By the assumption that approaches infinity as , we know that infinitely many samples from each , are available. Under additional regularity conditions, we know that the MLE estimate is also consistent (Newey1986LargeSE)—it converges to the true parameter with probability one. Therefore, the model with probability one. ∎

###### Proof of Lemma 3.2.

The solution of (11) has the explicit expression as

 ρ(x,t)=∫Kt(x,y)ρ0(y)dy,Kt(x,y):=1(2πσ2t)d/2e−|x−e−ty|22σ2t,σ2t:=1−e−2t. (14)

By the smoothness of , is smooth over . By (12), for any , , and thus is also smooth over . ∎

###### Proof of Proposition 3.3.

The well-posedness of the IVP can be guaranteed as long as is continuous and satisfies a Lipschitz condition w.r.t (Sideris2013OrdinaryDE). The former is by Lemma 3.2, and the latter is the assumption of the proposition. ∎

###### Proof of Lemma 3.4.

We only need to derive the distribution of in order to find its density function . Because the Ornstein–Uhlenbeck process is a Gaussian process, it is evident that , where is the time-dependent covariance matrix.

By the transition probability, also known as the Green’s function

 K(X,Y;t):=(2πσ2t)−d/2exp(−(2σ2t)−1|∥X−exp(−t)Y∥22),σ2t:=1−exp(−2t),

we know that . When , we thus have

 Xt=exp(−t)X0+E,E∼N(0,σ2tId),X0⊥E.

Hence, .

As a result, , with . ∎

###### Proof of Proposition 3.5.

Note that by the spectral decomposition and the assumption , we have

 Σ =V∑i=1P(λi(L))UiUTi, Σ−1 =V∑i=1P(λi(L))−1UiUTi.

Thus,

 \Sigmat =(1−exp(−2t))IV+exp(−2t)Σ =U[(1−exp(−2t))IV+exp(−2t)P(Λ)]UT.

As a result,

 \Sigmat−1=V∑i=1((1−exp(−2t))+exp(−2t)P(λi(L)))−1UiUTi.

###### Proof of Corollary 3.6.

By the assumption that and share the same eigen-bases, we have

 \Sigmat =V∑i=1((1−exp(−2t))+exp(−2t)λi(Σ))UiUTi, ~Σt =V∑i=1((1−exp(−2t))+exp(−2t)~P(λi(L)))UiUTi.

Under diagonalization, the inverse of and can be expressed as

 \Sigmat−1 =V∑i=1((1−exp(−2t))+exp(−2t)λi(Σ))−1UiUTi, ~Σ−1t =V∑i=1((1−exp(−2t))+exp(−2t)~P(λi(L)))−1UiUTi.

Thus, we have

 ρ(\Sigmat−1−~Σ−1t) ≤ maxi∣((1−exp(−2t))+exp(−2t)λi(Σ))−1−((1−exp(−2t))+exp(−2t)~P(λi(L)))−1∣ ≤ ∣∣ ∣∣(1−ct)δ(ct+(1−ct)λd(Σ))(ct+(1−ct)mini~P(λi(L)))∣∣ ∣∣,ct:=1−exp(−2t).

Note that the denominator