1 Introduction
We propose a deep generative model for molecular graphs based on invertible functions. We especially focus on introducing an invertible function that is tuned for the use in graph structured data, which allows for flexible mappings with less number of parameters than previous invertible models for graphs.
Molecular graph generation is one of the hot trends in the graph analysis with a potential for important applications such as in silico new material discovery and drug candidate screening. Previous generative models for molecules deal with string representations called SMILES (e.g. Kusner et al. (2017); GómezBombarelli et al. (2018)), which does not consider graph topology. Recent models such as (Jin et al., 2018; You et al., 2018; De Cao & Kipf, 2018; Madhawa et al., 2019)
are able to directly handle graphs. performances. Several researchers are investigating this topic using sophisticated statistical models such as variational autoencoders (VAEs)
(Kingma & Welling, 2014), adversarial lossbased models such as generative adversarial networks (GANs) (Goodfellow et al., 2014; Radford et al., 2015), and invertible flows (Kobyzev et al., 2019) and have achieved desirable performances.The decoders of these graph generation models generates a discrete graphstructured data from a (typically continuous) representation of a data sample, which is modeled by aforementioned statistical models. In general, it is difficult to design a decoder that balances the efficacy of the graph generation and the simplicity of the implementation and training. For example, MolGAN (De Cao & Kipf, 2018) has a relatively simple decoder but suffers from generating numerous duplicated graph samples. The stateoftheart VAEbased models such as (Jin et al., 2018; Liu et al., 2018) have good generation performance but their decoding scheme is highly complicated and requires careful training. On the contrary, invertible flowbased statistical models (Dinh et al., 2015; Kobyzev et al., 2019) does not require training for their decoders because the decoders are simply the inverse mapping of the encoders and are known for good generation performances in image generation (Dinh et al., 2017; Kingma & Dhariwal, 2018). Liu et al. (2019) proposes an invertibleflow based graph generation model. However, their generative model is not invertible because its decoder for graph structure is not built upon invertible flows. The GraphNVP by Madhawa et al. (2019) is the seminal fully invertibleflow approach for graph generation, which successfully combines the invertible maps with the generic graph convolutional networks (GCNs, e.g Kipf & Welling (2017); Schlichtkrull et al. (2017)).
However, the coupling flow (Kobyzev et al., 2019) used in the GraphNVP has a serious drawback when applied for sparse graphs such as molecular graphs we are interested in. The coupling flow requires a disjoint partitioning of the latent representation of the data (graph) in each layer. We need to design this partitioning carefully so that all the attributes of a latent representation are well mixed through stacks of mapping layers. However, molecular graphs are highly sparse in general: degree of each node atom is at most four (valency), and only few kind of atoms comprise the majority of the molecules (less diversity). Madhawa et al. (2019) argued that only a specific form of partitioning can lead to a desirable performance owing to sparsity: for each mapping layer, the representation of only one node is subject to update and all the other nodes are kept intact. In other words, a graph with 100 nodes requires at least 100 layers. But with the 100 layers, only one affine mapping is executed for each attribute of the latent representation. Therefore, the complexity of the mappings of GraphNVP is extremely low in contrast to the number of layer stacks. We assume that this is why the generation performance of GraphNVP is less impressive than other stateoftheart models (Jin et al., 2018; Liu et al., 2018) in the paper.
In this paper we propose a new graph flow, called graph residual flow (GRF): a novel combination of a generic GCN and recently proposed residual flows (Behrmann et al., 2019; Song et al., 2019; Chen et al., 2019)
. The GRF does not require partitioning of a latent vector and can
update all the node attributes in each layer. Thus, a 100 layerstacked flow model can apply the (nonlinear) mappings 100 times for each attribute of the latent vector of the 100node graph. We derive a theoretical guarantee of the invertibility of the GRF and introduce constraints on the GRF parameters, based on rigorous mathematical calculations. Through experiments with most popular graph generation datasets, we observe that a generative model based on the proposed GRF can achieve a generation performance comparable to the GraphNVP Madhawa et al. (2019), but with much fewer trainable parameters.To summarize, our contributions in this paper are as follows:

propose the graph residual flow (GRF): a novel residual flow model for graph generation that is compatible with a generic GCNs.

prove conditions such that the GRFs are invertible and present how to keep the entire network invertible throughout the training and sampling.

demonstrate the efficacy of the GRFbased models in generating molecular graphs; in other words, show that a generative model based on the GRF has much fewer trainable parameters compared to the GraphNVP, while still maintaining a comparable generation performance.
2 Background
2.1 GraphNVP
We first describe the GraphNVP (Madhawa et al., 2019), the first fully invertible model for chemical graph generation, as a baseline. We simultaneously introduce the necessary notations for graph generative models.
We use the notation to represent a graph
comprising an adjacency tensor
and a feature matrix . Let be the number of nodes in the graph, be the number of the types of nodes, and be the number of the types of edges. Then, and . In the case of molecular graphs, represents a molecule with types of bonds (single, double, etc.) and the types of atoms (e.g., oxygen, carbon, etc.). Our objective is to train an invertible model with parameters that maps into a latent point . We describe as a normalizing flow composed of multiple invertible functions.Let be a latent vector drawn from a known prior distribution (e.g., Gaussian):
. After applying a variable transformation, the log probability of a given graph
can be calculated as:(1) 
where is the Jacobian of at .
In (Madhawa et al., 2019) is modeled by two types of invertible nonvolume preserving (NVP) mappings (Dinh et al., 2017). The first type of mapping is the one that transforms the adjacency tensor, and the second type is the one that transforms the node attribute .
Let us divide the hidden variable into two parts ; the former is derived from invertible mappings of and the latter is derived from invertible mappings of . For the mapping of the feature matrix , the GraphNVP provides a node feature coupling:
(2) 
where indicates the layer of the coupling, functions and stand for scale and translation operations, respectively, and denotes elementwise multiplication. We use to denote a latent representation matrix of excluding the ^{th} row (node). The rest of the rows of the feature matrix remains the same as follows.
(3) 
and are modeled by a generic GCN, requiring the adjacency information of nodes, , for better interactions between the nodes.
For the mapping of the adjacency tensor, the GraphNVP provides an adjacency coupling:
(4) 
The rest of the rows remain as they are, as follows:
(5) 
The abovementioned formulations map only those variables that are related to a node in each th layer (Eqs.(2,5), and the remaining nodes are kept intact (Eqs.(3,5); i.e. the partitioning of the variables always occurs in the first axis of tensors. This limits the parameterization of scaling and translation operations, resulting in reduced representation power of the model.
In the original paper, the authors mention: “masking (switching) … w.r.t the node axis performs the best. … We can easily formulate … the slice indexing based on the nonnode axis … results in dramatically worse performance due to the sparsity of molecular graph.” Here, sparsity can be described in two ways: one is the sparsity of noncarbon atoms in organic chemicals, and the other is the low degrees of atom nodes (because of valency).
2.2 Invertible residual blocks
One of the major drawbacks of the partitionbased coupling flow is that it covers a fairly limited family of mappings. Instead, the coupling flow offers computational cheap and analytic form of inversions. A series of recent invertible models (Behrmann et al., 2019; Song et al., 2019; Chen et al., 2019) propose a different approach for invertible mappings, called residual flow (Kobyzev et al., 2019). They formulate ResNets (He et al., 2016), the golden standard for image recognition, as invertible mappings. The general idea is described as follows.
Our objective is to develop an invertible residual layer for a vector :
(6) 
where is the representation vector at the th layer, and is a residual block. If we correctly constrain , then we can assure the invertibility of the abovementioned residual layer.
iResNet (Behrmann et al., 2019) presents a constraint regarding the Lipschitz constant of . MintNet (Song et al., 2019) limits the shape of the residual block and derives the nonsingularity requirements of the Jacobian of the (limited) residual block.
Notably, the (invertible) residual connection (Eq.(
6)) does not assume the partition of variables into “intact” and “afinemap” parts. This means that each layer of invertible residual connection updates all the variables at once.In both the aforementioned papers, local convolutional network architecture (He et al., 2016) of the residual block is proposed for image tensor inputs, which can be applied for image generation/reconstructions for experimental validations. For example, in iResNet, the residual block is defined as:
(7) 
where
denotes a contractive nonlinear function such as ReLU and ELU,
are (spatially) local convolutional layers (i.e. aggregating the neighboring pixels). In this case, we put a constraint that the spectral norms of all s are less than unity for the Lipschitz condition.3 Invertible Graph Generation Model with Graph Residual Flow (GRF)
We observe that the limitations of the GraphMVP cannot be avoided as long we use the partitionbased coupling flows for the sparse molecular graph. Therefore we aim to realize a different type of an invertible coupling layer that does not depend on the variable partitioning (for easier inversion and likelihood computation). For this, we propose a new molecular graph generation model based on a more powerful and efficient Graph Residual Flow (GRF), which is our proposed invertible flow for graphs.
3.1 Setup
The overall setup is similar to that of the original GraphNVP. We use the notation to represent a graph comprising an adjacency tensor and a feature matrix . Each tensor is mapped to a latent representation through invertible functions. Let be the latent representation of the adjacency tensor, and be its prior. Similarly, let be the latent representation of the feature matrix, and
be its prior. We assume that both the priors are multivariate normal distributions. (e.g., oxygen, carbon, etc.).
As and are originally binary, we cannot directly apply the changeofvariables formula directly. The widely used (Dinh et al., 2017; Kingma & Dhariwal, 2018; Madhawa et al., 2019) workaround is dequantization: adding noises drawn from a continuous distribution and regarding the tensors as continuous. The dequantized graph denoted as is used as the input in Eq. 1:
Note that the original discrete inputs and can be recovered by simply applying floor operation on each continuous value in and . Hereafter, all the transformations are performed on dequantized inputs and .
3.2 Forward model
We can instantly formulate a naive model, and for doing so, we do not take it consideration the graph structure behind and regard and as simple tensors (multidimensional arrays). Namely, an tensor entry is a neighborhood of , where , and , regardless of the true adjacency of node and , and the feature and . Similar discussion holds for .
In such case, we simply apply the invertible residual flow for the tensors . Let and .
We formulate the invertible graph generation model based on GRFs. The fundamental idea is to replace the two coupling flows in GraphNVP with the new GRFs. A GRF conmprises two subflows: node feature residual flow and adjacency residual flow.
For the feature matrix, we formulate a node feature residual flow for layer as:
(10) 
where is a residual block for feature matrix at layer . Similar to Eq.(2), we assume the condition of the adjacency tensor for the coupling.
For the mapping of the adjacency tensor, we have a similar adjacency residual flow:
(11) 
where is a residual block for adjacency tensor at layer .
3.3 Residual Block Choices for GRFs
One of the technical contributions of this paper is the development of residual blocks for GRFs. The convolution architecture of ResNet reminds us of GCNs (e.g. (Kipf & Welling, 2017)), inspiring possible application to graph input data. Therefore, we extend the invertible residual blocks of (Behrmann et al., 2019; Song et al., 2019) to the feature matrix and the adjacency tensor conditioned by the graph structure .
The key issue here is the definition of neighborhood of local convolutions in the residual block (Eq.(7)).
The simplest approach to constructing a residual flow model is by using linear layer as layer . In such cases, we transform the adjacency matrix and feature matrix to single vectors. However, we must construct a large weight matrix so as not to reduce its dimension. Additionally, naive transformation into vector destroys the local feature of the graphs. To address the aforementioned issues, we propose two types of residual blocks and for each of the adjacency matrix and feature matrices.
In this paper, we propose a residual flow based on GCNs (e.g. (Kipf & Welling, 2017; Wu et al., 2019) for graphstructured data.
We focus on modeling the residual block for the node feature matrix. Our approach is to replace the usual convolutional layers in Eq.(7):
(12) 
Here, are adjacency matrix and degree matrix, respectively. is a matrix representation of . is a learnable matrix parameter of the linear layer. For defined in this way, the following theorem holds.
Theorem 1.
Here, is a Lipschitzconstant of for a certain function. The proof of this theorem is provided in appendix.
The Lipschitz constraint not only enables inverse operation (see Section 3.4) but also facilitates the computation of the logdeterminant of Jacobian matrix in Eq. (1) as performed in (Behrmann et al., 2019). In other words, the logdeterminant of Jacobian matrix can be approximated to the matrix trace (Withers & Nadarajah, 2010), and the trace can be computed through power series iterations and stochastic approximation (Hutchinson’s trick) (Hall, 2015; Hutchinson, 1990). Incorporating these tricks, the logdeterminant can be obtained by the following equation:
(13) 
denotes the Jacobian matrix of the residual block and is a probabilistic distribution that satisfies and .
3.4 Backward model or Graph Generation
generate the atomic feature tensor. As our model is invertible, the graph generation process is as depicted in Fig.1. The adjacency tensors and the atomic feature tensors can be simultaneously calculated during training, because their calculations are independent of each other. However, we must note that during generation, a valid adjacency tensor is required for the inverse computation of ResGCN. For this reason, we execute the following 2step generation: first, we generate the adjacent tensor and subsequently generate the atomic feature tensor. The abovementioned generation process is shown in the right half of Fig.1. The experiment section shows that this twostep generation process can efficiently generate chemically valid molecular graphs.
1st step: We sample from prior and split the sampled into two, one of which is for and the other is for . Next, we compute the inverse of w.r.t Residual Block by fixedpointiteration. Consequently, we obtain a probabilistic adjacency tensor . Finally, we construct a discrete adjacency tensor from by taking nodewise and edgewise argmax operation.
2nd step: We consider the discrete matrix obtained above as a fixed parameter and calculate the inverse image of for ResGCN using fixedpoint iteration. In this way, we obtain the probabilistic adjacency tensor . Next, we construct a discrete feature matrix by taking node wise argmax operation. Finally, we construct the molecule from the obtained adjacency tensor and feature matrix.
3.4.1 Inversion algorithm: fixed point iteration
For the residual layer , it is generally not feasible to compute the inverse image analytically. However, we have configure the layer to satisfy as described above. As was done in the iResNet (Behrmann et al., 2019), the inverse image of can be computed using a fixedpoint iteration of Algorithm 1. From the Banach fixedpoint theorem, this iterative method converges exponentially.
3.4.2 Condition for Guaranteed Inversion
From theorem 1, the upper bound of is determined by and . In this work, we selecte the exponential linear unit () as function . is a nonlinear function, which satisfies the differentiability. By definition, . For W, the constraints can be satisfied by using spectral normalization (Miyato et al., 2018). The layer configured in this manner holds . In other words, this layer is the contraction map. Here, the input can be obtained by fixed point iteration.
4 Experiments
4.1 Procedure
For our experiments, we use two datasets of molecules, QM9 (Ramakrishnan et al., 2014) and ZINC250k (Irwin et al., 2012). The QM9 dataset contains 134,000 molecules with four atom types, and ZINC250k is a subset of the ZINC250k database that contains 250,000 druglike molecules with nine atom types. The maximum number of heavy atoms in a molecule is nine for the QM9 and 38 for the ZINC250k. As a standard preprocessing, molecules are first kekulized and the hydrogen atoms are subsequently removed from these molecules. The resulting molecules contain only single, double, or triple bonds.
We represent each molecule as an adjacency tensor and a onehot feature matrix . denotes the maximum number of atoms a molecule in each dataset can have. If a molecule has less than
atoms, it is padded by adding virtual nodes to keep the dimensions of
and identical. As the adjacency tensors of molecular graphs are sparse, we add virtual bonds, referred to as "no bond," between the atoms that do not have a bond.Thus, an adjacency tensor conmprises adjacency matrices stacked together. Each adjacency matrix corresponds to the existence of a certain type of bond (single, double, triple, and virtual bonds) between the atoms. The feature matrix represents the type of each atom (e.g., oxygen, fluorine, etc.). As described in Section 3.3, and are dequantized to and .
We use a standard Gaussian distribution
as a prior distribution . The objective function (1) is maximized by the Adam optimizer (Kingma & Ba, 2015). The hyperparameters are chosen by optuna (Akiba et al., 2019) for QM9 and ZINC250k. Please find the appendix for the selected hyperparameter values. To reduce the model size, we adopt nodewise weight sharing for QM9 and lowrank approximation and multiscale architecture proposed in (Dinh et al., 2017) for ZINC250k.4.2 Invertibility Check
We first examine the reconstruction performance of GRF against the number of fixedpoint iterations by encoding and decoding 1,000 molecules sampled from QM9 and ZINC250k. According to Figure 1(b), the L2 reconstruction error converges around after 30 fixed point iterations. The reconstructed molecules are the same as the original molecule after convergence.
4.3 Numerical Evaluation
Following (Kingma & Dhariwal, 2018; Madhawa et al., 2019), we sample 1,000 latent vectors from a temperaturetruncated normal distribution and transform them into molecular graphs by inverse operations. Different temperatures are selected for and because they are handled separately in our model. We compare the performance of the proposed model with those of the baseline models using the following metrics. Validity (V) is the ratio of the chemically valid molecules to the generated graphs. Novelty (N) is the ratio of the molecules that are not included in the training set to the generated valid molecules. Uniqueness (U) is the ratio of the unique molecules to the generated valid molecules. Reconstruction accuracy (R) is the ratio of the molecules that are reconstructed perfectly by the model. This metric is not defined for GANs as they do not have encoders.
We choose GraphNVP (Madhawa et al., 2019), Junction Tree VAE (JTVAE) (Jin et al., 2018), RegularizingVAE (RVAE) (Ma et al., 2018) as stateoftheart baseline models. Also, we choose two additional VAE models as baseline models; grammar VAE(GVAE) (Kusner et al., 2017) and character VAE (CVAE) (GómezBombarelli et al., 2018), which learn SMILES(string) representations of molecules.
We present the numerical evaluation results of QM9 and ZINC250K datasets on the Table 2 (QM9) and the Table 2 (ZINC250K), respectively. As expected, GRF achieves 100% reconstruction rate, which is enabled by the ResNet architecture with spectral normalization and fixedpoint iterations. This has never been achieved by any other VAEbased baseline that imposes stochastic behavior in the bottleneck layers. Also, this is achieved without incorporating the chemical knowledge, which is done in some baselines (e.g., valency checks for chemical graphs in RVAE and GVAE, subgraph vocabulary in JTVAE). This is preferable because additional validity checks are computationally demanding, and the prepared subgraph vocabulary limits the extrapolation capacity of the generative model. As our model does not incorporate domainspecific procedures, it can be easily extended to general graph structures.
It is remarkable that our GRFbased generative model achieves good generation performance scores comparable to GraphNVP, with much fewer trainable parameters in order of magnitude. These results indicate the efficient construction of our GRF in terms of parametrization, as well as powerfulness and flexibility of the residual connections, compared to the coupling flows based on simple affine transformations. Therefore, our goal of proposing a novel and strong invertible flow for molecular graph generation is successfully achieved by the development of the GRF. We will discuss the number of parameters of GRF using BigO notation in Section 4.4.
The experiments also reveal a limitation of the current formulation of the GRF. One notable limitation is the lower uniqueness compared to the GraphNVP. We found that the generated molecules contain many straightchain molecules compared to those of GraphNVP, by examining the generated molecules manually. We attribute this phenomenon to the difficulty of generating realistic molecules without explicit chemical knowledge or autoregressive constraints. We are planning to tackle this issue as one of the future works.
Method  % V  % N  % U  % R  # Params  

GRF 



100.0  56,120  
GraphNVP  90.1  54.0  97.3  100.0  6,145,831  
RVAE  96.6  97.5    61.8    
GVAE  60.2  80.9  9.3  96.0    
CVAE  10.3  90.0  67.5  3.6   
Performance of generative models with respect to quality metrics and numbers of their parameters for ZINC250K dataset. Results of GraphNVP and JTVAE are recomputed following the hyperparameter setting in the original paper. Other baseline scores are borrowed from the original papers. Scores of GRF are averages over 5 runs. Standard deviations are presented below the averaged scores. We use
and for ZINC250k.Method  % V  % N  % U  % R  # Params  

GRF 



100.0  3,234,552  
GraphNVP  77.3  100.0  94.8  100.0  245,792,665  
JTVAE  99.8  100.0  100.0  76.7    
RVAE  34.9  100.0    54.7    
GVAE  7.2  100.0  9.0  53.7    
CVAE  0.7  100.0  67.5  44.6   
4.4 Efficiency in terms of model size
As we observe in the previous section, our GRFbased generative models are compact and memoryefficient in terms of the number of trainable parameters, compared to the existing GraphNVP flow model. In this section we discuss this issue in a more formal manner.
Let be the number of layers, be the number of the bond types, be the number of atom types. For GraphNVP, We need and parameters to construct adjacency coupling layers and atom coupling layers, respectively. From the above, we need parameters to construct whole GraphNVP. By contrast, our model only requires and parameters for resGraphLinear and resGCN, respectively. Therefore, whole GRF model requires parameters. In most cases of molecular graph generation settings, and is dominant.
Our GRF for ZINC250k uses linear layers to handle adjacency matrices, but the number of the parameters is substantially reduced by lowrank approximation (introduced in Sec. 4.1). Let be the approximated rank of each linear layer, and the whole GRF requires only parameters. Notably, GraphLinear is equal to lowrank approximation when .
Our model’s efficiency in model size is much more important when generating large molecules. Suppose we want to generate molecule with
heavy atoms with batch size of 64. Estimating from the memory usage of GRF for ZINC250k (
), GRF will consume 21 GB ifand GraphNVP will consume as large as 2100 GB. Since one of the most GPUs currently used (e.g., NVIDIA Tesla V100) is equipped with 16 – 32 GB memory, GraphNVP cannot process a batch on a single GPU or batch normalization becomes unstable with small batch. On the other hand, our model will scale to larger graphs due to the reduced parameters.
4.5 Smoothness of the Learned Latent Space
As a final experiment, we present the visualization of the learned latent space of . First we randomly choose 100 molecules from the training set, and subsequently encode them into the latent representation using the trained model. We compute the first and the second principle components of the latent space by PCA, and project the encoded molecules onto the plane spanned by these two principle component vectors. Then we choose another random molecule, , encode it and project it onto the aforementioned principle plane. Finally we decode the latent points on the principle plane, distributed in a gridmesh pattern centered at the projection of , and visualize them in Fig. 3. Figure 3 indicates that the learnt latent spaces from both QM9 (panel (a)) and ZINC250k datasets (panel (b)) are smooth where the molecules gradually change along the two axes.
The visualized smoothness appears to be similar to that of the VAEbased models but differs in that our GRF is a bijective function: the data points and the latent points correspond to each other in a onetoone manner. In contrast, to generate the data points with VAEbased methods, it is required to decode the same latent point several times and select the most common molecule. Our model is efficient because it can generate the data point in oneshot. Additionally, smooth latent space and bijectivity are crucial to the actual use case. Our model enables molecular graph generation through querying: encode a molecule with the desired attributes and decode the perturbed latents to obtain the drug candidates with similar attributes.
5 Conclusion
In this paper, we proposed a Graph Residual Flow, which is an invertible residual flow for molecular graph generations. Our model exploits the expressive power of ResNet architecture. The invertibility of our model is guaranteed only by a slight modification, i.e. by the addition of spectral normalization to each layer. Owing to the aforementioned feature, our model can generate valid molecules both in QM9 and ZINC250k datasets. The reconstruction accuracy is inherently 100%, and our model is more efficient in terms of model size as compared to GraphNVP, a previous flow model for graphs. In addition, the learned latent space of GRF is sufficiently smooth to enable the generation of molecules similar to a query molecule with known chemical properties.
Future works may include the creation of adjacency residual layers invariant for node permutation, and property optimization with GRF.
References
 Akiba et al. (2019) Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A nextgeneration hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, pp. 2623–2631. ACM, 2019.

Behrmann et al. (2019)
Jens Behrmann, Will Grathwohl, Ricky T. Q. Chen, David Duvenaud, and
JörnHenrik Jacobsen.
Invertible residual networks.
Proceedings of Incerntional Conference on Machine Learning (ICML), 2019.
 Chen et al. (2019) Ricky T. Q. Chen, Jens Behrmann, David Duvenaud, and JörnHenrik Jacobsen. Residual flows for invertible generative modeling. arXiv preprint arXiv:1906.02735, 2019.
 De Cao & Kipf (2018) Nicola De Cao and Thomas Kipf. Molgan: An implicit generative model for small molecular graphs. arXiv preprint arXiv:1805.11973, 2018.
 Dinh et al. (2015) Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Nonlinear independent components estimation. In Proceedings of the International Conference on Learning Representations (ICLR), 2015.
 Dinh et al. (2017) Laurent Dinh, Jascha SohlDickstein, and Samy Bengio. Density estimation using real nvp. In Proceedings of International Conference on Learning Representations (ICLR), 2017. URL https://arxiv.org/abs/1605.08803.
 GómezBombarelli et al. (2018) Rafael GómezBombarelli, Jennifer N Wei, David Duvenaud, José Miguel HernándezLobato, Benjamín SánchezLengeling, Dennis Sheberla, Jorge AguileraIparraguirre, Timothy D Hirzel, Ryan P Adams, and Alán AspuruGuzik. Automatic chemical design using a datadriven continuous representation of molecules. ACS central science, 4(2):268–276, 2018.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
 Hall (2015) Brian Hall. Lie groups, Lie algebras, and representations: an elementary introduction, volume 222. Springer, 2015.

He et al. (2016)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep Residual Learning for Image Recognition.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2016.  Hutchinson (1990) Michael F Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in StatisticsSimulation and Computation, 19(2):433–450, 1990.
 Irwin et al. (2012) John J Irwin, Teague Sterling, Michael M Mysinger, Erin S Bolstad, and Ryan G Coleman. Zinc: a free tool to discover chemistry for biology. Journal of chemical information and modeling, 52(7):1757–1768, 2012.
 Jin et al. (2018) Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder for molecular graph generation. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 2323–2332, StockholmsmÃ¤ssan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/jin18a.html.
 Kingma & Ba (2015) Diederik P. Kingma and Jimmy Lei Ba. Adam: a Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations (ICLR), 2015.
 Kingma & Welling (2014) Diederik P Kingma and Max Welling. Autoencoding variational bayes. In Proceedings of the 2nd International Conference on Learning Representations (ICLR), 2014. URL https://arxiv.org/abs/1312.6114.
 Kingma & Dhariwal (2018) Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10236–10245. Curran Associates, Inc., 2018. URL https://papers.nips.cc/paper/8224glowgenerativeflowwithinvertible1x1convolutions.
 Kipf & Welling (2017) Thomas N. Kipf and Max Welling. Semisupervised Classification with Graph Convolutional Networks. In Proceedings of the 5th International Conference on Learning Representations (ICLR), 2017.
 Kobyzev et al. (2019) Ivan Kobyzev, Simon Prince, and Marcus A Brubaker. Normalizing Flows: Introduction and Ideas. arXIv, pp. 1908.09257 [cs.LG], 2019.
 Kusner et al. (2017) Matt J Kusner, Brooks Paige, and José Miguel HernándezLobato. Grammar variational autoencoder. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1945–1954. PMLR, 2017.
 Liu et al. (2019) Jenny Liu, Aviral Kumar, Jimmy Ba, Jamle Kiros, and Kevin Swersky. Graph normalizing flows. arXiv preprint arXiv:1905.13177, 2019.
 Liu et al. (2018) Qi Liu, Miltiadis Allamanis, Marc Brockschmidt, and Alexander Gaunt. Constrained graph variational autoencoders for molecule design. In Advances in Neural Information Processing Systems, pp. 7806–7815, 2018.
 Ma et al. (2018) Tengfei Ma, Jie Chen, and Cao Xiao. Constrained generation of semantically valid graphs via regularizing variational autoencoders. In Advances in Neural Information Processing Systems, pp. 7113–7124, 2018.
 Madhawa et al. (2019) Kaushalya Madhawa, Katushiko Ishiguro, Kosuke Nakago, and Motoki Abe. Graphnvp: An invertible flow model for generating molecular graphs. arXiv preprint arXiv:1905.11600, 2019.
 Miyato et al. (2018) Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B1QRgziT.
 Oono & Suzuki (2019) Kenta Oono and Taiji Suzuki. On asymptotic behaviors of graph cnns from dynamical systems perspective, 2019.
 Radford et al. (2015) Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In Proceedings of International Conference on Learning Representations (ICLR), 2015.
 Ramakrishnan et al. (2014) Raghunathan Ramakrishnan, Pavlo O Dral, Matthias Rupp, and O Anatole Von Lilienfeld. Quantum chemistry structures and properties of 134 kilo molecules. Scientific data, 1:140022, 2014.
 Schlichtkrull et al. (2017) Michael Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. Modeling Relational Data with Graph Convolutional Networks. arXiv, pp. 1703.06103v4 [stat.ML], 2017.
 Song et al. (2019) Yang Song, Chenlin Meng, and Stefano Ermon. Mintnet: Building invertible neural networks with masked convolutions. arXiv preprint arXiv:1907.07945, 2019.
 Withers & Nadarajah (2010) Christopher S Withers and Saralees Nadarajah. log det a= tr log a. International Journal of Mathematical Education in Science and Technology, 41(8):1121–1124, 2010.
 Wu et al. (2019) Felix Wu, Tianyi Zhang, Amauri Jr. Holanda de Souza, Christopher Fifty, Tao Yu, and Kilian Q. Weinberger. Simplifying Graph Convolutional Networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), 2019.
 You et al. (2018) Jiaxuan You, Bowen Liu, Zhitao Ying, Vijay Pande, and Jure Leskovec. Graph convolutional policy network for goaldirected molecular graph generation. In Advances in Neural Information Processing Systems, pp. 6412–6422, 2018.
Appendix A Proof of theorem
Lemma 1.
Proof.
∎
Lemma 2.
Proof.
Augmented Normalized Laplacian is defined as . Like the normal graph Laplacian, an
th eigenvalue
of holds (Oono & Suzuki, 2019). Here, for the eigenvector
corresponding to , which is the ith eigenvalue of :As , i.e. follows. Here, operation norm
is bounded maximum singular value
. As is a symmetric matrix from its construction, the maximum singular value is equal to the absolute eigenvalue with the largest absolute value. From these conditions, .∎
Theorem 1.
Proof.
∎
Appendix B Model Hyperparameters
We use a singlescale architecture for QM9 dataset, while we use multiscale architecture (Dinh et al., 2017) for ZINC250k dataset to scale to 38 heavy atoms. Other hyperparameters are shown in Table 3. We find the factor of spectral normalization 0.9 is enough for numerical invertibility.
Dataset  GCN blocks  GCN layers  MLP blocks  MLP layers  BS  LR  Epochs 

QM9  1  1  32  25  2048  1e3  70 
ZINC250k  3  3  3  3  256  1e4  70 
Comments
There are no comments yet.