Graph Residual Flow for Molecular Graph Generation

09/30/2019 ∙ by Shion Honda, et al. ∙ IEEE Preferred Infrastructure The University of Tokyo 0

Statistical generative models for molecular graphs attract attention from many researchers from the fields of bio- and chemo-informatics. Among these models, invertible flow-based approaches are not fully explored yet. In this paper, we propose a powerful invertible flow for molecular graphs, called graph residual flow (GRF). The GRF is based on residual flows, which are known for more flexible and complex non-linear mappings than traditional coupling flows. We theoretically derive non-trivial conditions such that GRF is invertible, and present a way of keeping the entire flows invertible throughout the training and sampling. Experimental results show that a generative model based on the proposed GRF achieves comparable generation performance, with much smaller number of trainable parameters compared to the existing flow-based model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We propose a deep generative model for molecular graphs based on invertible functions. We especially focus on introducing an invertible function that is tuned for the use in graph structured data, which allows for flexible mappings with less number of parameters than previous invertible models for graphs.

Molecular graph generation is one of the hot trends in the graph analysis with a potential for important applications such as in silico new material discovery and drug candidate screening. Previous generative models for molecules deal with string representations called SMILES (e.g. Kusner et al. (2017); Gómez-Bombarelli et al. (2018)), which does not consider graph topology. Recent models such as  (Jin et al., 2018; You et al., 2018; De Cao & Kipf, 2018; Madhawa et al., 2019)

are able to directly handle graphs. performances. Several researchers are investigating this topic using sophisticated statistical models such as variational autoencoders (VAEs) 

(Kingma & Welling, 2014), adversarial loss-based models such as generative adversarial networks (GANs) (Goodfellow et al., 2014; Radford et al., 2015), and invertible flows (Kobyzev et al., 2019) and have achieved desirable performances.

The decoders of these graph generation models generates a discrete graph-structured data from a (typically continuous) representation of a data sample, which is modeled by aforementioned statistical models. In general, it is difficult to design a decoder that balances the efficacy of the graph generation and the simplicity of the implementation and training. For example, MolGAN (De Cao & Kipf, 2018) has a relatively simple decoder but suffers from generating numerous duplicated graph samples. The state-of-the-art VAE-based models such as (Jin et al., 2018; Liu et al., 2018) have good generation performance but their decoding scheme is highly complicated and requires careful training. On the contrary, invertible flow-based statistical models (Dinh et al., 2015; Kobyzev et al., 2019) does not require training for their decoders because the decoders are simply the inverse mapping of the encoders and are known for good generation performances in image generation (Dinh et al., 2017; Kingma & Dhariwal, 2018). Liu et al. (2019) proposes an invertible-flow based graph generation model. However, their generative model is not invertible because its decoder for graph structure is not built upon invertible flows. The GraphNVP by Madhawa et al. (2019) is the seminal fully invertible-flow approach for graph generation, which successfully combines the invertible maps with the generic graph convolutional networks (GCNs, e.g Kipf & Welling (2017); Schlichtkrull et al. (2017)).

However, the coupling flow (Kobyzev et al., 2019) used in the GraphNVP has a serious drawback when applied for sparse graphs such as molecular graphs we are interested in. The coupling flow requires a disjoint partitioning of the latent representation of the data (graph) in each layer. We need to design this partitioning carefully so that all the attributes of a latent representation are well mixed through stacks of mapping layers. However, molecular graphs are highly sparse in general: degree of each node atom is at most four (valency), and only few kind of atoms comprise the majority of the molecules (less diversity). Madhawa et al. (2019) argued that only a specific form of partitioning can lead to a desirable performance owing to sparsity: for each mapping layer, the representation of only one node is subject to update and all the other nodes are kept intact. In other words, a graph with 100 nodes requires at least 100 layers. But with the 100 layers, only one affine mapping is executed for each attribute of the latent representation. Therefore, the complexity of the mappings of GraphNVP is extremely low in contrast to the number of layer stacks. We assume that this is why the generation performance of GraphNVP is less impressive than other state-of-the-art models (Jin et al., 2018; Liu et al., 2018) in the paper.

In this paper we propose a new graph flow, called graph residual flow (GRF): a novel combination of a generic GCN and recently proposed residual flows (Behrmann et al., 2019; Song et al., 2019; Chen et al., 2019)

. The GRF does not require partitioning of a latent vector and can

update all the node attributes in each layer. Thus, a 100 layer-stacked flow model can apply the (non-linear) mappings 100 times for each attribute of the latent vector of the 100-node graph. We derive a theoretical guarantee of the invertibility of the GRF and introduce constraints on the GRF parameters, based on rigorous mathematical calculations. Through experiments with most popular graph generation datasets, we observe that a generative model based on the proposed GRF can achieve a generation performance comparable to the GraphNVP Madhawa et al. (2019), but with much fewer trainable parameters.

To summarize, our contributions in this paper are as follows:

  • propose the graph residual flow (GRF): a novel residual flow model for graph generation that is compatible with a generic GCNs.

  • prove conditions such that the GRFs are invertible and present how to keep the entire network invertible throughout the training and sampling.

  • demonstrate the efficacy of the GRF-based models in generating molecular graphs; in other words, show that a generative model based on the GRF has much fewer trainable parameters compared to the GraphNVP, while still maintaining a comparable generation performance.

2 Background

2.1 GraphNVP

We first describe the GraphNVP (Madhawa et al., 2019), the first fully invertible model for chemical graph generation, as a baseline. We simultaneously introduce the necessary notations for graph generative models.

We use the notation to represent a graph

comprising an adjacency tensor

and a feature matrix . Let be the number of nodes in the graph, be the number of the types of nodes, and be the number of the types of edges. Then, and . In the case of molecular graphs, represents a molecule with types of bonds (single, double, etc.) and the types of atoms (e.g., oxygen, carbon, etc.). Our objective is to train an invertible model with parameters that maps into a latent point . We describe as a normalizing flow composed of multiple invertible functions.

Let be a latent vector drawn from a known prior distribution (e.g., Gaussian):

. After applying a variable transformation, the log probability of a given graph

can be calculated as:

(1)

where is the Jacobian of at .

In (Madhawa et al., 2019) is modeled by two types of invertible non-volume preserving (NVP) mappings (Dinh et al., 2017). The first type of mapping is the one that transforms the adjacency tensor, and the second type is the one that transforms the node attribute .

Let us divide the hidden variable into two parts ; the former is derived from invertible mappings of and the latter is derived from invertible mappings of . For the mapping of the feature matrix , the GraphNVP provides a node feature coupling:

(2)

where indicates the layer of the coupling, functions and stand for scale and translation operations, respectively, and denotes element-wise multiplication. We use to denote a latent representation matrix of excluding the th row (node). The rest of the rows of the feature matrix remains the same as follows.

(3)

and are modeled by a generic GCN, requiring the adjacency information of nodes, , for better interactions between the nodes.

For the mapping of the adjacency tensor, the GraphNVP provides an adjacency coupling:

(4)

The rest of the rows remain as they are, as follows:

(5)

For adjacency coupling, we employ simple multi-layer perceptrons (MLPs) for

and .

The abovementioned formulations map only those variables that are related to a node in each -th layer (Eqs.(2,5), and the remaining nodes are kept intact (Eqs.(3,5); i.e. the partitioning of the variables always occurs in the first axis of tensors. This limits the parameterization of scaling and translation operations, resulting in reduced representation power of the model.

In the original paper, the authors mention: “masking (switching) … w.r.t the node axis performs the best. … We can easily formulate … the slice indexing based on the non-node axis … results in dramatically worse performance due to the sparsity of molecular graph.” Here, sparsity can be described in two ways: one is the sparsity of non-carbon atoms in organic chemicals, and the other is the low degrees of atom nodes (because of valency).

2.2 Invertible residual blocks

One of the major drawbacks of the partition-based coupling flow is that it covers a fairly limited family of mappings. Instead, the coupling flow offers computational cheap and analytic form of inversions. A series of recent invertible models (Behrmann et al., 2019; Song et al., 2019; Chen et al., 2019) propose a different approach for invertible mappings, called residual flow (Kobyzev et al., 2019). They formulate ResNets (He et al., 2016), the golden standard for image recognition, as invertible mappings. The general idea is described as follows.

Our objective is to develop an invertible residual layer for a vector :

(6)

where is the representation vector at the th layer, and is a residual block. If we correctly constrain , then we can assure the invertibility of the above-mentioned residual layer.

i-ResNet (Behrmann et al., 2019) presents a constraint regarding the Lipschitz constant of . MintNet (Song et al., 2019) limits the shape of the residual block and derives the non-singularity requirements of the Jacobian of the (limited) residual block.

Notably, the (invertible) residual connection (Eq.(

6)) does not assume the partition of variables into “intact” and “afine-map” parts. This means that each layer of invertible residual connection updates all the variables at once.

In both the aforementioned papers, local convolutional network architecture (He et al., 2016) of the residual block is proposed for image tensor inputs, which can be applied for image generation/reconstructions for experimental validations. For example, in i-ResNet, the residual block is defined as:

(7)

where

denotes a contractive nonlinear function such as ReLU and ELU,

are (spatially) local convolutional layers (i.e. aggregating the neighboring pixels). In this case, we put a constraint that the spectral norms of all s are less than unity for the Lipschitz condition.

3 Invertible Graph Generation Model with Graph Residual Flow (GRF)

Figure 1: Overall architecture of the proposed generative model. During a forward path (encoding), discrete graph components are inputted for dequantization. We apply the proposed GRFs to the dequantized components and obtain the latent representations, . During a backward path (graph generation), we first apply the inversion of the ResBlock to , yielding noise-overlapped . Denoised (recovered) and are the input arguments for the inverted ResGCN, recovering the noise-overlapped .

We observe that the limitations of the GraphMVP cannot be avoided as long we use the partition-based coupling flows for the sparse molecular graph. Therefore we aim to realize a different type of an invertible coupling layer that does not depend on the variable partitioning (for easier inversion and likelihood computation). For this, we propose a new molecular graph generation model based on a more powerful and efficient Graph Residual Flow (GRF), which is our proposed invertible flow for graphs.

3.1 Setup

The overall setup is similar to that of the original GraphNVP. We use the notation to represent a graph comprising an adjacency tensor and a feature matrix . Each tensor is mapped to a latent representation through invertible functions. Let be the latent representation of the adjacency tensor, and be its prior. Similarly, let be the latent representation of the feature matrix, and

be its prior. We assume that both the priors are multivariate normal distributions. (e.g., oxygen, carbon, etc.).

As and are originally binary, we cannot directly apply the change-of-variables formula directly. The widely used (Dinh et al., 2017; Kingma & Dhariwal, 2018; Madhawa et al., 2019) workaround is dequantization: adding noises drawn from a continuous distribution and regarding the tensors as continuous. The dequantized graph denoted as is used as the input in Eq. 1:

(8)
(9)

where

is a scaling hyperparameter. We adopted

for our experiment.

Note that the original discrete inputs and can be recovered by simply applying floor operation on each continuous value in and . Hereafter, all the transformations are performed on dequantized inputs and .

3.2 Forward model

We can instantly formulate a naive model, and for doing so, we do not take it consideration the graph structure behind and regard and as simple tensors (multi-dimensional arrays). Namely, an tensor entry is a neighborhood of , where , and , regardless of the true adjacency of node and , and the feature and . Similar discussion holds for .

In such case, we simply apply the invertible residual flow for the tensors . Let and .

We formulate the invertible graph generation model based on GRFs. The fundamental idea is to replace the two coupling flows in GraphNVP with the new GRFs. A GRF conmprises two sub-flows: node feature residual flow and adjacency residual flow.

For the feature matrix, we formulate a node feature residual flow for layer as:

(10)

where is a residual block for feature matrix at layer . Similar to Eq.(2), we assume the condition of the adjacency tensor for the coupling.

For the mapping of the adjacency tensor, we have a similar adjacency residual flow:

(11)

where is a residual block for adjacency tensor at layer .

Note that there are no slice indices of tensors and in Eqs.(10, 11). Therefore every entry of the tensors is subject to update in every layers, making a notable contrast with Eqs.(2,4).

3.3 Residual Block Choices for GRFs

One of the technical contributions of this paper is the development of residual blocks for GRFs. The convolution architecture of ResNet reminds us of GCNs (e.g. (Kipf & Welling, 2017)), inspiring possible application to graph input data. Therefore, we extend the invertible residual blocks of (Behrmann et al., 2019; Song et al., 2019) to the feature matrix and the adjacency tensor conditioned by the graph structure .

The key issue here is the definition of neighborhood of local convolutions in the residual block (Eq.(7)).

The simplest approach to constructing a residual flow model is by using linear layer as layer . In such cases, we transform the adjacency matrix and feature matrix to single vectors. However, we must construct a large weight matrix so as not to reduce its dimension. Additionally, naive transformation into vector destroys the local feature of the graphs. To address the aforementioned issues, we propose two types of residual blocks and for each of the adjacency matrix and feature matrices.

In this paper, we propose a residual flow based on GCNs (e.g. (Kipf & Welling, 2017; Wu et al., 2019) for graph-structured data.

We focus on modeling the residual block for the node feature matrix. Our approach is to replace the usual convolutional layers in Eq.(7):

(12)

Here, are adjacency matrix and degree matrix, respectively. is a matrix representation of . is a learnable matrix parameter of the linear layer. For defined in this way, the following theorem holds.

Theorem 1.

Here, is a Lipschitz-constant of for a certain function. The proof of this theorem is provided in appendix.

The Lipschitz constraint not only enables inverse operation (see Section 3.4) but also facilitates the computation of the log-determinant of Jacobian matrix in Eq. (1) as performed in (Behrmann et al., 2019). In other words, the log-determinant of Jacobian matrix can be approximated to the matrix trace (Withers & Nadarajah, 2010), and the trace can be computed through power series iterations and stochastic approximation (Hutchinson’s trick) (Hall, 2015; Hutchinson, 1990). Incorporating these tricks, the log-determinant can be obtained by the following equation:

(13)

denotes the Jacobian matrix of the residual block and is a probabilistic distribution that satisfies and .

3.4 Backward model or Graph Generation

generate the atomic feature tensor. As our model is invertible, the graph generation process is as depicted in Fig.1. The adjacency tensors and the atomic feature tensors can be simultaneously calculated during training, because their calculations are independent of each other. However, we must note that during generation, a valid adjacency tensor is required for the inverse computation of ResGCN. For this reason, we execute the following 2-step generation: first, we generate the adjacent tensor and subsequently generate the atomic feature tensor. The abovementioned generation process is shown in the right half of Fig.1. The experiment section shows that this two-step generation process can efficiently generate chemically valid molecular graphs.

1st step: We sample from prior and split the sampled into two, one of which is for and the other is for . Next, we compute the inverse of w.r.t Residual Block by fixed-point-iteration. Consequently, we obtain a probabilistic adjacency tensor . Finally, we construct a discrete adjacency tensor from by taking node-wise and edge-wise argmax operation.

2nd step: We consider the discrete matrix obtained above as a fixed parameter and calculate the inverse image of for ResGCN using fixed-point iteration. In this way, we obtain the probabilistic adjacency tensor . Next, we construct a discrete feature matrix by taking node wise argmax operation. Finally, we construct the molecule from the obtained adjacency tensor and feature matrix.

3.4.1 Inversion algorithm: fixed point iteration

For the residual layer , it is generally not feasible to compute the inverse image analytically. However, we have configure the layer to satisfy as described above. As was done in the i-ResNet (Behrmann et al., 2019), the inverse image of can be computed using a fixed-point iteration of Algorithm 1. From the Banach fixed-point theorem, this iterative method converges exponentially.

  Input: output from residual layer , contractive residual block , number of iterations
  Output: inverse of w.r.t
  
  for  do
     
  end for
  return
Algorithm 1 Inverse of Residual-layer via fixed-point iteration.

3.4.2 Condition for Guaranteed Inversion

From theorem 1, the upper bound of is determined by and . In this work, we selecte the exponential linear unit () as function . is a nonlinear function, which satisfies the differentiability. By definition, . For W, the constraints can be satisfied by using spectral normalization (Miyato et al., 2018). The layer configured in this manner holds . In other words, this layer is the contraction map. Here, the input can be obtained by fixed point iteration.

4 Experiments

4.1 Procedure

For our experiments, we use two datasets of molecules, QM9 (Ramakrishnan et al., 2014) and ZINC-250k (Irwin et al., 2012). The QM9 dataset contains 134,000 molecules with four atom types, and ZINC-250k is a subset of the ZINC-250k database that contains 250,000 drug-like molecules with nine atom types. The maximum number of heavy atoms in a molecule is nine for the QM9 and 38 for the ZINC-250k. As a standard preprocessing, molecules are first kekulized and the hydrogen atoms are subsequently removed from these molecules. The resulting molecules contain only single, double, or triple bonds.

We represent each molecule as an adjacency tensor and a one-hot feature matrix . denotes the maximum number of atoms a molecule in each dataset can have. If a molecule has less than

atoms, it is padded by adding virtual nodes to keep the dimensions of

and identical. As the adjacency tensors of molecular graphs are sparse, we add virtual bonds, referred to as "no bond," between the atoms that do not have a bond.

Thus, an adjacency tensor conmprises adjacency matrices stacked together. Each adjacency matrix corresponds to the existence of a certain type of bond (single, double, triple, and virtual bonds) between the atoms. The feature matrix represents the type of each atom (e.g., oxygen, fluorine, etc.). As described in Section 3.3, and are dequantized to and .

We use a standard Gaussian distribution

as a prior distribution . The objective function (1) is maximized by the Adam optimizer (Kingma & Ba, 2015). The hyperparameters are chosen by optuna (Akiba et al., 2019) for QM9 and ZINC-250k. Please find the appendix for the selected hyperparameter values. To reduce the model size, we adopt node-wise weight sharing for QM9 and low-rank approximation and multi-scale architecture proposed in (Dinh et al., 2017) for ZINC-250k.

4.2 Invertibility Check

We first examine the reconstruction performance of GRF against the number of fixed-point iterations by encoding and decoding 1,000 molecules sampled from QM9 and ZINC-250k. According to Figure 1(b), the L2 reconstruction error converges around after 30 fixed point iterations. The reconstructed molecules are the same as the original molecule after convergence.

(a) Reconstruction error.
(b) Original and reconstructed molecules.
Figure 2: (a) L2 reconstruction error of GRF against the number of fixed-point iterations. L2 errors are measured between dequantized and reconstructed , normalized by the number of entries. We observe that the error decays exponentially with the number of iterations in both datasets. (b) The original and reconstructed molecules sampled from QM9 and ZINC-250k with 100 fixed-point iterations. The color maps depict the values of the dequantized feature and adjacency matrices. Feature matrices have single channel while adjacency matrices have three channels excluding the virtual bond channel and are visualized as RGB channels. Because the values are to be quantized by argmax function over each node and the noise scaling hyperparameter is set as , if at any pixel the value difference is less than 0.1, the molecule is accurately reconstructed.

4.3 Numerical Evaluation

Following (Kingma & Dhariwal, 2018; Madhawa et al., 2019), we sample 1,000 latent vectors from a temperature-truncated normal distribution and transform them into molecular graphs by inverse operations. Different temperatures are selected for and because they are handled separately in our model. We compare the performance of the proposed model with those of the baseline models using the following metrics. Validity (V) is the ratio of the chemically valid molecules to the generated graphs. Novelty (N) is the ratio of the molecules that are not included in the training set to the generated valid molecules. Uniqueness (U) is the ratio of the unique molecules to the generated valid molecules. Reconstruction accuracy (R) is the ratio of the molecules that are reconstructed perfectly by the model. This metric is not defined for GANs as they do not have encoders.

We choose GraphNVP (Madhawa et al., 2019), Junction Tree VAE (JT-VAE) (Jin et al., 2018), Regularizing-VAE (RVAE) (Ma et al., 2018) as state-of-the-art baseline models. Also, we choose two additional VAE models as baseline models; grammar VAE(GVAE) (Kusner et al., 2017) and character VAE (CVAE) (Gómez-Bombarelli et al., 2018), which learn SMILES(string) representations of molecules.

We present the numerical evaluation results of QM9 and ZINC-250K datasets on the Table 2 (QM9) and the Table 2 (ZINC-250K), respectively. As expected, GRF achieves 100% reconstruction rate, which is enabled by the ResNet architecture with spectral normalization and fixed-point iterations. This has never been achieved by any other VAE-based baseline that imposes stochastic behavior in the bottleneck layers. Also, this is achieved without incorporating the chemical knowledge, which is done in some baselines (e.g., valency checks for chemical graphs in RVAE and GVAE, subgraph vocabulary in JT-VAE). This is preferable because additional validity checks are computationally demanding, and the prepared subgraph vocabulary limits the extrapolation capacity of the generative model. As our model does not incorporate domain-specific procedures, it can be easily extended to general graph structures.

It is remarkable that our GRF-based generative model achieves good generation performance scores comparable to GraphNVP, with much fewer trainable parameters in order of magnitude. These results indicate the efficient construction of our GRF in terms of parametrization, as well as powerfulness and flexibility of the residual connections, compared to the coupling flows based on simple affine transformations. Therefore, our goal of proposing a novel and strong invertible flow for molecular graph generation is successfully achieved by the development of the GRF. We will discuss the number of parameters of GRF using Big-O notation in Section 4.4.

The experiments also reveal a limitation of the current formulation of the GRF. One notable limitation is the lower uniqueness compared to the GraphNVP. We found that the generated molecules contain many straight-chain molecules compared to those of GraphNVP, by examining the generated molecules manually. We attribute this phenomenon to the difficulty of generating realistic molecules without explicit chemical knowledge or autoregressive constraints. We are planning to tackle this issue as one of the future works.

Method % V % N % U % R # Params
GRF
84.5
( 0.70)
58.6
( 0.82)
66.0
( 1.15)
100.0 56,120
GraphNVP 90.1 54.0 97.3 100.0 6,145,831
RVAE 96.6 97.5 - 61.8 -
GVAE 60.2 80.9 9.3 96.0 -
CVAE 10.3 90.0 67.5 3.6 -
Table 2:

Performance of generative models with respect to quality metrics and numbers of their parameters for ZINC-250K dataset. Results of GraphNVP and JT-VAE are recomputed following the hyperparameter setting in the original paper. Other baseline scores are borrowed from the original papers. Scores of GRF are averages over 5 runs. Standard deviations are presented below the averaged scores. We use

and for ZINC-250k.
Method % V % N % U % R # Params
GRF
73.4
( 0.62)
100.0
( 0.0)
53.7
( 2.13)
100.0 3,234,552
GraphNVP 77.3 100.0 94.8 100.0 245,792,665
JT-VAE 99.8 100.0 100.0 76.7 -
RVAE 34.9 100.0 - 54.7 -
GVAE 7.2 100.0 9.0 53.7 -
CVAE 0.7 100.0 67.5 44.6 -
Table 1: Performance of generative models with respect to quality metrics and numbers of their parameters for QM9 dataset. Results of GraphNVP are recomputed following the hyperparameter setting in the original paper. Other baseline scores are borrowed from the original papers. Scores of GRF are averages over 5 runs. Standard deviations are presented below the averaged scores. We use and for QM9.

4.4 Efficiency in terms of model size

As we observe in the previous section, our GRF-based generative models are compact and memory-efficient in terms of the number of trainable parameters, compared to the existing GraphNVP flow model. In this section we discuss this issue in a more formal manner.

Let be the number of layers, be the number of the bond types, be the number of atom types. For GraphNVP, We need and parameters to construct adjacency coupling layers and atom coupling layers, respectively. From the above, we need parameters to construct whole GraphNVP. By contrast, our model only requires and parameters for res-GraphLinear and res-GCN, respectively. Therefore, whole GRF model requires parameters. In most cases of molecular graph generation settings, and is dominant.

Our GRF for ZINC-250k uses linear layers to handle adjacency matrices, but the number of the parameters is substantially reduced by low-rank approximation (introduced in Sec. 4.1). Let be the approximated rank of each linear layer, and the whole GRF requires only parameters. Notably, GraphLinear is equal to low-rank approximation when .

Our model’s efficiency in model size is much more important when generating large molecules. Suppose we want to generate molecule with

heavy atoms with batch size of 64. Estimating from the memory usage of GRF for ZINC-250k (

), GRF will consume 21 GB if

and GraphNVP will consume as large as 2100 GB. Since one of the most GPUs currently used (e.g., NVIDIA Tesla V100) is equipped with 16 – 32 GB memory, GraphNVP cannot process a batch on a single GPU or batch normalization becomes unstable with small batch. On the other hand, our model will scale to larger graphs due to the reduced parameters.

4.5 Smoothness of the Learned Latent Space

As a final experiment, we present the visualization of the learned latent space of . First we randomly choose 100 molecules from the training set, and subsequently encode them into the latent representation using the trained model. We compute the first and the second principle components of the latent space by PCA, and project the encoded molecules onto the plane spanned by these two principle component vectors. Then we choose another random molecule, , encode it and project it onto the aforementioned principle plane. Finally we decode the latent points on the principle plane, distributed in a grid-mesh pattern centered at the projection of , and visualize them in Fig. 3. Figure 3 indicates that the learnt latent spaces from both QM9 (panel (a)) and ZINC-250k datasets (panel (b)) are smooth where the molecules gradually change along the two axes.

The visualized smoothness appears to be similar to that of the VAE-based models but differs in that our GRF is a bijective function: the data points and the latent points correspond to each other in a one-to-one manner. In contrast, to generate the data points with VAE-based methods, it is required to decode the same latent point several times and select the most common molecule. Our model is efficient because it can generate the data point in one-shot. Additionally, smooth latent space and bijectivity are crucial to the actual use case. Our model enables molecular graph generation through querying: encode a molecule with the desired attributes and decode the perturbed latents to obtain the drug candidates with similar attributes.

(a)
(b)
Figure 3: Visualization of the learned latent spaces along two randomly selected orthogonal axes. The red circled molecules are centers of the visualizations (and not the origin of the latent spaces). The empty space in the grid indicates that an invalid molecule is generated.

5 Conclusion

In this paper, we proposed a Graph Residual Flow, which is an invertible residual flow for molecular graph generations. Our model exploits the expressive power of ResNet architecture. The invertibility of our model is guaranteed only by a slight modification, i.e. by the addition of spectral normalization to each layer. Owing to the aforementioned feature, our model can generate valid molecules both in QM9 and ZINC-250k datasets. The reconstruction accuracy is inherently 100%, and our model is more efficient in terms of model size as compared to GraphNVP, a previous flow model for graphs. In addition, the learned latent space of GRF is sufficiently smooth to enable the generation of molecules similar to a query molecule with known chemical properties.

Future works may include the creation of adjacency residual layers invariant for node permutation, and property optimization with GRF.

References

Appendix A Proof of theorem

Lemma 1.

Proof.

Lemma 2.

Proof.

Augmented Normalized Laplacian is defined as . Like the normal graph Laplacian, an

-th eigenvalue

of holds (Oono & Suzuki, 2019)

. Here, for the eigenvector

corresponding to , which is the i-th eigenvalue of :

As , i.e. follows. Here, operation norm

is bounded maximum singular value

. As is a symmetric matrix from its construction, the maximum singular value is equal to the absolute eigenvalue with the largest absolute value. From these conditions, .

Theorem 1.

Proof.

Appendix B Model Hyperparameters

We use a single-scale architecture for QM9 dataset, while we use multi-scale architecture (Dinh et al., 2017) for ZINC-250k dataset to scale to 38 heavy atoms. Other hyperparameters are shown in Table 3. We find the factor of spectral normalization 0.9 is enough for numerical invertibility.

Dataset GCN blocks GCN layers MLP blocks MLP layers BS LR Epochs
QM9 1 1 32 25 2048 1e-3 70
ZINC-250k 3 3 3 3 256 1e-4 70
Table 3: Model hyperparameters for QM9 and ZINC250k. BS and LR stand for batch size and learning rate, respectively.