Categorical Normalizing Flows via Continuous Transformations

06/17/2020 ∙ by Phillip Lippe, et al. ∙ University of Amsterdam Google Inc 0

Despite their popularity, to date, the application of normalizing flows on categorical data stays limited. The current practice of using dequantization to map discrete data to a continuous space is inapplicable as categorical data has no intrinsic order. Instead, categorical data have complex and latent relations that must be inferred, like the synonymy between words. In this paper, we investigate Categorical Normalizing Flows, that is normalizing flows for categorical data. By casting the encoding of categorical data in continuous space as a variational inference problem, we jointly optimize the continuous representation and the model likelihood. To maintain unique decoding, we learn a partitioning of the latent space by factorizing the posterior. Meanwhile, the complex relations between the categorical variables are learned by the ensuing normalizing flow, thus maintaining a close-to exact likelihood estimate and making it possible to scale up to a large number of categories. Based on Categorical Normalizing Flows, we propose GraphCNF a permutation-invariant generative model on graphs, outperforming both one-shot and autoregressive flow-based state-of-the-art on molecule generation.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Normalizing Flows have been recently popular for tasks like image modeling Dinh et al. (2017); Kingma and Dhariwal (2018); Ho et al. (2019); Durkan et al. (2019) and speech generation Kim et al. (2019); Prenger et al. (2019) by providing efficient parallel sampling and exact density evaluation. The concept normalizing flows rely on is the rule of change of variables, a continuous transformation naturally working on continuous data. However, for many data types like language and graphs that are typically encoded as discrete, categorical variables, normalizing flows are not straightforward to apply.

Recently proposed ideas of discretizing the transformations inside normalizing flows to act directly on discrete data have shown to be limited in terms of the vocabulary size and layer depth due to gradient approximations Hoogeboom et al. (2019); Tran et al. (2019). For discrete, ordinal data like images, where integers represent quantized values, a popular strategy is to add a small amount of noise to each value Dinh et al. (2017); Ho et al. (2019). Such dequantization techniques, however, cannot be as simply applied on nominal discrete data where the values represent categories with no intrinsic order. Treating these categories as integers for dequantization biases the data to a non-existing order, and makes the modeling task significantly harder. Previous insights on variational dequantization Ho et al. (2019); Hoogeboom et al. (2020) have underlined the great importance of a flexible representation of ordinal data in normalizing flows, and hence we suspect a similar impact for categorical data.

In this paper, we investigate continuous encodings of categorical data in normalizing flows. Instead of pre-specifying non-overlapping volumes for each discrete value, we propose to use variational inference as a toolkit to jointly optimize the mapping to continuous latent space and modeling the likelihood by a normalizing flow. Previous work on combining variational inference with normalizing flows have focused on improving the approximate posterior’s flexibility Kingma et al. (2016); Rezende and Mohamed (2015); Van Den Berg et al. (2018). Here, instead, we use variational inference to provide a continuous representation of the discrete data to a normalizing flow. As no information should be lost when mapping the data into continuous space, we limit the encoding distributions to ones whose (approximate) posterior is independent over discrete variables. This leads to a learned partitioning of the latent space with an almost unique decoding. We call this approach Categorical Normalizing Flows and experiment with encoding distributions of increasing flexibility, but find that a simple mixture model is sufficient for encoding categorical data well.

Categorical Normalizing Flows can be applied to any task involving categorical variables. Examples, which we visit experimentally in this work, include words as categorical (one-hot vector) variables, sets and graphs 

Zhou et al. (2018); Wu et al. (2020). We put particular emphasis on graphs, as current approaches are mostly autoregressive Li et al. (2018); Shi et al. (2020); You et al. (2018) and view graphs as sequences, although there exists no intrinsic order of the nodes. Normalizing flows, however, can perform generation in parallel making a definition of order unnecessary. By treating both nodes and edges as categorical variables, we employ our variational inference encoding and propose GraphCNF. GraphCNF is a novel permutation-invariant normalizing flow on graph generation which assigns equal likelihood to any ordering of nodes. Meanwhile, GraphCNF encodes the node attributes, edge attributes and graph structure in three consecutive steps. As shown in the experiments, the improved encoding and flow architecture allows GraphCNF to outperform significantly both the autoregressive and parallel flow-based state-of-the-art.

Overall, our contributions are summarized as follows:

  • We propose Categorical Normalizing Flows, which apply a novel encoding method for categorical data in normalizing flows. By using variational inference with a factorized posterior, we still support an close-to exact likelihood estimate and scale up to large number of categories.

  • Starting from the Categorical Normalizing Flows, we propose GraphCNF, a permutation-invariant normalizing flow on graph generation. On molecule generation, GraphCNF sets a new state-of-the-art for flow-based methods outperforming one-shot and autoregressive baselines.

  • We experiment with encoding distributions of increasing flexibility on various tasks including sets, language and graphs, and show that a simple mixture model is sufficient for modeling discrete, categorical distribution accurately.

2 Preliminaries

A normalizing flow Rezende and Mohamed (2015); Tabak and Vanden Eijnden (2010)

is a generative model that models a probability distribution

by applying a sequence of invertible, smooth mappings . Using the rule of change of variables, the likelihood of the input is determined as follows:


where , and represents a prior distribution. This calculation requires to compute the Jacobian for the mappings , which is expensive for arbitrary functions. Thus, the mappings are often designed to allow an efficient computation of its determinant. One of such is the coupling layer proposed by Dinh et al. (2017)

which showed to work well with deep neural networks. For a detailed introduction to normalizing flows, we refer the reader to

Kobyzev et al. (2019).


Applying continuous normalizing flows on discrete data leads to undesired density models where arbitrarily high likelihoods are placed on particular values. This is because discrete data points represent delta peaks in a continuous distribution Theis et al. (2016); Uria et al. (2013). A common solution to this problem is to dequantize the data by adding noise. Considering as an integer, the dequantization can be formulated as where . The reverse mapping from back to is done by finding the next lower whole number for each element . Theis et al. (2016) have shown that modeling the dequantized representation, , lower-bounds the modeled discrete distribution . Denoting the distribution over by , we can write the lower bound as:


where represents the dataset. The dequantization distribution is usually set to uniform or is learned by a second normalizing flow. The latter form of dequantization is referred to as variational dequantization and has been proven to be crucial for state-of-the-art image modeling Ho et al. (2019); Hoogeboom et al. (2020).

3 Categorical Normalizing Flows

3.1 Encoding categorical data into continuous latent space

We define

to be a multivariate, nominal discrete random variable, where each element

is a categorical variable of categories with no intrinsic order. Our goal is to learn the joint probability mass function, , via a normalizing flow. As normalizing flows constitute a class of continuous transformations, it is not directly possible to rely on them for modeling . Instead, we propose to learn a continuous latent space in which each categorical choice of a variable maps to one distribution of a continuous variable . Thereby, we want to have the following properties:

  • The continuous distributions corresponding to different categories should be non-overlapping to preserve a unique decoding, similar to current dequantization methods. Specifically, the latent space is ideally partitioned into regions, one for each category. This ensures that no information is lost when mapping the discrete data to continuous values.

  • In contrast to integers, categories do not have an intrinsic order which would provide a natural positioning of the non-overlapping volumes. However, there usually exist (hidden) relations between the categories which are beneficial to represent in the encoding. Thus, the positioning of the volumes and distributions per category need to be optimized instead of pre-specified.

  • Relations between data points are usually represented by distance in continuous space. Categories can have several multi-dimensional relations as it is the case for words and their meaning. To encode those relations into the latent space, a single dimension is not sufficient as it cannot represent all the different forms of relations. Thus, the encoding distribution needs to support an arbitrary number of dimensions for .

In summary, the optimal encoding distribution would learn a partitioning of the continuous latent space into volumes, each representing one category with a flexible distribution within this part.

3.2 Normalizing flows on categorical data

In order to find such a function, we propose to learn a flexible encoding distribution by simultaneously optimizing a decoder for the reverse mapping. This allows us to jointly optimize the encoding of the categorical data with the normalizing flow on the continuous representation. A common framework for learning such a encoder-decoder structure on distributions is variational inference Kingma and Welling (2014); Rezende and Mohamed (2015). However, variational inference in the form as presented above has two drawbacks. Firstly, defining a joint decoder distribution does not fulfill our desired property of partitioning the latent space. Instead, the encoder-decoder model will compress the information as the decoder can infer categories from other continuous variables, which also leads to overlaps in distributions per categories. However, we want the interaction of the variables to be learned in the normalizing flow to utilize its parallel sampling and exact density evaluation. Secondly, represents an approximate posterior of the likelihood . The difference between the true and approximate posterior is the KL-divergence , which cannot be determined as is unknown. Thus, we can only model a lower bound which increasingly diverges with the posteriors complexity.

To overcome these issues, we propose to simplify the decoder by factorizing the posterior: . This limits the variational inference framework to a toolkit for learning the optimal partitioning of the latent space. Factorizing the posterior distribution means that we assume independence between the categorical variables given their learned continuous encodings. Therefore, any interaction between the categorical variables must be learned inside the normalizing flow. On the other hand, the encoder is being optimized to provide suitable representations of the categorical variables to the flow while separating the different categories in latent space to improve the decoding. The KL divergence between true and approximate posterior is also expected to be close to zero as the posterior becomes almost deterministic. Overall, our objective becomes:


We refer to this framework as Categorical Normalizing Flows. In contrast to dequantization in Eqn 2, the continuous encoding is not bounded by the domain of the encoding distribution. Instead, the partitioning is jointly learned with the model likelihood. Furthermore, we can freely choose the dimensionality of the continuous variables, , to fit the number of categories and their relations.

The encoder and decoder can be implemented in several ways. The first setup we consider is a mixture model, where each category is represented by a logistic distribution with different mean and scaling. With denoting the logistic, the encoder becomes . In this setup, the true posterior can actually be found by applying Bayes: with being a prior over categories. The mixture model is simple and efficient to implement, but limited in the distributions it can express. To increase flexibility, we experiment with adding class-conditional flows which transform each logistic into a more complex distribution. We refer to this approach as linear flows. Nevertheless, we experienced that a standard mixture model is sufficient for modeling discrete distributions accurately. Even representing the encoder as a flow across categorical variables, as applied in variational dequantization Ho et al. (2019), did not improve upon the mixture model. We compare these setups experimentally in Section  6.

4 Graph generation with Categorical Normalizing Flows

Categorical Normalizing Flows can be applied to any task involving categorical data, of which one is graph modeling. A graph is defined by a set of nodes , and a set of edges representing connections between nodes. Both the nodes and edges can have attributes which are often categorical. When modeling a graph, both the attributes and the overall graph structure need to be considered. The most successful current approaches Liao et al. (2019); Popova et al. (2019); Shi et al. (2020); You et al. (2018) are autoregressive although graphs are usually not sequential data. Vinyals et al. (2016) has shown that treating set-like data as a sequence can significantly hurt performance, and we validate this issue in experiments on graph coloring in Section 6.2. Furthermore, a likelihood-based model should intuitively assign equal probability to any permutation or order of the nodes as all of them represent the exact same graph.

Starting from Categorical Normalizing Flows, we propose GraphCNF, a normalizing flow for graph generation that is invariant to the order of nodes by generating all nodes and edges at once. Given a graph , we model each node and edge as a separate categorical variable where the categories correspond to their discrete attributes. To represent the graph structure, i.e. between which pairs of nodes does or doesn’t exist an edge, we add an extra category to the edges representing the missing or virtual edges. Hence, to model an arbitrary graph, we consider an edge variable for every possible tuple of nodes. To apply normalizing flows on the node and edge categorical variables, we map them into continuous latent space using Categorical Normalizing Flows. Subsequent coupling layers map those representations to a continuous prior distribution. Thereby, GraphCNF uses two crucial design choices for graph modeling: (1) we perform the generation stepwise for improved efficiency, and (2) we ensure that the model assigns equal likelihood to any ordering of the nodes.

4.1 Three-step generation

Modeling all edges including the virtual ones requires a significant amount of latent variables and is computationally expensive. However, normalizing flows have been shown to benefit from splitting of latent variables at earlier layers while increasing efficiency Dinh et al. (2017); Kingma and Dhariwal (2018). Thus, we propose to add the node types, edge attributes and graph structure stepwise to the latent space as visualized in Figure 1.

In the first step, we encode the nodes into continuous latent space, , using Categorical Normalizing Flows. On those, we apply a group of coupling layers, , which additionally use the adjacency matrix and the edge attributes, denoted by , as input. Thus, we can summarize the first step as:


The second step incorporates the edge attributes, , into latent space. Hence, all edges of the graph except the virtual edges are encoded into latent variables, , representing their attribute. The following coupling layers, denoted by , transform both the node and edge attribute variables:


Finally, we add the virtual edges to the latent variable model as . Thereby, we need to slightly adjust our encoding from Categorical Normalizing Flows as we considered the virtual edges as an additional category of the edges. While the other categories are already encoded by , we add a separate encoding distribution for the virtual edges, for which we use a simple logistic. Meanwhile, the posterior needs to be applied on all edges, as we need to distinguish the continuous representation between virtual and non-virtual edges. Overall, the mapping can be summarized as:


where the latent variables and are trained to follow a prior distribution. During sampling, we first inverse and determine the general graph structure. Next, we inverse and reconstruct the edge attributes. Finally, we apply the inverse of and determine the node types.

Figure 1: Visualization of GraphCNF for an example graph of five nodes. We add the node and edge attributes, as well as the virtual edges stepwise to the latent space while leveraging the graph structure in the coupling layers. The last step considers a fully connected graph with features per edge.

4.2 Permutation-invariant graph modeling

In order that the transformations of the coupling layers are permutation invariant, we apply a channel masking strategy Dinh et al. (2017) such that the split between latent variables is independent of the order of the nodes. Specifically, the split is performed over the latent dimensions for each node and edge independently. Secondly, we leverage the graph structure in the coupling networks by applying graph neural networks. In the first step, , we use a Relation GCN Schlichtkrull et al. (2018) which incorporates the categorical edge attributes into the layer. For the second and third step, we need a graph network that supports the modeling of both node and edge features. We refer to this network Edge-GNN, and as we found that various implementations work well, we layout the details of the Edge-GNN in Appendix B. Using both design choices, GraphCNF assigns equal probability to any ordering of nodes in a graph.

5 Related work

Discrete NF

Recent works have investigated normalizing flows with discretized transformations. Hoogeboom et al. (2019) proposed to use additive coupling layers with rounding operators for ensuring discrete output. Tran et al. (2019) discretizes the output by a Gumbel-Softmax approximating an argmax operator. Thereby, the coupling layers resemble a reversible shift operator. While both approaches achieved competitive results to continuous baselines, the gradient approximations have been shown to introduce new challenges to the models such as limiting the number of layers or distribution size.

Variational inference with NF

Several works have investigated the application of normalizing flows in variational auto-encoders Kingma and Welling (2014) for increasing the flexibility of the approximate posterior Kingma et al. (2016); Van Den Berg et al. (2018); Tomczak and Welling (2017). However, VAEs model a lower bound of the true likelihood. To minimize this gap, Ziegler and Rush (2019) proposed to move the main model complexity into the prior by using normalizing flows. Experiments focused on sequence tasks showed competitive, but still worse results than a LSTM baseline. In this paper, instead, we use an even simpler decoder factorized over discrete variables, such that all interactions between variables are learned in the flow.

Graph modeling

The first generation models on graphs have been autoregressive Liao et al. (2019); You et al. (2018), generating nodes and edges in a sequential order. While being efficient in memory, they are slow in sampling and assume an order in the set of nodes. The first application of normalizing flows for graph generation was introduced by Liu et al. (2019)

, where a flow modeled the node representations of a pretrained autoencoder. Recent works of GraphNVP

Madhawa et al. (2019) and GraphAF Shi et al. (2020) proposed normalizing flows for molecule generation. GraphNVP consists of two separate flows, one for modeling the adjacency matrix and a second for modeling the node types. Although allowing parallel generation, the model is sensitive to the node order due to its masking strategy and feature networks in the coupling layer. GraphAF is an autoregressive normalizing flow sampling nodes and edges sequentially but allowing parallel training. However, both flows use standard uniform dequantization to represent the node and edge categories. VAE have also been proposed for latent-based graph generation Simonovsky and Komodakis (2018); Ma et al. (2018); Liu et al. (2018); Jin et al. (2018). Although those models can be permutation-invariant, they model a lower bound of the true likelihood.

6 Experimental results

To show the wide applicability of Categorical Normalizing Flows, we perform experiments on sets, graphs and language. The normalizing flows we use in our experiments consist of a sequence of logistic mixture coupling layers proposed by Ho et al. (2019) which map a mixture of logistic distributions back into a single mode. This is particularly of interest for our proposed encoding strategy as its simplest implementation is based on a logistic mixture model. Before each coupling layer, we also include an activation normalization layer and invertible 1x1 convolution Kingma and Dhariwal (2018)

. For full reproducibility, we outline each experiment’s hyperparameter details in Appendix 

D, and publish our code here.

6.1 Set modeling

The first experiments we present are on sets of categorical variables. Our goal is to investigate whether Categorical Normalizing Flows can accurately model discrete distributions, and which encoding distribution is best suited. We compare our approach to variational dequantization Ho et al. (2019) and discrete normalizing flows by Tran et al. (2019). The two toy datasets we experiment on are set shuffling and set summation. In set shuffling, we model a set of categorical variables each having one out of categories. Each category has to appear exactly once, which leads to possible assignments that need to be modeled. In set summation, we again consider a set of size with categories, but those categories represent the actual integers . The task is to model those sets for which the sum of all element is an arbitrary number, . In contrast to set shuffling, the data is ordinal, which we initially expected to help dequantization methods. For both experiments we set and .

The results in Table 1 show that Categorical Normalizing Flows achieve nearly optimal performance. Although we model a lower bound in continuous space, our flows can indeed model discrete distributions precisely. Interestingly, representing the categories by a simple mixture model is sufficient for achieving these results. We observe the same trend in domains with more complex relations between categories, such as on graphs and language modeling, presumably because both the coupling layers and the prior distribution rest upon logistic distributions as well. Variational dequantization performs worse on the shuffling dataset, while on set summation with ordinal data, the gap to the optimum is smaller. The same holds for Discrete NFs, although it is worth noting that unlike Categorical Normalizing Flows, optimizing Discrete NFs had issues due to their gradient approximations.

Model Set shuffling Set summation
Discrete NF Tran et al. (2019) bpd bpd
Variational Dequant. Ho et al. (2019) bpd bpd
CNF Mixture model bpd bpd
CNF Linear flows bpd bpd
CNF Variational Encoding bpd bpd
Optimal bpd bpd
Table 1: Results on set modeling. Metric used is bits per categorical variable (dimension).

6.2 Graph coloring

Graph coloring is a well-known combinatorial problem Bondy et al. (1976) where for a graph , the task is to assign each node one out of colors. Yet, any two adjacent nodes cannot have the same color. Modeling the distribution of valid color assignments to arbitrary graphs is actually NP-complete. To train models on such a distribution, we generate a dataset of valid graph colorings for randomly sampled graphs. To further investigate the effect of complexity, we create two dataset versions, one with graphs of size and another with , as larger graphs are commonly harder to solve.

We compare GraphCNF to a variational autoencoder and an autoregressive prediction model which generates one node at a time. As graph coloring does not require edge generation, we only use the first step of GraphCNF’s three-step generation. For all models, we apply the same Graph Attention network Veličković et al. (2018)

for fairness. As autoregressive models require a manually prescribed node order, we compare the following: a

random ordering per graph, largest_first

which is inspired by heuristics of automated theorem provers that start from the nodes with the most connections, and

smallest_first, where we reverse the order of the previous heuristic. We evaluate the models on a held-out test set on which we measure the likelihood of the color assignments in bits per nodes. Secondly, we sample one color assignment per model for each test graph, and report the proportion of valid colorings.

The results in Table 2 show that the node ordering has indeed a significant effect on the autoregressive model’s performance. While the smallest_first ordering leads to only valid solutions on the large dataset, reversing the order simplifies the task for the model such that it generates more than twice as many valid color assignments. In contrast, GraphCNF is invariant of the order of nodes. Despite generating all nodes in parallel, it outperforms all node orderings on the small dataset, while being close to the best ordering on the larger dataset. This invariance property is especially beneficial in tasks where an optimal order of nodes is not known, like molecule generation. Although having more parameters, the sampling with GraphCNF is also considerably faster than the autoregressive models.

Method Validity Bits per node Time Validity Bits per node Time
VAE bpd s bpd s
RNNSmallest_first bpd s bpd s
RNNRandom bpd s bpd s
RNNLargest_first bpd s bpd s
GraphCNF bpd s bpd
Table 2: Results on the graph coloring problem. Runtimes are measured on a NVIDIA TitanRTX.

6.3 Molecule generation

Modeling and generating graphs is a crucial in biology and chemistry for applications such as drug discovery and property optimization, where molecule generation has emerged as a common benchmark Jin et al. (2018); Madhawa et al. (2019); Shi et al. (2020). In a molecule graph the nodes are atoms and the edges represent bonds between atoms, both represented by categorical features. Using a dataset of existing molecules, the goal is to learn a distribution of valid molecules as not all possible combinations of atoms and bonds are valid. We perform experiments on the Zinc250k Irwin et al. (2012) dataset which consists of 250,000 drug-like molecules. The molecules contain up to 38 atoms of 9 different types, with three different bond types possible between the atoms. For comparability, we follow the preprocessing of Shi et al. (2020).

For baselines to GraphCNF, we focus on models that consider molecules as graph and not as text representation. As VAE-based approaches, we consider R-VAE Ma et al. (2018) and Junction-Tree VAE (JT-VAE) Jin et al. (2018). R-VAE is a one-shot generation model using regularization to ensure semantic validity. JT-VAE represents a molecule as junction tree of sub-graphs which are obtained from the training dataset. We also compare our model to GraphNVP Madhawa et al. (2019) and GraphAF Shi et al. (2020). The models are evaluated by sampling 10,000 examples and measuring the proportion of valid molecules. We also report the proportion of unique molecules and novel samples that are not in the training dataset. These metrics prevent models from memorizing a small subset of graphs. Finally, the reconstruction rate describes whether graphs can be accurately decoded from latent space. Normalizing Flows naturally score 100% due to their invertible mapping, and we achieve the same with our encoding despite no guarantees.

Table 3

shows that GraphCNF generates almost twice as many valid molecules than other one-shot approaches. Yet, the validity and uniqueness stay at almost 100%. Even the autoregressive normalizing flow, GraphAF, is outperformed by GraphCNF by 15%. However, the rules for generating valid molecules can be enforced in autoregressive models by masking out the invalid outputs. This has been the case for JT-VAE as it has been trained with those manual rules, and thus achieves an validity of 100%. Nevertheless, we are mainly interested in the model’s capability of learning the rules by itself and being not specific to any application. While GraphNVP and GraphAF sample with a lower standard deviation from the prior to increase validity, we explicitly sample from the original prior to underline that our model covers the whole latent space well. Surprisingly, we found out that most invalid graphs actually consist of two or more that in isolation are valid. This can happen as one-shot generation models have no guidance regarding generating a single connected graph. By taking the largest sub-graph of these predictions, we obtain a validity ratio of

making our model generate almost solely valid molecules without any manually encoded rules. We also evaluated our model on a second dataset, Moses Polykovskiy et al. (2018), and achieved similar scores as shown in Appendix C.

Method Validity Uniqueness Novelty Reconstruction Parallel General
JT-VAE Jin et al. (2018)
GraphAF Shi et al. (2020)
R-VAE Ma et al. (2018)
GraphNVP Madhawa et al. (2019)
Table 3: Performance on molecule generation trained on Zinc250k Irwin et al. (2012), calculated on 10k samples and averaged over 4 runs. Scores of baselines are taken from their respective papers.

6.4 Language modeling

Finally, we test Categorical Normalizing Flows on language modeling. We experiment with two popular character-level datasets, Penn Treebank Marcus et al. (1994) and text8 with a vocabulary size of and respectively. We also test a word-level dataset, Wikitext103 Merity et al. (2017), with categories, which Discrete NF cannot handle due to its gradient approximations Tran et al. (2019). We follow the setup of Ziegler and Rush (2019) for the Penn Treebank and train on sequences of 256 tokens for the other two datasets. Each flow applies a single mixture coupling layer being autoregressive across time and latent dimensions. We use the same LSTM Hochreiter and Schmidhuber (1997) with hidden size 1024 for all flows and baselines.

Model Penn Treebank text8 Wikitext103
LSTM baseline bpd bpd bpd
Latent NF Ziegler and Rush (2019) bpd - -
CNF - 1 layer bpd (0.00) bpd (0.00) bpd (0.32)
Table 4: Results on language modeling. The reconstruction error is shown in brackets.

As shown in Table 4, Categorical Normalizing Flows with a single layer are performing on par with their autoregressive baselines. When comparing to the flow by Ziegler and Rush (2019), we see a significant improvement while using only 1 instead of 5 flows. This underlines the importance of using a factorized posterior. For word-level language modeling, a single mixture coupling layer is not flexible enough to learn all possible sequences. Still, this could be improved by using deeper flows.

7 Conclusion

We present Categorical Normalizing Flows which learn a categorical, discrete distribution by jointly optimizing the representation of categorical data in continuous latent space, and the model likelihood of a normalizing flow. Thereby, we apply variational inference with a factorized posterior to maintain an almost unique decoding while allowing flexible encoding distributions. We find that a plain mixture model is sufficient for modeling discrete distributions accurately while providing an efficient way for encoding and decoding categorical data. Furthermore, GraphCNF, a normalizing flow on graph modeling based on CNFs, outperforms autoregressive and one-shot approaches on molecule generation and graph coloring while being invariant to the node order. This emphasizes the potential of normalizing flows on categorical tasks, especially for such with non-sequential data.


  • J. A. Bondy, U. S. R. Murty, et al. (1976) Graph theory with applications. Vol. 290, Macmillan London. Cited by: §6.2.
  • L. Dinh, J. Sohl-Dickstein, and S. Bengio (2017) Density estimation using Real NVP. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France. External Links: Link Cited by: §1, §1, §2, §4.1, §4.2.
  • C. Durkan, A. Bekasov, I. Murray, and G. Papamakarios (2019) Neural Spline Flows. In Advances in Neural Information Processing Systems, pp. 7509–7520. External Links: Link Cited by: §1.
  • D. Hendrycks and K. Gimpel (2016) Gaussian Error Linear Units (GELUs). arXiv preprint arXiv:1606.08415v3. Cited by: §B.1, §D.2, §D.4.
  • J. Ho, X. Chen, A. Srinivas, Y. Duan, and P. Abbeel (2019) Flow++: Improving Flow-Based Generative Models with Variational Dequantization and Architecture Design. In

    Proceedings of the 36th International Conference on Machine Learning

    , K. Chaudhuri and R. Salakhutdinov (Eds.),
    Vol. 97, Long Beach, California, USA, pp. 2722–2730. External Links: Link Cited by: §A.3, §1, §1, §2, §3.2, §6.1, Table 1, §6.
  • S. Hochreiter and J. Schmidhuber (1997) Long Short-Term Memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §6.4.
  • E. Hoogeboom, T. S. Cohen, and J. M. Tomczak (2020) Learning Discrete Distributions by Dequantization. arXiv preprint arXiv:2001.11235v1. External Links: Link Cited by: §1, §2.
  • E. Hoogeboom, J. W. T. Peters, R. v. d. Berg, and M. Welling (2019) Integer Discrete Flows and Lossless Compression. In Advances in Neural Information Processing Systems 32, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. D’Alche, E. B. Fox, and R. Garnett (Eds.), Vancouver, BC, Canada, pp. 121234–12144. External Links: Link Cited by: §1, §5.
  • J. J. Irwin, T. Sterling, M. M. Mysinger, E. S. Bolstad, and R. G. Coleman (2012) ZINC: A Free Tool to Discover Chemistry for Biology. Journal of Chemical Information and Modeling 52 (7), pp. 1757–1768. External Links: Link, Document Cited by: §B.2, Figure 4, Table 5, Appendix C, §D.3, §6.3, Table 3.
  • W. Jin, R. Barzilay, and T. Jaakkola (2018) Junction tree variational autoencoder for molecular graph generation. 35th International Conference on Machine Learning, ICML 2018 5, pp. 3632–3648. External Links: Link, ISBN 9781510867963 Cited by: Table 6, §5, §6.3, §6.3, Table 3.
  • S. Kim, S. Lee, J. Song, J. Kim, and S. Yoon (2019) FloWaveNet : A Generative Flow for Raw Audio. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 3370–3378. External Links: Link Cited by: §1.
  • D. P. Kingma and M. Welling (2014) Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR, Banff, AB, Canada. External Links: Link Cited by: §3.2, §5.
  • D. P. Kingma and J. Ba (2015) Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations (ICLR) 2015, Y. Bengio and Y. LeCun (Eds.), San Diego, CA, USA. External Links: Link, ISBN 9781450300728, Document, ISSN 09252312 Cited by: §D.1.
  • D. P. Kingma and P. Dhariwal (2018) Glow: Generative Flow with Invertible 1x1 Convolutions. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31, pp. 10215–10224. External Links: Link Cited by: §A.2, §1, §4.1, §6.
  • D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling (2016) Improved variational inference with inverse autoregressive flow. Advances in Neural Information Processing Systems 29, pp. 4743–4751. External Links: Link, ISSN 10495258 Cited by: §1, §5.
  • I. Kobyzev, S. Prince, and M. A. Brubaker (2019) Normalizing Flows: Introduction and Ideas. arXiv preprint arXiv:1908.09257v1. External Links: Link Cited by: §2.
  • H. Lemos, M. Prates, P. Avelar, and L. Lamb (2019)

    Graph colouring meets deep learning: Effective graph neural network models for combinatorial problems


    Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI

    Vol. 2019-Novem, pp. 879–885. External Links: ISBN 9781728137988, Document, ISSN 10823409 Cited by: §D.2.
  • Y. Li, O. Vinyals, C. Dyer, R. Pascanu, and P. Battaglia (2018) Learning deep generative models of graphs. arXiv preprint arXiv:1803.03324 (arXiv:1803.03324, Version 1). External Links: Link Cited by: §1.
  • R. Liao, Y. Li, Y. Song, S. Wang, W. Hamilton, D. K. Duvenaud, R. Urtasun, and R. Zemel (2019) Efficient Graph Generation with Graph Recurrent Attention Networks. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. D’Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 4255–4265. External Links: Link Cited by: §4, §5.
  • J. Liu, A. Kumar, J. Ba, J. Kiros, and K. Swersky (2019) Graph Normalizing Flows. In Advances in Neural Information Processing Systems, pp. 13556–13566. External Links: Link Cited by: §5.
  • L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han (2020)

    On the Variance of the Adaptive Learning Rate and Beyond

    In International Conference on Learning Representations, External Links: Link Cited by: §D.1, Table 9.
  • Q. Liu, M. Allamanis, M. Brockschmidt, and A. Gaunt (2018) Constrained Graph Variational Autoencoders for Molecule Design. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 7795–7804. External Links: Link Cited by: §5.
  • T. Ma, J. Chen, and C. Xiao (2018) Constrained Generation of Semantically Valid Graphs via Regularizing Variational Autoencoders. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 7113–7124. External Links: Link Cited by: §5, §6.3, Table 3.
  • K. Madhawa, K. Ishiguro, K. Nakago, and M. Abe (2019) GraphNVP: An Invertible Flow Model for Generating Molecular Graphs. arXiv preprint arXiv:1905.11600v1. External Links: Link Cited by: §B.2, §5, §6.3, §6.3, Table 3.
  • M. Marcus, G. Kim, M. A. Marcinkiewicz, R. MacIntyre, A. Bies, M. Ferguson, K.Katz, and B. Schasberger (1994) The Penn Treebank: Annotating Predicate Argument Structure. In Proceedings of the Workshop on Human Language Technology, Plainsboro, NJ, pp. 114–119. External Links: Link Cited by: §D.4, §6.4.
  • S. Merity, C. Xiong, J. Bradbury, and R. Socher (2017) Pointer Sentinel Mixture Models. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §D.4, §6.4.
  • T. Mikolov, I. Sutskever, A. Deoras, H. Le, S. Kombrink, and J. Cernocky (2012) Subword language modeling with neural networks. preprint (http://www. fit. vutbr. cz/imikolov/rnnlm/char. pdf) 8. Cited by: §D.4, §D.4.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d\textquotesingle Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: §D.1, Appendix D.
  • J. Pennington, R. Socher, and C. Manning (2014) Glove: Global Vectors for Word Representation. In

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    Doha, Qatar, pp. 1532–1543. External Links: Link, Document Cited by: §D.4.
  • D. Polykovskiy, A. Zhebrak, B. Sanchez-Lengeling, S. Golovanov, O. Tatanov, S. Belyaev, R. Kurbanov, A. Artamonov, V. Aladinskiy, M. Veselov, A. Kadurin, S. Johansson, H. Chen, S. Nikolenko, A. Aspuru-Guzik, and A. Zhavoronkov (2018) Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. Computing Research Repository (arXiv:1811.12823, Version 3), pp. 1–17. External Links: Link Cited by: Table 6, Appendix C, §6.3.
  • M. Popova, M. Shvets, J. Oliva, and O. Isayev (2019) MolecularRNN: Generating realistic molecular graphs with optimized properties. arXiv preprint arXiv:1905.13372v1. External Links: Link Cited by: §4.
  • R. Prenger, R. Valle, and B. Catanzaro (2019) Waveglow: A flow-based generative network for speech synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621. Cited by: §1.
  • A. Rahimi, T. Cohn, and T. Baldwin (2018) Semi-supervised User Geolocation via Graph Convolutional Networks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 2009–2019. External Links: Link, Document Cited by: §B.1.
  • D. J. Rezende and S. Mohamed (2015) Variational Inference with Normalizing Flows. In Proceedings of the 32nd International Conference on Machine Learning, Vol. 37, Lille, France. External Links: Link Cited by: §1, §2, §3.2.
  • M. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov, and M. Welling (2018) Modeling Relational Data with Graph Convolutional Networks. In The Semantic Web, A. Gangemi, R. Navigli, M. Vidal, P. Hitzler, R. Troncy, L. Hollink, A. Tordai, and M. Alam (Eds.), Cham, pp. 593–607. External Links: Link, ISBN 978-3-319-93417-4 Cited by: §4.2.
  • C. Shi, M. Xu, Z. Zhu, W. Zhang, M. Zhang, and J. Tang (2020) GraphAF: a Flow-based Autoregressive Model for Molecular Graph Generation. In International Conference on Learning Representations, External Links: Link Cited by: Table 6, Appendix C, §D.3, §1, §4, §5, §6.3, §6.3, Table 3.
  • M. Simonovsky and N. Komodakis (2018) GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders.. In International Conference on Artificial Neural Networks, Vol. abs/1802.0, pp. 412–422. External Links: Link Cited by: §5.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research 15, pp. 1929–1958. External Links: ISBN 1532-4435, Document, ISSN 15337928 Cited by: §D.4.
  • E. Tabak and E. Vanden Eijnden (2010) Density estimation by dual ascent of the log-likelihood. Communications in Mathematical Sciences 8 (1), pp. 217–233. External Links: Link, ISSN 1539-6746 Cited by: §2.
  • L. Theis, A. Van Den Oord, and M. Bethge (2016) A note on the evaluation of generative models. 4th International Conference on Learning Representations, ICLR 2016 - Conference Track Proceedings. External Links: Link Cited by: §2.
  • J. M. Tomczak and M. Welling (2017) Improving Variational Auto-Encoders using Householder Flow. arXiv preprint arXiv:1611.09630v4. External Links: Link Cited by: §5.
  • D. Tran, K. Vafa, K. K. Agrawal, L. Dinh, and B. Poole (2019) Discrete Flows: Invertible Generative Models of Discrete Data. In Advances in Neural Information Processing Systems, pp. 14692–14701. External Links: Link Cited by: §D.1, §1, §5, §6.1, §6.4, Table 1.
  • B. Uria, I. Murray, and H. Larochellehugo (2013) RNADE: The real-valued neural autoregressive density-estimator. In Advances in Neural Information Processing Systems, Vol. 26, pp. 2175–2183. External Links: Link, ISSN 10495258 Cited by: §2.
  • R. Van Den Berg, L. Hasenclever, J. M. Tomczak, and M. Welling (2018) Sylvester normalizing flows for variational inference. 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018 1, pp. 393–402. External Links: Link, ISBN 9781510871601 Cited by: §1, §5.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is All you Need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Link Cited by: §B.1.
  • P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph Attention Networks. International Conference on Learning Representations. External Links: Link Cited by: §B.1, §6.2.
  • O. Vinyals, S. Bengio, and M. Kudlur (2016) Order Matters: Sequence to sequence for sets. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §4.
  • Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu (2020) A Comprehensive Survey on Graph Neural Networks. IEEE Transactions on Neural Networks and Learning Systems, pp. 1–21. External Links: Link Cited by: §1.
  • J. You, R. Ying, X. Ren, W. Hamilton, and J. Leskovec (2018) {G}raph{RNN}: Generating Realistic Graphs with Deep Auto-regressive Models. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 5708–5717. External Links: Link Cited by: §1, §4, §5.
  • J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun (2018) Graph Neural Networks: A Review of Methods and Applications. arXiv preprint arXiv:1812.08434. External Links: Link Cited by: §B.1, §1.
  • Z. M. Ziegler and A. M. Rush (2019) Latent Normalizing Flows for Discrete Sequences. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Vol. 97, Long Beach, California, USA, pp. 7673–7682. External Links: Link Cited by: §B.2, §D.4, §5, §6.4, §6.4, Table 4.

Appendix A Visualizations of encoding distributions

In the following, we visualize the different encoding distributions we tested in Categorical Normalizing Flows, and outline implementation details for full reproducibility.

a.1 Mixture of logistics

The mixture model represents each category by an independent logistic distribution in continuous latent space, as visualized in Figure 2. Specifically, the encoder distribution , with being the categorical input and the continuous latent representation, can be written as:


represent the logistic distribution, and the dimensionality of the continuous latent space per category. Both parameters and are learnable parameter, which can be implemented via a simple table lookup. For decoding the discrete categorical data from continuous space, the true posterior is calculated by applying the Bayes rule:


where the prior over categories, , is calculated based on the category frequencies in the training dataset. Although the posterior models a distribution over categories, the distribution is strongly peaked for most continuous points in the latent space as the probability steeply decreases the further a point is away from a specific mode. Furthermore, the distribution is trained to minimize the posterior entropy which pushes the posterior to be deterministic for commonly sampled continuous points. Hence, the posterior partitions the latent space into fragments in which all continuous points are assigned to one discrete category. The borders between the fragments, where the posterior is not close to deterministic, are small and very rarely sampled by the encoder distribution. We visualized the partitioning for an example of three categories in Figure 2.

(a) Encoding distribution (b) Posterior partitioning
Figure 2: Visualization of the mixture model encoding and decoding for 3 categories. Best viewed in color. (a) Each category is represented by a logistic distribution with independent mean and scale which are learned during training. (b) The posterior partitions the latent space which we visualize by the background color. The borders show from when on we have an almost unique decoding of the corresponding mixture ( decoding probability). Note that these borders do not directly correspond to the euclidean distance as we use logistic distributions instead of Gaussians.

Notably, the posterior can also be learned by a second, small linear network. While this possibly introduces a small KL divergence to the true posterior, we experienced it to vanish quickly over training iterations and did not observe any significant difference compared to using the true posterior besides a slower training in the very early stages of training. Additionally, we were able to achieve very low reconstruction errors in two dimensions for most discrete distributions of categories. Nevertheless, a higher dimensionality of the latent space is not only crucial for large number of categories as for word-level vocabularies, but can also be beneficial for more complex problems. Still, using even higher dimensionality rarely caused any problems or showed significantly decreasing performance. Presumably, the flow learns to ignore latent dimensions if those are not needed for modeling the discrete distribution. To summarize, the dimensionality of the latent space should be considered as important, but robust hyperparameter which can be tuned in an early stage of hyperparameter search.

In the very first training iterations, it can happen that the mixtures of multiple categories are at the exact same spot and . This can be easily resolved by either weighting the reconstruction loss higher for the first

500 iterations, or initializing the mean of the mixtures with a higher variance. Once the mixtures are separated, the model has no incentive to group them together again as it has started to learn the underlying discrete distribution which results in a considerably higher likelihood than a plain uniform distribution.

a.2 Linear flows

The flexibility of the mixture model can be increased by applying normalizing flows on each mixture that dependent on the discrete category. We refer to this approach as linear flows as the flows are applied for each categorical input variable independently. We visualize possible encoding distributions with linear flows in Figure 3. Formally, we can write the distribution as:


where are invertible, smooth mappings. In particular, we use here again a sequence of coupling layers with activation normalization and invertible 1x1 convolutions Kingma and Dhariwal (2018). Both the activation normalization and coupling use the category as additional external input to determine their transformation parameters by a neural network. The class-conditional transformations could also be implemented by storing parameter sets for the coupling layer neural networks, which is however inefficient for a larger number of categories. Furthermore, in coupling layers, we apply a channel mask that splits over latent dimensionality into two equally sized parts, of which one is transformed using the other as input.

(a) Encoding distribution (b) Posterior partitioning
Figure 3: Visualization of the linear flow encoding and decoding for 3 categories. Best viewed in color. (a) The distribution per category is not restricted to a simple logistic and can be multi-modal, rotated or transformed even more. (b) The posterior partitions the latent space which we visualize by the background color. The borders show from when on we have an almost unique decoding of the corresponding category distribution ( decoding probability).

Similarly to the mixture model, we can calculate the true posterior using Bayes rule. Thereby, we sample from the flow for , and need to inverse the flows for all other categories. Note that as the inverse of the flow also needs to be differentiable in this situation, we apply affine coupling layers instead of logistic mixture layers. However, this gets computationally expensive for more than 20 categories, and thus we used a single-layer linear network as posterior in these situations. The partitions of the latent space that can be learned by the encoding distribution are much more flexible, as illustrated in Figure 3.

We experimented with increasing sizes of linear flows, but noticed that the encoding distribution usually fell back to rotated logistic distributions. The fact that the added complexity and flexibility by the flows is not being used further supports our observation that mixture models are indeed sufficient for representing categorical data well in normalizing flows.

a.3 Variational Encoding

The third encoding distribution we experimented with is inspired by variational dequantization Ho et al. (2019) and models by one flow across all categorical variables. Still, the posterior, , is applied per categorical variable independently to maintain a unique decoding and partitioning of the latent space. The normalizing flow again consists of a sequence of logistic mixture coupling layers with activation normalization and invertible 1x1 convolutions. The inner feature network of the coupling layers depend on the task the normalizing flow is applied on. Hence, for sets, we used a transformer architecture, while for the graph experiments, we used a GNN. On the language modeling task, we used a Bi-LSTM model to generate the transformation parameters. All those networks use the discrete, categorical data as additional input.

As the true posterior cannot be found for this distribution, we apply a two-layer linear network to determine . While the reconstruction error was again very low, we again experienced that the model mainly relied on a logistic mixture model, even if we initialize it differently beforehand. Variational dequantization is presumably important for images as every pixel value has its own independent Gaussian noise signal. This noise can be nicely modeled by a flexible dequantization distributions which needs to be complex enough to capture the true mean and variance of this Gaussian noise. In categorical distributions, however, we do not have such noise signals and therefore seem not to benefit from variational encodings.

Appendix B Implementation details of GraphCNF

In this section, we describe further implementation details of GraphCNF. We detail the implementation of the Edge-GNN model used in the coupling layers of GraphCNF, and discuss how we encode graphs of different sizes.

b.1 Edge Graph Neural Network

GraphCNF implements a three-step generation approach, for which the second and third step also models latent variables for edges. Hence, in the coupling layers, we need a graph neural network which supports both node and edge features. We implement this by alternating between updates of the edge and the node features. Specifically, given node features and edge features at layer , we update those as follows:


The update functions, and , are both common GNN layers with slight adjustments to allow a communication between nodes and edges. Before detailing the update layers, it should be noted that we use Highway GNNs Rahimi et al. (2018) which apply a gating mechanism. Specifically, the updates for the nodes are determined by:


where is the output of the GNN layer. and represent single linear layer networks where has a consecutive sigmoid activation to limit the outputs between 0 and 1. The edge updates are applied in the similar manner. We experienced that such a gated update functions helps the gradient flow through the layers back to the input. This is important for normalizing flows as coupling layers or transformations in general strongly depend on previous transformations. Hence, we apply the same gating mechanism in the first step of GraphCNF, .

Next, we detail the GNN layers to obtain and . The edge update layer resembles a graph convolutional layer Zhou et al. (2018), and can be specified as follows:


where represents the features of the edge between node and . stands for a GELU Hendrycks and Gimpel (2016) non-linear activation. Using more complex transformations did not show to significantly improve the performance of GraphCNF.

To update the node representations, we took inspiration of the transformer architecture Vaswani et al. (2017)

and use a modified multi-head attention layer. In particular, a linear transformation maps each node to a key, query and value vector:


The attention value is usually computed based on the dot product between two nodes. However, as we explicitly have features for the edge between the two nodes, we use those to control the attention mechanism. Hence, we have an additional weight matrix to map the edge features to an attention bias:


where represents the hidden dimensionality of the features. Finally, we also add a edge-based value vector to allow a full communication from edges to nodes. Overall, the updates node features are calculated by:


Alternatively to transformers, we also experimented with Graph Attention Networks Veličković et al. (2018). However, those showed slightly worse results which is why we used the transformer-based layer.

In step 2, the (binary) adjacency matrix is given such that each node has a limited number of neighbours. A full transformer-based architecture as above is then not necessary anymore as every atom has usually between 1 and 3 neighbours. Especially the node-to-node dot product is expensive to perform. Hence, we experimented with a node update layer where the attention is purely based on the edge features in step 2. We found both to work equally well while the second is computationally more efficient.

b.2 Encoding graph size

The number of nodes varies across graphs in the dataset, and hence a generative model needs to be flexible regarding . To encode the number of nodes, we use a similar approach as Ziegler and Rush (2019) for sequences and add a simple prior over . The prior is parameterized based on the graph size frequency in the training set. Alternatively, to integrate the number of nodes in the latent space, we could add virtual nodes to the model, similar to virtual edges. Every graph in the training dataset would be filled up to the maximum number of nodes (38 for Zinc250k Irwin et al. (2012)) by adding such virtual nodes. Meanwhile, during sampling we remove virtual nodes if the model generates such. GraphNVP Madhawa et al. (2019) uses such an encoding as their coupling layers did not support flexible graph sizes. However, in experiments, we obtained similar performance with both size encodings while the external prior is computationally more efficient and therefore used in this paper.

Appendix C Additional results on molecule generation

In this section, we present additional results on the molecule generation task. Table 5 shows the results of our model on the Zinc250k Irwin et al. (2012) dataset including the likelihood on the test set in bits per node. We calculate this metric by summing the log likelihood of all latent variables, both nodes and edges, and divide by the number of nodes. Although the number of edges scales with , a higher proportion of those are virtual and did not had a significant contribution to the likelihood. Thus, bits per node constitutes a good metric for comparing the likelihood of molecules of varying size. Additionally, we also report the standard deviation for all metrics over 4 independent runs. For this, we initialized the random number generator with the seed 42, 43, 44 and 45 before creating the model. The specific validity values we obtained are 80.74%, 81.16%, 85.3% and 86.44% (in no particular order). It should be noted that the standard deviation among those models is considerably high. This is because the models in molecule generation are trained on maximizing the likelihood of the training dataset and not explicitly on generating valid molecules. We experienced that among over seeds, models that perform better in terms of likelihood do not necessarily perform better in validity.

Method Validity Uniqueness Novelty Reconstruction Bits per node
GraphCNF bpd
() () () () ()
() () () ()
Table 5: Performance on molecule generation trained on Zinc250k Irwin et al. (2012) with standard deviation is calculated over 4 independent runs. See Table 3 for baselines.

We also evaluated GraphCNF on the Moses Polykovskiy et al. (2018) molecule dataset. Moses contains 1.9 million molecules with up to 30 heavy atoms of 7 different types. Again, we follow the preprocessing of Shi et al. (2020) and represent molecules in kekulized form in which hydrogen is removed. The results can be found in Table 6 and show that we achieve very similar scores to the experiments on Zinc250k. Compared to the normalizing flow baseline GraphAF, GraphCNF generates considerably more valid atoms while being parallel in generation in contrast to GraphAF being autoregressive. JT-VAE uses manually encoded rules for generating valid molecules only such that the validity rate is . Overall, the experiment on Moses validates that GraphCNF is not specialized on a single dataset but can improve on current flow-based graph models across datasets.

Method Validity Uniqueness Novelty Bits per node
JT-VAE Jin et al. (2018) -
GraphAF Shi et al. (2020) -
GraphCNF bpd
() () () ()
() () ()
Table 6: Performance on molecule generation Moses Polykovskiy et al. (2018), calculated on 10k samples and averaged over 4 runs. Score for GraphAF taken from Shi et al. (2020), and JT-VAE from Polykovskiy et al. (2018).

Finally, we show 12 randomly sampled molecules from our model in Figure 4. In general, GraphCNF is able to generate very diverse set of molecules molecules with a variety of atom types. This qualitative analysis endorses the previous quantitative results of obtaining close to 100% uniqueness on 10k samples.

Figure 4: Visualization of molecules generated by GraphCNF which has been trained on the Zinc250k Irwin et al. (2012) dataset. Nodes with black connections and no description represent carbon atoms. All of the presented molecules are valid. Best viewed in color and electronically for large molecules.

Appendix D Experimental settings

In this section we detail the hyperparameter settings and datasets for all experiments. All experiments have been implemented using the deep learning framework PyTorch Paszke et al. (2019). The experiments for graph coloring and molecule generation have been executed on a single NVIDIA TitanRTX GPU. The average training time was between 1 and 2 days. The set and language experiments have been executed on a single NVIDIA GTX1080Ti in 4 to 16 hours. All experiments have been repeated with at least 3 different random seeds.

d.1 Set modeling

Dataset details

We use two toy datasets, set shuffling and set summation, to simulate a discrete distribution over sets in our experiments. Note that we do not have a classical split of train/val/test dataset, but instead train and test the models on samples from the same discrete distribution. This is because we want to verify whether a categorical normalizing flow and other baselines can model an arbitrary discrete distribution. The special property of sets is that permuting the elements of a set still represent the same set. However, a generative model still has to learn all possible permutations. While an autoregressive model considers those permutations as different data points, a permutation-invariant model as Categorical Normalizing Flow contains an inductive bias to assign the exact same likelihood to any permutation.

In set shuffling, we only have one set to model which is the following (with categories to ):

This set has possible permutations and therefore challenging to model. The optimal likelihood in bits per element is calculated by .

The dataset set summing contains of 2200 valid sets for and . An example for a valid set is:

For readability, the set is sorted by ascending values, although any permutation of the elements represent the exact same set. Taking into account all possible permutations of the sets in the dataset, we obtain a optimal likelihood of . The values for the sequence length and sum was chosen such that the task is challenging enough to show the differences between Categorical Normalizing Flows and its baselines, but also not too challenging to prevent unnecessarily long training times and model complexities.

Hyperparameter details

Table 7 shows an overview of the hyperparameters per model applied on set modeling. We use the notation “{val1, val2, …}” to show the different values we have tried during hyperparameter search. Thereby, the underlined value denotes the hyperparameter value with the best performance and finally was being used to generate the results in Table 1.

The number of encoding coupling layers in Categorical Normalizing Flows are sorted by the used encoding distribution. The mixture model uses no additional coupling layers, while for the linear flows, we apply 4 affine coupling layers using an external input for the discrete category. For the variational encoding distribution , we use 4 mixture coupling layers across the all latent variables with external input for . A larger dimensionality of the latent space per element showed to be beneficial for all encoding distributions. Note that due to a dimensionality larger than 1 per element, we are able to apply the channel mask instead of a chess mask and maintain permutation invariance compared to the baselines.

In variational dequantization and Discrete NF, we sort the categories randomly for set shuffling (the distribution is invariant to the category order/assignment) and in ascending order for set summation. In Discrete NF, we followed the published code from Tran et al. (2019) for their coupling layers and implemented it in PyTorch Paszke et al. (2019). We use a discrete prior over the set elements which is jointly optimized with the flow. However, we experienced significant optimization issues due to the straight-through gradient estimator in the Gumbel Softmax.

Across this paper, we experiment with the two optimizers Adam Kingma and Ba (2015) and RAdam Liu et al. (2020), and experienced RAdam to work slightly better. The learning rate decay is applied every update and leads to an exponential decay. However, we did not observe the choice of this hyperparameter to be crucial.

Hyperparameters Categorical NF Var. dequant. Discrete NF
Latent dimension {2, 4, 6} 1 16
#Encoding couplings - / 4 / 4 4 -
#Coupling layers 8 8 {4, 8}
Coupling network Transformer Transformer Transformer
- Number of layers 2 2 2
- Hidden size 256 256 256
Mask Channel mask Chess mask Chess mask
#mixtures 8 8 -
Batch size 1024 1024 1024
Training iterations 100k 100k 100k
Optimizer {Adam, RAdam} RAdam {SGD, Adam, RAdam}
Learning rate 7.5e-4 7.5e-4 {1e-3, 1e-4, 1e-5}
Learning rate decay 0.999975 0.999975 0.999975
Temperature (GS) - - {0.1, 0.2, 0.5}
Table 7: Hyperparameter overview for the set modeling experiments presented in Table 1

d.2 Graph coloring

Dataset details

In our experiments, we focus on the 3-color problem meaning that a graph has to be color with using colors. We generate the datasets by randomly sampling a graph and using an SAT solver111We have used the following solver from the OR-Tools library in python: for finding one valid coloring assignment. In case no solution can be found, we discard the graph and sample a new graph. We further ensure that every graph cannot be colored by less than 3 colors in order to exclude too simple graphs. For creating the graphs, we take inspiration from Lemos et al. (2019) and first uniformly sample the number of nodes between for the small dataset, and for the large dataset. Next, we sample a value between and which represents the probability of having an edge between a random pair of nodes. Thus,

controls how dense a graph is, and we aim to have both dense and sparse graphs in our dataset. Finally, for each pair of nodes, we sample from a bernoulli distribution with probability

of adding an edge between the two nodes or not. Finally, we check whether each node has at least one connection, and that all nodes can be reached from any other node. This ensures that we have one connected graph and not multiple sub-graphs. Overall, we create a train/val/test size of 192k/24k/24k for the small dataset, and 450k/20k/30k for the large graphs. We visualize examples of the datasets in Figure 5.

During training, we randomly permute the colors of a graph (e.g. red becomes blue, blue becomes green, green becomes red) as any permutation is a valid color assignment. When we sample a color assignment from our models, we explicitly use a temperature value of 1.0. For the autoregressive model and the VAE this means that we sample from the softmax output. A common alternative is to take the argmax, which correspond to a temperature value of 0.0. However, we stick to the original distribution because we want to test whether the models capture the full discrete distribution of valid color assignments and not only the most likely solution. For the normalizing flow, a temperature of 1.0 corresponds to sampling from the prior distribution as it was used during training.

(a) (b)
(c) (d)
Figure 5: Examples of valid graph color assignments from the dataset (best viewed in color). Due to the graph sizes and dense adjacency matrices, edges can be occluded or cluttered in (c) and (d).

Hyperparameter details

Table 8

shows an overview of the used hyperparameters. If “/” is used in the table, first parameter refers to the hyperparameter value used on small dataset and the second for the larger dataset. The activation function used within the graph neural networks is GELU

Hendrycks and Gimpel (2016). Interestingly we experience that a larger latent space dimensionality is crucial for larger graphs despite having the same number of categories as the small dataset. This shows that having a encoding being flexible in the number of dimensions can be further important for datasets where complex relations between categorical variables need to be modeled. Increasing the number of dimensions on the small dataset did not show any significant differences in performance. The number of mixtures in the mixture coupling layers is in general beneficial to be large. However, this can also increase the sampling time. In case sampling time is crucial, the number of mixtures can be decreased in cost of slightly worse performance.

The input to the autoregressive model is the graph with the color assignment at time step where each category including unassigned nodes are represented by an embedding vector. We experiment with increasing number of hidden layers. While more layers are especially important for sub-optimal node ordering, the performance does not significantly improve for more than 5 layers. As the sampling time also increases linearly with the number of layers, we use 5 hidden layers for the models.

For the variational autoencoder, we encode each node by a latent vector of size 4. As VAEs have shown to benefit from slowly adding the KL divergence between prior and posterior to the loss, we experiment with a scheduler where the slope is based on a sigmoid and stretched over 10k iterations.

Hyperparameters GraphCNF Variational AE Autoregressive
Latent dimension {2, 4} / {2, 4, 6, 8} 4 -
#Coupling layers {6, 8} - -
(Coupling) network GAT GAT GAT
- Number of layers {3, 4, 5} 5 {3, 4, 5, 6, 7}
- Hidden size 384 384 384
- Number of heads 4 4 4
Mask Channel mask - -
#mixtures {4, 8, 16} / {4, 8, 16} - -
Batch size 384 / 128 384 / 128 384 / 128
Training iterations 200k 200k 100k
Optimizer RAdam RAdam RAdam
Learning rate 7.5e-4 7.5e-4 7.5e-4
KL scheduler - {1.0, 0.10.5, 0.11.0} -
Table 8: Hyperparameter overview for graph coloring experiments presented in Table 2

d.3 Molecule generation

Dataset details

The Zinc250k Irwin et al. (2012) dataset we use contains 239k molecules of which we use 214k molecules for training, 8k for validation and 17k for testing. We follow the preprocessing of Shi et al. (2020) and represent molecules in kekulized form in which hydrogen is removed. This leaves the molecules with up to 38 heavy atoms, with a mean and median size of about 23. The smallest graph consists of 8 nodes. Thereby, Zinc250k considers molecule with 8 different atom types where the distribution is significantly imbalanced. The most common atom is carbon with 73% of all nodes in the dataset. Besides oxygen (10%) and nitrogen (12%), the rest of the atoms occur in less than 2% of all nodes, with the rarest atom being Bromine (0.002%). Between those atoms, the dataset contains 3 different bonds or edge types, namely a single, double and triple covalent bonds describing how many electrons are shared among the atoms. In over than 90% of all node pairs there exist no bond. In 7% of the cases the atoms are connected with a single connection, 2.4% with a double and 0.02% with a triple connection. A similar imbalance is present in the Moses dataset and is based on the properties of molecules. Nevertheless, we experienced that GraphCNF was able to generate a similar distribution, where adding the third stage (adding virtual edges later) considerably helped to stabilize the edge imbalance.

Hyperparameter details

We summarize our hyperparameters in Table 9. Generally, a higher latent dimensionality is beneficial for representing nodes/atoms, similarly to the graph coloring task. However, we experienced that a lower dimensionality for edges is slightly better, presumably because the flow already has a significant amount of latent variables for edges. Many edges, especially the virtual ones, do not contain much information. In addition, a deeper flow showed to gain better results offering more complex transformations. However, in contrast to the graph coloring model, GraphCNF on molecule generation requires a considerable amount of memory as we have to model a feature vector per edge. Nevertheless, we did not experience any issues due to the limited batch size of 96, and during testing, we could scale up the batch size easily to more than 128 on a NVIDIA GTX 1080Ti for both datasets.

Hyperparameters GraphCNF
Latent dimension (V/E) {4, 6, 8} / {2, 3, 4}
#Coupling layers (//) 4 / {4, 6} / {4, 6}
Coupling network (/) Relational GCN / Edge-GNN
- Number of layers (//) {3/3/3, 3/4/4, 4/4/4}
- Hidden size (V/E) {256, 384} / {128, 192}
Mask Channel mask
#mixtures (V/E) {8, 16} / {4, 8, 16}
Batch size (Zinc250k/Moses) 64 / 96
Training iterations 150k
Optimizer RAdam Liu et al. (2020)
Learning rate 2e-4, 5e-4, 7.5e-4, 1e-3
Table 9: Hyperparameter overview for molecule generation experiments presented in Table 3 and 6

d.4 Language modeling

Dataset details

The three datasets we use for language modeling are the Penn Treebank Marcus et al. (1994), text8 and Wikitext103 Merity et al. (2017). The Penn Treebank with a preprocessing of Mikolov et al. (2012) consists of approximately 5M characters and has a vocabulary size of . We follow the setup of Ziegler and Rush (2019) and split the dataset into sentences of a maximum length of 288. Furthermore, instead of an end-of-sentence token, the length is passed to the model and encoded by an external discrete prior which is created based on the sentence lengths in the training dataset.

Text8 contains about 100M characters and has a vocabulary size of . We again follow the preprocessing of Mikolov et al. (2012) and split the dataset into 90M characters for train, and 5M characters each for validation and testing. We train and test the models on a sequence length of 256.

In contrast to the previous two datasets, we use Wikitext103 as a word-level language dataset. First, we create a vocabulary and limit it to the most frequent 10,000 words in the training corpus. We thereby use pre-trained Glove Pennington et al. (2014) embeddings to represent the words in the baseline LSTM networks and to determine the logistic mixture parameters in the encoding distribution of Categorical Normalizing Flows. Due to this calculation of the mixture parameters, we use a small linear network as decoder. The linear network consists of three linear layers of hidden size 512 with GELU Hendrycks and Gimpel (2016) activation and an output size of 10,000 (the vocabulary size). Similarly to text8, we train and test the models on an input sequence length of 256.

Hyperparameter details

The hyperparameters for the language modeling experiments are summarized in Table 10. We apply the same hyperparameters for the flow and baseline if applicable. The best latent dimensionality for character-level has been shown to be 3, although larger dimensionality showed to gain similar performance. For the word-level dataset, it is beneficially to increase the latent dimensionality to 10. However, note that 10 is still significantly smaller than the Glove vector size of 300. As Penn Treebank has a limited training dataset on which LSTM networks easily overfit, we use a dropout Srivastava et al. (2014) of 0.3 throughout the models and dropout a input token with a chance of 0.1. The other datasets seemed to benefit slightly by a small input dropout to prevent overfitting at later stages of the training.

Hyperparameters Penn Treebank text8 Wikitext103
(Max) Sequence length 288 256 256
Latent dimension {2, 3, 4} {2, 3, 4} {8, 10, 12}
#Coupling layers 1 1 1
Coupling network LSTM LSTM LSTM
- Number of layers 1 2 2
- Hidden size 1024 1024 1024
- Dropout {0.0, 0.3} 0.0 0.0
- Input dropout {0.0, 0.05, 0.1, 0.2} {0.0, 0.05, 0.1} {0.0, 0.05, 0.1}
#mixtures 51 27 64
Batch size 128 128 128
Training iterations 100k 150k 150k
Optimizer RAdam RAdam RAdam
Learning rate 7.5e-4 7.5e-4 7.5e-4
Table 10: Hyperparameter overview for the language modeling experiments presented in Table 4