AlignFlow: Cycle Consistent Learning from Multiple Domains via Normalizing Flows

05/30/2019 ∙ by Aditya Grover, et al. ∙ Stanford University 6

Given unpaired data from multiple domains, a key challenge is to efficiently exploit these data sources for modeling a target domain. Variants of this problem have been studied in many contexts, such as cross-domain translation and domain adaptation. We propose AlignFlow, a generative modeling framework for learning from multiple domains via normalizing flows. The use of normalizing flows in AlignFlow allows for a) flexibility in specifying learning objectives via adversarial training, maximum likelihood estimation, or a hybrid of the two methods; and b) exact inference of the shared latent factors across domains at test time. We derive theoretical results for the conditions under which AlignFlow guarantees marginal consistency for the different learning objectives. Furthermore, we show that AlignFlow guarantees exact cycle consistency in mapping datapoints from one domain to another. Empirically, AlignFlow can be used for data-efficient density estimation given multiple data sources and shows significant improvements over relevant baselines on unsupervised domain adaptation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

page 15

page 16

Code Repositories

normalizing-flows

Implementations of normalizing flows using python and tensorflow


view repo

AlignFlow

Presentation for easy understanding of the paper AlignFlow


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, there has been an increase in the availability of both labeled and unlabeled datasets from multiple sources. For example, many variants of face datasets scraped from sources such as Wikipedia and IMDB are publicly available. Given data from two or more domains, we expect sample-efficient learning algorithms to be able to learn and align

the shared structure across these domains for accurate downstream tasks. This perspective has a broad range of applications across machine learning, including relational learning

[1], domain adaptation [2, 3, 4]

, image and video translation for computer vision 

[5, 6], and machine translation [7], especially for low resource languages [8].

Many variants of the domain alignment problem have been studied in prior work. For instance, unpaired cross-domain translation refers to the task of learning a mapping from one domain to another given datasets from the two domains [9]. This task can be used as a subproblem in the domain adaptation

setting, where the goal is to learn a classifier for the unlabeled domain given labeled data from a related source domain 

[10]. Many of these problems are underconstrained due to the limited supervision available and an amalgam of inductive biases need to be explicitly enforced (typically via additional loss terms) to learn meaningful solutions, e.g., cycle-consistency [9], entropic regularization [11] etc. In many cases, models need to be augmented with additional networks to enforce these biases during learning or for flexible inference at test time.

Latent variable generative models are highly effective for inferring hidden structure within observed data from a single domain [12]. For example, recent works have shown that these models can learn useful disentangled representations in a fully unsupervised manner [13, 14]. In this work, we present AlignFlow, a latent variable generative framework that seeks to discover the shared structure across multiple data domains using normalizing flows [15, 16, 17]. AlignFlow models the data from each domain via an invertible generative model with a single latent space shared across all the domains. If we let the two domains to be A and B with a shared latent space, say Z, then the latent variable generative model for A may additionally share some or all parameters with the model of domain B. Akin to a single invertible model, the collection of invertible models in AlignFlow provide great flexibility in specifying learning objectives and can be trained via maximum likelihood estimation, adversarial training or a hybrid variant accounting for both objectives.

By virtue of an invertible design, AlignFlow naturally extends as a cross-domain translation model. To translate data across two domains, say A to B, we can invert a data point from first followed by a second inversion from . Appealingly, we show that this composition of invertible mappings is exactly cycle-consistent, i.e., translating a datapoint from A to B using the forward mapping and backwards using the reverse mapping gives back the original datapoint and vice versa from B to A. Cycle-consistency was first introduced in CycleGAN [18] and has been shown to be an excellent inductive bias for underconstrained problems, such as unpaired domain alignment. While models such as CycleGAN only provide approximate cycle-consistency by incorporating additional loss terms, AlignFlow can omit these terms and guarantee exact cycle-consistency by design.

We analyze the AlignFlow framework extensively. Theoretically, we derive conditions under which the AlignFlow objective is consistent in the sense of recovering the true marginal distributions. Empirically, we consider two sets of tasks. In the first task, we demonstrate the ability of AlignFlow to effectively perform density estimation on a target domain given data from an additional related domain. Next, we compare AlignFlow against unsupervised domain adaptations based on cross-domain image translations and observe consistent improvements over 3 benchmark configurations.

2 Preliminaries

In this section, we discuss the necessary background and notation on generative adversarial networks and normalizing flows. We use uppercase notation

to denote random variables, and lowercase notation

to denote specific values in the italicized corresponding sample spaces .

2.1 Generative Adversarial Networks

A generative adversarial network (GAN) is a latent variable model which specifies a deterministic mapping between a set of latent variables Z and a set of observed variables X [19]. In order to sample from GANs, we need a prior density over Z that permits efficient sampling. A GAN generator can also be conditional, where the conditioning is on another set of observed variables (and optionally the latent variables Z as before) [20].

A GAN is trained via adversarial training, wherein the generator plays a minimax game with an auxiliary critic . The goal of the critic is to distinguish real samples from the observed dataset with samples generated via . The generator, on the other hand, tries to generate samples that can maximally confuse the critic. Many learning objectives have been proposed for adversarial training, including those based on f-divergences [21], Wasserstein Distance [22], and maximum mean discrepancy [23]

. For the standard cross-entropy based GAN loss, the critic outputs a probability of a datapoint being real and optimizes the following objective w.r.t. a data distribution

:

(1)

for a suitable choice of prior density

. The generator and the critic are both parameterized by deep neural networks and learned via alternating gradient-based optimization. Because adversarial training only requires samples from the generative model, it can be used to train generative models with intractable or ill-defined likelihoods 

[24]. In practice, such likelihood-free methods give excellent performance on sampling-based tasks unlike the alternative maximum likelihood estimation-based training criteria for learning generative models. However, these models are harder to train due to the alternating minimax optimization and suffer from issues such as mode collapse [25].

2.2 Normalizing Flows

Normalizing flows represent a latent variable generative model that specifies an invertible mapping between a set of latent variables Z and a set of observed variables X. Let and denote the marginal densities defined by the model over and respectively. Using the change-of-variables formula, the marginal densities can be related as:

(2)

where due to the invertibility constraints. Here, the second term on the RHS corresponds to the absolute value of the determinant of the Jacobian of the inverse transformation and signifies the change in volume when translating across the two sample spaces.

For evaluating likelihoods via the change-of-variables formula, we require efficient and tractable evaluation of the prior density, the inverse transformation , and the determinant of its Jacobian of

. To draw a sample from this model, we perform ancestral sampling, i.e., we first sample a latent vector

and obtain the sampled vector as given by . This requires the ability to efficiently: (1) sample from the prior density and (2) evaluate the forward transformation . Many transformations parameterized by deep neural networks that satisfy one or more of these criteria have been proposed in the recent literature on normalizing flows, e.g., NICE [16] and Autoregressive Flows [26, 27]. By suitable design of transformations, both likelihood evaluation and sampling can be performed efficiently, as in Real-NVP [17]. Consequently, a flow model can be trained efficiently via maximum likelihood estimation as well as likelihood-free adversarial training [28].

3 The AlignFlow Framework

In this section, we present the AlignFlow framework for learning generative models in the presence of unpaired data from multiple domains. For ease of presentation, we consider the case of two domains. Unless mentioned otherwise, our results naturally extend to more than two domains as well.

3.1 Problem Setup

The learning setting we consider is as follows. We are given unpaired datasets and from two domains A and B respectively, where the datapoints are assumed to be sampled i.i.d. from the true marginal and respectively. We are interested in learning models for two sets of distributions (a) the marginal likelihoods and for unconditional density estimation and sampling from domains A and B respectively, and (b) conditional distributions and for translating (i.e., conditional sampling) from and respectively.

Before presenting the AlignFlow framework, we note two observations. For task (a), simply having a dataset from the target domain is enough. However, given the easy availability of unpaired data, we hope to exploit the shared structure across related domains for sample-efficient learning. For task (b), we note that the problem is heavily underconstrained since we are only given data from the marginal distributions and hence, it is unclear how to learn the conditional distribution. However, the recent spate of empirical successes suggest certain inductive biases can learn useful conditional distributions for tasks such as cross-domain translation and domain adaptation even for this underconstrained problem [9, 29].

3.2 Representation

We will use a graphical model to represent the relationships between the domains. Consider a Bayesian network

with two sets of observed random variables A and B with domains and respectively, and a parent set of latent random variables Z with domain .

The latent variables Z indicate a shared feature space between the observed variables A and B, which will be exploited later for efficient learning and inference. While Z is unobserved, we assume a prior density

over these variables, such as an isotropic Gaussian. Finally, to compactly specify the joint distribution over all sets of variables, we constrain the relationship between

A and Z, and B and Z to be invertible. That is, we specify mappings and such that the respective inverses and exist. Notice that such a representation naturally provides a mechanism to translate from one domain to another as the composition of two invertible mappings:

(3)
(4)

Since composition of invertible mappings is invertible, both and are invertible. In fact, it is straightforward to observe that and are inverses of each other:

(5)

3.3 Learning Algorithms & Objectives

As discussed in the preliminaries, each of the individual flow models and express a model for and respectively and can be trained independently via maximum likelihood estimation, adversarial learning, or a hybrid objective. However, our goal is to perform sample-efficient learning by exploiting data from other domains as well as learn a conditional mapping across the two domains. For both these goals, we require learning algorithms which use data from both domains for parameter estimation. Unless mentioned otherwise, all our results that hold for a particular domain A will have a natural counterpart for the domain B.

Adversarial Training. Instead of generating data in domain A by sampling from the prior density , we can consider conditional sampling based on data sampled from domain B. That is, we introduce a critic that plays a minimax game with the generator mapping with the prior density given as . The critic distinguishes real samples with the generated samples for . For example, the cross-entropy GAN loss in this case is given as:

(6)

The expectation above are approximated empirically via datasets and respectively.

Maximum Likelihood Estimation. Unlike adversarial training, flow models trained with maximum likelihood estimation (MLE) explicitly require a prior with a tractable density to apply the change-of-variables formula. Due to the tractability requirement, we cannot substitute for in this case. Instead, we propose to share parameters between the two mappings. The extent of parameter sharing depends on the similarity across the two domains; for highly similar domains, entire architectures could potentially be shared in which case .

Hybrid Training. Both MLE and adversarial training objectives can be combined into a single training objective. In particular, the most expressive AlignFlow objective is given as:

(7)

where and

are hyperparameters that reflect the strength of the MLE terms for domains

A and B respectively. The AlignFlow objective is minimized w.r.t. the parameters of the generator and maximized w.r.t. parameters of the critics and . Notice that is a function of the critics and only since the latter also encompasses the other parametric functions appearing in the objective () via the invertibility constraints in Eqs. 3-5. When , we perform pure adverarial training and the prior over Z plays no role in learning. On the other hand, when , we can perform pure MLE training to learn the invertible generator. Here, the critics play no role since the adversarial training terms are ignored.

3.4 Inference

AlignFlow can be used for both conditional and unconditional sampling at test time. For conditional sampling as in the case of domain translation, we are given a datapoint and we can draw the corresponding cross-domain translation in domain via the mapping . For unconditional sampling, we require since doing so will activate the use of the prior via the MLE terms in the learning objective. Thereafter, we can obtain samples by first drawing and then applying the mapping to . Furthermore, the same can be mapped to domain via . Hence, we can sample paired data given .

4 Theoretical Analysis

The AlignFlow objective consists of three parametric models: one generator

, and two critics . Here, denote model families specified e.g., via deep neural network based architectures. In this section, we analyze the optimal solutions to these parameterized models within well-specified model families.

4.1 Optimal Generators

Our first result characterizes the conditions under which the optimal generators exhibit marginal-consistency for the data distributions defined over the domains A and B.

Definition 1.

Let denote the joint distribution between two domains and . An invertible mapping is marginally-consistent w.r.t. two arbitrary distributions () iff for all , :

(8)

Next, we show that AlignFlow is marginally-consistent for well-specified model families.

Lemma 1.

Let and denote the class of invertible mappings represented by the AlignFlow architecture for mapping and . For a given choice of prior distribution , if there exist mappings that are marginally consistent w.r.t. () and () respectively, then the mapping is marginally-consistent w.r.t. ().

The result follows directly from Definition 1 and change-of-variables applied to the mapping .

Theorem 1.

Assume that the model families for the critics and are the set of all measurable functions for the cross-entropy GAN objective. Then, (as defined in Lemma 1) globally minimizes the AlignFlow objective in Eq. 3.3 for any value of .

Proof. See Appendix A.1. Theorem 1 suggests that optimizing the AlignFlow objective will recover the marginal data distributions and under suitable conditions. For the other goal of learning cross-domain mappings, we note that marginally-consistent mappings w.r.t. a target data distribution (such as ) and a target prior density (such as ) need not be unique. While a cycle-consistent, invertible model family mitigates the underconstrained nature of the cross-domain translation problem, it does not provably eliminate it. We provide some non-identifiable constructions in Appendix A.3 and leave the exploration of additional constraints that guarantee identifiability to future work.

4.2 Optimal Critics

Unlike standard adversarial training of an unconditional normalizing flow model [28, 30], the AlignFlow model involves two critics. Here, we are interested in characterizing the dependence of the optimal critics for a given invertible mapping . Consider the AlignFlow framework where the GAN loss terms in Eq. 3.3 are specified via the cross-entropy objective in Eq. 6. For this model, we can relate the optimal critics using the following result.

Theorem 2.

Let and denote the true data densities for domains and respectively. Let and denote the optimal critics for the AlignFlow objective with the cross-entropy GAN loss for any fixed choice of the invertible mapping . Letting for any , we have:

(9)

Proof. See Appendix A.2. In essence, the above result shows that the optimal critic for one domain, w.l.o.g. say A, can be directly obtained via the optimal critic of another domain B for any choice of the invertible mapping , assuming one were given access to the data marginals and .

4.3 Exact Cycle Consistency

A

B

(a) CycleGAN

A

B

Z

(b) AlignFlow
Figure 1: CycleGAN v.s. AlignFlow for unpaired cross-domain translation. Unlike CycleGAN, AlignFlow specifies a single invertible mapping that is exactly cycle-consistent, represents a shared latent space Z between the two domains, and can be trained via both adversarial training and exact maximum likelihood estimation. Double-headed arrows denote invertible mappings. and are random variables denoting the output of the critics used for adversarial training.

So far, we have only discussed objectives that are marginal consistent with respect to data distributions and . However, many domain alignment tasks such as cross-domain translation require can be cast as learning a joint distribution . As discussed previously, this problem is underconstrained given unpaired datasets and and the learned marginal densities alone do not guarantee learning a mapping that is useful for downstream tasks. Cycle consistency, as proposed in CycleGAN [18], is a highly effective learning objective that encourages learning of meaningful cross-domain mappings such that the data translated from domain to via to be mapped back to the original datapoints in via . That is, for all . Formally, the cycle-consistency loss for translation from A to B and back is defined as:

(10)

Symmetrically, a cycle consistency term in the reverse direction encourages for all . Next, we show that AlignFlow is exactly cycle consistent.

Proposition 1.

Let denote the class of invertible mappings represented by an arbitrary AlignFlow architecture. For any , we have:

(11)
(12)

where by design.

The proposition follows directly from the invertible design of the AlignFlow framework (Eq. 5).

Comparison with CycleGAN. We illustrate and compare AlignFlow and CycleGAN in Figure 1. CycleGAN parameterizes two independent cross-domain mappings and , whereas AlignFlow only specifies a single, invertible mapping. Learning in a CycleGAN is restricted to an adversarial training objective along with cycle-consistent loss terms, whereas AlignFlow is exactly consistent and can be trained via adversarial learning, MLE, or a hybrid (Eq. 3.3) without the need for additional loss terms to enforce cycle consistency. Finally, inference in CycleGAN is restricted to conditional sampling since it does not involve any latent variables Z with easy-to-sample prior densities. As described previously, AlignFlow permits both conditional and unconditional sampling.

Comparison with UNIT and CoGAN. Models such as CoGAN [31] and its extension UNIT [29] also consider adding a shared-space constraint between two decoders decoding into the different domains. These models again can only enforce approximate cycle consistency, introduce additional encoders, and approximate lower bounds to the log-likelihood thereby prohibiting exact MLE training.

5 Experimental Evaluation

To achieve our two goals of data-efficient modeling of individual domains and effective cross-domain mappings, we evaluate AlignFlow on two tasks: (a) density estimation given data from multiple domains, and (b) unsupervised domain adaptation. For additional experimental details and analysis beyond those stated below, we refer the reader to Appendix B.

5.1 Data-efficient Density Estimation via pure MLE Training

In multi-domain density estimation, we are given data from two domains. The target domain is a data-limited domain which we augment with additional data from a related, but different domain. The goal is to learn a single generative model for the target domain using one or both datasets. We ignore the adversarial learning terms for AlignFlow in this experiment since the density estimation objective is directly related to MLE.

We experimented with the CIFAR-10 dataset using a simplified Glow architecture [32]

. For the augmented dataset during training, we consider CIFAR-100 and ImageNet downscaled to 32x32 

[33]. CIFAR-100 is obtained from the same source dataset of the 80 million tiny images dataset but has non-overlapping classes with those of CIFAR-10 and hence, can be viewed as a different data distribution. ImageNet is derived from a different source than CIFAR-10. In Figure 2, we show the samples and held-out negative log-likelihoods of the best performing models. We defer samples with ImageNet augmentation to Appendix B.1, where we achieve NLL of 3.45 bpd. A baseline approach which ignores the data available from the augmented domains underperforms and training AlignFlow using pure MLE with weight sharing can effectively exploit data from related domains.

5.2 Unsupervised Domain Adaptation via pure Adversarial Training

(a) CIFAR-10 only: 3.48 bpd
(b) With CIFAR-100: 3.42 bpd
Figure 2: Generated samples and held-out negative log-likelihoods (NLL, in bits/dimension or bpd) for MLE training on the CIFAR-10 dataset alone (left) and augmented with CIFAR-100 (right).
Model MNISTUSPS USPSMNIST SVHNMNIST
source only 82.2 0.8 69.6 3.8 67.1 0.6
ADDA [34] 89.4 0.2 90.1 0.8 76.0 1.8
CyCADA [3] 95.6 0.2 96.5 0.1 90.4 0.4
UNIT [29] 95.97 93.58 90.53
AlignFlow 96.2 0.2 96.7 0.1 91.0 0.3
target only 96.3 0.1 99.2 0.1 99.2 0.1
Table 1: Test classification accuracies for domain adaptation from sourcetarget. The source only and target only models directly use classifiers trained on the source and target datasets respectively. Baseline numbers directly reported from the cited works.

In unsupervised domain adaptation [10], we are given data from two related domains: a source and a target domain. For the source, we have access to both the input datapoints and their labels. For the target, we are only provided with input datapoints without any labels. Using the available data, the goal is to learn a classifier for the target domain. We extend [3] to use an AlignFlow architecture and objective (adversarially trained Real-NVPs [17] here) in place of CycleGAN for this task.

A variety of algorithms have been proposed for the above task which seek to match pixel-level or feature-level distributions across the two domains. See Appendix B.3 for more details. For fair comparison, we compare against baselines Cycada [3] and UNIT [29] which involve pixel-level translations and are closest to the current work. We evaluate across all pairs of source and target datasets as in [3] and [29]: MNIST [35], USPS [36], SVHN [37], which are all image datasets of handwritten digits with 10 classes. In Table 1, we see that AlignFlow outperforms both Cycada [3] (based on CycleGAN) and UNIT [29] in all cases. This also suggests that combining AlignFlow with recent state-of-the-art adaptation approaches e.g., [38, 39, 40, 41, 42, 43] is an interesting direction for future work.

5.3 Multi-domain Latent Interpolations via Hybrid Training

(a) MNISTUSPS
(b) USPSMNIST
Figure 3:

Multi-domain latent space interpolations.

Top: Left-most and right-most images are sampled from (in red boxes). Interpolation is then performed in latent space and then decoded using . Bottom: For each corresponding image in the top row, its latent representation is decoded into the target domain using . Note how both class identity and style are preserved in the interpolated pairs of digits in the two domains. Also, notice that the USPS images (even the true ones in red boxes) are slightly blurred due to the upscaling applied as standard preprocessing.

The use of a shared latent space in AlignFlow allows us to perform paired interpolations in two domains simultaneously. While pure MLE without any parameter sharing does not give good alignment, pure adversarial training cannot be used for unconditional sampling since the prior is inactive. Hence, we use AlignFlow models trained via a hybrid objective for latent space interpolations. In particular, we sample two datapoints and obtain their latent representations via . Following [17], we compute interpolations in the polar space as for several values of . Finally, we map to either back to domain A via and B via . We show this empirically on the MNIST/USPS datasets in Figure 3. We see that many aspects of style and content are preserved in the samples corresponding to the latent space interpolations.

6 Related Work

A key assumption in unsupervised domain alignment is the existence of a deterministic or stochastic mapping such that the distribution of B matches that of , and vice versa. This assumption can be incorporated as a marginal distribution-matching constraint into the objective using an adversarially-trained GAN critic [19]. However, this objective is under-constrained. To partially mitigate this issue, CycleGAN [18], DiscoGAN [1], and DualGAN [44] added an approximate cycle-consistency constraint that encourages and to behave like identity functions on domains A and B

respectively. While cycle-consistency is empirically very effective, alternatives based on variational autoencoders that do not require either cycles or adversarial training have also been proposed recently 

[45, 46].

Models such as CoGAN [31], UNIT [29], and CycleGAN [18] have since been extended to enable one-to-many mappings [47, 9] as well as multi-domain alignment [48]. Our work focuses on the one-to-one unsupervised domain alignment setting. In contrast to previous models, AlignFlow leverages both a shared latent space and exact cycle-consistency. To our knowledge, AlignFlow provides the first demonstration that invertible models can be used successfully in lieu of the cycle-consistency objective. Furthermore, AlignFlow allows the incorporation of exact maximum likelihood training, which we demonstrated to induce a meaningful shared latent space that is amenable to interpolation.

7 Conclusion & Future Work

We presented AlignFlow, a generative framework for learning from multiple data sources based on normalizing flow models. The use of normalizing flow models is an attractive choice for several reasons we highlight: it guarantees exact cycle-consistency via a single cross-domain mapping, learns a shared latent space across two domains, and permits a flexible training objective which is a hybrid of terms corresponding to adversarial training and exact maximum likelihood estimation. Theoretically, we derived conditions under which the AlignFlow model learns marginals that are consistent with the underlying data distributions. Finally, our empirical evaluation demonstrated significant gains on the tasks of multi-domain density estimation and unsupervised domain adaptation, and an increase in inference capabilities, e.g., paired interpolations in the latent space for two domains.

In the future, we plan to consider extensions of AlignFlow for learning stochastic, multimodal mappings [9] and translations across more than two domains [48]. Exploring recent advancements in invertible architectures [49, 26, 27, 50, 13, 51, 52] for improved learning of AlignFlow is another promising direction. In spite of strong empirical results in domain alignment, theories explaining such results are limited [53, 54, 55, 56, 57]. With a handle on model likelihoods and invertible inference, we are optimistic that AlignFlow can potentially aid the development of such a theory and characterize useful structure for guaranteeing identifiability in underconstrained problems involving multiple domains.

References

Appendices

Appendix A Proofs of Theoretical Results

a.1 Proof of Theorem 1

Proof.

Since the maximum likelihood estimate minimizes the KL divergence between the data and model distributions, the optimal value for is attained at a marginally-consistent mapping, say . Symmetrically, there exists a marginally-consistent mapping that optimizes .

From Theorem 1 of Goodfellow et al. [19], we know that the cross-entropy GAN objective is globally minimized when and critic is Bayes optimal. Further, from Lemma 1, we know that is marginally-consistent w.r.t. (). Hence, globally minimizes . Symmetrically, globally minimizes .

Since globally optimizes all the individual loss terms in the AlignFlow objective in Eq. 3.3, it globally optimizes the overall objective for any value of .

a.2 Proof of Theorem 2

Proof.

First, we note that only the GAN loss terms depend on and . Hence, the MLE terms are constants for a fixed and hence, can be ignored for deriving the optimal critics. Next, for any GAN trained with the cross-entropy loss as specified in Eq 6, we know that the Bayes optimal critic prediction for any is given as:

(13)

See Proposition 1 in Goodfellow et al. [19] for a proof.

We can relate the densities and via the change of variables as:

(14)

where .

Substituting the expression for density of from Eq. 14 in Eq. 13, we get:

(15)

where .

Symmetrically, using Proposition 1 in Goodfellow et al. [19] we have the Bayes optimal critic for any given as:

(16)

Rearranging terms in Eq. 16, we have:

(17)

for any .

Substituting the expression for density of from Eq. 17 in Eq. 15, we get:

(18)

where .

a.3 Non-identifiability of Cross-domain Mappings

As discussed, marginal consistency along with invertibility can only reduce the underconstrained nature of the unpaired cross-domain translation problem, but not completely eliminate it. In the following result, we identify one such class of non-identifiable model families for the MLE-only objective of AlignFlow (). We will need the following definitions.

Definition 2.

Let denotes the symmetric group on dimensional permutation matrices. A function class for the cross-domain mappings is closed under permutations iff for all , , we have .

Definition 3.

A density is symmetric iff for all , we have .

Examples of distributions with symmetric densities include the isotropic Gaussian and Laplacian distributions.

Proposition 2.

Consider the case where , and is closed under permutations. For a symmetric prior (e.g., isotropic Gaussian), there exists an optimal solution to the AlignFlow objective (Eq. 3.3) for such that .

Proof.

We will prove the proposition via contradiction. That is, let’s assume that is a unique solution for the AlignFlow objective for (Eq. 3.3). Now, consider an alternate mapping for an arbitrary non-identity permutation matrix in the symmetric group.

As before, we note that and due to the invertibility constraints in Eqs. 3-5. Since permutation matrices are invertible and so is , their composition given by is also invertible. Further, since is closed under permutation and , we also have .

Next, we note that the inverse of a permutation matrix is also a permutation matrix. Since the prior is assumed to be symmetric and a a transformation specified by a permutation matrix is volume-preserving (i.e., for all ), we can use the change-of-variables formula in Eq. 2 to get:

(19)
(20)

Noting that and due to the invertibility constraints in Eqs. 3-5, we can substitute the above equations in Eq. 3.3. When , for any choice of we have:

(21)

The above equation implies that is also an optimal solution to the AlignFlow objective in Eq. 3.3 for . Thus, we arrive at a contradiction since is not the unique maximizer. Hence, proved. ∎

The above construction suggests that MLE-only training can fail to identify the optimal mapping corresponding to the joint distribution even if it lies within the mappings represented via the family represented via the AlignFlow architecture. Failure modes due to non-identifiability could also potentially arise for adversarial and hybrid training. Empirically, we find that while MLE-only training gives poor performance for cross-domain translations, the hybrid and adversarial training objectives are much more effective, which suggests that these objectives are less susceptible to identifiability issues in recovering the true mapping.

Appendix B Experiment Details

We used PyTorch [58] for implementing our codebase. All models were trained on a single Nvidia TitanX GPU. We are attaching our anonymized code with the supplementary material and will make it publicly available at the end of the review process.

b.1 Density Estimation

Figure 4: With ImageNet: 3.45 bpd

Figure 4 shows the samples obtained by augmenting the CIFAR-10 dataset with ImageNet data. For these experiments, we shared all sets of parameters across the generators for the two domains (to keep number of parameters fixed across all approaches) and performed a hyperparameter search for the weight of the MLE objective for the augmented dataset. We searched over relative weights in .

The Glow architecture used in these experiments is the one used by Kingma and Dhariwal [32] for CIFAR-10 experiments which, in the notation of [32], corresponds to , , and . We now give a brief explanation of these hyperparameters: The Glow architecture consists of levels, which are sequences of transformations that process their inputs at the same spatial scale. The number of levels is denoted by , and following [32] we use for all CIFAR-10 experiments. Each level consists of flow steps (we use ), and each flow step has three parts: (1) activation normalization, (2) invertible convolution, and (3) affine coupling in which half of the channels are used to compute an affine transformation applied to the remaining channels. In the affine coupling layers, one must choose the function for computing the scale and translate factors: Following [32], we use a simple 3-layer CNN with 512 channels (). We note that in contrast with prior works such as Real NVP [17], the Glow architecture does not make use of checkerboard masks in the coupling layers, instead using only channelwise masks.

Due to computational constraints, we run the full-scale Glow model on a single NVIDIA TitanX GPU for a batch size of , as opposed to the original multi-GPU model of [32] run with a batch size of . In any case, the goal of this experiment differs in the sense of demonstrating relative improvements via augmented data from a related domain, when methods are compared under a fixed computational budget and model size. The remaining training procedures are standard and match that of [32].

b.2 Image-To-Image Translation

Dataset Model MSE () MSE ()
Facades CycleGAN 0.7129 0.3286
AlignFlow (Adversarial only) 0.6727 0.2679
AlignFlow (Hybrid) 0.5801 0.2512
AlignFlow (MLE only) 0.9014 0.5960
Maps CycleGAN 0.0245 0.0953
AlignFlow (Adversarial only) 0.0385 0.1123
AlignFlow (Hybrid) 0.0209 0.0897
AlignFlow (MLE only) 0.0452 0.1746
CityScapes CycleGAN 0.1252 0.1200
AlignFlow (Adversarial only) 0.2569 0.2196
AlignFlow (Hybrid) 0.1130 0.1462
AlignFlow (MLE only) 0.2526 0.2272
Table 2: Mean Squared Error (MSE) comparing CycleGAN and variants of AlignFlow on paired test sets. MSE is computed pixelwise after normalizing images to .

In additional preliminary results on image-to-image translation tasks, we evaluate AlignFlow on three image-to-image translation datasets used by Zhu et al. [18]: Facades, Maps, and CityScapes [59]. These datasets are chosen because they provide one-to-one aligned image pairs, so one can quantitatively evaluate unpaired image-to-image translation models via a distance metric such as mean squared error (MSE) between generated examples and the corresponding ground truth. While MSE can have limitations, it it is reasonable for evaluating one-to-one paired datasets. Note that we restrict ourselves to unpaired translation, so the pairing information is omitted during training and only used for evaluation.

We report the MSE for translations on the test sets after cross-validation of hyperparameters in Table 2

, leaving the exploration of other perceptual evaluation metrics for future work. For hybrid models, we set

. We observe that while learning AlignFlow via adversarial training or MLE alone is not as competitive as CycleGAN, hybrid training of AlignFlow significantly outperforms CycleGAN in almost all cases. Specifically, we observe that MLE alone typically performs worse than adversarial training, but together both these objectives seem to have a regularizing effect on each other.

Figure 5: Latent space interpolation on Facades. Top: Left-most and right-most images are sampled from (in red boxes). Interpolation is then performed in latent space and then decoded using . We see semantically meaningful changes across the row, e.g., in the shadow and the style of entrance to the building. Bottom: For each corresponding image in the top row, its latent representation is decoded into the target domain using . Inspection of the orange regions indicates a change from 3 floors (left) to 4 floors (right).

We use the standard training, validation, and test splits for each dataset. For datasets which do not provide a validation set (e.g.,

Facades and CityScapes), we randomly hold out a portion of the training set with the same number of images as the test set. We train each model for 200 epochs with a fixed learning rate of

for the first 100 epochs, followed by a linear decay schedule for 100 epochs from the initial learning rate to 0. We use the Adam [60] optimizer with and , and for AlignFlow we apply weight normalization [61] of

to the generator’s parameters. When training with an MLE objective, we apply gradient clipping with a maximum gradient norm of 10. Scaling flow models to higher dimensionality is an active area of research; for this work we resized the images to

for Cityscapes and Maps, and for Facades. We use a batch-size of 16 images.

For MLE/Hybrid models, we used an isotropic Gaussian prior. We use the following flow architecture to parameterize and :

Scale[Input: 32x32x3, Output: 16x16x6x2]
Scale[Input: 16x16x6, Output: 8x8x12x2]
Scale[Input: 8x8x12, Output: 4x4x24x2]
Scale[Input: 4x4x24, Output: 4x4x24]

where CheckerboardCoupling and ChannelwiseCoupling are affine coupling layers with checkerboard and channelwise masking, respectively, and where Squeeze&Split first trades spatial extent for channels by turning each subvolume into a subvolume, and then splits the volume along the last dimension and sends half of the features directly to the latent space. See Dinh et al. [17] for more details. Within each affine coupling layer, we parametrize the scale and translate factors using a ResNet [62] architecture with the specified number of channels and residual blocks. We additionally use activation normalization [32] before each coupling layer.

Latent space interpolations for the Facades dataset are shown in Figure 5. Again, we see that many aspects of style and content are preserved in the samples corresponding to the latent space interpolations.

Figure 6: Examples of failure modes for CycleGAN reconstructions in MNISTSVHN cross-domain translation. In each group of 3, a real example is shown on the left, the translated image is shown at center, and the reconstructed image is shown on the right.

b.3 Unsupervised Domain Adaptation

One such model relevant to this experiment is Cycle-Consistent Domain Adaptation (CyCADA) [3]. CyCADA first learns a cross-domain translation mapping from source to target domain via CycleGAN. This mapping is used to stylize the source dataset into the target domain, which is then subject to additional feature-level and semantic consistency losses for learning the target domain classifier [63, 34]. A full description of CyCADA is beyond the scope of discussion of this work; we direct the reader to Hoffman et al. [3] for further details.

We use the same training, validation and test splits of MNIST, USPS, and SVHN digit datasets as in CyCADA [3]. For all datasets, images are resized to as in CyCADA. We employ the pixel-level and feature-level adaptation training pipeline as in CyCADA but replace the CycleGAN-based image translation network with the AlignFlow. The architectures for imposing semantic consistency and feature adaptation are the same as the ones used for CyCADA. The architecture and hyperparameter tuning protocol was consistent with the one used for image-to-image translations using AlignFlow. For the hyperparameters of feature-level domain adaptation post the image translations, we adopted the optimal hyperparameter settings from ADDA [34].

Failure Modes of Approximate Cycle Consistency. To show the importance of exact cycle-consistency for domain adaptation, we present a few failure modes in the context of Cycada [3]. In Figure 6, we consider some cross-domain translation cases between MNIST and SVHN. In each group of 3, a real example is shown on the left, the translated image is shown at center, and the reconstructed image is shown on the right. Notice that the class label changes or becomes unrecognizable in translating and reconstructing the input. AlignFlow does not have these failure modes because cycle-consistency is ensured by design and hence, the reverse translations will exactly match the original image in the source domain preserving essential properties like class identity.