## 1 Introduction

The goal of unsupervised domain alignment is to find a transformation of one dataset that makes it similar to another dataset while preserving the structure of the original. The majority of modern approaches to domain alignment directly search for a transformation of the data that minimizes an empirical estimate of some statistical distance - a non-negative quantity that takes lower values as datasets become more similar. The variability of what “similar” means in this context, which transformations are allowed, and whether data points themselves or their feature representations are aligned, leads to a variety of domain alignment methods. Unfortunately, existing estimators of statistical distances either restrict the notion of similarity to enable closed form estimation

(sun2016deep), or rely on adversarial (min-max) training (tzeng2017adversarial) that makes it very difficult to quantitatively reason about the performance of such methods. In particular, the value of the optimized adversarial objective conveys very little about the quality of the alignment, which makes it difficult to perform automatic model selection on a new dataset pair. On the other hand, Normalizing Flows (rezende2015variational)are an emerging class of deep neural density models that do not rely on adversarial training. They model a given dataset as a random variable with a simple known distribution transformed by an unknown invertible transformation, often parameterized using a deep neural network. Recent work on normalizing flows made great strides in defining new rich parameterizations for these invertible transforms

(kingma2018glow; grathwohl2018ffjord), but focused almost exclusively on density estimation.In this paper we present the Log-likelihood Ratio Minimizing Flow (LRMF) - a new non-adversarial approach for aligning datasets “with respect to” a given family of distributions

(e.g. normal distributions, or PixelCNNs with fixed architecture, etc.). It uses unique properties of normalizing flows to turn an otherwise adversarial optimization problem into a minimization problem. More specifically, we consider two datasets

and equivalent with respect to if there is a single density model in that is optimal for both and . If we fit a density model to and to , and then fit another “shared” model to the combined dataset containing samples from both and , the average log-likelihood scores of the shared model on and would match average log-likelihoods of on and on only if was an optimal density model for both and independently, i.e. these datasets are equivalent w.r.t. . We want to find a transformation that transforms dataset in a way that makes equivalent to . We do that by minimizing the gap between log-likelihood scores of the “shared” model and “private” models and . For general , such can be found only by solving a min-max optimization problem, but in this paper we show that if is a family of normalizing flows, then the flow that makes and equivalent with respect tocan be found by minimizing a single objective that attains zero upon convergence. This enables automatic model validation and hyperparameter tuning on the held out set.

To sum up, the novel non-adversarial data alignment method presented in this paper combines the clear convergence criteria found in non-parametric and simple parametric approaches and the power of deep neural discriminators used in adversarial models. Our method finds a transformation of one dataset that makes it “equivalent” to another dataset with respect to the specified family of density models. We show that if that transformation is restricted to a normalizing flow, the resulting problem can be solved by minimizing a single simple objective, and that this objective attains zero only if two domains are correctly aligned. We experimentally verify this claim, and show that the proposed method preserves local structure of the transformed distribution, and that it is robust to both model misspecification by both over- and under-parameterization.

## 2 Log-Likelihood Ratio Minimizing Flow

In this section, we formally define the proposed method for aligning distributions. We assume that is a manifold of densities, and we project datasets onto by maximizing the likelihood of these datasets across models from

, or equivalently by minimizing the KL divergence with corresponding empirical Dirac delta mixture “densities”. We introduce the log-likelihood ratio pseudo-distance and show its relation to the test statistic of the same name. Intuitively, if we project datasets

and onto independently ( and ), and also find a point on that minimizes the combined distance from and to that , then the log-likelihood ratio distance would equal the difference between the approximation quality (negative log-likelihood) of that optimal “shared” approximation and the two optimal “private” approximations (Definition 2.1). Figure 1 illustrates how this quantity changes as two datasets become more similar: the log-likelihood ratio distance between and [(3)+(4)-(2)-(5)] is larger then the log-likelihood ratio distance between and [(6)+(7)-(8)-(5)] because the shared approximation for and approximates them almost as well as their private approximations and . After defining the distance we consider the problem of finding a transformation that would minimise this distance (Eq 1) and provide an intuition on how different components of the objective would affect the learned transformation if we minimized it directly (Figure 2). In general, finding such a transformation would require solving an adversarial optimization problem, but we show that if the transformation is restricted to the family of normalizing flows, then the optimal one can be found by minimizing a simple non-adversarial objective (Theorem 2.3). We illustrate this result with an example can be solved analytically: we show that minimizing the log-likelihood ratio distance between two random variables with respect to the normal density family is equivalent to directly matching their first two moments.

First, let specify the parametric family of distributions we are going to align our datasets “against”. We assume that the negative log-likelihoods of distributions in over some fixed domain is given by

###### Definition 2.1.

Let us define the log-likelihood ratio distance between datasets and from with respect to the family of densities as the difference between negative log-likelihoods of optimal models with “shared” and “private” parameters :

The expression above is also the log-likelihood ratio test statistic

for the null hypothesis

for the model described by the log-likelihood function:###### Lemma 2.1.

The log-likelihood ratio distance is always non-negative

and equals zero only if there exists a single “shared” model that approximates both and as well as their “private” optimal approximations:

###### Proof.

If we define and , the first statement follows from the fact that

and second statement comes form the fact that the equality holds only if there exists such that

since for all other holds and analogously since . ∎

Now we will introduce the parametric family of transformations and will show that the adversarial problem that arises if we try to find that minimizes the log-likelihood ratio distance , i.e.

(1) |

can be solved using non-adversarial minimization if is a parametric family of flows. But first, let us examine the intuition on how different components of the objective above would have effected the transformed dataset if we optimized them directly. The model in the definition above stands for the optimal approximation of . Figure 2 shows that, when minimized over , the first component of the loss, namely , pulls the transformed dataset towards the best shared model and the third component pushes it away from its optimal approximation on , therefore ensuring that becomes more similar to with respect to without collapsing onto . For example, if two datasets were already equivalent w.r.t. , but not necessarily on itself, the objective above would not encourage them to move towards . When optimized over and , first three components of the objective above ensure that is still an optimal approximation of and is still an optimal approximations of and combined. The last component of the loss optimized over is a constant that can be computed separately.

The optimization problem stated above is still adversarial, but the lemma presented below enables us to find the optimal transformation by simply minimizing a modified version of the objective (1) using an iterative method of one’s choice.

###### Lemma 2.2.

If can approximate well, i.e.

where is the true data distribution of , and if is the family of normalizing flows from to itself, then the amount of additional entropy (negative log-likelihood) the transformation induces upon according to is bounded by the actual amount of entropy it induces upon it:

###### Proof.

We know from the change of variable formula that if is a normalizing flow and where , then the log-likelihood can be expressed as

Following the same logic, if and , and considering the invertibility of , we can express the true distribution of as follows (the superscript in indicates that we used instead of in that flow likelihood):

Considering that can approximate well, i.e. , but might not approximate the true distribution of the transformed dataset as well:

after multiplying both sides by negative one and reordering the terms we get the statement of the lemma. ∎

We get a simpler formula that directly specifies which models should be trained on what data (and is therefore easier to translate to code), if we express the average logarithm of the determinant of the Jacobian (of the inverse transform computed at the transformed dataset) using the log-likelihood of the corresponding flow with an arbitrarily choice of prior , because it is canceled out:

###### Definition 2.2.

Let us define the log-likelihood ratio minimizing flow (LRMF) for a pair of datasets and on , the family of densities on with the negative log-likelihood , and the parametric family of normalizing flows from onto itself with likelihood and an arbitrary choice of prior , as the flow that minimizes the following objective

(2) | ||||

where the constant

does not depend on and , and can be precomputed in advance.

###### Theorem 2.3.

This theorem follows from the definition of and two lemmas provided above. This result enables us to find the parameters of the normalizing flow that make and equivalent with respect to using existing gradient decent iterations with known convergence guarantees. Intuitively, the reason why we were able to replace the adversarial problem (1) with a minimization problem (2) is that the third term in Eq 1 acts as an entropy-based “discriminator”, but the flow family provides a closed-form expression for estimating the extra entropy the transformation induces upon the dataset without solving an optimization problem.

The example below shows that the affine transform that minimizes the log-likelihood ratio distance between two random variables with respect to normal density family corresponds to shifting and scaling one variable to match two first moments of the other, which makes intuitive sense.

###### Example 2.1.

Let us consider two random variables with moments , restrict to normal densities, and the transform to the affine family:

. The negative log-likelihood of the normal distribution depends on its variance as follows:

In our case

Using the fact that the variance of the equal mixture can be expressed using moments of its components, we can solve optimization over analytically

Combining expressions above gives us the final objective:

which can be solved analytically by setting the derivatives of wrt and to zero and gives:

On replacing the Gaussian prior with a learned density in normalizing flows. We explored whether a similar distribution alignment effect can be achieved by directly fitting a density model to the target distribution to obtain the optimal first, and then fitting a flow model to the dataset but replacing the Gaussian prior with the learned density of , essentially training the flow to map to instead of usual to

While this procedure worked on distributions that were very similar to begin with, in the majority of cases the log-likelihood fit to did not provide informative gradients when evaluated on the transformed dataset, as the KL-divergence between distributions with disjoint supports is infinite . Moreover, even when this objective did not explode, multi-modality of often caused the learned transformation to map to one of its modes. Training both and jointly or in alternation yielded a procedure that was very sensitive to the choice of learning rates and hyperparameters, and failed silently, which were the reasons we abandoned adversarial methods in the first place. The LRMF method described in this paper is not susceptible to this problem, because we never train a density estimator on one dataset and evaluate its log-likelihood on another dataset. The LRMF objective (2) shows that we both train and evaluate on and , and train and evaluate on .

On directly estimating likelihood scores across domains. One could suggest to estimate the similarity between datasets by directly evaluating and optimizing some combination of and . Unfortunately, high likelihood values themselves are not very indicative of belonging to the dataset used for training the model, especially in higher dimensions, as explored by nalisnick2018deep. One intuitive example of this effect in action is that for a high-dimensional normally distributed the probability of observing a sample in the neighbourhood of zero is small, but if we had a dataset sampled from that neighbourhood , its log-likelihood would be high, even higher then the likelihood of the dataset sampled from itself. The proposed method, however, is not susceptible to this issue as we always evaluate the likelihood on the same dataset we used for training.

On using the score test instead of the likelihood ratio. If we perform the Taylor expansion of the log-likelihood ratio statistic near the optimal shared model , we get the score test statistic - a “lighter” version of the log-likelihood ratio test that requires training only a single model. Intuitively, if we train a model from simultaneously on two datasets and until convergence, i.e. until the average gradient of the loss w.r.t. weights summed across both datasets becomes small , then the combined norm of two gradients computed across each dataset independently would be small , only under the null hypothesis ( and are equivalent w.r.t. ). From our experience, this approach works well for detecting the presence of the domain shift, but is hardly suitable for direct minimization. We also would like to point readers to the relation between such score-based objective and the recently proposed Invariant Risk Minimization objective (arjovsky2019invariant), discussed in more detail in the supplementary.

On matching the parameters of density models. Two major objections we have to directly minimizing the distance between parameters of density models fitted to respective datasets are that: a) the set of parameters that describes a given distribution might be not unique, and this objective does not consider this case; and b) one would have to employ some higher-order derivatives of the likelihood function to account for the fact that not all parameters contribute equally to the learned density function, therefore rendering this objective computationally infeasible to optimize for even moderately complicated density models.

## 3 Related work

Domain Adaptation. Early neural feature-level domain adaptation methods such as DAN (long2015learning) or JAN (long2017deep)

directly optimized closed form estimates of non-parametric statistical distances (e.g. maximum mean discrepancy, earth mover’s distance or energy distance) between deep features of data points from two domains. Other early neural DA methods such as DeepCORAL

(sun2016deep)approximated domain distributions via simple parametric density models with known closed form expressions for the KL-divergence, e.g. pairs of Gaussians. Unfortunately, both non-parametric and simple parametric models are limited to feature-level adaptation because of their inability to capture the internal structure of complicated real-world datasets. These limitations were address by recent adversarial (GAN-based) parametric approaches that used deep convolutional networks for domain discrimination, including DANN

(ganin2016domain) and ADDA (tzeng2017adversarial). Unfortunately, the adversarial nature of these objectives makes respective models notoriously hard to train and provides no domain-agnostic convergence validation and automated model selection protocols beyond visually assessing generated images, or by evaluating the performance on a different downstream task, if ground truth target labels are available.Normalizing Flows. The main assumption behind normalizing flows (rezende2015variational) is that the observed data can be modeled as a simple distribution transformed by an unknown invertible transformation. Then the density at a given point can be estimated using the change of variable formula by estimating the determinant of the Jacobian of that transformation at the given point. The main challenge in developing such models is to define a class of transformations that are invertible, rich enough to model real-world distributions, and simple enough to enable direct estimation of the aforementioned Jacobian determinant. Most notable examples of recently proposed normalizing flows include Real NVP (dinh2016density), as the first to fit a good flow model of a real-world dataset, GLOW (kingma2018glow) built upon Real NVP (more general learnable permutations at multiple scales) and the first to show that a large enough flow can model high resolution images, and the recent FFJORD (grathwohl2018ffjord)

, that made use of the fact that the forward simulation of an ordinary differential equation with an arbitrary velocity field satisfies all the requirements for a practically useful normalizing flow.

Composition of inverted flows. One natural alternative to the approach proposed in this paper, explored by the authors of the AlignFlow (grover2019alignflow), is to train two flow models and on datasets and and to use the “back-to-back” composition to map points from to . We argue that the structure of the dataset manifold is completely erased since each flow is independently trained to map the structured input to the structureless standard Gaussian. Figure 3 shows what happens if we treat vertices of two meshes as observations from two mesh surface point distributions, fit two different flows, and pass one vertex cloud through the back-to-back composition of learned flows. The number of points in each sub-volume of matches the corresponding number in the transformed point cloud (third column) ), but if we draw the faces of the original mesh at new vertex position, we observe that all relations between vertices (the local structure of the original mesh surface manifold) is distorted beyond recognition. We argue that this must be due to the fact that each flow is trained to “fold” a two-dimensional surface distribution into the interior of the three-dimensional Gaussian ball independently, and two “incompatible foldings” render correspondences between and meaningless. Authors of the AlignFlow suggest to share weights between and to mitigate this issue, but we believe that this only a partial solution that does not addresses the core of the issue. Training a flow model to directly map one distribution to the other with a likelihood objective (forth column) preserves the local structure of the original distribution. Authors of the PointFlow (yang2019pointflow) showed that a hierarchical FFJORD trained on point clouds of mesh surfaces can be used to align these point clouds in the fashion. But the point correspondences found by the PointFlow are again due to the spatial co-occurrence of respective parts of meshes (left bottom leg of the humanoid is always at the bottom left) and do not respect the structure of respective surface manifolds.

CycleGAN with normalizing flows. RevGAN (van2019reversible) made use of GLOW (kingma2018glow) to enforce the cycle consistency of the CycleGAN on the architecture level, but left the loss and the adversarial training procedure unchanged. We believe that the normalizing flow model for dataset alignment should make use of the maximum likelihood principle since the ability to fit rich models with plain minimization and validate their performance on held out sets are the primary selling points of normalizing flows that should not be dismissed.

Likelihood ratio testing for out-of-distribution detection. nalisnick2018deep recently observed that likelihood scores themselves are not sufficient for determining whether a given data point came from the same dataset as the one used for training the density model. Independently from us, a recent paper by LLR_NIPS2019_9611 suggested to use log-likelihood ratio test for out-of-distribution detection of genomic sequences. We go one step further and propose a simple procedure for minimizing this measure of the dataset discrepancy.

## 4 Results

The probability density function of the shared model

fitted to and . When LRMF objective converges, matches . (f) The visualization of the trained normalizing flow , at each pointwe draw a vector pointing along the direction

with color intensity and length proportional to .. The top row shows how well two domains (red and blue) are aligned by different methods trained to transform the red dataset to match the blue dataset. The middle row shows new positions of points colored consistently with the first column. Each domain contains two moons and the bottom row shows what happens to each moon from the red dataset after the alignment. Numbers at the bottom of each figure show the accuracy of the 1-nearest neighbor classifier trained on labels from the blue domain and evaluated on transformed samples from the red domain. The dynamics of the LRMF alignment can be found in Figure

5.According to the distribution of log-odds of the classifiers trained to discriminate USPS from MNIST

(left) and transformed USPS from MNIST (right), LRMF learned a transformation of latent codes that made transformed USPS and MNIST indistinguishable by a CNN discriminator. (c) The LRMF objective attains zero upon convergence (red line). Intermediate steps and more examples of transformed USPS digits are given in the supplementary (Figure 10).In this section we show experiments on simple datasets that verify that minimizing the proposed LRMF objective (Eq 2

) with Gaussian, RealNVP and FFJORD density estimators and transformations does indeed result in dataset alignment. We also show that both under- and over-parameterized LRMFs performed well in practice, and that resulting flows preserved local structure of aligned datasets. We also show that the RealNVP LRMF produced a semantically meaningful alignment in the embedding space of an autoencoder trained simultaneously on two digit domains (MNIST and USPS). We provide LaTeXed pseudo-code of the training algorithm and jupyter notebooks with code in JAX

(jax2018github)and TensorFlow Probability

(DBLP:journals/corr/abs-1711-10604) to reproduce these experiments in the supplementary.Data. The blobs dataset pair contains two samples of size from two 2-dimensional Gaussians. The moons dataset contains two pairs of moons rotated relative to one another. The exact parameters are given in the supplementary. Experiments on MNIST and USPS where conducted in the 32-dimensional embedding space of a VAE-GAN trained on unlabeled images from both digit domains scaled to 28x28.

Affine LRMF w.r.t. the Gaussian family. We parameterized the positive-definite transformation as and the Gaussian density with parameters to ensure that is always positive definite. The first line in Figure 5 shows that, just like in the Example 2.1, the affine likelihood ratio minimizing flow with respect to the normal density family matches first two moments of aligned distributions. Blue, red and cyan points represent and respectively and ellipses represent levels of and . The loss converges to zero, and equals at the optima. The second row contains an experiment on the moons dataset pair containing two pairs of moons (blue and red). As in the previous experiment, the affine LRMF w.r.t. the normal density matches first two moments of aligned distributions. The loss value in the second experiment dropped below zero (red line) around optima because of the randomness and the size of the minibatch, the expectation of the loss is always positive. This experiment shows that even though the normal family of distributions is not sufficient to approximate moon distributions, it still aligns them while preserving their local structure.

Real NVP and FFJORD LRMF.

For Real NVP experiments we stacked four NVP blocks (spaced by permutations), each block parameterized by a dense neural network for predicting shift and scale with two 512-neuron hidden layers with ReLUs (the “default” Real NVP). We used the Adam optimizer with learning rate

for training. The first line (i) in Figure 5 shows that even with a severely overparameterized and , the LRMF model successfully found a meaningful correspondence between two blob datasets, the loss converged to zero, and the density at optima agreed with the density , as expected. The second line (ii) shows that the Real NVP LRMF learned the structure of data manifolds and preserved it during translation. The FFJORD LRMF also converged to zero and performed marginally better the the Real NVP model, the visualizations can be found in the supplementary (Figure 9).SN-GAN. The Spectral Normalization (miyato2018spectral) of weights of the the discriminator prevented the adversarial training from collapsing. Unfortunately, as presented in Figure 8, neither the convergence (c,d; blue line) nor the performance (b) of the learned transformation can not be inferred from the objective value alone, and the model tends to diverge from the aligned configuration over time (b)

. As a result, the overall procedure still requires either visual inspection of transformed samples or an external and domain-specific stopping heuristics, such as the Inception Score or the Frechet Inception Distance. The convergence of the proposed log-likelihood ratio minimizing flow

(orange line), on the other hand, can be judged by examining the average objective value alone - if the objective reaches zero, datasets are guaranteed to be aligned with respect to the specified density family.Local structure. Results presented in Figure 8 show that, despite the perfect marginal alignment produced by the EMD and the back-to-back flow composition, among minimization objectives, only LRMF correctly preserved the local structure of the transformed dataset (red). The middle row shows where did the points of the “red” domain move after the transformation. Numbers at the bottom of the Figure 6 were obtained by training a 1-nearest neighbor classifier on ground truth labels for and testing it on ground truth labels for . Results for MMD and EMD are deterministic, so no error bars are provided, and LRMF experiments were repeated ten times. The bandwidth of the Gaussian kernel in MMD was chosen to maximize the classification accuracy. The best alignment produced by the SN-GAN was comparable to the LRMF alignment, but required manual visual assessment for early stopping, as mentioned above.

USPS-to-MNIST. We trained a VAE-GAN to embed the combined unlabeled dataset containing digit images from both USPS and MNIST into a shared 32-dimensional latent space. We then trained a Real NVP log-likelihood ratio minimizing flow to map latent codes of USPS digits to latent codes of MNIST. Figure 8 shows that LRMF objective attained zero upon convergence and semantically aligned images form two domains. Two bar charts in the middle of Figure 8 show log-odds of the convolutional classifier pre-trained to discriminate original MNIST images from original USPS images (on the left) and log-odds of another classifier (on the right) re-trained to discriminate original MNIST images from transformed USPS images at each iteration. Both classifiers were able to perfectly discriminate MNIST from USPS at first (top line) and were unable to discriminate the original MNIST from the transformed USPS. More detailed version of this figure with translation results and log-odds at intermediate steps are given in the supplementary (Figure 10)

## 5 Conclusion

In this paper we propose a new dataset alignment objective parameterized by a deep density model and a normalizing flow that, when converges to zero, guarantees that the density model fitted to the transformed source dataset will be optimal for the target, and vice versa. We also show that it is robust to model misspecification and preserves local structure of the dataset better than other non-adversarial alignment objectives and a composition of inverted normalizing flows.

## 6 Acknowledgements

This work was partially supported by NSF award #1724237 and DARPA.

## References

## 7 Supplementary Material

### 7.1 Pseudo-code for the learning algorithm

This pseudo-code generally follows the JAX implementation we provide, the actual LRMF is trained in mini-batches.

### 7.2 Attached code

Attached IPython notebooks were tested to work as expected in Colab. The JAX version includes experiments on 1D and 2D Gaussians and Real NVP, the Tensorflow Probabiliy (TFP) version includes experiments on Real NVP and FFJORD.

### 7.3 Dataset parameters

Blobs datasets were samples from 2-dimensional Gaussians with parameters

Moons dataset pairs were generated with containing 2000 samples each.

### 7.4 On the relation to the Invariant Risk Minimization

In a recent arXiv submission, arjovsky2019invariant suggested that in the presence of an observable variability in the environment (e.g. labeled night-vs-day variability in images) the representation function that minimizes the conventional empirical risk across all variations actually yields a subpar classifier. One interpretation of this statement is that instead of searching for a representation function that minimizes the expected value of the risk

across all variations in the environment :

one should look for a representation that is optimal under each individual variation of the environment

Authors linearise this objective combined with the conventional ERM around the optimal , and express the aforementioned optimality across all environments as a gradient penalty term that equals zero only if is indeed optimal across all environment variations:

This procedure and the resulting objective are very much reminiscent of the log-likelihood ratio minimizing flow objective we propose in this paper, and the score test version we would have obtained if we linearized our objective around the optimal . The main difference being that arjovsky2019invariant applied the idea of invariance across changing environments to the setting of supervised training via risk minimization, whereas we apply it to unsupervised alignment via likelihood maximization.

### 7.5 FFJORD LRMF on moons.

As mentioned in the main paper, FFJORD LRMF performed on par with Real NVP version. We had to fit to identity function prior to optimizing the LRMF objective, because the glorot uniform initialized 5-layer neural network with tanh non-linearities (used as a velocity field in FFJORD) generated significantly non-zero outputs. The dynamics can be found in the Figure 7.