Improved Conditional Flow Models for Molecule to Image Synthesis

by   Karren Yang, et al.

In this paper, we aim to synthesize cell microscopy images under different molecular interventions, motivated by practical applications to drug development. Building on the recent success of graph neural networks for learning molecular embeddings and flow-based models for image generation, we propose Mol2Image: a flow-based generative model for molecule to cell image synthesis. To generate cell features at different resolutions and scale to high-resolution images, we develop a novel multi-scale flow architecture based on a Haar wavelet image pyramid. To maximize the mutual information between the generated images and the molecular interventions, we devise a training strategy based on contrastive learning. To evaluate our model, we propose a new set of metrics for biological image generation that are robust, interpretable, and relevant to practitioners. We show quantitatively that our method learns a meaningful embedding of the molecular intervention, which is translated into an image representation reflecting the biological effects of the intervention.



page 6

page 13

page 14


GANs for Biological Image Synthesis

In this paper, we propose a novel application of Generative Adversarial ...

Wavelet Flow: Fast Training of High Resolution Normalizing Flows

Normalizing flows are a class of probabilistic generative models which a...

BioSpaun: A large-scale behaving brain model with complex neurons

We describe a large-scale functional brain model that includes detailed,...

Learning multi-scale functional representations of proteins from single-cell microscopy data

Protein function is inherently linked to its localization within the cel...

Correcting Nuisance Variation using Wasserstein Distance

Profiling cellular phenotypes from microscopic imaging can provide meani...

BIGRoC: Boosting Image Generation via a Robust Classifier

The interest of the machine learning community in image synthesis has gr...

Robust Molecular Image Recognition: A Graph Generation Approach

Molecular image recognition is a fundamental task in information extract...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

High-content cell microscopy assays are gaining traction in recent years as the rich morphological data from the images proves to be more informative for drug discovery than conventional targeted screens Caicedo et al. (2016); Eggert (2013); Swinney and Anthony (2011). Motivated by these developments, we aim to build, to our knowledge, the first generative model to synthesize cell microscopy images under different molecular interventions, translating molecular information into a high-content and interpretable image representation of the intervention. Such a system has numerous practical applications in drug development – for example, it could enable practitioners to virtually screen compounds based on their predicted morphological effects on cells, allowing more efficient exploration of the vast chemical space and reducing the resources required to perform extensive experiments Reddy et al. (2007); Shoichet (2004); Walters et al. (1998). In contrast to conventional models that predict specific chemical properties, a molecule-to-image synthesis model has the potential to produce a panoptic view of the morphological effects of a drug that captures a broad spectrum of properties such as mechanisms of action Ljosa et al. (2013); Loo et al. (2009); Perlman et al. (2004) and gene targets Breinig et al. (2015).

To build our molecule-to-image synthesis model (Mol2Image), we integrate state-of-the-art graph neural networks for learning molecular representations with flow-based generative models. Flow-based models are a relatively recent class of generative models that learn the data distribution by directly inferring the latent distribution and maximizing the log-likelihood of the data Dinh et al. (2014, 2016); Kingma and Dhariwal (2018)

. Compared to other classes of deep generative models such as variational autoencoders (VAEs)

Kingma and Welling (2013) and generative adversarial networks (GANs) Goodfellow et al. (2014), flow-based models do not rely on approximate posterior inference or adversarial training and are less prone to training instability and mode collapse, making them advantageous for biological applications Sun et al. (2019). Nevertheless, molecule-to-image synthesis is a challenging task that highlights key, unsolved problems in flow-based generation. Current flow architectures cannot scale to full-resolution cell images (e.g., ) and are unable to separately generate image features at multiple spatial resolutions, which is important to disentangle coarse features (e.g., cell distribution) from fine features (e.g., subcellular localization of proteins). While separate generation of image features at different resolutions has been demonstrated using GANs Denton et al. (2015), this is still an open problem for flow-based models. Furthermore, existing formulations of flow models do not effectively leverage conditioning information when the relationship between the image and the conditioning information is complex and/or subtle as is the case with molecular interventions. This results in generated samples that do not reflect the conditioning information.

Contributions. In this work, we develop (to our knowledge) the first molecule-to-image synthesis model for generating high-resolution cell images conditioned on molecular interventions. Specifically,

  • [leftmargin=*]

  • We develop a new architecture and approach for flow-based generation based on a Haar wavelet image pyramid, which generates image features at different spatial resolutions separately and enables scaling to high-resolution images.

  • We propose a new training strategy based on contrastive learning to maximize the mutual information between the latent variables of the flow model and the embedding of the molecular graph, ensuring that generated images reflect the molecular intervention.

  • We establish a set of evaluation metrics specific to biological image generation that are robust, interpretable, and relevant to biological practitioners.

  • We demonstrate that our method outperforms the baselines by a large margin, indicating potential for application to virtual screening.

Although we focus on molecule-to-image synthesis in this work, our generative approach can potentially extend to other applications, e.g., text-to-image synthesis Reed et al. (2016).

(a) Model Architecture
(b) Training Strategy
Figure 1: (a) (Red box) Our flow-based model architecture based on a Haar wavelet image pyramid. Information flow follows the black arrows during training/inference and the red arrows during generation. The dashed lines represent conditioning and are used in both training and generation. (Green box) Molecular information is processed and input to the network via a graph neural network . (b) Our training strategy for effective molecule-to-image synthesis. See text for details.

2 Related Work

Biological Image Generation. Osokin et al. use GAN architectures to generate cellular images of budding yeast to infer missing fluorescence channels (stained proteins) in a dataset where only two channels can be observed at a time Osokin et al. (2017). Separately, Goldsborough et al. qualitatively evaluate the use of different GAN variants in generating three-channel images of human breast cancer cell lines Goldsborough et al. (2017). While these works consider the task of generating single cell images, neither considers the generation of cells conditioned on complex inputs nor the generation of multi-cell images, which is useful in observing cell-to-cell interactions Okagaki et al. (2010) and variability Mattiazzi Usaj et al. (2020)

. A separate, similar line of investigation in histopathology and medical imagery has used GAN models to refine and generate synthetic datasets for training downstream classifiers but does not address the difficulty of conditional image generation necessary to capture drug interventions

Hou et al. (2017); Mahmood et al. (2018); Yi et al. (2019). While both high throughput image-based drug screens Caicedo et al. (2018) and molecular structures Yang et al. (2019) have been used to generate representations of small molecules, little work has focused on learning representations of these modalities jointly.

Graph Neural Networks for Molecules. A neural network formulation on graphs was first proposed by Gori et alGori et al. (2005); Scarselli et alScarselli et al. (2009) and later extended to various graph neural network (GNN) architectures Li et al. (2015); Dai et al. (2016); Niepert et al. (2016); Kipf and Welling (2017); Hamilton et al. (2017); Lei et al. (2017); Velickovic et al. (2017); Xu et al. (2018). In the context of molecule property prediction, Duvenaud et alDuvenaud et al. (2015) and Kearns et alKearnes et al. (2016) first applied GNNs to learn neural fingerprints for molecules. Gilmer et alGilmer et al. (2017) further enhanced GNN performance by using set2set readout functions and adding virtual nodes into molecular graphs. Yang et alYang et al. (2019) provided extensive benchmarking of various GNN architectures and demonstrated the advantage of GNNs over traditional Morgan fingerprints Rogers and Hahn (2010) as well as domain-specific features. While these works mainly focused on predicting numerical chemical properties, we here focus on using GNNs to learn rich molecular representations for molecule-to-image synthesis.

Flow-Based Generative Models. A flow-based generative model (e.g., Glow) is a sequence of invertible networks that transforms the input distribution to a simple latent distribution such as a spherical Gaussian Dinh et al. (2014, 2016); Ho et al. (2019); Kingma and Dhariwal (2018); Louizos and Welling (2017); Rezende and Mohamed (2015). Conditional variants of Glow have recently been proposed for image segmentation Lu and Huang (2019); Winkler et al. (2019), modality transfer Kondo et al. (2019); Sun et al. (2019)

, image super-resolution

Winkler et al. (2019)

, and image colorization

Ardizzone et al. (2019)

. These applications are variants of image-to-image translation tasks and leverage the spatial correspondence between the conditioning information and the generated image. Other conditional models perform generation given an image class

Kingma and Dhariwal (2018)

or a binary attribute vector

Liu et al. (2019). Since the condition is categorical, these models apply auxiliary classifiers in the latent space to ensure that the model learns the correspondence between the condition and the image. Unlike these works, we generate images from molecular graphs; here spatial correspondence is not present and the conditioning information cannot be learned using a classifier. Therefore we must leverage other techniques to ensure correspondence between the generated images and the conditioning information.

In addition to conditioning on molecular structure, our flow model architecture is based on an image pyramid, which conditions the generation of fine features at a particular spatial resolution on a coarse image from another level of the pyramid. Flow-based generation of images conditioned on other images has been explored in various previous works Ardizzone et al. (2019); Kondo et al. (2019); Lu and Huang (2019); Sun et al. (2019); Winkler et al. (2019), but different from these works, our flow-based model leverages conditioning to break generation into successive steps and refine features at different scales. Our approach is inspired by methods such as Laplacian Pyramid GANs Denton et al. (2015) that break GAN generation into successive steps. A key design choice here is our use of a Haar wavelet image pyramid instead of a Laplacian pyramid, which avoids introducing redundant variables into the model and is an important consideration for flow-based models. Ardizzone et al.  Ardizzone et al. (2019) use the Haar wavelet transform to improve training stability, but they do not consider the framework of an image pyramid for separately generating features at different spatial resolutions.

3 Method

Our approach is to develop a flow-based generative model for synthesizing cell images conditioned on the molecular embeddings of a graph neural network. We first provide an overview of graph neural networks (Section 3.1) and generative flows (Section 3.2). In Section 3.3, we describe our novel multi-scale flow architecture that generates images in a coarse-to-fine process based on the framework of an image pyramid. The architecture separates generation of image features at different spatial resolutions and scales to high-resolution cell images. In Section 3.4, we describe a novel training strategy using contrastive learning for effective molecule-to-image synthesis.

3.1 Preliminaries: Graph Neural Networks

A molecule can be represented as a labeled graph whose nodes are the atoms in the molecule and edges are the bonds between the atoms. Each node has a feature vector including its atom type, valence, and other atomic properties. Each edge is also associated with a feature vector indicating its bond type. A graph neural network (GNN) learns to embed a graph into a continuous vector . In this paper, we adopt the GNN architecture from Dai et al. (2016); Yang et al. (2019), which associates hidden states with each node and updates these states by passing messages over edges . Each message is initialized at zero. At time step , the messages are updated as follows:


where is the set of neighbor nodes of and

stands for a multilayer perceptron. After

message passing steps, we compute hidden states as well as the final representation as


3.2 Preliminaries: Generative Flows

A generative flow consists of a sequence of invertible functions that transform an input variable to a Gaussian latent variable . The generative process is defined as:


where are the intermediate variables that arise from applying the inverse of individual flow functions . By the change-of-variables formula, the log-likelihood of sampling is,



is the Gaussian probability density function. In this paper, we adopt the flow functions from the Glow model

Kingma and Dhariwal (2018), in which each flow consists of actnorm, convolution, and coupling layers (see Kingma and Dhariwal (2018) for details). The Jacobian matrices of these transformations are triangular and hence have log-determinants that are easy to compute. As a result, the log-likelihood of the data is tractable and can be efficiently optimized with respect to the parameters of the flow functions.

3.3 Proposed Multi-Scale Flow Architecture

Existing multi-scale architectures for generative flows Dinh et al. (2016); Kingma and Dhariwal (2018) do not separately generate features for different spatial resolutions and cannot scale to full-resolution cell images. In the following, we propose a novel multi-scale architecture that generates cell images in a coarse-to-fine fashion and enables scaling to high-resolution images. Our architecture integrates flow units into the framework of an image pyramid generated by recursive 2D Haar wavelet transforms.

Haar Wavelets. Wavelets are functions that can be used to decompose an image into coarse and fine components. The Haar wavelet transform generates the coarse component in a way that is equivalent to nearest neighbor downsampling. The coarse component is obtained by convolving the image with an averaging matrix followed by sub-sampling by a factor of 2, and the fine components are obtained by convolving the image with three different matrices followed by sub-sampling by a factor of 2:


To generate an image pyramid that captures features at different spatial resolutions, we recursively apply Haar wavelet transforms to the coarse image. Specifically, let be a pyramid of downsampled images, where represents the image after applications of the coarse operation. We apply the fine operation to each downsampled image except the last, resulting in the image pyramid . The image at each spatial resolution can be reconstructed recursively,

where represents spatial upsampling, the brackets indicate concatenation, and represents the inverse of the linear operation corresponding to the 2D Haar wavelet transform; see Equation (5).

Haar Pyramid Generative Flow. Our flow architecture consists of multiple blocks , each responsible for generating the fine features for a different level of the Haar image pyramid conditioned on a coarse image from the next image in the pyramid; see Figure 0(a), red box. Note that each block consists of multiple invertible flow units, i.e., and can be treated independently as a generative flow from Section 3.2. The generative process is defined as follows. First we generate the final downsampled image of the pyramid,


by sampling a latent vector that corresponds to the coarsest features and passing it through the first block. Then we recursively sample latent vectors corresponding to finer spatial features and generate the other images in the Haar image pyramid as follows:

where is the final full-resolution image. To perform conditioning on the coarse image , we provide it as an additional input to both the prior distribution of and to the individual flow units in

. Computation of the log-likelihood within the image pyramid framework is straightforward, since the Haar wavelet transform is an invertible linear transformation with a block-diagonal Jacobian matrix that adds a constant factor to the log-determinant in Equation (


Conditioning on a Molecular Graph. To condition the generation of features by block on a molecular intervention , we condition the distribution of latent variables on the output of a graph neural network. Specifically, we let take as input, where is a graph neural network described in Section 3.1; see Figure 0(b), green box.

3.4 Training Strategy: Maximizing Mutual Information using Contrastive Learning

The challenge of training a conditional flow model using log-likelihood is that it may not sufficiently leverage the shared information between the input image and the molecular intervention. Intuitively, the flow model can achieve a high log-likelihood by converting the input image distribution to a Gaussian distribution without using the condition. This is especially true for molecule-to-image synthesis because the effect of the molecular intervention on the cells is subtle in the image space.

To ensure that the conditional flow model extracts useful information from the molecular graph for generation, we propose a training strategy based on contrastive learning. As shown in Figure 0(b), during training, we use contrastive learning to maximize the mutual information between the latent variables from the flow model and the molecular embedding from the graph neural network . During generation, information flow is reversed through the flow model to generate an image that is tightly coupled to the conditioning molecular information.

The objective of contrastive learning is to learn embeddings of and

that maximize their mutual information. Specifically, these embeddings should distinguish “matched" samples from the joint distribution

from “mismatched" samples from the product of the marginals . To obtain these embeddings, we train a critic function to assign high values to matched samples and low values to mismatched samples by minimizing the following contrastive loss:


In practice, we compute

by taking the cosine similarity of

and , where is the flow model and is the graph neural network that embeds the molecular structure graph :



is a temperature hyperparameter. Minimizing the contrastive loss in Equation (

7) is equivalent to maximizing a lower bound on the mutual information between and and has been used in previous work for representation learning Oord et al. (2018). Our key insight is in leveraging contrastive learning in a conditional flow model to maximize the mutual information between the latent image variables and the molecular embedding , such that reversing information flow through generates images that share a high degree of information with the molecular graph .

Figure 2: Examples of cell images generated by our method vs the baselines.

4 Experiments

Dataset. We perform our experiments on the Cell Painting dataset introduced by Bray et alBray et al. (2017, 2016) and preprocessed by Hofmarcher et alHofmarcher et al. (2019). The dataset consists of 284K cell images collected from 10.5K molecular interventions. We divide the dataset into a training set of 219K images corresponding to 8.5K molecular interventions, and hold out the remaining of the data for evaluation. The held-out data consists of images corresponding to each of the 8.5K molecules in the training set as well as images corresponding to 2K molecules that are not in the training set.

Implementation and Training Details. Our model for the molecule-to-image generation task consists of six flow modules that construct different levels of the Haar wavelet image pyramid, generating images from resolution of to . The lowest resolution module consists of 64 flow units, and each of the other modules consists of 32 flow units. Each of the modules is trained to maximize the log-likelihood of the data (Equation 4). Additionally, the three flow modules that process low-resolution images (up to resolution) are also trained to maximize the mutual information between the latent variables and the molecular features using contrastive learning with a weight of and . We train each flow module for approximately 50K iterations using Adam Kingma and Ba (2014) with initial learning rate of , during which the highest resolution block sees over 1M images and the lowest resolution block sees over 10M images.

Robust Evaluation Metrics for Biological Image Generation.

For a molecule-to-image synthesis model to be useful to practitioners, it needs to generate image features that are meaningful from a biological standpoint. It has been shown that machine learning methods can discriminate between microscopy images using features that are irrelevant to the target content

Shamir (2011); Lu et al. (2019). Therefore, in addition to more conventional vision metrics, we propose a new set of evaluation metrics based on CellProfiler cell morphology features McQuin et al. (2018) that are more robust, interpretable, and relevant to practitioners Rohban et al. (2017). We specifically consider the following morphological features:

  • [leftmargin=*]

  • Coverage. The total area of the regions covered by segmented cells.

  • Cell/Nuclei Count. The total number of nuclei/cells found in the image.

  • Cell Size. The average size of the segmented cells found in the image.

  • Zernike Shape. A set of 30 features that describe the shape of cells using a basis of Zernike polynomials (order 0 to order 9).

  • Expression Level. A set of five features that measure the level of signal from the different cellular compartments in the image: DNA, mitochondria, endoplasmic reticulum, Golgi, cytoplasmic RNA, nucleoli, actin, and plasma membrane.

We extract these features from a subset of images and compute the Spearman correlation between the features of real and generated images corresponding to the same molecule (see Supplementary Material for details). Due to space constraints, we show the mean of the correlation coefficients for the 30 Zernike shape features and the five expression level features.

Other Evaluation Metrics. In addition to these specialized metrics for biological images, we also evaluate our model using the following metrics that are more conventional for image generation tasks:

  • [leftmargin=*]

  • Sliced Wasserstein Distance (SWD). To assess the visual quality of the generated images, we consider the statistical similarity of image patches taken from multiple levels of a Laplacian pyramid representation of generated and real images, as described in Karras et al. (2018). This metric compares the unconditional distributions of patches between generated and real images, but it does not take into account the correspondence between the generated image and the molecular information.

  • Correspondence Classification Accuracy (Corr). To assess the correspondence between the generated images and the molecular information, we compute the accuracy of a pretrained correspondence classifier on the generated images. The classifier consists of a visual network and GNN that are trained on a binary classification task: detect whether the input cell image matches the input molecular intervention (positive sample) or whether they are mismatched (negative sample). The classifier detects correctly matched pairs of images and molecules with an accuracy of 0.65 on real data (upper bound for our evaluation of generated images).

Approach Coverage Cell Count Cell Size Zernike Shape Exp. Level SWD Corr
CGAN Gulrajani et al. (2017) 7.0 4.8 -2.9 -3.9 7.4 5.65 56.6
CGlow Kingma and Dhariwal (2018) -1.3 3.8 5.8 2.2 6.6 5.01 55.5
w/o pyramid 28.5 36.1 17.5 8.7 26.7 4.96 60.0
w/o contrastive loss 7.7 13.4 12.0 6.8 5.3 3.68 58.8
Mol2Image (ours) 44.6 54.4 27.5 15.8 37.3 4.63 63.2
Table 1: Evaluation of Mol2Image (our model) vs. the baselines on images generated from molecules from the training set. “Coverage", “Cell Count", “Cell Size", “Zernike Shape", “Exp. Level" measure Spearman correlation coefficients () between features from a subset of real and generated images; higher is better. “Corr" represents correspondence classification accuracy of a pretrained model; higher is better and ground truth (upper bound) achieves . “SWD" is the sliced Wasserstein distance metric () from Karras et al. (2018); lower is better. See text for details.
Approach Coverage Cell Count Cell Size Zernike Shape Exp. Level SWD Corr
CGAN Gulrajani et al. (2017) 6.4 1.9 -1.5 -1.0 9.2 5.60 56.1
CGlow Kingma and Dhariwal (2018) 3.1 -3.7 -3.0 -3.1 3.7 5.40 54.5
w/o pyramid 9.2 1.7 12.9 6.1 8.6 4.20 59.1
w/o contrastive loss 5.0 9.1 6.1 2.9 9.2 3.41 55.7
Mol2Image (ours) 15.8 19.7 11.0 4.9 13.4 4.27 62.6
Table 2: Same as Table 1, but evaluated on images generated from held-out molecules. Ground truth (upper bound) achieves on the correspondence classification accuracy (Corr) metric.

Baselines and Ablations. Since molecule-to-image synthesis is a novel task, we develop our own baselines based on well-established generative models and perform ablations to determine the benefit of our approach. Since not all of the methods are capable of generating high-quality images at full resolution, we compare all of the model results at spatial resolution.

  • [leftmargin=*]

  • Baseline: Conditional GAN with Graph Neural Network (CGAN). We train a CGAN such that a generator network is trained to generate images conditioned on the corresponding molecule, . Both the generator and discriminator are conditioned on the molecular representation learned by the same GNN as above. We use a Wasserstein GAN trained with a gradient penalty Gulrajani et al. (2017). This variant is able to consistently produce qualitatively realistic images in the unconditional setting, in agreement with previous generative models for cell image data Osokin et al. (2017).

  • Baseline: Conditional Glow with Graph Neural Network (CGlow). Since our model is an improved flow-based model for conditional generation, we develop a baseline approach based on existing work that is a straightforward extension of Glow to the conditional setting Kingma and Dhariwal (2018). Specifically, this baseline model conditions the distribution of latent variables introduced at every level of the multi-scale architecture on the output of the graph neural network and optimizes the conditional log-likelihood with respect to the model parameters. Alternatively, this model can be seen as an ablation of our model without pyramid architecture or contrastive training.

  • Ablation: Mol2Image without Pyramid Architecture (w/o pyramid). We train our model without the framework of the image pyramid for separately generating features at different scales. Instead, we directly generate the full resolution image.

  • Ablation: Mol2Image without Contrastive Learning (w/o contrastive loss). We train our model without using contrastive loss to maximize the mutual information between the latent variables of the image and the embeddings extracted by the graph neural network.

Results. Tables 1 and 2 show the results of our model in comparison to the baselines. Our conditional flow-based generative model, which is trained with the proposed pyramid architecture and the contrastive loss, outperforms the baselines in generating cell images that reflect the effects of the molecular interventions. Table 1 shows that our model performs well on generating cell images conditioned on molecules that were observed during training. Table 2 shows that our model generalizes better than the baselines to molecules that were held-out from the training set.

Effect of Contrastive Loss. Our training strategy, which uses contrastive loss to maximize the mutual information between the image latent variables and the molecular embedding, is essential for effective generation of images conditioned on the molecular intervention. In particular, there is much lower correspondence between the images and the molecular intervention when contrastive learning is omitted. This result holds both in the case that we use the image pyramid framework (i.e., compare "Mol2Image" with "w/o contrastive learning") and in the case that we directly generate 64 x 64 images using the standard multi-scale architecture (i.e., compare "w/o pyramid architecture" with "CGlow"). This demonstrates that contrastive learning can provide a strong signal for learning the relation between the image and the conditioning information for generative modeling, in the absence of categorical labels that can be used in a supervised framework. On the other hand, contrastive loss does not appear to improve the unconditional quality of generated images (based on SWD).

Effect of Pyramid Framework. We proposed the pyramid structure to generate image features at different spatial resolutions, which is important to disentangle higher level features (e.g., cell distribution) from lower level features (e.g., cell shape), and to allow our model to scale to high-resolution cell images (512 x 512). Interestingly, we find that the image pyramid framework also improves the conditional generation of 64 x 64 images compared to the baseline model that directly generates images of this size (i.e., compare "Mol2Image" to "w/o pyramid"). We hypothesize that this is because it is more efficient and easier to learn the relation between images and conditions when starting with the low-resolution images at the bottom of the image pyramid. Consistent with our observations, previous works have reported that training GANs starting from lower-resolution images Karras et al. (2018) or using an image pyramid Denton et al. (2015) is more effective than training directly on full-resolution images.

Qualitative Examples. Figure 2 shows a qualitative comparison between the baselines (CGAN, CGlow) and our method on generating images conditioned on molecular structure. The generated images from our method (Figure 2, row 3) more closely reflect the real effect of the intervention (Figure 2, row 4) compared to other methods, both in terms of cell morphology and in terms of channel intensities (representing expression of different cellular components). More qualitative examples (including full-resolution images) are provided in the Supplementary Material.

Molecular Embedding Random Morgan Fingerprint w/o Contrastive Loss Mol2Image (ours)
Mean AUC 0.569 0.645 0.675 0.810
Mean AUC (Held-Out) 0.578 0.665 0.675 0.683
Table 3: Evaluation of molecular embeddings on predicting morphological labels. Higher AUC is better. “Random" refers to embeddings from a randomly initialized GNN. “Held-out" refers to held-out molecules from the training set. For reference, a fully-supervised model (in which the parameters of the graph neural network are trained) achieves an AUC of 0.702 on held-out molecules.

Analysis of Molecular Embeddings. Since our method performs well at generating cell images conditioned on molecular interventions, we hypothesize that the GNN learns a molecular representation that reflects the morphological features of the cell image. To determine whether the molecular embeddings are linearly separable based on the morphology they induce in treated cells, we train a linear classifier to predict a subset of 14 features curated from the morphological analysis of Bray et alBray et al. (2017) (see the Supplementary Material). For comparison, we consider embeddings from a randomly initialized GNN, Morgan/circular fingerprints Rogers and Hahn (2010), and an ablation of our model trained without contrastive loss. Table 3 shows the average AUC of the various embeddings on this task. The results suggest that our method learns molecular embeddings that are linearly-separable based on morphological properties of the treated cells (Table 3, Row 1), and that the learned embeddings can also generalize to previously unseen molecules (Table 3, Row 2).

5 Discussion

We have developed a new multi-scale flow-based architecture and training strategy for molecule-to-image synthesis and demonstrated the benefits of our approach on new evaluation metrics tailored to biological cell image generation. Our work represents a first step towards image-based virtual screening of chemicals and lays the groundwork for studying the shared information in molecular structures and perturbed cell morphology. A promising avenue for future work is integrating side information (e.g., known chemical properties, drug dosage) to impose constraints on the molecular embedding space and improve generalization to previously unseen molecules. Furthermore, even though we have focused on molecule-to-image synthesis in this paper, our contributions to flow-based models can potentially be applied in other contexts, e.g., text-to-image synthesis Reed et al. (2016).


Karren Dai Yang was supported by an NSF Graduate Research Fellowship and ONR (N00014-18-1-2765). Alex X. Lu was funded by a pre-doctoral award from the National Science and Engineering Research Council. Regina Barzilay and Tommi Jaakkola were partially supported by the MLPDS Consortium and the DARPA AMD program. Caroline Uhler was partially supported by NSF (DMS-1651995), ONR (N00014-17-1-2147 and N00014-18-1-2765), IBM, and a Simons Investigator Award.


  • L. Ardizzone, C. Lüth, J. Kruse, C. Rother, and U. Köthe (2019) Guided image generation with conditional invertible neural networks. arXiv preprint arXiv:1907.02392. Cited by: §2, §2.
  • M. Bray, S. M. Gustafsdottir, M. H. Rohban, S. Singh, V. Ljosa, K. L. Sokolnicki, J. A. Bittker, N. E. Bodycombe, V. Dančík, T. P. Hasaka, et al. (2017) A dataset of images and morphological profiles of 30 000 small-molecule treatments using the cell painting assay. Gigascience 6 (12), pp. giw014. Cited by: §4, §4.
  • M. Bray, S. Singh, H. Han, C. T. Davis, B. Borgeson, C. Hartland, M. Kost-Alimova, S. M. Gustafsdottir, C. C. Gibson, and A. E. Carpenter (2016) Cell painting, a high-content image-based assay for morphological profiling using multiplexed fluorescent dyes. Nature protocols 11 (9), pp. 1757. Cited by: §4.
  • M. Breinig, F. A. Klein, W. Huber, and M. Boutros (2015) A chemical–genetic interaction map of small molecules using high-throughput imaging in cancer cells. Molecular systems biology 11 (12). Cited by: §1.
  • J. C. Caicedo, C. McQuin, A. Goodman, S. Singh, and A. E. Carpenter (2018)

    Weakly supervised learning of single-cell feature embeddings


    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 9309–9318. Cited by: §2.
  • J. C. Caicedo, S. Singh, and A. E. Carpenter (2016) Applications in image-based profiling of perturbations. Current opinion in biotechnology 39, pp. 134–142. Cited by: §1.
  • H. Dai, B. Dai, and L. Song (2016) Discriminative embeddings of latent variable models for structured data. In International Conference on Machine Learning, pp. 2702–2711. Cited by: §2, §3.1.
  • E. L. Denton, S. Chintala, R. Fergus, et al. (2015) Deep generative image models using a laplacian pyramid of adversarial networks. In Advances in neural information processing systems, pp. 1486–1494. Cited by: §1, §2, §4.
  • L. Dinh, D. Krueger, and Y. Bengio (2014)

    Nice: non-linear independent components estimation

    arXiv preprint arXiv:1410.8516. Cited by: §1, §2.
  • L. Dinh, J. Sohl-Dickstein, and S. Bengio (2016) Density estimation using real nvp. arXiv preprint arXiv:1605.08803. Cited by: §1, §2, §3.3.
  • D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams (2015) Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224–2232. Cited by: §2.
  • U. S. Eggert (2013) The why and how of phenotypic small-molecule screens. Nature chemical biology 9 (4), pp. 206. Cited by: §1.
  • J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212. Cited by: §2.
  • P. Goldsborough, N. Pawlowski, J. C. Caicedo, S. Singh, and A. Carpenter (2017) CytoGAN: generative modeling of cell images. bioRxiv, pp. 227645. Cited by: §2.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
  • M. Gori, G. Monfardini, and F. Scarselli (2005) A new model for learning in graph domains. In Neural Networks, 2005. IJCNN’05. Proceedings. 2005 IEEE International Joint Conference on, Vol. 2, pp. 729–734. Cited by: §2.
  • I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In Advances in Neural Information Processing Systems (NeurIPS), pp. 5767–5777. Cited by: 1st item, Table 1, Table 2.
  • W. L. Hamilton, R. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. arXiv preprint arXiv:1706.02216. Cited by: §2.
  • J. Ho, X. Chen, A. Srinivas, Y. Duan, and P. Abbeel (2019) Flow++: improving flow-based generative models with variational dequantization and architecture design. arXiv preprint arXiv:1902.00275. Cited by: §2.
  • M. Hofmarcher, E. Rumetshofer, D. Clevert, S. Hochreiter, and G. Klambauer (2019) Accurate prediction of biological assays with high-throughput microscopy images and convolutional networks. Journal of chemical information and modeling 59 (3), pp. 1163–1171. Cited by: §4.
  • L. Hou, A. Agarwal, D. Samaras, T. M. Kurc, R. R. Gupta, and J. H. Saltz (2017) Unsupervised histopathology image synthesis. arXiv preprint arXiv:1712.05021. Cited by: §2.
  • T. Karras, T. Aila, S. Laine, and J. Lehtinen (2018) Progressive growing of gans for improved quality, stability, and variation. ICLR. Cited by: 1st item, Table 1, §4.
  • S. Kearnes, K. McCloskey, M. Berndl, V. Pande, and P. Riley (2016) Molecular graph convolutions: moving beyond fingerprints. Journal of computer-aided molecular design 30 (8), pp. 595–608. Cited by: §2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. Proceedings of the International Conference on Learning Representations (ICLR). Cited by: §4.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1.
  • D. P. Kingma and P. Dhariwal (2018) Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215–10224. Cited by: §1, §2, §3.2, §3.3, 2nd item, Table 1, Table 2.
  • T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations. Cited by: §2.
  • R. Kondo, K. Kawano, S. Koide, and T. Kutsuna (2019) Flow-based image-to-image translation with feature disentanglement. In Advances in Neural Information Processing Systems, pp. 4170–4180. Cited by: §2, §2.
  • T. Lei, W. Jin, R. Barzilay, and T. Jaakkola (2017) Deriving neural architectures from sequence and graph kernels. International Conference on Machine Learning. Cited by: §2.
  • Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel (2015) Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493. Cited by: §2.
  • R. Liu, Y. Liu, X. Gong, X. Wang, and H. Li (2019) Conditional adversarial generative flow for controllable image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7992–8001. Cited by: §2.
  • V. Ljosa, P. D. Caie, R. Ter Horst, K. L. Sokolnicki, E. L. Jenkins, S. Daya, M. E. Roberts, T. R. Jones, S. Singh, A. Genovesio, et al. (2013) Comparison of methods for image-based profiling of cellular morphological responses to small-molecule treatment. Journal of biomolecular screening 18 (10), pp. 1321–1329. Cited by: §1.
  • L. Loo, H. Lin, R. J. Steininger III, Y. Wang, L. F. Wu, and S. J. Altschuler (2009) An approach for extensibly profiling the molecular states of cellular subpopulations. Nature methods 6 (10), pp. 759. Cited by: §1.
  • C. Louizos and M. Welling (2017) Multiplicative normalizing flows for variational bayesian neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2218–2227. Cited by: §2.
  • A. Lu, A. Lu, W. Schormann, M. Ghassemi, D. Andrews, and A. Moses (2019) The cells out of sample (coos) dataset and benchmarks for measuring out-of-sample generalization of image classifiers. In Advances in Neural Information Processing Systems, pp. 1852–1860. Cited by: §4.
  • Y. Lu and B. Huang (2019) Structured output learning with conditional generative flows. arXiv preprint arXiv:1905.13288. Cited by: §2, §2.
  • F. Mahmood, R. Chen, and N. J. Durr (2018) Unsupervised reverse domain adaptation for synthetic medical images via adversarial training. IEEE transactions on medical imaging 37 (12), pp. 2572–2581. Cited by: §2.
  • M. Mattiazzi Usaj, N. Sahin, H. Friesen, C. Pons, M. Usaj, M. P. D. Masinas, E. Shuteriqi, A. Shkurin, P. Aloy, Q. Morris, et al. (2020) Systematic genetics and single-cell imaging reveal widespread morphological pleiotropy and cell-to-cell variability. Molecular systems biology 16 (2), pp. e9243. Cited by: §2.
  • C. McQuin, A. Goodman, V. Chernyshev, L. Kamentsky, B. A. Cimini, K. W. Karhohs, M. Doan, L. Ding, S. M. Rafelski, D. Thirstrup, et al. (2018) CellProfiler 3.0: next-generation image processing for biology. PLoS biology 16 (7). Cited by: §4.
  • M. Niepert, M. Ahmed, and K. Kutzkov (2016)

    Learning convolutional neural networks for graphs

    In International Conference on Machine Learning, pp. 2014–2023. Cited by: §2.
  • L. H. Okagaki, A. K. Strain, J. N. Nielsen, C. Charlier, N. J. Baltes, F. Chrétien, J. Heitman, F. Dromer, and K. Nielsen (2010) Cryptococcal cell morphology affects host cell interactions and pathogenicity. PLoS pathogens 6 (6). Cited by: §2.
  • A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §3.4.
  • A. Osokin, A. Chessel, R. E. Carazo Salas, and F. Vaggi (2017) GANs for biological image synthesis. In Proceedings of IEEE International Conference on Computer Vision (ICCV), pp. 2233–2242. Cited by: §2, 1st item.
  • Z. E. Perlman, M. D. Slack, Y. Feng, T. J. Mitchison, L. F. Wu, and S. J. Altschuler (2004) Multidimensional drug profiling by automated microscopy. Science 306 (5699), pp. 1194–1198. Cited by: §1.
  • A. S. Reddy, S. P. Pati, P. P. Kumar, H. Pradeep, and G. N. Sastry (2007) Virtual screening in drug discovery-a computational perspective. Current Protein and Peptide Science 8 (4), pp. 329–351. Cited by: §1.
  • S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee (2016) Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396. Cited by: §1, §5.
  • D. J. Rezende and S. Mohamed (2015) Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770. Cited by: §2.
  • D. Rogers and M. Hahn (2010) Extended-connectivity fingerprints. Journal of chemical information and modeling 50 (5), pp. 742–754. Cited by: §2, §4.
  • M. H. Rohban, S. Singh, X. Wu, J. B. Berthet, M. Bray, Y. Shrestha, X. Varelas, J. S. Boehm, and A. E. Carpenter (2017) Systematic morphological profiling of human gene and allele function via cell painting. Elife 6, pp. e24060. Cited by: §4.
  • F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2009) The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: §2.
  • L. Shamir (2011) Assessing the efficacy of low-level image content descriptors for computer-based fluorescence microscopy image analysis. Journal of microscopy 243 (3), pp. 284–292. Cited by: §4.
  • B. K. Shoichet (2004) Virtual screening of chemical libraries. Nature 432 (7019), pp. 862–865. Cited by: §1.
  • H. Sun, R. Mehta, H. H. Zhou, Z. Huang, S. C. Johnson, V. Prabhakaran, and V. Singh (2019) DUAL-glow: conditional flow-based generative model for modality transfer. In Proceedings of the IEEE International Conference on Computer Vision, pp. 10611–10620. Cited by: §1, §2, §2.
  • D. C. Swinney and J. Anthony (2011) How were new medicines discovered?. Nature reviews Drug discovery 10 (7), pp. 507–519. Cited by: §1.
  • P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §2.
  • W. P. Walters, M. T. Stahl, and M. A. Murcko (1998) Virtual screening—an overview. Drug discovery today 3 (4), pp. 160–178. Cited by: §1.
  • C. Winkler, D. Worrall, E. Hoogeboom, and M. Welling (2019) Learning likelihoods with conditional normalizing flows. arXiv preprint arXiv:1912.00042. Cited by: §2, §2.
  • K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2018) How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826. Cited by: §2.
  • K. Yang, K. Swanson, W. Jin, C. Coley, P. Eiden, H. Gao, A. Guzman-Perez, T. Hopper, B. Kelley, M. Mathea, et al. (2019) Analyzing learned molecular representations for property prediction. Journal of chemical information and modeling 59 (8), pp. 3370–3388. Cited by: §2, §2, §3.1.
  • X. Yi, E. Walia, and P. Babyn (2019) Generative adversarial network in medical imaging: a review. Medical image analysis, pp. 101552. Cited by: §2.