1 Introduction
Machine learning models operating on medical images often aim to identify morphological features that distinguish unhealthy or stressed biological compositions from those that are normal eulenberg2017reconstructing ; esteva2017dermatologist ; kraus2017automated . Many times the morphological features change gradually as the composition deviates from normality (e.g., change of cell shape as concentration of applied drug increases). In such cases, in addition to modeling the data distribution, it would also be useful to explicitly construct features that capture this continuous change and influence the generative process, so that the resulting model has greater interpretability and fidelity. Specifically, provided side information responsible for certain morphological changes, we would like to train a generative model with interpretable latent representation that disentangles this side information, so that the model can generate the corresponding continuum.
Deep generative models have shown great success in modeling medical data johnson2017generative ; mcdermott2018semi . Typically, a deep generative model learns to transform a prior into the data distribution . While evaluating model likelihood is generally intractable, several good approximations have been proposed, such as optimizing the evidence lower bound (ELBO) kingma2013auto or parameterizing an adversarial divergence goodfellow2014generative . An encoderdecoder architecture bengio2009learning is useful when inference is required (e.g., given an cell image, encode its morphology into latent features with lower dimensions). In a regular autoencoder, minimizing the reconstruction cost alone may not result in good latent representations makhzani2015adversarial . Therefore, regularizing the encoded latent features is crucial in applying encoderdecoder architecture to generative modeling. The choice of regularizer steers the diversity of interpretations of the model such as optimizing the ELBO on the data distribution kingma2013auto , or minimizing the primal form of Wasserstein distance tolstikhin2017wasserstein .
Side information refers to additional feature that is not directly modeled but may be relevant to the primary task. In fewshot learning or transfer learning, side information is useful in learning robust and generalizable representations
tsai2017learning . For instance, tsai2017improving uses word embedding vectors as side information and applies the HSIC gretton2005measuringwith learned kernels to enforce its dependency with the learned representation. When dealing with data from source domain and target domain, a classifier can be trained between source and target to encourage features to be domaininvariant
ganin2016domain . On the other hand, regularization on the individual latent features in encoderdecoder models can also lead to better performance and interpretability. In chen2018isolating , the total correlation between latent axes is penalized in order to disentangle the latent features, whereas in lopez2018information this regularizer is replace by dHSIC.In this work, we employ a nonparametric independence measure (HSIC) to integrate side information into the latent representation of a generative model trained on biomedical data. Specifically, given side information that correlates to certain continuous morphological changes of the biological composition, we disentangle the latent features by incorporating this information into one axis, and forcing the remaining axes to be independent to the information. In contrast to classifierbased regularization, this approach does not require training an additional model and thus is more stable and dataefficient. We verify our method on two different biological datasets: lung cancer images acquired by CT scans armato2011lung , and singlecell leukemia images acquired by timestretch microscopy kobayashi2017scirep . In both experiments, our generative model successfully models continuous morphological changes (side informationdependent) and produces interpretable latent representation that captures the trend.
2 Methods
Wasserstein Autoencoder
In this work the generative model is a Wasserstein Autoencoder tolstikhin2017wasserstein . Given the data distribution and generated distribution , consider the reparameterized primal form of optimal transport villani2008optimal ; tolstikhin2017wasserstein :
(1) 
Relaxing , this objective can be written as the minimization of the following loss:
(2) 
in which is a divergence measure that matches and , which we set to be the maximum mean discrepancy (MMD).
Kernelbased Regularizers
We employ the maximum mean discrepancy (MMD) gretton2012kernel
to match the prior and aggregated posterior in WAE. The MMD is the RKHS distance between mean embeddings, and its unbiased empirical estimate can be computed in
:(3) 
With a divergence measure, we can also define an information measure as the discrepancy between the joint distribution and the product of marginal distributions. The HilbertSchmitt independence criterion (HSIC)
gretton2005measuringis defined as the squared MMD between the joint distribution and the product of marginals of two random variables, and its biased empirical estimate is given by:
(4) 
where , is the Gram matrix of with , and is the Gram matrix of with . Since the time complexity of HSIC is also quadratic, this additional regularizer would increase training time only marginally.
HSICregularized WAE
Given side information we want to incorporate into the latent representation, we can add the following regularizer to the WAE objective to encourage dependence or independence between the aggregated posterior and :
(5) 
Moreover, we can disentangle the correlation between the side information and certain axis by increasing dependence between and one axis and decreasing the dependence between and the remaining axes :
(6) 
In this case, information of would concentrate at ; therefore, by varying this axis we can generate a continuum of morphological changes that correspond to change in .
3 Data and Experiments
LIDCIDRI dataset
The Lung Image Data Consotium (LIDC) armato2011lung comprises of thoracic scans from 1018 patient cases with 2670 images. Each sample includes the coordinates contouring the susceptible nodule in the CT scan and radiologists’ assessment of likelihood of malignancy ranged from 1 to 5. Nodules were extracted and the final image was obtained by taking the union of all nodule. The dimension of input image was 48 x 48. Data was augmented by rotation and reflection.
K562 Cell Image Dataset
A leukemia cell line K562 was divided and incubated with 10fold serial dilutions of adriamycin ranging from 0.5 to 500 nM for 24 hours, followed by image acquisition with optofluidic timestretch microscope goda2009serial lei2016optical . Approximately 10,000 singlecell images were acquired for each concentration. Each image was normalized to 0mean, downsampled to 96 x 96, and labeled with the treated drug concentration. It has been reported that druginduced morphological changes of K562 cells can be captured by the microscopy images kobayashi2017scirep .
Experimental Details
We trained the HSICregularized WAEs on both datasets, with the side information being the malignancy score in LIDCIDRI dataset, and ADM concentration in K562 dataset. We set and as a factorized unit Gaussian . For MMD we used the inversemultiquadratic (IMQ) kernel , and for HSIC we used the Gaussian RBF kernel with bandwidth selected via the median trick. We used Adam kingma2014adam for optimization. Implementation details are included in the appendix.
4 Results
HSIC disentangles latent features with respect to side information
Training loss for the LIDCIDRI dataset is shown in Figure 1. It can be observed that in addition to minimizing the WAE loss, training also pushes the HSIC() towards 0, indicating that these axes contain little to no information of the label, and at the same time consistently increases HSIC(). To further verify that this dependency is captured by the model, we generated images from random (see Appendix for generated images) and found their 3 nearest neighbors in the test data; we then regressed the malignancy score of these nearest neighbors against the dependent axis . The regression plot in Figure 1 suggests a strong positive increasing trend of malignancy score, thus indicating that trend in does indeed match the increasing trend of malignancy score.
Latent representation and generated samples are consistent with side information
For the K562 dataset, we encode the test images and visualize the latent space via a scatter plot, in which the xaxis is the dependent axis , and the yaxis is the 1st principle component of the independent axes . Consistent with our expectation, Figure 2 shows that concentration of the encoded cell images vary dramatically along , but not in the other axes. This observation is further supported by the kernelfitted densities for each class: exhibits reasonable separation between different drug concentrations; in contrast, different classes are almost indistinguishable from . Random samples are also generated, and it can be observed that cells seem to become larger in size as the concentration of adriamycin increases. This finding agrees with the cellular mechanism: adriamycin can arrest cells in G2/M phase just before mitosis, and thus the drugeaffected cells tends to be larger in volume giuseppe1989adriamycin . Meanwhile, morphological changes other than size change are also present in the manifold, suggesting unelucidated features to be investigated in further studies.
5 Conclusion
In this work we proposed a regularized generative model that constructs interpretable latent features and models continuous morphological change that corresponds to the provided side information. We applied our model to two distinct biomedical datasets with different clinical significance and validated its effectiveness in incorporating and disentangling the side information in the latent representation as well as generation, which enables modeling of a continuous spectrum of morphological changes.
References

[1]
Philipp Eulenberg, Niklas Köhler, Thomas Blasi, Andrew Filby, Anne E
Carpenter, Paul Rees, Fabian J Theis, and F Alexander Wolf.
Reconstructing cell cycle and disease progression using deep learning.
Nature communications, 8(1):463, 2017. 
[2]
Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter,
Helen M Blau, and Sebastian Thrun.
Dermatologistlevel classification of skin cancer with deep neural networks.
Nature, 542(7639):115, 2017.  [3] Oren Z Kraus, Ben T Grys, Jimmy Ba, Yolanda Chong, Brendan J Frey, Charles Boone, and Brenda J Andrews. Automated analysis of highcontent microscopy data with deep learning. Molecular systems biology, 13(4):924, 2017.
 [4] Gregory R Johnson, Rory M DonovanMaiye, and Mary M Maleckar. Generative modeling with conditional autoencoders: Building an integrated cell. arXiv preprint arXiv:1705.00092, 2017.
 [5] Matthew BA McDermott, Tom Yan, Tristan Naumann, Nathan Hunt, Harini Suresh, Peter Szolovits, and Marzyeh Ghassemi. Semisupervised biomedical translation with cycle wasserstein regression gans. 2018.
 [6] Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [7] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [8] Yoshua Bengio et al. Learning deep architectures for ai. Foundations and trends® in Machine Learning, 2(1):1–127, 2009.
 [9] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
 [10] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein autoencoders. arXiv preprint arXiv:1711.01558, 2017.
 [11] YaoHung Hubert Tsai, LiangKang Huang, and Ruslan Salakhutdinov. Learning robust visualsemantic embeddings. 2017.
 [12] YaoHung Hubert Tsai and Ruslan Salakhutdinov. Improving oneshot learning through fusing side information. arXiv preprint arXiv:1710.08347, 2017.
 [13] Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Schölkopf. Measuring statistical dependence with hilbertschmidt norms. In International conference on algorithmic learning theory, pages 63–77. Springer, 2005.
 [14] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domainadversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
 [15] Tian Qi Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating sources of disentanglement in variational autoencoders. arXiv preprint arXiv:1802.04942, 2018.
 [16] Romain Lopez, Jeffrey Regier, Nir Yosef, and Michael I Jordan. Information constraints on autoencoding variational bayes. arXiv preprint arXiv:1805.08672, 2018.
 [17] Bidaut L McNittGray MF Meyer CR Reeves AP Zhao B Aberle DR Henschke CI Hoffman EA Kazerooni EA MacMahon H Van Beeke EJ Yankelevitz D Biancardi AM Bland PH Brown MS Engelmann RM Laderach GE Max D Pais RC Qing DP Roberts RY Smith AR Starkey A Batrah P Caligiuri P Farooqi A Gladish GW Jude CM Munden RF Petkovska I Quint LE Schwartz LH Sundaram B Dodd LE Fenimore C Gur D Petrick N Freymann J Kirby J Hughes B Casteele AV Gupte S Sallamm M Heath MD Kuhn MH Dharaiya E Burns R Fryd DS Salganicoff M Anand V Shreter U Vastagh S Croft BY Armato, McLennan G. The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans. page 38(2):915–931, 2011.
 [18] Hirofumi Kobayashi, Cheng Lei, Yi Wu, Ailin Mao, Yiyue Jiang, Baoshan Guo, Yasuyuki Ozeki, and Keisuke Goda. Labelfree detection of cellular drug responses by highthroughput brightfield imaging and machine learning. Scientific Reports, 7(1):12454, 2017.
 [19] Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
 [20] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel twosample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
 [21] K Goda, KK Tsia, and B Jalali. Serial timeencoded amplified imaging for realtime observation of fast dynamic phenomena. Nature, 458(7242):1145, 2009.
 [22] Cheng Lei, Baoshan Guo, Zhenzhou Cheng, and Keisuke Goda. Optical timestretch imaging: Principles and applications. Applied Physics Reviews, 3(1):011102, 2016.
 [23] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [24] Toffoli Giuseppe, Viel Alessandra, Bevilacqua Carla, Maestro Roberta, Tumiotto Loretta, and Boiocchi Mauro. In k562 leukemia cells treated with doxorubicin and hemin, a decrease in cmyc mrna expression correlates with loss of selfrenewal capability but not with erythroid differentiation. Leukemia Research, 13(4):279–287, 1989.
6 Appendix
6.1 Further details on experiments
Model hyperparameters for LIDCIDRI dataset
We used batches of size 512 and trained the models for 18000 steps, approximately 5 epochs. We used
, , and . We set for Adam optimizer.Encoder architecture:
Decoder architecture:
Where stands for the convolutional layer with k filters,
for the fractional strided convolution layer with k filters,
for the batch normalization,
for the leaky rectified linear units and
for the fully connected layer. All the convolutional layers in the encoder and decoder used vertical and horizontal strides of 2 and SAME padding.
Model hyperparameters for K562 Dataset
For training HSICregularized WAE on K562 dataset, we used batches of size 200, and trained the models for 8000 steps, approximately 50 epochs. We used , , and . We set for Adam optimizer.
Encoder architecture:
Decoder architecture: