Machine learning models operating on medical images often aim to identify morphological features that distinguish unhealthy or stressed biological compositions from those that are normal eulenberg2017reconstructing ; esteva2017dermatologist ; kraus2017automated . Many times the morphological features change gradually as the composition deviates from normality (e.g., change of cell shape as concentration of applied drug increases). In such cases, in addition to modeling the data distribution, it would also be useful to explicitly construct features that capture this continuous change and influence the generative process, so that the resulting model has greater interpretability and fidelity. Specifically, provided side information responsible for certain morphological changes, we would like to train a generative model with interpretable latent representation that disentangles this side information, so that the model can generate the corresponding continuum.
Deep generative models have shown great success in modeling medical data johnson2017generative ; mcdermott2018semi . Typically, a deep generative model learns to transform a prior into the data distribution . While evaluating model likelihood is generally intractable, several good approximations have been proposed, such as optimizing the evidence lower bound (ELBO) kingma2013auto or parameterizing an adversarial divergence goodfellow2014generative . An encoder-decoder architecture bengio2009learning is useful when inference is required (e.g., given an cell image, encode its morphology into latent features with lower dimensions). In a regular auto-encoder, minimizing the reconstruction cost alone may not result in good latent representations makhzani2015adversarial . Therefore, regularizing the encoded latent features is crucial in applying encoder-decoder architecture to generative modeling. The choice of regularizer steers the diversity of interpretations of the model such as optimizing the ELBO on the data distribution kingma2013auto , or minimizing the primal form of Wasserstein distance tolstikhin2017wasserstein .
Side information refers to additional feature that is not directly modeled but may be relevant to the primary task. In few-shot learning or transfer learning, side information is useful in learning robust and generalizable representationstsai2017learning . For instance, tsai2017improving uses word embedding vectors as side information and applies the HSIC gretton2005measuring
with learned kernels to enforce its dependency with the learned representation. When dealing with data from source domain and target domain, a classifier can be trained between source and target to encourage features to be domain-invariantganin2016domain . On the other hand, regularization on the individual latent features in encoder-decoder models can also lead to better performance and interpretability. In chen2018isolating , the total correlation between latent axes is penalized in order to disentangle the latent features, whereas in lopez2018information this regularizer is replace by dHSIC.
In this work, we employ a non-parametric independence measure (HSIC) to integrate side information into the latent representation of a generative model trained on biomedical data. Specifically, given side information that correlates to certain continuous morphological changes of the biological composition, we disentangle the latent features by incorporating this information into one axis, and forcing the remaining axes to be independent to the information. In contrast to classifier-based regularization, this approach does not require training an additional model and thus is more stable and data-efficient. We verify our method on two different biological datasets: lung cancer images acquired by CT scans armato2011lung , and single-cell leukemia images acquired by time-stretch microscopy kobayashi2017scirep . In both experiments, our generative model successfully models continuous morphological changes (side information-dependent) and produces interpretable latent representation that captures the trend.
In this work the generative model is a Wasserstein Auto-encoder tolstikhin2017wasserstein . Given the data distribution and generated distribution , consider the reparameterized primal form of optimal transport villani2008optimal ; tolstikhin2017wasserstein :
Relaxing , this objective can be written as the minimization of the following loss:
in which is a divergence measure that matches and , which we set to be the maximum mean discrepancy (MMD).
We employ the maximum mean discrepancy (MMD) gretton2012kernel
to match the prior and aggregated posterior in WAE. The MMD is the RKHS distance between mean embeddings, and its unbiased empirical estimate can be computed in:
With a divergence measure, we can also define an information measure as the discrepancy between the joint distribution and the product of marginal distributions. The Hilbert-Schmitt independence criterion (HSIC)gretton2005measuring
is defined as the squared MMD between the joint distribution and the product of marginals of two random variables, and its biased empirical estimate is given by:
where , is the Gram matrix of with , and is the Gram matrix of with . Since the time complexity of HSIC is also quadratic, this additional regularizer would increase training time only marginally.
Given side information we want to incorporate into the latent representation, we can add the following regularizer to the WAE objective to encourage dependence or independence between the aggregated posterior and :
Moreover, we can disentangle the correlation between the side information and certain axis by increasing dependence between and one axis and decreasing the dependence between and the remaining axes :
In this case, information of would concentrate at ; therefore, by varying this axis we can generate a continuum of morphological changes that correspond to change in .
3 Data and Experiments
The Lung Image Data Consotium (LIDC) armato2011lung comprises of thoracic scans from 1018 patient cases with 2670 images. Each sample includes the coordinates contouring the susceptible nodule in the CT scan and radiologists’ assessment of likelihood of malignancy ranged from 1 to 5. Nodules were extracted and the final image was obtained by taking the union of all nodule. The dimension of input image was 48 x 48. Data was augmented by rotation and reflection.
K562 Cell Image Dataset
A leukemia cell line K562 was divided and incubated with 10-fold serial dilutions of adriamycin ranging from 0.5 to 500 nM for 24 hours, followed by image acquisition with optofluidic time-stretch microscope goda2009serial lei2016optical . Approximately 10,000 single-cell images were acquired for each concentration. Each image was normalized to 0-mean, down-sampled to 96 x 96, and labeled with the treated drug concentration. It has been reported that drug-induced morphological changes of K562 cells can be captured by the microscopy images kobayashi2017scirep .
We trained the HSIC-regularized WAEs on both datasets, with the side information being the malignancy score in LIDC-IDRI dataset, and ADM concentration in K562 dataset. We set and as a factorized unit Gaussian . For MMD we used the inverse-multiquadratic (IMQ) kernel , and for HSIC we used the Gaussian RBF kernel with bandwidth selected via the median trick. We used Adam kingma2014adam for optimization. Implementation details are included in the appendix.
HSIC disentangles latent features with respect to side information
Training loss for the LIDC-IDRI dataset is shown in Figure 1. It can be observed that in addition to minimizing the WAE loss, training also pushes the HSIC() towards 0, indicating that these axes contain little to no information of the label, and at the same time consistently increases HSIC(). To further verify that this dependency is captured by the model, we generated images from random (see Appendix for generated images) and found their 3 nearest neighbors in the test data; we then regressed the malignancy score of these nearest neighbors against the dependent axis . The regression plot in Figure 1 suggests a strong positive increasing trend of malignancy score, thus indicating that trend in does indeed match the increasing trend of malignancy score.
Latent representation and generated samples are consistent with side information
For the K562 dataset, we encode the test images and visualize the latent space via a scatter plot, in which the x-axis is the dependent axis , and the y-axis is the 1st principle component of the independent axes . Consistent with our expectation, Figure 2 shows that concentration of the encoded cell images vary dramatically along , but not in the other axes. This observation is further supported by the kernel-fitted densities for each class: exhibits reasonable separation between different drug concentrations; in contrast, different classes are almost indistinguishable from . Random samples are also generated, and it can be observed that cells seem to become larger in size as the concentration of adriamycin increases. This finding agrees with the cellular mechanism: adriamycin can arrest cells in G2/M phase just before mitosis, and thus the druge-affected cells tends to be larger in volume giuseppe1989adriamycin . Meanwhile, morphological changes other than size change are also present in the manifold, suggesting unelucidated features to be investigated in further studies.
In this work we proposed a regularized generative model that constructs interpretable latent features and models continuous morphological change that corresponds to the provided side information. We applied our model to two distinct biomedical datasets with different clinical significance and validated its effectiveness in incorporating and disentangling the side information in the latent representation as well as generation, which enables modeling of a continuous spectrum of morphological changes.
Philipp Eulenberg, Niklas Köhler, Thomas Blasi, Andrew Filby, Anne E
Carpenter, Paul Rees, Fabian J Theis, and F Alexander Wolf.
Reconstructing cell cycle and disease progression using deep learning.Nature communications, 8(1):463, 2017.
Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter,
Helen M Blau, and Sebastian Thrun.
Dermatologist-level classification of skin cancer with deep neural networks.Nature, 542(7639):115, 2017.
-  Oren Z Kraus, Ben T Grys, Jimmy Ba, Yolanda Chong, Brendan J Frey, Charles Boone, and Brenda J Andrews. Automated analysis of high-content microscopy data with deep learning. Molecular systems biology, 13(4):924, 2017.
-  Gregory R Johnson, Rory M Donovan-Maiye, and Mary M Maleckar. Generative modeling with conditional autoencoders: Building an integrated cell. arXiv preprint arXiv:1705.00092, 2017.
-  Matthew BA McDermott, Tom Yan, Tristan Naumann, Nathan Hunt, Harini Suresh, Peter Szolovits, and Marzyeh Ghassemi. Semi-supervised biomedical translation with cycle wasserstein regression gans. 2018.
-  Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
-  Yoshua Bengio et al. Learning deep architectures for ai. Foundations and trends® in Machine Learning, 2(1):1–127, 2009.
-  Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
-  Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. arXiv preprint arXiv:1711.01558, 2017.
-  Yao-Hung Hubert Tsai, Liang-Kang Huang, and Ruslan Salakhutdinov. Learning robust visual-semantic embeddings. 2017.
-  Yao-Hung Hubert Tsai and Ruslan Salakhutdinov. Improving one-shot learning through fusing side information. arXiv preprint arXiv:1710.08347, 2017.
-  Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Schölkopf. Measuring statistical dependence with hilbert-schmidt norms. In International conference on algorithmic learning theory, pages 63–77. Springer, 2005.
-  Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
-  Tian Qi Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating sources of disentanglement in variational autoencoders. arXiv preprint arXiv:1802.04942, 2018.
-  Romain Lopez, Jeffrey Regier, Nir Yosef, and Michael I Jordan. Information constraints on auto-encoding variational bayes. arXiv preprint arXiv:1805.08672, 2018.
-  Bidaut L McNitt-Gray MF Meyer CR Reeves AP Zhao B Aberle DR Henschke CI Hoffman EA Kazerooni EA MacMahon H Van Beeke EJ Yankelevitz D Biancardi AM Bland PH Brown MS Engelmann RM Laderach GE Max D Pais RC Qing DP Roberts RY Smith AR Starkey A Batrah P Caligiuri P Farooqi A Gladish GW Jude CM Munden RF Petkovska I Quint LE Schwartz LH Sundaram B Dodd LE Fenimore C Gur D Petrick N Freymann J Kirby J Hughes B Casteele AV Gupte S Sallamm M Heath MD Kuhn MH Dharaiya E Burns R Fryd DS Salganicoff M Anand V Shreter U Vastagh S Croft BY Armato, McLennan G. The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans. page 38(2):915–931, 2011.
-  Hirofumi Kobayashi, Cheng Lei, Yi Wu, Ailin Mao, Yiyue Jiang, Baoshan Guo, Yasuyuki Ozeki, and Keisuke Goda. Label-free detection of cellular drug responses by high-throughput bright-field imaging and machine learning. Scientific Reports, 7(1):12454, 2017.
-  Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
-  Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
-  K Goda, KK Tsia, and B Jalali. Serial time-encoded amplified imaging for real-time observation of fast dynamic phenomena. Nature, 458(7242):1145, 2009.
-  Cheng Lei, Baoshan Guo, Zhenzhou Cheng, and Keisuke Goda. Optical time-stretch imaging: Principles and applications. Applied Physics Reviews, 3(1):011102, 2016.
-  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Toffoli Giuseppe, Viel Alessandra, Bevilacqua Carla, Maestro Roberta, Tumiotto Loretta, and Boiocchi Mauro. In k562 leukemia cells treated with doxorubicin and hemin, a decrease in c-myc mrna expression correlates with loss of self-renewal capability but not with erythroid differentiation. Leukemia Research, 13(4):279–287, 1989.
6.1 Further details on experiments
Model hyperparameters for LIDC-IDRI dataset
We used batches of size 512 and trained the models for 18000 steps, approximately 5 epochs. We used, , and . We set for Adam optimizer.
Where stands for the convolutional layer with k filters,
for the fractional strided convolution layer with k filters,
for the batch normalization,
for the leaky rectified linear units and
for the fully connected layer. All the convolutional layers in the encoder and decoder used vertical and horizontal strides of 2 and SAME padding.
Model hyperparameters for K562 Dataset
For training HSIC-regularized WAE on K562 dataset, we used batches of size 200, and trained the models for 8000 steps, approximately 50 epochs. We used , , and . We set for Adam optimizer.