Modeling the Biological Pathology Continuum with HSIC-regularized Wasserstein Auto-encoders

01/20/2019 ∙ by Denny Wu, et al. ∙ UNIVERSITY OF TORONTO 0

A crucial challenge in image-based modeling of biomedical data is to identify trends and features that separate normality and pathology. In many cases, the morphology of the imaged object exhibits continuous change as it deviates from normality, and thus a generative model can be trained to model this morphological continuum. Moreover, given side information that correlates to certain trend in morphological change, a latent variable model can be regularized such that its latent representation reflects this side information. In this work, we use the Wasserstein Auto-encoder to model this pathology continuum, and apply the Hilbert-Schmitt Independence Criterion (HSIC) to enforce dependency between certain latent features and the provided side information. We experimentally show that the model can provide disentangled and interpretable latent representations and also generate a continuum of morphological changes that corresponds to change in the side information.



There are no comments yet.


page 4

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning models operating on medical images often aim to identify morphological features that distinguish unhealthy or stressed biological compositions from those that are normal eulenberg2017reconstructing ; esteva2017dermatologist ; kraus2017automated . Many times the morphological features change gradually as the composition deviates from normality (e.g., change of cell shape as concentration of applied drug increases). In such cases, in addition to modeling the data distribution, it would also be useful to explicitly construct features that capture this continuous change and influence the generative process, so that the resulting model has greater interpretability and fidelity. Specifically, provided side information responsible for certain morphological changes, we would like to train a generative model with interpretable latent representation that disentangles this side information, so that the model can generate the corresponding continuum.

Deep generative models have shown great success in modeling medical data johnson2017generative ; mcdermott2018semi . Typically, a deep generative model learns to transform a prior into the data distribution . While evaluating model likelihood is generally intractable, several good approximations have been proposed, such as optimizing the evidence lower bound (ELBO) kingma2013auto or parameterizing an adversarial divergence goodfellow2014generative . An encoder-decoder architecture bengio2009learning is useful when inference is required (e.g., given an cell image, encode its morphology into latent features with lower dimensions). In a regular auto-encoder, minimizing the reconstruction cost alone may not result in good latent representations makhzani2015adversarial . Therefore, regularizing the encoded latent features is crucial in applying encoder-decoder architecture to generative modeling. The choice of regularizer steers the diversity of interpretations of the model such as optimizing the ELBO on the data distribution kingma2013auto , or minimizing the primal form of Wasserstein distance tolstikhin2017wasserstein .

Side information refers to additional feature that is not directly modeled but may be relevant to the primary task. In few-shot learning or transfer learning, side information is useful in learning robust and generalizable representations

tsai2017learning . For instance, tsai2017improving uses word embedding vectors as side information and applies the HSIC gretton2005measuring

with learned kernels to enforce its dependency with the learned representation. When dealing with data from source domain and target domain, a classifier can be trained between source and target to encourage features to be domain-invariant

ganin2016domain . On the other hand, regularization on the individual latent features in encoder-decoder models can also lead to better performance and interpretability. In chen2018isolating , the total correlation between latent axes is penalized in order to disentangle the latent features, whereas in lopez2018information this regularizer is replace by dHSIC.

In this work, we employ a non-parametric independence measure (HSIC) to integrate side information into the latent representation of a generative model trained on biomedical data. Specifically, given side information that correlates to certain continuous morphological changes of the biological composition, we disentangle the latent features by incorporating this information into one axis, and forcing the remaining axes to be independent to the information. In contrast to classifier-based regularization, this approach does not require training an additional model and thus is more stable and data-efficient. We verify our method on two different biological datasets: lung cancer images acquired by CT scans armato2011lung , and single-cell leukemia images acquired by time-stretch microscopy kobayashi2017scirep . In both experiments, our generative model successfully models continuous morphological changes (side information-dependent) and produces interpretable latent representation that captures the trend.

2 Methods

Wasserstein Auto-encoder

In this work the generative model is a Wasserstein Auto-encoder tolstikhin2017wasserstein . Given the data distribution and generated distribution , consider the reparameterized primal form of optimal transport villani2008optimal ; tolstikhin2017wasserstein :


Relaxing , this objective can be written as the minimization of the following loss:


in which is a divergence measure that matches and , which we set to be the maximum mean discrepancy (MMD).

Kernel-based Regularizers

We employ the maximum mean discrepancy (MMD) gretton2012kernel

to match the prior and aggregated posterior in WAE. The MMD is the RKHS distance between mean embeddings, and its unbiased empirical estimate can be computed in



With a divergence measure, we can also define an information measure as the discrepancy between the joint distribution and the product of marginal distributions. The Hilbert-Schmitt independence criterion (HSIC)


is defined as the squared MMD between the joint distribution and the product of marginals of two random variables, and its biased empirical estimate is given by:


where , is the Gram matrix of with , and is the Gram matrix of with . Since the time complexity of HSIC is also quadratic, this additional regularizer would increase training time only marginally.

HSIC-regularized WAE

Given side information we want to incorporate into the latent representation, we can add the following regularizer to the WAE objective to encourage dependence or independence between the aggregated posterior and :


Moreover, we can disentangle the correlation between the side information and certain axis by increasing dependence between and one axis and decreasing the dependence between and the remaining axes :


In this case, information of would concentrate at ; therefore, by varying this axis we can generate a continuum of morphological changes that correspond to change in .

3 Data and Experiments

LIDC-IDRI dataset

The Lung Image Data Consotium (LIDC) armato2011lung comprises of thoracic scans from 1018 patient cases with 2670 images. Each sample includes the coordinates contouring the susceptible nodule in the CT scan and radiologists’ assessment of likelihood of malignancy ranged from 1 to 5. Nodules were extracted and the final image was obtained by taking the union of all nodule. The dimension of input image was 48 x 48. Data was augmented by rotation and reflection.

K562 Cell Image Dataset

A leukemia cell line K562 was divided and incubated with 10-fold serial dilutions of adriamycin ranging from 0.5 to 500 nM for 24 hours, followed by image acquisition with optofluidic time-stretch microscope goda2009serial lei2016optical . Approximately 10,000 single-cell images were acquired for each concentration. Each image was normalized to 0-mean, down-sampled to 96 x 96, and labeled with the treated drug concentration. It has been reported that drug-induced morphological changes of K562 cells can be captured by the microscopy images kobayashi2017scirep .

Experimental Details

We trained the HSIC-regularized WAEs on both datasets, with the side information being the malignancy score in LIDC-IDRI dataset, and ADM concentration in K562 dataset. We set and as a factorized unit Gaussian . For MMD we used the inverse-multiquadratic (IMQ) kernel , and for HSIC we used the Gaussian RBF kernel with bandwidth selected via the median trick. We used Adam kingma2014adam for optimization. Implementation details are included in the appendix.

4 Results

Figure 1: Results from LIDC-IDRI dataset. Left: training loss and HSIC loss vs. training steps. Right: malignancy score of the nearest neighbors of generated samples vs. ; the trend of malignancy correlates with the dependent axis.

HSIC disentangles latent features with respect to side information

Training loss for the LIDC-IDRI dataset is shown in Figure 1. It can be observed that in addition to minimizing the WAE loss, training also pushes the HSIC() towards 0, indicating that these axes contain little to no information of the label, and at the same time consistently increases HSIC(). To further verify that this dependency is captured by the model, we generated images from random (see Appendix for generated images) and found their 3 nearest neighbors in the test data; we then regressed the malignancy score of these nearest neighbors against the dependent axis . The regression plot in Figure  1 suggests a strong positive increasing trend of malignancy score, thus indicating that trend in does indeed match the increasing trend of malignancy score.

Latent representation and generated samples are consistent with side information

For the K562 dataset, we encode the test images and visualize the latent space via a scatter plot, in which the x-axis is the dependent axis , and the y-axis is the 1st principle component of the independent axes . Consistent with our expectation, Figure 2 shows that concentration of the encoded cell images vary dramatically along , but not in the other axes. This observation is further supported by the kernel-fitted densities for each class: exhibits reasonable separation between different drug concentrations; in contrast, different classes are almost indistinguishable from . Random samples are also generated, and it can be observed that cells seem to become larger in size as the concentration of adriamycin increases. This finding agrees with the cellular mechanism: adriamycin can arrest cells in G2/M phase just before mitosis, and thus the druge-affected cells tends to be larger in volume giuseppe1989adriamycin . Meanwhile, morphological changes other than size change are also present in the manifold, suggesting unelucidated features to be investigated in further studies.

Figure 2: Results from K562 dataset. Left: scatter plot of test images on latent space, with as x-axis and the 1st principle component (PC) of as y-axis; class separation is obvious on but not on other axes. Right: generated images sampled from the dependent axis and the 1st PC of all other axes; generated cells vary in shape along .

5 Conclusion

In this work we proposed a regularized generative model that constructs interpretable latent features and models continuous morphological change that corresponds to the provided side information. We applied our model to two distinct biomedical datasets with different clinical significance and validated its effectiveness in incorporating and disentangling the side information in the latent representation as well as generation, which enables modeling of a continuous spectrum of morphological changes.


6 Appendix

6.1 Further details on experiments

Model hyperparameters for LIDC-IDRI dataset

We used batches of size 512 and trained the models for 18000 steps, approximately 5 epochs. We used

, , and . We set for Adam optimizer.

Encoder architecture:

Decoder architecture:

Where stands for the convolutional layer with k filters,

for the fractional strided convolution layer with k filters,

for the batch normalization,

for the leaky rectified linear units and

for the fully connected layer. All the convolutional layers in the encoder and decoder used vertical and horizontal strides of 2 and SAME padding.

Model hyperparameters for K562 Dataset

For training HSIC-regularized WAE on K562 dataset, we used batches of size 200, and trained the models for 8000 steps, approximately 50 epochs. We used , , and . We set for Adam optimizer.

Encoder architecture:

Decoder architecture:

6.2 Additional Results

Figure 3: Additional results from LIDC-IDRI dataset. Left: scatter plot of test images on 10D latent space; x-axis is the feature dependent to the malignancy score information, and y-axis the 1st principle component (PC) of all independent features. Consistent with our expectation and our observation from the K562 cell image dataset, Figure 3

shows that malignancy of lung cancer nodule in CT scans vary dramatically along HSIC dep. axis, while not in the other axes. The kernel density estimation plot for each malig. score group also shows strong separations between the distributions along the HSIC dep. axis, while little effects were contributed by other axis, resulting in indistinguishable distributions. Right: The generated samples from HISC-regulated MMD WAE’s decoder identifies that the morphological change seems to be the principle feature that influence the malignancy score of lung cancer nodules in CT scan. The nodules’ shape becomes pointier, less rounder, and the pixel of the image intensifies as the malignancy score increases.