The era of big data is marked not only by large datasets, but also complex modalities of data such as images, networks, and text. A key challenge in the analysis of complex data is data representation: what do you measure to capture the signal of interest (marron2014overview)? Automated, data-driven methods for feature extraction have proven to be powerful tools for analyzing complex data (bengio2013representation). This is particularly true for predictive problems involving data modalities such as images and text where deep learning has made major advances. Developing interpretable, data-driven feature extraction methods – particularly for inferential analyses where the goal is for humans to learn about the underlying signal – is an ongoing challenge (doshi2017towards; olah2018building).
Modern scientific applications increasingly involve relating multiple modalities of data to one another. The multi-block data setting111This setting is also referred to as multi-view, multi-modal, or multi-omic data. Related terms include data fusion and data integration. means a fixed set of observations (e.g. patients) and several sets of variables (e.g. gene expression, copy number, protein). The classical example of a multi-block data algorithm is canonical correlation analysis
(CCA) which, given two data blocks, estimates a low dimensionaljoint signal capturing the information shared by the two blocks (hotelling1936relation). The multi-block data setting arises in recent cancer studies such as the Cancer Genome Atlas (cancer2012comprehensive).
Two of the main approaches to the study of breast cancer are histopathology and genetics. Pathologic assessment of breast tissue is the current state of the art method for breast cancer diagnosis. A pathologist examines the tissue under a microscope in order to determine if there is cancer as well as to assess important features used for determining the course of treatment including histologic type, invasive status, tumor size, lymph node status, grade, as well as stain for hormone receptors and HER2. Histopathology is the classical approach for diagnosing cancer dating back to the 1800s (rosen2001rosen; titford2006short). Figure 1 shows an image of a hematoxylin and eosin stained (H&E) breast cancer tumor from the dataset analyzed in this paper. Complementary to histopathology, genetic information is increasingly used to understand the biological processes underlying cancer. Genetics also plays an important role in tumor clustering/classification as well as predicting patient prognosis and response to treatments (gyHorffy2015multigene).
Most existing research in breast cancer focuses on a single data modality e.g. either histopathology or genetics. For example, the development of automated algorithms to augment human pathologists’ judgements using machine learning is active area of research that aims to increase diagnostic accuracy and decrease costs(liu2017detecting; couture2018image; veta2019predicting). Subtype discovery (clustering) based on gene expression data is another major area of research (perou2000molecular; parker2009supervised). Exploratory/inferential analyses, which study direct associations between histopathology and genetics, are receiving increasing interest in recent years, but are hindered by the challenge of doing inferential analysis on image data (ash2018joint).
The Carolina Breast Cancer Study Phase 3 (CBCS) is a cohort of breast cancer survivors diagnosed between 2008 and 2013 and with both histologic and gene expression data (troester2017racial). CBCS provides a unique opportunity to explore both connections and differences between genomics and histopathology that could lead to new discoveries e.g. improved classification. It is known that genetics plays an important role in cancer development, so one might expect to see joint modes of variation between the genetic and histopathological data. However, not all information in one data modality is contained in the other, therefore, we expect to also see histopathology individual as well as genetic individual modes of variation. To estimate these three different kinds of signals, we use angle-based joint and individual variation explained (AJIVE) (feng2018angle).
Statistical analysis of images presents difficulties for feature extraction. Convolutional neural networks (CNNs) have proven adept at solving predictive tasks with images (fukushima1980neocognitron; lecun2015deep). Inspired by the approach in couture2018image, we use an intermediate layer of a pre-trained neural network (see Section 2.2) for automatic image feature extraction – so called transfer learning (yosinski2014transferable; sharif2014cnn)
. We explore the limits of transfer learning and show that pre-trained CNN features with no fine tuning are able to capture complex visual signals in a domain vastly different than the one they were originally trained for. See Section5.0.2 for a longer discussion of transfer learning in the context of this paper. While the use of CNNs allows us to capture rich visual information, it presents difficulties because neural network features are notoriously difficult to interpret.
Leveraging deep learning for exploratory/inferential tasks, where the goal is to learn about patterns in the data, presents new challenges and is an active area of investigation (ash2018joint; adebayo2018sanity; olah2018building).
We develop a multi-scaled approach called representative patch views (RPV) to interpret signals in the data captured by multivariate statistical algorithms222 Many statistical methods such as linear regression, PCA and AJIVE provide loadings vectors of neural features (i.e. vectors in neural feature space) that don’t immediately tell us humans anything about the underlying biology.
Many statistical methods such as linear regression, PCA and AJIVE provide loadings vectors of neural features (i.e. vectors in neural feature space) that don’t immediately tell us humans anything about the underlying biology.applied to neural features (Section 3.1). This approach takes a patch-based representation of each H&E image, allowing us to localize regions in each image that play an important role in a given mode of variation. The RPVs operate at both the core level and patch level, providing a pathologist with two different levels of granularity to explore the visual features associated with a particular mode of variation.
Our AJIVE analysis is able to discover clear, interpretable, joint signals such as associations between Basal-like (genetics) and high grade tumors (histology). The analysis also finds clear individual signals in the histopathological data such as fat content and mucinous tumors.
Section 1.1 below discusses the results of the first AJIVE joint component which captures tumor grade from the pathology perspective and the contrast between the Basal-like and Luminal A PAM50 subtypes from the genetic perspective. Section 2 presents the data provided by CBCS as well as the patch based, CNN image features extraction approach. Section 3 discusses the methods used in this paper including the representative patch views and AJIVE. Section 4 interprets the joint, image individual and genetic individual AJIVE components. Finally, Section 5 concludes with more discussion about transfer learning and exploratory analysis with deep learning.
All correlations and AUC statistics reported in the text of the paper are statistically significant at a level of 0.05 after controlling for multiple testing with the Benjamini-Hochberg procedure benjamini1995controlling unless stated otherwise (see Section A.3).
The Appendix provides additional results and discussion. Section A.1 contains a brief description of the tissue structures relevant to this paper. Section A.2 gives diagnostic plots for the AJIVE analysis of the CBCS data. Section A.3 gives additional details of statistical analyses used in the clinical data interpretation methods. The supplementary material includes a number of figures that were excluded from the main body of the paper for brevity. In particular, all representative patch views examined by pathologists are shown in Acknowledgements.
1.1 Results of the first AJIVE joint component
This section discusses the results of the first AJIVE joint component; many of the technical details (e.g. the RPVs and multiple testing control) are provided in Sections 2 and 3. In the CBCS dataset available to us there are subjects, genes (the PAM50 gene set) and neural network features extracted from the pathology images using a mean pool of the last convolutional layer of VGG16 (simonyan2014very). To help interpret the results of the AJIVE analysis we use a variety of additional variables that are provided by CBCS, but are not used in the AJIVE analysis. These include: PAM50 molecular subtypes, histopathological variables (e.g. pathologist determined tumor grade), immunohistochemical variables (e.g. estrogen receptor (ER) and clinical HER2 status), and patient variables (e.g. age).
1.1.1 Methods overview
Like principal components analysis (PCA), AJIVE returns an (ordered) set of components comprised of scores vectors and loadings vectors (see Sections 3.2). While PCA returns one set of components, AJIVE returns three sets of components when applied to two data blocks: joint, block one (pathology) individual, and block two (genetic) individual. The joint components capture joint signals shared between the two data blocks; the individual components capture signals present in one data block that are not related to the other data block.
For interpretation purposes, we hypothesize that – like PCA components – each AJIVE component discovers a spectrum contrasting interesting aspects of the data (e.g. Basal-like tumors vs. non-Basal-like tumors). Interpreting the results of the AJIVE analysis often involves understanding what biological phenomena are driving such a spectrum. One AJIVE component assigns a number to each subject (the scores) and subjects can be sorted from most negative to most positive. In the following analysis we consider the negative and positive extremes of each component.
By exploiting a patch based representation of the images, we can use the AJIVE neural network loadings vectors to create the RPVs which are discussed in detail in Section 3.1 (see Figure 8). Each RPV corresponds to a subject-component-end triple; one subject and one end of one AJIVE component (e.g. subject 12 and the negative end of the first AJIVE joint component). The image patches selected in the RPV are meant to be representative of the visual patterns characteristic of an end of a component. For the sake of exposition, we primarily show a smaller version of the RPVs in the body of the paper, which display only the top 8 patches (e.g. Figure 2).
We initially consider the negative and positive extremes of the first joint component separately.
From the pathology perspective, two distinct visual patterns show up in the negative end of the first joint component (Figures 2 and 3). Section A.1 has a brief explanation of the various tumor structures which are relevant to this paper. The first pattern is dense tumor infiltrating lymphocytes (TILs) and is illustrated in Figure 2 which shows the RPV of the most negative subject of the first joint component. The smaller cells which have hyperchromatic round nuclei and relatively scant cytoplasm (i.e dark, round, purple structures), are lymphocytes. In particular types of breast cancer, TILs in the intratumoral stroma are associated with prognosis and may be associated with response to immunomodulatory therapy (wein2017clinical).
The second visual pattern in the negative end of the first joint component is dense, high nuclear grade tumor cells and is illustrated in Figure 3. Nuclear grade describes how abnormal the tumor cells look: “low grade” means the tumor cells look similar to regular cells (“well-differentiated”) and “high grade” means the tumor cells look markedly abnormal (“poorly-differentiated”) e.g. are enlarged and irregularly shaped (rosen2001rosen).
On the positive end of the first joint component, the pathology review shows subjects whose cores contain mostly normal breast tissue i.e. little tumor tissue. This pattern is illustrated by Figure 4, which shows the subject with the most positive scores for the first joint component. These patches contain few tumor cells and are mostly normal breast structures such as collagenous stroma (the light pink, stringy tissue) and ducts.
The first joint component is related to histopathological features including tumor grade and histological type (ductal vs. lobular). For example, Figure 4(a) shows that high grade tumors cluster on the negative end of the first joint component while low grade tumors cluster on the positive end (AUC = 0.945). Tumor grade incorporates cellular differentiation and other architectural features as an indicator of aggressiveness (elston2002pathological). This first component is also statistically significantly related to histological type with ductal on the negative end and lobular on the positive end (AUC = 0.785).
From the genetics perspective, the first joint component strongly tracks the proliferation score as well as the contrast between Basal-like vs. Luminal A tumors. Figure 6 shows the PAM50 joint loadings vector for the first component. Several of the top negative genes (e.g. CCNB1, CENPF, MYC, MKI67) are associated with high tumor cell proliferation and tend to have low expression levels in normal breast tissue. Several of the top positive genes (e.g. MLPH, MMP11) tend to have high expression levels in normal breast tissue. Note that FOXC1 is highly expressed in both basal-like and normal-like breast myoepithelium.
Figure 4(b) shows that Basal-like tumors cluster on the negative end of the first joint component, molecular HER2 and Luminal B cluster in the middle while Luminal A and normal tumors cluster on the positive end. Note the AUC score for Basal-like vs. Lum A is 0.984 which is quite high. Luminal B and molecular HER2 are separated from Basal-like (AUCs of 0.886 and 0.876). The separation indicates that this joint component is distinguishing more subtle histopathological and molecular features beyond proliferation and cellularity. Figure 4(c) shows a strong, negative correlation between the first joint component scores and the proliferation score, which is a genetic measure indicative of how fast tumor cells grow (whitfield2002identification).
Strikingly, the first joint component almost perfectly separates ROR-PT, which is a combined genetic and pathology based risk of recurrence score parker2009supervised. Patients with high ROR-PT are clustered on the negative end while patients with a low ROR-PT are clustered on the positive end with an AUC of 0.999.
In addition to genetic phenotypes measured by RNA expression data as just discussed, we also have immunohistochemistry (IHC) data, a surrogate measure of RNA subtypes and the most common way of classifying tumors in a clinical setting. From the IHC perspective, the first joint component is strongly related to ER status and weakly related to clinical HER2 status (see Table2). Clinical ER negative tumors cluster on the negative end of this component with an AUC of 0.883.
In this first joint component, the pathology and genetics tell complementary stories that are familiar to breast cancer experts. The data raise the possibility that this joint component separates tumors based on one or more histologic features associated with tumor grade. These features could include aspects of nuclear atypia (i.e increased nuclear size, irregular shape, altered chromatin pattern, multiple nucleoli) which are reflected in the nuclear grade. Tumors with a high combined histologic grade also tend to be more cellular and show less tubule or gland formation as compared to low-grade tumors.
From the genetics perspective, Basal-like tumors are on the negative end, molecular HER2/Luminal B tumors are in the middle, and Luminal A/Normal like tumors are on the positive end. The joint scores are strongly negatively correlated with the proliferation score. The negative genes in Figure 6 are predominantly proliferation regulated genes; however, we note several of the positive genes are often considered basal-specific genes. These genes are also expressed in normal myoepthelieum and are representative of the normal ducts still observed within slides of the low grade tumors (livasy2006phenotypic; heng2017molecular).
Aggressive tumors tend to have high tumor cellularity and little benign tissue. In less aggressive tumors, there is typically more normal breast tissue. Basal-like tumors tend to be more aggressive and are generally associated with high tumor grade, ER negativity, ductal histology, and high proliferation score (livasy2006phenotypic; troester2017racial; williams2019differences). Luminal B and molecular HER2 tumors tend to be moderately aggressive. Luminal A and Normal like tumors are less aggressive and it is known these tend to be low grade.
It is promising that this mode of variation turned up in the first joint component. These connections between the underlying genetic drivers and the pathological impressions have both geneticists and pathologists excited about the potential of AJIVE to quantitatively integrate these different aspects of cancer.
1.2 Related literature
There is a large literature on dimensionality reduction for multi-block data including classical algorithms such as CCA hotelling1936relation and partial least squares wold1985partial as well as more modern methods such as: multi-block versions of CCA kettenring1971canonical; nielsen2002multiset; asendorf2015informative, iNFM yang2015non, SLIDE gaynanova2017structural and BASS (zhao2014bayesian). JIVE lock2013joint and AJIVE feng2018angle are some of the first methods to look at both joint as well as individual modes of variation.
Interpretability in deep learning is a growing field (vellido2012making; molnar2018interpretable; chen2018looks; kim2018interpretability; olah2018building; holzinger2019causability). Saliency maps interpretation methods involve visualizing the gradient (or a related quantity) of the output of a CNN with respect to a given input image (zeiler2014visualizing; springenberg2014striving; selvaraju2017grad; sundararajan2017axiomatic; adebayo2018sanity). We adapted several of these methods for the AJIVE analysis and explored their applicability to our data; unfortunately, none of the methods provided consistently interpretable outputs and raised issues which will be explored in a follow up paper.
Deep learning based predictive analysis of histological images is a growing area (komura2018machine; aeffner2019introduction; niazi2019digital; abels2019computational) which includes tasks such as classification/regression (wang2016deep; liu2017detecting; bejnordi2018using; ilse2018attention; liu2018artificial; veta2019predicting), semantic segmentation jimenez2019deep; mahmood2019deep, and microscope augmentation (chen2018microscope) CNN architectures which integrate genetic (or other) information are also being explored for these predictive tasks (couture2018image; srivastava2018building; mahmood2018multimodal). Other studies used non-deep learning based methods to do exploratory, integrative analysis of histological and genetic data beck2011systematic; wang2013identifying; cooper2015novel
A similar joint, exploratory analysis of breast cancer H&E image and gene expression data was performed by (ash2018joint). Our methods differ from theirs in a number of ways: they only examined joint signals while we examine both joint and individual signals; they used a sparse CCA while we use AJIVE; we develop and use the RPVs for image interpretation; they trained an auto-encoder while we use transfer learning. An important result of our paper is that even simple transfer learning effectively captures the important signals in the data.
1.3 Software and data release
The code to reproduce the analysis in this paper can be found at github.com/idc9/breast_cancer_image_analysis. The raw data e.g. H&E images, gene expression data, clinical variables cannot be released publicly due to patient confidentiality concerns. Researchers may request permission to access the raw data used in this study by visiting https://unclineberger.org/cbcs/for-researchers/.
The scikit-image library is used for various image processing tasks (van2014scikit)
. The PyTorch framework is used for all neural network computations(paszke2017automatic)
. The pre-trained VGG16 weights are downloaded with the PyTorch vision library. We used many of the standard python data science libraries e.g. NumPyvan2011numpy, SciPy jones2014scipy, sklearn scikit2011pedregosa, pandas mckinney2011pandas, Matplotlib hunter2007matplotlib, and seaborn waskom2018seaborn. AJIVE computations are done with the pyjive package carmichael2019pyjive which was developed for this project.
The data came from the Carolina Breast Cancer Study, a population-based study of black and white women with invasive breast cancer diagnosed between 2008-2013 in North Carolina. Tumor blocks were collected and cores were transferred from the donor paraffin blocks to prepare tissue microarrays as well as to isolate RNA for gene expression analysis. Tissue preparation and molecular methods are described in (troester2017racial; allott2018frequency). The current analysis includes patients for whom both image and gene expression data were available.
For each patient, a pathologist reviewed a paraffin-embedded tumor block and marked the area containing the invasive carcinoma. Then a lab technician extracted a number of circular “cores”, which were then transferred into a recipient TMA paraffin block and eosin (H&E) and imaged. Acknowledgements shows a graphical depiction of this process. The upshot is that for each patient we have a median333Minimum of 1 and maximum of 8. of 4 H&E stained core images. The images of these cores are roughly circular with an average width of approximately 2500 pixels444Min 600, max 3400.. An example core image is shown in Figure 1. It is appealing to work with cores and not the much larger whole slide images because the cores provide more concentrated tumor cells and are more computationally tractable555The whole slide images can be of order 50,000 50,000 pixels or larger..
Pathologic evaluation of the tumor (including histologic type and grade) was based on the original whole tissue sections. We also compute a number of variables describing image features such as the proportion of white background and the median intensity of the background pixels.
For each patient, we have the PAM50 gene expression measurements, which are 50 genes chosen to distinguish the 5 clinically relevant, genetic subtypes (Basal-like, Luminal A, Luminal B, molecular HER2 and Normal-like) (parker2009supervised). The intrinsic subtype gene list was developed using genes which were consistently expressed within the tumor while minimizing the contribution of the non-tumor microenvironment; therefore the PAM50 genes do not describe the tumor microenvironment (perou2000molecular). We also have a number of variables derived from the PAM50 gene expression such as proliferation score and ROR-PT (parker2009supervised). The ROR-PT is a risk of recurrence score based on both genetic and histological information (e.g. tumor size).
CBCS provides clinically relevant immunohistochemical variables (ER status, clinical HER2 status and PR status), which are derived from routine methods used in the clinical laboratory. Finally, there are other patient variables such as age which are discussed in Acknowledgements.
gene expression variables under go pre-precessing which includes centering and scaling by their standard deviation resulting in the gene expression data matrix.
2.2 Image processing and patch representation
In order to achieve uniform visual stain density, the raw H&E core images are stain normalized using the procedure described in (macenko2009method). The set of background pixels of each image (i.e. the whitespace in Figure 1) is then estimated via the following procedure. Each image is converted to grayscale, then a background pixel intensity threshold is estimated with weighted666Through exploratory analysis we noticed that off the shelf methods (e.g. Otsu alone) had systemic issues with images which have a high proportion of background (e.g. those with a high fat content or high mucin content). This particular combination was selected by tuning on a visual examination of the 100 images with the highest proportion background. combination of (0.1) Otsu’s method otsu1979threshold and (0.9) the triangle method zack1977automatic. The background mask (True/False array saying whether or not a pixel is in the background) is then used for a variety of downstream tasks. For example, using the background mask we can estimate the channel wise median background pixel and compute the proportion of background in the entire image.
Next we create a patch-based representation of each image. Each core image is broken into a grid of
pixel patches. To make an even grid of patches, the image is first padded with the estimated typical background pixel so its dimensions are divisible by 200. Using the background mask, patches which are more than 90% background are thrown out (Figures6(a) and 6(b)). The background threshold (90%) was selected via manual inspection to be the smallest value such that patches with large amounts of fat and some tissue are still included (Figure 6(b)).
There are a total of core images from the subjects resulting in patches. We estimate the channel (red, green, blue values for each pixel) mean and standard deviation from the patch dataset. Before being input into the neural network, each pixel channel is mean centered then scaled by the standard deviation.
2.3 CNN feature extraction
After the raw images are processed, CNN features are extracted from each patch. We use the last convolutional layer of the VGG16 architecture simonyan2014very with an additional spatial mean pooling layer added to the end of this architecture to average out spatial information resulting in 512 CNN features. In other words, if the output from the original network applied to a pixel image is sized (where 512 is the depth), the spatial mean pool will output a
dimensional vector. The pre-trained weights of the network are downloaded from the torch vision library. No additional fine-tuning is performed (see Section5.0.2 for discussion).
Finally, core-images are represented as an average of their patch features (again, ignoring patches which are over 90% background). Patients are then represented by an average of their cores. Acknowledgements provides a graphical representation of this process. Each CNN feature is first mean centered then scaled by its standard deviation resulting in the image feature data matrix .
A key challenge for doing statistical inference on populations of images with deep learning is interpretability. In section 3.1 we explain the novel, broadly applicable RPV method for interpretation of the visual signals captured by neural network features. In Section 3.2 we give an overview of the AJIVE procedure. Section 3.3 describes the pathology review process for interpreting the AJIVE image modes of variation.
3.1 Representative patch views
The RPV assumes that images777In the CBCS study each subject has a number of core-images and subjects are represented as an average of their images. For exposition purposes we pretend each subject has one image in this section, however, the extension to the multi-image case is clear. are represented via the patch based approach described in Sections 2.2 and 2.3. In other words, images are broken into a collection of patches; CNN features are extracted for each patch then images are represented as an average of their patch features. Suppose we compute a loadings vector of image features (e.g. the first image PCA component). The RPV illustrates one end of a loadings vector (e.g. the positive end of PC 1). For each end of one component, we select the top 15 subjects (e.g. the subjects with the 15 most positive PC scores) then create the RPVs for each of these subjects.
Figure 8 shows an example RPV displaying one subject for the negative end of AJIVE joint component 1. The leftmost column shows the four cores for this subject. The rightmost five columns show the top 20 patches for the negative end of the first joint component from the patch based localization approach described below in Section 3.1.1. The second column shows the location on each core of the top 20 patches. The RPVs are multi-scale in the sense that they give insights at both the core level and the patch level.
3.1.1 Patch based localization
Here we describe the general approach used to select the representative patches for the RPVs described in the above section. Patch based localization is accomplished by projecting patches onto the loadings vector. Main ideas are illustrated below in the context of PCA but this approach is quite general.
Let be the number of subjects (images) in the dataset, be the number of patches for the th image, be the th patch for the th subject, the feature extraction network outputting features. Also let be the features for patch , be the average patch features for the th subject, be the image feature dataset, and be after processing (e.g. centering and scaling). Let be a loadings vector and the scores vector888We assume that the scores are the projection of the data onto the loadings vector. computed from (e.g. PC component 1).
Consider the positive end of this component and let be the index of a particular image. We perform patch based localization999We use the term localization because this method helps identify which regions in the image are playing an important role in the given mode of variation. by projecting every patch of subject onto . In detail, let be the features of the th patch for subject after the processing. Let be the scores of this patch for . Now let be the indices of the patches with the 20 most positive patch scores (i.e. ). We call these the representative patches for the positive end of this component.
Principal components analysis can be viewed as finding important modes of variation in the data. If we used “interpretable” features, then inspecting the top entries of the loadings vector, , would shed light on what this given mode of variation is about. Unfortunately, for uninterpretable features such as abstract neural features, inspection of a loadings vector does not provide much interpretable information. Note this visualization procedure can be applied to statistical algorithms other than PCA; it simply requires an algorithm that outputs loadings vectors of neural features.
3.2 Angle-based joint and individual variation explained
Angle based joint and individual variation explained (AJIVE) is a statistical feature extraction/dimensionality reduction algorithm for multi-block data (feng2018angle). The goal of AJIVE is to find joint signals, if any exist, which are common to all data blocks as well as individual signals which are specific to each block, if they exist. Here we give a brief overview of AJIVE for two data blocks.
Consider two data blocks , on the same set of observations. AJIVE estimates what variation is joint to both data blocks as well as what variation is individual to each block. In particular, each matrix is decomposed into a sum of joint, individual, and error terms,
while imposing the following constraints
All subspaces live in where is the number of observations. The two joint matrices span the same joint subspace, . The two individual matrices span subspaces which are orthogonal to the joint subspace. We refer to the rank of the joint subspace as the joint rank, , and the rank of the two individual subspaces as the and individual ranks, .
The mechanics of AJIVE are outlined below for the case of data blocks101010The original paper describes the procedure some what differently, but this description is equivalent.. The properties of the common normalized scores discussed below follow from the fact they are the subspace flag mean of the PCA scores subspaces draper2014flag. We use a different estimate of the block common loadings, than in the original paper. One of the key statistical procedures in AJIVE is to estimate the joint rank111111This is accomplished by estimating which principal angles between and are smaller than random in an appropriate sense. which is achieved using the Wedin bound and the random direction bound detailed in (feng2018angle).
Initial signal extraction: Estimate low rank PCAs of with ranks (e.g. selected by inspecting the PCA scree plots). Denote this initial PCA of by where , . Similarly for .
Signal space extraction: Perform CCA on the PCA scores, . Using the random direction bound and the Wedin bound estimate the CCA rank, . Let be the matrices whose columns are the x/y CCA scores with unit norm. Let be the matrices whose columns are the CCA x/y loadings. Let be the common normalized scores which have the property of being proportional to the average of the CCA scores. In other words, the th column of is unit norm and is proportional to the average of the j columns of and . Additionally the common normalized scores are orthonormal i.e. . Finally, let121212Note the th column of is equivalent to the rank principal components regression coefficient of the th column of the common normalized scores, , regressed on the matrix. be the x-common loadings. Similarly for .
Signal space extraction: Let be the estimated joint matrix. Let . Let
be the number of singular values ofabove the threshold discussed in Section 2.4 of feng2018angle and let be the rank SVD approximation of . We denote the PCA of the individual matrix by , , which is also of interest. Similarly for .
The outputs of interest in this paper are the following
The joint rank, .
The common normalized scores, .
The common loadings131313These were not given names in the feng2018angle and were computed slightly differently., .
The and , which are referred to as the block specific, individual scores and loadings. Similarly for .
The common loadings, are different than those in (feng2018angle). The loadings computed here are the loadings such that i.e. the average of the resulting scores are proportional to the common normalized scores. Computing the loadings in this way ensures that they incorporate joint information only.
It can be checked that the random direction bound is equivalent to the classical Roy’s largest root test CCA rank selection method (johnstone2008multivariate).
3.2.1 AJIVE analysis of CBCS data
The only variables used in the AJIVE analysis are the CNN image features and expressions for 50 genes from PAM50; the other variables are used to interpret the AJIVE results. The initial signal ranks are 81 (image features) and 30 (genes) and were selected by inspection of the difference of the log-singular values and airing on the side of picking too high a rank. AJIVE estimates a joint rank of 7, image individual rank of 76 and genetic individual rank of 25. The AJIVE diagnostic plot, detailed in feng2018angle, is provided in A.2.
3.3 Pathology review of images
In close collaboration with pathologists (B.C. and J.G.), we reviewed the first three joint and image individual components at two levels of granularity. Tables 1 and 3 summarize the pathologists’ observations of these components. These observations are key to understanding the connections between the pathology and the genetics.
In the first approach, which we refer to as global sort, all core images are reviewed in sequence after sorting by the patient scores. Joint components are sorted by common normalized scores, , and individual components are sorted by block specific scores, (see Section 3.2). After sorting, the images are reviewed in sequence (e.g. from the negative to the positive end) to explore the visual signals captured by a given component. The benefits of the global sort method are i) a large number of images are inspected ii) we get a sense of the high level changes141414In a preliminary analysis where image patches with a large amount of background were not excluded (see Section 2.2), the global sort method on the first few principal components revealed that the primary modes of variation in the data are driven by the raw amount of background. This motivated the exclusion of patches with too much background. as we move along a component from the extreme negative to the middle then to the extreme positive end and iii) we can see if the trends found in the RPVs (see next paragraph) hold broadly for the entire component. The downsides of this method are that it is time intensive and does not provide explicit information about what visual signals are important in a given image. The H&E images are quite large and complex and finding patterns across a set of images is challenging.
The RPV approach developed in Section 3.1 extracts more fine-grained information at the patient level. The RPVs of the 15 most negative and 15 most positive subjects are inspected for each component. The RPVs are created with the common loadings for the joint components and the block specific individual loadings (see Section 3.2). The number 15 was selected to balance showing “enough” information without taking too much time. The RPVs have the benefit of highlighting a more focused set of visual patterns.
Tables 1 and 3 display the pathologist’s observations based on the RPVs at each end of each component. Each column summarizes the pathologist’s impression of a clinically relevant histological feature. The homogeneous column indicates whether or not there appeared to be a consistent pattern across the reviewed RPVs. The global sort review shows these trends hold for more than just the 15 most extreme images.
This section discusses the results for the joint AJIVE components (Section 4.1), the image individual (Section 4.2) and genetic individual (Section 4.3). For the sake of time – both the readers’ and the pathologists’ – we focus on the top 3 components from each of the joint, image individual and genetic individual.
The pathology review of the images from the joint and image individual components is described in Section 3.3. While the pathologist reviewed the full RPVs (Figure 8), only mini-RPVs (e.g. Figure 8(a)) displaying 8 patches are shown below. The full RPVs shown to the pathologists, all AJIVE genetic loadings, and all clinical data comparisons are provided in Acknowledgements. The methodology for clinical data comparisons (e.g. multiple testing control) is discussed in Section A.3.
4.1 Joint image and genetic information
|Component||ER status||Clinical HER2 status|
|2||0.752 ()||0.617 ()|
4.1.1 Joint component 2
From the pathology perspective, the tumors on the negative end of joint component 2 show mostly collagenous stroma surrounded by moderate nuclear grade tumor cells. Figure 8(a) shows the mini-RPVs of two subjects from the negative end of the second joint component. The positive end of this component was not homogeneous (Table 1).
From the genetics perspective, the negative end of joint component 2 picks out the Luminal B subtype (Figure 9(b)). Looking at the PAM50 loadings vector, ESR1, SLC39A6 are the two most negative genes in the PAM50 loadings (Figure 9(a)) and are known to be high in clinically ER+ cancers (parker2009supervised). The Luminal B observations cluster on the negative end of this direction and are statistical significantly separated from the other PAM50 subtypes with AUC scores of: Basal = 0.905, HER2 = 0.933, Luminal A = 0.760, Normal = 0.950 (Figure 9(b)).
From the immunohistochemical perspective, the second joint component is moderately related to ER status while weakly related to clinical HER2 status (Table 2). Clinical ER positive tumors cluster on the negative end of this component with an AUC of 0.752.
The pathology perspective of this second joint component appears to pick up on morphological features of Luminal B tumors i.e. intratumoral channels of stromal cells which are surrounded by moderate nuclear grade cancer cells. To our knowledge, little is known about the histological features of Luminal B tumors. Interestingly, beck2011systematic used image analysis approaches to demonstrate connections between certain stroma morphological features and patient survival. Pathologists do not currently use stromal features in the diagnosis and classification of tumors. However, tumor stroma and microenvironment (eiro2019breast) and the stromal features of benign and tumor-adjacent normal tissue (roman2012gene; chollet2018stroma) are areas of active investigation. Recent studies using CNNs have shown that breast biopsies may be accurately classified as malignant solely based on stromal features (bejnordi2018using).
4.1.2 Joint component 3
From the genetics perspective, the negative end of joint component 3 picks out molecular HER2. The HER2 observations are separated from the other PAM50 subtypes with AUC scores of: Basal = 0.947, Luminal A = 0.940, Luminal B= 0.833, Normal = 0.950. Interestingly, ERBB2 and EGFR have large negative values in the joint loadings vector while GRB7, which is on the same amplicon as ERBB2, is almost 0 (Figure 10(a)). The negative end of this component is also moderately related to clinical HER2 status with an AUC of 0.777 (Table 2). This component is identifying not only clinical HER2 samples (as determined by IHC staining) but more strongly the molecular HER2 subtype of samples (as determined by gene expression). Previous work (cancer2012comprehensive) has shown both gene expression and protein and phosphoprotein levels of ERBB2 and EGFR are significantly enriched in clinically HER2 samples that are also the molecular HER2 subtype compared to clinical HER2 samples that are Luminal subtypes. This is consistent with the separations we see in joint component 3.
From the pathology perspective, the negative end joint component 3 again shows collagenous stroma, but this time surrounded by high nuclear grade tumor cells (Figure 11(a)). Recall joint component 2 was similar but with moderate grade tumor cells. This third joint component appears to be picking up on morphological features of molecular HER2 tumors. Similar to joint component 2, it is interesting that the stroma appear to play an important role in this component.
4.2 Image individual information
As mentioned in Section 2.1, the PAM50 genes were selected to emphasize genes expressed in tumor epithelium, not genes highly expressed in tumor microenvironment features such as fat cells, collagenous stroma, and in some cases mucin. Several of these microenvironment features have clear visual signals (e.g. high fat content images have round, clear adipose cells) and show up prominently in the AJIVE individual components.
4.2.1 Image individual component 1
All of the images on the positive end of the first image individual component shows a very clear theme of tumors with high fat content (Figure 12(a)). High fat content is a strong visual signal so it makes sense that it shows up as an early individual mode of variation for image data. The negative end of the first image individual component shows tumors with low tumor cellularity and low/moderate grade nuclei (Table 3 and Figure 13(a)).
4.2.2 Image individual component 2
The negative end of image individual component 2 clearly picks out mucinous carcinoma tumors (Figure 14(a)). Mucinous carcinomas are characterized by tumor cells floating in pools of mucin. These cancers presents a very clear visual pattern of dark purple tumor cells surrounded by wispy looking mucin. Mucinous carcinoma is a rare histological subtype which the PAM50 genes perou2000molecular are not designed to identify.
Mucinous carcinomas are typically low-grade, hormone receptor-positive, have a good prognosis and appear to be genetically different from invasive ductal carcinomas of no special type (diab1999tumor; di2008retrospective; weigelt2009mucinous; lacroix2010mucinous). Mucinous carcinomas are usually genetically Luminal-type (typically Luminal A) (colleoni2011outcome; caldarella2013invasive; weigelt2009mucinous). All of the top 15 tumors on the negative end of this component are genetically Luminal (12 are Luminal A and 3 are Luminal B). Interestingly, neither the Luminal A nor B classes are strongly associated with the individual scores for this component overall; none of the difference in distribution tests (Section A.3) for Luminal A vs. another class were statistically significant (similarly for Luminal B). This is consistent with variation appearing in an image individual component.
The positive end of individual component 2 picks out images with moderate cellularity and collagenous stroma surrounded by moderate nuclear grade tumor cells.
The positive and negative ends of image individual component 2 show contrasting histologic features. The patches from the tumors on the positive end are entirely filled with a combination of tumor cell aggregates separated by areas of dense collagenous stroma. Adipocytic stroma is absent and the only optically clear space is in areas of retraction artifact where tumor cell groups appear to be pulled away from adjacent stroma (a known artifact of histologic preparation in some invasive tumors). The patches from the negative end show extracellular mucin from mucinous carcinomas with low or no tumor cellularity and just a few wispy bands of stromal collagen. The contrasting histology raises the possibility that this component may separate tumors based on one or more of the following features: tumor cellularity, tumor grade, extracellular stromal composition.
4.2.3 Image individual component 3
The negative end of image individual component 3 picks up on tumors whose patches contain a large amount of optically clear space. This includes tumors with: with high fat content (Figure 16(b)), where the cells discohesive (Figure 16(c)) and disrupted tissue sections (Figure 16(a)). Recall (Section 2.2) that patches with too much background (over 90%) are removed. Therefore white space surrounding the tumors and large white spaces in the core are unlikely to influence the amount of white space in the patches representing the image. Some of the features seen in the images in Figure 17 16(a) and 16(c) are likely related to technical variation in the tumor fixation/preservation and the quality of the histologic preparation. While the high fat content pattern seen in the positive end of the first image individual component is similar to this component (i.e. it picks up on large amounts of white space) the first component uniformly contains high fat content images in the top 15 images which is unlike this third component.
The positive end of this third component picks on images with a large amount of dense collagenous stroma (Figure 17(a)). This pattern is very clear in all 15 of the most positive subjects’ representative images views (see Acknowledgements). These tumors have lower tumor cellularity, moderate nuclear grade and have a moderate number of lymphocytes. Similar to the amount of white space, the dense collagenous stroma is a clear visual pattern.
4.3 Genetic individual information
Figure 18(a) shows the PAM50 loadings vector of the first genetic individual component. This component picks up on overall gene expression levels which is a common source of technical variation. Both the second and third genetic individual components show connections to the PAM50 subtypes based on the clinical data comparisons given in Acknowledgements albeit with weaker separations than the joint components.
The second genetic individual component identifies additional information which varies between Luminal A and Normal that is not dependent on cell proliferation and seems to be more related to features such as estrogen receptor signaling and keratin expression status. The scores for this second component separate Normal-like from Luminal A with an AUC of 0.801.
Figure 18(b) shows a scatter plot of the loadings vector from genetic individual component 2 compared to the Normal-Luminal A mean difference direction151515 The genes were first scaled by their standard deviation so this is the naive Bayes classification direction.
The genes were first scaled by their standard deviation so this is the naive Bayes classification direction.. Several of the genes on the top left of 18(b) (ESR1, FOXA1, PGR) are all part of the estrogen signaling pathway (oh2006estrogen). Several of the genes in the middle (CCNB1, MYC, MKI67, TYMS, MYBL2, CCNE1) are related to proliferation suggesting this component is unrelated to proliferation. Several of the genes in the bottom right (KRT5, KRT14, KRT17) are characteristic of normal myoepithelium as well as Basal-like like breast cancer (lazard1993expression).
This paper develops methods that, using deep learning and AJIVE give interpretable, simultaneous image and genetic results. Inferential and exploratory analysis leveraging deep learning is a promising area which presents many interesting, open questions – some of which are discussed below. These analytical tools enable simultaneous engagement from both the pathology and genetic communities which is critical for the fundamental biomedical interpretations.
Future research should evaluate whether the features learned in this paper can be reproducibly identified by pathologists and/or automated computer vision systems as well as whether these features can be validated in external test sets.
Scaling histological image analysis pipelines to gigapixel whole slide images (WSI) is an important future direction. In clinical practice, pathologists use WSIs; the core images used in this paper require additional preparation, are typically only available in some research settings and may ignore important tumor information (e.g. spatial heterogeneity across the tumor, particularly histological patterns not observed in the sampled region). Analyses of WSIs presents computational challenges as these images are orders of magnitude larger than the core images.
5.0.1 Patch representation
Patch based approaches have shown promise for predictive tasks using deep learning ilse2018attention. The patch based approach taken in this paper was selected because it i) will scale to whole slide images ii) can identify localized image information e.g. with the RPVs and iii) creates a smaller feature set161616Passing the full core images through the CNN resulted in features.. The approach of averaging of patch features ignores some within image heterogeneity. For image-only analysis, methods such as bishop1998hierarchical; backenroth2018modeling may be able to capture additional within-image heterogeneity. In the context of multi-view data, additional methodology needs to be developed to account for grouped observations (e.g. pourzanjani2017understanding).
5.0.2 Transfer learning
Training a neural network can be time and resource intensive. Furthermore, CNNs often require a large amount of training data to be fit effectively. Transfer learning allows the data analyst to use more powerful neural networks with less data and less time spent tuning CNN parameters (yosinski2014transferable; sharif2014cnn)
. First, a CNN is trained to solve a different predictive task on a large, external dataset – typically the famous ImageNet classification task(deng2009imagenet). Then the pre-trained network parameters may be fine-tuned on the dataset of interest to solve the predictive problem of interest.
The setting of this paper is a bit different. First, we are doing exploratory analysis, not predictive analysis. Second, while we do have labels which could be used to fine-tune the CNN (e.g. the PAM50 subtypes) we do not want to use these labels because then the network would be aware of information which we might want to (re)discover and/or validate in the following analysis. This leaves us with a couple of options to still use transfer learning including: training an unsupervised algorithm (e.g. auto-encoders kingma2013auto, generative adversarial networks goodfellow2014generative
or self-supervised learning algorithmsoord2018representation; lu2019semi) or not doing any fine tuning. ash2018joint trains auto-encoder which has some disadvantages: a CNN is required to be trained which is time/resource consuming, a number of new hyper-parameters are introduced into the problem and either the data are used twice or external data are needed. We explore the latter option and show, perhaps surprisingly, that pre-trained CNN features with no fine tuning are able to capture complex visual signals in a domain vastly different than the one they were originally trained on.
Even in the context of transfer learning, there are many choices to be made about how to extract neural network features from an image including: network architecture, layer (or layers) of the network, and feature aggregation (e.g. spatial mean pooling discussed in Section 2.3). For predictive modeling these hyper-parameters can be set using an error metric and methods such as cross-validation, however, as discussed in the above paragraph, we do not have such error metrics readily available to guide hyper-parameter choices. Preliminary sensitivity analysis showed that the results of our analysis are not particularly sensitive to mild differences in architecture choices. Better methods to select these CNN hyper-parameters is an open area of research.
Appendix A Appendix section
a.1 Common tissue structures
Below we give examples of some of the tissue structures which are relevant to this paper. Histopathology images are quite complex and pathologists are trained for years to interpret them.171717And pathologists don’t always agree with each other about their interpretations elmore2015diagnostic. For a more in-depth discussion of breast cancer pathology see rosen2001rosen; schnitt2009biopsy.
Figure 19(a) shows tumor infiltrating lymphocytes (TILs), the nuclei of which are basophilic (dark blue or purple) in conventional hematoxylin and eosin-stained (H&E) tissue sections. Hematoxylin quantitatively stains nucleic acids (DNA and RNA). Stromal TILs are more common in certain subtypes of breast cancer and may be associated with prognosis and response to treatment.
Figure 19(b) shows mostly high nuclear grade tumor cells (basophilic, dark blue or purple nuclei on H&E) as well as some stroma (eosinophilic or pink on H&E). Nuclear grade describes how abnormal the tumor cells look: “low grade” means the nuclei resemble those of normal cells and “high grade” means the nuclei are enlarged, hyperchromatic (more basophilic, darker blue/purple staining than normal nuclei), irregularly shaped and may contain multiple nucleoli. Nucleoli are small intranuclear organelles that contain DNA, RNA and protein and are responsible for the synthesis of ribosomes (ribosomes are the organelles that synthesize proteins).
Figure 19(c) shows collagenous stroma which is synthesized by fibroblasts and myofibroblasts and appears as eosinophilic (pink) fibrillar extracellular material on H&E. Collagenous stroma is the connective tissue that provides the scaffolding and support for epithelial structures. The nuclei (basophilic, dark blue or purple on H&E) of a few fibroblasts are visible in a collagenous stroma.
Figure 19(d) shows clusters of tumor cells separated by areas or eosinophilic (pink) collagenous stroma. On H&E-stained tissue sections, the tumor cell nuclei and their contents are basophilic (blue/purple) and the tumor cell cytoplasm varies from pale to eosinophilic (pink). This is a common histologic appearance of breast cancer as it invades into the stroma as aggregates or sheets of tumor cells.
The large white spaces in Figure 19(e) are the cytoplasm of adipocytes, cells that synthesize lipids (i.e, fat). The lipid-filled cytoplasm of adipocytes appears as these optically clear areas because the solvents used in the routine preparation of H&E tissue sections dissolve the lipids, leaving blank spaces where the lipids in the cytoplasm were. The ratio of fatty stoma to collagenous or fibrous stroma varies with age. Older patients will have more fatty stroma than younger patients. This age-related decrease in breast stromal density accounts for the increased accuracy of mammography in older patients.
Figure 19(f) shows extracellular mucin, a glycoprotein produced by epithelial cells which can be present in both normal and tumor tissue. Mucin appears almost clear or pale pink or blue in H&E-stained tissue sections. Invasive breast cancers with pure mucinous histology are often low-grade and are thought to have a better prognosis than invasive ductal carcinoma of no special type.
Figure 19(g) shows a normal duct in the lower left and low cellularity invasive carcinoma in the upper right part of the image. The benign cells contain nuclei that lack the enlargement, irregular shape and multiple nucleoli often seen in tumor cell nuclei. The benign cells have ample pale eosinophilic (pink) cytoplasm. The cells rest on a thin basement membrane which appears as a circumferential eosinophilic (pink) band around the periphery of the duct. The optically clear space in the middle of the duct is the lumen.
There are a couple of terms important to this paper which are similar but have different meanings. Clinical HER2 and molecular HER2 are two separate classifications used in breast cancer; the former is a immunohistochemical classification used in the clinic to determine clinical decision making while the latter is a genetic subtype. High nuclear grade refers to individual cancer cells; high tumor grade is based on a composite index including nuclear grade, tubule formation and mitotic activity. Collagenous stroma refers to (the pink) connective tissue; adipocytic stroma refers to (the white) fat cells.
a.2 AJIVE diagnostics
The initial signal ranks are 81 (image features) and 30 (genes). There were chosen by inspection of the the difference of the log-singular values and airing on the side of picking too high a rank.
Note the Wedin bound does not provide any value for these data. This is likely due to known conservativeness of the Wedin bound for non-square matrices. The random direction bound – which can be seen to be equivalent to the classical Roy’s latent root test for CCA rank selection – estimates the joint rank to be 7. This estimated joint rank was fairly robust to moderate changes in the initial signal ranks. The image individual rank is estimated to be 76 and genetic individual rank is estimated to be 25. These are likely overestimates, however, we focus only on the first few individual components.
a.3 Clinical data interpretation methods
In addition to the H&E images and gene expression data, we have a variety of clinical variables which can be used to interpret the different AJIVE components (e.g. PAM50 subtype). We compare each clinical variable of interest with the AJIVE scores for each component (i.e. the common normalized scores, image individual scores and genetic individual scores).
For continuous variables (e.g. proliferation score) we create a scatter plot, report the Pearson correlation and use the standard t-test test to determine if the association is statistically significant. For example, Figure4(c) shows the first joint component (x-axis) compared to the proliferation scores (y-axis). The text in the top left reports the Pearson correlation and is bolded if the correlation is statistically significant (after correction for multiple testing).
For categorical variables we show a conditional histogram of the scores and report difference in distribution tests for each possible class comparison using the Mann Whitney U test. This test test is used because it looks for location differences and because its test statistic is equivalent to the AUC statistic which gives an interpretable measure of how well separated two classes are. For categorical variables with more than 2 classes (e.g. there are five PAM50 subtypes) we do all one-vs-one comparisons.
For example, Figure 4(b) shows the first joint component (x-axis) conditioned on PAM50 subtype. The legend lists the classes, number of subjects in each class, and the the type of test. For a given class, the legend lists the other classes which were statistically significantly separated (after multiple testing adjustment) and reports the test statistics (AUC score) in parentheses. The class name is bold if at least one other class is statistically significantly separated. For example, in Figure 4(b) molecular Her2 is statistically significantly separated from Basal (AUC = 0.876 ), Luminal A (AUC = 0.897) and Normal (AUC = 0.827), but not Luminal B.
We compare each of the joint and individual AJIVE components to 33 variables. Additionally, for each multi-class categorical variable we do all of the one vs. one tests (e.g. for 5 classes we do tests). Therefore adjustment for multiple testing is necessary to avoid spurious results. For each of the AJIVE joint, image individual and genetic individual components we use the Benjamini-Hochberg procedure benjamini1995controlling which is implemented in Statsmodels seabold2010statsmodels.
Some of the clinical variables (e.g. proliferation scores, PAM50 molecular subtype) compared to the AJIVE scores are derived directly from the PAM50 gene expression which were used in the AJIVE analysis. This raises issues related to post selection inference when we compute p-values for these comparisons. Because the focus of this paper is on exploratory analysis we leave these issues for follow up work.
All joint, image individual and genetic individual clinical data comparisons are shown provided in Acknowledgements.
We thank the Carolina Breast Cancer Study participants and staff. We also want to acknowledge Robert C. Millikan, founder of the Carolina Breast Cancer Study Phase 3. Research reported in this publication was supported by a Specialized Program of Research Excellence (SPORE) in breast cancer (P50 CA058223), an award from the Susan G. Komen Foundation (OGUNC1202), the North Carolina University Cancer Research Fund, and a Cancer Center Support Grant (P30 CA016086). Iain Carmichael and J. S. Marron were partially supported by NSF Grant IIS-1633074, BIG DATA 2016-2019. Iain Carmichael is currently supported by NSF MSPRF DMS-1902440. Katherine Hoadley was supported by Komen Career Catalyst Grant (CCR16376756).
Supplement A Supplementary results [url]https://marronwebfiles.sites.oasis.unc.edu/AJIVE-Hist-Gene/ Additional results and figures can be found in the zipped folder located at the above link (this file is large – approximately 1.5 Gb). These include all representative patch views shown to pathologists, all AJIVE genetic loadings vectors and all clinical data comparisons. See the file readme.txt for details.