1.1 Feature Learning
We consider three complementary feature learning algorithms. The first is the VAE, which learns an -dimensional reduction of a dataset by optimizing a reconstruction objective. It posits a generative model of the data; is a prior on latent features and is a likelihood parameterized by . The algorithm finds a pair maximizing the lower bound,
where maps raw data examples to distributions in a latent space. This problem is nonconvex, and the solution is non-deterministic. There are many implementations of VAEs; our experiments follow Van Den Oord et al. (2017).
Second, we learn supervised features through a CNN. A CNN regressor optimizes an empirical estimate ofover and . Here, transforms the raw input into the “final layer” features, and is defined recursively according to
where and matrices are restricted to the set of convolutions. Like in the VAE, this solved through first-order optimization methods. Our implementation is the CBR architecture from Raghu et al. (2017).
Third, we use a random convolutional features (RCF) model Rahimi and Recht (2008). A random sample of training examples is selected; the ’s are assumed to be -channel images with dimension . For each sample, a random patch, denoted , is extracted. For any -channel image , the feature is found by convolving with and spatially averaging over activations. This model uses random training image patches as convolutional kernels, rather than learning them from scratch. The features are analogous to the features in the CNN.
To train an RCF, the training data are featurized into
. Then, a ridge regression model is trained fromto the , giving an estimate . For a new example , the same image patches are used to form , and predictions are made with . This model does not require gradient based training, and it can serve as a fast baseline.
1.2 Procrustes Analysis
Given centered and , the Procrustes problem finds a rotation solving,
where is the space of orthonormal matrices. The solution can be shown to be for and obtained by the SVD of Friedman et al. (2001); Gower (1975). For matrices , the generalized Procrustes problem finds rotations and mean solving
While there is no closed form solution, the optimization can be solved by cyclically updating each via standard Procrustes problems and then updating Friedman et al. (2001).
1.3 PCA and the Bootstrap
Several approaches are available for bootstrapping PCA. The total bootstrap computes principal planes by applying PCA to resampled versions of the data Chateau and Lebart (1996). For each replication, rows are sampled with replacement, viewed as draws from a larger population. The associated principal axes may be reflected or swapped with one another, so the associated sample coordinates are not directly comparable, and coordinates must be aligned. This is often accomplished through a Procrustes or conjoint analysis Elguero and Holmes-Junca (1988). In either case, the cloud of points associated with each sample in the resulting reference space is used to form a confidence region for it.
In contrast, fixed-effects PCA views the rows of the data matrix as the entire population of interest Josse et al. (2016). The source of randomness in this case is measurement noise around a low-rank model, not sampling from a larger population of rows. By fitting a measurement noise model and resampling residuals, a parametric bootstrap provides confidence regions for the true latent coordinates in the low-rank model. We also note that Bayesian factor analysis approaches sampling from the posterior of latent coordinates Ren et al. (2017, 2020). Like the fixed-effects PCA model, these approaches specify an explicit low-rank model with measurement noise. Like the total bootstrap, underlying factors may be swapped, and alignment is necessary.
This section describes ways to adapt the bootstrap approaches above to the feature learning context. Our raw data are samples , where is the raw data domain, e.g., images, text sentences, or audio signals. A corresponding set of responses may be available. The full data are . A feature learner is a parameterized mapping taking data from and representing it in .
For example, in a text data application, we expect the learner to transform a set of raw word sequences into a vector of features reflecting the topic of the document. is estimated from data, typically through an optimization,
for some loss . In an unsupervised feature learner, candidates are functions of alone. For a supervised feature learner, the class includes functions of both and . To simplify notation, we will write to denote the learned features for observation .
A challenge is that the learned features are not the same from one run to the next; the learned feature from run 1 need not have any relationship with the feature from run 2. This is a consequence of using stochastic optimization in 1. However, even if there is no direct correspondence across runs, they may all reflect the same underlying latent features. In particular, projections from dimensionality reductions from the two datasets may be similar, after applying an appropriate alignment.
Suppose that data have been split into a feature learning set, indexed by and an inference set, indexed by . The fraction
used for feature learning is a hyperparameter whose influence is empirically studied below. The learning setis resampled times, leading to different feature extractors, which can then be applied to the full dataset, yielding learned features
2.1 Nonparametric Bootstrap
Like the total bootstrap, one approach to comparing embeddings across feature extractors is to perform a dimensionality reduction on each and then align the projections. For each
, compute a singular value decomposition,
where the index means that only features associated with the inference set are used. Define coordinates for sample with respect to the top right singular vectors using . These can be stacked into .
A Procrustes analysis applied to learns a series of rotation matrices aligning the projections. For each sample , compute a mean and covariance matrix based on the vectors . These are used to create level confidence areas for each inference sample in the -dimensional projection space. This approach plugs-in an estimate for a “true” low-dimensional , assuming that this representation is noisily observed and then subject to arbitrary rotations. Note that if the true latent representations are subject to more general transformations (e.g., translations) across runs of the extractor, this assumption may not be appropriate.
The advantage of this approach is that it does not require a parametric model for simulating new versions of. The price to pay is that it is necessary to train feature extraction models , which can be a computational burden, even if it is parallelizable. Further, confidence areas are not computed for samples in the feature learning set
. However, if the uncertainty of sample-level projections is assumed to vary smoothly, then a heuristic is to consider the uncertainty of a sample inas comparable to those of its nearby samples in .
2.2 Parametric Bootstrap
To avoid the computational complexity associated with training feature extractors, we consider a parametric bootstrap, which simulates by resampling residuals from an fitted low-rank model, analogous to the fixed-effects PCA approach Josse et al. (2016). Suppose that variation across is induced by latent features . The feature learning process is modeled as,
where stacks the and is the element of . Only is available for predicting the response .
To simulate based on a single set of observed latent features , we resample rows of in the inference set and compute the associated rank- truncated SVD, . Then we draw,
where is obtained by resampling entries of and is a random permutation matrix, reflecting the fact that coordinates of the feature extractor need not match from one run to the next. Alignment and confidence area construction then proceeds as in section 2.1.
We adapt the mechanism above to simulate in the case where we have more than one trained feature extractor , for . Set , so the feature learning phase is less costly than in section 2.1.
Begin by extracting on resampled versions of the inference set, using the extractors . Then compute their truncated, rank- SVDs . New feature sets are simulated from,
where is drawn uniformly from and resamples entries across all . Given the resampled , we generate confidence regions as before.
We conduct two simulation studies. The first uses a low-rank model and permits calculation of coverage rates, but it is less representative of realistic feature learning settings. The second generates images using a spatial point process with variation reflecting a small set of latent parameters. The distributed feature learning associated with this setting prevents us from computing the coverage of confidence ellipsoids, but its complexity more accurately reflects practice.
3.1 Low-rank model
The first simulation generates samples using,
where denotes a random orthonormal matrix with rows and columns. is a random rank- matrix observed with Gaussian noise. is a response depending on the latent coordinates of each row. The specific parameters we use are . Note that this is a model of the data , not the features , as in equation 2.
As a feature extractor, we use a randomly perturbed and permuted SVD-based estimate of the latent coordinates,
where the subscript denotes that only the top left singular vectors and values are used and . The permutation and noise mimic the variation across retrained feature extractors. Given this source data and feature extractors, we apply all three bootstrap methods to this data, generating bootstrap replicates in each case. For the compromise approach, we train separate feature extractors.
The resulting 95% confidence ellipses are shown in Figure 2. Qualitatively, the parametric and nonparametric bootstrap approaches provide similar output. A gradient across the colors of reflects accurate estimation of the true latent factors. We have Procrustes aligned the true coordinates (squares) with the bootstrap replicates, and the fact that most squares are contained within ellipses suggests that the bootstrap accurately reflects uncertainty in the estimated projections. In fact, for the parametric and nonparametric approaches, the empirical coverage rates of these ellipses are 96.4% and 95.2%, respectively. On the other hand, the compromise approach appears to be overly conservative, with a coverage of 99.9%. This behavior arises in the remaining simulations and data analysis as well.
3.2 Spatial point process
In this simulation, we generate a collection of images using a point process where parameters vary from image to image. Intuitively, each image represents cells viewed through a microscope, and different latent parameters influence the cell ecosystem. A single response value is associated with these latent parameters. Example images for varying are given in Figure 3. We generate 10,000 of these -dimensional RGB images.
Locations of cells are governed by an intensity function drawn from a two-dimensional marked Log Cox Matern Process (LCMP) Diggle et al. (2013). Recall that a Matern process is a Gaussian process with covariance function,
where acts like a bandwidth parameter and controls roughness.
Our LCMP has classes (cell types). This can be constructed as follows. First, a nonnegative process is simulated along the image grid, , where is the covariance matrix induced by equation 9. This is a baseline intensity that determines the location of cells, regardless of cell type. further processes are then simulated, . These processes reflect relative frequencies of the classes at any location ; the intercept makes a class either more or less frequent across all positions . Given these intensity functions, we can simulate cell locations by drawing from an inhomogeneous Poisson process with intensity . For a cell at location , we assign it cell type
with probability. We have introduced a temperature controlling the degree of mixedness between cell types at a given location.
To complete the procedure for simulating images, we add two last sources of variation — the number of cells and cell size. The number of cells per image is drawn uniformly from 50 to 1000. The cells from class are drawn with a random radius. A summary of all parameters used to generate each image is given in Supplementary Table 1. Each parameter is drawn uniformly within its range, which has been chosen to provide sufficient variation in image appearance. These parameters are the “true” underlying features associated with the simulated images; they give the most concise description of the variation observed across the images. The response is a hand-specified linear combination of these parameters, the reasoning behind the combination is discussed in the generate.Rmd script in the accompanying compendium, see Supplementary Section 6.
We study the influence of the following parameters,
Learning vs. inference split sizes. We vary the proportion of data used for learning and inference. We sample so that .
Models trained. For feature extractors, we train CNN, VAE, and RCF models on the learning split .
Model complexity. We train VAEs whose hidden layer has dimensionality . Similarly, we vary the number of first-layer convolutional filters in the CNN model across . For the RCF, we use random features. This increase reflects the fact that more random features must be considered before a subset of predictive ones are identified.
Inference strategy. We use the parametric, nonparametric, and compromise bootstrap strategies from section 2 to estimate confidence areas for the projections obtained by feature learners.
Figure 4 shows the activations of learned features across 2000 images for two perturbed versions of the training data when 90% of the data are used for inference and (CNN, VAE) and (RCF). The learned features correspond to,
CNN: Activations from the final hidden layer of neurons, used directly as input for the regression.
VAE: Spatially-pooled activations from the middle, encoding layer of the variational autoencoder.
RCF: The spatially-pooled activations corresponding to each random convolutional feature.
Note that, across algorithms, there is no simple correspondence between learned and source features (i.e., parameters of the underlying simulation). Instead, there are clusters of learned features, corresponding to a pattern across multiple source features. We also find subsets of features across all models that are only weakly correlated with any source feature. This has been referred to as distributed representation learningHinton (1984); Le and Mikolov (2014).
Certain source features appear “easier” to represent than others, in the sense that more of the learned features are strongly correlated with them. Many features are correlated with , the total number of cells in the image, and , the size of the cells from Process 1. Depending on the model, the bandwidth , roughness , and prevalence parameters are either only weakly or not at all correlated with learned features. Even when features detect variation in and , they cannot disambiguate between these two parameters. Finally, the CNN and VAE features tend to be more clustered, with strong correlation across several source features. In contrast, the RCF features show more gradual shifts in correlation strength. They also show relatively little variation in correlation strength across features other than and .
Example confidence areas across models and bootstrapping approaches are given in Figure 5a. In contrast to the Figure 2 of the low-rank simulation, the areas from the nonparametric bootstrap are larger than those from the parametric bootstrap. This disagreement suggests that the proposed mechanism of equations 4 and 8 is insufficient for characterizing differences in learned features that arise between runs of more complex feature extractors – multiple runs must be used for to account for randomness in algorithmic feature learning. As before, the compromise bootstrap has larger confidence areas than either parametric or nonparametric approaches on their own. In general, the RCF tends to have smaller confidence areas compared to the CNN and VAE.
Figure 5b shows confidence regions for a single model (CNN) and bootstrap procedure (parametric) across a range of model complexities and split proportions. For larger , projections are further from the origin, suggesting larger activations on average. The fraction of data used for feature learning does not appear to affect the strength of the association with the response or the size of the projection uncertainties. Corresponding figures for the other models are provided in Supplementary Section 7.
4 Data Analysis
We next analyze the spatial proteomics dataset reported in Keren et al. (2018), which found a relationship between the spatial organization of Triple Negative Breast Cancer (TNBC) tissues and disease progression. In a classical proteomics study, the expression levels for a set of proteins is measured for a collection of cells, but the cell locations are unknown. In contrast, these data provide for each patient (1) an image delineating cell boundaries and (2) the protein expression levels associated with each cell in the images.
We only work with spatial cell delineations, not protein expression levels. This allows us to study feature learning within the images without having to worry about linking expression and image data, which is itself a complex integration problem. The data are
-dimensional images, one for each of 41 patients. Each pixel has a value 1 through 7 encoding which of 7 categories of tumor or immune cell types the pixel belongs. To ensure that the cell types are treated as categorical, we transform pixels to their one-hot encodings, resulting in a collection ofbinary matrices.
To setup a prediction problem, we split each image into patches. These patches are our
. Patches from 32 of the patients are reserved for feature learning. Four among these 32 are used as a development split, to tune parameters of the feature learning algorithms. As a response variable, we use. These provide signal for the supervised feature learners. Example cell patches are shown in Figure 6. We fit the same models (CNN, VAE, and RCF) as discussed in section 3, varying model complexity over the same parameters as before.
As a baseline, we compare against a ridge regression with pixelwise composition features. We train a model with as a response and the average number of pixels belonging to each of the cell-type categories as a -dimensional feature vector. This helps to determine whether the model has learned interesting features for counting cells, like cell size and boundaries, rather than simply averaging across pixel values. Indeed, Figure 7 shows that, except in models with low capacity , performance is improved when learning features algorithmically.
To characterize features and their uncertainty, we perform iterations for each of the parametric, nonparametric, and compromise bootstrap strategies. In each case, the samples are generated from patches reserved for inference. Two-dimensional projections for fixed model complexities are given in Figure 8. As visible by the color gradients, all methods learn to differentiate between patches with small and large values of , even the VAE, which is unsupervised. Comparing rows, the RCF appears to give the most stable representations, while the coordinates for the VAE have larger confidence areas in general. For all models, some projections appear to be more uncertain than others. Moreover, the certain axes directions tend to be more uncertain than others, reflected by the eccentricity of ellipses. For example, viewing estimates from the nonparametric approach, the VAE projections have the highest uncertainty for low values of Dimension 2. Analogously, high values of Dimension 2 have high uncertainty in the RCF.
For the CNN and RCF, the three bootstrap approaches give qualitatively similar conclusions about regions with higher and lower uncertainty, though the average sizes of the confidence areas differ. The size of confidence areas for the compromise approach in this case seem intermediate between those of the parametric and nonparametric approaches. For the VAE, the bootstrap approaches do not appear to agree. The compromise approach generally gives much larger confidence areas, potentially reflecting a failure of the Procrustes alignment in this case. Though this figure only displays one for each model, we find few differences in the projection uncertainty across models with different complexities, see Supplementary Figure 10 in the supplementary materials.
Figure 9 overlays example patches onto aligned coordinates. In the CNN, samples in the bottom right have a high fraction of immune cells, those on the top left are mostly tumor, and those at the top right have lower cell density. In light of the confidence regions in Figure 10, embeddings of immune cell rich images tend to show more variability across bootstraps. In the RCF, the first dimension similarly reflects the tumor vs. immune gradient. The lower uncertainty of projections along this axis suggests that this gradient is reliably captured across runs of the feature extractor. The lower right region of the VAE has high cell diversity, and the upper left has lower density. It seems that regions with higher cell diversity and larger proportions of immune cells also have more uncertain embeddings. From the top right to the bottom left, there is again an tumor to immune transition.
We have adapted existing approaches for evaluating the uncertainty of projections in dimensionality reduction for use in the context of algorithmically learned features. We have conducted an empirical study using simulations of varying complexity and a spatial proteomics data analysis problem, applying a representative suite of feature learning algorithms. We found that in more complex settings, a parametric bootstrap based on a single set of learned features does not reflect the degree of uncertainty present when comparing features derived from independently trained models.
Our results raise several questions for further study. It is natural to ask to what extent similar behaviors are exhibited across other data domains, model types, or training regimes. For example, it would not be unusual to represent the cell data in our case study using a marked graph linking neighboring cells. Do the features learned by a graph autoencoder have similar stability properties? More generally, are there classes of feature learning algorithms that all share behavior from the stability perspective? In other domains, we may ask whether our methods can be adapted to text or audio data.
Though the proposed bootstraps provide similar conclusions in the low-rank simulation, they differ in the point process simulation and spatial proteomics data analysis. This suggests that mechanisms in equations 4 and 8 may not reflect the behavior of repeated feature learning in more complex situations. The assumption that features can be rotated to align runs may also be problematic, and more general transformations across feature learning runs are plausible. For the CNN and RCF models in the data analysis example, we find that confidence areas for the compromise approach are intermediate between those for the parametric and nonparametric bootstraps. However, in both simulations, it tends to be larger, and we have no explanation. Finally, though we have empirically found coverage rates for the parametric and nonparametric bootstraps to be acceptable in the low-rank simulation, we have not theoretically studied the properties of these procedures. This could clarify differences that arise in projection uncertainties between bootstrap and feature learning approaches.
The author thanks Susan Holmes, Karl Rohe, three reviewers, and the editor for feedback which improved the manuscript. Research was performed with assistance of the UW-Madison Center For High Throughput Computing (CHTC).
- Superheat: an R package for creating beautiful and extendable heatmaps for visualizing complex data. Journal of Computational and Graphical Statistics 27 (4), pp. 910–922. Cited by: Figure 4.
- Assessing sample variability in the visualization techniques related to principal component analysis: bootstrap and alternative simulation methods. In COMPSTAT, pp. 205–210. Cited by: §1.3, Bootstrap Confidence Regions for Learned Feature Embeddings.
Double/debiased/neyman machine learning of treatment effects. American Economic Review 107 (5), pp. 261–65. Cited by: Bootstrap Confidence Regions for Learned Feature Embeddings.
- Spatial and spatio-temporal log-gaussian cox processes: extending the geostatistical paradigm. Statistical Science 28 (4), pp. 542–563. Cited by: §3.2.1.
- Confidence regions for projection pursuit density estimates. In Compstat, pp. 59–63. Cited by: §1.3, Bootstrap Confidence Regions for Learned Feature Embeddings.
- Visualizing higher-layer features of a deep network. University of Montreal 1341 (3), pp. 1. Cited by: Bootstrap Confidence Regions for Learned Feature Embeddings.
- The Elements of Statistical Learning. Vol. 1, Springer series in statistics New York. Cited by: §1.2, Bootstrap Confidence Regions for Learned Feature Embeddings.
- Generalized procrustes analysis. Psychometrika 40 (1), pp. 33–51. Cited by: §1.2.
- A selective approach to internal inference. arXiv preprint arXiv:1510.00486. Cited by: Bootstrap Confidence Regions for Learned Feature Embeddings.
- Distributed representations. Cited by: §3.2.2.
- Confidence areas for fixed-effects pca. Journal of Computational and Graphical Statistics 25 (1), pp. 28–48. Cited by: §1.3, §2.2, Bootstrap Confidence Regions for Learned Feature Embeddings.
- A structured tumor-immune microenvironment in triple negative breast cancer revealed by multiplexed ion beam imaging. Cell 174 (6), pp. 1373–1387. Cited by: §4.
- Distributed representations of sentences and documents. In International conference on machine learning, pp. 1188–1196. Cited by: §3.2.2.
Understanding neural networks via feature visualization: a survey. In
Explainable AI: interpreting, explaining and visualizing deep learning, pp. 55–76. Cited by: Bootstrap Confidence Regions for Learned Feature Embeddings.
- Svcca: singular vector canonical correlation analysis for deep learning dynamics and interpretability. In Advances in neural information processing systems, pp. 6076–6085. Cited by: §1.1.
- Weighted sums of random kitchen sinks: replacing minimization with randomization in learning. Advances in neural information processing systems 21, pp. 1313–1320. Cited by: §1.1.
- Bayesian nonparametric ordination for the analysis of microbial communities. Journal of the American Statistical Association 112 (520), pp. 1430–1442. Cited by: §1.3.
- Bayesian mixed effects models for zero-inflated compositions in microbiome data analysis. The Annals of Applied Statistics 14 (1), pp. 494–517. Cited by: §1.3.
- Neural discrete representation learning. In Advances in Neural Information Processing Systems, pp. 6306–6315. Cited by: §1.1.
- Methods for correcting inference based on outcomes predicted by machine learning. Proceedings of the National Academy of Sciences 117 (48), pp. 30266–30275. Cited by: Bootstrap Confidence Regions for Learned Feature Embeddings.
We have prepared a research compendium at https://github.com/krisrs1128/LFBCR. Instructions for downloading raw and preprocessed data are given in the analysis/data/ subdirectory. The LFBCR compendium available in the repository above is also an R package and includes functions for that can be used to apply the bootstrap and alignment methods more generally.
Rmarkdown files needed to reproduce the figures in Section 3 are given in the analysis/simulations subfolder. The analogous files for the spatial proteomics application are given in analysis/data_analysis. The learning subfolders at both these locations can be used to retrain one instance of each of the feature learning algorithms. Further details are given in the README.md folders in the repository.
7 Supplementary Tables and Figures
|The total number of cells.||0.5|
|The roughness of the overall intensity process.||-0.5|
|The bandwidth of the overall intensity process.||-0.5|
|The intercept controlling the frequency of class .||1 for , -1 otherwise|
|The roughness of the relative intensity processes.||-0.5|
|The bandwidth of relative intensity processes.||-0.5|
|The temperature used in cell type assignment.||0.5|
|The shape parameter controlling the sizes of each cell type.||1 for , 0 otherwise|