Log In Sign Up

Bootstrap Confidence Regions for Learned Feature Embeddings

Algorithmic feature learners provide high-dimensional vector representations for non-matrix structured signals, like images, audio, text, and graphs. Low-dimensional projections derived from these representations can be used to explore variation across collections of these data. However, it is not clear how to assess the uncertainty associated with these projections. We adapt methods developed for bootstrapping principal components analysis to the setting where features are learned from non-matrix data. We empirically compare the derived confidence regions in simulations, varying factors that influence both feature learning and the bootstrap. Approaches are illustrated on spatial proteomic data. Code, data, and trained models are released as an R compendium.


page 10

page 12

page 14

page 17

page 22


Rates of Bootstrap Approximation for Eigenvalues in High-Dimensional PCA

In the context of principal components analysis (PCA), the bootstrap is ...

AR-sieve Bootstrap for High-dimensional Time Series

This paper proposes a new AR-sieve bootstrap approach on high-dimensiona...

Comments on `High-dimensional simultaneous inference with the bootstrap'

We provide comments on the article "High-dimensional simultaneous infere...

Optimal Discriminant Analysis in High-Dimensional Latent Factor Models

In high-dimensional classification problems, a commonly used approach is...

BYOL for Audio: Exploring Pre-trained General-purpose Audio Representations

Pre-trained models are essential as feature extractors in modern machine...

Multi-Level Network Embedding with Boosted Low-Rank Matrix Approximation

As opposed to manual feature engineering which is tedious and difficult ...

1 Background

1.1 Feature Learning

We consider three complementary feature learning algorithms. The first is the VAE, which learns an -dimensional reduction of a dataset by optimizing a reconstruction objective. It posits a generative model of the data; is a prior on latent features and is a likelihood parameterized by . The algorithm finds a pair maximizing the lower bound,

where maps raw data examples to distributions in a latent space. This problem is nonconvex, and the solution is non-deterministic. There are many implementations of VAEs; our experiments follow Van Den Oord et al. (2017).

Second, we learn supervised features through a CNN. A CNN regressor optimizes an empirical estimate of

over and . Here, transforms the raw input into the “final layer” features, and is defined recursively according to

where and matrices are restricted to the set of convolutions. Like in the VAE, this solved through first-order optimization methods. Our implementation is the CBR architecture from Raghu et al. (2017).

Third, we use a random convolutional features (RCF) model Rahimi and Recht (2008). A random sample of training examples is selected; the ’s are assumed to be -channel images with dimension . For each sample, a random patch, denoted , is extracted. For any -channel image , the feature is found by convolving with and spatially averaging over activations. This model uses random training image patches as convolutional kernels, rather than learning them from scratch. The features are analogous to the features in the CNN.

To train an RCF, the training data are featurized into

. Then, a ridge regression model is trained from

to the , giving an estimate . For a new example , the same image patches are used to form , and predictions are made with . This model does not require gradient based training, and it can serve as a fast baseline.

1.2 Procrustes Analysis

Given centered and , the Procrustes problem finds a rotation solving,

where is the space of orthonormal matrices. The solution can be shown to be for and obtained by the SVD of Friedman et al. (2001); Gower (1975). For matrices , the generalized Procrustes problem finds rotations and mean solving

While there is no closed form solution, the optimization can be solved by cyclically updating each via standard Procrustes problems and then updating Friedman et al. (2001).

1.3 PCA and the Bootstrap

Several approaches are available for bootstrapping PCA. The total bootstrap computes principal planes by applying PCA to resampled versions of the data Chateau and Lebart (1996). For each replication, rows are sampled with replacement, viewed as draws from a larger population. The associated principal axes may be reflected or swapped with one another, so the associated sample coordinates are not directly comparable, and coordinates must be aligned. This is often accomplished through a Procrustes or conjoint analysis Elguero and Holmes-Junca (1988). In either case, the cloud of points associated with each sample in the resulting reference space is used to form a confidence region for it.

In contrast, fixed-effects PCA views the rows of the data matrix as the entire population of interest Josse et al. (2016). The source of randomness in this case is measurement noise around a low-rank model, not sampling from a larger population of rows. By fitting a measurement noise model and resampling residuals, a parametric bootstrap provides confidence regions for the true latent coordinates in the low-rank model. We also note that Bayesian factor analysis approaches sampling from the posterior of latent coordinates Ren et al. (2017, 2020). Like the fixed-effects PCA model, these approaches specify an explicit low-rank model with measurement noise. Like the total bootstrap, underlying factors may be swapped, and alignment is necessary.

2 Methods

This section describes ways to adapt the bootstrap approaches above to the feature learning context. Our raw data are samples , where is the raw data domain, e.g., images, text sentences, or audio signals. A corresponding set of responses may be available. The full data are . A feature learner is a parameterized mapping taking data from and representing it in .

For example, in a text data application, we expect the learner to transform a set of raw word sequences into a vector of features reflecting the topic of the document. is estimated from data, typically through an optimization,


for some loss . In an unsupervised feature learner, candidates are functions of alone. For a supervised feature learner, the class includes functions of both and . To simplify notation, we will write to denote the learned features for observation .

A challenge is that the learned features are not the same from one run to the next; the learned feature from run 1 need not have any relationship with the feature from run 2. This is a consequence of using stochastic optimization in 1. However, even if there is no direct correspondence across runs, they may all reflect the same underlying latent features. In particular, projections from dimensionality reductions from the two datasets may be similar, after applying an appropriate alignment.

Suppose that data have been split into a feature learning set, indexed by and an inference set, indexed by . The fraction

used for feature learning is a hyperparameter whose influence is empirically studied below. The learning set

is resampled times, leading to different feature extractors, which can then be applied to the full dataset, yielding learned features

2.1 Nonparametric Bootstrap

Like the total bootstrap, one approach to comparing embeddings across feature extractors is to perform a dimensionality reduction on each and then align the projections. For each

, compute a singular value decomposition,

where the index means that only features associated with the inference set are used. Define coordinates for sample with respect to the top right singular vectors using . These can be stacked into .

A Procrustes analysis applied to learns a series of rotation matrices aligning the projections. For each sample , compute a mean and covariance matrix based on the vectors . These are used to create level confidence areas for each inference sample in the -dimensional projection space. This approach plugs-in an estimate for a “true” low-dimensional , assuming that this representation is noisily observed and then subject to arbitrary rotations. Note that if the true latent representations are subject to more general transformations (e.g., translations) across runs of the extractor, this assumption may not be appropriate.

The advantage of this approach is that it does not require a parametric model for simulating new versions of

. The price to pay is that it is necessary to train feature extraction models , which can be a computational burden, even if it is parallelizable. Further, confidence areas are not computed for samples in the feature learning set

. However, if the uncertainty of sample-level projections is assumed to vary smoothly, then a heuristic is to consider the uncertainty of a sample in

as comparable to those of its nearby samples in .

Figure 1: A summary of the proposed bootstrap procedures. All begin by splitting data into training and inference sets, used for feature learning and confidence region construction, respectively. The nonparametric bootstrap (a) trains separate feature learners, each of which are used for feature extraction and dimensionality reduction before being aligned. The parametric bootstrap (b) trains a single feature learner and then simulates and aligns an ensemble of latent coordinates for each sample. The compromise (c) trains a smaller set of feature learners but further resamples residuals (like in the parametric bootstrap) to increase the number of apparent bootstrap replicates.

2.2 Parametric Bootstrap

To avoid the computational complexity associated with training feature extractors, we consider a parametric bootstrap, which simulates by resampling residuals from an fitted low-rank model, analogous to the fixed-effects PCA approach Josse et al. (2016). Suppose that variation across is induced by latent features . The feature learning process is modeled as,


where stacks the and is the element of . Only is available for predicting the response .

To simulate based on a single set of observed latent features , we resample rows of in the inference set and compute the associated rank- truncated SVD, . Then we draw,

where is obtained by resampling entries of and is a random permutation matrix, reflecting the fact that coordinates of the feature extractor need not match from one run to the next. Alignment and confidence area construction then proceeds as in section 2.1.

2.3 Compromise

We adapt the mechanism above to simulate in the case where we have more than one trained feature extractor , for . Set , so the feature learning phase is less costly than in section 2.1.

Begin by extracting on resampled versions of the inference set, using the extractors . Then compute their truncated, rank- SVDs . New feature sets are simulated from,

where is drawn uniformly from and resamples entries across all . Given the resampled , we generate confidence regions as before.

3 Simulations

We conduct two simulation studies. The first uses a low-rank model and permits calculation of coverage rates, but it is less representative of realistic feature learning settings. The second generates images using a spatial point process with variation reflecting a small set of latent parameters. The distributed feature learning associated with this setting prevents us from computing the coverage of confidence ellipsoids, but its complexity more accurately reflects practice.

3.1 Low-rank model

The first simulation generates samples using,


where denotes a random orthonormal matrix with rows and columns. is a random rank- matrix observed with Gaussian noise. is a response depending on the latent coordinates of each row. The specific parameters we use are . Note that this is a model of the data , not the features , as in equation 2.

As a feature extractor, we use a randomly perturbed and permuted SVD-based estimate of the latent coordinates,


where the subscript denotes that only the top left singular vectors and values are used and . The permutation and noise mimic the variation across retrained feature extractors. Given this source data and feature extractors, we apply all three bootstrap methods to this data, generating bootstrap replicates in each case. For the compromise approach, we train separate feature extractors.

The resulting 95% confidence ellipses are shown in Figure 2. Qualitatively, the parametric and nonparametric bootstrap approaches provide similar output. A gradient across the colors of reflects accurate estimation of the true latent factors. We have Procrustes aligned the true coordinates (squares) with the bootstrap replicates, and the fact that most squares are contained within ellipses suggests that the bootstrap accurately reflects uncertainty in the estimated projections. In fact, for the parametric and nonparametric approaches, the empirical coverage rates of these ellipses are 96.4% and 95.2%, respectively. On the other hand, the compromise approach appears to be overly conservative, with a coverage of 99.9%. This behavior arises in the remaining simulations and data analysis as well.

Figure 2: Projections from the low-rank data simulation. Each ellipse gives the confidence area for the latent coordinates of one sample. Squares are the positions of the true low-rank coordinates after Procrustes rotating them to align with the centers of the ellipses. The confidence areas for the nonparametric bootstrap are smaller than those for the parametric bootstrap. Those for the compromise method are conservative.

3.2 Spatial point process

In this simulation, we generate a collection of images using a point process where parameters vary from image to image. Intuitively, each image represents cells viewed through a microscope, and different latent parameters influence the cell ecosystem. A single response value is associated with these latent parameters. Example images for varying are given in Figure 3. We generate 10,000 of these -dimensional RGB images.

3.2.1 Generation

Locations of cells are governed by an intensity function drawn from a two-dimensional marked Log Cox Matern Process (LCMP) Diggle et al. (2013). Recall that a Matern process is a Gaussian process with covariance function,


where acts like a bandwidth parameter and controls roughness.

Our LCMP has classes (cell types). This can be constructed as follows. First, a nonnegative process is simulated along the image grid, , where is the covariance matrix induced by equation 9. This is a baseline intensity that determines the location of cells, regardless of cell type. further processes are then simulated, . These processes reflect relative frequencies of the classes at any location ; the intercept makes a class either more or less frequent across all positions . Given these intensity functions, we can simulate cell locations by drawing from an inhomogeneous Poisson process with intensity . For a cell at location , we assign it cell type

with probability

. We have introduced a temperature controlling the degree of mixedness between cell types at a given location.

Figure 3: Example images, for low (top), average (middle), and high (bottom) values of . For each sample, three relative intensity functions are generated, shown as a greyscale heatmaps. Samples drawn from each process are overlaid as circles. The final images combine points across processes, removing the underlying intensity function, which is not available to the feature learner. Small values are associated with smoother, less structured intensity functions.

To complete the procedure for simulating images, we add two last sources of variation — the number of cells and cell size. The number of cells per image is drawn uniformly from 50 to 1000. The cells from class are drawn with a random radius. A summary of all parameters used to generate each image is given in Supplementary Table 1. Each parameter is drawn uniformly within its range, which has been chosen to provide sufficient variation in image appearance. These parameters are the “true” underlying features associated with the simulated images; they give the most concise description of the variation observed across the images. The response is a hand-specified linear combination of these parameters, the reasoning behind the combination is discussed in the generate.Rmd script in the accompanying compendium, see Supplementary Section 6.

3.2.2 Experiments

We study the influence of the following parameters,

  • Learning vs. inference split sizes. We vary the proportion of data used for learning and inference. We sample so that .

  • Models trained. For feature extractors, we train CNN, VAE, and RCF models on the learning split .

  • Model complexity. We train VAEs whose hidden layer has dimensionality . Similarly, we vary the number of first-layer convolutional filters in the CNN model across . For the RCF, we use random features. This increase reflects the fact that more random features must be considered before a subset of predictive ones are identified.

  • Inference strategy. We use the parametric, nonparametric, and compromise bootstrap strategies from section 2 to estimate confidence areas for the projections obtained by feature learners.

Figure 4 shows the activations of learned features across 2000 images for two perturbed versions of the training data when 90% of the data are used for inference and (CNN, VAE) and (RCF). The learned features correspond to,

  • CNN: Activations from the final hidden layer of neurons, used directly as input for the regression.

  • VAE: Spatially-pooled activations from the middle, encoding layer of the variational autoencoder.

  • RCF: The spatially-pooled activations corresponding to each random convolutional feature.

Note that, across algorithms, there is no simple correspondence between learned and source features (i.e., parameters of the underlying simulation). Instead, there are clusters of learned features, corresponding to a pattern across multiple source features. We also find subsets of features across all models that are only weakly correlated with any source feature. This has been referred to as distributed representation learning

Hinton (1984); Le and Mikolov (2014).

Certain source features appear “easier” to represent than others, in the sense that more of the learned features are strongly correlated with them. Many features are correlated with , the total number of cells in the image, and , the size of the cells from Process 1. Depending on the model, the bandwidth , roughness , and prevalence parameters are either only weakly or not at all correlated with learned features. Even when features detect variation in and , they cannot disambiguate between these two parameters. Finally, the CNN and VAE features tend to be more clustered, with strong correlation across several source features. In contrast, the RCF features show more gradual shifts in correlation strength. They also show relatively little variation in correlation strength across features other than and .

Figure 4: Each feature learning algorithm learns a distributed representation of the true underlying features in the simulation. Within each heatmap, rows correspond to the parameters in Supplementary Table 1. Columns are activations of learned features; they have been reordered using the package Barter and Yu (2018). The color of a cell gives the correlation between true and learned features. Blue and Burgundy encode positive and negative correlation, respectively.

Example confidence areas across models and bootstrapping approaches are given in Figure 5a. In contrast to the Figure 2 of the low-rank simulation, the areas from the nonparametric bootstrap are larger than those from the parametric bootstrap. This disagreement suggests that the proposed mechanism of equations 4 and 8 is insufficient for characterizing differences in learned features that arise between runs of more complex feature extractors – multiple runs must be used for to account for randomness in algorithmic feature learning. As before, the compromise bootstrap has larger confidence areas than either parametric or nonparametric approaches on their own. In general, the RCF tends to have smaller confidence areas compared to the CNN and VAE.

Figure 5: (a) The 95% confidence areas associated with projections from the spatial point process simulation. Each point correspond to one image. Only the setting with 90% of the data used for feature learning and the midsize models ( for the CNN and VAE, for the RCF) are shown. (b) A view of confidence areas for the CNN across a range of learning split fractions () and model complexities (), all using the nonparametric approach.

Figure 5b shows confidence regions for a single model (CNN) and bootstrap procedure (parametric) across a range of model complexities and split proportions. For larger , projections are further from the origin, suggesting larger activations on average. The fraction of data used for feature learning does not appear to affect the strength of the association with the response or the size of the projection uncertainties. Corresponding figures for the other models are provided in Supplementary Section 7.

4 Data Analysis

We next analyze the spatial proteomics dataset reported in Keren et al. (2018), which found a relationship between the spatial organization of Triple Negative Breast Cancer (TNBC) tissues and disease progression. In a classical proteomics study, the expression levels for a set of proteins is measured for a collection of cells, but the cell locations are unknown. In contrast, these data provide for each patient (1) an image delineating cell boundaries and (2) the protein expression levels associated with each cell in the images.

We only work with spatial cell delineations, not protein expression levels. This allows us to study feature learning within the images without having to worry about linking expression and image data, which is itself a complex integration problem. The data are

-dimensional images, one for each of 41 patients. Each pixel has a value 1 through 7 encoding which of 7 categories of tumor or immune cell types the pixel belongs. To ensure that the cell types are treated as categorical, we transform pixels to their one-hot encodings, resulting in a collection of

binary matrices.

Figure 6: Example patches from the TNBC data. Panels are ordered by , the (log) fraction of cells they contain that belong to the tumor. This provides signal for the supervised algorithms, whose goal is to correctly place patches from new patients along this gradient.

To setup a prediction problem, we split each image into patches. These patches are our

. Patches from 32 of the patients are reserved for feature learning. Four among these 32 are used as a development split, to tune parameters of the feature learning algorithms. As a response variable, we use

. These provide signal for the supervised feature learners. Example cell patches are shown in Figure 6. We fit the same models (CNN, VAE, and RCF) as discussed in section 3, varying model complexity over the same parameters as before.

As a baseline, we compare against a ridge regression with pixelwise composition features. We train a model with as a response and the average number of pixels belonging to each of the cell-type categories as a -dimensional feature vector. This helps to determine whether the model has learned interesting features for counting cells, like cell size and boundaries, rather than simply averaging across pixel values. Indeed, Figure 7 shows that, except in models with low capacity , performance is improved when learning features algorithmically.

Figure 7:

Relative performance of feature learning strategies on the TNBC data. Linked points come from the same bootstrap replicate. The black line gives the baseline ridge regression approach using the manually generated raw pixel counts for each of the cell types as predictors. Predictions from the development split are omitted; this split has few patches, and the estimated MSE’s have high variance.

To characterize features and their uncertainty, we perform iterations for each of the parametric, nonparametric, and compromise bootstrap strategies. In each case, the samples are generated from patches reserved for inference. Two-dimensional projections for fixed model complexities are given in Figure 8. As visible by the color gradients, all methods learn to differentiate between patches with small and large values of , even the VAE, which is unsupervised. Comparing rows, the RCF appears to give the most stable representations, while the coordinates for the VAE have larger confidence areas in general. For all models, some projections appear to be more uncertain than others. Moreover, the certain axes directions tend to be more uncertain than others, reflected by the eccentricity of ellipses. For example, viewing estimates from the nonparametric approach, the VAE projections have the highest uncertainty for low values of Dimension 2. Analogously, high values of Dimension 2 have high uncertainty in the RCF.

For the CNN and RCF, the three bootstrap approaches give qualitatively similar conclusions about regions with higher and lower uncertainty, though the average sizes of the confidence areas differ. The size of confidence areas for the compromise approach in this case seem intermediate between those of the parametric and nonparametric approaches. For the VAE, the bootstrap approaches do not appear to agree. The compromise approach generally gives much larger confidence areas, potentially reflecting a failure of the Procrustes alignment in this case. Though this figure only displays one for each model, we find few differences in the projection uncertainty across models with different complexities, see Supplementary Figure 10 in the supplementary materials.

Figure 9 overlays example patches onto aligned coordinates. In the CNN, samples in the bottom right have a high fraction of immune cells, those on the top left are mostly tumor, and those at the top right have lower cell density. In light of the confidence regions in Figure 10, embeddings of immune cell rich images tend to show more variability across bootstraps. In the RCF, the first dimension similarly reflects the tumor vs. immune gradient. The lower uncertainty of projections along this axis suggests that this gradient is reliably captured across runs of the feature extractor. The lower right region of the VAE has high cell diversity, and the upper left has lower density. It seems that regions with higher cell diversity and larger proportions of immune cells also have more uncertain embeddings. From the top right to the bottom left, there is again an tumor to immune transition.

Figure 8: Confidence areas from the TNBC application. Points are shaded by , which provides the supervisory signal to the CNN and RCF during feature extraction. Models and bootstrap procedures are arranged along rows and columns, respectively. Only the models with intermediate complexity ( for the CNN and VAE, for the RCF) are shown. Analogous figures for other are given in the supplementary materials.

Figure 9: A version of the nonparametric bootstrap column from the dimensionality reductions in Figure 8, overlaying representative samples across the learned feature space. Cells are color coded as in Figure 6. Note that the overall shape of the region in which images are displayed mirrors the shape of the cloud of points in Figure 8.

5 Discussion

We have adapted existing approaches for evaluating the uncertainty of projections in dimensionality reduction for use in the context of algorithmically learned features. We have conducted an empirical study using simulations of varying complexity and a spatial proteomics data analysis problem, applying a representative suite of feature learning algorithms. We found that in more complex settings, a parametric bootstrap based on a single set of learned features does not reflect the degree of uncertainty present when comparing features derived from independently trained models.

Our results raise several questions for further study. It is natural to ask to what extent similar behaviors are exhibited across other data domains, model types, or training regimes. For example, it would not be unusual to represent the cell data in our case study using a marked graph linking neighboring cells. Do the features learned by a graph autoencoder have similar stability properties? More generally, are there classes of feature learning algorithms that all share behavior from the stability perspective? In other domains, we may ask whether our methods can be adapted to text or audio data.

Though the proposed bootstraps provide similar conclusions in the low-rank simulation, they differ in the point process simulation and spatial proteomics data analysis. This suggests that mechanisms in equations 4 and 8 may not reflect the behavior of repeated feature learning in more complex situations. The assumption that features can be rotated to align runs may also be problematic, and more general transformations across feature learning runs are plausible. For the CNN and RCF models in the data analysis example, we find that confidence areas for the compromise approach are intermediate between those for the parametric and nonparametric bootstraps. However, in both simulations, it tends to be larger, and we have no explanation. Finally, though we have empirically found coverage rates for the parametric and nonparametric bootstraps to be acceptable in the low-rank simulation, we have not theoretically studied the properties of these procedures. This could clarify differences that arise in projection uncertainties between bootstrap and feature learning approaches.


The author thanks Susan Holmes, Karl Rohe, three reviewers, and the editor for feedback which improved the manuscript. Research was performed with assistance of the UW-Madison Center For High Throughput Computing (CHTC).


  • R. L. Barter and B. Yu (2018) Superheat: an R package for creating beautiful and extendable heatmaps for visualizing complex data. Journal of Computational and Graphical Statistics 27 (4), pp. 910–922. Cited by: Figure 4.
  • F. Chateau and L. Lebart (1996) Assessing sample variability in the visualization techniques related to principal component analysis: bootstrap and alternative simulation methods. In COMPSTAT, pp. 205–210. Cited by: §1.3, Bootstrap Confidence Regions for Learned Feature Embeddings.
  • V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, and W. Newey (2017)

    Double/debiased/neyman machine learning of treatment effects

    American Economic Review 107 (5), pp. 261–65. Cited by: Bootstrap Confidence Regions for Learned Feature Embeddings.
  • P. J. Diggle, P. Moraga, B. Rowlingson, B. M. Taylor, et al. (2013) Spatial and spatio-temporal log-gaussian cox processes: extending the geostatistical paradigm. Statistical Science 28 (4), pp. 542–563. Cited by: §3.2.1.
  • E. Elguero and S. Holmes-Junca (1988) Confidence regions for projection pursuit density estimates. In Compstat, pp. 59–63. Cited by: §1.3, Bootstrap Confidence Regions for Learned Feature Embeddings.
  • D. Erhan, Y. Bengio, A. Courville, and P. Vincent (2009) Visualizing higher-layer features of a deep network. University of Montreal 1341 (3), pp. 1. Cited by: Bootstrap Confidence Regions for Learned Feature Embeddings.
  • J. Friedman, T. Hastie, R. Tibshirani, et al. (2001) The Elements of Statistical Learning. Vol. 1, Springer series in statistics New York. Cited by: §1.2, Bootstrap Confidence Regions for Learned Feature Embeddings.
  • J. C. Gower (1975) Generalized procrustes analysis. Psychometrika 40 (1), pp. 33–51. Cited by: §1.2.
  • S. M. Gross, J. Taylor, and R. Tibshirani (2015) A selective approach to internal inference. arXiv preprint arXiv:1510.00486. Cited by: Bootstrap Confidence Regions for Learned Feature Embeddings.
  • G. E. Hinton (1984) Distributed representations. Cited by: §3.2.2.
  • J. Josse, S. Wager, and F. Husson (2016) Confidence areas for fixed-effects pca. Journal of Computational and Graphical Statistics 25 (1), pp. 28–48. Cited by: §1.3, §2.2, Bootstrap Confidence Regions for Learned Feature Embeddings.
  • L. Keren, M. Bosse, D. Marquez, R. Angoshtari, S. Jain, S. Varma, S. Yang, A. Kurian, D. Van Valen, R. West, et al. (2018) A structured tumor-immune microenvironment in triple negative breast cancer revealed by multiplexed ion beam imaging. Cell 174 (6), pp. 1373–1387. Cited by: §4.
  • Q. Le and T. Mikolov (2014) Distributed representations of sentences and documents. In International conference on machine learning, pp. 1188–1196. Cited by: §3.2.2.
  • A. Nguyen, J. Yosinski, and J. Clune (2019)

    Understanding neural networks via feature visualization: a survey


    Explainable AI: interpreting, explaining and visualizing deep learning

    pp. 55–76. Cited by: Bootstrap Confidence Regions for Learned Feature Embeddings.
  • M. Raghu, J. Gilmer, J. Yosinski, and J. Sohl-Dickstein (2017) Svcca: singular vector canonical correlation analysis for deep learning dynamics and interpretability. In Advances in neural information processing systems, pp. 6076–6085. Cited by: §1.1.
  • A. Rahimi and B. Recht (2008) Weighted sums of random kitchen sinks: replacing minimization with randomization in learning. Advances in neural information processing systems 21, pp. 1313–1320. Cited by: §1.1.
  • B. Ren, S. Bacallado, S. Favaro, S. Holmes, and L. Trippa (2017) Bayesian nonparametric ordination for the analysis of microbial communities. Journal of the American Statistical Association 112 (520), pp. 1430–1442. Cited by: §1.3.
  • B. Ren, S. Bacallado, S. Favaro, T. Vatanen, C. Huttenhower, and L. Trippa (2020) Bayesian mixed effects models for zero-inflated compositions in microbiome data analysis. The Annals of Applied Statistics 14 (1), pp. 494–517. Cited by: §1.3.
  • A. Van Den Oord, O. Vinyals, et al. (2017) Neural discrete representation learning. In Advances in Neural Information Processing Systems, pp. 6306–6315. Cited by: §1.1.
  • S. Wang, T. H. McCormick, and J. T. Leek (2020) Methods for correcting inference based on outcomes predicted by machine learning. Proceedings of the National Academy of Sciences 117 (48), pp. 30266–30275. Cited by: Bootstrap Confidence Regions for Learned Feature Embeddings.

6 Reproducibility

We have prepared a research compendium at Instructions for downloading raw and preprocessed data are given in the analysis/data/ subdirectory. The LFBCR compendium available in the repository above is also an R package and includes functions for that can be used to apply the bootstrap and alignment methods more generally.

Rmarkdown files needed to reproduce the figures in Section 3 are given in the analysis/simulations subfolder. The analogous files for the spatial proteomics application are given in analysis/data_analysis. The learning subfolders at both these locations can be used to retrain one instance of each of the feature learning algorithms. Further details are given in the folders in the repository.

7 Supplementary Tables and Figures

Feature Description Influence Range
The total number of cells. 0.5
The roughness of the overall intensity process. -0.5
The bandwidth of the overall intensity process. -0.5
The intercept controlling the frequency of class . 1 for , -1 otherwise
The roughness of the relative intensity processes. -0.5
The bandwidth of relative intensity processes. -0.5
The temperature used in cell type assignment. 0.5
The shape parameter controlling the sizes of each cell type. 1 for , 0 otherwise
Table 1: Our simulation mechanism is governed by the above parameters. Parameters and control the number and sizes of imaginary cells. , , , , , and control the overall and relative intensities of the marked LCMP from which the cell positions are drawn. Example draws are displayed in Figure 3.
Figure 10: A comparison of the influence of models (rows) and model complexities (columns) on confidence regions constructed using the nonparametric bootstrap in the data analysis application. Each individual panel is read similarly to Figure 8.
Figure 11: The analog of Figure 5, still shown on CNN projections, but for the nonparametric (left) and compromise (right) bootstrap approaches.
Figure 12: The analog of Figure 5, but for VAE models evaluated using the parametric (left), nonparametric (middle) and compromise (left) bootstrap approaches.
Figure 13: Confidence areas corresponding to Figure 5a, but for models with lower and average complexity, still using 15% of data reserved for learning.
Figure 14: Confidence areas analogous to Figure 5a, but for models trained using 50% of the data. The source images can be viewed in full resolution at the compendium repository.
Figure 15: Confidence areas analogous to Figure 5a in the spatial point process simulation, but for models trained using 90% of the data. The source images can be viewed in full resolution at the compendium repository.
Figure 16: Confidence areas analogous to Figure 8 in the data analysis application, but for models with lower (left) and higher (right) complexity. The source images can be viewed in full resolution in the compendium repository.