1 Introduction
The main goal of many scientific imaging methods is to reconstruct a dimensional structure from (d)dimensional observations , where is either one or two. For the sake of simplicity we will talk about the case in the rest of this work. The contributions of this paper are:

We view the process of image formation through a graphical model in which latent variables correspond to physical quantities such as the hidden structure v or the relative orientation/pose of a specimen. This enables one to predict errors in the reconstruction of 3D structures through uncertainty estimates. This is especially interesting when objects v are only partially observable, as is the case in certain medical scans, such as breast cancer scans. Moreover, uncertainty prediction enables more data efficient model validation.

Based on the aforementioned innovations, we propose a new method for (unsupervised) reconstruction evaluation. Particularly, we demonstrate that learned uncertainties can replace currently used data inefficient methods of evaluation (see Section 6.2). We thus learn better model fits than traditional methods given the same amount of data.

We extend current approaches such as (Jaitly et al., 2010)
to describe the generative process as a differentiable map by adopting recent techniques from the deep learning community
(Jaderberg et al., 2015; Rezende et al., 2016). We demonstrate that this fenables more advanced joint inference schemes over object pose and structure estimation.
Our experimental validation focuses on single particle electron cryomicroscopy (cryoEM). CryoEM is a challenging scientific imaging task, as it suffers from complex sources of observation noise, low signal to noise ratios, and interference corruption. Interference corruption attenuates certain Fourier frequencies in the observations. Radiation exposure is minimized because electron radiation severely damages biological specimens during data collection. Minimal radiation, however, leads to low signaltonoise ratios, where sensor cells record relatively low electron counts. Since imaging techniques like CT suffer from a subset of these difficulties, we believe that evaluating and analyzing our method on cryoEM problems is appropriate.
2 Background and related work
Modelling nanoscale structures such as proteins or viruses is a central task in structural biology. By freezing such structures and subsequently projecting them via a parallel electron beam to a sensor grid (see figure 2), CryoEM enables reconstruction and visualization of such structures. The technique has been described as revolutionary because researchers are capable of observing structures that cannot be crystallized, as required for Xray crystallography (Rupp, 2009).
The main task of reconstructing the structure from projections in cryoEM, and the wider field of medical imaging, is somewhat similar to multiview scene reconstruction from natural images. There are, however, substantial differences. Most significantly, the projection operation in medical imaging is often an orthogonal integral projection, while in computer vision it is a nonlinear perspective projection for which materials exhibit different degrees of opacity. Thus, the generative model in computer vision is more complex. Medical imaging domains, on the other hand, face significant noise and measurement uncertainties, with signaltonoise ratios as low as 0.05
(Baxter et al., 2009).Most CryoEM techniques (De la RosaTrevín et al., 2013; Grigorieff, 2007; Scheres, 2012; Tang et al., 2007) iteratively refine an initial structure by matching a maximum a posteriori (MAP) estimate of the pose (orientation and position) under the proposal structure with the image observation. These approaches suffer from a range of problems such as high sensitivity to poor initialization (Henderson et al., 2012). In contrast to this approach, and closely related to our work, (Brubaker et al., 2015; Punjani et al., 2017) treat poses and structure as latent variables of a joint density model. MAP estimation enables efficient optimization in observation space. Previous work (Sigworth, 1998; Scheres et al., 2007a; Jaitly et al., 2010; Scheres, 2012) has suggested full marginalization, however due to its cost, it is usually intractable.
This paper extends the MAP approach by utilizing variational inference to approximate intractable integrals. Further, reparameterizing posterior distributions enables gradient based learning (Kingma and Welling, 2013; Rezende et al., 2014a). To our knowledge this is the first such approach that provides an efficient way to learn approximate posterior distributions in this domain.
3 Observation Formation Model
Given a structure v, we consider a generic generative model of observations, one that is common to many imaging modalities. As a specific example, we take the structure v to be a frozen (i.e. cryogenic) protein complex, although the procedure described below applies as well to CT scanning and optical microscopy. v is in a specific pose relative to the direction of the electron radiation beam. This yields a poseconditional projection, with observation . Specifically, the pose , consists of , corresponding to the rotation of the object with respect to the microscope coordinate frame, and , the translation of the structure with respect to the origin.
The observations are then generated as follows: The specimen in pose is subjected to radiation (the electron beam), yielding an orthographic integral projection onto the plane perpendicular to the beam. The expected projection can be formulated as
(1) 
Here is the projection operator, is the linear translation operator for translation , and is the linear operator corresponding to rotation . Without loss of generality we can choose the projection direction to be along the direction . When the projection is recorded with a discrete sensor grid (i.e., sampled), information beyond the Nyquist frequency is aliased. Additionally, the recording is corrupted with noise stemming from the stochastic nature of electron detection events and sensor failures (Egelman, 2016). Low doses are necessary since electron exposure causes damage to sensitive biological molecules. Logically, the effect is more severe for smaller objects of study.
Many sophisticated noise models have been proposed for these phenomena (Faruqi et al., 2003; Vulović et al., 2013; Scheres et al., 2007b). In this work, for simplicity, we assume isotropic Gaussian noise; i.e.,
(2) 
where models the magnitude of the observation noise. The image formation process is depicted in Figure 2.
The final section below discusses how one can generalize to more sophisticated (learnable) noise models. Note that we do not model interference patterns caused by electron scattering, called defocus and modelled with a contrast transfer function (CTF). This will lead to less realistic generative models, however we see the problem of CTF estimation as somewhat independent of our problem. Ideally, we would like to model the CTF across multiple datasets, but we leave this to future work.
4 Backpropagating through the generative model
In this section we aim to bridge the gap between our knowledge of the generative process and a differentiable mapping that facilitates direct optimization of hidden variables () with gradientdescent style schemes. We start with an explanation of a naive differentiable implementation in position space, followed by a computationally more efficient version by shifting the computations to the Fourier domain (momentum space).
4.1 Naive implementation: project in position space
Our goal is to optimize the conditional loglikehood with respect to the unobserved and v, maximizing the likelihood of the 2D observations. This requires equation (2) to be a differentiable operator with respect to and v. Note that the dependence on and v is fully determined by equation (1). In order to achieve this, we first need to apply the group action onto v. Matrix representations of the group action such as the Euler angles matrix are defined on the sampling grid of v rather than the voxel values . For example, the action induced by a rotation around the zaxis by an angle on the position of an arbitrary voxel can be written as,
(3) 
This entails two problems. First, the volume after transformation should be sampled at the same grid points as before. This requires interpolation. Second, to achieve a differentiable map we need to formulate the transformation of position values as a transformation of the voxel values.
Jaderberg et al. (2015) offers a solution to both problems, known as differentiable sampling^{1}^{1}1Originally invented to learn affine transformations on images to ease the classification task for standard neural networks, the approach has since been extend to 3D reconstruction problems from images
(Rezende et al., 2016)..The th voxel of the transformed volume, , with index vector , can be expressed as a weighted sum of all voxels before transformation . The weights are determined by a sampling kernel , the argument of which is the difference between the transformed voxel’s position and all transformed sampling grid vectors :
(4) 
A popular kernel in this context is the linear interpolation sampling kernel^{2}^{2}2Linear kernels are efficient and yield fairly good results. More complex ones such as the Lanczos resampling kernel may actually yield worse results due to smoothing.
(5) 
Computing one voxel only requires a sum over 8 voxels from the original structure. These are determined by flooring and ceiling the elements of . Furthermore, the partial derivatives are provided by,
(6)  
(7) 
This framework was originally proposed for any differentiable kernel and any differentiable affine position transformation . In our setting, we restrict ourselves to linear interpolation kernels. The group actions represented by are affine. In this work we represent rotations as Euler angles by using the ZYZ convention. One could also use quaternions or exponential maps. As with rotation, the translation operation is also a transformation of the voxel grid, rather than the voxel values. Thus, equation (4) can also be used to obtain a differentiable translation operation.
Finally, the orthogonal integral projection operator is applied by summing voxel values along one principal direction. Since the position of the hidden volume is arbitrary we can fix this direction to be the Zaxis as discussed in section 3. Denoting a volume, rotated and translated according to , by , the th element of its expected projection is given by
(8) 
where denotes the element of .
This concludes a naive approach to modelling a differential map of the expected observation in position space. This approach is not particularly efficient, as according to equation (8) we need to interpolate all voxels to compute one dimensional observation. Moreover, backpropagating through this mapping requires transporting gradients through all voxels. Next, we show how to reduce the cost of this naive approach without a loss of precision by shifting the problem to the Fourier domain.
4.2 Projectionslice theorem
The projectionslice theorem or Fourierslice theorem states the equivalence between the Fourier transform
of the projection of a dimensional function onto a dimensional submanifold and a dimensional slice of the dimensional Fourier transform of that function. This slice is a dimensional linear submanifold through the origin in the Fourier domain that is parallel to the projection submanifold .In our setting, given an axisaligned projection direction (see equation (1)) and the discrete grid , the expected observation in 2D Fourier space is equal to a central slice through the Fourier transformed 3D structure parallel to the projection direction :
(9) 
where and denote discrete Fourier transforms in 2 and 3 dimensions. The slice operator is the Fourier equivariant of the projection operator. In our case, it is applied as follows:
(10) 
where are Fourier indexes.
This allows one to execute the generative model in position or momentum space. It has proven more efficient for most reconstruction algorithms to do computation in the Fourier domain (Sigworth, 2016). This also applies to our algorithm: (i) We reconstruct the structure in the Fourier domain. This means we only need to apply an inverse Fourier transformation at the end of optimization. (ii) We may save the Fourier transformed expected projections apriori, further this is easily parallelized. Thus even though in its original formulation we can not expect a computational benefit, when sharing the computation across many data points to reconstruct one latent structure the gain is significant. We elaborate on this point with respect to differentiation below.
4.3 Differentiable orthographic integral projection
We incorporate the Fourier Slice Theorem into our approach to build an efficient differentiable generative model. For this we translate all elements of the model to their counterparts in the Fourier domain. The Gaussian noise model becomes (Havin and Jöricke, 1994),
(11) 
where . In the following, we aim to determine an expression for the expected Fourier projection by deriving the operators counterparts .
We start by noting that it is useful to keep in memory to avoid computing the 3D discrete Fourier transform multiple times during optimization. The inverse Fourier transform is then only applied once, after convergence of the algorithm. Next, we restate that the Fourier transformation is a rotationequivariant mapping (Chirikjian and Kyatkin, 2000). This means the derivations from Section 4.1 with respect to the rotation apply in this context as well. A translation in the Fourier domain , however, induces a reweighting of the original Fourier coefficients,
(12) 
Finally, the last section established the equivalence of the slice and the projection operator (see equation (4.2)) in momentum and position space. Specifically, for the linear interpolation kernel, we compute the set of interpolation points by flooring and ceiling the elements of the vector , . This entails 6 interpolation candidates per voxel of the central slice, in total . Remember, this computation above involved voxels and candidates.
We can further improve the efficiency of the algorithm by swapping the projection and translation operators. That is, due to parallel radiation, and hence orthographic projection,
(13) 
where . This is more efficient because we reduce the translation to a two dimensional translation. Thus this modifies equation (12) to its two dimensioanl equivalent.
For the naive implementation the cost of a real space forward projection is O. In contrast, converting the volume to the Fourier space O, projecting O and applying the inverse Fourier transform O. At first glance this implies a higher computational cost. However, for large datasets the cost of transforming the volume is amortized over all data points. For gradient descent schemes, we iterate over the dataset more than once, hence the cost of Fourier transforming the observations is further amortized. Furthermore, it is often reasonable to consider only a subset of Fourier frequencies, so back projection becomes O with . The efficiency of this algorithm in the context of cryoEM was first recognized by Grigorieff (1998). (We provide code for the differentiable observation model and in particular for the Fourier operators: https://github.com/KarenUllrich/pytorchbackprojection.)
5 Variational inference
Here we describe the variational inference procedure for 3D reconstruction in Fourier space, enabled by the Fourier slice theorem. We assume we have a dataset of observations from a single type of protein with ground truth structure v, and its Fourier transform . We consider two scenarios. In the first, both the pose of the protein and the projection are observed, and inference is only performed over the global protein structure. In the second, the pose of the protein for each observation is unknown. Therefore, inference is done over poses and the protein structure.
The first scenario is synonymous with the setting in tomography, where we observe a frozen cell or larger complex positioned in known poses. This case is often challenging because the sample cannot be observed from all viewing angles. For example, in cryoEM tomography the specimen frozen in an ice slice can only be rotated till the verge of the slice comes into view. We find similar problems in CT scanning, for example in breast cancer scans. The second scenario is relevant for cryoEM single particle analysis. In this case multiple identical particles are confined in a frozen sample and no information on their structure or position is available a priori.
In this work we lay the foundations for doing inference in either of the two scenarios. However, our experiments demonstrate that joint inference over the poses and the 3D structure is very sensitive to getting stuck in local minima that correspond to approximate symmetries of the 3D structure. Therefore, the main focus of this work is the setting where the poses are observed.
5.1 Inference over the 3D structure
Here the data comprise image projections and poses, , with Fourier transformed projections denoted . Our goal is to learn a model of the latent structure that as closely as possible resembles the true posterior . For this, we assume a joint latent variable model . To avoid clutter below, we use shorthand notations like for .
Specifically, we minimize an upper bound to the KullbackLeibler (KL) divergence:
(14) 
Here, we have assumed that, given the volume, the data are IID: . We have bounded the divergence by the data loglikelihood , a constant with respect to and . This is equivalent to lower bounding the model evidence by introducing a variational posterior (Jordan et al., 1998).
In this work we focus on modelling
as isotropic Gaussian distribution. The prior is assumed to be a standard Gaussian. In practice, we use stochastic gradient descentlike optimization, with the data organized in minibatches. That is, we learn the distribution parameters
for through stochastic optimization, efficiently by using the reparameterization trick (Kingma and Welling, 2013; Rezende et al., 2014a).In equation (14), the reconstruction term depends on the number of datapoints. The KLdivergence between the prior and approximate posterior does not. As the sum of the minibatch objectives should be equal to the objective of the entire dataset, the minibatch objective is
(15) 
where is the set of indices of the data in minibatch , and denotes the size of the th minibatch.
5.2 Joint inference over the 3D structure and poses
In the second scenario the pose of the 3D structure for each observation is unknown. The data thus comprises the observed projections . Again, we perform inference in the Fourier domain, with transformed projections as data. We perform joint inference over the poses and the volume . We assume the latent variable model can be factored as follows, .
Upper bounding the KLdivergence, as above, we obtain
(16) 
The prior, approximate posterior and conditional likelihood all factorize across datapoints: , , and . Like equation (15), for minibatch optimization algorithms we make use of the objective
(17)  
In Section 5.1, we learn the structure parameters shared across all data points. Here the pose parameters are unique per observation and therefore require separate inference procedures per data point. As an alternative, we can also learn a function that approximates the inference procedure; This is called amortized inference (Kingma and Welling, 2013). In practice, a complex parameterized function such as a neural network predicts the parameters of the variational posterior .
6 Experiments
We empirically test the formulation above with simulated data from the wellknown GroELGroES protein (Xu et al., 1997). To this end we generate three datasets, each with 40K projections onto 128128 images at a resolution of 2.8Å per pixel. The three datasets have signaltonoise ratios (SNR) of 1.0, 0.04 and 0.01, referred to below as the noisefree, mediumnoise and highnoise cases. Figure 7 shows one sample per noise level. As previously stated in Section 3, we do not model the microscope’s defocus or electron scattering effects, as captured by the CTF (Kohl and Reimer, 2008).
Using synthetic data allows us to evaluate the algorithm with the groundtruth structure, e.g., in terms of meansquared error (MSE). With realdata, where ground truth is unknown, the resolution of a fitted 3D structure is often quantified using Fourier Shell Correlation (Rosenthal and Henderson, 2003): The observations are partitioned randomly into two sets, and , each of which is then independently modeled with the same reconstruction algorithm. The normalized crosscorrelation coefficient is then computed as a function of frequency to assess the agreement between the two reconstructions.
Given two 3D structures, and , in the Fourier domain, FSC at frequency is given by
(18) 
where denotes the set of frequencies in a shell at distance from the origin of the Fourier domain (i.e. with wavelength ). This yields FSC curves like those in Fig. 4. The quality (resolution) of the fit can be measured in terms of the frequency at which this curve crosses a threshold . When one of or is ground truth, then , and when both are noisy reconstructions it is common to use (Rosenthal and Henderson, 2003)). The structure is then said to be resolved to wavelength for which .
6.1 Comparison to Baseline algorithms
When poses are known, the current stateoftheart (SOTA) is a conventional tomographic reconstruction (a.k.a. backprojection). When poses are unknown, there are several wellknown SOTA cryoEM algorithms (De la RosaTrevín et al., 2013; Grigorieff, 2007; Scheres, 2012; Tang et al., 2007). All provide point estimates of the 3D structure. In terms of graphical models, point estimates correspond to the case in which the posterior is modeled as a delta function, , the parameter of which is the 3D voxel array, .
We compare this baseline to a model in which the posterior is a multivariate diagonal Gaussian, . While the latent structure is modeled in Fourier domain, the spatial domain signal is realvalued. We restrict the learnable parameters, and , accordingly ^{3}^{3}3That is : and with .. We use the reparameterization trick thus correlate the samples accordingly. Finally, the prior in equation (14) in a multivariate standard Gaussian, .
Table 1 shows results with these two models, with known poses (the tomographic setting), and with noisefree observations. Given the sensor resolution of , the highest possible resolution would be the Nyquist wavelength of . Our results show that both models approach this resolution, and in reasonable time.
Posterior  

Time until  390  480 
converged  
MSE  2.29  2.62 
[/voxel]  
Resolution [Å]  5.82  5.82 
6.2 Uncertainty estimation leads to data efficiency
In this section, we explore how modelling the latent structure with uncertainty can improve data efficiency. For this, recall that FSC is computed by comparing reconstructions based on dataset splits, and . As an alternative, we propose to utilize the learned model uncertainties to achieve a similar result. We thus only need one model fit that includes both dataset splits. Specifically, we propose to extend a measure first presented in Unser et al. (1987): the spectral SNR to the 3D case, and hence refer to it as spectral shell SNR (SSSNR). When modelling the latent structure as diagonal Gaussian , the SSSNR can be computed to be
(19) 
Following the formulation by Unser et al. (1987), we can then express the FSC in terms of the SSSNR, i.e., .
Figure 4 shows FSC curves based on reconstructions from the mediumnoise (top) and highnoise (bottom) datasets. First, we aim to demonstrate that the FSC curves between the Gaussian fit (with all data) vs ground truth, and the MAP estimate model with all data vs ground truth, i.e., FSC(, ) and FSC(, ), yield the same fit quality. Note that we would not usually have access to the ground truth structure. Secondly, because in a realistic scenario we would not have access to the ground truth we would need to split the dataset in two. For this we evaluate the FSC between ground truth and two disjoint splits of the dataset FSC(, ) and FSC(, ). This curve not surprisingly lies under the previous curves. Also note, that the actual measure we would consider FSC(, ) is more conservative. Finally, we show that curve has the same inflection points as the FSC curve. As one would expect, it lies above the conservative FSC(, ).
Using one can quantify the model fit with learned uncertainty rather than FSC curve. As a consequence there is no need to partition the data and perform two separate reconstructions, each with only half the data.
6.3 Uncertainty identifies missing information
Above we discussed how global uncertainty estimation can help estimate the quality of fit.
Here we demonstrate how local uncertainty can help evaluate the local quality of
fit^{4}^{4}4Other studies recognize the importance of local uncertainty estimation, measuring
FSC locally in a wavelet basis (Cardone et al., 2013; Kucukelbir et al., 2014; Vilas et al., 2018)..
In many medical settings, such as breast cancer scans, limited viewing directions are available.
This is an issue in tomography, and also occurs in single particle cryoEM
when the distribution of particle orientations around the viewing sphere are highly anisotropic.
To recreate the same effect we construct a dataset of 40K observations, as before with no noise
to separate the sources of information corruption^{5}^{5}5For completeness we present the same
experiment with noise in the appendix..
We restrict viewing directions to Euler angles with ;
i.e., no observations outside a cone
As above, we assume a Gaussian posterior over the latent protein structure.
Figure 5 shows the result.
We can show that the uncertainty the model has learned correlates with regions in which data has been observed (uncertainty close to 0) and has not been observed (uncertainty close to 1). Due to the pressure from the KLdivergence (see equation (14)) the latter areas of the model default to the prior .
This approach can be a helpful method to identify areas of low local quality of fit.
6.4 Limits: Treating poses as random variables
Extending our method from treating structures as latent variables to treating poses as latent variables is difficult. In the following we analyze why this is the case. Note though that, our method of estimating latent structure can be combined with common methods of pose estimation such as branch and bound (Punjani et al., 2017)
without losing any of the benefits we offer. However, ideally it would be interesting to also learn latent pose posteriors. This would be useful to for example detect outliers in the dataset which is common in real world scenarios.
In an initial experiment, we fix the volume to the ground truth. We subsequently only estimate the poses with a simple Dirac model for each data point . In figure 6, we demonstrate the problem of SGDbased learning for this. For one example (samples on top), we show its true error surface, the global optimum (red star) and the changes over the course of optimization (red line). We observe that due to the high symmetries in the structure, the true pose posterior error surface has many symmetries as well. An estimate depending on its starting position, seems to converge to the closest local optimum only rather than the global one. We would hope to be able to fix this problem in the future by applying more advanced density estimation approaches.
7 Model Criticism and Future Work
This paper introduces practical probabilistic models into the scientific imaging pipeline, where practical refers to scalability through the use of the reparameterization trick. We show how to turn the operators in the pipeline into differentiable maps, as this is required to apply the trick. The main focus of the experiments is to show why this novelty is important, addressing issues such as data efficiency, local uncertainty, and cross validation. Specifically, we found that a parameterized distribution, i.e. the Gaussian, achieves the same quality of fit as a point estimate, i.e. the dirac, while relying on less data. We conclude that our latent variable model is a suitable building block. It can be plugged into many SOTA approaches seamlessly, such as (De la RosaTrevín et al., 2013; Grigorieff, 2007; Scheres, 2012; Tang et al., 2007). We also established that the learned uncertainty is predictive of locations with too few samples. Finally, we demonstrated the limits of our current methods in treating poses as latent variables. This problem, however, does not limit the applicability of our method to latent structures. We thus propose to combine common pose estimation with our latent variable structure estimation. This method benefit from the uncertainty measure but also find globally optimal poses.
In future work we hope to find a way to efficiently learn pose posterior distributions as well. We hope that a reasonable approach would be to use multimodal distributions and thus more advanced density estimation techniques. We will also try to incorporate amortized inference, mentioned in Section 5. Amortization would give the additional advantage of being able to transfer knowledge from protein to protein. Transfer could then lead to more advanced noise and CTF models. Bias in transfer will be a key focus of this effort; i.e., we only want to transfer features of the noise and not the latent structure. Another problem we see with the field of reconstruction algorithms is that the model evaluation can only help to detect variance but not bias in a model class. This is a problem with FSC comparison, but also with our proposal. We believe that an estimate of the dataloglikelihood of a hold out test dataset is generally much better suited. In a probabilistic view, this can be achieved by importance weighting the ELBO (Rezende et al., 2014b).
Acknowledgements
We thank the reviewers for their valuable feedback, in particular AnonReviewer1. This research was funded in part by Google and by the Canadian Institute for Advanced Research.
References
 Alemi et al. (2017) Alemi, A. A., Poole, B., Fischer, I., Dillon, J. V., Saurous, R. A., and Murphy, K. (2017). Fixing a broken elbo. arXiv preprint arXiv:1711.00464.
 Baxter et al. (2009) Baxter, W. T., Grassucci, R. A., Gao, H., and Frank, J. (2009). Determination of signaltonoise ratios and spectral snrs in cryoem lowdose imaging of molecules. Journal of structural biology, 166(2):126–132.

Brubaker et al. (2015)
Brubaker, M. A., Punjani, A., and Fleet, D. J. (2015).
Building proteins in a day: Efficient 3d molecular reconstruction.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 3099–3108.  Cardone et al. (2013) Cardone, G., Heymann, J. B., and Steven, A. C. (2013). One number does not fit all: Mapping local variations in resolution in cryoem reconstructions. Journal of structural biology, 184(2):226–236.
 Chirikjian and Kyatkin (2000) Chirikjian, G. and Kyatkin, A. (2000). Engineering Applications of Noncommutative Harmonic Analysis: With Emphasis on Rotation and Motion Groups. CRC Press.
 De la RosaTrevín et al. (2013) De la RosaTrevín, J., Otón, J., Marabini, R., Zaldivar, A., Vargas, J., Carazo, J., and Sorzano, C. (2013). Xmipp 3.0: an improved software suite for image processing in electron microscopy. Journal of structural biology, 184(2):321–328.
 Egelman (2016) Egelman, E. H. (2016). The current revolution in cryoem. Biophysical journal, 110(5):1008–1012.
 Faruqi et al. (2003) Faruqi, A., Cattermole, D., Henderson, R., Mikulec, B., and Raeburn, C. (2003). Evaluation of a hybrid pixel detector for electron microscopy. Ultramicroscopy, 94(34):263–276.
 Grigorieff (1998) Grigorieff, N. (1998). Threedimensional structure of bovine nadh: ubiquinone oxidoreductase (complex i) at 22 å in ice. Journal of molecular biology, 277(5):1033–1046.
 Grigorieff (2007) Grigorieff, N. (2007). Frealign: highresolution refinement of single particle structures. Journal of structural biology, 157(1):117–125.
 Havin and Jöricke (1994) Havin, V. and Jöricke, B. (1994). The Uncertainty Principle in Harmonic Analysis. SpringerVerlag.
 Henderson et al. (2012) Henderson, R., Sali, A., Baker, M. L., Carragher, B., Devkota, B., Downing, K. H., Egelman, E. H., Feng, Z., Frank, J., Grigorieff, N., et al. (2012). Outcome of the first electron microscopy validation task force meeting. Structure, 20(2):205–214.
 Jaderberg et al. (2015) Jaderberg, M., Simonyan, K., Zisserman, A., et al. (2015). Spatial transformer networks. In Advances in Neural Information Processing Systems, pages 2017–2025.
 Jaitly et al. (2010) Jaitly, N., Brubaker, M. A., Rubinstein, J. L., and Lilien, R. H. (2010). A bayesian method for 3d macromolecular structure inference using class average images from single particle electron microscopy. Bioinformatics, 26(19):2406–2415.
 Jordan et al. (1998) Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1998). An introduction to variational methods for graphical models. In Learning in graphical models, pages 105–161. Springer.
 Kingma and Welling (2013) Kingma, D. P. and Welling, M. (2013). Autoencoding variational bayes. arXiv preprint arXiv:1312.6114.
 Kohl and Reimer (2008) Kohl, H. and Reimer, L. (2008). Transmission electron microscopy: physics of image formation. Springer.
 Kucukelbir et al. (2014) Kucukelbir, A., Sigworth, F. J., and Tagare, H. D. (2014). Quantifying the local resolution of cryoem density maps. Nature methods, 11(1):63.
 Langlois et al. (2014) Langlois, R., Pallesen, J., Ash, J. T., Ho, D. N., Rubinstein, J. L., and Frank, J. (2014). Automated particle picking for lowcontrast macromolecules in cryoelectron microscopy. Journal of structural biology, 186(1):1–7.
 Marino et al. (2018) Marino, J., Yue, Y., and Mandt, S. (2018). Iterative amortized inference. arXiv preprint arXiv:1807.09356.
 Pettersen et al. (2004) Pettersen, E. F., Goddard, T. D., Huang, C. C., Couch, G. S., Greenblatt, D. M., Meng, E. C., and Ferrin, T. E. (2004). Ucsf chimera—a visualization system for exploratory research and analysis. Journal of computational chemistry, 25(13):1605–1612.
 Pintilie (2010) Pintilie, G. (2010). Greg pintilie’s homepage: people.csail.mit.edu/gdp/cryoem.html.
 Punjani et al. (2017) Punjani, A., Rubinstein, J. L., Fleet, D. J., and Brubaker, M. A. (2017). cryosparc: algorithms for rapid unsupervised cryoem structure determination. Nature Methods, 14(3):290–296.
 Rezende et al. (2016) Rezende, D. J., Eslami, S. A., Mohamed, S., Battaglia, P., Jaderberg, M., and Heess, N. (2016). Unsupervised learning of 3d structure from images. In Advances in Neural Information Processing Systems, pages 4996–5004.
 Rezende et al. (2014a) Rezende, D. J., Mohamed, S., and Wierstra, D. (2014a). Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082.

Rezende et al. (2014b)
Rezende, D. J., Mohamed, S., and Wierstra, D. (2014b).
Stochastic backpropagation and approximate inference in deep generative models.
In Proceedings of the 31st International Conference on Machine Learning, Proceedings of Machine Learning Research, pages 1278–1286.
 Rosenthal and Henderson (2003) Rosenthal, P. B. and Henderson, R. (2003). Optimal determination of particle orientation, absolute hand, and contrast loss in singleparticle electron cryomicroscopy. Journal of Molecular Biology, 333(4):721–745.
 Rupp (2009) Rupp, B. (2009). Biomolecular crystallography: principles, practice, and application to structural biology. Garland Science.
 Scheres (2012) Scheres, S. H. (2012). Relion: implementation of a bayesian approach to cryoem structure determination. Journal of structural biology, 180(3):519–530.
 Scheres et al. (2007a) Scheres, S. H., Gao, H., Valle, M., Herman, G. T., Eggermont, P. P., Frank, J., and Carazo, J.M. (2007a). Disentangling conformational states of macromolecules in 3dem through likelihood optimization. Nature methods, 4(1):27.
 Scheres et al. (2007b) Scheres, S. H., NúñezRamírez, R., GómezLlorente, Y., San Martín, C., Eggermont, P. P., and Carazo, J. M. (2007b). Modeling experimental image formation for likelihoodbased classification of electron microscopy data. Structure, 15(10):1167–1177.
 Sigworth (1998) Sigworth, F. (1998). A maximumlikelihood approach to singleparticle image refinement. Journal of structural biology, 122(3):328–339.
 Sigworth (2016) Sigworth, F. J. (2016). Principles of cryoem singleparticle image processing. Microscopy, 65(1):57–67.
 Tang et al. (2007) Tang, G., Peng, L., Baldwin, P. R., Mann, D. S., Jiang, W., Rees, I., and Ludtke, S. J. (2007). Eman2: an extensible image processing suite for electron microscopy. Journal of structural biology, 157(1):38–46.
 Unser et al. (1987) Unser, M., Trus, B. L., and Steven, A. C. (1987). A new resolution criterion based on spectral signaltonoise ratios. Ultramicroscopy, 23(1):39–51.
 Vilas et al. (2018) Vilas, J. L., GómezBlanco, J., Conesa, P., Melero, R., de la RosaTrevín, J. M., Otón, J., Cuenca, J., Marabini, R., Carazo, J. M., Vargas, J., et al. (2018). Monores: automatic and accurate estimation of local resolution for electron microscopy maps. Structure, 26(2):337–344.
 Vulović et al. (2013) Vulović, M., Ravelli, R. B., van Vliet, L. J., Koster, A. J., Lazić, I., Lücken, U., Rullgård, H., Öktem, O., and Rieger, B. (2013). Image formation modeling in cryoelectron microscopy. Journal of structural biology, 183(1):19–32.
 Xu et al. (1997) Xu, Z., Horwich, A. L., and Sigler, P. B. (1997). The crystal structure of the asymmetric groel–groes–(adp) 7 chaperonin complex. Nature, 388(6644):741.
 Zhao et al. (2013) Zhao, J., Brubaker, M. A., and Rubinstein, J. L. (2013). Tmacs: A hybrid template matching and classification system for partiallyautomated particle selection. Journal of structural biology, 181(3):234–242.
Appendix A Visual impressions
First, to give a visual impression of the observations, we learn from we present in Figure 7 samples from the 3 data sets we use. All of the datasets are based on the same protein estimate of the GroELGroES protein(Xu et al., 1997). They differ in the signaltonoise ratio (SNR). The left row represents noise free data, the middle a moderate common noise level, and the right an extreme level of noise. For each observation example, we show also the corresponding Fourier transformations, their real part in the second column and their imaginary part in the third column.
Further we present the qualitative results of fitting the mid and high level datasets with our method. We visualize the respective protein fit from experiment 6.2 in Figure 8 with the Chimera X software package (Pettersen et al., 2004). The two pictures on the top row represent the middle noise fit, respectively the bottom two the high noise fit.
Appendix B Extension to experiment 6.3
We shall execute the same experiment as in section 6.3 given the dataset with intermediate noise. We display the experiments of this result in Figure 9. It is clear that while, missing information leads to large deviation in variance we also find that the noise leads to some variance in the observed area.
Again we visualize the result of the fit in Figure 10.
Appendix C Amortized inference and variational EM for pose estimation
We have not used amortized inference in our experiments. In experiment 6.4 we have modelled poses as local variables and trained them by variational expectation maximization. Other work shows that the amortization gab can be significant
(Scheres et al., 2007b; Alemi et al., 2017; Marino et al., 2018). Hence in order to exclude the gap as a reason for failure, we decided to model local variables. We did run experiments though with ResNets as encoders without success either. We believe the core problem in probabilistic pose estimation is the number of local optima. This makes simple SGD a somewhat poor choice, because we rely on finding the global minimum.Appendix D Remarks to the chosen observation model
The Gaussian noise model is a common but surely oversimplified model (Sigworth, 1998)
. A better model would be the Poisson distribution. The Gaussian is a good approximation to it given a high rate parameter meaning if there is a reasonable high count of radiation hitting the sensors. This is a good assumption for most methods of the field, but can actually be a poor model in some cases of cryo electron microscopy. An example of an elaborate model is presented in
Vulović et al. (2013).
Comments
There are no comments yet.