KLT Picker: Particle Picking Using Data-Driven Optimal Templates

12/12/2019
by   Amitay Eldar, et al.
Tel Aviv University
Yale University
16

Particle picking is currently a critical step in the cryo-EM single particle reconstruction pipeline. Despite extensive work on this problem, for many data sets it is still challenging, especially for low SNR micrographs. We present the KLT (Karhunen Loeve Transform) picker, which is fully automatic and requires as an input only the approximated particle size. In particular, it does not require any manual picking. Our method is designed especially to handle low SNR micrographs. It is based on learning a set of optimal templates through the use of multi-variate statistical analysis via the Karhunen Loeve Transform. We evaluate the KLT picker on publicly available data sets and present high-quality results with minimal manual effort.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 5

page 6

page 8

page 9

page 10

page 11

page 12

01/18/2022

ASOCEM: Automatic Segmentation Of Contaminations in cryo-EM

Particle picking is currently a critical step in the cryo-electron micro...
02/01/2018

APPLE Picker: Automatic Particle Picking, a Low-Effort Cryo-EM Framework

Particle picking is a crucial first step in the computational pipeline o...
12/14/2011

Automatic post-picking improves particle image detection from Cryo-EM micrographs

Cryo-electron microscopy (cryo-EM) studies using single particle reconst...
12/19/2012

Automatic post-picking using MAPPOS improves particle image detection from Cryo-EM micrographs

Cryo-electron microscopy (cryo-EM) studies using single particle reconst...
11/29/2010

Classifying extremely imbalanced data sets

Imbalanced data sets containing much more background than signal instanc...
04/15/2016

Unsupervised single-particle deep clustering via statistical manifold learning

Motivation: Structural heterogeneity in single-particle cryo-electron mi...
03/18/2021

Localization of Cochlear Implant Electrodes from Cone Beam Computed Tomography using Particle Belief Propagation

Cochlear implants (CIs) are implantable medical devices that can restore...

Code Repositories

KLTpicker

KLT picker: Particle picking using data-driven optimal templates


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Single Particle Cryo-Electron Microscopy is a well known method used for 3D reconstruction of macro-molecules from their multiple 2D projections. In this process, a sample containing macro-molecules is rapidly frozen in a thin layer of ice, whereby the locations and orientations of the frozen samples in the ice layer are unknown. An electron beam is transmitted through the frozen samples creating their 2D projections. The resulting 2D micrographs suffer from low signal-to-noise ratio (SNR). The images containing the projections are called micrographs and the projections without the noise are called particles. The micrographs mostly contain three types of regions – regions of particles with added noise, regions of noise only, and regions of contaminations. One of the first steps towards 3D reconstruction is locating the particles on the micrographs, a step called “particle picking”. In order to achieve a high resolution 3D model of the macro-molecule, one needs many thousands of particle images (to overcome the high levels of noise).

Over the last years, many algorithms aimed to automate the particle picking step have been proposed. Roughly speaking, such algorithms can be divided into two main categories: semi-automatic and automatic.

Semi-automatic algorithms require the user to select manually hundreds to thousands of particles which are used in various ways in order to automatically detect more particles. RELION [Scheres, 2015] and SIGNATURE [Chen and Grigorieff, 2007]

use manually picked particles as templates and correlate them against the micrographs in order to compute statistical similarity measures. The main idea is that cross-correlation between a template and a particle image with added noise will be larger than cross-correlation with noise only. Another approach, gaining popularity recently, is the use of methods from the field of deep learning. TOPAZ 

[Bepler et al., 2018]

uses manually selected particles to train a convolutional neural network, where Xmipp 

[Sorzano et al., 2004]

uses them to train an ensemble of classifiers. In both cases, the network and the ensemble are later used to detect more particles. Common to all semi-automatic algorithms is the need for a high quality set of manually picked particles. Most of them could also benefit from a collection of images that contain only noise.

Automatic algorithms do not require the user to manually select particles. Nevertheless, they do require some prior knowledge about the macro-molecule. Most of the algorithms require at least an estimate of the particle size (i.e the maximal diameter of the particle) and some of them require the user to adjust all sorts of parameters. The DoG picker 

[Voss et al., 2009] is based on convolving difference of gaussian functions with a micrograph image. This method is suitable for identifying blobs in the image and to sort them by their sizes. The APPLE picker [Heimowitz et al., 2018]

uses cross-correlation between micrograph patches in order to detect areas that are more likely to contain a particle. These areas are then being used to train a support vector machine, which is used to detect particles. Deep learning methods are also gaining popularity in the automatic category. DeepPicker 

[Wang et al., 2016] and fast R-CNN [Xiao and Yang, 2017] use 3D reconstructions of known macro-molecules as labeled data sets to train neural networks that are later used to pick particles. This procedure leads to a significant speedup of the particle picking step.

In this paper we present the KLT (Karhunen Loeve Transform) picker, an automatic particle picking method in which the only input needed by the user is an estimate of the particle’s size. Our method is based on computing for each micrograph a set of data-adaptive templates, which are optimal in the sense that they admit maximal correlation with the particle images. Additionally, they are naturally adapted to coping with the unknown in-plane rotations of the particles, as the absolute value of their correlation with a particle image is rotationlly invariant. The above-mentioned properties of our templates eventually allow to detect particles under low SNR conditions. The estimation of these templates is achieved through the Karhunen Loeve Transform [Maccone, 2009] by utilizing the radial power spectral density (RPSD) [Papoulis, 1977] of the particles. Using these templates gives rise to a score for each patch in the micrograph, based on the well known Likelihood Ratio Test (LRT) [Wasserman, 2013]

. Specifically, the algorithm partitions each micrograph into overlapping patches, and assigns to each of them a score that reflects the probability that it contains a particle. The threshold probability for detecting particles is set either to the theoretical threshold (explained below), or, the user can adjust the threshold to pick a smaller number of patches with the highest scores (high probability for the presence of a particle). In addition the threshold probability can be set to pick patches that are more likely to contain only noise (patches with the lowest scores). One of the advantages of our method is that it can complement other semi-automatic algorithms, that require an initial set of training images. For example, the user can decide to pick only the 10 highest and lowest scored patches out of each micrograph and use them instead of an initial manual picking step. Nevertheless, the KLT picker is fully capable of preforming the entire particle picking step, as we demonstrate on three different data sets in Section 

3. In order to evaluate the picking results, we perform a 3D reconstruction for each data set and compare the resulting gold-standard Fourier Shell Correlation (FSC) resolution van Heel, M. and Schatz, M. [2005] to the published one. The data sets used are Plasmodium Falciparum 80S ribosome [Wong et al., 2014], -Galactosidase [Bartesaghi et al., 2015] and Synaptic RAG1-RAG2 [Ru et al., 2015]. The code for the KLT picker is available at https://github.com/amitayeldar/KLTpicker.

2 Materials and methods

We next detail the different steps in our particle picking algorithm.

2.1 Computing data adaptive particle templates

Let be the particle’s radius and let be a function representing a centered particle whose energy is concentrated in a disk of radius . We consider

to be a random function, with some probability distribution capturing the variability of particle images. For the purpose of particle picking, we seek the set of orthogonal functions that maximize their correlation with the particle images. Specifically,

(1)

where is the disk of radius , denotes the standard inner product on , and is the appropriate norm. In words, is the function which maximizes the correlation with the particle function , is the function which maximizes the correlation with while being orthogonal to , and so on. These functions are the results of the Karhunen Loeve Transform (KLT) [Maccone, 2009] of the particle function . We denote the maximal expected value attained in the right-hand side of (1) by , i.e.

(2)

describes the importance of every function in serving as a template for particle picking, as correlating corresponding to a large against a noisy particle image is expected to produce a larger value (on average) than correlating it against pure noise (which would produce the same correlation regardless of the template used).

Solving (1) (under assumption detailed below ) requires the Radial Power Spectral Density (RPSD) [Papoulis, 1977] of . In the next section, we present a method to estimate the RPSD under the assumptions that is band-limited and is generated from a wide sense stationary random process. Under these assumptions, we show in A that the solutions to (1), in polar coordinates, are of the form

(3)

where is the

’th eigenfunction of the integral equation

(4)

with

being the eigenvalues corresponding to

, and is given by

where is the ’th order Bessel function of the first kind, is the band-limit of , and is the RPSD of , given explicitly as

(5)

where

is the two-dimensional Fourier transform of 

. It is worthwhile to point out that of (3) are steerable functions [Teo, 1999], since correlating a template of the form of (3) with an arbitrarily rotated image has absolute value which is independent of the rotation.

In order to solve (4) and evaluate , one needs to know the particle’s RPSD . In the following section we describe a method to estimate it for a given micrograph. Then, equipped with an estimate for , we use the well known Nyström’s method [Atkinson, 1967] in order to solve (4) numerically. To simplify subsequent notation, we sort all eigenvalues (for all and ) in descending order, and denote the sorted eigenvalues by . We sort the eigenfunctions accordingly. Figure 1 illustrates the first 16 templates  (sorted according to their corresponding eigenvalues ) obtained from the EMPIAR-10028 data set [Wong et al., 2014], and Figure 2 depicts the first eigenvalues corresponding to the templates displayed in Figure 1.

Figure 1: The first 16 functions (for the largest ) obtained from solving (4) using the RPSD estimation procedure described in Section 2.2. The eigenfunctions and the eigenvalues were computed from a single typical micrograph of EMPIAR-10028 data set [Wong et al., 2014], and were normalized so that the largest eigenvalue is one.
Figure 2: The first 500 eigenvalues (sorted in descending order) obtained from solving (4) using the RPSD estimation procedure described in Section 2.2. The red dashed line is a normalized cumulative sum of the eigenvalues. The eigenvalues were computed from a single typical micrograph of EMPIAR-10028 data set [Wong et al., 2014] and were normalized so that the largest eigenvalue is one. It is evident that the eigenvalues decay rapidly, and eigenfunctions are already enough to capture more than of the variability in the particle images (since the normalized cumulative sum of the first 300 eigenvalues exceeds ).

2.2 Radial power spectral density estimation

In this section, we describe a method for estimating the particle’s RPSD (see (5)) for a given micrograph. Towards this end, we partition the micrograph into (possibly overlapping) patches of size

with stride

. Specifically, if is a micrograph, then the first patch is defined by , the second by and so on (for a detailed explanation on how are chosen see B). Then, for each patch we estimate it’s RPSD at equally-spaced points as explained in [Papoulis, 1977], and form the matrix whose th column is the RPSD computed from the ’th patch. We consider the following model for the RPSD of the patches in a given micrograph:

(6)

where are the samples of the noise’s RPSD, and are the samples of of (5) (the particle’s RPSD). We assume that remains the same across all patches, and that is multiplied by a factor in each patch, which accounts for the proportion of the particle contained in the ’th patch. That is, a patch which contains no particle admits , a patch which fully contains a particle admits , and a patch which contains part of a particle admits equal to the proportion of the particle in the patch. According to (6), we can estimate , and by solving

(7)

where is the standard Euclidean norm. See Figure 3 for an illustration of the RPSD estimation process. Note that (7) can be viewed as an instance of Non-Negative Matrix Factorization (NNMF) [Gillis, 2017] of the matrix . In B, we describe a method for estimating the solution of (7) by using alternating least-squares [Kim and Park, 2008], which is a popular approach for NNMF. We illustrate typical values of from the factorization (7) in Figure 4. It is evident that is larger in areas of the micrograph which are densely packed with particles, indicating that the factorization extracts the RPSD of the particles from the correct regions of the micrograph.

Figure 3: Particle and noise RPSD estimation procedure. We divide the micrograph into (possibly overlapping) patches where is less than or equal to the maximal diameter of the particle. For each patch we compute its RPSD and factorize it as a sum of the noise RPSD and the particle RPSD multiplied by a proportion factor .
Figure 4: Heat map of over the micrograph it’s estimated from. The values of were obtained by solving (7) for a typical micrograph of the EMPIAR-10028 data set [Wong et al., 2014], and were normalized so that their largest value is 1. In order to create the heat map, was used to color the ’th patch. As explained in Section 2.2, large values for (bright color) indicate the presence of a particle, and small values (dark color) indicate its absence.

2.3 Particle detection

In the previous sections, we described how to compute a set of templates (together with their eigenvalues describing their importance) adapted to the particle images in the micrograph. However, at this point it is unclear how to make use of the different templates. In particular, since a single template does not capture the variability of all particle images, multiple templates must be used. We should therefore determine how many templates to use and how to combine them for particle picking. To that end, we turn to classical hypothesis testing and derive a scheme where different templates are combined according to their relative importance (through the eigenvalues

). Specifically, we next derive a procedure for discriminating between an image of pure noise, which is defined as the null hypothesis and denoted by

, and an image of a particle plus noise, which is defined as the alternative hypothesis and denoted by . That is, we consider the model

(8)

where is a flattened micrograph patch, is a flattened image of the noise only, is a flattened particle image without noise and with zero values outside of , and is the floor function. We assume that the noise is white, which is enforced by prewhitening the micrograph using the samples of the estimated noise RPSD (obtained in Section 2.2). Given the model (8), we test for the hypotheses and by the well-known Likelihood Ratio Test (LRT) [Wasserman, 2013]. We derive the LRT for the case where and

are normally distributed, and demonstrate the practical effectiveness of the resulting test in Section 

3. To derive the LRT in this case, we need the covariance matrices for both hypotheses, namely and . Assuming that and are independent, it follows that

(9)

where

is the standard deviation of the noise, and

is the identity matrix. Since is typically unknown, we propose a method for estimating it from a given micrograph in C. To evaluate (9) it therefore remains to estimate . Let be an equally-spaced (Cartesian) grid with points over the square , and denote by the flattened version of . By setting the values of outside of to be zero, we have that for every . Setting also the values of of (1) outside of to zero, and denoting by the matrix whose ’th column consists of the flattened samples of on , leads to the following approximation of the particle’s covariance matrix (see D)

(10)

where is a diagonal matrix with the eigenvalues on its main diagonal. The parameter is chosen such that the error in the approximation (10) is small, which essentially depends on the decay rate of the eigenvalues (see D for a detailed explanation on how the parameter is determined).

Given the covariance matrix of (9) (computed via the approximation (10)), we use a score based on the LRT for comparing between two multivariate normal distributions as follows. Each micrograph patch of size (flattened to a vector) is considered as a realization of and is assigned a score

(11)

where is a matrix determinant. See D for an elaborate derivation of this score and an efficient algorithm to compute it. High value of indicates the presence of a particle in the patch. We consider the scores as a matrix whose elements are . Extracting the particle’s coordinates out of the scoring matrix is done by using a version of the non-maximum suppression algorithm [Neubeck and Van Gool, 2006]. This greedy algorithm first chooses the highest score in the scoring matrix as the first picked particle’s center. Then, in order to prevent nearby pixels from being considered as particles, all pixels within a user-defined radius are excluded from the picking, and the procedure resumes with the next largest score, and so on. We set the exclusion radius to be the particle size, however, smaller values may give better results for closely packed, irregularly shaped particles. The picking continues until the remaining scores are below zero or until the algorithm picks a predefined number of particles. Stopping when the remaining scores are negative is based on the observation that holds whenever , which means that all of the remaining patches have higher probability to contain noise only. See Figure 5 for an example of a scoring matrix and the particles picked from it.

Figure 5: Example of a typical scoring matrix as a heat map (right image) and particles picked from it (left image) with a threshold of zero. The data set is EMPIAR-10028 [Wong et al., 2014].

See Figure 6 and algorithm 1 for a diagram and an algorithm that summarize the particle picking process.

1:Required: Micrograph and the particles’ radius .
2:Partition the micrograph into patches with stride in each dimension (by default, is chosen to be 80% of the maximal diameter of the particle and , see B for more details).
3:Compute the RPSD of each patch as described in section 2.2.
4:Estimate the particle and noise RPSD , respectively by solving (7).
5:Prewhiten the micrograph using .
6:Compute the set of eigenfunctions and eigenvalues of (1) by solving (4) using and the Nyström method.
7:Estimate the particle’s covariance matrix according to (10) and the noise standard deviation , as detailed in C.
8:Compute (see (9)).
9:Partition the micrograph into patches with stride in each dimension and compute the LRT score for each patch according to (11). The outcome is a scoring matrix.
10:Extract the particles’ coordinates from the scoring matrix as described at the end of Section 2.3.
Algorithm 1 The KLT picker
Figure 6: For each micrograph, we estimate the particle and noise RPSD as explained in Section 2.2. Then, we prewhiten the micrograph, and estimate the templates using (4). Last, we compute the LRT scoring matrix and extract the particles’ coordinates from it, as described at the end of Section 2.3.

3 Experimental results

In this section, we demonstrate the performance of our algorithm on three different data sets, as follows. First, the micrographs were downsampled to 20%-50% of their original size and were normalized to zero mean and variance of one. In addition, each micrograph was bandpass filtered, eliminating 5% of its lowest and highest radial frequencies. The lowest frequencies were eliminated since they do not contain enough samples for estimating the RPSD, and the highest frequencies were eliminated since they contain mostly noise. Throughout all of the experiments, we used the following parameters for the KLT picker:

and are 20% and 80%, respectively, of the maximal diameter of the particle, , and as explained above, the detection threshold is set to zero. Note that is taken to be smaller than half of the estimated particle’s size, as this leads, in practice, to better centered particle selections. The obtained particle coordinates were imported into RELION [Scheres, 2015] and the routine described in [Scheres, 2014] was executed in order to obtain a 3D reconstruction. The EMDB [in Europe, ] model corresponding to each data set was used as the ab initio model for the 3D refinement process.

The procedure described above is demonstrated on the following data sets of the EMPIAR repository [Iudin et al., 2016]: EMPIAR-10028 (Plasmodium Falciparum 80S ribosome)  [Wong et al., 2014], EMPIAR-10061 (-Galactosidase) [Bartesaghi et al., 2015], and EMPIAR-10049 (Synaptic RAG1-RAG2) [Ru et al., 2015].

Our algorithm was implemented in Matlab, and all the experiments were executed on a Linux machine with 16 cores running at 2.1GHz, 768GB of memory, and an Nvidia Titan GPU.

3.1 Plasmodium Falciparum 80S

The Plasmodium Falciparum 80S data set consists of micrographs of size pixels, with pixel size of . The approximate size of the particle is pixels (according to [Wong et al., 2014]). Examples of the KLT picking results are shown in Figure 10. The KLT picker picked particles from the micrographs. In order to purge the set of picked particles, a 2D class averaging step was executed (using classes) and particles retained. Examples of the selected 2D classes are shown in Figure 7. The 3D reconstruction reached a gold standard Fourier Shell Correlation (FSC) resolution of . Figure 8 presents the FSC curve produced by RELION’s post-processing task. In comparison, particles were picked by [Wong et al., 2014] using RELION’s automated selection tool, and after 2D and 3D class averaging steps, particles remained. The reconstruction in [Wong et al., 2014] reached a gold standard FSC resolution of . Figure 9 presents surface views of our 3D reconstructed model and the reconstructed model of  [Wong et al., 2014]. The running time of the KLT picker for a single micrograph, using a GPU, was approximately seconds (following a one-time preprocessing step of approximately seconds).

3.2 -Galactosidase

The -Galactosidase data set consists of 1539 micrographs, each of size pixels, with pixel size of 0.64. The approximate size of the particle according to [Bartesaghi et al., 2015] is pixels. KLT picker picked particles. Examples of the results of the picking are shown in Figure 11. After executing a 2D class averaging step (using classes), particles retained. Examples of the selected 2D classes are shown in Figure 7. The 3D reconstruction reached a gold standard FSC resolution of . Figure 8 presents the FSC curve produced by RELION’s post-processing task. It is important to note that the micrographs’ raw movies were not available to us, therefore better results may be achieved by further polishing. In comparison, [Bartesaghi et al., 2015] picked particles using an automatic method based on correlation with a Gaussian ring. Than, a 3D classification in FREALIGN [Grigorieff, 2007] was used to discard bad selections, and particles retained for the 3D refinement step (that was executed in the same program). The reconstruction in [Bartesaghi et al., 2015] reached a gold standard FSC resolution of . Figure 9 presents surface views of the 3D models. The running time for a single micrograph, using a GPU, was approximately seconds (following a one-time preprocessing step of approximately seconds).

3.3 Synaptic RAG1-RAG2

The Synaptic RAG1-RAG2 consists of 680 micrographs, each of size pixels, with pixel size of 1.23. The approximate size of the particle according to [Ru et al., 2015] is pixels. The KLT picker picked particles. Examples of the results of the picking are shown in Figure 12. After executing a 2D class averaging step (using classes), particles retained. Examples of the selected 2D classes are shown in Figure 7. The 3D reconstruction reached a gold standard FSC resolution of . Figure 8 presents the FSC curve produced by RELION’s post-processing task. We note that similarly to the -Galactosidase data set, the raw movies were unavailable to us, hence the results are sub-optimal. In comparison, [Ru et al., 2015] used a semi-automatic method executed in SAMUEL [Shi et al., 2008] for the particle picking step. Initially, particles were manually selected and classified by a 2D class averaging step. Good classes were manually chosen and the particles in those classes were used as templates for an automatic picking process. As described in [Ru et al., 2015], additional manual effort was required to discard bad selections, which resulted in a final data set of particles. The 3D refinement step was executed in RELION, and the reconstruction reached a gold standard FSC resolution of . Figure 9 presents surface views of the 3D model. The running time for a single micrograph, using a GPU, was approximately seconds (following a one-time preprocessing step of approximately seconds).

Figure 7: Examples of 2D class averages of Plasmodium Falciparum 80S (left), -Galactosidase(middle), and Synaptic RAG1-RAG2 (right) produced by RELION from particles picked by the KLT picker.
Figure 8: FSC curves of Plasmodium Falciparum 80S (left), -Galactosidase (middle), and Synaptic RAG1-RAG2 (right) produced by RELION’s post-processing task.
Figure 9: Surface views taken from the 3D reconstructions of Plasmodium Falciparum 80S (left), -Galactosidase (middle), and Synaptic RAG1-RAG2 (right) . The top surface views correspond to reconstructions from particles picked by the KLT picker, and the bottom ones to the reconstruction of [Wong et al., 2014][Bartesaghi et al., 2015], and [Ru et al., 2015], respectively. The views were generated using UCSF Chimera software [Pettersen et al., 2004].
Figure 10: Examples of picking results of the Plasmodium Falciparum 80S data set. The micrographs are shown in the left column, picked particles with threshold equal to zero are in the middle columns, and the top 50 picked particles are in the right column.
Figure 11: Examples of picking results of the -Galactosidase data set. See Figure 10 for details.
Figure 12: Examples of picking results of the Synaptic RAG1-RAG2 data set. See Figure 10 for details.

4 Discussion

4.1 Review of the results

The data sets used to demonstrate our method represent different types of micrographs. The Plasmodium Falciparum 80S is the least challenging, as the macro-molecule is large () and the particles can easily be spotted by a naked eye. The -Galactosidase is more challenging, as the macro-molecule is smaller (), and it is harder to see the particles on the micrographs. Finally, the Synaptic RAG1-RAG2 is the most challenging data set, as it corresponds to smaller macro-molecule () and has low SNR. With all these data set, we achieved high resolutions with minimum manual effort. In addition, the surface views in Figure 9 show that the 3D reconstructions we got are compatible with the published ones.

Throughout the entire reconstruction process, the only manual step was the selection of classes out of RELION’s 2D classification output. We did not use 3D classification, nor any other method to filter the KLT picker’s picked particles. It is also important to note that the parameters in all of the experiments where chosen automatically based on the particle’s size, so there is no need to adjust them manually.

4.2 Comparison with other particle picking methods

Particle picking methods can be divided into two categories. The first relies on prior assumptions on the particle’s structure, such as Gaussian shape [Voss et al., 2009] or resemblance to other known macro-molecules [Wang et al., 2016, Xiao and Yang, 2017]. The second relies solely on information from the data itself, which can be a set of manually picked particles [Scheres, 2015, Bepler et al., 2018], or a set of micrograph patches that were automatically selected [Heimowitz et al., 2018]. The KLT picker is in between both categories. On one hand, it relies on the assumption that the projection images can be approximated by a band-limited, rotationally invariant, wide sense stationary process restricted to a disk. This assumption is non-restrictive in practice, as exemplified by our results. On the other hand, the templates of the KLT picker are adapted to every micrograph separately, through the RPSD of the particles and of the noise.

5 Conclusion

In this paper, we presented the KLT picker, an automated, data driven particle picker, that requires only one input which is the estimated particle’s size. The KLT picker is designed especially to handle low SNR micrographs, it is easy to use, fast, and mathematically rigorous. We evaluated the performance of the KLT picker on three different data sets. In all of them we achieved high-quality results with minimal manual effort.

The algorithm can be further improved as follows. Micrographs tend to contain contaminations of various forms, from small contaminations to whole areas containing either ice or the edges of the supporting grid. These areas can degrade the estimation of the RPSD, which is the essence of our method. Devising an algorithm to automatically identify these areas may improve the picking. Dense particle clusters can poses a problem to the KLT picker, as the scoring matrix, introduced in Section 2.3, tends to assign high scores to the interface between particles instead of to the center of each particle separately, see an example in Figure 13. One possible solution to this problem is to replace the current algorithm for extracting particles, which uses the individual values of the scoring matrix (described at the end of Section 2.3), with an algorithm that considers the entire neighborhood of a particle candidate.

Figure 13: An example of a dense particles cluster for which the KLT picker picked the interface between particles instead of the center of each particle. Micrograph taken from EMPIAR-10028 data set [Wong et al., 2014].

Acknowledgments

This research was supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement 723991 - CRYOMATH).

Appendix A Deriving (3) and (4)

The Karhunen Loeve Theorem [Maccone, 2009] states that the solutions to (1) are the eigenfunctions of the autocorrelation operator , that is,

(12)

Under the assumptions that is band-limited, wide sense stationary, and that is radially-symmetric (i.e ), the Wiener–Khinchin theorem [Cohen, 1998] states that the radial power spectral density (RPSD) of , given by (5), is the Fourier transform of the autocorrelation . Hence, by taking the inverse Fourier transform,

(13)

where is the band-limit of . Denote , , and write the solutions of (12) as

(14)

Substituting (14) into (12) yields

(15)

Note that (13) in polar coordinates is given by

thus

(16)

Changing variables in the innermost integral in (16) to gives

(17)

where is the ’th order Bessel function of the first kind. Substituting back (17) in (16) gives

(18)

Changing variables in the innermost integral in (18) to gives

(19)

Using (19) in (18) we get

where .

Appendix B Estimating the solution of (7)

In this section, we outline an algorithm for estimating the solution of (7), that is of

(20)

Equation (20) can be viewed as an instance of a non-negative matrix factorization problem (NNMF) [Gillis, 2017], which is NP-hard. One method for estimating solutions of NNMF problems is by alternating least-squares (ALS) [Kim and Park, 2008]. The idea behind this iterative method is solving, in each step, the NNMF minimization problem for one of the variables, while treating the others as constants. If the method converges, then it converges to a local minimum of the NNMF problem. In our case, the variables are the vectors , and the ALS iterations are given by (for iteration )

(21)

As the initial vectors we set

(22)

where for . To justify this choice, recall that is modeled as

(23)

thus,

which implies

Since we expect at least one patch to include a complete particle, and at least one patch to contain only noise, we get , which yields

That is is initialized close to its true value. As for , (23) implies

where the third equality holds due to the positivity of all arguments. As written above, and hence

that is, is also initialized close to its true value.

Next, we derive an explicit formula for the solution of (21). Using basic algebra, we can rewrite the sum in  (20) in three equivalent forms

(24)

Treating as a variable and as constants in the second line, gives a sum of convex parabolas for the variables . The minimum of this sum, under the constraint is attained at the minimum of these parabolas whenever they are positive and at else. The same applies when treating as the variable in the third line (and the other as constants). Treating as the variable in the forth line, implies that the minimum is given as follow. For each parabola in the sum (24), if its minimum is obtained at a negative , then the minimum of the sum is attained at . If the minimum is obtained at , we set . Otherwise, the minimum of the sum is attained at the minimum of the parabola. To conclude, we get the following formulas for the solutions of the ’th iteration in (21)

(25)

where for

and

One can prove that if can be factorized as in  (23), and the iterations of (25) converge, then in the limit of in (25), it holds that

(26)

where can be estimated from the noise variance. This implies that we can recover . We omit the details of the proof. In C, we describe a method to estimate the noise variance from a given micrograph. The procedure for solving (20) is summarized in Algorithm 2.

1:Required: Matrix .
2:Compute the initial vectors using (22) and using (25).
3:
4:while   do
5:     Compute using (25)
6:     
7:end while
8:Output:
Algorithm 2 ALS algorithm

Finally, we return to the computation of the columns of . Each column is the RPSD of a micrograph patch of size with stride , as described at the beginning of Section 2.2. These parameters are determined as follows. In (23) we assume that can be factorized as a combination of only . However, this model is only an approximation, and in practice it holds that

(27)

where is an error term matrix. Under the assumptions that the expectancy of the rows of is zero, and that the columns of are uncorrelated and with the same variance, one can prove that the limit of (25) when using of (27) stays the same as in (26), but with different values for (we omit the proof). These values can be estimated from the noise variance, and therefore, we can recover for the more general model (27). Now, taking clearly breaks the assumptions for and taking does not use all of the data, hence we used . As for the parameter , ideally we would like a patch to contain no more then one complete particle. Taking much bigger then the particle diameter can cause patches to contain two or more particles, and taking it much smaller will not allow patches to contain enough of the particle. In practice, it seems that taking equal 80% of the particle diameter yields good results.

Appendix C Estimating the noise variance

The noise variance is needed to recover from (26) and to evaluate (9). Assuming that both the noise and the particle function are generated from a wide sense stationary random process and that they are uncorrelated, it follows that the variance of a given micrograph patch is the sum of the noise variance and a fraction of the particle’s variance, which is proportional to the fraction of the particle contained in the patch. As the variance is a positive number, patches with lower variance are more likely to contain only noise. With that in mind, we partition the micrograph into overlapping particle-size patches with stride equal to . Then, we approximate the variance of each patch as explained in Wasserman [2013] and finally estimate the noise variance by averaging of the lowest variance results. Choosing depends on the number of particles in the micrograph, which is unknown to us. We have observed experimentally that taking provides satisfactory results in low-SNR conditions.

Appendix D Deriving the approximation (10)

In this section, we derive an approximation for the particle’s covariance matrix , where

(28)

(see also Section 2.3). The Karhunen Loeve Theorem [Maccone, 2009] states that

where are given by (1), and moreover, . Therefore, the autocorrelation can be written as

Using (28) we get

(29)

In order to define a truncation rule for the infinite sum in (29), we consider the relative error term

(30)
(31)

where is the standard norm and (31) follows from (30) due to orthogonality of . Then, we define the following truncation rule. First we truncate the infinite sum at terms where is chosen such that

and then, we truncate it again to terms such that

(32)

By standard linear algebra, it can be shown that the truncation rule of (32) yields the desired approximation of (10)

(33)

where and are defined in Section 2.3

Appendix E Derivation and computation of (11)

Our goal is to compute the score

(34)

Under the assumptions that the particle is distributed multivariate normal, and that the noise is white, the probability density function (PDF) of a micrograph patch containing a particle plus noise is

(35)

where is a matrix determinant. The PDF of a patch with noise only is

(36)

Substituting the above PDFs in  (34

) and replacing the random variable

with its patch realization gives (see also (11))