KLT picker: Particle picking using data-driven optimal templates
Particle picking is currently a critical step in the cryo-EM single particle reconstruction pipeline. Despite extensive work on this problem, for many data sets it is still challenging, especially for low SNR micrographs. We present the KLT (Karhunen Loeve Transform) picker, which is fully automatic and requires as an input only the approximated particle size. In particular, it does not require any manual picking. Our method is designed especially to handle low SNR micrographs. It is based on learning a set of optimal templates through the use of multi-variate statistical analysis via the Karhunen Loeve Transform. We evaluate the KLT picker on publicly available data sets and present high-quality results with minimal manual effort.READ FULL TEXT VIEW PDF
KLT picker: Particle picking using data-driven optimal templates
Single Particle Cryo-Electron Microscopy is a well known method used for 3D reconstruction of macro-molecules from their multiple 2D projections. In this process, a sample containing macro-molecules is rapidly frozen in a thin layer of ice, whereby the locations and orientations of the frozen samples in the ice layer are unknown. An electron beam is transmitted through the frozen samples creating their 2D projections. The resulting 2D micrographs suffer from low signal-to-noise ratio (SNR). The images containing the projections are called micrographs and the projections without the noise are called particles. The micrographs mostly contain three types of regions – regions of particles with added noise, regions of noise only, and regions of contaminations. One of the first steps towards 3D reconstruction is locating the particles on the micrographs, a step called “particle picking”. In order to achieve a high resolution 3D model of the macro-molecule, one needs many thousands of particle images (to overcome the high levels of noise).
Over the last years, many algorithms aimed to automate the particle picking step have been proposed. Roughly speaking, such algorithms can be divided into two main categories: semi-automatic and automatic.
Semi-automatic algorithms require the user to select manually hundreds to thousands of particles which are used in various ways in order to automatically detect more particles. RELION [Scheres, 2015] and SIGNATURE [Chen and Grigorieff, 2007]
use manually picked particles as templates and correlate them against the micrographs in order to compute statistical similarity measures. The main idea is that cross-correlation between a template and a particle image with added noise will be larger than cross-correlation with noise only. Another approach, gaining popularity recently, is the use of methods from the field of deep learning. TOPAZ[Bepler et al., 2018]
uses manually selected particles to train a convolutional neural network, where Xmipp[Sorzano et al., 2004]
uses them to train an ensemble of classifiers. In both cases, the network and the ensemble are later used to detect more particles. Common to all semi-automatic algorithms is the need for a high quality set of manually picked particles. Most of them could also benefit from a collection of images that contain only noise.
Automatic algorithms do not require the user to manually select particles. Nevertheless, they do require some prior knowledge about the macro-molecule. Most of the algorithms require at least an estimate of the particle size (i.e the maximal diameter of the particle) and some of them require the user to adjust all sorts of parameters. The DoG picker[Voss et al., 2009] is based on convolving difference of gaussian functions with a micrograph image. This method is suitable for identifying blobs in the image and to sort them by their sizes. The APPLE picker [Heimowitz et al., 2018]
uses cross-correlation between micrograph patches in order to detect areas that are more likely to contain a particle. These areas are then being used to train a support vector machine, which is used to detect particles. Deep learning methods are also gaining popularity in the automatic category. DeepPicker[Wang et al., 2016] and fast R-CNN [Xiao and Yang, 2017] use 3D reconstructions of known macro-molecules as labeled data sets to train neural networks that are later used to pick particles. This procedure leads to a significant speedup of the particle picking step.
In this paper we present the KLT (Karhunen Loeve Transform) picker, an automatic particle picking method in which the only input needed by the user is an estimate of the particle’s size. Our method is based on computing for each micrograph a set of data-adaptive templates, which are optimal in the sense that they admit maximal correlation with the particle images. Additionally, they are naturally adapted to coping with the unknown in-plane rotations of the particles, as the absolute value of their correlation with a particle image is rotationlly invariant. The above-mentioned properties of our templates eventually allow to detect particles under low SNR conditions. The estimation of these templates is achieved through the Karhunen Loeve Transform [Maccone, 2009] by utilizing the radial power spectral density (RPSD) [Papoulis, 1977] of the particles. Using these templates gives rise to a score for each patch in the micrograph, based on the well known Likelihood Ratio Test (LRT) [Wasserman, 2013]
. Specifically, the algorithm partitions each micrograph into overlapping patches, and assigns to each of them a score that reflects the probability that it contains a particle. The threshold probability for detecting particles is set either to the theoretical threshold (explained below), or, the user can adjust the threshold to pick a smaller number of patches with the highest scores (high probability for the presence of a particle). In addition the threshold probability can be set to pick patches that are more likely to contain only noise (patches with the lowest scores). One of the advantages of our method is that it can complement other semi-automatic algorithms, that require an initial set of training images. For example, the user can decide to pick only the 10 highest and lowest scored patches out of each micrograph and use them instead of an initial manual picking step. Nevertheless, the KLT picker is fully capable of preforming the entire particle picking step, as we demonstrate on three different data sets in Section3. In order to evaluate the picking results, we perform a 3D reconstruction for each data set and compare the resulting gold-standard Fourier Shell Correlation (FSC) resolution van Heel, M. and Schatz, M.  to the published one. The data sets used are Plasmodium Falciparum 80S ribosome [Wong et al., 2014], -Galactosidase [Bartesaghi et al., 2015] and Synaptic RAG1-RAG2 [Ru et al., 2015]. The code for the KLT picker is available at https://github.com/amitayeldar/KLTpicker.
We next detail the different steps in our particle picking algorithm.
Let be the particle’s radius and let be a function representing a centered particle whose energy is concentrated in a disk of radius . We consider
to be a random function, with some probability distribution capturing the variability of particle images. For the purpose of particle picking, we seek the set of orthogonal functions that maximize their correlation with the particle images. Specifically,
where is the disk of radius , denotes the standard inner product on , and is the appropriate norm. In words, is the function which maximizes the correlation with the particle function , is the function which maximizes the correlation with while being orthogonal to , and so on. These functions are the results of the Karhunen Loeve Transform (KLT) [Maccone, 2009] of the particle function . We denote the maximal expected value attained in the right-hand side of (1) by , i.e.
describes the importance of every function in serving as a template for particle picking, as correlating corresponding to a large against a noisy particle image is expected to produce a larger value (on average) than correlating it against pure noise (which would produce the same correlation regardless of the template used).
Solving (1) (under assumption detailed below ) requires the Radial Power Spectral Density (RPSD) [Papoulis, 1977] of . In the next section, we present a method to estimate the RPSD under the assumptions that is band-limited and is generated from a wide sense stationary random process. Under these assumptions, we show in A that the solutions to (1), in polar coordinates, are of the form
where is the
’th eigenfunction of the integral equation
being the eigenvalues corresponding to, and is given by
where is the ’th order Bessel function of the first kind, is the band-limit of , and is the RPSD of , given explicitly as
is the two-dimensional Fourier transform of. It is worthwhile to point out that of (3) are steerable functions [Teo, 1999], since correlating a template of the form of (3) with an arbitrarily rotated image has absolute value which is independent of the rotation.
In order to solve (4) and evaluate , one needs to know the particle’s RPSD . In the following section we describe a method to estimate it for a given micrograph. Then, equipped with an estimate for , we use the well known Nyström’s method [Atkinson, 1967] in order to solve (4) numerically. To simplify subsequent notation, we sort all eigenvalues (for all and ) in descending order, and denote the sorted eigenvalues by . We sort the eigenfunctions accordingly. Figure 1 illustrates the first 16 templates (sorted according to their corresponding eigenvalues ) obtained from the EMPIAR-10028 data set [Wong et al., 2014], and Figure 2 depicts the first eigenvalues corresponding to the templates displayed in Figure 1.
In this section, we describe a method for estimating the particle’s RPSD (see (5)) for a given micrograph. Towards this end, we partition the micrograph into (possibly overlapping) patches of size
with stride. Specifically, if is a micrograph, then the first patch is defined by , the second by and so on (for a detailed explanation on how are chosen see B). Then, for each patch we estimate it’s RPSD at equally-spaced points as explained in [Papoulis, 1977], and form the matrix whose th column is the RPSD computed from the ’th patch. We consider the following model for the RPSD of the patches in a given micrograph:
where are the samples of the noise’s RPSD, and are the samples of of (5) (the particle’s RPSD). We assume that remains the same across all patches, and that is multiplied by a factor in each patch, which accounts for the proportion of the particle contained in the ’th patch. That is, a patch which contains no particle admits , a patch which fully contains a particle admits , and a patch which contains part of a particle admits equal to the proportion of the particle in the patch. According to (6), we can estimate , and by solving
where is the standard Euclidean norm. See Figure 3 for an illustration of the RPSD estimation process. Note that (7) can be viewed as an instance of Non-Negative Matrix Factorization (NNMF) [Gillis, 2017] of the matrix . In B, we describe a method for estimating the solution of (7) by using alternating least-squares [Kim and Park, 2008], which is a popular approach for NNMF. We illustrate typical values of from the factorization (7) in Figure 4. It is evident that is larger in areas of the micrograph which are densely packed with particles, indicating that the factorization extracts the RPSD of the particles from the correct regions of the micrograph.
In the previous sections, we described how to compute a set of templates (together with their eigenvalues describing their importance) adapted to the particle images in the micrograph. However, at this point it is unclear how to make use of the different templates. In particular, since a single template does not capture the variability of all particle images, multiple templates must be used. We should therefore determine how many templates to use and how to combine them for particle picking. To that end, we turn to classical hypothesis testing and derive a scheme where different templates are combined according to their relative importance (through the eigenvalues
). Specifically, we next derive a procedure for discriminating between an image of pure noise, which is defined as the null hypothesis and denoted by, and an image of a particle plus noise, which is defined as the alternative hypothesis and denoted by . That is, we consider the model
where is a flattened micrograph patch, is a flattened image of the noise only, is a flattened particle image without noise and with zero values outside of , and is the floor function. We assume that the noise is white, which is enforced by prewhitening the micrograph using the samples of the estimated noise RPSD (obtained in Section 2.2). Given the model (8), we test for the hypotheses and by the well-known Likelihood Ratio Test (LRT) [Wasserman, 2013]. We derive the LRT for the case where and
are normally distributed, and demonstrate the practical effectiveness of the resulting test in Section3. To derive the LRT in this case, we need the covariance matrices for both hypotheses, namely and . Assuming that and are independent, it follows that
is the standard deviation of the noise, andis the identity matrix. Since is typically unknown, we propose a method for estimating it from a given micrograph in C. To evaluate (9) it therefore remains to estimate . Let be an equally-spaced (Cartesian) grid with points over the square , and denote by the flattened version of . By setting the values of outside of to be zero, we have that for every . Setting also the values of of (1) outside of to zero, and denoting by the matrix whose ’th column consists of the flattened samples of on , leads to the following approximation of the particle’s covariance matrix (see D)
where is a diagonal matrix with the eigenvalues on its main diagonal. The parameter is chosen such that the error in the approximation (10) is small, which essentially depends on the decay rate of the eigenvalues (see D for a detailed explanation on how the parameter is determined).
Given the covariance matrix of (9) (computed via the approximation (10)), we use a score based on the LRT for comparing between two multivariate normal distributions as follows. Each micrograph patch of size (flattened to a vector) is considered as a realization of and is assigned a score
where is a matrix determinant. See D for an elaborate derivation of this score and an efficient algorithm to compute it. High value of indicates the presence of a particle in the patch. We consider the scores as a matrix whose elements are . Extracting the particle’s coordinates out of the scoring matrix is done by using a version of the non-maximum suppression algorithm [Neubeck and Van Gool, 2006]. This greedy algorithm first chooses the highest score in the scoring matrix as the first picked particle’s center. Then, in order to prevent nearby pixels from being considered as particles, all pixels within a user-defined radius are excluded from the picking, and the procedure resumes with the next largest score, and so on. We set the exclusion radius to be the particle size, however, smaller values may give better results for closely packed, irregularly shaped particles. The picking continues until the remaining scores are below zero or until the algorithm picks a predefined number of particles. Stopping when the remaining scores are negative is based on the observation that holds whenever , which means that all of the remaining patches have higher probability to contain noise only. See Figure 5 for an example of a scoring matrix and the particles picked from it.
In this section, we demonstrate the performance of our algorithm on three different data sets, as follows. First, the micrographs were downsampled to 20%-50% of their original size and were normalized to zero mean and variance of one. In addition, each micrograph was bandpass filtered, eliminating 5% of its lowest and highest radial frequencies. The lowest frequencies were eliminated since they do not contain enough samples for estimating the RPSD, and the highest frequencies were eliminated since they contain mostly noise. Throughout all of the experiments, we used the following parameters for the KLT picker:and are 20% and 80%, respectively, of the maximal diameter of the particle, , and as explained above, the detection threshold is set to zero. Note that is taken to be smaller than half of the estimated particle’s size, as this leads, in practice, to better centered particle selections. The obtained particle coordinates were imported into RELION [Scheres, 2015] and the routine described in [Scheres, 2014] was executed in order to obtain a 3D reconstruction. The EMDB [in Europe, ] model corresponding to each data set was used as the ab initio model for the 3D refinement process.
The procedure described above is demonstrated on the following data sets of the EMPIAR repository [Iudin et al., 2016]: EMPIAR-10028 (Plasmodium Falciparum 80S ribosome) [Wong et al., 2014], EMPIAR-10061 (-Galactosidase) [Bartesaghi et al., 2015], and EMPIAR-10049 (Synaptic RAG1-RAG2) [Ru et al., 2015].
Our algorithm was implemented in Matlab, and all the experiments were executed on a Linux machine with 16 cores running at 2.1GHz, 768GB of memory, and an Nvidia Titan GPU.
The Plasmodium Falciparum 80S data set consists of micrographs of size pixels, with pixel size of . The approximate size of the particle is pixels (according to [Wong et al., 2014]). Examples of the KLT picking results are shown in Figure 10. The KLT picker picked particles from the micrographs. In order to purge the set of picked particles, a 2D class averaging step was executed (using classes) and particles retained. Examples of the selected 2D classes are shown in Figure 7. The 3D reconstruction reached a gold standard Fourier Shell Correlation (FSC) resolution of . Figure 8 presents the FSC curve produced by RELION’s post-processing task. In comparison, particles were picked by [Wong et al., 2014] using RELION’s automated selection tool, and after 2D and 3D class averaging steps, particles remained. The reconstruction in [Wong et al., 2014] reached a gold standard FSC resolution of . Figure 9 presents surface views of our 3D reconstructed model and the reconstructed model of [Wong et al., 2014]. The running time of the KLT picker for a single micrograph, using a GPU, was approximately seconds (following a one-time preprocessing step of approximately seconds).
The -Galactosidase data set consists of 1539 micrographs, each of size pixels, with pixel size of 0.64. The approximate size of the particle according to [Bartesaghi et al., 2015] is pixels. KLT picker picked particles. Examples of the results of the picking are shown in Figure 11. After executing a 2D class averaging step (using classes), particles retained. Examples of the selected 2D classes are shown in Figure 7. The 3D reconstruction reached a gold standard FSC resolution of . Figure 8 presents the FSC curve produced by RELION’s post-processing task. It is important to note that the micrographs’ raw movies were not available to us, therefore better results may be achieved by further polishing. In comparison, [Bartesaghi et al., 2015] picked particles using an automatic method based on correlation with a Gaussian ring. Than, a 3D classification in FREALIGN [Grigorieff, 2007] was used to discard bad selections, and particles retained for the 3D refinement step (that was executed in the same program). The reconstruction in [Bartesaghi et al., 2015] reached a gold standard FSC resolution of . Figure 9 presents surface views of the 3D models. The running time for a single micrograph, using a GPU, was approximately seconds (following a one-time preprocessing step of approximately seconds).
The Synaptic RAG1-RAG2 consists of 680 micrographs, each of size pixels, with pixel size of 1.23. The approximate size of the particle according to [Ru et al., 2015] is pixels. The KLT picker picked particles. Examples of the results of the picking are shown in Figure 12. After executing a 2D class averaging step (using classes), particles retained. Examples of the selected 2D classes are shown in Figure 7. The 3D reconstruction reached a gold standard FSC resolution of . Figure 8 presents the FSC curve produced by RELION’s post-processing task. We note that similarly to the -Galactosidase data set, the raw movies were unavailable to us, hence the results are sub-optimal. In comparison, [Ru et al., 2015] used a semi-automatic method executed in SAMUEL [Shi et al., 2008] for the particle picking step. Initially, particles were manually selected and classified by a 2D class averaging step. Good classes were manually chosen and the particles in those classes were used as templates for an automatic picking process. As described in [Ru et al., 2015], additional manual effort was required to discard bad selections, which resulted in a final data set of particles. The 3D refinement step was executed in RELION, and the reconstruction reached a gold standard FSC resolution of . Figure 9 presents surface views of the 3D model. The running time for a single micrograph, using a GPU, was approximately seconds (following a one-time preprocessing step of approximately seconds).
The data sets used to demonstrate our method represent different types of micrographs. The Plasmodium Falciparum 80S is the least challenging, as the macro-molecule is large () and the particles can easily be spotted by a naked eye. The -Galactosidase is more challenging, as the macro-molecule is smaller (), and it is harder to see the particles on the micrographs. Finally, the Synaptic RAG1-RAG2 is the most challenging data set, as it corresponds to smaller macro-molecule () and has low SNR. With all these data set, we achieved high resolutions with minimum manual effort. In addition, the surface views in Figure 9 show that the 3D reconstructions we got are compatible with the published ones.
Throughout the entire reconstruction process, the only manual step was the selection of classes out of RELION’s 2D classification output. We did not use 3D classification, nor any other method to filter the KLT picker’s picked particles. It is also important to note that the parameters in all of the experiments where chosen automatically based on the particle’s size, so there is no need to adjust them manually.
Particle picking methods can be divided into two categories. The first relies on prior assumptions on the particle’s structure, such as Gaussian shape [Voss et al., 2009] or resemblance to other known macro-molecules [Wang et al., 2016, Xiao and Yang, 2017]. The second relies solely on information from the data itself, which can be a set of manually picked particles [Scheres, 2015, Bepler et al., 2018], or a set of micrograph patches that were automatically selected [Heimowitz et al., 2018]. The KLT picker is in between both categories. On one hand, it relies on the assumption that the projection images can be approximated by a band-limited, rotationally invariant, wide sense stationary process restricted to a disk. This assumption is non-restrictive in practice, as exemplified by our results. On the other hand, the templates of the KLT picker are adapted to every micrograph separately, through the RPSD of the particles and of the noise.
In this paper, we presented the KLT picker, an automated, data driven particle picker, that requires only one input which is the estimated particle’s size. The KLT picker is designed especially to handle low SNR micrographs, it is easy to use, fast, and mathematically rigorous. We evaluated the performance of the KLT picker on three different data sets. In all of them we achieved high-quality results with minimal manual effort.
The algorithm can be further improved as follows. Micrographs tend to contain contaminations of various forms, from small contaminations to whole areas containing either ice or the edges of the supporting grid. These areas can degrade the estimation of the RPSD, which is the essence of our method. Devising an algorithm to automatically identify these areas may improve the picking. Dense particle clusters can poses a problem to the KLT picker, as the scoring matrix, introduced in Section 2.3, tends to assign high scores to the interface between particles instead of to the center of each particle separately, see an example in Figure 13. One possible solution to this problem is to replace the current algorithm for extracting particles, which uses the individual values of the scoring matrix (described at the end of Section 2.3), with an algorithm that considers the entire neighborhood of a particle candidate.
This research was supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement 723991 - CRYOMATH).
Under the assumptions that is band-limited, wide sense stationary, and that is radially-symmetric (i.e ), the Wiener–Khinchin theorem [Cohen, 1998] states that the radial power spectral density (RPSD) of , given by (5), is the Fourier transform of the autocorrelation . Hence, by taking the inverse Fourier transform,
where is the band-limit of . Denote , , and write the solutions of (12) as
Note that (13) in polar coordinates is given by
Changing variables in the innermost integral in (16) to gives
Changing variables in the innermost integral in (18) to gives
In this section, we outline an algorithm for estimating the solution of (7), that is of
Equation (20) can be viewed as an instance of a non-negative matrix factorization problem (NNMF) [Gillis, 2017], which is NP-hard. One method for estimating solutions of NNMF problems is by alternating least-squares (ALS) [Kim and Park, 2008]. The idea behind this iterative method is solving, in each step, the NNMF minimization problem for one of the variables, while treating the others as constants. If the method converges, then it converges to a local minimum of the NNMF problem. In our case, the variables are the vectors , and the ALS iterations are given by (for iteration )
As the initial vectors we set
where for . To justify this choice, recall that is modeled as
Since we expect at least one patch to include a complete particle, and at least one patch to contain only noise, we get , which yields
That is is initialized close to its true value. As for , (23) implies
where the third equality holds due to the positivity of all arguments. As written above, and hence
that is, is also initialized close to its true value.
Treating as a variable and as constants in the second line, gives a sum of convex parabolas for the variables . The minimum of this sum, under the constraint is attained at the minimum of these parabolas whenever they are positive and at else. The same applies when treating as the variable in the third line (and the other as constants). Treating as the variable in the forth line, implies that the minimum is given as follow. For each parabola in the sum (24), if its minimum is obtained at a negative , then the minimum of the sum is attained at . If the minimum is obtained at , we set . Otherwise, the minimum of the sum is attained at the minimum of the parabola. To conclude, we get the following formulas for the solutions of the ’th iteration in (21)
where can be estimated from the noise variance. This implies that we can recover . We omit the details of the proof. In C, we describe a method to estimate the noise variance from a given micrograph. The procedure for solving (20) is summarized in Algorithm 2.
Finally, we return to the computation of the columns of . Each column is the RPSD of a micrograph patch of size with stride , as described at the beginning of Section 2.2. These parameters are determined as follows. In (23) we assume that can be factorized as a combination of only . However, this model is only an approximation, and in practice it holds that
where is an error term matrix. Under the assumptions that the expectancy of the rows of is zero, and that the columns of are uncorrelated and with the same variance, one can prove that the limit of (25) when using of (27) stays the same as in (26), but with different values for (we omit the proof). These values can be estimated from the noise variance, and therefore, we can recover for the more general model (27). Now, taking clearly breaks the assumptions for and taking does not use all of the data, hence we used . As for the parameter , ideally we would like a patch to contain no more then one complete particle. Taking much bigger then the particle diameter can cause patches to contain two or more particles, and taking it much smaller will not allow patches to contain enough of the particle. In practice, it seems that taking equal 80% of the particle diameter yields good results.
The noise variance is needed to recover from (26) and to evaluate (9). Assuming that both the noise and the particle function are generated from a wide sense stationary random process and that they are uncorrelated, it follows that the variance of a given micrograph patch is the sum of the noise variance and a fraction of the particle’s variance, which is proportional to the fraction of the particle contained in the patch. As the variance is a positive number, patches with lower variance are more likely to contain only noise. With that in mind, we partition the micrograph into overlapping particle-size patches with stride equal to . Then, we approximate the variance of each patch as explained in Wasserman  and finally estimate the noise variance by averaging of the lowest variance results. Choosing depends on the number of particles in the micrograph, which is unknown to us. We have observed experimentally that taking provides satisfactory results in low-SNR conditions.
In this section, we derive an approximation for the particle’s covariance matrix , where
where are given by (1), and moreover, . Therefore, the autocorrelation can be written as
Using (28) we get
In order to define a truncation rule for the infinite sum in (29), we consider the relative error term
and then, we truncate it again to terms such that
where and are defined in Section 2.3
Our goal is to compute the score
Under the assumptions that the particle is distributed multivariate normal, and that the noise is white, the probability density function (PDF) of a micrograph patch containing a particle plus noise is
where is a matrix determinant. The PDF of a patch with noise only is
Substituting the above PDFs in (34
) and replacing the random variablewith its patch realization gives (see also (11))