1 Introduction
Atoms of the same element may have a variable numbers of neutrons in their nulcei. The number of neutrons in the nucleus of an atom influences its mass. The relative abundances with which different isotopes naturally occur is well established.[5]
In compounds composed of several elements, finding the relative abundances of the most prevalent isotopes of the compound (and the respective masses at which these isotopes occur) is a difficult combinatorial problem.
For small problems, this can be solved via brute force: it is possible to compute all isotope masses and their respective abundances, sort them in descending order of abundance, and then retrieving the isotope peaks (we will refer to the mass and relative abundance of a particular isotope as an isotope peak, because that is how they are observed in mass spectrometry) with with greatest abundance; however, the runtime of this brute-force approach is far too inefficient, because it grows .
Łącki et al.[5] recently introduced a statistical approach to this problem, by which the top isotope peaks of a compound may be efficiently approximated. For each element, the method generates data from the isotopologue (i.e., from the contributions all possible isotopes from that element). For each element, its possible contributions follow a multinomial distribution, which Łącki et al.
approximate using multivarite Gaußians and generate in descending order or probability. The isotopologue contributions are combined over the relevant element sets. At this point, they generate the isotope configurations of the compound in descending order of probability, which is equivalent to finding the
most probable isotopologues of the molecule.In this paper, we demonstrate that finding the top isotope peaks of a compound composed of elements is equivalent to finding the top values (and respective indices) in , where
are vectors of length
and is the Cartesian product of these vectors under the operator . This problem, which is important to other problems such as max-convolution[2]and max-product Bayesian inference
[8, 6], is the generalization of the pairwise problem in which we compute the top values in .Finding the top values in is nontrivial. In fact, there is no known approach that sorts all values of faster than naively computing and sorting them in [1]; however, Frederickson & Johnson demonstrated that the top values in can be computed [4]. Frederickson & Johnson generalized the problem to obtaining the top value in a matrix which is sorted by both columns and rows[3]. A sorted matrix may be built with , but the method presented assumes the matrix is already of this form and does not take into account the work to produce the matrix from the vectors.
This can be observed by sorting both and in descending order, and then sparsely building a matrix of . If and represent sorted vectors such that and , then the maximal value of is . The second largest value in is .
W.l.o.g., we know that we will never insert into the sorted result of top values in without first inserting the larger (or equal) value . Thus, each time a value from the matrix is found to be the next largest in , the subsequent next largest value in may be in the unvisited existing neighbors (row or column, but not diagonal) of previously unvisited values. These considered values form a “fringe” around the indices whose values are already inserted into the result. We find the next value to insert into the result by finding the minimum value currently in this fringe. Because values will be inserted into the fringe in each iteration (i.e., if was just inserted into the result, then and will be added into the fringe if they are in bounds). Thus, the minimum value in the fringe can be updated sparsely by using a binary heap. Note that only the indices and values comprising the fringe are the only values stored, and that the full matrix, which would have space and thus runtime , is never realized. The fringe can never have size (because in the worst-case scenario, it moves steps up and steps to the right), and so each next largest value in will have cost . Computing the top values in is thus .
In this paper, we first construct a direct generalization of Frederickson & Johnson’s method to the problem of finding the top values in . We then create a more efficient method by generalizing Frederickson & Johnson’s method to the case where and are arbitrary, heap data structures, and compute the largest values in by constructing a balanced binary tree whose nodes each are one of these data structures. This method is then applied to finding the most abundant isotope peaks for a given molecular formula.
2 Methods
2.1 A direct -dimensional generalization of Frederickson & Johnson
The direct generalization of Frederickson & Johnson’s method, which closely resembles Łącki et al.’s approach to generating the top isotope peaks[5], is straightforward: Instead of a matrix, we have an tensor. As before, we store only the current fringe, which is stored in a heap. In each iteration, we remove the minimal value in the fringe and append that to the result vector. Let this minimal value come from index . Now we insert the values from the neighbors of index into the heap holding the fringe: . As with the two-dimensional method, it is possible to store not only the in the heap, but also store the index tuple from which it came. This is shown in Listing 2.
This -dimensional method will be substantially slower than the two-dimensional method: the fringe in this version of the -dimensional method is a plane of width and dimension , and thus can have size up to .
2.2 A hierarchical -dimensional method for finding the top values of
First, observe that in the two-dimensional case , it is not necessary for and to be sorted vectors; instead, it is sufficient that and simply be max-heap data structures, from which we can repeatedly request the next largest value. The method thus runs similarly to the two-dimensional case: the top-left corner of is computed via the maximal value of and the maximal value of . This inserts two values into the fringe: either the sum of the largest value in and the second-largest value in or the sum of the second-largest value in and the largest value in . Neither the full, sorted contents of nor the full, sorted contents of are needed.
We thus construct a balanced binary tree of these heap-like structures:
Each heap-like structure is of the form , where and are heap-like structures (Figure 1). The base case (at the leaves) is achieved by simply using a binary heap of an input vector (Listing 4).

2.3 Computing the most abundant isotope peaks
Let carbon be represented as the vector and hydrogen as the vector .[5] Propane, C3H8, has abundances composed of
. To reduce the number of vectors from 11 to two, a multinomial for each element is computed to find the contributions from all isotopes. This multinomial is represented as a vector which takes the place of all vectors for that element. Thus, propane could be estimated as the sum of two multinomials, which are encoded as vectors: one for
and one for . Molecules that consist of any amounts of hydrogen, carbon, nitrogen, oxygen, sulfur, etc. can solved using vectors: . The most abundant isotope peaks are found via the top values in .The specific isotopes to which these abundances correspond (and from which we can compute the masses that correspond to each abundance) can be computed easily from the tuple indices. Python code implementing the hierarchical method to calculate the most abundant isotope peaks is shown in Listing 5.
3 Results
3.1 Efficient isotope peak computation
On a fake compound composed
of
Cl800V800He800C800H800N800O100S6Cu800Ga800Ag800Tl800Ne800
with the cost of computing the multinomial isotopologues not included
(this cost was identical for both methods and required
second), computing the top 512 isotope beaks via a TensorCartesianSumHeap.py-based implementation took 0.002984046
seconds, while the TreeCartesianSumHeap.py-based implementation
took 0.00146198272 seconds.
3.2 Time and memory use on arbitrary
Problems of various sizes were run with vector length , retrieving the top values in . The brute-force approach was not considered, for efficiency reasons. Time and memory use of TensorCartesianSumHeap.py and TreeCartesianSumHeap.py on a dual Xeon with 128GB of RAM are shown in Figure 2.
![]() |
![]() |
4 Discussion
As increases, the method we introduce here is far more time efficient, but more importantly, far more space efficient than direct -dimensional generalization of Frederickson & Johnson.
Although the approach we propose here does have benefit to computation of intense isotope peaks, the limited number of elements (currently at ) benefits only slightly from our approach. Furthermore, it is rare for many elements to be combined in a single compound. Our demonstration implementation generates multinomials naively, unlike Łącki et al.; the longer runtime from this unnecessary naïvete in preprocessing mutes the speedup of the hiercharical method for elements with several isotopes that are probable; however, it is possible to use this hierarchical approach but with multinomials generated non-naively as Łącki et al. do, which would likely achieve a modest speedup over their current approach.
However, there are cases outside of computation of intense isotope peaks, in which the hierarchical method we propose would yield large practical benefit. These include maximum a posteriori Bayesian inference on dependencies of the form [7, 6]. Operations research applications include financial markets, e.g., retrieving the lowest overall bids in each sector of supply lines for a product.
Using a language like javascript, this method can be parallelized easily by parallelizing heap pop operations in nodes at the same layer of the tree. If enough hardware parallelism is available, the runtime of a full propagation through the tree would thus be the height of the tree, which is pop operations from fringe heaps. Each of these fringe heap pop operations is , and thus the runtime would be .
5 Availability
Python source code from this method is available at https://bitbucket.org/seranglab/theodolite/ (MIT license, free for both academic and commercial use).
References
- [1] D. Bremner, T. M. Chan., E. D. Demaine, J. Erickson, F. Hurtado, J. Iacono, S. Langerman, and P. Taslakian. Necklaces, convolutions, and . In Algorithms–ESA 2006, pages 160–171. Springer, 2006.
- [2] M. Bussieck, H. Hassler, G. J. Woeginger, and U. T. Zimmermann. Fast algorithms for the maximum convolution problem. Operations research letters, 15(3):133–141, 1994.
- [3] Greg Frederickson and Donald B. Johnson. Generalized selection and ranking: Sorted matrices. SIAM J. Comput., 13:14–30, 02 1984.
- [4] Greg N Frederickson and Donald B Johnson. The complexity of selection and ranking in and matrices with sorted columns. Journal of Computer and System Sciences, 24(2):197–208, 1982.
- [5] M. K. Łącki, M. Startek, D. Valkenborg, and A. Gambin. Isospec: Hyperfast fine structure calculator. Analytical Chemistry, 89(6):3272–3277, 2017.
-
[6]
J. Pfeuffer and O. Serang.
A bounded -norm approximation of max-convolution for sub-quadratic
Bayesian inference on additive factors.
Journal of Machine Learning Research
, 17(36):1–39, 2016. - [7] O. Serang. The probabilistic convolution tree: Efficient exact Bayesian inference for faster LC-MS/MS protein inference. PloS one, 9(3):e91507, 2014.
-
[8]
O. Serang.
A fast numerical method for max-convolution and the application to efficient max-product inference in Bayesian networks.
Journal of Computational Biology, 22:770–783, 2015.
Comments
There are no comments yet.