Most abundant isotope peaks and efficient selection on Y=X_1+X_2+... + X_m

by   Patrick Kreitzberg, et al.
University of Montana

The isotope masses and relative abundances for each element are fundamental chemical knowledge. Computing the isotope masses of a compound and their relative abundances is an important and difficult analytical chemistry problem. We demonstrate that this problem is equivalent to sorting Y=X_1+X_2+...+X_m. We introduce a novel, practically efficient method for computing the top values in Y. then demonstrate the applicability of this method by computing the most abundant isotope masses (and their abundances) from compounds of nontrivial size.



There are no comments yet.


page 1

page 2

page 3

page 4


QuickMergesort: Practically Efficient Constant-Factor Optimal Sorting

We consider the fundamental problem of internally sorting a sequence of ...

Relative Schwarz Method

This paper proposes Relative Schwarz Method, which is a new domain decom...

Pivot Sampling in QuickXSort: Precise Analysis of QuickMergesort and QuickHeapsort

QuickXSort is a strategy to combine Quicksort with another sorting metho...

Fast Bayesian Feature Selection for High Dimensional Linear Regression in Genomics via the Ising Approximation

Feature selection, identifying a subset of variables that are relevant f...

Robust Ranking of Equivalent Algorithms via Relative Performance

In scientific computing, it is common that one target computation can be...

Geometric Perspectives on Fundamental Solutions in the Linearized Satellite Relative Motion Problem

Understanding natural relative motion trajectories is critical to enable...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Atoms of the same element may have a variable numbers of neutrons in their nulcei. The number of neutrons in the nucleus of an atom influences its mass. The relative abundances with which different isotopes naturally occur is well established.[5]

In compounds composed of several elements, finding the relative abundances of the most prevalent isotopes of the compound (and the respective masses at which these isotopes occur) is a difficult combinatorial problem.

For small problems, this can be solved via brute force: it is possible to compute all isotope masses and their respective abundances, sort them in descending order of abundance, and then retrieving the isotope peaks (we will refer to the mass and relative abundance of a particular isotope as an isotope peak, because that is how they are observed in mass spectrometry) with with greatest abundance; however, the runtime of this brute-force approach is far too inefficient, because it grows .

Łącki et al.[5] recently introduced a statistical approach to this problem, by which the top isotope peaks of a compound may be efficiently approximated. For each element, the method generates data from the isotopologue (i.e., from the contributions all possible isotopes from that element). For each element, its possible contributions follow a multinomial distribution, which Łącki et al.

approximate using multivarite Gaußians and generate in descending order or probability. The isotopologue contributions are combined over the relevant element sets. At this point, they generate the isotope configurations of the compound in descending order of probability, which is equivalent to finding the

most probable isotopologues of the molecule.

In this paper, we demonstrate that finding the top isotope peaks of a compound composed of elements is equivalent to finding the top values (and respective indices) in , where

are vectors of length

and is the Cartesian product of these vectors under the operator . This problem, which is important to other problems such as max-convolution[2]

and max-product Bayesian inference

[8, 6], is the generalization of the pairwise problem in which we compute the top values in .

Finding the top values in is nontrivial. In fact, there is no known approach that sorts all values of faster than naively computing and sorting them in [1]; however, Frederickson & Johnson demonstrated that the top values in can be computed [4]. Frederickson & Johnson generalized the problem to obtaining the top value in a matrix which is sorted by both columns and rows[3]. A sorted matrix may be built with , but the method presented assumes the matrix is already of this form and does not take into account the work to produce the matrix from the vectors.

This can be observed by sorting both and in descending order, and then sparsely building a matrix of . If and represent sorted vectors such that and , then the maximal value of is . The second largest value in is .

W.l.o.g., we know that we will never insert into the sorted result of top values in without first inserting the larger (or equal) value . Thus, each time a value from the matrix is found to be the next largest in , the subsequent next largest value in may be in the unvisited existing neighbors (row or column, but not diagonal) of previously unvisited values. These considered values form a “fringe” around the indices whose values are already inserted into the result. We find the next value to insert into the result by finding the minimum value currently in this fringe. Because values will be inserted into the fringe in each iteration (i.e., if was just inserted into the result, then and will be added into the fringe if they are in bounds). Thus, the minimum value in the fringe can be updated sparsely by using a binary heap. Note that only the indices and values comprising the fringe are the only values stored, and that the full matrix, which would have space and thus runtime , is never realized. The fringe can never have size (because in the worst-case scenario, it moves steps up and steps to the right), and so each next largest value in will have cost . Computing the top values in is thus .

In this paper, we first construct a direct generalization of Frederickson & Johnson’s method to the problem of finding the top values in . We then create a more efficient method by generalizing Frederickson & Johnson’s method to the case where and are arbitrary, heap data structures, and compute the largest values in by constructing a balanced binary tree whose nodes each are one of these data structures. This method is then applied to finding the most abundant isotope peaks for a given molecular formula.

2 Methods

2.1 A direct -dimensional generalization of Frederickson & Johnson

The direct generalization of Frederickson & Johnson’s method, which closely resembles Łącki et al.’s approach to generating the top isotope peaks[5], is straightforward: Instead of a matrix, we have an tensor. As before, we store only the current fringe, which is stored in a heap. In each iteration, we remove the minimal value in the fringe and append that to the result vector. Let this minimal value come from index . Now we insert the values from the neighbors of index into the heap holding the fringe: . As with the two-dimensional method, it is possible to store not only the in the heap, but also store the index tuple from which it came. This is shown in Listing 2.

This -dimensional method will be substantially slower than the two-dimensional method: the fringe in this version of the -dimensional method is a plane of width and dimension , and thus can have size up to .

import heapq
class MaxIndexHeap:
  def __init__(self, values_and_indices=[]):
    self.min_heap = [ (-a,b) for a,b in values_and_indices ]
    self.descending_values_and_indices = []
  def insert(self, val_and_index):
    # Note: does not guard against double insertion of an index
    val, index = val_and_index
    heapq.heappush(self.min_heap, (-val, index))
  def pop_max_and_index(self):
    neg_val, index = heapq.heappop(self.min_heap)
    self.descending_values_and_indices.append( (-neg_val, index))
    return (-neg_val, index)
  def get_value_and_index_at_rank(self, rank):
    assert(rank >= 0 and rank < len(self.descending_values_and_indices))
    return self.descending_values_and_indices[rank]
  def __len__(self):
    return len(self.min_heap)
Listing 1: A max heap Python class.
from MaxIndexHeap import *
class TensorCartesianSumHeap:
  def __init__(self, vectors):
    self.m = len(vectors)
    self.descending_vectors_and_unsorted_indices = [ sorted([(b,a) for a,b in enumerate(v)], reverse=True) for v in vectors ]
    max_value = sum([a for a,b in [v[0] for v in self.descending_vectors_and_unsorted_indices]])
    zero_tup = (0,)*self.m
    self.fringe = MaxIndexHeap([(max_value,zero_tup)])
    self.sorted_indices_in_fringe = set( [zero_tup] )
  def pop_max_and_index(self):
    val,index = self.fringe.pop_max_and_index()
    # insert neighbors from fringe if not already visited:
    for i in range(self.m):
      new_index = list(index)
      new_index[i] += 1
      new_index = tuple(new_index)
      desc_vec_i = self.descending_vectors_and_unsorted_indices[i]
      if new_index[i] < len(desc_vec_i) and new_index not in self.sorted_indices_in_fringe:
        old_val_at_index = desc_vec_i[new_index[i]-1][0]
        new_val_at_index = desc_vec_i[new_index[i]][0]
        new_val = val - old_val_at_index + new_val_at_index
        self.fringe.insert( (new_val, new_index) )
        self.sorted_indices_in_fringe.add( new_index )
    unsorted_index = tuple([ self.descending_vectors_and_unsorted_indices[i][j][1] for i,j in enumerate(index) ])
    return (val, unsorted_index)
Listing 2: A generalization of Frederickson & Johnson’s method, which computes the top values of . This method uses instances of the MaxIndexHeap class (Listing 1).

2.2 A hierarchical -dimensional method for finding the top values of

First, observe that in the two-dimensional case , it is not necessary for and to be sorted vectors; instead, it is sufficient that and simply be max-heap data structures, from which we can repeatedly request the next largest value. The method thus runs similarly to the two-dimensional case: the top-left corner of is computed via the maximal value of and the maximal value of . This inserts two values into the fringe: either the sum of the largest value in and the second-largest value in or the sum of the second-largest value in and the largest value in . Neither the full, sorted contents of nor the full, sorted contents of are needed.

We thus construct a balanced binary tree of these heap-like structures:

Each heap-like structure is of the form , where and are heap-like structures (Figure 1). The base case (at the leaves) is achieved by simply using a binary heap of an input vector (Listing 4).

Figure 1: Illustration of the hierarchical method. Pairwise Cartesian sum heaps (Listing 3) are assembled into a balanced binary tree. The gray squares in each 2D grid represent values which have already been popped from that CartesianSumPairHeap at the request of a parent node in the tree. When a value is popped from a child to the parent, it advances the corresponding margin along one axis of the parent’s grid. The blue squares are values on the fringe, but which have not yet been popped. The child on the left has popped six values and currently has four values in its fringe; the row axis of the parent has six values that have been realized thus far. The child on the right has popped four values and currently has four values in its fringe; the column axis of the parent has four values that have been realized thus far. The indices from which the child popped are also included, enabling lookup of the index was the next largest value .
from MaxIndexHeap import *
def int_or_tuple_to_tuple(int_or_tuple):
  if type(int_or_tuple) is not tuple:
    return (int_or_tuple,)
  return int_or_tuple
class CartesianSumPairHeap:
  def __init__(self, heap_like_a, heap_like_b):
    self.heap_like_a = heap_like_a
    self.heap_like_b = heap_like_b
    self.number_remaining_elements = len(heap_like_a) * len(heap_like_b)
    max_a, index_a = heap_like_a.pop_max_and_index()
    max_b, index_b = heap_like_b.pop_max_and_index()
    index_a = int_or_tuple_to_tuple(index_a)
    index_b = int_or_tuple_to_tuple(index_b)
    max_value = max_a + max_b
    zero_tup = (0,0)
    self.fringe = MaxIndexHeap([(max_value, (zero_tup, index_a+index_b))])
    self.sorted_indices_in_fringe = set( [zero_tup] )
    self.descending_values_and_indices = []
  def insert_into_fringe(self, a_int_ind, b_int_ind):
    # check that indices are in bounds:
    if a_int_ind != len(self.heap_like_a.descending_values_and_indices) and b_int_ind != len(self.heap_like_b.descending_values_and_indices):
      matrix_index = (a_int_ind, b_int_ind)
      if matrix_index not in self.sorted_indices_in_fringe:
        new_val_a, new_index_a = self.heap_like_a.get_value_and_index_at_rank(a_int_ind)
        new_val_b, new_index_b = self.heap_like_b.get_value_and_index_at_rank(b_int_ind)
        new_index_a = int_or_tuple_to_tuple(new_index_a)
        new_index_b = int_or_tuple_to_tuple(new_index_b)
        self.fringe.insert( (new_val_a+new_val_b, (matrix_index, new_index_a+new_index_b)) )
  def pop_max_and_index(self):
    self.number_remaining_elements -= 1
    val,(two_d_sorted_index,full_index) = self.fringe.pop_max_and_index()
    # insert neighbors of selected cell if not already visited:
    current_a = two_d_sorted_index[0]
    current_b = two_d_sorted_index[1]
    if len(self.heap_like_a)!=0 and current_a+1 == len(self.heap_like_a.descending_values_and_indices):
    if len(self.heap_like_b)!=0 and current_b+1 == len(self.heap_like_b.descending_values_and_indices):
    # at this point, right and down cells can be found in heap_like_a/b
    self.descending_values_and_indices.append( (val, full_index) )
    return (val, full_index)
  def get_value_and_index_at_rank(self, rank):
    assert(rank >= 0 and rank < len(self.descending_values_and_indices))
    return self.descending_values_and_indices[rank]
  def __len__(self):
    # hack because python only allows primitive int return value for __len__:
    if self.number_remaining_elements > 2**32-1:
      return 2**32-1
    return self.number_remaining_elements
Listing 3: The CartesianSumPairHeap class is a heap-like structure which gets the next largest value either by pulling from one of its heap-like structure children, or from two sorted vectors contained in the class. This class is used in the hierarchical method.
from CartesianSumPairHeap import *
class TreeCartesianSumHeap:
  def __init__(self, vectors):
    vectors_and_indices = [ [(b,a) for a,b in enumerate(v)] for v in vectors ]
    # build a balanced binary tree and store the root:
    current_layer = [ MaxIndexHeap(v) for v in vectors_and_indices ]
    while len(current_layer) > 1:
      next_layer = []
      for i in range(len(current_layer) // 2):
        next_layer.append( CartesianSumPairHeap(current_layer[2*i], current_layer[2*i+1]) )
      if len(current_layer)%2 == 1:
        next_layer.append( current_layer[-1] )
      current_layer = next_layer
    self.root = current_layer[0]
  def pop_max_and_index(self):
    val,index = self.root.pop_max_and_index()
    return (val,int_or_tuple_to_tuple(index))
Listing 4: A hierarchical method for computing the top values of which uses objects of the class in Listing 3.

2.3 Computing the most abundant isotope peaks

Let carbon be represented as the vector and hydrogen as the vector .[5] Propane, C3H8, has abundances composed of

. To reduce the number of vectors from 11 to two, a multinomial for each element is computed to find the contributions from all isotopes. This multinomial is represented as a vector which takes the place of all vectors for that element. Thus, propane could be estimated as the sum of two multinomials, which are encoded as vectors: one for

and one for . Molecules that consist of any amounts of hydrogen, carbon, nitrogen, oxygen, sulfur, etc. can solved using vectors: . The most abundant isotope peaks are found via the top values in .

The specific isotopes to which these abundances correspond (and from which we can compute the masses that correspond to each abundance) can be computed easily from the tuple indices. Python code implementing the hierarchical method to calculate the most abundant isotope peaks is shown in Listing 5.

import numpy
from collections import defaultdict
import sys
from scipy.special import gammaln
from functools import reduce
from TreeCartesianSumHeap import *
from TensorCartesianSumHeap import *
from BruteForceCartesianSumHeap import *
def read_in_elements(element_file):
  element_to_masses_and_abundances = {}
  for line in open(element_file):
    words = line.split()
    element_abbrev = words[0]
    tuples = [ tuple([float(x) for x in w.split(’,’)]) for w in words[1:] ]
    element_to_masses_and_abundances[element_abbrev] = tuples
  return element_to_masses_and_abundances
def parse_molecular_formula(formula):
  element_to_count = defaultdict(int)
  elements_and_counts = formula.split(’,’)
  for words in elements_and_counts:
    e_and_count = words.split(’:’)
    if len(e_and_count) == 1:
      e = e_and_count[0]
      count = 1
      e, count = e_and_count
      count = int(count)
    element_to_count[e] += count
  return element_to_count
def log_multinomial(log_probs, tup):
  result = gammaln(sum(tup)+1)
  for x in tup:
    result -= gammaln(x+1)
  for i in range(len(log_probs)):
    result += log_probs[i]*tup[i]
  return result
def tuples_with_fixed_sum(b, n):
    masks = numpy.identity(b, dtype=int)
    for c in itertools.combinations_with_replacement(masks, n):
        yield tuple(sum(c))
# turn masses/log_abundances from single element into all combinations:
def element_to_multinomial_masses_and_log_abundances(element_masses, element_log_abundances, trials):
  n = len(element_masses)
  mass_results = []
  log_abundance_results = []
  for tup in tuples_with_fixed_sum(n, trials):
    mass = sum([ tup[i]*element_masses[i] for i in range(n) ])
    log_prob = log_multinomial(element_log_abundances, tup)
  return mass_results, log_abundance_results
def usage():
  print( ’USAGE:<elementfile><molecularformulae.g.H:2,S,O:4><number_of_peaks>[tree(DEFAULT)ORtensorORbruteforce]’ )
def main(args):
  if len(args) != 3:
    if len(args) != 4 or len(args) == 4 and args[3] not in (’tree’, ’tensor’, ’bruteforce’):
  element_to_isotop_mass_and_abundance = read_in_elements(args[0])
  element_to_count = parse_molecular_formula(args[1])
  k = int(args[2])
  if len(args) == 4:
    mode = args[3]
  mass_vectors = []
  abundance_vectors = []
  for e,c in element_to_count.items():
    element_masses = [a for a,b in element_to_isotop_mass_and_abundance[e]]
    element_log_abundances = [numpy.log(b) for a,b in element_to_isotop_mass_and_abundance[e]]
    isotopologue_masses, isotopologue_log_abundances = element_to_multinomial_masses_and_log_abundances(element_masses, element_log_abundances, c)
    mass_vectors.append( isotopologue_masses )
    abundance_vectors.append( isotopologue_log_abundances )
  total_heap_size = reduce( (lambda x, y: x * y), [len(mv) for mv in mass_vectors])
  k = min(k, total_heap_size)
  if len(args) != 4 or mode == ’tree’:
    cartesian_heap = TreeCartesianSumHeap(abundance_vectors)
  elif mode == ’tensor’:
    cartesian_heap = TensorCartesianSumHeap(abundance_vectors)
    cartesian_heap = BruteForceCartesianSumHeap(abundance_vectors)
  log_abundance_and_index = [ cartesian_heap.pop_max_and_index() for i in range(k) ]
  print( ’{:8}{:8}{:8}’.format(’log(prob)’, ’prob’, ’mass’) )
  for i in range(k):
    log_abund, ind = log_abundance_and_index[i]
    mass = sum([ mass_vectors[ell][j] for ell,j in enumerate(ind) ])
    print( ’{:8f}{:8f}{:8f}’.format(log_abund, numpy.exp(log_abund), mass) )
  mass_to_total_abundance = defaultdict(float)
  for i in range(k):
    log_abund, ind = log_abundance_and_index[i]
    mass = sum([ mass_vectors[ell][j] for ell,j in enumerate(ind) ])
    mass_to_total_abundance[mass] += numpy.exp(log_abund)
  print( ’{:8}{:8}’.format(’prob’, ’mass’) )
  for a,m in sorted([ (b,a) for a,b in mass_to_total_abundance.items() ], reverse=True):
    print( ’{:8f}{:8f}’.format(a,m) )
if __name__ == ’__main__’:
Listing 5: A method for computing the most abundant isotope peaks using the hierarchical -dimensional method for finding the top values of .

3 Results

3.1 Efficient isotope peak computation

On a fake compound composed of
Cl800V800He800C800H800N800O100S6Cu800Ga800Ag800Tl800Ne800 with the cost of computing the multinomial isotopologues not included (this cost was identical for both methods and required second), computing the top 512 isotope beaks via a implementation took 0.002984046 seconds, while the implementation took 0.00146198272 seconds.

3.2 Time and memory use on arbitrary

Problems of various sizes were run with vector length , retrieving the top values in . The brute-force approach was not considered, for efficiency reasons. Time and memory use of and on a dual Xeon with 128GB of RAM are shown in Figure 2.

Figure 2: Time and memory usage of and Problems of different size and with are timed (left panel) and memory usage (right panel) are plotted. Memory usage of 0GB is due to a low enough footprint that the ps command could not estimate it; therefore, we only show , for which both methods had nonzero memory usage. The growing gap in both of these log-log plots shows a nonlinear speedup and nonlinear memory benefit. At , the memory usage of is 66.3GB while the proposed hierarchical method,, uses only 1.15GB.

4 Discussion

As increases, the method we introduce here is far more time efficient, but more importantly, far more space efficient than direct -dimensional generalization of Frederickson & Johnson.

Although the approach we propose here does have benefit to computation of intense isotope peaks, the limited number of elements (currently at ) benefits only slightly from our approach. Furthermore, it is rare for many elements to be combined in a single compound. Our demonstration implementation generates multinomials naively, unlike Łącki et al.; the longer runtime from this unnecessary naïvete in preprocessing mutes the speedup of the hiercharical method for elements with several isotopes that are probable; however, it is possible to use this hierarchical approach but with multinomials generated non-naively as Łącki et al. do, which would likely achieve a modest speedup over their current approach.

However, there are cases outside of computation of intense isotope peaks, in which the hierarchical method we propose would yield large practical benefit. These include maximum a posteriori Bayesian inference on dependencies of the form [7, 6]. Operations research applications include financial markets, e.g., retrieving the lowest overall bids in each sector of supply lines for a product.

Using a language like javascript, this method can be parallelized easily by parallelizing heap pop operations in nodes at the same layer of the tree. If enough hardware parallelism is available, the runtime of a full propagation through the tree would thus be the height of the tree, which is pop operations from fringe heaps. Each of these fringe heap pop operations is , and thus the runtime would be .

5 Availability

Python source code from this method is available at (MIT license, free for both academic and commercial use).


  • [1] D. Bremner, T. M. Chan., E. D. Demaine, J. Erickson, F. Hurtado, J. Iacono, S. Langerman, and P. Taslakian. Necklaces, convolutions, and . In Algorithms–ESA 2006, pages 160–171. Springer, 2006.
  • [2] M. Bussieck, H. Hassler, G. J. Woeginger, and U. T. Zimmermann. Fast algorithms for the maximum convolution problem. Operations research letters, 15(3):133–141, 1994.
  • [3] Greg Frederickson and Donald B. Johnson. Generalized selection and ranking: Sorted matrices. SIAM J. Comput., 13:14–30, 02 1984.
  • [4] Greg N Frederickson and Donald B Johnson. The complexity of selection and ranking in and matrices with sorted columns. Journal of Computer and System Sciences, 24(2):197–208, 1982.
  • [5] M. K. Łącki, M. Startek, D. Valkenborg, and A. Gambin. Isospec: Hyperfast fine structure calculator. Analytical Chemistry, 89(6):3272–3277, 2017.
  • [6] J. Pfeuffer and O. Serang. A bounded -norm approximation of max-convolution for sub-quadratic Bayesian inference on additive factors.

    Journal of Machine Learning Research

    , 17(36):1–39, 2016.
  • [7] O. Serang. The probabilistic convolution tree: Efficient exact Bayesian inference for faster LC-MS/MS protein inference. PloS one, 9(3):e91507, 2014.
  • [8] O. Serang.

    A fast numerical method for max-convolution and the application to efficient max-product inference in Bayesian networks.

    Journal of Computational Biology, 22:770–783, 2015.