Predicting Electron-Ionization Mass Spectrometry using Neural Networks

11/21/2018 ∙ by Jennifer N. Wei, et al. ∙ Google Princeton University Harvard University 4

When confronted with a substance of unknown identity, researchers often perform mass spectrometry on the sample and compare the observed spectrum to a library of previously-collected spectra to identify the molecule. While popular, this approach will fail to identify molecules that are not in the existing library. In response, we propose to improve the library's coverage by augmenting it with synthetic spectra that are predicted using machine learning. We contribute a lightweight neural network model that quickly predicts mass spectra for small molecules. Achieving high accuracy predictions requires a novel neural network architecture that is designed to capture typical fragmentation patterns from electron ionization. We analyze the effects of our modeling innovations on library matching performance and compare our models to prior machine learning-based work on spectrum prediction.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Abstract

When confronted with a substance of unknown identity, researchers often perform mass spectrometry on the sample and compare the observed spectrum to a library of previously-collected spectra to identify the molecule. While popular, this approach will fail to identify molecules that are not in the existing library. In response, we propose to improve the library’s coverage by augmenting it with synthetic spectra that are predicted using machine learning. We contribute a lightweight neural network model that quickly predicts mass spectra for small molecules. Achieving high accuracy predictions requires a novel neural network architecture that is designed to capture typical fragmentation patterns from electron ionization. We analyze the effects of our modeling innovations on library matching performance and compare our models to prior machine learning-based work on spectrum prediction.

1 Introduction

Mass spectrometry (MS) is an important tool used to identify unknown molecular samples in a variety of applications, from characterization of organic synthesis products, to pharmacokinetic studies massspec_pharmakinetics , to forensic studies Zhou2017LatentFingerprints , to analyzing gaseous samples on remote satellites Petrie_ions_in_space .

In electron-ionization mass spectrometry (EI-MS), molecular samples are ionized by an electron beam and broken into fragments. The resultant ions are separated by an electric field until they reach a detector. The mass spectrum is a distribution of the frequency or intensity of each type of ion, ordered by mass-to-charge (m/z) ratio.

A popular method for identifying a sample from its mass spectrum is to look up the sample’s spectrum in a reference library. Here, a similarity function is used to measure the similarity between the query spectrum from the sample and each spectrum in the library. If the measurement noise when obtaining the query spectrum is reasonable, then the library spectrum with the highest similarity will have the identity of the sample.stein1995ChemicalSubstructureIdentification ; stein1994optimization A schematic of this process is shown in Figure 1a.

This library matching approach is very popular, but it suffers from a coverage problem: if the sample consists of a molecule that is not in the library, then correct identification is impossible. This is an issue in practice, since existing mass spectral reference libraries, such as the NIST/NIH/EPA MS database 2017nist , Wiley Registry of Mass Spectral Data mclafferty2016wiley , and MassBank horai2010massbank only contain hundreds of thousands of reference spectra. The coverage problem could be reduced by recording spectra for additional molecules, but this is time consuming and expensive. For example, NIST releases updates to its library every 3 years, containing roughly 20,000 new spectra. Additionally, mass spectra of new molecules are only added to the library if the molecule is of common interest; molecules for newly synthesized compounds are typically not incorporated  2017nist ; stein2012MassLibReview .

An alternative solution is to use de novo methods that input a spectrum and directly generate a molecule, without using a fixed list of molecules (Section 2). However, these approaches currently have low-accuracy and are difficult for practitioners to incorporate into their existing work-flows.

Another method for alleviating the coverage problem is to augment existing libraries with synthetic spectra that are generated by a model. Thus far, this approach has not been practical, as existing spectrum prediction methods are very computationally expensive. These prediction models use quantum mechanics calculations  bauer2016compute ; grimme2013towards ; Guerra_BEB_model or machine learning allen2016computational

to estimate the probability of each bond breaking under ionization, and thus the frequency of each ion fragment. Since these methods must either compute molecular orbital energies with high accuracy using expensive calculations, or else stochastically simulate the fragmentation of the molecule, the time needed for each model to make a prediction scales with the size of the molecule, taking up to 10 minutes for large molecules

bauer2016compute ; allen2016computational .

In response, we present Neural Electron Ionization Mass Spectrometry (NEIMS), a neural network that predicts the electron-ionization mass spectrum for a given small molecule. Since our model directly predicts spectra, instead of bond breaking probabilities, it is dramatically faster than previously reported methods, making it possible to generate predictions for thousands of possible candidates in seconds. Furthermore, the approach does not rely on specific details of EI, and thus our model could be easily retrained to predict mass spectra for other ionization methods.

We test the performance of our model by predicting mass spectra for small molecules from the NIST 2017 Mass Spectral Library. We find that the predictive capability of our model is similar to previously reported machine learning models, but requires much less time to make predictions. Additionally, we report the similarity of the spectra predicted by NEIMS. The code repository for NEIMS is publicly available at github.com/brain-research/deep-molecular-massspec.

2 Related Work

Several algorithms have been developed previously for either predicting spectra or for predicting the molecule’s identity given the spectrum. We review some of these techniques here.

DENDRAL

One of the earliest efforts in artificial intelligence was a model used to identify molecules from their mass spectrum. Heuristic DENDRAL (Dentritic Algorithm) was a collaboration between chemists and computer scientists at Stanford in the 1960s 

buchanan1981dendral . This algorithm used expert rules from chemistry to help identify patterns in the spectra and suggest possible identities for the molecule. A few years later, Meta DENDRAL was introduced to learn the expert rules that originally been given to Heuristic DENDRAL lindsay1993dendral .

De Novo Identification Methods Several models have been reported to predict identities of samples directly from the spectrum. Many have been developed for tandem mass spectrometry, where the task is to predict the original peptide sequences from digested fragments given the mass spectrum Eng1994sequest . Some of these methods use machine learning to achieve this task Tran8247deepnovo ; Schoenholtz2018supervision . One work even uses machine learning models to identify personal characteristics by analyzing electrospray-ionization mass spectra of samples collected from human fingerprints Zhou2017LatentFingerprints .

While this approach is common for prediction of peptide sequences, it is uncommon for prediction of molecules from spectra. Several previously published models have used neural network models to predict molecule subgroups from spectra, or the class of the molecule curry1990msnet ; spec2smiles . One recent work attempts to employ a LSTM sequence-to-sequence model to predict the molecule directly from its mass spectrum, using the Simplified Molecular Input Line Entry Specification (SMILES) to output the molecule spec2smiles . Because of the difficulty of constructing syntactically correct SMILES, this approach was not able to successfully reconstruct the entire SMILES string for any of the input spectra.

In this work, we focus on the prediction of spectra from molecules, such that these predicted spectra can be used to improve the coverage of library-matching-based identification. The advantage of this approach over de novo approaches is that new libraries of synthetic spectra can be easily incorporated into the existing mass spectrometry software used by practitioners.

Quantum Mechanics Spectral Prediction Methods The first prediction methods for EI-MS spectrum used quantum mechanical simulation techniques to predict fragmentation events. There are three methods of predicting the mass spectrum using first principles bauer2016compute . The first is to use quasi-equilibrium theory, also known as Rice-Ramsberger-Kassel-Marcus theory, to estimate the rate constants for ionization reaction lorquet1994whither ; lorquet2000landmarks ; rosenstock1952absolute . The second is to estimate the bond order energies within a molecule, and estimate where a molecule may fragment. A related method to this second method is to calculate the cross-section of molecular orbitals upon electron impact to predict the molecule’s ionization behavior irikura2017ab ; Guerra_BEB_model . The third method uses Born-Oppenheimer Molecular Dynamics. Quantum Chemistry Electron-Ionization Mass Spectrometry (QCEIMS) is a particularly recent example of the ab initio molecular dynamics method grimme2013towards ; Asgeirsson_QCEIMS ; bauer2016compute . The trajectories resulting from this simulation are then analyzed for the presence of ionic fragments. The distribution of the ion fragments aggregated from all the simulations is then renormalized to generate a calculated EI-MS spectrum. Each of these methods requires at least 1000 seconds per molecule allen2016computational , and may even take days or weeks for molecules of 50 atoms. While these methods may be fast for methods involving density functional theory, they do not have the speed needed to rapidly generate a collection of spectra thousands of molecules. Furtermore, some of the basis sets used for the density functional theory might not support the presence of inorganic atoms.

Machine Learning Spectral Prediction Methods Allen et al. allen2016computational introduced Competitive Fragmentation Modelling-Electron Ioinization (CFM-EI) to predict EI-MS spectra. This probabilistic model predicts the probability of breaking molecular bonds under electron ionization, and also predicts the charged fragment that is likely to form. In order to generate the spectra, it is necessary to run a stochastic simulation to determine the frequency of each molecular fragment. In Section 4.2, we directly compare this method with our proposed model.

3 Methods

(a)
(b)
Figure 1: Library Matching Task. (a) A depiction of how query spectra are matched to a collection of reference spectra as performed by mass spectrometry software. (b) Query spectra are compared against a library comprised of spectra from the NIST 2017 main library and spectra predicted by our model (outlined in blue). Spectral Images adapted from NIST Webbook NIST_WebBook .

Our goal is to design a model that will accurately predict the EI-MS spectrum for any molecule. This will be used to produce an augmented reference library containing both predicted spectra and experimentally-measured spectra. This task is outlined in Figure 1b.

We first discuss how similarity metrics for spectra in Section 3.1. Next, we describe our method for spectra prediction in Sections 3.2 and 3.3. We then explain how we evaluate our model’s impact on the library matching task more thoroughly in Section 3.4.

3.1 Similarity Metrics for Mass Spectra

The ability to a match a query spectrum from a sample to the correct spectrum in the library depends on the choice of similarity metric between spectra mclafferty1974probability ; stein1994optimization

. A weighted cosine similarity is commonly used by mass spectrometry software. The exact form of the cosine similarity is given below 

stein1994optimization :

(1)

Here, and

are vectors of

m/z intensities representing the query spectrum and the library spectrum respectively, and are the mass-to-charge ratio and intensity found at , and are the largest indices of and with non-zero values,and is the larger of and . The motivation for the weighting by is because the peaks in mass spectra corresponding to larger fragments are more characteristic and useful in practice for identifying the true molecule.

Other similarity metrics besides cosine distance similarities are also employed. For example, one other similarity method involves estimating the relative importance of one peak given the other peaks mclafferty1974probability . Other methods uses a Euclidian difference between peaks, or use a variation of the Hamming distance stein1994optimization ; hertz1971identification . Another similarity metric accounts for neutral losses, or the intensity peaks corresponding to the loss of small, neutral fragments from the original molecular ion moorthy2017combining . It is also possible to use the same form of the similarity function as in (1), but with different weighting given to the intensity or the masses stein1994optimization . In principle, machine learning could be also used to learn a parameterized similarity metric that yields improved library matching performance. However, this custom metric would be difficult to deploy, since it would require changing the software used by practitioners.

We develop our model with the assumption that Eq. (1) will be used for the similarity metric in downstream library matching software that consumes an augmented library.

3.2 Spectral Prediction

We treat the prediction of mass spectrometry spectra as a multi-dimensional regression task. The output of our model is a vector that represents the intensity at every integral m/z bin. We use this discretization granularity for m/z because it is what is provided in the NIST datasets we use for training our model.

In the NEIMS model (Figure 3), we first map molecules to additive Extended Circular Fingerprints (ECFPs) rdkit . These fingerprints are similar to their binary counterparts Rogers_2010_ECFP in that they record molecular subgraphs made up from local neighborhoods around each atom node in the molecule, but differ in that they count the occurrences for each subgroup. This information is then hashed into a vector representation. The difference is that additive fingerprints record the frequency that each bit is set, rather than just the presence. The RDKit Cheminformatics package rdkit

was used to generate the fingerprints. These features are then passed into a multi-layer perceptron neural network (MLP). To account for some of the physical phenomena of ionization, we make some application-specific adjustments to the prediction from the MLP, described in Section 

3.3.

In Section 4.1

we compare the performance of NEIMS to that of a simple linear regression (LR) model. Here, we apply a linear transformation to the ECFP features.

To train the model, we use a modified mean-squared-error loss function. This loss function, shown below, follows the same weighting pattern as in Eq. 

1:

(2)

where is the ground truth spectrum, is the predicted spectrum, and

is the mass of the input molecule. We used stochastic gradient descent to optimize the parameters of the MLP with the Adam optimizer 

Kingma_adam_optimizer

. We use Tensorflow 

Tensorflow-2016 to construct and train the model.

3.3 Adjustments for Physical Phenomena

In practice, we have found that the conventional MLP described in the previous section struggles to accurately predict the right-hand side of spectra (Figure 2a). Errors in this region, which correspond to large , are particularly damaging for library matching with the weighting in (1).

This section introduces a revised neural network architecture (Figure 3) designed to better model the underlying fragmentation process that occurs in mass spectrometry. We have found that it improves prediction in the high mass region of the spectrum (Figure 2b), which yields improvements in library matching (Section 4.1).

As is standard for MLPs used for regression, the predictions of the above MLP model on an input molecule are an affine transformation of a set of features , which are computed by all but the final layer of the network. For reasons that will become apparent, we refer to the above MLP as performing forward prediction. At bin , we have the following predicted intensity:

(3)

where and are the model’s weights and biases for forward prediction at bin .

The input ECFP features, from which is computed, capture local structures in the molecule, so generally will be more accurate in capturing the presence of small substructures of molecule . Often, there is a direct correspondence between the presence of such substructures and spectral peaks with small . For example, in Figure 2a a peak occurs at , due to the presence of chlorine. Therefore an accurate forward prediction model will have a learned weight that will output a high intensity at if there is evidence in for the presence of chlorine.

Figure 2: Spectral Prediction with MLP forward Model (a) and MLP bidirectional Model (b). For both spectra plots, the true spectrum is shown in blue on top, while the predicted spectrum is shown inverted in red.  Note that the spectrum predicted by the bidirectional model shows fewer stray peaks than the forward model, particularly for larger m/z values. These peaks are much easier to predict with the reverse prediction mode.

On the other hand, forward prediction often struggles to accurately predict intensities for large fragments that are the result of neutral losses stein1995ChemicalSubstructureIdentification . One reason for this is that the composition of large fragments is not captured well by the ECFP representation. Another reason is that information learned about the cleavage of a small group does not transfer well across molecules of different masses. For pentachlorobenzene, which has a molecular mass of 250 Da, the fragment that results from the loss of a neutral chlorine atom results in a peak at 215 Da. Meanwhile, for chlorobenzene, which has a mass of 112 Da, the fragment resulting from a loss of a chlorine atom would have a peak at 77 Da. Despite the clear relationship between these intensity peaks, the forward model is not parameterized to capture this pattern.

In response, following the physical phenomenon that created the fragments, we define larger ion peaks as a function of the residual groups that were broken off from the original molecule. Referring to our previous example of pentachlorobenzene (), we can parameterize the ratio of the fragment which lost a chlorine group as . The corresponding fragment in chlorobenzene would have a mass of . By defining the peaks in this way, it is possible for these predictions of spectral intensities to be linked by the prediction at index 35. This leads to the indexing scheme of our reverse prediction model:

(4)

Here, is a small shift that allows for peaks to occur at intensities greater than

, due to isotopes. In practice, reverse prediction is implemented using a copy of the forward model, with separate sets of parameters for the final affine layer, but shared parameters for

. The outputs of this model are post-processed on a per-molecule basis to obey the indexing in (4), which depends on each molecule’s mass.

Both the forward and reverse predictions are combined to form a bidirectional prediction. That is, the final prediction at index is a combination of both and . In the case of pentachlorobenzene, the prediction of spectral intensity at is a function of from the forward mode and from the reverse mode. Instead of simply averaging the two prediction modes, we have found that small additional performance improvements can be obtained using a coordinate-wise gate. Here, the output at position is given by:

(5)

where is an affine transformation of and

is a sigmoid function. This approach echoes the formulation of the Hybrid Similarity Search designed by Moorthy et al., which accounts for peaks that are created by small fragment ions and those which are created by large fragments which have lost smaller groups 

moorthy2017combining .

Finally, for all models, we zero out predicted intensities at that are greater than .

By adding these features, we incorporate some of the physical phenomena that occur in mass spectrometry into our model while maintaining the overall simplicity of the MLP. In this way, we are able to predict the spectrum directly without resorting to sampling bond-breaking events within the molecule, which requires subsequent stochastic sampling to obtain a spectrum.

Figure 3:

Molecular representations are passed into a multilayer perceptron to generate an initial output. This output is used to make a forward prediction starting at

and and in reverse starting from and ending at . A sigmoid gating is applied to the inputs as shown in Eq. 5

3.4 Library Matching Evaluation

We evaluate NEIMS using an augmented reference library consisting of a combination of observed spectra and model-predicted spectra, with library matching performance computed with respect to a query set of spectra. These are from the NIST 2017 replicates library, which is a collection of noisier spectra for molecules that are contained in the NIST main library. The inconsistencies in these spectra reflect experimental variation, and make an informative dataset to test our model’s performance.

To construct the augmented reference library, we edit the NIST main library, removing spectra corresponding to the query set molecules and replacing them with the predictions from NEIMS. We then perform library matching and calculate the similarity between each query spectrum and every spectrum from the augmented library. We record the rank of the correct spectrum, i.e. the rank of the predicted spectrum corresponding to the molecule which made the query spectrum. The similarity metric is Eq. (1).

For the purposes of tuning model hyperparameters, we chose to optimize recall@10, i.e. the percentage of our query set for which the correct spectra had a matching rank of less than or equal to 10 in the library matching task. Half of the replicates library was used for tuning hyperparameters, and the remaining half was used to evaluate test performance. All models were trained on the spectra prediction task for 100,000 training steps with a batch size of 100.

During the library match search, we have a mass filtering option. This feature reduces the library size so it only includes spectra from molecular candidates that have a molecular mass that differ by a few Daltons from the mass of the query molecule. If the EI-MS analysis is combined with mass spectrometry techniques using weak ionization methods, it is possible to determine the mass of molecule being analyzed. In the CFM-EI model, the molecular formula is used to filter the search library  allen2016computational . Using the molecular mass to filter the library allows more possible candidate spectra to be considered in the search than using a molecular formula filter.

4 Results and Discussion

To analyze the performance of the models, we trained with 240,942 spectra from the NIST 2017 Mass Spectral Main Library. These spectra were selected so that no molecules in the replicates library have spectra in the training set.

After hyperparameter tuning using Vizier Google_Vizier , we found that the optimal MLP architecture has seven layers of 2000 nodes, with residual network connections between the layers he_resnet

, using ReLU activation and a dropout rate of 0.25.

4.1 Library Matching Results

(a) Library matching performance on different models
(b) Recall Results at various levels
Figure 4: Performance of different model architectures.

We first examine the effects of our various modeling decisions on performance. Figure 4a compares the performance of forward, reverse, and bidirectional versions of the linear regression and MLP models on the library matching task. For bidirectional prediction in the linear regression model, the forward and reverse predictions are simply averaged together, rather than applying the gate described in (5).

The top row of Figure 4a shows that it is not possible to achieve perfect recall accuracy on the library matching task even when using the full NIST main library as the reference library, without any model-predicted spectra. Observing Figure 4b we see that using the NIST main library as the reference library, we have 86% recall@1 accuracy, and 98.3% recall@10 accuracy. This serves as a practical upper bound on achievable library matching accuracy and reflects the experimental inconsistencies between between the main library spectra and replicates spectra stein2012MassLibReview .

The forward prediction mode for both the linear regression model and the multilayer perceptron (MLP) has poor performance. The linear regression model is improved by  20% when switching to using reverse mode prediction. Using bidirectional prediction mode improves recall@10 accuracy by 30% for both the linear regression and the multilayer perceptron model. This finding suggests that the bidirectional prediction mode is more effective at capturing the fragmentation events than the forward-only model.

Figure 2 shows the improvement in spectral prediction for pentachlorobenzene using the bidirectional MLP model. Note that the bidirectional model on the right more accurately models intensities at larger m/z. The intensity peaks for larger m/z are critical for determining the identity of a molecule, and are more heavily weighted in Eq. (1).

NEIMS achieves 91.7% recall@10 after applying a mass filter. The mass filter was set to a tolerance of 5 Daltons of the query molecule’s mass; this reduces the size of the library to a median of 6,696 spectra for each query molecule. In practice, this tolerance window could be set to a larger window, depending on the uncertainty of the information about the molecular mass of the ion. For the rest of this report, we will refer to the bi-directional multi-layer perceptron model with mass filtering of 5 Daltons as the default settings for NEIMS.

From Figure 4b we see that while NEIMS has decent performance for recall levels of 10 and above compared to the NIST spectral library, it has considerably worse performance for recall values of 1 and 5. This result is unsurprising given that the hyperparmaters of the model were trained to maximize performance on recall@10. If recall@1 was instead selected to tune the hyperparameters, the performance accuracy on recall@1 would improve.

4.2 Comparison to previously reported models

We next compared our model’s performance directly to the performance of the CFM-EI model allen2016computational . The setup of Allen et al. differs from our current setup in a few ways. First, they evaluate their model on the NIST ’14 spectral library. Second, for the library matching task, their augmented reference library contains only spectra predicted by their model, and none from the original NIST collection. Third, the cosine similarity metric Eq. (1) used for evaluation in library matching in CFM-EI uses a different weighting scheme. In their analysis, the cosine similarity is weighted by instead of in order to de-emphasize the larger peaks in the mass spectrum, as they ran their experiments on other datasets with a higher proportion of larger molecules allen2016computational .

To compare the performance of NEIMS to that of CFM-EI, we match their setup identically. We retrain our NEIMS model on the NIST 14 dataset, and evaluate the performance using the NIST 14 replicates as the query set. For library matching, we incorporate only predicted spectra into our augmented library, and using the same modified similarity metric.

Model Recall@1 Recall@10 (%) Average run time (ms)
NIST ’14 Reference Library 77 99*
CFM-EI 42.6 89* 300,000
NEIMS 54.3 92.7 0.47
Table 1: Performance on Library matching task for NIST 17. * indicates that values were estimated from Figure 4 of Allen et al. allen2016computational

The library matching performance for CFM-EI and NEIMS are compared against the NIST14 library for library matching performance are reported in Table 1. NEIMS performs slightly better than CFM-EI on the library matching task. More importantly, NEIMS is able to make spectral predictions orders of magnitude faster than CFM-EI. With NEIMS, it would be possible to generate spectra for 1 million molecules in 90 min on a CPU, with potential for considerable speedup with using GPU.

4.3 Distances between predicted and ground truth spectra

Figure 5: Comparing the similiarity between the predicted spectrum and the ground truth spectrum to the overall similarity between spectra for the same molecule.

So far, we have evaluated the quality of the NEIMS predictions indirectly, by way of how they affect library matching with an augmented library. Next, we assess the prediction accuracy directly, by measuring the similarity (Eq. 1) between spectra in the NIST main library and the model’s predictions. We refer to this similarity as the predicted similarity.

There is inherent noise in mass spectra due to stochasticity of the underlying physical process and also to experimental inconsistencies stein2012MassLibReview . The NIST replicates library provides multiple spectra for each molecule, and we can use these sets of spectra to characterize the scale of this noise for each molecule. Specifically, we define the inherent noise for a given molecule as the average pairwise similarity between all corresponding spectra, both in the NIST main library and the NIST replicates library, and refer to this as the overall similarity.

For each molecule, we compute the ratio of the predicted similarity to overall similarity as a normalized metric for the quality of our predictions. A ratio of 1.0 would suggest that there are is limited available headroom for improvements using machine learning, since the model’s errors are comparable to the variability in the data.

Figure 5 shows the improvement in this ratio for the MLP bidirectional model over the MLP forward model, confirming that the bidirectional model has better spectral prediction performance. For the MLP bidirectional model, roughly half of the molecules have a predicted similarity to overall similarity ratio that is greater than 0.9, indicating that there is potential for further improvement to the model. Some of these molecules have ratios that are greater than 1, which is possible if there is more variation between the spectra (i.e. a lower overall similarity) than between the predicted spectrum and the main library spectrum (i.e. predicted similarity).

5 Conclusion

We demonstrate that NEIMS achieves high library matching performance on an augmented spectral library containing predictions for molecules in the query set. The performance of NEIMS is also slightly better than existing machine learning models for predicting EI-MS spectra, with significant boost in speed of prediction.

The high performance in library matching is attributable to the bidirectional prediction mode. The reverse mode in particular allows the model to more accurately to predict intensities for larger fragments which result from the loss of small neutral subgroups. We observe that the improvement in the library matching task also corresponds with improvement in the similarity of the predicted spectra to the ground truth spectra.

Several adjustments could be made to further improve NEIMS. For example, NEIMS currently does not have a method to model intensity peaks corresponding to isotopes in ion fragments. If we were to train on spectral data with greater precision in the peaks locations, we might be able to learn the exact identities of the atoms based on the decimal values of the m/z peak locations.

Mass filtering improved the performance of NEIMS by 6%. This suggests that for experimental setups where it is possible to know the molecular mass of the sample with some accuracy, it is possible to improve the accuracy of matching on the augmented spectral library. It would also be interesting to explore other settings for mass filtering, such as filtering out spectra which have a molecular mass that is much smaller than the position of the largest m/z peak.

Different molecular representations could also be tested. The predictions made from ECFP are limited by the descriptiveness of the fingerprint rdkit_blogpost_collide_bits . In particular, the overlap in representation for different molecular features represents a huge limitation to the representation of the molecule. Additionally, ECFPs are not equipped to represent molecules with multiple stereocenters, which will have different spectra. It would also be interesting to explore whether a bond-based molecular fingerprint representation kearnes2016molecular or other graph-based molecular representations duvenaud_convolutional_2015 ; gilmer_2017_mpnn may improve performance.

Combining NEIMS with transfer learning methods could allow for spectral prediction specific to individual spectrometry machines. A library of such machine-specific spectra would improve matching 

stein2012MassLibReview .

The lightweight framework of NEIMS makes it possible to rapidly generate spectral predictions for large numbers of molecular candidates. This collection of predicted spectra can then be used directly in mass spectrometry software to expand the coverage of molecules which can be identified by mass spectrometry. Because the requirements of NEIMS has limited dependence to EI mass spectrometry, it likely that some of the principles used here could be extended to other types of mass spectrometry.

6 Acknowledgments

We thank Stephen Stein for fruitful discussions about mass spectrometry and for providing helpful feedback on this manuscript. We thank Laura Castellanos for her insights about mass spectrometry. We thank Steven Kearnes for his helpful comments, and Lucy Colwell and Michael Brenner for their helpful conversations.

7 Supplementary information

The code for this work can be found at github.com/brain-research/deep-molecular-massspec

References

  • [1] Yunsheng Hsieh and Walter A. Korfmacher. Increasing speed and throughput when using hplc-ms/ms systems fordrug metabolism and pharmacokinetic screening. Current Drug Metabolism, 7(5):479–489, 2006.
  • [2] Zhenpeng Zhou and Richard N. Zare. Personal information from latent fingerprints using desorption electrospray ionization mass spectrometry and machine learning. Analytical Chemistry, 89(2):1369–1372, 2017. PMID: 28194988.
  • [3] Simon Petrie and Diethard Kurt Bohme. Ions in space. Mass Spectrometry Reviews, 26(2):258–280, 2006.
  • [4] Stephen E Stein. Chemical substructure identification by mass spectral library searching. Journal of the American Society for Mass Spectrometry, 6(8):644–655, 1995.
  • [5] Stephen E Stein and Donald R Scott. Optimization and testing of mass spectral library search algorithms for compound identification. Journal of the American Society for Mass Spectrometry, 5(9):859–866, 1994.
  • [6] Steven E Stein. National Institute of Standards and Technology (NIST) Mass Spectral Database, 2017.
  • [7] Fred W Mclafferty. Wiley Registry of Mass Spectral Data. John Wiley and Sons, 11th edition, 2016.
  • [8] Hisayuki Horai, Masanori Arita, Shigehiko Kanaya, Yoshito Nihei, Tasuku Ikeda, Kazuhiro Suwa, Yuya Ojima, Kenichi Tanaka, Satoshi Tanaka, Ken Aoshima, et al. Massbank: a public repository for sharing mass spectral data for life sciences. Journal of mass spectrometry, 45(7):703–714, 2010.
  • [9] Stephen Stein. Mass spectral reference libraries: an ever-expanding resource for chemical identification. Analytical Chemistry, 84:7274 – 7282, 2012.
  • [10] Christoph Alexander Bauer and Stefan Grimme. How to compute electron ionization mass spectra from first principles. The Journal of Physical Chemistry A, 120(21):3755–3766, 2016.
  • [11] Stefan Grimme. Towards first principles calculation of electron impact mass spectra of molecules. Angewandte Chemie International Edition, 52(24):6306–6312, 2013.
  • [12] M. Guerra, F. Parente, P. Indelicato, and J. P. Santos. Modified binary encounter Bethe model for electron-impact ionization. ArXiv e-prints, June 2013.
  • [13] Felicity Allen, Allison Pon, Russ Greiner, and David Wishart. Computational prediction of electron ionization mass spectra to assist in gc/ms compound identification. Analytical chemistry, 88(15):7689–7697, 2016.
  • [14] Bruce G Buchanan and Edward A Feigenbaum. Dendral and meta-dendral: Their applications dimension. In Readings in artificial intelligence, pages 313–322. Elsevier, 1981.
  • [15] Robert K Lindsay, Bruce G Buchanan, Edward A Feigenbaum, and Joshua Lederberg. Dendral: a case study of the first expert system for scientific hypothesis formation. Artificial intelligence, 61(2):209–261, 1993.
  • [16] Jimmy K. Eng, Ashley L. McCormack, and John R. Yates. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. Journal of the American Society for Mass Spectrometry, 5(11):976–989, Nov 1994.
  • [17] Ngoc Hieu Tran, Xianglilan Zhang, Lei Xin, Baozhen Shan, and Ming Li.

    De novo peptide sequencing by deep learning.

    Proceedings of the National Academy of Sciences, 114(31):8247–8252, 2017.
  • [18] S. S. Schoenholz, S. Hackett, L. Deming, E. Melamud, N. Jaitly, F. McAllister, J. O’Brien, G. Dahl, B. Bennett, A. M. Dai, and D. Koller. Peptide-Spectra Matching from Weak Supervision. ArXiv e-prints, August 2018.
  • [19] Bo Curry and David E Rumelhart.

    Msnet: A neural network which classifies mass spectra.

    Tetrahedron Computer Methodology, 3(3-4):213–237, 1990.
  • [20] T. Rabinowitz. Mass-spectrometry-prediction. https://github.com/terryrabinowitz/Mass-Spectrometry-Prediction/blob/master/readme.pdf, 2017.
  • [21] JC Lorquet. Whither the statistical theory of mass spectra? Mass Spectrometry Reviews, 13(3):233–257, 1994.
  • [22] Jean-Claude Lorquet. Landmarks in the theory of mass spectra. International Journal of Mass Spectrometry, 200(1-3):43–56, 2000.
  • [23] Henry Meyer Rosenstock, MB Wallenstein, AL Wahrhaftig, and Henry Eyring. Absolute rate theory for isolated systems and the mass spectra of polyatomic molecules. Proceedings of the National Academy of Sciences, 38(8):667–678, 1952.
  • [24] Karl K Irikura. Ab initio computation of energy deposition during electron ionization of molecules. The Journal of Physical Chemistry A, 121(40):7751–7760, 2017.
  • [25] Vilhjálmur Ásgeirsson, Christoph A. Bauer, and Stefan Grimme. Quantum chemical calculation of electron ionization mass spectra for general organic and inorganic molecules. Chem. Sci., 8:4879–4895, 2017.
  • [26] Peter J Linstrom and William G Mallard. The nist chemistry webbook: A chemical data resource on the internet. Journal of Chemical & Engineering Data, 46(5):1059–1063, 2001.
  • [27] FW McLafferty, RH Hertel, and RD Villwock. Probability based matching of mass spectra. rapid identification of specific compounds in mixtures. Journal of Mass Spectrometry, 9(7):690–702, 1974.
  • [28] Harry S Hertz, Ronald A Hites, and Klaus Biemann. Identification of mass spectra by computer-searching a file of known spectra. Analytical Chemistry, 43(6):681–691, 1971.
  • [29] Arun S Moorthy, William E Wallace, Anthony J Kearsley, Dmitrii V Tchekhovskoi, and Stephen E Stein. Combining fragment-ion and neutral-loss matching during mass spectral library searching: A new general purpose algorithm applicable to illicit drug identification. Analytical chemistry, 89(24):13261–13268, 2017.
  • [30] RDKit: Open-source cheminformatics, 2018. [Online; accessed 18-November-2018].
  • [31] David Rogers and Mathew Hahn. Extended-connectivity fingerprints. Journal of Chemical Information and Modeling, 50(5):742–754, 2010. PMID: 20426451.
  • [32] D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. ArXiv e-prints, December 2014.
  • [33] Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283, 2016.
  • [34] Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Elliot Karro, and D. Sculley, editors. Google Vizier: A Service for Black-Box Optimization, 2017.
  • [35] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. ArXiv e-prints, December 2015.
  • [36] G. Landrum. Collding bits, 2014.
  • [37] Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley. Molecular graph convolutions: moving beyond fingerprints. Journal of computer-aided molecular design, 30(8):595–608, 2016.
  • [38] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alan Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. Adv. Neural. Inf. Process. Syst. 28, pages 2224–2232, 2015.
  • [39] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural Message Passing for Quantum Chemistry. ArXiv e-prints, April 2017.