Simulation Assisted Likelihood-free Anomaly Detection

01/14/2020 ∙ by Anders Andreassen, et al. ∙ Google Rutgers University Berkeley Lab 10

Given the lack of evidence for new particle discoveries at the Large Hadron Collider (LHC), it is critical to broaden the search program. A variety of model-independent searches have been proposed, adding sensitivity to unexpected signals. There are generally two types of such searches: those that rely heavily on simulations and those that are entirely based on (unlabeled) data. This paper introduces a hybrid method that makes the best of both approaches. For potential signals that are resonant in one known feature, this new method first learns a parameterized reweighting function to morph a given simulation to match the data in sidebands. This function is then interpolated into the signal region and then the reweighted background-only simulation can be used for supervised learning as well as for background estimation. The background estimation from the reweighted simulation allows for non-trivial correlations between features used for classification and the resonant feature. A dijet search with jet substructure is used to illustrate the new method. Future applications of Simulation Assisted Likelihood-free Anomaly Detection (SALAD) include a variety of final states and potential combinations with other model-independent approaches.



There are no comments yet.


page 7

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

An immense search effort by the LHC collaborations has successfully probed many extreme regions of the Standard Model phase space atlasexoticstwiki ; atlassusytwiki ; atlashdbspublictwiki ; cmsexoticstwiki ; cmssusytwiki ; cmsb2gtwiki ; lhcbtwiki . Despite strong theoretical and non-collider experimental motivation, there is currently no convincing evidence for new particles or forces of nature from the LHC searches. However, many final states are uncovered Kim:2019rhy ; Craig:2016rqv

and the full hypervariate phase space accessible by modern detector technology is only starting to be probed holistically with deep learning methods 

Larkoski:2017jix ; Guest:2018yhq ; Abdughani:2019wuv ; Radovic:2018dip . There is a great need for new searches that can identify unexpected scenarios.

Until recently, nearly all model independent searches relied heavily on simulation. Generically, these searches operate by comparing data with background-only simulation in a large number of phase space regions. Such searches have been performed without machine learning at D0 

sleuth ; Abbott:2000fb ; Abbott:2000gx ; Abbott:2001ke , H1 Aaron:2008aa ; Aktas:2004pz , CDF Aaltonen:2007dg ; Aaltonen:2007ab ; Aaltonen:2008vt , CMS CMS-PAS-EXO-14-016 ; CMS-PAS-EXO-10-021 , and ATLAS Aaboud:2018ufy ; ATLAS-CONF-2014-006 ; ATLAS-CONF-2012-107

. A recent phenomenological study proposed extending this idea to deep learning classifiers 

DAgnolo:2018cun ; DAgnolo:2019vbw . While independent of signal models, these approaches are dependent on the fidelity of the background model simulation for both signal sensitivity and background accuracy. If the background simulation is inaccurate, then differences between simulation and (background-only) data will hide potential signals. Even if a biased simulation can find a signal, if the background is mis-modeled, then the signal specificity will be poor.

A variety of approaches have been proposed to enhance signal sensitivity without simulations. Such proposals are based on clustering or nearest neighbor algorithms DeSimone:2018efk ; Mullin:2019mmh ; 1809.02977

, autoencoders 

Farina:2018fyg ; Heimel:2018mkt ; Roy:2019jae ; Cerri:2018anq ; Blance:2019ibf ; Hajer:2018kqm , probabilistic modeling Dillon:2019cqt , weak supervision Collins:2018epr ; Collins:2019jip , density estimation anode , and others Aguilar-Saavedra:2017rzt . These approaches must also be combined with a background estimation strategy. If simulation is used to estimate the background, then the specificity is the same as the model-dependent searches. Many of these approaches can be combined with a resonance search, as explicitly demonstrated in Ref. Collins:2018epr ; Collins:2019jip . The background estimation strategy may impose additional constraints on the learning, such as the need for decorrelation between a resonant feature and other discriminative features Louppe:2016ylz ; Dolen:2016kst ; Moult:2017okx ; Stevens:2013dya ; Shimmin:2017mfk ; Bradshaw:2019ipy ; ATL-PHYS-PUB-2018-014 ; Xia:2018kgd ; Englert:2018cfo ; Wunsch:2019qbo ; Disco . A detailed overview of model independent approaches can be found in Ref. anode .

While it is desirable to be robust to background model inaccuracies, it is also useful to incorporate information from Standard Model simulations. Even though these simulations are only an approximation to nature, they include an extensive set of fundamental and phenomenological physics models describing the highest energy reactions all the way to signal formation in the detector electronics. This paper describes a method that uses a background simulation in a way that depends as little on that simulation as possible. In particular, a model based on the

Deep neural networks using Classification for Tuning and Reweighting

(dctr) procedure Andreassen:2019nnm is trained in a region of phase space that is expected to be devoid of signals. In a resonance search, there is one feature where the signal is known to be localized and the sideband can be used to train the dctr model. This reweighting function learns to morph the simulation into the data and is parameterized in the feature(s) used to mask potential signals. Then, the model is interpolated to the signal-sensitive region and the reweighted background simulation can be used for both enhancing signal sensitivity and estimating the Standard Model background. As deep learning classifiers can naturally probe high dimensional spaces, this reweighting model can in principle exploit the full phase space for both enhancing signal sensitivity and specificity.

This paper is organized as follows. Section 2 introduces the Simulation Assisted Likelihood-free Anomaly Detection (salad) method. A dijet search at the LHC is emulated to illustrate the new method. The simulation and deep learning setup are introduced in Sec. 3 and then the application of dctr is shown in Sec. 4. The signal sensitivity and specificity are presented in Sec. 5 and 6, respectively. The paper ends with conclusions and outlook in Sec. 7.

2 Methods

Let be a feature (or set of features) that can be used to localize a potential signal in a signal region (SR). Furthermore, let be another set of features which are useful for isolating a potential signal. The prototypical example is a resonance search where is the single resonant feature, such as the invariant mass of two jets, while are other properties of the event, such as the substructure of the two jets. The salad method then proceeds as follows:

  1. Train a classifier to distinguish data and simulation for . This classifier is parameterized in by simply augmenting with ,  Cranmer:2015bka ; Baldi:2016fzo . If is trained using the binary cross entropy or the mean squared error loss, then asymptotically, a weight function is defined by


    where the last factor in Eq. 1 is an overall constant that is the ratio of the total amount of data to the total amount of simulation. This property of neural networks to learn likelihood ratios has been exploited for a variety of full phase space reweighting and parameter estimation proposals in high energy physics Andreassen:2019nnm ; Brehmer:2018hga ; Brehmer:2018eca ; Brehmer:2018kdj ; Cranmer:2015bka ; Andreassen:2019cjw .

  2. Simulated events in the SR are reweighted using . The function is interpolated automatically by the neural network. A second classifier

    is used to distinguish the reweighted simulation from the data. This can be achieved in the usual way with a weighted loss function such as the binary cross-entropy:


    Events are then selected with large values of . Asymptotically111Sufficiently flexible neural network architecture, enough training data, and an effective optimization procedure., will be monotonically related with the optimal classifier:


    It is important that the same data are not used for training and testing. The easiest way to achieve this is using different partitions of the data for these two tasks. One can make use of more data with a cross-validation procedure Collins:2018epr ; Collins:2019jip .

  3. One could combine the previous step with a standard data-driven background estimation technique like a sideband fit or the ABCD method. However, one can also directly use the weighted simulation to predict the number of events that should pass a threshold requirement on :


    for some threshold value . The advantage of Eq. 4 over other data-based methods is that could be correlated with ; for sideband fits, thresholds requirements on cannot sculpt local features in the spectrum.

3 Simulation

A large-radius dijet resonance search is used to illustrate the salad method. The simulations are from the LHC Olympics 2020 community challenge R&D dataset gregor_kasieczka_2019_2629073 . The background process is generic parton scattering (labeled QCD for quantum chromodynamics) and the signal is a hypothetical boson that decays into an boson and boson. Each of the and decay to quarks. The masses of the , and particles are 3.5, 0.5, and 0.1 TeV, respectively. The mass hierarchy between the particle and its decay products means that the and particles are Lorentz boosted in the lab frame and therefore their two-prong decay products are captured inside a single large-radius jet. Particle-level simulations are produced with Pythia 8 Sjostrand:2006za ; Sjostrand:2007gs or Herwig++ Bahr:2008pv without pileup or multiple parton interactions and a detector simulation is performed with Delphes 3.4.1 deFavereau:2013fsa ; Mertens:2015kba ; Selvaggi:2014mya . Particle flow objects are clustered into jets using the Fastjet Cacciari:2011ma ; Cacciari:2005hq implementation of the anti- algorithm Cacciari:2008gp using as the jet radius. Events are selected by requiring at least one such jet with TeV. In the remaining studies, the Pythia dataset is treated as ‘data’, while the Herwig dataset is treated as ‘simulation’, to mimic the scenario in practice where the simulation is different than data.

Figure 1 presents the invariant mass of the leading two jets. The selection is evident from the peak around 3 TeV. The signal peaks around the mass and aside from the kinematic feature from the jet selection, the background distribution is featureless. The spectra from Pythia and Herwig are nearly identical, which may be expected since the invariant mass is mostly determined by hard-scatter matrix elements and not final state effects.

Figure 1: The invariant mass of the leading two jets.

To demonstrate the salad approach, two features222In principle, salad can readily accommodate the full phase space as used by the original dctr method Andreassen:2019nnm based on particle flow networks Komiske:2018cqr ; this will be explored in future studies. about each of the leading jets are used for classification. The first feature is the jet mass and the second feature is the -subjettiness ratio Thaler:2011gf ; Thaler:2010tr . This second feature is the most widely used feature for differentiating jets that have two hard prongs (as in the signal) from jets that have only one hard prong (as for most of the background). The two jets are ordered by their mass and the four features used for machine learning are presented in Fig. 2. As expected, the signal mass distributions show peaks at the and masses and the distributions are small, indicating two-prong substructure. Pythia and Herwig differ mostly at low mass and across the entire distribution.

Figure 2: The four features used for machine learning: jet mass (top) and the -subjettiness ratios (bottom) for the more massive jet (left) and the less massive jet (right).

The baseline performance for classifying signal versus the QCD background is presented in Fig. 3

. As is the case for all neural networks presented in the following sections, three fully connected layers with 100 hidden nodes on each intermediate layer are implemented using Keras 


and TensorFlow 

tensorflow with the Adam adam

optimization algorithm. Rectified linear units are the activation function for all intermediate layers and the sigmoid is used for the final output layer. Networks are trained with binary cross entropy for 50 epochs with early stopping (with patience 10). The supervised classifier presented in Fig. 

3 effectively differentiates signal from background, with a maximum significance improvement of about 10. It is expected that the performance of any model independent approach will be bounded from above by the performance of this classifier.

Figure 3: A supervised classifier trained to distinguish signal from Pythia QCD. The top plot is a histogram of the neural network output, the left bottom plot is a Receiver Operating Characteristic (ROC) curve, and the right bottom plot is a significance improvement (SIC) curve. is the signal efficiency or true positive rate and is the background efficiency or false positive rate.

4 Parameterized Reweighting with DCTR

The first step of the dctr reweighting procedure is to train a classifier to distinguish the ‘data’ (Pythia) from the ‘simulation’ (Herwig) in a sideband region. The output of such a classifier is shown in Fig. 4, where the signal region is defined as GeV. There are about 850k events in the sideband region and 150k events in the signal region. Unlike the classifier in Fig. 3, the separation in Fig. 4 is not as dramatic because Pythia and Herwig are much more similar than signal is with QCD. As expected, the network is a linear function of the likelihood ratio so the ratio plot in Fig. 4 is linear. Interestingly, the signal is more Herwig-like than Pythia-like. The reweighting function is applied to the Herwig in Fig. 4 to show that the reweighted simulation (Sim.+dctr) looks nearly identical to the ‘Data’. All of the events used for Fig. 4 are independent from the ones used for training the network. Figure 5 shows shows that this reweighting works for all of the input distributions to the neural network as well.

Figure 4: A histogram of the classifier output for a neural network trained to distinguish ‘data’ (Pythia) and ‘simulation’ (Herwig) in the sideband region.
Figure 5: The four features used for machine learning in the sideband region, before and after applying dctr: jet mass (top) and the -subjettiness ratios (bottom) for the more massive jet (left) and the less massive jet (right).

The next step for salad is to interpolate the reweighting function. The neural network presented in Fig. 4 is trained conditional on and so it can be evaluated in the SR for values of the invariant mass that were not available during the network training. Note that the signal region must be chosen large enough so that the signal contamination in the sideband does not bias the reweighting function. For this example, for 25% signal fraction in the signal region, the contribution in the sideband is about 1% and has no impact on the dctr model. Figure 6 shows a classifier trained to distinguish ‘data’ and ’simulation’ in the signal region before and after the application of the interpolated dctr model. There is excellent closure, also for each of the input features to the classifier as shown in Fig. 7.

Figure 6: A histogram of the classifier output for a neural network trained to distinguish ‘data’ (Pythia) and ‘simulation’ (Herwig) in the signal region.
Figure 7: The four features used for machine learning in the signal region, before and after applying dctr: jet mass (top) and the -subjettiness ratios (bottom) for the more massive jet (left) and the less massive jet (right).

5 Sensitivity

After reweighting the signal region to match the data, the next step of the search is to train a classifier to distinguish the reweighted simulation from the data in the signal region. If the reweighting works exactly, then this new classifier will asymptotically learn , which is the optimal classifier by the Neyman-Pearson lemma neyman1933ix . If the reweighting is suboptimal, then some of the classifier capacity will be diverted to learning the residual difference between the simulation and background data. If the reweighted simulation is nothing like the data, then all of the capacity will go towards this task and it will not be able to identify the signal. There is therefore a tradeoff between how different the (reweighted) simulation is from the data and how different the signal is from the background. If the signal is much more different from the background than the simulation is from the background data, it is possible that a sub-optimally reweighted simulation will still be able to identify the signal (see Sec. 6 for problems with background estimation).

Figure 8 shows the sensitivity of the salad tagger to signal as a function of the signal-to-background ratio () in the signal region. In all cases, the background is the QCD simulation using Pythia333Note that the full one million Pythia events are divided in two pieces, one that acts as the test set for all methods and one that is used for further study. The remaining half is further split in half to represent the data or the simulation for the lines marked ‘Pythia’ in Fig. 8. For a fair comparison, the Herwig statistics are comparable to 25% of the full Pythia dataset. . The Pythia lines correspond to the case where the simulation follows the same statistics as the data ( Pythia). The area under the curve (AUC) should be as close to one as possible and a tagger that is operating uniformly at random will produce an AUC of 0.5. Anti-tagging (preferentially tagging events that are not signal-like) results in an AUC less then 0.5. The maximum significance improvement is calculated as the largest value of , where the 0.01% offset regulates statistical fluctuations at low efficiency.

When the , then the performance in Fig. 8 is similar to the fully supervised classifier presented in Sec. 3. As , the Pythia curves approach the random classifier, with an AUC of 0.5 and a max significance improvement of unity. The Herwig curve has an AUC less than 0.5 as because the signal is more Herwig-like than Pythia-like (see Fig. 4) and thus a tagger that requires the features to be data-like (data Pythia) will anti-tag the signal. Likewise, the efficiency of the tagger on the simulation is higher than 50% when placing a threshold on the NN that keeps 50% of the events in data. The maximum significance improvement quickly drops to unity for Herwig when , indicating the the network is spending more capacity on differentiating Pythia from Herwig than finding signal.

For all four metrics, salad significantly improves the performance of the Herwig-only approach. In particular, the salad tagger is effective to about , whereas the Herwig-only tagger is only able to provide useful discrimination power down to about . For the significance improvement and false positive rate at a fixed true positive rate, the salad tagger tracks the Pythia tagger almost exactly down to below 1%. The AUC about half way between Pythia and Herwig at high , which is indicative of poor performance at low efficiency.

Figure 8: Four metrics for the sensitivity of the salad classifier as a function of the signal-to-background ratio (

) in the signal region: the area under the curve (AUC) in the top left, the maximum significance improvement (top right), the false positive rate at a fixed 50% signal efficiency (bottom left), and the significance improvement at the same fixed 50% signal efficiency (bottom right). The evaluation of these metrics requires signal labels, even though the training of the classifiers themselves do not have signal labels. Error bars correspond to the standard deviation from training five different classifiers. Each classifier is itself the truncated mean over ten random initializations.

6 Background Estimation

The performance gains from Sec. 5 can be combined with a sideband background estimation strategy, as long as threshold requirements on the classifier do not sculpt bumps in the spectrum. However, there is also an opportunity to use salad to directly estimate the background from the interpolated simulation. Figure 9 illustrates the efficacy of the background estimation for a single classifier trained in the absence of signal. Without the dctr reweighting, the predicted background rate is too low by a factor of two or more below 10% data efficiency. With the interpolated reweighting function, the background prediction is accurate within a few percent down to about 1% data efficiency.

Figure 9: The predicted efficiency normalized to the true data efficiency in the signal region for various threshold requirements on the NN. The -axis is the data efficiency from the threshold. The error bars are due to statistical uncertainties.

In practice, the difficulty in using salad to directly estimate the background is the estimation of the residual bias. One may be able to use validation regions between the signal region and sideband region, but it will never require as much interpolation as the signal region itself. One can rely on simulation variations and auxiliary measurements to estimate the systematic uncertainty from the direct salad background estimation, but estimating high-dimensional uncertainties is challenging Nachman:2019dol ; Nachman:2019yfl . With a low-dimensional reweighting or with a proper high-dimensional systematic uncertainty estimate, the parameterized reweighting used in salad should result in a lower uncertainty than directly estimating the uncertainty from simulation. In particular, any nuisance parameters that affect the sideband region and the signal region in the same way will cancel when reweighting and interpolating.

7 Conclusions

This paper has introduced Simulation Assisted Likelihood-free Anomaly Detection (salad), a new approach to search for resonant anomalies by using parameterized reweighting functions for classification and background estimation. The salad approach uses information from simulation in a way that is nearly background-model independent while remaining signal-model agnostic. The only requirement for the signal is that there is one feature where the signal is known to be localized. In the example presented in the paper, this feature was the invariant mass of two jets. The location of the resonance need not be known ahead of time and can be scanned using a series of signal and sideband regions. This scanning will result in a trials factor per non-overlapping signal region. An additional look elsewhere effect is incurred by scanning the threshold on the neural network. In practice, one could use a small number of widely separated thresholds to be broadly sensitive. As long as the data used for training and testing are independent, there is no additional trials factor for the feature space used for classification. Strategies for maximally using the data for training can be found in Ref. Collins:2018epr ; Collins:2019jip .

While the numerical salad

 results presented here did not fully achieve the performance of a fully supervised classifier trained directly with inside knowledge about the data, there is room for improvement. In particular, a detailed hyperparameter scan could improve the quality of the reweighting. Additionally, calibration techniques could be used to further increase the accuracy 

Cranmer:2015bka . Future work will investigate the potential of salad to analyze higher-dimensional feature spaces as well as classifier features that are strongly correlated with the resonant feature. It will also be interesting to compare salad with other recently proposed model independent methods. When the nominal background simulation is an excellent model of nature, salad should perform similarly to the methods presented in Ref. DAgnolo:2018cun ; DAgnolo:2019vbw and provide a strong sensitivity to new particles. In other regimes where the background simulation is biased, salad should continue to provide a physics-informed but still mostly background/signal model-independent approach to extend the search program for new particles at the LHC and beyond.

Code and data availability

The code can be found at and the datasets are available on Zendo as part of the LHC Olympics gregor_kasieczka_2019_2629073 .

BN would like to thank Jack Collins for countless discussions about anomaly detection, including ideas related to salad. Some of those discussions happened at the Aspen Center for Physics, which is supported by National Science Foundation grant PHY-1607611. This work was supported by the U.S. Department of Energy, Office of Science under contract DE-AC02-05CH11231. DS is supported by DOE grant DOE-SC0010008. DS thanks LBNL, BCTP and BCCP for their generous support and hospitality during his sabbatical year. BN would like to thank NVIDIA for providing Volta GPUs for neural network training.