Anomaly Detection with Density Estimation

by   Benjamin Nachman, et al.
Rutgers University
Berkeley Lab

We leverage recent breakthroughs in neural density estimation to propose a new unsupervised anomaly detection technique (ANODE). By estimating the probability density of the data in a signal region and in sidebands, and interpolating the latter into the signal region, a likelihood ratio of data vs. background can be constructed. This likelihood ratio is broadly sensitive to overdensities in the data that could be due to localized anomalies. In addition, a unique potential benefit of the ANODE method is that the background can be directly estimated using the learned densities. Finally, ANODE is robust against systematic differences between signal region and sidebands, giving it broader applicability than other methods. We demonstrate the power of this new approach using the LHC Olympics 2020 R&D Dataset. We show how ANODE can enhance the significance of a dijet bump hunt by up to a factor of 7 with a 10% accuracy on the background prediction. While the LHC is used as the recurring example, the methods developed here have a much broader applicability to anomaly detection in physics and beyond.


page 14

page 16


Unsupervised in-distribution anomaly detection of new physics through conditional density estimation

Anomaly detection is a key application of machine learning, but is gener...

A Discriminative Framework for Anomaly Detection in Large Videos

We address an anomaly detection setting in which training sequences are ...

Simulation-Assisted Decorrelation for Resonant Anomaly Detection

A growing number of weak- and unsupervised machine learning approaches t...

MTGFlow: Unsupervised Multivariate Time Series Anomaly Detection via Dynamic Graph and Entity-aware Normalizing Flow

Multivariate time series anomaly detection has been extensively studied ...

Anomaly Detection in Trajectory Data with Normalizing Flows

The task of detecting anomalous data patterns is as important in practic...

Invariant Representation Driven Neural Classifier for Anti-QCD Jet Tagging

We leverage representation learning and the inductive bias in neural-net...

Simulation Assisted Likelihood-free Anomaly Detection

Given the lack of evidence for new particle discoveries at the Large Had...

Code Repositories


code for ANOmaly Detection with Density Estimation (

view repo

1 Introduction

Despite an impressive and extensive search program from ATLAS atlasexoticstwiki ; atlassusytwiki ; atlashdbspublictwiki , CMS cmsexoticstwiki ; cmssusytwiki ; cmsb2gtwiki , and LHCb lhcbtwiki for new particles and forces of nature, there is no convincing evidence for new phenomena at the Large Hadron Collider (LHC). However, there remain compelling theoretical (e.g. naturalness) and experimental (e.g. dark matter) reasons for fundamental structure to be observable with current LHC sensitivity. The vast majority of LHC searches are designed with specific signal models motivated by one of these reasons (e.g. gluino pair production from supersymmetry) in mind, and these searches are optimized with a heavy reliance on simulations, for both the signal and the Standard Model (SM) background. Given that it is impossible to cover every model with a specially optimized search (see e.g. Kim:2019rhy ; Craig:2016rqv for comprehensive lists of currently uncovered models), and given that there are vast regions of unexplored LHC phase space, it is critical to consider extending the search program to include more model-agnostic methods.

A variety of model-agnostic approaches have been proposed to search for physics beyond the Standard Model (BSM) at colliders. These approaches are designed to be broadly sensitive to anomalies in data without focusing on specific models. Yet, they have varying degrees of both signal-model and background-model independence, as there is often a tradeoff between the broadness of a search and how sensitive it is to particular classes of signal scenarios. Existing and proposed model-agnostic searches range from fully-signal-model independent but fully-background model dependent sleuth ; Abbott:2000fb ; Abbott:2000gx ; Abbott:2001ke ; Aaron:2008aa ; Aktas:2004pz ; Aaltonen:2007dg ; Aaltonen:2007ab ; Aaltonen:2008vt ; CMS-PAS-EXO-14-016 ; CMS-PAS-EXO-10-021 ; Aaboud:2018ufy ; ATLAS-CONF-2014-006 ; ATLAS-CONF-2012-107 ; DAgnolo:2018cun ; DAgnolo:2019vbw (because they compare data to SM simulation); to varying degrees of partial signal-model and background-model independence Farina:2018fyg ; Heimel:2018mkt ; Roy:2019jae ; Cerri:2018anq ; Blance:2019ibf ; Hajer:2018kqm ; Collins:2018epr ; Collins:2019jip ; DeSimone:2018efk ; Mullin:2019mmh ; 1809.02977 ; Dillon:2019cqt ; Aguilar-Saavedra:2017rzt

. A comprehensive overview of existing model-agnostic approaches and how they are classified in terms of signal and background model independence will be given in Section


This paper introduces a new approach called ANOmaly detection with Density Estimation

(ANODE) that is complementary to existing methods and aims to be largely background and signal model agnostic. Density estimation, especially in high dimensions, has traditionally been a difficult problem in unsupervised machine learning. The objective of density estimation is to learn the underlying probability density from which a set of independent and identically distributed examples were drawn. In the past few years, there have been a number of breakthroughs in density estimation using neural networks and the performance of high dimensional density estimation has greatly improved. The idea of ANODE is to make use of these recent breakthroughs in order to directly estimate the probability density of the data. Assuming the signal is localized somewhere, one can attempt to use sideband methods and interpolation to estimate the probability density of the background. Then, one can use this to construct a likelihood ratio generally sensitive to new physics.

As with any search for BSM, it is not enough to have a discriminant that is sensitive to signals, one must also have a valid method of background estimation, otherwise it will be impossible to claim a discovery of new physics. The method of background estimation can further introduce possible sources of signal and background model dependence, and it is important to avail oneself of data-driven background methods in any truly model-agnostic search. This paper will explore two methods of data-driven background estimation, one based on importance sampling, and the other based on directly integrating the background density estimate obtained in the ANODE procedure.

Other neural network approaches to density estimation have been studied in high energy physics. Such methods include Generative Adversarial Networks (GANs) Goodfellow:2014upx ; deOliveira:2017pjk ; Paganini:2017hrr ; Paganini:2017dwg ; Butter:2019eyo ; Martinez:2019jlu ; Bellagente:2019uyp ; Vallecorsa:2019ked ; SHiP:2019gcl ; Carrazza:2019cnt ; Butter:2019cae ; Lin:2019htn ; DiSipio:2019imz ; Hashemi:2019fkn ; Chekalina:2018hxi ; ATL-SOFT-PUB-2018-001 ; Zhou:2018ill ; Carminati:2018khv ; Vallecorsa:2018zco ; Datta:2018mwd ; Musella:2018rdi ; Erdmann:2018kuh ; Deja:2019vcv ; Derkach:2019qfk ; Erbin:2018csv ; Erdmann:2018jxd ; Urban:2018tqv

, autoencoders 

Monk:2018zsb ; ATL-SOFT-PUB-2018-001 , physically-inspired networks Andreassen:2018apy ; Andreassen:2019txo , and flows pmlr-v37-rezende15 ; Albergo:2019eim . GANs are efficient for sampling from a density and are thus promising for accelerating slow simulations, but they do not provide an explicit representation of the density itself. For this reason, ANODE is built using normalizing flows pmlr-v37-rezende15 and in particular the recently proposed masked autoregressive flow (MAF) NIPS2017_6828 . These methods estimate densities by using a succession of neural networks to gradually map the original data to a transformed dataset that follows a simple distribution (e.g. normal or uniform).

The ANODE method is demonstrated using a simulated large-radius dijet search based on the LHC Olympics 2020 R&D dataset gregor_kasieczka_2019_2629073 . In particular, properties of hadronic jets are used as discriminating features to enhance a bump hunt in the invariant mass of pairs of jets. ANODE learns a parameterized density of the features using a sideband and this is combined with a density estimation of the same features in the signal region. The resulting likelihood ratio is able to enhance the sensitivity of a traditional bump hunt from to . There is currently no dedicated search for generic dijet signatures where each of the jets can also originate from a BSM resonance Kim:2019rhy ; Aguilar-Saavedra:2017zuc ; Aguilar-Saavedra:2019adu ; Agashe:2018leo ; Agashe:2017wss . Therefore, this particular application could be directly useful for extending the LHC physics search program. Many other applications to resonant new physics searches involving jets and other final states are also possible.

In order to benchmark the performance of ANODE, it is compared with the CWoLa hunting method Collins:2018epr ; Collins:2019jip

. The CWoLa approach is also a neural network-based resonance search, but does not involve density estimation. Instead, CWoLa hunting uses neural networks to identify differences between signal regions and neighboring sideband regions. By turning the problem into a supervised learning task 

Metodiev:2017vrx , CWoLa is able to effectively find rare resonant signals. However, CWoLa hunting has certain requirements on the independence of the discriminating features and the resonant feature. ANODE does not have this requirement and the potential for exploiting correlated features is studied by introducing correlations.

This paper is organized as follows. Section 2 reviews the landscape of model independent searches at the LHC to provide context for the ANODE method. Section 3 introduces the details of the ANODE approach and provides a brief introduction to normalizing flows. The reminder of the paper illustrates ANODE through an example based on a dijet search using jet substructure. Details of the simulated samples are provided in Sec. 4 and the results for the signal sensitivity and background specificity are presented in Sec. 5.1 and 5.2, respectively. A study of correlations between the discriminating features and the resonant feature is in Sec. 5.3. The paper ends with conclusions and outlook in Sec. 6.

2 An Overview of Model (In)dependent Searches

A viable search for new physics generally must have two essential components: it must be sensitive to new phenomena and it must also be able to estimate the background under the null hypothesis (Standard Model only). The categorization of a search’s degree of model (in)dependence requires consideration of both of these components. Figure 

1 illustrates how to characterize model independence for both BSM sensitivity and SM background specificity. We will now consider each in turn.

Figure 1: A graphical representation of searches for new particles in terms of the background and signal model dependence for achieving signal sensitivity (a) and background specificity (b). The Model Unspecific Search for New Physics (MUSiC) CMS-PAS-EXO-14-016 ; CMS-PAS-EXO-10-021 and General Search Aaboud:2018ufy ; ATLAS-CONF-2014-006 ; ATLAS-CONF-2012-107 strategies are from CMS and ATLAS, respectively. LDA stands for Latent Dirichlet Allocation 10.5555/944919.944937 ; Dillon:2019cqt , ANOmaly detection with Density Estimation (ANODE) is the method presented in this paper, and CWoLa stands for Classification Without Labels Metodiev:2017vrx ; Collins:2018epr ; Collins:2019jip . Direct density estimation is a form of side-banding where the multidimensional feature space density is learned conditional on the resonant feature (see Sec. 3.2).

2.1 BSM sensitivity

For BSM sensitivity, the various types of searches are categorized as follows:

In the upper-right corner of Fig. 1(a), we have also attempted to illustrate in finer detail the differences between some recent model-agnostic approaches. For example, the autoencoder is in the farthest corner since it assumes almost nothing about the signal or the background but can be run directly on the data, as long as the signal is sufficiently rare Farina:2018fyg ; Heimel:2018mkt . The tradeoff is that there is no optimality guarantee for the autoencoder – any signals that it does find will be found in a rather uncontrolled manner. Meanwhile, CWoLa hunting Collins:2018epr ; Collins:2019jip is somewhat more signal and background model-dependent than autoencoders, since this approach assumes that the signal is localized in a particular feature, and that there is an uncorrelated set of additional features on which one can train a classifier to distinguish signal region and sideband. In return, one obtains a guarantee of asymptotic optimality – the classifier approaches the likelihood ratio neyman1933ix in the limit of infinite statistics.

The ANODE method introduced in this paper complements the other recently proposed techniques and is asymptotically optimal. To do this, ANODE estimates the density of the background-only scenario using sidebands and compares that with the density estimated in a signal-sensitive region (details are in Sec. 3). Like the CWoLa hunting method, the new approach is broadly sensitive to resonant new physics and thus it is placed in the upper right part of Fig. 1(a). The reason that ANODE is further right and above of CWoLa hunting is that it is less sensitive to correlations, a feature that is discussed more below.

2.2 Background estimation

A variety of methods are commonly used for background estimation and are highlighted in Fig. 1(b). Generally, background estimation is less dependent on the signal model than achieving signal sensitivity and therefore the -axis range of Fig. 1(b) is more compressed than Fig. 1(a).

  • In some cases, the simulation is used to directly estimate the background. This is often the case for well-understood backgrounds such as electroweak phenomena or very rare processes that are difficult to constrain with data.

  • Most searches use data in some way to constrain the background prediction. One common approach is the control region method, where a search is complemented by an auxiliary measurement to constrain the simulation. Knowledge of the signal is used to ensure that the auxiliary measurement is not biased by the presence of signal.

  • The two most common methods for background estimates that do not directly use simulation are the ABCD method and the sideband method (bump hunt). The ABCD method operates by identifying two independent features, each which is sensitive to the presence of signal. Four regions, labeled A,B,C, and D are constructed by (anti)requiring a threshold on the two features. The background rate in the most signal sensitive region is estimated from the other three regions. Background simulations are required to verify independence of the two features.

  • Finally, the sideband fit only requires that the background be smooth in the region of a potential signal so that a parametric (or not Frate:2017mai ) function can be fit to sidebands and interpolated. However, this method only works for resonant new physics.

While strategies from Fig. 1(a) can often be matched with any approach in Fig. 1(b), there is often one combination that is used in practice. Table 1 provides examples of various searches and the background estimation technique that typically is associated with that search. Searches with a complex background may use multiple background estimation procedures.

ANODE can be combined with any background estimation technique, but it can also be used directly since the background density is already estimated to construct a signal-sensitive classifier. Even though directly providing an accurate background estimation puts stringent requirements on the accuracy of the density estimation, it also reduces the need for a full decorrelation between classification features and the resonant feature. A variety of decorrelation techniques exist Louppe:2016ylz ; Dolen:2016kst ; Moult:2017okx ; Stevens:2013dya ; Shimmin:2017mfk ; Bradshaw:2019ipy ; ATL-PHYS-PUB-2018-014 ; DiscoFever ; Xia:2018kgd ; Englert:2018cfo ; Wunsch:2019qbo , but ultimately decorrelating removes information available for classification.

Search Typical Background Strategy Recent Examples
MUSiC & the General Search Pure MC Prediction  Aaboud:2018ufy ; CMS-PAS-EXO-14-016
Pure electroweak processes Pure MC Prediction  Aaboud:2017rel
SUSY with top quarks & bosons Control Region Method  Aaboud:2017aeu ; CMS-PAS-SUS-19-009
All-hadronic searches ABCD Method  Aaboud:2017hdf ; Sirunyan:2018rlj
Long-lived particle searches ABCD Method  Aaboud:2018aqj ; Sirunyan:2018vlw
BSM resonance searches Sideband Method  Aad:2019hjw ; Sirunyan:2019vgj
CWoLa hunting Sideband Method  Collins:2018epr ; Collins:2019jip
ANODE Sideband or Direct Density This paper
Table 1: A table with the common pairings of search strategy for signal sensitivity (left column), the background estimation method (middle column), and an example search (right column).

3 The ANODE Method

This section will describe the ANODE proposal for an unsupervised method to search for resonant new physics using density estimation.

Let be a feature in which a signal (if it exists) is known to be localized around some . The value of will be scanned for broad sensitivity and the following procedure will be repeated for each window in . It is often the case that the width of the signal in is fixed by detector properties and is signal model independent. A region is called the signal region (SR) and is defined as the sideband region (SB). A traditional, unsupervised, model-agnostic search is to perform a bump hunt in , using the SB to interpolate into the SR in order to estimate the background.

Let be some additional discriminating features in which the signal density is different than the background density. If we could find the region(s) where the signal differs from the background and then cut on to select these regions, we could improve the sensitivity of the original bump hunt in . The goal of ANODE is to accomplish this in an unsupervised and model-agnostic way, via density estimation in the feature space .

More specifically, ANODE attempts to learn two densities: and for . Then, classification is performed with the likelihood ratio


In the ideal case that for and , Eq. 1

is the optimal test statistic for identifying the presence of signal. In the absence of signal,

, so as long as , this leads to a zero-background search.

In practice, both and are approximations and so is not unity in the absence of signal. The densities are estimated using conditional neural density estimation as described in Sec. 3.1. The function is estimated in the signal region and the function is estimated using the sideband region and then interpolated into the signal region. The interpolation is done automatically by the neural conditional density estimator. Effective density estimation will result in in the SR that is localized near unity and then one can enhance the presence of signal by applying a threshold , for . The interpolated can then also be used to estimate the background, as described in Sec. 3.2.

3.1 Neural Density Estimation

The ANODE procedure as described in the previous subsection is completely general with regards to the method of density estimation. In this work we will demonstrate a proof-of-concept using normalizing flow models for density estimation. Since normalizing flows were proposed in Ref. pmlr-v37-rezende15 , they have generated much activity and excitement in the machine learning community, achieving state-of-the-art performance on a variety of benchmark density estimation tasks.

The core idea behind a normalizing flow is to apply a change of variables from a random variable with a simple density (e.g. Gaussian or uniform) to one with a complex density that matches some training dataset. The transformation from one density describing random variable

to another density describing random variable follows the usual change of variables formula using the Jacobian:


where and are realizations of and , respectively, and have the same dimension, and is an invertible function. The process in Eq. 2 can be repeated to build a normalizing flow:


where , , and . The first neural density estimation with normalizing flows had the following form for :


where is an element-wise non-linearity and are trainable parameters. The benefit of Eq. 4

is that the Jacobian evaluation is simple from the chain rule. Since the first development of normalizing flows, there has been significant development in extending their expressivity. One innovation is to combine flows with autoregressive density estimation 

NIPS2016_6581 . An autoregressive flow JMLR:v17:16-272 modifies the change of variables so that for , , where the indices denote the dimension of and for . Any that satisfies this condition is amenable to neural density estimation because the Jacobian determinant evaluation is simple. In particular, the Jacobian is upper triangular and therefore the determinant is the product of the diagonal elements: . ANODE is built on a masked autoregressive flow (MAF) NIPS2017_6828 . For a MAF,


where and are arbitrary functions and for arbitrary numbers , . As in Eq. 3, this procedure is repeated multiple times to build a deep autoregressive flow. The Masking in MAF comes from its use of MADE pmlr-v37-germain15 to evaluate and for all in one forward pass. This approach eliminates the need for the recursion in Eq. 5. MAF is nearly the same as inverse autoregressive flows (IAF) NIPS2016_6581 , which also use Gaussian autoregressions and are built on MADE. The main difference is that MAF is very efficient for density estimation and slow for sampling while IAF is slow for density estimation and fast for sampling. As ANODE only needs to estimate the density without producing new samples, MAF is selected as the method of choice.

The estimation of for ANODE requires that the MAF provides a conditional density. This can be accomplished by adding as an input to all functions and .

3.2 Estimating the Background

An anomaly detection technique is only useful for finding new particles if the Standard Model background can be estimated. As mentioned earlier, one benefit of the direct density estimation in ANODE is that the background can be directly estimated with . This results in multiple possibilities for background estimation that are considered in this work:

  • Direct density estimation. These methods use the interpolated to directly compute the efficiency of the background after a threshold requirement on .

    • Density sampling. One could directly sample events from using the stacked change of variables specified by Eq. 5. As mentioned in Sec. 3.1, this is less efficient for MAF compared with IAF. This sampling is not pursued in this paper.

    • Density integration. Another approach is to directly integrate for events with :

    • Importance sampling. Analytically integrating a function in high dimensions is impractical, so one can estimate the integral with importance sampling. An effective method to implement this sampling is make the following observation:


      The last line in Eq. 7 can be estimated by computing the fraction of events in the SR (representing the full distribution) with and then weighting each event in the counting by .

  • Sideband in . As long as the requirement does not sculpt a localized feature in , one can estimate the background prediction by performing a fit in the spectrum from the SB and interpolating to the SR. This is a standard approach, as discussed in Sec. 1.

Further details about background estimation are presented in Sec. 5.2 for the numerical example described in the next section.

3.3 Comparison with the CWoLa hunting method

The CWoLa hunting method Collins:2018epr ; Collins:2019jip is a recently-proposed model-agnostic sideband method that also uses machine learning and will serve as a benchmark for ANODE. In the CWoLa hunting approach, the signal sensitivity is achieved by training a classifier to distinguish the SR from the SB. This classifier will approach the likelihood ratio , which is optimal under certain conditions:


where the second equality is true in the absence of signal in the sideband111This is not strictly necessary - the classifier can still be optimal even if there is some signal in the sideband Metodiev:2017vrx . and the third equality is true when and are independent. The background is estimated using a sideband fit after placing a selection based on the above classifier.

A key assumption of the CWoLa method is that and are independent. This condition is stronger than the requirement for the background fit, but is necessary for achieving signal sensitivity. In particular, in the presence of a dependence between and , the CWoLa classifier will learn the true differences between SB and SR. If these differences are larger than the difference between signal and background in the SR, the CWoLa classifier may not succeed in finding the signal.

In contrast, the ANODE method does not require any particular relationship between and to achieve signal sensitivity. In fact, the information about could be fully contained within , and ANODE could still succeed in principle. Therefore, ANODE can make use of features which are strongly correlated with , thus extending the potential sensitivity to new signals. This is possible because of the two step density estimation, interpolating from the sideband and then estimating from the SR. Such an approach is not possible with CWoLa hunting, which directly learns the likelihood ratio. The only requirement for ANODE is that there are no non-trivial features in the SR that cannot be smoothly predicted from the SB. Section 5.3 illustrates the ability of ANODE to cope with correlated features.

4 Details of the Sample

A simulated resonance search using large-radius dijets is used to illustrate ANODE. The simulated datasets are from the LHC Olympics 2020 challenge research and development dataset gregor_kasieczka_2019_2629073 . For a background process, one million quantum chromodynamic (QCD) dijet events are simulated with Pythia 8 Sjostrand:2006za ; Sjostrand:2007gs without pileup or multiple parton interactions. The signal is a hypothetical boson ( TeV) that decays into an boson ( GeV) and a boson ( GeV), with the same simulation setup as the QCD dijets. The and bosons decay promptly into quarks and due to their large Lorentz boost in the lab frame, the resulting hadronic decay products are captured by a single large-radius jet. The detector simulation is performed with Delphes 3.4.1 deFavereau:2013fsa ; Mertens:2015kba ; Selvaggi:2014mya and particle flow objects are clustered into jets using the Fastjet Cacciari:2011ma ; Cacciari:2005hq implementation of the anti- algorithm Cacciari:2008gp using as the jet radius. Events are selected by requiring at least one such jet with TeV. While there exist LHC searches for the case that and are electroweak bosons Aad:2019fbh ; Sirunyan:2019jbg , the generic case is currently uncovered by a dedicated search.

The resonant feature will be the invariant mass of the leading two jets, . These two jets are ordered by their mass so that by construction, . The discriminating features are four-dimensional, consisting of the observables:


where is the n-subjettiness ratio Thaler:2011gf ; Thaler:2010tr . This observable is the most widely used single feature for identifying jets with a two-prong substructure. While the ultimate goal of ANODE is to perform density estimation on high-dimensional, low-level features, there is already utility in a search with high-level features from Eq. 9. Thus to demonstrate how ANODE works, this will be the focus for the rest of this paper.

Simulated data are constructed by injecting 1000 signal events to the full background sample. A histogram of is presented in Fig. 2. As expected, the signal peaks near . The signal region is defined by  TeV and then the sideband is the rest of the spectrum. The simulated data are divided into two equal samples for training and testing; thus we have background and signal events each sample. In the SR, we are left with background and signal events in each sample. This corresponds to and in the SR. This value of would be the approximate significance from a sideband fit (ignoring the fit errors). Section 5.1 will show how much this can be enhanced from ANODE.

The additional four features for classification are shown in Fig. 3. The lighter jet mass peaks near and the difference between masses peaks at about GeV. The observables are lower for the two-prong signal jets than for the mostly one-prong background jets. Jet mass and are negatively correlated for QCD jets Dolen:2016kst and so is higher for than for .

The conditional MAF (along with most methods of density estimation) has difficulty at sharp, discontinuous edges and boundaries, so we first transform the dataset before performing density estimation. First, all features are linearly scaled to be

. Then, the logit transformation

is applied to map the scaled features to be between . The Jacobian for this map is accounted for when computing probability densities for the original feature space. Even with this transformation, density estimation is difficult near the boundaries. Therefore, the scaled features are required to have . This keeps 95% (72%) of the signal (background) in the SR. Below we will refer to this as the “fiducial region.” All results below are computed with respect to the number of events after this truncation.

Figure 2: Histograms for the invariant mass of the leading two jets for the Standard Model background as well as the injected signal. There are 1 million background events and 1000 signal events.
Figure 3: The four features used for classification: (top left), (top right), (bottom left), and (bottom right). These histograms are inclusive in . There are 1 million background events and 1000 signal events for the mass histograms.

5 Results

5.1 Sensitivity

The conditional MAF is optimized222Based on code from

using the log likelihood loss function,

. All of the neural networks are trained with PyTorch 


. For the hyperparameters, there are 15 MADE blocks (one layer each) with 128 hidden units per block. Networks are optimized with Adam 

adam using a learning rate and weight decay of

. The SR and SB density estimators are each trained for 50 epochs. No systematic attempt was made to optimize these hyperparameters and it is likely that better performance could be obtained with further optimization. For the SR density estimator, the last epoch is chosen for simplicity and it was verified that the results are robust against this choice. The SB density estimator significantly varies from epoch to epoch. Averaging the density estimates point-wise over 10 consecutive epochs results in a stable result. Averaging over more epochs does not further improve the stability. All results with ANODE present the SB density estimator with this averaging scheme for the last 10 epochs.

Figure 4: Scatter plot of versus across the test set in the SR. Background events are shown (as a two-dimensional histogram) in grayscale and individual signal events are shown in red.

Figure 4 shows a scatter plot of versus for the test set in the SR. As desired, the background is mostly concentrated around , while there is a long tail for signal events at higher values of and between . This is exactly what is expected for this signal: it is an over-density () in a region of phase space that is relatively rare for the background ().

The background density in Fig. 4 also shows that the is narrower around when is large and more spread out when . This is evidence that the density estimation is more accurate when the densities are high and worse when the densities are low. This is also to be expected: if there are many data points close to one another, it should be easier to estimate their density than if the data points are very sparse.

Another view of the results is presented in Fig. 5, with one-dimensional information about in the SR. The left plot of Fig. 5 shows that the background is centered and approximately symmetric around

with a standard deviation of approximately 17%. This width is due to various sources, including the accuracy of the SR density, the accuracy of the SB density, and the quality of the interpolation from SB to SR. Each of these sources has contributions from the finite size of the datasets used for training, the neural network flexibility, and the training procedure. The right plot of Fig. 

5 presents the number of background and signal events as a function of a threshold . The starting point are the original numbers background (40,000) and signal (400) numbers in the SR window and the fiducial window. Starting from low and one can achieve and a high with a threshold requirement on . Figure 6 shows that the signal is clearly visible in the distribution after applying such a threshold requirement.

Figure 5: Left: Histogram of evaluated on the test set; Right: the integrated number of events that survive a threshold on . The two distributions are scaled to represent the rates for 500,000 total background events and 500 total signal events, as introduced in Sec. 4.
Figure 6: Distributions of (left) and (right) in the signal region after applying a threshold requirement on .
Figure 7: Receiver Operating Characteristic (ROC) curve (left) and Significance Improvement Characteristic (SIC) curve (right).

The performance of as an anomaly detector is further quantified by the Receiver Operating Characteristic (ROC) and Significance Improvement Characteristic (SIC) curves in Fig. 7. These metrics are obtained by scanning and computing the signal efficiency (true positive rate) and background efficiency (false positive rate) after a threshold requirement on

. The Area Under the Curve (AUC) for ANODE is 0.82. For comparison, the CWoLa hunting approach is also shown in the same plots. The CWoLa classifier is trained using sideband regions that are 200 GeV wide on either side of the SR. The sidebands are weighted to have the same number of events as each other and in total, the same as the SR. A single NN with four hidden layers with 64 notes each is trained using Keras 


and TensorFlow 

tensorflow . Dropout JMLR:v15:srivastava14a

of 10% is used for each intermediate layer. Intermediate layers use rectified linear unit activation functions and the last layer uses a sigmoid. The classifier is optimized using binary cross entropy and is trained for 300 epochs. As with ANODE, 10 epochs are averaged for the reported results

333A different regularization procedure was used in Ref. Collins:2018epr ; Collins:2019jip based on the validation loss and -folding. The averaging here is expected to serve a similar purpose..

The performance of ANODE is comparable to CWoLa hunting in Fig. 7, which does slightly better at higher signal efficiencies and much better at lower signal efficiencies. This may be a reflection of the fact that CWoLa makes use of supervised learning and directly approaches the likelihood ratio, while ANODE is unsupervised and attempts to learn both the numerator and denominator of the likelihood ratio. With this dataset, ANODE is able to enhance the signal significance by about a factor of 7 and would therefore be able to achieve a local significance above given that the starting value of is 1.6.

5.2 Background Estimation

This section explores the possibility of using the estimate of to directly determine the background efficiency in the SR after a requirement on . Figure 8 presents a comparison between integration methods (direct integration and importance sampling) described in Sec. 3.2 and the true background yields. Qualitatively, both methods are able to characterize the yield across several orders of magnitude in background efficiency. However, both methods diverge from the truth in the extreme tails of the distribution. The right plot of Fig. 8 offers a quantitative comparison between methods. For efficiencies down to about , both methods are accurate within about 25%. The direct integration method has a smaller bias of about 10%. This is consistent with Fig. 5, for which the standard deviation is between 10-20%.

Figure 8: Left: The number of events after a threshold requirement using the two integration methods described in Sec. 3.2, as well as the true background yield. Right: The ratio of the predicted and true background yields from the left plot, as a function of the actual number of events that survive the threshold requirement. The shaded bands around the central predictions are the statistical (Poisson) uncertainty derived from the observed background counts. The black dashed and dotted lines are 10% and 20% around a ratio of .

5.3 Performance on a Dataset with Correlated Features

The results presented in the previous sections have established that ANODE is able to identify the signal and estimate the corresponding SM backgrounds introduced in Sec. 4. One fortuitous aspect of the chosen features introduced in Sec. 4 is that they are all relatively independent of . This is illustrated in Fig. 9, using the SR and neighboring sideband regions. As a result of this independence, the CWoLa method is able to find the signal and presumably the ANODE interpolation from SB to SR is easier than if there was a strong dependence.

Figure 9: A comparison of the four features between the SR and two nearby sidebands defined by TeV (lower sideband) and TeV (upper sideband).

The purpose of this section is to study the sensitivity of the ANODE and CWoLa hunting methods to correlations in the features with . Based on the assumptions of the two methods, it is expected that with strong correlations, CWoLa hunting will fail to find the signal while ANODE should still be able to identify the presence of signal in the SR as well as estimate the background. To study this sensitivity in a controlled fashion, correlations are introduced artificially. In practice, adding more features to will inevitably result in some dependence with ; the artificial example here illustrates the challenges already in low dimensions. New jet mass observables are created, which are linearly shifted:


where for this study. The resulting shifted lighter jet mass is presented in Fig. 10.

Figure 10: The lighter jet mass for the SR and the lower and upper sideband regions after the shift defined by Eq. 10.

New ANODE and CWoLa models are trained using the shifted dataset and their performance is quantified in Fig. 11. As expected, the fully supervised classifier is nearly the same as Fig. 7. ANODE is still able to significantly enhance the signal, with a maximum significance improvement near 4. While in principle ANODE could achieve the same classification accuracy on the shifted and nominal datasets, the performance on the shifted examples is not as strong as in Fig. 7. In practice the interpolation of into the SR is more challenging now due to the linear correlations. This could possibly be overcome with improved training, better choices of hyperparameters, or more sophisticated density estimation techniques.

By construction, there are now bigger differences between the SR and SB than between the SR background and the SR signal. Therefore, the CWoLa hunting classifier is not able to find the signal. This is evident from the ROC curve in the left plot of Fig. 11, which shows that the signal-versus-background classifier is essentially random while the SR-versus-SB classifier has learned something non-trivial.

Lastly, Fig. 12 shows the performance of direct density estimation for the background prediction using the shifted dataset. The performance is comparable to the unshifted dataset (Fig. 8), meaning that ANODE could potentially be used as a complete anomaly detection method even in the presence of correlated feature spaces.

Figure 11: ROC (left) and SIC (right) curves in the signal region using the shifted dataset specified by Eq. 10.
Figure 12: The same as Fig. 8, but for the shifted dataset. In particular, these plots compare the background prediction from two direct density estimation techniques with the true background yield after a threshold requirement .

6 Conclusions

This paper has presented a powerful new model-independent search method called ANOmaly detection with Density Estimation (ANODE), which is built on neural density estimation. Unlike other approaches, ANODE directly learns the background probability density and data probability density in a signal region. The ratio of these densities is a powerful classifier and the background density can be directly used to estimate the background efficiency from a threshold requirement on the classifier. Finally, ANODE is robust against correlations in the data, which tend to break other model-agnostic sideband methods such as CWoLa.

The results presented in this paper are meant to be a proof of concept of the general method, and there are many exciting future directions. For example, while this paper focused on collider searches for BSM, the ANODE method is completely general and could be applied to many areas beyond high energy physics, including astronomy and astrophysics. Similarly, while the demonstrations here were based on the innovative MAF density estimation technique, the ANODE method can be used in conjunction with any density estimation algorithm. Indeed, there are numerous other neural density estimation methods from the past few years that claim state-of-the-art performance, including Neural Autoregressive Flows DBLP:journals/corr/abs-1804-00779 and Neural Spline Flows durkan2019neural

; exploring these would be an obvious way to attempt to improve the results in this paper. In addition, it would be interesting to attempt the ANODE method on even higher-dimensional feature spaces, all the way up to the full low-level feature set of the four vectors of all the hadrons in the event. The prospects for the ANODE method are exciting: as the field of neural density estimation continues to grow within the machine learning community, ANODE will become more sensitive to resonant new physics in collider high energy physics and beyond.

DS is grateful to Matt Buckley and John Tamanas for many fruitful discussions on neural density estimation. We are especially grateful to John Tamanas for help with the conditional MAF code. Additionally, we would like to thank Uroš Seljak for helpful discussions and Nick Rodd for helpful comments on the draft. This work was supported by the U.S. Department of Energy, Office of Science under contract DE-AC02-05CH11231. DS is supported by DOE grant DOE-SC0010008. DS thanks LBNL, BCTP and BCCP for their generous support and hospitality during his sabbatical year.