New Methods and Datasets for Group Anomaly Detection From Fundamental Physics

The identification of anomalous overdensities in data - group or collective anomaly detection - is a rich problem with a large number of real world applications. However, it has received relatively little attention in the broader ML community, as compared to point anomalies or other types of single instance outliers. One reason for this is the lack of powerful benchmark datasets. In this paper, we first explain how, after the Nobel-prize winning discovery of the Higgs boson, unsupervised group anomaly detection has become a new frontier of fundamental physics (where the motivation is to find new particles and forces). Then we propose a realistic synthetic benchmark dataset (LHCO2020) for the development of group anomaly detection algorithms. Finally, we compare several existing statistically-sound techniques for unsupervised group anomaly detection, and demonstrate their performance on the LHCO2020 dataset.


A Generalized Active Learning Approach for Unsupervised Anomaly Detection

This work formalizes the new framework for anomaly detection, called act...

Isolation Distributional Kernel: A New Tool for Point Group Anomaly Detection

We introduce Isolation Distributional Kernel as a new way to measure the...

Out-Of-Bag Anomaly Detection

Data anomalies are ubiquitous in real world datasets, and can have an ad...

One-Class Support Measure Machines for Group Anomaly Detection

We propose one-class support measure machines (OCSMMs) for group anomaly...

The Dark Machines Anomaly Score Challenge: Benchmark Data and Model Independent Event Classification for the Large Hadron Collider

We describe the outcome of a data challenge conducted as part of the Dar...

Arrays of (locality-sensitive) Count Estimators (ACE): High-Speed Anomaly Detection via Cache Lookups

Anomaly detection is one of the frequent and important subroutines deplo...

Uninformed Students: Student-Teacher Anomaly Detection with Discriminative Latent Embeddings

We introduce a simple, yet powerful student-teacher framework for the ch...

1. Introduction

Unsupervised anomaly detection is a long established area of statistics [Grubbs, 1969]

and has recently seen substantial progress from modern machine learning approaches (see Refs. 

[Chalapathy and Chawla, 2019, Kwon et al., 2019] for recent reviews). Most of these methods are designed to identify examples that are individually anomalous, i.e.

is vanishingly small. An area of anomaly detection that has received comparatively less attention is the case where one cannot determine with certainty that any single example is anomalous. These “group anomalies” instead manifest as overdensities in the probability density of the data and occur naturally in a variety of applications ranging from computer security to pandemic detection and include many other scientific, industrial, and financial applications.

Group anomaly detection is also central to fundamental physics. In particular, most data analyses at the Large Hadron Collider (LHC) can be seen as group anomaly detection. Of these approaches, nearly all of them are supervised and rely strongly on a particular anomaly hypothesis. Following the Nobel-prize winning discovery of the Higgs boson in 2012 (which was a benchmark for group anomaly detection [Muandet and Schölkopf, 2013]), there is increasing urgency in searching for new phenomena beyond the Standard Model (BSM) of particle physics, for which there is ample indirect evidence (e.g. dark matter). Since these new BSM particles and interactions can take nearly any form, there is correspondingly a growing need for unsupervised approaches to group anomaly detection at the LHC.

As we will review in section 1.3, existing approaches to group anomaly detection in the machine learning literature are insufficiently sensitive for applying to fundamental physics or operate under a different set of assumptions. This has inspired researchers in fundamental physics to devise a host of new approaches to unsupervised group anomaly detection. In this paper, our first goal is to clearly define the challenge of modern group anomaly detection at the LHC and how it relates to other applications (Sec. 1.1), and to highlight a few especially promising approaches that have been recently proposed in the LHC literature. Group anomaly detection at the LHC shares many properties with other possible applications, so our expectation is that results obtained there will be useful more widely.

The second purpose of this paper is to bring the LHC Olympics 2020 (LHCO2020) challenge and datasets [Kasieczka G., Nachman B., Shih D. (editors) and others, 2020] to the attention of the wider ML community. The LHCO2020 was initiated in order to facilitate the development and comparison of semi-, weakly-, and un-supervised (which we call less-than-supervised) group anomaly detection methods. In this paper, we propose the LHCO2020 datasets as high quality findable, accessible, interoperable, and reusable (FAIR) benchmarks for group anomaly detection in a broader setting. The datasets have been well-curated and documented by domain experts, but can be used by others without specific domain knowledge. In Sec. 2, we will describe the datasets in more detail. In Sec. 3, we will briefly review some machine learning methods that have already been developed and applied to the LHCO2020 dataset, and in Sec. 4 we will provide some quantitative comparisons of selected methods.

1.1. Problem statement

Data at the LHC are obtained by smashing beams of protons at relativistic velocities. The large energy available in these collisions allows the creation of new particles with mass via , where is the speed of light. Such particles then decay and radiate, producing a spray of secondary particles that interact with the detector and register signals in millions of readout channels. Standard dimensionality reduction schemes are then used to reconstruct the secondary particles. Further (lossy) dimensionality reduction is accomplished by combining these trajectories using physics-inspired functions. A typical data analysis will use features for .

One aspect of data analysis at the LHC that distinguishes it from other areas is the excellent fidelity of available simulators. The typical analysis workflow at the LHC begins with a particular anomaly hypothesis. Probability densities and are then numerically estimated using simulators. Finally, a likelihood ratio test with the empirical data probability density is used to test for the presence of anomalous events.

While this procedure has achieved broad sensitivity to a variety of anomaly hypotheses, it also has large gaps in coverage. In particular, not all collision types can be accurately simulated and not all possible anomaly hypotheses are known. It is therefore essential to use unsupervised approaches that do not rely on positing a particular anomaly hypothesis and can estimate likelihood ratios directly from unlabeled data.

There is no known general solution to this challenge, but one class of group anomalies in collider physics called resonances

are amenable to less-than-supervised learning. These anomalies have the following generic characteristics:

  • Rarity: .

  • Overlap: .

  • Resonance: for some feature (which is often a mass) and fixed .

  • Smoothness: varies slowly with so that one can use data with to estimate for .

Apart from the above, no assumptions on the anomaly are made. Specifically there is no preferred value of or which features are sensitive to it. Furthermore, no group memberships are known.

Rarity is required because otherwise the anomalies would have been ruled out by non-observation in existing analyses. In the LHC Olympics, the anomalies constitute much less than 1% of the dataset. The overlapping support distinguishes these group anomalies from off-manifold anomalies and occurs physically due to radiative processes and detector effects that ensure that anything that can happen, will happen with some probability. Resonance is natural for searches looking for a particle of mass that decays into objects which can be fully observed. The reconstructed mass of the decay products will naturally be localized near , where the spread is often dominated by detector effects and independent of the details of the new particle. Smoothness is an excellent approximation because the physical processes underlying the particle interactions do not change abruptly as a function of .

Group anomaly detection in other domains have similar properties as resonant anomalies in collider physics. An example is the detection of distributed denial of service (DDoS) attacks on computer networks: In this case, the feature encoding time takes the role of the invariant mass in physics problems. The number of malicious packages is small compared to overall traffic (Rarity), individual packages look innocent (Overlap), but share a set of properties such as used protocol or originating hosts. Finally these attacks are limited in time (Resonance) over the stationary/periodic background of network traffic (Smoothness).

Similar correspondences exist for other challenges in fundamental science (e.g. galaxy classification), engineering (e.g. predictive maintenance or production line monitoring), medicine (e.g. early warning systems for pandemics or cancer detection), social media analysis (e.g. trending tweets) and financial data analysis (e.g. insurance fraud, credit card fraud or stock market analysis).

1.2. Approaches

The proposed methods for group anomaly detection rely on defining a signal region (SR) — a compact space defined solely in so that in it, the data is possibly enriched in signal. The complement is termed sideband (SB). At the LHC, a common choice for the SR is an interval in centered around with a width related to . However, in general the position of a potential signal is not known. This can be solved with a sliding window approach for the SR. Results obtained using different regions can be used individually (no group anomaly in this region) or statistically combined — by computing appropriate trial factors — for a global statement (presence or absence of a group anomaly in the data). This method is also often referred to as bump hunting. In particle physics specifically, bump hunting has a long history [Aubert and others, 1974, Augustin and others, 1974], but the Rarity assumption in this problem requires additional statistical methods to enhance infrequent anomalies in an unsupervised way.

We will introduce two algorithms for group anomaly detection: classification-based and density-based. In the first case, we construct a binary classifier to distinguish between the SR and SB using features that are independent from

. Simply put: if non-trivial classification can achieved, it is a sign for an underlying difference between the sideband and signal region — hence a group anomaly. The second approach directly estimates the density of normal events conditional on

in the SB and then interpolates it to the SR where it can be used for likelihood-ratio based anomaly detection. A detailed description of the algorithms is provided in Sec. 

3 and code is available from the authors upon request.

1.3. Related work

Alternate approaches for group anomaly detection fall into two categories [Toth and Chawla, 2018]: discriminative and generative.

Two classic examples of discriminative methods are One-Class Support Machines (OCSMM) [Muandet and Schölkopf, 2013] and Support Measure Data Description (SMDD) [Guevara et al., 2015]. Both assume known group memberships and test whether a given group is anomalous with respect to a distribution of normal groups. In contrast, we address the problem when no group membership is known for individual points. Imposing groups via clustering is not feasible as the anomalous group overlaps with the normal group and only differs in density.

For the same reason, generative approaches that require known group memberships [Xiong et al., 2011b, a, Chalapathy et al., 2019] cannot be applied. Group Latent Anomaly Detection (GLAD) is a one-step model that unifies group discovery and anomaly detection [Yu et al., 2015], developed for social media analysis. A related approach based on topic modelling and Latent Dirichlet Allocation (LDA) [Dillon et al., 2019, 2020] was deployed on the LHCO2020 datasets [Kasieczka G., Nachman B., Shih D. (editors) and others, 2020]. However, other methods — such as density estimation based anomaly detection reviewed in this contribution — outperform LDA on the LHCO2020 dataset.

An interesting connection exists to developments in point based anomaly detection using density estimation. Recently [Le Lan and Dinh, 2020] showed that low background likelihood by itself is not a reliable anomaly metric and a likelihood ratio is needed instead. Constructing such a likelihood ratio is often difficult and ad-hoc background models based on perturbing the data [Ren et al., 2019], measuring input complexity [Serrà et al., 2020], or a second density estimator trained on generic data [Schirrmeister et al., 2020] are used in in practice. Compared to these, the Resonance and Smoothness assumptions allow robust construction of a likelihood ratio.

The absence of realistic open benchmark datasets for development and evaluation is a known problem limiting progress in methods for anomaly detection [Toth and Chawla, 2018, Pang et al., 2020]

. Purely synthetic datasets — such as Gaussian distributions with different correlations or mixed examples e.g. from different image classes 

[Xiong et al., 2011b, Chalapathy and Chawla, 2019] — have limited realism and practical applicability. Other datasets from the natural sciences [Xiong et al., 2011b, a, Muandet and Schölkopf, 2013, Guevara et al., 2015] are either not available at all or require substantial amounts of further processing before being suitable for anomaly detection studies. Similarly, datasets from network intrusion detection [Tavallaee et al., 2009, Divekar et al., 2018] in general require domain knowledge and feature engineering and are — depending on the time slice — too densely populated with anomalies.

Compared to other available datasets [Rayana, 2016], LHCO2020 is ready-to-use without additional pre-processing or feature engineering required. The feature representations at two different levels of complexity allow developing end-to-end algorithms as well as fast prototyping on low-dimensional features. Injected anomalies are rare (e.g., 834 anomalies examples out of 1M data points for black box 1) and have high point difficulty (all anomalies are inliers) but have sufficient complexity (up to 2100 features/examples) to allow succesful detection. Although based on simulation tools, consistent methods are employed to simulate background and anomaly and the overall toolchain is well-validated including comparisons to experimental data. For added realism, different background models are used in the development datasets and the individual challenge sets (see the following Section for details). Finally, the LHCO2020 dataset was extensively vetted by domain experts and a number of anomaly detection algorithms were evaluated on it111After submission of this work, a second anomaly detection challenge aimed at finding anomalies in particle physics data was published [Aarrestad and others, 2021], further underscoring the relevance of this task..

2. LHC Olympics Dataset

The portal for the LHC Olympics datasets can be found at the challenge website222 The datasets described below are all publicly available and downloadable from Zenodo [Kasieczka et al., 2019]. The LHCO2020 consist of a dataset for algorithm development (R&D Dataset) and three initially blinded datasets (Black Box 1–3). Following conclusion of the challenge phase, labels (anomaly or not) are now provided for these datasets as well.

Standard Monte Carlo based tools were used to create these datasets. The underlying physics process was simulated via Pythia [Sjostrand et al., 2008] and Herwig++ [Bahr and others, 2008] while Delphes [de Favereau et al., 2014] was used to model the finite detector resolution. These software packages are highly configurable via so-called tunes

which encode empirical properties of fundamental physics and uncertainties. To increase the realism of the challenge, different tunes were used to create the datasets as described in the following. For more details on the (public, open source) tools that were used to produce these simulated events, we refer the reader to the challenge website.

All datasets are arrays (pandas dataframes [McKinney, 2010] saved to compressed HDF5 [Koranne, 2011] format) with shape (, 2101). Each row of the array is a single data instance (event), representing the products of a single collision in the simulated particle detector.

2.1. R&D Dataset

The events in the R&D dataset are of two types. The “normal” or “background” events consist of one million events produced through the strong interactions of the Standard Model. The “anomaly” or “signal” events consist of 100,000 events where a hypothetical heavy particle from a theory Beyond the Standard Model (BSM) (called the ) decays instantaneously to two additional heavy BSM particles (called and ), which each decay to a collimated spray of Standard Model particles (hadrons). See Fig. 1 for an illustration of the anomaly events.


Figure 1. Schematic diagrams (Feynman diagram) of the anomaly used for the R&D dataset and Black Box 1 (left) as well as tri-jet (center) and di-jet (right) anomalies used for Black Box 3 . Incoming particles from collisions are on the left, outgoing particles eventually measured by detectors are on the right. Lines in the middle represent virtual resonances leading to anomalous correlations of features in observed events. Different linestyles denote different types of particles.

In each event (background or signal), the properties of the 700 most energetic collision products (particles known as “hadrons”) are recorded in standard particle physics detector coordinates, – “transverse momentum”, – “pseudorapidity” and – “azimuthal angle” as illustrated in Fig. 2

. More detailed information such as particle charge is not included. If the event has fewer than 700 collision products, it is zero padded. Finally, the truth bit (signal or background) is appended at the end of every event. In this way, each event comprises 2101 floating point numbers.

Figure 2. A schematic diagram of a detector at the LHC to illustrate the standard coordinate system. In the top view, protons collide into and out of the page while in the bottom view, protons collide from the left and right. The collision debris flies out in all directions and for simplicity is represented by six particles. These particles register signals in a series of detector components. Their trajectories are then reconstructed using their transverse momentum and angular coordinates and .

The purpose of the R&D dataset is to provide a common benchmark dataset for anomaly detection techniques on a realistic and well-motivated BSM signal.333The signal model was discussed in Ref. [J. Kim, K. Kong, B. Nachman, D. Whiteson, 2020] and has the feature that existing searches for BSM physics may not be particularly sensitive. For this purpose, many more anomaly events (100k) were provided than would be present in a realistic setting (e.g., 1000 out of 1M background events). Also, the truth bit labels were provided so that anomaly detection approaches could be evaluated, whereas in real data the truth bit is not known — hence the need for group anomaly detection.

Finally, in addition to the raw features dataset described above, a reduced dataset of high-level features was also made available during the course of the challenge. These high-level features are computed from those of the raw dataset, but they are provided separately for convenience. They are summaries of the overall geometrical arrangement and energy distribution of the low-level features.

Concretely, the particles are first grouped using a hierarchical clustering algorithm 

[Cacciari et al., 2008] to form so-called jets — collimated sprays of hadrons. The two most energetic jets in each event are selected and the remaining jets are discarded. Then the relativistic invariant mass of these two jets is calculated. This is the potentially resonant feature . The other high-level features are: the invariant mass of the lighter jet; the mass difference of the two jets; and and the -subjettiess ratios [Thaler and Van Tilburg, 2011, 2012] of the leading two jets. This feature quantifies the degree to which a jet is characterized by two subjets or one subjet, with smaller values indicating two-prong substructure.

Many approaches in the LHC Olympics challenge were based on these features, instead of the low-level features. Plots of these high-level (histograms marginalized over the rest of the feature space) are shown in Fig. 3. We see that many of them are quite useful in separating signal vs background. The resonant feature is shown in Fig. 4.

Figure 3. Histograms of the four high-level features provided in the LHCO2020 data. The features in the right plot are dimensionless and the features in the left plot are given in units of TeV.
Figure 4. A histogram of the resonant feature in units of GeV with a parametric fit () using the SB data overlaid. The fit Kolmogorov-Smirnov (KS) -value is well above 0.05 in the SB.

2.2. Black Box 1

This box contained the same signal topology as the R&D dataset (see Fig. 1) but with different parameters for the anomalous particles, in order that a method trained exclusively on the R&D dataset could not trivially succeed on the Black Box dataset. A total of 834 signal events were included (out of a total of 1M events in all). This number was chosen so that the approximate local significance inclusively is not significant.444It is important to keep in mind that in particle physics, the discovery threshold is conventionally taken to be , corresponding to a -value of

under the null hypothesis.

In order to emulate reality, the background events in Black Box 1 are different to the ones from the R&D dataset. The background still uses the same generators as for the R&D dataset, but a number of Pythia and Delphes settings were changed from their defaults to mimic the domain shift between simulation and experimental data.

2.3. Black Box 2

This sample of 1M events was background only. The background was produced using a different publicly-available and standard particle-physics event generation tool, Herwig++ [Bahr and others, 2008], instead of Pythia. Also, it used a modified Delphes detector card that is different from Black Box 1 but with similar modifications on top of the R&D dataset card.

2.4. Black Box 3

The signal was based on Ref. [Agashe et al., 2017a, b] and consisted of a hypothetical heavy BSM particle with two different decay modes resulting in two collimated showers of particles (“dijets”) or with three collimated showers of particles (“trijets”) as illustrated in Fig. 1 center and right. These signals are inspired by theories introducing extra dimensions of space-time. 1200 dijet events and 2000 trijet events were included along with Standard Model backgrounds in Black Box 3 (for a total of 1M events). These numbers were chosen so that an analysis that found only one of the two modes would not observe a significant excess. The background events were produced with modified Pythia and Delphes settings (different than the R&D and other Black Box datasets).

2.5. Evaluation of the Challenge

During the initial challenge phase (see [Kasieczka G., Nachman B., Shih D. (editors) and others, 2020]), only the signal contained in the R&D Dataset was known to participants. For this, both the physical properties (decay topology, masses) and per-event labels were given. No such information was made available for Black Box 1–3. Participants were asked to submit (separately for each Black Box): I) A p-value associated with the dataset having no new particles (null hypothesis); II) As complete a characterization of the new physics as possible (in text-form) (e.g. masses and decay modes of all new particles with associated uncertainties); and III) How many signal events (central value and uncertainty) are in the dataset (before any selection criteria).

After the challenge phase, the physical properties and datasets with added per-event labels (signal or background) were made public, rendering the initial evaluation criteria obsolete. However, as better signal identification will aid better anomaly detection, quantities such as accuracy, area under the curve (AUC), or significance improvement (SIC, defined as the ratio of true positive rate over the square root of the false positive rate for a given working point) are useful metrics to report. We stress that while these metrics utilize the truth labels (signal or background) during the evaluation stage, a successful anomaly detection method would ideally not use these labels during the training stage.

3. Smooth background anomaly detection

Without a particular anomaly hypothesis, it is not possible to construct the optimal [Neyman and Pearson, 1933] classifier

. However, one can construct a related test statistic that strives to achieve optimality for a related hypothesis test: is the data in the SR more consistent with (a model of) itself or with a prediction for the normal data in the SR? In this case, the optimal test statistic would be

. Rejecting the null hypothesis would be evidence of an anomaly. We describe two approaches that exploit the features of the resonance group anomaly detection in order to approximate this test statistic directly from data.

Classification Without Labels (CWoLa) [Metodiev et al., 2017, Collins et al., 2018, 2019]:

Due to the smoothness condition, , where . We call the short sideband (SS). In the CWoLa protocol, one trains a classifier using (without ) to distinguish data from the SR from data in the SS. The values of and are chosen to have enough examples in both regions, but also to make the regions as close in as possible. The observation of Refs. [Metodiev et al., 2017, Collins et al., 2018, 2019] (see also the analog with label noise in Ref. [Scott et al., 2013]) is that the classifier trained in this way is monotonically related to when optimal. Any classifier can be used and can be chosen based on the details of the data. An advantage of the CWoLa approach is that the problem of density estimation is converted into classification, a comparably easier problem. However, this approach relies strongly on the smoothness assumption and an additional assumption about the feature space: that the CWoLa classifier cannot learn from .

Anomaly Detection with Density Estimation (ANODE) [Nachman and Shih, 2020]:

The smoothness and resonance conditions can also be used to estimate directly from the SB. Then, this density can be interpolated into the SR. One can also estimate the probability density directly in the SR. The ratio of the direct and interpolated densities will approximate when optimal. Any explicit conditional density estimation strategy will work; the authors of Ref. [Nachman and Shih, 2020] used a masked auto-regressive [Papamakarios et al., 2017] normalizing flow [Rezende and Mohamed, 2015]. An advantage of ANODE is that

can be different in the SR and SB; as long as it is smooth enough so that one can use the interpolation power of neural networks to estimate the density in the SR from the SB, the procedure should work.

For both CWoLa and ANODE, the estimate is used to enhance the presence of a potential anomaly. As we do not know ahead of time how many anomalous examples there may be, we make a small number of fixed choices , where could be defined by the fraction of normal events that pass the selection. After this requirement, we need to estimate to compute a -value in the SR for the observed data. This tail probability can be estimated by once again using the SB. Now, we simply need to estimate the expected number of normal examples that would pass in the SR. The probability mass function is then Poisson with this mean. The average value can be estimated using a histogram in and interpolating from the SB (see Fig. 4).

4. Experiments

State of the art performance in on this problem — as measured by most closely predicting properties of an unseen anomaly in a blind study on the first Black Box — is achieved by density estimation [Kasieczka G., Nachman B., Shih D. (editors) and others, 2020]. However, no correct identification of the anomaly was claimed for the more challenging third Black Box during the blind phase, leaving such more-complex multi-group anomalies as an open problem.

To provide quantitative results, we reproduce in Fig. 5 (top) the Receiver Operation Characteristic (ROC) curve for several methods on the R&D dataset. All algorithms use the high-level observables introduced in Sec. 2.1 as their input. Shown are: a classifier assuming known anomaly/background labels (Supervised), CWoLa’s performance on the SR/SB classification task, CWoLa’s performance on the anomaly detection task (S vs. B), density estimation (ANODE), and random guessing (Random) for reference. As can be seen from the figure, CWoLa is nearly random when trying to classify SR from SB; this is the expected (and desired) behavior indicating that the central assumption of CWoLa – that – is satisfied for this feature set. Maximal performance — highest area under the curve (AUC) — is of course achieved by supervised training. However, the less-than-supervised approaches still have excellent sensitivity.

In Fig. 5 (bottom) the possible gain in significance achieved by selecting a working point of a given signal efficiency is shown. This gain is calculated as the ratio of true positive rate and the square root of the false positive rate. It is an estimate of the significance assuming uncertainties dominated by Poisson statistics. Most relevant are the maxima of these curves, which reach for ANODE (for a TPR of 0.25) and more than for CwoLa (for a TPR of 0.1). Put differently, by selecting examples based on a working point chosen for ANODE (CWoLa) corresponding to a true positive rate of 0.25 (0.1), the statistical significance of the signal is improved seven-fold (more than ten-fold). The supervised approaches reach an even higher maximum above twenty, but rely on perfect truth labels, which are of course never present in real data.

Figure 5. ROC curve (left) and significance improvement curve (right) for different anomaly detection algorithms. A detailed explanation of the different lines is provided in the text. Reproduced from Ref. [Nachman and Shih, 2020].

5. Conclusion

Unsupervised group anomaly detection is an active and intensely studied topic in fundamental physics where the goal is to detect faint signals hinting at new forces of Nature. Our contribution outlines the key assumptions of this challenge to facilitate significant contribution from non-domain experts. These assumptions can be applied to other domains as well and yield a potentially novel perspective on a wide range of group anomaly detection problems. This is further aided by a curated and validated challenge dataset following FAIR principles — thereby fulfilling a well documented community need.

We review two well performing algorithms on the LHCO2020 datasets and show how robust anomaly detection without supervision and without group labels is possible. These results and implementation are available as reference for future studies555, Both the classification-based CWoLa and the density-based ANODE methods are shown to increase an approximation of the potential improvement in statistical significance by a factor of five or more. This large effect demonstrates the crucial power of group anomaly detection to increase the sensitivity of fundamental physics experiments, and heralds the promise of these methods for other domains.

Together, this contribution provides a bridge between the fundamental science and machine learning communities and introduces the next big challenge in particle physics (following the discovery of the Higgs boson) to a new audience.

The authors thank the participants of the LHC Olympics for many interesting discussions on using anomaly detection in particle physics. BN and GK are grateful to the NHETC Visitor Program at Rutgers University for the generous support and hospitality during the spring of 2019 where the idea for the LHC Olympics 2020 was conceived. GK acknowledges support by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy – EXC 2121 “Quantum Universe” – 390833306. BN was supported by the Department of Energy, Office of Science under contract number DE- AC02-05CH11231. DS was supported by the DOE under Award Number DOE-SC0010008.

Appendix A Glossary of physics acronyms used in the paper

Acronym Definition
ANODE Anomaly Detection with Density Estimation
BSM Beyond the Standard Model
CWoLa Classification without Labels
LHC Large Hadron Collider
LHCO LHC Olympics
SB Sideband (region)
SR Signal or Search Region
SS Short Sideband
Table 1. A glossary of the physics acronyms used in this work.


  • T. Aarrestad et al. (2021) The Dark Machines Anomaly Score Challenge: Benchmark Data and Model Independent Event Classification for the Large Hadron Collider. External Links: 2105.14027 Cited by: footnote 1.
  • K. Agashe, P. Du, S. Hong, and R. Sundrum (2017a) Flavor Universal Resonances and Warped Gravity. JHEP01(2017)016. External Links: 1608.00526 Cited by: §2.4.
  • K. S. Agashe, J. Collins, P. Du, S. Hong, D. Kim, and R. K. Mishra (2017b)

    LHC Signals from Cascade Decays of Warped Vector Resonances

    JHEP05(2017)078. External Links: 1612.00047 Cited by: §2.4.
  • J. J. Aubert et al. (1974) Experimental observation of a heavy particle . Phys. Rev. Lett. 33, pp. 1404–1406. Cited by: §1.2.
  • J. -E. Augustin et al. (1974) Discovery of a narrow resonance in annihilation. Phys. Rev. Lett. 33, pp. 1406–1408. Cited by: §1.2.
  • M. Bahr et al. (2008) Herwig++ Physics and Manual. Eur. Phys. J. C 58, pp. 639–707. External Links: 0803.0883 Cited by: §2.3, §2.
  • M. Cacciari, G. P. Salam, and G. Soyez (2008) The anti- jet clustering algorithm. JHEP 0804:063,2008. External Links: 0802.1189 Cited by: §2.1.
  • R. Chalapathy and S. Chawla (2019) Deep learning for anomaly detection: A survey. CoRR abs/1901.03407. External Links: Link, 1901.03407 Cited by: §1.3, §1.
  • R. Chalapathy, E. Toth, and S. Chawla (2019) Group anomaly detection using deep generative models. In ECML PKDD 2018: Machine Learning and Knowledge Discovery in Databases, pp. 173–189. Cited by: §1.3.
  • J. H. Collins, K. Howe, and B. Nachman (2018) Anomaly Detection for Resonant New Physics with Machine Learning. Phys. Rev. Lett. 121 (24), pp. 241803. External Links: 1805.02664 Cited by: §3, §3.
  • J. H. Collins, K. Howe, and B. Nachman (2019) Extending the search for new resonances with machine learning. Phys. Rev. D99 (1), pp. 014038. External Links: 1902.02634 Cited by: §3, §3.
  • J. de Favereau, C. Delaere, P. Demin, A. Giammanco, V. Lemaître, A. Mertens, and M. Selvaggi (2014) DELPHES 3, A modular framework for fast simulation of a generic collider experiment. JHEP02(2014)057. External Links: 1307.6346 Cited by: §2.
  • B. M. Dillon, D. A. Faroughy, J. F. Kamenik, and M. Szewc (2020) Learning the latent structure of collider events. JHEP10(2020)206. External Links: 2005.12319 Cited by: §1.3.
  • B. M. Dillon, D. A. Faroughy, and J. F. Kamenik (2019) Uncovering latent jet substructure. Phys. Rev. D 100 (5), pp. 056002. External Links: 1904.04200 Cited by: §1.3.
  • A. Divekar, M. Parekh, V. Savla, R. Mishra, and M. Shirole (2018) Benchmarking datasets for Anomaly-based Network Intrusion Detection: KDD CUP 99 alternatives. 2018 IEEE 3rd International Conference on Computing, Communication and Security (ICCCS). Cited by: §1.3.
  • F. E. Grubbs (1969) Procedures for detecting outlying observations in samples. Technometrics 11 (1), pp. 1–21. Cited by: §1.
  • J. Guevara, S. Canu, and R. Hirata (2015) Support measure data description for group anomaly detection. In ODDx3 Workshop on Outlier Definition, Detection, and Description at the 21st ACM SIGKDD International Conference On Knowledge Discovery And Data Mining (KDD2015), Cited by: §1.3, §1.3.
  • J. Kim, K. Kong, B. Nachman, D. Whiteson (2020) The motivation and status of two-body resonance decays after the LHC Run 2 and beyond. JHEP04(2020)030. External Links: 1907.06659 Cited by: footnote 3.
  • Kasieczka G., Nachman B., Shih D. (editors) and others (2020) The LHC Olympics 2020: A Community Challenge for Anomaly Detection in High Energy Physics. . External Links: 2101.08320, Link Cited by: §1.3, §1, §2.5, §4.
  • G. Kasieczka, B. Nachman, and D. Shih (2019) Official Datasets for LHC Olympics 2020 Anomaly Detection Challenge. Zenodo. Note: Cited by: §2.
  • S. Koranne (2011) Hierarchical data format 5: HDF5. In Handbook of Open Source Tools, pp. 191–200. Cited by: §2.
  • D. Kwon, H. Kim, J. Kim, S. C. Suh, I. Kim, and K. J. Kim (2019) A survey of deep learning-based network anomaly detection. Cluster Computing, pp. 1–13. Cited by: §1.
  • C. Le Lan and L. Dinh (2020) Perfect density models cannot guarantee anomaly detection. In ”I Can’t Believe It’s Not Better!”NeurIPS 2020 workshop, Cited by: §1.3.
  • W. McKinney (2010) Data Structures for Statistical Computing in Python. In Proceedings of the 9th Python in Science Conference, pp. 51 – 56. Cited by: §2.
  • E. M. Metodiev, B. Nachman, and J. Thaler (2017) Classification without labels: Learning from mixed samples in high energy physics. JHEP10(2017)174. External Links: 1708.02949 Cited by: §3, §3.
  • K. Muandet and B. Schölkopf (2013) One-class support measure machines for group anomaly detection. In

    29th Conference on Uncertainty in Artificial Intelligence (UAI 2013)

    pp. 449–458. Cited by: §1.3, §1.3, §1.
  • B. Nachman and D. Shih (2020) Anomaly Detection with Density Estimation. Phys. Rev. D 101, pp. 075042. External Links: 2001.04990 Cited by: §3, §3, Figure 5.
  • J. Neyman and E. S. Pearson (1933) On the problem of the most efficient tests of statistical hypotheses. Phil. Trans. R. Soc. Lond. A 231, pp. 289. Cited by: §3.
  • G. Pang, C. Shen, L. Cao, and A. van den Hengel (2020) Deep learning for anomaly detection: A review. CoRR abs/2007.02500. External Links: Link, 2007.02500 Cited by: §1.3.
  • G. Papamakarios, T. Pavlakou, and I. Murray (2017) Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems 30, pp. 2338–2347. Cited by: §3.
  • S. Rayana (2016) ODDS library. Stony Brook University, Department of Computer Sciences. External Links: Link Cited by: §1.3.
  • J. Ren, P. J. Liu, E. Fertig, J. Snoek, R. Poplin, M. A. DePristo, J. V. Dillon, and B. Lakshminarayanan (2019) Likelihood ratios for out-of-distribution detection. External Links: 1906.02845, Link Cited by: §1.3.
  • D. Rezende and S. Mohamed (2015) Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 37, pp. 1530–1538. Cited by: §3.
  • R. T. Schirrmeister, Y. Zhou, T. Ball, and D. Zhang (2020) Understanding anomaly detection with deep invertible networks through hierarchies of distributions and features. External Links: 2006.10848, Link Cited by: §1.3.
  • C. Scott, G. Blanchard, and G. Handy (2013) Classification with asymmetric label noise: consistency and maximal denoising. In Conference on learning theory, pp. 489–511. Cited by: §3.
  • J. Serrà, D. Álvarez, V. Gómez, O. Slizovskaia, J. F. Núñez, and J. Luque (2020) Input complexity and out-of-distribution detection with likelihood-based generative models. External Links: 1909.11480, Link Cited by: §1.3.
  • T. Sjostrand, S. Mrenna, and P. Z. Skands (2008) A Brief Introduction to PYTHIA 8.1. Comput. Phys. Commun. 178, pp. 852–867. External Links: 0710.3820 Cited by: §2.
  • M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani (2009) A detailed analysis of the KDD CUP 99 data set. In 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, Vol. , pp. 1–6. Cited by: §1.3.
  • J. Thaler and K. Van Tilburg (2011) Identifying Boosted Objects with N-subjettiness. JHEP03(2011)015. External Links: 1011.2268 Cited by: §2.1.
  • J. Thaler and K. Van Tilburg (2012) Maximizing Boosted Top Identification by Minimizing N-subjettiness. JHEP02(2012)093. External Links: 1108.2701 Cited by: §2.1.
  • E. Toth and S. Chawla (2018) Group deviation detection methods: a survey. ACM Computing Surveys (CSUR) 51 (4), pp. 1–38. Cited by: §1.3, §1.3.
  • L. Xiong, B. Póczos, J. Schneider, A. Connolly, and J. VanderPlas (2011a) Hierarchical probabilistic models for group anomaly detection. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 789–797. Cited by: §1.3, §1.3.
  • L. Xiong, B. Póczos, and J. Schneider (2011b) Group anomaly detection using flexible genre models. Advances in neural information processing systems 24, pp. 1071–1079. Cited by: §1.3, §1.3.
  • R. Yu, X. He, and Y. Liu (2015) GLAD: group anomaly detection in social media analysis. ACM Trans. Knowl. Discov. Data 10 (2). Cited by: §1.3.