Log In Sign Up

Challenges for Unsupervised Anomaly Detection in Particle Physics

by   Katherine Fraser, et al.
Harvard University

Anomaly detection relies on designing a score to determine whether a particular event is uncharacteristic of a given background distribution. One way to define a score is to use autoencoders, which rely on the ability to reconstruct certain types of data (background) but not others (signals). In this paper, we study some challenges associated with variational autoencoders, such as the dependence on hyperparameters and the metric used, in the context of anomalous signal (top and W) jets in a QCD background. We find that the hyperparameter choices strongly affect the network performance and that the optimal parameters for one signal are non-optimal for another. In exploring the networks, we uncover a connection between the latent space of a variational autoencoder trained using mean-squared-error and the optimal transport distances within the dataset. We then show that optimal transport distances to representative events in the background dataset can be used directly for anomaly detection, with performance comparable to the autoencoders. Whether using autoencoders or optimal transport distances for anomaly detection, we find that the choices that best represent the background are not necessarily best for signal identification. These challenges with unsupervised anomaly detection bolster the case for additional exploration of semi-supervised or alternative approaches.


page 8

page 9

page 14

page 15

page 16


Online-compatible Unsupervised Non-resonant Anomaly Detection

There is a growing need for anomaly detection methods that can broaden t...

A comparison of classical and variational autoencoders for anomaly detection

This paper analyzes and compares a classical and a variational autoencod...

Null Hypothesis Test for Anomaly Detection

We extend the use of Classification Without Labels for anomaly detection...

Background Modeling for Double Higgs Boson Production: Density Ratios and Optimal Transport

We study the problem of data-driven background estimation, arising in th...

AnoShift: A Distribution Shift Benchmark for Unsupervised Anomaly Detection

Analyzing the distribution shift of data is a growing research direction...

Anomaly Detection With Conditional Variational Autoencoders

Exploiting the rapid advances in probabilistic inference, in particular ...

Radial Autoencoders for Enhanced Anomaly Detection

In classification problems, supervised machine-learning methods outperfo...

1 Introduction

While many searches for physics beyond the Standard Model have been carried out at the Large Hadron Collider, new physics remains elusive. This may be due to a lack of new physics in the data, but it could also be due to us looking in the wrong place. Trying to design searches that are more robust to unexpected new physics has inspired a lot of work on anomaly detection using unsupervised methods including community wide challenges such as the LHC Olympics Kasieczka and others (2021) and the Dark Machines Anomaly Score Challenge Aarrestad and others (2021). The goal of anomaly detection is to search for events which are “different” than what is expected. When used for anomaly detection, unsupervised methods attempt to characterize the space of background events in some way, independent of signal. The hope is then that signal events will stand out as being uncharacteristic.

Anomaly detection techniques can be broadly split into two categories. For some signals, the signal events look similar to background events and one must exploit information about the expected probability distribution of the background to find the signal. Many anomaly detection techniques have been developed to find signals of this type 

Collins et al. (2018); D’Agnolo and Wulzer (2019); De Simone and Jacques (2019); Casa and Menardi (2018); Collins et al. (2019); Dillon et al. (2019); Mullin et al. (2021); D’Agnolo et al. (2021); Nachman and Shih (2020); Andreassen et al. (2020); Aad and others (2020); Dillon et al. (2020); Benkendorfer et al. (2021); Mikuni and Canelli (2021); Stein et al. (2020); Kasieczka and others (2021); Batson et al. (2021); Blance and Spannowsky (2021); Bortolato et al. (2021); Collins et al. (2021); Dorigo et al. (2021); Volkovich et al. (2021); Hallin et al. (2021). Alternatively, some signals are qualitatively different from background and then methods that try to characterize an individual event as anomalous can be used Aguilar-Saavedra et al. (2017); Hajer et al. (2020); Heimel et al. (2019); Farina et al. (2020); Cerri et al. (2019); Roy and Vijay (2019); Blance et al. (2019); Romão Crispim et al. (2020); Amram and Suarez (2021); Crispim Romão et al. (2021a); Knapp et al. (2021); Crispim Romão et al. (2021c); Cheng et al. (2020); Khosa and Sanz (2020); Thaprasop et al. (2021); Aguilar-Saavedra et al. (2021); Pol et al. (2020); van Beekveld et al. (2021); Park et al. (2020); Chakravarti et al. (2021); Faroughy (2021); Finke et al. (2021); Atkinson et al. (2021); Dillon et al. (2021); Kahn et al. (2021); Aarrestad and others (2021); Caron et al. (2021); Govorkova et al. (2021); Gonski et al. (2021); Govorkova and others (2021); Ostdiek (2021)

. Here, we restrict to the latter type of anomaly detection, where an anomaly score for individual events can be used for discrimination, without needing to characterize the full probability distribution of the signal ensemble. With an effective method, events with a small score are likely to be a part of the background distribution, while events with a larger score are not. There are many different ways of defining an anomaly score. Some rely on traditional high-level observables, like mass or

-subjettiness. Others attempt to directly learn how likely a given event or object is using low-level information, like individual particle momenta (see e.g., ref. Caron et al. (2021)

). Some methods that search for outliers rely on abstract representations to try to characterize the event space, such as the latent space of an autoencoder

Dillon et al. (2021); Bortolato et al. (2021). Others give the event space itself a geometric interpretation in terms of distances Komiske et al. (2019, 2020); Cai et al. (2020)

. Given the complexity and high-dimensionality of data at the LHC, many anomaly detection techniques employ machine learning.

In this paper, we begin by exploring the use of autoencoders for anomaly detection. Autoencoders were initially introduced for dimensionality reduction, similar to principal component analysis, to learn the important information in data while ignoring insignificant information and noise 

Kramer (1991). Autoencoders contain an encoder, which reduces the dimensionality of the input to give some latent representation, and a decoder, which transforms the latent space back to the original space. In particle physics, autoencoders were first used for anomaly detection in refs. Farina et al. (2020); Heimel et al. (2019), where they are meant to reconstruct certain types of data (background) but not others (signals). In order to work as an anomaly detector, an autoencoder should have a small reconstruction error for background events and a large reconstruction error for signal events. To do so, the autoencoder must establish a delicate balance in achieving a reconstruction fidelity which is accurate, but not too accurate. There are several cases where this is especially difficult, such as when the signal-to-background ratio is small, when the dataset has certain topological properties Batson et al. (2021), or when innate characteristics of the samples make the signal sample simpler than the background sample to reconstruct Dillon et al. (2021); Finke et al. (2021).

A generalization of autoencoders called variational autoencoders (VAEs) were introduced in ref. Kingma and Welling (2013). Unlike an ordinary autoencoder, where each input is mapped to an arbitrary point in the latent space, in a VAE, the latent space is a probability distribution which is sampled and mapped back to the original space by the decoder. In addition to the usual reconstruction error, the VAE loss also includes a Kullback-Leibler (KL) divergence component that pushes the latent space towards a Gaussian prior and regularizes network training. The latent space of the VAE encodes the probability distribution of the background training sample, which can be used in the anomaly score. VAEs were first used in anomaly detection in computer science in ref. An and Cho (2015), and first used for particle physics anomaly detection in ref. Cerri et al. (2019). They have been widely studied since then Carrazza and Dreyer (2019); Cheng et al. (2020); Dohi (2020); Pol et al. (2020); van Beekveld et al. (2021); Park et al. (2020); Bortolato et al. (2021); Dillon et al. (2021); Govorkova and others (2021); Ostdiek (2021).

The task of an autoencoder, variational or not, for unsupervised anomaly detection is to provide a strong universal signal/background discriminant for a variety of signals having access only to background for training. In principle, this approach is advantageous because it opens the possibility to bypass Monte Carlo simulations and work directly with experimental data, which is almost completely background. The autoencoder paradigm is based on the vision that there is trade-off between efficacy and generality: the ideal discriminant for a given signal and given background would be ineffective for a different signal and different background while a general discriminant, like the autoencoder, would work decently on a broad class of signals and backgrounds. The ideal assumes, first, that such a general discriminant exists with an appropriate use case, and second that it can be found by training purely on one or more background samples without any direct information about the signal. However, one has reason to be suspicious: machine learning methods work great at optimizing a given loss, which is meant to correlate strongly with the problem one is trying to solve. For autoencoder anomaly detection, the optimization (background only) is not aligned with the ultimate problem of interest (signal discovery over background), so it should not be surprising if the autoencoder does poorly. In section 4, we explore the challenges induced by trying to optimize a VAE in a model agnostic way.

In order to understand what a VAE is learning, we study its latent space. In particular, we look at distance between events in VAE latent space (see Dillon et al. (2021); Collins (2021) for other studies of VAE latent spaces in particle physics). Since we can think of the VAE anomaly score as a “distance” encoding how far any given event is from the background distribution, it is also natural to ask about the distances between individual events. We find there is a significant correlation between the Euclidean distance between events represented in the VAE latent space and the Wasserstein optimal transport distance between events represented as images. We study Wasserstein distances in particular because they were physically motivated in refs.  Komiske et al. (2019, 2020); Cai et al. (2020).

The correlation we observe between distances in the VAE latent space and between the event images motivates us to explore using optimal transport distances between events to define an anomaly score in section 5

. One method for using distances directly is to identify representative events in the background sample, and use an event-to-event distance between a given event and the representative event as the score. The advantages of this method we propose are that it does not require training a neural network and that it is easily adaptable to different background samples.

This paper is organized as follows. In section 2, we provide information about the dataset used in our study. In section 3, we provide relevant background information on the metrics used (section 3.1), and the details of the VAE architecture (section 3.2). In section 4, we explore the effectiveness of an image based convolutional autoencoder for anomaly detection, including its sensitivity to hyperparameters. We also explore correlations between Euclidean distances in the autoencoder’s latent space and optimal-transport distances among the event images in section 4.2. This motivates the development of methods that directly use the optimal transport distances among events as an alternative to VAEs in section 5. We conclude in section 6.

2 Datasets

We begin by describing the datasets we use for our analysis. For concreteness, we focus on anomaly detection in simulated jet events at the LHC. We will consider QCD dijet events as the background, and consider both top and jets as representatives of anomalous signal events. The authors of ref. Cheng et al. (2020) have provided a suite of jets for Standard Model and Beyond the Standard Model particle resonances which are available on Zenodo Cheng (2021). A sample of QCD dijet background events are also provided on Zenodo using the same selection criteria, showering, and detector simulation parameters Leissner-Martin et al. (2020). The datasets were generated with MadGraph Alwall et al. (2014) and Pythia8 Sjöstrand et al. (2015) and used Delphes de Favereau et al. (2014) for fast detector simulation. Jets were clustered using FastJet Cacciari et al. (2012); Cacciari and Salam (2006) using the anti- algorithm Cacciari et al. (2008) with a cone size of . The event selection requires two hard jets, with leading jet having GeV and the sub-leading jet having GeV. The QCD jets are created using the process in MadGraph, while the top and jets we examine are produced through a which decays to or a which decays to , respectively. There are around 700,000 QCD dijet events and 100,000 events for the “anomalous” top and events. We reserve 100,000 QCD events for testing and use 50,000 QCD events for validation when training the VAE.

The leading jet in each event is used for the analysis. We pre-process the raw four-vectors into an image following the procedure presented in ref. 

Macaluso and Shih (2018). Using the EnergyFlow package 1, we boost and rotate the jet along the beam direction so that the weighted centroid is located at . Next, the jet is rotated about the centroid such that the weighted major principal axis is vertical. After this, the jet is flipped along both the horizontal and vertical axes so that the maximum intensity is in the upper right quadrant. Only after the centering, rotations, and flipping do we pixelate the data Macaluso and Shih (2018). We use pixel images covering a range of . The final step of the pre-processing is to divide by the total in the image. Note that we do not

standardize each pixel by, e.g., subtracting the mean and dividing by the standard deviation for the entire training dataset, because optimal transport requires positive values in every pixel. It is important to note that the individual images are very sparse and do not resemble the average of the dataset. For instance, out of the 1600 pixels, only

, , and pixels account for more than of the total of the image for the QCD, top, and jets, respectively.

3 Defining the Anomaly Score

Anomaly detection, in general, requires an anomaly score: we want to determine if an event is anomalous by measuring how far away it is from the elements of the background distribution. This anomaly score can also be thought of as the “distance” between an event and an ensemble. In order to define an event-to-ensemble distance it is helpful first to explore event-to-event distance measures. For instance, given an event-to-event metric, one could compute the distance from an event to some fiducial background event, and use this as a proxy for the event-to-ensemble distance. To understand both types of distances, we need to review the metrics used to define the distance, which we will do in Section 3.1. We can also use an autoencoder to generate an implicit construction of an approximate event-to-ensemble distance, in the form of an anomaly score. We will provide background and discuss the architecture of our autoencoder in section 3.2.

3.1 Metrics

First, we define the metrics that can be used to compute event-to-event distances. One of the simplest event representations is to treat an event as an image, with pixel intensities representing the particles’ energies Cogan et al. (2015).111 In principle, it would be interesting to consider the complete set of four-vectors of the particles in an event as a representation, rather than the pixelated image, and define a metric on these. The -Wasserstein distances described later in this section are well-suited for such a representation, but building an autoencoder architecture on the full set of four-vectors is more challenging. It is also important to comment that our image representation is dependent on its preprocessing. Although recent studies have shown that the processing done to events before anomaly detection is inherently model dependent Finke et al. (2021), we work with the images as described. A simple event-to-event metric, the “mean power error” (MPE), can then be written as:


where is the pixel intensity (transverse momentum) in pixel of the image 1(2), and is a parameter that governs the relative importance of pixels with high/low intensity differences. This type of metric is often used for doing regression. Frequently, the choice is made, inspired by the statistic, in which case is known as the mean-square error (MSE). The mean-absolute error (MAE) is another well-known choice, corresponding to .

While makes sense in regression, using it on images does not make much sense from a physics point of view.222In contrast, if one designs a neural network with higher-level variables as the input data representation, using MPE as the metric is a sensible choice. For instance, let be the image of a particle with energy in a single pixel and be the image of a particle with same energy in the neighboring pixel. These events are nearly identical physically, but will have a very large MSE distance. Moreover, we will still get the same MSE distance if we move one of the two pixels much further way. Physically similar events do not necessarily result in small MSE distances.

A completely different way to assign distance between two events is to compute the minimum “effort” needed to transform one image into the other, known as the optimal transport distance. There are many possible optimal transport algorithms (see ref. Villani (2009) for a broad review). Finding the minimum effort is an optimization problem: given a cost function , where and label elements (e.g. pixel labels) of the two events, we optimize over the transport plan, . The cost can be thought of as how much work it takes to transport a single unit of intensity a given distance, and the plan describes how much intensity to transport and where to transport it to. In terms of the cost and plan, the total optimal transport cost is then defined as


In some cases, the cost function is itself a positive definite distance, in which case is also a distance. One example is the set of -Wasserstein distances:


Depending on the problem, the set of may have to satisfy additional constraints.

We define the underlying cost as the Euclidean distance in the plane between pixel in image and pixel in image . The transport plan is defined by the amount of that is moved from pixel in image to pixel in image . The transport plan is constrained such that the amount of moved from a pixel cannot be more than what was there, . Similarly the amount of moved into a pixel cannot exceed the amount in that pixel in : . Here, we consider normalized images, preprocessed such that the total intensity summed over all pixels is equal to unity, so that there is no extra cost of creating or destroying . In mathematical language, we are considering “balanced optimal transport”.

In particle physics applications, unbalanced optimal transport with the choice is commonly referred to as the Energy Movers Distance (EMD) Komiske et al. (2019, 2020), as it has the interpretation of work required to rearrange an energy pattern. This interpretation makes the EMD a natural choice for a metric on collider events. This has prompted further work on using the EMD to define event shape observables characterizing the event isotropy Cesarotti and Thaler (2020), which can be useful in searching for signals that are far from QCD-like Cesarotti et al. (2020, 2020). Sometimes, has been considered Komiske et al. (2020), but the less explored case of can also be computed. Intuitively, gives more importance to smaller distances. While the EMD includes an additional term to account for energy differences between jets, in our results, we will restrict to balanced optimal transport, since we normalize the images.

The -Wasserstein optimal transport metrics are more aligned with what one expects for physical events than MPE. For example, two single-particle events where the particles are nearby will have a much smaller -Wasserstein distance than when they are far from each other, in contrast to their MSE distance. However, as QCD prefers small angle radiation, we find that the 1-Wasserstein distance and MSE have mild correlations, as shown in figure 1.

Figure 1: The pairwise (event-to-event) distances between event images for different metrics. The mean squared error is displayed along the -axis and the 1-Wasserstein distance is along the -axis.

We reiterate that both and are used to compare the distance between two images (or events). However, for anomaly detection, we want to know how far an event is from the expected distribution. One way to do this is with an autoencoder, which we describe next.

3.2 Autoencoders

A popular method for detecting anomalous data is with a neural-network autoencoder (AE). An autoencoder works by first encoding the data in a lower-dimensional latent space, and then decoding it back to the original higher-dimensional representation. The idea is that data similar to the training sample will be reconstructed well, whereas data that is not similar to the training sample may be reconstructed poorly. The reconstruction fidelity can then be used as an anomaly score. Often the data are represented as images, and the autoencoder uses the MSE metric (eq. (1) with ) to compare the input image to the reconstructed image.

In figure 2, we show an example of an autoencoder architecture that we will use. The encoder is made up of some number of downsampling blocks (there are two in the figure, each marked by a dashed blue line). Each block contains two sets of

convolutional layers with a depth of five filters. The stride and padding are set to keep the image size the same and the ELU activation function is applied after each layer. After the convolutional layers, the data is downsampled through a

average pooling layer. After the final downsampling block, the data is flattened and then followed by a dense layer with 100 nodes and an ELU activation. Finally the network is mapped to the latent space through another dense layer. We experiment with one, two, and three downsampling blocks, and use a fixed latent size of 64 dimensions. Our latent space is substantially larger than what is often used, for example ref. Farina et al. (2020) uses a six dimensional latent representation and ref. Heimel et al. (2019) finds the optimal size to be around 20-34 for their top-tagging data. We found the best performance with a 64 dimensional representation.

Figure 2: Example architecture of an autoencoder, as used for this study. The autoencoder is made of two networks, the encoder and the decoder, each with one, two, or three down(up)-sampling blocks.

The second part of the AE is the decoder, which maps the latent space back to the space of the input data. In our setup, the decoder is a mirror of the encoder. The first step is a dense layer with 100 nodes and ELU activation. From here, another dense layer is used, where the number of nodes is set to the number of pixels in the final down sampling block. The ELU activation function is used again, and then the data is reshaped into a square array. From here, the same number of upsampling blocks is applied as the number of downsampling blocks. In each upsampling block, the first operation is a 2D transposed convolution which doubles the shape of the image and contains a depth of five filters, followed by the ELU activation. After this, two convolutional layers are used with the ELU activation with the stride and padding set to keep the image size the same. The final convolution operation reduces the depth to one.

During training and inference, the input image is compared with the reconstructed image via some choice of event-to-event metric. A common method is to use the MSE as the loss function, with the aim of reproducing the exact image. However, it is possible to use other metrics for the comparison. Furthermore, the metric used for training does not need to be the same as the metric used for the anomaly score (see for instance ref. 

Cheng et al. (2020)). We will refer to the difference between the input and reconstructed image as the image distance, also known as the reconstruction error.

A variational autoencoder enhances the basic autoencoder by adding stochasticity to the latent embedding. In a regular autoencoder, which is a deterministic function, very dissimilar events can be placed near each other in the latent space. Distances in the latent space of an ordinary AE therefore do not have a precise meaning. In a VAE, the stochastic element makes the network return a distribution in the latent space for each input event. Since the same input data can be mapped to several nearby points in a VAE, dissimilar events cannot be placed nearby. Returning a distribution in the latent space is therefore essential for making distances in the latent space meaningful. The stochasticity also connects the loss to the statistics method of variational inference Kingma and Welling (2013, 2019), as we summarize in appendix A (see also refs. Blei et al. (2017); Kingma and Welling (2019)

for reviews). Specifically, we show that the autoencoder estimates a

lower bound on the probability for any given event to be an element of the background sample that the network is trained on.

To implement the stochasticity of a VAE, our networks are trained using the standard reparameterization trick Kingma and Welling (2013); Jimenez Rezende and Mohamed (2015). A single element of the input data now yields a distribution, and these distributions are treated as a set of

independent Gaussian distributions, where

is the dimension of the latent space. The output of the encoder is then doubled: instead of returning a single point in the latent space, it now outputs both the means

and the variances

of the distribution in latent space. The loss function for the network also has to be modified: we want the background sample to be well modeled by a set of Gaussian distributions in latent space. This is done by introducing a Kullback-Leibler divergence (KLD) term (see appendix 

A for details), which is estimated as:333 This estimation assumes Standard Normal priors for the likelihood of the latent data, as described in appendix A. There is a great deal of ongoing research into methods to improve the likelihood estimate by changing the latent space priors or improving the posterior approximations of the encoder Kingma et al. (2016); van den Berg et al. (2018); Dillon et al. (2021); Aarrestad and others (2021).


This KLD term acts to regularize the autoencoder by pushing the means in the latent space to zero and the variances to one. Depending on the metric used to determine the distance between the original and reconstructed data, more or less regularization may be needed. To account for this, we introduce another hyperparameter , and define the loss function as


We scan over , typically finding the best results for small but nonzero .

To minimize the loss given in eq. (5), we use the Adam optimizer Kingma and Ba (2014) with the default parameters and an initial learning rate of

. The training data consists of around 550,000 QCD dijet samples, and we reserve 50,000 QCD events for an independent validation set. After each epoch of training, the loss is evaluated on the validation set. When the loss has not improved on the validation set for five epochs, the learning rate is decreased by a factor of 10, with a minimum learning rate of

. Training concludes when the validation loss has not improved for 12 epochs. We then restore the weights of the network from the epoch with the best validation loss.

4 Autoencoder results

Here we present the results of our studies of variational autoencoders. We start by studying the metric dependence of VAE performance as anomaly detectors. Then we study the latent space to understand what the VAE is learning.

4.1 Autoencoder performance

Now we study the performance of variational autoencoders as anomaly detectors using different metrics. Anomaly detection with an autoencoder requires two metric choices. First, one must choose a training metric, used for computing the image distance during training. Next, one must choose an anomaly metric to compute an anomaly score which determines how similar an event is to the training sample. The training metric and anomaly metric can be the same, but do not have to be.

For the training metric, we consider MSE-type metrics and and -Wasserstein metrics and . Using a -Wasserstein metric in the loss function to train an autoencoder is not standard, and requires a little bit of extra engineering.444Ref. Collins (2021) also implements a VAE trained with a -Wasserstein metric. The challenge is that the optimal-transport metrics are not well-suited for the back-propagation part of the training procedure of a neural network. To get around this, we used the Sinkhorn approximation within the GeomLoss package Feydy et al. (2019). Even with this, training was slow and sometimes timed out after three days of training on GPU. In contrast, the MSE and MAE networks typically completed training in around 12 hours on the same platform.

For the anomaly metric, we consider either using the full loss (including both the training metric contribution and the KL-divergence part in the variational autoencoder), just the MSE error between the input and output images , the MAE , or the -Wasserstein distances with , 1.0, and 2.0. The value of each of these is computed for the test samples for the QCD dijet events, the top-jet events, and the -jet events.

To evaluate performance in anomaly detection, we train the autoencoder on a QCD background using the training metric. Then we evaluate the anomaly score using the anomaly metric for a boosted top jet signal sample and a boosted

-jet signal sample. For a figure of merit of performance we use the Area Under the receiver operating characteristic Curve (AUC).

Signal Top jet jet
Training Down Anomaly AUC AUC
Metric Samplings Metric
Supervised - - 0.94 0.96


1 Loss 0.82 0.61
MSE 0.82 0.60
MAE 0.79 0.48
Wass(0.5) 0.82 0.45
Wass(1) 0.83 0.41
Wass(2) 0.81 0.39
2 Loss 0.83 0.65
MSE 0.83 0.65
MAE 0.80 0.53
Wass(0.5) 0.82 0.51
Wass(1) 0.82 0.51
Wass(2) 0.81 0.54
3 Loss 0.84 0.65
MSE 0.84 0.65
MAE 0.81 0.53
Wass(0.5) 0.83 0.52
Wass(1) 0.84 0.52
Wass(2) 0.82 0.54


1 Loss 0.78 0.44
MSE 0.71 0.57
MAE 0.72 0.49
Wass(0.5) 0.75 0.47
Wass(1) 0.78 0.44
Wass(2) 0.76 0.39
2 Loss 0.79 0.46
MSE 0.76 0.61
MAE 0.75 0.52
Wass(0.5) 0.77 0.49
Wass(1) 0.79 0.46
Wass(2) 0.77 0.40
3 Loss 0.79 0.41
MSE 0.79 0.60
MAE 0.77 0.51
Wass(0.5) 0.79 0.47
Wass(1) 0.79 0.41
Wass(2) 0.72 0.36
Table 1: Results showing the ability of a VAE trained on QCD only samples to distinguish top and jets as different from QCD. The Training Metric column shows which distance metric is used in the loss function for training, and the Anomaly Metric column shows the distance metric used at inference time. The bold blue entries mark the highest AUCs overall. We indicate the -Wasserstein distance metric as Wass(), and the MPE with power by MAE and MSE, respectively.

Results are shown in table 1 for the training metric choices and and for different numbers of downsampling blocks in the network. For each number of down samplings, we trained the network with different values of the VAE parameter , and in the table present the results for the value of which achieved the best loss on the validation data. For the trained networks, the values of which minimized the loss were , , and , for the one, two, and three down sample block networks, respectively. The trained results are in the lower part of the table and had optimal values of of , , and for one, two, and three down sampling blocks, respectively. The entries highlighted in blue indicate the configuration with the best AUC for top jets and jets across all of our considered architectures, training methods, and anomaly score methods. The top row in the table shows the AUC numbers (in red) from a supervised approach, for comparison (see appendix B for details of the supervised algorithm).

In general, we find the networks trained with as the training metric and using the full loss as the anomaly metric has the best performance. The exception is when only a single down sample layer is used, in which case using as the anomaly metric does slightly better for the top-jet signal than using the full loss as the anomaly metric. When is used as the training metric, the best performance is with as the anomaly metric.

We can see at this stage the proliferation of choices one has to make when deciding what architecture, training metric, and anomaly metric to use. Making these choices is especially hard to do if one wants to remain model agnostic. For instance, figure 3 shows the results of the network trained with as the reconstruction loss. The left panel contains the loss on the QCD validation events. Using the idea that minimizing the loss is getting a better estimate of the probability of an event, one would expect that the network configuration (number of down samplings and value of ) which minimizes the loss will have learned the QCD distribution the best. However, the next two panels show the ability of the networks to distinguish top and jets from the QCD background. In particular, we see that the value of which minimizes any of the loss curves does not yield the best signal separation. We also point out that the network with a single down sample block has the lowest loss, but is consistently the worst anomaly detector. This figure also highlights the danger of using the AUC of a particular signal to chose the hyperparameters of a universal anomaly detector. Examining only performance on the jets, it would be tempting to pick either the two or three down sample networks with a value of , as this gives the best AUC for the s. However, these particular networks have the worst score of the top jets. This is the challenge of signal independent searches; without a signal model in mind, optimizing analysis strategies is hard to do in a principled manner.

Figure 3: Results from scanning over . The value of which minimizes the validation loss does not yield the highest AUCs for either the top or samples. If one were to use one of the signal samples to chose the value of , it can lead to worse results on the other signal.

The network trained with with a small KL divergence term yields the best anomaly detection performance. Therefore, we expect that it is learning a good representation of the underlying background distribution. We next explore this hypothesis by examining event-to-event distances among different metrics.

4.2 What has the VAE learned?

In order for a variational autoencoder to be able to judge how likely an event is to be in a particular sample, it must have a representation of the probability distribution of events in that sample. Moreover, since it first maps events to a lower-dimensional latent space, the information about the relative likelihood should be encoded in the latent space in some way. It would make sense if the network places similar events nearby in the latent space, and dissimilar events far apart. In this section we attempt to quantify if this is indeed true by comparing to the more physical Wasserstein distance.

Figure 4: Each panel shows correlations between pair-wise distances of events in the QCD test set. The -axis always denotes the distance. The -axis denotes the Euclidean distance in the latent space. The representation learned by the network is more correlated with the distances than the MSE distances (see figure 1).

Since each input is mapped to a (Gaussian) distribution in the latent space, we use the Euclidean distance between the means of these distributions, which is a simple measure of the distance in latent space.555One could also try to take into account the variance of the distributions, by e.g., taking the KL divergence between the two distributions. In figure 4, we show the correlations between the Euclidean latent space distance and the -Wasserstein event distance among all the pairings of 1000 events in the QCD test set for various values of the VAE parameter . For this study, the events are passed through the encoder part of a VAE with three down sampling layers, down to a 64 dimensional latent space where the Euclidean distance is computed. As the value of is increased, the network goes from having little regularization to being forced to approach a Gaussian. Correspondingly, the correlation initially grows as the structure is forced upon the latent representation, and then decreases as becomes so large that the regularization dominates and the distribution becomes nearly Gaussian. We observe similar results for the networks with one and two downsampling layers that are trained with   in the loss function. In this figure, the value of which gives the minimum loss corresponds to the with maximum correlation, but we do not find this trend to hold in general. It seems that the VAE with an intermediate value of that balances the   and KLD terms in the loss function creates a latent space where distances between events are correlated with the   distance in the image space.

The downsampling operations are critical to the production of the latent space. As they combine information from neighboring pixels, they introduce an element of scale which MPE would not exhibit. To verify the importance of downsampling, we show in figure 5 the pair-wise event distance correlations for the same network at different depths into the encoder. In the first panel, distances on the -axis are computed in the first downsampling block, where the events are represented as tensors and the   goes across all 2000 “pixels” (see figure 2). The correlation between the distance in this first downsampled layer and the Wasserstein distance of the events is much larger than the MSE distance between the original events. The correlation further increases from the first down sample block to the second. The correlation then decreases after a third downsampling. Then, when the information is further reduced to the latent space, we get smaller correlations than seen in the early stages of the network.

Figure 5: Correlations of the pair-wise event distance in the image space with the interior activations of the network after different number of down sample blocks.

Although the EMD metric is a -Wasserstein metric with , there is no clear reason why should be preferred to other values. So, next we consider . In figure 6, we show the correlations for the same network with three down sampling layers but now using distances along the -axis. The distances between events at different layers in the network are more correlated with than .

Figure 6: Correlations of the pair-wise event distance in the image space with the interior activations of the network after different numbers of down sample blocks. This is the same network as in figure 5, but now the image distances use a different power of . The correlation is higher with than .

In this section, we have explored the representation of the QCD event distribution that variational autoencoders learn. Our conclusion from this study is that the Euclidean distances between QCD events in the latent space are highly correlated with the -Wasserstein distances between the events themselves. This is particularly compelling because the VAE is trained with the MSE metric for its loss function and has no direct access to any -Wasserstein metric. A related question is how the correlations would look if a -Wasserstein metric were used for training. In that case, we find that the Euclidean distances between events in the latent space of the Wasserstein trained networks are even more correlated with the Wasserstein distances in the image space. Thus, it could be argued that the Wasserstein trained networks learn an even better representation of the QCD distribution than the MSE trained networks. However, the VAE with MSE training worked better for finding the top- and -jet signals than those trained with an Wasserstein loss. The fact that the method with the “best” latent representation does not yield the best signal separation highlights the challenges of model agnostic anomaly detection.

5 Event-to-Ensemble Distance

In the previous section we showed that VAEs tend to work better when MSE loss is used for training than when Wasserstein metrics are used for training and that the Euclidean distance in the latent space correlates strongly with the Wasserstein metric on the data, regardless of the metric used for training. Thus, there is a sense in which VAEs are learning the Wasserstein metrics. If the power of the VAE for anomaly detection in physical problems stems from it implicitly learning aspects of the -Wasserstein metrics, we can then ask if there may be a way to use these metrics more directly for anomaly detection, sidestepping the VAE entirely. One way to do this is to use the metric to compute an event-to-ensemble distance, as we explore in this section.

We would like to use the -Wasserstein distance, or another metric, to characterize the distance of an event to the background ensemble. There are already several options for using Wasserstein distances to characterize different types of events in the literature, such as

nearest neighbor (kNN) classifiers

Komiske et al. (2019), “linearized” optimal transport Cai et al. (2020), where all the events are compared to a single reference event and this is used to define a new distance, and event isotropy, which compares a given event to an isotropic configuration Cesarotti and Thaler (2020). Our goal, using a method like these, is to extract from the background ensemble one or more representative images and to compute the distance of a given signal or background event to those images. This direct event-to-ensemble distance measure can then be compared to the VAE anomaly score, which is also effectively an event-to-ensemble distance.

To compute the direct event-to-ensemble distance we need an algorithm to select or construct fiducial events from the ensemble and a metric with which to compute the distance. As with the VAE architecture, there may be no choice that is optimal for all signals. In choosing the fiducial events, we must decide which sample to choose events from, how to select those events, how many events to use, how to represent the fiducial events (e.g. as images), and how to combine the distances to the different events. To make a fair comparison to the VAE approach, we would like our algorithm for generating fiducial events to depend only on the background sample, independent of what anomalous signal we might search for. Thus we choose the QCD jet event ensemble as our reference sample. To select events from the sample, the simplest possibility is to arbitrarily select some number of random images. However, despite occasionally giving a large AUC for classification, results with random images are very sensitive to fluctuations between the random images. A second possibility that may seem sensible is to take the average of all events in the sample. A third option, which we find to be the most natural, is to use medoids as we now explain.

With a given metric, which we call the medoid metric, we can compute the pairwise distance between any two events in the ensemble. Then for each event we can sum over all the distances to all other events . The medoid of the ensemble is the event that minimizes . medoids generalizes this to finding the events for which the sum of the distances of each event to the closest of the medoids is minimized. Thus the event fragments into a set of clusters, with each cluster closest to one of the medoids. -medoids clustering is similar to -means clustering when the medoid metric is chosen to be the Euclidean metric, except that -medoid clustering actually requires the medoid to be one of the events in the set. Medoids have previously been explored in other contexts in refs. Komiske et al. (2019, 2020); Crispim Romão et al. (2021b). To use medoids, we need to choose a value for and a medoid metric. Then it is natural to take for the event-to-ensemble distance the distance of an event to its closest medoid. Although one could in principle use a different metric to compute the event-to-ensemble distance, it is also most natural to use the same medoid metric that determines the medoids.

Figure 7: Example of the elbow method. Left shows histograms of the 1-Wasserstein distance to the closest medoid, with colors corresponding to the number of medoids. Right plots the peak of each of these histograms, as a function of the number of medoids.

Choosing the number of medoids

is challenging to do in a signal independent way. One approach is the elbow method, a common heuristic for determining how many clusters are in a dataset. In our case, to use the elbow method we scan over the number

of medoids, and for each compute the distances of all the events in our sample to the nearest medoid and histogram the results. There will be a small number events very close to a medoid and a small number very far from all medoids, so the histograms will have a peak at some finite value of the distance. Moreover, the peak distance will decrease monotonically as the number of medoids increases. In many applications the decrease is rapid for small , but at some abruptly stops decreasing rapidity and starts decreasing slowly. Thus the peak distance as a function of

often has an elbow shape. To determine the elbow location algorithmically we perform a linear regression to an elbow function (two straight lines), and take the first integer value after the elbow as the suggested value of

. The result can be seen in figure 7. The idea behind the elbow method is that increasing past the location of the elbow should not give much improvement. Moreover, in the case of anomaly detection, if we have too many medoids, we can get one medoid that looks “background-like” rather than “signal-like”. We find that typically medoids is selected according to this elbow method.

The main advantage of the elbow method is that it can be automated and used independent of the sample or the use case. However, there often is not a clear elbow. In figure 7, the elbow is only apparent because we have fit to a piecewise linear function. The data seems to follow more of a power law behavior. In addition, the location of the elbow can be affected by the maximum number of medoids we include in the fit. Additionally, the elbow can only be computed once we’ve already made the arbitrary choice of the medoid metric, and of the metric being used for the comparison between the full sample and the reference sample. Finally, there is no reason to expect that the elbow choice of , which is determined only by the background sample, would be optimal for anomaly detection tasks. Thus, we also consider values of not determined by the elbow method for this study.

Top jet jet
Number of
Method AUC AUC
Supervised  - - - 0.94 0.96

QCD Reference

Wass(1)  - Avg 0.81 0.62
 1 Medoid 0.83 0.66
 3 (elbow) Medoids (min) 0.85 0.68
 5 Medoids (min) 0.87 0.60
 7 Medoids (min) 0.87 0.61
Wass(5)  - Avg 0.53 0.60
 1 Medoid 0.68 0.36
 3 Medoids (min) 0.66 0.41
 4 (elbow) Medoids (min) 0.67 0.41
 5 Medoids (min) 0.71 0.43
MAE  - Avg 0.83 0.71
 1 Medoid 0.82 0.71
 3 (elbow) Medoids (min) 0.82 0.61
 5 Medoids (min) 0.83 0.67
 7 Medoids (min) 0.83 0.65

Top Reference

Wass(1)  - Avg 0.69 0.69
 1 Medoid 0.58 0.79
 3 (elbow) Medoids (min) 0.32 0.79
 5 Medoids (min) 0.45 0.84
 7 Medoids (min) 0.49 0.83
Wass(5)  - Avg 0.72 0.40
 1 Medoid 0.53 0.52
 2 (elbow) Medoids (min) 0.72 0.70
 3 Medoids (min) 0.66 0.61
 5 Medoids (min) 0.61 0.54
Wass(5)  3 (elbow) Medoids (sum) 0.66 0.66
 5 Medoids (sum) 0.73 0.58
 7 Medoids (sum) 0.75 0.60
MAE  - Avg 0.48 0.57
 1 Medoid 0.29 0.64
 3 (elbow) Medoids (min) 0.25 0.36
 5 Medoids (min) 0.32 0.58
Table 2: AUC values for QCD vs. signal classification. Top rows use a QCD reference sample, and bottom rows use a top reference sample (assuming events are more “top-like” than QCD events). When there are multiple medoids, distances are combined either by taking the minimum or the sum to the

different medoids, as denoted in the table. Medoids are selected with the same metric as is used to compare images. For each metric, we note which number of medoids corresponds to the elbow. The best AUC values with the QCD reference events are denoted in blue. We indicate the

-Wasserstein distance metric as Wass().

The top of table 2 shows results for top-jet vs. QCD-jet discrimination and -jet vs. QCD-jet discrimination when QCD jets are used for the reference sample. We show results for different values of with medoids, using different medoid metrics. We also show the result from using the distance to a single composite average event determined by averaging each pixel intensity over all events in the reference sample. When we study the elbow for the most common 1-Wasserstein metric, we see reasonable performance for both QCD and top jets, though it is best for neither of them. This is in line with what we expect for unsupervised anomaly detection. For simplicity, we report results where the metric used to select the medoids is the same one used to compute our observable. We could have chosen two different metrics for the medoid metric and that used to compute the event-to-ensemble distance, but restricting to the case where they are the same does not qualitatively change our results. Table 2 shows that the number of medoids and the choice of metric matters substantially.

Figure 8: Images of example QCD and top events. The top row shows QCD events; the bottom row shows top events. The left column shows the average image in each case, and the other two columns show two medoids computed with the 1-Wasserstein metric. Note medoids are more sparse and varied than average images, and that one of the top medoids appears “QCD-like” when we include multiple medoids.

We find that the QCD medoids typically perform better than the average QCD jet. This is not surprising, since the average QCD jet is much more concentrated in the center of the image than any real QCD jet, as can be seen in figure 8. In this figure, the color shows the fraction of the total in each pixel on a logarithmic scale. We also find better performance for anomaly detection with 5-6 medoids, rather than the 2-4 medoids suggested by the elbow method. When detecting top jets with QCD reference images, we get the best results when the -Wasserstein metric with is used to compare images, though we also find reasonable performance for the -Wasserstein metric with or (not shown in the table) and when the MAE metric is used. Although MAE is not a physically motivated metric, the performance in this case is not surprising because MAE between events is highly correlated with -Wasserstein distances between events for QCD jets (the Pearson correlation coefficient between MAE and the 1-Wasserstein distance is 0.87).

That our results depend on the exponent is suggestive. For the -Wasserstein metric with larger we get substantially decreased performance when comparing to QCD reference jets. This suggests that the ideal value of is related to the relevant scales in the problem: a smaller value of places comparatively larger emphasis on pixels with smaller differences. This is consistent with results such as ref. Finke et al. (2021), which finds better AE performance when pixel intensities are remapped to emphasize dim pixels. When we choose a better (smaller) value of , the results are also slightly less sensitive to exactly which QCD reference images are chosen than when a larger value is chosen.

While autoencoders are often trained on a QCD background, several studies have explored trying to train an AE on alternative samples. One example is ref. Finke et al. (2021), which showed that AEs perform poorly when tagging QCD jets if trained on a top jet sample. This can be attributed to top jets being more complex than QCD jets, so that an AE trained on top jets can also reconstruct QCD jets despite them being out of distribution samples. While modifications can be made so that an AE trained on top jets can tag QCD jets Finke et al. (2021), requiring sample dependent optimization defeats the point of unsupervised anomaly detection.

Unlike in the case of an AE, event-to-ensemble distances can be directly applied to other reference samples, assuming an applicable metric is selected. We include in table 2 results using a top-jet reference sample for concreteness and brevity, but the method can be easily applied to other reference samples. We find that for top reference samples, the best metric is not the same as for QCD reference samples. In contrast to the case of the QCD reference sample, we find the -Wasserstein metric with higher does better for QCD vs. top classification than that with . However, this result is signal dependent, in addition to being dependent on the background sample: when trying to use the top reference sample to distinguish QCD vs. -jet events, we find using the -Wasserstein metric with is better than higher . We also find that whether the average event or minimum distance to the medoids does better depends on the signal sample — for QCD versus top classification with a top reference sample the average event does better than medoids (unlike the QCD reference sample case), but the opposite is true for QCD vs. classification. Furthermore, we find the somewhat counter intuitive result that the sum of the distance to the medoids does better than the minimum for top vs. QCD classification with top reference jets, which is surprising because only the distance to the closest event is actually used when determining the medoids. For QCD vs. top classification, using a QCD reference sample still outperforms the top reference sample, but the opposite is true when doing QCD vs. classification.

Our best results using the event-to-ensemble distance approach are comparable to and even slightly exceed the performance of the VAE in the previous section, which can be seen from comparing table 2 to table 1. This suggests that if we choose a smart, physically motivated metric like the -Wasserstein distance then we can use the medoid method, which is much faster and simpler than the VAE, and avoids optimizing all the hyperparameters in the VAE network architecture. The trade-off is that we need to put effort into optimizing the metric and choice of for the medoid approach. This is not surprising, since as we saw in the previous section, distances in the VAE latent space between two separate encoded latent representations are fairly correlated with the -Wasserstein distance between the original images. This equivalence is further supported by our study of the number of downsampling layers in the previous section. Like in the case of the -Wasserstein metrics, locality is incorporated in the convolutional VAE on scales other than the arbitrary pixel size due to the convolutions and down samplings.

The ease and speed of using the event-to-ensemble approach is a distinct advantage when compared to AEs, where the architecture, normalization, and parameters all need to be optimized. However, since the ideal metric and fiducial sample selection depend on both the background and signal samples, the signal dependence of the event-to-ensemble approach further suggests that there may be advantages of weakly/semi-supervised learning as compared to unsupervised learning, and that weakly/semi-supervised methods should be explored further. The potential advantages of semi-supervised learning can be further seen from the fact that the best QCD vs.

AUC values come from top reference samples, rather than QCD reference samples.

6 Conclusions

Using an autoencoder for anomaly detection is particularly challenging, since it must be trained well enough to reconstruct the background, but not so well that it also reconstructs the signal. There are many details about the network configuration that need to be optimized, such as the network size, metric used to compare input and output images, the definition of the anomaly score, and hyperparameters. Many of the results currently in the literature do not sufficiently emphasize these difficulties, so we have attempted in this paper to characterize and resolve them. To be concrete, we considered detecting boosted hadronic top and jets over a QCD-jet background. We considered using different metrics for training the VAE; we found that using the more physically-motivated optimal transport-based metrics did not outperform the simpler mean-squared-error metrics, and actually performed slightly worse. We found that the optimal values of various hyperparameters depend on the signal that we are trying to detect and that the optimal hyperparameters for describing the QCD sample are not necessarily those that detect anomalies the best.

In order to understand what the autoencoder has learned, we also studied the autoencoder latent space. The latent space provides a representation of any particular event which can be used to study the background distribution. In order to characterize this latent space, we computed the distance between distinct events. One way to do this is by using the Euclidean distance between quantities in the latent space. Alternatively, if we rely on a more physical, optimal transport based metric, we can compute the distances between images directly. When we compared the two, we found that the event-to-event optimal transport based distances between the background events are highly correlated with the Euclidean distances between events in the latent space of the autoencoder. This suggests that the autoencoder is learning some aspects of optimal transport, despite being trained with only a mean-squared-error based loss function.

This motivated us to develop methods that use optimal transport more directly. By choosing a representation of the QCD background distribution, such as the average QCD image or several medoids of the set of QCD jets, we can directly compute the optimal transport distance to this fiducial sample and use it as an anomaly score. We found that this method is at least as effective as the autoencoder, with the added benefits of being easier and faster to optimize, and generalizing more easily than the autoencoder to more complicated background distributions. We also found that the best choice of optimal transport metric depends on both the new physics signal and the qualities of the expected background distribution.

Although we have shown that the performance of variational autoencoders can be reproduced, and improved upon, by the relatively simpler medoid method, neither approach is very close to optimal for signal detection. To be quantitative, when trained on a QCD sample, the best autoencoder performance we found gave an AUC of 0.65 for detection (see Table 1). The best performance using medoids with a QCD background gave a slightly better AUC of 0.71 (see Table 2). These are both worse than the performance of a fully supervised network which gave a nearly perfect AUC of 0.96. Somewhat surprisingly, we found that when the medoids method was used on a top-jet background sample, it found jets over QCD better (AUC of 0.84) than when trained on a QCD background. This is comparable to what a supervised network trained to find tops over QCD gives when tested on vs. QCD (AUC of 0.86). These observations suggest that a path forward might be to use a semi-supervised approach Dery et al. (2017); Cohen et al. (2018); Komiske et al. (2018); Park et al. (2020), where a network is trained with an example signal in mind, and then used for anomaly detection more broadly.


We thank Jack Collins, Philip Harris, and Sang Eon Park for useful discussions and comments on a previous version of this manuscript. This work is supported by the National Science Foundation under Cooperative Agreement PHY-2019786 (The NSF AI Institute for Artificial Intelligence and Fundamental Interactions, The computations in this paper were run on the FASRC Cannon cluster supported by the FAS Division of Science Research Computing Group at Harvard University. BO was partially supported by the Department of Energy (DOE) Grant No. DE-SC0020223. KF is supported in part by NASA Grant 80NSSC20K0506 and in part by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE1745303. SH is supported in part by the DOE Grant DE-SC0013607, and in part by the Alfred P. Sloan Foundation Grant No. G-2019-12504. RKM is supported by the National Science Foundation under Grant No. NSF PHY-1748958 and NSF PHY-1915071.

Appendix A Variational Inference for Autoencoders

The idea behind variational inference for anomaly detection is to estimate the true probability distribution of the background, . Assuming we have an underlying latent space of elements , we can write as


where denotes the expectation value, is the probability of given , and is the prior likelihood of the latent data. We can take the latent space prior to be a set of independent Gaussians with zero mean and unit standard deviation, , where runs over the dimension of the latent space. At this point is an unknown and intractable distribution.

To make progress, we introduce a new tractable distribution , where are some parameters to be optimized over. In an autoencoder architecture, this is the encoder. We can then write eq. (6) in a more useful form:


The log likelihood, , is then given as


where we have used Jensen’s inequality in the second line above. Let’s first consider the first term in the last line in eq. (10). It is the expectation value of given when is sampled from (which is a distribution in given ). This term can be interpreted as a (negative) reconstruction error term. If we approximate by a decoder part of the architecture (where is to be optimized over), is the usual (negative) reconstruction error term in the loss function for an autoencoder with decoder and encoder .

The second term is by definition the Kullback-Leibler divergence (KLD) between the distributions and . Recall that . We take to also be a Gaussian distribution, but with a unknown mean and standard deviation (to be fixed by the optimization), i.e. . The KLD between these two distributions is then given exactly by eq. (4). Using the reparameterization trick Kingma and Welling (2013); Jimenez Rezende and Mohamed (2015), we can write in terms of a standard normal:


Using the reparameterization trick allows for more efficient training of the network, as the the back propagation of the gradients extends to the parameters of the distribution ( and ) even though a random draw from the distribution is passed to the decoder.

It’s now clear that the last line in eq. (10) is the negative loss for a VAE architecture. By training the VAE, we are minimizing the loss. By the inequality in eq. (10), the last line is also a lower limit for the log likelihood. The optimized VAE therefore gives a maximized lower bound to the log likelihood, the so called Evidence LOwer Bound (ELBO). Notice that in this discussion it is imperative to use the full VAE loss in order for it to have the variational inference interpretation.

Appendix B Supervised results

It is well known that anomaly detection is sub-optimal for looking for any particular model; if the signal is known before-hand, supervised classification will yield the best results. We use a similar setup for our supervised classification as we did for the VAEs. The network consists of 1, 2, or 3 convolution blocks. Each block is made of two successive convolutional layers with 5 filters with a kernel size of 3 pixels, followed by an ELU activation function. After the convolutions, the data is down sampled with a average pool operation. Following the convolution blocks, the data is flattened to a vector and a fully connected layer reduces the output to a single number with a sigmoid activation.

The networks are trained using 50000 events from the QCD sample and 50000 events from either the top or samples. Similarly, 5000 events from each dataset are used for validation and to stop the network training when the validation loss has stopped improving. The training minimizes the binary cross entropy. After training, the network is applied to the test data of 5000 events in each class. We find that the network with three down sample layers achieves the best AUCs, with a score of 0.94 for top tagging and 0.96 for tagging.


  • [1] External Links: Link Cited by: §2.
  • G. Aad et al. (2020) Dijet resonance search with weak supervision using TeV collisions in the ATLAS detector. Phys. Rev. Lett. 125 (13), pp. 131801. External Links: 2005.02983, Document Cited by: §1.
  • T. Aarrestad et al. (2021) The Dark Machines Anomaly Score Challenge: Benchmark Data and Model Independent Event Classification for the Large Hadron Collider. External Links: 2105.14027 Cited by: §1, §1, footnote 3.
  • J. A. Aguilar-Saavedra, J. H. Collins, and R. K. Mishra (2017) A generic anti-QCD jet tagger. JHEP 11, pp. 163. External Links: 1709.01087, Document Cited by: §1.
  • J. A. Aguilar-Saavedra, F. R. Joaquim, and J. F. Seabra (2021) Mass Unspecific Supervised Tagging (MUST) for boosted jets. JHEP 03, pp. 012. Note: [Erratum: JHEP 04, 133 (2021)] External Links: 2008.12792, Document Cited by: §1.
  • J. Alwall, R. Frederix, S. Frixione, V. Hirschi, F. Maltoni, O. Mattelaer, H. -S. Shao, T. Stelzer, P. Torrielli, and M. Zaro (2014) The automated computation of tree-level and next-to-leading order differential cross sections, and their matching to parton shower simulations. JHEP 07, pp. 079. External Links: 1405.0301, Document Cited by: §2.
  • O. Amram and C. M. Suarez (2021) Tag N’ Train: a technique to train improved classifiers on unlabeled data. JHEP 01, pp. 153. External Links: 2002.12376, Document Cited by: §1.
  • J. An and S. Cho (2015) Variational autoencoder based anomaly detection using reconstruction probability. Special Lecture on IE 2, pp. 1. Cited by: §1.
  • A. Andreassen, B. Nachman, and D. Shih (2020) Simulation Assisted Likelihood-free Anomaly Detection. Phys. Rev. D 101 (9), pp. 095004. External Links: 2001.05001, Document Cited by: §1.
  • O. Atkinson, A. Bhardwaj, C. Englert, V. S. Ngairangbam, and M. Spannowsky (2021) Anomaly detection with Convolutional Graph Neural Networks. External Links: 2105.07988 Cited by: §1.
  • J. Batson, C. G. Haaf, Y. Kahn, and D. A. Roberts (2021) Topological Obstructions to Autoencoding. JHEP 04, pp. 280. External Links: 2102.08380, Document Cited by: §1, §1.
  • K. Benkendorfer, L. L. Pottier, and B. Nachman (2021) Simulation-assisted decorrelation for resonant anomaly detection. Phys. Rev. D 104 (3), pp. 035003. External Links: 2009.02205, Document Cited by: §1.
  • A. Blance, M. Spannowsky, and P. Waite (2019) Adversarially-trained autoencoders for robust unsupervised new physics searches. JHEP 10, pp. 047. External Links: 1905.10384, Document Cited by: §1.
  • A. Blance and M. Spannowsky (2021) Unsupervised Event Classification with Graphs on Classical and Photonic Quantum Computers. External Links: 2103.03897 Cited by: §1.
  • D. M. Blei, A. Kucukelbir, and J. D. McAuliffe (2017) Variational inference: a review for statisticians. Journal of the American Statistical Association 112 (518), pp. 859–877. External Links: ISSN 1537-274X, Link, Document Cited by: §3.2.
  • B. Bortolato, B. M. Dillon, J. F. Kamenik, and A. Smolkovič (2021) Bump Hunting in Latent Space. External Links: 2103.06595 Cited by: §1, §1.
  • M. Cacciari, G. P. Salam, and G. Soyez (2008) The anti- jet clustering algorithm. JHEP 04, pp. 063. External Links: 0802.1189, Document Cited by: §2.
  • M. Cacciari, G. P. Salam, and G. Soyez (2012) FastJet User Manual. Eur. Phys. J. C 72, pp. 1896. External Links: 1111.6097, Document Cited by: §2.
  • M. Cacciari and G. P. Salam (2006) Dispelling the myth for the jet-finder. Phys. Lett. B 641, pp. 57–61. External Links: hep-ph/0512210, Document Cited by: §2.
  • T. Cai, J. Cheng, N. Craig, and K. Craig (2020) Linearized optimal transport for collider events. Phys. Rev. D 102 (11), pp. 116019. External Links: 2008.08604, Document Cited by: §1, §1, §5.
  • S. Caron, L. Hendriks, and R. Verheyen (2021) Rare and Different: Anomaly Scores from a combination of likelihood and out-of-distribution models to detect new physics at the LHC. External Links: 2106.10164 Cited by: §1.
  • S. Carrazza and F. A. Dreyer (2019) Lund jet images from generative and cycle-consistent adversarial networks. Eur. Phys. J. C 79 (11), pp. 979. External Links: 1909.01359, Document Cited by: §1.
  • A. Casa and G. Menardi (2018) Nonparametric semisupervised classification for signal detection in high energy physics. External Links: 1809.02977 Cited by: §1.
  • O. Cerri, T. Q. Nguyen, M. Pierini, M. Spiropulu, and J. Vlimant (2019) Variational Autoencoders for New Physics Mining at the Large Hadron Collider. JHEP 05, pp. 036. External Links: 1811.10276, Document Cited by: §1, §1.
  • C. Cesarotti, M. Reece, and M. J. Strassler (2020) Spheres To Jets: Tuning Event Shapes with 5d Simplified Models. External Links: 2009.08981 Cited by: §3.1.
  • C. Cesarotti, M. Reece, and M. J. Strassler (2020) The Efficacy of Event Isotropy as an Event Shape Observable. External Links: 2011.06599 Cited by: §3.1.
  • C. Cesarotti and J. Thaler (2020) A Robust Measure of Event Isotropy at Colliders. JHEP 08, pp. 084. External Links: 2004.06125, Document Cited by: §3.1, §5.
  • P. Chakravarti, M. Kuusela, J. Lei, and L. Wasserman (2021) Model-Independent Detection of New Physics Signals Using Interpretable Semi-Supervised Classifier Tests. External Links: 2102.07679 Cited by: §1.
  • T. Cheng, J. Arguin, J. Leissner-Martin, J. Pilette, and T. Golling (2020) Variational Autoencoders for Anomalous Jet Tagging. External Links: 2007.01850 Cited by: §1, §1, §2, §3.2.
  • T. Cheng (2021) Cited by: §2.
  • J. Cogan, M. Kagan, E. Strauss, and A. Schwarztman (2015)

    Jet-Images: Computer Vision Inspired Techniques for Jet Tagging

    JHEP 02, pp. 118. External Links: 1407.5675, Document Cited by: §3.1.
  • T. Cohen, M. Freytsis, and B. Ostdiek (2018) (Machine) Learning to Do More with Less. JHEP 02, pp. 034. External Links: 1706.09451, Document Cited by: §6.
  • J. H. Collins, K. Howe, and B. Nachman (2018) Anomaly Detection for Resonant New Physics with Machine Learning. Phys. Rev. Lett. 121 (24), pp. 241803. External Links: 1805.02664, Document Cited by: §1.
  • J. H. Collins, K. Howe, and B. Nachman (2019) Extending the search for new resonances with machine learning. Phys. Rev. D 99 (1), pp. 014038. External Links: 1902.02634, Document Cited by: §1.
  • J. H. Collins, P. Martín-Ramiro, B. Nachman, and D. Shih (2021) Comparing weak- and unsupervised methods for resonant anomaly detection. Eur. Phys. J. C 81 (7), pp. 617. External Links: 2104.02092, Document Cited by: §1.
  • J. H. Collins (2021) An Exploration of Learnt Representations of W Jets. External Links: 2109.10919 Cited by: §1, footnote 4.
  • M. Crispim Romão, N. F. Castro, J. G. Milhano, R. Pedro, and T. Vale (2021a) Use of a generalized energy Mover’s distance in the search for rare phenomena at colliders. Eur. Phys. J. C 81 (2), pp. 192. External Links: 2004.09360, Document Cited by: §1.
  • M. Crispim Romão, N. F. Castro, J. G. Milhano, R. Pedro, and T. Vale (2021b) Use of a generalized energy Mover’s distance in the search for rare phenomena at colliders. Eur. Phys. J. C 81 (2), pp. 192. External Links: 2004.09360, Document Cited by: §5.
  • M. Crispim Romão, N. F. Castro, and R. Pedro (2021c) Finding New Physics without learning about it: Anomaly Detection as a tool for Searches at Colliders. Eur. Phys. J. C 81 (1), pp. 27. External Links: 2006.05432, Document Cited by: §1.
  • R. T. D’Agnolo, G. Grosso, M. Pierini, A. Wulzer, and M. Zanetti (2021) Learning multivariate new physics. Eur. Phys. J. C 81 (1), pp. 89. External Links: 1912.12155, Document Cited by: §1.
  • R. T. D’Agnolo and A. Wulzer (2019) Learning New Physics from a Machine. Phys. Rev. D 99 (1), pp. 015014. External Links: 1806.02350, Document Cited by: §1.
  • J. de Favereau, C. Delaere, P. Demin, A. Giammanco, V. Lemaître, A. Mertens, and M. Selvaggi (2014) DELPHES 3, A modular framework for fast simulation of a generic collider experiment. JHEP 02, pp. 057. External Links: 1307.6346, Document Cited by: §2.
  • A. De Simone and T. Jacques (2019) Guiding New Physics Searches with Unsupervised Learning. Eur. Phys. J. C 79 (4), pp. 289. External Links: 1807.06038, Document Cited by: §1.
  • L. M. Dery, B. Nachman, F. Rubbo, and A. Schwartzman (2017) Weakly Supervised Classification in High Energy Physics. JHEP 05, pp. 145. External Links: 1702.00414, Document Cited by: §6.
  • B. M. Dillon, D. A. Faroughy, J. F. Kamenik, and M. Szewc (2020) Learning the latent structure of collider events. JHEP 10, pp. 206. External Links: 2005.12319, Document Cited by: §1.
  • B. M. Dillon, D. A. Faroughy, and J. F. Kamenik (2019) Uncovering latent jet substructure. Phys. Rev. D 100 (5), pp. 056002. External Links: 1904.04200, Document Cited by: §1.
  • B. M. Dillon, T. Plehn, C. Sauer, and P. Sorrenson (2021) Better Latent Spaces for Better Autoencoders. External Links: 2104.08291 Cited by: §1, §1, §1, §1, footnote 3.
  • K. Dohi (2020) Variational Autoencoders for Jet Simulation. External Links: 2009.04842 Cited by: §1.
  • T. Dorigo, M. Fumanelli, C. Maccani, M. Mojsovska, G. C. Strong, and B. Scarpa (2021) RanBox: Anomaly Detection in the Copula Space. External Links: 2106.05747 Cited by: §1.
  • M. Farina, Y. Nakai, and D. Shih (2020) Searching for New Physics with Deep Autoencoders. Phys. Rev. D 101 (7), pp. 075021. External Links: 1808.08992, Document Cited by: §1, §1, §3.2.
  • D. A. Faroughy (2021) Uncovering hidden new physics patterns in collider events using Bayesian probabilistic models. PoS ICHEP2020, pp. 238. External Links: 2012.08579, Document Cited by: §1.
  • J. Feydy, T. Séjourné, F. Vialard, S. Amari, A. Trouve, and G. Peyré (2019) Interpolating between optimal transport and mmd using sinkhorn divergences. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2681–2690. Cited by: §4.1.
  • T. Finke, M. Krämer, A. Morandini, A. Mück, and I. Oleksiyuk (2021) Autoencoders for unsupervised anomaly detection in high energy physics. JHEP 06, pp. 161. External Links: 2104.09051, Document Cited by: §1, §1, §5, §5, footnote 1.
  • J. Gonski, J. Lai, B. Nachman, and I. Ochoa (2021) High-dimensional Anomaly Detection with Radiative Return in Collisions. External Links: 2108.13451 Cited by: §1.
  • E. Govorkova et al. (2021) Autoencoders on FPGAs for real-time, unsupervised new physics detection at 40 MHz at the Large Hadron Collider. External Links: 2108.03986 Cited by: §1, §1.
  • E. Govorkova, E. Puljak, T. Aarrestad, M. Pierini, K. A. Woźniak, and J. Ngadiuba (2021) LHC physics dataset for unsupervised New Physics detection at 40 MHz. External Links: 2107.02157 Cited by: §1.
  • J. Hajer, Y. Li, T. Liu, and H. Wang (2020) Novelty Detection Meets Collider Physics. Phys. Rev. D 101 (7), pp. 076015. External Links: 1807.10261, Document Cited by: §1.
  • A. Hallin, J. Isaacson, G. Kasieczka, C. Krause, B. Nachman, T. Quadfasel, M. Schlaffer, D. Shih, and M. Sommerhalder (2021) Classifying Anomalies THrough Outer Density Estimation (CATHODE). External Links: 2109.00546 Cited by: §1.
  • T. Heimel, G. Kasieczka, T. Plehn, and J. M. Thompson (2019) QCD or What?. SciPost Phys. 6 (3), pp. 030. External Links: 1808.08979, Document Cited by: §1, §1, §3.2.
  • D. Jimenez Rezende and S. Mohamed (2015) Variational Inference with Normalizing Flows. External Links: 1505.05770 Cited by: Appendix A, §3.2.
  • A. Kahn, J. Gonski, I. Ochoa, D. Williams, and G. Brooijmans (2021) Anomalous jet identification via sequence modeling. JINST 16 (08), pp. P08012. External Links: 2105.09274, Document Cited by: §1.
  • G. Kasieczka et al. (2021) The LHC Olympics 2020: A Community Challenge for Anomaly Detection in High Energy Physics. External Links: 2101.08320 Cited by: §1, §1.
  • C. K. Khosa and V. Sanz (2020) Anomaly Awareness. External Links: 2007.14462 Cited by: §1.
  • D. P. Kingma and M. Welling (2013) Auto-Encoding Variational Bayes. External Links: 1312.6114 Cited by: Appendix A, §1, §3.2, §3.2.
  • D. P. Kingma and J. Ba (2014) Adam: A Method for Stochastic Optimization. External Links: 1412.6980 Cited by: §3.2.
  • D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling (2016) Improving Variational Inference with Inverse Autoregressive Flow. External Links: 1606.04934 Cited by: footnote 3.
  • D. P. Kingma and M. Welling (2019) An Introduction to Variational Autoencoders. External Links: 1906.02691 Cited by: §3.2.
  • O. Knapp, O. Cerri, G. Dissertori, T. Q. Nguyen, M. Pierini, and J. Vlimant (2021) Adversarially Learned Anomaly Detection on CMS Open Data: re-discovering the top quark. Eur. Phys. J. Plus 136 (2), pp. 236. External Links: 2005.01598, Document Cited by: §1.
  • P. T. Komiske, E. M. Metodiev, B. Nachman, and M. D. Schwartz (2018)

    Learning to classify from impure samples with high-dimensional data

    Phys. Rev. D 98 (1), pp. 011502. External Links: 1801.10158, Document Cited by: §6.
  • P. T. Komiske, E. M. Metodiev, and J. Thaler (2019) Metric Space of Collider Events. Phys. Rev. Lett. 123 (4), pp. 041801. External Links: 1902.02346, Document Cited by: §1, §1, §3.1, §5, §5.
  • P. T. Komiske, E. M. Metodiev, and J. Thaler (2020) The Hidden Geometry of Particle Collisions. JHEP 07, pp. 006. External Links: 2004.04159, Document Cited by: §1, §1, §3.1, §5.
  • M. A. Kramer (1991) Nonlinear principal component analysis using autoassociative neural networks. AIChE Journal 37 (2), pp. 233–243. External Links: Document, Link, Cited by: §1.
  • J. Leissner-Martin, T. Cheng, and J. Arguin (2020) Cited by: §2.
  • S. Macaluso and D. Shih (2018)

    Pulling Out All the Tops with Computer Vision and Deep Learning

    JHEP 10, pp. 121. External Links: 1803.00107, Document Cited by: §2.
  • V. Mikuni and F. Canelli (2021) Unsupervised clustering for collider physics. Phys. Rev. D 103 (9), pp. 092007. External Links: 2010.07106, Document Cited by: §1.
  • A. Mullin, S. Nicholls, H. Pacey, M. Parker, M. White, and S. Williams (2021) Does SUSY have friends? A new approach for LHC event analysis. JHEP 02, pp. 160. External Links: 1912.10625, Document Cited by: §1.
  • B. Nachman and D. Shih (2020) Anomaly Detection with Density Estimation. Phys. Rev. D 101, pp. 075042. External Links: 2001.04990, Document Cited by: §1.
  • B. Ostdiek (2021) Deep Set Auto Encoders for Anomaly Detection in Particle Physics. External Links: 2109.01695 Cited by: §1, §1.
  • S. E. Park, D. Rankin, S. Udrescu, M. Yunus, and P. Harris (2020) Quasi Anomalous Knowledge: Searching for new physics with embedded knowledge. JHEP 21, pp. 030. External Links: 2011.03550, Document Cited by: §1, §1, §6.
  • A. A. Pol, V. Berger, G. Cerminara, C. Germain, and M. Pierini (2020) Anomaly Detection With Conditional Variational Autoencoders. In Eighteenth International Conference on Machine Learning and Applications, External Links: 2010.05531 Cited by: §1, §1.
  • M. Romão Crispim, N. F. Castro, R. Pedro, and T. Vale (2020) Transferability of Deep Learning Models in Searches for New Physics at Colliders. Phys. Rev. D 101 (3), pp. 035042. External Links: 1912.04220, Document Cited by: §1.
  • T. S. Roy and A. H. Vijay (2019) A robust anomaly finder based on autoencoders. External Links: 1903.02032 Cited by: §1.
  • T. Sjöstrand, S. Ask, J. R. Christiansen, R. Corke, N. Desai, P. Ilten, S. Mrenna, S. Prestel, C. O. Rasmussen, and P. Z. Skands (2015) An introduction to PYTHIA 8.2. Comput. Phys. Commun. 191, pp. 159–177. External Links: 1410.3012, Document Cited by: §2.
  • G. Stein, U. Seljak, and B. Dai (2020) Unsupervised in-distribution anomaly detection of new physics through conditional density estimation. In 34th Conference on Neural Information Processing Systems, External Links: 2012.11638 Cited by: §1.
  • P. Thaprasop, K. Zhou, J. Steinheimer, and C. Herold (2021)

    Unsupervised Outlier Detection in Heavy-Ion Collisions

    Phys. Scripta 96 (6), pp. 064003. External Links: 2007.15830, Document Cited by: §1.
  • M. van Beekveld, S. Caron, L. Hendriks, P. Jackson, A. Leinweber, S. Otten, R. Patrick, R. Ruiz De Austri, M. Santoni, and M. White (2021) Combining outlier analysis algorithms to identify new physics at the LHC. JHEP 09, pp. 024. External Links: 2010.07940, Document Cited by: §1, §1.
  • R. van den Berg, L. Hasenclever, J. M. Tomczak, and M. Welling (2018) Sylvester Normalizing Flows for Variational Inference. External Links: 1803.05649 Cited by: footnote 3.
  • C. Villani (2009) Optimal transport, old and new. Springer. External Links: ISBN 978-3-540-71050-9 Cited by: §3.1.
  • S. Volkovich, F. De Vito Halevy, and S. Bressler (2021) The Data-Directed Paradigm for BSM searches. External Links: 2107.11573 Cited by: §1.