1 Introduction
While many searches for physics beyond the Standard Model have been carried out at the Large Hadron Collider, new physics remains elusive. This may be due to a lack of new physics in the data, but it could also be due to us looking in the wrong place. Trying to design searches that are more robust to unexpected new physics has inspired a lot of work on anomaly detection using unsupervised methods including community wide challenges such as the LHC Olympics Kasieczka and others (2021) and the Dark Machines Anomaly Score Challenge Aarrestad and others (2021). The goal of anomaly detection is to search for events which are “different” than what is expected. When used for anomaly detection, unsupervised methods attempt to characterize the space of background events in some way, independent of signal. The hope is then that signal events will stand out as being uncharacteristic.
Anomaly detection techniques can be broadly split into two categories. For some signals, the signal events look similar to background events and one must exploit information about the expected probability distribution of the background to find the signal. Many anomaly detection techniques have been developed to find signals of this type
Collins et al. (2018); D’Agnolo and Wulzer (2019); De Simone and Jacques (2019); Casa and Menardi (2018); Collins et al. (2019); Dillon et al. (2019); Mullin et al. (2021); D’Agnolo et al. (2021); Nachman and Shih (2020); Andreassen et al. (2020); Aad and others (2020); Dillon et al. (2020); Benkendorfer et al. (2021); Mikuni and Canelli (2021); Stein et al. (2020); Kasieczka and others (2021); Batson et al. (2021); Blance and Spannowsky (2021); Bortolato et al. (2021); Collins et al. (2021); Dorigo et al. (2021); Volkovich et al. (2021); Hallin et al. (2021). Alternatively, some signals are qualitatively different from background and then methods that try to characterize an individual event as anomalous can be used AguilarSaavedra et al. (2017); Hajer et al. (2020); Heimel et al. (2019); Farina et al. (2020); Cerri et al. (2019); Roy and Vijay (2019); Blance et al. (2019); Romão Crispim et al. (2020); Amram and Suarez (2021); Crispim Romão et al. (2021a); Knapp et al. (2021); Crispim Romão et al. (2021c); Cheng et al. (2020); Khosa and Sanz (2020); Thaprasop et al. (2021); AguilarSaavedra et al. (2021); Pol et al. (2020); van Beekveld et al. (2021); Park et al. (2020); Chakravarti et al. (2021); Faroughy (2021); Finke et al. (2021); Atkinson et al. (2021); Dillon et al. (2021); Kahn et al. (2021); Aarrestad and others (2021); Caron et al. (2021); Govorkova et al. (2021); Gonski et al. (2021); Govorkova and others (2021); Ostdiek (2021). Here, we restrict to the latter type of anomaly detection, where an anomaly score for individual events can be used for discrimination, without needing to characterize the full probability distribution of the signal ensemble. With an effective method, events with a small score are likely to be a part of the background distribution, while events with a larger score are not. There are many different ways of defining an anomaly score. Some rely on traditional highlevel observables, like mass or
subjettiness. Others attempt to directly learn how likely a given event or object is using lowlevel information, like individual particle momenta (see e.g., ref. Caron et al. (2021)). Some methods that search for outliers rely on abstract representations to try to characterize the event space, such as the latent space of an autoencoder
Dillon et al. (2021); Bortolato et al. (2021). Others give the event space itself a geometric interpretation in terms of distances Komiske et al. (2019, 2020); Cai et al. (2020). Given the complexity and highdimensionality of data at the LHC, many anomaly detection techniques employ machine learning.
In this paper, we begin by exploring the use of autoencoders for anomaly detection. Autoencoders were initially introduced for dimensionality reduction, similar to principal component analysis, to learn the important information in data while ignoring insignificant information and noise
Kramer (1991). Autoencoders contain an encoder, which reduces the dimensionality of the input to give some latent representation, and a decoder, which transforms the latent space back to the original space. In particle physics, autoencoders were first used for anomaly detection in refs. Farina et al. (2020); Heimel et al. (2019), where they are meant to reconstruct certain types of data (background) but not others (signals). In order to work as an anomaly detector, an autoencoder should have a small reconstruction error for background events and a large reconstruction error for signal events. To do so, the autoencoder must establish a delicate balance in achieving a reconstruction fidelity which is accurate, but not too accurate. There are several cases where this is especially difficult, such as when the signaltobackground ratio is small, when the dataset has certain topological properties Batson et al. (2021), or when innate characteristics of the samples make the signal sample simpler than the background sample to reconstruct Dillon et al. (2021); Finke et al. (2021).A generalization of autoencoders called variational autoencoders (VAEs) were introduced in ref. Kingma and Welling (2013). Unlike an ordinary autoencoder, where each input is mapped to an arbitrary point in the latent space, in a VAE, the latent space is a probability distribution which is sampled and mapped back to the original space by the decoder. In addition to the usual reconstruction error, the VAE loss also includes a KullbackLeibler (KL) divergence component that pushes the latent space towards a Gaussian prior and regularizes network training. The latent space of the VAE encodes the probability distribution of the background training sample, which can be used in the anomaly score. VAEs were first used in anomaly detection in computer science in ref. An and Cho (2015), and first used for particle physics anomaly detection in ref. Cerri et al. (2019). They have been widely studied since then Carrazza and Dreyer (2019); Cheng et al. (2020); Dohi (2020); Pol et al. (2020); van Beekveld et al. (2021); Park et al. (2020); Bortolato et al. (2021); Dillon et al. (2021); Govorkova and others (2021); Ostdiek (2021).
The task of an autoencoder, variational or not, for unsupervised anomaly detection is to provide a strong universal signal/background discriminant for a variety of signals having access only to background for training. In principle, this approach is advantageous because it opens the possibility to bypass Monte Carlo simulations and work directly with experimental data, which is almost completely background. The autoencoder paradigm is based on the vision that there is tradeoff between efficacy and generality: the ideal discriminant for a given signal and given background would be ineffective for a different signal and different background while a general discriminant, like the autoencoder, would work decently on a broad class of signals and backgrounds. The ideal assumes, first, that such a general discriminant exists with an appropriate use case, and second that it can be found by training purely on one or more background samples without any direct information about the signal. However, one has reason to be suspicious: machine learning methods work great at optimizing a given loss, which is meant to correlate strongly with the problem one is trying to solve. For autoencoder anomaly detection, the optimization (background only) is not aligned with the ultimate problem of interest (signal discovery over background), so it should not be surprising if the autoencoder does poorly. In section 4, we explore the challenges induced by trying to optimize a VAE in a model agnostic way.
In order to understand what a VAE is learning, we study its latent space. In particular, we look at distance between events in VAE latent space (see Dillon et al. (2021); Collins (2021) for other studies of VAE latent spaces in particle physics). Since we can think of the VAE anomaly score as a “distance” encoding how far any given event is from the background distribution, it is also natural to ask about the distances between individual events. We find there is a significant correlation between the Euclidean distance between events represented in the VAE latent space and the Wasserstein optimal transport distance between events represented as images. We study Wasserstein distances in particular because they were physically motivated in refs. Komiske et al. (2019, 2020); Cai et al. (2020).
The correlation we observe between distances in the VAE latent space and between the event images motivates us to explore using optimal transport distances between events to define an anomaly score in section 5
. One method for using distances directly is to identify representative events in the background sample, and use an eventtoevent distance between a given event and the representative event as the score. The advantages of this method we propose are that it does not require training a neural network and that it is easily adaptable to different background samples.
This paper is organized as follows. In section 2, we provide information about the dataset used in our study. In section 3, we provide relevant background information on the metrics used (section 3.1), and the details of the VAE architecture (section 3.2). In section 4, we explore the effectiveness of an image based convolutional autoencoder for anomaly detection, including its sensitivity to hyperparameters. We also explore correlations between Euclidean distances in the autoencoder’s latent space and optimaltransport distances among the event images in section 4.2. This motivates the development of methods that directly use the optimal transport distances among events as an alternative to VAEs in section 5. We conclude in section 6.
2 Datasets
We begin by describing the datasets we use for our analysis. For concreteness, we focus on anomaly detection in simulated jet events at the LHC. We will consider QCD dijet events as the background, and consider both top and jets as representatives of anomalous signal events. The authors of ref. Cheng et al. (2020) have provided a suite of jets for Standard Model and Beyond the Standard Model particle resonances which are available on Zenodo Cheng (2021). A sample of QCD dijet background events are also provided on Zenodo using the same selection criteria, showering, and detector simulation parameters LeissnerMartin et al. (2020). The datasets were generated with MadGraph Alwall et al. (2014) and Pythia8 Sjöstrand et al. (2015) and used Delphes de Favereau et al. (2014) for fast detector simulation. Jets were clustered using FastJet Cacciari et al. (2012); Cacciari and Salam (2006) using the anti algorithm Cacciari et al. (2008) with a cone size of . The event selection requires two hard jets, with leading jet having GeV and the subleading jet having GeV. The QCD jets are created using the process in MadGraph, while the top and jets we examine are produced through a which decays to or a which decays to , respectively. There are around 700,000 QCD dijet events and 100,000 events for the “anomalous” top and events. We reserve 100,000 QCD events for testing and use 50,000 QCD events for validation when training the VAE.
The leading jet in each event is used for the analysis. We preprocess the raw fourvectors into an image following the procedure presented in ref.
Macaluso and Shih (2018). Using the EnergyFlow package 1, we boost and rotate the jet along the beam direction so that the weighted centroid is located at . Next, the jet is rotated about the centroid such that the weighted major principal axis is vertical. After this, the jet is flipped along both the horizontal and vertical axes so that the maximum intensity is in the upper right quadrant. Only after the centering, rotations, and flipping do we pixelate the data Macaluso and Shih (2018). We use pixel images covering a range of . The final step of the preprocessing is to divide by the total in the image. Note that we do notstandardize each pixel by, e.g., subtracting the mean and dividing by the standard deviation for the entire training dataset, because optimal transport requires positive values in every pixel. It is important to note that the individual images are very sparse and do not resemble the average of the dataset. For instance, out of the 1600 pixels, only
, , and pixels account for more than of the total of the image for the QCD, top, and jets, respectively.3 Defining the Anomaly Score
Anomaly detection, in general, requires an anomaly score: we want to determine if an event is anomalous by measuring how far away it is from the elements of the background distribution. This anomaly score can also be thought of as the “distance” between an event and an ensemble. In order to define an eventtoensemble distance it is helpful first to explore eventtoevent distance measures. For instance, given an eventtoevent metric, one could compute the distance from an event to some fiducial background event, and use this as a proxy for the eventtoensemble distance. To understand both types of distances, we need to review the metrics used to define the distance, which we will do in Section 3.1. We can also use an autoencoder to generate an implicit construction of an approximate eventtoensemble distance, in the form of an anomaly score. We will provide background and discuss the architecture of our autoencoder in section 3.2.
3.1 Metrics
First, we define the metrics that can be used to compute eventtoevent distances. One of the simplest event representations is to treat an event as an image, with pixel intensities representing the particles’ energies Cogan et al. (2015).^{1}^{1}1 In principle, it would be interesting to consider the complete set of fourvectors of the particles in an event as a representation, rather than the pixelated image, and define a metric on these. The Wasserstein distances described later in this section are wellsuited for such a representation, but building an autoencoder architecture on the full set of fourvectors is more challenging. It is also important to comment that our image representation is dependent on its preprocessing. Although recent studies have shown that the processing done to events before anomaly detection is inherently model dependent Finke et al. (2021), we work with the images as described. A simple eventtoevent metric, the “mean power error” (MPE), can then be written as:
(1) 
where is the pixel intensity (transverse momentum) in pixel of the image 1(2), and is a parameter that governs the relative importance of pixels with high/low intensity differences. This type of metric is often used for doing regression. Frequently, the choice is made, inspired by the statistic, in which case is known as the meansquare error (MSE). The meanabsolute error (MAE) is another wellknown choice, corresponding to .
While makes sense in regression, using it on images does not make much sense from a physics point of view.^{2}^{2}2In contrast, if one designs a neural network with higherlevel variables as the input data representation, using MPE as the metric is a sensible choice. For instance, let be the image of a particle with energy in a single pixel and be the image of a particle with same energy in the neighboring pixel. These events are nearly identical physically, but will have a very large MSE distance. Moreover, we will still get the same MSE distance if we move one of the two pixels much further way. Physically similar events do not necessarily result in small MSE distances.
A completely different way to assign distance between two events is to compute the minimum “effort” needed to transform one image into the other, known as the optimal transport distance. There are many possible optimal transport algorithms (see ref. Villani (2009) for a broad review). Finding the minimum effort is an optimization problem: given a cost function , where and label elements (e.g. pixel labels) of the two events, we optimize over the transport plan, . The cost can be thought of as how much work it takes to transport a single unit of intensity a given distance, and the plan describes how much intensity to transport and where to transport it to. In terms of the cost and plan, the total optimal transport cost is then defined as
(2) 
In some cases, the cost function is itself a positive definite distance, in which case is also a distance. One example is the set of Wasserstein distances:
(3) 
Depending on the problem, the set of may have to satisfy additional constraints.
We define the underlying cost as the Euclidean distance in the plane between pixel in image and pixel in image . The transport plan is defined by the amount of that is moved from pixel in image to pixel in image . The transport plan is constrained such that the amount of moved from a pixel cannot be more than what was there, . Similarly the amount of moved into a pixel cannot exceed the amount in that pixel in : . Here, we consider normalized images, preprocessed such that the total intensity summed over all pixels is equal to unity, so that there is no extra cost of creating or destroying . In mathematical language, we are considering “balanced optimal transport”.
In particle physics applications, unbalanced optimal transport with the choice is commonly referred to as the Energy Movers Distance (EMD) Komiske et al. (2019, 2020), as it has the interpretation of work required to rearrange an energy pattern. This interpretation makes the EMD a natural choice for a metric on collider events. This has prompted further work on using the EMD to define event shape observables characterizing the event isotropy Cesarotti and Thaler (2020), which can be useful in searching for signals that are far from QCDlike Cesarotti et al. (2020, 2020). Sometimes, has been considered Komiske et al. (2020), but the less explored case of can also be computed. Intuitively, gives more importance to smaller distances. While the EMD includes an additional term to account for energy differences between jets, in our results, we will restrict to balanced optimal transport, since we normalize the images.
The Wasserstein optimal transport metrics are more aligned with what one expects for physical events than MPE. For example, two singleparticle events where the particles are nearby will have a much smaller Wasserstein distance than when they are far from each other, in contrast to their MSE distance. However, as QCD prefers small angle radiation, we find that the 1Wasserstein distance and MSE have mild correlations, as shown in figure 1.
We reiterate that both and are used to compare the distance between two images (or events). However, for anomaly detection, we want to know how far an event is from the expected distribution. One way to do this is with an autoencoder, which we describe next.
3.2 Autoencoders
A popular method for detecting anomalous data is with a neuralnetwork autoencoder (AE). An autoencoder works by first encoding the data in a lowerdimensional latent space, and then decoding it back to the original higherdimensional representation. The idea is that data similar to the training sample will be reconstructed well, whereas data that is not similar to the training sample may be reconstructed poorly. The reconstruction fidelity can then be used as an anomaly score. Often the data are represented as images, and the autoencoder uses the MSE metric (eq. (1) with ) to compare the input image to the reconstructed image.
In figure 2, we show an example of an autoencoder architecture that we will use. The encoder is made up of some number of downsampling blocks (there are two in the figure, each marked by a dashed blue line). Each block contains two sets of
convolutional layers with a depth of five filters. The stride and padding are set to keep the image size the same and the ELU activation function is applied after each layer. After the convolutional layers, the data is downsampled through a
average pooling layer. After the final downsampling block, the data is flattened and then followed by a dense layer with 100 nodes and an ELU activation. Finally the network is mapped to the latent space through another dense layer. We experiment with one, two, and three downsampling blocks, and use a fixed latent size of 64 dimensions. Our latent space is substantially larger than what is often used, for example ref. Farina et al. (2020) uses a six dimensional latent representation and ref. Heimel et al. (2019) finds the optimal size to be around 2034 for their toptagging data. We found the best performance with a 64 dimensional representation.The second part of the AE is the decoder, which maps the latent space back to the space of the input data. In our setup, the decoder is a mirror of the encoder. The first step is a dense layer with 100 nodes and ELU activation. From here, another dense layer is used, where the number of nodes is set to the number of pixels in the final down sampling block. The ELU activation function is used again, and then the data is reshaped into a square array. From here, the same number of upsampling blocks is applied as the number of downsampling blocks. In each upsampling block, the first operation is a 2D transposed convolution which doubles the shape of the image and contains a depth of five filters, followed by the ELU activation. After this, two convolutional layers are used with the ELU activation with the stride and padding set to keep the image size the same. The final convolution operation reduces the depth to one.
During training and inference, the input image is compared with the reconstructed image via some choice of eventtoevent metric. A common method is to use the MSE as the loss function, with the aim of reproducing the exact image. However, it is possible to use other metrics for the comparison. Furthermore, the metric used for training does not need to be the same as the metric used for the anomaly score (see for instance ref.
Cheng et al. (2020)). We will refer to the difference between the input and reconstructed image as the image distance, also known as the reconstruction error.A variational autoencoder enhances the basic autoencoder by adding stochasticity to the latent embedding. In a regular autoencoder, which is a deterministic function, very dissimilar events can be placed near each other in the latent space. Distances in the latent space of an ordinary AE therefore do not have a precise meaning. In a VAE, the stochastic element makes the network return a distribution in the latent space for each input event. Since the same input data can be mapped to several nearby points in a VAE, dissimilar events cannot be placed nearby. Returning a distribution in the latent space is therefore essential for making distances in the latent space meaningful. The stochasticity also connects the loss to the statistics method of variational inference Kingma and Welling (2013, 2019), as we summarize in appendix A (see also refs. Blei et al. (2017); Kingma and Welling (2019)
for reviews). Specifically, we show that the autoencoder estimates a
lower bound on the probability for any given event to be an element of the background sample that the network is trained on.To implement the stochasticity of a VAE, our networks are trained using the standard reparameterization trick Kingma and Welling (2013); Jimenez Rezende and Mohamed (2015). A single element of the input data now yields a distribution, and these distributions are treated as a set of
independent Gaussian distributions, where
is the dimension of the latent space. The output of the encoder is then doubled: instead of returning a single point in the latent space, it now outputs both the meansand the variances
of the distribution in latent space. The loss function for the network also has to be modified: we want the background sample to be well modeled by a set of Gaussian distributions in latent space. This is done by introducing a KullbackLeibler divergence (KLD) term (see appendix
A for details), which is estimated as:^{3}^{3}3 This estimation assumes Standard Normal priors for the likelihood of the latent data, as described in appendix A. There is a great deal of ongoing research into methods to improve the likelihood estimate by changing the latent space priors or improving the posterior approximations of the encoder Kingma et al. (2016); van den Berg et al. (2018); Dillon et al. (2021); Aarrestad and others (2021).(4) 
This KLD term acts to regularize the autoencoder by pushing the means in the latent space to zero and the variances to one. Depending on the metric used to determine the distance between the original and reconstructed data, more or less regularization may be needed. To account for this, we introduce another hyperparameter , and define the loss function as
(5) 
We scan over , typically finding the best results for small but nonzero .
To minimize the loss given in eq. (5), we use the Adam optimizer Kingma and Ba (2014) with the default parameters and an initial learning rate of
. The training data consists of around 550,000 QCD dijet samples, and we reserve 50,000 QCD events for an independent validation set. After each epoch of training, the loss is evaluated on the validation set. When the loss has not improved on the validation set for five epochs, the learning rate is decreased by a factor of 10, with a minimum learning rate of
. Training concludes when the validation loss has not improved for 12 epochs. We then restore the weights of the network from the epoch with the best validation loss.4 Autoencoder results
Here we present the results of our studies of variational autoencoders. We start by studying the metric dependence of VAE performance as anomaly detectors. Then we study the latent space to understand what the VAE is learning.
4.1 Autoencoder performance
Now we study the performance of variational autoencoders as anomaly detectors using different metrics. Anomaly detection with an autoencoder requires two metric choices. First, one must choose a training metric, used for computing the image distance during training. Next, one must choose an anomaly metric to compute an anomaly score which determines how similar an event is to the training sample. The training metric and anomaly metric can be the same, but do not have to be.
For the training metric, we consider MSEtype metrics and and Wasserstein metrics and . Using a Wasserstein metric in the loss function to train an autoencoder is not standard, and requires a little bit of extra engineering.^{4}^{4}4Ref. Collins (2021) also implements a VAE trained with a Wasserstein metric. The challenge is that the optimaltransport metrics are not wellsuited for the backpropagation part of the training procedure of a neural network. To get around this, we used the Sinkhorn approximation within the GeomLoss package Feydy et al. (2019). Even with this, training was slow and sometimes timed out after three days of training on GPU. In contrast, the MSE and MAE networks typically completed training in around 12 hours on the same platform.
For the anomaly metric, we consider either using the full loss (including both the training metric contribution and the KLdivergence part in the variational autoencoder), just the MSE error between the input and output images , the MAE , or the Wasserstein distances with , 1.0, and 2.0. The value of each of these is computed for the test samples for the QCD dijet events, the topjet events, and the jet events.
To evaluate performance in anomaly detection, we train the autoencoder on a QCD background using the training metric. Then we evaluate the anomaly score using the anomaly metric for a boosted top jet signal sample and a boosted
jet signal sample. For a figure of merit of performance we use the Area Under the receiver operating characteristic Curve (AUC).
Signal  Top jet  jet  
Training  Down  Anomaly  AUC  AUC 
Metric  Samplings  Metric  
Supervised      0.94  0.96 
MSE 
1  Loss  0.82  0.61 
MSE  0.82  0.60  
MAE  0.79  0.48  
Wass(0.5)  0.82  0.45  
Wass(1)  0.83  0.41  
Wass(2)  0.81  0.39  
2  Loss  0.83  0.65  
MSE  0.83  0.65  
MAE  0.80  0.53  
Wass(0.5)  0.82  0.51  
Wass(1)  0.82  0.51  
Wass(2)  0.81  0.54  
3  Loss  0.84  0.65  
MSE  0.84  0.65  
MAE  0.81  0.53  
Wass(0.5)  0.83  0.52  
Wass(1)  0.84  0.52  
Wass(2)  0.82  0.54  
Wass(1) 
1  Loss  0.78  0.44 
MSE  0.71  0.57  
MAE  0.72  0.49  
Wass(0.5)  0.75  0.47  
Wass(1)  0.78  0.44  
Wass(2)  0.76  0.39  
2  Loss  0.79  0.46  
MSE  0.76  0.61  
MAE  0.75  0.52  
Wass(0.5)  0.77  0.49  
Wass(1)  0.79  0.46  
Wass(2)  0.77  0.40  
3  Loss  0.79  0.41  
MSE  0.79  0.60  
MAE  0.77  0.51  
Wass(0.5)  0.79  0.47  
Wass(1)  0.79  0.41  
Wass(2)  0.72  0.36 
Results are shown in table 1 for the training metric choices and and for different numbers of downsampling blocks in the network. For each number of down samplings, we trained the network with different values of the VAE parameter , and in the table present the results for the value of which achieved the best loss on the validation data. For the trained networks, the values of which minimized the loss were , , and , for the one, two, and three down sample block networks, respectively. The trained results are in the lower part of the table and had optimal values of of , , and for one, two, and three down sampling blocks, respectively. The entries highlighted in blue indicate the configuration with the best AUC for top jets and jets across all of our considered architectures, training methods, and anomaly score methods. The top row in the table shows the AUC numbers (in red) from a supervised approach, for comparison (see appendix B for details of the supervised algorithm).
In general, we find the networks trained with as the training metric and using the full loss as the anomaly metric has the best performance. The exception is when only a single down sample layer is used, in which case using as the anomaly metric does slightly better for the topjet signal than using the full loss as the anomaly metric. When is used as the training metric, the best performance is with as the anomaly metric.
We can see at this stage the proliferation of choices one has to make when deciding what architecture, training metric, and anomaly metric to use. Making these choices is especially hard to do if one wants to remain model agnostic. For instance, figure 3 shows the results of the network trained with as the reconstruction loss. The left panel contains the loss on the QCD validation events. Using the idea that minimizing the loss is getting a better estimate of the probability of an event, one would expect that the network configuration (number of down samplings and value of ) which minimizes the loss will have learned the QCD distribution the best. However, the next two panels show the ability of the networks to distinguish top and jets from the QCD background. In particular, we see that the value of which minimizes any of the loss curves does not yield the best signal separation. We also point out that the network with a single down sample block has the lowest loss, but is consistently the worst anomaly detector. This figure also highlights the danger of using the AUC of a particular signal to chose the hyperparameters of a universal anomaly detector. Examining only performance on the jets, it would be tempting to pick either the two or three down sample networks with a value of , as this gives the best AUC for the s. However, these particular networks have the worst score of the top jets. This is the challenge of signal independent searches; without a signal model in mind, optimizing analysis strategies is hard to do in a principled manner.
The network trained with with a small KL divergence term yields the best anomaly detection performance. Therefore, we expect that it is learning a good representation of the underlying background distribution. We next explore this hypothesis by examining eventtoevent distances among different metrics.
4.2 What has the VAE learned?
In order for a variational autoencoder to be able to judge how likely an event is to be in a particular sample, it must have a representation of the probability distribution of events in that sample. Moreover, since it first maps events to a lowerdimensional latent space, the information about the relative likelihood should be encoded in the latent space in some way. It would make sense if the network places similar events nearby in the latent space, and dissimilar events far apart. In this section we attempt to quantify if this is indeed true by comparing to the more physical Wasserstein distance.
Since each input is mapped to a (Gaussian) distribution in the latent space, we use the Euclidean distance between the means of these distributions, which is a simple measure of the distance in latent space.^{5}^{5}5One could also try to take into account the variance of the distributions, by e.g., taking the KL divergence between the two distributions. In figure 4, we show the correlations between the Euclidean latent space distance and the Wasserstein event distance among all the pairings of 1000 events in the QCD test set for various values of the VAE parameter . For this study, the events are passed through the encoder part of a VAE with three down sampling layers, down to a 64 dimensional latent space where the Euclidean distance is computed. As the value of is increased, the network goes from having little regularization to being forced to approach a Gaussian. Correspondingly, the correlation initially grows as the structure is forced upon the latent representation, and then decreases as becomes so large that the regularization dominates and the distribution becomes nearly Gaussian. We observe similar results for the networks with one and two downsampling layers that are trained with in the loss function. In this figure, the value of which gives the minimum loss corresponds to the with maximum correlation, but we do not find this trend to hold in general. It seems that the VAE with an intermediate value of that balances the and KLD terms in the loss function creates a latent space where distances between events are correlated with the distance in the image space.
The downsampling operations are critical to the production of the latent space. As they combine information from neighboring pixels, they introduce an element of scale which MPE would not exhibit. To verify the importance of downsampling, we show in figure 5 the pairwise event distance correlations for the same network at different depths into the encoder. In the first panel, distances on the axis are computed in the first downsampling block, where the events are represented as tensors and the goes across all 2000 “pixels” (see figure 2). The correlation between the distance in this first downsampled layer and the Wasserstein distance of the events is much larger than the MSE distance between the original events. The correlation further increases from the first down sample block to the second. The correlation then decreases after a third downsampling. Then, when the information is further reduced to the latent space, we get smaller correlations than seen in the early stages of the network.
Although the EMD metric is a Wasserstein metric with , there is no clear reason why should be preferred to other values. So, next we consider . In figure 6, we show the correlations for the same network with three down sampling layers but now using distances along the axis. The distances between events at different layers in the network are more correlated with than .
In this section, we have explored the representation of the QCD event distribution that variational autoencoders learn. Our conclusion from this study is that the Euclidean distances between QCD events in the latent space are highly correlated with the Wasserstein distances between the events themselves. This is particularly compelling because the VAE is trained with the MSE metric for its loss function and has no direct access to any Wasserstein metric. A related question is how the correlations would look if a Wasserstein metric were used for training. In that case, we find that the Euclidean distances between events in the latent space of the Wasserstein trained networks are even more correlated with the Wasserstein distances in the image space. Thus, it could be argued that the Wasserstein trained networks learn an even better representation of the QCD distribution than the MSE trained networks. However, the VAE with MSE training worked better for finding the top and jet signals than those trained with an Wasserstein loss. The fact that the method with the “best” latent representation does not yield the best signal separation highlights the challenges of model agnostic anomaly detection.
5 EventtoEnsemble Distance
In the previous section we showed that VAEs tend to work better when MSE loss is used for training than when Wasserstein metrics are used for training and that the Euclidean distance in the latent space correlates strongly with the Wasserstein metric on the data, regardless of the metric used for training. Thus, there is a sense in which VAEs are learning the Wasserstein metrics. If the power of the VAE for anomaly detection in physical problems stems from it implicitly learning aspects of the Wasserstein metrics, we can then ask if there may be a way to use these metrics more directly for anomaly detection, sidestepping the VAE entirely. One way to do this is to use the metric to compute an eventtoensemble distance, as we explore in this section.
We would like to use the Wasserstein distance, or another metric, to characterize the distance of an event to the background ensemble. There are already several options for using Wasserstein distances to characterize different types of events in the literature, such as
nearest neighbor (kNN) classifiers
Komiske et al. (2019), “linearized” optimal transport Cai et al. (2020), where all the events are compared to a single reference event and this is used to define a new distance, and event isotropy, which compares a given event to an isotropic configuration Cesarotti and Thaler (2020). Our goal, using a method like these, is to extract from the background ensemble one or more representative images and to compute the distance of a given signal or background event to those images. This direct eventtoensemble distance measure can then be compared to the VAE anomaly score, which is also effectively an eventtoensemble distance.To compute the direct eventtoensemble distance we need an algorithm to select or construct fiducial events from the ensemble and a metric with which to compute the distance. As with the VAE architecture, there may be no choice that is optimal for all signals. In choosing the fiducial events, we must decide which sample to choose events from, how to select those events, how many events to use, how to represent the fiducial events (e.g. as images), and how to combine the distances to the different events. To make a fair comparison to the VAE approach, we would like our algorithm for generating fiducial events to depend only on the background sample, independent of what anomalous signal we might search for. Thus we choose the QCD jet event ensemble as our reference sample. To select events from the sample, the simplest possibility is to arbitrarily select some number of random images. However, despite occasionally giving a large AUC for classification, results with random images are very sensitive to fluctuations between the random images. A second possibility that may seem sensible is to take the average of all events in the sample. A third option, which we find to be the most natural, is to use medoids as we now explain.
With a given metric, which we call the medoid metric, we can compute the pairwise distance between any two events in the ensemble. Then for each event we can sum over all the distances to all other events . The medoid of the ensemble is the event that minimizes . medoids generalizes this to finding the events for which the sum of the distances of each event to the closest of the medoids is minimized. Thus the event fragments into a set of clusters, with each cluster closest to one of the medoids. medoids clustering is similar to means clustering when the medoid metric is chosen to be the Euclidean metric, except that medoid clustering actually requires the medoid to be one of the events in the set. Medoids have previously been explored in other contexts in refs. Komiske et al. (2019, 2020); Crispim Romão et al. (2021b). To use medoids, we need to choose a value for and a medoid metric. Then it is natural to take for the eventtoensemble distance the distance of an event to its closest medoid. Although one could in principle use a different metric to compute the eventtoensemble distance, it is also most natural to use the same medoid metric that determines the medoids.
Choosing the number of medoids
is challenging to do in a signal independent way. One approach is the elbow method, a common heuristic for determining how many clusters are in a dataset. In our case, to use the elbow method we scan over the number
of medoids, and for each compute the distances of all the events in our sample to the nearest medoid and histogram the results. There will be a small number events very close to a medoid and a small number very far from all medoids, so the histograms will have a peak at some finite value of the distance. Moreover, the peak distance will decrease monotonically as the number of medoids increases. In many applications the decrease is rapid for small , but at some abruptly stops decreasing rapidity and starts decreasing slowly. Thus the peak distance as a function ofoften has an elbow shape. To determine the elbow location algorithmically we perform a linear regression to an elbow function (two straight lines), and take the first integer value after the elbow as the suggested value of
. The result can be seen in figure 7. The idea behind the elbow method is that increasing past the location of the elbow should not give much improvement. Moreover, in the case of anomaly detection, if we have too many medoids, we can get one medoid that looks “backgroundlike” rather than “signallike”. We find that typically medoids is selected according to this elbow method.The main advantage of the elbow method is that it can be automated and used independent of the sample or the use case. However, there often is not a clear elbow. In figure 7, the elbow is only apparent because we have fit to a piecewise linear function. The data seems to follow more of a power law behavior. In addition, the location of the elbow can be affected by the maximum number of medoids we include in the fit. Additionally, the elbow can only be computed once we’ve already made the arbitrary choice of the medoid metric, and of the metric being used for the comparison between the full sample and the reference sample. Finally, there is no reason to expect that the elbow choice of , which is determined only by the background sample, would be optimal for anomaly detection tasks. Thus, we also consider values of not determined by the elbow method for this study.
Top jet  jet  

Metric 

Method  AUC  AUC  
Supervised        0.94  0.96  
QCD Reference 
Wass(1)    Avg  0.81  0.62  
1  Medoid  0.83  0.66  
3 (elbow)  Medoids (min)  0.85  0.68  
5  Medoids (min)  0.87  0.60  
7  Medoids (min)  0.87  0.61  
Wass(5)    Avg  0.53  0.60  
1  Medoid  0.68  0.36  
3  Medoids (min)  0.66  0.41  
4 (elbow)  Medoids (min)  0.67  0.41  
5  Medoids (min)  0.71  0.43  
MAE    Avg  0.83  0.71  
1  Medoid  0.82  0.71  
3 (elbow)  Medoids (min)  0.82  0.61  
5  Medoids (min)  0.83  0.67  
7  Medoids (min)  0.83  0.65  
Top Reference 
Wass(1)    Avg  0.69  0.69  
1  Medoid  0.58  0.79  
3 (elbow)  Medoids (min)  0.32  0.79  
5  Medoids (min)  0.45  0.84  
7  Medoids (min)  0.49  0.83  
Wass(5)    Avg  0.72  0.40  
1  Medoid  0.53  0.52  
2 (elbow)  Medoids (min)  0.72  0.70  
3  Medoids (min)  0.66  0.61  
5  Medoids (min)  0.61  0.54  
Wass(5)  3 (elbow)  Medoids (sum)  0.66  0.66  
5  Medoids (sum)  0.73  0.58  
7  Medoids (sum)  0.75  0.60  
MAE    Avg  0.48  0.57  
1  Medoid  0.29  0.64  
3 (elbow)  Medoids (min)  0.25  0.36  
5  Medoids (min)  0.32  0.58 
different medoids, as denoted in the table. Medoids are selected with the same metric as is used to compare images. For each metric, we note which number of medoids corresponds to the elbow. The best AUC values with the QCD reference events are denoted in blue. We indicate the
Wasserstein distance metric as Wass().The top of table 2 shows results for topjet vs. QCDjet discrimination and jet vs. QCDjet discrimination when QCD jets are used for the reference sample. We show results for different values of with medoids, using different medoid metrics. We also show the result from using the distance to a single composite average event determined by averaging each pixel intensity over all events in the reference sample. When we study the elbow for the most common 1Wasserstein metric, we see reasonable performance for both QCD and top jets, though it is best for neither of them. This is in line with what we expect for unsupervised anomaly detection. For simplicity, we report results where the metric used to select the medoids is the same one used to compute our observable. We could have chosen two different metrics for the medoid metric and that used to compute the eventtoensemble distance, but restricting to the case where they are the same does not qualitatively change our results. Table 2 shows that the number of medoids and the choice of metric matters substantially.
We find that the QCD medoids typically perform better than the average QCD jet. This is not surprising, since the average QCD jet is much more concentrated in the center of the image than any real QCD jet, as can be seen in figure 8. In this figure, the color shows the fraction of the total in each pixel on a logarithmic scale. We also find better performance for anomaly detection with 56 medoids, rather than the 24 medoids suggested by the elbow method. When detecting top jets with QCD reference images, we get the best results when the Wasserstein metric with is used to compare images, though we also find reasonable performance for the Wasserstein metric with or (not shown in the table) and when the MAE metric is used. Although MAE is not a physically motivated metric, the performance in this case is not surprising because MAE between events is highly correlated with Wasserstein distances between events for QCD jets (the Pearson correlation coefficient between MAE and the 1Wasserstein distance is 0.87).
That our results depend on the exponent is suggestive. For the Wasserstein metric with larger we get substantially decreased performance when comparing to QCD reference jets. This suggests that the ideal value of is related to the relevant scales in the problem: a smaller value of places comparatively larger emphasis on pixels with smaller differences. This is consistent with results such as ref. Finke et al. (2021), which finds better AE performance when pixel intensities are remapped to emphasize dim pixels. When we choose a better (smaller) value of , the results are also slightly less sensitive to exactly which QCD reference images are chosen than when a larger value is chosen.
While autoencoders are often trained on a QCD background, several studies have explored trying to train an AE on alternative samples. One example is ref. Finke et al. (2021), which showed that AEs perform poorly when tagging QCD jets if trained on a top jet sample. This can be attributed to top jets being more complex than QCD jets, so that an AE trained on top jets can also reconstruct QCD jets despite them being out of distribution samples. While modifications can be made so that an AE trained on top jets can tag QCD jets Finke et al. (2021), requiring sample dependent optimization defeats the point of unsupervised anomaly detection.
Unlike in the case of an AE, eventtoensemble distances can be directly applied to other reference samples, assuming an applicable metric is selected. We include in table 2 results using a topjet reference sample for concreteness and brevity, but the method can be easily applied to other reference samples. We find that for top reference samples, the best metric is not the same as for QCD reference samples. In contrast to the case of the QCD reference sample, we find the Wasserstein metric with higher does better for QCD vs. top classification than that with . However, this result is signal dependent, in addition to being dependent on the background sample: when trying to use the top reference sample to distinguish QCD vs. jet events, we find using the Wasserstein metric with is better than higher . We also find that whether the average event or minimum distance to the medoids does better depends on the signal sample — for QCD versus top classification with a top reference sample the average event does better than medoids (unlike the QCD reference sample case), but the opposite is true for QCD vs. classification. Furthermore, we find the somewhat counter intuitive result that the sum of the distance to the medoids does better than the minimum for top vs. QCD classification with top reference jets, which is surprising because only the distance to the closest event is actually used when determining the medoids. For QCD vs. top classification, using a QCD reference sample still outperforms the top reference sample, but the opposite is true when doing QCD vs. classification.
Our best results using the eventtoensemble distance approach are comparable to and even slightly exceed the performance of the VAE in the previous section, which can be seen from comparing table 2 to table 1. This suggests that if we choose a smart, physically motivated metric like the Wasserstein distance then we can use the medoid method, which is much faster and simpler than the VAE, and avoids optimizing all the hyperparameters in the VAE network architecture. The tradeoff is that we need to put effort into optimizing the metric and choice of for the medoid approach. This is not surprising, since as we saw in the previous section, distances in the VAE latent space between two separate encoded latent representations are fairly correlated with the Wasserstein distance between the original images. This equivalence is further supported by our study of the number of downsampling layers in the previous section. Like in the case of the Wasserstein metrics, locality is incorporated in the convolutional VAE on scales other than the arbitrary pixel size due to the convolutions and down samplings.
The ease and speed of using the eventtoensemble approach is a distinct advantage when compared to AEs, where the architecture, normalization, and parameters all need to be optimized. However, since the ideal metric and fiducial sample selection depend on both the background and signal samples, the signal dependence of the eventtoensemble approach further suggests that there may be advantages of weakly/semisupervised learning as compared to unsupervised learning, and that weakly/semisupervised methods should be explored further. The potential advantages of semisupervised learning can be further seen from the fact that the best QCD vs.
AUC values come from top reference samples, rather than QCD reference samples.6 Conclusions
Using an autoencoder for anomaly detection is particularly challenging, since it must be trained well enough to reconstruct the background, but not so well that it also reconstructs the signal. There are many details about the network configuration that need to be optimized, such as the network size, metric used to compare input and output images, the definition of the anomaly score, and hyperparameters. Many of the results currently in the literature do not sufficiently emphasize these difficulties, so we have attempted in this paper to characterize and resolve them. To be concrete, we considered detecting boosted hadronic top and jets over a QCDjet background. We considered using different metrics for training the VAE; we found that using the more physicallymotivated optimal transportbased metrics did not outperform the simpler meansquarederror metrics, and actually performed slightly worse. We found that the optimal values of various hyperparameters depend on the signal that we are trying to detect and that the optimal hyperparameters for describing the QCD sample are not necessarily those that detect anomalies the best.
In order to understand what the autoencoder has learned, we also studied the autoencoder latent space. The latent space provides a representation of any particular event which can be used to study the background distribution. In order to characterize this latent space, we computed the distance between distinct events. One way to do this is by using the Euclidean distance between quantities in the latent space. Alternatively, if we rely on a more physical, optimal transport based metric, we can compute the distances between images directly. When we compared the two, we found that the eventtoevent optimal transport based distances between the background events are highly correlated with the Euclidean distances between events in the latent space of the autoencoder. This suggests that the autoencoder is learning some aspects of optimal transport, despite being trained with only a meansquarederror based loss function.
This motivated us to develop methods that use optimal transport more directly. By choosing a representation of the QCD background distribution, such as the average QCD image or several medoids of the set of QCD jets, we can directly compute the optimal transport distance to this fiducial sample and use it as an anomaly score. We found that this method is at least as effective as the autoencoder, with the added benefits of being easier and faster to optimize, and generalizing more easily than the autoencoder to more complicated background distributions. We also found that the best choice of optimal transport metric depends on both the new physics signal and the qualities of the expected background distribution.
Although we have shown that the performance of variational autoencoders can be reproduced, and improved upon, by the relatively simpler medoid method, neither approach is very close to optimal for signal detection. To be quantitative, when trained on a QCD sample, the best autoencoder performance we found gave an AUC of 0.65 for detection (see Table 1). The best performance using medoids with a QCD background gave a slightly better AUC of 0.71 (see Table 2). These are both worse than the performance of a fully supervised network which gave a nearly perfect AUC of 0.96. Somewhat surprisingly, we found that when the medoids method was used on a topjet background sample, it found jets over QCD better (AUC of 0.84) than when trained on a QCD background. This is comparable to what a supervised network trained to find tops over QCD gives when tested on vs. QCD (AUC of 0.86). These observations suggest that a path forward might be to use a semisupervised approach Dery et al. (2017); Cohen et al. (2018); Komiske et al. (2018); Park et al. (2020), where a network is trained with an example signal in mind, and then used for anomaly detection more broadly.
Acknowledgments
We thank Jack Collins, Philip Harris, and Sang Eon Park for useful discussions and comments on a previous version of this manuscript. This work is supported by the National Science Foundation under Cooperative Agreement PHY2019786 (The NSF AI Institute for Artificial Intelligence and Fundamental Interactions, http://iaifi.org/) The computations in this paper were run on the FASRC Cannon cluster supported by the FAS Division of Science Research Computing Group at Harvard University. BO was partially supported by the Department of Energy (DOE) Grant No. DESC0020223. KF is supported in part by NASA Grant 80NSSC20K0506 and in part by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE1745303. SH is supported in part by the DOE Grant DESC0013607, and in part by the Alfred P. Sloan Foundation Grant No. G201912504. RKM is supported by the National Science Foundation under Grant No. NSF PHY1748958 and NSF PHY1915071.
Appendix A Variational Inference for Autoencoders
The idea behind variational inference for anomaly detection is to estimate the true probability distribution of the background, . Assuming we have an underlying latent space of elements , we can write as
(6) 
where denotes the expectation value, is the probability of given , and is the prior likelihood of the latent data. We can take the latent space prior to be a set of independent Gaussians with zero mean and unit standard deviation, , where runs over the dimension of the latent space. At this point is an unknown and intractable distribution.
To make progress, we introduce a new tractable distribution , where are some parameters to be optimized over. In an autoencoder architecture, this is the encoder. We can then write eq. (6) in a more useful form:
(7) 
The log likelihood, , is then given as
(8)  
(9)  
(10) 
where we have used Jensen’s inequality in the second line above. Let’s first consider the first term in the last line in eq. (10). It is the expectation value of given when is sampled from (which is a distribution in given ). This term can be interpreted as a (negative) reconstruction error term. If we approximate by a decoder part of the architecture (where is to be optimized over), is the usual (negative) reconstruction error term in the loss function for an autoencoder with decoder and encoder .
The second term is by definition the KullbackLeibler divergence (KLD) between the distributions and . Recall that . We take to also be a Gaussian distribution, but with a unknown mean and standard deviation (to be fixed by the optimization), i.e. . The KLD between these two distributions is then given exactly by eq. (4). Using the reparameterization trick Kingma and Welling (2013); Jimenez Rezende and Mohamed (2015), we can write in terms of a standard normal:
(11) 
Using the reparameterization trick allows for more efficient training of the network, as the the back propagation of the gradients extends to the parameters of the distribution ( and ) even though a random draw from the distribution is passed to the decoder.
It’s now clear that the last line in eq. (10) is the negative loss for a VAE architecture. By training the VAE, we are minimizing the loss. By the inequality in eq. (10), the last line is also a lower limit for the log likelihood. The optimized VAE therefore gives a maximized lower bound to the log likelihood, the so called Evidence LOwer Bound (ELBO). Notice that in this discussion it is imperative to use the full VAE loss in order for it to have the variational inference interpretation.
Appendix B Supervised results
It is well known that anomaly detection is suboptimal for looking for any particular model; if the signal is known beforehand, supervised classification will yield the best results. We use a similar setup for our supervised classification as we did for the VAEs. The network consists of 1, 2, or 3 convolution blocks. Each block is made of two successive convolutional layers with 5 filters with a kernel size of 3 pixels, followed by an ELU activation function. After the convolutions, the data is down sampled with a average pool operation. Following the convolution blocks, the data is flattened to a vector and a fully connected layer reduces the output to a single number with a sigmoid activation.
The networks are trained using 50000 events from the QCD sample and 50000 events from either the top or samples. Similarly, 5000 events from each dataset are used for validation and to stop the network training when the validation loss has stopped improving. The training minimizes the binary cross entropy. After training, the network is applied to the test data of 5000 events in each class. We find that the network with three down sample layers achieves the best AUCs, with a score of 0.94 for top tagging and 0.96 for tagging.
References
 [1] External Links: Link Cited by: §2.
 Dijet resonance search with weak supervision using TeV collisions in the ATLAS detector. Phys. Rev. Lett. 125 (13), pp. 131801. External Links: 2005.02983, Document Cited by: §1.
 The Dark Machines Anomaly Score Challenge: Benchmark Data and Model Independent Event Classification for the Large Hadron Collider. External Links: 2105.14027 Cited by: §1, §1, footnote 3.
 A generic antiQCD jet tagger. JHEP 11, pp. 163. External Links: 1709.01087, Document Cited by: §1.
 Mass Unspecific Supervised Tagging (MUST) for boosted jets. JHEP 03, pp. 012. Note: [Erratum: JHEP 04, 133 (2021)] External Links: 2008.12792, Document Cited by: §1.
 The automated computation of treelevel and nexttoleading order differential cross sections, and their matching to parton shower simulations. JHEP 07, pp. 079. External Links: 1405.0301, Document Cited by: §2.
 Tag N’ Train: a technique to train improved classifiers on unlabeled data. JHEP 01, pp. 153. External Links: 2002.12376, Document Cited by: §1.
 Variational autoencoder based anomaly detection using reconstruction probability. Special Lecture on IE 2, pp. 1. Cited by: §1.
 Simulation Assisted Likelihoodfree Anomaly Detection. Phys. Rev. D 101 (9), pp. 095004. External Links: 2001.05001, Document Cited by: §1.
 Anomaly detection with Convolutional Graph Neural Networks. External Links: 2105.07988 Cited by: §1.
 Topological Obstructions to Autoencoding. JHEP 04, pp. 280. External Links: 2102.08380, Document Cited by: §1, §1.
 Simulationassisted decorrelation for resonant anomaly detection. Phys. Rev. D 104 (3), pp. 035003. External Links: 2009.02205, Document Cited by: §1.
 Adversariallytrained autoencoders for robust unsupervised new physics searches. JHEP 10, pp. 047. External Links: 1905.10384, Document Cited by: §1.
 Unsupervised Event Classification with Graphs on Classical and Photonic Quantum Computers. External Links: 2103.03897 Cited by: §1.
 Variational inference: a review for statisticians. Journal of the American Statistical Association 112 (518), pp. 859–877. External Links: ISSN 1537274X, Link, Document Cited by: §3.2.
 Bump Hunting in Latent Space. External Links: 2103.06595 Cited by: §1, §1.
 The anti jet clustering algorithm. JHEP 04, pp. 063. External Links: 0802.1189, Document Cited by: §2.
 FastJet User Manual. Eur. Phys. J. C 72, pp. 1896. External Links: 1111.6097, Document Cited by: §2.
 Dispelling the myth for the jetfinder. Phys. Lett. B 641, pp. 57–61. External Links: hepph/0512210, Document Cited by: §2.
 Linearized optimal transport for collider events. Phys. Rev. D 102 (11), pp. 116019. External Links: 2008.08604, Document Cited by: §1, §1, §5.
 Rare and Different: Anomaly Scores from a combination of likelihood and outofdistribution models to detect new physics at the LHC. External Links: 2106.10164 Cited by: §1.
 Lund jet images from generative and cycleconsistent adversarial networks. Eur. Phys. J. C 79 (11), pp. 979. External Links: 1909.01359, Document Cited by: §1.
 Nonparametric semisupervised classification for signal detection in high energy physics. External Links: 1809.02977 Cited by: §1.
 Variational Autoencoders for New Physics Mining at the Large Hadron Collider. JHEP 05, pp. 036. External Links: 1811.10276, Document Cited by: §1, §1.
 Spheres To Jets: Tuning Event Shapes with 5d Simplified Models. External Links: 2009.08981 Cited by: §3.1.
 The Efficacy of Event Isotropy as an Event Shape Observable. External Links: 2011.06599 Cited by: §3.1.
 A Robust Measure of Event Isotropy at Colliders. JHEP 08, pp. 084. External Links: 2004.06125, Document Cited by: §3.1, §5.
 ModelIndependent Detection of New Physics Signals Using Interpretable SemiSupervised Classifier Tests. External Links: 2102.07679 Cited by: §1.
 Variational Autoencoders for Anomalous Jet Tagging. External Links: 2007.01850 Cited by: §1, §1, §2, §3.2.
 Cited by: §2.

JetImages: Computer Vision Inspired Techniques for Jet Tagging
. JHEP 02, pp. 118. External Links: 1407.5675, Document Cited by: §3.1.  (Machine) Learning to Do More with Less. JHEP 02, pp. 034. External Links: 1706.09451, Document Cited by: §6.
 Anomaly Detection for Resonant New Physics with Machine Learning. Phys. Rev. Lett. 121 (24), pp. 241803. External Links: 1805.02664, Document Cited by: §1.
 Extending the search for new resonances with machine learning. Phys. Rev. D 99 (1), pp. 014038. External Links: 1902.02634, Document Cited by: §1.
 Comparing weak and unsupervised methods for resonant anomaly detection. Eur. Phys. J. C 81 (7), pp. 617. External Links: 2104.02092, Document Cited by: §1.
 An Exploration of Learnt Representations of W Jets. External Links: 2109.10919 Cited by: §1, footnote 4.
 Use of a generalized energy Mover’s distance in the search for rare phenomena at colliders. Eur. Phys. J. C 81 (2), pp. 192. External Links: 2004.09360, Document Cited by: §1.
 Use of a generalized energy Mover’s distance in the search for rare phenomena at colliders. Eur. Phys. J. C 81 (2), pp. 192. External Links: 2004.09360, Document Cited by: §5.
 Finding New Physics without learning about it: Anomaly Detection as a tool for Searches at Colliders. Eur. Phys. J. C 81 (1), pp. 27. External Links: 2006.05432, Document Cited by: §1.
 Learning multivariate new physics. Eur. Phys. J. C 81 (1), pp. 89. External Links: 1912.12155, Document Cited by: §1.
 Learning New Physics from a Machine. Phys. Rev. D 99 (1), pp. 015014. External Links: 1806.02350, Document Cited by: §1.
 DELPHES 3, A modular framework for fast simulation of a generic collider experiment. JHEP 02, pp. 057. External Links: 1307.6346, Document Cited by: §2.
 Guiding New Physics Searches with Unsupervised Learning. Eur. Phys. J. C 79 (4), pp. 289. External Links: 1807.06038, Document Cited by: §1.
 Weakly Supervised Classification in High Energy Physics. JHEP 05, pp. 145. External Links: 1702.00414, Document Cited by: §6.
 Learning the latent structure of collider events. JHEP 10, pp. 206. External Links: 2005.12319, Document Cited by: §1.
 Uncovering latent jet substructure. Phys. Rev. D 100 (5), pp. 056002. External Links: 1904.04200, Document Cited by: §1.
 Better Latent Spaces for Better Autoencoders. External Links: 2104.08291 Cited by: §1, §1, §1, §1, footnote 3.
 Variational Autoencoders for Jet Simulation. External Links: 2009.04842 Cited by: §1.
 RanBox: Anomaly Detection in the Copula Space. External Links: 2106.05747 Cited by: §1.
 Searching for New Physics with Deep Autoencoders. Phys. Rev. D 101 (7), pp. 075021. External Links: 1808.08992, Document Cited by: §1, §1, §3.2.
 Uncovering hidden new physics patterns in collider events using Bayesian probabilistic models. PoS ICHEP2020, pp. 238. External Links: 2012.08579, Document Cited by: §1.
 Interpolating between optimal transport and mmd using sinkhorn divergences. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2681–2690. Cited by: §4.1.
 Autoencoders for unsupervised anomaly detection in high energy physics. JHEP 06, pp. 161. External Links: 2104.09051, Document Cited by: §1, §1, §5, §5, footnote 1.
 Highdimensional Anomaly Detection with Radiative Return in Collisions. External Links: 2108.13451 Cited by: §1.
 Autoencoders on FPGAs for realtime, unsupervised new physics detection at 40 MHz at the Large Hadron Collider. External Links: 2108.03986 Cited by: §1, §1.
 LHC physics dataset for unsupervised New Physics detection at 40 MHz. External Links: 2107.02157 Cited by: §1.
 Novelty Detection Meets Collider Physics. Phys. Rev. D 101 (7), pp. 076015. External Links: 1807.10261, Document Cited by: §1.
 Classifying Anomalies THrough Outer Density Estimation (CATHODE). External Links: 2109.00546 Cited by: §1.
 QCD or What?. SciPost Phys. 6 (3), pp. 030. External Links: 1808.08979, Document Cited by: §1, §1, §3.2.
 Variational Inference with Normalizing Flows. External Links: 1505.05770 Cited by: Appendix A, §3.2.
 Anomalous jet identification via sequence modeling. JINST 16 (08), pp. P08012. External Links: 2105.09274, Document Cited by: §1.
 The LHC Olympics 2020: A Community Challenge for Anomaly Detection in High Energy Physics. External Links: 2101.08320 Cited by: §1, §1.
 Anomaly Awareness. External Links: 2007.14462 Cited by: §1.
 AutoEncoding Variational Bayes. External Links: 1312.6114 Cited by: Appendix A, §1, §3.2, §3.2.
 Adam: A Method for Stochastic Optimization. External Links: 1412.6980 Cited by: §3.2.
 Improving Variational Inference with Inverse Autoregressive Flow. External Links: 1606.04934 Cited by: footnote 3.
 An Introduction to Variational Autoencoders. External Links: 1906.02691 Cited by: §3.2.
 Adversarially Learned Anomaly Detection on CMS Open Data: rediscovering the top quark. Eur. Phys. J. Plus 136 (2), pp. 236. External Links: 2005.01598, Document Cited by: §1.

Learning to classify from impure samples with highdimensional data
. Phys. Rev. D 98 (1), pp. 011502. External Links: 1801.10158, Document Cited by: §6.  Metric Space of Collider Events. Phys. Rev. Lett. 123 (4), pp. 041801. External Links: 1902.02346, Document Cited by: §1, §1, §3.1, §5, §5.
 The Hidden Geometry of Particle Collisions. JHEP 07, pp. 006. External Links: 2004.04159, Document Cited by: §1, §1, §3.1, §5.
 Nonlinear principal component analysis using autoassociative neural networks. AIChE Journal 37 (2), pp. 233–243. External Links: Document, Link, https://aiche.onlinelibrary.wiley.com/doi/pdf/10.1002/aic.690370209 Cited by: §1.
 Cited by: §2.

Pulling Out All the Tops with Computer Vision and Deep Learning
. JHEP 10, pp. 121. External Links: 1803.00107, Document Cited by: §2.  Unsupervised clustering for collider physics. Phys. Rev. D 103 (9), pp. 092007. External Links: 2010.07106, Document Cited by: §1.
 Does SUSY have friends? A new approach for LHC event analysis. JHEP 02, pp. 160. External Links: 1912.10625, Document Cited by: §1.
 Anomaly Detection with Density Estimation. Phys. Rev. D 101, pp. 075042. External Links: 2001.04990, Document Cited by: §1.
 Deep Set Auto Encoders for Anomaly Detection in Particle Physics. External Links: 2109.01695 Cited by: §1, §1.
 Quasi Anomalous Knowledge: Searching for new physics with embedded knowledge. JHEP 21, pp. 030. External Links: 2011.03550, Document Cited by: §1, §1, §6.
 Anomaly Detection With Conditional Variational Autoencoders. In Eighteenth International Conference on Machine Learning and Applications, External Links: 2010.05531 Cited by: §1, §1.
 Transferability of Deep Learning Models in Searches for New Physics at Colliders. Phys. Rev. D 101 (3), pp. 035042. External Links: 1912.04220, Document Cited by: §1.
 A robust anomaly finder based on autoencoders. External Links: 1903.02032 Cited by: §1.
 An introduction to PYTHIA 8.2. Comput. Phys. Commun. 191, pp. 159–177. External Links: 1410.3012, Document Cited by: §2.
 Unsupervised indistribution anomaly detection of new physics through conditional density estimation. In 34th Conference on Neural Information Processing Systems, External Links: 2012.11638 Cited by: §1.

Unsupervised Outlier Detection in HeavyIon Collisions
. Phys. Scripta 96 (6), pp. 064003. External Links: 2007.15830, Document Cited by: §1.  Combining outlier analysis algorithms to identify new physics at the LHC. JHEP 09, pp. 024. External Links: 2010.07940, Document Cited by: §1, §1.
 Sylvester Normalizing Flows for Variational Inference. External Links: 1803.05649 Cited by: footnote 3.
 Optimal transport, old and new. Springer. External Links: ISBN 9783540710509 Cited by: §3.1.
 The DataDirected Paradigm for BSM searches. External Links: 2107.11573 Cited by: §1.