Variational Autoencoders for New Physics Mining at the Large Hadron Collider

by   Olmo Cerri, et al.

Using variational autoencoders trained on known physics processes, we develop a one-side p-value test to isolate previously unseen processes as outlier events. Since the autoencoder training does not depend on any specific new physics signature, the proposed procedure has a weak dependence on underlying assumptions about the nature of new physics. An event selection based on this algorithm would be complementary to classic LHC searches, typically based on model-dependent hypothesis testing. Such an algorithm would deliver a list of anomalous events, that the experimental collaborations could further scrutinize and even release as a catalog, similarly to what is typically done in other scientific domains. Repeated patterns in this dataset could motivate new scenarios for beyond-the-standard-model physics and inspire new searches, to be performed on future data with traditional supervised approaches. Running in the trigger system of the LHC experiments, such an application could identify anomalous events that would be otherwise lost, extending the scientific reach of the LHC.


page 1

page 2

page 3

page 4


Automatically detecting anomalous exoplanet transits

Raw light curve data from exoplanet transits is too complex to naively a...

Novelty Detection Meets Collider Physics

Novelty detection is the machine learning task to recognize data, which ...

Advanced Multi-Variate Analysis Methods for New Physics Searches at the Large Hadron Collider

Between the years 2015 and 2019, members of the Horizon 2020-funded Inno...

Learning new physics efficiently with nonparametric methods

We present a machine learning approach for model-independent new physics...

Deficit hawks: robust new physics searches with unknown backgrounds

Searches for new physics often face unknown backgrounds, causing false d...

Nonparametric semisupervised classification for signal detection in high energy physics

Model-independent searches in particle physics aim at completing our kno...

Including Physics in Deep Learning -- An example from 4D seismic pressure saturation inversion

Geoscience data often have to rely on strong priors in the face of uncer...

1 Introduction

One of the main motivations behind the construction of the CERN Large Hadron Collider (LHC) is the exploration of the high-energy frontier in search for new physics phenomena. This new physics could answer some of the standing fundamental questions in particle physics, e.g., the nature of dark matter or the origin of electroweak symmetry breaking. In LHC experiments, searches for physics beyond the Standard Model (BSM) are typically carried on as fully-supervised data analyses: assuming a new physics scenario of some kind, a search is structured as a hypothesis test based on profiled likelihood ratios ATLAS:2011tau . These searches are said to be model dependent, since they depend on considering a specific new physics model.

Assuming that one is testing the right model, this approach is very effective in discovering a signal, as demonstrated by the LHC searches for the Standard Model (SM) Higgs boson Aad:2012tfa ; Chatrchyan:2012xdj . On the other hand, given the (so far) negative outcome of many BSM searches at the LHC and at other particle-physics experiments, it is possible that a future BSM model, if any, is not among those typically tested. The problem is more profound if analyzed in the context of the LHC big-data problem: at the LHC, 40 million proton-beam collisions are produced every second, but only 1000 collision events/sec can be stored by the ATLAS and CMS experiments, due to limited bandwidth, processing, and storage resources. It is possible to imagine BSM scenarios that would escape detection, simply because the corresponding new physics events would be rejected by a typical set of online selection algorithms.

Establishing alternative search methodologies with reduced model dependence is an important aspect of future LHC runs. Traditionally, this issue was addressed with so-called model-independent searches, performed at the Tevatron Aaltonen:2008vt ; Abazov:2011ma , at HERA Aaron:2008aa , and at the LHC CMS-PAS-EXO-14-016 ; Aaboud:2018ufy , as discussed in Section 2.

In this paper, we propose to address this need by deploying an unsupervised algorithm in the online selection system of the LHC experiments. This algorithm would be trained on known SM processes and could be able to identify BSM events as anomalies. The selected events could be stored in a special stream, scrutinized by experts (e.g., to exclude detector malfunctioning that could explain the anomalies), and even released outside the experimental collaborations, in the form of an open-access catalog. The final goal of this application is to identify anomalous event topologies and inspire future supervised searches on data collected afterwards.

As a proof of principle, we consider the case of a typical single-lepton data stream, selected by the hardware-based L1 trigger system. On this stream of data, a variational autoencoder (VAE) is trained to compress the input event representation into a low-dimension latent space and then decompressed to return the shape parameters describing the probability density function (pdf) of each input quantity, given a point in the compressed space. The event distribution in a proper test statistic, namely part of the VAE loss function, is used to perform a one-side p-value test, to associate to each incoming event the probability of originating from known SM processes. A p-value threshold is applied to decide which event should be included into a low-rate anomalous-event data stream. In this work, we set the threshold such that

events could be collected every day under current LHC operation conditions. In particular, we took as a reference 8 months of data taking per year, with an integrated luminosity of  fb, as in 2016. Assuming an LHC duty cycle of 2/3, this corresponds to an average instantaneous luminosity of  cm s.

We then measure the BSM production cross section that would correspond to a signal excess of 100 event/month, as well as the one that would give a signal yield of the daily SM yield. For this, we consider a set of low-mass BSM resonances, decaying to one or more leptons and light enough to be challenging for the currently employed LHC trigger algorithms.

This paper is structured as follows: we discuss related works in Section 2. Section 3 gives a brief description of the dataset used. Section 4

describes the VAE model used in the study, as well as a set of fully-supervised classifiers used for performance comparison. Results are discussed in Section 

5. In Section 6 we discuss how such an application could be used in a typical LHC experimental environment. Conclusions are given in Section 7.

2 Related Work

Model-independent searches for new physics have been performed at the Tevatron Aaltonen:2008vt ; Abazov:2011ma , at HERA Aaron:2008aa , and the LHC CMS-PAS-EXO-14-016 ; Aaboud:2018ufy . These searches are based on the comparison of a large set of binned distributions to the prediction from Monte Carlo simulation, in search for bins exhibiting a deviation larger than some predefined threshold. While the effectiveness of this strategy in establishing a discovery has been matter of discussion, a recent study by the ATLAS collaboration Aaboud:2018ufy has rephrased this model-independent search strategy into a tool to identify interesting excesses, on which traditional analysis techniques could be performed on independent datasets (e.g., the data collected after running the model-independent analysis). This change of scope has the advantage of reducing the trial factor (i.e., the so-called look-elsewhere effect 2008arXiv0811.1663L ; Gross:2010qma ), which washes out the significance of an observed excess.

Our strategy is similar to what is proposed in Ref. Aaboud:2018ufy

, with two substantial differences: (i) we aim to monitor also those events that could be discarded by the online selection, by running the algorithm in the trigger system; (ii) we do so exploiting deep-learning-based anomaly detection techniques.

Recent works DAgnolo:2018cun ; Collins:2018epr ; DeSimone:2018efk ; Hajer:2018kqm

have investigated the use of machine-learning techniques to setup new strategies for BSM searches with minimal or no assumption on the specific new-physics scenario under investigation. In this work, we use variational autoencoders based on high-level features as a baseline. Previously, autoencoders have been used in collider physics for detector monitoring 

CMSdqm ; CMSdc and event generation ATL-SOFT-PUB-2018-001 . Autoencoders have also been explored to define a jet tagger that would identify new physics events with anomalous jets Heimel:2018mkt ; AEjets , with a strategy similar to what we apply to the full event in this work.

3 Data samples

The dataset used for this study is a refined version of the high-level-feature (HLF) dataset used in Ref. TOPCLASS . Proton-proton collisions are generated using the PYTHIA8 event-generation library pythia , fixing the center-of-mass energy to the LHC Run-II value (13 TeV) and the average number of overlapping collisions per beam crossing (pileup) to . These beam conditions loosely correspond to the LHC operating conditions in 2016.

Events generated by PYTHIA8 are processed with the DELPHES library delphes , to emulate detector efficiency and resolution effects. We take as benchmark detector description the upgraded design of the CMS detector, foreseen for the High-Luminosity LHC phase CMS_TP . In particular, we use the CMS HL-LHC detector card distributed with DELPHES. We run the DELPHES particle-flow (PF) algorithm, which combines information from different detector components to derive a list of reconstructed particles, the so-called PF candidates. For each particle, the algorithm returns the measured energy and flight direction. Each particle is associated to one of three classes: charged particles, photons, and neutral hadrons. In addition, lists of reconstructed electrons and muons are given.

Events are filtered at generation requiring an electron, muon, or tau lepton with  GeV. Once detector effects are taken into account through the DELPHES simulation, events are further selected requiring the presence of one reconstructed electron or muon with transverse momentum  GeV and a loose isolation requirement , where the isolation is computed as:


and the sum extends over all the photons, charged and neutral hadrons within a cone of size from the lepton.111As common for collider physics, we use a Cartesian coordinate system with the axis oriented along the beam axis, the axis on the horizontal plane, and the axis oriented upward. The and axes define the transverse plane, while the axis identifies the longitudinal direction. The azimuth angle is computed from the axis. The polar angle is used to compute the pseudorapidity . We fix units such that .

The 21 considered HLF quantities are:

  • The absolute value of the isolated-lepton transverse momentum .

  • The three isolation quantities (ChPFIso, NeuPFIso, GammaPFIso) for the isolated lepton, computed with respect to charged particles, neutral hadrons and photons, respectively.

  • The lepton charge.

  • A Boolean flag (isEle) set to 1 when the trigger lepton is an electron, 0 otherwise.

  • , i.e. the scalar sum of the of all the jets, leptons, and photons in the event with  GeV and . Jets are clustered from the reconstructed PF candidates, using the FASTJET fastjet implementation of the anti- jet algorithm antikt , with jet-size parameter R=0.4.

  • The number of jets entering the sum ().

  • The invariant mass of the set of jets entering the sum ().

  • The number of these jets being identified as originating from a quark ().

  • The missing transverse momentum, decomposed into its parallel () and orthogonal () components with respect to the isolated lepton direction. The missing transverse momentum is defined as the negative sum of the PF-candidate vectors:

  • The transverse mass, , of the isolated lepton and the system, defined as:


    with the azimuth separation between the and vectors, and the absolute value of .

  • The number of selected muons ().

  • The invariant mass of this set of muons ().

  • The absolute value of the total transverse momentum of these muons ().

  • The number of selected electrons ().

  • The invariant mass of this set of electrons ().

  • The absolute value of the total transverse momentum of these electrons ().

  • The number of reconstructed charged hadrons.

  • The number of reconstructed neutral hadrons.

This list of HLF quantities is not defined having in mind a specific BSM scenario. Instead, it is conceived to include relevant information to discriminate the various SM processes populating the single-lepton data stream. On the other hand, it is generic enough to allow (at least in principle) the identification of a large set of new physics scenarios.

Many SM processes would contribute to the considered single-lepton dataset. For simplicity, we restrict the list of relevant SM processes to the four with highest production cross section, namely:

  • Inclusive production, with ().

  • Inclusive production, with ().

  • production.

  • QCD multijet production.222To speed up the generation process for QCD events, we require  GeV, the fraction of QCD events with  GeV and producing a lepton within acceptance being negligible but computationally expensive.

These samples are mixed to provide a SM cocktail dataset, which is then used to train autoencoder models and to tune the threshold requirement that defines what we consider an anomaly. The cocktail is built scaling down the high-statistics samples (, , and

) to the lowest-statistics one (QCD, whose generation is the most computing-expensive), according to their production cross-section values (estimated at leading order with

PYTHIA) and selection efficiency (shown in Tab. 1).

Standard Model processes
Process Acceptance Trigger Cross Events Event
efficiency section [nb] fraction /month
BSM benchmark processes
Process Acceptance Trigger Total Cross-section
efficiency efficiency 100 events/month
436 fb
166 fb
335 fb
163 fb
Table 1: Acceptance and trigger efficiency of SM processes and corresponding values for BSM benchmark models. For SM processes, we report the total cross section before the trigger, the expected number of events per month and the fraction in the SM cocktail. For BSM models, we compute the production cross section corresponding to an average of 100 events per month passing the acceptance and trigger requirements.
The monthly event yield is computed assuming the conditions discussed in Section 1.

In addition, we consider the following BSM models to benchmark anomaly-detection capabilities:

  • A leptoquark with mass 80 GeV, decaying to a quark and a lepton.

  • A neutral scalar boson with mass 50 GeV, decaying to two off-shell bosons, each forced to decay to two leptons: .

  • A scalar boson with mass 60 GeV, decaying to two tau leptons: .

  • A charged scalar boson with mass 60 GeV, decaying to a tau lepton and a neutrino: .

For each BSM scenario, we consider any direct production mechanism implemented in , including associate jet production. We list in Tab. 1 the leading-order production cross section and selection efficiency for each model.

Figures 1 and 2 show the distribution of HLF quantities for the SM processes and the BSM benchmark models, respectively.

Figure 1: Distribution of the HLF quantities for the four considered SM processes.
Figure 2: Distribution of the HLF quantities for the four considered BSM benchmark models.

4 Model description

We train Autoencoders (AEs) on the SM cocktail sample described in Section 3, taking as input the 21 HLF quantities listed there. The use of HLF quantities to represent events limits the model independence of the anomaly detection procedure. While the list of features is chosen to represent the main physics aspects of the considered SM processes and in no way tailored to specific BSM models, it is true that such a list might be more suitable for certain models than for others. In this respect, one cannot guarantee that the anomaly-detection performance observed on a given BSM model would generalize to any BSM scenario. We will address in a future work a possible solution to reduce the model carried by the input event representation.

In this section, we present both the best-performing autoencoder model and a set of supervised classifiers, trained to distinguish each of the four BSM benchmark models from SM events. We use the classification performance of these supervised algorithms as an estimate of the best performance that the VAE could get to.

4.1 Autoencoders

Autoencoders are algorithms that compress a given set of inputs variables in a latent space (encoding) and then, starting from the latent space, reconstruct the HLF input values (decoding). Autoencoders are used in the context of anomaly detection, associating a p-value to a given event through a quantification of the encoding-decoding distance.

In this work we focus on VAEs 2013arXiv1312.6114K . Unlike traditional AEs, VAEs return the event pdf in the latent and original space, instead of decoded values of the input quantities and the encoded point in the latent space. The functional form of the pdfs is specified through the loss function a priori and the pdfs’ shape parameters are the output of a trainable function of the inputs. Such a function is the VAE itself and is determined during training.

Figure 3: Schematics of the VAE used to perform anomaly detection, where represent the input variables and the latent space variables. The shape of each layer is reported in brackets.

We consider the VAE architecture shown in Fig. 3

, characterized by a four-dimensional latent space. Each latent dimension is associated to a Gaussian pdf and its two degrees of freedom (mean

and variance

). The input layer consists of 21 nodes, corresponding to the 21 HLF quantities described in Section 3

. This layer is connected to the hidden space through two hidden dense layers, each consisting of 50 neurons with ReLU activation functions. Two four-neuron layers are connected to the second hidden layer. Linear activation functions are used for the first of these four-neuron layers. Its nodes are interpreted as the mean values

of the latent-space Gaussian pdfs. The nodes of the second layer are activated by the functions:


This activation, inspired by wiki:activation_function

, has been chosen to increase training stability since it’s strictly positive defined, non linear buit does not involve exponential which might create instabilities in early epochs. These four nodes are interpreted as the

parameters of the latent-space four-dimensional Gaussian. After several trials, the dimension of the latent space has been set to 4 in order to keep a good training stability without impacting the VAE performances. The decoding step originates from a point in the latent space, sampled according to the predicted pdf (green oval in Fig. 3). The coordinates of this point in the latent space are fed into a sequence of two hidden dense layers, each consisting of 50 neurons with ReLU activation functions. The last of these layers is connected to three dense layers of 21, 17, and 10 neurons, activated by linear, p-ISRLu and clipped-tanh functions, respectively. The clipped-tanh function if written as:


The 48 output nodes represent the parameters of the pdfs describing the input HLF quantities, which enter the loss function to be minimized.

The VAE loss function is a weighted sum of two pieces: the probability of the inputs given the predicted output pdf parameters (

) and the Kullback-Leibler divergence (

) between the latent space pdf and the prior:


where is a free parameter set to 0.3. The prior chosen for the latent space is a 4-dim Gaussian with a diagonal covariance matrix. The means () and the diagonal terms of the covariance matrix (

) are free parameters of the algorithm and are optimized during the back-propagation. The Kullback-Leibler divergence between two Gaussian distributions has an analytic form. Hence, for each batch,

can be expressed as:


where is the batch size, runs over the samples and over the latent space dimensions. Similarly, is the average likelihood of the inputs given the predicted values:


where runs over the input space dimensions, is the functional form chose to describe the pdf of the -th input space variable and are the parameter of the function. Different functional forms have been chosen for , to properly describe different classes of HLF distributions:

  • Clipped Log-normal + function: used to describe , , , , , , , ChPFIso, NeuPFIso and GammaPFIso:

  • Gaussian: used for and :

  • Truncated Gaussian: a Gaussian function truncated for negative values and normalized to unit area for . Used to model :

  • Discrete truncated Gaussian: like the truncated Gaussian, but normalized to be evaluated on integers (i.e. ). This function is used to describe , , and . It is written as:


    where the normalization factor is set to:

  • Binomial: used for IsEle and lepton charge:


    where and are the two possible values of the variable (0 or 1 for IsEle and -1 or 1 for lepton charge) and

  • Poisson: used for charged-particle and neutral-hadron multiplicities:


    where .

Figure 4: Training history for VAE. Total loss, reconstruction NLL () and KL divergence () are shown separately for training and validation set though all the training epochs.

The model is implemented in KERAS+TENSORFLOW keras ; tensorflow , trained with the Adam optimizer adam on a SM dataset of 3.45M samples, equivalent to an integrated luminosity of  pb. The SM validation dataset is made of 3.45M of statistically independent samples. Such a sample would be collected in about ten hours of continuous run, under the assumptions made in this study (see Section 1). In training, we fix the batch size to 1000. We use early stopping with patience set to 20 and , and we progressively reduce the learning rate on plateau, with patience set to 8 and .

While optimizing anomaly-detection performance, alternative architectures were tested. For instance, we increased or decreased the dimensionality of the latent space, we changed the value of in Eq.(6

), we changed the number of neurons in the hidden layers, tried the RMSprop optimizer, and used plain Gaussian priors for the 21 input features. In addition, we tested the use of a vampprior 

VAMP . While some of these alternative models improved the encoding-decoding capability of the VAE, no sizable improvement in anomaly-detection performance was observed. For simplicity, we limited our study to the architecture in Fig. 3 and dropped these alternative models.

The model’s training history is shown in Fig. 4. Figure 5 shows the comparison of the input and output distributions for the 21 HLF quantities in the validation dataset. While discrepancies are observed in some tail, good agreement is observed on the bulk of the distributions.

Figure 5:

Comparison of input (blue) and output (red) probability distributions for the HLF quantities in the validation sample.Output distributions are obtained adding the predicted pdf for each event properly normalized.

4.2 Supervised classifiers

For each of the four BSM benchmark models, we train a fully-supervised classifier, based on a Boosted Decision Tree (BDT). Each BDT receives as input the same 21 features used by the VAE and is trained on a labelled dataset consisting of the SM cocktail (the background) and one of the four BSM benchmark models (the signal). The implementation is done through the Gradient Boosted Regressor of scikit-learn library 

scikit-learn with up to 150 estimators, minimum samples per leaf and maximum depth equal to 3 a learning rate of 0.1 and a tolerance of on the validation loss function (choose to be the default deviance). Each BDT, tailored to a specif BSM model, is trained on 3.45M SM events and about 0.5M BSM events, consistently up-weighted in order to have the same impact on the loss function (i.e. the weights are 1 for SM events and

for BSM ones, depending on the actual size of the BSM sample used). In addition, we experimented with fully-connected deep neural networks (DNNs) with two hidden layers. Despite trying different architectures, we didn’t find a configuration in which DNNs outperformed BDTs. We then decided to use the BDTs as a reference of fully-supervised discrimination capabilities.

Process AUC TPR []
0.98 5.4
0.94 0.2
0.90 0.1
0.97 0.3
Table 2: Classification performance of the four BDT classifiers, trained on the considered BSM benchmark models: area under ROC curve (AUC), and true positive rate (TPR) corresponding to a SM false positive rate , equivalent to the acceptance rate chosen for the VAE.
Figure 6: ROC curves for the fully-supervised BDT classifiers, optimized to separate each of the four BSM benchmark models from the SM cocktail dataset.

Figure 6 shows the ROC curves obtained for the four BDTs. We summarize in Tab. 2 the classification performance of the four supervised BDTs, which set a qualitative upper limit for VAE’s results. Overall, the four models can be discriminated with good accuracy, with some loss of performance for those models sharing similarities with specific SM processes (e.g., exhibiting single- and double-lepton topology with missing transverse energy, typical of events). In the table, we also quote the true-positive rate (TPR) corresponding to a SM false positive rate . This value of the efficiency is the one needed for an average of 1000 SM events per month.

Figure 7: Distribution of the loss components: (left) and (right) for the validation dataset. For comparison, the corresponding distribution for the SM processes and the four benchmark BSM models are shown. The vertical line represents a lower threshold such that of the SM events would be retained, equivalent to expected SM events per month.

5 Results with VAE

An event is classified as anomalous whenever the associated loss, computed from the VAE output, is above a given threshold. Since no BSM signal was observed so far, it is reasonable to expect that a new-physics signal, if any, would be characterized by a low production cross section and/or features very similar to those of a SM process. In view of this, we decided to use a tight threshold value, in order to reduce as much as possible any SM contribution.

Figure 7 shows the distribution of and loss components for the validation dataset. In both plots, the vertical line represents a lower threshold such that a of the SM events would be retained. This threshold value would result in SM events to be selected every month, i.e., a daily rate of events, as illustrated in Table 3. The acceptance rate is calculated assuming the LHC running conditions listed in Section 1. Table 3 also reports the by-process VAE selection efficiency and the relative background composition of the selected sample.

Figure 7 also shows and distribution for the four benchmark BSM models. We observe that the discrimination power, loosely quantify by the integral of these distributions above threshold, is better for than and that the impact of the term on discrimination is negligible. Anomalies are then defined as events laying on the right tail of the expected distribution.

The left plot in Fig. 8 shows the ROC curves obtained from the distribution of the four BSM benchmark models and the SM cocktail, compared to the corresponding BDT curves of Section 4.2. The right plot in Fig. 8 shows the p-value computed from the cocktail SM distribution, both for the SM events themselves (flat by construction) and for the four BSM processes. As the plot shows, BSM processes tend to concentrate at small p-values, which allows their identification as anomalies.

Standard Model processes
Process VAE selection Sample composition Event/month
Table 3: By-process acceptance rate for the anomaly detection algorithm described in the text, computed applying the lower threshold on shown in Figure 7. The threshold is tuned such that a fraction of about of SM events would be accepted, corresponding to events/day and events/month (assuming an average luminosity per month of ). The sample composition refers to the subset of SM events accepted by the anomaly detection algorithm. All quoted uncertainties refer to CL regions.
Figure 8: Left:ROC curves for the VAE trained only on SM mix (solid), compared to the corresponding curves for the four supervised BDT models (dashed) described in Section 4.2. Right: p-value distribution for the SM cocktail events and the four BSM benchmark processes.

Table 4 summarize VAE’s performance on the four BSM benchmark models. Together with the selection efficiency corresponding to , the table reports the effective cross section (cross section after applying the trigger requirements) that would correspond to 100 selected events in a month (assuming an integrated luminosity of ). Similarly, we quote the cross section that would result in a signal-to-background ratio of 1/3 on the sample of events selected by the VAE. The VAE can probe the four models down to relatively low cross section values, comparable to those that are typically probed in dedicated fully-supervised searches. As a comparison, Ref. Chatrchyan:2012sv excludes a with a mass of 150 GeV and production cross section larger than  pb, using 4.8 fb at a center-of-mass energy of 7 TeV, while most recent searches Sirunyan:2017yrk only cover larger mass values.

BSM benchmark processes
Process VAE selection Cross-section Cross-section
efficiency 100 events/month [pb] S/B = 1/3 [pb]
7.1 27
30 110
55 210
17 65
Table 4: Breakdown of BSM processes efficiency, and cross section values corresponding to 100 selected events in a month and to a signal-over-background ratio of 1/3. The monthly event yield is computed assuming an average luminosity per month of , computing by taking the LHC 2016 data delivery ( collected in 8 months). All quoted efficiencies are computed fixing the VAE loss threshold . The quoted uncertainties correspond to a CL region.

6 How to deploy a VAE for BSM detection

The work presented in this paper suggests the possibility of deploying a VAE as a trigger algorithms associated to dedicated data streams. These trigger would isolate anomalous events, similarly to what was done by the CMS experiment at the beginning of the first LHC run. At that time, with early new physics signal being a possibility, the CMS experiment deployed online a set of algorithms (collectively called hot line) to select potentially interesting new-physics candidates. At that time, anomalies were characterized as events with high- particles or high particle multiplicities, in line with the kind of early-discovery new physics scenarios considered at that time. The events populating the hot-line stream were immediately processed at the CERN computing center (as opposed to traditional physics streams, that are processed after 48 hours). The hot-line algorithms were tuned to collect O(10) events per day, which were then visually inspected by experts.

While the focus of the work presented in this paper is not an early discovery, the spirit of the application we propose would be similar: a set of VAEs deployed online would select a limited number of events every day. These events would be collected in a dedicated dataset and further analyzed. The analysis technique could go from visual inspection of the collisions to detailed studies of reconstructed objects, up to some kind of model-independent analysis of the collected dataset, e.g. a deep-learning implementation of a model-independent hypothesis testing DAgnolo:2018cun directly on the Loss distribution (provided a reliable sample of background-only data).

While a pure SM sample to train VAEs could only be obtained from Monte Carlo simulation, the presence of outlier contamination in the training sample has typically a tiny impact on performance. One could then imagine to train the VAE models on so-far collected data and use them on much larger dataset. In our study, we consider a training dataset of  pb and applied the VAE to a larger dataset. One could even envision more frequent re-trainings (e.g., every factor increase in integrated luminosity or in presence of substantial detector and/or accelerator condition changes). Such a training could happen offline on a dedicated dataset, e.g., deploying triggers randomly selecting events entering the last stage of the trigger system. The training could even happen online, assuming the availability of sufficient computing resources.

To demonstrate the feasibility of a train-on-data strategy, we enrich the dataset used in Section 4 with a signal contamination of events. As a starting point, the amount of injected signal is tuned to a luminosity of 100 pb and a cross section of 7.1 fb, corresponding to the value at which the VAE in Section 4 would select 100 events in 5 fb . This result into about 700 events added to the training sample. The VAE is trained following the procedure outlined in Section 4 and its performance is compared to that obtained on a signal-free dataset of the same size. The comparison of the ROC curves for the two models is shown in Fig. 9. In the same figure, we show similar results, derived injecting a and signal contamination. A degradation of VAE’s performance is observed once the signal cross section is set to 710 pb (i.e., 100 times the sensitivity value found in Section 4). At that point, the contamination is so large that the signal becomes as abundant as events and would have easily detectable consequences. For comparison, at a production cross section of 27 pb a third of the events selected by the VAE in Section 4 would come from production (see Table 4). And this would have negligible consequences on the training quality. This test shows that a robust anomaly-detecting VAE could be trained directly on data, even in presence of previously undetected (e.g., at Tevatron, 7 TeV and 8-TeV LHC) BSM signals.

Figure 9: ROC curves for the VAE trained on SM contaminated with and without contamination. Different level of contamination are reported corresponding to ( pb - equal to the estimated one to have 100 events per month), ( pb) and ( pb) of the training sample.

7 Conclusions

We present a strategy to isolate potential BSM events produced by the LHC, using variational autoencoders trained on a reference SM sample. Such an algorithm could be used in the trigger system of general-purpose LHC experiments to isolate recurrent anomalies, which might otherwise escape observation (e.g., being filtered out by a typical trigger selection). Taking as an example a single-lepton data stream, we show how such an algorithm could select datasets enriched with events originating from challenging BSM scenarios. We also discuss how the model training could happen directly on data, with no sizable performance loss.

The final outcome of the analysis would be a list of anomalous events, that the experimental collaborations could further scrutinize and even release as a catalog, similarly to what is typically done in other scientific domains. Repeated patterns in these events could motivate new scenarios for beyond-the-standard-model physics and inspire new searches, to be performed on future data with traditional supervised approaches.

We believe that such an application could help extending the physics reach of the current and next stages of the CERN LHC.


This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement n 772369) and the United States Department of Energy, Office of High Energy Physics Research under Caltech Contract No. DE-SC0011925. This work was conducted at "iBanks", the AI GPU cluster at Caltech. We acknowledge NVIDIA, SuperMicro and the Kavli Foundation for their support of "iBanks".