1 Introduction
One of the main motivations behind the construction of the CERN Large Hadron Collider (LHC) is the exploration of the highenergy frontier in search for new physics phenomena. This new physics could answer some of the standing fundamental questions in particle physics, e.g., the nature of dark matter or the origin of electroweak symmetry breaking. In LHC experiments, searches for physics beyond the Standard Model (BSM) are typically carried on as fullysupervised data analyses: assuming a new physics scenario of some kind, a search is structured as a hypothesis test based on profiled likelihood ratios ATLAS:2011tau . These searches are said to be model dependent, since they depend on considering a specific new physics model.
Assuming that one is testing the right model, this approach is very effective in discovering a signal, as demonstrated by the LHC searches for the Standard Model (SM) Higgs boson Aad:2012tfa ; Chatrchyan:2012xdj . On the other hand, given the (so far) negative outcome of many BSM searches at the LHC and at other particlephysics experiments, it is possible that a future BSM model, if any, is not among those typically tested. The problem is more profound if analyzed in the context of the LHC bigdata problem: at the LHC, 40 million protonbeam collisions are produced every second, but only 1000 collision events/sec can be stored by the ATLAS and CMS experiments, due to limited bandwidth, processing, and storage resources. It is possible to imagine BSM scenarios that would escape detection, simply because the corresponding new physics events would be rejected by a typical set of online selection algorithms.
Establishing alternative search methodologies with reduced model dependence is an important aspect of future LHC runs. Traditionally, this issue was addressed with socalled modelindependent searches, performed at the Tevatron Aaltonen:2008vt ; Abazov:2011ma , at HERA Aaron:2008aa , and at the LHC CMSPASEXO14016 ; Aaboud:2018ufy , as discussed in Section 2.
In this paper, we propose to address this need by deploying an unsupervised algorithm in the online selection system of the LHC experiments. This algorithm would be trained on known SM processes and could be able to identify BSM events as anomalies. The selected events could be stored in a special stream, scrutinized by experts (e.g., to exclude detector malfunctioning that could explain the anomalies), and even released outside the experimental collaborations, in the form of an openaccess catalog. The final goal of this application is to identify anomalous event topologies and inspire future supervised searches on data collected afterwards.
As a proof of principle, we consider the case of a typical singlelepton data stream, selected by the hardwarebased L1 trigger system. On this stream of data, a variational autoencoder (VAE) is trained to compress the input event representation into a lowdimension latent space and then decompressed to return the shape parameters describing the probability density function (pdf) of each input quantity, given a point in the compressed space. The event distribution in a proper test statistic, namely part of the VAE loss function, is used to perform a oneside pvalue test, to associate to each incoming event the probability of originating from known SM processes. A pvalue threshold is applied to decide which event should be included into a lowrate anomalousevent data stream. In this work, we set the threshold such that
events could be collected every day under current LHC operation conditions. In particular, we took as a reference 8 months of data taking per year, with an integrated luminosity of fb, as in 2016. Assuming an LHC duty cycle of 2/3, this corresponds to an average instantaneous luminosity of cm s.We then measure the BSM production cross section that would correspond to a signal excess of 100 event/month, as well as the one that would give a signal yield of the daily SM yield. For this, we consider a set of lowmass BSM resonances, decaying to one or more leptons and light enough to be challenging for the currently employed LHC trigger algorithms.
This paper is structured as follows: we discuss related works in Section 2. Section 3 gives a brief description of the dataset used. Section 4
describes the VAE model used in the study, as well as a set of fullysupervised classifiers used for performance comparison. Results are discussed in Section
5. In Section 6 we discuss how such an application could be used in a typical LHC experimental environment. Conclusions are given in Section 7.2 Related Work
Modelindependent searches for new physics have been performed at the Tevatron Aaltonen:2008vt ; Abazov:2011ma , at HERA Aaron:2008aa , and the LHC CMSPASEXO14016 ; Aaboud:2018ufy . These searches are based on the comparison of a large set of binned distributions to the prediction from Monte Carlo simulation, in search for bins exhibiting a deviation larger than some predefined threshold. While the effectiveness of this strategy in establishing a discovery has been matter of discussion, a recent study by the ATLAS collaboration Aaboud:2018ufy has rephrased this modelindependent search strategy into a tool to identify interesting excesses, on which traditional analysis techniques could be performed on independent datasets (e.g., the data collected after running the modelindependent analysis). This change of scope has the advantage of reducing the trial factor (i.e., the socalled lookelsewhere effect 2008arXiv0811.1663L ; Gross:2010qma ), which washes out the significance of an observed excess.
Our strategy is similar to what is proposed in Ref. Aaboud:2018ufy
, with two substantial differences: (i) we aim to monitor also those events that could be discarded by the online selection, by running the algorithm in the trigger system; (ii) we do so exploiting deeplearningbased anomaly detection techniques.
Recent works DAgnolo:2018cun ; Collins:2018epr ; DeSimone:2018efk ; Hajer:2018kqm
have investigated the use of machinelearning techniques to setup new strategies for BSM searches with minimal or no assumption on the specific newphysics scenario under investigation. In this work, we use variational autoencoders based on highlevel features as a baseline. Previously, autoencoders have been used in collider physics for detector monitoring
CMSdqm ; CMSdc and event generation ATLSOFTPUB2018001 . Autoencoders have also been explored to define a jet tagger that would identify new physics events with anomalous jets Heimel:2018mkt ; AEjets , with a strategy similar to what we apply to the full event in this work.3 Data samples
The dataset used for this study is a refined version of the highlevelfeature (HLF) dataset used in Ref. TOPCLASS . Protonproton collisions are generated using the PYTHIA8 eventgeneration library pythia , fixing the centerofmass energy to the LHC RunII value (13 TeV) and the average number of overlapping collisions per beam crossing (pileup) to . These beam conditions loosely correspond to the LHC operating conditions in 2016.
Events generated by PYTHIA8 are processed with the DELPHES library delphes , to emulate detector efficiency and resolution effects. We take as benchmark detector description the upgraded design of the CMS detector, foreseen for the HighLuminosity LHC phase CMS_TP . In particular, we use the CMS HLLHC detector card distributed with DELPHES. We run the DELPHES particleflow (PF) algorithm, which combines information from different detector components to derive a list of reconstructed particles, the socalled PF candidates. For each particle, the algorithm returns the measured energy and flight direction. Each particle is associated to one of three classes: charged particles, photons, and neutral hadrons. In addition, lists of reconstructed electrons and muons are given.
Events are filtered at generation requiring an electron, muon, or tau lepton with GeV. Once detector effects are taken into account through the DELPHES simulation, events are further selected requiring the presence of one reconstructed electron or muon with transverse momentum GeV and a loose isolation requirement , where the isolation is computed as:
(1) 
and the sum extends over all the photons, charged and neutral hadrons within a cone of size from the lepton.^{1}^{1}1As common for collider physics, we use a Cartesian coordinate system with the axis oriented along the beam axis, the axis on the horizontal plane, and the axis oriented upward. The and axes define the transverse plane, while the axis identifies the longitudinal direction. The azimuth angle is computed from the axis. The polar angle is used to compute the pseudorapidity . We fix units such that .
The 21 considered HLF quantities are:

The absolute value of the isolatedlepton transverse momentum .

The three isolation quantities (ChPFIso, NeuPFIso, GammaPFIso) for the isolated lepton, computed with respect to charged particles, neutral hadrons and photons, respectively.

The lepton charge.

A Boolean flag (isEle) set to 1 when the trigger lepton is an electron, 0 otherwise.

The number of jets entering the sum ().

The invariant mass of the set of jets entering the sum ().

The number of these jets being identified as originating from a quark ().

The missing transverse momentum, decomposed into its parallel () and orthogonal () components with respect to the isolated lepton direction. The missing transverse momentum is defined as the negative sum of the PFcandidate vectors:
(2) 
The transverse mass, , of the isolated lepton and the system, defined as:
(3) with the azimuth separation between the and vectors, and the absolute value of .

The number of selected muons ().

The invariant mass of this set of muons ().

The absolute value of the total transverse momentum of these muons ().

The number of selected electrons ().

The invariant mass of this set of electrons ().

The absolute value of the total transverse momentum of these electrons ().

The number of reconstructed charged hadrons.

The number of reconstructed neutral hadrons.
This list of HLF quantities is not defined having in mind a specific BSM scenario. Instead, it is conceived to include relevant information to discriminate the various SM processes populating the singlelepton data stream. On the other hand, it is generic enough to allow (at least in principle) the identification of a large set of new physics scenarios.
Many SM processes would contribute to the considered singlelepton dataset. For simplicity, we restrict the list of relevant SM processes to the four with highest production cross section, namely:

Inclusive production, with ().

Inclusive production, with ().

production.

QCD multijet production.^{2}^{2}2To speed up the generation process for QCD events, we require GeV, the fraction of QCD events with GeV and producing a lepton within acceptance being negligible but computationally expensive.
These samples are mixed to provide a SM cocktail dataset, which is then used to train autoencoder models and to tune the threshold requirement that defines what we consider an anomaly. The cocktail is built scaling down the highstatistics samples (, , and
) to the loweststatistics one (QCD, whose generation is the most computingexpensive), according to their production crosssection values (estimated at leading order with
PYTHIA) and selection efficiency (shown in Tab. 1).Standard Model processes  
Process  Acceptance  Trigger  Cross  Events  Event 
efficiency  section [nb]  fraction  /month  
110M  
QCD  63M  
12M  
0.6M 
BSM benchmark processes  
Process  Acceptance  Trigger  Total  Crosssection 
efficiency  efficiency  100 events/month  
436 fb  
166 fb  
335 fb  
163 fb 
The monthly event yield is computed assuming the conditions discussed in Section 1.
In addition, we consider the following BSM models to benchmark anomalydetection capabilities:

A leptoquark with mass 80 GeV, decaying to a quark and a lepton.

A neutral scalar boson with mass 50 GeV, decaying to two offshell bosons, each forced to decay to two leptons: .

A scalar boson with mass 60 GeV, decaying to two tau leptons: .

A charged scalar boson with mass 60 GeV, decaying to a tau lepton and a neutrino: .
For each BSM scenario, we consider any direct production mechanism implemented in , including associate jet production. We list in Tab. 1 the leadingorder production cross section and selection efficiency for each model.
Figures 1 and 2 show the distribution of HLF quantities for the SM processes and the BSM benchmark models, respectively.
4 Model description
We train Autoencoders (AEs) on the SM cocktail sample described in Section 3, taking as input the 21 HLF quantities listed there. The use of HLF quantities to represent events limits the model independence of the anomaly detection procedure. While the list of features is chosen to represent the main physics aspects of the considered SM processes and in no way tailored to specific BSM models, it is true that such a list might be more suitable for certain models than for others. In this respect, one cannot guarantee that the anomalydetection performance observed on a given BSM model would generalize to any BSM scenario. We will address in a future work a possible solution to reduce the model carried by the input event representation.
In this section, we present both the bestperforming autoencoder model and a set of supervised classifiers, trained to distinguish each of the four BSM benchmark models from SM events. We use the classification performance of these supervised algorithms as an estimate of the best performance that the VAE could get to.
4.1 Autoencoders
Autoencoders are algorithms that compress a given set of inputs variables in a latent space (encoding) and then, starting from the latent space, reconstruct the HLF input values (decoding). Autoencoders are used in the context of anomaly detection, associating a pvalue to a given event through a quantification of the encodingdecoding distance.
In this work we focus on VAEs 2013arXiv1312.6114K . Unlike traditional AEs, VAEs return the event pdf in the latent and original space, instead of decoded values of the input quantities and the encoded point in the latent space. The functional form of the pdfs is specified through the loss function a priori and the pdfs’ shape parameters are the output of a trainable function of the inputs. Such a function is the VAE itself and is determined during training.
We consider the VAE architecture shown in Fig. 3
, characterized by a fourdimensional latent space. Each latent dimension is associated to a Gaussian pdf and its two degrees of freedom (mean
and variance
). The input layer consists of 21 nodes, corresponding to the 21 HLF quantities described in Section 3. This layer is connected to the hidden space through two hidden dense layers, each consisting of 50 neurons with ReLU activation functions. Two fourneuron layers are connected to the second hidden layer. Linear activation functions are used for the first of these fourneuron layers. Its nodes are interpreted as the mean values
of the latentspace Gaussian pdfs. The nodes of the second layer are activated by the functions:(4) 
This activation, inspired by wiki:activation_function
, has been chosen to increase training stability since it’s strictly positive defined, non linear buit does not involve exponential which might create instabilities in early epochs. These four nodes are interpreted as the
parameters of the latentspace fourdimensional Gaussian. After several trials, the dimension of the latent space has been set to 4 in order to keep a good training stability without impacting the VAE performances. The decoding step originates from a point in the latent space, sampled according to the predicted pdf (green oval in Fig. 3). The coordinates of this point in the latent space are fed into a sequence of two hidden dense layers, each consisting of 50 neurons with ReLU activation functions. The last of these layers is connected to three dense layers of 21, 17, and 10 neurons, activated by linear, pISRLu and clippedtanh functions, respectively. The clippedtanh function if written as:(5) 
The 48 output nodes represent the parameters of the pdfs describing the input HLF quantities, which enter the loss function to be minimized.
The VAE loss function is a weighted sum of two pieces: the probability of the inputs given the predicted output pdf parameters (
) and the KullbackLeibler divergence (
) between the latent space pdf and the prior:(6) 
where is a free parameter set to 0.3. The prior chosen for the latent space is a 4dim Gaussian with a diagonal covariance matrix. The means () and the diagonal terms of the covariance matrix (
) are free parameters of the algorithm and are optimized during the backpropagation. The KullbackLeibler divergence between two Gaussian distributions has an analytic form. Hence, for each batch,
can be expressed as:(7) 
where is the batch size, runs over the samples and over the latent space dimensions. Similarly, is the average likelihood of the inputs given the predicted values:
(8) 
where runs over the input space dimensions, is the functional form chose to describe the pdf of the th input space variable and are the parameter of the function. Different functional forms have been chosen for , to properly describe different classes of HLF distributions:

Clipped Lognormal + function: used to describe , , , , , , , ChPFIso, NeuPFIso and GammaPFIso:
(9) 
Gaussian: used for and :
(10) 
Truncated Gaussian: a Gaussian function truncated for negative values and normalized to unit area for . Used to model :
(11) 
Discrete truncated Gaussian: like the truncated Gaussian, but normalized to be evaluated on integers (i.e. ). This function is used to describe , , and . It is written as:
(12) where the normalization factor is set to:
(13) 
Binomial: used for IsEle and lepton charge:
(14) where and are the two possible values of the variable (0 or 1 for IsEle and 1 or 1 for lepton charge) and

Poisson: used for chargedparticle and neutralhadron multiplicities:
(15) where .
The model is implemented in KERAS+TENSORFLOW keras ; tensorflow , trained with the Adam optimizer adam on a SM dataset of 3.45M samples, equivalent to an integrated luminosity of pb. The SM validation dataset is made of 3.45M of statistically independent samples. Such a sample would be collected in about ten hours of continuous run, under the assumptions made in this study (see Section 1). In training, we fix the batch size to 1000. We use early stopping with patience set to 20 and , and we progressively reduce the learning rate on plateau, with patience set to 8 and .
While optimizing anomalydetection performance, alternative architectures were tested. For instance, we increased or decreased the dimensionality of the latent space, we changed the value of in Eq.(6
), we changed the number of neurons in the hidden layers, tried the RMSprop optimizer, and used plain Gaussian priors for the 21 input features. In addition, we tested the use of a vampprior
VAMP . While some of these alternative models improved the encodingdecoding capability of the VAE, no sizable improvement in anomalydetection performance was observed. For simplicity, we limited our study to the architecture in Fig. 3 and dropped these alternative models.4.2 Supervised classifiers
For each of the four BSM benchmark models, we train a fullysupervised classifier, based on a Boosted Decision Tree (BDT). Each BDT receives as input the same 21 features used by the VAE and is trained on a labelled dataset consisting of the SM cocktail (the background) and one of the four BSM benchmark models (the signal). The implementation is done through the Gradient Boosted Regressor of scikitlearn library
scikitlearn with up to 150 estimators, minimum samples per leaf and maximum depth equal to 3 a learning rate of 0.1 and a tolerance of on the validation loss function (choose to be the default deviance). Each BDT, tailored to a specif BSM model, is trained on 3.45M SM events and about 0.5M BSM events, consistently upweighted in order to have the same impact on the loss function (i.e. the weights are 1 for SM events andfor BSM ones, depending on the actual size of the BSM sample used). In addition, we experimented with fullyconnected deep neural networks (DNNs) with two hidden layers. Despite trying different architectures, we didn’t find a configuration in which DNNs outperformed BDTs. We then decided to use the BDTs as a reference of fullysupervised discrimination capabilities.
Process  AUC  TPR [] 

0.98  5.4  
0.94  0.2  
0.90  0.1  
0.97  0.3 
Figure 6 shows the ROC curves obtained for the four BDTs. We summarize in Tab. 2 the classification performance of the four supervised BDTs, which set a qualitative upper limit for VAE’s results. Overall, the four models can be discriminated with good accuracy, with some loss of performance for those models sharing similarities with specific SM processes (e.g., exhibiting single and doublelepton topology with missing transverse energy, typical of events). In the table, we also quote the truepositive rate (TPR) corresponding to a SM false positive rate . This value of the efficiency is the one needed for an average of 1000 SM events per month.
5 Results with VAE
An event is classified as anomalous whenever the associated loss, computed from the VAE output, is above a given threshold. Since no BSM signal was observed so far, it is reasonable to expect that a newphysics signal, if any, would be characterized by a low production cross section and/or features very similar to those of a SM process. In view of this, we decided to use a tight threshold value, in order to reduce as much as possible any SM contribution.
Figure 7 shows the distribution of and loss components for the validation dataset. In both plots, the vertical line represents a lower threshold such that a of the SM events would be retained. This threshold value would result in SM events to be selected every month, i.e., a daily rate of events, as illustrated in Table 3. The acceptance rate is calculated assuming the LHC running conditions listed in Section 1. Table 3 also reports the byprocess VAE selection efficiency and the relative background composition of the selected sample.
Figure 7 also shows and distribution for the four benchmark BSM models. We observe that the discrimination power, loosely quantify by the integral of these distributions above threshold, is better for than and that the impact of the term on discrimination is negligible. Anomalies are then defined as events laying on the right tail of the expected distribution.
The left plot in Fig. 8 shows the ROC curves obtained from the distribution of the four BSM benchmark models and the SM cocktail, compared to the corresponding BDT curves of Section 4.2. The right plot in Fig. 8 shows the pvalue computed from the cocktail SM distribution, both for the SM events themselves (flat by construction) and for the four BSM processes. As the plot shows, BSM processes tend to concentrate at small pvalues, which allows their identification as anomalies.
Standard Model processes  

Process  VAE selection  Sample composition  Event/month 
QCD  
Tot 
Table 4 summarize VAE’s performance on the four BSM benchmark models. Together with the selection efficiency corresponding to , the table reports the effective cross section (cross section after applying the trigger requirements) that would correspond to 100 selected events in a month (assuming an integrated luminosity of ). Similarly, we quote the cross section that would result in a signaltobackground ratio of 1/3 on the sample of events selected by the VAE. The VAE can probe the four models down to relatively low cross section values, comparable to those that are typically probed in dedicated fullysupervised searches. As a comparison, Ref. Chatrchyan:2012sv excludes a with a mass of 150 GeV and production cross section larger than pb, using 4.8 fb at a centerofmass energy of 7 TeV, while most recent searches Sirunyan:2017yrk only cover larger mass values.
BSM benchmark processes  
Process  VAE selection  Crosssection  Crosssection 
efficiency  100 events/month [pb]  S/B = 1/3 [pb]  
7.1  27  
30  110  
55  210  
17  65 
6 How to deploy a VAE for BSM detection
The work presented in this paper suggests the possibility of deploying a VAE as a trigger algorithms associated to dedicated data streams. These trigger would isolate anomalous events, similarly to what was done by the CMS experiment at the beginning of the first LHC run. At that time, with early new physics signal being a possibility, the CMS experiment deployed online a set of algorithms (collectively called hot line) to select potentially interesting newphysics candidates. At that time, anomalies were characterized as events with high particles or high particle multiplicities, in line with the kind of earlydiscovery new physics scenarios considered at that time. The events populating the hotline stream were immediately processed at the CERN computing center (as opposed to traditional physics streams, that are processed after 48 hours). The hotline algorithms were tuned to collect O(10) events per day, which were then visually inspected by experts.
While the focus of the work presented in this paper is not an early discovery, the spirit of the application we propose would be similar: a set of VAEs deployed online would select a limited number of events every day. These events would be collected in a dedicated dataset and further analyzed. The analysis technique could go from visual inspection of the collisions to detailed studies of reconstructed objects, up to some kind of modelindependent analysis of the collected dataset, e.g. a deeplearning implementation of a modelindependent hypothesis testing DAgnolo:2018cun directly on the Loss distribution (provided a reliable sample of backgroundonly data).
While a pure SM sample to train VAEs could only be obtained from Monte Carlo simulation, the presence of outlier contamination in the training sample has typically a tiny impact on performance. One could then imagine to train the VAE models on sofar collected data and use them on much larger dataset. In our study, we consider a training dataset of pb and applied the VAE to a larger dataset. One could even envision more frequent retrainings (e.g., every factor increase in integrated luminosity or in presence of substantial detector and/or accelerator condition changes). Such a training could happen offline on a dedicated dataset, e.g., deploying triggers randomly selecting events entering the last stage of the trigger system. The training could even happen online, assuming the availability of sufficient computing resources.
To demonstrate the feasibility of a trainondata strategy, we enrich the dataset used in Section 4 with a signal contamination of events. As a starting point, the amount of injected signal is tuned to a luminosity of 100 pb and a cross section of 7.1 fb, corresponding to the value at which the VAE in Section 4 would select 100 events in 5 fb . This result into about 700 events added to the training sample. The VAE is trained following the procedure outlined in Section 4 and its performance is compared to that obtained on a signalfree dataset of the same size. The comparison of the ROC curves for the two models is shown in Fig. 9. In the same figure, we show similar results, derived injecting a and signal contamination. A degradation of VAE’s performance is observed once the signal cross section is set to 710 pb (i.e., 100 times the sensitivity value found in Section 4). At that point, the contamination is so large that the signal becomes as abundant as events and would have easily detectable consequences. For comparison, at a production cross section of 27 pb a third of the events selected by the VAE in Section 4 would come from production (see Table 4). And this would have negligible consequences on the training quality. This test shows that a robust anomalydetecting VAE could be trained directly on data, even in presence of previously undetected (e.g., at Tevatron, 7 TeV and 8TeV LHC) BSM signals.
7 Conclusions
We present a strategy to isolate potential BSM events produced by the LHC, using variational autoencoders trained on a reference SM sample. Such an algorithm could be used in the trigger system of generalpurpose LHC experiments to isolate recurrent anomalies, which might otherwise escape observation (e.g., being filtered out by a typical trigger selection). Taking as an example a singlelepton data stream, we show how such an algorithm could select datasets enriched with events originating from challenging BSM scenarios. We also discuss how the model training could happen directly on data, with no sizable performance loss.
The final outcome of the analysis would be a list of anomalous events, that the experimental collaborations could further scrutinize and even release as a catalog, similarly to what is typically done in other scientific domains. Repeated patterns in these events could motivate new scenarios for beyondthestandardmodel physics and inspire new searches, to be performed on future data with traditional supervised approaches.
We believe that such an application could help extending the physics reach of the current and next stages of the CERN LHC.
Acknowledgments
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement n 772369) and the United States Department of Energy, Office of High Energy Physics Research under Caltech Contract No. DESC0011925. This work was conducted at "iBanks", the AI GPU cluster at Caltech. We acknowledge NVIDIA, SuperMicro and the Kavli Foundation for their support of "iBanks".
References
 (1) ATLAS, CMS, LHC Higgs Combination Group Collaboration, Procedure for the LHC Higgs boson search combination in summer 2011, .
 (2) ATLAS Collaboration, G. Aad et al., Observation of a new particle in the search for the Standard Model Higgs boson with the ATLAS detector at the LHC, Phys. Lett. B716 (2012) 1–29, [arXiv:1207.7214].
 (3) CMS Collaboration, S. Chatrchyan et al., Observation of a new boson at a mass of 125 GeV with the CMS experiment at the LHC, Phys. Lett. B716 (2012) 30–61, [arXiv:1207.7235].
 (4) CDF Collaboration, T. Aaltonen et al., Global Search for New Physics with 2.0 fb at CDF, Phys. Rev. D79 (2009) 011101, [arXiv:0809.3781].
 (5) D0 Collaboration, V. M. Abazov et al., Model independent search for new phenomena in collisions at TeV, Phys. Rev. D85 (2012) 092015, [arXiv:1108.5362].
 (6) H1 Collaboration, F. D. Aaron et al., A General Search for New Phenomena at HERA, Phys. Lett. B674 (2009) 257–268, [arXiv:0901.0507].
 (7) CMS Collaboration, MUSiC, a Model Unspecific Search for New Physics, in pp Collisions at , Tech. Rep. CMSPASEXO14016, CERN, Geneva, 2017.
 (8) ATLAS Collaboration, M. Aaboud et al., A strategy for a general search for new phenomena using dataderived signal regions and its application within the ATLAS experiment, Submitted to: Eur. Phys. J. (2018) [arXiv:1807.07447].
 (9) L. Lyons, Open statistical issues in particle physics, ArXiv eprints (Nov., 2008) [arXiv:0811.1663].
 (10) E. Gross and O. Vitells, Trial factors for the look elsewhere effect in high energy physics, Eur. Phys. J. C70 (2010) 525–530, [arXiv:1005.1891].
 (11) R. T. D’Agnolo and A. Wulzer, Learning New Physics from a Machine, arXiv:1806.02350.
 (12) J. H. Collins, K. Howe, and B. Nachman, CWoLa Hunting: Extending the Bump Hunt with Machine Learning, arXiv:1805.02664.
 (13) A. De Simone and T. Jacques, Guiding New Physics Searches with Unsupervised Learning, arXiv:1807.06038.
 (14) J. Hajer, Y.Y. Li, T. Liu, and H. Wang, Novelty Detection Meets Collider Physics, arXiv:1807.10261.
 (15) A. A. Pol, G. Cerminara, C. Germain, M. Pierini, and A. Seth, Detector monitoring with artificial neural networks at the CMS experiment at the CERN Large Hadron Collider, arXiv:1808.00911.
 (16) CMS Collaboration, Anomaly detection using Deep Autoencoders for the assessment of the quality of the data acquired by the CMS experiment, tech. rep., CERN, Geneva, Jul, 2018.
 (17) ATLAS Collaboration, Deep generative models for fast shower simulation in ATLAS, Tech. Rep. ATLSOFTPUB2018001, CERN, Geneva, Jul, 2018.
 (18) T. Heimel, G. Kasieczka, T. Plehn, and J. M. Thompson, QCD or What?, arXiv:1808.08979.
 (19) M. Farina, Y. Nakai, and D. Shih, Searching for New Physics with Deep Autoencoders, arXiv:1808.08992.
 (20) T. Q. Nguyen et al., Topology classification with deep learning to improve realtime event selection at the LHC, arXiv:1807.00083.
 (21) T. Sjöstrand et al., An Introduction to PYTHIA 8.2, Comput. Phys. Commun. 191 (2015) 159–177, [arXiv:1410.3012].
 (22) DELPHES 3 Collaboration, J. de Favereau et al., DELPHES 3, A modular framework for fast simulation of a generic collider experiment, JHEP 02 (2014) 057, [arXiv:1307.6346].
 (23) CMS Collaboration, V. Khachatryan et al., Technical Proposal for the PhaseII Upgrade of the CMS Detector, Tech. Rep. CERNLHCC2015010. LHCCP008. CMSTDR1502, Geneva, Jun, 2015.
 (24) M. Cacciari, G. P. Salam, and G. Soyez, FastJet User Manual, Eur. Phys. J. C72 (2012) 1896, [arXiv:1111.6097].
 (25) M. Cacciari, G. P. Salam, and G. Soyez, The anti jet clustering algorithm, JHEP 04 (2008) 063, [arXiv:0802.1189].
 (26) D. P. Kingma and M. Welling, AutoEncoding Variational Bayes, ArXiv eprints (Dec., 2013) [arXiv:1312.6114].
 (27) Wikipedia contributors, Activation function — Wikipedia, the free encyclopedia, 2018. [Online; accessed 25November2018].
 (28) F. Chollet et al., “Keras.” https://github.com/fchollet/keras, 2015.
 (29) M. Abadi et al., TensorFlow: Largescale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
 (30) D. P. Kingma and J. Ba, Adam: A Method for Stochastic Optimization, ArXiv eprints (Dec., 2014) [arXiv:1412.6980].
 (31) J. M. Tomczak and M. Welling, VAE with a vampprior, CoRR abs/1705.07120 (2017) [arXiv:1705.07120].
 (32) F. Pedregosa et al., Scikitlearn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830.
 (33) CMS Collaboration, S. Chatrchyan et al., Search for pair production of thirdgeneration leptoquarks and top squarks in collisions at TeV, Phys. Rev. Lett. 110 (2013), no. 8 081801, [arXiv:1210.5629].
 (34) CMS Collaboration, A. M. Sirunyan et al., Search for thirdgeneration scalar leptoquarks and heavy righthanded neutrinos in final states with two tau leptons and two jets in protonproton collisions at TeV, JHEP 07 (2017) 121, [arXiv:1703.03995].