Topology classification with deep learning to improve real-time event selection at the LHC

06/29/2018 ∙ by Thong Q. Nguyen, et al. ∙ 0

We show how event topology classification based on deep learning could be used to improve the purity of data samples selected in real time at at the Large Hadron Collider. We consider different data representations, on which different kinds of multi-class classifiers are trained. Both raw data and high-level features are utilized. In the considered examples, a filter based on the classifier's score can be trained to retain 99 and reduce the false-positive rate by as much as one order of magnitude for certain background processes. By operating such a filter as part of the online event selection infrastructure of the LHC experiments, one could benefit from a more flexible and inclusive selection strategy while reducing the amount of downstream resources wasted in processing false positives. The saved resources could be translated into a reduction of the detector operation cost or into an effective increase of storage and processing capabilities, which could be reinvested to extend the physics reach of the LHC experiments.



There are no comments yet.


page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The CERN Large Hadron Collider (LHC) collides protons every 25 ns. Each collision can result in any of hundreds of physics processes. The total data volume exceeds by far what the experiments could record. This is why the incoming data flow is typically filtered through a set of rule-based algorithms, designed to retain only events with particular signatures (e.g., the presence of a high-energy particle of some kind). Such a system, commonly referred to as trigger, consists of hundreds of algorithms, each designed to accept events with a specific topology. The ATLAS Aaboud:2016leb and CMS Adam:2005zf trigger systems are based on this idea. In their current implementation, given the throughput capability and the typical event size, these two experiments can write on disk events/sec. A few processes, e.g., QCD multijet production, constitute the vast majority of the produced events. One is typically interested to select a fraction of these events for further studies. On the other hand, the main interest of the LHC experiments is related to selecting and studying the many rare processes which occur at the LHC. In a typical data flow, these events are overwhelmed by the large amount of QCD multijet events. The trigger system is put in place to make sure that the majority of these rare events are part of the stored events/sec.

Trigger algorithms are typically designed to maximize the efficiency (i.e., the true-positive rate), resulting in a non-negligible false-positive rate and, consequently, in a substantial waste of resources at trigger level (i.e., data throughput that could have been used for other purposes) and downstream (i.e., storage disk, processing power, etc.).

The most commonly used selection rules are inclusive, i.e., more than one topology is selected by the same requirement. The so-called isolated lepton triggers are a typical example of this kind of algorithms. These triggers select events with a high-momentum electron or muon and no surrounding energetic particle, a typical signature of an interesting rare process, e.g., the production of a boson decaying to a neutrino and an electron or muon. With such a requirement, one can simultaneously collect bosons produced in the primary interaction ( events) or from the cascade decay of other particles, e.g., top quarks (mainly in events where a top quark-antiquark pair is produced). A sample selected this way is dominated by events but it retains a substantial () contamination from QCD multijet. The contribution is smaller than . Events from production are sometimes triggered by a set of dedicated lepton+jets algorithms, capable of using looser requirements on the lepton at the cost of introducing requirements on jets.111 A jet is a spray of hadrons, typically originating from the hadronization of gluons and quarks produced in the proton collisions. Due to this additional complexity, the use of these triggers in a data analysis comes with additional complications. For instance, the applied jet requirements produce distortions on offline distributions of jet-related quantities. To avoid having this effect, any typical data analysis applies a tighter offline selection. This means that many of the selected events close to the online-selection threshold are discarded. This is not necessarily the most cost-effective way to retain an unbiased dataset for offline analysis.

Figure 1: Relative composition of the isolated-lepton sample after the acceptance requirement (left) and the trigger selection (right), as described in the text.

In this paper, we investigate the possibility of using machine learning to disentangle events from different event topologies at trigger level. Doing so, one could customize the trigger-selection strategy on individual processes (depending on the physics goals) while keeping the selection loose and simple. As a benchmark case, we consider a stream of data selected by requiring the presence of one electron or muon with transverse momentum

 GeV 222 In this paper, we set units in such a way that = = 1. and a loose requirement on the isolation. Details on the applied selection can be found in Sec. 2.

The considered benchmark sample is dominated by direct production, with a sizable contamination from QCD multijet events and a small contribution of events. Other interesting processes (e.g., , , and production) are usually selected with more exclusive and dedicated trigger algorithms (e.g., di-muon or di-electron triggers), or share the same kinematic properties of the two main interesting processes ( and ). For the sake of simplicity, we ignore these sub-leading processes in our study, without compromising the validity of our conclusions. Fig. 1 shows the composition of a sample with one electron or muon within the defined acceptance ( GeV and pseudorapidity , where is the polar angle), before and after applying the trigger requirements ( GeV and loose isolation).

Such a loose set of requirements would translate into an event acceptance rate of  Hz for a luminosity of  cm s, well beyond the currently allocated budget for these triggers. We suggest that, using the score of our topology classifier, one could tune the amount of each process to be stored for further analysis, within the boundaries of the allocated resources (typically  Hz). For instance, one might be interested to retain all the events and some fraction of events, while rejecting the QCD multijet events. We envision two main applications: for a given total rate, one could loosen the baseline trigger requirements, increasing the acceptance efficiency at no cost. Or, for a given acceptance efficiency (true positive rate), one could save resources by reducing the overll rate, rejecting the contribution of unwanted topologies (see Appendix A).

We consider several topology classifiers based on deep learning model architectures: fully-connected deep neural networks (DNNs), convolutional neural networks (CNNs) 


, and recurrent neural networks such as Long-Short-Term-Memory networks (LSTMs) 


and gated recurrent units (GRUs) 

GRU . We consider four different representations of the collision events: (i) a set of physics-motivated high-level features, (ii) the raw image of the detector hits, (iii) a sequence of particles, characterized by a limited set of basic features (energy, direction, etc.), and (iv) an abstract representation of this list of particles as an image.

The paper is structured as follows. In Sec. 2 we describe the four data representations. In Sec. 3 we describe the corresponding classification models. Results are discussed in Sec. 4. In Sec. 5 we investigate the generalization properties of the four classifiers to scenarios of other topologies. In Sec. 6 we briefly discuss applications of machine learning algorithms to similar problems. Conclusions are given in Sec. 7. Appendix A describes a different scenario, in which the classifier is used to save resources by reducing the trigger acceptance rate, as opposed of using it to sustain a loose trigger selection that could otherwise require too many resources.

2 Dataset

Synthetic data corresponding to , and QCD multijet production topologies are generated using the PYTHIA8 event generation library pythia . The setup of the proton-beam simulation is loosely inspired by the LHC running configuration in 2015-2016: two proton beams, each with 6.5 TeV, generate on average 20 proton-proton collisions per crossing.

Generated samples are processed with the DELPHES library delphes

, which applies a parametric model of a detector response. Detector performances is tuned to the CMS upgrade design foreseen for the High-Luminosity LHC 

CMS_TP , as implemented in the corresponding default card provided with DELPHES. We run the DELPHES particle-flow (PF) algorithm, which combines the information from all the CMS detector components to derive a list of reconstructed particles, the so-called PF candidates. For each particle, the algorithm returns the measured energy and flight direction. Each particle is associated to one of three classes: charged particles, photons, and neutral hadrons.

The basic event representation consists of a list of reconstructed PF candidates. For each candidate , the following information is given: (i) The particle four-momentum in Cartesian coordinates (, , , ); (ii) The particle three-momentum in cylindrical coordinates: the transverse momentum , the pseudorapidity , and the azimuthal angle ; (iii) The Cartesian coordinates (, , ) of the particle point of origin. For all neutral particles, (0, 0, 0) is used in the absence of pointing information; (iv) The electric charge; (v) The particle isolation with respect to charged particles (ChPFIso), photons (GammaPFIso), or neutral hadrons (NeuPFIso). For each particle class, the isolation is quantified as


where the sum extends over all the particles of the appropriate class with angular distance from the particle .

The particle identity is categorized via a one-hot-encoded representation (

, , ), corresponding to a charged particle, a neutral hadron, or a photon. In addition, two boolean flags are stored ( and ) to identify if a given particle is an electron or a muon. In total, each particle is then described by 19 features.

The trigger selection is emulated by requiring all the events to include one isolated electron or muon with transverse momentum  GeV and particle-based isolation . This baseline selection, which follows the typical requirements of an inclusive single-lepton trigger algorithm, accepts QCD multijet events and events for every event. Despite its large and efficiency, this trigger selection comes with a large cost in terms of QCD multijet events written on disk and processed offline. The cost is even larger if the main physics target is events and the contribution is seen as an additional source of background (e.g., in a high-statistics scenario, with all measurements of properties limited in precision by systematic uncertainties).

All particles are ranked in decreasing order of . For each event, the isolated lepton is the first entry of the list of particles. To avoid double counting of this isolated lepton as a charged particle, each charged particle is required to have

. In addition to the isolated lepton, we consider the first 450 charged particles, the first 150 photons, and the first 200 neutral hadrons. This corresponds to a total of 801 particles per event, each characterized by the 19 features described above. If fewer particles are found in the event, zero padding is used to guarantee a fixed length of the particle list across different events. The events are then stored as numpy arrays in a set of compressed HDF5 files. The dataset is planned to be released on the CERN OpenData portal, accessible at

In addition to this raw-event representation, we provide a list of physics-motivated high-level features, computed from the full event (the HLF dataset):

  • , i.e. the scalar sum of the of all the jets, leptons, and photons in the event with  GeV and . Jets are clustered from the reconstructed PF candidates, using the FASTJET fastjet implementation of the anti- jet algorithm antikt , with jet-size parameter R=0.4.

  • The missing transverse energy , defined as the absolute value of the missing transverse momentum, computed summing over the full list of reconstructed PF candidates:

  • The squared transverse mass, , of the isolated lepton and the system, defined as:


    with the transverse momentum of the lepton and the azimuthal separation between the lepton and vector.

  • The azimuthal angle of the vector, .

  • The number of jets entering the sum.

  • The number of these jets identified as originating from a quark.

  • The isolated-lepton momentum, expressed in polar coordinates (, , )

  • The three isolation quantities (ChPFIso, NeuPFIso, GammaPFIso) for the isolated lepton.

  • The lepton charge.

  • The flag for the isolated lepton.

The list of 801 particles is used to generate two visual representations of the events. In the first one, the (, ) plane corresponding to the detector acceptance is divided into a barrel region (), two end-cap regions ( and ), and two forward regions ( and ). The barrel and endcap regions of the electromagnetic calorimeter, as well as the endcap of the hadronic calorimeter (HCAL), are binned in cells of size . The barrel region of the HCAL is binned with cells of size . The forward regions are binned with cells of size 0.175 in , while the dimension in varies from 0.175 to 0.35. Each cell is filled with the scalar sum of the of the particles pointing to that cell. The three classes of particles (charged particles, photons, and neutral hadrons) are considered separately, resulting in three adjacent images. An example is shown in Fig. 2 for a event. This representation corresponds to the raw image recorded by the detector.

Figure 2: An example of a event as the input of the raw-image classifier.

Recently, it was proposed to represent LHC collision events as abstract images where reconstructed physics objects (jets, in that case) are represented as geometric shapes whose size reflects the energy of the particle Madrazo . We generalize this approach by applying it to the full list of particles. Each particle is represented as a unique geometric shape, centered at the particle’s coordinates and with size proportional to its . The geometric shapes are chosen as follow: (i) pentagons for the selected isolated electron or muon; (ii) triangles for photons; (iii) squares for charged particles; (iv) hexagons for neutral hadrons. The images are digitized as arrays of size , where each of the first four channels contains a separated particle class, and the last channel contains the , represented as a circle. As an example, the abstract representation for the event in Fig. 2 is shown in Fig. 3.

This abstract representation allows mitigating the sparsity problem of the raw images. On the other hand, there is no guarantee that the physics information is fully retained in this translation. As a result, there could be a reduction of discrimination power. This is one of the points we aim to investigate in this study.

(a) Photons
(b) Charged Particles
(c) Neutral Hadrons
(d) Lepton
Figure 3: Example of a event, represented as a 5-channel abstract image.

3 Model description

In this section, we describe five types of multi-class classifiers, trained on the four data representations described in the previous section. We start by considering a state-of-the art HEP application, based on the high-level features listed in Sec. 2. We then consider a convolutional neural network taking as input the raw images. This model offers the baseline point of comparison for the classifier using the abstract images. In order to have a fair comparison between the two approaches, the same kind of network architecture is used for the two sets of images. Next, we consider recurrent neural networks based on LSTMs and GRUs, trained directly on the lists of 801 particles. Finally, we consider a classifier taking both the high-level features and the list of 801 particles as inputs, using a combination of recurrent neural networks and fully connected neural networks.

The CNNs are implemented in PyTorch pytorch

. The recurrent neural networks and feed-forward neural networks are implemented in

Keras and trained using Theano theano as a back-end. The Adam optimizer Adam

is used to adapt the learning rate. The training is capped at 50 epochs, and can be stopped early if there is no improvement in terms of validation loss after 8 epochs. Categorical cross entropy is used as the loss function. All trainings are performed on a cluster of GeForce GTX 1080 GPUs. In an early stage of this work, experiments on the recurrent models were performed on the CSCS Piz Daint super computer, using the

mpi-learn library mpi-learn for multiple-GPU training.

3.1 High-level-features classifier using feed-forward neural networks

A fully connected feed-forward DNN based on a set of high-level features (HLF classifier) is the closest approach to the currently used rule-based trigger algorithms. We train a model of this kind taking as input the 14 features contained in the HLF dataset (see Sec. 2). The 14 features are normalized to take values between 0 and 1.

The final network configuration is the result of an optimization process performed using the scikit-learn scikit-learn

optimizer, which performs an exhaustive cross-validated grid-search over a set of hyperparameters related to the network architecture and the training setup. The number of layers, the number of nodes in each layer, and the choice of optimizer have been considered in the scan. For a given number of layers, discrimination performances were found to be constant over the considered range of number of nodes per layer. We believe that this is a direct consequence of the simple problem at hand: even a relatively small networks achieve good classification performances. We then took the smallest network as the best compromise between performance and architecture minimality.

The chosen architecture consists of three hidden layers with 50, 20, and 10 nodes, activated by rectified linear units (ReLU


. The output layer consists of 3 nodes, activated by a softmax activation function.

3.2 Raw-image and abstract-image classifiers using convolutional neural networks

To classify events represented as raw calorimeter images (raw-image classifier) and abstract images (abstract-image classifier), we use DenseNet-121, an instantiation of the Densely Connected Convolutional Network huang2017densely

. The DenseNet-121 architecture includes 4 dense blocks, each of which contains 6, 12, 24, 16 dense layers, respectively. Each dense layer contains two 2D convolutional layers preceded by batch normalization layers. A dropout rate of 0.5 is applied after each dense layer. Between two subsequent dense blocks is a transition layer consisting of a batch normalization layer, a 2D convolutional layer, and an average pooling layer.

3.3 Particle-sequence classifier using recurrent neural networks

A particle-sequence classifier is trained using a recursive layer, taking as input the 801 candidates. To feed these particles into a recurrent network, particles are ordered according to their increasing or decreasing distance from the isolated lepton. Different physics-inspired metrics are considered to quantify the distance (, , ,  antikt , or anti- kt ). The best results are obtained using the decreasing distance ordering.

We use gated recurrent units (GRU) to aggregate the input sequence of particle flow candidate features into a fixed size encoding. The fixed encoding is fed into a fully connected layer with 3 softmax activated nodes. Input data is standardized so that each feature has zero mean and unit standard deviation. The zero-padded entries in the particle sequence are skipped with the Masking layer. The best internal width of the recurrent layers was found to be 50, determined by k-fold cross validation on a training set of 300,000 events. We also considered using long short-term memory networks (LSTM) to replace the GRU, but we found that the GRU architecture outperformed the LSTM architecture for the same number of internal cells.





GRU (50)


High-level features (14)


Concatenate (64)

Dense (25)

Output (3)
Figure 4: Network architecture of the inclusive classifier.

3.4 Inclusive classifier

In order to inject some domain knowledge in the GRU classifier, we consider a modification of its architecture in which the 14 features of the HLF dataset are concatenated to the output of the GRU layer after some dropout (see Fig. 4). As for the other classifiers, the final output layer consists of 3 nodes, activated by a softmax activation function. We refer to this model as inclusive classifier.

4 Results

Each of the models presented in the previous section returns the probability of each event to be associated to a given topology:

, , and . By applying a threshold requirement on or , one can define a or a classifier, respectively. By changing the threshold value, one can build the corresponding receiver operating characteristic (ROC) curve. Fig. 5 shows the comparison of the ROC curves for five classifiers: the DenseNets based on raw images and abstract images, the GRU using the list of particles, the DNN using the HLFs, and the inclusive classifier using both the HLFs and the list of particles. Results for both a and selectors are shown.

(a) selector
(b) selector
Figure 5: ROC curves for the (left) and (right) selectors described in the paper.

Acceptable results are obtained already with the raw-image classifier. On the other hand, the use of abstract images allows us to reach better performances. A further improvement is observed for those models not using an image-based representation of the event. The fact that the HLF selectors perform so well doesn’t come as a surprise, given a considerable amount of physics knowledge implicitly provided by the choice of the relevant features. On the other hand, the fact that the particle-sequence classifier reaches comparable performances to the HLF selector is remarkable, as is the further improvement observed by merging the two approaches in the inclusive classifier. In some sense, the GRU layer is gaining a good part of the physics intuition that motivated the choice of the HLF quantities, but not entirely. Fig. 6 shows the Pearson correlation coefficients between the GRU scores ( and ) and the HLF quantities. As one would expect, exhibits a stronger correlation with those features that quantify jet activity, as well as with the b-jet multiplicity. On the contrary, events shows an anti-correlation with respect to jet quantities, since the production of associated jets in events is much more penalized than for events. As expected, both scores are anti-correlated to the isolation quantities, which takes larger values for non-isolated leptons.

Figure 6: Pearson correlation coefficients between the (left) and (right) scores of the Particle-sequence classifier and the 14 quantities of the HLF dataset.

The performance of each of the five classifiers is summarized in Tab. 1 in terms of false-positive rate (FPR) and trigger rate (TR) as a function of the true-positive rate (TPR). The best QCD rejection is obtained by the inclusive classifier, which can retain 99% of the or events with a false-positive rate of .

selector Raw-image Abstract-image HLF Particle-sequence Inclusive
(DenseNet) (DenseNet) (DNN) (GRU) (DNN+GRU)
FPR @99% TPR 76.5% 43.6% 41.1% 15.2% 7.9%
FPR @95% TPR 41.3% 13.7% 7.3% 4.0% 1.3%
FPR @90% TPR 26.5% 6.7% 3.5% 1.8% 0.4%
TR @99% TPR 382 Hz 250 Hz 202 Hz 78 Hz 42 Hz
TR @95% TPR 208 Hz 82 Hz 39 Hz 22 Hz 9 Hz
TR @90% TPR 134 Hz 39 Hz 20 Hz 11 Hz 4 Hz
selector Raw-image Abstract-image HLF Particle-sequence Inclusive
(DenseNet) (DenseNet) (DNN) (GRU) (DNN+GRU)
FPR @99% TPR 79.0% 58.6% 26.3% 20.0% 8.0%
FPR @95% TPR 60.5% 26.4% 10.6% 7.5% 2.7%
FPR @90% TPR 48.1% 14.9% 5.8% 3.7% 1.2%
TR @99% TPR 488 Hz 462 Hz 316 Hz 290 Hz 262 Hz
TR @95% TPR 454 Hz 366 Hz 258 Hz 249 Hz 239 Hz
TR @90% TPR 408 Hz 301 Hz 235 Hz 228 Hz 223 Hz
Table 1: False positive rate (FPR) and trigger rate (TR) at different values of the true positive rate (TPR), for a (top) and

selector. Rate values are estimated scaling the TPR and process-dependent FPR values by the acceptance and efficiency, assuming a leading-order (LO) production cross section and luminosity of 2

 cm s. TR values should be taken only as suggestions of the actual rates, since the accuracy is limited by the use of LO cross sections and a parametric detector simulation.

The trigger baseline selection we use in this study, looser than what is used nowadays in CMS, gives an overall trigger rate (i.e., summing electron and muon events) of  Hz, more than a factor two larger than what is currently allocated. Using the 99% working points of the two classifiers, one would reduce the overall rate to  Hz (counting the overlap between the two triggers). This would be comparable to what is currently allocated for these triggers, but with a looser selection, i.e., with a less severe bias on the offline analysis. In addition, the trigger efficiency (the TPR) is so large that the bias imposed on offline quantities is quite minimal. This is illustrated in Fig. 7, where the dependence of the TPR on the most relevant HLF quantities is shown. In our experience, any rule-based algorithm with the same target trigger rate would result in larger inefficiencies at small values of at least some of these quantities, e.g., the lepton . One should also consider that the principle of a topology classifier could be generalized to other physics cases, as well as to other uses (e.g., labels for fast reprocessing or access to specific subsets of the triggered samples).

(a) selector for events.

(b) selector for events.
Figure 7: Selection efficiency using 99% TPR working point as functions of lepton , , and for and events.

5 Impact on other topologies

While reducing the resource consumption of standard physics analyses is the main motivation behind this study, it is important to evaluate the impact of the proposed classifiers on other kind of topologies. For this purpose, we consider a handful of beyond-the-standard-model (BSM) scenarios, and we compute the TPR as a function of the most relevant kinematic quantities, similar to what was done in Fig. 7 for the standard topologies.


(b) High-mass



Figure 8: Selection efficiencies of different BSM models using 99% TPR working point as functions of lepton , , and .

We consider the following BSM processes:

  • : a heavy Higgs boson with mass 425 GeV decaying to a charged Higgs boson of mass 325 GeV and a boson. The then decays to a final state, where is the 125 GeV Higgs boson, which we force to decay to a bottom quark-antiquark pair. This model, introduced in Ref. baldi , generates a 22 topology similar to that given by events.

  • High-mass : a high-mass variation of the previous model, in which the and masses are set to 1025 GeV and 625 GeV, respectively.

  • : a light neutral scalar particle with mass 20 GeV, decaying to two neutral scalars of 5 GeV each, both decaying to muon pairs, for a total of four muons in the final state.

  • resonance with mass 300 GeV, decaying inclusively with -like couplings.

  • resonance with mass 600 GeV, decaying to a pair of electrons of muons.

These events are filtered with the baseline selection described in Sect. 2.

For each of these models, we consider the inclusive classifier and apply the 99%-TPR thresholds on and . We then consider the fraction of events passing at least one of the two selectors. Results are shown in Fig. 8 for the most relevant kinematic quantities. While the individual selectors might show local inefficiencies, the combination of the two trigger paths is perfectly capable of retaining any event with features different from that of a QCD multijet event. In this respect, the logical OR of our two exclusive topology classifiers is robust enough to also select a large spectrum of BSM topologies. On the other hand, one cannot guarantee that QCD-like topologies (e.g., a dark photon produced in jet showers and decaying to lepton pairs) would not be rejected, a limitation which also affects traditional inclusive trigger strategies.

6 Related work

Several classification algorithms have been studied in the context of LHC physics application, notably for jet tagging deOliveira:2015xxd ; Guest:2016iqz ; Macaluso:2018tck ; Datta:2017lxt ; Butter:2017cot ; Kasieczka:2017nvn ; Komiske:2016rsd ; Schwartzman:2016jqu and event topology identification baldi ; Bhimji:2017qvb ; Madrazo using feed-forward neural networks, convolutional neural networks or physics-inspired architectures. Lists of particles have been used to define jet and event classifiers starting from a list of reconstructed particle momenta RecursiveJets ; Egan:2017ojy ; Cheng:2017rdo . These studies typically consider data analysis as the main use case, focusing on small FPR selections. This is the main difference with respect to this study, which is more related to an optimization of the data-taking procedure.

7 Conclusions

We show how deep neural networks can be used to train topology classifiers for LHC collision events, which could be used as a cleanup filter to select or reject specific event topologies in a trigger system. We consider several network architectures, applied to different representations of the same collision datasets. The best results are obtained by combining a set of physics-motivated high-level features with the output of a GRU unit applied to a list of particle-level features. For the most difficult case, i.e., selecting rare events, we show how a trigger based on this concept would retain 99% of the events while reducing the FPR by as much as times. We show that such a trigger would have a minimal impact on the main kinematic features of the event topologies under consideration. In addition, the logic OR of the and selections would also catch a broad class of new-physics topologies, on which the classifiers were not trained. In view of the challenging trigger environment foreseen for the High-Luminosity LHC, it would be important to test this trigger strategy as a way to preserve a good experimental reach with a substantial reduction of computational resources. In this respect, we look forward to the LHC Run III as an opportunity to experiment this technique with real data.

8 Acknowledgments

This work is partially supported by a grant from the Swiss National Supercomputing Center (CSCS) under project ID d59. We thank CERN OpenLab for supporting DW during his internship at CERN. We are grateful to Caltech and the Kavli Foundation for their support of undergraduate student research in cross-cutting areas of machine learning and domain sciences. Part of this work was conducted at "iBanks", the AI GPU cluster at Caltech. We acknowledge NVIDIA, SuperMicro and the Kavli Foundation for their support of "iBanks". TN would like to thank Duc Le for valuable discussions. This project is partially supported by the United States Department of Energy, Office of High Energy Physics Research under Caltech Contract No. DE-SC0011925. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement n 772369).


Appendix A

In this paper, we showed how one could use a topology classifier to keep the overall trigger rate under control while operating triggers with otherwise unsustainable loose selections. In this appendix we discuss how topology classifiers could be used to save resources for a pre-defined baseline trigger selection by rejecting events associated to unwanted topologies. In this case, the main goal is not to reduce the impact of the online selection. Instead, we focus on reducing resource consumption downstream for a given trigger selection.

To this purpose, we consider a copy of the dataset described in Sec. 2, obtained tightening the threshold from 23 to 25 GeV and the isolation requirement from ISO < 0.45 to ISO < 0.20. Doing so, the sample composition changes as follow: 7.5% QCD; 92% ; 0.5% . With such selections, the trigger acceptance rate would decrease from 690 Hz to 390 Hz, closer to what is currently allocated for these triggers in the CMS experiment.

Following the procedure described in Sec. 3 and 4, we train the same topology classifiers on this dataset. The corresponding ROC curves are presented in Fig. 9 for a and a selector.

(a) selector
(b) selector
Figure 9: ROC curves for the (left) and (right) selectors described in the paper, trained on a dataset defined by a tighter baseline selection.

We then define a set of trigger filters applying a lower threshold to the normalized score of the classifier, choosing the threshold value that corresponds to a certain TPR value. The result is presented in Table 2, in terms of the FPR and the trigger rate.

selector Raw-image Abstract-image HLF Particle-sequence Inclusive
(DenseNet) (DenseNet) (DNN) (GRU) (DNN+GRU)
FPR @99% TPR 76.7% 55.5% 44.3% 13.4% 10.2%
FPR @95% TPR 43.5% 20.2% 9.1% 2.1% 1.5%
FPR @90% TPR 24.8% 9.9% 4.2% 0.6% 0.5%
TR @99% TPR 285 Hz 230 Hz 219 Hz 57 Hz 42 Hz
TR @95% TPR 148 Hz 85 Hz 37 Hz 10 Hz 9 Hz
TR @90% TPR 73 Hz 42 Hz 19 Hz 4 Hz 4 Hz
selector Raw-image Abstract-image HLF Particle-sequence Inclusive
(DenseNet) (DenseNet) (DNN) (GRU) (DNN+GRU)
FPR @99% TPR 81.3% 68.9% 45.7% 17.3% 14.9%
FPR @95% TPR 58.4% 43.9% 19.6% 6.1% 5.2%
FPR @90% TPR 46.9% 30.2% 11.7% 3.0% 2.5%
TR @99% TPR 385 Hz 384 Hz 376 Hz 363 Hz 362 Hz
TR @95% TPR 367 Hz 360 Hz 349 Hz 343 Hz 342 Hz
TR @90% TPR 343 Hz 336 Hz 328 Hz 325 Hz 324 Hz
Table 2: False positive rate (FPR) and trigger rate (TR) corresponding to different values of the true positive rate (TPR), for a (top) and selector. Rate values are estimated scaling the TPR and process-dependent FPR values by the acceptance and efficiency, assuming a leading-order (LO) production cross section and luminosity of 2 cm s. TR values should be taken only as a loose indication of the actual rates, since the accuracy is limited by the use of LO cross sections and a parametric detector simulation.

The trigger baseline selection we use in this study, close to what is used nowadays in CMS for muons, gives an overall trigger rate (i.e., summing electron and muon events) of 390 Hz (i.e., 190 Hz per lepton flavor). If one was willing to take (as an example) half the events and all the events, this number could be reduced to  Hz using the inclusive selectors presented in this study (taking into account the partial overlap between the two triggers). A more classic approach would consist in prescaling the isolated lepton triggers, i.e. randomly accepting half of the events. The effect on events would be the same, but one would lose half of the events while still writing 15 times more QCD than events. In this respect, the strategy we propose would allow a more flexible and cost-effective strategy.