A major aim of experimental high-energy physics (HEP) is to find rare signals of new particles produced in large numbers of collisions at accelerators such as the Large Hadron Collider (LHC) and the ATLAS and CMS experiments. Improvements in classifying these collisions would improve sensitivity to discoveries that could overturn our understanding of the universe at the most fundamental level. Neural Networks have been used in high energy physics for some time. Recently attention has focused on deep learning to improve sensitivity, as well as tackle the increase in detector resolutions and data rates in HEP experiments.
Here we evaluate the use of deep Neural Network’s (NN) on low-level detector data directly for physics analysis without reconstruction of physics objects like jets; without tuning of analysis variables; and using data from the entire calorimeter together with other sub-detectors. We use modern deep learning methods for performance; relate performance to that of physics variables; and ensure results are robust to factors such as pileup and alternative physics models.
The second goal of this work is to ensure these NN architectures run efficiently with popular deep-learning software frameworks on the Cori supercomputer at the National Energy Research Scientific Computing Center (NERSC). This is a major computing resource that is predominately composed of Intel Knights Landing (KNL) XeonPhi CPUs. We seek to improve single node performance on these CPUs, providing timings, optimisations and recipes, as well as distributed training up to the full scale of the machine (10k KNL nodes).
We provide background, in Section 2.1, on the physics use-case studied and benchmark selections; in Section 2.2, the CNN architecture we employ; and, in Section 2.3 the computing resources at NERSC. We then provide physics performance results in Section 3.1 together with interpreting how the learned network compares to the physics variables used in the benchmark analysis (in Section 3.2). Then in Section 3.3 we show timing results for implementations in various frameworks and on GPU and CPU resources at NERSC, before concluding in Section 4.
2.1 Physics and datasets
For this study, we take as a use-case a particular analysis searching for new massive supersymmetric (‘RPV-Susy’) particles in multi-jet final states at the LHC . We focus on ‘gluino-cascade decays’ with gluino and neutralino masses of 1400 GeV and 850 GeV by default (and varied for performance studies).
We use the Pythia event generator  interfaced to the Delphes fast detector simulation  and using the default Delphes ATLAS detector configuration (‘card’). We generate events for two classes: the RPV-Susy signal and the most prevalent background (‘QCD’). In order to have sufficient quantities of background events at high transverse momentum () that are harder to separate, the QCD sample is generated in ranges of and then weighted according to the amount expected (cross-section) in both training loss and the evaluation.
Before training our network we apply some of the physics selections used in  to filter images to those more challenging to discriminate, resulting in a training sample of around 400k events. We compare the performance of our deep network to our own implementation of the full selections in  as a baseline benchmark. To produce the jet variables we use the same jet algorithm as in the physics analysis with parameters given in A and applied using ‘FastJet’  within Delphes. The preselection and benchmark selections are also summarized in A. We have verified that our samples and baseline selections give performance comparable to that obtained in  providing a meaningful benchmark even though those selections were not tuned for exactly these datasets. We also compare our analysis to shallow classifiers based on these physics variables. As inputs to those classifiers, we use the full four-momentum of five jets with highest transverse momentum as well the jet variables used in the physics analysis selections: namely the sum of jet mass, number of jets, and between the leading 2 jets.
Data from the surface of the cylindrical detector is represented as a 2D image with coordinates corresponding to azimuthal angle and pseudorapidity . For the pixel intensity in this image, we use either the overall energy deposited in the combined calorimeter or split the energy deposited in the electromagnetic and hadronic calorimeters, and add the number of tracks formed from the inner detector in that region to provide a three channel image. This is similar to the approach of  except that we use large images covering the entire detector, and use these directly for classifying entire events rather than individual objects. We choose to bin the energy (and number of tracks) into uniform 64x64 bins,which correspond to the approximately 0.1x0.1 ( x ) resolution of the ATLAS hadronic calorimeter, but also use larger 224x224 images for large-scale studies that could capture the granularity of the ATLAS electromagnetic calorimeter and future detectors.
2.2 Neural Net Architecture
We perform binary classification on the resulting images using a Convolutional Neural Net (CNN) 8]
. These layers output into two fully connected layers which project into a two-dimensional vector on which a softmax function is applied to determine the signal and background class probabilities. We use softmax with cross-entropy as the loss function and the ADAM optimizer. The network employed is shown in figure 1.
2.3 Computational Resources
The CPU timing and scaling results reported here used the NERSC Cori system: a Cray XC40 with 2388 Intel (E5-2698 v3) Haswell and 9688 Intel XeonPhi 7250 Knight’s Landing (KNL) compute nodes. Each KNL processor includes 68 1.4GHz cores with 4 HyperThreads and peak performance of 6 TeraFLOP/s (single-precision). Each Haswell node has dual 16-core 2.3 GHz processors with peak performance of 2.4 TeraFLOP/s. We compare performance to a standalone machine containing Titan X (Pascal) GPUs each capable of 10.2 TeraFLOP/s peak.
3.1 CNN Performance
In order to obtain good significance to new physics it is necessary to combine high signal efficiency (True Positive Rate (TPR)) with very high background rejection (low False Positive Rate (FPR)). In figure 3 we show the ROC curve of TPR vs FPR for the CNN architecture described in 2.2 with a single channel for calorimeter energy (labelled ‘CNN’). We compare to the physics selections described in 2.1. We achieve increased signal efficiency at same background rejection without using physics-based jet variables. We also compare the AMS (approximate median significance)  which gives a single measure accounting for initial pre-selection efficiency and relative signal/background cross-sections, and achieve a 1.8x improvement in AMS at the same background rejection point.
Comparison to shallow classifiers:
We also compare to shallow machine-learning approaches using high-level physics variables as inputs as detailed in Section2.13: these outperform the selections but under-perform relative to the CNN.
Further improving performance: In figure 3 we show further improvements to performance from modifications to our network or training. Firstly we alter the cross-section based weights applied to the training loss (described in section 2.2). The background cross-section weights are in some cases are seven orders of magnitude larger than that of the signal sample. This range of values causes some instability for training in some of the deep learning frameworks we use, so we also train the network with applying the natural logarithm of these normalized positive weights. This slightly improves the performance in the false-positive region of interest, though it performs worse than the simple cross-section for higher false-positive rates.
Next, we explore using three input channels in our CNN: namely to separate the energy deposited in the electromagnetic and hadronic calorimeters, and to add the number of tracks reconstructed in the same bin. Figure 3, shows the further performance improvements.
Finally a slight further improvement can be seen in figure 3 from creating an ensemble of the networks trained with original weights and the log weights (with 3 channels). This is done by taking the mean of the two network outputs as a prediction.
Robustness to different signal samples and pileup: The model described in preceding sections was trained on a specific cascade decay with gluino mass (MGlu) of 1400 GeV, and neutralino mass (MNeu) of 850 GeV. However, in figure 3 we show that applying this network to other signal samples with different input masses, without retraining, still offers good performance that exceeds that of the benchmark selections for each sample.
The above studies used simulated samples without a contribution from additional interactions per beam crossing (pile-up). We also generate samples using the default Delphes ATLAS detector configuration ‘card’ with added pile-up (). The physics selections benchmark has a higher false-positive rate for this sample (0.001) with a TPR of 0.6 at that point. The CNN still performs well with a TPR of around 0.9 at that FPR.
3.2 Relation between CNN and physics variables
In figures 6 and 6 we compare the predicted class probability of the CNN with two of the jet variables used in the physics analysis and a clear correlation can be seen. We also train a one-layer network combining the CNN output with each of these physics variables in turn. The performance is shown in figure 6 and there is little or no further improvement within the statistics of the sample used, suggesting that the CNN is capturing much of the discriminating power of these jet variables, despite not being explicitly trained on them.
3.3 Training Time Performance on CPU and GPU architectures
In figure 7
we compare the training time for implementations of our network in different popular deep-learning frameworks and on different architectures. We report the time per batch of 512 examples excluding I/O. Not all implementations have been optimized so the aim is not to provide detailed comparisons of frameworks but rather to drive improved CPU performance. It can clearly be seen that the default Tensorflow (v1.2) implementation performed very poorly on the KNL architecture. Several optmizations were recently introduced by Intel for Tensorflow on CPU  building on the Intel Math Kernel Libary (MKL) , such as multi-threaded convolutional layers, vectorizing over channels or filters and cache blocking. These offer substantial improvements as shown in the times labeled ‘TF (Intel)’ in figure 7. Further optimizations have since been carried out in a collaboration between NERSC and Intel using the architecture and data from this paper, as well as other use-cases, resulting in the ‘TF (Latest)’ measurement in figure 7. These include MKL implementation of element-wise operations to avoid MKL to Eigen conversions. This version is currently available for use on NERSC systems and will be included into future releases of Tensorflow. Code, datasets and recipes are provided in B.
Performance improvements for CPU were also made in IntelCaffe . Figure 7 shows this implementation also performs well. In addition a multi-node version of IntelCaffe using the Intel MLSL library  is available at NERSC. Figure 7
shows that using 8 nodes (with a constant batch size per node) one can train this network approximately 6x faster than a single node. With other collaborators, we have scaled this Caffe implementation on the same problem, but with large 224x224 images, to 9600 KNL nodes on Cori. That study is reported in detail in.
We have implemented deep convolutional networks on large whole-detector images directly for physics analysis. We find this offers improved sensitivity than physics-variable based selections, and shallow classifiers using those variables, without the need for jet reconstruction. Further improvements come, in particular, from adding multiple sub-detectors as channels. This network is robust to pileup and can be applied to other masses of the signal sample without retraining.
Furthermore, we have used this implementation and data to benchmark and improve popular deep learning libraries on CPU architectures including XeonPhi/KNL on the NERSC Cori supercomputer as well as to demonstrate distributed training up to 9600 KNL nodes.
-  ATLAS Collaboration 2016 Search for massive supersymmetric particles in multi-jet final states produced in pp collisions at 13 TeV using the ATLAS detector at the LHC Tech. Rep. ATLAS-CONF-2016-057 CERN Geneva URL http://cds.cern.ch/record/2206149
-  Sjöstrand T, Mrenna S and Skands P 2008 Computer Physics Communications 178 852 – 867
-  de Favereau J, Delaere C et al. 2014 JHEP 2014 57
-  Cacciari M, Salam G P and Soyez G 2012 The European Physical Journal C 72 1896
-  de Oliveira L, Kagan M, Mackey L, Nachman B and Schwartzman A 2016 JHEP 07 069
-  Komiske P T, Metodiev E M and Schwartz M D 2017 JHEP 01 110
-  LeCun Y, Bottou L, Bengio Y and Haffner P 1998 Proceedings of the IEEE 86 2278–2324
-  Nair V and Hinton G E 2010 Proceedings of ICML-10 pp 807–814
-  Kingma D P and Ba J 2014 CoRR abs/1412.6980 URL http://arxiv.org/abs/1412.6980
-  Adam-Bourdarios C, Cowan G, Germain C, Guyon I, Kegl B and Rousseau D 2014 URL http://higgsml.lal.in2p3.fr/documentation
-  2017 TensorFlow version 1.2.0 https://github.com/tensorflow/tensorflow/releases/tag/v1.2.0
-  2017 TensorFlow Optimizations on Modern Intel Architecture https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture
-  2017 Introducing DNN primitives in Intel Math Kernel Library https://software.intel.com/en-us/articles/introducing-dnn-primitives-in-intelr-mkl
-  2017 Intel distribution of Caffe* https://github.com/intel/caffe
-  2017 Intel Machine Learning Scaling Library for Linux OS https://github.com/01org/MLSL
-  Kurth T et al. 2017 Accepted for SC17, arXiv:1708.05256 URL http://arxiv.org/abs/1708.05256
We thank Ben Nachman and Brian Amadio for many valuable discussions and physics input, and Mustafa Mustafa and the Intel Tensorflow team for CPU optimizations. This research used resources of the National Energy Research Scientific Computing Center (NERSC), a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.
Appendix A Benchmark analysis and pre-selection
The selections applied for the benchmark analysis were (following ):
|Fat-jet object selection:|
|AntiKt R=1.0 trimmed (, )|
|between leading two Fat-jets|
|( AND GeV)|
|OR ( AND GeV)|
Appendix B Datasets, Code and Recipes
Example code for implementing the networks used in this study, together with datasets and recipes for running at NERSC are provided at http://www.nersc.gov/users/data-analytics/data-analytics-2/deep-learning/deep-networks-for-hep/