Interpretability Study on Deep Learning for Jet Physics at the Large Hadron Collider

by   Taoli Cheng, et al.

Using deep neural networks for identifying physics objects at the Large Hadron Collider (LHC) has become a powerful alternative approach in recent years. After successful training of deep neural networks, examining the trained networks not only helps us understand the behaviour of neural networks, but also helps improve the performance of deep learning models through proper interpretation. We take jet tagging problem at the LHC as an example, using recursive neural networks as a starting point, aim at a thorough understanding of the behaviour of the physics-oriented DNNs and the information encoded in the embedding space. We make a comparative study on a series of different jet tagging tasks dominated by different underlying physics. Interesting observations on the latent space are obtained.


page 4

page 8


Physics-based Neural Networks for Shape from Polarization

How should prior knowledge from physics inform a neural network solution...

Differentiable Physics-informed Graph Networks

While physics conveys knowledge of nature built from an interplay betwee...

Observing how deep neural networks understand physics through the energy spectrum of one-dimensional quantum mechanics

We investigated how neural networks (NNs) understand physics using one-d...

Multi-Scale Distributed Representation for Deep Learning and its Application to b-Jet Tagging

Recently machine learning algorithms based on deep layered artificial ne...

Revealing Fundamental Physics from the Daya Bay Neutrino Experiment using Deep Neural Networks

Experiments in particle physics produce enormous quantities of data that...

Distributed Training and Optimization Of Neural Networks

Deep learning models are yielding increasingly better performances thank...

Criticality & Deep Learning I: Generally Weighted Nets

Motivated by the idea that criticality and universality of phase transit...

1 Introduction

Applying deep neural networks in jet tagging has been gaining a lot of progress during latest years. The community has reached the common sense that 1), DNNs taking in low-level features are able to perform well in jet tagging tasks and automatically carry out feature engineering. 2), different architectures have been explored (including image-based CNNs Cogan et al. (2015); de Oliveira et al. (2016)

, momentum-vector-based RNNs

Egan et al. (2017), physics-inspired Recursive NNs Louppe et al. (2019), graph models Henrion (2017); Moreno et al. (2019) and point clouds Qu and Gouskos (2019), etc.). These architectures perform equivalently well in regard of tagging efficiency, and the results support for each other Butter and others (2019). Despite the fact that this is an encouraging message, it also reminds us that there might be information redundancy in architecture search. It will be interesting if we can extract the common part among different architecture models to reduce information redundancy and also combine uniquely learnt information from different models to enhance tagging performance. 3), After the establishment of well-performing jet tagging DNNs, the next step would be pushing this further by interacting with our physics theoretical framework.

Previous attempts to visualize and interpret DNNs (de Oliveira et al. (2016); Komiske et al. (2019)) have been focusing on general DNNs activation and filter visualization. This gives us some intuitive sense about what information has been learnt. However, DNNs interpreted in this way is not sufficient for robust physics analysis. It has been obscure regarding finding a clear path between DNNs and physics-theory observable system.

In this work, we are trying to construct the connection with a series of studies varying from latent space probing to tailored saliency study. We use different jet tagging tasks to form a comparative study, including W/QCD, Top/QCD, quark/gluon classifications. In these classification tasks, different factors dominate since the causal factor resides in different part of the jet clustering trees. We know that the very essence of object identification in High Energy Physics is to "see" underlying quantum properties which evolves into low energy as observed patterns in the detector space. And our ultimate hope is in the same vein, i.e. to probe into the underlying physics structure, but not only detecting objects. As for DNN architectures, we take advantage of the tree structured Recursive Neural Networks (RecNNs) Louppe et al. (2019) that utilize the well-established jet clustering algorithms, which encoded the radiation patterns within the clustering history. A jet representation is obtained through embedding each clustering node recursively along the clustering tree, and can then be easily fed into downstream tasks. This approach produces interesting interpretation of encoded information within neural nets and also provides physics-friendly method for bridging physics theory and DNN-interpretation.

Although all the interpretation studies here are carried with the architecture of tree-based RecNNs and interpreted in the scope of supervised classification problems, the proposed methods and findings can be adaptively applied to other architectures and unsupervised learning cases. As the last step within the interpretation loop, how to utilize the information gained by analyzing latent space, relevance and sensitivities to improve DNN performance and clarify other relevant aspects such as uncertainties and robustness, would be worth further exploring.

We use similar training datasets and neural network architecture along with parameters as in Cheng (2017); Louppe et al. (2019) throughout this work. For all the tasks, we constrain jet transverse momentum to be around 600 GeV for a more meaningful comparison across tasks. We feed input features () for every jet constituent, into the jet clustering history to recursively build up the jet embedding. A more detailed description on the datasets and neural network architecture can be found in Appendix A.

2 Probing Jet Embedding Space

Here we employ the “probing method” Alain and Bengio (2016) to discover what has been presented in the latent space. By feeding the learnt latent embedding representations into auxiliary probing tasks, one will be able to, in a most straight-forward way, find the relevance between latent space and our target information, and thus get a hint on how well the latent space is aligned with the space of specific physics observables.

After being trained on the input space of particle four momentum (or other variants such as of the input vector) in classification tasks, the embedding vectors of jets are then taken to be investigated in a task-independent way using linear classification or regression to link with our physics observable space . To be more specific, we take the Generalized Angularites Larkoski et al. (2014) as a systematic summary of common jet observables, as expressed in Eqn. 1, with being the transverse momentum fraction of i-th jet component and , where is the azimuthal distance to the jet axis and R is the jet radius.


We will then have in the (, ) space, for instance, jet mass as (1,2). We report, in Table 1

, the returned R2 score of the linear regression for jet angularities and jet N-subjettiness

Thaler and Van Tilburg (2011) (

) which is very useful in probing “prongness” substructure of jets. R2 score is defined as the ratio of explained variance and total variance, with

corresponding to perfect fitting.

Jet Observables R2(W/QCD) R2(Top/QCD) R2(q/g)
multiplicity () 0.68 0.55 0.75
width () 0.74 0.82 0.55
() 0.77 0.80 0.55
() 0.30 0.36 0.65
0.67 0.78 0.54
0.67 0.62 0.56
0.67 0.53 0.61
0.39 0.35 0.20
0.16 0.39 0.09
Table 1: R2 scores in linear regression for the latent jet representation, displayed for different classification tasks (W/QCD, Top/QCD, quark/gluon). Three observables with highest R2 scores are highlighted for each classification task.

From Table 1, one can see that among three classification tasks, W/QCD and Top/QCD both give high probing score on jet width () Gallicchio and Schwartz (2013) and jet mass (), while quark/gluon gives high score on jet multiplicity () and () Chatrchyan and others (2012). This is interesting, because in (, ) space, (1,1) and (1,2) are both IRC safe observables, while the other two IRC unsafe angularities give important information in quark/gluon discrimination Larkoski et al. (2014). Given the previous study results on quark/gluon discrimination using mutual information in Larkoski et al. (2014), it’s foreseeable that further study on the full angularity space or using more sophisticated probing methods perhaps will lead to very interesting observations. Another interesting point is that the ratios of N-subjettiness (), which play very important role in traditional prongness jet substructure tagging including both W tagging and Top tagging, are not strongly manifested in the embedding representation. Rather, the N-subjettiness () themselves are much more directly related with the latent representation, indicating a more fundamental role in the latent space under investigation. Beside the characteristic jet observables studied above, the authors in Datta et al. (2019); Larkoski and Metodiev (2019)

attempted to find the best discriminative observables, constructed from the N-subjettiness basis, for jet tagging problems with a machine learning approach. It will thus also be very interesting to see how these new observables are related to the latent embeddings.

Beside embedded jet representation, we would like to see how the representation changes along the clustering trees. This is intrinsically difficult since every jet has different clustering history. As a naive attempt, we calculated the mass direction and want to see how the mass direction shift when passing it down the clustering tree. We observed that sometimes when the mass direction changes drastically (i.e. the mass direction calculated with jet embeddings doesn’t fit with downwards clustering nodes), that’s close to where the hard splitting (resonance decay) happens. However, a more systematic investigation will be necessary to lead to a confident conclusion.

Embedding Transferability

As we have seen that some general physics observables (or relevant information) can be induced in the latent space by classification tasks, we expect the latent representation learnt by a specific classification task is informative enough to apply to other physics processes, where some common features may be transferable. We try to pass the jet embeddings learnt by one classification task to another classification problem, to see how general or transferable the learnt jet representations are. Results show that for similar tasks (W/QCD and Top/QCD), the embeddings transfer very good performance, while the transferability decreases a bit for different types of tasks (such as across W/QCD and quark/gluon). More details can be found in Appendix B.

Other Network Architectures

We have shown linear probing for RecNNs. In Appendix C, we also present linear probing results (only in Top/QCD classification) for other important architectures such as FCN, LSTM and CNN, with only low-level features as input. For some of these models, there is no clear Embedding Layer, so we investigated all the 3 latest layers. R2 scores from these architectures are not as impressive as in RecNN, but still some common trends can be observed. For all these architectures, jet width, jet mass and (

) always dominate. Due to the image-based nature of CNN, the hidden representation is always in the pixel space. This makes it difficult to directly link with physics observables. So in the third latest layer

Hidden(-3) (the flatten layer), R2s are always close to 0. In later dense layers, R2 increases a bit, indicating some abstract features are learnt here. Comparatively, RecNN obviously surpass other models in learning physics observables in latent embeddings. While these analyses are only using a very simple linear regression, we expect that more complex analysis will also bring interesting observations.

3 Interpreting on Jet Lund Plane

Gradient-based saliency maps Simonyan et al. (2013) give us a general sense on how input space affects activation in neural networks. However, this method is not straight-forward in revealing underlying physics. For jet physics, the most important building block is jet splitting mechanism, which tells us the underlying dominating interactions. Taking advantage of the base clustering structures within RecNNs, we can peek into every splitting easily, even within a neural network setup. In order to explore the jet splitting mechanisms and corresponding neural network activations, we map the saliency onto Jet Lund Planes, giving a meaningful, in a physics sense, visualization.

Lund diagrams Dasgupta et al. (2013); Dreyer et al. (2018) build theoretically useful framework for jet splitting by expressing emissions on the lund plane , where denotes angular distance between two splitting branches and denotes the relative transverse momentum of the softer branch (the emission). It gives a handy description on soft and collinear emissions within jets. And it separately display different kinematic regions with clear underlying physics. Soft-collinear emissions are emitted uniformly on lund plane and the region where hard splitting happens can always be easily spot. This gives us a navigation map to identify the physics nature of most sensitive nodes within neural networks.


We display gradient-based saliency maps for different jet classification tasks comparatively. The saliency for Recursive Neural Networks G. Louppe, K. Cho, C. Becot, and K. Cranmer (2019); 1 is defined, in Eqn. 2, as the

norm of relative gradient sensitivity passed from the classifier output

of every clustering embedding node normalized with respect to norm of the gradient sensitivity of final jet embedding node .


Saliency maps can be examined with in the tree-structure itself, however, only in an individual-based way, again since every jet has its own clustering history. In order to have a collective impression on the sensitivity distribution for all the clustering nodes without losing track of underlying physics manifestation, we map the saliency on to jet lund plane to compare across tasks and across different kinematic regions dominated by different splitting mechanisms.

Figure 1: Saliency sensitivity mapped onto Jet Lund Plane. Left: QCD and W jets in W/QCD Classification; Middle: QCD and Top jets in Top/QCD Classification; Right: Quark and Gluon jets in Quark/Gluon Classification.


We present the averaged saliency sensitivity in Fig. 1. For W/QCD, Top/QCD and quark/gluon classification, we show the saliency map on Lund Plane for both classes. This facilitates two purposes: class saliency within tasks and same-class saliency comparison across tasks. To make the illustration clearer, we restrict the saliency intensity (overflows are mapped onto upper bound) to be within [0, 2] for W/QCD and [0, 1.5] for the other two classifications. More plots for uniform range, which make easier comparison, are included in Appendix D.

From Fig. 1, one can observe that:

  • For the three different tasks, the sensitivity maps on the lund plane give very different characteristic patterns. For W/QCD, the sensitivity is generally higher than other two tasks.

  • From the left column for W/QCD classification, there are more activities along the diagonal boundary. This is where hard-collinear splitting happens. Notice that even for QCD jets, there is a high-sensitivity spot close to location (2, 3.5), which corresponds to hard splitting at a mass between 60 GeV (which is the mass peak of QCD jet samples) and 80 GeV (W boson mass). This might be related to the two-prongness (hard splitting), where the characteristics of W jets shapes the sensitivity activities within the comparing QCD jets. It’s not clear if some high activity in collinear region is related to IRC safety robustness of the network. (IRC safety study in Louppe et al. (2019) showed that for “collinear10-max” (applies collinear splits to the 10 highest jet constituents), background rejection decreases a bit.)

  • From W/QCD to Top/QCD , the relative sensitive regions of QCD jets change drastically. The left boundary corresponds to the final clustering step which in C/A will have large angles. In Top/QCD classification, sensitivity activity slightly decreases along the axis. (the red spot of lower-left corner in QCD jets is curious though.) Compared to W/QCD, the hard-collinear region is much less sensitive. This may be partly because top jets with heavier mass have generally more decay constituents and correspondingly longer cascading chain and more complex jet substructure, where saliency passing down longer chains ending in collinear splittings tends to decrease or vanish. The top mass hard-splitting regions in the upper part of the triangle is less obvious compared to W/QCD case, but still relatively important in Top jets.

  • Finally for quark/gluon classification, it seems that large-angle and some hard-collinear splittings (especially for gluon jets) are relatively important as shown in the plots. Generally speaking, saliency will effectively run through the whole clustering tree in quark/gluon tagging, where multiplicity plays an important role. So we don’t see much sensitivity decrease along the axis, while saliency passing from large angle down to small angles.

4 Conclusions and Extended Studies

We have shown some interesting results regarding probing the hidden representations learnt by jet tagging oriented deep neural networks, taking tree-structured recursive neural networks as an example. A cross-task comparative study helps us with revealing and understanding the information encoded in DNNs. In order to interpret the sensitivity within learnt neural net models in a physical way, we combine saliency tree maps and jet lund plane, and presented the mapped sensitivity across different jet classification tasks. Results show that when feeding in only low-level features of jet constituents, there is still high relevance with general physics observables in the latent embeddings; and the saliency maps give very characteristic patterns depending on the classification tasks.

we only presented results for RecNNs. Some of the methods are easier to explore within RecNNs because of its physics-inspired architecture. In general, one can probe the embedding space or hidden layers to gain some basic information. However, for instance, it will be very difficult for CNNs to interpret in Lund Plane, since pixelization of jet images washes out the individual particle information. We expect some more suitable methods can be found for these architectures. On the other hand, neural networks directly featured with Lund coordinates might be another playground for Lund Plane interpretability study. And a systematic study compares different neural network architectures across different classification tasks will give more evident information on the model interpretability. Depending on the specific architecture (image-based, 4-vectors based, or theory-motivated-variables-based), different models might be learning different discriminating factors, thus combining these learnt representations might help gain better performance.

Although these studies are carried on supervised classification problems , they may as well (or even better) be applied to unsupervised learning cases. Beside the results presented here, there are more interesting points to investigate. One important topic is the robustness of deep learning models. It’s possible that saliency analysis will help us find better ways approaching robust machine learning models. We hope this effort will develop further into more powerful practical use for physics structure detection in the future.


This work is funded by IVADO Fundamental Research Grant. The author would like to thank Gilles Louppe for collaboration in the very early stage. And the author is also grateful for helpful discussions with MILA colleagues. Finally, a thank you goes to Prof. Jean-Francois Arguin for kind support during this work was done.


  • [1] Note: Cited by: §3.
  • G. Alain and Y. Bengio (2016) Understanding intermediate layers using linear classifier probes. arXiv e-prints, pp. arXiv:1610.01644. External Links: 1610.01644 Cited by: §2.
  • A. Butter, G. Kasieczka, T. Plehn, and M. Russell (2018) Deep-learned Top Tagging with a Lorentz Layer. SciPost Phys. 5 (3), pp. 028. External Links: Document, 1707.08966 Cited by: Appendix C.
  • A. Butter et al. (2019) The Machine Learning Landscape of Top Taggers. SciPost Phys. 7, pp. 014. External Links: Document, 1902.09914 Cited by: 3rd item, §1.
  • M. Cacciari, G. P. Salam, and G. Soyez (2008) The Anti-k(t) jet clustering algorithm. JHEP 04, pp. 063. External Links: Document, 0802.1189 Cited by: Appendix A.
  • M. Cacciari, G. P. Salam, and G. Soyez (2012) FastJet User Manual. Eur. Phys. J. C72, pp. 1896. External Links: Document, 1111.6097 Cited by: Appendix A.
  • S. Chatrchyan et al. (2012) Search for a Higgs boson in the decay channel to ZZ(*) to qbar l+ in collisions at TeV. JHEP 04, pp. 036. External Links: Document, 1202.1416 Cited by: §2.
  • T. Cheng (2017) Recursive Neural Networks in Quark/Gluon Tagging. External Links: 1711.02633 Cited by: Appendix A, §1.
  • J. Cogan, M. Kagan, E. Strauss, and A. Schwarztman (2015)

    Jet-Images: Computer Vision Inspired Techniques for Jet Tagging

    JHEP 02, pp. 118. External Links: Document, 1407.5675 Cited by: §1.
  • M. Dasgupta, A. Fregoso, S. Marzani, and G. P. Salam (2013) Towards an understanding of jet substructure. JHEP 09, pp. 029. External Links: Document, 1307.0007 Cited by: §3.
  • K. Datta, A. Larkoski, and B. Nachman (2019) Automating the Construction of Jet Observables with Machine Learning. External Links: 1902.07180 Cited by: §2.
  • J. de Favereau, C. Delaere, P. Demin, A. Giammanco, V. Lemaître, A. Mertens, and M. Selvaggi (2014) DELPHES 3, A modular framework for fast simulation of a generic collider experiment. JHEP 02, pp. 057. External Links: Document, 1307.6346 Cited by: Appendix A.
  • L. de Oliveira, M. Kagan, L. Mackey, B. Nachman, and A. Schwartzman (2016) Jet-images — deep learning edition. JHEP 07, pp. 069. External Links: Document, 1511.05190 Cited by: §1, §1.
  • F. A. Dreyer, G. P. Salam, and G. Soyez (2018) The Lund Jet Plane. JHEP 12, pp. 064. External Links: Document, 1807.04758 Cited by: §3.
  • S. Egan, W. Fedorko, A. Lister, J. Pearkes, and C. Gay (2017) Long Short-Term Memory (LSTM) networks with jet constituents for boosted top tagging at the LHC. External Links: 1711.09059 Cited by: §1.
  • J. Gallicchio and M. D. Schwartz (2013) Quark and Gluon Jet Substructure. JHEP 04, pp. 090. External Links: Document, 1211.7038 Cited by: §2.
  • I. Henrion (2017) Neural Message Passing for Jet Physics. Proceedings of the Deep Learning for Physical Sciences Workshop at NIPS (2017). External Links: Link Cited by: §1.
  • P. T. Komiske, E. M. Metodiev, and J. Thaler (2019) Energy Flow Networks: Deep Sets for Particle Jets. JHEP 01, pp. 121. External Links: Document, 1810.05165 Cited by: §1.
  • A. J. Larkoski and E. M. Metodiev (2019) A Theory of Quark vs. Gluon Discrimination. External Links: 1906.01639 Cited by: §2.
  • A. J. Larkoski, J. Thaler, and W. J. Waalewijn (2014) Gaining (Mutual) Information about Quark/Gluon Discrimination. JHEP 11, pp. 129. External Links: Document, 1408.3122 Cited by: §2, §2.
  • G. Louppe, K. Cho, C. Becot, and K. Cranmer (2019) QCD-Aware Recursive Neural Networks for Jet Physics. JHEP 01, pp. 057. External Links: Document, 1702.00748 Cited by: Appendix A, §1, §1, §1, 2nd item, §3.
  • E. A. Moreno, O. Cerri, J. M. Duarte, H. B. Newman, T. Q. Nguyen, A. Periwal, M. Pierini, A. Serikova, M. Spiropulu, and J. Vlimant (2019) JEDI-net: a jet identification algorithm based on interaction networks. External Links: 1908.05318 Cited by: §1.
  • J. Pearkes, W. Fedorko, A. Lister, and C. Gay (2017) Jet Constituents for Deep Neural Network Based Top Quark Tagging. External Links: 1704.02124 Cited by: 1st item.
  • H. Qu and L. Gouskos (2019) ParticleNet: Jet Tagging via Particle Clouds. External Links: 1902.08570 Cited by: §1.
  • K. Simonyan, A. Vedaldi, and A. Zisserman (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. CoRR abs/1312.6034. External Links: Link, 1312.6034 Cited by: §3.
  • T. Sjöstrand, S. Ask, J. R. Christiansen, R. Corke, N. Desai, P. Ilten, S. Mrenna, S. Prestel, C. O. Rasmussen, and P. Z. Skands (2015) An Introduction to PYTHIA 8.2. Comput. Phys. Commun. 191, pp. 159–177. External Links: Document, 1410.3012 Cited by: Appendix A.
  • J. Thaler and K. Van Tilburg (2011) Identifying Boosted Objects with N-subjettiness. JHEP 03, pp. 015. External Links: Document, 1011.2268 Cited by: §2.

Appendix A Datasets and Neural Net Architecture


All the jet samples are generated by Pythia8 [26] and passed to Delphes [12] for fast detector simulation. Jets have [500, 700] GeV. They are clustered using anti-kt algorithm [5] with FastJet [6]. Jet clustering cone size is set to be .

Neural Net Settings

The neural networks used in this work is similar as in [21, 8]. We embed jets along with their clustering histories which are used as scaffold for building the recursive embedding process. Input features () for constituent particles within jets are taken as direct input fed into the neural networks. We use anti-kt clustered jets, then recluster using Cambridge-Aachen algorithm for simplicity in Lund Plane analysis. The network architecture is (RecNN Embedding () Dense(70) Dense(70) Sigmoid).

Appendix B Transferability of RecNN

In the study of Embedding Transferability, we use the embedding layer learnt from classification task A to obtain jet embeddings for task B, then feed these transferred embeddings to a dense network for classification. (i.e. freeze the embedding layer which is initialized with parameters learnt in task A). In Table 2, we show cross-task transfer AUCs for all the combinations, with Base AUC denoting original full training AUC for B, and Transfer AUC denoting results from the transferred embeddings.

Tasks Base AUC Transfer AUC
W/QCD Top/QCD 0.926 0.891
g/q Top/QCD 0.926 0.791
Top/QCD W/QCD 0.957 0.911
q/g W/QCD 0.957 0.822
W/QCD q/g 0.861 0.763
Top/QCD q/g 0.861 0.759
Table 2: Transferability results shown here. In Base AUC, the original trained AUC for the target task is shown, while in resulting in Transfer AUC, transferred embedding is used for training the classifier.

Appendix C Linear Probing in Other Models

To quickly explore the latent space of other popular architectures, we take a few low-level feature based models which have been used for jet tagging: FCN, LSTM, CNN.

  • FCN: taking in four momenta of first 30 jet constituents in a -ordered manner. The architecture is similar as in [23]

    . The dense hidden layers have 300, 102, 12, 6 nodes with ReLU activation. Output layer has a sigmoid activation.

  • LSTM: taking in four momenta of first 30 jet constituents in a -ordered manner. LSTM layer outputs a 70 dimensional embedding, which then is fed into two ReLU dense layers with 50, 20 nodes. Output layer has a sigmoid activation.

  • CNN: taking in 3737 grey scale images on () plane with deposit as pixel intensity. We take a similar architecture as in [4]: (Conv2D*16 Conv2D*16 MaxPooling2D Conv2D*8 Conv2D*8 MaxPooling2D Flatten Dense(128) Dense(64) Sigmoid).

For simplicity, we only take Top/QCD classification as an instance here. Models are trained on datasets provided in [3]. R2 scores are shown for 3 latest layers in Fig. 2.

Figure 2: R2 scores for latest hidden layers for RecNN, FCN, LSTM, CNN. Hidden(-1) is the latest layer, Hidden(-2) is the second latest, and Hidden(-3) is the third latest. For RecNN and LSTM Hidden(-3) is just the embedding layer. For CNN, Hidden(-3) is the flatten layer.

Appendix D Extra Plots for Saliency Lund Plane

Figure 3: Saliency sensitivity mapped onto Jet Lund Plane. Left: QCD and W jets in W/QCD Classification; Middle: QCD and Top jets in Top/QCD Classification; Right: Quark and Gluon jets in Quark/Gluon Classification.