Applying deep neural networks in jet tagging has been gaining a lot of progress during latest years. The community has reached the common sense that 1), DNNs taking in low-level features are able to perform well in jet tagging tasks and automatically carry out feature engineering. 2), different architectures have been explored (including image-based CNNs Cogan et al. (2015); de Oliveira et al. (2016)
, momentum-vector-based RNNsEgan et al. (2017), physics-inspired Recursive NNs Louppe et al. (2019), graph models Henrion (2017); Moreno et al. (2019) and point clouds Qu and Gouskos (2019), etc.). These architectures perform equivalently well in regard of tagging efficiency, and the results support for each other Butter and others (2019). Despite the fact that this is an encouraging message, it also reminds us that there might be information redundancy in architecture search. It will be interesting if we can extract the common part among different architecture models to reduce information redundancy and also combine uniquely learnt information from different models to enhance tagging performance. 3), After the establishment of well-performing jet tagging DNNs, the next step would be pushing this further by interacting with our physics theoretical framework.
Previous attempts to visualize and interpret DNNs (de Oliveira et al. (2016); Komiske et al. (2019)) have been focusing on general DNNs activation and filter visualization. This gives us some intuitive sense about what information has been learnt. However, DNNs interpreted in this way is not sufficient for robust physics analysis. It has been obscure regarding finding a clear path between DNNs and physics-theory observable system.
In this work, we are trying to construct the connection with a series of studies varying from latent space probing to tailored saliency study. We use different jet tagging tasks to form a comparative study, including W/QCD, Top/QCD, quark/gluon classifications. In these classification tasks, different factors dominate since the causal factor resides in different part of the jet clustering trees. We know that the very essence of object identification in High Energy Physics is to "see" underlying quantum properties which evolves into low energy as observed patterns in the detector space. And our ultimate hope is in the same vein, i.e. to probe into the underlying physics structure, but not only detecting objects. As for DNN architectures, we take advantage of the tree structured Recursive Neural Networks (RecNNs) Louppe et al. (2019) that utilize the well-established jet clustering algorithms, which encoded the radiation patterns within the clustering history. A jet representation is obtained through embedding each clustering node recursively along the clustering tree, and can then be easily fed into downstream tasks. This approach produces interesting interpretation of encoded information within neural nets and also provides physics-friendly method for bridging physics theory and DNN-interpretation.
Although all the interpretation studies here are carried with the architecture of tree-based RecNNs and interpreted in the scope of supervised classification problems, the proposed methods and findings can be adaptively applied to other architectures and unsupervised learning cases. As the last step within the interpretation loop, how to utilize the information gained by analyzing latent space, relevance and sensitivities to improve DNN performance and clarify other relevant aspects such as uncertainties and robustness, would be worth further exploring.
We use similar training datasets and neural network architecture along with parameters as in Cheng (2017); Louppe et al. (2019) throughout this work. For all the tasks, we constrain jet transverse momentum to be around 600 GeV for a more meaningful comparison across tasks. We feed input features () for every jet constituent, into the jet clustering history to recursively build up the jet embedding. A more detailed description on the datasets and neural network architecture can be found in Appendix A.
2 Probing Jet Embedding Space
Here we employ the “probing method” Alain and Bengio (2016) to discover what has been presented in the latent space. By feeding the learnt latent embedding representations into auxiliary probing tasks, one will be able to, in a most straight-forward way, find the relevance between latent space and our target information, and thus get a hint on how well the latent space is aligned with the space of specific physics observables.
After being trained on the input space of particle four momentum (or other variants such as of the input vector) in classification tasks, the embedding vectors of jets are then taken to be investigated in a task-independent way using linear classification or regression to link with our physics observable space . To be more specific, we take the Generalized Angularites Larkoski et al. (2014) as a systematic summary of common jet observables, as expressed in Eqn. 1, with being the transverse momentum fraction of i-th jet component and , where is the azimuthal distance to the jet axis and R is the jet radius.
We will then have in the (, ) space, for instance, jet mass as (1,2). We report, in Table 1
, the returned R2 score of the linear regression for jet angularities and jet N-subjettinessThaler and Van Tilburg (2011) (
) which is very useful in probing “prongness” substructure of jets. R2 score is defined as the ratio of explained variance and total variance, withcorresponding to perfect fitting.
From Table 1, one can see that among three classification tasks, W/QCD and Top/QCD both give high probing score on jet width () Gallicchio and Schwartz (2013) and jet mass (), while quark/gluon gives high score on jet multiplicity () and () Chatrchyan and others (2012). This is interesting, because in (, ) space, (1,1) and (1,2) are both IRC safe observables, while the other two IRC unsafe angularities give important information in quark/gluon discrimination Larkoski et al. (2014). Given the previous study results on quark/gluon discrimination using mutual information in Larkoski et al. (2014), it’s foreseeable that further study on the full angularity space or using more sophisticated probing methods perhaps will lead to very interesting observations. Another interesting point is that the ratios of N-subjettiness (), which play very important role in traditional prongness jet substructure tagging including both W tagging and Top tagging, are not strongly manifested in the embedding representation. Rather, the N-subjettiness () themselves are much more directly related with the latent representation, indicating a more fundamental role in the latent space under investigation. Beside the characteristic jet observables studied above, the authors in Datta et al. (2019); Larkoski and Metodiev (2019)
attempted to find the best discriminative observables, constructed from the N-subjettiness basis, for jet tagging problems with a machine learning approach. It will thus also be very interesting to see how these new observables are related to the latent embeddings.
Beside embedded jet representation, we would like to see how the representation changes along the clustering trees. This is intrinsically difficult since every jet has different clustering history. As a naive attempt, we calculated the mass direction and want to see how the mass direction shift when passing it down the clustering tree. We observed that sometimes when the mass direction changes drastically (i.e. the mass direction calculated with jet embeddings doesn’t fit with downwards clustering nodes), that’s close to where the hard splitting (resonance decay) happens. However, a more systematic investigation will be necessary to lead to a confident conclusion.
As we have seen that some general physics observables (or relevant information) can be induced in the latent space by classification tasks, we expect the latent representation learnt by a specific classification task is informative enough to apply to other physics processes, where some common features may be transferable. We try to pass the jet embeddings learnt by one classification task to another classification problem, to see how general or transferable the learnt jet representations are. Results show that for similar tasks (W/QCD and Top/QCD), the embeddings transfer very good performance, while the transferability decreases a bit for different types of tasks (such as across W/QCD and quark/gluon). More details can be found in Appendix B.
Other Network Architectures
We have shown linear probing for RecNNs. In Appendix C, we also present linear probing results (only in Top/QCD classification) for other important architectures such as FCN, LSTM and CNN, with only low-level features as input. For some of these models, there is no clear Embedding Layer, so we investigated all the 3 latest layers. R2 scores from these architectures are not as impressive as in RecNN, but still some common trends can be observed. For all these architectures, jet width, jet mass and (
) always dominate. Due to the image-based nature of CNN, the hidden representation is always in the pixel space. This makes it difficult to directly link with physics observables. So in the third latest layerHidden(-3) (the flatten layer), R2s are always close to 0. In later dense layers, R2 increases a bit, indicating some abstract features are learnt here. Comparatively, RecNN obviously surpass other models in learning physics observables in latent embeddings. While these analyses are only using a very simple linear regression, we expect that more complex analysis will also bring interesting observations.
3 Interpreting on Jet Lund Plane
Gradient-based saliency maps Simonyan et al. (2013) give us a general sense on how input space affects activation in neural networks. However, this method is not straight-forward in revealing underlying physics. For jet physics, the most important building block is jet splitting mechanism, which tells us the underlying dominating interactions. Taking advantage of the base clustering structures within RecNNs, we can peek into every splitting easily, even within a neural network setup. In order to explore the jet splitting mechanisms and corresponding neural network activations, we map the saliency onto Jet Lund Planes, giving a meaningful, in a physics sense, visualization.
Lund diagrams Dasgupta et al. (2013); Dreyer et al. (2018) build theoretically useful framework for jet splitting by expressing emissions on the lund plane , where denotes angular distance between two splitting branches and denotes the relative transverse momentum of the softer branch (the emission). It gives a handy description on soft and collinear emissions within jets. And it separately display different kinematic regions with clear underlying physics. Soft-collinear emissions are emitted uniformly on lund plane and the region where hard splitting happens can always be easily spot. This gives us a navigation map to identify the physics nature of most sensitive nodes within neural networks.
We display gradient-based saliency maps for different jet classification tasks comparatively. The saliency for Recursive Neural Networks G. Louppe, K. Cho, C. Becot, and K. Cranmer (2019); 1 is defined, in Eqn. 2, as the
norm of relative gradient sensitivity passed from the classifier outputof every clustering embedding node normalized with respect to norm of the gradient sensitivity of final jet embedding node .
Saliency maps can be examined with in the tree-structure itself, however, only in an individual-based way, again since every jet has its own clustering history. In order to have a collective impression on the sensitivity distribution for all the clustering nodes without losing track of underlying physics manifestation, we map the saliency on to jet lund plane to compare across tasks and across different kinematic regions dominated by different splitting mechanisms.
We present the averaged saliency sensitivity in Fig. 1. For W/QCD, Top/QCD and quark/gluon classification, we show the saliency map on Lund Plane for both classes. This facilitates two purposes: class saliency within tasks and same-class saliency comparison across tasks. To make the illustration clearer, we restrict the saliency intensity (overflows are mapped onto upper bound) to be within [0, 2] for W/QCD and [0, 1.5] for the other two classifications. More plots for uniform range, which make easier comparison, are included in Appendix D.
From Fig. 1, one can observe that:
For the three different tasks, the sensitivity maps on the lund plane give very different characteristic patterns. For W/QCD, the sensitivity is generally higher than other two tasks.
From the left column for W/QCD classification, there are more activities along the diagonal boundary. This is where hard-collinear splitting happens. Notice that even for QCD jets, there is a high-sensitivity spot close to location (2, 3.5), which corresponds to hard splitting at a mass between 60 GeV (which is the mass peak of QCD jet samples) and 80 GeV (W boson mass). This might be related to the two-prongness (hard splitting), where the characteristics of W jets shapes the sensitivity activities within the comparing QCD jets. It’s not clear if some high activity in collinear region is related to IRC safety robustness of the network. (IRC safety study in Louppe et al. (2019) showed that for “collinear10-max” (applies collinear splits to the 10 highest jet constituents), background rejection decreases a bit.)
From W/QCD to Top/QCD , the relative sensitive regions of QCD jets change drastically. The left boundary corresponds to the final clustering step which in C/A will have large angles. In Top/QCD classification, sensitivity activity slightly decreases along the axis. (the red spot of lower-left corner in QCD jets is curious though.) Compared to W/QCD, the hard-collinear region is much less sensitive. This may be partly because top jets with heavier mass have generally more decay constituents and correspondingly longer cascading chain and more complex jet substructure, where saliency passing down longer chains ending in collinear splittings tends to decrease or vanish. The top mass hard-splitting regions in the upper part of the triangle is less obvious compared to W/QCD case, but still relatively important in Top jets.
Finally for quark/gluon classification, it seems that large-angle and some hard-collinear splittings (especially for gluon jets) are relatively important as shown in the plots. Generally speaking, saliency will effectively run through the whole clustering tree in quark/gluon tagging, where multiplicity plays an important role. So we don’t see much sensitivity decrease along the axis, while saliency passing from large angle down to small angles.
4 Conclusions and Extended Studies
We have shown some interesting results regarding probing the hidden representations learnt by jet tagging oriented deep neural networks, taking tree-structured recursive neural networks as an example. A cross-task comparative study helps us with revealing and understanding the information encoded in DNNs. In order to interpret the sensitivity within learnt neural net models in a physical way, we combine saliency tree maps and jet lund plane, and presented the mapped sensitivity across different jet classification tasks. Results show that when feeding in only low-level features of jet constituents, there is still high relevance with general physics observables in the latent embeddings; and the saliency maps give very characteristic patterns depending on the classification tasks.
we only presented results for RecNNs. Some of the methods are easier to explore within RecNNs because of its physics-inspired architecture. In general, one can probe the embedding space or hidden layers to gain some basic information. However, for instance, it will be very difficult for CNNs to interpret in Lund Plane, since pixelization of jet images washes out the individual particle information. We expect some more suitable methods can be found for these architectures. On the other hand, neural networks directly featured with Lund coordinates might be another playground for Lund Plane interpretability study. And a systematic study compares different neural network architectures across different classification tasks will give more evident information on the model interpretability. Depending on the specific architecture (image-based, 4-vectors based, or theory-motivated-variables-based), different models might be learning different discriminating factors, thus combining these learnt representations might help gain better performance.
Although these studies are carried on supervised classification problems , they may as well (or even better) be applied to unsupervised learning cases. Beside the results presented here, there are more interesting points to investigate. One important topic is the robustness of deep learning models. It’s possible that saliency analysis will help us find better ways approaching robust machine learning models. We hope this effort will develop further into more powerful practical use for physics structure detection in the future.
This work is funded by IVADO Fundamental Research Grant. The author would like to thank Gilles Louppe for collaboration in the very early stage. And the author is also grateful for helpful discussions with MILA colleagues. Finally, a thank you goes to Prof. Jean-Francois Arguin for kind support during this work was done.
-  Note: https://github.com/glouppe/recnn Cited by: §3.
- Understanding intermediate layers using linear classifier probes. arXiv e-prints, pp. arXiv:1610.01644. External Links: Cited by: §2.
- Deep-learned Top Tagging with a Lorentz Layer. SciPost Phys. 5 (3), pp. 028. External Links: Cited by: Appendix C.
- The Machine Learning Landscape of Top Taggers. SciPost Phys. 7, pp. 014. External Links: Cited by: 3rd item, §1.
- The Anti-k(t) jet clustering algorithm. JHEP 04, pp. 063. External Links: Cited by: Appendix A.
- FastJet User Manual. Eur. Phys. J. C72, pp. 1896. External Links: Cited by: Appendix A.
- Search for a Higgs boson in the decay channel to ZZ(*) to qbar l+ in collisions at TeV. JHEP 04, pp. 036. External Links: Cited by: §2.
- Recursive Neural Networks in Quark/Gluon Tagging. External Links: Cited by: Appendix A, §1.
Jet-Images: Computer Vision Inspired Techniques for Jet Tagging. JHEP 02, pp. 118. External Links: Cited by: §1.
- Towards an understanding of jet substructure. JHEP 09, pp. 029. External Links: Cited by: §3.
- Automating the Construction of Jet Observables with Machine Learning. External Links: Cited by: §2.
- DELPHES 3, A modular framework for fast simulation of a generic collider experiment. JHEP 02, pp. 057. External Links: Cited by: Appendix A.
- Jet-images — deep learning edition. JHEP 07, pp. 069. External Links: Cited by: §1, §1.
- The Lund Jet Plane. JHEP 12, pp. 064. External Links: Cited by: §3.
- Long Short-Term Memory (LSTM) networks with jet constituents for boosted top tagging at the LHC. External Links: Cited by: §1.
- Quark and Gluon Jet Substructure. JHEP 04, pp. 090. External Links: Cited by: §2.
- Neural Message Passing for Jet Physics. Proceedings of the Deep Learning for Physical Sciences Workshop at NIPS (2017). External Links: Cited by: §1.
- Energy Flow Networks: Deep Sets for Particle Jets. JHEP 01, pp. 121. External Links: Cited by: §1.
- A Theory of Quark vs. Gluon Discrimination. External Links: Cited by: §2.
- Gaining (Mutual) Information about Quark/Gluon Discrimination. JHEP 11, pp. 129. External Links: Cited by: §2, §2.
- QCD-Aware Recursive Neural Networks for Jet Physics. JHEP 01, pp. 057. External Links: Cited by: Appendix A, §1, §1, §1, 2nd item, §3.
- JEDI-net: a jet identification algorithm based on interaction networks. External Links: Cited by: §1.
- Jet Constituents for Deep Neural Network Based Top Quark Tagging. External Links: Cited by: 1st item.
- ParticleNet: Jet Tagging via Particle Clouds. External Links: Cited by: §1.
- Deep inside convolutional networks: visualising image classification models and saliency maps. CoRR abs/1312.6034. External Links: Cited by: §3.
- An Introduction to PYTHIA 8.2. Comput. Phys. Commun. 191, pp. 159–177. External Links: Cited by: Appendix A.
- Identifying Boosted Objects with N-subjettiness. JHEP 03, pp. 015. External Links: Cited by: §2.
Appendix A Datasets and Neural Net Architecture
Neural Net Settings
The neural networks used in this work is similar as in [21, 8]. We embed jets along with their clustering histories which are used as scaffold for building the recursive embedding process. Input features () for constituent particles within jets are taken as direct input fed into the neural networks. We use anti-kt clustered jets, then recluster using Cambridge-Aachen algorithm for simplicity in Lund Plane analysis. The network architecture is (RecNN Embedding () Dense(70) Dense(70) Sigmoid).
Appendix B Transferability of RecNN
In the study of Embedding Transferability, we use the embedding layer learnt from classification task A to obtain jet embeddings for task B, then feed these transferred embeddings to a dense network for classification. (i.e. freeze the embedding layer which is initialized with parameters learnt in task A). In Table 2, we show cross-task transfer AUCs for all the combinations, with Base AUC denoting original full training AUC for B, and Transfer AUC denoting results from the transferred embeddings.
|Tasks||Base AUC||Transfer AUC|
Appendix C Linear Probing in Other Models
To quickly explore the latent space of other popular architectures, we take a few low-level feature based models which have been used for jet tagging: FCN, LSTM, CNN.
LSTM: taking in four momenta of first 30 jet constituents in a -ordered manner. LSTM layer outputs a 70 dimensional embedding, which then is fed into two ReLU dense layers with 50, 20 nodes. Output layer has a sigmoid activation.
CNN: taking in 3737 grey scale images on () plane with deposit as pixel intensity. We take a similar architecture as in : (Conv2D*16 Conv2D*16 MaxPooling2D Conv2D*8 Conv2D*8 MaxPooling2D Flatten Dense(128) Dense(64) Sigmoid).