# Interpretable Deep Learning for Two-Prong Jet Classification with Jet Spectra

Classification of jets with deep learning has gained significant attention in recent times. However, the performance of deep neural networks is often achieved at the cost of interpretability. Here we propose an interpretable network trained on the jet spectrum S_2(R) which is a two-point correlation function of the jet constituents. The spectrum can be derived from a functional Taylor series of an arbitrary jet classifier function of energy flows. An interpretable network can be obtained by truncating the series. The intermediate feature of the network is an infrared and collinear safe C-correlator which allows us to estimate the importance of a S_2(R) deposit at an angular scale R in the classification. The performance of the architecture is comparable to that of a convolutional neural network (CNN) trained on jet images, although the number of inputs and complexity of architecture is significantly simpler than the CNN classifier. We consider two examples: one is the classification of two-prong jets which differ in color charge of the mother particle, and the other is a comparison between Pythia 8 and Herwig 7 generated jets.

## Authors

• 9 publications
• 2 publications
• 2 publications
• ### Visual Interpretability for Deep Learning: a Survey

This paper reviews recent studies in emerging directions of understandin...
02/02/2018 ∙ by Quanshi Zhang, et al. ∙ 0

• ### ECG arrhythmia classification using a 2-D convolutional neural network

In this paper, we propose an effective electrocardiogram (ECG) arrhythmi...
04/18/2018 ∙ by Tae Joon Jun, et al. ∙ 0

• ### Interpretable Set Functions

We propose learning flexible but interpretable functions that aggregate ...
05/31/2018 ∙ by Andrew Cotter, et al. ∙ 0

• ### Spectral Analysis of Jet Substructure with Neural Network: Boosted Higgs Case

Jets from boosted heavy particles have a typical angular scale which can...
07/09/2018 ∙ by Sung Hak Lim, et al. ∙ 0

• ### Interpretable deep learning for guided structure-property explorations in photovoltaics

The performance of an organic photovoltaic device is intricately connect...
11/14/2018 ∙ by Balaji Sesha Sarath Pokuri, et al. ∙ 4

• ### A Deep Learning Interpretable Classifier for Diabetic Retinopathy Disease Grading

Deep neural network models have been proven to be very successful in ima...
12/21/2017 ∙ by Jordi de la Torre, et al. ∙ 0

• ### Automatic phantom test pattern classification through transfer learning with deep neural networks

Imaging phantoms are test patterns used to measure image quality in comp...
01/22/2020 ∙ by Rafael B. Fricks, et al. ∙ 4

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Deep learning is gaining significant interest recently in the field of collider data analysis. One of the primary motivations is to extract the maximum information from the complex collision events. The deep learning in collider physics takes advantage of a large influx of data from experiments, more precise theoretical predictions, significant improvement in computing power, and ongoing progress in the field of machine learning itself. Such techniques offer advances in areas ranging from event selection to particle identification.

The large center-of-mass energy at the Large Hadron Collider (LHC) enables the production of boosted particles whose decay products are highly collimated. These collimated objects are reconstructed as a jet, and it is often misidentified as a QCD jet originated from light quarks or gluons. Many jet substructure techniques using the information of subjets Butterworth:2008iy ; Thaler:2008ju ; Kaplan:2008ie ; Plehn:2009rk ; Plehn:2010st ; Soper:2011cr ; Soper:2012pb ; Dasgupta:2013ihk ; Soper:2014rya ; Larkoski:2014wba and the distribution of jet constituents Gallicchio:2010sw ; Thaler:2010tr ; Gallicchio:2011xq ; Chien:2013kca ; Larkoski:2013eya ; Larkoski:2014gra ; Moult:2016cvt have been developed in order to improve the sensitivity of tagging and classifying these boosted particle jets. The deep learning methods Almeida:2015jua ; deOliveira:2015xxd ; Komiske:2016rsd ; Kasieczka:2017nvn ; Louppe:2017ipp ; Komiske:2017ubm ; Butter:2017cot ; Cheng:2017rdo ; Andreassen:2018apy ; Choi:2018dag ; Lim:2018toa ; Dreyer:2018nbf ; Lin:2018cin ; Komiske:2018cqr ; Martinez:2018fwc ; Kasieczka:2018lwf ; Qu:2019gqs have provided useful insight into the internal structure of the jets and, thereby, shown better performances than those jet substructure techniques.111For a review on the recent theoretical and machine learning developments in jet substructure techniques at the LHC, we refer Larkoski:2017jix ; Asquith:2018igt .

The flexibility of deep learning also enables us to solve problems beyond supervised classifications, such as weakly supervised learning

Dery:2017fap ; Cohen:2017exh ; Metodiev:2017vrx , adversarial learning to suppress learning from unwanted information Louppe:2016ylz ; Shimmin:2017mfk

, and unsupervised learning for finding anomalous signatures

Chakraborty:2017mbz ; Hajer:2018kqm ; Heimel:2018mkt ; Farina:2018fyg ; Cerri:2018anq ; Roy:2019jae . The neural network can also be useful to new physics searches with deep learning at the LHC Roxlo:2018adx ; Brehmer:2018kdj ; Brehmer:2018eca ; Guo:2018hbv ; Collins:2018epr ; DAgnolo:2018cun ; DeSimone:2018efk ; Englert:2019xhk ; Collins:2019jip .

The output of a neural network is, in general, a highly non-linear function of the inputs. A neural network classifier often acts like a “black box”. One may consider architectures with post-hoc interpretability DBLP:journals/corr/Lipton16a which allows us to extract information, other than its prediction, from the learned model after training. A simple strategy is using a predefined functional form to restrict the representation power of the neural network Komiske:2018cqr ; Datta:2019ndh . Then the network is interpreted in terms of the functional form. The aim of this paper is also to construct an interpretable neural network architecture that allows us not only to interpret the predictions of the network but also to visualize it in terms of trained weights connected to physical variables.

, a multilayer perceptron (MLP) trained on two-point correlation functions

and of angular scale was introduced. The and spectra are constructed from the constituents of a jet before and after the trimming Krohn:2009th respectively. The angular scale is an important parameter for describing kinematics of a decaying particle and parton shower (PS); hence, these spectra efficiently encode the radiation pattern inside a jet. The MLP trained on these inputs learns relevant features for the classification among the Higgs boson jet (Higgs jet) and QCD jet.

In this paper, we connect the spectra to energy flow functionals Tkachov:1995kk , i.e., we consider transverse energy of a jet constituent as particle-specific information at in the plane 7974879

. The spectra are basis vectors of infrared and collinear (IRC) safe variables called bilinear

-correlators Tkachov:1995kk whose angular weighting function depends only on the relative distance between two constituents. Those correlators naturally appear in the functional Taylor series of a classifier of , and the MLP can be considered as a subseries of the Taylor series. We show that the performance of the MLP and neural networks trained on jet images Cogan:2014oua ; Almeida:2015jua ; deOliveira:2015xxd ; Kasieczka:2017nvn are comparable. This strongly suggests that and contain sufficient information for jet classification. Encouraged by this feature, we construct an interpretable architecture by truncating the series. Namely, can be implemented in a classifier after proper discretization in , where is a set of kinematic variables of the jet and is a smooth function. By reading the weights , we could quantify important features for the given classification problem.

Jet substructure studies often suffer from systematic uncertainties of soft activities. The soft radiations generated by a Monte Carlo program are strongly model dependent. While this mismodeling could be corrected by using real data, it is certainly useful to use input variables with less systematic uncertainties. When hard substructures are important for solving the problem, we may use jet grooming techniques Butterworth:2008iy ; Krohn:2009th ; Ellis:2009su ; Ellis:2009me ; Larkoski:2014wba to remove the soft activity. Instead of throwing this soft activity away, we encode it in which is . Then, the inputs and include hard and soft substructure information respectively. The interpretable architecture trained on and is able to quantify these features. We study two classification problems: one is classificaiton of two-prong jets to understand their hard substructures and color coherence, and the other is comparison of Pythia 8 Sjostrand:2014zea and Herwig 7 Bellm:2015jjp ; Bahr:2008pv events to quantify the differences.

The paper is organized as follows. In section 2, we review and and show its relation to energy flow and -correlators. We also show and distributions of typical Higgs jet and QCD jet. A hypothetical color octet scalar particle, sgluon, decaying to is considered to study the color connection in two-prong jets. In section 3, we first discuss the capability of and for the classification of two-prong jets and show the result of an MLP trained on those inputs. The results are then compared with that of a CNN trained on jet images. In section 4, we introduce a two-level architecture consists of a softmax classifier and an MLP trained on and . The intermediate feature of this architecture is the bilinear -correlator whose basis vectors are and and components are generated by the MLP. We visualize and interpret the weights of the given classification problem. Finally, the summary and outlook are given in section 5.

## 2 Two-Point Correlation Spectrum and Two-Prong Jets

### 2.1 Jet Spectra

In Lim:2018toa , we introduced a two-point correlation spectral function which maps a jet to a function of angular scale ,

 \Spec(R) = ∫d→R1d→R2\PT(→R1)\PT(→R2)δ(R−R12), (1) \PT(→R) = ∑i∈\jetpT,iδ(→R−→Ri), (2)

where is a set of jet constituents, is the position of the -th jet constituent in the pseudorapidity-azimuth plane, is the angular distance between the two jet constituents and , and is an energy flow functional Tkachov:1995kk of . For practical purpose, is discretized as below,

 \Spec(R;ΔR) = 1ΔR∫R+ΔRRdR\Spec(R) (3) = 1ΔR∑i,j∈\jetpT,ipT,jI[R,R+ΔR)(Rij),

where is an indicator function of the angular distance of the domain ,

 IA(x) = {1ifx∈A,0ifx∉A.

The spectral function is, therefore, the sum of the product of ’s of the two jet constituents with an angular distance lying between and .

We obatin IRC safe quantities by multiplying smooth functions222 Continuous functions are sufficient for the convergence and IRC safety Tkachov:1995kk , but we further restrict ’s to smooth functions for perturbative calculations. and (or ), and integrating over . To understand the IRC safety of , let us consider a splitting of a given constituent in into two constituents, . The inner product of and the difference of the energy flows before and after the splitting, , is given as follows,

 ∫d→RδPT(→R)w(→R) = pT,i1w(→Ri1)+pT,i2w(→Ri2)−pT,i0w(→Ri0) (4) = [δpT,i0−pT,i1(δ→Ri1⋅∇→R)−pT,i2(δ→Ri2⋅∇→R)+⋯]w(→Ri0), (5)

where , and . The soft limit, where carries a small momentum, corresponds to , while in the collinear limit. The integral vanishes in these limits, namely the energy flow after parton splitting converges weakly Tkachov:1995kk to the one before splitting.

The spectrum inherits the same property. The inner product of the smooth function and the difference of the spectrum, , before and after the splitting is given as follows,

 ∫dRδS2(R)w(R)=2∑j∈\jet[δpT,i0+pT,i1(δ→Ri1⋅∇→R)+pT,i2(δ→Ri2⋅∇→R)+⋯]pT,jw(Ri0j). (6)

Again, this integral vanishes in the IRC limits. Note that the binned spectrum is not completely IRC safe because of the discontinuity of the indicator function at the bin boundaries. Nevertheless, when the domain is discretized into small sections , the IRC unsafe terms cancel in the sum, , and it is approximately IRC safe up to binning errors.

The resulting IRC safe observables belong to -correlators Tkachov:1995kk which are multilinear forms of the energy flow. An -linear -correlator is expressed as follows,

 ∫d→R1⋯d→RnPT(→R1)⋯PT(→Rn)w(→R1,⋯,→Rn), (7)

where is a continuous function of . For example, an inner product of and is a linear -correlator, and an inner product of and is a bilinear -correlator with depending only on the relative distance ,

 ∫dRS2(R)w(R)=∫d→R1d→R2PT(→R1)PT(→R2)w(R12). (8)

Many well-known jet observables belong to the -correlator, for example, a jet transverse momentum is a linear -correlator with , a jet mass is a bilinear -correlator with .

The spectra use all the jet constituents, but it is useful to separate the correlations of constituents of the hard subjets; we consider jet trimming for this purpose. We recluster the constituents of a jet of a radius parameter to subjets with a smaller radius parameter . A subjet is discarded if , where and are the transverse momenta of the jet and -th subjet respectively. The trimmed jet is defined as a union of the remaining subjets,

 \jet\trim=⋃apT,\jetapT,\jet≥f\trim\jeta . (9)

The jet trimming is beneficial because it does not introduce additional angular scale parameters other than . The trimmed spectrum is then calculated using the constituents of the trimmed jet. We denote it as and its binned version which are defined as follows:

 \SpecTrim(R) = ∫d→R1d→R2\PTTrim(→R1)\PTTrim(→R2)⋅δ(R−R12), (10) \PTTrim(→R) = ∑i∈\jet\trimpT,iδ(→R−→Ri), (11) \SpecTrim(R;ΔR) = 1ΔR∑i,j∈\jet\trimpT,i pT,j⋅I[R,R+ΔR)(Rij), (12)

where is the energy flow of .

In the limit of the constituents of each subjet are localized, the energy flow and the jet spectrum can be approximated in terms of the subjet momenta. The energy flow of such a jet is decomposed into a sum of energy flows of all the subjets,

 PT(→R)=∑aPT,a(→R),PT,a(→R)=∑i∈\jetapT,iδ(→R−→Ri). (13)

The energy flow of each subjet converges weakly to . The spectrum can be approximated by the momenta of the subjets, i.e.,

 \Spec(R;ΔR) ≈ ∑a,b\jeta,\jetb⊂\jetpT,\jetapT,\jetb⋅I[R,R+ΔR)(Rab). (14)

The jet trimming also introduces a scale hierarchy among the subjets, and so their pairwise contributions to can be classified by the scale. We define a quantity where,

 \SpecSoft(R;ΔR)=\Spec(R;ΔR)−\SpecTrim(R;ΔR). (15)

In the r.h.s. of the above equation, the correlations among the constituents of the hard subjets are canceled, and we have,

 \SpecTrim(R;ΔR) = p2T,\jet⋅O[1], (16) \SpecSoft(R;ΔR) = p2T,\jet⋅(O[f\trim]+O[f2\trim]). (17)

The dominant contributions to (i.e., the terms) come from the correlations between a constituent in and a constituent in . The subleading terms denote the correlations among the constituents in .

### 2.2 Derivation of Classifiers based on Energy Flows and Jet Spectra

We discuss the relation between and neural network classifiers trained on the energy flow . A general softmax classifier that solves -class jet classification problem can be expressed as a functional which maps the energy flow to real numbers , i.e.,

 hi = ^Ψi[PT] (18) ^y = φsoftmax(→z),zk=w(out)kihi+b(out)k,k∈{1,⋯,K}, (19)

where and are the weights and biases of the output layer, and is the prediction of the classifier. Here the is the softmax function whose -th component is expressed as follows,

 φsoftmax,k(→z)=ezk∑Kk=1ezk. (20)

Many jet classifiers can be expressed in the form of eq. (18). For example, in the cut-based analysis, are the jet substructure variables, such as the ratios of -subjettiness Thaler:2010tr , ratios of energy correlation functions Larkoski:2014gra ; Moult:2016cvt etc. The deep neural network classifiers, such as artificial neural network tagger Almeida:2015jua , convolutional neural network using pixelated jet images deOliveira:2015xxd , energy flow network Komiske:2018cqr etc., are also described by eq. (18). The neural networks that are introduced in section 3 and section 4 also belong to this category.

The jet spectra and can be derived from eq. (18) using a functional Taylor expansion. The energy flow is decomposed by trimming as follows,

 PT,a(→R)={PT,\trim(→R)a=1,PT(→R)−PT,\trim(→R)a=2. (21)

One can express as a functional series at a reference point ,

 hi = w(0)i+∫d→RPT,a(→R)w(1)i,a(→R)+12!∫d→R1d→R2PT,a(→R1)PT,b(→R2)w(2)i,ab(→R1,→R2)+⋯, (22)

where is the coefficient of -th correlation function. The first order coefficient can be chosen as a constant if we are not interested in features depending on reference vectors, for example, jet axes, beam directions, etc. The linear term in of eq. (22) is related to the jet momentum and trimmed jet momentum as follows,

 ∫d→RPT,1(→R)≃pT,\jet,\trim,∫d→RPT,2(→R)≃pT,\jet−pT,\jet,\trim. (23)

The second order coefficient , the first non-trivial term of the series expansion, is a function of the relative distance of and . The basis vectors of are two-point correlation functions ,

 hi = w(0)i+∫d→RPT,a(→R)w(1)i,a+12!∫dRS2,ab(R)w(2)i,ab(R)+⋯ (24) S2,ab(R) = ∫d→R1d→R2PT,a(→R1)PT,b(→R2)δ(R−R12). (25)

The spectra and are expressed in terms of as follows,

 S2(R)=∑a,bS2,ab(R),\SpecTrim(R)=S2,11(R). (26)

Instead of the energy flows, we consider a classifier of ( = 1,2),

 hi=Ψi[S2,A;→x\kin], (27)

where is a set of additional inputs to the classifier based on the kinematics of the jet, and the index is equal to 1 for “trim” and 2 for “soft”. Similar to eq. (24), we expand eq. (27) around as

 hi = w(0)i(→x\kin)+∫dRS2,A(R)w(2)i,A(R;→x\kin)2 (28) 00+12∫dR1dR2S2,A(R1)S2,B(R2)w(4)i,AB(R1,R2;→x\kin)12+⋯.

where are the corresponding weight functions. One may further truncate the series to get a linear form,

 hi=12∫dRS2,A(R)w(2)i,A(R;→x\kin). (29)

The above-mentioned linear setup has an advantage on the interpretability and visualization of the network predictions; we discuss more on this network in section 4.

### 2.3 Spectra of Two-Prong Jets

The spectrum is useful to identify the substructures of the jet and also to characterize the jet. Typically, of QCD jets have a peak at with a long tail towards large . The peak originates from the autocorrelation term in eq. (3). On the other hand, if a jet originates from a Higgs boson decaying into , the -partons create two isolated cores inside the jet. The spectra have a peak at the angular scale equal to the angle between the two clusters. In addition, encodes the fragmentation pattern of -partons.

At the LHC, boosted heavy objects such as top quark, gauge bosons and Higgs boson decaying into quarks can be studied by identifying jet substructures. Usually, these substructures are characterized by parameters such as defined as,

 Dβ2 = eβ3/(eβ2)3, (30) eβ2 = 1p2T,\jet∑i,j∈\jet,i

where is the angular exponent. If a jet has a two-prong substructure then is much less than one. The jet spectrum contains more information than , and therefore, the analysis with goes beyond the one using . It was shown that a neural network trained on distinguishes Higgs jet from QCD jet better than the one trained on Lim:2018toa .

To study the fragmentation pattern of the -partons and their color connection to the mother particle, we introduce a color-octet scalar, sgluon (). We assume that the Higgs boson () and decay into though the interaction,

 LSM ∋ yhb¯bh¯bb+h.c. (33) LSgluon ∋ yσb¯bσa¯bTab+h.c.. (34)

The Higgs boson is a color singlet particle and the decay is isolated in color flows. Therefore, beyond the angle between the b-partons, i.e., , is suppressed due to the color coherence. No such constraint on the angular scale exists for sgluon and QCD jets. Meanwhile, the Higgs jet and sgluon jet have the same two-prong substructure, unlike the QCD jet, as both are originating from a particle decaying into final states.

To study the spectra of two-prong jets, we simulate events as follows. We use Madgraph5 2.6.1 Alwall:2014hca to generate the events of , , and processes with the boson decaying to a pair of neutrinos. These events are then passed to Pythia 8.226 Sjostrand:2014zea for the parton shower and hadronizations. To study the impact of the parton shower and hadronization schemes, those parton level events are also passed to Herwig 7.1.3 Bellm:2015jjp ; Bahr:2008pv . A color octet scalar UFO model Degrande:2014sta ; NLOModels generated by Feynrule 2.0 Alloul:2013bka ; Degrande:2011ua is used to simulate process. The masses and widths of Higgs boson and sgluon are 125 GeV and  MeV. The detector response is simulated by Delphes 3.4.1 deFavereau:2013fsa with the default ATLAS detector configuration. We use FastJet 3.3.0 Cacciari:2011ma ; Cacciari:2005hq to reconstruct jets from the calorimeter towers using anti- algorithm Cacciari:2008gp with the radius parameter . For jet trimming, we use and . We select the events with the leading jet transverse momentum GeV and its mass GeV. For Higgs jet and sgluon jet, we additionally require that the two -partons originating from their decay are located within from the jet axis. More details on our simulations are described in appendix A.

In figure 1, we show the pixelated jet image (left panel) and and spectra (right panel) of a QCD jet. There are high energy deposits in the jet image near the jet center along with a wide spray of soft activity. It has also a moderate amount of radiation at . As a result, spectra has a long tail starting from . The jet trimming eliminates a significant amount of soft particles and, therefore, the tail does not appear in . The remaining cross-correlations contributing to are the ones between the high and moderate energy deposits. Most of the energy deposits are concentrated at the center and the peak intensity at is much lower than the intensity from autocorrelations at .

In figure 2, we show and distributions of a Higgs jet. For this particular event, distribution is similar to distribution and their difference is hard to be seen. No significant activity has been observed beyond the peak at , mostly because the Higgs jet is very compact compared to the QCD jet. Correspondingly, there are two prominent subjets in the jet image, while most of the cells have no jets.

Finally, we show the and distributions of a sgluon jet in figure 3. The distribution has a large peak at which is as significant as the one at . This spectrum is qualitatively similar to the Higgs jet in figure 2. However, the spectrum has a long tail beyond as compared with that of a Higgs jet. The tail disappears after jet trimming, like the QCD jet in figure 1, that makes the distribution more compact. From figure 1-3, we observe that and include useful information that are complementary.

In Lim:2018toa , it was shown that a neural network classfier trained on and spectra performs better than that of without . The reason is that the hard and soft correlation terms in , i.e., terms in eq. (16) and in eq. (17) respectively, can be resolved by the jet trimming. Therefore, we use the orthogonal combinations, namely and , throughout this paper.

The and spectra encode the important features of the parton shower and fragmentation, and, thus, may be regarded as a well-motivated prototype. The hard partons evolve by the parton splittings , which are parameterized by the angle and momentum fraction with and . The splitting generates two-point correlation at . Therefore, spectra encode the parton splitting at any angular scale.

## 3 Classifying Higgs jet, sgluon jet, and QCD jet

In this section, we introduce a neural network trained on the jet spectra for classifying Higgs jet, sgluon jet, and QCD jet. We first discuss the basic kinematic features of these jets and then outline their dependence on the parton shower simulators. Afterward, we show the details of the neural network and then present our results in terms of the receiver operating characteristic (ROC) curves.

### 3.1 Basic Kinematics

In figure 4, we show and distributions for the Higgs boson, sgluon and QCD jets. The solid and dashed lines correspond to Pythia 8 (PY8) and Herwig 7 (HW7) generated jets, respectively. The mild differences in the distribution are due to the difference of their matrix elements. The Higgs jet is produced via -channel process, while the sgluon and QCD jet are produced via -channel and -channel processes; hence, scalings are different. Not much difference is observed between the distributions of PY8 and HW7 samples. This is because is mostly determined by the matrix level of the leading parton and the jet algorithm with large radius parameter clusters most of the radiations from this parton into a single jet. However, the difference between distributions is large. The peak at of sgluon jet is significantly broader than that of Higgs jet because radiations of the b-partons from the Higgs boson decay are mostly confined due to the color coherence but those of the sgluon are not. As a consequence, PY8 and HW7 generate different distributions.

We assume that both Higgs boson and sgluon have narrow-widths although sgluon width can be large. An increase in the width will broaden distribution of that has a peak at the characteristic angular scale,

 ^Rb¯b=2mhpT,\jet=2mσpT,\jet≃2m\jetpT,\jet. (35)

For example, the variation of is only about 0.07 for GeV for GeV and 0.05 for GeV. It is close to the calorimeter angular resolution 0.1 and does not affect the calorimeter level analysis.

We first make a quantitative estimate of the radiation pattern inside the jet. To do so, we define two quantities comparing spectra around ,

 Rsym = ∫min[a′^Rb¯b,R\jet]a^Rb¯bdRS2(R)∫a^Rb¯b0dRS2(R)+∫∞min[a′^Rb¯b,R\jet]dRS2(R), (36) Rrad = C∫∞min[a′^Rb¯b,R\jet]dRS2(R)∫min[a′^Rb¯b,R\jet]0dRS2(R)+C∫∞min[a′^Rb¯b,R\jet]dRS2(R) (37)

with , and . The ratio compares energy deposits around and in its surrounding angular scales Lim:2018toa . The ratio is sensitive to the correlation between the two hard substructures of the Higgs jet. On the other hand, The is sensitive to the color of mother particle as it compares energy deposits in the large angular scales.

We show the distributions in the left panel of figure 5. The distributions of the Higgs jet and sgluon jet are similar because both of the spectra have a peak at . Meanwhile, the two-point correlations for the QCD jet are not localized around the scale, so the is smaller than that of a Higgs jet and a sgluon jet. In the right panel of figure 5, we show the distributions. The of the sgluon jet and QCD jet are large in average, while is smaller for Higgs jet because large angle radiations are suppressed.

The difference in and distributions between PY8 and HW7 samples is small; however, there is an appreciable difference in the restricted phase space. In figure 6, we plot distributions after the selection, , so that the jets always contain two hard subjets with similar transverse momenta. The PY8 (solid line) and HW7 (dashed line) samples have significantly different distributions for the Higgs jet. Such a difference is not observed for the QCD/sgluon jets. The observed deviation for the Higgs jets could be originating from the difference of the parton shower scheme. The angular-ordered shower is adopted in HW7. On the other hand, the -ordered shower is the default shower algorithm for PY8 where angular ordering is enforced by hand. An artificial veto in -ordered shower introduces the mismatch to the angular-ordered shower at double-leading log level Webber:2010vz ; Bhattacherjee:2015psa .

### 3.2 Multilayer Perceptron of Spectra

We introduce a neural network trained on the kinematic and spectrum ( and ) variables to classify the jets. A schematic diagram of the architecture of the classifier is shown in figure 7. The following set of inputs is used,

 →x={pT,\jet,m\jet,pT,\jet,\trim,m\jet,\trim}∪{\SpecTrimk,\SpecSoftk|k∈{0,⋯,19}}, (38)

where and are the transverse momentum and mass of the trimmed jet, respectively. The discretized spectra and are used to analyze the radiation pattern of the jet,

 \SpecTrimk = \SpecTrim(0.1k;0.1), (39) \SpecSoftk = \Spec(0.1k;0.1)−\SpecTrim(0.1k;0.1). (40)

Here we take the bin width , which is approximately the angular resolution of hadronic calorimeter of the ATLAS detector. Note that the maximum separation between any two constituents of the jet is .

A multilayer perceptron (MLP) with layers is used to map the inputs to the class prediction. The following first order recurrence relation between the layers describes an MLP,

 h(ℓ)i = φ(ℓ)(w(ℓ)ijh(ℓ−1)j+b(ℓ)i),→h(0)=→x, (41)

where and are the weights and biases of the

-th layer. The activation function of

-th layer,

, is a monotonic and nonlinear function. We use four hidden layers with 1000, 800, 400 and 200 nodes, respectively, with a rectified linear unit (ReLU),

, as the activation function. This MLP will identify important features of inputs for the classification after training. To make a class prediction, the outputs of the MLP are provided to a softmax classifier in eq. (19). The whole network architecture is illustrated in figure 8.

The MLP is trained by minimizing a loss function including categorical cross-entropy and

weight regularization NIPS1991_563 ,

 L=1NeventsNevents∑events∑iyilog^yi+λL∑ℓ=1∑i,j|w(ℓ)ij|2, (42)

where is the total number of events in the training data set, is a weight decay constant associated to the regularization and we choose . The () denotes the components of the truth (predicted) label vector (). The weight regularization reduces the over-fitting on the training data and also allows smooth extrapolation to the phase space that is not covered by the training sample. The minimization is done with ADAM optimizer DBLP:journals/corr/KingmaB14

. After the minimization of the loss function, the softmax layer provides scores of the classes of a given event. The truth label vectors are defined as follows,

 →y = ⎧⎪⎨⎪⎩(1,0,0)Higgs jet,(0,1,0)sgluon jet,(0,0,1)QCD jet. (43)

The unnecessary symmetries in the neural network are broken by using the Glorot uniform initialization method pmlr-v9-glorot10a . The weights in the hidden layers are initialized by assigning random numbers between , where and are numbers of inputs and outputs of a layer, respectively. The biases are initialized to zero. All the inputs are standardized before training. The architecuture is implemented in Keras chollet2015keras with backend TensorFlow tensorflow2015-whitepaper .

In figure 9, we show ternary plots of the predicted label vector . The three sides of the triangle (starting from the base of the triangle and then counterclockwise) are , , axis; we denote them as , and respectively. The distributions of the Higgs jet and QCD jet have high density spots that do not overlap with each other. It means that the network has found the exclusive features of those two kinds of jets. The two-prong substructure of a Higgs jet and the one-prong structure of a QCD jet are the exclusive features. However, the two-prong substructure of a sgluon jet is more radiative and less exclusive, and therefore, there are no high density spots in the distribution of the sgluon jet.

Next, we show ROC curves of binary classifications in figure 10 with the red dotted lines. The following signal-background classifications are considered: Higgs-QCD, sgluon-QCD, and Higgs-sgluon. We assign the truth label vectors for the signal and for background. The QCD jet mistag rates are comparable for both Higgs-QCD and sgluon-QCD classifications, however, the separation between the Higgs jet and sgluon jet is weaker.

We now compare these ROC curves with that of CNN trained on jet images.333The CNN setup is explained in detail in appendix B. The CNN classifier takes inputs of the jet images, while inputs of and spectra are used for the MLP. The solid blue lines in figure 10 denote the ROC curves of the CNN. Some improvement in the background mistag rates is observed compared with the MLP classifier. Quantitatively, it is only, 0.2% () at signal acceptance of 20% for Higgs-QCD classifcation. Note that the MLP trained on and is expressed by eq. (28), and the CNN is expressed by eq. (22) when we use large kernel sizes for each convolutional layer so that the CNN can be considered as an MLP trained on jet images. The missing terms which are not included in eq. (28) are the source of degradation. Nevertheless, the small difference means that and captures relevant features for these classifications.

### 3.3 Event Generator Dependence

The classifier introduced in the previous subsection uses not only the information of hard subjets encoded in but also the soft activities captured in as well. This leads to concerns about the accuracy of the models of soft physics. Specifically, the performance of the classifier could be sensitive to the soft activities in the jet while the simulated soft activities may be significantly different from the truth.

In figure 11, we compare the ROC curves of the MLP trained with PY8 and HW7 samples. As these two event generators are based on different modeling of parton shower and hadronization scheme, the comparison would give us a reasonable estimate of the systematic uncertainty originating from the generator choice.

In the left panel of figure 11, we compare the ROC curves of the Higgs jet vs QCD jet classification for different generator choices. By doing this exercise, we estimate a systematic uncertainty in the predictions of the classifier by comparing ROC(PY8, PY8) and ROC(HW7, HW7) curves, where the first and second entries in the parenthesis correspond to the generators used to simulate the training and test samples respectively. On the other hand, ROC(PY8, HW7) and ROC(PY8, HW7) show degradation of the performance of classifier trained on the “wrong sample” to analyse “real events”.

The performance of the classifier improves as we vary generator combinations in the following order: ROC(HW7, HW7), ROC(PY8, HW7), ROC(PY8, PY8), and ROC(HW7, PY8). We find that the classification performance is significantly better for PY8 test samples than that of HW7 samples. On the other hand, the classification performance for the same test samples hardly depend on the classifiers, namely ROC(HW7, PY8) ROC(HW7,HW7) and ROC(PY8,HW7) ROC(PY8, PY8). For the Higgs jet vs QCD jet classification, the classifier mostly concentrates on the core substructures within the jet, and here both PY8 and HW7 provide similar kinematics and radiation spectra. Therefore, we do not observe any significant change in the ROC curves by varying training samples while keeping the test samples the same.

In the middle panel of figure 11, we compare the classifier performance for the sgluon jet vs QCD jet classification. It improves in the following order: ROC(HW7, PY8), ROC(PY8, HW7), ROC(HW7, HW7) and ROC(PY8, PY8). The classifiers are indeed sensitive to the choice of generators. The network trained with PY8 (HW7) samples have failed to capture the features of HW7 (PY8) test samples. The networks have focused on different portions of the distribution of the fragmentation functions. In the right panel of figure 11, the ROC curves for the Higgs jet vs sgluon jet classification show similar behavior.

## 4 Interpretable Two-level Architecture

Quantitative understanding of a neural network is not straightforward because the parameters and intermediate outputs of the neural network are less readable. In this section, we propose an architecture constructed from the truncated series in eq. (29) and try to explain quantitatively how this network classifies events. The discretized architecture is then defined as follows,

 h = ∑kSk2,\trimwk