# Quantum-inspired Machine Learning on high-energy physics data

One of the most challenging big data problems in high energy physics is the analysis and classification of the data produced by the Large Hadron Collider at CERN. Recently, machine learning techniques have been employed to tackle such challenges, which, despite being very effective, rely on classification schemes that are hard to interpret. Here, we introduce and apply a quantum-inspired machine learning technique and, exploiting tree tensor networks, we show how to efficiently classify b-jet events in proton-proton collisions at LHCb and to interpret the classification results. In particular, we show how to select important features and adapt the network geometry based on information acquired in the learning process. Moreover, the tree tensor network can be adapted for optimal precision or fast response in time without the need for repeating the learning process. This paves the way to high-frequency real-time applications as needed for current and future LHC event classification to trigger events at the tens of MHz scale.

## Authors

• 1 publication
• 1 publication
• 1 publication
• 1 publication
• 2 publications
• 1 publication
• 2 publications
05/03/2019

### TensorNetwork: A Library for Physics and Machine Learning

TensorNetwork is an open source library for implementing tensor network ...
05/28/2019

### Machine Learning on data with sPlot background subtraction

Data analysis in high energy physics has to deal with data samples produ...
12/17/2019

### Embedded Constrained Feature Construction for High-Energy Physics Data Classification

Before any publication, data analysis of high-energy physics experiments...
10/21/2018

### Machine Learning Methods for Track Classification in the AT-TPC

We evaluate machine learning methods for event classification in the Act...
05/15/2020

### Quantum-Classical Machine learning by Hybrid Tensor Networks

Tensor networks (TN) have found a wide use in machine learning, and in p...
05/26/2021

### An Explainable Probabilistic Classifier for Categorical Data Inspired to Quantum Physics

This paper presents Sparse Tensor Classifier (STC), a supervised classif...
12/30/2019

### Bayesian Tensor Network with Polynomial Complexity for Probabilistic Machine Learning

It is known that describing or calculating the conditional probabilities...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I LHCb framework and data description

LHCb is an experiment located in the LHC accelerator at CERN, Geneve, mainly dedicated to the study of the physics of - and -quarks produced in proton-proton collisions. The LHCb detector includes a high-precision tracking system, that provides the measurement of the momentum of charged particles, and a particle identification system that distinguishes different types of charged hadrons, photons, electrons and muons Aaij et al. (2015a). The energy of charged and neutral particles is measured by electromagnetic and hadronic calorimeters.

LHCb is fully instrumented in the phase space region of proton-proton collisions defined by the pseudo-rapidity () range [2,5], with defined as

 η=−log[tan(θ2)],

where is the angle between the particle momentum and the beam axis. The direction of particles momenta can be fully identified by and by the azimuthal angle , defined as the angle in the plane transverse to the beam axis. The projection of the momentum in this plane is called transverse momentum (). In the following we work with physics natural units.

At LHCb jets are reconstructed using a Particle Flow algorithm A (1994) for charged and neutral particles selection and using the anti- algorithm Cacciari et al. (2008) for clusterization. The jet momentum is defined as the sum of the momenta of the particles that form the jet, while the jet axis is defined as the direction of the jet momentum. The particles that form the jet are contained in a cone of radius , where and are respectively the pseudo-rapidity difference and the azimuthal angle difference between the particles momenta and the jet axis. For each particle inside the jet cone the momentum relative to the jet axis () is defined as the projection of the particle momentum in the plane transverse to the jet axis.

A topic of great interest for the experiment is the identification of the charge of the quark that generated a -jet, i.e. or . Such identification can be used in many physics measurements, and it is the core of the determination of the charge asymmetry in -pairs production, which is sensitive to physics beyond the Standard Model Murphy (2015).

The separation between - and -jets is a highly difficult task because the -quark fragmentation produce dozens of particles via non-perturbative Quantum Chromodynamics processes, resulting in non-trivial correlations between them and the original particle. The algorithms used to identify the charge of the -quarks are called tagging methods. Two categories of tagging algorithms exist: based on one single particle inside the jet, muon, and/or inclusively exploiting the jet sub-structure, i.e. information on all the jet constituents, as shown in Fig. 2.

The tagging algorithm performance is typically quantified with the tagging power . This tagging power represents the effective fraction of jets that contribute to the statistical uncertainty in an asymmetry measurement D0 collaboration et al. (2006); CDF collaboration et al. (2006). Thus, the tagging power takes into account the efficiency , i.e. the fraction of jets where the classifier takes a decision, and the prediction accuracy , i.e. the fraction of classified jet where the right decision is taken,

 ϵtag=ϵeff⋅(2a−1)2.

LHCb measured the forward-central asymmetry using the dataset collected in the LHC Run I Aaij et al. (2014) using the muon tagging approach: In this method, the muon with the highest momentum in the jet cone is selected, and its electric charge is used to decide on the -quark charge. In fact, if this muon is produced in the original semi-leptonic decay of the -hadron, its charge is totally correlated with the -quark charge. Up to date, the muon tagging method gives the best performance on the - vs -jet discrimination. Although this method can distinguish between - and -quark with good accuracy, its efficiency is low as it is only applicable on jets where a muon is found and it is intrinsically limited by the -hadrons branching ratio in semi-leptonic decays. Additionally, the muon tagging may fail in some scenarios, where the selected muon is produced not by the decay of the -hadron but in other decay processes. In these cases, the muon may not be completely correlated with the -quark charge.

The LHCb simulation datasets used for our analysis are produced with a Monte Carlo technique using the framework GAUSS Clemencic et al. (2011), which makes use of PYTHIA 8 Sjöstrand et al. (2008) to generate proton-proton interactions and jet fragmentation and uses EvtGen Lange (2001) to simulate -hadrons decay. The GEANT4 software Agostinelli et al. (2003); Allison et al. (2006) is used to simulate the detector response, and the signals are digitized and reconstructed using the LHCb analysis framework.

## Ii Machine Learning Methods

The identification of the -quark charge described in sec. I can be formulated in terms of a supervised learning problem. As described in detail in App. A, we implemented a TTN as a classifier and applied it to the LHCb problem analysing its performance; the same is done for a DNN and both algorithms are compared with the muon tagging approach. Both, the TTN and the DNN, use as input for the supervised learning features of the jet substructure. The features are determined as follows: the muon with the highest among all other detected muons in the jet cone is selected and the same is done for the highest kaon, pion, electron, and proton. In this way, 5 particles of different types are selected. For each particle, three observables are considered: (i) The momentum relative to the jet axis (), (ii) the particle charge (), and (iii) the distance in the (,) space between the particle and the jet axis (), resulting in observables. If a particle type is not found in a jet, the related features are set to . The th feature is the total jet charge , defined as the weighted average of the particles charges inside the jet, using the particles as weights:

 Q=∑i(prelT)iqi∑i(prelT)i .

The used dataset contains and -jets produced in proton-proton collisions at a center-of-mass energy of TeV Aaij et al. (2020a, b). First, pairs of -jets and -jets are selected by requiring a jet greater than 20 GeV and in the range [2.2,4.2] for both jets. Then, the dataset of about jets (samples) is split into two datasets: of the samples are used in the training process while the remaining are used as test set to evaluate and compare the different methods.

We train the TTN as described in App. A and analyse the data with different bond dimensions . The auxiliary dimension controls the number of free parameters within the variational TTN ansatz. While the TTN is able to capture more information from the training data with increasing bond dimension , choosing too large may lead to overfitting and thus can worsen the results in the test set. For the DNN we use an optimized network with three hidden layers of 96 nodes (see App. E for details). Hereafter, we aim to compare the best possible performance of both approaches therefore, we optimised the hyper-parameters of both methods in order to obtain the best possible results from each of them, TTN and DNN.

For each event prediction, both methods give as output the probability

to classify a jet as generated by a - or a -quark. This probability (i.e. the confidence of the classifier) is normalized in the following way: for values of probability () a jet is classified as generated by a -quark (-quark), with an increasing confidence going to (

). Therefore a completely confident classifier returns a probability distribution peaked at

and for jets classified as generated by - and -quark respectively.

## Iii Jet classification performance

In the following, we present the jet classification performance for the TTN and the DNN applied to the LHCb dataset, comparing both ML techniques also with the muon tagging approach.

We introduce a threshold symmetrically around the prediction confidence of in which we classify the event as unknown. We optimise the cut on the predictions of the classifiers (i.e. their confidences) to maximise the tagging power for each method based on the training samples. In the following analysis we find () for the TTN (DNN). Thereby, we predict for the TTN (DNN) a -quark with confidences (), a -quark with confidences () and no prediction for the range in between.

Applying both ML approaches after the training procedure on the test data, we obtain similar performances in terms of the prediction accuracy. Taking the threshold for classifying data as unknown into account, the TTN takes a decision in of the cases with an overall accuracy of , while the DNN decides in of the samples with (see App. H for further details). We further checked both approaches for biases in physical quantities to ensure that both methods are able to properly capture the physical process behind the problem and thus that they can be used as valid tagging methods for LHCb events (see App. F).

In Fig. 3 we present the tagging power of the different approaches with respect to the jet transverse momentum . Evidently, both Machine Learning methods perform significantly better than the muon tagging approach for the complete range of jet transverse momentum , while the TTN and DNN both show comparable performances within the statistical uncertainties.

In Figs. 3 and  3 we present the histograms of the confidences for predicting a -flavored jet for all samples in the test data set for the DNN and the TTN respectively. Interestingly, even though both approaches give similar performances in terms of overall precision and tagging power, the prediction confidences are fundamentally different. For the DNN, we see a Gaussian-like distribution with, in general, not very high confidences for each prediction. Thus, we obtain less correct predictions with high confidences, but at the same time, fewer wrong predictions with high confidences compared to the TTN. On the other hand, the TTN shows a flatter distribution including more predictions - correct and incorrect - with higher confidence. Remarkably though, we can see peaks for extremely confident predictions (around 0 and around 1) for the TTN . These peaks can be traced back to the presence of the muon; noting that the charge of which is a well-defined predictor for a jet generated by a -quark. The DNN lacks these confident predictions exploiting the muon charge.

Finally, in Fig. 3 the scatter plot of TTN and DNN is presented: for each jet classified as coming from a -quark (green dots) or -quark (red dots) we relate the outputs of the two classifiers to spot any correlation between them. The graph shows that despite the insights of the different confidence distributions the outputs of the two classifiers are linearly correlated, with a Pearson correlation factor .

In conclusion, the two different approaches result in similar outcomes in terms of prediction performances. However, the underlying information used by the two discriminators is inherently different. For instance, the DNN predicts more conservatively, in the sense that the confidences for each prediction tend to be lower compared with the TTN. Additionally, the DNN does not exploit the presence of the muon as strongly as the TTN, even though the muon is a good predictor for the classification.

## Iv Exploiting insights into the data with TTN

The TTN analysis allows to efficiently measure the captured correlations and the entanglement within the classifier. These measurements give insight into the learned data and can be exploited to identify the most important features typically used for the classifications.

Therefore, we interpret the TTN classifier as a set of quantum many-body wavefunctions - one for each of the class label (see App. A for further details). To perform the classification, each feature is encoded by the local feature map

 Φ[i](xi)=[cos(πx′i2),sin(πx′i2)] , (1)

thus each feature is represented by a quantum spin. Accordingly, each sample is mapped into a product state . Alongside, when we classify a sample , we compute the overlap for all labels with the product state resulting in the weighted probabilities

 Pl=|⟨Φ(x)|ψl⟩|2∑l|⟨Φ(x)|ψl⟩|2

for each class. We stress, that we can encode the input data in different non-linear feature maps as well (see App. A.4).

We can now calculate the correlation functions

 Cli,j=⟨ψl|σziσzj|ψl⟩⟨ψl|ψl⟩

for each pair of features (located at site and ), to gain an insight into the different information the features provide. In case of maximum correlation or anti-correlation among them for all classes , the information of one of the features can be obtained by the other one and thus one can be neglected. In case of no correlation among them, the two features may provide fundamentally different information for the classification. For both labels () the results are very similar, thus in Fig. 4 we present only (see App. B.1 for further discussion on the correlation measurements).

The correlation analysis presented above allows pinpointing if two features give independent information. However, the correlation itself does not tell if this information is important for the classification. We thus, computed the entanglement entropy of each feature, as reported in Fig. 4. The entanglement entropy reflects the shared information between two TN bipartitions. The entanglement is measured via the Schmidt decomposition, that is, decomposing into two bipartitions and  Nielsen and Chuang (2000) such that

 Ψ=χ∑αλα|ΨAα⟩⊗|ΨBα⟩,

where

are the Schmidt-coefficients (non-zero, normalised singular values of the decomposition). The entanglement entropy is then defined as

. Consequently, the minimal entropy is obtained only if we have one single non-zero singular value . In this case, we can completely separate the two bipartitions as they share no information. On the contrary, higher mean that information is shared among the bipartitions.

In the Machine Learning context, the entropy can be interpreted as follows: If the features in one bipartition provide no valuable information for the classification task, the entropy will be zero. On the other hand, increases the more information between the two bipartitions are exploited. This analysis can be used to optimize the learning procedure: whenever , the feature can be discarded with no loss of information for the classification. Thereby, a new model with fewer features and fewer tensors can be introduced. The new more efficient model results in the same predictions in less time. On the contrary, whenever a bipartition entropy is high, highlights which features - or combination of features - are important for the correct predictions. In conclusion, if the entropy of a feature bipartition is low, we can discard one of them providing negligible loss of information. Moreover, if the bipartition entropy is significantly large, features can be reordered for a better representation of the classifying wavefunction. Finally, if two features are completely (anti-)correlated we can neglect at least one.

Driven by the previous analysis, we introduce the Quantum Information Post-learning feature Selection (QuIPS) algorithm, which combines the insights of both of these measurements - correlations and entropy - to rank the input features according to their importance for the classification (see App. B). Employing QuIPS, we discarded half of the features by selecting the 8 most important ones: i.-iii. charge, momenta, and distance of the muon; iv.-vi. charge, momenta, and distance of the kaon; vii. charge of the pion; viii. total detected charge. To test the QuIPS performance, we compared it with an independent but more time-expensive analysis on the importance of the different particle types (see App. G): the two approaches perfectly matched. Finally, we studied two new models, one composed of the 8 most important features proposed by the QuIPS, and, for comparison, another with the 8 discarded features. In Fig. 4 we show the tagging power for the different analyses with the complete 16-sites (model ), the best 8 (), the worst 8 () and the muon tagging. Remarkably, we see that the models and give comparable results, while model results are even worse than the classical approach. These performances are confirmed by the prediction accuracy of the different models: While we only lose less than of accuracy from to , the accuracy of the model drastically drops to around - that is, almost random predictions. Finally, in this particular run, the model has been trained times faster with respect to model and predicts

times faster as well (The actual speed-up highly depends on the bond-dimension and other hyperparameters, see App.

D for details).

A critical point of interest in high energy physics applications is the prediction time. Indeed, short prediction times are necessary to perform real-time event selection. In the LHCb Run 2 data-taking, the high-level software trigger takes a decision approximately every Aaij et al. (2015a) and higher rates are expected in future Runs. Consequently, we aim to exploit the QuIPS to efficiently reduce the prediction computational time while maintaining a comparable high prediction power. Another step we can undertake to reduce the prediction time is to reduce the bond dimension after the training procedure. Here, we introduce the Quantum information Adaptive Network Optimization (QIANO) performing this truncation in a way ensuring to introduce the least infidelity possible (see App. C). In other words, QIANO can adjust the bond dimension to achieve a targeted prediction time while keeping the prediction accuracy reasonably high. We stress that this can be done without relearning a new model, as it would be the case with NN.

Finally, we apply QuIPS and QIANO to reduce the information in the TTN in an optimal way for a targeted balance between prediction time and accuracy. In Fig. 4 we show the tagging power taking the original TTN and truncate it to different bond-dimensions . We can see, that even though we compress quite heavily, the overall tagging power does not change significantly. In fact, we only drop about in the overall prediction accuracy, while at the same time improving the average prediction time from s to s. Applying the same idea to the model we can reduce the average prediction time efficiently down to s (see App. D for more details), compatible to current real-time classification rate.

## V Conclusions

We analysed an LHCb dataset for the classification of - and -jets with two different ML approaches, a DNN and a TTN. We showed that we obtained with both techniques a tagging power about one order of magnitude higher than the classical muon tagging approach, which up to date is the best-published result for this classification problem. We pointed out that, even though both approaches result in similar tagging power, they treat the data very differently. In particular, TTN efficiently recognises the importance of the presence of the muon as a strong predictor for the jet classification.

We further explained the crucial benefits of the TTN approach over the DNNs, namely (i) the ability of efficiently measuring correlations and the entanglement entropy, and (ii) the power of compressing the network while keeping a high amount of information (to some extend even lossless compression). We showed how the former quantum-inspired measurements help to set up a more efficient ML model: in particular, by introducing an information-based heuristic technique, we can establish the importance of single features based on the information captured within the trained TTN classifier only. Using this insight, we introduced the QuIPS, which can significantly reduce the model complexity by discarding the least-important features maintaining high prediction accuracy. This selection of features based on their informational importance for the trained classifier is one major advantage of TNs targeting to efficiently decrease training and prediction time. Regarding the latter benefit of the TTN, we introduced the QIANO, with which once we learned a TTN, we can decrease its prediction time by optimally decreasing its representative power based on information from the quantum entropy, ensuring that each truncation introduces the least infidelity possible. In contrast to DNNs, with the QIANO we do not need to set up a new model and train it from scratch, but we can optimise the network post-learning adaptively to the used CPU and required prediction time of the final application.

Finally, given the importance of prediction time in the LHCb experiment, we showed that using QuIPS and QIANO we can efficiently compress the trained TTN to target a given prediction time. In particular, we decreased our prediction times from to . Finally, while we used only one CPU for the predictions, by parallelising the tensor contractions on GPUs one can obtain a speed-up from to times Milsted et al. (2019). Thus, we are confident that it is possible to reach the MHz prediction rate while still obtaining results significantly better than the classical muon tagging approach.

Further applications of our approach in the LHCb experiment is the discrimination between -jets,

-jets and light flavour jets, which was already tackled by a Machine Learning approach using Boosted Decision Tree classifiers

Aaij et al. (2015b). A fast and efficient real-time identification of - and -jets can be the key point for several studies in high energy physics, ranging from the search for the rare Higgs boson decay in two -quarks, up to the search for new particles decaying in a pair of heavy-flavour quarks ( or ). Given the optimal performance of the presented method, we envisage a multitude of possible future applications in high-energy experiments at CERN and in other fields of science.

## Vi Data availability

This paper is based on data obtained by the LHCb experiment, but is analyzed independently, and has not been reviewed by the LHCb collaboration. The data are available in the official LHCb open data repository Aaij et al. (2020a, b).

## Vii Acknowledgments

We are very grateful to Konstantin Schmitz for valuable comments and discussions on the Machine Learning comparison. We thank Miles Stoudenmire for fruitful discussions on the implementation of the Tensor Networks Machine Learning code.

This work is partially supported by the Italian PRIN 2017 and Fondazione CARIPARO, the Horizon 2020 research and innovation programme under grant agreement No 817482 (Quantum Flagship - PASQuanS), the QuantERA projects QTFLAG and QuantHEP, and the DFG project TWITTER. We acknowledge computational resources by CINECA, the Cloud Veneto and by the BwUniCluster.

We acknowledge the LHCb Collaboration for the valuable help and the Istituto Nazionale di Fisica Nucleare and the Department of Physics and Astronomy of the University of Padova for the support.

## Appendix A Tree Tensor Networks

Tensor Networks (TNs) have been developed for decades to investigate quantum many-body systems on classical computers. They provide an efficient representation of a quantum wavefunction in a compact form and thereby, they have proven to be an essential tool for a broad range of applications Schollwöck (2011); Silvi et al. (2019); McCulloch (2007); Singh and Vidal (2013); Felser et al. (2019); Dalmonte and Montangero (2016); Gerster et al. (2017)

. In a mathematical context, a TN approximates a high-order tensor by a set of low-order tensors that are contracted in a particular underlying geometry and have common roots with other decompositions, such as the Singular Value Decomposition (SVD) or Tucker decomposition

Tucker (1966). The accuracy of the TN approximation can be controlled with the so-called bond-dimension , an auxiliary dimension for the indices of the connected local tensors. Among others, some of the most successful TN representations are the Matrix Product State (MPS) - or Tensor Trains Östlund and Rommer (1995); Schollwöck (2011); Oseledets (2011); Stoudenmire and Schwab (2016), the Tree Tensor Network (TTN) - or Hierarchical Tucker decomposition Gerster et al. (2014); Hackbusch and Kühn (2009); Liu et al. (2019), and the Projected Entangled Pair States (PEPS) Verstraete and Cirac (2004); Orús (2014).

In the following, we briefly describe the main principle of Tensor Networks and the concepts we refer to within the paper or later on in the appendix. For a more detailed insight into Tensor Networks, we refer to more comprehensive reviews and text books Orús (2014); Schollwöck (2011); Silvi et al. (2019); Novikov et al. (2016); Montangero (2018).

### a.1 Graphical representation of Tensor Networks

Within the original Tensor Network development and applications in physics, a graphical representation of the underlying mathematical tensor notation has been established for the sake of compactness. In a nutshell, we represent TNs with circles - for the tensors - and lines connecting the different tensors. Each line, as well referred to as link, indicates a contraction of the two connected tensors over a coinciding index. Fig. 5 shows the graphical representation of Tensors with different ranks: A vector, a matrix, and a rank-3 tensor from left to right at the top, and a general rank-N tensor at the bottom.

Within TN algorithms, the tensors are constantly manipulated. The most important operations thereby are the contraction of two tensors, reshaping a tensor and performing a factorisation. The contraction of two tensors generalises the linear algebra well-known matrix-matrix-multiplication to tensors with arbitrary rank. As matrices are a special form of tensors, the matrix multiplication can be identified in terms of TN as a contraction of two rank-two tensors over one coinciding link (see Fig. 6). Generalising this statement for arbitrary tensors and , the tensor contraction can be performed over several coinciding links . The resulting tensor given by the summation over all indices for coinciding links as follows.

 Cl,n =∑mAl,mBm,n , with  l=(l1,...,lλ),m=(m1,...,mμ) and n=(n1,...,nν)

Therefore, the links and are the remaining indices after the contraction.

For a general TN, a link connecting two tensors always indicates a contraction of both tensors. The algorithmic complexity of such a tensor contraction scales with the dimension of all involved links to (although the scaling can be reduced, when carried out as optimised matrix-matrix multiplication). Due to this complexity, the contractions play a crucial role in the efficiency of algorithms for Tensor Networks.

### a.2 Tree Tensor Network representation

In its original idea, the TTN represents an arbitrary pure quantum state as a decomposition of the complete exponentially large tensor . The corresponding separable Hilbert space of the system consists of local subspaces , where each local state space shall be -dimensional. The most general pure state in such a system can be written

 |ψ⟩=d∑i1,...iL=1ci1,...,iL|i1⟩⊗|i2⟩⊗...⊗|iL⟩ , (2)

where describes the local state space of the site . This complete representation seeks for coefficients describing each possible combination of the local states. These coefficients can be recast in a tensor of rank , where each leg of this tensor corresponds to a local Hilbert space . The normalisation of the state thereby coincides with the Frobenius norm of the rank- tensor. The TTN further decomposes this rank- tensor into a set of hierarchically connected rank- tensors.

### a.3 Machine Learning with Tree Tensor Network

Even though DNNs have been highly developed in recent decades by industry and research, the first approaches of Machine Learning with TN yield already to comparable results when applied to standardised datasets Stoudenmire and Schwab (2016); Stoudenmire (2018); Glasser et al. (2018a). In particular, we implemented a TTN as a classifier for supervised learning problems. In this section, we give insights into the TTN Machine Learning algorithm.

As for a general supervised learning problem, the data samples are given as input vectors . Each sample is encoded into a higher dimensional feature space by a feature map , and subsequently classified with the decision function .

 f(x)=W⋅Φ(x) . (3)

In general, the complete weight tensor can be used as a classifier, however, this tensor becomes exponentially large with increasing numbers of features given in the dataset. Therefore, we represent as a quantum-inspired Tensor Network, in particular a Tree Tensor Network, building on the idea proposed for an MPS in Stoudenmire and Schwab (2016).

A TN with sites addresses a global space spanned by a tensor product of local subspaces . Each subspace can in general have a different dimension . For the application in Machine Learning, a natural feature map suited for a TN is a product of local feature maps where and . All local feature maps together determine the global feature map

 Φ(x)=Φs1(x1)Φs2(x2)⋯ΦsN(xN) . (4)

To point out the connection with quantum mechanics in this Machine Learning ansatz, we can describe the TTN classifier as a set of quantum many-body wavefunctions - one for each of the class label . Consequently, when we predict a sample we calculate its overlap for all labels with the product state given by a global feature map. The final prediction output for each class is then given by the weighted probabilities

 Pl=|⟨Φ(x)|ψl⟩|2∑l|⟨Φ(x)|ψl⟩|2 . (5)

For the identification of jets, the features are the detected physical observables from the LHCb simulation described in Section I. We rescale each of the features to with respect to the corresponding maximum value of all samples within the complete training set. We encode the rescaled features - following the inspiration of quantum spins - by choosing the local feature map

 Φsi(xi)=[cos(πx′i2),sin(πx′i2)] . (6)

In this way, we can think about each single feature being represented by a quantum spin (where is mapped to a spin down and to a spin up). Accordingly, each sample is mapped to the product state . After the transformation, the -th feature is addressed by the -th site of the TTN. In general, we can exploit different, more expressive feature maps then the chosen one of Eq. 6 (see App. A.4).

For the learning procedure of the TN, we aim to minimise the quadratic cost function

 C=12NT∑n=1∑l(fl(xn)−δl,Ln)2 ,

where the index runs over all training samples and is a Kronecker delta with being the correct label for the -th sample. Thus , if the label equals the known label for the supervised learning. We optimise the complete network by subsequently performing a gradient descent on local tensors until the cost function converges. We sweep through the network from the bottom to the top, so that after one sweep every tensor has been optimised once, concluding one learning iteration. In contrast to Ref. (Stoudenmire and Schwab, 2016), we keep the label fixed at the top tensor and optimise each tensor separately rather than optimising in the space of two tensors at once.

Furthermore, we initialise the TTN by performing the unsupervised learning proposed in Ref.

Stoudenmire (2018) up to the topmost layer in the tree and adding a random tensor on top connecting the remaining two bipartitions with the label for the classification. We start with optimising the random top-tensor via conjugate gradient descent and afterward start iteratively sweeping through the network from the bottom to top.

### a.4 Higher-Dimensional Local Feature Maps

In Eq. (6), we presented the local feature map as a 2-dimensional vector inspired by the quantum spin representation. In general, we are not restricted to this feature map, as the different samples can be mapped by using more expressive feature maps, e.g., taking polynomial orders (e.g. ) or higher order spherical feature maps defined as

 Φsjd(xj)=√d−1sj−1(cos(π2xj))d−sj(sin(π2xj))sj−1 .

We analysed the data with different orders of the spherical feature map and presented in Sec. III the results obtained by the 5-th order map , as this order lead to the best prediction accuracy. Anyhow, the different feature maps all result in similar prediction accuracies in the end and the fundamental insights we obtained did not change. As an example, for the -dimensional feature map we obtained an accuracy of in contrast to for (both after applying the cuts).

In Sec. IV we used the -dimensional feature map for the insights into the TTN by measuring the correlations and entropy. We stress that the operators used to measure correlations have to be adapted to the local Hilbert space as well. For spherical feature maps, we can exploit the Pauli-Matrices for , the Gell-matrices for , or higher representations of the Lie group in order to investigate correlations in the classification.

### a.5 Isometrisation

Here, we restrict ourselves to rank-3 tensors for the sake of compactness and since the Tree TN is composed out of rank-3 tensors only. The generalisation to rank- tensors is straightforward.

A tensor of the TTN is isometrised with respect to the links

if it is a unitary matrix when combining the two indices

and , that is it obeys the isometry condition

 ∑k1,k2(T)k3k1,k2(T†)k1,k2k′3=δk3,k′3. (7)

Hence, one isometrised tensor performs a unitary transformation on two subspaces

. We can isometrise an arbitrary TTN by iteratively performing QR-decompositions

Silvi et al. (2019) on each tensor of the tree from the bottom to the top. In particular, when going from the bottom to the top, we set the unitary -tensor as the original tensor and contract the -tensor upwards with the connected tensor over the link . This procedure results in all tensors being isometrised, except for the upmost one. The TTN is then isometrised towards the up-most tensor. In the same manner, we can isometrise the TTN as well to different tensors within the network.

After we train a TTN in the Machine Learning application, we isometrise the complete TTN towards the upmost tensor for the predictions. Consequently, the prediction is a real space renormalisation of two neighboring sites for each layer within the tree Montangero (2018); Tagliacozzo et al. (2009). Each tensor simply performs a unitary transformation together with a truncation of two sites originating from the input sample. Consequently, when assessing the tensor entries we know exactly how the data will be processed for the general prediction.

### a.6 Schmidt-decomposition

In a loop-free TN state every link between two tensors in the network bipartites the underlying system into the subsystems and . This allows rewriting the TN at every link in terms of the Schmidt-decomposition

 |ψ⟩=χν∑α=1λ[A,B]α|ψ[A]⟩⊗|ψ[B]⟩ (8)

with being the bond dimension of the link and the Schmidt-coefficients (or non-zero, normalised singular values of the decomposition). Thus, at each bipartition, the bond dimension of the link provides an upper bound for the Schmidt rank and consequently for the bipartite entanglement the TN is able to capture (compare App. B.2). Each of the two Schmidt vector sets forms an orthonormal basis for the associated subspaces Nielsen and Chuang (2000). In practice, we exploit the isometry condition for the tensors within the TTN by isometrising to one of the tensors attached to the link of the desired bipartition. The Schmidt-values then correspond to all non-zero singular values of an SVD decomposition with respect to attached to the link on the tensor the TTN is isometrised towards.

As we will see in the next section, this Schmidt-decomposition will allow us to calculate the information encoded in the TNs and based on this information to efficiently reduce the complexity of the Machine Learning model.

## Appendix B Quantum Information Post-learning feature Selection (QuIPS)

In this section, we introduce the Quantum Information Post-learning feature Selection (QuIPS), a protocol that exploits the information encoded in the TTN to reduce the input features in the model to the most valuable ones for the classification process. In particular, the quantum-inspired measurements of correlations and entropy are used to determine the importance of the different input features after the learning procedure based on the information they provide for the classification. Finally, the QuIPS allows us to rank all the input features according to their importance and to use this ranking to efficiently reduce the model by discarding the least important features.

In the following we first describe the two exploited quantum-inspired measurements, the correlations and the von Neumann entropy, and finally give an algorithmic protocol for the QuIPS.

### b.1 Correlation measurements

We can measure the correlations captured within the TTN classifier by exploiting the quantum-correlation measurements. As we chose the local map to represent the input features in quantum spins, we will measure the correlations in the basis for each pair of features (located at site and ), defined as follows:

 Cli,j=⟨ψl|σziσzj|ψl⟩⟨ψl|ψl⟩

The correlation results in if the TTN recognizes that the two local features and are completely correlated - such that the rescaled input of always equals the rescaled input . We obtain for the two local features being completely anti-correlated - such that . Finally, we obtain in case of the two local features being completely uncorrelated. Thus in case of no correlation, we know that the two features may provide fundamentally different information for the classification. In any way, we can not tell if this information given by the two features is actually important for the classification itself. On the contrary, in the case of complete (anti-)correlation, the two features provide the same information and we can drop one of them in further analysis. As an example, we can take a look at learning from pictures, where the first two pixels are always white in the complete data. Thus, we will measure that both pixels are totally correlated and we know we can discard at least one of them as we can always reconstruct the information from the other one. But, in this case, both pixels would give us no valuable information for the actual classification problem, and we could even go further and discard both. Measuring the correlation only, we have no insight into this information. Anyway, when we take the entropy measurement in the subsequent section into account, we can measure the information provided for the classification and efficiently discard both pixels in this scenario.

This idea for the correlation measurement can be extended to the use of different local feature maps (see App. A.4) by using different operators as correlators. We mention as well, that this correlation measurement is purely based on the information within the TTN captured after the learning procedure.

In principle, we can further measure correlations or local expectation values in the -basis. This can help to find further correlations within the data and can give insights on choosing the local feature map. If we find higher correlations in this basis for certain features, it might be interesting to actually change the input basis of the local feature map from a spin in to the -basis.

In the paper, we presented the correlations for the -quark classification. In Fig. 7 (a) we show the correlations as well for the -quark for sake of completeness. As mentioned before, both cases are very similar and only differ slightly in the magnitude of the single correlations.

We can further generalise the correlation measurement and compute the cross-correlations of the two feature spaces and between two different classes and

 Cl,l′i,j=⟨ψl|σziσzj|ψl′⟩√⟨ψl|ψl⟩⟨ψl′|ψl′⟩ .

### b.2 Entropy measurements

Within the TTN, we have also access to the entanglement entropy. This expresses the correlations within two general bipartitions of the whole system. To compute it, we bring the state represented by the TTN into an orthogonal form with the Schmidt decomposition (see App. A.6). The von Neumann entropy is then defined by the Schmidt-coefficients as . Consequently, the minimal entropy is obtained only if we have one single non-zero singular value of . In this case, we can completely separate the two bipartitions as they share no information. This idea can be interpreted the Machine Learning context: If the features in one bipartition provide no valuable information for distinguishing between the different classes, the entropy is zero. On the other hand, the entropy increases the more information between the two bipartitions is used for the classification. This criterion can be used to optimize the learning procedure: In the first scenario, we can efficiently discard the features with no - or negligible less - information and introduce a new model with fewer features and less tensor respectively. With this more compact model, we are able to obtain the same predictions while requiring fewer contractions and thereby less time. The second scenario, where the entropy is high, helps us to understand which features - and more general which combination of features - are critically important for the correct predictions.

In fact, we can further exploit the mutual information

 Iik=Si+Sk−Sik

of two different features and if they are attached to the same tensor by measuring the entanglement entropy on a higher layer within the tree as illustrated in Fig. 8 as an example of the -classification. Here, () is the corresponding entropy for the two features () and the entropy of the combined features. If we see, that e.g. two features provide the same entropy , and additionally the coarse-grained space consisting of both features provides the exact same entropy , we know that the information we obtain from the two features is equivalent. Thus, in this case, we can neglect one of the two (in agreement with the correlation measurement). The same idea can be extended to the mutual information of different clusters of input features and opens a very promising direction for a deeper understanding of information captured within the TTN.

### b.3 QuIPS Protocol

Both above-mentioned measurements, the correlations together with the entanglement entropy, leave us with the following insights and receipt for increasing the model efficiency using QuIPS:

• If two learned features are completely correlated we can neglect at least one of them without loss of classification accuracy.

• If the entropy for a bipartition (set of features) is low, we can discard all the features.

• If the entropy is significantly large, we may reorder the features for a better representation of the classifying wavefunction.

Therefore, the QuIPS can be described by the following algorithmic idea for setting up the new model with a targeted number of total features:

• Add the feature with highest entropy to the new model

• while ( numberOfFeatures < targetNumber )

• take next feature with highest entropy

• if ( new model )

• end if

• end while

Resuming, the QuIPS offers valuable insights into the learned data, in particular, it provides an information-based heuristic for the importance of the single features based on the information within the trained TTN. Thereby, we can significantly reduce the model complexity by discarding the least-important features while still maintaining high prediction accuracy. As an outlook on further investigation in this direction, we can include the mutual information into the heuristic and put more value as well on the ordering of the features. As the TTN classifier is breaking the symmetry of the position of input samples, this ordering of features can lead to critical gains in the performance as well.

Furthermore, an interesting approach is to exploit different metrics for measuring and describing the captured information. For instance, we might use the Kullback-Leibler Divergence

Kullback and Leibler (1951), which is a more prominent measurement in Machine Learning.

## Appendix C Quantum Information Adaptive Network Optimization (QIANO)

We introduce the Quantum Information Adaptive Network Optimization (QIANO) which performs a reduction of free parameters in the network in a way that ensures to keep the highest amount of information possible. In other words with QIANO we can adjust the bond dimension of the TTN classifier targeting a certain prediction time while controlling the prediction accuracy to stay reasonably high. Moreover, this adjustment can be done without the need of relearning a new model, as it would be the case with neural networks. The underlying procedure for this truncation of a TTN is the Singular Value Decomposition (SVD) which will be applied to the tensors within the network.

### c.1 Singular Value Decomposition

Every complex matrix can be decomposed into a matrix featuring orthonormal columns, a diagonal matrix and a matrix with orthonormal rows such that

 M=USV† .

The orthonormal columns of - also referred to as left singular vectors of - obey . Analogously, the orthonormal rows of - or right singular vectors of - entail . The diagonal matrix contains the singular values of (real, non-negative). The number of non-zero singular values equals the Schmidt rank of the matrix . With a descending order of the singular values, the matrix is unique for given , which is in general not the case for and .

The SVD provides the best possible approximation of a matrix by a matrix with lower rank with respect to its Frobenius norm . Indeed, performing the SVD on leads to

 ||M||2F=tr{USV†VS†U†}=tr{SS†}=r∑i=1σ2i ,

with denoting the Schmidt rank of the matrix . Thus the squared Frobenius Norm equals the sum of the squared non-zero singular values . Consequently, the error in made by the approximation is minimal for taking the highest singular values into account, together with their corresponding singular vectors in and . If the singular values are arranged in descending order, the error in the Frobenius norm can be calculated by the discarded values Schollwöck (2011).

 ||~M||2F=~r∑i=1σ2i=||M||2F−r∑k=~r+1σ2k=ϵ~r

The SVD is generalised straightforward to the general tensor algebra by splitting the tensor with respect to two different sets of its links.

### c.2 TTN truncation

When we truncate one link in the TTN to a lower bond dimension

, we can isometrise the network forming the Schmidt-decomposition by an SVD on a certain tensor. We truncate it by throwing away the smallest singular values. The error made within this truncation can be estimated accurately since the induced state fidelity explicitly depends on the sum of the squared discharged singular values. We can proceed by truncating all the links in the network in the same manner reducing the sizes of the tensors and the total space of the TTN representation.

In this process of truncation, we stress again that the local truncation of the links ensures the mathematically best possible approximation within the lower subspace regarding the total fidelity of the TTN state. Anyhow, for the global truncation we point out that as we iterate threw the complete network with local truncations, the ordering of the local truncation may play a role in the final approximation. Even though, this approach has proven to be extremely efficient (see application on the LHCb data in App. D), the global truncation might be performed in an even further optimised way.

### c.3 QIANO Protocol

The QIANO implements a truncation of the TTN as mentioned above after the training procedure which thereby reduces the free parameters in the network while introducing the least possible infidelity for each truncation step. In other words, one adjusts the bond dimension targeting a certain prediction time while keeping the critical amount of information captured by the TTN - and thereby a high prediction accuracy - within the smaller subspace we represent our quantum wave function classifier in. A prediction with the truncated TTN equals to perform a set of contractions - or vector-matrix-multiplications - on lower dimensional tensors and thus can be executed in lower computational time. Performing a QIANO, we represent the TTN classifier more compactly based on the quantum-information captured within the TN ansatz, resulting in a more efficient classifier concerning the prediction time versus its prediction accuracy. We can perform this reduction of free parameters without the need of relearning a new model, as it would be the case with neural networks when we target a certain prediction time. We train the TTN once with a maximum bond dimension and then truncate it depending on the CPU architecture to the dimension in order to obtain the targeted .

This way of reducing the information in the TTN for the sake prediction time maintaining an optimal balance between prediction time and accuracy is of extreme value in a broad range of Machine Learning applications: The TTN offers the flexibility to adjust the prediction time depending on the requirement and the architecture of the computational system. For the analysis on the CERN data we provide in App. D a deeper insight into the actual speed-up, including the measured prediction times for different bond dimension (see, Tab. 1).

### c.4 Equivalence with DNNs

It has been shown that artificial NNs can be mapped into TNs Glasser et al. (2018a); Robeva and Seigal (2017); Chen et al. (2018); Glasser et al. (2018b)

. Following the idea of the graph-based mapping, the post-learning reduction of free parameters by truncating the TTN is equivalent to reducing the number of free parameters in the DNN in the way to optimally preserve the amount of represented information within the TTN. Thus, this would not only include the dropout of different weights and neurons in the DNN, but furthermore to some extend restructuring the neuronal connections of the DNN resulting in a different model. With the TTN this post-learning reduction of free parameters can be done without the need of relearning a new model, as it would in general be necessary for DNNs.

## Appendix D Efficient Prediction time speed-up with QuIPS and QIANO

In the following, we present the application of the QuIPS and QIANO protocoll on the LHCb problem in more detail.

### d.1 Speed-up for CERN data

Here, we consider both approaches to analyse the speed-up: Discharting features with QuIPS and optimizing the representation power of the TTN with QIANO. In Tab. 1 we show the different results for the complete 16 features and the best 8 features, both varying the bond dimension . Interestingly, when we truncate the TTN classifier from high bond-dimension we lose no prediction accuracy. In fact, we actually increase the accuracy in the test set, which might be due to the tendency to overfitting with increasing . We notice a bigger drop in the prediction accuracy only when we truncate down to , thus gaining from to almost an order of magnitude in speed-up without decreasing the precision. Interestingly, when we train from the beginning with bond dimension , we are not able to achieve the same accuracy as we get after truncating from the learned model to .

The presented calculations were done at the CINECA Marconi Knights Landing cluster on one single processor core (Intel Xeon 8160, SkyLake at 2.10 GHz).

Furthermore, the performance of tensor contractions can be improved by a factor ranging from to bond-dimension depending when executed in parallel on GPUs rather than CPUs Milsted et al. (2019). Thus combining the TTN truncation and the post-analysis feature selection, together with a CPU, GPU or FPGA architecture optimally designed for TNs we are confident that we can reach the order of prediction times required for real-time applications in the LHCb experiment.

### d.2 Mimimum CPU time

During the prediction, we can perform the contractions in each layer in parallel. So in the following calculation of a lower boundary for the CPU-time we assume a perfect parallelisation and thus only take the contractions of one tensor within each layer into account. This contraction consists of two matrix-vector multiplications; the first with a -matrix and a -dimensional vector; the second with a -matrix and -vector, where is the actual dimension of the downwards directed links in the -th layer (the input layer is ; the link upwards of the last layer has dimension of the classification problem).

Therefore, the number of floating-point operations which cannot be parallelised on a higher level, but only within the execution of the single matrix-vector multiplications, is

 NF=L+1∑l=1χ2lχl+1+χlχl+1.

Thus, taking, for instance, a TTN with 8 input features and bond dimension , we can calculate the number of floating-point operations required for predicting a sample. In this case, the 4 tensors in the lowest level each address two different -dimensional local input feature spaces and merge them to a -dimensional space. Here we require FLOPS. In the next layer, we work with dimensions and , resulting in FLOPS. The last contraction projects the samples onto the -dimensional output space with FLOPS. Totalling FLOPS which can be computed by an ordinary CPU with a lower performance of GFLOPS/s (higher performance with GFLOPS/s) within ns (ns).

This lower bound for the prediction time is very encouraging in high energy physics for LHCb experiments in the real-time event selection. In particular, the high-level software trigger in the Run 2 data-taking has to take a decision approximately every 1 s Aaij et al. (2015a) and higher rates in the range of ns are expected in future Runs. Thus with the bond-dimension of this calculation example, we can reach these time scales and we showed in the paper that using QuIPS and QIANO, we still can obtain results about times more efficient than the muon tagging for such a TTN.

Finally, let us mention that we did not consider the parallelisation of the actual vector-matrix multiplication in this lower bound for the CPU time. Furthermore, this is a completely theoretical bound neglecting, for instance, the time to copy data into the cache and any overhead from the implementation.

## Appendix E Detailed description of the DNN

The DNN networks studied here were implemented using the Keras

Chollet et al. (2015)

framework with the Tensorflow

back-end. The network was built alternating a Dropout layer after each Dense layer starting with a Batch Normalization layer. As mentioned before, we use 60% of the total data (totalling about 700k samples) for the learning procedure and 40% of it for the final testing and estimation of the performance. We further divide the learning data into training (60%) and validation set (40%). The hyper-parameters of the network (Depth, number of nodes per layer, dropout rate, normalization moment, and kernel initialization) were tuned using the hyperopt

Bergstra et al. (2013)

package exploring different parameter spaces in order to maximise the accuracy in the validation dataset. The ReLu activation was used for the hidden layer while a sigmoid is used for the output node. The network was trained with both ADAM and SGD optimiser optimizing their learning rate.

The final chosen network architecture (see Fig. 10

) consists of three couples of dense plus a dropout layers with 96 nodes per layer and 0.1 dropout rate. The optimization was done with ADAM and learning rate 0.0001. The model was trained for a maximum of 250 epochs, early stopping with a patience parameter of 25 epochs on the loss in the validation set was used. The model used for evaluating the performance on the test set is the model with the best performance on the validation set.

Further, we investigated the use of different cost functions for the network optimisation. We performed the DNN optimisation for the cross-entropy loss function and with the Mean Squared Error (MSE), with which the TTN is trained as well. In the end, both cost functions lead to similar results in the prediction accuracy, the tagging power and the probability distribution (see Fig.

11). Introducing the cuts in the evaluation of the tagging power, we obtain with for the cross-entropy loss and with for the MSE.

## Appendix F Check for biases in physical quantities

Once the performances of the TTN and DNN algorithms have been established, it is necessary to check for biases in describing the main physical quantities related to jets physics: in this way, we can probe the feasibility of these new tagging methods to perform physical analysis. Typical quantities describing jets are the transverse momentum (which is the momentum perpendicular to the beam axis direction) and the pseudorapidity (defined as where is the polar angle); some cuts are applied to these quantities: GeV and , to ensure that both jets are contained in the LHCb acceptance. Every simulated event contains two jets generated by a - and a -quark, therefore labelled as and . To perform this check it is sufficient to require the TTN and the DNN to classify the flavour of just one jet: once one jet’s flavour has been established we are sure what the other jet’s flavour is.

In Figs. 12 and  12 jet distributions are shown while in Figs. 12 and  12 distributions are shown for jets generated by - and -quark respectively, all normalized to one. Results are shown for the TTN and DNN methods, compared to the so-called Monte Carlo truth (MC), which is the set of true known features of the jets resulting from the simulation process. From the plots it is clear that no evident biases are present, therefore we can conclude that not only the TTN and the DNN methods perform better than the usual muon tagging method (as seen by plotting the tagging power), but they also describe properly the physics behind the studied processes. This quick check allows us to possibly use a TTN to perform physics analyses, such as measuring the charge asymmetry in events.

## Appendix G Analysis with single particles

In the following, we present our study on the importance of the different particle types (which we recall to be electron, pion, proton, kaon and muon and their antiparticles respectively) for the final classification. In particular, we are looking at the contribution from every single particle, including all of the corresponding three features (relative momenta , charge and relative distance

), to the tagging power. This study has been performed for our TTN algorithm and provides (i) a deeper insight on the interesting features from the physical point of view, (ii) further information to validate our feature reduction using the QuIPS protocol which, combining correlation and entropy arguments, reduces the number of features considered without sensibly decreasing the final accuracy and tagging power.

In the first part of this study, we discard one particle type only in the analysis. Thus we learn using only the remaining 4 particles and the total charge, resulting in 5 different simulations corresponding to discarding each particle correspondingly. In order to deselect just the -th particle we set all its input features , , and explicitly to zero; both in the train and in the test dataset.

In Fig. 13 the tagging power of the TTN analysis including all particles, and discarding the features of one particle type are shown. It is evident that by removing the kaon for the classification, we have a clear loss in the tagging power for the complete range of transverse momentum of the complete jet. This loss is even more significant for low jet transverse momentum . For other particles, a clear loss exceeding the statistical uncertainty can only be found for certain , such as the pion for high , or the muon for the middle range of . In order to properly understand the importance of the different particles, we can further compute the tagging power by considering only the features of one particle at a time.

In Fig. 13 we show the tagging power for using just one of the particle types for the classification: the contribution of the kaon at low jet is evident, meaning that in order to get high values of the tagging power we need to exploit the features of the kaon. Here, we can see again, even more clearly, the contribution of the pion for high transverse momentum of the jet, and of the muon for the middle range of , while the features of the proton do not play an important role in the tagging power. Interestingly, the TTN analysis using only the muon (lime curve) is still more efficient than the usual muon tagging: this is due to the fact that while the muon tagging algorithm considers only the charge of the muon, the TTN also considers other features such as and .

Concluding this study, we figured out that the kaon is the most crucial particle for the -classification followed by the muon or by the pion for very high transverse momentum of the detected jet. These insights perfectly align with the quantum-information based insights provided by our QuIPS protocol: Next to the total charge of the jet , the QuIPS suggests for this problem to use all available features of the kaon and muon, together with one feature (the charge) of the pion when reducing the total number of features from 16 to 8.

## Appendix H Other comparisons between TTN and DNN

When investigating the performances of the TTN and DNN methods, the two algorithms gave the same performances (as their accuracy and tagging power is the same within the statistical error) but we showed by the different shapes of the TTN and DNN confidence distributions that they exploit different kinds of information. Further, we found some correlation between the outputs of the two algorithms. Here, we further investigate this aspect of correlation in the predictions of the different methods by computing the so-called confusion matrix, a graphic comparison between the outputs of the two classifiers.

In Fig. 14 the confusion matrix for the TTN and the DNN methods is shown: the output of every jet analysed by the DNN (whose output could be () for a jet generated by a -quark (-quark) or NC for a non-classified jet) is compared to the output of the same jet classified by the TTN. The fact that a jet is classified as NC comes from applying cuts to the confidence distributions to maximize the tagging power. Despite the output (i.e. the confidence distributions) of the TTN and the DNN being different, the two algorithms tend to classify jets in the same way: whenever the DNN classify a jet as generated by a -quark it is very unlikely that the TTN wrongly classifies it as generated by a -quark (it happens of the times) and vice-versa; moreover in of the cases a jet is classified by just one classifier, while the other one does not classify it. As a last remark, when one classifier does not classify a jet (e.g. for the TTN this corresponds to the central row) the other one does not classify the same jet in of the cases, meaning that the TTN and the DNN classify (and do not classify) jets for the majority of the data in the same way. This aspect is also confirmed by considering Fig. 14, where no cuts on the confidence distributions are considered: when the TTN classify a -quark (-quark) so does the DNN 95% (94%) of the times.

We further checked the results of the two classifiers for the true labels coming from the MC simulation (i.e. the accuracy of the algorithm). In Figs. 14 and  14 confusion matrices between true labels and DNN and TTN respectively (without cuts on confidence distributions) are shown: in 64% of the times the jet’s flavour is correctly classified, both for the TTN and the DNN. In Fig. 14 the same comparison is shown for the muon tagging approach: the jet’s flavour is correctly classified 75% of the times but it is evident that the number of classified jet has been reduced.

In Figs. 15 and  15 confusion matrices between the two classifiers and the true labels are shown, with cuts maximizing the tagging power applied to the confidence distributions: when applying the cuts the we correctly classify between - of the times, both for the TTN and the DNN. Therefore we can conclude that both the TTN and the DNN have almost the same accuracy compared to the muon tagging algorithm, but they are able to process a bigger amount of jets, resulting in greater values for efficiency and therefore greater values of tagging power.

As a last comparison between the two classifiers we consider the Receiving Operator Characteristic (ROC) curve, in order to check the ability of the TTN and the DNN to classify - and -jets as the discrimination threshold is varied. Therefore, we plot the rate of correctly tagged jets (defined as True Positive Rate, TPR) against the rate of wrongly tagged jets (defined as False Positive Rate, FPR). In Fig. 16 the two ROC curves are plotted and compared with the so called line of no-discrimination, which represents a randomly guessing classifier: the two ROC curves for tTN and DNN are perfectly coincident, and the Area Under the Curve (AUC) for the two classifiers is the almost same ( and ).

This last check further confirms the similarity between the TTN and DNN in tagging - and -jets despite relying on totally different confidence distributions.