Particle Transformer for Jet Tagging

by   Huilin Qu, et al.
Peking University

Jet tagging is a critical yet challenging classification task in particle physics. While deep learning has transformed jet tagging and significantly improved performance, the lack of a large-scale public dataset impedes further enhancement. In this work, we present JetClass, a new comprehensive dataset for jet tagging. The JetClass dataset consists of 100 M jets, about two orders of magnitude larger than existing public datasets. A total of 10 types of jets are simulated, including several types unexplored for tagging so far. Based on the large dataset, we propose a new Transformer-based architecture for jet tagging, called Particle Transformer (ParT). By incorporating pairwise particle interactions in the attention mechanism, ParT achieves higher tagging performance than a plain Transformer and surpasses the previous state-of-the-art, ParticleNet, by a large margin. The pre-trained ParT models, once fine-tuned, also substantially enhance the performance on two widely adopted jet tagging benchmarks.


page 1

page 2

page 3

page 4


ParticleNet: Jet Tagging via Particle Clouds

How to represent a jet is at the core of machine learning on jet physics...

BIP: Boost Invariant Polynomials for Efficient Jet Tagging

Deep Learning approaches are becoming the go-to methods for data analysi...

Semi-Supervised Music Tagging Transformer

We present Music Tagging Transformer that is trained with a semi-supervi...

Persian Ezafe Recognition Using Transformers and Its Role in Part-Of-Speech Tagging

Ezafe is a grammatical particle in some Iranian languages that links two...

Jet tagging in the Lund plane with graph networks

The identification of boosted heavy particles such as top quarks or vect...

Particle Convolution for High Energy Physics

We introduce the Particle Convolution Network (PCN), a new type of equiv...

Deep or Simple Models for Semantic Tagging? It Depends on your Data [Experiments]

Semantic tagging, which has extensive applications in text mining, predi...

1 Introduction

Machine learning has revolutionized how large-scale data samples are analyzed in particle physics and greatly increased the discovery potential for new fundamental laws of nature Radovic et al. (2018). Specifically, deep learning has transformed how jet tagging, a critical classification task at high-energy particle colliders such as the CERN LHC, is performed, leading to a drastic improvement in its performance Kogler et al. (2019); Larkoski et al. (2020).

Figure 1:

Illustration of jet tagging at the CERN LHC. High-energy proton-proton collisions at the LHC can produce new unstable particles that decay and yield a collimated spray of outgoing particles. These outgoing particles are measured by complex particle detector systems, and jets can be built (“reconstructed”) from these measured particles. The goal of jet tagging is to classify the jets and identify those arising from particles of high interest, e.g., the Higgs boson, the

or boson, or the top quark.

At the CERN LHC, two beams of protons are accelerated to nearly the speed of light and made to collide at a frequency of 40 million times per second (40 MHz). Such high-energy collisions can create new unstable particles, which then decay and produce sprays of outgoing particles. Complex detector systems, such as the general-purpose ATLAS Aad et al. (2008) and CMS Chatrchyan et al. (2008) detectors with (100 M) individual sensors of various types, are used to measure the positions, trajectories, energies, and momenta of the outgoing particles. From these measurements, an event is reconstructed for each collision. The primary goal in the analysis of the collision data is to identify events involving novel physics processes, an example of which is the discovery of the Higgs boson Aad et al. (2012); Chatrchyan et al. (2012).

Figure 2: Examples of the 10 types of jets in the JetClass dataset, viewed as particle clouds. Each particle is displayed as a marker, with its coordinates corresponding to the flying direction of the particle, and its size proportional to the energy. The circles, triangles (upward- or downward-directed), and pentagons represent the particle types, which are hadrons, leptons (electrons or muons), and photons, respectively. The solid (hollow) markers stand for electrically charged (neutral) particles. The marker color reflects the displacement of the particle trajectory from the interaction point of the proton-proton collision, where a larger displacement results in more blue.

A crucial step in the data analysis process is jet tagging. A jet refers to a collimated spray of outgoing particles. Jet tagging is the process of identifying the type of particle that initiates a jet. It is essentially a classification task that aims to distinguish jets arising from particles of interest, such as the Higgs boson or the top quark, from other less interesting types of jets. Jet tagging is a challenging task because the particle initiating a jet can radiate, and the radiated particles further produce more particles, leading to a cascade of to particles at the end. The radiation also smears the characteristics of the initial particle and makes the identification very difficult.

Traditional approaches for jet tagging rely on hand-crafted features motivated by the principles of quantum chromodynamics (QCD), the theory governing the evolution of particles inside a jet. The rise of deep learning has led to a plethora of new approaches Larkoski et al. (2020). The prevailing ones represent a jet as a particle cloud, i.e., an unordered and variable-sized set of the outgoing particles, as illustrated in Figure 1. Based on the particle cloud representation, ParticleNet Qu & Gouskos (2020) adapts the Dynamic Graph CNN architecture Wang et al. (2019) and achieves substantial performance improvement on two representative jet tagging benchmarks. Since then, several new models (e.g., Mikuni & Canelli (2020, 2021); Shimmin (2021)) have been proposed, but no significant performance improvement has been reported so far. We deem the lack of a sufficiently large public dataset an impeding factor.

In this work, we advocate for JetClass, a new large and comprehensive dataset to advance deep learning for jet tagging. The JetClass dataset consists of 100 M jets for training, about two orders of magnitude larger than existing public datasets. It also includes more types of jets, several of which have not been explored for tagging yet but are promising for future applications at the LHC.

Based on this dataset, we propose Particle Transformer (ParT), a new Transformer-based architecture for jet tagging. We demonstrate that Transformer architectures, together with a large dataset, can also lead to powerful performance on jet tagging. Furthermore, we introduce a small modification to the attention mechanism by incorporating a new term characterizing pairwise particle interactions. The resulting ParT achieves significantly higher performance than a plain Transformer and surpasses the previous state-of-the-art, ParticleNet, by a large margin. We also apply the pre-trained ParT models to the two widely adopted jet tagging benchmarks with fine-tuning and observe a substantial gain on these tasks. The dataset, code, and pre-trained models will be made publicly available.

2 The JetClass Dataset

We provide an overview of the new JetClass dataset in this section. The dataset includes a total of 10 types of jets. Representative jets of each type are visualized as particle clouds in Figure 2. The jets in this dataset generally fall into two categories. The background jets are initiated by light quarks or gluons () and are ubiquitously produced at the LHC. The signal jets are those arising either from the top quarks (), or from the , or Higgs () bosons. For top quarks and Higgs bosons, we further consider their different decay modes as separate types, because the resulting jets have rather distinct characteristics and are often tagged individually. The use of jet tagging typically involves selecting one (or a few) specific type of signal jets with high confidence, and rejecting background jets as much as possible, since the background jets usually appear orders of magnitude more frequently than the targeted signal jets. Note that for several types of signal jets in this dataset, such as , , and , no dedicated methods have been developed so far to tag them. However, as we will demonstrate in Section 5.1, these types of jets can also be cleanly tagged with deep learning approaches, opening up new possible territories for jet tagging at the LHC.

Simulation setup. Jets in this dataset are simulated with standard Monte Carlo event generators used by LHC experiments. The production and decay of the top quarks and the , and Higgs bosons are generated with MadGraph5_amc@nlo Alwall et al. (2014). We use pythia Sjöstrand et al. (2015) to evolve the produced particles, i.e., performing parton showering and hadronization, and produce the final outgoing particles111We include multiple parton interactions but omit pileup interactions in the simulation.. To be close to realistic jets reconstructed at the ATLAS or CMS experiment, detector effects are simulated with Delphes de Favereau et al. (2014) using the CMS detector configuration provided in Delphes. In addition, the impact parameters of electrically charged particles are smeared to match the resolution of the CMS tracking detector Chatrchyan et al. (2014). Jets are clustered from Delphes E-Flow objects with the anti- algorithm Cacciari et al. (2008, 2012) using a distance parameter . Only jets with transverse momentum in 500–1000 GeV and pseudorapidity are considered. For signal jets, only the “high-quality” ones that fully contain the decay products of initial particles are included222We require all the quarks () and charged leptons (electrons or muons, denoted ) from the decay of the top quark or the , or Higgs boson satisfy , where , in which () is the pseudorapidity (azimuthal angle) of the momentum of the jet or the particle..

Input features. The dataset provides all constituent particles of each jet as inputs for jet tagging. Note that the number of particles varies from jet to jet, typically between 10 and 100, with an average of 30–50 depending on the jet type. For each particle of a jet, three categories of features are provided:

  • Kinematics.

    This includes the energy and momentum, described by the 4-vector

    in units of GeV, which are the most fundamental quantities measured by a particle detector. All other kinematic variables can be computed from the 4-vectors.

  • Particle identification. This includes the electric charge, with values of

    (positively/negatively charged particles) and 0 (neural particles), and the particle identity determined by the detector systems. For the latter, a 5-class one-hot encoding should be used to be consistent with current LHC experiments: charged hadron (

    ), neutral hadron (0), electron (), muon (), and photon (22). Note that a different definition can easily lead to significantly altered performance, so the definition above should be followed carefully to ensure consistency in the comparison between algorithms. The particle identification information is especially important for tagging jets involving a charged lepton, e.g., and , as leptons can be almost unambiguously identified at the LHC.

  • Trajectory displacement. This includes the measured values and errors of the transverse and longitudinal impact parameters of the particle trajectories in units of mm, in total 4 variables. These measurements are only available for electrically charged particles, and a value of 0 is used for neutral particles. The trajectory displacement information is critical for tagging jets involving a bottom () or charm () quark, such as , , , etc., but is missing from most of the existing datasets.

Training, validation and test sets. The training set consists of 100 M jets in total, equally distributed in the 10 classes. An additional set of 5 M jets is intended for model validation. For the evaluation of performance, a separate test set with 2 M jets in each class (in total 20 M) is provided.

Evaluation metrics. To thoroughly evaluate the performance of deep learning models on this dataset, we advocate for a series of metrics. Since jet tagging on this dataset is naturally framed as a multi-class classification task, two common metrics, i.e., the accuracy and the area under the ROC curve (AUC)333The AUC can be calculated using roc_auc_score in scikit-learn with average=’macro’ and multi_class=’ovo’. are adopted to quantify the overall performance. In addition, we propose the background rejection (i.e., the inverse of the false positive rate) at a certain signal efficiency (i.e., the true positive rate, TPR) of %, i.e.,


for each type of signal jets. By default, the jets are considered as the background, as is the case in most LHC data analyses, and each of the other 9 types of jets can be considered as the signal. The signal efficiency (TPR) for each signal type is chosen to be representative of actual usages at the LHC experiments and is typically 50%. It is increased to 99% (99.5%) for (), as these types of jets have more distinct characteristics and can be more easily separated from jets. Since the definition of the metric involves only two classes, i.e., the signal class under consideration () and the background class (), therefore, we use a two-class score,


where and are the softmax outputs for class and , respectively, to achieve optimal performance for vs separation. This is aligned with the convention adopted by the CMS experiment Sirunyan et al. (2020b). Note that the background rejection metric, although rarely used in vision or language tasks, is actually a standard metric for jet tagging because it is directly related to the discovery potential at the LHC experiments. A factor of two increase in background rejection can lead to about 40% increase in the discovery potential, which would otherwise require a dataset of twice the size, or in other words, doubling the running time of the LHC.

3 Related Work

Jet tagging with deep learning. Deep learning approaches have been proposed extensively to improve jet tagging. Previous models handle jets with different representations, e.g., images de Oliveira et al. (2016), sequences Guest et al. (2016), trees Louppe et al. (2019), graphs Henrion et al. (2017)

, with corresponding deep learning architectures such as 2D CNNs, recurrent or recursive networks, and graph neural networks. More recently, the particle cloud representation 

Komiske et al. (2019b); Qu & Gouskos (2020), analogous to point clouds, which treats a jet as a permutation-invariant set of particles as visualized in Figure 2, has been proposed. The Deep Sets Zaheer et al. (2017) and Dynamic Graph CNN Wang et al. (2019) architectures are adapted for jet tagging, resulting in the Energy Flow Network Komiske et al. (2019b) and the state-of-the-art, ParticleNet Qu & Gouskos (2020), respectively. Since then, particle clouds have become the prevailing representation of jets and more architectures based on GAPNet Chen et al. (2019); Mikuni & Canelli (2020), the Point Cloud Transformer Guo et al. (2021); Mikuni & Canelli (2021) have been studied, but no significant performance improvement over ParticleNet has been reported. Lately, researches have been focused more on incorporating inductive biases from physics principles in the architecture design, such as the usage of the Lund jet plane Dreyer et al. (2018); Dreyer & Qu (2021); Dreyer et al. (2021), the Lorentz group symmetry Bogatskiy et al. (2020); Gong et al. (2022), and the rotational symmetry Shimmin (2021).

Deep-learning-based jet tagging algorithms have been widely adopted in real-world data analysis at the LHC. For example, the CMS Collaboration develops the DeepAK8 Sirunyan et al. (2020b) algorithm to tag jets arising from the top quark or the Higgs, , or boson, using a 1D CNN following the ResNet He et al. (2016) architecture, and a significant increase in the discovery potential for new heavy particles has been achieved Sirunyan et al. (2021); Tumasyan et al. (2022). Moreover, the ParticleNet architecture has been used by CMS to probe the quartic interaction between the Higgs and vector bosons, indirectly confirming its existence for the first time CMS Collaboration (2021). Clearly, advances in jet tagging play a vital role in accelerating our understanding of elementary particles, the fundamental building blocks of nature.

Jet tagging datasets. A number of datasets have been published so for to study jet tagging:

  • Top quark tagging dataset Kasieczka et al. (2019) proposed in Butter et al. (2019), consisting of 2 M jets in 2 types ( and ) and providing only the kinematic information.

  • Quark-gluon tagging dataset Komiske et al. (2019a) proposed in Komiske et al. (2019b), consisting of 2 M jets in 2 types (quark and gluon), and providing both the kinematic and particle identification information.

  • Higgs boson tagging dataset Duarte (2019); Chen et al. (2021), containing 3.9 M jets and 1.9 M jets, with all three categories of information.

  • JetNet dataset Kansal et al. (2021b) proposed in Kansal et al. (2021a), containing jets in 3 types: gluon, light quark, and top quark, and providing only the kinematic information.

  • A multiclass dataset Pierini et al. (2020) proposed in Moreno et al. (2020), with 880 k jets in 5 classes: light quark, gluon, boson, boson and top quark and providing only the kinematic information.

Compared with existing datasets, the JetClass dataset is not only substantially larger in size, but also more inclusive in terms of the types of jets contained.


Recent years have witnessed the enormous success of Transformer models. Starting from natural language processing and then spreading to computer vision, the original Transformer

Vaswani et al. (2017), as well as its variants, e.g., BERT Devlin et al. (2019), ViT Dosovitskiy et al. (2021) and Swin-Transformer Liu et al. (2021), have refreshed the performance records in various tasks, demonstrating the power of Transformer as a universal architecture. Transformers, and the attention mechanism at its core, have proved to be powerful for fundamental scientific problems as well. For example, AlphaFold2 Jumper et al. (2021), which reaches the state-of-the-art performance in protein structure prediction, employs the attention mechanism. In particular, adding a pair bias, derived from pairwise features, to the self attention helps improve the model explainability.

4 Model Architecture

Figure 3: The architecture of (a) Particle Transformer (b) Particle Attention Block (c) Class Attention Block.

Together with the JetClass dataset, we propose the Particle Transformer (ParT) as a new baseline for jet tagging. An overview of the ParT architecture is presented in Figure 3(a). For a jet with particles, ParT makes use of two sets of inputs: the particle input includes a list of features for every particle and forms an array of a shape ; the interaction444The term interaction here refers to any feature involving a pair of particles, which may or may not be related to the physical forces between them. input is a matrix of features for every pair of particles, in a shape . The particle and interaction inputs are each followed by an MLP to project them to a - and -dimensional embedding, and , respectively. Unlike Transformers for NLP and vision, we do not add any ad-hoc positional encodings, as the particles in a jet are permutation invariant. The spatial information (i.e., the flying direction of each particle) is directly included in the particle inputs. We feed the particle embedding into a stack of particle attention blocks to produce new embeddings, via multi-head self attention. The interaction matrix is used to augment the scaled dot-product attention by adding it as a bias to the pre-softmax attention weights. The same is used for all the particle attention blocks. After that, the last particle embedding is fed into two class attention blocks, and a global class token is used to extract information for jet classification via attention to all the particles, following the CaiT approach Touvron et al. (2021). The class token is passed to a single-layer MLP, followed by softmax, to produce the final classification scores.

Remark. ParT can also be viewed as a graph neural network on a fully-connected graph, in which each node corresponds to a particle, and the interactions are the edge features.

Particle interaction features. While the ParT architecture is designed to be able to process any kinds of pairwise interaction features, for this paper we only consider a specific scenario in which the interaction features are derived from the energy-momentum 4-vector, , of each particle. This is the most general case for jet tagging, as the particle 4-vectors are available in every jet tagging task. Specifically, for a pair of particles , with 4-vectors , , we calculate the following 4 features:


where is the rapidity, is the azimuthal angle, is the transverse momentum, and is the momentum 3-vector and is the norm, for , . Since these variables typically have a long-tail distribution, we take the logarithm and use as the interaction features for each particle pair. The choice of this set of features is motivated by Dreyer & Qu (2021).

Particle attention block. A key component of ParT is the particle attention block. As illustrated in Figure 3

(b), the particle attention block consists of two stages. The first stage includes a multi-head attention (MHA) module with a LayerNorm (LN) layer both before and afterwards. The second stage is a 2-layer MLP, with an LN before each linear layer and GELU nonlinearity in between. Residual connections are added after each stage. The overall block structure is based on NormFormer

Shleifer et al. (2021), however, we replace the standard MHA with P-MHA, an augmented version that can also exploit the pairwise particle interactions directly. The P-MHA is computed as


where , and are linear projections of the particle embedding . Essentially, we add the interaction matrix to the pre-softmax attention weights. This allows P-MHA to incorporate particle interaction features designed from physics principles and modify the dot-product attention weights, thus increasing the expressiveness of the attention mechanism.

Class attention block. As illustrated in Figure 3(c), the class attention block has a similar structure as the particle attention block. However, unlike in the particle attention block where we compute the self attention between particles, here we compute the attention between a global class token and all the particles using the standard MHA. Specifically, the inputs to the MHA are


where is the concatenation of the class token and the particle embedding after the last particle attention block, .


We implement the ParT model in PyTorch

Paszke et al. (2019). Specifically, the P-MHA is implemented using the PyTorch’s MultiheadAttention by providing the interaction matrix as the attn_mask input. The baseline ParT model has a total of particle attention blocks and 2 class attention blocks. It uses a particle embedding of a dimension

, encoded from the input particle features using a 3-layer MLP with (128, 512, 128) nodes each layer with GELU nonlinearity, and LN is used in between for normalization. The interaction input features are encoded using a 4-layer pointwise 1D convolution with (64, 64, 64, 16) channels with GELU nonlinearity and batch normalization in between to yield a

dimensional interaction matrix. The P-MHA (MHA) in the particle (class) attention blocks all have 8 heads, with a query dimension for each head, and an expansion factor of 4 for the MLP. We use a dropout of 0.1 for all particle attention blocks, and no dropout for the class attention block.

5 Experiments

We conduct experiments on the new JetClass dataset and show the results in Section 5.1. The pre-trained ParT models are also applied to two existing datasets with fine-tuning, and the performance is compared to previous state-of-the-arts in Section 5.2.

5.1 Experiments on JetClass Dataset

All classes
Accuracy AUC
PFN 0.772 0.9714 2924 841 75 198 265 797 721 189 159
P-CNN 0.809 0.9789 4890 1276 88 474 947 2907 2304 241 204
ParticleNet 0.844 0.9849 7634 2475 104 954 3339 10526 11173 347 283
ParT 0.861 0.9877 10638 4149 123 1864 5479 32787 15873 543 402
ParT (plain) 0.849 0.9859 9569 2911 112 1185 3868 17699 12987 384 311
Table 1: Jet tagging performance on the JetClass dataset. ParT is compared to PFN Komiske et al. (2019b), P-CNN Sirunyan et al. (2020b) and the state-of-the-art ParticleNet Qu & Gouskos (2020). For all the metrics, a higher value indicates better performance. The ParT architecture using plain MHAs instead of P-MHAs, labelled as ParT (plain), is also shown for comparison.

Setup. For experiments on the JetClass dataset, we use the full set of particle features, including kinematics, particle identification, and trajectory displacement, as inputs. The full list of 17 features for each particle is summarized in Table 5. In addition, the 4 interaction features introduced in Equation 3 are also used for the ParT model. The training is performed on the full training set of 100 M jets. We employ the Lookahead optimizer Zhang et al. (2019) with and to minimize the cross-entropy loss, and the inner optimizer is RAdam Liu et al. (2020) with , , and

. A batch size of 512 and an initial learning rate (LR) of 0.001 are used. No weight decay is applied. We train for a total of 1 M iterations, amounting to around 5 epochs over the full training set. The LR remains constant for the first 70% of the iterations, and then decays exponentially, at an interval of every 20 k iterations, down to 1% of the initial value at the end of the training. Performance of the model is evaluated every 20 k iterations on the validation set and a model checkpoint is saved. The checkpoint with the highest accuracy on the validation set is used to evaluate the final performance on the test set.

Baselines. We compare the performance of ParT with 3 baseline models: the PFN Komiske et al. (2019b) architecture based on Deep Sets Zaheer et al. (2017), the P-CNN architecture used by the DeepAK8 algorithm of the CMS experiment Sirunyan et al. (2020b), and the state-of-the-art ParticleNet architecture Qu & Gouskos (2020) adapted from DGCNN Wang et al. (2019). All the models are trained end-to-end on the JetClass dataset for the same number of effective epochs for a direct comparison. For ParticleNet, we directly use the existing PyTorch implementation. For PFN and P-CNN, we re-implement them in PyTorch and verify that the published results are reproduced. The optimizer and LR schedule remain the same as in the training of ParT. The (batch size, LR) combination is re-optimized and chosen to be (512, 0.01) for ParticleNet and (4096, 0.02) for PFN and P-CNN.

Results. Performance on the JetClass dataset is evaluated using the metrics described in Section 2, and the results are summarized in Table 1. The proposed ParT architecture achieves the best performance on every metric, and outperforms the existing state-of-the-art, ParticleNet, by a large margin. The overall accuracy is increased by 1.7% compared to ParticleNet. Moreover, for the physics-oriented metric, the background rejection, ParT improves over ParticleNet by a factor of 3 for , a factor of 2 for , and about 70% for . It is also clear that, the earlier PFN and P-CNN models lag substantially behind ParticleNet and ParT on this large dataset, amounting to up to an order of magnitude difference in background rejection. The large improvement of ParT is likely to lead to a significant jump in the discovery potential for related physics searches at the LHC.

Another observation is that there is a large variation in tagging performance between signals of different types. The best separation against the background jets is achieved for and signals – with the powerful ParT model, these two can be selected almost perfectly, i.e., at an efficiency of more than 99% with nearly no contamination from background jets. This opens up new territory for jet tagging at the LHC, as these types of jets have not been exploited for tagging so far.

Ablation study. To quantify the effectiveness of the P-MHA introduced in ParT, we carried out an ablation study by replacing the P-MHA with a standard MHA, the resulting architecture is then a plain Transformer and therefore denoted as ParT (plain). We train ParT (plain) with the same procedure as the full ParT and the performance is shown in Table 1. An accuracy drops of 1.2% is observed compared to the full ParT, and the background rejection is reduced by 20–30% for most signals. Note that, replacing P-MHA with plain MHA implies that the particle interaction input is discarded completely, but this does not imply a reduction of information content, as the interaction features defined in Equation 3 are derived purely from the energy-momentum 4-vectors, which are already used as particle features via the 7 kinematic variables presented in Table 5. Therefore, the improvement of ParT over a plain Transformer indeed arise from an efficient exploitation of the particle kinematic information using the P-MHA.

Model complexity. Table 2 compares the model complexity of ParT with the baselines. While the number of trainable parameters is increased by more than compared to ParticleNet, the number of floating point operations (FLOPs) is actually 40% lower. We also observe that the FLOPs of ParT are 30% higher than ParT (plain), which mostly comes from the encoding of the pairwise features, because the computational cost there scales quadratically with the number of particles in a jet.

Accuracy # params FLOPs
PFN 0.772 86.1 k 4.62 M
P-CNN 0.809 354 k 15.5 M
ParticleNet 0.844 370 k 540 M
ParT 0.861 2.14 M 340 M
ParT (plain) 0.849 2.13 M 260 M
Table 2: Number of trainable parameters and FLOPs.

5.2 Fine-Tuning for Other Datasets

Top quark tagging dataset. The top quark tagging benchmark Butter et al. (2019) provides a dataset of 2 M (1.2/0.4/0.4 M for train/validation/test) jets in two classes, (signal) and (background). Only kinematic features, i.e., the energy-momentum 4-vectors, are provided. Therefore, we pre-train a ParT model on the JetClass dataset also using only the kinematic features, and then fine-tune it on the top quark tagging dataset. The particle input features are the 7 kinematic features listed in Table 5, the same as used by ParticleNet. The JetClass pre-training follows the same setup as described in Section 5.1. For the fine-tuning, we replace the last MLP with a new randomly-initialized MLP with 2 output nodes, and then fine-tune all the weights on the top tagging dataset for 20 epochs. A smaller LR of 0.0001 is used for the pre-trained weights, while a larger LR of 0.005 is used to update the randomly-initialized weights of the MLP. The LR remains constant across the full training, with a weight decay of 0.01. We run a total of 9 experiments, starting from the same pre-trained model but different random initializations of the replaced MLP, and report the performance of the model with median accuracy and the spread, following the procedure used by ParticleNet. For comparison, we also trained ParT from scratch on this dataset for 20 epochs, using a start LR of 0.001, a schedule that decays the LR to 1% in the last 30% of the epochs, and a weight decay of 0.01. Both results are presented in Table 3. The pre-trained ParT achieves a significant improvement over the existing baselines, increasing by 70% compared to the previous state-of-the-art, ParticleNet. On the other hand, the ParT model trained from scratch only reaches similar performance as ParticleNet. This highlights the benefits of a large dataset for jet tagging.

Accuracy AUC
P-CNN 0.930 0.9803
PFN 0.9819
ParticleNet 0.940 0.9858
JEDI-net (w/ ) 0.930 0.9807 774.6
PCT 0.940 0.9855
LGN 0.929 0.964
rPCN 0.9845
ParT 0.940 0.9858
ParT-f.t. 0.944 0.9877
Table 3: Comparison between ParT and existing models on the top quark tagging dataset. ParT-f.t. denotes the model pre-trained on JetClass and fine-tuned on this dataset. ParT refers to the model trained from scratch on this dataset. Results for other models are quoted from their published results: P-CNN and ParticleNet Qu & Gouskos (2020), PFN Komiske et al. (2019b), JEDI-net Moreno et al. (2020), PCT Mikuni & Canelli (2021), LGN Bogatskiy et al. (2020), and rPCN Shimmin (2021).

Quark-gluon tagging dataset. We also benchmark ParT on the quark-gluon tagging dataset Komiske et al. (2019a) proposed in Komiske et al. (2019b), the target of which is to separate jets initiated by quarks (signal) from those by gluons (background). This dataset also consists of 2 M jets, with a recommended train/validation/test splitting of 1.6/0.2/0.2 M. It provides not only the kinematic features, but also particle identification information. We consider two scenarios in the usage of the particle identification information. In the “exp” scenario, we restrict the information to only 5 classes and do not attempt to separate electrically charged (and neural) hadrons of different types, which is the procedure adopted by ParticleNet, and also prescribed by the JetClass dataset. In the “full” scenario, we consider all particle types and further distinguish electrically charged (and neural) hadrons into more types. We perform the pre-training on JetClass using only the kinematic and particle identification inputs, the follows the first scenario as that is the prescribed one for JetClass. For the fine-tuning, we then carry out experiments under both scenarios. The pre-training and fine-tuning setup is the same as in the top quark tagging benchmark, and the fine-tuning also lasts for 20 epochs. Results are summarized in Table 4. The pre-trained ParT achieves the best performance and improves existing baselines by a large margin in both scenarios.

Accuracy AUC
P-CNN 0.827 0.9002 34.7 91.0
PFN 0.9005
ParticleNet 0.840 0.9116
rPCN 0.9081
ParT 0.840 0.9121
ParT-f.t. 0.843 0.9151

ABCNet 0.840 0.9126
PCT 0.841 0.9140
ParT 0.849 0.9203
ParT-f.t. 0.852 0.9230
Table 4: Comparison between ParT and existing models on the quark-gluon tagging dataset. ParT-f.t. denotes the model pre-trained on JetClass and fine-tuned on this dataset. ParT refers to the model trained from scratch on this dataset. Results for other models are quoted from their published results: P-CNN and ParticleNet Qu & Gouskos (2020), PFN Komiske et al. (2019b), ABCNet Mikuni & Canelli (2020), PCT Mikuni & Canelli (2021), and rPCN Shimmin (2021). The subscript “exp” and “full” distinguish models using partial or full particle identification information.

6 Discussion and Conclusion

Large-scale datasets have always been a catalyst for new breakthroughs in deep learning. In this work, we present JetClass, a new large-scale open dataset to advance deep learning research in particle physics. The dataset consists of 100 M simulated jets, about two orders of magnitude larger than existing public jet datasets, and covers a broad spectrum of 10 classes of jets in total, including several novel types that have not been studied with deep learning so far. While we focus on investigating a classification task, i.e., jet tagging, with this dataset, we highlight that this dataset can serve as the basis for many important deep learning researches in particle physics, e.g., unsupervised or self-supervised training techniques for particle physics (e.g., Dillon et al. (2021)), generative models for high-fidelity fast simulation of particle collisions (e.g., Kansal et al. (2021a)), regression models to predict jet energy and momentum with higher precision (e.g., Sirunyan et al. (2020a)), and more. We invite the community to explore and experiment with this dataset and extend the boundary of deep learning and particle physics even further.

With this large dataset, we introduce Particle Transformer (ParT), a new architecture that substantially improves jet tagging performance over previous state-of-the-art. We propose it as a new jet tagging baseline for future research to improve upon. The effectiveness of ParT arises mainly from the augmented self-attention, in which we incorporate physics-inspired pairwise interactions together with the machine-learned dot-product attention. This approach is likely to be effective for other tasks on similar datasets, such as point clouds or many-body systems, especially when prior knowledge is available to describe the interaction or the geometry. On the other hand, one limitation of using the full pairwise interaction matrix is the increase in computational time and memory consumption. Novel approaches for particle (point) embeddings and self-attentions that alleviate the computational cost could be an interesting direction for future research.


  • Aad et al. (2008) Aad, G. et al. The ATLAS Experiment at the CERN Large Hadron Collider. JINST, 3:S08003, 2008. doi: 10.1088/1748-0221/3/08/S08003.
  • Aad et al. (2012) Aad, G. et al. Observation of a new particle in the search for the Standard Model Higgs boson with the ATLAS detector at the LHC. Phys. Lett. B, 716:1–29, 2012. doi: 10.1016/j.physletb.2012.08.020.
  • Alwall et al. (2014) Alwall, J., Frederix, R., Frixione, S., Hirschi, V., Maltoni, F., Mattelaer, O., Shao, H. S., Stelzer, T., Torrielli, P., and Zaro, M. The automated computation of tree-level and next-to-leading order differential cross sections, and their matching to parton shower simulations. JHEP, 07:079, 2014. doi: 10.1007/JHEP07(2014)079.
  • Bogatskiy et al. (2020) Bogatskiy, A., Anderson, B., Offermann, J., Roussi, M., Miller, D., and Kondor, R. Lorentz group equivariant neural network for particle physics. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 992–1002. PMLR, 13–18 Jul 2020. URL
  • Butter et al. (2019) Butter, A. et al. The Machine Learning landscape of top taggers. SciPost Phys., 7:014, 2019. doi: 10.21468/SciPostPhys.7.1.014.
  • Cacciari et al. (2008) Cacciari, M., Salam, G. P., and Soyez, G. The anti- jet clustering algorithm. JHEP, 04:063, 2008. doi: 10.1088/1126-6708/2008/04/063.
  • Cacciari et al. (2012) Cacciari, M., Salam, G. P., and Soyez, G. FastJet User Manual. Eur. Phys. J. C, 72:1896, 2012. doi: 10.1140/epjc/s10052-012-1896-2.
  • Chatrchyan et al. (2008) Chatrchyan, S. et al. The CMS Experiment at the CERN LHC. JINST, 3:S08004, 2008. doi: 10.1088/1748-0221/3/08/S08004.
  • Chatrchyan et al. (2012) Chatrchyan, S. et al. Observation of a New Boson at a Mass of 125 GeV with the CMS Experiment at the LHC. Phys. Lett. B, 716:30–61, 2012. doi: 10.1016/j.physletb.2012.08.021.
  • Chatrchyan et al. (2014) Chatrchyan, S. et al. Description and performance of track and primary-vertex reconstruction with the CMS tracker. JINST, 9(10):P10009, 2014. doi: 10.1088/1748-0221/9/10/P10009.
  • Chen et al. (2019) Chen, C., Fragonara, L. Z., and Tsourdos, A. Gapnet: Graph attention based point neural network for exploiting local feature of point cloud. arXiv preprint arXiv:1905.08705, 2019.
  • Chen et al. (2021) Chen, Y., Huerta, E. A., Duarte, J., Harris, P., Katz, D. S., Neubauer, M. S., Diaz, D., Mokhtar, F., Kansal, R., Park, S. E., Kindratenko, V. V., Zhao, Z., and Rusack, R. A FAIR and AI-ready Higgs Boson Decay Dataset. arXiv:2108.02214 [hep-ex, physics:hep-ph], August 2021.
  • CMS Collaboration (2021) CMS Collaboration. Search for Higgs boson pair production via vector boson fusion with highly Lorentz-boosted Higgs bosons in the four b quark final state at TeV. Technical report, CERN, Geneva, 2021. URL
  • de Favereau et al. (2014) de Favereau, J., Delaere, C., Demin, P., Giammanco, A., Lemaître, V., Mertens, A., and Selvaggi, M. DELPHES 3, A modular framework for fast simulation of a generic collider experiment. JHEP, 02:057, 2014. doi: 10.1007/JHEP02(2014)057.
  • de Oliveira et al. (2016) de Oliveira, L., Kagan, M., Mackey, L., Nachman, B., and Schwartzman, A. Jet-images — deep learning edition. JHEP, 07:069, 2016. doi: 10.1007/JHEP07(2016)069.
  • Devlin et al. (2019) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs], May 2019.
  • Dillon et al. (2021) Dillon, B. M., Kasieczka, G., Olischlager, H., Plehn, T., Sorrenson, P., and Vogel, L. Symmetries, Safety, and Self-Supervision. 8 2021.
  • Dosovitskiy et al. (2021) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL
  • Dreyer et al. (2021) Dreyer, F., Soyez, G., and Takacs, A. Quarks and gluons in the Lund plane. 12 2021.
  • Dreyer & Qu (2021) Dreyer, F. A. and Qu, H. Jet tagging in the Lund plane with graph networks. JHEP, 03:052, 2021. doi: 10.1007/JHEP03(2021)052.
  • Dreyer et al. (2018) Dreyer, F. A., Salam, G. P., and Soyez, G. The Lund Jet Plane. JHEP, 12:064, 2018. doi: 10.1007/JHEP12(2018)064.
  • Duarte (2019) Duarte, J. Sample with jet, track and secondary vertex properties for Hbb tagging ML studies HiggsToBBNTuple_HiggsToBB_QCD_RunII_13TeV_MC, 2019.
  • Gong et al. (2022) Gong, S., Meng, Q., Zhang, J., Qu, H., Li, C., Qian, S., Du, W., Ma, Z.-M., and Liu, T.-Y. An Efficient Lorentz Equivariant Graph Neural Network for Jet Tagging. 1 2022.
  • Guest et al. (2016) Guest, D., Collado, J., Baldi, P., Hsu, S.-C., Urban, G., and Whiteson, D. Jet Flavor Classification in High-Energy Physics with Deep Neural Networks. Phys. Rev. D, 94(11):112002, 2016. doi: 10.1103/PhysRevD.94.112002.
  • Guo et al. (2021) Guo, M.-H., Cai, J.-X., Liu, Z.-N., Mu, T.-J., Martin, R. R., and Hu, S.-M. PCT: Point cloud transformer. Computational Visual Media, 7(2):187–199, June 2021. ISSN 2096-0433, 2096-0662. doi: 10.1007/s41095-021-0229-5.
  • He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pp. 770–778, 2016.
    doi: 10.1109/CVPR.2016.90.
  • Henrion et al. (2017) Henrion, I., Brehmer, J., Bruna, J., Cho, K., Cranmer, K., Louppe, G., and Rochette, G. Neural Message Passing for Jet Physics. In Deep Learning for Physical Sciences Workshop at the 31st Conference on Neural Information Processing Systems (NeurIPS), 2017.
  • Jumper et al. (2021) Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S. A. A., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., Back, T., Petersen, S., Reiman, D., Clancy, E., Zielinski, M., Steinegger, M., Pacholska, M., Berghammer, T., Bodenstein, S., Silver, D., Vinyals, O., Senior, A. W., Kavukcuoglu, K., Kohli, P., and Hassabis, D. Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873):583–589, August 2021. ISSN 0028-0836, 1476-4687. doi: 10.1038/s41586-021-03819-2.
  • Kansal et al. (2021a) Kansal, R., Duarte, J., Su, H., Orzari, B., Tomei, T., Pierini, M., Touranakou, M., Vlimant, J.-R., and Gunopulos, D. Particle Cloud Generation with Message Passing Generative Adversarial Networks. arXiv:2106.11535 [hep-ex], October 2021a.
  • Kansal et al. (2021b) Kansal, R., Duarte, J., Su, H., Orzari, B., Tomei, T., Pierini, M., Touranakou, M., Vlimant, J.-R., and Gunopulos, D. Jetnet, May 2021b. URL
  • Kasieczka et al. (2019) Kasieczka, G., Plehn, T., Thompson, J., and Russel, M. Top Quark Tagging Reference Dataset, March 2019.
  • Kogler et al. (2019) Kogler, R. et al. Jet Substructure at the Large Hadron Collider: Experimental Review. Rev. Mod. Phys., 91(4):045003, 2019. doi: 10.1103/RevModPhys.91.045003.
  • Komiske et al. (2019a) Komiske, P., Metodiev, E., and Thaler, J. Pythia8 Quark and Gluon Jets for Energy Flow, may 2019a.
  • Komiske et al. (2019b) Komiske, P. T., Metodiev, E. M., and Thaler, J. Energy Flow Networks: Deep Sets for Particle Jets. JHEP, 01:121, 2019b. doi: 10.1007/JHEP01(2019)121.
  • Larkoski et al. (2020) Larkoski, A. J., Moult, I., and Nachman, B. Jet Substructure at the Large Hadron Collider: A Review of Recent Advances in Theory and Machine Learning. Phys. Rept., 841:1–63, 2020. doi: 10.1016/j.physrep.2019.11.001.
  • Liu et al. (2020) Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and Han, J.

    On the variance of the adaptive learning rate and beyond.

    In International Conference on Learning Representations, 2020. URL
  • Liu et al. (2021) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10012–10022, October 2021.
  • Louppe et al. (2019) Louppe, G., Cho, K., Becot, C., and Cranmer, K. QCD-Aware Recursive Neural Networks for Jet Physics. JHEP, 01:057, 2019. doi: 10.1007/JHEP01(2019)057.
  • Mikuni & Canelli (2020) Mikuni, V. and Canelli, F. ABCNet: An attention-based method for particle tagging. Eur. Phys. J. Plus, 135(6):463, 2020. doi: 10.1140/epjp/s13360-020-00497-3.
  • Mikuni & Canelli (2021) Mikuni, V. and Canelli, F. Point cloud transformers applied to collider physics. Mach. Learn. Sci. Tech., 2(3):035027, 2021. doi: 10.1088/2632-2153/ac07f6.
  • Moreno et al. (2020) Moreno, E. A., Cerri, O., Duarte, J. M., Newman, H. B., Nguyen, T. Q., Periwal, A. a., Pierini, M., Serikova, A., Spiropulu, M., and Vlimant, J.-R. JEDI-net: a jet identification algorithm based on interaction networks. Eur. Phys. J. C, 80(1):58, 2020. doi: 10.1140/epjc/s10052-020-7608-4.
  • Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc., 2019. URL
  • Pierini et al. (2020) Pierini, M., Duarte, J. M., Tran, N., and Freytsis, M. Hls4ml lhc jet dataset (150 particles), January 2020. URL
  • Qu & Gouskos (2020) Qu, H. and Gouskos, L. ParticleNet: Jet Tagging via Particle Clouds. Phys. Rev. D, 101(5):056019, 2020. doi: 10.1103/PhysRevD.101.056019.
  • Radovic et al. (2018) Radovic, A., Williams, M., Rousseau, D., Kagan, M., Bonacorsi, D., Himmel, A., Aurisano, A., Terao, K., and Wongjirad, T. Machine learning at the energy and intensity frontiers of particle physics. Nature, 560(7716):41–48, 2018. doi: 10.1038/s41586-018-0361-2.
  • Shimmin (2021) Shimmin, C. Particle Convolution for High Energy Physics. 7 2021.
  • Shleifer et al. (2021) Shleifer, S., Weston, J., and Ott, M. Normformer: Improved transformer pretraining with extra normalization. arXiv preprint arXiv:2110.09456, 2021.
  • Sirunyan et al. (2020a) Sirunyan, A. M. et al.

    A Deep Neural Network for Simultaneous Estimation of b Jet Energy and Resolution.

    Comput. Softw. Big Sci., 4(1):10, 2020a. doi: 10.1007/s41781-020-00041-z.
  • Sirunyan et al. (2020b) Sirunyan, A. M. et al. Identification of heavy, energetic, hadronically decaying particles using machine-learning techniques. JINST, 15(06):P06005, 2020b. doi: 10.1088/1748-0221/15/06/P06005.
  • Sirunyan et al. (2021) Sirunyan, A. M. et al. Search for top squark production in fully-hadronic final states in proton-proton collisions at 13 TeV. Phys. Rev. D, 104(5):052001, 2021. doi: 10.1103/PhysRevD.104.052001.
  • Sjöstrand et al. (2015) Sjöstrand, T., Ask, S., Christiansen, J. R., Corke, R., Desai, N., Ilten, P., Mrenna, S., Prestel, S., Rasmussen, C. O., and Skands, P. Z. An introduction to PYTHIA 8.2. Comput. Phys. Commun., 191:159–177, 2015. doi: 10.1016/j.cpc.2015.01.024.
  • Touvron et al. (2021) Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., and Jégou, H. Going deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 32–42, October 2021.
  • Tumasyan et al. (2022) Tumasyan, A. et al. Search for resonances decaying to three W bosons in proton-proton collisions at = 13 TeV. 1 2022.
  • Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL
  • Wang et al. (2019) Wang, Y., Sun, Y., Liu, Z., Sarma, S. E., Bronstein, M. M., and Solomon, J. M. Dynamic graph cnn for learning on point clouds. ACM Trans. Graph., 38(5), oct 2019. ISSN 0730-0301. doi: 10.1145/3326362. URL
  • Zaheer et al. (2017) Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola, A. J. Deep sets. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL
  • Zhang et al. (2019) Zhang, M., Lucas, J., Ba, J., and Hinton, G. E. Lookahead optimizer: k steps forward, 1 step back. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL

Appendix A Input features

Category Variable Definition JetClass Top QG QG Kinematics difference in pseudorapidity between the particle and the jet axis difference in azimuthal angle between the particle and the jet axis logarithm of the particle’s transverse momentum logarithm of the particle’s energy logarithm of the particle’s relative to the jet logarithm of the particle’s energy relative to the jet energy angular separation between the particle and the jet axis () Particle identification charge electric charge of the particle Electron if the particle is an electron (|pid|==11) Muon if the particle is an muon (|pid|==13) Photon if the particle is an photon (pid==22) CH if the particle is an charged hadron (|pid|==211 or 321 or 2212) 555(|pid|==211) + (|pid|==321)*0.5 + (|pid|==2212)*0.2 NH if the particle is an neutral hadron (|pid|==130 or 2112 or 0) 666(|pid|==130) + (|pid|==2112)*0.2. Trajectory displacement hyperbolic tangent of the transverse impact parameter value hyperbolic tangent of the longitudinal impact parameter value error of the measured transverse impact parameter error of the measured longitudinal impact parameter
Table 5: Particle input features used for jet tagging on the JetClass, the top quark tagging (Top) and the quark gluon tagging (QG) datasets. For QG, we consider two scenarios: QG is restricted to use only the 5-class experimentally realistic particle identification information, while QG uses the full set of particle identification information in the dataset and further distinguish between different types of charged hadrons and neutral hadrons.