Learning representations of irregular particle-detector geometry with distance-weighted graph networks

02/21/2019 ∙ by Shah Rukh Qasim, et al. ∙ 0

We explore the use of graph networks to deal with irregular-geometry detectors in the context of particle reconstruction. Thanks to their representation-learning capabilities, graph networks can exploit the full detector granularity, while natively managing the event sparsity and arbitrarily complex detector geometries. We introduce two distance-weighted graph network architectures, dubbed GarNet and GravNet layers, and apply them to a typical particle reconstruction task. The performance of the new architectures is evaluated on a data set of simulated particle interactions on a toy model of a highly granular calorimeter, loosely inspired by the endcap calorimeter to be installed in the CMS detector for the High-Luminosity LHC phase. We study the clustering of energy depositions, which is the basis for calorimetric particle reconstruction, and provide a quantitative comparison to alternative approaches. The proposed algorithms outperform existing methods or reach competitive performance with lower computing-resource consumption. Being geometry-agnostic, the new architectures are not restricted to calorimetry and can be easily adapted to other use cases, such as tracking in silicon detectors.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Traditionally, Machine Learning (ML) techniques are a key ingredient to event processing at particle colliders, employed in tasks such as particle reconstruction (clustering), identification (classification), and energy or direction measurement (regression) in calorimeters and tracking devices. The first applications of Neural Networks to High Energy Physics (HEP) date back to the ’80s 

[1, 2, 3, 4]. Starting with the MiniBooNE experiment [5]

, Boosted Decision Trees became the state of the art, and played a crucial role in the discovery of the Higgs boson by the ATLAS and CMS experiments 

[6]. Recently, a series of studies on different aspects of LHC data taking and data processing workflows have demonstrated the potential of Deep Learning (DL) in collider applications, both as a way to speed up current algorithms and to improve their performance. Nevertheless, the list of DL models actually deployed in the centralized workflows of the LHC experiments remains quite short.111

As an example, at the moment such a list for the CMS experiment consists of a set of b-tagging algorithms 

[7, 8] and a data quality monitoring algorithm for the muon drift tube chambers [9]. Other applications exist at the analysis level, downstream from the centralized event processing. In data analyses, one typically considers abstract four-momenta and not the low-level quantities such as detector hits, making the use of DL techniques easier.

Many of these studies, which are typically proof-of-concept demonstrations, are based on convolutional neural networks (CNN) 

[10]

, which perform computing vision tasks by applying translation-invariant kernels to

raw digital images. CNN architectures applied on HEP data thus imposes a requirement for the particle detectors to be represented as regular arrays of sensors. This requirement, common to many of the approaches described in Section 2, creates problems for realistic applications of CNNs in collider experiments.222The picture is completely different in other HEP domains. For instance, CNNs have been successfully deployed in neutrino experiments, where the regular-array assumption meets the geometry of a typical detector.

In this work, we propose novel Deep Learning architectures based on graph networks to improve the performance and reduce the execution time of typical particle-reconstruction tasks, such as cluster reconstruction and particle identification. In contrast to CNNs, graph networks can learn optimal detector-hits representations without making specific assumptions on the detector geometry. In particular, no preprocessing of data from detectors is required, even for detectors with irregular geometries. We consider the specific case of calorimeteric particle reconstruction, for which this characteristic of graph networks may become especially relevant in the near future. In view of the High-Luminosity LHC phase, the endcap calorimeter of the CMS detector will be replaced by a novel-design digital calorimeter, the High Granularity Calorimeter (HGCAL), consisting of arrays of hexagonal silicon sensor cells interleaved with absorber layers [11]. Being positioned close to the beam pipe and exposed to proton-proton collisions on average per bunch crossing, this detector will be characterized by high occupancy over its large number of readout channels. Downstream in the data processing pipeline, the unprecedented number of sensors and their geometry will cause an increase in event size and consequently the computational needs, necessitating novel data processing approaches given the expected computing limitations [12]. The detector we consider in this study, described in detail in Section 4, is loosely inspired by the HGCAL geometry. In particular, it features a similarly irregular sensor structure, with sensor sizes varying with the detector depth as well as within a single layer. On the other hand, the HGCAL hexagonal sensors were traded for square-shaped sensors, in order to keep the computing resources needed to generate the training data set within a manageable limit.

As a benchmark application, we consider the basis for all further particle reconstruction task in a calorimeter: clustering of the recorded energy deposits into disentangled showers from individual particles. To this purpose, we introduce two novel distance-weighted graph network architectures, the GarNet and the GravNet layers, which are designed to provide a good balance between performance and computing resources needs for inference. While our discussion is limited to a calorimetry-related problem, the design of these new layer architectures is such that it automatically generalizes to any kind of sparse data, such as hits collected by a typical tracking device or reconstructed particle candidates inside a hadronic jet. We believe that architectures of this kind are more practical to deploy in a realistic experimental environment and could become relevant for the LHC experiments, both for offline and real-time event processing and selection.

This paper is structured as follows: Section 2 reviews related previous works. In Section 3, we describe the GravNet and GarNet architectures. Section 4 describes the data set used for this study. Section 5 introduces the metric used to optimize the networks. Section 6 describes the models. The results are presented in Sections 7 and 8 in terms of the accuracy and computational efficiency, respectively. Conclusions are presented in Section 9.

2 Related Work

In recent years, deep learning models, and in particular CNNs, have become very popular in different areas of HEP. CNNs were successfully applied to calorimeter-oriented tasks, including particle identification [13, 14, 15, 11, 16], energy regression [13, 15, 11, 16], hadronic jet identification [17, 18, 19, 20], fast simulation [21, 22, 13, 23, 24] and pileup subtraction in jets [25]. Many of these works assume a simplified detector description: the detector is represented as a somehow regular array of sensors expressed as 2D or 3D images, and the problem of overlapping regions at the transition between detector components (e.g. barrel and endcap) is ignored. Sometimes the fixed-grid pixel shape is intended to reflect the typical angular resolution of the detector, which is implicitly assumed to be a constant, while in reality it depends on the energy of the incoming particle.

Overcoming the necessity for a regular structure motivated original research to use graph-based networks [26], which in general are suited for processing point-wise data with no regular structure by representing them as vertices in a graph. A comprehensive overview of various graph-based networks can be found in Ref. [27]. The connections between the vertices (edges) usually define paths of information exchange [28, 29]. In some cases, the edge and vertex properties are used to infer attention (weight) assigned to each neighbour during this information exchange [30], however, without changing the neighbour relations themselves. Particularly interesting for irregular detectors are networks that are capable of learning the geometry, as studied in combination with message passing [31]. Within this approach, the adjacency matrix is trainable. The idea has also been adopted in the context of jet identification [32]. Although this approach is promising, its downside is the need to connect all vertices with each other, which makes it unsuitable for graphs with a large number of vertices as the memory requirement becomes forbiddingly high. This problem is overcome by defining only a subset of connections between neighbours in a learnable space representation, where the edge properties of each vertex to a limited number of its neighbours are used to calculate a new feature representation per vertex, which is then passed to the next layer of similar structure [33]. This approach is implemented in the EdgeConv layer and the corresponding DGCNN model [33]. The neighbours are selected based on the new vertex features, which makes it particularly challenging to create a gradient for training with respect to the neighbour selection. The DGCNN model works around this issue by using the edge features themselves. However, due to the dynamic calculation of neighbour relations in high-dimensional space, this network requires substantial computing resources, which would make its use for triggering purposes in collider detectors unfeasible.

Some of these architectures have already been considered for collider physics, in the context of jet tagging [32], event topology classification [34], and for pileup subtraction [35].

3 The GravNet and GarNet layers

Figure 1: Pictorial representation of the data flow across the GarNet and the GravNet layers. (a) The input features of each are processed by a dense neural network with two output layers: a set of learned features and a spatial information in some learned representation space. (b) In the case of the GravNet layer, the quantities are interpreted as the coordinates of the vertices in some abstract space. The graph is built in this space, connecting each to its closest neighbors (N=4 in the figure), using the euclidean distance between the vertices to rank the neighbors. (c) In the case of the GarNet layer, the quantities are interpreted as the distances between the vertices and a set of aggregators in some abstract space. The graph is then built connecting each vertex to each aggregator, and the quantities are the euclidean distances. (d) Once the graph structure is established, the features of the vertices connected to a given vertex or aggregator are converted into the quantities, through a potential (function of . The corresponding information is then gathered across the graph and turned into a new feature of (e.g. summing over the edges, or taking the maximum. (e) For each choice of gathering function, a new set of features is generated. The vector is concatenated to the initial vector. The resulting feature vector is given as input to a dense neural network with activation, which returns the output representation .

The neural network layers proposed in this study are designed to provide competitive performance on particle reconstruction tasks while dealing with data sparsity in an efficient way. These architectures aim to keep a trainable space representation at minimal computational costs. The layers receive as input a data set, consisting of a batch of examples, each represented by a set of detector hits, embedded in the data set through features. For instance, the features could include the Cartesian coordinates of a given sensor, its address (layer number, module number, etc.), the sensor time stamp, the recorded energy, etc.

A pictorial representation of the operations performed by the two layers is shown in Figure 1. For both architectures, the first step is to apply a dense333Here and in the following, dense layer refers to a learnable weight-matrix multiplication and bias vector addition with respect to the last feature dimension, with shared weights over all other dimensions. In this case, the weights and bias are applied to the vertex features and shared over the vertices . This can also be thought of as a 2D convolution with a kernel. neural network to each of the detector hits, deriving from the features two output arrays: the first array () is interpreted as a set of coordinates in some learned representation space (for the GravNet layer) or as the distance between the considered vertex and a set of aggregators (for the GarNet layer); the second array () is interpreted as a learned representation of the vertex features. At this point, a given input example of initial dimension is converted into a graph with vertices in the abstract space identified by . Each vertex is represented by the features, derived from the initial inputs. The projection from the

to this graph is linear, with trainable weights and bias vectors.

The main difference between the GravNet and the GarNet architectures is in the way the vertices are connected when building the graph. In the case of the GravNet layer, the Euclidean distances between pairs of vertices in the space are used to associate to each vertex its closest neighbors. In the case of the GarNet layer, the graph is built connecting each of the vertices to a set of aggregators. What is learned by , in this case, is the distance between the vertex and the aggregators.

Once the edges of the graph are built, each of the vertex (aggregator) of the GravNet (GarNet) layer collects the information associated with the features across its edges. This is done in three steps:

  1. The quantities

    (1)

    are computed for the feature of each of the vertices connected to a given vertex or aggregator , scaling the original value by a potential, function of the euclidean distance , giving the gravitational network GravNet its name. The potential function is introduced to enhance the contribution of close-by vertices. For this reason, has to be a decreasing function of . In this study, we use a Gaussian potential for the GravNet layer444A gravitational potential () has singularities at and therefore cannot be used, however the potential we are using has a similar qualitative effect of pulling together vertices. and an exponential potential for the GarNet layer.

  2. The functions computed from all the edges associated to a vertex of aggregator are combined, generating a new feature of . For instance, we consider the average of the across the edges and their maximum. While the maximum was already used in similar architectures [33], the use of the mean function was particularly crucial in our case, to improve the convergence with respect to the neighbour selection.

  3. Each adopted combination rule in the previous step generates a new set of features , which is concatenated to the original vector. This extended vector is transformed into a set of new vertex features, using a fully connected dense layer with activation. The concatenation is done for each initial vertex. In the case of the GarNet layer, this requires an additional step of passing the features of the aggregators back to the initial vertices, weighted by the potential. This information exchange of the garnered information through the aggregators defines the GarNet name.

The full process transforms the initial data set into a data set. As common with graph networks, the main advantage comes from the fact that the output (unlike the input) carries collective information from each vertex and its surrounding, providing a more informative input to downstream processing. Thanks to the distinction between learned space information and learned features , the dimensionality of connections in the graph is kept under control, resulting in a smaller memory consumption than, for instance, the EdgeConv layer.

The two layer architectures and the models based on them, described in the following sections, are implemented in TensorFlow 

[36]. The code for the models and layers can be found in https://github.com/jkiesele/caloGraphNN.

Figure 2: Calorimeter geometry. The markers indicate the centre of the sensors, their size the sensor size. Layers are colour-coded for better visualisation.

4 Data set

The data set used in this paper is based on a simplified calorimeter with irregular geometry, built in GEANT4 [37]. The calorimeter has a width of in the x and y directions and length of 2 in the longitudinal direction (z), corresponding to 20 nuclear interaction lengths. It is entirely made out of Tungsten, further split into 20 layers of equal width in z. Each layer contains square sensor cells, with a fine segmentation in the quadrant with and and a lower granularity elsewhere. The total number of cells and their individual sizes vary by layer, replicating the basic features of a slightly irregular calorimeter. For more details, see Figure 2 and Table 1.

Layer Cells (, ) Cells elsewhere
0 64 48
1 64 108
2–3 100 192
4–7 64 108
8–11 64 48
12–13 16 12
14–19 4 3
Table 1: Number of cells in the finely segmented quadrant and the rest of the layer, for the benchmark calorimeter geometry described in the text.

Charged pions are generated at ; the and coordinates of the generation vertex are randomly sampled within and . The and components of the particle momentum are set to 0, while the component is sampled uniformly between 10 and 100. The particles therefore impinge the calorimeter front face perpendicularly and shower along the longitudinal direction.

The resulting total energy deposit in each cell, as well as the cell position, width, and layer number, are recorded for each event. These quantities correspond to the feature vector given as input to the graph models (see Section 3). Each example consists of the result of two overlapping showers. Cell by cell, the energy of two showers is summed and the fraction belonging to each of the showers in each cell is defined as the ground truth. In addition, the position of the largest energy deposit per shower is recorded. If this position is the same for the two overlapping showers, they are considered not separable and the event is discarded. This applies to about 5% of the events.

In total 16000000 events are generated. Out of these, 100000 are used for validation and 20000 for testing. The rest is used for training.

5 Clustering metrics

To identify individual showers and use their properties, e.g. for a subsequent particle identification task, the energy deposits should be clustered so that overlapping parts are identified without removing important parts of the original shower. Therefore, the clustering algorithms should predict the energy fraction of each sensor belonging to each shower. Lower energy deposits are slightly less important. These considerations define the loss function:

(2)

where and are the predicted and true energy fractions in sensor and shower . These are weighted by the square root of , which is the total energy deposit in sensor belonging to shower , to introduce a mild energy scaling within each shower. In addition, we define the clustering energy response for one test shower as:

(3)

6 Models

The models need to incorporate neural network layers to identify localized structures as well as to perform information exchange globally between the sensors. This can be achieved either by multiple message passing iterations between neighbouring sensors or a direct global information exchange. Here, we employ a combination of both. The input to all models is an array of sensors, each holding its recorded energy deposits, global position coordinates, sensor size, and layer number. We compare three different graph-network approaches to a CNN based approach (Binning), presented as a baseline. Each model is designed to contain approximately free parameters. The model structure is as follows:

  • Binning: a regular grid of pixels is imposed on the irregular geometry. Each pixel contains the information of at most one sensor555Alternative configurations with more than one sensor per pixel were also investigated and showed similar performance.. The information is concatenated to the mean of these features in all pixels, pre-processed by one CNN layer with 20 nodes, and then fed through eight blocks of CNN layers. Each block consists of a CNN layer with a kernel of followed by a layer with a kernel of

    , each containing 14 filters. The output of each block is passed to the next block as well as it is added to a list of all block outputs. All CNN layers employ tanh activation functions. Finally, the full list of block outputs per pixel is reshaped to represent the vertices of the graph and fed through a dense layer with 128 nodes and ReLU activation. Different CNN models have also been tested and showed similar or worse performance.

  • DGCNN model: adapting the model proposed in Ref [33] to our problem, the sensor features are interpreted as positions of points in a 16-dimensional space and fed through one global space transformation followed by four blocks comprising one EdgeConv layer. Our EdgeConv layer has a similar configuration as in Ref. [33], with 40 neighbouring vertices and three internal dense layers with ReLu activation acting on the edges with 64 nodes each. The output of the EdgeConv layer is concatenated with its mean over all vertices and fed to one dense layer with 64 nodes and ReLu activation which concludes the block. The output of each block is passed to the next block and simultaneously added to a list of all block outputs per vertex, which is finally fed to a dense layer with 32 nodes and ReLU activation.

  • GravNet model: the model consists of four blocks. Each block starts with concatenating the mean of the vertex features to the vertex features, three dense layers with 64 nodes and activation, and one GravNet layer with coordinate dimensions, features to propagate, and output nodes per vertex. For each vertex, 40 neighbours are considered. The output of each block is input to the next block and added to a list containing the output of all blocks. This determines the full vector of vertex features passed to a final dense layer with 128 nodes and ReLU activation.

  • GarNet model: The original vertex features are concatenated with the mean of the vertex features and then passed on to one dense layer with 32 nodes and tanh activation before entering 11 subsequent GarNet layers. These layers contain aggregators, to which features are passed, and output nodes. The output of each layer is passed to the next and in addition added to a vector containing the concatenated outputs of each GarNet layer. The latter is finally passed to a dense layer with 48 nodes and ReLU activation.

In all cases, each output vertex of these model building blocks is fed through one dense layer with ReLU activation and three nodes, followed by a dense layer with two output nodes and softmax activation. This last processing step determines the energy fraction belonging to each shower. Batch normalisation [38] is applied in all models to the input and after each block.

All models are trained on the full training data set using the Adam optimizer [39] and an initial learning rate of about , depending on the model. The learning rate is reduced exponentially in steps to the minimum of after 2 million iterations. Once the learning rate has reached the minimum level, it is modulated by 10% at a fixed frequency, following the method proposed in Ref. [40].

7 Clustering performance

All approaches described in Section 6 perform well for clustering purposes. An example is shown in Figure 3, where two charged pions with approximately 50 initial energy enter the calorimeter. One pion loses a significant fraction of energy in an electromagnetic shower in the first calorimeter layers. The remaining energy is carried by a single particle passing the central part of the calorimeter before showering. The second pion passes the first layers as a minimally ionizing particle and showers in the central part of the calorimeter. Even though the two showers largely overlap, the GravNet network (shown here as an example) is able to identify and separate the two showers very well. The track within the calorimeter is well identified and reconstructed and the energy fractions properly assigned, even in the parts where the two showers heavily overlap. Similar performance can be observed with the other investigated methods.

(a) Truth
(b) Reconstructed
Figure 3: Comparison of true energy fractions and energy fractions reconstructed by the GravNet model for two charged pions with approx. 50 energy showering in different parts of the calorimeter. Colours indicate the fraction belonging to each of the showers. The size of the markers scales with the square root of the energy deposit in each sensor.

Quantitatively, the performance of the models is compared using the mean loss () on the test data set, as well as the clustering response as defined in Equations 2 and 3. For every event, we define one of the shower as the test shower and the other overlapping shower as noise shower. Performance characteristics are evaluated only for the test shower and are quantified by the mean (

) and variance (

) of the response in the test data set. In addition, we define clustering accuracy (

) as the fraction of showers with response between 0.7 and 1.3. Given that some showers are not properly clustered, the response distribution has a small fraction of outliers that disturb its otherwise rather Gaussian shape. Therefore, test showers with response less than 0.2 and higher than 2.8 are removed, resulting in the response kernel mean

and variance .

As listed in Table 2, the GravNet layer outperforms the other approaches, including even the more resource-intensive DGCNN model. The GarNet model is slightly worse than the DGCNN model but still outperforms the binning approach as far as the reconstruction of individual shower hit fractions is concerned, represented by the loss function. However, with respect to the clustering response, the binning model outperforms the GarNet and DGCNN model slightly.

Binning 0.192 0.017 1.085 0.187 1.046 0.058 0.867
DGCNN 0.174 0.012 1.084 0.203 1.045 0.052 0.880
GarNet 0.183 0.011 1.089 0.201 1.048 0.056 0.869
GravNet 0.172 0.012 1.079 0.187 1.042 0.049 0.884
Table 2: Mean and variance of loss, response, and response within the Gaussian kernel as well as clustering accuracy.

One should notice that part of the incorrectly predicted events are actually correctly clustered events in which the test shower is labelled as noise shower (shower swapping). Since the labelling is irrelevant in a clustering problem, this behavior is not a real inefficiency of the algorithm. We denote by the fraction of events where this behaviour is observed. In Table 3, we calculate the loss for both choices and evaluate the performance parameters for the assignment that minimises the loss. The binning model shows the largest fraction of swapped showers. The difference in response between the best-performing GravNet model and the GarNet model is enhanced, while the difference between the GravNet and DGCNN model scales similarly, likely because of their similar general structure.

Binning 0.180 0.007 1.078 0.150 1.048 0.055 0.875 3.2
DGCNN 0.167 0.006 1.077 0.146 1.047 0.051 0.885 2.5
GarNet 0.176 0.006 1.084 0.168 1.049 0.055 0.874 2.4
GravNet 0.165 0.006 1.072 0.133 1.043 0.048 0.891 2.7
Table 3: Mean and variance of loss, response, and response within the Gaussian kernel as well as clustering accuracy corrected for shower swapping. The last column shows the fraction of swapped showers.

As shown in Figure 4, the reconstructed energy for the test shower is biased towards low values for all models, indicating that on average lower energy showers obtain a fraction of the energy belonging to the higher energy shower. While this behaviour could be corrected a posteriori when the particle energy is measured, the variance of the response can only be corrected on an event-by-event basis and is therefore the more important metric. In both metrics, the GravNet model outperforms the other models in the full range, and the GarNet model shows the worst performance, albeit in a comparable range. The resource-intensive DGCNN model lies in between GravNet and GarNet.

(a) Mean
(b) Variance
Figure 4: Mean and variance of the test shower response as a function of the noise shower energy. Swapping of the showers is allowed here.

8 Resource requirements

In addition to the clustering performance, it is important to take into account the computational resources demanded by each model during inference. The inference time can have a significant impact on the applicability of the network for reconstruction tasks, in particular for the kind of real-time processing performed by the trigger systems of typical collider experiments. We evaluate the inference time and memory consumption for the models studied here on one NVIDIA GTX 1080 Ti GPU for batch sizes of 1 and 100, denoted as () and (, ), respectively. The inference time is also evaluated on one Intel Xeon E5-2650 CPU core () for a fixed batch size of 10. As shown in Figure 5, memory consumption and execution times differ significantly between the models. The binning approach outperforms all other models, because of the highly optimised CNN implementations. The DGCNN model requires the largest amount of memory, while the model using the GravNet layers requires about 50% less. The GarNet model provides the best compromise of memory consumption with respect to performance. In terms of inference time, the binning model is the fastest and the graph-based models show a similar behaviour for small batch sizes on a GPU. The GarNet and the GravNet model benefit from parallelizing over a larger batch. In particular, the GarNet model is mostly sequential, which also explains the outstanding performance on a single CPU core, with almost a factor of 10 shorter inference time compared to the DGCNN model.

Figure 5: Comparison of inference time for the network architectures described in the text, evaluated on CPUs and GPUs with different choices of batch size. The shaded area represents the statistical uncertainty band.

9 Conclusions

In this work, we introduced the GarNet and GravNet layers, which are distance-weighted graph networks capable of learning irregular patterns of sparse data, such as the detector hits in a particle physics detector with realistic geometry. Using as a benchmark problem the hit clustering in a highly granular calorimeter, we show how these network architectures offer a good compromise between clustering performance and computational resource needs, when compared to CNN-based and other graph-based networks. In the specific case considered here, the performance of the GarNet and GravNet models are comparable to the CNN and graph baselines. On the other hand, the simulated calorimeter in the benchmark study is only slightly irregular and can still be represented by an almost regular array. In more realistic applications, e.g. with the hexagonal sensors and the non-projective geometry of the future HGCAL detector of CMS, the difference in performance between the graph-based approaches and the CNN-based approaches is expected to increase further, making the GarNet approach a very efficient candidate for fast and accurate inference and the GravNet approach a good candidate for high-performance reconstruction with less resource requirements and better performance than the DGCNN model.

It should also be noted that the GarNet and GravNet architectures make no specific assumption on the structure of the underlying data, and thus can be employed for many other applications related to particle and event reconstruction, such as tracking and jet identification. Exploring the extent of usability of these architectures will be the focus of follow-up work.

We thank our CMS colleagues for many suggestions received in the development of this work. The training of the models was performed on the GPU clusters of the CERN TechLab and the CERN CMG group. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement n 772369).

References