1 Introduction
Pooling has been an essential component of modern machine learning, allowing pertinent local information to be propagated to global intermediate feature sets or final discriminators. The shape of the pooling operation is typically determined by hand, setting the size of a convolutional filter and the number of pooling steps before an output layer. This process is difficult to optimize for graph neural networks [1], since neighbourhoods of nodes may vary in size and meaning depending on the problem at hand. In the area of message passing neural networks [6] there are recent advancements in learned pooling techniques on graphs [2], and there is small but steady progress on using pooling to alter input graph structures. This text describes a new pooling architecture using dynamic graph convolutions [13]
and clustering algorithms to learn an optimized representation and corresponding graph for pooling. The model used in the following text is implemented in Pytorch Geometric
[4] (PyG).This architecture was derived in the context of hadron^{3}^{3}3Particles that are bound together by the strong force. energy regression in High Energy Physics (HEP), where graph neural networks are beginning to solve difficult clustering problems [11, 10]
in novel ways. The objective of that problem is to determine the original energy of a particle incident upon a device called a “sampling calorimeter." Within the calorimeter, which is made of dense material like lead or steel interspersed with lighter material, incident particles above a threshold energy will produce pairs of particles by nuclear interaction, creating a “shower" of particles. At a number of fixed depths within the calorimeter, it records scintillation or ionization signals at fixed depths as a proxy for the number of produced particles. These estimates of multiplicity can be used to infer the originating particle’s energy. The energy deposition patterns of hadrons are known to have a high degree of local fluctuations in particle multiplicity during the shower’s evolution. This means that throughout the shower there are randomly located regions that require different treatment from more homogeneous ones, and so there exists an optimal dynamic cluster for each hadron shower’s data to best estimate the energy.
A humandesigned algorithm called “software compensation" [12] has been developed to solve this problem as well. It is already based in the principle of reducing an objective function to generate learned weights to determine shower energies. However, there is a significant amount of specific tuning that needs to be done to make the algorithm function for varying detector designs, and its domain of applicability remains well within HEP. Using the technique described in this paper, the entire algorithm is now learned, rather than a specific part of a correction, and the algorithm can dynamically adapt to the topology of a given hadron shower. Using a machine learning algorithm for this task mitigates the need for manual specialization, and affords the possibility to investigate the applications of technique on rather different tasks from calorimetry and with more widely available datasets. Benchmarks for various estimation and classification tasks will be demonstrated in the more classic machine learning tasks.
The advantages of this dynamic reduction architecture are:

Representation spaces with good performance are very small.

No prior graph structure is necessary, and if one is provided it can be altered by the pooling layers, since the graph pooling structure is learned.

Without a prior graph structure, data for training and inference need very little preprocessing beyond normalization and stacking.
2 Related Work
The architecture proposed here is similar in outcome to the techniques proposed in [2, 7], but radically different in implementation. Our focus is on learning latent representations that optimize the pooling performance of an unsupervised clustering algorithm. In particular, previous works in graph learning [9, 5] demonstrate that controlling the behavior of an unsupervised algorithm can help in learning concise representations quickly. However, aspects of the original structure of the data were kept and only messages to pass in that structure were generated. This text expands on both previous aspects by combining it with dynamic graph convolutions [13] and controlling a clustering algorithm in the latent space to produce a dynamically learned optimized pooling.
3 Dynamic Reduction Network
All clustering algorithms can be treated as an indexing of nodes, so any clustering algorithm can be used in the latent space to make this demonstration. The unsupervised clustering algorithm is treated as a black box, and the supervised task is to optimize its performance for the task at hand. Using a dynamic graph convolutional approach, it is possible to learn the parameters that maximize the impact of the clustering on reducing information for further processing, and hence this network is called a dynamic reduction network (DRN). We have chosen to use a readily available greedy popularity based clustering algorithm [3] as an initial demonstrator. Other clustering algorithms will be tested and compared once their GPU implementations are made available in PyG in the future.
The default model used for the MNIST superpixels [9] classification task in this paper is composed as follows:

The input data are normalized by fixed values such that the input data largely occupy the range
, outliers are allowed.

A multilayer perceptron (MLP) encoding the normalized input data to a latent space of dimension N is applied to all input nodes. The default depth is 3 layers, with an intermediate layer half the size of the final output.

The latent space data are processed by a dynamic graph convolution layer, i.e. neighbours are found in the latent space rather than the original representation. The internal messages are created using a MLP with three layers, starting from 2N, 1.5N, and outputing a message of width N. The update function can be summation, maximum, or average.

The resulting latent nearest neighbours graph is then weighted by distance and clustered using a greedy clustering algorithm, pooling the node features by taking the maximum of the clustered data.

The reduced latent data are passed through another dynamic convolutional filter, and clustered again with the same algorithm.

The results of the second learned pooling step are then globally max pooled and passed through an MLP decoder to produce the output logits.
This process is summarized in Figure 1. Alterations to this model used for various test are described later in the text. The depth of the MLPs in the various encoding, message passing, and decoding steps are parameterized. The number of nearest neighbours, k, is a hyperparameter of the model and may require tuning to a given task, depending on the relational data that is needed to make a prediction.
4 Results
This model was tested on MNIST “superpixels" dataset with 75 superpixels ^{4}^{4}4This model was developed using private data of the CMS Collaboration which cannot be published here.. The superpixels dataset is a downsampled and aggregated form of the full MNIST dataset. Previous graph models are shown in Refs. [9, 5], A result of a scan in hiddendimension (the “width") of the network is shown in 2. During the training and evaluation both the original graph structure and all pixels that are not filled are dropped, the data are “zerosuppressed". Each superpixel has a pair of coordinates defining a centroid, and an intensity. The performance using a width of 20 channels with is a factor of two better on 75 superpixels than the reference models in [9, 5], and approaches the performance of models trained on full MNIST at a width of 256 channels. The data passed to the model consists only of the centroid x and y, and the grayscale of that pixel. In Figure 2
each DRN is trained for 400 epochs using an nVidia Tesla V100 with the AdamW optimizer
[8] using onecycle cosine annealing (on the learning rate only) with a starting value of 0.001. The weight decay is held constant at a value of 0.001. Best performance is often achieved quite early, so further tuning of model training and optimization is possible, and volatile GPU usage is low. There is significant room for improving the training and evaluation time performance of this model.Model  No. of  Best achieved  Epoch of best  Performance 
variant  parameters  performance  performance  at 400 epochs 
DRN20  5123  0.9761  180  0.9731 
DRN32  12797  0.9806  205  0.9792 
DRN64  50157  0.9872  347  0.9866 
DRN128  198605  0.9891  232  0.9884 
DRN256  790413  217  0.9899  
MoNet  –  0.9111  –  – 
SplineCNN  63786  0.9522  40  – 
The superpixels dataset is known to yield poor performance in CNNbased networks, and graph networks have been proposed in previous work to improve upon this. The performance demonstrated in Figure 2 shows clearly that further improvements on very undersampled data are achievable and this model sets new performance benchmarks on these datasets, especially in terms of model size.
5 Conclusions
This text introduces the Dynamic Reduction Network as a new tool in High Energy Particle physics for processing sampled data from a highly varying multidimensional image. This is accomplished by designing a network that can learn effective pooling strategies by manipulating an unsupervised algorithm in a highdimensional latent space. To demonstrate the efficacy of this network, a performance benchmark on an undersampled MNIST dataset indicates that this new architecture outperforms previous graph based architectures, even for very small numbers of parameters. This outcome suggests a powerful new technique for approaching classification and regression problems in both computer vision and high energy physics.
6 Acknowledgements
This research was supported in part by the Office of Science, Office of High Energy Physics, of the US Department of Energy under Contract No. DEAC0207CH11359 through FNAL LDRD2019017.
Many thanks to Matthias Fey and Song Han for quick consultation on this model. Thanks as well to Nhan Tran and Salvatore Rappoccio for proofreading assistance.
References
 [1] (201707) Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine 34 (4), pp. 18–42. External Links: ISSN 10535888, Link, Document Cited by: §1.
 [2] (2019) Edge contraction pooling for graph neural networks. External Links: 1905.10990 Cited by: §1, §2.
 [3] (2012) A gpu algorithm for greedy graph matching. In Facing the MulticoreChallenge II: Aspects of New Paradigms and Technologies in Parallel Computing, pp. 108–119. External Links: ISBN 9783642303968 Cited by: §3.
 [4] (2019) Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, Cited by: §1.
 [5] (2017) SplineCNN: fast geometric deep learning with continuous bspline kernels. External Links: 1711.08920 Cited by: §2, §4.
 [6] (2017) Neural message passing for quantum chemistry. External Links: 1704.01212 Cited by: §1.
 [7] (2019) Selfattention graph pooling. External Links: 1904.08082 Cited by: §2.
 [8] (2017) Decoupled weight decay regularization. External Links: 1711.05101 Cited by: §4.
 [9] (2016) Geometric deep learning on graphs and manifolds using mixture model cnns. External Links: 1611.08402 Cited by: §2, §3, Figure 2, §4.
 [10] (2019) Graph neural networks for particle reconstruction in high energy physics detectors. NeurIPS Proceedings. External Links: Link Cited by: §1.
 [11] (2019) Learning representations of irregular particledetector geometry with distanceweighted graph networks. The European Physical Journal C 79 (7), pp. 608. External Links: Document, ISBN 14346052, Link Cited by: §1.
 [12] (201710) Software compensation in particle flow reconstruction. The European Physical Journal C 77 (10). External Links: ISSN 14346052, Link, Document Cited by: §1.
 [13] (2018) Dynamic graph cnn for learning on point clouds. External Links: 1801.07829 Cited by: §1, §2.
Comments
There are no comments yet.