Energetic events at the Large Hadron Collider (LHC) consist of hundreds of particles each described by four momentum components, leading to embedding spaces with dimensionality . Dimensionality reduction is therefore critically important for understanding this data. Distance measures between events based on optimal transport, most notably Earth Movers Distance (EMD), have been introduced for usage on particle physics datasets in recent years Komiske et al. (2019b, 2020a); Crispim Romão et al. (2021); Komiske et al. (2020b); Cai et al. (2020), leading to geometric interpretations of the data manifold. Different generative processes involved in creating events are associated with hierarchically distinct EMD scales.
Variational Autoencoders (VAEs) Kingma and Welling (2014) have been shown to produce semantically meaningful and interpretable dimensional reductions into their latent space in many contexts. To be trained, they require a notion of similarity between pairs of objects to be used in the reconstruction loss. Pixel intensity based losses have been used in various VAE studies for collider data Dillon et al. (2021); Cheng et al. (2020); Dohi (2020), but these fail to properly reflect the similarity of events with collinear splittings or small displacements. Given its appealing physical properties, the EMD between events is a promising candidate be used as the reconstruction loss in a VAE. This paper serves to introduce a VAE trained with such a reconstruction loss, and to describe some preliminary experiments and an investigation into their scaling behaviour, with the objective to understand what can be gleaned about the underlying dataset from studying the properties of the learnt representation. Instead of training on full LHC events, this study is restricted to individual -jets, which are collimated streams of particles formed from the decay of a boson travelling with high momentum.
2 Data, Architecture, and Training
A sample of jets were simulated with momenta in the range using a standard pipeline of Madgraph Alwall et al. (2011), Pythia8 Sjöstrand et al. (2015), and Delphes de Favereau et al. (2014). The simulated jets typically contain
particles, and the 50 largest-momentum particles are selected and stored as 3-vectors
, with zero-padding used for jets with fewer particles. The momenta are prescaled so that their sum is one for each jet. Two VAEs with identical architecture are trained:VAE_uncent and VAE_cent are trained on jets which have not been centered in the detector and jets that have been centered (), respectively.
The VAEs are implemented in TensorflowAbadi and others (2015)
, Tensorflow ProbabilityDillon et al. (2017)
and KerasChollet and others (2015) built with a Particle Flow Network Komiske et al. (2019a) encoder which takes input jets as point clouds , and a dense decoder which outputs jets
with the same structure. The encoder parameterizes means and variances,
for a multivariate diagonal normal distribution with 256 dimensions, from which latent space coordinates
are sampled. The loss function,
is composed of a reconstruction term , which is chosen to be a sharp Sinkhorn Sinkhorn and Knopp (1967); Cuturi (2013); Schmitzer (2016) approximation to the EMD, and a KL divergence term which penalizes information encoding in the latent space and which can be decomposed into a sum of separate contributions from each latent direction. is dimensionless due to the rescaling of the jets, but it is related to true Sinkhorn distance between the physical jets by . Similarly, while the hyperparameter is dimensionless, it is related to a dimensionful quantity defined as .
can be interpreted equivalently as the variance of the Gaussian posterior probability from which the reconstruction loss descends, thereby setting the resolution scale of the VAE, or as thehyperparameter of a -VAE Higgins et al. (2017). Physical scales much larger than may be modelled in its latent space, but scales much smaller than will be ignored. This motivates a -annealing training strategy Fu et al. (2019) whereby the VAE is trained in stages using a sequence of values for , preserving the model weights from one stage to the next. This sequence is chosen from a set of values that are log-uniform separated in the range . The procedure begins with an initial ‘priming’ run, starting at the smallest and proceeding upwards. Next, is annealed in a zig-zag pattern until reaches again its smallest value. Finally, a ‘production’ run is performed, repeating the sequence of values used for the priming run. The model weights saved at the end of each step in this production run are studied and used to generate the figures used in this paper. Further details of the data generation and the training procedure are given in Appendix A.1.
Before proceeding to an investigation of the properties of the learnt representations, I present some qualitative observations from the training procedure. During most of the priming run the learnt representation is completely disorganized, taking advantage of all 256 latent directions to describe the data (i.e. all dimensions have associated KL divergence significantly greater than zero). By the production run the learnt representation has been organized into a small set of informative directions with , while the majority are uninformative with . For each direction which is informative for some values of , there is some critical value above which the dimension becomes uninformative. These are associated with physical scales in the training data, for instance the translation of jets around the detector (), or the orientation of hard prongs within the jet (). When trained with , the corresponding variations will not be learnt in the latent space. When trained with , the variations may be learnt in the latent space, but they have no tendency to be organized into orthogonal or semantically meaningful ones. When subsequently trained with , these dimensions tend to organize into a small number of semantically meaningful directions, which have a tendency to be preserved if is subsequently gradually reduced to very small values (though this tendency appears to be weaker the greater is the hierarchy or the larger is the step size in ). Some intuition for some of these features can be gleaned from a simple analytical example in Appendix A.2.
3 Structure of the Learnt Representations
VAE_uncent has three informative directions at , and an additional three when trained at . VAE_cent has no informative directions at , three at , and many more at smaller values of . For each VAE, the latent directions can be ordered by the sizes of their individual contributions to KL. Fig. (1) illustrates the physics encoded by the first three of VAE_uncent with . The blue contours which indicate the support of the training data in the latent space suggest that a two dimensional manifold has been embedded into a three dimensional space. maps while maps . The remaining half of the barrel is mapped with the aid of which appears to encode , and a full rotation around the detector barrel is obtained by traversing the ring in the plane.
Fig. (2) illustrates the representation learnt in the two most informative directions of VAE_cent at . The coordinates of this VAE map the polar and azimuthal angles of the decay in its rest frame, where can be mapped to the energy fraction of the hardest prong in the boosted frame. The topology of the decay of a massive particle into two identical particles is that of the real projective plane, RP, which in this context is naturally represented by the sphere with antipodes identified. Any hemisphere gives a single cover of the space of two-body decays everywhere except on its rim, with an identification of opposite points on the rim being the surviving remnant of the antipodal identification on the full sphere. It can be seen that this plane of the latent space represents an approximate projection of the hemisphere indicated by the grey region on the right of the figure, with the pole of the sphere being represented at approximately in the latent space. Jets located around the edge of the support region in the latent space satisfy an approximate reflection symmetry .
In neither the case of the jet coordinates for VAE_uncent, or the two-prong decay topology for VAE_cent , did the R topology of the latent space match the or RP topology of the corresponding datasets. But in both cases, a semantically meaningful and intuitive representation was found. In one case by an embedding into a higher dimensional space, and in the other by projecting on to a plane.
The organization of the learnt information into a small number of latent dimensions, and the manner in which the properties of the learnt representations vary smoothly with , suggests the possibility of notions of dimensionality or information complexity that depend on the way that properties of the learnt representation scale with . To this end, I introduce definitions for two notions of dimensionality
where in practice I will estimate these derivatives using finite difference approximations, taking quantities evaluated using VAEs trained at two nearby values ofand calculating the explicit fractions. , which can be thought of as representing the scaling of learnt information complexity with , was originally motivated by the observation that informative directions typically have , and that for . takes advantage of the notion of being a resolution on the reconstruction space. If individual informative directions in the latent space are imagined to map to orthogonal directions in the reconstruction space, and if the gaussian sampling in the latent space maps approximately to gaussian sampling in the corresponding reconstruction space directions with variance , then the full reconstruction error can be obtained by adding in quadrature those associated with the individual orthogonal directions. This leads to , and the derivative extracts the dimension. Of course, none of these assumptions are guaranteed to apply to the learnt representations of the VAE, and so the agreement of these notions of dimensionality can be regarded as a diagnostic of how well behaved is the training data and the learnt representations. In Appendix A.2 I describe a simple analytic example in which these formulae can be derived exactly. is related to the work of Rezende and Viola (2018), in which the derivative of
is studied and spikes in this quantity are interpreted as indicating phase transitions.
In Fig. (3) is plotted the dimensionalities calculated on both VAEs during the production run. Two observations are in order. Firstly, note that and are in rough agreement for both VAEs, except for at low values of for VAE_uncent where it struggles to represent the detailed substructure of the jet at the same time as its bulk position in the detector. Secondly, note that at high scales, the dimensionality of VAE_uncent is very close to 2, while the corresponding physics is described using three latent dimensions as was illustrated in Fig. (1
). The third latent dimension, acting as a categorical variable, scales differently than the variables representing continuous transformations of the jet. It contributes very little to the computed dimensionality. This illustrates the role that these quantities have in reflecting the true information complexity of the dataset when compared to a naive counting of active latent directions. Indeed, the dimensionality scaling of the learnt representation agrees with an intuitive understanding of the dataset. At large, the uncentered jets have dimensionality of 2 while centered have dimensionality of 0. An order of magnitude below and three new dimensions emerge, two of which describe the orientation of the hard prongs within the jet associated with the scale and a third one describing the overall boost of the jet (associated with the spread of momenta ). At scales below this many new dimensions rapidly emerge, representing the various physical processes associated with showering, hadronization, decays, and detector effects.
4 Conclusions and Outlook
The VAEs trained for this study are effectively learning a concrete representation of the metric space of jets induced by the EMD. They identify semantically meaningful, intuitive, and approximately orthogonal principal axes of variation in the space of jets. Associated with these principal axes are concrete scales, which reflect scales associated with the physical generative process for the jets. As the resolution of the VAE is varied by adjusting the hyperparameter , the learnt representation smoothly adjusts, and its varying properties can be probed to understand the information complexity of the EMD manifold of the jets. There remains to be seen potential applications for these properties, and the question of mixed samples which are expected in unsupervised training on LHC data. These questions will be tackled in a future publication. Further topics of exploration include alternative latent space topologies, and additional probes of the geometry of the learnt manifold.
Code used for generating the results for this paper can be found at https://github.com/Jackadsa/EMD_VAE/tree/neurips_ml4ps.
This work was supported by the U.S. Department of Energy, Office of Science under contract DE-AC02-76SF00515. This work was initiated at the Aspen Center for Physics, which is supported by National Science Foundation grant PHY-1607611. I am grateful for the support of staff maintaining the SLAC Shared Data Facility, on which the models were trained. I am also thankful to the maintainers of the software packages Matplotlib Hunter (2007), Jupyter Kluyver and others (2016), NumPy Harris and others (2020), and SciPy Virtanen and others (2020). I would like to thank the following for many helpful discussions and continuing support, without which this work would not have been possible: Ben Nachman, Matthew Schwartz, Patrick Komiske III, Eric Metodiev, Zhen Liu, Jesse Thaler, Siddharth Mishra-Sharma, Michael Kagan, and Alex Alemi.
TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Cited by: §2.
- MadGraph 5 : Going Beyond. JHEP 06, pp. 128. External Links: Cited by: §2.
- The anti- jet clustering algorithm. JHEP 04, pp. 063. External Links: Cited by: §A.1.
- Linearized optimal transport for collider events. Phys. Rev. D 102 (11), pp. 116019. External Links: Cited by: §1.
- Variational Autoencoders for Anomalous Jet Tagging. External Links: Cited by: §1.
- Keras. Note: https://keras.io Cited by: §2.
- Use of a generalized energy Mover’s distance in the search for rare phenomena at colliders. Eur. Phys. J. C 81 (2), pp. 192. External Links: Cited by: §1.
- Sinkhorn distances: lightspeed computation of optimal transport. In NIPS, External Links: Cited by: §2.
- DELPHES 3, A modular framework for fast simulation of a generic collider experiment. JHEP 02, pp. 057. External Links: Cited by: §2.
- Better Latent Spaces for Better Autoencoders. External Links: Cited by: §1.
- TensorFlow distributions. External Links: Cited by: §2.
- Variational Autoencoders for Jet Simulation. External Links: Cited by: §1.
- Cyclical annealing schedule: a simple approach to mitigating kl vanishing. In NAACL-HLT, External Links: Cited by: §2.
- Array programming with NumPy. Nature 585 (7825), pp. 357–362. External Links: Cited by: §4.
- Beta-vae: learning basic visual concepts with a constrained variational framework. In ICLR, Cited by: §2.
- Matplotlib: a 2d graphics environment. Computing in Science & Engineering 9 (3), pp. 90–95. External Links: Cited by: §4.
- Adam: a method for stochastic optimization. External Links: Cited by: §A.1.
- Auto-encoding variational bayes. CoRR. External Links: Cited by: §1.
- Jupyter notebooks – a publishing format for reproducible computational workflows. In Positioning and Power in Academic Publishing: Players, Agents and Agendas, F. Loizides and B. Schmidt (Eds.), pp. 87 – 90. Cited by: §4.
- Exploring the Space of Jets with CMS Open Data. Phys. Rev. D 101 (3), pp. 034009. External Links: Cited by: §1.
- Energy Flow Networks: Deep Sets for Particle Jets. JHEP 01, pp. 121. External Links: Cited by: §2.
- Metric Space of Collider Events. Phys. Rev. Lett. 123 (4), pp. 041801. External Links: Cited by: §1.
- The Hidden Geometry of Particle Collisions. JHEP 07, pp. 006. External Links: Cited by: §1.
- The invisible hand algorithm: solving the assignment problem with statistical physics. Neural Networks 7, pp. 477–490. Cited by: §A.1.
- Differential properties of sinkhorn approximation for learning with wasserstein distance. Advances in Neural Information Processing Systems 31, pp. 5859–5870. External Links: Cited by: §A.1.
- Taming vaes. External Links: Cited by: §3.
- Stabilized sparse scaling algorithms for entropy regularized transport problems. SIAM Journal on Scientific Computing 41, pp. . External Links: Cited by: §A.1, §2.
- Solution of the optimal assignment problem by diagonal scaling algorithms. External Links: Cited by: §A.1.
- Concerning nonnegative matrices and doubly stochastic matrices.. Pacific J. Math. 21 (2), pp. 343–348. External Links: Cited by: §2.
- An introduction to PYTHIA 8.2. Comput. Phys. Commun. 191, pp. 159–177. External Links: Cited by: §2.
- SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods 17, pp. 261–272. External Links: Cited by: §4.
Appendix A Appendix
a.1 More Details on Data, Architecture, and Training
jets are simulated and decayed in Madgraph with the process and a generation level cut on the missing momentum of . The events are showered in Pythia8. Detector simulation is performed with Delphes using an ATLAS based card and with particle flow reconstruction. Particle flow objects were then clustered into jets with the anti- algorithm Cacciari et al. (2008) with . The event is selected if the leading jet has momentum in the range , , and mass in the range . For selected events, the constituents of the leading jet are reclustered with anti- with and the leading 50 particles are recorded with . Events with fewer than 50 particles are zero padded. events survive the cuts, of which are used for training and the remainder for validation and testing.
The inputs to the VAE are jets each with 50 particles represented as
. The encoder network of the VAEs consists of four 1D convolution layers with filter size 1024, kernel size 1, stride 1, followed by a sum layer, followed by four dense layers of size 1024. Unless otherwise specified, all layers have activation function Leaky ReLu with negative slope coefficient of 0.1. 256 latent spaceare encoded with linear activation. The decoder consists of five layers with size 1024, followed by a linear dense layer which outputs fifty particles represented as , and then an function reduces this to . The explicit allows the network to avoid learning a discontinuity in , which is also the motivation for the trigonometric form of the inputs.
The loss function is a custom implementation in TensorFlow of a sharp Sinkhorn Luise et al. (2018) using -scaling Schmitzer (2016); Sharify et al. (2013); Kosowsky and Yuille (1994). In principle symbolic differentiation should be effective for this problem, however I encountered debilitating numerical instabilities that I was unable to diagnose. I therefore implemented the explicit gradient introduced in Luise et al. (2018). The Sinkhorn distance is calculated with regulator scaling from 1 to 0.01 in ten log-uniform steps, with ten iterations per step. Double floating precision is required for this calculation.
The -annealing schedule used for training VAE_uncent is illustrated in Fig. (4
), each black dot representing a step in the annealing schedule. At each step, the VAE is trained for fifty epochs, or until validation loss has not improved for ten epochs. Training is performed with batch size of 100 and with 1000 steps per epoch, and so five epochs are required to cycle through the whole training dataset. The adam optimizerKingma and Ba (2017) is used for training. The learning rate begins at at the beginning of each annealing step, and reduces by a factor of if validation loss has not improved in five epochs. The first straight line of annealing steps is in this paper called the ‘priming’ run. The last straight line is called the ‘production’ run.
Training was performed on NVIDIA GeForce 2080Ti GPUs. An epoch takes approximately two minutes, and the full annealing schedule approximately four days.
a.2 A Simple Analytical Example
Consider a toy one-dimensional gaussian distributed dataset with variance. A linear VAE with one latent dimension is trained to reconstruct the coordinate of the input , with loss function
The VAE is linear in the sense that all quantities are linear functions of their inputs: , , . The linear term in is 0 due to symmetry. Constant terms can be added to and but will minimize the loss at 0. The loss function can be integrated analytically over and . Performing the integral explicitly and taking derivatives, this loss function has extrema with
The former results in an uninformative latent space with which is a minimum only for , and becomes a saddle point for . The latter, which leads to an informative latent space, is real only for and is a minimum in this regime. Substituting this minimum into the expressions for the reconstruction and KL losses in Eq. (3) gives
Evaluating Eq. (2) explicitly gives for and for .
The general case of a -dimensional Gaussian dataset with variances is more complicated, but it can be shown that there are minima when latent space axes are aligned with the principal axes of the data. In this case, the problem reduces to independent copies of the one-dimensional case. In this case, both count the number of directions for which , which is the same as the number of active dimensions that have
In summary, the VAE probes in detail directions that have characteristic scale larger than , and averages over directions which have scale smaller than . The s play the role of the s of the end of Section 2. Note that the quantity , which for is given approximately by , gives the conversion factor between distances along latent coordinates and distance in the real space.
Nonlinear VAEs trained on non-Gaussian data have no guarantee to follow this behaviour, in which case the behaviour of these quantities become a diagnostic of how closely the behaviour of the data resembles that of a Gaussian.