An Exploration of Learnt Representations of W Jets

09/22/2021 ∙ by Jack H. Collins, et al. ∙ Stanford University 0

I present a Variational Autoencoder (VAE) trained on collider physics data (specifically boosted W jets), with reconstruction error given by an approximation to the Earth Movers Distance (EMD) between input and output jets. This VAE learns a concrete representation of the data manifold, with semantically meaningful and interpretable latent space directions which are hierarchically organized in terms of their relation to physical EMD scales in the underlying physical generative process. A hyperparameter β controls the resolution at which the VAE is sensitive to structures in the data manifold. The variation of the latent space structure with β, and the scaling of some VAE properties, provide insight into scale dependent structure of the dataset and its information complexity. I introduce two measures of the dimensionality of the learnt representation that are calculated from this scaling.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Energetic events at the Large Hadron Collider (LHC) consist of hundreds of particles each described by four momentum components, leading to embedding spaces with dimensionality . Dimensionality reduction is therefore critically important for understanding this data. Distance measures between events based on optimal transport, most notably Earth Movers Distance (EMD), have been introduced for usage on particle physics datasets in recent years Komiske et al. (2019b, 2020a); Crispim Romão et al. (2021); Komiske et al. (2020b); Cai et al. (2020), leading to geometric interpretations of the data manifold. Different generative processes involved in creating events are associated with hierarchically distinct EMD scales.

Variational Autoencoders (VAEs) Kingma and Welling (2014) have been shown to produce semantically meaningful and interpretable dimensional reductions into their latent space in many contexts. To be trained, they require a notion of similarity between pairs of objects to be used in the reconstruction loss. Pixel intensity based losses have been used in various VAE studies for collider data Dillon et al. (2021); Cheng et al. (2020); Dohi (2020), but these fail to properly reflect the similarity of events with collinear splittings or small displacements. Given its appealing physical properties, the EMD between events is a promising candidate be used as the reconstruction loss in a VAE. This paper serves to introduce a VAE trained with such a reconstruction loss, and to describe some preliminary experiments and an investigation into their scaling behaviour, with the objective to understand what can be gleaned about the underlying dataset from studying the properties of the learnt representation. Instead of training on full LHC events, this study is restricted to individual -jets, which are collimated streams of particles formed from the decay of a boson travelling with high momentum.

2 Data, Architecture, and Training

A sample of jets were simulated with momenta in the range using a standard pipeline of Madgraph Alwall et al. (2011), Pythia8 Sjöstrand et al. (2015), and Delphes de Favereau et al. (2014). The simulated jets typically contain

particles, and the 50 largest-momentum particles are selected and stored as 3-vectors

, with zero-padding used for jets with fewer particles. The momenta are prescaled so that their sum is one for each jet. Two VAEs with identical architecture are trained:

VAE_uncent and VAE_cent are trained on jets which have not been centered in the detector and jets that have been centered (), respectively.

The VAEs are implemented in Tensorflow 

Abadi and others (2015)

, Tensorflow Probability 

Dillon et al. (2017)

and Keras 

Chollet and others (2015) built with a Particle Flow Network Komiske et al. (2019a) encoder which takes input jets as point clouds , and a dense decoder which outputs jets

with the same structure. The encoder parameterizes means and variances


for a multivariate diagonal normal distribution with 256 dimensions, from which latent space coordinates

are sampled. The loss function,


is composed of a reconstruction term , which is chosen to be a sharp Sinkhorn Sinkhorn and Knopp (1967); Cuturi (2013); Schmitzer (2016) approximation to the EMD, and a KL divergence term which penalizes information encoding in the latent space and which can be decomposed into a sum of separate contributions from each latent direction. is dimensionless due to the rescaling of the jets, but it is related to true Sinkhorn distance between the physical jets by . Similarly, while the hyperparameter is dimensionless, it is related to a dimensionful quantity defined as .

The hyperparameter

can be interpreted equivalently as the variance of the Gaussian posterior probability from which the reconstruction loss descends, thereby setting the resolution scale of the VAE, or as the

hyperparameter of a -VAE Higgins et al. (2017). Physical scales much larger than may be modelled in its latent space, but scales much smaller than will be ignored. This motivates a -annealing training strategy Fu et al. (2019) whereby the VAE is trained in stages using a sequence of values for , preserving the model weights from one stage to the next. This sequence is chosen from a set of values that are log-uniform separated in the range . The procedure begins with an initial ‘priming’ run, starting at the smallest and proceeding upwards. Next, is annealed in a zig-zag pattern until reaches again its smallest value. Finally, a ‘production’ run is performed, repeating the sequence of values used for the priming run. The model weights saved at the end of each step in this production run are studied and used to generate the figures used in this paper. Further details of the data generation and the training procedure are given in Appendix A.1.

Before proceeding to an investigation of the properties of the learnt representations, I present some qualitative observations from the training procedure. During most of the priming run the learnt representation is completely disorganized, taking advantage of all 256 latent directions to describe the data (i.e. all dimensions have associated KL divergence significantly greater than zero). By the production run the learnt representation has been organized into a small set of informative directions with , while the majority are uninformative with . For each direction which is informative for some values of , there is some critical value above which the dimension becomes uninformative. These are associated with physical scales in the training data, for instance the translation of jets around the detector (), or the orientation of hard prongs within the jet (). When trained with , the corresponding variations will not be learnt in the latent space. When trained with , the variations may be learnt in the latent space, but they have no tendency to be organized into orthogonal or semantically meaningful ones. When subsequently trained with , these dimensions tend to organize into a small number of semantically meaningful directions, which have a tendency to be preserved if is subsequently gradually reduced to very small values (though this tendency appears to be weaker the greater is the hierarchy or the larger is the step size in ). Some intuition for some of these features can be gleaned from a simple analytical example in Appendix A.2.

3 Structure of the Learnt Representations

Figure 1: Learnt representation of jet coordinates for VAE_uncent with . The major axes of the left and center plots are the latent space coordinates to . The blue contours indicate probability density of encoded events. Overlaid are 7 by 7 grids of jet images in red. Each small jet image has its own coordinate axes . Each jet image is generated by decoding the latent code associated with the major axis coordinates of the center of its small square, so that a horizontal or vertical sequence of images corresponds with the effect of a corresponding latent space displacement on the jet reconstruction. The jet images on the right zoom in on one of these small squares for illustration. Areas of discs are proportional to the of the corresponding particle.

VAE_uncent has three informative directions at , and an additional three when trained at . VAE_cent has no informative directions at , three at , and many more at smaller values of . For each VAE, the latent directions can be ordered by the sizes of their individual contributions to KL. Fig. (1) illustrates the physics encoded by the first three of VAE_uncent with . The blue contours which indicate the support of the training data in the latent space suggest that a two dimensional manifold has been embedded into a three dimensional space. maps while maps . The remaining half of the barrel is mapped with the aid of which appears to encode , and a full rotation around the detector barrel is obtained by traversing the ring in the plane.

Fig. (2) illustrates the representation learnt in the two most informative directions of VAE_cent at . The coordinates of this VAE map the polar and azimuthal angles of the decay in its rest frame, where can be mapped to the energy fraction of the hardest prong in the boosted frame. The topology of the decay of a massive particle into two identical particles is that of the real projective plane, RP, which in this context is naturally represented by the sphere with antipodes identified. Any hemisphere gives a single cover of the space of two-body decays everywhere except on its rim, with an identification of opposite points on the rim being the surviving remnant of the antipodal identification on the full sphere. It can be seen that this plane of the latent space represents an approximate projection of the hemisphere indicated by the grey region on the right of the figure, with the pole of the sphere being represented at approximately in the latent space. Jets located around the edge of the support region in the latent space satisfy an approximate reflection symmetry .

Figure 2: Learnt representation of two-body jet substructure in VAE_cent with . The major axes of the plot correspond to the two most important latent directions . Each smaller circle contains a jet image with internal axes running from to . The jet image in each circle is obtained by the decoder from the latent code associated with the coordinates of the center of the circle, so that a horizontal or vertical sequence of images corresponds with the effect of a corresponding latent space displacement on the jet reconstruction. Black and red lines are approximate contours of the polar and azimuthal angles , of the boson decay, drawn by eye. This plane approximates a projection of the hemisphere shaded grey on the right of the RP space in which the -decay kinematics lives.

In neither the case of the jet coordinates for VAE_uncent, or the two-prong decay topology for VAE_cent , did the R topology of the latent space match the or RP topology of the corresponding datasets. But in both cases, a semantically meaningful and intuitive representation was found. In one case by an embedding into a higher dimensional space, and in the other by projecting on to a plane.

The organization of the learnt information into a small number of latent dimensions, and the manner in which the properties of the learnt representations vary smoothly with , suggests the possibility of notions of dimensionality or information complexity that depend on the way that properties of the learnt representation scale with . To this end, I introduce definitions for two notions of dimensionality


where in practice I will estimate these derivatives using finite difference approximations, taking quantities evaluated using VAEs trained at two nearby values of

and calculating the explicit fractions. , which can be thought of as representing the scaling of learnt information complexity with , was originally motivated by the observation that informative directions typically have , and that for . takes advantage of the notion of being a resolution on the reconstruction space. If individual informative directions in the latent space are imagined to map to orthogonal directions in the reconstruction space, and if the gaussian sampling in the latent space maps approximately to gaussian sampling in the corresponding reconstruction space directions with variance , then the full reconstruction error can be obtained by adding in quadrature those associated with the individual orthogonal directions. This leads to , and the derivative extracts the dimension. Of course, none of these assumptions are guaranteed to apply to the learnt representations of the VAE, and so the agreement of these notions of dimensionality can be regarded as a diagnostic of how well behaved is the training data and the learnt representations. In Appendix A.2 I describe a simple analytic example in which these formulae can be derived exactly. is related to the work of Rezende and Viola (2018), in which the derivative of

is studied and spikes in this quantity are interpreted as indicating phase transitions.

Figure 3: Dimensionality of learnt representations.

In Fig. (3) is plotted the dimensionalities calculated on both VAEs during the production run. Two observations are in order. Firstly, note that and are in rough agreement for both VAEs, except for at low values of for VAE_uncent where it struggles to represent the detailed substructure of the jet at the same time as its bulk position in the detector. Secondly, note that at high scales, the dimensionality of VAE_uncent is very close to 2, while the corresponding physics is described using three latent dimensions as was illustrated in Fig. (1

). The third latent dimension, acting as a categorical variable, scales differently than the variables representing continuous transformations of the jet. It contributes very little to the computed dimensionality. This illustrates the role that these quantities have in reflecting the true information complexity of the dataset when compared to a naive counting of active latent directions. Indeed, the dimensionality scaling of the learnt representation agrees with an intuitive understanding of the dataset. At large

, the uncentered jets have dimensionality of 2 while centered have dimensionality of 0. An order of magnitude below and three new dimensions emerge, two of which describe the orientation of the hard prongs within the jet associated with the scale and a third one describing the overall boost of the jet (associated with the spread of momenta ). At scales below this many new dimensions rapidly emerge, representing the various physical processes associated with showering, hadronization, decays, and detector effects.

4 Conclusions and Outlook

The VAEs trained for this study are effectively learning a concrete representation of the metric space of jets induced by the EMD. They identify semantically meaningful, intuitive, and approximately orthogonal principal axes of variation in the space of jets. Associated with these principal axes are concrete scales, which reflect scales associated with the physical generative process for the jets. As the resolution of the VAE is varied by adjusting the hyperparameter , the learnt representation smoothly adjusts, and its varying properties can be probed to understand the information complexity of the EMD manifold of the jets. There remains to be seen potential applications for these properties, and the question of mixed samples which are expected in unsupervised training on LHC data. These questions will be tackled in a future publication. Further topics of exploration include alternative latent space topologies, and additional probes of the geometry of the learnt manifold.

Code used for generating the results for this paper can be found at

This work was supported by the U.S. Department of Energy, Office of Science under contract DE-AC02-76SF00515. This work was initiated at the Aspen Center for Physics, which is supported by National Science Foundation grant PHY-1607611. I am grateful for the support of staff maintaining the SLAC Shared Data Facility, on which the models were trained. I am also thankful to the maintainers of the software packages Matplotlib Hunter (2007), Jupyter Kluyver and others (2016), NumPy Harris and others (2020), and SciPy Virtanen and others (2020). I would like to thank the following for many helpful discussions and continuing support, without which this work would not have been possible: Ben Nachman, Matthew Schwartz, Patrick Komiske III, Eric Metodiev, Zhen Liu, Jesse Thaler, Siddharth Mishra-Sharma, Michael Kagan, and Alex Alemi.


  • M. Abadi et al. (2015)

    TensorFlow: large-scale machine learning on heterogeneous systems

    Note: Software available from External Links: Link Cited by: §2.
  • J. Alwall, M. Herquet, F. Maltoni, O. Mattelaer, and T. Stelzer (2011) MadGraph 5 : Going Beyond. JHEP 06, pp. 128. External Links: 1106.0522, Document Cited by: §2.
  • M. Cacciari, G. P. Salam, and G. Soyez (2008) The anti- jet clustering algorithm. JHEP 04, pp. 063. External Links: 0802.1189, Document Cited by: §A.1.
  • T. Cai, J. Cheng, N. Craig, and K. Craig (2020) Linearized optimal transport for collider events. Phys. Rev. D 102 (11), pp. 116019. External Links: 2008.08604, Document Cited by: §1.
  • T. Cheng, J. Arguin, J. Leissner-Martin, J. Pilette, and T. Golling (2020) Variational Autoencoders for Anomalous Jet Tagging. External Links: 2007.01850 Cited by: §1.
  • F. Chollet et al. (2015) Keras. Note: Cited by: §2.
  • M. Crispim Romão, N. F. Castro, J. G. Milhano, R. Pedro, and T. Vale (2021) Use of a generalized energy Mover’s distance in the search for rare phenomena at colliders. Eur. Phys. J. C 81 (2), pp. 192. External Links: 2004.09360, Document Cited by: §1.
  • M. Cuturi (2013) Sinkhorn distances: lightspeed computation of optimal transport. In NIPS, External Links: 1306.0895 Cited by: §2.
  • J. de Favereau, C. Delaere, P. Demin, A. Giammanco, V. Lemaître, A. Mertens, and M. Selvaggi (2014) DELPHES 3, A modular framework for fast simulation of a generic collider experiment. JHEP 02, pp. 057. External Links: 1307.6346, Document Cited by: §2.
  • B. M. Dillon, T. Plehn, C. Sauer, and P. Sorrenson (2021) Better Latent Spaces for Better Autoencoders. External Links: 2104.08291 Cited by: §1.
  • J. V. Dillon, I. Langmore, D. Tran, E. Brevdo, S. Vasudevan, D. Moore, B. Patton, A. Alemi, M. Hoffman, and R. A. Saurous (2017) TensorFlow distributions. External Links: 1711.10604 Cited by: §2.
  • K. Dohi (2020) Variational Autoencoders for Jet Simulation. External Links: 2009.04842 Cited by: §1.
  • H. Fu, C. Li, X. Liu, J. Gao, A. Çelikyilmaz, and L. Carin (2019) Cyclical annealing schedule: a simple approach to mitigating kl vanishing. In NAACL-HLT, External Links: 1903.10145 Cited by: §2.
  • C. R. Harris et al. (2020) Array programming with NumPy. Nature 585 (7825), pp. 357–362. External Links: Document, Link Cited by: §4.
  • I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2017) Beta-vae: learning basic visual concepts with a constrained variational framework. In ICLR, Cited by: §2.
  • J. D. Hunter (2007) Matplotlib: a 2d graphics environment. Computing in Science & Engineering 9 (3), pp. 90–95. External Links: Document Cited by: §4.
  • D. P. Kingma and J. Ba (2017) Adam: a method for stochastic optimization. External Links: 1412.6980 Cited by: §A.1.
  • D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. CoRR. External Links: 1312.6114 Cited by: §1.
  • T. Kluyver et al. (2016) Jupyter notebooks – a publishing format for reproducible computational workflows. In Positioning and Power in Academic Publishing: Players, Agents and Agendas, F. Loizides and B. Schmidt (Eds.), pp. 87 – 90. Cited by: §4.
  • P. T. Komiske, R. Mastandrea, E. M. Metodiev, P. Naik, and J. Thaler (2020a) Exploring the Space of Jets with CMS Open Data. Phys. Rev. D 101 (3), pp. 034009. External Links: 1908.08542, Document Cited by: §1.
  • P. T. Komiske, E. M. Metodiev, and J. Thaler (2019a) Energy Flow Networks: Deep Sets for Particle Jets. JHEP 01, pp. 121. External Links: 1810.05165, Document Cited by: §2.
  • P. T. Komiske, E. M. Metodiev, and J. Thaler (2019b) Metric Space of Collider Events. Phys. Rev. Lett. 123 (4), pp. 041801. External Links: 1902.02346, Document Cited by: §1.
  • P. T. Komiske, E. M. Metodiev, and J. Thaler (2020b) The Hidden Geometry of Particle Collisions. JHEP 07, pp. 006. External Links: 2004.04159, Document Cited by: §1.
  • J. Kosowsky and A. Yuille (1994) The invisible hand algorithm: solving the assignment problem with statistical physics. Neural Networks 7, pp. 477–490. Cited by: §A.1.
  • G. Luise, A. Rudi, M. Pontil, and C. Ciliberto (2018) Differential properties of sinkhorn approximation for learning with wasserstein distance. Advances in Neural Information Processing Systems 31, pp. 5859–5870. External Links: 1805.11897 Cited by: §A.1.
  • D. J. Rezende and F. Viola (2018) Taming vaes. External Links: 1810.00597 Cited by: §3.
  • B. Schmitzer (2016) Stabilized sparse scaling algorithms for entropy regularized transport problems. SIAM Journal on Scientific Computing 41, pp. . External Links: Document, 1610.06519 Cited by: §A.1, §2.
  • M. Sharify, S. Gaubert, and L. Grigori (2013) Solution of the optimal assignment problem by diagonal scaling algorithms. External Links: 1104.3830 Cited by: §A.1.
  • R. Sinkhorn and P. Knopp (1967) Concerning nonnegative matrices and doubly stochastic matrices.. Pacific J. Math. 21 (2), pp. 343–348. External Links: Link Cited by: §2.
  • T. Sjöstrand, S. Ask, J. R. Christiansen, R. Corke, N. Desai, P. Ilten, S. Mrenna, S. Prestel, C. O. Rasmussen, and P. Z. Skands (2015) An introduction to PYTHIA 8.2. Comput. Phys. Commun. 191, pp. 159–177. External Links: 1410.3012, Document Cited by: §2.
  • P. Virtanen et al. (2020) SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods 17, pp. 261–272. External Links: Document Cited by: §4.

Appendix A Appendix

a.1 More Details on Data, Architecture, and Training

jets are simulated and decayed in Madgraph with the process and a generation level cut on the missing momentum of . The events are showered in Pythia8. Detector simulation is performed with Delphes using an ATLAS based card and with particle flow reconstruction. Particle flow objects were then clustered into jets with the anti- algorithm Cacciari et al. (2008) with . The event is selected if the leading jet has momentum in the range , , and mass in the range . For selected events, the constituents of the leading jet are reclustered with anti- with and the leading 50 particles are recorded with . Events with fewer than 50 particles are zero padded. events survive the cuts, of which are used for training and the remainder for validation and testing.

The inputs to the VAE are jets each with 50 particles represented as

. The encoder network of the VAEs consists of four 1D convolution layers with filter size 1024, kernel size 1, stride 1, followed by a sum layer, followed by four dense layers of size 1024. Unless otherwise specified, all layers have activation function Leaky ReLu with negative slope coefficient of 0.1. 256 latent space

are encoded with linear activation. The decoder consists of five layers with size 1024, followed by a linear dense layer which outputs fifty particles represented as , and then an function reduces this to . The explicit allows the network to avoid learning a discontinuity in , which is also the motivation for the trigonometric form of the inputs.

The loss function is a custom implementation in TensorFlow of a sharp Sinkhorn Luise et al. (2018) using -scaling Schmitzer (2016); Sharify et al. (2013); Kosowsky and Yuille (1994). In principle symbolic differentiation should be effective for this problem, however I encountered debilitating numerical instabilities that I was unable to diagnose. I therefore implemented the explicit gradient introduced in Luise et al. (2018). The Sinkhorn distance is calculated with regulator scaling from 1 to 0.01 in ten log-uniform steps, with ten iterations per step. Double floating precision is required for this calculation.

Figure 4: Annealing Schedule.

The -annealing schedule used for training VAE_uncent is illustrated in Fig. (4

), each black dot representing a step in the annealing schedule. At each step, the VAE is trained for fifty epochs, or until validation loss has not improved for ten epochs. Training is performed with batch size of 100 and with 1000 steps per epoch, and so five epochs are required to cycle through the whole training dataset. The adam optimizer 

Kingma and Ba (2017) is used for training. The learning rate begins at at the beginning of each annealing step, and reduces by a factor of if validation loss has not improved in five epochs. The first straight line of annealing steps is in this paper called the ‘priming’ run. The last straight line is called the ‘production’ run.

Training was performed on NVIDIA GeForce 2080Ti GPUs. An epoch takes approximately two minutes, and the full annealing schedule approximately four days.

a.2 A Simple Analytical Example

Consider a toy one-dimensional gaussian distributed dataset with variance

. A linear VAE with one latent dimension is trained to reconstruct the coordinate of the input , with loss function


The VAE is linear in the sense that all quantities are linear functions of their inputs: , , . The linear term in is 0 due to symmetry. Constant terms can be added to and but will minimize the loss at 0. The loss function can be integrated analytically over and . Performing the integral explicitly and taking derivatives, this loss function has extrema with


The former results in an uninformative latent space with which is a minimum only for , and becomes a saddle point for . The latter, which leads to an informative latent space, is real only for and is a minimum in this regime. Substituting this minimum into the expressions for the reconstruction and KL losses in Eq. (3) gives

KL (7)

Evaluating Eq. (2) explicitly gives for and for .

The general case of a -dimensional Gaussian dataset with variances is more complicated, but it can be shown that there are minima when latent space axes are aligned with the principal axes of the data. In this case, the problem reduces to independent copies of the one-dimensional case. In this case, both count the number of directions for which , which is the same as the number of active dimensions that have


In summary, the VAE probes in detail directions that have characteristic scale larger than , and averages over directions which have scale smaller than . The s play the role of the s of the end of Section 2. Note that the quantity , which for is given approximately by , gives the conversion factor between distances along latent coordinates and distance in the real space.

Nonlinear VAEs trained on non-Gaussian data have no guarantee to follow this behaviour, in which case the behaviour of these quantities become a diagnostic of how closely the behaviour of the data resembles that of a Gaussian.