1 Introduction
The Wasserstein (Earth Mover’s) Distance is a metric defined between two probability measures. In the field of highenergy particle physics, a modified version of the Wasserstein distance, the Energy Mover’s Distance (EMD), serves as a metric for the space of collider events by defining the
work required to rearrange the radiation pattern of one event into another Komiske et al. (2019). In particular, the EMD is intimately connected to the structure of infrared and collinearsafe observables used in the ubiquitous task of clustering particles into jets Komiske et al. (2020), and is foundational in the SHAPER tool for developing geometric collider observables Gambhir et al. (2022).Recently, a novel neural architecture was developed that enforces an exact upper bound on the Lipschitz constant of the model by constraining the norm of its weights in a minimal way, resulting in higher expressiveness than other methods Kitouni et al. (2021); Anil et al. (2019). Here, we employ this architecture—leveraging its improved expressiveness for 1Lipschitz continuous networks—to replace the Sinkhorn estimation of the EMD in SHAPER Feydy et al. (2018); Gambhir et al. (2022)
by directly calculating the EMD using the KantorovicRubenstein (KR) dual formulation of the Wasserstein1 metric. The KR duality casts the optimal transport problem as an optimization over the space of 1Lipschitz functions, which we parameterize with dense neural networks using the architecture from
Kitouni et al. (2021). With small modifications to the KR dual formulation, we are able to reliably and accurately obtain the EMD and Kantorovic potential in a differentiable way, without any approximations. This makes it possible to run gradientbased optimization procedures over the exact EMD (see Fig. 1). In addition, we expect these improvements could potentially have a major impact on jet studies at the future ElectronIon Collider, where traditional clustering methods are not optimal Arratia et al. (2021), and more broadly in optimal transport problems.2 Lipschitz Networks and the Energy Mover’s Distance
Lipschitz Networks
Fully connected networks can be Lipschitz bounded by constraining the matrix norm of all weights Kitouni et al. (2021); Gouk et al. (2020); Miyato et al. (2018). Constraints with respect to a particular norm will be denoted as Lip. We start with a model that is Lip with Lipschitz constant i.e., :
(1) 
Without loss of generality, we take (rescaling the inputs would be equivalent to changing ). We recursively define the layer of the fully connected network of depth with activation as
(2) 
where is the input and is the output of the neural network. We have that satisfies equation 1 if
(3) 
and has a Lipschitz constant less than or equal to 1. Here, denotes the operator norm with norm in the domain and in the codomain. It is shown in Anil et al. (2019) that when using the GroupSort activation, can approximate any Lip function arbitrarily well, making weightnormed networks universal approximators. An implementation of the weight constraint along with a number of examples is provided in https://github.com/niklasnolte/MonotOneNorm.
Energy Mover’s Distance
The EMD is a metric between probability measures and . Using the standard Wassersteinmetric notation, the EMD is defined as
(4) 
where is the set of all joint probability measures whose marginals are and . The Wasserstein optimization problem can be cast as an optimization over Lipschitz continuous functions using the KantorovichRubinstein duality:
(5) 
where is continuous, i.e.,
. In highenergy particle collisions, the EMD is defined by using the energies of individual particles in place of probabilities, with their momentum directional coordinates representing the supports of the probability distribution. For more details, including on how unequal total energies are handled, see
Komiske et al. (2019). By performing optimizations over a constrained set of s, one can use the EMD to define observables over .3 NEEMo: Neural Estimation of the Energy Mover’s Distance
Algorithm
Consider a highenergy particlecollision event with particles. Let be the energy of particle , be the direction of its momentum, and be the set of all particles in the event. Following the SHAPER prescription Gambhir et al. (2022) for defining an observable , we first define to be any collection of points parameterized by , e.g., these points can be sampled from any geometric object with any density distribution. The EMD between the event and the geometric object can be computed with equation 5 as
(6) 
where is a 1Lipschitz neural network with parameters . At the expression above is maximized and is the Kantorovic potential from which the EMD is obtained as the RHS of equation 6. Since is differentiable, the optimum can be obtained using standard gradient descent techniques. This is the key improvement of NEEMo over SHAPER, which can only estimate the Kantorovic potential and the EMD up to a specified order . Note that in equation 6 the expectation is computed exactly but optimization can also be done stochastically by sampling from the discrete distributions with probabilities and and using the empirical mean to estimate the EMD. This can improve convergence in some cases.
Given that all of our operations are differentiable, gradients can flow back to . Therefore, one can also optimize the parameters to obtain the bestfitting collection of points in that class. We obtain the following minimax optimization problem:
(7) 
where quantifies how well the event is described by the class of geometric object Komiske et al. (2020); Gambhir et al. (2022).
Limitations
Unlike the conventional clustering algorithms used in highenergy particle physics, NEEMo relies on nonconvex gradientbased optimization of a neural network and a set of geometric parameters. This results in the clustering procedure itself being relatively slow and not easily implemented in real time. This problem can be alleviated with powerful custom optimizers and initialization techniques to guarantee fast convergence, though whether NEEMo could ever be run online during data taking is an open question. We note that for many potential applications, e.g. at the ElectronIon Collider, this is not a problem since running online is not required.
4 Experiments
Synthetic Data
We start with a few toy examples. First, consider an event consisting of three sets of particles distributed uniformly along the perimeters of circles. Here, we know the exact parameterization of our target distribution and use NEEMo to fit three randomly initialized circles to the event. Figure 1 shows a few steps in the fit evolution. The Kantorovic potential given by the Lipschitzconstrained network induces forces on the parameters of , which drive it to evolve from its random initialization to perfect alignment with the target distribution. In this example, in equation 7 quantifies the 3circliness of the event , an observable first defined in Gambhir et al. (2022). To highlight the flexibility, we next consider an event with two sets of particles distributed along the perimeters of a triangle and ellipse, respectively. Figure 3 shows that again evolves following the gradients of the Kantorovic potential to perfect alignment with the target distribution.
NSubjets
We now perform a model jetsubstructure study, clustering synthetic data into subjets
. First, we generate jets with 3, 4, or 5 subjet centers distributed uniformly. From each center we generate 10 particles drawn from a Gaussian distribution. We then use our algorithm to fit 3, 4, or 5 centers to the simulated jets. Figure
4 shows that our algorithm is able to estimate the correct number of subjets. The EMD of the Nsubjet fit is clearly lowest for jets with N true clusters.Future Directions
In the framework developed in these proceedings, any parameterized source distribution can be chosen to fit any target distribution using the EMD, without any approximations. This can be used, e.g., for constructing precision jet observables that are sensitive to percentlevel fluctuations for new physics searches at LHC experiments. In addition, NEEMo provides a more precise way to quantify event modifications due to hadronization and detector effects. Finally, the flexibility provided by NEEMo could potentially have a major impact on jet studies at the future ElectronIon Collider, where traditional clustering methods are not optimal. Rather than modifying the metric used in a sequentialrecombination algorithm as in Arratia et al. (2021), the jet geometry itself can be altered using NEEMo in an eventbyevent unsupervised manner. We plan to report on all of these novel directions in a followup journal article that is currently in preparation.
5 Broader Impacts
Comparing probability distributions is a fundamental task in statistics. Most commonly used methods only compare densities in a pointwise manner, whereas the Earth Mover’s Distance accounts for the geometry of the underlying space. This is easily visualized in our figures showing the Kantorovic potential. Due to space constraints we only showed a few toy example applications in collider physics, but we stress that the approach we present here—directly calculating the Earth Mover’s Distance using the KantorovicRubenstein dual formulation of the Wasserstein1 metric—can be applied to any optimal transport problem. While the existence of the KR duality has long been known, it only recently became possible to simultaneously enforce the exact 1Lipschitz bound while achieving enough expressiveness to find the optimal Kantorovic potential. Our approach now makes it possible to perform gradientbased optimizations over the exact Earth Mover’s Distance. Given the sizable impact of similar approximate methods, we expect our exact approach could have applications across many fields and types of problems.
Acknowledgements
We thank Rikab Gambhir and Jesse Thaler for helpful discussions about SHAPER, and for introducing us to this problem. This work was supported by NSF grant PHY2019786 (The NSF AI Institute for Artificial Intelligence and Fundamental Interactions, http://iaifi.org/) and by DOE grant DEFG0294ER40818.
References
 Komiske et al. (2019) Patrick T. Komiske, Eric M. Metodiev, and Jesse Thaler. Metric space of collider events. Physical Review Letters, 123(4), jul 2019. doi: 10.1103/physrevlett.123.041801. URL https://doi.org/10.1103%2Fphysrevlett.123.041801.
 Komiske et al. (2020) Patrick T. Komiske, Eric M. Metodiev, and Jesse Thaler. The Hidden Geometry of Particle Collisions. JHEP, 07:006, 2020. doi: 10.1007/JHEP07(2020)006.
 Gambhir et al. (2022) Rikab Gambhir, Akshunna Dogra, Jesse Thaler, Demba Ba, and Abiy Tasissa. Can you hear the shape of a jet? 14th International Workshop on Boosted Object Phenomenology, Reconstruction, Measurements, and Searches in HEP, 2022. URL https://indi.to/rbQ5j.
 Kitouni et al. (2021) Ouail Kitouni, Niklas Nolte, and Mike Williams. Robust and provably monotonic networks. In 35th Conference on Neural Information Processing Systems, the Workshop on Machine Learning for the Physical Sciences, 2021. URL https://arxiv.org/abs/2112.00038.

Anil et al. (2019)
Cem Anil, James Lucas, and Roger Grosse.
Sorting out Lipschitz function approximation.
In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,
Proceedings of the 36th International Conference on Machine Learning
, volume 97 of Proceedings of Machine Learning Research, pages 291–301. PMLR, 09–15 Jun 2019. URL http://proceedings.mlr.press/v97/anil19a.html.  Feydy et al. (2018) Jean Feydy, Thibault Séjourné, FrançoisXavier Vialard, Shunichi Amari, Alain Trouvé, and Gabriel Peyré. Interpolating between optimal transport and mmd using sinkhorn divergences, 2018. URL https://arxiv.org/abs/1810.08278.
 Arratia et al. (2021) Miguel Arratia, Yiannis Makris, Duff Neill, Felix Ringer, and Nobuo Sato. Asymmetric jet clustering in deepinelastic scattering. Phys. Rev. D, 104(3):034005, 2021. doi: 10.1103/PhysRevD.104.034005.
 Gouk et al. (2020) Henry Gouk, Eibe Frank, Bernhard Pfahringer, and Michael J. Cree. Regularisation of neural networks by enforcing lipschitz continuity, 2020.
 Miyato et al. (2018) Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral Normalization for Generative Adversarial Networks. arXiv eprints, art. arXiv:1802.05957, February 2018.