Log In Sign Up

Finding NEEMo: Geometric Fitting using Neural Estimation of the Energy Mover's Distance

by   Ouail Kitouni, et al.

A novel neural architecture was recently developed that enforces an exact upper bound on the Lipschitz constant of the model by constraining the norm of its weights in a minimal way, resulting in higher expressiveness compared to other techniques. We present a new and interesting direction for this architecture: estimation of the Wasserstein metric (Earth Mover's Distance) in optimal transport by employing the Kantorovich-Rubinstein duality to enable its use in geometric fitting applications. Specifically, we focus on the field of high-energy particle physics, where it has been shown that a metric for the space of particle-collider events can be defined based on the Wasserstein metric, referred to as the Energy Mover's Distance (EMD). This metrization has the potential to revolutionize data-driven collider phenomenology. The work presented here represents a major step towards realizing this goal by providing a differentiable way of directly calculating the EMD. We show how the flexibility that our approach enables can be used to develop novel clustering algorithms.


page 2

page 4


Monge's Optimal Transport Distance with Applications for Nearest Neighbour Image Classification

This paper focuses on a similarity measure, known as the Wasserstein dis...

On The Chain Rule Optimal Transport Distance

We define a novel class of distances between statistical multivariate di...

The Sketched Wasserstein Distance for mixture distributions

The Sketched Wasserstein Distance (W^S) is a new probability distance sp...

Wasserstein K-Means for Clustering Tomographic Projections

Motivated by the 2D class averaging problem in single-particle cryo-elec...

Wasserstein Barycenter Model Ensembling

In this paper we propose to perform model ensembling in a multiclass or ...

Differential Geometric Retrieval of Deep Features

Comparing images to recommend items from an image-inventory is a subject...

FRAME Revisited: An Interpretation View Based on Particle Evolution

FRAME (Filters, Random fields, And Maximum Entropy) is an energy-based d...

1 Introduction

The Wasserstein (Earth Mover’s) Distance is a metric defined between two probability measures. In the field of high-energy particle physics, a modified version of the Wasserstein distance, the Energy Mover’s Distance (EMD), serves as a metric for the space of collider events by defining the

work required to rearrange the radiation pattern of one event into another Komiske et al. (2019). In particular, the EMD is intimately connected to the structure of infrared- and collinear-safe observables used in the ubiquitous task of clustering particles into jets Komiske et al. (2020), and is foundational in the SHAPER tool for developing geometric collider observables Gambhir et al. (2022).

Recently, a novel neural architecture was developed that enforces an exact upper bound on the Lipschitz constant of the model by constraining the norm of its weights in a minimal way, resulting in higher expressiveness than other methods Kitouni et al. (2021); Anil et al. (2019). Here, we employ this architecture—leveraging its improved expressiveness for 1-Lipschitz continuous networks—to replace the -Sinkhorn estimation of the EMD in SHAPER Feydy et al. (2018); Gambhir et al. (2022)

by directly calculating the EMD using the Kantorovic-Rubenstein (KR) dual formulation of the Wasserstein-1 metric. The KR duality casts the optimal transport problem as an optimization over the space of 1-Lipschitz functions, which we parameterize with dense neural networks using the architecture from 

Kitouni et al. (2021). With small modifications to the KR dual formulation, we are able to reliably and accurately obtain the EMD and Kantorovic potential in a differentiable way, without any approximations. This makes it possible to run gradient-based optimization procedures over the exact EMD (see Fig. 1). In addition, we expect these improvements could potentially have a major impact on jet studies at the future Electron-Ion Collider, where traditional clustering methods are not optimal Arratia et al. (2021), and more broadly in optimal transport problems.

Figure 1: Fitting three synthetic clusters (green) with three circles (red) using NEEMo (see Sec. 3). The heatmap is the Kantorovic potential, parameterized as a Lipschitz-bounded network, which induces forces on the circles (shown as arrows) that drive them into perfect alignment with the target distribution (only a few steps in the evolution of the fit are shown).

2 Lipschitz Networks and the Energy Mover’s Distance

Lipschitz Networks

Fully connected networks can be Lipschitz bounded by constraining the matrix norm of all weights Kitouni et al. (2021); Gouk et al. (2020); Miyato et al. (2018). Constraints with respect to a particular norm will be denoted as Lip. We start with a model that is Lip with Lipschitz constant i.e., :


Without loss of generality, we take (rescaling the inputs would be equivalent to changing ). We recursively define the layer of the fully connected network of depth with activation as


where is the input and is the output of the neural network. We have that satisfies equation 1 if


and has a Lipschitz constant less than or equal to 1. Here, denotes the operator norm with norm in the domain and in the co-domain. It is shown in Anil et al. (2019) that when using the GroupSort activation, can approximate any Lip function arbitrarily well, making weight-normed networks universal approximators. An implementation of the weight constraint along with a number of examples is provided in

Energy Mover’s Distance

The EMD is a metric between probability measures and . Using the standard Wasserstein-metric notation, the EMD is defined as


where is the set of all joint probability measures whose marginals are and . The Wasserstein optimization problem can be cast as an optimization over Lipschitz continuous functions using the Kantorovich-Rubinstein duality:


where is continuous, i.e.,

. In high-energy particle collisions, the EMD is defined by using the energies of individual particles in place of probabilities, with their momentum directional coordinates representing the supports of the probability distribution. For more details, including on how unequal total energies are handled, see 

Komiske et al. (2019). By performing optimizations over a constrained set of s, one can use the EMD to define observables over .

Figure 2: Training procedure to fit a parameterized shape to a distribution . NEEMo replaces the -Sinkhorn estimation in the standard SHAPER procedure with a Lipschitz network that evaluates the Kantorovic potential to obtain the EMD.

3 NEEMo: Neural Estimation of the Energy Mover’s Distance


Consider a high-energy particle-collision event with particles. Let be the energy of particle , be the direction of its momentum, and be the set of all particles in the event. Following the SHAPER prescription Gambhir et al. (2022) for defining an observable , we first define to be any collection of points parameterized by , e.g., these points can be sampled from any geometric object with any density distribution. The EMD between the event and the geometric object can be computed with equation 5 as


where is a 1-Lipschitz neural network with parameters . At the expression above is maximized and is the Kantorovic potential from which the EMD is obtained as the RHS of  equation 6. Since is differentiable, the optimum can be obtained using standard gradient descent techniques. This is the key improvement of NEEMo over SHAPER, which can only estimate the Kantorovic potential and the EMD up to a specified order . Note that in equation 6 the expectation is computed exactly but optimization can also be done stochastically by sampling from the discrete distributions with probabilities and and using the empirical mean to estimate the EMD. This can improve convergence in some cases.

Given that all of our operations are differentiable, gradients can flow back to . Therefore, one can also optimize the parameters to obtain the best-fitting collection of points in that class. We obtain the following minimax optimization problem:


where quantifies how well the event is described by the class of geometric object  Komiske et al. (2020); Gambhir et al. (2022).


Unlike the conventional clustering algorithms used in high-energy particle physics, NEEMo relies on nonconvex gradient-based optimization of a neural network and a set of geometric parameters. This results in the clustering procedure itself being relatively slow and not easily implemented in real time. This problem can be alleviated with powerful custom optimizers and initialization techniques to guarantee fast convergence, though whether NEEMo could ever be run online during data taking is an open question. We note that for many potential applications, e.g. at the Electron-Ion Collider, this is not a problem since running online is not required.

4 Experiments

Synthetic Data

We start with a few toy examples. First, consider an event consisting of three sets of particles distributed uniformly along the perimeters of circles. Here, we know the exact parameterization of our target distribution and use NEEMo to fit three randomly initialized circles to the event. Figure 1 shows a few steps in the fit evolution. The Kantorovic potential given by the Lipschitz-constrained network induces forces on the parameters of , which drive it to evolve from its random initialization to perfect alignment with the target distribution. In this example, in equation 7 quantifies the 3-circliness of the event , an observable first defined in Gambhir et al. (2022). To highlight the flexibility, we next consider an event with two sets of particles distributed along the perimeters of a triangle and ellipse, respectively. Figure 3 shows that again evolves following the gradients of the Kantorovic potential to perfect alignment with the target distribution.

Figure 3: Same as Fig. 1 but fitting to distributions parameterized by a triangle and an ellipse.


We now perform a model jet-substructure study, clustering synthetic data into -subjets

. First, we generate jets with 3, 4, or 5 subjet centers distributed uniformly. From each center we generate 10 particles drawn from a Gaussian distribution. We then use our algorithm to fit 3, 4, or 5 centers to the simulated jets. Figure 

4 shows that our algorithm is able to estimate the correct number of subjets. The EMD of the N-subjet fit is clearly lowest for jets with N true clusters.

Figure 4: From left to right: Fit of N subjets (centers) to jets with 3, 4, or 5 number true subjets.

Future Directions

In the framework developed in these proceedings, any parameterized source distribution can be chosen to fit any target distribution using the EMD, without any -approximations. This can be used, e.g., for constructing precision jet observables that are sensitive to percent-level fluctuations for new physics searches at LHC experiments. In addition, NEEMo provides a more precise way to quantify event modifications due to hadronization and detector effects. Finally, the flexibility provided by NEEMo could potentially have a major impact on jet studies at the future Electron-Ion Collider, where traditional clustering methods are not optimal. Rather than modifying the metric used in a sequential-recombination algorithm as in Arratia et al. (2021), the jet geometry itself can be altered using NEEMo in an event-by-event unsupervised manner. We plan to report on all of these novel directions in a follow-up journal article that is currently in preparation.

5 Broader Impacts

Comparing probability distributions is a fundamental task in statistics. Most commonly used methods only compare densities in a point-wise manner, whereas the Earth Mover’s Distance accounts for the geometry of the underlying space. This is easily visualized in our figures showing the Kantorovic potential. Due to space constraints we only showed a few toy example applications in collider physics, but we stress that the approach we present here—directly calculating the Earth Mover’s Distance using the Kantorovic-Rubenstein dual formulation of the Wasserstein-1 metric—can be applied to any optimal transport problem. While the existence of the KR duality has long been known, it only recently became possible to simultaneously enforce the exact 1-Lipschitz bound while achieving enough expressiveness to find the optimal Kantorovic potential. Our approach now makes it possible to perform gradient-based optimizations over the exact Earth Mover’s Distance. Given the sizable impact of similar approximate methods, we expect our exact approach could have applications across many fields and types of problems.


We thank Rikab Gambhir and Jesse Thaler for helpful discussions about SHAPER, and for introducing us to this problem. This work was supported by NSF grant PHY-2019786 (The NSF AI Institute for Artificial Intelligence and Fundamental Interactions, and by DOE grant DE-FG02-94ER40818.