DeepAI

# Background Modeling for Double Higgs Boson Production: Density Ratios and Optimal Transport

We study the problem of data-driven background estimation, arising in the search of physics signals predicted by the Standard Model at the Large Hadron Collider. Our work is motivated by the search for the production of pairs of Higgs bosons decaying into four bottom quarks. A number of other physical processes, known as background, also share the same final state. The data arising in this problem is therefore a mixture of unlabeled background and signal events, and the primary aim of the analysis is to determine whether the proportion of unlabeled signal events is nonzero. A challenging but necessary first step is to estimate the distribution of background events. Past work in this area has determined regions of the space of collider events where signal is unlikely to appear, and where the background distribution is therefore identifiable. The background distribution can be estimated in these regions, and extrapolated into the region of primary interest using transfer learning of a multivariate classifier. We build upon this existing approach in two ways. On the one hand, we revisit this method by developing a powerful new classifier architecture tailored to collider data. On the other hand, we develop a new method for background estimation, based on the optimal transport problem, which relies on distinct modeling assumptions. These two methods can serve as powerful cross-checks for each other in particle physics analyses, due to the complementarity of their underlying assumptions. We compare their performance on simulated collider data.

• 8 publications
• 2 publications
• 2 publications
• 9 publications
• 80 publications
02/15/2021

### Model-Independent Detection of New Physics Signals Using Interpretable Semi-Supervised Classifier Tests

A central goal in experimental high energy physics is to detect new phys...
10/13/2021

### Challenges for Unsupervised Anomaly Detection in Particle Physics

Anomaly detection relies on designing a score to determine whether a par...
05/25/2021

### Towards a method to anticipate dark matter signals with deep learning at the LHC

We study several simplified dark matter (DM) models and their signatures...
06/15/2019

### Detecting new signals under background mismodelling

Searches for new astrophysical phenomena often involve several sources o...
09/09/2018

### Nonparametric semisupervised classification for signal detection in high energy physics

Model-independent searches in particle physics aim at completing our kno...
10/20/2022

### Machine-Learning Compression for Particle Physics Discoveries

In collider-based particle and nuclear physics experiments, data are pro...
05/24/2019

### Large-Scale Statistical Survey of Magnetopause Reconnection

The Magnetospheric Multiscale Mission (MMS) seeks to study the micro-phy...

## 1 Introduction

The Standard Model (SM) of particle physics is a theory describing the interactions between elementary particles—the building blocks of matter. One key component of the SM is the presumed existence of a quantum field responsible for generating mass in certain elementary particles. This field is known as the Higgs field, originally theorized by Higgs (1964), Englert and Brout (1964). Excitations of the Higgs field produce particles, known as Higgs bosons, which were the subject of an intensive search by experimental particle physicists ever since the mid 1970s. In July 2012, two independent experiments at the Large Hadron Collider (LHC) at CERN (the European Organization for Nuclear Research) announced the observation of a new particle consistent with the SM Higgs boson (ATLAS, 2012; CMS, 2012). Having discovered this Higgs-like particle, current work is concerned with detailed studies of its properties, in order to confirm or refute those predicted by the SM. One such property is the so-called Higgs boson self-coupling, whereby a single excitation of the Higgs field can split into two Higgs bosons without intermediate interactions with other particles. Observing this phenomenon would provide compelling new information regarding the mechanism of particle mass generation. This paper is concerned with some of the statistical challenges posed by such a measurement.

The LHC is housed in a massive underground tunnel in which two counter-rotating beams of protons are accelerated to nearly the speed of light. When these protons collide, new particles are almost always formed, and their paths within particle detectors are recorded. Individual collisions are known as events. An event in which two Higgs boson are generated is called a double Higgs (or di-Higgs) event. The Higgs boson is a highly unstable particle, thus whenever it is produced, it decays into other particles almost immediately, making di-Higgs production impossible to observe directly.

The Higgs boson most commonly decays into a pair of so-called bottom quarks (-quarks). An event in which four -quarks are observed is thus a candidate di-Higgs event, but could also have arisen from various other physical processes that produce four -quarks. We say that a di-Higgs event in which the Higgs bosons decay into four -quarks is a signal event, while any other event tagged as having four -quarks is called a background event. The problem of searching for double Higgs boson production reduces to testing whether the proportion of signal events is nonzero among the observed data. As we describe in Section 3, carrying out this test is a well-understood statistical task when the distribution of both background and signal events is known. While the di-Higgs signal distribution can be approximated to sufficient accuracy with first-principles simulation, simulating the background distribution suffers from large high-order corrections which are computationally intractable (Di Micco et al., 2020). Instead, the background distribution must be estimated using the observed data. This is known as the problem of data-driven background modeling, which is the main subject of this paper.

As stated, the background distribution is not a statistically identifiable quantity, due to the potential presence of an unknown proportion of signal in the data. Any analysis strategy must therefore make some modeling assumptions to make the background estimation problem tractable. As we discuss below, it is standard to assume that the background distribution is related in some way to the distribution of certain auxiliary events, which in turn is identifiable. An example of useful auxiliary events is those consisting of less than four observed -quarks, since they are unlikely to be signal events, but are kinematically related to background events (Bryant, 2018). Stated differently, the distribution of appropriately chosen auxiliary events is an identifiable estimand which has undergone a distributional shift relative to the non-identifiable background distribution of interest. If the analyst has access to a sample of auxiliary events, its empirical distribution provides a first naive approximation of the desired background distribution. To obtain a more precise estimate, one must correct for the distributional shift and cross-check the assumptions used to do so.

### 1.1 Our Contributions

This paper develops two methodologies for data-driven background modeling in di-Higgs boson searches. Our methods are fully nonparametric, and each hinge upon a characteristic modeling assumption. The assumptions made by our methods are distinct and complementary, thus these methods can serve as cross-checks for each other in di-Higgs boson searches, an important benefit that will increase the analyst’s trust in the obtained background estimates.

Our first approach develops a probabilistic classifier for discriminating auxiliary events from background events. This classifier leads to an estimate of the density ratio between such events. We fit this density ratio in a signal-free region of the phase space, known as the Control Region, and then extrapolate it to the region of primary interest, known as the Signal Region. Any deviation of this extrapolated density ratio from unity is then used to correct the distributional shift undergone by the auxiliary sample. As we discuss in Section 1.2

, variants of this procedure are the most widely-used approach to data-driven background modeling; our contribution here is to develop a powerful new classifier for this method, which is specifically tailored to collider data. In particular, our classifier is a convolutional neural network with residual layers

(He et al., 2016), whose architecture accounts for the structure and symmetries of collider events with multiple identical final state objects.

Our second contribution is to develop a new background estimation method which is based on the optimal transport problem between distributions of collider events. Optimal transport has already proven to be a powerful tool for unsupervised domain adaptation in classification problems (Courty et al., 2016), and here we propose to use it rather differently to correct distributional shifts between estimates of the auxiliary and background distributions. Our approach involves the out-of-sample estimation of optimal transport maps, for which we consider two different estimators. While the first is based on smoothing of an in-sample optimal coupling, and has previously been proposed in the literature (cf. Section 1.2), our second estimator appears to be new, and also leverages our new classifier for collider data.

The optimal transport problem requires a metric on the space of collider events, for which we employ a variant of the metric proposed by Komiske, Metodiev and Thaler (2019), which is itself obtained through the optimal transport problem of matching clusters of energy deposits in collision events. Our approach therefore involves a nested use of optimal transport.

We illustrate the empirical performance of our two methodologies on realistic simulated collider data. We observe that both approaches lead to quantitatively similar background estimates, despite the complementarity of their underlying modeling assumptions. In particular, this example illustrates how our methods can be used to cross check each other in practice.

### 1.2 Related Work

Di-Higgs boson production has been the subject of numerous recent searches by the ATLAS and CMS collaborations at the LHC—we refer to the recent survey paper of Di Micco et al. (2020) for an overview. The four -quark final state is the most common decay channel for di-Higgs events, but suffers from a large multijet background as described previously. Each of the most recent searches in this final state (e.g. ATLAS (2018a), ATLAS (2019), ATLAS (2021), CMS (2022), ATLAS ) performed data-driven background estimation by first estimating a density ratio (between background and auxiliary events) in a Control Region, and extrapolating this density ratio to the Signal Region. Certain searches, such as ATLAS (2019)

, estimate the density ratio using heuristic one-dimensional reweighting schemes, while others, such as

CMS (2022), use off-the-shelf multivariate classifiers for this purpose. Our work builds upon the latter approach by designing a classifier tailored to the structure and symmetries present in collider data.

The idea of estimating density ratios using classifiers has a long history in statistics—see for instance Qin (1998), Cheng and Chu (2004), Kpotufe (2017)—and appears in a variety of applications in experimental particle physics (e.g. Cranmer, Pavez and Louppe (2015), Brehmer et al. (2020), CMS (2022)). Classification-based estimators have the practical advantage of circumventing the need for high-dimensional density estimation, which can be particularly challenging to perform over the space of collider events. In the context of background estimation, a density ratio is first estimated in the Control Region, and then extrapolated into the Signal Region; this extrapolation can be viewed as an instance of transfer learning (Weiss, Khoshgoftaar and Wang, 2016). It has been empirically witnessed that modern classification algorithms, such as deep neural networks, have the ability to transfer well to new distributions (Yosinski et al., 2014), which further motivates their use for density ratio estimation in the context of background estimation.

While a careful choice of classifier can greatly improve the accuracy of transfer learning, such a transfer cannot be possible if the underlying distribution in the Signal Region is unconstrained relative to its counterpart in the Control Region. Therefore, this type of procedure must necessarily place an implicit modeling assumption on the underlying distributions, which is typically difficult to verify in practice. Given that variants of this procedure are used in each of the most recent di-Higgs searches listed above, this motivates us to develop a background estimation method with a complementary modeling assumption, in Section 5. Rather than density ratios, the key object of interest in this methodology will be the notion of an optimal transport map, which arises from the celebrated optimal transport problem (Villani, 2003)

. Optimal transport has received a surge of recent interest in the statistics and machine learning literature—we refer to

Panaretos and Zemel (2019a, b); Kolouri et al. (2017); Peyré and Cuturi (2019) for recent reviews. Closest to our setting are recent applications of optimal transport to domain adaptation for classification problems; see for instance Courty et al. (2016), Redko, Habrard and Sebban (2017), Rakotomamonjy et al. (2022), and references therein. Nested optimal transport formulations, as in our work, have recently been used for other tasks such as multilevel clustering (Ho et al., 2017, 2019; Huynh et al., 2021) and multimodal distribution alignment (Lee et al., 2019).

Our methodology relies on the estimation of optimal transport maps or couplings between distributions of collider events. The question of estimating optimal transport maps over Euclidean space has been the subject of intensive recent study (Perrot et al., 2016; Forrow et al., 2018; Makkuva et al., 2019; Schiebinger et al., 2019; Nath and Jawanpuria, 2020; Hütter and Rigollet, 2021; de Lara, González-Sanz and Loubes, 2021; Deb, Ghosal and Sen, 2021; Manole et al., 2021; Ghosal and Sen, 2022; Gunsilius, 2022). While many of these works are tailored to the quadratic cost function, the widely-used nearest-neighbor estimator (Flamary et al., 2021; Manole et al., 2021) can naturally be defined over the non-Euclidean ground space appearing in our work, and is used in one of our background estimators defined in Section 5.2.2.

Beyond the search of di-Higgs boson production, we emphasize that the question of background estimation arises in a variety of problems in experimental high-energy physics, where our methodologies could also potentially be applied. We refer to the book of Behnke et al. (2013) for a pedagogical introduction to statistical aspects of the subject; see also Appendix 1 of Lyons (1989). Finally, we mention some recent advances on the widely-used sPlot (Barlow, 1987; Pivk and Le Diberder, 2005; Borisyak and Kazeev, 2019; Dembinski et al., 2022) and ABCD (Alison, 2015; Aad et al., 2015; Choi and Oh, 2021; Kasieczka et al., 2021) techniques for background estimation, the latter of which can be viewed as a precursor to the methods developed in this paper.

### 1.3 Paper Outline

The rest of this paper is organized as follows. Section 2 contains background about the LHC and di-Higgs boson production. Section 3 outlines the statistical procedure used for signal searches in collider experiments at the LHC, and mathematically formulates the data-driven background modeling problem. In Section 4, we describe how to perform background estimation using a classifier for discriminating auxiliary events from background events, and we describe our new classifier architecture for this purpose. In Section 5, we describe a new approach to background modeling based on the optimal transport problem. We then compare these methods in a simulated di-Higgs search in Section 6. We close with a discussion in Section 7.

## 2 Background

### 2.1 LHC Experiments and di-Higgs Boson Production

The LHC is the largest particle collider in the world, consisting of a 27 kilometer-long tunnel in which two counter-rotating beams of protons are accelerated to nearly the speed of light. These particles are primarily collided in one of four underground detectors, named ALICE, ATLAS, CMS and LHCb. ATLAS and CMS are general-purpose detectors used for a wide range of physics analyses, including Higgs boson-related searches, while ALICE and LHCb focus on specific physics phenomena. We focus on the CMS detector in what follows, but similar descriptions can be made for the ATLAS detector.

When two protons collide, their energy is converted into matter, in the form of new particles. The goal of the CMS (Compact Muon Solenoid) detector is to measure the momenta, energies and types of such particles. To measure their momenta, CMS is built around a giant superconducting solenoid magnet, depicted in Figure 1, which deforms the trajectories of particles as they move from the center of the detector outward through a silicon tracker. The extent to which the trajectory of a charged particle is bent depends on its momentum and can hence be used to measure the momentum. After the silicon tracker, CMS consists of several layers of calorimeters which measure the energies of the particles. We refer to CMS (2008) for a complete description of the CMS detector.

Certain proton-proton collisions give rise to highly unstable particles which decay almost instantly into more stable particles. In this case, the detector is only able to observe these long-lived particles. By measuring their energies and momenta, insight can be gained into the physical properties of the unstable particles from which they originate.

The Higgs boson is an example of an unstable particle, which decays within approximately seconds. The SM predicts that a Higgs boson decays into a pair of bottom quarks (-quarks) of the time, and this decay channel has indeed been observed experimentally (ATLAS, 2018b; CMS, 2018a). Other channels which have been observed experimentally include the decay of a Higgs boson into pairs of photons (ATLAS, 2018c; CMS, 2018b), W bosons (ATLAS, 2018d; CMS, 2019), Z bosons (ATLAS, 2018e; CMS, 2018c), and tau leptons (atlas2018d, ; CMS, 2018d). The SM further predicts the rare possibility that two Higgs bosons can be produced simultaneously, and this paper is concerned with the statistical challenges arising in the search for this process, which has yet to be observed experimentally. If this process were to occur, the two resulting Higgs bosons would each, in turn, be most likely to decay into two -quarks, thus making four -quarks the most common final state of di-Higgs boson events. We focus on this decay channel (abbreviated HH) throughout this paper. We note that -quarks form into bound states with other quarks called -hadrons which are themselves unstable, and rapidly decay into collimated sprays of stable particles called -jets, which can be efficiently identified by the CMS detector.

### 2.2 Collider Events and the CMS Coordinate System

Particles measured by the CMS detector are typically represented in spherical coordinates. Given a particle with momentum vector

, its azimuthal angle is defined as the angle increasing from the positive -axis to the positive -axis, while the polar angle is increasing from the positive -axis to the positive -axis. The length of its projection onto the plane is called the transverse momentum . It is common to replace the polar angle by the pseudorapidity of the particle, given by .

In addition to the variables and , the rest mass of each particle can be obtained from the energy measurements made by the calorimeters in the CMS detector. Altogether, a particle jet is analyzed as a single point in this coordinate system, and encoded as a four-dimensional vector . In our search channel, collisions lead to multiple, say , jets measured by the detector, which may be encoded as the -dimensional vector . We opt for an alternative notation, which will be particularly fruitful for the purpose of defining a metric between collider events in Section 5.3. Specifically, an event will henceforth be represented by the discrete measure

 g=K∑i=1pTiδ(ηi,ϕi,mi), (2.1)

where denotes the Dirac measure placing unit mass at a point . In particular, the representation (2.1) emphasizes the invariance of an event with respect to the ordering of its jets. The transverse momenta may be viewed as a proxy for the energy of each jet, thus the total measure of denotes its total energy, denoted . The set of events with jets of interest is denoted by

 \calG(K)={K∑j=1pTjδ(ηj,ϕj,mj):pTj,mj>0, ϕj,ηj∈\bbR, 1≤j≤K},

where the definition of is extended from to the entire real line by -periodicity. In the context of double Higgs boson production in the four -jet final state, the choice will be most frequently used, and in this case we simply write .

Finally, we note that events are deemed invariant under the orientation of the - and -axes. This fact, together with the periodicity of the angle , implies that two events and may be deemed equivalent if they are mirror-symmetric in , as well as rotationally symmetric in , that is, if there exist and such that

 K∑j=1pTjδ(ι1ηj,Δ+ι2ϕj,mj)=K∑j=1p′Tjδ(η′j,ϕ′j,m′j). (2.2)

Formally, we define an equivalence relation between events in , such that if and only if there exist for which (2.2) holds.

## 3 Problem Formulation

### 3.1 Overview of Signal Searches at the LHC

In order to make inferences about the presence or absence of a signal process in collider data, event counts are commonly analyzed as binned Poisson point processes. While we focus on the setting of double Higgs boson production in the four -quark final state, the description that follows is representative of a wide range of signal searches for high-energy physics experiments.

Let denote a -finite Borel measure over the state space of collider events, with respect to a fixed choice of Borel -algebra on denoted . Let denote an inhomogeneous Poisson point process (Reiss, 2012) with a nonnegative intensity function on , that is, is a random point measure on such that

1. , where is the intensity measure induced by , defined by for all ;

2. are independent for all pairwise disjoint sets , for all integers .

Every four -jet collision event is either a signal event, namely an event arising from two Higgs bosons, or a background event, arising from some other physical process. Letting denote the rate of signal events, we write the intensity measure as

 λ(B)=\binten4(B)+μ\sinten(B),B∈\bbB(\calG),

where and , respectively, denote nonnegative background and signal intensity measures. is typically normalized such that the value corresponds to the theoretical prediction of the signal rate. The measures and typically depend on nuisance parameters related to the calibration of the detector, the uncertain parameters of certain physical processes, such as the parton distribution functions of the proton (Placakyte, 2011), and so on. We suppress the dependence on such nuisance parameters for ease of exposition. The parameter is of primary interest, since non-zero values of indicate the existence of signal events. A search for the signal process therefore reduces to testing the following hypotheses on the basis of observations from the Poisson point process :

 H0:μ=0 vs. H1:μ>0. (3.1)

Given a sequence of observed events, we may write , where is independent of the observations, which satisfy

 G1,G2,⋯iid∼λλ(\calG)=ϵS+(1−ϵ)P4. (3.2)

Here, and denote the respective signal and background distributions, and the proportion of signal events.

The Poisson point process is often binned in practice. Let denote a dimensionality reduction map, to be discussed below, which will be used to bin the point process using univariate bins. Let denote a collection of bins forming a partition of , and define the event counts

 Dj=\fppp(h−1(Ij))=∣∣{1≤i≤\Nfppp:h(Gi)∈Ij}∣∣,j=1,…,J. (3.3)

The definition of

implies that the random variables

are independent and satisfy

 Dj∼Poisson(Bj+μSj),j=1,…,J, (3.4)

where and .

The likelihood ratio test with respect to the joint distribution of

is typically used to test the hypotheses (3.1) (ATLAS, CMS and Higgs Combination Group, 2011). The binned likelihood function for the parameter is given by

 L(μ)=J∏j=1(Bj+μSj)DjDj!e−(Bj+μSj). (3.5)

Di-Higgs events are rare in comparison to background events, thus the signal-to-background ratio is low. At the time of writing, values of

which are typically observed at the LHC may be too small for any test to have power in rejecting the null hypothesis in (

3.1) at desired significance levels. Analyses which fail to reject instead culminate in an upper confidence bound on , also known as an upper limit (ATLAS, CMS and Higgs Combination Group, 2011).

The power of the likelihood ratio test for (3.1) may be increased by choosing a function which maximizes the separation between background and signal event counts across the bins. Informally, the optimal such choice of is given by

 h(g)=\bbP(G is a Signal Event|G=g), (3.6)

which may be estimated using a multivariate classifier, such as a neural network or boosted decision trees, for discriminating background events from signal events.

The signal intensity measure is theoretically predicted by the SM, and can be approximated well using Monte Carlo event generators (Di Micco et al., 2020). The background intensity is, however, intractable due to the strongly interacting nature of quantum chromodynamics (QCD) in which events with the four -quark final state can be produced via an enormous number of relevant and complex pathways. The intensity measure , or its binned analogue , must therefore be estimated from the collider data itself, which we refer to as data-driven background modeling. This problem constitutes the primary focus of this paper.

### 3.2 Setup for Data-Driven Background Modeling

The aim of this paper is to develop data-driven estimators of the background intensity measure . The primary challenge is the fact that the sample is contaminated with an unknown proportion of signal events. The background estimation problem is thus statistically unidentifiable as stated, and it will be necessary to impose further modeling assumptions.

In order to formulate these assumptions and our resulting background modeling methods, we assume that the analyst has access to a second Poisson Point Process consisting of auxiliary events which were tagged by the CMS detector as having four jets, of which exactly three are -jets. We refer to such events as “ events”, as opposed to “ events” which were identified as having four -jets. We stress that the terms and do not refer to the true number of -quarks arising from the collision, rather the number of -jets identified by the detector111An event with exactly three -quarks is not physical, since -quarks can only arise in pairs; however, the detector may incorrectly label events as having three -jets.. As we further discuss in Section 6, the majority of events in fact arise from the hadronization of two -quarks and two charm or light quarks, while a small proportion arise from four -quarks. For the purpose of a discovery analysis, the sample can therefore be treated as having a negligible proportion of signal events (Bryant, 2018; CMS, 2022). We treat this proportion as zero for sake of exposition. We shall henceforth denote the intensity measure of the point process by , and we denote by

the corresponding probability distribution of the independent observations

.

The kinematics of the 3b events are expected to be similar, but not equal, to those of background events (CMS, 2022). However, unlike , notice that is an identifiable estimand, due to the lack of signal events in the process . Any consistent estimator of this quantity can be used to provide a zeroth-order approximation of (up to a correction for normalization). This approximation is, however, insufficiently accurate to be used as a final estimate of and our goal is to develop statistical methods for correcting this naive background estimate.

Recall that the four -jets of any signal event are naturally paired, with each pair arising from a Higgs boson. The true pairing of the jets is unknown to the detector; however, it may be approximated, for instance using an algorithm due to Bryant (2018). We use the same pairing algorithm in our work. Given as input an event , this deterministic algorithm outputs one among the three distinct unordered pairs of measures which satisfy . We refer to and as dijets. When is a signal event, we expect that each dijet arose from a decay of a Higgs boson, whereas when is a background event, we expect that at least one of the two dijets arose from the decay of a different particle.

The Higgs boson is known to have mass approximately equal to 125 GeV (ATLAS, 2012; CMS, 2012). It follows that the two dijets should approximately satisfy , where denotes the invariant mass222If denotes the sum of the energies of the constituent jets of , and denotes the magnitude of the sum of their momentum vectors, then the invariant mass of is defined by  (Hagedorn, 1964; Martin and Shaw, 2017). of any , . Large deviations of the dijet invariant masses from 125 GeV indicate that is not a signal event. This provides a heuristic for determining events among which are unlikely to be signal events. To this end, we form subsets such that , where is called the Signal Region, containing events with dijet masses near , and is called the Control Region, containing all other events which will be used in the analysis. We follow Bryant (2018) and employ the following specific definitions of and :

 \calGsr =⎧⎪⎨⎪⎩g∈\calG: ⎷(1−mHm(g1))2+(1−mHm(g2))2≤κs⎫⎪⎬⎪⎭, (3.7) \calGcr ={g∈\calG:√(m(g1)−σCmH)2+(m(g2)−σCmH)2≤κc}∖\calGsr, (3.8)

for some constants .

These regions are illustrated in Figure 2 for a simulated background sample. We similarly partition the Poisson intensity measures , by defining for all ,

 βcj(B)=βj(B∩\calGcr),βsj(B)=βj(B∩\calGsr),j=3,4.

These four measures are illustrated in Figure 3. Furthermore, we assume for ease of exposition that these measures are absolutely continuous with respect to the dominating measure , and we let for all and .

Recall that the collider events associated with the intensity measures and are signal-free by construction, and those from are also signal-free under the assumption that negligibly few signal events will fall outside of . These three intensity measures can therefore be estimated directly by means of their empirical intensity functions. We have thus reduced the background modeling problem to that of estimating , given estimates of , and .

To this end, we will partition the samples into the sets

 {Gs1,…,Gs\nfs} :={G1,…,G\nf}∩\calGsr, {Hs1,…,Hs\nts} :={H1,…,H\nt}∩\calGsr, {Gc1,…,Gc\nfc} :={G1,…,G\nf}∩\calGcr, {Hc1,…,Hc\ntc} :={H1,…,H\nt}∩\calGcr,

where and . Furthermore, let

 \bintenc3,\ntc=\tppp|\calGcr=\ntc∑i=1δHci,\bintens3,\nts=\tppp|\calGsr=\nts∑i=1δHsi,\bintenc4,\nfc=\fppp|\calGcr=\nfc∑i=1δGci

denote the empirical estimators of the measures , illustrated in the background of Figure 3. As previously noted, the measure provides a zeroth-order approximation of (after a normalization correction), thus a naive first estimate of is given by . As we shall see in the simulation study of Section 6, this approximation is insufficiently accurate to be used as a final estimator. Our methodologies improve upon it by modeling the discrepancy between the and distributions in the Control Region via , and then making use of that information in the Signal Region to improve the accuracy of as an estimator of .

Once we are able to derive an estimator of , based on the signal-free observations , , we may define the fitted histogram , . One may then test the hypotheses (3.1) using the likelihood ratio test, based on the following modification of the likelihood function in equation (3.5),

 ˜L(μ)=J∏J=1(^Bj+μSj)DsjDsj!e−(^Bj+μSj),where % Dsj=∣∣{1≤i≤ms:h(Gsi)∈Ij}∣∣. (3.9)

Here, can be viewed as a restriction of the likelihood to the Signal Region. Notice that is independent of , for any . In practice, it is also necessary to incorporate statistical and systematic uncertainties pertaining to the estimator into the hypothesis testing procedure (ATLAS, CMS and Higgs Combination Group, 2011). Since formal uncertainty quantification for background modeling is beyond the scope of this work, we omit further details, and provide further discussion of this point in Section 7. With this framework in place, the primary difficulty remaining in the testing problem (3.1) is that of deriving estimators of the background intensity measure . In what follows, we derive three estimators for this quantity, outlined in Figure 3, which rely on complementary modeling assumptions.

## 4 Background Modeling via Density Ratio Extrapolation

The discrepancy between and background distributions may be directly quantified in the Control Region, where no signal events are present. Under the assumption that these distributions vary sufficiently smoothly between the Control and Signal Regions, this discrepancy may be extrapolated into the signal region to produce a correction of the signal region intensity measure , leading to an estimate of . This general strategy forms the basis of most existing background modeling methodologies discussed in Section 1.2. We first recall in this section how this approach may be carried out using a classifier for discriminating and events. We then develop a new classifier architecture specifically tailored to this type of collider data. Beyond its use for background modeling, this classifier will make a reappearance in Section 6, where we use it as a tool for validating fitted background models. We will also employ the same classifier architecture for discriminating background and signal events, and ultimately use it as the final binning function in equation (3.6).

### 4.1 Reduction to Classification

Let denote a random collider event, arising from either the or distributions, and define the latent binary random variable indicating the component membership of ; that is, takes on the value 1 if is and takes on the value 0 if is . Setting , it is a straighforward consequence of Bayes Rule that

 bc4(g)bc3(g)=ψ(g)1−ψ(g),g∈\calGcr, (4.1)

and hence,

 \bintenc4(B)=∫Bψ(g)1−ψ(g)d\bintenc3(g),B∈\bbB(\calGcr). (4.2)

Equations (4.14.2

) are a reformulation for our context of the well-known fact that, up to normalization, a likelihood ratio may be expressed as an odds ratio (cf. Section

1.2). Estimating the ratio of to intensity functions in the Control Region thus reduces to the classification problem of estimating , say by a classifier . This observation has the practical advantage of circumventing the need of performing high-dimensional density estimation. Assuming that the estimator  can be evaluated in the Signal Region , disjoint from its training region , we may postulate that the measure

 B∈\bbB(\calGsr)⟼∫B^ψ(g)1−^ψ(g)d\bintens3(g), (4.3)

provides a reasonable approximation of . The quality of such an approximation is driven by the ability of the classifier to generalize between regions of the phase space. To formalize this, we will assume for simplicity that is a parametric classifier lying in a class , for some parameter space , , which is obtained via the minimization problem

 \hpsi=ψ\halpha,  where  \halpha∈\argminα∈Ω⎧⎨⎩1\ntc\ntc∑i=1\calL(ψα(Hci),0)+1\nfc\nfc∑j=1\calL(ψα(Gcj),1)⎫⎬⎭,

for a loss function

. We then make the following assumption. The conditional probability

satisfies the following conditions.

• (Correct Specification) There exists such that .

• (Generalization) We have,

Assumption 4.1 implies that a classifier trained solely in the Control Region can consistently estimate the full conditional probability , for events lying in both the Control and Signal Regions. Such an assumption guarantees the ability of the classifier to generalize from the Control Region, making the ansatz (4.3) justified. A natural estimator for is then obtained by replacing in equation (4.3) by its empirical counterpart . Doing so leads to the estimator

 ^\bintens4=\nts∑i=1^ψ(Hsi)1−^ψ(Hsi)δHsi.

is called the estimator, and we refer to as the FvT (Four vs. Three) classifier. A careful choice of this classifier can dramatically increase the quality of fit, and we now describe a new classifier architecture specifically tailored to collider events of the type considered in this work.

### 4.2 A Classifier for Collider Events

We aim to design a classifier over , which

1. is invariant to the ordering of the constituent jets in an input event ;

2. is invariant with respect to the equivalence relation defined in (2.2);

3. incorporates the dijet substructure of an event .

We opt for a convolutional neural network architecture with residual layers, or ResNet (He et al., 2016), as depicted in Figure 4. An input event to the network is treated as a one-dimensional image with 4 pixels, in any order, where each pixel corresponds to one jet parametrized by its coordinates . The layers of the network are then designed as follows.

1. The 4 input pixels are duplicated twice to obtain an image with 12 pixels, such that all 6 possible pairs of dijets appear exactly once, consecutively in the image. Each such pair is then convolved into a one-dimensional image of six dijet pixels.

2. Each pair of dijets from this image is further convolved into an image with three pixels, consisting of the three possible pairing representations of the original four-jet, or quadjet, event.

3. The three quadjet pixels are combined into the last pixel representing the whole event.

4. A final output layer passes the event through a softmax activation function to produce the fitted probability

.

Notice that (c) is satisfied by construction of the network. We approximately impose (a) by replacing dijet pixels in step (2) by their sums and absolute values of their differences. Similar modifications may be made in (1) to render the network entirely invariant to relabelling of the input jets, as in (a), but we have observed icreased performance by allowing the convolutions in step (1) to have information about the ordering of the jets. Quadjet pixels from (2) are added together weighted by a real-valued score in step (3) to guarantee permutation invariance of the three pixels.

To enforce (b), recall that the network should be invariant under transformations , and , for any , simultaneously across all constituent jets of an event. We enforce invariance under the former two transformations after step (3) by averaging the event-level pixel with its image under . In principle, the same may be done to enforce invariance under the rotation by averaging over a large grid of candidate values . To reduce the computational burden which would arise from such an operation, we instead apply random rotations at each batch used in the Adam optimizer (Kingma and Ba, 2015) which we choose for training the network.

In steps (1)-(3), we also add engineered features specifically designed for dijet, quadjet, and event-level pixels respectively. These engineered features are designed such that they preserve invariance properties (a) and (b). For instance, in step (1) we insert pixels containing dijet masses and the Euclidean distance between dijet angular variables .

## 5 Background Modeling via Optimal Transport

The methodology described in the previous section hinged upon the ability of the classifier to accurately extrapolate from the Control Region to the Signal Region, implying that the and intensity functions vary smoothly across these two regions. The validity of this assumption is difficult to verify in practice, and motivates us to develop a distinct approach with a complementary modeling assumption. In this section, rather than extrapolating the discrepancy between the and intensity functions, we will extrapolate the discrepancy between the intensity functions in the Control and Signal Regions, as illustrated in Figure 3.

We cannot use a density ratio to quantify the discrepancy between intensity functions in the Control and Signal Regions, because these regions are disjoint. We will instead use the notion of a transport map, which will be defined in the sequel. In order to employ transport maps, it will be convenient to normalize all intensity functions throughout this section. That is, we will define an estimator for by separately estimating the probability measure and the normalization . More generally, we write

 Pcj=\bintencj/\bintencj(\calGcr),Psj=\bintensj/\bintensj(\calGsr),j=3,4,

to denote the four population-level probability measures, with corresponding empirical measures

 Pa3,na=1nana∑i=1δHai,Pa4,ma=1mama∑i=1δGai,a∈{c,s}.

A transport map between and is any Borel-measurable function such that whenever , we have . When this condition holds, we write , and we say pushes forward onto , or that is the pushforward of under . Equivalently, this condition holds if and only if

 Ps3(B)=T#Pc3(B)=Pc3(T−1(B)),for all B∈\bbB(\calGsr).

We propose to perform background estimation under the following informal modeling assumption, which will be stated formally in the sequel. There exists a map such that

 T0#Pc3=Ps3,andT0#Pc4=Ps4. (5.1)

Assumption 5 requires the and distributions to be sufficiently similar for there to exist a shared map which pushes forward their restrictions to the Control Region into their counterparts in the Signal Region. If such a map were available, it would suggest the following procedure for estimating ,

1. Fit an estimator of based only on the observations;

2. Given any estimator of , use the pushforward as an estimator of .

For this approach to be practical, we must specify an explicit candidate satisfying Assumption 5. We propose to choose such that its movement of the probability mass from into that of is minimal. This leads us to consider the classical optimal transport problem, which we now describe.

### 5.1 The Optimal Transport Problem

Assume a metric on the space is given; we provide a candidate for such a metric in Section 5.3. For any transport map pushing forward onto , we refer to as the cost of moving an event to an event . The optimal transport problem seeks to find the choice of which minimizes the expected cost of transporting onto , which amounts to solving the following optimization problem

 \argminT:\calGcr→\calGsr∫\calGcrW(h,T(h))dPc3(h),s% .t. T#Pc3=Ps3. (5.2)

Equation (5.2) is known as the Monge problem (Monge, 1781). When a solution to the Monge problem exists, it is said to be an optimal transport map. We postulate that, when it exists, the optimal transport map from to is a sensible candidate for the map appearing in the statement of Assumption 5.

A shortcoming of this choice is the requirement that there exist a solution to the optimization problem (5.2). It is well-known that the Monge problem over Euclidean space admits a unique solution for absolutely continuous distributions, when the cost function is the squared Euclidean norm (Knott and Smith, 1984; Brenier, 1991). While sufficient conditions for the solvability of the Monge problem in more general spaces are given by Villani (2008, Chapter 9), we do not know whether they are satisfied by the metric space under consideration. Furthermore, the Monge problem may not even be feasible between distributions which are not absolutely continuous, which precludes the possibility of estimating using the optimal transport map between the empirical measures of and .

Motivated by these considerations, we introduce a classical relaxation of the Monge problem, known as the Kantorovich optimal transport problem (Kantorovich, 1942, 1948). Let denote the set of all joint Borel distributions over whose marginals are respectively and , in the sense that and . We refer to such joint distributions as couplings. Consider the minimization problem

 \calW(Pc3,Ps3)=infπ∈Π(Pc3,Ps3)∫\calGcr×\calGsrW(g,h)dπ(g,h). (5.3)

When the infimum in (5.3) is achieved by a coupling , this last is known as an optimal coupling. When an optimal coupling is supported on a set of the form , for some map , it can be seen that is in fact an optimal transport map between and . The Kantorovich problem (5.3) is therefore a relaxation of the Monge problem (5.2). Unlike the latter, however, the minimization problem (5.3) is always feasible since is non-empty; indeed, always contains the independence coupling between and . Moreover, the infimum in the Kantorovich problem is achieved as soon as the cost function is lower semi-continuous, and the measures and

satisfy a mild moment condition (

Villani (2008), Theorem 4.1). We also note that the optimal objective value defines a metric between probability measures called the (first-order) Wasserstein distance (Villani, 2003), or Earth Mover’s distance (Rubner, Tomasi and Guibas, 2000).

Using the Kantorovich relaxation, we now formalize Assumption 5 into the following condition, which we shall require throughout the remainder of this section. Assume there exists an optimal coupling between and . Given a pair of random variables , let denote the conditional distribution of given , for any . Then, the following implication holds:

 Gc∼Pc4Gs|Gc∼π0(⋅|Gc)  ⟹  Gs∼Ps4. (5.4)

Assumption 5.1 requires the and distributions to be sufficiently similar for their restrictions to the Signal and Control Regions to be related by a common conditional distribution. It further postulates that this conditional distribution is induced by the optimal coupling . Heuristically, plays the role of a multivalued optimal transport map for pushing an observation from the distribution onto . Assumption 5.1 requires this map to additionally push the distribution onto its counterpart in the Signal Region. In the special case where there exists an optimal transport map from to , we note that is an optimal coupling of with , where denotes the identity map. In this case, equation (5.4) is tantamount to equation (5.1).

### 5.2 Background Estimation

We next derive estimators for the background distribution under Assumption 5.1. Under this condition, it follows from equation (5.4

) and the law of total probability that

 Ps4(⋅)=∫\calGcπ0(⋅|g)dPc4(g).

Since is the distribution of given , it is an identified parameter which can be estimated using only the data. Given an estimator of this quantity, and an estimator of , it is natural to consider the plugin estimator of the background distribution , given by

 ^Ps4(⋅):=∫\calGc\hpi(⋅|g)d^Pc4,\nfc(g). (5.5)

In what follows, we begin by defining an estimator in Section 5.2.1, followed by two candidates for the estimator , leading to two distinct background estimation methods defined in Sections 5.2.2 and 5.2.3. In Section 5.2.4, we briefly discuss how these constructions also lead to estimators of the unnormalized intensity measure . We then provide discussion and comparison of these methodologies in Section 5.2.5.

#### 5.2.1 The Empirical Optimal Transport Coupling

A natural plugin estimator for the coupling is the optimal coupling between the empirical measures and . In detail, denoting by the joint probability mass function of , the empirical Kantorovich problem takes the following form:

 (5.6)

Equation (5.6

) is a finite-dimensional linear program, for which exact solutions may be computed using network simplex algorithms such as the Hungarian algorithm

(Kuhn, 1955). We refer to Peyré and Cuturi (2019) for a survey. We then define the estimator , for , as the discrete distribution supported on , with probability mass function defined by

 ^qj|i=^qij∑\ntsk=1^qik=\ntc⋅^qij,j=1,…,\nts.

With this definition in place, we are in a position to define estimators of the background distribution .

#### 5.2.2 The Estimator

We first consider the general estimator in equation (5.5) when is the empirical measure . This choice is perhaps most natural, but it requires us to perform out-of-sample evaluations of the estimator . Indeed, recall that we merely defined the latter for all , whereas is supported on the distinct set .

We extend the support of to all using a variant of the method of nearest neighbors for nonparametric regression (Biau and Devroye, 2015). Let be an integer. For all , let denote the indices of the -nearest neighbors of with respect to , among . Specifically, we set where

 W(g,Hcj1)≤⋯≤W(g,Hcjk)≤W(g,Hcj),for all j∈[\ntc]∖{j1,…,jk}.

Furthermore, define the inverse distance weights

 ωi(g)=1/W(g,Hci)∑l∈Ik(g)1/W(g,Hcl),i∈Ik(g), (5.7)

with the convention . We then define for all ,

 ^πNN(⋅|g)=∑i∈Ik(g)ωi(g)^π(⋅|Hci). (5.8)

The estimator couples with all of the events to which its -nearest neighbors are coupled under . The coupling values which correspond to the closest nearest neighbors are assigned higher weights . Furthermore, we note that when , it holds that .

Using these defintions, the generic estimator in equation (5.5) takes the following form,

 ^Ps4,NN(⋅)=∫^πNN(⋅|g)dPc4,\nfc(g)=1\nfc\nfc∑ℓ=1∑i∈Ik(Gcℓ)ωi(Gcℓ)^π(⋅|Hci).

or equivalently,

 ^Ps4,NN=\ntc\nfc\nts∑j=1⎛⎜⎝\nfc∑ℓ=1∑i∈Ik(Gcℓ)ωi(Gcℓ)^qij⎞⎟⎠δHsj.

We refer to as the OT-NN (Optimal Transport– Nearest Neighbor) estimator of .

#### 5.2.3 The OT-FvT Estimator

It is known that the rate of production of events typically exceeds that of events by one order of magnitude (cf. Section 6). As a result, in the general formulation (5.5) of our optimal transport map estimators, we expect to have access to a smaller sample size for estimating the distribution , than the sample sizes and for estimating the optimal transport coupling . Motivated by this observation, we next define an estimator which can leverage the larger sample size .

Let denote the density of for . Recall from Section 4 that for any event , denotes the probability that a random event arose from the distribution as opposed to the distribution, given that . Furthermore, denotes the -valued output of the FvT classifier for discriminating events from events. Recall further that for any , it holds that