A response-matrix-centred approach to presenting cross-section measurements

03/15/2019 ∙ by Lukas Koch, et al. ∙ Science and Technology Facilities Council 0

The current canonical approach to publishing cross-section data is to unfold the reconstructed distributions. Detector effects like efficiency and smearing are undone mathematically, yielding distributions in true event properties. This is an ill-posed problem, as even small statistical variations in the reconstructed data can lead to large changes in the unfolded spectra. This work presents an alternative or complementary approach: the response-matrix-centred forward-folding approach. It offers a convenient way to forward-fold model expectations in truth space to reconstructed quantities. These can then be compared to the data directly, similar to what is usually done with full detector simulations within the experimental collaborations. For this, the detector response (efficiency and smearing) is parametrised as a matrix. The effects of the detector on the measurement of a given model is simulated by simply multiplying the binned truth expectation values by this response matrix. Systematic uncertainties in the detector response are handled by providing a set of matrices according to the prior distribution of the detector properties and marginalising over them. Background events can be included in the likelihood calculation by giving background events their own bins in truth space. To facilitate a straight-forward use of response matrices, a new software framework has been developed: the Response Matrix Utilities (ReMU). ReMU is a Python package distributed via the Python Package Index. It only uses widely available, standard scientific Python libraries and does not depend on any custom experiment-specific software. It offers all methods needed to build response matrices from Monte Carlo data sets, use the response matrix to forward-fold truth-level model predictions, and compare the predictions to real data using Bayesian or frequentist statistical inference.



There are no comments yet.


page 7

page 24

page 25

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Motivation

Cross-section measurements are an important tool for investigating possible manifestations of “new physics”, i.e. phenomena beyond the currently accepted models. This is either done directly with the cross-section measurements, e.g. in beyond-the-standard-model searches at collider experiments, or by using cross-section measurements as inputs for other experiments, e.g. the use of neutrino cross-section measurements for constraining systematic uncertainties in oscillation experiments (see Figure 1).

Figure 1: Neutrino cross sections for oscillation experiments. Neutrino oscillation experiments need to reconstruct the neutrino energy event by event. Since the neutrino is invisible, only the products of the interaction can be used for this. Exact reconstructions are made impossible by undetectable particles below the detection threshold of the respective detectors. Models for electroweak nuclear interactions are used to correct these effects. Currently there are quite large theoretical uncertainties on especially the hadronic model (initial state of the nucleus, nucleon form factors, etc.) and the Final State Interactions (FSI). Cross-section measurements play an important role in constraining these uncertainties (see e.g. [Alvarez-Ruso2018]).

It is thus important to present these results in a way that allows re-interpretation of the data when new insights into the theoretical models are gained in the future. Especially in the case of neutrino cross-section measurements it can be difficult to disentangle the measured quantities from detector effects and (possibly poorly motivated) model assumptions (see [Uchida2018] for an overview of challenges and possible solutions). Measurements that do not take care of these issues can end up being difficult to interpret and ultimately become useless for global data comparisons, fits, etc.

There is no single recipe that ensures that a measurement is free of model assumptions or detector effects. A couple of points are important to keep in mind though. For example, when trying to constrain a certain aspect of an electroweak nuclear interaction model, it is important to make sure that no assumptions of that model are influencing the measurement. In the worst case, one can end up publishing “data” that is really just a carbon copy of the model. Checks against these kinds of model bias are a common part of physics analyses.

It is equally important though, to ensure that (possibly implicit) assumptions about model effects one is not interested in do not affect the measurement. The detection efficiency and reconstruction resolution of an event in a real detector can depend on a lot of variables. In principle, it depends on the type, momentum and direction of every single particle that leaves a vertex and can theoretically be detected. In practice, one is interested in the properties of only a few of those particles. Even when considering just one particle, it is often not fully characterised by its three momentum components, but the information is reduced to, for example, the magnitude of the momentum to avoid bins with very few events in n-dimensional histograms.

Unfortunately, not looking at the other variables does not mean that their influence on the detector efficiency goes away. Models that are well tuned to real data in the distribution of a certain quantity, like the total lepton momentum in a charged-current neutrino interaction, can differ wildly from reality in distributions that simply have not been looked at before, e.g. the second highest proton momentum in an event. Ignoring these quantities means that one uses the average efficiency of the events, obtained for a certain distribution of the ignored quantities.

This can lead to very different efficiencies and purities of event selections if two theories predict very different distributions. For example, all detectors have certain energy/momentum thresholds below which they are not sensitive to particles. If two theories (or a theory and reality for that matter) now predict different fractions of events/particles below that threshold, the resulting average efficiency of selecting the events will vary accordingly.

A lot of work is done on minimising or at least quantifying these effects. Strategies range from doing multi-dimensional differential cross-section measurements (to ensure all dependencies of the efficiency are modelled), to repeating the analysis with multiple theories and simply quoting how much the results depend on the used model. The former approach requires a lot of data to have a significant number of events in every bin, while the latter suffers from the uncertainty of whether all available models even cover reality at all.

The response-matrix-centred method described in this work aims to combine the model independence of the multi-dimensional approach with the ability to work with low number of events of the naive model test. This is achieved by de-coupling the binning of the reconstructed events from the description of the events at the generator level. The high dimensionality of variables is only needed in truth space, i.e. the description of the events at the generator level. The actual recorded data can be binned much coarser in reco space, i.e. with wider binning and/or fewer reconstructed variables. The response matrix is the connecting piece between the two, describing how likely an event in a particular truth space bin is going to end up in any of the reco space bins.

If the truth binning is chosen carefully, the response matrix should be (sufficiently111A certain dependence on the event distribution within the bins will always remain, but it can be reduced to the point where it does not matter compared to other uncertainties.) independent of any assumed physics model of the interactions. That is, different models can predict different truth space distributions, but the values of the response matrix elements do not depend on the model that is used to build the matrix222Aside from statistical effects from the number of available simulated events in each truth bin.. The real data and response matrix can then be used with arbitrary models to calculate a likelihood and extract cross sections.

This is so far not different from the naive model testing method. The advantage of the response matrix approach is realised when considering the matrix and the raw data as the main result of the measurement. They are (ideally) independent of any model assumptions and can be used to test any new model or model improvement that will be developed in the future. Furthermore, if the raw data and response matrix are published, model developers can use them directly to test new models against old data. Compared to the classical approach, where the theories are developed by theorists and then tested within the experimental groups in dedicated analyses, this reduces the time of the development cycle considerably. In fact, a lot of work has been spent to make old experimental results available for easy model tuning, for example with the NUISANCE[nuisance] or Rivet[Buckley2010] frameworks. Results obtained with the response-matrix-centred approach would be very easy to include in such global fits.

It might seem like a shortcut for lazy experimental physicists to simply publish the raw data and response matrix to leave the rest to the model builders. This is not the case though, since the construction of the response matrix requires exactly the same understanding of the detector and care to cover all systematics as a classical, unfolding analysis. Also it is unlikely that any experimental group would publish the data and response matrix without also using them for their own model tests.

It is worth noting that model comparisons in reconstructed (or smeared) space are in general more powerful than equivalent model tests in truth (or unfolded) space [Cousins2016]. The forward-folding approach might thus also be advantageous for analyses that do

have enough statistics to do a multi-dimensional unfolding of the results. In any case, one is not restricted to do one or the other. If the data, time and person-power allow it, it might be the best choice to publish both an unfolded result, as well as the raw data with a response matrix. The additional work needed for doing an unfolding analysis on top of a forward folding one is probably less than it might seem. The response matrix can be used to do an unfolding analysis with it, e.g. using a likelihood fit or Markov Chain Monte Carlo with the bin-by-bin truth-level predictions as fit parameters. If the model-independence criterion of the forward-folding matrix leads to very underconstrained truth bins (i.e. a much finer truth binning than in reconstructed space), the dimensionality can be reduced by fitting templates of a theory. This would mean the result is no longer model-independent, but it could be argued that a purely unfolding analysis should suffer from the same problems.

The description of the mathematical model of the response-matrix-centred approach can be found in LABEL:sec:strategy. Details on how to build the matrix and how to contain the knowledge about the systematic uncertainties in it are given in LABEL:sec:implementation. The algorithms are implemented in a Python software library called ReMU, Response Matrix Utilities. It is intended to make the usage of the data and response matrix as easy as possible. More informations about the software and data formats are included in LABEL:sec:software.

2 Example analysis

pythonframe=single, gobble=4

2.1 Introduction

This section is intended as a rough outline on how ReMU and the response-matrix-centred approach could be used in practice. Since the actual software is subject to active development, we will concentrate on the principles rather than the actual implementation here. The example is taken from the documentation of ReMU, and the reader should refer to it for the full implementation details and additional information [ReMUdocs].

The mock experiment that is handled here is quite simple. It records events with only two properties: and . Only  is smeared by the detector (Gaussian blur with ). The efficiency of detecting an event depends only on  ()). There are no background events. Let us assume we are interested in a measurement of the distribution of .

2.2 Preparation of the response matrix

The response matrix must be prepared by the detector experts within an experiment’s collaboration. Only they have the necessary knowledge about the detector response and its uncertainties, as well as access to the full detector simulation framework. Care must be taken to ensure that the response matrix is actually as model-independent as desired. ReMU offers a few methods to test the matrices for that property.

Response matrix objects are created by specifying the binning in reco and truth space. They are then filled with simulated events that were processed with the full detector simulation and analysis chain: respA = migration.ResponseMatrix(reco_binning, truth_binning) respA.fill_from_csv_file("modelA_data.txt")

A model-independent response matrix should not depend on the model that was used to populate the response matrix. ReMU offers a method to calculate the Mahalanobis distance between two matrices and compare the result with the expected distribution assuming that the two matrices are random variations of the same matrix (see LABEL:sec:matrix_tests): respB = migration.ResponseMatrix(reco_binning, truth_binning) respB.fill_from_csv_file(["modelB_data.txt"]) respA.plot_compatibility("compatibility.png", respB) See Figure 2 for how these plots might look for compatible and incompatible matrices. Note that passing this test is a necessary condition for model-independent matrices, but not a sufficient one. The available models for this test might be too similar to show any intrinsic model-dependencies of the response matrix. It is up to the detector experts to make sure that they are covering the necessary response variations in the truth binning.

Figure 2: Matrix compatibility plots. The squared Mahalanobis distance (see LABEL:sec:matrix_tests

) of two matrices populated with simulated events from different models. When the truth binning cannot cover the varying detector response between the models (left), the distance (vertical line) will be larger than expected from purely random variation (blue histogram). In the Gaussian limit, this variation should be chi-squared distributed (green curve). If the binning covers the model differences (right), the distance should fall within the expected distribution, or be smaller. The distance can be smaller than the expected distribution, because the parametrisation of the uncertainties of the matrix elements starts with a prior (or pseudo observations) that are common to both compared matrices (see

LABEL:sec:build). So two matrices with no data in them will be perfectly identical, despite large expected statistical uncertainties.

In the case of this example, it is necessary to bin the truth both in  (because this is the variable of interest) and in  (because the detection efficiency depends on this variable). Note that if the response matrix is model-independent, it can actually be populated by all available simulated data combined: resp.fill_from_csv_file(["modelA_data.txt", "modelB_data.txt"]) Figure 3 shows a plot of the 2D projections of the final response matrix. It was generated by one of the many methods in ReMU that are intended to help gaining insights into the properties of the response matrices when building them: resp.plot_values("optimised_response_matrix.png", variables=(None, None))

Figure 3: Response matrix projection. 1D and 2D projections of the distribution of events that populate the response matrix. Only events that have been reconstructed are included, i.e. this shows the smearing/migration of events, but not the efficiency. Dotted lines outside the plot axes indicate that the corresponding bin is not constrained in that direction, i.e. it behaves like on over- or underflow bin.

2.3 Using the response matrix

Once the response matrix (or the set of response matrices, see LABEL:sec:likelihood

) is prepared and published with the data vector, it can be used to do statistical tests. This can be done both inside the experiment’s collaboration, as well as outside of it. The usage of the response matrix does not require expert knowledge of the detector.

Within ReMU the data and response matrices are combined into LikelihoodMachine objects. These then provide methods to do different statistical tests on model predictions. It does not matter what exactly the data looks like or how many matrices make up the set, the interface to the user stays the same.

The simplest kind of test that can be done is comparing the likelihoods of different models. For this, all that is needed is a model prediction for the events in truth space: truth_binning.fill_from_csv_file("modelA_truth.txt") modelA = truth_binning.get_values_as_ndarray() The LikelihoodMachine can then calculate (log-)likelihoods of the measured data, given this prediction and marginalising over the detector uncertainties encoded in the set of response matrices: lm.log_likelihood(modelA)

In case models have free parameters, it is also possible to maximise the likelihood over the allowed parameter space. For example, one can use the (area normalised) shape of models as templates and let the template weight (i.e. the number of true events) be fitted to the data: modelA_shape = TemplateHypothesis([modelA / np.sum(modelA)]) lm.max_log_likelihood(modelA_shape) This will return both the maximised (log-)likelihood and the parameter values of that point.

Since likelihood values alone are hard to interpret, ReMU also offers several methods to calculate p-values: lm.likelihood_p_value(modelA) lm.max_likelihood_p_value(modelA_shape) lm.max_likelihood_ratio_p_value(model0, model1) These respectively calculate the probability of

  • measuring data that yields a lower likelihood than the actual one, assuming the provided model is true,

  • measuring data that yields a lower maximum likelihood than the actual one, assuming the best fit point of the provided model is true,

  • measuring data that yields a lower ratio of maximised likelihoods between the two specified models than the actual one, assuming the best fit point of model0 is true.

These p-values can then be used to check goodness of fit, to construct different confidence intervals, or to do frequentist hypothesis tests.

Let us assume we have two models we want to compare to the data. Model  assumes that the true properties and 

of events are uncorrelated, normal distributed. Model 

assumes a correlation between and  (see Figure 4). They also feature slightly different means of the distribution. See Table 1 for a summary of the model parameters. Each model only predicts the shape of the event distribution, but not the total number of events. Note that even though we are only interested in a measurement of , the different behaviours in  lead to different average detection efficiencies between the models.

Model 0.1 0.2 1.0 1.0 0.0
Model 0.0 0.0 1.0 1.0 0.5
Table 1: Example models.
Figure 4: Example models. True distribution of events in model  (left) and model  (right). Dotted lines outside the plot axes indicate that the corresponding bin is not constrained in that direction, i.e. it behaves like on over- or underflow bin.

Maximising the likelihood over the free normalisation parameter of the models yields two maximum likelihood solutions that both fit the data reasonably well (see Figure 5). A look at their respective max_likelihood_p_values tells us that model  is slightly disfavoured (p-value ).

Figure 5:

Reco-space model comparison. The model predictions are shown with mean and standard deviation due to detector systematics. The data is shown with

“error bars” for visualisation purposes only.

Instead of globally excluding a hypothesis, it can be useful to look at the local p-values in the parameter space. Figure 6 shows this for the two models. Again model  is disfavoured. This is not useful to construct confidence intervals under the assumption that each model is true, though. This is done in Figure 7 using the max_likelihood_ratio_p_value of the fixed normalisation prediction vs the floating normalisation models. By construction, this p-value is 1 at the maximum likelihood fit point of the parameter space. Model  and  yield different confidence intervals for the total number of events, because their average detection efficiencies are different.

Figure 6: Local p-values, as a function of the template weight (number of true events). Vertical lines show the maximum likelihood solutions. The dotted lines show the results when not applying any detector systematics, i.e. using only a single (nominal) response matrix.
Figure 7: Likelihood ratio p-values, as a function of the template weight (number of true events). Vertical lines show the maximum likelihood solutions. The dotted lines show the results when not applying any detector systematics, i.e. using only a single (nominal) response matrix.

This kind of model-dependent result is a good illustration for the advantage of the response-matrix-centred forward-folding approach of sharing data. Had the data of this mock experiment been shared as an unfolded distribution, it would have had to include the different average efficiencies of possible models in the systematic uncertainties of the result. This would lead to inflated errors compared to the specific model tests done here. But even those inflated errors could not guarantee that the coverage of the true value is as expected if the true model is not among the considered ones. Alternatively the result could have been unfolded in both and , but that approach is not always feasible. Depending on the available data statistics, the number of variables that influence the detector response, and how well the detector can measure those in the first place, the data might not be able to constrain all relevant truth bins. This is not an issue in the forward-folding approach.

3 Conclusions

The response-matrix-centred approach to presenting cross-section measurements promises to be a useful addition to the set of tools available to cross-section analysts. It combines the fine-grained model-independence of a multi-dimensionally unfolded differential cross-section measurement with the ability to work with low-statistics, coarsely binned real data. Even for measurements with enough real data for very fine reco-binning, the comparison in reco-space can offer superior model-separation power over comparisons in (unfolded) truth-space [Cousins2016].

A method was presented how to handle systematic detector uncertainties by providing a set of possible response matrices and calculating the marginal likelihood of their reco-space predictions. Statistical uncertainties of the matrix elements stemming from finite Monte Carlo statistics are handled in a similar way, by quantifying the uncertainties and creating random variations of the response matrices accordingly.

Three methods to deal with backgrounds were presented, avoiding subtracting any events from the original data vector. Irreducible background must be added to the signal truth bins by the (background) model. “Physics-like” background can be handled just like signal events, with its own set of truth bins and corresponding response parametrisation in the response matrix. Detector-specific background templates can be put directly into the response matrix, with a single truth bin determining their strength.

The biggest challenge for the analyst creating the matrix is to choose an appropriate truth binning. The detector response typically depends on a multitude of variables, but the number of bins grows exponentially with the number of binning dimensions. This leads to a very high demand for Monte Carlo statistics to build the response matrices.

Once the matrix is available though, it is relatively easy to test various physics models against the data, without having to re-evaluate the detector response and its uncertainties for each model. This will be useful for the NUISANCE[nuisance] and Rivet[Buckley2010] frameworks, for example.

I want to thank my colleagues of the T2K collaboration for supporting me during the genesis of this paper, including – but not limited to – Morgan Wascko, Kendall Mahn and Stephen Dolan. Especially Stephen has been very helpful, with many fruitful discussion about the finer points of (un-)folding, and feedback on paper drafts. I also want to thank my colleagues at the STFC Rutherford Appleton Laboratory for their feedback and for providing a “collider physics perspective”. This work was partially supported by the Deutsche Forschungsgemeinschaft (DFG) [grant number RO3625-1].

Appendix A Statistical methods