Parametric unfolding. Method and restrictions

04/27/2020
by   Nikolay Gagunashvili, et al.
0

Parametric unfolding of a true distribution distorted due to finite resolution and limited efficiency for the registration of individual events is discussed. Details of the computational algorithm of the unfolding procedure are presented.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/28/2019

Robust, fast and accurate: a 3-step method for automatic histological image registration

We present a 3-step registration pipeline for differently stained histol...
06/24/2021

High-resolution Image Registration of Consecutive and Re-stained Sections in Histopathology

We compare variational image registration in consectutive and re-stained...
07/28/2014

Non-parametric Image Registration of Airborne LiDAR, Hyperspectral and Photographic Imagery of Forests

There is much current interest in using multi-sensor airborne remote sen...
07/13/2018

Performance of Image Registration Tools on High-Resolution 3D Brain Images

Recent progress in tissue clearing has allowed for the imaging of entire...
09/28/2018

On wavelets to select the parametric form of a regression model

Let Y be a response variable related with a set of explanatory variables...
08/15/2021

NPBDREG: A Non-parametric Bayesian Deep-Learning Based Approach for Diffeomorphic Brain MRI Registration

Quantification of uncertainty in deep-neural-networks (DNN) based image ...
05/10/2021

Budget-limited distribution learning in multifidelity problems

Multifidelity methods are widely used for statistical estimation of quan...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The measured distribution of events with a reconstructed characteristic obtained from a detector with finite resolution and limited efficiency can be represented as

(1)

where is the true density,

the efficiency function describing the probability of recording an event with a true characteristic

, and the experimental resolution function, i.e. the probability of obtaining instead of after the reconstruction of the event. The integration in (1) is carried out over the domain of the variable .

If a parametric model of true distribution

exists, the model parameters can estimated by fitting

(2)

to the measured distribution , as discussed e.g. in zhigunov ; zechebook .

To realize method, the efficiency function and the resolution function

must be defined. In many cases, especially in particle physics, they are not known analytically and instead are obtained by computer simulation of the measurement process. A test statistic for comparing the histogram of the measured distribution

and the histogram of the measured distribution obtained by simulation gagu_comp was used in gagu_f , where the model parameters were estimated by minimization of this statistic. The test was improved significantly in chicom , computer code implementing this test was developed in prev ; prev2 .

This paper extends the previous work by presenting a detailed parametric unfolding algorithm using the above mentioned results chicom ; prev ; prev2 . A bootstrap algorithm for the calculation of the statistical errors of the estimated parameters has been developed. To gauge the quality of the results a method of residuals analysis has been developed that complements the -value of the chi-square test statistic. Application of the method as well as the evaluation are demonstrated on a numerical example.

2 Fitting a simulated parametric model to data

In experimental particle and nuclear physics analyses the modelling of the measurement process usually is the most time-consuming step, requiring the simulation of particle transport through a medium and the rather complex registration apparatus. The minimization algorithm to estimate the model parameters then is an iterative procedure that may need many to calculate the simulated measured histogram many times. A way to decrease the CPU-time demand is to perform an initial calculation for some distribution , and then to calculate simulated measured histogram with an alternative true distribution by taking all entries with weights sobol

(3)

that exploit the equality

Let us denote the sum of the entries to the th bin of the measured histogram by , and the sum of the weights of the events in the th bin of the simulated measured histogram by

(4)

where is the number of events in bin and is the weight of the th event in the th bin. Events that do not register due to inefficiency enter in an overflow bin . The total number of events is denoted by and total number of simulated events by .

Let the values of the parameter be fixed. The hypothesis of homogeneity that the measured histogram with bin contents and the simulated histogram with bin contents are drawn from the same parent distribution is probed by the test statistic

(5)

where the first sum is over all bins , and the second sum omits the least sensitive bin as defined below. The estimates minimize ,

(6)

subject to the constraints

(7)

As shown in chicom , the power of the test is optimized by the choice

(8)

Statistic (5) has approximately a distribution if the hypothesis of homogeneity is valid chicom .

Varying the model paramaters , estimators for best fit parameters are found by minimization of the statistic (5),

(9)

If the parametric model fits the data, the statistic has a distribution, because parameters are estimated in addition to the probabilities and can be used for a goodness-of-fit test in the selection of the best from a set of alternative models.

Another approach to the evaluation of the fit quality is the analysis of the residuals. The definition of Pearson’s residuals for usual histograms is

(10)

which for weighted histograms generalizes to

(11)

For a homogeneity test two unweighted histograms an adjustment of the residual was proposed in haberman , which for histograms with weighted entries becomes

(12)

and

(13)

where is equivalent number of unweighted events for a sample of weighted events

(14)

If the hypothesis of homogeneity is valid, then the adjusted residuals are approximately independent and identically distributed random variables with a standard normal PDF

.

The statistical errors of parameters can be estimated by the bootstrap method efron . To realize this method, a set of resampling histograms is generated, each according to a multinomial distribution with parameters . The fit is done for each histogram of the set. The resulting set of parameter estimates then permits one to calculate an estimate of the covariance matrix of the parameters also.

3 Numerical example

Starting from a true PDF to be of the form

(15)

the measured density was defined according (1) with an acceptance function

(16)

and a gaussian resolution function

(17)

A simulation of 10 000 events generated according to was done (see algorithm in jinst ) and is presented as a histogram with 77 bins in Figure 1.

Figure 1: Histogram of true distribution and measured distribution .

As input for the fitting procedure the true distribution assumed to be fully simulated is taken as

(18)

The result from simulating 1 000 000 events according to the algorithm described in jinst is shown in Figure 2 together with the initial distribution .

Figure 2: Histogram of true distribution and measured distribution .

As a fit-model for the true distribution the following function with five free parameters was chosen,

(19)

In the fit event weights were used according to the formula (3). The test statistic for for a fixed set of parameters was calculated with methods and codes published in chicom ; prev ; prev2 . The parameter variations were driven by the SIMPLEX algorithm of the package MINUIT minuit in order to determine the best fit parameters by minimizing . Figure 3 shows unfolded distribution compared to the true PDF. Figure 4 shows the weighted histogram with the optimal values of parameters in comparison with the histogram representing the measured events.

1
0.648 1
0.648 0.459 1
-0.586 -0.631 -0.679 1
0.754 0.663 0.882 -0.830 1
Table 1: Best fit parameters with uncertainties and correlation matrix.

The bootstrap method, with the resample size equal to 1000, was used for estimating error intervals and correlation matrix of the fit parameters. To resample the measured histogram, bins contents were generated as multinomial random numbers cern with parameter and probabilities . Assignment of central value and estimate of the statistical errors of particular parameter, for example , is done based on the ordered list of bootstrap estimates

(20)

where the number in parentheses shows the location when sorting in ascending order.

Starting from the smallast size 68% confidence interval for

, with lower and upper limit estimated by

(21)

where

(22)

the central value is taken to be and the uncertainties are estimated by the signed deviations from the central value

(23)

The results are given in Table 1. For the best fit parameter the value of the test statistic is , which corresponds to a -value of .

A graphical analysis of residuals was done to evaluate the result of then fitting procedure. Figure 5 shows the distribution of the residuals, where a Kolmogorov-Smirnov test of normality gives -value of

. Figure 5 shows quantile-quantile plot of residuals with the 95% confidence band

qq . Figures 3,4,5,6 illustrate the potential of the parametric unfolding method, with very satisfactory -values for the test statistis considered.

Figure 3: True PDF and unfolded PDF (dashed line)

.

Figure 4: Histogram of measured PDF and histogram of fitted measured PDF (solid line)

.

Figure 5: Distribution of residuals
Figure 6: Quantile-quantile plot of residuals with 95% confidence band

.

4 Evaluation of the method

For the evaluation of the method as a whole, the procedure described above was repeated 1000 times. Sets of 10 000 events distributed according were simulated to create histograms of the measured distribution . The same set of 1 000 000 simulated events distributed according to was used in each run. Figure 7 shows plots for all pairs of the 5 parameter and the distribution of the estimators of . Figure 8 shows the region covered by the 1000 estimates of the unfolded distribution together with true distribution . Figure 9 presents a histogram of the distribution of the -values and confirms that the theoretical distribution can be used for a goodness of fit test.

Figure 7: Plots for all pairs of estimators of and resulting correlation matrix.
Figure 8: Comparison of the regions covered by 1000 estimates of the unfolded distribution with true distribution (solid line)

.

Figure 9: Distribution of -values derived from the test statistic.

The accuracy of the bootstrap error estimates is checked by doing 1000 toy experiments and determining the range covered by the smallest size 68% quantiles interval of the parameter values. The results are given in Table 2, together with the error estimates obtained by the bootstrap method for the example discussed before. One finds reasonable agreement, with some indication that the bootstrap error estimates are slightly conservative.

  parameter errors errors BS
+0.036 -0.045 +0.051 -0.038
+0.084 -0.076 +0.084 -0.086
+0.048 -0.062 +0.061 -0.056
+0.114 -0.105 +0.137 -0.117
+0.017 -0.023 +0.025 -0.021
Table 2: Error estimates from the smallest size 68% quantile interval of the parameter distributions from 1000 toy experiments compared to the errors obtained by the bootstrap method for the example discussed before.
0.585 0.597 -0.519 0.714
0.648 0.346 -0.553 0.589
0.648 0.459 -0.614 0.868
-0.586 -0.631 -0.679 -0.777
0.754 0.663 0.882 -0.830
Table 3: Correlation matrices obtained by the bootstrap method (lower triangle) and calculated from the distribution of the 1000 simulated toy experiments (upper triangle)

.

Finally, the simulation study was done for different values of the resolution parameter . Results are presented in Table 4, showing how the accuracy of the parameter estimates diminishes with the worsening of the detector resolution. The fits were done without constraints for the values of the parameters. The study indicates that for larger resolution parameters eventually constraints will be needed to obtain stable results.

  parameter
0.049 0.063 0.081 0.123
0.096 0.122 0.160 0.252
0.070 0.089 0.110 0.143
0.147 0.179 0.219 0.265
0.027 0.032 0.040 0.041
Table 4: Sizes of 68% confidence intervals for different values of the resolution parameter in the response function.

Conclusions

Parametric unfolding of data measured by a detector with finite resolution and limited efficiency is presented. The method is developed as an application of an improved test for comparing weighted histograms and incorporates new computational algorithms and codes. The bootstrap method is employed to estimation the errors of the fit parameters. Residual analysis generalized for weighted histograms has been developed to gauge the quality of the unfolding result. A numerical example is given to illustrate the method, and an extensive simulation study was done to confirm that the proposed method as a whole is valid.

Acknowledgments

The author is grateful to Michael Schmeling (Max Plank institute for Nuclear Physics) for critical reading of the manuscript and comments and to Hjörleifur Sveinbjörnsson and Helmut Neukirchen (University of Iceland) for their help and support of this work. This research was funded by the University of Iceland Research Fund (HI17080029).

References

  • (1) V. V. Ammosov, Z. U. Usubov, V. P. Zhigunov, Nucl. Instr. Meth. A295 (1990) 224-230.
  • (2) G. Bohm, G. Zech, Introduction to Statistics and Data Analysis for Physicists,Verlag Deutsches Elektronen-Synchrotron, 2010.
  • (3) N. D. Gagunashvili, Nucl. Instr. Meth. A635 (2011) 86-91.
  • (4) N. D. Gagunashvili, Nucl. Instr. Meth. A614 (2010) 287-296.
  • (5) N. D. Gagunashvili, Eur. Phys. J. Plus 132 (2017) 196.
  • (6) N. D. Gagunashvili, Comput. Phys. Commun. 183 (2012) 193-196.
  • (7) N. D. Gagunashvili, H. Halldorsson, H. Neukirchen, Comput. Phys. Commun. 245 (2019) 106872.
  • (8) I. M. Sobol’, Numerical Monte Carlo methods, Nauka, Moscow, 1973.
  • (9) S. J. Haberman, Biometrics 29 (1973) 205-220.
  • (10) B. Efron, R. Tibshirani, Statist. Sci. 1 (1986) 54-75.
  • (11) N. D. Gagunashvili, JINST 10, (2005) P05004.
  • (12) F. James, M. Roos, Comput. Phys. Commun. 10 (1975) 343-367.
  • (13) S. Aldor-Noiman, L. D. Brown, A. Buja, W. Rolke, R. A. Stine, Am. Stat. 67 (2013) 249-260.
  • (14) CERN Program Library (V138), http://cernlib.web.cern.ch/cernlib/.