1 Introduction
One of the most general goals of pattern recognition is to distinguish noisy and/or distorted realizations of some patterns of interest (POI) from realizations of other patterns and noise. A common point of most approaches is that they explicitly or implicitly define/quantify the problemspecific notion of similarity
(or the inversely proportional dissimilarity) of patterns, usually through (dis)similarity measures. The role of (dis)similarity measures depends on how the available knowledge and the POI are represented, by labeled datasets (leading to machine learning approaches) or by exact models (leading to template matching).
Machine learning approaches. If there is a handlabeled dataset containing multiple realizations of the POI and counterexamples, one can exploit generalpurpose machine learning techniques to address recognition Wang (2016) tasks directly. Although some techniques like metric learning Kulis (2013)
approaches learn the most suitable (dis)similarity measure explicitly, many of the commonly used regressors and classifiers use relatively simple (dis)similarity measures (for example,
Euclidean distancein kNearest Neighbors (kNN)
ShalevShwartz and BenDavid (2014)and kernel functions in Support Vector Machines (SVM)
ShalevShwartz and BenDavid (2014)). One of the main reasons why advanced (dis)similarity measures rarely appear in generalpurpose machine learning techniques is that they make assumptions on the distribution of the data, and these assumptions are likely to fail in general problems when one has no information about the possible distortions. Instead, machine learning techniques learn the applicationspecific meaning of (dis)similarity from the data and represent the advanced concepts of (dis)similarity in the inner structure of the machine learning model in terms of the simple measures.Template matching approaches. There are numerous problems with no handlabeled datasets, but one highquality realization or exact model of the POI. In these cases, one declares the (dis)similarity measure to be used according to the expected distortions of the POI, and considers sufficiently similar patterns as a realization of the POI. This approach is usually referred as template matching, exploited in various problems where data is acquired in highly controlled environments with low and/or predictable variability (like quality checking on conveyor belts Wu et al. (2020)); the POI is simple enough to be represented by one realization or exact model (for example, in medical imaging Kovács and Hajdu (2016)); the POI changes by time and the training of patternspecific solutions is infeasible (like in object tracking applications Yan et al. (2019)).
This paper deals with the second class of problems (applications, where (dis)similarity measures invariant to certain types of distortions are needed) and presents some theoretical results related to the recently introduced dissimilarity measure Matching by Tone Mapping (MTM) HelOr et al. (2014), which was shown to provide superior performance in numerous template matching and even registration scenarios due to its approximate invariance to even nonlinear distortions. Before moving on to the presentation of the findings, we provide a brief overview of the most widely used measures to enable the positioning of the work in the literature of (dis)similarity measures. For the ease of discussion, we introduce the terminology of template matching: let denote a template (pattern) and a window of a signal the template is being compared to, respectively. By intensity transformation (tone mapping/distortion) we refer to a deterministic function applied to each coordinate of its parameter vector independently by introducing the notation . We mention that the invariance of a (dis)similarity measure to distortions like () is usually referred as photometric invariance although the concept is applicable in other fields of signal processing beyond imaging.
As the meaning of (dis)similarity is usually applicationspecific, numerous measures have been proposed in the last decades. Probably the simplest dissimilarity measures are the
distances ( and also known as Manhattan and Euclidean distances) with no invariance to any distortion . Crosscorrelation (CC) and normalized Euclidean distance Goshtasby (2012) are invariant to scaling, while the Pearson correlation coefficient(PCC) is invariant to linear distortions. Although these measures are invariant to linear transformations at most, they usually serve as building blocks of advanced techniques or they are made invariant to certain classes of nonlinear transformations by the kerneltrick
Kovács and Hajdu (2013).Numerous (dis)similarity measures (like Spearman’s Rho Spearman (1904) and Kendall’s Tau Kendall (1938) ) are based on the rank transformation , where is the rank of among , and use one of the simple measures to quantify the (dis)similarity of and instead of and . Although the ranking of elements is not affected by monotonic transformations, and consequently these methods are invariant to monotonic distortions, a common drawback is their sensitivity to noise and ties among the elements of and .
A large and popular family of (dis)similarity functions Pluim et al. (2003); Wachowiak et al. (2003) is based on informationtheoretical concepts by quantifying the mutual information (MI) content in the intensity distributions of the template and the window. Alternatively, the comparison of the distributions of derived local quantities, like gradient orientation Liu et al. (2006)
was also proposed. Although the MIbased measures are considered to be invariant to even nonlinear intensity transformations, the estimation of joint densities can be challenging, especially for small templates.
Correlation ratio and its variants Lau et al. (2001); Woods et al. (1993) characterize the degree to which can be treated as a singlevalued function of , and were shown to provide better performance than MI in certain registration problems Woods et al. (1993).
Invariance to certain distortions can be achieved by extracting invariant features from and and quantifying the (dis)similarity of the feature vectors. A brief overview of photometric invariant features can be found in Zickler (2014). Commonly used features in the imaging domain, invariant to certain types of geometric and photometric distortions are Hu’s descriptors Hu (1962)
(combinations of statistical moments) and
local binary patterns (LBP) Ojala et al. (1996) (based on the intensity differences of a pixel and its neighbors).Although geometric distortions are out of scope for this paper, we mention that some measures used in the imaging domain are invariant to even affine or projective geometrical transformations Lowe (1999).
For further details on (dis)similarity measures, excellent overviews can be found in the books Brunelli (2009), Goshtasby (2012).
Recently, the Matching by Tone Mapping (MTM) HelOr et al. (2014) measure was proposed for photometric invariant template matching and registration and its restriction to monotonic distortions was also introduced Kovács (2018). As MTM was shown to give superior performance in numerous template matching and registration scenarios HelOr et al. (2014) and can be computed efficiently in terms of some convolution operations, it has many potential applications in signal processing. Similarly to MI and related techniques (being approximately invariant to nonlinear distortions), MTM operates by binning the template, however the proper selection of bins providing optimal performance according to some criteria is still an open question. In this paper, we carry out a statistical analysis of the effect of bin selection for MTM.
The main contributions of the paper to the field are summarized as follows:

As the name suggests, MTM was developed for image processing, where various distortions of a template can be treated as tone mappings. We point out that MTM is a highly analogous concept to MI, with numerous potential applications beyond imaging. In order to emphasize the generality of the measure, we introduce the name normalized Unexplained Variance (nUV) which we found more more conformant with the literature of statistics.

We define criteria for the ideal operation of the nUV measure, provide theoretical results on the ideal binning under these criteria and also provide algorithms to determine the ideal binning for particular problems.

By numerical simulations, we show that in the context of discriminating distorted templates from noise, the proposed binning techniques improve the discrimination power of nUV by 413% in terms of AUC scores, with statistical significance.
The paper is organized as follows. In Section 2 a brief introduction is given to MTM, its analogy to MI is pointed out and the new nomenclature of nUV is introduced. The optimality criterion is defined, theoretical results are derived and corresponding algorithms to find an approximately optimal binnings are proposed in Section 3. The numerical experiments are described and evaluated in Section 4, and finally, conclusions are drawn in Section 5.
2 Brief Introduction to Matching by Tone Mapping (MTM) and Problem Formulation
In this section, we give a brief introduction of the MTM measure, discuss the importance of binning, formulate the problem we deal with in the rest of the paper and also point out a close relation between MTM and MI leading us to the introduction of the term normalized Unexplained Variance (nUV).
First, the notations used in the rest of the paper are introduced, trying to follow those of the related papers HelOr et al. (2014); Kovács (2018) for the compatibility of discussions. We use lowercase, boldface and uppercase letters to denote scalars, vectors and matrices, respectively (e.g. , , ), keeping the notations and for the template and the window and for the dimensionality of the feature space. Sets, and the special class of functions called intensity transformations (distortions or tone mappings in HelOr et al. (2014)) are denoted by calligraphic letters like , and , respectively, recalling . For the ease of reading, and for compatility with literature HelOr et al. (2014), we also introduce greek letters, which always denote vectors in special roles.
The MTM dissimilarity HelOr et al. (2014) of and is defined as
(1) 
where the numerator measures how close can be transformed to by applying some tone mapping coordinatewise and the function in the denominator stands for the empirical variance of the elements of , ensuring invariance to intensity scaling. It is worth noting that MTM is not symmetric: the form (1) is referred as the PatterntoWindow (PtW) case and the WindowtoPattern (WtP) is defined by interchanging and in (1). In the rest of the paper we focus on the PatterntoWindow case, but emphasize that all results can be derived for the WindowtoPattern (WtP) case analogously.
2.1 Piecewise constant approximation
The minimization problem (1) cannot be solved explicitly, but approximate solutions can be obtained by the linearization of the problem, particularly, replacing the term with a linear approximation. Let the coordinates of be quantized into bins and let the boundaries of the bins arranged into the vector , supposing that , and each bin contains at least one element. One can form the piecewise constant (PWC) slice transform matrix of as
(2) 
It can be readily seen that the matrix contains structural information about , each column is related to a bin, and the th element of column is set to only if falls in the bin . The columns of the matrix are referred as slices, the cardinalities of the slices are represented in the vector with denoting the number of elements falling in slice . Given , one can approximate as , in many ways, e.g. or . Similarly to the approximation of , the matrix can be used to approximate various coordinatewise transformations of , for example, the vector , can be considered as an approximation of the vector derived from by applying the tone mapping coordinatewise. Analogously, for any , the expression can be considered as the PWC approximation of some possibly nonlinear coordinatewise transformation of . Obviously, the quality of approximation highly depends on the intensity distribution of , the number of bins, and the smoothness of . Nevertheless, the linearization of the minimization problem (1) by is reasonable, and PWC MTM becomes
(3) 
where is the exact solution of the least squares problem in the numerator:
(4) 
The numerator can be interpreted as a PWC ordinary least squares regression. In principle, any regression technique could be used to approximate MTM. The benefits of the PWC regression are that it has an extremely low number of parameters and can be computed efficiently. To simplify notations, we substitute (
4) into (3) and introduce the formalism(5) 
where is the projection matrix into the subspace generated by the columns of . To reduce clutter, we omitted the argument of and , but we highlight, that both and are implied by the structure of .
The numerator being a least squares regression implies that is the hatmatrix: it is idempotent, symmetric, and an orthogonal projection, thus, selfadjoint Draper and Smith (1998).
In principle, the PWC regression could be replaced by any regression technique. The benefits of the PWC regression to approximate MTM are that it has an extremely low number of parameters and can be computed efficiently.
An insight into the operation of the measure can be gained by recognizing some further special properties of matrix
originating from its special construction from the orthogonal matrix
, which we utilize in Section 3.Lemma 1.
(Properties of the matrix ) Let denote a matrix , and denote the set of indices of falling in bin . The matrix is a square matrix of type and if , and otherwise. As a consequence, is the mean of elements of falling in the bin where falls.
Proof.
For the proof see A. ∎
The operation of MTM with PWC approximation is illustrated in Figure 1. The paired samples of and are visualized in a scatter plot and 3 bins are indicated by vertical lines. By Lemma 1 and the simplified form (5), MTM approximates the values in a particular bin by their mean (red horizontal lines) and the numerator of the measure calculates the squared differences of values from the corresponding the means. Thus, the numerator is the sum of residuals in the PWC regression, which is divided by the total empirical variance of , hence, is the score of regressing as the target variable on as the explanatory variable using a piecewise constant regression. Another interpretation that will be discussed in detail in the next subsection is that MTM measures the uncertainty of values falling in a particular bin by computing their variance (equal to the sum of residuals when is approximated by the means within the bins). This uncertainty characterizes how much can be treated as a function of .
To illustrate the structure of for a better insight and also validate the lemma qualitatively, let , being binned to 2 bins by the binning vector . Then,
where the first two elements are the means of the first two elements of belonging to the first slice of the slicing of .
We mention that the piecewise constant approximation enables the regularization of the measure through the number of bins. One can readily see that, if all values of the template are unique, and each value is treated as a separate slice, both and
become the identity matrix, and the numerator of PWC MTM
will give a perfect match ( dissimilarity) for any . Consequently, the use of a low number of bins regularizes PWC MTM to prevent overfitting by controlling the smoothness of the nonlinear tone mappings used to approximate from .Finally, another important property of
is pointed out: its relation to kmeans clustering
Bishop (2006).Lemma 2.
(On the relation of the projection matrix to kmeans clustering) Let denote the set of all hatmatrices implied by slice matrices binning to nonempty bins. Minimizing the expression
(6) 
is equivalent to solving the kmeans clustering problem for the elements of , and constructing the projection matrix from the clusters interpreted as bins.
Proof.
The expression measures the sum of squared residuals when each element of within a slice is approximated by the mean of the elements in the slice. Treating the slices as clusters and summing the squared residuals for each cluster, one can recognize as the objective function of kmeans clustering to be minimized in the space of all partitioning of the coordinates of . ∎
Finally, we mention that solving the kmeans clustering problem is not equivalent to applying the wellknown MLEM kmeans clustering algorithm, which provides only a suboptimal solution Bishop (2006).
2.2 The importance of binning and problem formulation
Bin selection is the problem of determining the proper number and widths of bins Scott (2015) to group data, and the ideal binning strategy is usually data and applicationspecific.
The authors of MTM HelOr et al. (2014) did not address the question of bin selection for the slice transform and used equal width binning (EQW) in the evaluation of the measure with relatively low numbers of bins showing the best performance. Although there are numerous rules of thumb to determine the number of bins (square root rule Jopia (2019), Strugesrule Jopia (2019), Ricerule Jopia (2019)
), as pointed out in the previous subsection, the number of bins plays the role of the regularization parameter, thus, we keep it as a degree of freedom.
On the other hand, given a particular number of bins, the selection of bin boundaries can naturally be expected to affect the performance of PWC MTM. The goal of this paper is to examine the effect of bin boundary selection strategies on PWC MTM to identify ideal binning techniques under specific conditions. We mention, that there are results in the literature for the selection of the widths of bins (Scott’s rule Scott (1979), FreedmanDiaconis’ choice Freedman and Diaconis ), and variablewidth bins (like equal frequency binning (EQF) Peng et al. (2009) with each bin containing the same number of elements) have also been proposed. The common point of these techniques is that most of them are derived to optimize the construction of the empirical distribution function through histograms in terms of some optimality criteria. As MTM is intended to be used in pattern recognition scenarios to recognize patterns under some assumptions on the nature of the noise and possible distortions, them problem of binning is essentially different from that of constructing the empirical distribution function. To support this claim, we anticipate some results discussed in subsection 4.5, where EQF turns out to have extremely low performance in the pattern recognition settings of the numerical experiments.
2.3 The relevance of MTM, its relation to MI, and the introduction of the term normalized Unexplained Variance
As there are dozens of (dis)similarity measures proposed in the literature, we found it crucial to point out some beneficial properties of MTM that make it worth studying its statistical properties and motivated the writing of this paper. The authors of HelOr et al. (2014) already showed that PWC MTM can be computed efficiently for a template and all windows of a signal, and also demonstrated that the performance of MTM in pattern recognition applications is highly competitive with that of MI. To further emphasize its relevance as a generalpurpose dissimilarity measure, in this subsection we show that MTM is a highly analogous concept to a normalized variant of MI, leading us to change the nomenclature by introducing the term normalized Unexplained Variance (nUV) to make the name more aligned to the classical concepts being utilized under the hood.
In statistics, entropy and variance are probably the most widely used measures of uncertainty Zidek and van Eeden (2003). Related to entropy, mutual information (MI) treats the coordinates of the window and template
as corresponding realizations of two random variables (
,) with a joint distribution
, and is defined as , where and denote the marginal and conditional entropies of the distribution . For (dis)similarity measures it is usually desired to map into a bounded range in order to be easily interpretable and comparable across different problems, and consequently, there are multiple normalized variants of MI proposed. One particular normalization leads to the uncertainty coefficient Press et al. (2007), but for clearity, in rest of the paper we refer it as normalized mutual information (nMI):(7) 
where refers to empirical estimation using the paired samples of and . nMI quantifies the relative amount of information and share, by subtracting the relative amount of uncertainty remaining in given from the total amount of uncertainty in , which is scaled to .
According to the law of total variance Weiss (2005), if has finite variance,
(8) 
holds. The first term on the righthand side is usually referred as the unexplained variance, characterizing the variance remaining in given ; and the second term is called the explained variance characterizing the variation of one can explain by the variation of . Rearranging the equation and dividing both sides by , one obtains
(9) 
Given a paired sample for and , the various terms can be approximated by empirical quantities. In the theory of regression, the conditional expectation function is proved Bishop (2006) to be the best predictor of from in terms of squared loss, and any least squares regression function estimating from can be treated as an approximation of . Consequently, can be estimated by the squared residuals of the regression. Choosing a piecewise constant regression function and using the notations introduced before,
(10) 
with denoting the dimensionality of the feature space, and estimating by the empirical variance :
(11) 
The term on the left hand side is the normalized explained variance, quantifying the variance of explained by . Comparing the expressions (7) and (11), one can observe that nMI and are highly analogous concepts, both quantifying the uncertainty disappearing from given , nMI using entropy, and using variance to measure uncertainty.
This analogy shows that MTM is more general in principles than what the name Matching by Tone Mapping would suggest, it can be treated as a meaningful alternative of MI in any applications where MI is used as a similarity measure. To emphasize its generality beyond image processing, be conformant with the literature of statistics, and the classical principles utilized by its operation, we found it necessary to change the nomenclature and this change is formalized in the next definition.
Definition 1.
In the context of this paper, the piecewise constant approximation of the normalized unexplained variance (nUV) dissimilarity measure refers to the piecewise constant approximation of the MTM measure, both denoted by .
As it is a common technique to estimate nMI through binning Carrara and Ernst (2019), we compare the role of binning in nMI to that in PWC nUV. Unlike the binning implementations of nMI (discretizing both vectors), PWC nUV carries out binning only for the template vector. Consequently, another beneficial property of PWC nUV is that one can expect less loss of information and less sensitivity to improper binning strategies.
Finally, we mention that although MI and its variants are widely used in pattern recognition, according to our best knowledge, there are no results in the literature to optimize its operation by determining the ideal binning technique in terms of some pattern recognition specific optimality conditions, which could make PWC nUV a favorable choice to nMI in problems where the assumptions of optimality are met.
3 The optimal binning technique
In this section, we carry out the statistical analysis of bin boundary selection strategies for the PWC nUV measure. First, we introduce a statistical model and phrase the optimality criterion we aim to optimize by choosing the binning technique. Then, the main results on the optimality of binning are derived and two algorithms are proposed to determine the ideal binning for a particular template under mild assumptions on the nature of distortions and noise.
3.1 The optimal operation of the PWC nUV dissimilarity measure
Naturally, any binning technique (equal width, equal frequency, etc.) can be used to carry out the slice transform, thus, to drive the PWC nUV measure. In order to select the optimal binning technique for a given number of bins, we need to define when we consider the operation of the measure optimal. As the goal of pattern recognition is to recognize patterns under certain classes of distortions, we consider PWC nUV operating optimally when it separates the noisy background from a noisy and distorted pattern as much as possible. In order to put this concept formally – in accordance with the notations so far – let denote a template,
a window containing white noise (from distribution
with finite variance ), and , the window containing the template distorted by the tone mapping and additive noise . We consider the tone mapping to be a stochastic process (a random real function) with finite first and second moments (, ,) defined by all finite dimensional probability distributions
, .Definition 2.
(Optimal binning) With the notations introduced before, for a given template we consider the operation of PWC nUV optimal if its expected discrimination power
(12) 
regarding a noisy window and a noisy distorted template is maximal.
Put in another way, we consider PWC nUV operating optimally if the binning technique used in the slice transform is such that the expected dissimilarity of the template from noise () and the expected dissimilarity of the template from the distorted template () is as different as possible.
3.2 Linearization of the distortion and the models being examined
In this subsection, we apply equivalent transformations to replace the stochastic process by a random variable from a finitedimensional distribution.
If all the coordinates of are different, is a random quantity governed by the dimensional distribution of the distortion . What makes different from a real random vector variable is the presence of equal coordinates. If holds for some , also needs to hold, as one realization of the random tone mapping is a function assigning the same domain value to both and . In order to ensure that this condition holds, we factor by introducing its fullrank slice transform. Let denote the number of unique coordinates of , denote the vector of unique elements of in an increasing order, and let denote the fullrank slice transform matrix, in which each unique coordinate of falls in a distinct slice and let denote the vector of the number of unique elements in the slices of . With , can be reconstructed without any loss of information, that is .
With these notations, , where is a random vector from the dimensional distribution . Consequently, , where and the optimality criterion becomes the maximization of
(13) 
3.3 The need for first order approximation of the expected values
Treating the as a random quantity through the random nature of the window implies some difficulties in the evaluation of the optimalitycriterion (13), as random variables appear in both the numerator and denominator, leading to a ratio distribution which is analytically interactable. In order to carry out the analysis, we introduce the usual firstorder approximation for the expectated value of the ratio distribution Benaroya et al. (2005):
(14) 
and use this approximation throughout the paper when evaluating the expected value of the PWC nUV measure.
3.4 Statistical analysis of the model
In this section we carry out the statistical analysis of the model introduced.
Proposition 1.
Using the notations introduced before, in the first order approximation of the ratio distribution,
(15) 
Proof.
For the proof see B. ∎
As a consequence of the proposition, the expected value of the dissimilarity of a template from a window containing only noise is a constant, which depends only on the dimensionality of the space and the number of bins , but is independent of the structure of the bins (matrix ). This result suggests that the matrix minimizing the term will maximize the expected discrimination power of the measure according to the the optimality criterion (13).
Proposition 2.
(On the expected dissimilarity of a template and a distorted, noisy template) With the notations introduced before,
(16) 
where denotes the expected crossproduct matrix of the distortion , denotes the Frobenius inner product, and the vector contains the cardinalities of the slices in the fullrank slice transform of .
Proof.
For the proof see C. ∎
We note that the crossproduct matrix () encodes the covariance structure and the mutual relationships of the coordinates of the mean vector as where is the exact covariance matrix of the distortion.
Theorem 1.
(On the optimization of binning) In first order approximation of the expected value of the ratio distribution, the projection matrix maximizing the expression
(17) 
minimizes the expected dissimilarity of the template and the distorted, noisy template (equation (16)), thus, maximizes the separation power of the measure for the distortion .
Proof.
The results give an interesting insight into the operation of the PWC nUV measure. In order to optimize its operation in terms of the optimality criterion (13), one needs to find a bin structure implying a projection matrix , which maximizes the alignment of the projection matrix and the crossproduct matrix of the distortion by maximizing their Frobenius (matrix) inner product. Put in another way, as is zero if and do not fall in the same bin, the maximization of the Frobeniusproduct requires strongly covarying distortion coordinates having similar means to fall in the same bin to contribute their high crossproduct value to the objective function of the maximization (17).
One can readily see that the optimization problem (17
) is a combinatorial optimization problem as the matrices in the set
are induced by the partitions of the set , thus, the problem is hardly tractable analytically. However, greedy algorithms Cormen et al. (2009) can be derived to approximate the optimal solutions. Given a crossproduct matrix , the number of bins and the vector containing the cardinalities of the slices of the fullrank slice transform , a greedy algorithm to find a binning approximating the ideal one is provided in Algorithm 1. The algorithm initializes a random configuration of nonempty bins and computes the inner product with the matrix implied by the random configuration. Then, iteratively checks if moving any of the bin boundaries one step to the left or right increases the inner product. In each iteration, the adjustment of bin boundaries leading to the highest increase in the inner product is being chosen. The algorithm stops when no further increase can be achieved. The vector of bin boundaries computed by the algorithm contains the indices of the bin boundaries of the ideal binning in the vector containing the ordered, unique elements of .//RowColSum computes the contribution of row and column to the inner product for bin
//Change computes the change in the objective function when the th bin boundary is moved one step to the left () or to the right ()
// GreedyBinning implements the proposed greedy binning algorithm
As the following corollary shows, if the distortion is spherical and centered to the origin, that is, it makes no distinction between various directions of the feature space, the expected value reduces to a constant – in accordance with the no free lunch theorems of machine learning Wolpert and Macready (1997): all machine learning techniques of a class (in this case all binnings of bins) provide the same average performance when evaluated on all possible problems with no structural preference in their distribution (in this case all distorted vectors from some spherical distribution).
Corollary 1.
(No free lunch theorem) If all elements of are unique, and the the distortion has a spherical distribution in the ddimensional features space of , that is, ,
(18) 
Proof.
One can readily see by substituting in place of and in place of in Proposition 2, and utilizing . Without the unicity constraint on the elements of , the ties imply nondiagonal nonzero entries in , and would not reduce to . ∎
As a special case of Proposition 2, one can expect that the distortion is such that it maps close to itself. This closeness can be modelled by a distribution which has the mean , where is the vector of unique elements of in an increasing order. The following proposition provides an insight into the effect of localized distortions: in this case, the ideal quantization requires to match the covariance structure of the distortion and also requires the minimization of the representation error of made by the binning.
Proposition 3.
(On the expected value of the measure with a localized distortion) Using the notations introduced before, with if , then
(19) 
Proof.
For the proof, see Appendix D ∎
As a consequence of the proposition, if the distortion is centered to in the sense that , the ideal binning jointly minimizes the representation error of the binning and the alignment of the binning with the covariance structure of the distortion . According to Lemma 2, the representation error could be minimized by solving the kmeans clustering problem (applying some kmeans clustering technique like the well known MLEM), however, the alignment of the binning and the covariance structure is not minimized by it, thus, in these cases still the optimization method formulated in Theorem 1 and Algorithm 1 is recommended.
It is also reasonable to suppose that the distortion is not only centered to , but spherical. The following proposition and theorem show that in these cases solving the kmeans clustering problem leads to the ideal quantization.
Proposition 4.
(On the expected value of the measure with spherically distributed distortion) If has unique elements, and , then
(20) 
Proof.
Theorem 2.
(On the optimization of the binning for spherically distributed distortion) If the elements of are unique and the distortion is centered to with a spherical distribution, the ideal binning can be determined by solving the kmeans clustering problem for the elements of .
Proof.
As a consequence of Theorem 2, when only white noise is expected, still the solution of the kmeans clustering problem provides the ideal binning.
We highlight that Theorem 1 provides the general conditions of ideal binning applicable to any assumptions on . The greedy algorithm proposed in Algorithm 1 finds an approximating solution but due to the combinatorial nature of the problem, it does not guarantee a global optimum. When the distortions imply that bin selection turns into the kmeans clustering problem, advanced techniques developed to find the exact solution of the kmeans clustering problem in 1D can be exploited to find the ideal solution Wang and Song (2011). Finally, one can readily see, that although the results are based on the firstorder approximation, when the unexplained variance measure is not normalized (making it analogous to MI), the same conditions on the optimal binning are exact.
3.5 Estimation of the crossproduct matrix of the distortion
In order to determine the ideal quantization for a template, one needs to make assumptions on the crossproduct structure of the expected distortions . If the distortions are known to come from a particular class of functions, one can estimate the crossproduct matrix by sampling the class of functions, apply each sample function to the unique elements of and compute the empirical crossproduct matrix of the resulting vectors. For example, given a template in an image processing application, expecting gamma distortions (), with (related to overexposition), one can determine as the vector of unique elements in , and estimate the matrix by sampling from and where .
3.5.1 The probability of ideal binning
Working with digital signals, due to the finite precision of representation and/or sensitivity of acquisition devices, the space of windows is usually a bounded subset with volume . If the assumptions on are valid in a volume of the feature space and the ideal bins are determined by these assumptions, one can readily see that the probability of the ideal operation of the nUV measure becomes , thus, using the proposed binning techniques can still improve the separation power of PWC nUV, even though the assumptions are valid only in a subset of the entire space .
4 Tests and Results
As a dissimilarity measure highly analogous to MI, nUV has numerous potential applications from template matching through registration Ruiz et al. (2009)
to feature selection in machine learning
Vergara and Estevez (2014). Due to this generality of the measure, the theoretical nature of the results we derived, and space limitations, we do not evaluate the measure on real data. The goal of the numerical experiments is twofold, summarized as follows.Testing the accuracy of firstorder approximations. By simulations we show quantitatively and illustrate qualitatively that the formulae derived in the previous section are aligned with the measurements, thus, the firstorder statistical approximation is acceptable in the scope of the experimental settings. This is carried out by simulating templates, noisy windows, and distorted templates and computing and comparing the predictions of propositions 1, 2 and 4 with the real dissimilarity scores.
Testing the pattern recognition performance in terms of AUC. We characterize quantitatively how much improvement can be achieved by using the proposed binning techniques in pattern recognition scenarios. In each test case, for each binning technique we record if holds (indicating the correct recognition of the distorted template). We note that the percentage of correct recognitions is an estimation of the probability that a randomly chosen positive sample (distorted template) will have a smaller dissimilarity score than a randomly chosen negative sample (noisy window). This estimation is equivalent to one of the common interpretations of the widely used AUC score (Area Under the receiver operating Curve) Flach et al. (2011).
All results reproducable by the codes with fixed random seeds in the GitHub repository https://github.com/gykovacs/ideal_binning_nuv.
4.1 One test case
The computational steps in one test case are summarized as follows.
Sampling of a random template. A random template () is generated with a random dimensionality
; the templates are vectors from three different distributions with equal probabilities: standard normal distribution (
), uniform distribution (
), and a distribution composed from two normals () – this composition is used to generate templates with nonunimodal intensity distributions. Then, the templates are normalized into the range and a random exponent is used to adjust the template by to alter its intensity distribution (in image processing this type of exponential adjustment is related to under and overexposition). Finally, when general distortions are evaluated with greedy binning, the intensity values in the template are rounded to 3 digits, in this way introducing some ties between template values which is also usual in digital signal processing. When Theorem 2 is examined, due to the unicity constraint on the values of , rounding is not applied.Sampling of a noisy window. A noisy window () and the white noise vector used to distort the template (
) is generated from a normal distribution with uniformly random standard deviation
.Sampling a distorted template and the crossproduct structure.
First, the fullrank decomposition of is being determined: . For general distortions, a random mean vector and a covariance matrix are sampled. For spherical distortions, and , with The crossproduct matrix is computed as . A distortion vector is being sampled as and a distorted template () is generated.
Binning. The binning of the template is carried out by equal width (EQW), equal frequency (EQF), kmeans and greedy binning for various bin numbers.
Calculation of dissimilarity scores. The approximations of the expected values by Propositions 1, 2 and 4 and the true dissimilarity scores and are computed.
All the fixed, constant parameters used in the simulations are selected to cover a reasonably wide range of possible applications, template structures, intensity distributions, and signaltonoise ratios.
4.2 Aggregation of the results
The results of propositions 2 and 4 provide formulas for the expected values of the PWC nUV measure when the possible distortions have a particular crossproduct structure. There can be numerous meaningful crossproduct structures representing the possible distortions in various fields of applications. Picking any of them would deteriorate the generality of the experiments and due to space limitations, we can not examine many different structures in detail. Therefore in the test cases, almost all parameters of the templates, distortions, and noisy windows are being sampled. For the analysis, the computed dissimilarity scores are averaged for the entire population, which enables us to draw conclusions about the operation of the measure with distortions from many different crossproduct structures. In the rest of the section, the averages of the expected values are denoted by , and the averages of the computed dissimilarity scores are denoted by .
One can expect that for templates with varying sizes, varying numbers of bins might be ideal for template matching. In order to compensate this variation in the sizes and enable the meaningful aggregation of the results, the numbers of bins for each test case are varied as follows. All the figures are computed for and bins, and for the number of bins determined by the Sturgesformula (), the Ricerule () and the square root rule (), where denotes the number of unique elements in . One can easily confirm that in the range of the experiments usually holds, therefore, we found it meaningful to plot the aggregated results in one figure and connect them with lines for a better visualization of trends, even though the values , and depend on the sizes of the templates.
4.3 General distortions and greedy binning
We have executed 5000 experiments of the test cases described above, and plotted the aggregated results in Figure 2, with the standard deviations denoted by vertical lines with a minor horizontal shift for visibility. As one can observe, the predictions of Propositions 1 and 2 for the means of the distributions and the means of the real scores are very well aligned, with the highest relative difference of for the noise and for the distorted population. From these, we can conclude that despite the firstorder approximation (subsection 3.2), the formulae in propositions 1 and 2 are close enough to the real values in the tested scenarios to expect an improvement in the separation power of PWC nUV by Theorem 1 and Algorithm 1.
The AUC scores for the various binning techniques are summarized in Figure 2(b) and aggregated in Table 1 with the matrix of pvalues of the McNemar tests for the equality of the scores. One can observe that the greedy binning outperforms both EQW (by 13% in aggregation) and EQF (by 26% in aggregation) with statistical significance, providing a numerical validation for Theorem 1 in the scope of the experiments. Interestingly, although kmeans binning is proved to be ideal when the distortion is spherically distributed around the template, it is outperforming EQW (by 4%) and EQF (by 17%) with statistical significance, indicating that there can be further configurations when kmeans binning works well. Comparing the greedy binning and the kmeans binning, the most remarkable difference is that for the kmeans binning there is no need for the estimation of the crossproduct or covariance structure of the distortions. Consequently, the results suggest that even in the lack of any knowledge on the possible distortions, using kmeans binning instead of EQW could improve the matching results with statistical significance.
General distortions  Spherical distortions  
EQW  EQF  kmeans  greedy  EQW  EQF  kmeans  greedy  
EQW  1  0  0  0  1  0  2.3e07  2.8e04 
EQF  0  1  0  0  0  1  0  0 
kmeans  0  0  1  0  2.3e07  0  1  5.7e02 
greedy  0  0  0  1  2.8e04  0  5.7e02  1 
AUC  0.63  0.5  0.67  0.76  0.83  0.5  0.84  0.84 
4.4 Spherical distortions
Again, we have executed 5000 experiments of the test cases described and plotted the results in Figure 3. The predictions of propositions 1 and 4 for the means of the distributions and the means of the real scores are very well aligned, with the highest relative difference of for the noisy windows and for the distorted population. Despite the firstorder approximation, the results suggest that Proposition 4 gives a good approximation of the expected value in the scope of the experiments. Comparing the AUC scores of recognition plotted in Figure 3(b) and aggregated in Table 1 with the pvalues of the McNemar tests on the equality of the scores, one can observe that the kmeans and greedy techniques outperform EQW (by 1% in aggregation) and EQF (by 34% in aggregation) with statistical significance, however, the improvement is mainly for low numbers of bins, the performances quickly converge to that of EQW and the greedy technique (due to its suboptimality) gets below EQW in terms of AUC when the square root rule is applied to determine the number of bins. The reason for the limited improvement in the case of spherical distortions is that according to Proposition 4, kmeans and the greedy techniques improve the discrimination power of PWC nUV by minimizing the term which is a smaller decrease than minimizing both this term and the as pointed out in Proposition 3.
4.5 A note on the low pattern recognition performance of EQF
Interestingly, the AUC of EQF is 0.5 in both experiments, which means that it has no discriminative power in these settings. The operating principle of PWC nUV is that the slices describe the rough structure of the template as the values in the bins are close to each other. This assumption is definitely not satisfied by EQF, as having the same number of values in a bin completely neglects the structure of the template, can break many similar values into separate bins, providing a poor representation of the template. This phenomenon is a qualitative validation of our previous claim that binning techniques developed to reconstruct the empirical distribution function of a sample, do not necessarily perform well in other binning problems.
5 Conclusions
In this paper, we have examined the effect of binning strategies on the piecewise constant approximation of the normalized Unexplained Variance (nUV) (also known as MTM) dissimilarity measure. We defined the criterion of ideal operation in Definition 2, and managed to show in Theorem 1 that the ideal binning needs to maximize the alignment of the projection matrix and the crossproduct structure of the expected distortion. In order to obtain an approximate solution for this combinatorial optimization problem, we have proposed a greedy algorithm in Algorithm 1. In subsequent propositions, we have examined special cases of the general statement and arrived at the case of localized and spherically distributed distortions, for which the ideal binning can be determined by solving the kmeans clustering problem according to Theorem 2.
In the Section 4 we have carried out experiments to see how much the simulation results are aligned with the firstorder statistical approximations. According to the results, the relative error is less than 0.1% in terms of the means of the figures. We also compared the performance of the proposed binning techniques to that of historical ones in pattern recognition scenarios and found the proposed approaches to outperform the historical ones by 13% AUC in the case of general distortions and the greedy binning, and by 1% AUC in the case of spherical distortions and kmeans binning, in both cases with statistical significance.
The conclusions we can draw are summarized as follows. Due to the analogies presented in Section 2.3, nUV can be treated as a powerful alternative of MI, quantifying the uncertainty remaining about the window given in terms of variance. Thus, nUV is potentially applicable in any problem where MI is used as a similarity measure (template matching, registration, feature selection, etc.), Although numerical experiments can never cover all the possible use cases of a generalpurpose dissimilarity measure, due to the wide range of parameters used in the simulations, one can expect that using PWC nUV with the proposed binning techniques can improve its performance in terms of the AUC score.
Appendix A Proof of Lemma 1
Proof.
Due to the orthogonality of , is a diagonal matrix of type with being equal to the cardinality of slice . Inverting this matrix inverts the elements in the diagonal, with . Finally, due to the construction of and the orthogonality of , one can readily see that in , is nonzero only if and fall in the same slice , and the value it takes is . Due to the special structure of ,
(21) 
if , which is the mean of elements of in the slice implied by . ∎
Appendix B Proof of Proposition 1
Proof.
Most of the proofs in the paper are analogous, expanding the inner products in the expressions and simplifying them by utilizing the special properties of matrix highlighted in Lemma 1. Due to space limitations, these steps are carried out only in this proof in all details.
According to subsection 3.3, the numerator and the denominator are evaluated separately. The numerator is expanded as
(22) 
Evaluating the first term, utilizing Lemma 1 on the special properties of , and the assumptions on the white noise ( mean, finite variance), one gets
(23) 
Similarly, and . For the denominator,
(24) 
∎
Appendix C Proof of Proposition 2
Proof.
First, we evaluate the numerator (): Expanding the inner product and carrying out the integration for (zeromean white noise with finite variance ) leaves the following nonzero terms.
(25) 
Due to the idempotence of , , thus,
(26) 
Let . Carrying out the integration for ,
(27) 
for the expectation of the numerator, where denotes the Frobenius inner product. Similarly for the denominator, utilizing the special properties of :
Comments
There are no comments yet.