Approximately Optimal Binning for the Piecewise Constant Approximation of the Normalized Unexplained Variance (nUV) Dissimilarity Measure

07/24/2020
by   Attila Fazekas, et al.
University of Debrecen (UD)
0

The recently introduced Matching by Tone Mapping (MTM) dissimilarity measure enables template matching under smooth non-linear distortions and also has a well-established mathematical background. MTM operates by binning the template, but the ideal binning for a particular problem is an open question. By pointing out an important analogy between the well known mutual information (MI) and MTM, we introduce the term "normalized unexplained variance" (nUV) for MTM to emphasize its relevance and applicability beyond image processing. Then, we provide theoretical results on the optimal binning technique for the nUV measure and propose algorithms to find approximate solutions. The theoretical findings are supported by numerical experiments. Using the proposed techniques for binning shows 4-13 significance, enabling us to conclude that the proposed binning techniques have the potential to improve the performance of the nUV measure in real applications.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/23/2016

Template Matching Advances and Applications in Image Analysis

In most computer vision and image analysis problems, it is necessary to ...
01/15/2015

Evaluating accuracy of community detection using the relative normalized mutual information

The Normalized Mutual Information (NMI) has been widely used to evaluate...
12/07/2016

Template Matching with Deformable Diversity Similarity

We propose a novel measure for template matching named Deformable Divers...
12/14/2020

Template Matching with Ranks

We consider the problem of matching a template to a noisy signal. Motiva...
06/12/2015

On the accuracy of self-normalized log-linear models

Calculation of the log-normalizer is a major computational obstacle in a...
08/20/2020

Variance and covariance of distributions on graphs

We develop a theory to measure the variance and covariance of probabilit...
11/20/2020

Efficient Data-Dependent Learnability

The predictive normalized maximum likelihood (pNML) approach has recentl...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of the most general goals of pattern recognition is to distinguish noisy and/or distorted realizations of some patterns of interest (POI) from realizations of other patterns and noise. A common point of most approaches is that they explicitly or implicitly define/quantify the problem-specific notion of similarity

(or the inversely proportional dissimilarity) of patterns, usually through (dis)similarity measures. The role of (dis)similarity measures depends on how the available knowledge and the POI are represented, by labeled datasets (leading to machine learning approaches) or by exact models (leading to template matching).

Machine learning approaches. If there is a hand-labeled dataset containing multiple realizations of the POI and counterexamples, one can exploit general-purpose machine learning techniques to address recognition Wang (2016) tasks directly. Although some techniques like metric learning Kulis (2013)

approaches learn the most suitable (dis)similarity measure explicitly, many of the commonly used regressors and classifiers use relatively simple (dis)similarity measures (for example,

Euclidean distance

in k-Nearest Neighbors (kNN)

Shalev-Shwartz and Ben-David (2014)

and kernel functions in Support Vector Machines (SVM)

Shalev-Shwartz and Ben-David (2014)). One of the main reasons why advanced (dis)similarity measures rarely appear in general-purpose machine learning techniques is that they make assumptions on the distribution of the data, and these assumptions are likely to fail in general problems when one has no information about the possible distortions. Instead, machine learning techniques learn the application-specific meaning of (dis)similarity from the data and represent the advanced concepts of (dis)similarity in the inner structure of the machine learning model in terms of the simple measures.

Template matching approaches. There are numerous problems with no hand-labeled datasets, but one high-quality realization or exact model of the POI. In these cases, one declares the (dis)similarity measure to be used according to the expected distortions of the POI, and considers sufficiently similar patterns as a realization of the POI. This approach is usually referred as template matching, exploited in various problems where data is acquired in highly controlled environments with low and/or predictable variability (like quality checking on conveyor belts Wu et al. (2020)); the POI is simple enough to be represented by one realization or exact model (for example, in medical imaging Kovács and Hajdu (2016)); the POI changes by time and the training of pattern-specific solutions is infeasible (like in object tracking applications Yan et al. (2019)).

This paper deals with the second class of problems (applications, where (dis)similarity measures invariant to certain types of distortions are needed) and presents some theoretical results related to the recently introduced dissimilarity measure Matching by Tone Mapping (MTM) Hel-Or et al. (2014), which was shown to provide superior performance in numerous template matching and even registration scenarios due to its approximate invariance to even non-linear distortions. Before moving on to the presentation of the findings, we provide a brief overview of the most widely used measures to enable the positioning of the work in the literature of (dis)similarity measures. For the ease of discussion, we introduce the terminology of template matching: let denote a template (pattern) and a window of a signal the template is being compared to, respectively. By intensity transformation (tone mapping/distortion) we refer to a deterministic function applied to each coordinate of its parameter vector independently by introducing the notation . We mention that the invariance of a (dis)similarity measure to distortions like () is usually referred as photometric invariance although the concept is applicable in other fields of signal processing beyond imaging.

As the meaning of (dis)similarity is usually application-specific, numerous measures have been proposed in the last decades. Probably the simplest dissimilarity measures are the

distances ( and also known as Manhattan and Euclidean distances) with no invariance to any distortion . Cross-correlation (CC) and normalized Euclidean distance Goshtasby (2012) are invariant to scaling, while the Pearson correlation coefficient

(PCC) is invariant to linear distortions. Although these measures are invariant to linear transformations at most, they usually serve as building blocks of advanced techniques or they are made invariant to certain classes of non-linear transformations by the kernel-trick

Kovács and Hajdu (2013).

Numerous (dis)similarity measures (like Spearman’s Rho Spearman (1904) and Kendall’s Tau Kendall (1938) ) are based on the rank transformation , where is the rank of among , and use one of the simple measures to quantify the (dis)similarity of and instead of and . Although the ranking of elements is not affected by monotonic transformations, and consequently these methods are invariant to monotonic distortions, a common drawback is their sensitivity to noise and ties among the elements of and .

A large and popular family of (dis)similarity functions Pluim et al. (2003); Wachowiak et al. (2003) is based on information-theoretical concepts by quantifying the mutual information (MI) content in the intensity distributions of the template and the window. Alternatively, the comparison of the distributions of derived local quantities, like gradient orientation Liu et al. (2006)

was also proposed. Although the MI-based measures are considered to be invariant to even non-linear intensity transformations, the estimation of joint densities can be challenging, especially for small templates.

Correlation ratio and its variants Lau et al. (2001); Woods et al. (1993) characterize the degree to which can be treated as a single-valued function of , and were shown to provide better performance than MI in certain registration problems Woods et al. (1993).

Invariance to certain distortions can be achieved by extracting invariant features from and and quantifying the (dis)similarity of the feature vectors. A brief overview of photometric invariant features can be found in Zickler (2014). Commonly used features in the imaging domain, invariant to certain types of geometric and photometric distortions are Hu’s descriptors Hu (1962)

(combinations of statistical moments) and

local binary patterns (LBP) Ojala et al. (1996) (based on the intensity differences of a pixel and its neighbors).

Although geometric distortions are out of scope for this paper, we mention that some measures used in the imaging domain are invariant to even affine or projective geometrical transformations Lowe (1999).

For further details on (dis)similarity measures, excellent overviews can be found in the books Brunelli (2009), Goshtasby (2012).

Recently, the Matching by Tone Mapping (MTM) Hel-Or et al. (2014) measure was proposed for photometric invariant template matching and registration and its restriction to monotonic distortions was also introduced Kovács (2018). As MTM was shown to give superior performance in numerous template matching and registration scenarios Hel-Or et al. (2014) and can be computed efficiently in terms of some convolution operations, it has many potential applications in signal processing. Similarly to MI and related techniques (being approximately invariant to non-linear distortions), MTM operates by binning the template, however the proper selection of bins providing optimal performance according to some criteria is still an open question. In this paper, we carry out a statistical analysis of the effect of bin selection for MTM.

The main contributions of the paper to the field are summarized as follows:

  1. As the name suggests, MTM was developed for image processing, where various distortions of a template can be treated as tone mappings. We point out that MTM is a highly analogous concept to MI, with numerous potential applications beyond imaging. In order to emphasize the generality of the measure, we introduce the name normalized Unexplained Variance (nUV) which we found more more conformant with the literature of statistics.

  2. We define criteria for the ideal operation of the nUV measure, provide theoretical results on the ideal binning under these criteria and also provide algorithms to determine the ideal binning for particular problems.

  3. By numerical simulations, we show that in the context of discriminating distorted templates from noise, the proposed binning techniques improve the discrimination power of nUV by 4-13% in terms of AUC scores, with statistical significance.

The paper is organized as follows. In Section 2 a brief introduction is given to MTM, its analogy to MI is pointed out and the new nomenclature of nUV is introduced. The optimality criterion is defined, theoretical results are derived and corresponding algorithms to find an approximately optimal binnings are proposed in Section 3. The numerical experiments are described and evaluated in Section 4, and finally, conclusions are drawn in Section 5.

2 Brief Introduction to Matching by Tone Mapping (MTM) and Problem Formulation

In this section, we give a brief introduction of the MTM measure, discuss the importance of binning, formulate the problem we deal with in the rest of the paper and also point out a close relation between MTM and MI leading us to the introduction of the term normalized Unexplained Variance (nUV).

First, the notations used in the rest of the paper are introduced, trying to follow those of the related papers Hel-Or et al. (2014); Kovács (2018) for the compatibility of discussions. We use lowercase, boldface and uppercase letters to denote scalars, vectors and matrices, respectively (e.g. , , ), keeping the notations and for the template and the window and for the dimensionality of the feature space. Sets, and the special class of functions called intensity transformations (distortions or tone mappings in Hel-Or et al. (2014)) are denoted by calligraphic letters like , and , respectively, recalling . For the ease of reading, and for compatility with literature Hel-Or et al. (2014), we also introduce greek letters, which always denote vectors in special roles.

The MTM dissimilarity Hel-Or et al. (2014) of and is defined as

(1)

where the numerator measures how close can be transformed to by applying some tone mapping coordinate-wise and the function in the denominator stands for the empirical variance of the elements of , ensuring invariance to intensity scaling. It is worth noting that MTM is not symmetric: the form (1) is referred as the Pattern-to-Window (PtW) case and the Window-to-Pattern (WtP) is defined by interchanging and in (1). In the rest of the paper we focus on the Pattern-to-Window case, but emphasize that all results can be derived for the Window-to-Pattern (WtP) case analogously.

2.1 Piecewise constant approximation

The minimization problem (1) cannot be solved explicitly, but approximate solutions can be obtained by the linearization of the problem, particularly, replacing the term with a linear approximation. Let the coordinates of be quantized into bins and let the boundaries of the bins arranged into the vector , supposing that , and each bin contains at least one element. One can form the piecewise constant (PWC) slice transform matrix of as

(2)

It can be readily seen that the matrix contains structural information about , each column is related to a bin, and the th element of column is set to only if falls in the bin . The columns of the matrix are referred as slices, the cardinalities of the slices are represented in the vector with denoting the number of elements falling in slice . Given , one can approximate as , in many ways, e.g. or . Similarly to the approximation of , the matrix can be used to approximate various coordinate-wise transformations of , for example, the vector , can be considered as an approximation of the vector derived from by applying the tone mapping coordinate-wise. Analogously, for any , the expression can be considered as the PWC approximation of some possibly non-linear coordinate-wise transformation of . Obviously, the quality of approximation highly depends on the intensity distribution of , the number of bins, and the smoothness of . Nevertheless, the linearization of the minimization problem (1) by is reasonable, and PWC MTM becomes

(3)

where is the exact solution of the least squares problem in the numerator:

(4)
Figure 1: MTM approximates the values in a particular bin by their mean (red horizontal lines) and calculates sum of squared differences of values from the corresponding means.

The numerator can be interpreted as a PWC ordinary least squares regression. In principle, any regression technique could be used to approximate MTM. The benefits of the PWC regression are that it has an extremely low number of parameters and can be computed efficiently. To simplify notations, we substitute (

4) into (3) and introduce the formalism

(5)

where is the projection matrix into the subspace generated by the columns of . To reduce clutter, we omitted the argument of and , but we highlight, that both and are implied by the structure of .

The numerator being a least squares regression implies that is the hat-matrix: it is idempotent, symmetric, and an orthogonal projection, thus, self-adjoint Draper and Smith (1998).

In principle, the PWC regression could be replaced by any regression technique. The benefits of the PWC regression to approximate MTM are that it has an extremely low number of parameters and can be computed efficiently.

An insight into the operation of the measure can be gained by recognizing some further special properties of matrix

originating from its special construction from the orthogonal matrix

, which we utilize in Section 3.

Lemma 1.

(Properties of the matrix ) Let denote a matrix , and denote the set of indices of falling in bin . The matrix is a square matrix of type and if , and otherwise. As a consequence, is the mean of elements of falling in the bin where falls.

Proof.

For the proof see A. ∎

The operation of MTM with PWC approximation is illustrated in Figure 1. The paired samples of and are visualized in a scatter plot and 3 bins are indicated by vertical lines. By Lemma 1 and the simplified form (5), MTM approximates the values in a particular bin by their mean (red horizontal lines) and the numerator of the measure calculates the squared differences of values from the corresponding the means. Thus, the numerator is the sum of residuals in the PWC regression, which is divided by the total empirical variance of , hence, is the score of regressing as the target variable on as the explanatory variable using a piecewise constant regression. Another interpretation that will be discussed in detail in the next subsection is that MTM measures the uncertainty of values falling in a particular bin by computing their variance (equal to the sum of residuals when is approximated by the means within the bins). This uncertainty characterizes how much can be treated as a function of .

To illustrate the structure of for a better insight and also validate the lemma qualitatively, let , being binned to 2 bins by the binning vector . Then,

where the first two elements are the means of the first two elements of belonging to the first slice of the slicing of .

We mention that the piecewise constant approximation enables the regularization of the measure through the number of bins. One can readily see that, if all values of the template are unique, and each value is treated as a separate slice, both and

become the identity matrix, and the numerator of PWC MTM

will give a perfect match ( dissimilarity) for any . Consequently, the use of a low number of bins regularizes PWC MTM to prevent overfitting by controlling the smoothness of the non-linear tone mappings used to approximate from .

Finally, another important property of

is pointed out: its relation to k-means clustering

Bishop (2006).

Lemma 2.

(On the relation of the projection matrix to k-means clustering) Let denote the set of all hat-matrices implied by slice matrices binning to non-empty bins. Minimizing the expression

(6)

is equivalent to solving the k-means clustering problem for the elements of , and constructing the projection matrix from the clusters interpreted as bins.

Proof.

The expression measures the sum of squared residuals when each element of within a slice is approximated by the mean of the elements in the slice. Treating the slices as clusters and summing the squared residuals for each cluster, one can recognize as the objective function of k-means clustering to be minimized in the space of all -partitioning of the coordinates of . ∎

Finally, we mention that solving the k-means clustering problem is not equivalent to applying the well-known ML-EM k-means clustering algorithm, which provides only a suboptimal solution Bishop (2006).

2.2 The importance of binning and problem formulation

Bin selection is the problem of determining the proper number and widths of bins Scott (2015) to group data, and the ideal binning strategy is usually data and application-specific.

The authors of MTM Hel-Or et al. (2014) did not address the question of bin selection for the slice transform and used equal width binning (EQW) in the evaluation of the measure with relatively low numbers of bins showing the best performance. Although there are numerous rules of thumb to determine the number of bins (square root rule Jopia (2019), Struges-rule Jopia (2019), Rice-rule Jopia (2019)

), as pointed out in the previous subsection, the number of bins plays the role of the regularization parameter, thus, we keep it as a degree of freedom.

On the other hand, given a particular number of bins, the selection of bin boundaries can naturally be expected to affect the performance of PWC MTM. The goal of this paper is to examine the effect of bin boundary selection strategies on PWC MTM to identify ideal binning techniques under specific conditions. We mention, that there are results in the literature for the selection of the widths of bins (Scott’s rule Scott (1979), Freedman-Diaconis’ choice Freedman and Diaconis ), and variable-width bins (like equal frequency binning (EQF) Peng et al. (2009) with each bin containing the same number of elements) have also been proposed. The common point of these techniques is that most of them are derived to optimize the construction of the empirical distribution function through histograms in terms of some optimality criteria. As MTM is intended to be used in pattern recognition scenarios to recognize patterns under some assumptions on the nature of the noise and possible distortions, them problem of binning is essentially different from that of constructing the empirical distribution function. To support this claim, we anticipate some results discussed in subsection 4.5, where EQF turns out to have extremely low performance in the pattern recognition settings of the numerical experiments.

2.3 The relevance of MTM, its relation to MI, and the introduction of the term normalized Unexplained Variance

As there are dozens of (dis)similarity measures proposed in the literature, we found it crucial to point out some beneficial properties of MTM that make it worth studying its statistical properties and motivated the writing of this paper. The authors of Hel-Or et al. (2014) already showed that PWC MTM can be computed efficiently for a template and all windows of a signal, and also demonstrated that the performance of MTM in pattern recognition applications is highly competitive with that of MI. To further emphasize its relevance as a general-purpose dissimilarity measure, in this subsection we show that MTM is a highly analogous concept to a normalized variant of MI, leading us to change the nomenclature by introducing the term normalized Unexplained Variance (nUV) to make the name more aligned to the classical concepts being utilized under the hood.

In statistics, entropy and variance are probably the most widely used measures of uncertainty Zidek and van Eeden (2003). Related to entropy, mutual information (MI) treats the coordinates of the window and template

as corresponding realizations of two random variables (

,

) with a joint distribution

, and is defined as , where and denote the marginal and conditional entropies of the distribution . For (dis)similarity measures it is usually desired to map into a bounded range in order to be easily interpretable and comparable across different problems, and consequently, there are multiple normalized variants of MI proposed. One particular normalization leads to the uncertainty coefficient Press et al. (2007), but for clearity, in rest of the paper we refer it as normalized mutual information (nMI):

(7)

where refers to empirical estimation using the paired samples of and . nMI quantifies the relative amount of information and share, by subtracting the relative amount of uncertainty remaining in given from the total amount of uncertainty in , which is scaled to .

According to the law of total variance Weiss (2005), if has finite variance,

(8)

holds. The first term on the right-hand side is usually referred as the unexplained variance, characterizing the variance remaining in given ; and the second term is called the explained variance characterizing the variation of one can explain by the variation of . Rearranging the equation and dividing both sides by , one obtains

(9)

Given a paired sample for and , the various terms can be approximated by empirical quantities. In the theory of regression, the conditional expectation function is proved Bishop (2006) to be the best predictor of from in terms of squared loss, and any least squares regression function estimating from can be treated as an approximation of . Consequently, can be estimated by the squared residuals of the regression. Choosing a piecewise constant regression function and using the notations introduced before,

(10)

with denoting the dimensionality of the feature space, and estimating by the empirical variance :

(11)

The term on the left hand side is the normalized explained variance, quantifying the variance of explained by . Comparing the expressions (7) and (11), one can observe that nMI and are highly analogous concepts, both quantifying the uncertainty disappearing from given , nMI using entropy, and using variance to measure uncertainty.

This analogy shows that MTM is more general in principles than what the name Matching by Tone Mapping would suggest, it can be treated as a meaningful alternative of MI in any applications where MI is used as a similarity measure. To emphasize its generality beyond image processing, be conformant with the literature of statistics, and the classical principles utilized by its operation, we found it necessary to change the nomenclature and this change is formalized in the next definition.

Definition 1.

In the context of this paper, the piecewise constant approximation of the normalized unexplained variance (nUV) dissimilarity measure refers to the piecewise constant approximation of the MTM measure, both denoted by .

As it is a common technique to estimate nMI through binning Carrara and Ernst (2019), we compare the role of binning in nMI to that in PWC nUV. Unlike the binning implementations of nMI (discretizing both vectors), PWC nUV carries out binning only for the template vector. Consequently, another beneficial property of PWC nUV is that one can expect less loss of information and less sensitivity to improper binning strategies.

Finally, we mention that although MI and its variants are widely used in pattern recognition, according to our best knowledge, there are no results in the literature to optimize its operation by determining the ideal binning technique in terms of some pattern recognition specific optimality conditions, which could make PWC nUV a favorable choice to nMI in problems where the assumptions of optimality are met.

3 The optimal binning technique

In this section, we carry out the statistical analysis of bin boundary selection strategies for the PWC nUV measure. First, we introduce a statistical model and phrase the optimality criterion we aim to optimize by choosing the binning technique. Then, the main results on the optimality of binning are derived and two algorithms are proposed to determine the ideal binning for a particular template under mild assumptions on the nature of distortions and noise.

3.1 The optimal operation of the PWC nUV dissimilarity measure

Naturally, any binning technique (equal width, equal frequency, etc.) can be used to carry out the slice transform, thus, to drive the PWC nUV measure. In order to select the optimal binning technique for a given number of bins, we need to define when we consider the operation of the measure optimal. As the goal of pattern recognition is to recognize patterns under certain classes of distortions, we consider PWC nUV operating optimally when it separates the noisy background from a noisy and distorted pattern as much as possible. In order to put this concept formally – in accordance with the notations so far – let denote a template,

a window containing white noise (from distribution

with finite variance ), and , the window containing the template distorted by the tone mapping and additive noise . We consider the tone mapping to be a stochastic process (a random real function) with finite first and second moments (, ,

) defined by all finite dimensional probability distributions

, .

Definition 2.

(Optimal binning) With the notations introduced before, for a given template we consider the operation of PWC nUV optimal if its expected discrimination power

(12)

regarding a noisy window and a noisy distorted template is maximal.

Put in another way, we consider PWC nUV operating optimally if the binning technique used in the slice transform is such that the expected dissimilarity of the template from noise () and the expected dissimilarity of the template from the distorted template () is as different as possible.

3.2 Linearization of the distortion and the models being examined

In this subsection, we apply equivalent transformations to replace the stochastic process by a random variable from a finite-dimensional distribution.

If all the coordinates of are different, is a random quantity governed by the -dimensional distribution of the distortion . What makes different from a real random vector variable is the presence of equal coordinates. If holds for some , also needs to hold, as one realization of the random tone mapping is a function assigning the same domain value to both and . In order to ensure that this condition holds, we factor by introducing its full-rank slice transform. Let denote the number of unique coordinates of , denote the vector of unique elements of in an increasing order, and let denote the full-rank slice transform matrix, in which each unique coordinate of falls in a distinct slice and let denote the vector of the number of unique elements in the slices of . With , can be reconstructed without any loss of information, that is .

With these notations, , where is a random vector from the dimensional distribution . Consequently, , where and the optimality criterion becomes the maximization of

(13)

3.3 The need for first order approximation of the expected values

Treating the as a random quantity through the random nature of the window implies some difficulties in the evaluation of the optimality-criterion (13), as random variables appear in both the numerator and denominator, leading to a ratio distribution which is analytically interactable. In order to carry out the analysis, we introduce the usual first-order approximation for the expectated value of the ratio distribution Benaroya et al. (2005):

(14)

and use this approximation throughout the paper when evaluating the expected value of the PWC nUV measure.

3.4 Statistical analysis of the model

In this section we carry out the statistical analysis of the model introduced.

Proposition 1.

Using the notations introduced before, in the first order approximation of the ratio distribution,

(15)
Proof.

For the proof see B. ∎

As a consequence of the proposition, the expected value of the dissimilarity of a template from a window containing only noise is a constant, which depends only on the dimensionality of the space and the number of bins , but is independent of the structure of the bins (matrix ). This result suggests that the matrix minimizing the term will maximize the expected discrimination power of the measure according to the the optimality criterion (13).

Proposition 2.

(On the expected dissimilarity of a template and a distorted, noisy template) With the notations introduced before,

(16)

where denotes the expected cross-product matrix of the distortion , denotes the Frobenius inner product, and the vector contains the cardinalities of the slices in the full-rank slice transform of .

Proof.

For the proof see C. ∎

We note that the cross-product matrix () encodes the covariance structure and the mutual relationships of the coordinates of the mean vector as where is the exact covariance matrix of the distortion.

Theorem 1.

(On the optimization of binning) In first order approximation of the expected value of the ratio distribution, the projection matrix maximizing the expression

(17)

minimizes the expected dissimilarity of the template and the distorted, noisy template (equation (16)), thus, maximizes the separation power of the measure for the distortion .

Proof.

The statement can be readily seen as the projection matrix appears only in the numerator of (16) and has a negative sign, thus maximizing (17) minimizes the approximation in equation (16) and maximizes the optimality criterion in equation (13). ∎

The results give an interesting insight into the operation of the PWC nUV measure. In order to optimize its operation in terms of the optimality criterion (13), one needs to find a bin structure implying a projection matrix , which maximizes the alignment of the projection matrix and the cross-product matrix of the distortion by maximizing their Frobenius (matrix) inner product. Put in another way, as is zero if and do not fall in the same bin, the maximization of the Frobenius-product requires strongly covarying distortion coordinates having similar means to fall in the same bin to contribute their high cross-product value to the objective function of the maximization (17).

One can readily see that the optimization problem (17

) is a combinatorial optimization problem as the matrices in the set

are induced by the -partitions of the set , thus, the problem is hardly tractable analytically. However, greedy algorithms Cormen et al. (2009) can be derived to approximate the optimal solutions. Given a cross-product matrix , the number of bins and the vector containing the cardinalities of the slices of the full-rank slice transform , a greedy algorithm to find a binning approximating the ideal one is provided in Algorithm 1. The algorithm initializes a random configuration of non-empty bins and computes the inner product with the matrix implied by the random configuration. Then, iteratively checks if moving any of the bin boundaries one step to the left or right increases the inner product. In each iteration, the adjustment of bin boundaries leading to the highest increase in the inner product is being chosen. The algorithm stops when no further increase can be achieved. The vector of bin boundaries computed by the algorithm contains the indices of the bin boundaries of the ideal binning in the vector containing the ordered, unique elements of .

Algorithm 1 Greedy optimization of binning. Accessing an item in a vector is denoted by squared brackets, subscripts and superscripts are parts of the names of the variables. A Python implementation of the algorithm is available in the GitHub repository https://github.com/gykovacs/ideal_binning_nuv

//RowColSum computes the contribution of row and column to the inner product for bin

1:function RowColSum(, , , , )
2:   
3:   for  to  do // the upper bound exclusive
4:      if  then
5:                   
6:   return s;

//Change computes the change in the objective function when the th bin boundary is moved one step to the left () or to the right ()

1:function Change()
2:    // transforming the step to 0 or -1
3:   RowColSum() // new contribution of bin
4:   RowColSum() // new contribution of bin
5:    // new cardinalities
6:    // changes in the objective function
7:   return

// GreedyBinning implements the proposed greedy binning algorithm

1:function GreedyBinning(, , )
2:   // The parameters , and denote the cross-product matrix to fit, the vector and the number of bins, respectively.
3:    A random vector of size containing increasing integers from to , with q[0]=0
4:    Vectors of size , initialized by zeros.
5:    (The inner product (target function) to be maximized)
6:   //Initializing the target function and the vectors containing the sums and number of items related to the bins
7:   for  to  do
8:      for  to  do
9:         for  to  do
10:             // the initial contributions of the bins          
11:          // the initial cardinalities of the bins       
12:       // the initial objective function    
13:   // Iteratively checking for the largest improvement by moving bin boundaries one step to the left or right
14:   do
15:      
16:      for  to  do // for all bins (the upper bound exclusive)
17:         for  do // relative index of the shrinking bin
18:            if  then // if the shrinking bin does not get empty
19:                // transforming relative index to a step -1/+1
20:               Change() // the changes
21:               if  then // if the improvement larger than before, record its parameters
22:                                                             
23:      if  then // if there was an improvement, update the values accordingly
24:         
25:                
26:   while 
27:   return

As the following corollary shows, if the distortion is spherical and centered to the origin, that is, it makes no distinction between various directions of the feature space, the expected value reduces to a constant – in accordance with the no free lunch theorems of machine learning Wolpert and Macready (1997): all machine learning techniques of a class (in this case all binnings of bins) provide the same average performance when evaluated on all possible problems with no structural preference in their distribution (in this case all distorted vectors from some spherical distribution).

Corollary 1.

(No free lunch theorem) If all elements of are unique, and the the distortion has a spherical distribution in the d-dimensional features space of , that is, ,

(18)
Proof.

One can readily see by substituting in place of and in place of in Proposition 2, and utilizing . Without the unicity constraint on the elements of , the ties imply non-diagonal non-zero entries in , and would not reduce to . ∎

As a special case of Proposition 2, one can expect that the distortion is such that it maps close to itself. This closeness can be modelled by a distribution which has the mean , where is the vector of unique elements of in an increasing order. The following proposition provides an insight into the effect of localized distortions: in this case, the ideal quantization requires to match the covariance structure of the distortion and also requires the minimization of the representation error of made by the binning.

Proposition 3.

(On the expected value of the measure with a localized distortion) Using the notations introduced before, with if , then

(19)
Proof.

For the proof, see Appendix D

As a consequence of the proposition, if the distortion is centered to in the sense that , the ideal binning jointly minimizes the representation error of the binning and the alignment of the binning with the covariance structure of the distortion . According to Lemma 2, the representation error could be minimized by solving the k-means clustering problem (applying some k-means clustering technique like the well known ML-EM), however, the alignment of the binning and the covariance structure is not minimized by it, thus, in these cases still the optimization method formulated in Theorem 1 and Algorithm 1 is recommended.

It is also reasonable to suppose that the distortion is not only centered to , but spherical. The following proposition and theorem show that in these cases solving the k-means clustering problem leads to the ideal quantization.

Proposition 4.

(On the expected value of the measure with spherically distributed distortion) If has unique elements, and , then

(20)
Proof.

Substituting the evaluations

into (19) with completes the proof. ∎

Theorem 2.

(On the optimization of the binning for spherically distributed distortion) If the elements of are unique and the distortion is centered to with a spherical distribution, the ideal binning can be determined by solving the k-means clustering problem for the elements of .

Proof.

The theorem is a consequence of Lemma 2 and being the only dependent term in the numerator of (20). ∎

As a consequence of Theorem 2, when only white noise is expected, still the solution of the k-means clustering problem provides the ideal binning.

We highlight that Theorem 1 provides the general conditions of ideal binning applicable to any assumptions on . The greedy algorithm proposed in Algorithm 1 finds an approximating solution but due to the combinatorial nature of the problem, it does not guarantee a global optimum. When the distortions imply that bin selection turns into the k-means clustering problem, advanced techniques developed to find the exact solution of the k-means clustering problem in 1D can be exploited to find the ideal solution Wang and Song (2011). Finally, one can readily see, that although the results are based on the first-order approximation, when the unexplained variance measure is not normalized (making it analogous to MI), the same conditions on the optimal binning are exact.

3.5 Estimation of the cross-product matrix of the distortion

In order to determine the ideal quantization for a template, one needs to make assumptions on the cross-product structure of the expected distortions . If the distortions are known to come from a particular class of functions, one can estimate the cross-product matrix by sampling the class of functions, apply each sample function to the unique elements of and compute the empirical cross-product matrix of the resulting vectors. For example, given a template in an image processing application, expecting gamma distortions (), with (related to over-exposition), one can determine as the vector of unique elements in , and estimate the matrix by sampling from and where .

3.5.1 The probability of ideal binning

Working with digital signals, due to the finite precision of representation and/or sensitivity of acquisition devices, the space of windows is usually a bounded subset with volume . If the assumptions on are valid in a volume of the feature space and the ideal bins are determined by these assumptions, one can readily see that the probability of the ideal operation of the nUV measure becomes , thus, using the proposed binning techniques can still improve the separation power of PWC nUV, even though the assumptions are valid only in a subset of the entire space .

4 Tests and Results

As a dissimilarity measure highly analogous to MI, nUV has numerous potential applications from template matching through registration Ruiz et al. (2009)

to feature selection in machine learning

Vergara and Estevez (2014). Due to this generality of the measure, the theoretical nature of the results we derived, and space limitations, we do not evaluate the measure on real data. The goal of the numerical experiments is twofold, summarized as follows.

Testing the accuracy of first-order approximations. By simulations we show quantitatively and illustrate qualitatively that the formulae derived in the previous section are aligned with the measurements, thus, the first-order statistical approximation is acceptable in the scope of the experimental settings. This is carried out by simulating templates, noisy windows, and distorted templates and computing and comparing the predictions of propositions 1, 2 and 4 with the real dissimilarity scores.

Testing the pattern recognition performance in terms of AUC. We characterize quantitatively how much improvement can be achieved by using the proposed binning techniques in pattern recognition scenarios. In each test case, for each binning technique we record if holds (indicating the correct recognition of the distorted template). We note that the percentage of correct recognitions is an estimation of the probability that a randomly chosen positive sample (distorted template) will have a smaller dissimilarity score than a randomly chosen negative sample (noisy window). This estimation is equivalent to one of the common interpretations of the widely used AUC score (Area Under the receiver operating Curve) Flach et al. (2011).

All results reproducable by the codes with fixed random seeds in the GitHub repository https://github.com/gykovacs/ideal_binning_nuv.

4.1 One test case

The computational steps in one test case are summarized as follows.

Sampling of a random template. A random template () is generated with a random dimensionality

; the templates are vectors from three different distributions with equal probabilities: standard normal distribution (

), uniform distribution (

), and a distribution composed from two normals () – this composition is used to generate templates with non-unimodal intensity distributions. Then, the templates are normalized into the range and a random exponent is used to adjust the template by to alter its intensity distribution (in image processing this type of exponential adjustment is related to under- and over-exposition). Finally, when general distortions are evaluated with greedy binning, the intensity values in the template are rounded to 3 digits, in this way introducing some ties between template values which is also usual in digital signal processing. When Theorem 2 is examined, due to the unicity constraint on the values of , rounding is not applied.

Sampling of a noisy window. A noisy window () and the white noise vector used to distort the template (

) is generated from a normal distribution with uniformly random standard deviation

.

Sampling a distorted template and the cross-product structure.

First, the full-rank decomposition of is being determined: . For general distortions, a random mean vector and a covariance matrix are sampled. For spherical distortions, and , with The cross-product matrix is computed as . A distortion vector is being sampled as and a distorted template () is generated.

Binning. The binning of the template is carried out by equal width (EQW), equal frequency (EQF), k-means and greedy binning for various bin numbers.

Calculation of dissimilarity scores. The approximations of the expected values by Propositions 1, 2 and 4 and the true dissimilarity scores and are computed.

All the fixed, constant parameters used in the simulations are selected to cover a reasonably wide range of possible applications, template structures, intensity distributions, and signal-to-noise ratios.

4.2 Aggregation of the results

The results of propositions 2 and 4 provide formulas for the expected values of the PWC nUV measure when the possible distortions have a particular cross-product structure. There can be numerous meaningful cross-product structures representing the possible distortions in various fields of applications. Picking any of them would deteriorate the generality of the experiments and due to space limitations, we can not examine many different structures in detail. Therefore in the test cases, almost all parameters of the templates, distortions, and noisy windows are being sampled. For the analysis, the computed dissimilarity scores are averaged for the entire population, which enables us to draw conclusions about the operation of the measure with distortions from many different cross-product structures. In the rest of the section, the averages of the expected values are denoted by , and the averages of the computed dissimilarity scores are denoted by .

One can expect that for templates with varying sizes, varying numbers of bins might be ideal for template matching. In order to compensate this variation in the sizes and enable the meaningful aggregation of the results, the numbers of bins for each test case are varied as follows. All the figures are computed for and bins, and for the number of bins determined by the Sturges-formula (), the Rice-rule () and the square root rule (), where denotes the number of unique elements in . One can easily confirm that in the range of the experiments usually holds, therefore, we found it meaningful to plot the aggregated results in one figure and connect them with lines for a better visualization of trends, even though the values , and depend on the sizes of the templates.

4.3 General distortions and greedy binning

We have executed 5000 experiments of the test cases described above, and plotted the aggregated results in Figure 2, with the standard deviations denoted by vertical lines with a minor horizontal shift for visibility. As one can observe, the predictions of Propositions 1 and 2 for the means of the distributions and the means of the real scores are very well aligned, with the highest relative difference of for the noise and for the distorted population. From these, we can conclude that despite the first-order approximation (subsection 3.2), the formulae in propositions 1 and 2 are close enough to the real values in the tested scenarios to expect an improvement in the separation power of PWC nUV by Theorem 1 and Algorithm 1.

The AUC scores for the various binning techniques are summarized in Figure 2(b) and aggregated in Table 1 with the matrix of p-values of the McNemar tests for the equality of the scores. One can observe that the greedy binning outperforms both EQW (by 13% in aggregation) and EQF (by 26% in aggregation) with statistical significance, providing a numerical validation for Theorem 1 in the scope of the experiments. Interestingly, although k-means binning is proved to be ideal when the distortion is spherically distributed around the template, it is outperforming EQW (by 4%) and EQF (by 17%) with statistical significance, indicating that there can be further configurations when k-means binning works well. Comparing the greedy binning and the k-means binning, the most remarkable difference is that for the k-means binning there is no need for the estimation of the cross-product or covariance structure of the distortions. Consequently, the results suggest that even in the lack of any knowledge on the possible distortions, using k-means binning instead of EQW could improve the matching results with statistical significance.

(a)
(b)
Figure 2: Results for general distortions: fitting of the measurements to the theoretical values (a); AUC scores (b).
General distortions Spherical distortions
EQW EQF k-means greedy EQW EQF k-means greedy
EQW 1 0 0 0 1 0 2.3e-07 2.8e-04
EQF 0 1 0 0 0 1 0 0
k-means 0 0 1 0 2.3e-07 0 1 5.7e-02
greedy 0 0 0 1 2.8e-04 0 5.7e-02 1
AUC 0.63 0.5 0.67 0.76 0.83 0.5 0.84 0.84
Table 1: The matrix of p-values and the AUC scores for the various binning techniques with general and spherical distortions (values smaller than are rounded to .)

4.4 Spherical distortions

Again, we have executed 5000 experiments of the test cases described and plotted the results in Figure 3. The predictions of propositions 1 and 4 for the means of the distributions and the means of the real scores are very well aligned, with the highest relative difference of for the noisy windows and for the distorted population. Despite the first-order approximation, the results suggest that Proposition 4 gives a good approximation of the expected value in the scope of the experiments. Comparing the AUC scores of recognition plotted in Figure 3(b) and aggregated in Table 1 with the p-values of the McNemar tests on the equality of the scores, one can observe that the k-means and greedy techniques outperform EQW (by 1% in aggregation) and EQF (by 34% in aggregation) with statistical significance, however, the improvement is mainly for low numbers of bins, the performances quickly converge to that of EQW and the greedy technique (due to its suboptimality) gets below EQW in terms of AUC when the square root rule is applied to determine the number of bins. The reason for the limited improvement in the case of spherical distortions is that according to Proposition 4, k-means and the greedy techniques improve the discrimination power of PWC nUV by minimizing the term which is a smaller decrease than minimizing both this term and the as pointed out in Proposition 3.

(a)
(b)
Figure 3: Results for spherical distortions: fitting of the measurements to the theoretical values (a); AUC scores (b).

4.5 A note on the low pattern recognition performance of EQF

Interestingly, the AUC of EQF is 0.5 in both experiments, which means that it has no discriminative power in these settings. The operating principle of PWC nUV is that the slices describe the rough structure of the template as the values in the bins are close to each other. This assumption is definitely not satisfied by EQF, as having the same number of values in a bin completely neglects the structure of the template, can break many similar values into separate bins, providing a poor representation of the template. This phenomenon is a qualitative validation of our previous claim that binning techniques developed to reconstruct the empirical distribution function of a sample, do not necessarily perform well in other binning problems.

5 Conclusions

In this paper, we have examined the effect of binning strategies on the piecewise constant approximation of the normalized Unexplained Variance (nUV) (also known as MTM) dissimilarity measure. We defined the criterion of ideal operation in Definition 2, and managed to show in Theorem 1 that the ideal binning needs to maximize the alignment of the projection matrix and the cross-product structure of the expected distortion. In order to obtain an approximate solution for this combinatorial optimization problem, we have proposed a greedy algorithm in Algorithm 1. In subsequent propositions, we have examined special cases of the general statement and arrived at the case of localized and spherically distributed distortions, for which the ideal binning can be determined by solving the k-means clustering problem according to Theorem 2.

In the Section 4 we have carried out experiments to see how much the simulation results are aligned with the first-order statistical approximations. According to the results, the relative error is less than 0.1% in terms of the means of the figures. We also compared the performance of the proposed binning techniques to that of historical ones in pattern recognition scenarios and found the proposed approaches to outperform the historical ones by 13% AUC in the case of general distortions and the greedy binning, and by 1% AUC in the case of spherical distortions and k-means binning, in both cases with statistical significance.

The conclusions we can draw are summarized as follows. Due to the analogies presented in Section 2.3, nUV can be treated as a powerful alternative of MI, quantifying the uncertainty remaining about the window given in terms of variance. Thus, nUV is potentially applicable in any problem where MI is used as a similarity measure (template matching, registration, feature selection, etc.), Although numerical experiments can never cover all the possible use cases of a general-purpose dissimilarity measure, due to the wide range of parameters used in the simulations, one can expect that using PWC nUV with the proposed binning techniques can improve its performance in terms of the AUC score.

Appendix A Proof of Lemma 1

Proof.

Due to the orthogonality of , is a diagonal matrix of type with being equal to the cardinality of slice . Inverting this matrix inverts the elements in the diagonal, with . Finally, due to the construction of and the orthogonality of , one can readily see that in , is non-zero only if and fall in the same slice , and the value it takes is . Due to the special structure of ,

(21)

if , which is the mean of elements of in the slice implied by . ∎

Appendix B Proof of Proposition 1

Proof.

Most of the proofs in the paper are analogous, expanding the inner products in the expressions and simplifying them by utilizing the special properties of matrix highlighted in Lemma 1. Due to space limitations, these steps are carried out only in this proof in all details.

According to subsection 3.3, the numerator and the denominator are evaluated separately. The numerator is expanded as

(22)

Evaluating the first term, utilizing Lemma 1 on the special properties of , and the assumptions on the white noise ( mean, finite variance), one gets

(23)

Similarly, and . For the denominator,

(24)

Appendix C Proof of Proposition 2

Proof.

First, we evaluate the numerator (): Expanding the inner product and carrying out the integration for (zero-mean white noise with finite variance ) leaves the following non-zero terms.

(25)

Due to the idempotence of , , thus,

(26)

Let . Carrying out the integration for ,

(27)

for the expectation of the numerator, where denotes the Frobenius inner product. Similarly for the denominator, utilizing the special properties of :