Supervised classification is a very well studied problem in the machine learning and statistics literature, where the nearest neighbour algorithm (KNN) is one of the most popular approaches. It amounts to assigning an unlabelled class to the most common class label among neighbouring feature vectors. One of the key issues in implementing this algorithm is choosing the number of neighbours , and various flavours of cross validation are used for this purpose. However a drawback to kNN is that it does not have a probabilistic interpretation, for example, no uncertainty is associated with the inferred class label.
There have been several recent papers which addressed this deficiency, Holmes02:PKNN ; cuc:mar09 ; Manocha07:PKNN ; fri:pet11 . Indeed from such a Bayesian perspective the issue of choosing the value of can be viewed as a model (order) selection problem. To date, there exist several different approaches to tackle the model selection problem. One of the most popular approaches is based on information criteria including the Akaike Information Criterion (AIC), the Schwarz’s Bayesian Information Criterion (BIC) and the Deviance Information Criterion (DIC) Akaike74:AIC ; Schwarz78:BIC ; Spiegelhalter02:bayesianmeasures . Given a particular model , the well-known AIC and BIC are defined by and for observations where and denote the likelihood and the number of parameters of , respectively.
It is known that many fast functional approximations or information criterion techniques do not adequately approximate the underlying posterior distribution of the model order. Furthermore, Monte carlo based estimators can provide approximate distributions of the model order, but typically require excessive computation time.
Our main contribution is to propose a new functional approximation technique to infer the posterior distribution of the model order, where and denote the model order and observations, respectively. In particular, this paper demonstrates the applicability of the proposed algorithm by addressing the problem of finding the number of neighbours
for probabilistic k-Nearest Neighbour (PKNN) classification. In addition, we designed a new symmetrized neighbouring structure for the KNN classifier in order to conduct a fair comparison. From an application point of view, we also classified several benchmark datasets and a few real experimental datasets using the proposed algorithms.
In addition to model selection, we also consider improvements of the KNN approach itself for the purpose of a fair comparison. Although conventional KNN based on euclidean distance is widely used in many application domains, the conventional KNN is not a correct model in that it does not guarantee the symmetricity of the neighbouring structure.
It is important to state that PKNN formally defines a Markov random field over the joint distribution of the class labels. In turn this yields a complication from an inferential point of view, since it is well understood that the Markov random field corresponding to likelihood of the class labels involves an intractable normalising constant, sometimes called the partition function in statistical physics, rendering exact calculation of the likelihood function almost always impossible. Inference for such complicated likelihoods function is an active field of research. In the context of PKNNHolmes02:PKNN and Manocha07:PKNN use the pseudo-likelihood function bes74 as an approximation to the true likelihood. While cuc:mar09 and fri:pet11 consider improvements to pseudolikelihood by using a Monte Carlo auxiliary variable technique, the exchange algorithm, Murray06 which targets the posterior distribution which involves the true intractable likelihood function. Bayesian model selection is generally a computationally demanding exercise, particularly in the current context, due to the intractability of the likelihood function, and for this reason we use a pseudolikelihood approximation throughout this paper, although our efforts are focused on efficient means to improve upon this aspect using composite likelihood approximations var:reid:fir11 .
This paper consists of several sections. Section 3 includes the background of the statistical approaches used in this paper. This section shows two main techniques, k-Nearest Neighbour (KNN) classification and Integrated Laplace Approximation (INLA). For the extended literature review, probabilistic kNN (PKNN) is explained with details. The proposed algorithm is introduced in the section 4. In this section, we introduce a generic algorithm to reconstruct and approximate the underlying model order posterior and to efficiently search for the optimal model order . Afterwards, this section includes how to apply the generic algorithm into PKNN. In section 5, PKNN adopting the proposed algorithms have applied to several real datasets. Finally, we conclude this paper with some discussion of sections 6 and 7.
2 Related Work
One of the main aims of this paper is to explore nearest neighbour classification from a model selection perspective. Some popular model selection approaches in the literature include the following. Grenander et al. Grenander94:JumpProcess ; Stephens00:ModelSelection
proposed a model selection algorithm which is based on jump-diffusion dynamics with the essential feature that at random times the process jumps between parameter spaces in different models and different dimensions. Similarly, Markov birth-death processes and point processes can be considered. One of the most popular approaches to infer the posterior distribution and to explore model uncertainty is Reversible Jump Markov Chain Monte Carlo developed by Richardson and GreenRichardson97:GMM . The composite model approach of Carlin and Chib Carlin95:BayesianModelSelection is a further approach. The relationships between the issue of choice of pseudo-prior in the case of Carlin and Chib’s product composite model and the choice of proposal densities in the case of reversible jump are discussed by Godsill Godsill01:ModelSelection .
In addition, there are a lot of similarities in the clustering domain. For instance, many clustering algorithms such as K-means algorithms, Gaussian Mixture Model (GMM), and Spectral clustering have also the challenging difficulty to infer the number of clustersas similarly shown in the estimation of the number of neighbours of the (P)KNN.
3 Statistical Background
3.1 k-Nearest Neighbour (kNN) model
In pattern recognition, the k-Nearest Neighbour algorithm (kNN) is one of the most well-known and useful non-parametric methods for classifying and clustering objects based on classified features which are close, in some sense, in the feature space. The kNN is designed with the concept that labels or classes are determined by a majority vote of its neighbours. However, along with such a simple implementation, the kNN has a sensitivity problem from the locality which are generated from two difficult problems: estimating the decision boundary to determine the boundary complexity and the number of neighbours to be voted. In order to address this problem, adaptive kNN is proposed to efficiently and effectively calculate the number of neighbours and the boundary Wang06:neighborhoodsize ; Hand03:KNN ; Domeniconi00adaptivemetric ; Guo10:KNN . In addition, the probabilistic kNN (PKNN) model which is more robust than the conventional kNN has been introduced and developed by Markov chain Monte carlo to estimate the number of neighbours Holmes02:PKNN ; Manocha07:PKNN . In this paper, we use the PKNN model since it provides proper likelihood term given a particular model with neighbours.
|(a) Given data||(b) Asymmetric||(c) Symmetrised|
|PKNN (K=2)||Boltzmann Model|
3.1.1 An asymmetric Pseudo-likelihood of PKNN
Let where each denote the class label and dimensional feature vector . Then, the pseudo-likelihood of the probabilistic kNN (PKNN) proposed by Holmes02:PKNN can be formed as
where the unknown scaling value and is a set of classes, denotes the number of neighbours and if and otherwise. In this equation, represents the set of neighbours.
Suppose that we have four data points as shown in Fig. 1-(a). Given , we have an interesting network structure in Fig. 1-(b) from this conventional PKNN. In this subgraph, arrows direct the neighbours. As we can see in the Fig. 1-(b), some pairs of data points (nodes) are bidirectional but others are unidirectional, resulting in an asymmetric phenomena. Unfortunately, this asymmetric property does not satisfy the Markov Random Field assumption which can be implicitly applied in Eq. (1).
3.1.2 A symmetrised Boltzmann modelling for pseudo-likelihood of PKNN
Since the pseudo-likelihood of the conventional probabilistic kNN is not symmetrised an approximate symmetrised model has been
proposed for PKNN Cucala_Marin_Robert_Titterington_2008 as
The Boltzmann modeling of PKNN resolves the asymmetric problem which arises from the conventional PKNN of Eq. (1). However, the Boltzmann modeling reconstructs the symmetrised network by averaging the asymmetrised effects from the principal structure of PKNN as shown in Fig. 1-(c). This brings different interaction rate among the edges. In the subgraph, two edges have a value of a half and all others have a value of one and so this difference may yield an inaccurate Markov Random Field model again.
3.2 Estimation of PKNN by Markov chain Monte Carlo (MCMC) - a conventional way
The most popular approach to estimate the parameters of PKNN is using Markov chain Monte Carlo (MCMC). In this paper, PKNN via MCMC is also used for performance comparison. In particular, there are two different version of MCMC.
The first approach is to infer the unknown model parameters ( and ) in the training step via MCMC. Afterward, given these estimate values, we can classify the new data from the testing set straightforwardly using the conditional posterior . Suppose that we need to reconstruct the target posterior given the observations and
which is a set of training data. The standard MCMC approach uses a Metropolis-Hasting (MH) algorithm, so that each unknown parameter is updated according to an acceptance probability
where and denote the proposed new parameters. In the training step, we estimate and from the above MCMC simulation. Afterwards, we simply classify the testing datasets given and . That is, given a testing set we can estimate the classes by
for a new test data and its unknown label . However, since the uncertainty of the model parameters is ignored in the testing step of the first approach, the first approach with two separate steps (training and testing) is less preferred from a statistical point of view although it is often used in practice. Unlike the first approach, the second approach jointly estimates the hidden model parameters to incorporate this uncertainty while classifying the testing datasets. In the second approach we reconstruct not the conditional distribution but a marginalized distribution by jointly estimating parameters. In this case, the target density is not but . Then each unknown parameter from the marginalized density is updated according to the modified acceptance probability
In this paper, we use the second approach to infer the parameters and classify the data for MCMC simulation for comparison since the joint estimation to obtain the marginalized distribution considers the uncertainty even in the classification of the new dataset. We simply design and each proposal distribution is defined by
where we set and
for the Gamma distribution. Given this particular setting of the proposal distribution, we obtain the simplified acceptance probability
3.3 Integrated Nested Laplace Approximation (INLA)
Suppose that we have a set of hidden variables and a set of observations , respectively. MCMC can of course be used to infer the marginal density where is a set of control parameters. In order to efficiently build the target density, we apply a remarkably fast and accurate functional approximation based on the Integrated Nested Laplace Approximation (INLA) developed by Rue09:INLA . This algorithm approximates the marginal posterior by
Here, denotes a simple functional approximation close to such as a Gaussian approximation and is a value of the functional approximation. For the simple Gaussian approximation case, the proper choice of is the mode of the Gaussian approximation of . Given the log of the posterior, we can calculate the mode and its Hessian matrix via Quasi-Newton style optimization by and . Finally we do a grid search from the mode in all directions until , for a given threshold .
4 Proposed Approach
Our proposed algorithm estimates the underlying densities for the number of neighbours of probabilistic kNN classification by using Eq. (11). To distinguish it from other model selection approaches, we term this approach KOREA, which is an acronym for ”K-ORder Estimation Algorithm” in a Bayesian framework.
4.1 Obtaining the optimal number of neighbours
Let denote a set of observations and let be a set of the model parameters given a model order . The first step of KOREA is to estimate the optimal number of neighbours, :
According to Eq. (11), we can obtain an approximated marginal posterior distribution by
This equation has the property that is an integer variable while of Eq. (11) is in general a vector of continuous variables. By ignoring the difference, we can still use the Quasi-Newton method to efficiently obtain optimal . Alternatively, we can also calculate some potential candidates between and if is not too large. Otherwise, we may still use the Quasi-Newton style algorithm with a rounding operator which transforms a real value to an integer for .
4.2 Bayesian Model Selection for PKNN classification
In general, one of the most significant problems in classification is to infer the joint posterior distribution of different hidden classes for different observations such that . However, jointly inferring the hidden variables is not straightforward therefore we make the assumption that the hidden class of the -th observation is independent to one of the -th observation given the -th observation where
and then we have the following simpler form (similar to Naive Bayes):
where is estimated by Eq. (21).
4.2.1 PKNN via KOREA
In the probabilistic kNN model (PKNN), let us define the new dataset with data by , which is not labeled yet. The unknown labels are denoted by . Here we use and for the th new observation and its hidden label. That is, we have a hidden variable of interest given , and such that . The target posterior is obtained in a similar form to Eq. (13) as
Now we need to know three distributions in the above equation.
: conditional likelihood
: posterior of
: posterior of
The first equation among the three above is the conditional distribution and it is defined by
This is a likelihood function given the neighbouring structure. That is, explains the fitness between the assumed/given labels and the given full data ()
Another equation is but we defer the estimation of this distribution since it can be automatically estimated when we estimate the last distribution . Therefore, we infer the last equation first. The last equation is the marginal posterior of and using a similar approach to INLA it is defined by
As we can see the denominator is the approximation of the second distribution of interest so we can reuse it i.e. which is a Gaussian approximation of .
We also easily obtain the marginal posterior of which is . Since the marginal posterior is approximated by , we can simply reconstruct the distribution by reusing the previously estimated distributions. When we have and , then we have
Finally, we can obtain the target distribution of interest
with three distributions. Since we can now estimate the target distribution as a mixture distribution, we can also obtain the expectation and variance as follows:
where and . Here and represents inverse Gamma distribution with hyper-parameters and . In this paper, we set and yielding an almost flat prior.
4.3 Additional Neighbouring Rules
4.3.1 A Boltzmann modelling with equal weights
In the conventional Boltzmann modelling for the neighbouring structure, the interaction rate is divided by a fixed as shown in Eq. (2). This results in each neighbour having its own different weight. Therefore, we need to apply an equal weight to the neighbours by varying for the different neighbouring structure. In order to build this strategy, we adopt three sequential approaches: (i) obtain a neighbour structure in the same way as conventional Boltzmann modelling; (ii) modify the structure by transforming from a directed graph to an undirected graph. If but then we add into for ; and (iii) apply the pseudo likelihood for the likelihood. In this paper, we name this modelling as Boltzmann modelling.
5 Simulation Results
The performance of our algorithm is tested with a collection of benchmark datasets. All of the datasets (test and training) used in this paper can be found at http://mathsci.ucd.ie/~nial/dnn/. The six well-known benchmark datasets are presented in Table 1.
|Name of data||d|
We test the performance by using -fold cross validation for a fair comparison with all approaches although our proposed approach does not not require it due to the Bayesian nature of it.
|(a) Crabs||(b) Fglass||(c) Meat|
|(d) Oliveoil||(e) wine||(f) Iris|
Figure 2 demonstrates reconstructed densities of a testing datum. While top subgraphs show the 2 dimensional densities , bottom sub-figures represent the 1 dimensional densities for all datasets. The graphs illustrate that the distribution is not unimodal but a complex multi-modal distribution. This also suggests that selecting an appropriate number of neighbours for PKNN is critical to obtain high accuracy.
Asymptotically, MCMC with a large number of iterations will converge and therefore can be used in principle to estimate the underlying posterior density. Thus, we can check whether the reconstructed density using KOREA is close to that estimated by MCMC with a very large number of iterations in order to validate the our proposed algorithm. Two subgraphs of figure 3 visualize the similarity between reconstructed posterior densities of a testing data of wine dataset by KOREA (red circle line) and MCMC (blue cross line) with small (top) and large (bottom) number of samples. (For MCMC, we set the sample size by 100 for small size and 10000 for large size respectively.) As we can see in the figures, our propsed algorithm KOREA is closely approximated to the MCMC algorithm with a large number of iterations ize which is commonly regarded as underlying reference or pseudo-ground truth density.
In order to measure the similarity between the reconstructed densities by MCMC and KOREA, we use four different metrics as shown in figure 4: Root Mean Square Error (RMSE), Peak Signal-to-Noise Ratio (PSNR), Kullback Leibler Distance (KLD) and Structural SIMilarity (SSIM) Wang04imagequality . As in the case of figure 3, MCMC with a large sample size produces densities very close to those produced by our proposed KOREA algorithm. As the number of MCMC samples increases, RMSE and KLD decrease while PSNR and SSIM increases for all datasets.
Table 2 demonstrates the performance of the each algorithms based on F-measure for four cases: kNN, PKNN, KOREA (average) and KOREA (optimal). Since MCMC produces results which are very close to that of KOREA as shown in figures 3 and 4, we did not present these results. KOREA (average) and KOREA (optimal) represent the mean (marginalized) estimate and MAP estimate of KOREA, respectively. As we can see in the table, KOREA works superior to other conventional approaches for all datasets. The results with the best performance are highlighted in bold in this table.
In addition, we compared the simulation times for each of the algorithms. Table 3 demonstrates the execution time for all algorithms. Our proposed algorithm (PKNN with KOREA) is slower than conventional kNN and PKNN with fixed but it is much faster than MCMC technique which is regarded as one of the best approaches to infer the model parameters and number of neighbours in Bayesian framework. From the point of the accuracy of table 2 and the execution time of table 3, we eventually find that PKNN can be efficiently improved by using our proposed KOREA algorithm and this is a very practically useful technique compared to the conventional approaches including KNN, PKNN and MCMC.
Our proposed algorithm uses an approach similar to the idea of INLA by replacing the model parameters with the model order (the number of neighbours, ). This means that we can speed up the computation by embedding (Quasi-)Newton method for Laplace approximation rather than grid sampling as done in the original INLA. However, as we can see in Fig. 2, the posterior is not unimodal so we can find local optima rather than global optima for the maximal mode of the posterior if we use such a simple Laplace approximation. Therefore, instead of (Quasi-)Newton methods employed in the original INLA, we reconstructed the density with relatively slower grid approach for the real datasets in the PKNN of this paper. Of course, if the distribution is uni-modal, then we can use the Quasi-Newton method to speed up the algorithm.
We proposed a model selection algorithm for probabilistic k-nearest neighbour (PKNN) classification which is based on functional approximation in Bayesian framework. This algorithm has several advantages compared to other conventional model selection techniques. First of all, the proposed approach can quickly provide a proper distribution of the model order which is not given by other approaches, in contrast to time consuming techniques like MCMC. In addition, since the proposed algorithm is based on a Bayesian scheme, we do not need to run cross validation which is usually used for the performance evaluation. The proposed algorithm can also inherit the power of the fast functional approximation of INLA. For instance, it can quickly find the optimal number of neighbours and efficiently generate the grid samples by embedding Quasi-Newton method if the posterior is uni-modal. Lastly, the proposed approach can calculate the model average which is marginalized posterior . We also remark that our algorithm is based on a pseudo-likelihood approximation of the likelihood and suggest that, although our algorithm has yielded good performance, further improvements may result by utilising more accurate approximations of the likelihood, albeit at the expense of computational run time.
This research was supported by the MKE (The Ministry of Knowledge Economy), Korea, under the ITRC (Information Technology Research Center) support program (NIPA-2012- H0301-12-3007) supervised by the NIPA (National IT Industry Promotion Agency). Nial Friel’s research was supported by a Science Foundation Ireland Research Frontiers Program grant, 09/RFP/MTH2199.
C. C. Holmes, N. M. Adams,
A probabilistic nearest
neighbour method for statistical pattern recognition, Journal of the Royal
Statistical Society - Series B: Statistical Methodology 64 (2) (2002)
- (2) L. Cucala, J.-M. Marin, C. P. Robert, D. M. Titterington, A bayesian reassessment of nearest-neighbor classification, Journal of the American Statistical Association 104 (2009) 263–273.
S. Manocha, M. Girolami, An empirical
analysis of the probabilistic k-nearest neighbour classifier, Pattern
Recognition Letters 28 (13) (2007) 1818–1824.
- (4) N. Friel, A. N. Pettitt, Classification via distance nearest neighbours, Statistics and Computing 21 (2011) 431–437.
- (5) H. Akaike, A new look at the statistical model idenitification, IEEE Transactions on Automatic Control 19 (1974) 716–723.
- (6) G. Schwarz, Estimating the dimension of a model, The Annals of Statistics 6 (2) (1978) 461–464.
- (7) D. J. Spiegelhalter, N. G. Best, B. P. Carlin, Bayesian measures of model complexity and fit, Journal of the Royal Statistical Society, Series B 64 (2002) 583–639.
- (8) J. E. Besag, Spatial interaction and the statistical analysis of lattice systems (with discussion), Journal of the Royal Statistical Society, Series B 36 (1974) 192–236.
I. Murray, Z. Ghahramani, D. MacKay, Mcmc for doubly-intractable distributions, in: Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence (UAI-06), AUAI Press, Arlington, Virginia, 2006.
- (10) C. Varin, N. Reid, D. Firth, An overview of composite likelihood methods, Statistica Sinica 21 (2011) 5–42.
- (11) U. Grenander, M. I. Miller, Representations of knowledge in complex systems, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 56 (4) (1994) 549–603.
- (12) M. Stephens, Bayesian Analysis of Mixture Models with an Unknown number of components – an alternative to reversible jump methods, The Annals of Statistics 28 (1) (2000) 40–74.
- (13) S. Richardson, P. J. Green, On bayesian analysis of mixtures with an unknown number of components, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 59 (4) (1997) 731–792.
- (14) B. P. Carlin, S. Chib, Bayesian model choice via markov chain monte carlo methods, Journal of the Royal Statistical Society, Series B 57 (1995) 473–484.
- (15) S. J. Godsill, On the relationship between markov chain monte carlo methods for model uncertainty, Journal of Computational and Graphical Statistics 10 (2001) 230–248.
- (16) J. Wang, P. Neskovic, L. N. Cooper, Neighborhood size selection in the k-nearest-neighbor rule using statistical confidence, Pattern Recognition 39.
- (17) D. J. Hand, V. Vinciotti, Choosing k for two-class nearest neighbour classifiers with unbalanced classes, Pattern Recognition Letters 24 (9‚Äì10) (2003) 1555 – 1562.
- (18) C. Domeniconi, J. Peng, D. Gunopulos, Adaptive metric nearest neighbor classification, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2000) 1281–1285.
- (19) R. Guo, S. Chakraborty, Bayesian adaptive nearest neighbor, Statistical Analysis and Data Mining 3 (2010) 92––105.
L. Cucala, J.-M. Marin, C. Robert, M. Titterington,
A Bayesian Reassessment of
Nearest-Neighbor Classification, Journal of the American Statistical
Association 104 (485) (2008) 263–273.
- (21) H. Rue, S. Martino, N. Chopin, Approximate bayesian inference for latent gaussian models by using integrated nested laplace approximations, Journal Of The Royal Statistical Society Series B 71 (2) (2009) 319–392.
- (22) Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, Image quality assessment: From error visibility to structural similarity, IEEE Transactions on Image Processing 13 (4) (2004) 600–612.