I Introduction
Bayesian Classification is a naturally probabilistic method that performs classification tasks based on the predicted class membership probabilities, i.e. the probability that a given sample belongs to each class [1]
. With the output membership probabilities, a Bayesian classifier provides a degree of confidence for the classification decision, which is more meaningful than a simple assertion of class label. The class membership probabilities in Bayesian classification are estimated based on Bayes’ theorem. By Bayes’ theorem, the estimation of a class membership probability is transformed to the estimation of the prior probability and the corresponding conditional probability.
In Bayesian classification, the most important step is to estimate the conditional probability for each class; in multivariable cases, it usually relates to the joint probability estimation using samples in a certain class. Research efforts have been made to estimate the multivariate joint Probability Density Function (PDF) for Bayesian classification. The naive Bayesian classifiers (NBC)
[2, 3] take the class conditional independence assumption to transform the multivariate joint conditional PDF into the product of several univariate conditional PDF, which extremely reduces the computation of joint PDF. However, the effectiveness of NBC quite relies on the Class Conditional Independence Assumption (CCIA) which is rarely true in realworld applications. The Bayesian belief networks [4] were proposed to allow the representation of dependence among the features, but to train an unconstrained belief network is computationally intensive; it was shown that probabilistic inference using Bayesian belief networks is NPhard [5, 6]. Usually, a weaker conditional independence is assumed to train a belief network with the complexity between Naive Bayes and an unconstrained belief network for a compromise, e.g. Tree Augmented Naive Bayes (TAN)
[4] and Aggregating OneDependence Estimators (AODE) [7]. In addition, the nonnaive Bayesian classifier (NNBC) [8]was proposed to estimate the multivariate joint PDF directly using multivariate Kernel Density Estimation (KDE) with an optimal bandwidth.
Most of the previous methods try to establish a Global Probabilistic Model (GPM) in the whole sample space for the joint probability distribution estimation. However, the realworld problems are usually too complex to model an effective GPM in the whole sample space or the corresponding GPM is too complex for efficient classification; some fundamental assumptions are usually required to simplify the GPM, for example, the CCIA for NBC. In fact, for the classification of a certain sample, the GPM is unnecessary, and a Local Probabilistic Model (LPM) that models the local distribution around the query sample is sufficient for the estimation of probability distribution at the exact query point.
In addition, a Bayesian classifier based on LPM is a generalized model that can be specialized to a number of existing classification methods, e.g., knearest neighbors (kNN) and NBC. We propose a unified form for all local classification methods, where different classification methods may have different parameters. By tuning the parameters, we can establish a fitting model for a particular classification problem. To our best knowledge, this is the first report that shows a generalization of local classification methods.
There are several obvious advantages of using an LPM for Bayesian classification, as follows:

An LPM established for a local region is expected much simpler and can relax the fundamental assumptions that may not be true in the whole sample space;

Bayesian classification based on LPM is a generalization of several existing classification models. A selective LPM can be flexible for problems with various complexities through tuning the size of the local region and the corresponding local model assumption.
Ii Preliminaries and Related Work
Iia Bayesian Decision for Classification
Given a query sample with known feature description and unknown class label , where
is a finite set of possible class labels. A Bayesian classifier estimates the posterior probability of class
, and predicts the best class label for using the optimal decision rule that minimizes the conditional risk as equation 1. [12](1)  
where
is the loss function (LF) that states the costs for assigning a true class label
as . In classification problems, the zeroone Loss Function (01 LF) is usually assumed if the true LF is unknown. The 01 LF assigns no loss to a correct classification decision and uniform unit loss to an incorrect decision, defined as(2) 
With the 01 LF, the predicted class can be simplified as
(3)  
is the posterior probability of class , given the sample , and can be calculated by Bayes’ theorem as
(4)  
As is constant for all classes, the best class label by Bayesian rule is
(5) 
Thus, there are two items should be estimated from the training set. Let the training set be given as ; each sample has descriptors, i.e. . is usually estimated by the corresponding frequency as
(6) 
where is a subset of such that the class labels of samples in are all .
is the likelihood of sample related with class and can usually be estimated as a multivariate joint probability distribution of class at using the training samples in . If discrete features exist in the descriptors, can also be transformed to conditional joint PDF estimation for continuous variables by
(7) 
where and is the continuous and discrete component in , respectively. Thus, in this paper, we mainly focus on the multivariate joint probability density estimation for continuous variables for Bayesian classification.
IiB Related Work
Although literature on using local probabilistic models for density estimation can be found at [13, 14, 15]; however, previous methods mainly focus on univariate density estimation with large sample sizes and would incur dimensionality curse when applied to classification problems with multifeatures. The idea of localization for classification can also be found. The knearest neighbor (kNN) methods that classify a query sample based on a majority voting of its nearest neighbors should be the original local method for classification [16, 17]
. Research on Support Vector Machines (SVM) constructed in a local region can be found in
[18, 19, 20, 21]. Also, some Bayesian classifiers try to train a series of local models based on a subset of features, including NBtree [22] and Lazy Bayesian Rules (LBR) [23]. Research on building a naive Bayesian classifier in a neighborhood includes [24, 25, 26]. However Limited research on Bayesian classification using local probabilistic models for likelihood estimation can be found, several local classification methods can be a specification of LPMbased Bayesian classification, see Section IVB.Iii Local Probabilistic Model
For a probability density estimation problem, a simple parametric model can not always describe the complex distribution in the real world; while a nonparametric model usually requires quite a lot of samples for an effective estimation, especially in high dimensional cases. As can be expected that the probability distribution in a local region should not be so complex and can be estimated through a simple parametric model. Thus, it should be a feasible solution for probability density estimation through an LPM. In this section, we will introduce the concept of local probability distribution and the corresponding local probabilistic model.
Iiia Local Probability Distribution
For a random variable
^{1}^{1}1It can be a single variable for univariate distribution or a multivariable for multivariate joint distribution and so is it later.
, we use local probability distribution to describe the probability distribution in a subset of the sample space. To facilitate a better understanding, we make the following definition.Definition 1 (Local Probability).
Suppose and are two subsets of the sample space of a random variable , the local probability of in local region is defined as the conditional probability of given that , denoted as
(8) 
For a continuous random variable, the local probability density is also defined.
Definition 2 (Local Probability Density (LPD)).
In the sample space of a continuous random variable , given a continuous closed region , for an arbitrary point , the local probability density in is defined as
(9) 
where and respectively denote the neighborhood of and its volume^{2}^{2}2 denotes the size of a region, volume for 3dimensional cases, area for 2dimensional cases and hypervolume for higher dimensional cases..
Similar to the global probability density to the whole sample space, the LPD describes the relative likelihood for a random variable to take on a certain point given that it is in a certain local region. If the local region is extended to the whole sample space, the local distribution becomes the general global distribution, and LP/LPD should become the conventional probability/probability density. Similar to the conventional probability density, LPD also satisfies the nonnegativity and unitarity in the corresponding local region.
Proposition 1.
If denotes the local probability density of a continuous random variable in a continuous closed region , then has following two properties.
(I) Nonnegativity:
(10) 
(II) Unitarity :
(11) 
The following proposition can describe the relationship between the general global PDF and LPD.
Proposition 2.
In the sample space of a continuous random variable , if a continuous closed region has the prior probability , and the local probability density function for region is , then the global probability density of a point is
(12) 
Proposition 2 provides a method to estimate global density from LPD, which is the basic idea of our method. Given an arbitrary point in the sample space, we consider a local region containing ; if the prior probability is known or can be estimated, the estimation of global density is transformed to the estimation of LPD.
IiiB Local Model Assumption
The estimation of LPD is based on the samples falling in the corresponding local region; it is similar to the estimation of PDF in the whole sample space, except that the local distribution is supposed to be much simpler. Thus we can assume a simple probabilistic model in the local region for the LPD estimation. Suppose we totally have observations among which there are samples falling in the local region , denoted by , the following local model assumptions can be taken for the LPD estimation.
IiiB1 Locally Uniform Assumption
The Locally Uniform Assumption (LUA) assumes a uniform distribution in the local region. Thus, according to the unitarity of LPD in region
(Formula 11), the LPD with LUA is estimated as(13) 
With the prior probability of estimated by , the PDF can be estimated as , in agreement with the histogram estimator. The histogram estimator can achieve an accurate estimation when and . It can also be interpreted by LUA in a local region; can guarantee an accurate estimation of ; and is more accurate with a larger number of samples.
However, if the sample size is not large enough, a large cannot coexist with a small local region in practice. To achieve a more accurate estimate for , the region should be somewhat large and thus cannot be always regarded uniform, and reasonably, a more complex distribution should be assumed.
IiiB2 Locally Parametric Assumption
If the local region is not small enough to take the simple LUA, we may assume a somewhat complex parametric model to model the local distribution, termed Local Parametric Assumption (LPA). The LPA assumes a parametric model can model the distribution in the local region with an unknown parameter vector . An estimate of can be obtained from the samples in . Note that the LPD should obey the unitarity in as Formula 11. Thus, the LPD in should be normalized as
(14) 
Specifically, we can assume samples in region
follow a Gaussian distribution
as(15) 
and the unknown parameters mean and/or covariance matrix can be estimated as
(16) 
(17) 
We term this assumption locally Gaussian assumption (LGA).
IiiB3 Locally Complex Assumption
Although the local distribution can be supposed simpler than the global distribution, however, if the original global distribution is too complex and the local region may not be small enough to take the LUA and LPA to model the local distribution, we may take the locally complex assumption (LCA) and employ the nonparametric method for LPD estimation. Similarly, the KDE can also be employed for the LPD estimation with an assigned bandwidth vector . KDE estimates a global distribution from the samples in local region as
(18) 
where the operator denotes the right division between the corresponding elements in two equalsized matrices or vectors; returns the product of the all elements in a vector; and is the kernel function of which the Gaussian kernel is usually used as
(19) 
where is the dimensionality of .
And then the global distribution is normalized to the region for LPD estimation as
(20) 
IiiC Analysis
In terms of the complexities of a model, it is usually described by the number of effective parameters [27]. Give a certain local region, LUA has no parameters to estimate, it is the simplest LPM. LPA assumes a single parametric model with its intrinsic parameter to estimate. While LCA virtually assumes a mixture model with single parametric models, thus its complexity is the sum of single models. Therefore, in terms of the complexity, LUA LPA LCA.
The local distribution should be much simpler than the global distribution and can be modeled easily. The simplicity can be reflected in two aspects: (1) the relationship between features in a local region can be simpler than that in a larger region containing the local region; and (2) for a single feature, the local model is not more complex than the global model.
The relationship between features can be described by the following theorem.
Theorem 1.
If a set of random variables are independent in a region , then these variables are independent in an arbitrary region where the variables in the boundary of are independent.
Theorem 1 indicates that feature independence in a local region is a weaker constraint than that in the whole sample space. The independence assumption that may not be true in the whole sample space may be applicable in a local region. Thus, due to the more likely local independence, the joint PDF estimation can be simplified by naive Bayesian in a local region. Actually, the independence assumption between features reduces the number of parameters of a multivariate probabilistic model by assuming the correlation between features as zero.
Additionally, for a single variable, the local model cannot be more complex, and usually simpler, than the original global model. For example, in Figure (a)a the original global distribution is a uniform mixture distribution model , where there are 5 parameters; it is a piecewise uniform distribution in the corresponding local region, simpler than the mixture distribution. In Figure (b)b the original global distribution is a Gaussian mixture distribution , where there are also 5 parameters; the local distribution can be approximately uniform (no parameter), linear (one parameter) and Gaussian (two parameters), all simpler than the mixture Gaussian distribution.
From Equation 12, the effectiveness of the PDF estimation depends on the estimation of LPD and the prior probability of the corresponding local region, so it is very important to selected the local region and the local model assumption. If the local region is too small, the prior estimation of should have a larger relative error; the extreme case would happen if there are no samples in the local region, where the prior estimate is 0, and ineffective for PDF estimation. Conversely, if we choose a large local region , though the estimation of is more accurate, the LPM may be too complex for an effective estimate of ; when the local region is extended to the whole sample space, it becomes the general global PDF estimation.
We provide a criterion for local region selection; while ensuring the accuracy of the estimation of , try to choose the smallest local region where the LPM should be simple. Thus, the local region selection should be related with the sample size , and the selection of the LPM is related with the local region. If is sufficiently large, the estimation of can be effective even in a relative small local region where the local distribution is simple and can be assumed uniform. That is the reason why the histogram estimator is effective when . However, the sample size in practice is small, especially in high dimensional cases. In the case of small sample size, to ensure the accuracy of will produce a relatively large local region where the LUA does not hold; a more complex local model assumption (LPA or LCA) should be taken.
Iv Classification Rules
Because the global probability density can be effectively estimated through assuming an appropriate LPM, a Bayesian Classifier based on LPM (LPMBC) can be constructed.
Iva Formulation
Given a query sample with unknown class label , where is a finite set of possible class labels, a Bayesian classifier estimates posterior probability for each class label , and predicts the best class label for by minimizing the posterior probability based on Bayes’ theorem, as
(21)  
According to Formula 12, the probability density at point , for class can be extended as
(22) 
For the classification of a certain sample , a local region for each class should be determined such that (). And then the Bayesian classification rule Equation 21 can be further transformed to
(23)  
The output posterior probability can also be computed according to the law of total probability as
(24) 
Thus, to classify a certain sample , we should select a local region , then estimate the prior probability and assume an LPM in each for the corresponding LPD estimation from the training set.
represents the probability that a sample belongs to both class and the local region . If there are out of samples in class falling into region , it can usually be estimated as
(25) 
Thus,
(26) 
(27) 
The estimation of LPD depends on the LPM assumption. If an LPM for region and class is assumed as LUA, LPA or LCA, the corresponding LPD can be estimated from Equation 13, 14 and 20, respectively.
Due to , is actually a neighborhood of and can be selected centered as to facilitate the computation related with it. Note that the neighborhood of different samples may overlap and so that the estimated PDF will not be integrated to unity. However, for the classification of a certain sample, we only need the likelihood of that point in each class, the true PDF at every point is not necessary.
IvB Specification
Formula 26 can be viewed as a generalized local classification model and can be specialized to a series of different classification rules by selecting various local region and various LPM assumptions.
Bayesian rule. The common Bayesian classification rule is essentially an LPMBC with the local region extended to the whole sample space, where the LPD becomes the global PDF. Thus in the global case, the LPMBC rule described by formula 23 can be transformed back to the Bayesian rule in Formula 21.
KNN rule. If we select an identical neighborhood for all the classes and take LUA for LPD estimation as , constant for all classes, then the LPMBC rule in Formula 26 can be transformed to . That is, it outputs the class that has the most samples in the neighborhood, in agreement with the traditional voting kNN rule.
Distance weighted kNN rule (DWkNN). A DWkNN rule [28] is essentially a reduced form of LPMBC with an identical neighborhood for all the classes and with LCA for LPD estimation. The specialization can be described as
(28)  
where is the th sample of class in the neighborhood. Thus, it can be seen as a DWkNN by assigning a weight to sample . A different kernel function corresponds to a different weight function.
Local mean method (LMM). An LMM computes a local center of the neighborhood of the query sample for each class, and then minimizes the distance between the local centers and the query sample. Usually, an equal number of nearest neighbors are selected from the corresponding class to estimate the corresponding local center. A number of articles [29, 30, 31, 32] have presented classifiers of this kind.
An LMM is a special case of LPMBC when using LGA with the same covariance matrix for each class. If we select the neighborhood for class such that there are samples from class in , i.e. is constant for each class, the LPMBC rule can be transformed as
(29)  
where is the local center estimated from the samples in the corresponding local region. is the Mahalanobis distance between and . If
is further assigned an identity matrix, the Mahalanobis distance reduces to the Euclidean distance; that is, this rule assigns the query sample to the class whose local center is closest to the query sample.
Local distribution based kNN (LDkNN). We presented the LDkNN method [33] where the local distribution of each class is assumed Gaussian and the mean and covariance matrix is estimated from the samples in the corresponding neighborhood. An LDkNN rule is essentially an LPMBC with LGA in an identical neighborhood for all classes.
V Experiments
Va Experimental Setting
To evaluate the performance of LPMBC, experiments are performed on 16 benchmark datasets from the wellknown UCI machine learning repository
[34]. Detailed information of the datasets is summarized in Table I. For each dataset, we conduct the following setup.Datasets  #Samples  #Features  #Classes 

Blood  748  4  2 
Bupaliver  345  6  2 
Climate  540  18  2 
Diabete  768  8  2 
Haberman  306  3  2 
Heart  270  13  2 
Image  2310  19  7 
Ionosphere  351  34  2 
Iris  150  4  3 
Libras  360  90  15 
Parkinson  195  22  2 
Seeds  210  7  3 
Sonar  208  60  2 
spectf  267  44  2 
Vertebral  310  6  3 
Wine  178  13  3 
Cross test: each dataset is randomly stratified into 5 folds; for each iteration, 4 folds constitute the training set , the remaining one fold is the test set . The classification performance is assessed on ; for the 5 folds, the performances on the 5 is averaged.
Normalization: each feature is normalized over
to have mean 0 and standard deviation 1, and
are processed with the corresponding parameters.Parameter selection: the neighborhood size and the LPM assumption are two parameters of LPMBC, for a test sample , the neighborhood associated with each class is selected so that it has samples in the corresponding class and in , parameter can indicate the neighborhood size. The distance metric to construct the neighborhood is selected Chebychev distance, since it can form a hypercubical neighborhood where the local CCIA would be more likely to hold and where the calculation of integration or volume is reduced. The neighborhood size and the LPM assumption is selected via an internal 4fold cross validation method on . In the internal cross validation, neighborhood size is selected from , where is the minimum number of samples among all classes; and the LPM assumption is selected among LUA, LGA and LCA. In addition, although an LPM can model the dependencies among features, featureindependent LPMs are established in our experiments based on Theorem 1 to facilitate the computation.
Performance evaluation: the classification performance is evaluated by accuracy (ACC) and mean square error (MSE) [35]. To avoid bias, the 5fold cross test is implemented 8 times and the performances are averaged.
Competing classifiers: the following classifiers are also implemented in our experiments to for comparison. GNBC and KNBC^{3}^{3}3the global NBC respectively using Gaussian model assumption [1] and KDE [36], NNBC [8], TAN [4], NBTree [22], locally weighted naive Bayes (LWNB) [24], VkNN, local mean method Categorical Average Pattern (CAP) [29] and LDkNN [33].
VB Experimental Results
Datasets  LPMBC  GNBC  KNBC  NNBC  TAN  NBTree  LWNB  VkNN  CAP  LDkNN  

MSE  Blood  0.1591  0.1681  0.1657  0.1582  0.1610  0.1587  0.1604  0.1591  0.1711  0.1622 
Bupaliver  0.2160  0.2595  0.2198  0.2474  0.2443  0.2368  0.2347  0.2331  0.2291  0.2276  
Climate  0.0458  0.0399  0.0570  0.0871  0.0527  0.0506  0.0396  0.0709  0.0652  0.0458  
Diabete  0.1678  0.1779  0.1971  0.1887  0.1725  0.1799  0.1724  0.1832  0.1677  0.1814  
Haberman  0.1785  0.1918  0.2527  0.2000  0.1823  0.1840  0.1853  0.1785  0.1939  0.1962  
Heart  0.1391  0.1334  0.1514  0.1722  0.1345  0.1961  0.1320  0.1886  0.1389  0.1416  
Image  0.0341  0.1841  0.1387  0.0582  0.0385  0.0438  0.1474  0.0412  0.0798  0.0768  
Ionosphere  0.0858  0.1401  0.0783  0.1020  0.0775  0.1067  0.0807  0.1085  0.1175  0.0922  
Iris  0.0308  0.0364  0.0338  0.0502  0.0518  0.0549  0.0362  0.0468  0.0648  0.0351  
Libras  0.1579  0.3458  0.3167  0.1346  0.2490  0.2783  0.2228  0.1579  0.1329  0.2229  
Parkinson  0.0656  0.2898  0.2422  0.0422  0.1222  0.1126  0.2664  0.0656  0.0616  0.1130  
Seeds  0.0432  0.0813  0.0735  0.0482  0.0731  0.0803  0.0744  0.0582  0.0584  0.0711  
Sonar  0.1205  0.2869  0.2027  0.1093  0.1811  0.2188  0.2224  0.2212  0.1541  0.1564  
spectf  0.1531  0.3020  0.2460  0.1785  0.1586  0.1893  0.2467  0.1860  0.1468  0.1589  
Vertebral  0.1135  0.1202  0.1317  0.1271  0.1311  0.1478  0.1175  0.1428  0.1410  0.1140  
Wine  0.0069  0.0207  0.0201  0.0359  0.0215  0.0491  0.0159  0.0462  0.0200  0.0108  
Average rank  2.28  7.44  6.63  5.50  5.13  6.88  5.19  5.94  5.19  4.84  
ACC  Blood  0.7925  0.7644  0.7575  0.7797  0.7513  0.7767  0.7741  0.7821  0.7884  0.7925 
Bupaliver  0.6851  0.5569  0.6435  0.6322  0.5768  0.6551  0.6261  0.6243  0.6572  0.6783  
Climate  0.9414  0.9465  0.9188  0.8940  0.9296  0.9370  0.9519  0.9162  0.9236  0.9414  
Diabete  0.7697  0.7564  0.7355  0.7332  0.7448  0.7370  0.7643  0.7318  0.7660  0.7645  
Haberman  0.7464  0.7480  0.5295  0.7308  0.7255  0.7353  0.7516  0.7452  0.7463  0.7464  
Heart  0.8380  0.8366  0.7963  0.7787  0.8000  0.7556  0.8296  0.7458  0.8384  0.8329  
Image  0.9613  0.7972  0.8290  0.9243  0.9515  0.9506  0.8229  0.9484  0.9551  0.9154  
Ionosphere  0.9106  0.8429  0.9145  0.8771  0.9202  0.8775  0.9003  0.8608  0.8764  0.9031  
Iris  0.9625  0.9558  0.9617  0.9542  0.9267  0.9200  0.9467  0.9500  0.9575  0.9600  
Libras  0.8347  0.6287  0.6568  0.8439  0.6667  0.6806  0.7667  0.8347  0.8534  0.7707  
Parkinson  0.9186  0.7000  0.7437  0.9444  0.8410  0.8821  0.7179  0.9186  0.9135  0.8647  
Seeds  0.9488  0.9030  0.9065  0.9304  0.8857  0.9048  0.9000  0.9262  0.9274  0.9202  
Sonar  0.8756  0.6837  0.7656  0.8624  0.7548  0.7500  0.7500  0.7494  0.8210  0.8288  
spectf  0.8123  0.6794  0.7351  0.7187  0.7903  0.7566  0.7416  0.7940  0.8146  0.8100  
Vertebral  0.8464  0.8218  0.8085  0.8109  0.8000  0.7968  0.8290  0.7859  0.8081  0.8452  
Wine  0.9916  0.9755  0.9768  0.9564  0.9775  0.9438  0.9775  0.9416  0.9790  0.9880  
Average rank  1.84  7.13  6.69  5.88  6.72  6.78  5.81  6.94  3.63  3.59 
The classification results in terms of ACC and MSE are reported in Table II. We can observe that LPMBC performs best on 7 and 9 datasets respectively in terms of MSE and ACC, more than all other classifiers. The average rank of LPMBC on these datasets is respective 2.28 and 1.84, both lower than all other classifiers. These results imply that the LPMBC can be flexible for various classification problems through tuning the neighborhood size and the corresponding LPM assumption. We employ Friedman tests [37, 38] for multiple comparisons among these classifiers. The values in terms of MSE and ACC, respectively, are and , both much less than 0.01; it indicates significant difference among these 10 classifiers. We further use the posthoc BonferroniDunn test [38] to reveal the differences among the classifiers. Figure 2 shows the results of the BonferroniDunn test that the other classifiers are compared to LPMBC. The results indicate that, in terms of MSE and ACC, LPMBC can significantly () outperform all classifiers except LDkNN and CAP; and that the data is insufficient to discriminate the advantages of LPMBC over LDkNN.
VC Parameter Analysis
The neighborhood size and LPM assumption can influence the classification performance as discussed before. In this experiment, we vary the neighborhood size and take different LPM assumptions to test the performance of LPMBC on both real and synthetic datasets. Each synthetic dataset is for a 2dimensional binary classification problem; each of the two classes consists of 100 samples from Gaussian distribution respectively with center and , and the covariance matrix ; is selected from to construct 5 datasets. The performance curves of ACC and MSE on some representative datasets are shown in Figure 3. From Figure 3 we can observe a few general trends. (1) A simple LPM (e.g. LUA) usually favors a small neighborhood; while a complex LPM (e.g. LCA) usually favors a larger neighborhood. (2) LCA does not always perform best in the whole sample space; it depends on the inherent complexity of the true distribution. In realworld problems, the inherent distribution is complex, the LCA is more effective than LGA in the whole sample space; while in the synthetic, the inherent distribution is Gaussian, not so complex, LCA is less effective than LGA. (3) LPMBC usually gets the best performance in a mediumsized neighborhood with a certain LPM for realworld problems.
Feature independency. On the synthetic datasets, the covariance can describe the dependency between the two features. From Figure (f)f to (j)j, we can see that with the dependency increases, LPMBC with a larger neighborhood size becomes increasingly ineffective; the superiority of LPMBC with a small neighborhood to that with a large neighborhood becomes more and more obvious. The results validate Theorem 1 and indicate that LPMBC can be a promising Bayesian classifier by relaxing the fundamental CCIA to a local region.
Vi Conclusion
In this paper, we proposed implementing Bayesian classification based on a local probabilistic model. The idea is to transform the estimation of global distribution into the estimation in a local region where the distribution should be simple. LPMBC is a compromise between parametric and nonparametric method. In the neighborhood of the query sample we assume a parametric probabilistic model while in the whole sample space, it is nonparametric. If the neighborhood is small, it is inclined to be nonparametric and vice versa. Through tuning the neighborhood size, we can control the tradeoff between parametric and nonparametric. Also, the LPMBC can be viewed as a generalized local classification method. Through specifying the local region and LPM, it can be specialized to various local classifiers. Thus, it should be more effective if an appropriate neighborhood and LPM are selected. We have discussed three kinds of LPMs in this paper, other probabilistic models can also be assumed; the rough rules of neighborhood and LPM selection have been discussed; however, to generate a general selection rule of neighborhood and the corresponding LPM requires further investigation in our future research.
Acknowledgements
This work was supported by the National Basic Research Program of China (2014CB744600), the National Natural Science Foundation of China (61402211, 61063028 and 61210010).
References
 [1] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification. John Wiley & Sons, 2012.
 [2] P. Domingos and M. Pazzani, “Beyond independence: Conditions for the optimality of the simple bayesian classifier,” in Proc. 13th Intl. Conf. Machine Learning, 1996, pp. 105–112.
 [3] D. J. Hand and K. Yu, “Idiot’s bayesnot so stupid after all?” International statistical review, vol. 69, no. 3, pp. 385–398, 2001.

[4]
N. Friedman, D. Geiger, and M. Goldszmidt, “Bayesian network classifiers,”
Machine learning, vol. 29, no. 23, pp. 131–163, 1997.  [5] G. F. Cooper, “The computational complexity of probabilistic inference using bayesian belief networks,” Artificial intelligence, vol. 42, no. 2, pp. 393–405, 1990.
 [6] P. Dagum and M. Luby, “Approximating probabilistic inference in bayesian belief networks is nphard,” Artificial intelligence, vol. 60, no. 1, pp. 141–153, 1993.
 [7] G. I. Webb, J. R. Boughton, and Z. Wang, “Not so naive bayes: aggregating onedependence estimators,” Machine learning, vol. 58, no. 1, pp. 5–24, 2005.
 [8] X.Z. Wang, Y.L. He, and D. D. Wang, “Nonnaive bayesian classifiers for classification problems with continuous attributes,” Cybernetics, IEEE Transactions on, vol. 44, no. 1, pp. 21–39, 2014.
 [9] L. Bottou and V. Vapnik, “Local learning algorithms,” Neural computation, vol. 4, no. 6, pp. 888–900, 1992.
 [10] K. Huang, H. Yang, I. King, and M. R. Lyu, “Local learning vs. global learning: An introduction to maximin margin machine,” in Support vector machines: theory and applications. Springer, 2005, pp. 113–131.
 [11] M. Wu and B. Schölkopf, “A local learning approach for clustering,” in Advances in neural information processing systems, 2006, pp. 1529–1536.
 [12] C. M. Bishop et al., Pattern recognition and machine learning. springer New York, 2006, vol. 1.
 [13] N. L. Hjort and M. Jones, “Locally parametric nonparametric density estimation,” The Annals of Statistics, pp. 1619–1647, 1996.
 [14] C. R. Loader et al., “Local likelihood density estimation,” The Annals of Statistics, vol. 24, no. 4, pp. 1602–1618, 1996.
 [15] P. Vincent, Y. Bengio et al., “Locally weighted full covariance gaussian density estimation,” Technical report 1240, Tech. Rep., 2003.
 [16] T. Cover and P. Hart, “Nearest neighbor pattern classification,” Information Theory, IEEE Transactions on, vol. 13, no. 1, pp. 21–27, 1967.
 [17] D. T. Larose and C. D. Larose, “knearest neighbor algorithm,” Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, pp. 149–164, 2006.
 [18] H. Zhang, A. Berg, M. Maire, and J. Malik, “Svmknn: Discriminative nearest neighbor classification for visual category recognition,” in Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, vol. 2. IEEE, 2006, pp. 2126–2136.
 [19] H. Cheng, P.N. Tan, and R. Jin, “Localized support vector machine and its efficient algorithm.” in SDM. SIAM, 2007, pp. 461–466.
 [20] E. Blanzieri and F. Melgani, “Nearest neighbor classification of remote sensing images with the maximal margin principle,” Geoscience and Remote Sensing, IEEE Transactions on, vol. 46, no. 6, pp. 1804–1811, 2008.
 [21] L. Ladicky and P. Torr, “Locally linear support vector machines,” in Proceedings of the 28th International Conference on Machine Learning (ICML11), 2011, pp. 985–992.

[22]
R. Kohavi, “Scaling up the accuracy of naivebayes classifiers: A decisiontree hybrid.” in
KDD. Citeseer, 1996, pp. 202–207.  [23] Z. Zheng and G. I. Webb, “Lazy learning of bayesian rules,” Machine Learning, vol. 41, no. 1, pp. 53–84, 2000.
 [24] E. Frank, M. Hall, and B. Pfahringer, “Locally weighted naive bayes,” in Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers Inc., 2002, pp. 249–256.
 [25] Z. Xie, W. Hsu, Z. Liu, and M. L. Lee, “Snnb: A selective neighborhood based naive bayes for lazy learning,” in Advances in knowledge discovery and data mining. Springer, 2002, pp. 104–114.
 [26] B. Hu, C. Mao, X. Zhang, and Y. Dai, “Bayesian classification with local probabilistic model assumption in aiding medical diagnosis,” in Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conference on. IEEE, 2015, pp. 691–694.
 [27] T. J. Hastie, R. J. Tibshirani, and J. H. Friedman, The elements of statistical learning: data mining, inference, and prediction. Springer, 2009.
 [28] K. Hechenbichler and K. Schliep, “Weighted knearestneighbor techniques and ordinal classification,” 2004.
 [29] S. Hotta, S. Kiyasu, and S. Miyahara, “Pattern recognition using average patterns of categorical knearest neighbors,” in Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, vol. 4. IEEE, 2004, pp. 412–415.
 [30] Y. Mitani and Y. Hamamoto, “A local meanbased nonparametric classifier,” Pattern Recognition Letters, vol. 27, no. 10, pp. 1151–1159, 2006.
 [31] B. Li, Y. Chen, and Y. Chen, “The nearest neighbor algorithm of local probability centers,” Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, vol. 38, no. 1, pp. 141–154, 2008.
 [32] J. Gou, Z. Yi, L. Du, and T. Xiong, “A local meanbased knearest centroid neighbor classifier,” The Computer Journal, vol. 55, no. 9, pp. 1058–1071, 2012.
 [33] C. Mao, B. Hu, P. Moore, Y. Su, and M. Wang, “Nearest neighbor method based on local distribution for classification,” in Advances in Knowledge Discovery and Data Mining. Springer, 2015, pp. 239–250.
 [34] K. Bache and M. Lichman, “UCI machine learning repository,” 2013. [Online]. Available: http://archive.ics.uci.edu/ml
 [35] L. W. Zhong and J. T. Kwok, “Accurate probability calibration for multiple classifiers,” in Proceedings of the TwentyThird international joint conference on Artificial Intelligence. AAAI Press, 2013, pp. 1939–1945.
 [36] G. H. John and P. Langley, “Estimating continuous distributions in bayesian classifiers,” in Proceedings of the Eleventh conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., 1995, pp. 338–345.
 [37] M. Hollander and D. A. Wolfe, “Nonparametric statistical methods,” NY John Wiley & Sons, 1999.
 [38] Demsar and J. Ar, “Statistical comparisons of classifiers over multiple data sets,” Journal of Machine Learning Research, vol. 7, no. 1, pp. 1–30, 2006.
Appendix A Proof of Theorem 1
We only prove the cases with two variables. The mathematical expressions of Theorem 1 are as follows.
are two random variables, is a subregion of the sample space of .
if
then
we have .
Proof.
because , then
(30)  
then,
(31)  
if ,
if
(32)  
if , .
if
(33)  
holds.
∎
Comments
There are no comments yet.