Naive Bayes (NB) classifier is well-known for its simplicity, computational efficiency, and competitive performance. Let
be a vector offeature variables (also known as attributes) and be a class variable. NB makes a bold and usually violated assumption that are independent given . This is the famous conditional independence assumption (CIA). In this paper, all feature variables are assumed to be categorical. CIA implies that
We call a realization of an instance. An instance can be labeled or unlabeled depending on whether the corresponding value of is observed or not. We have a training data set comprising labeled instances which are random samples of . We call any labeled instance in a training instance. A test instance is an unlabeled instance whose -value we want to estimate. For any test instance with , NB estimate of , denoted as , is
are estimates of the corresponding probabilities. Common probability estimator is the relative frequency or its Laplace’s modification(Cestnik, 1990).
An obvious weakness of NB is that the CIA is unrealistic in most real applications. Although NB achieves quite good classification accuracy even when the CIA is violated by a wide margin, CIA does have adverse effect on the asymptotic behavior of the classifier: When we observe more and more data, NB in general does not converge to the Bayesian classification rule for the full multinomial model. This characteristic of NB is confirmed by the empirical results which show that when more data becomes available, the correct classification rate of NB does not scale up (Kohavi, 1996), and NB will eventually be overtaken by classifiers which make weaker assumptions.
Researches have been done to remedy this defect of NB. They are developed along two directions. One is to extend the NB model to a larger Bayesian network model. The other modifies the training set to better suit the CIA. An early attempt in the first direction is the tree augmented naive Bayes (TAN) model(Friedman and Goldszmidt, 1996; Friedman et al, 1997) which embeds a tree topology on the Bayesian network of . The averaged one-dependence estimator (AODE) (Webb et al, 2005) is another method. It uses the average of superparent-one-dependence estimators for classification. Cerquides and Lopez de Mantaras (2005) introduced a WAODE method which extends AODE by replacing the simple average by a weighted average. Zhang et al (2005) proposed a hidden naive Bayes (HNB) classifier on which one hidden parent is added to each attribute.
In this paper, the second direction is taken. We modify the training data by assigning each training instance a weight which depends on how close it is to the test instance. NB is then fitted to the locally weighted data. The unfavorable impact of CIA is lessened as focus is laid on a small neighborhood of the test instance.
The remaining part of the paper is organized as follows. We consider how we can apply NB in the presence of weights in Section 2. The new classifier, which we call a lazy cell-weighted naive Bayes method, is proposed in Section 3. Empirical study is performed in Section 4 comparing the new method with seven commonly used classifiers of similar nature. The new method with an appropriate choice of parameter is found outperforming methods considered in the study. The last section concludes the paper.
2 Naive Bayes with weights
Let and be the probability and the (nonnegative) weight of respectively. Define a random vector with weight to be a random vector with the same support as , but with the joint probability
This distribution is well-defined as far as for at least one .
If all weights are equal, the weighted random vector reduces to the unweighted one. As weight is assigned to cell, we call it cell weight. To avoid confusion, we denote a random vector with weight by . It satisfies the CIA if for any
Under CIA, our estimate of for a test instance with is
where denotes an estimate of . The weighted relative frequency estimators of probabilities are
where is the observed frequency of and in .
3 Lazy cell-weighted naive Bayes
Locally weighted classifier is determined by a weighting function and a model. The former assigns a nonnegative weight to each training instance so that instances “closer” to the test instance,
, have larger weights. The model is then fitted to the weighted training data. Classification is made basing on the estimated posterior probability ofunder the fitted model.
Definition. Let be a local weighting function for a test instance . It is called compatible to a model if
(a) If the model holds for the unweighted data, it also holds for the weighted data, and
(b) does not depend on .
Compatibility is fundamental for a weighting function and a model to form a reasonable locally weighted classifier if the aim of the weighting is to alleviate the possible failure of the model in a large region, rather than an intentional modification of the model. In other words, the weighting works as a means to weaken the sensitivity of the classifier to the model instead of as a technique to introduce a new model. Note that Condition (b) allows to depend on when . This Condition ensures that
Therefore, estimator of the former using the weighted data is just an alternative estimator of the latter. Hopefully, this new estimator is more robust than the original estimator to the model assumption.
The use of NB as a local model is not new. A successful example is the hybrid classifier with decision tree. A decision tree is built. For each leaf, a NB is fitted to the data associated with that leaf(Kohavi, 1996; Gama et al, 2003). The weighting function of this approach is compatible to NB when the feature space is partitioned by axis-parallel surfaces. Another approach utilizes only the nearest neighbors of the test instance (Frank et al, 2003; Jiang et al, 2005). Unfortunately, the weighting function is not compatible to NB because of the conflict between CIA and the zero weight. In this section, we propose a locally weighting function compatible to NB. We call the corresponding classifier, lazy cell-weighted naive Bayes (LCWNB) classifier.
3.1 Parametric structure of the LCWNB
For any two realizations and of , the Hamming distance of and is the total number of ’s () such that . Denote the Hamming distance between and by . For each -value, , choose a constant such that . Given a test instance , we attach to cell a weight . (We use the convention that ). As the cell weight is a non-increasing function of the Hamming distance, is a local weighting function.
When for all , only the cells for different have weight one and all other cells have weight zero. The estimator in Section 2 is the Bayesian classification rule for the full multinomial model. Therefore, the magnitude of determines how much we move from NB when all ’s are one to the full multinomial model when all ’s are zero.
To prove the compatibility of the weighting function, let the test instance be . Under CIA in , for any
Clearly Equation (1) holds and Condition (a) follows. The correctness of Condition (b) is obvious.
3.2 Parameter selection in the LCWNB
The choice of has dominating effect on the new method. It controls how much information in is retained for classification. A simple way to quantify the retained information is to count the number of random sample that can be generated from using . Such a random sample can be drawn using the following acceptance-rejection method. For each training instance in , include this instance in a set with probability . Then the training instances in form a random sample of . Let be the expected frequency of in , and be the total number of training instances in with and . Then
For the -nearest neighbor (-NN) method where only the nearest neighbors of the test instance have weight one and all other instances have weight zero, a similar acceptance-rejection method yields . Thus acts like the constant in the -NN method. Its value should be chosen to balance the bias and the variability.
Let be the number of classes. For NB, the training set can be partitioned into independent random samples, one for each class label. It is desirable to control the degree of localization separately for each sample. For with large frequency, we can afford using small to emphasize model fitting in a small region. However, for with small frequency, large is preferred as we want to retain enough information for estimation. Assigning different value for different is a means to handle class imbalance. The following simple rule is proposed.
Simple selection rule: Choose a positive real number . Select so that the corresponding is closest to , i.e. .
As is a monotonically increasing polynomial of , the value of can be efficiently found using binary search. As an analogy of in the -NN method, should be small. In the comparison study, is selected to be 5, 10 or 20.
A desirable property of the simple selection rule is that when the number of test instances increases, all ’s will eventually be zero, and the classifier approaches the Bayesian classification rule for the full multinomial model.
3.3 Laplace’s estimator of probabilities
The probability estimators in Section 2 are unreliable when zero or very small cell frequencies are encountered. Laplace’s law of succession (Cestnik, 1990) is a common remedy of the problem.
For the unweighted case, for all . The total weight is , which is the size of . As Laplace’s estimator is designed for the unweighted data, it is desirable to define a “sample size” for a weighted sample. For importance estimator, a corresponding measure is the effective sample size. An effective sample size of a weighted average (weighted relative frequency is a special kind of weighted average) is
if this importance estimator has the same variance as the simple average ofrandom sample from the target distribution. An approximate effective sample size for importance estimator (see Liu, 2001, Section 2.5.3) (note that their weights are scaled so that the mean weight is 1) is
To rescale the weights so that the total revised weight is equal to , we multiply each weight by a constant , where
The Laplace’s probability estimators for the rescaled weights are
In this paper, we use a common choice of the constants which are , (the total number of possible class labels), and which is the number of possible values of .
LWNB (Frank et al, 2003) also rescales their weights. For LWNB, the weights are multiplied by a constant to make the total weight equal to the total number of non-zero weights. Their method cannot be applied to our classifier as we give all training instances positive weight. We need another measure of “sample size” that takes into account the distribution of weights. A common feature of his method with ours is that both multiplication factors are larger than 1.
3.4 Implementation details
The pseudo-code of the LCWNB classifier is given in Figure 1. The time complexity for classifying one test instance is , where is the total number of training instances, is the number of classes, and is the number of iterations in the binary search for the parameter (the difference between the computed value and the true value of is bounded by ). The term is usually dominating as a large value of is unnecessary. As we need to store the whole training data set, the memory complexity is .
The target expected sample size , the
training data set where ,
and a test instance
The estimated class
Find for all class .
For each class ,
perform binary search for so that is closest to .
Let for the selected .
4 Empirical comparison
In this section, we conduct an empirical study of LCWNB. The aim is twofold. First, we look for a good choice of the parameter for the LCWNB method. Three candidate values of , namely 5, 10 and 20, are considered. The LCWNB with these three values of are denoted as LCWNB5, LCWNB10 and LCWNB20 respectively. Second, we compare LCWNB with an appropriate choice of with seven existing methods. They are (i) NB: naive-Bayes classifier, (ii) TAN: the tree augmented naive-Bayes (TAN) (Friedman and Goldszmidt, 1996; Friedman et al, 1997), (iii) AODE: the averaged one-dependence estimator (Webb et al, 2005), (iv) WAODE: the weighted average one-dependence estimator (Cerquides and Lopez de Mantaras, 2005), (v) HNB: the hidden naive-Bayes method (Zhang et al, 2005), (vi) LWNB: the locally weighted naive Bayes method (Frank et al, 2003), and (vii) ICLNB: the instance cloning local naive Bayes (Jiang et al, 2005).
A collection of 36 benchmark data sets from the UCI repository (Frank and Asuncion, 2010) are downloaded from the website of Weka (Witten et al, 2011). They are used as test beds for the classifiers. Summary description of the data sets is given in Table 1.
Classification accuracy rate is used as a performance measure in this paper. The rates are computed using 10 independent runs of 10-fold cross-validation. All classifiers are trained and tested on exactly the same cross validation folds. In the study, the filter ReplaceMissingValues in Weka is used to replace the missing values, and then the filter Discretization Weka is used to perform unsupervised 10-bin discretization. If the number of values of an attribute is almost equal to the number of instances, that attribute is removed from the data in the preprocessing step.
Table 2 lists the classification accuracy rates of the methods when applied to the data sets. Some other statistics are given in the bottom three rows.
Since no single classifier outperforms others in all data sets, statistical analyses are in need in the comparison. As the data sets are not randomly chosen, all statistical results apply only to an imaginary population where the data sets are representative. For example for the 36 data sets, the largest size of the training data is 20000. Extrapolating the results to data exceeding this limit is unsafe. Sixteen of the 36 data sets have . The results are therefore biased towards two-class classification. Some data sets have a common source. Such common sources have larger impact on the comparison results.
To sketch out a general picture of the accuracy rates, two descriptive performance statistics are computed for each classifier. One is the average accuracy rate displayed in the third row from the bottom of Table 2. It is the most fundamental summary measure as its interpretation does not depend on what other classifiers are included in the study. The other measure is the mean rank which is listed in the second row from the bottom of the same table. It is the average rank of the classifier, with rank 1 assigned to the method having the largest rate for a data set, and rank 10 to the method having the smallest rate. Mean rank is robust to extraordinary accuracy rates.
Figure 2 presents a paired bar chart for the two measures with classifiers arranged in the descending order of the average classification accuracy rates. The two reference lines in each bar chart are the 95% simultaneous confidence bounds for the corresponding measure under the assumption that all methods perform equally well. They are computed from 9999 random permutations of the data. As there are bars lying outside the confidence bounds in the charts, the equal performance hypothesis should be rejected at 5% significance level. NB and TAN are likely to be inferior to other methods. LCWNB5 and LCWNB10 perform well in both measures. They are the best two with WAODE the third best.
The first purpose of this study is to suggest an appropriate -value. Nonparametric tests are performed to compare the three values. The -values for the Friedman test and the Iman and Davenport’s modification (Iman and Davenport, 1980) are 0.2366 and 0.2393 respectively. The -value for the Quade test (Quade, 1979) is 0.1449. All tests indicate that the three choices are of equal performance at 5% significance level. We get the same conclusion from the one-sided Wilcoxon tests when we investigate whether LCWNB5 is superior to LCWNB10 and LCWNB20.
When we rank these three methods, it is found that LCWNB10 is usually ranked second. Its accuracy rate lies between those of the other two methods in 27 out of 36 data sets. The -value for this pattern is . LCWNB5 is ranked 1 in 18 out of the 36 data sets. The -value for it is . Both p-values suggest that the three methods behave differently contradicting the results of the Friedman and related tests. This inconsistency can be explained when we discover that LCWNB10 is usually ranked 2 while LCWNB5 and LCWNB20 are commonly ranked 1 or 3. As a result, their mean ranks are close to each other, but the distributions of ranks are different.
It is of interest to have a close inspection on the performance of the three choices of when the data set characteristics are taken into account. Figure 3 shows the ranks of the LCWNB for the three -values when the data sets are arranged in ascending order of , and .
While no obvious pattern is found in the top and the bottom panels of Figure 3, the middle panel displays a trend in the ranks. LCWNB5 is ranked 1 in the seven data sets with the largest and ranked 3 in seven of the ten data sets with the smallest . The probability of observing as or more extreme than this discovered pattern is . Similar but reversed pattern is found for LCWNB20. It suggests that we should use LCWNB20 when is small, and gradually reduce the value when increases. Basing on the empirical data, the optimal switching rule that minimizes the mean rank is to use LCWNB20 when , use LCWNB10 when or 16, and use LCWNB5 when . This close relation between and is not surprising because the smallest possible value of is . The larger the , the smaller the weight is expected.
If we have to fix to a single value,
is a reasonable choice. Let us use LCWNB5 as a standard and compare it with the other nine methods. A paired-T-test is conducted for each data set and each of the nine method. Solid dots and hollow dots are added in the table to indicate whether the T-test shows a significant improvement or degradation (when LCWNB5 is compared to the method) respectively. The bottom row of the table summarizes the results of the 324 (=) tests by showing the frequencies of the following four categories: (1) LCWNB5 is significantly worse than the method; (2) LCWNB5 is worse than the method, but the difference is not significant; (3) LCWNB5 is better than the method, but the difference is not significant; and (4) LCWNB5 is significantly better than the method. Tie is counted as 0.5 in categories 2 and 3. The significance level used in the tests is 5%. The abbreviations, s.w./m.w./m.b./s.b., in the last row of Table 2 stand for “significantly worse”/“marginally worse”/“marginally better”/“significantly better.” As , we accept the alternative hypothesis that a method is worse than LCWNB5 at 6.62% significance level if (m.b. + s.b.) is larger than 23. Dots are added in the table to NB, TAN and HNB to indicate that they are significantly worse than LCWNB5 in the above test.
Nonparametric tests for LCWNB5 and the seven existing methods are performed. Again NB and TAN are found significantly inferior to the other methods. We exclude NB and TAN from the study and compare LCWNB5 with the remaining methods. The test statistic is the mean rank of LCWNB5. We would accept that LCWNB5 is superior if the observed mean rank 3.0556 of LCWNB5 is too small to be explained by chance under the equal performance assumption. Letbe the rank of method in the
where . The (one-sided) -value associated with the average rank of LCWNB5 is showing that the mean rank of LCWNB5 is significantly small if the level is set at 6%.
We have discovered that it is better to choose according to the number of attributes in the data. We denote the corresponding LCWNB method by LCWNB* when the switching rule mentioned before is used. Applying the same test to compare LCWNB* with AODE, WAODE, HNB, LWNB and ICLNB, the (one-sided) -value is 0.01559. It shows that LCWNB* is significantly better than other tests at 2% level.
In this paper, a new locally weighted classifier is proposed. It has close relation with the methods that use only the nearest neighbors of the test instance (Xie et al, 2002; Frank et al, 2003; Jiang et al, 2005). Their methods differ from ours in three ways. First, we control the expected sample size rather than the number of instances with positive weight. Second, their weighting functions are not compatible to NB. Their probability estimator is derived under an inaccurate model even when CIA holds in the unweighted data. Third, our weights depend on the value of and the attributes, while their weights depend only on the attributes. This discrepancy can make a significant difference when the empirical distribution of is highly uneven. Using the same weighting function for all -values can lead to unreliable probability estimator for those -values with small relative frequency.
On the whole, LCWNB is simple, and easy to understand. It is sound theoretically and performs well empirically. It improves NB through using probability estimator that is robust to the correctness of the CIA without making any additional assumption.
Cerquides and Lopez de Mantaras (2005)
Cerquides J, Lopez de Mantaras R (2005) Robust bayesian linear classifier ensembles. In: Proceedings of the Sixteen European Conference on Machine Learning, pp 72–83
Cestnik B (1990) Estimating probabilities: a crucial task in machine learning. In: Proceedings of the Ninth European Conference on Artificial Intelligence, Pitman, London, pp 147–149
- Frank and Asuncion (2010) Frank A, Asuncion A (2010) UCI machine learning repository. URL http://archive.ics.uci.edu/ml
- Frank et al (2003) Frank E, Hall M, Pfahringer B (2003) Locally weighted naive bayes. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann, pp 249–256
- Friedman and Goldszmidt (1996) Friedman N, Goldszmidt M (1996) Building classifiers using bayesian networks. In: Proceedings of the Thirteen National Conference on Artificial Intelligence, pp 1277–1284
- Friedman et al (1997) Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Machine Learning 29(2):131–163
- Gama et al (2003) Gama J, Rocha R, Medas P (2003) Accurate decision trees for mining high-speed data streams. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 523–528
- Iman and Davenport (1980) Iman RL, Davenport JM (1980) Approximations of the critical region of the friedman statistic. Communications in Statistics 9(6):471–595
- Jiang et al (2005) Jiang L, Zhang H, Su J (2005) Instance cloning local naive bayes. In: Proceedings of the Eighteenth Conference of the Canadian Society for Computational Studies of Intelligence, pp 280–291
- Kohavi (1996) Kohavi R (1996) Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp 202–207
- Liu (2001) Liu JS (2001) Monte Carlo strategies in scientific computing. Springer-Verlag, New York
- Quade (1979) Quade D (1979) Using weighted rankings in the analysis of complete blocks with additive block effects. Journal of the American Statistical Association 74(367):680–683
- Webb et al (2005) Webb GI, Boughton JR, Wang Z (2005) Not so naive bayes: Aggregating one-dependence estimators. Machine Learning 58(1):5–24
- Witten et al (2011) Witten IH, Frank E, Hall MA (2011) Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. Kaufmann, Burlington
- Xie et al (2002) Xie Z, Hsu W, Liu Z, Lee ML (2002) Snnb: A selective neighborhood based naive bayes for lazy learning. In: Proceedings of the Sixth Pacific-Asia Conference on KDD, pp 104–114
- Zhang et al (2005) Zhang H, Jiang L, Su J (2005) Hidden naive bayes. In: Proceedings of the 20th National Conference on Artificial Intelligence, pp 919–924