1 Introduction
In the past few years, the bagofwords (BoW) model has gained its popularity in visual recognition thanks to its simplicity and efficiency [5, 10, 12, 22]. It usually works as follows: A set of local patches (for still images) or local spatialtemporal volumes (for videos) are extracted and represented by local descriptors. These descriptors are processed, for example, by means clustering [5], to form a collection of visual words, which in turn forms a visual codebook. By assigning each local descriptor to the closest (or multiple) visual word(s), a histogram indicating the number of occurrences of each visual word is obtained to characterise an image or video sequence. Among all the factors of the BoW model, visual codebook plays a pivotal role in determining recognition performance. Usually, a sufficiently largesized codebook (for example, up to thousands of visual words) has to be used to ensure satisfactory recognition performance.
However, a largesized codebook can be unfavourable in some cases. For example, as indicated in [1], when localising an object in an image, the computational cost and memory requirement for generating the histogram of each candidate window is proportional to codebook size. To model the interaction between visual words, the pairwise relationship among visual words is considered in [15]
. However, the number of pairs quadratically increases with codebook size. In addition, a largesized codebook leads to highdimensional image representation, which could make many machine learning algorithms become inefficient and unreliable or even breakdown. Nevertheless, simply reducing the value of
in means clustering will quickly degrade recognition performance due to the loss of discriminative information. To handle this situation, one of the effective approaches in the literature is to hierarchically merge visual words of an initial largesized codebook while minimising the loss of discriminative information in the whole course [18, 13, 25]. In this paper, we focus on this method and call it “wordmerging” in short in the following parts.Essentially, wordmerging can be regarded as a dimensionality reduction method. However, comparing with generalpurpose dimensionality reduction methods, wordmerging methods enjoy two major advantages: i) the speed of performing dimensionality reduction by merging words is much faster. Let and be the dimension of the original image representation and the targeted dimension, respectively. The computational cost of wordmerging is merely , which corresponds to a linear scan of the dimensions. This is in sharp contrast to as required in commonly used linearprojectionbased dimensionality reduction methods. This advantage makes wordmerging an attractive option in computation or memorysensitive applications, such as object detection in [1]; ii) unlike linearprojectionmethods which merge all dimensions via a weighted linear combination, wordmerging methods partition all dimensions into mutually exclusive clusters and then combine them. This process well maintains the “visual word” concept, which is important when modelling spatial relationship between visual words [15] or visualizing “discriminative visual words” is needed.
In the literature, a number of previous studies have implemented the idea of hierarchical visual word merging with different models and criteria. In [15, 1], the mutual information between words and class labels is used to identify the optimal pair of words to merge at each level of the hierarchy. In [26], the scattermatrixbased class separability is taken as a criterion to seek the optimal pair of words to merge. The work of [27]
differs from the previous work in that a more rigorous probabilistic model is used to merge visual words. In their work, the optimal pair is sought as the one after which is merged, the resulting histograms can maximize the posterior probability of true class labels. Nevertheless, as reported in
[1, 26], the merging criterion of [27] often produces results inferior to those in [1, 26]. This is in a sharp contrast to the expected power of a rigorous probabilistic model.In this work, we follow the basic probabilistic model in [27] and discuss its two key factors: the function used to model classconditional distribution and the method used to estimate the distribution parameters. The difference between our work and [27] is that the two key factors are fixed in [27] whereas they are treated as flexible components in our work. As will be seen, such a difference is critical because varying these two factors could bring forth markedly different characteristics to the probabilistic model. By properly choosing different settings to the two factors, we achieve a generalized probabilistic framework for merging visual words. With our framework, we show that existing merging criteria can be viewed as the special cases of the probabilistic model, with different combinations of classconditional distributions and parameter estimation methods. More importantly, through exploring new combinations of classconditional distribution and parameter estimation method, we are able to produce a spectrum of new merging criteria. In particular, three of them are explored in this work. The first one adopts the same parameter estimation method (Bayesian method) in [27] but replaces its distribution model with multinomial distribution. The second one combines a Gaussian distribution with maximum likelihood parameter estimation. In the third merging criterion, we propose a maxmarginbased parameter estimation method and apply it with multinomial distribution. Through extensive experimental study, we compare the performance of various merging criteria and analyse their differences. Moreover, we show that the third merging criterion produced by our framework achieves the overall best performance among the comparable algorithms in the literature.
In sum, this work has made the following contributions:

With this framework, we propose a new criterion by modelling each class with a multinomial distribution function. It can achieve better recognition performance than that originally proposed in [27].

With this framework, we explore the combination of Gaussian distribution and maximum likelihood estimation to produce another merging criterion;

Based on this framework, we put forward a maxmarginbased parameter estimation method, leading to another new criterion. It gives the overall highest recognition performance when compared with all the above wordmerging criteria.
2 Related Work
This section reviews the supervised compact codebook creation methods in [1, 26, 27], with the focus on [27] which inspires our work. As shown in [26], compact codebook creation can essentially be casted as a largescale discrete optimization problem, subject to a criterion related to the discriminative power of the resultant compact codebook. Due to the difficulty of efficient and global optimization, hierarchically merging visual words is often adopted in the literature. That is, two words are identified at each level of the hierarchy such that merging them will optimize a given criterion. Let denote a visual codebook consisting of words. Let be the resultant codebook after merging the th and th words. The corresponding histogram for the th sample is denoted by , and its th bin is , where . Also, is the class label of a training sample. In this paper, the criteria in [1, 26, 27] are termed AIB, CSM and UVD in short, respectively.
AIB: In [1], the mutual information, , between and class labels is used to measure its discriminative power as
(1) 
where denotes the th word of and and are estimated with the th bins of training histograms. At each level , the words and whose mergence maximizes are identified and merged. As noted in [1], this criterion can be related to agglomerative information bottleneck [23].
CSM: In [26], the scattermatrixbased class separability, , is used to measure the goodness of as
(2) 
where and are the withinclass scatter matrix and the total scatter matrix, respectively. denotes the trace of a matrix. They are computed with training histograms . At each level, the words and whose mergence minimizes are identified and merged ^{1}^{1}1To facilitate the subsequent analysis, we use the minimization of here. Because of the identity , it is equivalent to [26] which maximizes ..
UVD: In [27], the posterior probability of true class labels conditioned on is proposed to measure the discriminative power of . Let be the label set of the training samples. Let be the set of training histograms obtained with
. Using the Bayes’ theorem, this posterior probability is computed as
(3) 
where is the likelihood of the training histograms conditioned on true label configuration , and is the likelihood conditioned on any one of possible label configurations. Due to the difficulty of enumerating all possible configurations, [27] approximates the denominator with two configurations only: the true configuration and a special configuration in which all training samples have a same class label. Assuming equal prior over these two configurations, it gives:
Thus, maximizing is (approximately) equivalent to maximizing . The likelihood is computed as
(5) 
where is the classconditional distribution for class , its parameter set, and the set of all training samples in class . In [27], is modeled as a Gaussian distribution^{2}^{2}2As suggested in [27], the square root of each bin of is used to better fit the Gaussian distribution assumption.. A conjugate Gaussiangamma prior is defined over as , where , , , and are hyperparameters. Assuming the independence of different bins and i.i.d samples in each class, the above likelihood is obtained as
(6) 
where is the th bin of the histogram , and
is the parameter set (mean and variance) for the
th bin in class . Sinceis the conjugate prior of
, the integral can be analytically worked out. At each level of the hierarchy, the words and whose mergence maximizes is identified and merged.3 The proposed Generalized Probabilistic Framework
In this paper, we take the basic formulation in Eq.(2) and develop it to a general probabilistic framework. Any algorithm taking such a formulation needs to determine two key factors: i) how to model the classconditional distribution in Eq.(5) ^{3}^{3}3In this section, we drop the superscript in . All the calculation is now at the level unless indicated otherwise.; ii) how to estimate the model parameter . As shown in Section 2, UVD [27] models with a Gaussian distribution and uses the Bayesian method to marginalize out the model parameter . The effect of is averaged with a Gaussiangamma prior and its value is not explicitly estimated.
Figure 1 is used to illustrate the proposed probabilistic framework. By setting the two factors in different ways, the framework not only accommodates the existing criteria UVD, AIB and CSM, but also produces a matrix of new criteria. Three of them, called MLT, GMLE and MME in short, will be investigated.
In the following sections, we firstly interpret existing methods from the viewpoint of our framework. More specifically, after a brief interpretation of UVD in Section 3.1, we show in Section 3.2 that AIB is a special case of our framework, which chooses the two factors as multinomial distribution and maximum likelihood estimation; In Section 3.3, we show that CSM can be (approximately) interpreted as a special case of our framework, which chooses the two factors as Gaussian distribution and maximum likelihood estimation; Then a discussion about the impact of the two factors is given in Section 3.4. After that, we propose three new merging criteria from Section 3.5 to 3.7. From now on, we define and use it throughout the following sections.
3.1 Uvd [27]: Gaussian distribution + Bayesian method (gamma distribution prior)
Our framework is inspired by the formulation of UVD, and therefore UVD naturally fits our framework. It uses Gaussian distribution to model the image representation and employs the Bayesian method for parameter estimation. As mentioned above, UVD does not explicitly estimate the model parameters. Instead, it treats the model parameters as random variables and models their distribution through a prior distribution with a set of hyperparameters.
3.2 Aib [1]: Multinomial distribution + Maximum Likelihood Estimation
Multinomial distribution^{4}^{4}4Strictly speaking, the case in AIB is not exactly a multinomial distribution, and calling it categorical distribution may be more precise. However, these two terms are usually used equivalently in text analysis and we follow this convention in this paper. has been widely used in the literature to model the occurrence of words in a document. With multinomial distribution, the conditional probability of a histogram is modelled as
(7) 
Assuming the i.i.d. property of samples and plugging this distribution model into our framework, we obtain as
(8) 
Thus, the merging criterion becomes:
(9)  
where denotes the mean of the th bin in class . With training samples, it is not difficult to obtain the MLE of the model parameters as
(10) 
Note that because all samples are assumed to be in a same class in the configuration. In AIB [1], the terms of and are computed in the same way as in Eq.(3.2) ^{5}^{5}5This can be seen in the code provided in [24].. Also, AIB computes the joint probability as
(11) 
Note that the denominator keeps constant when merging different words at the level . Substituting into Eq.(9) and dropping constant , we produce AIB criterion in [1] because
(12) 
3.3 Csm [26]: Gaussian distribution + Maximum Likelihood Estimation
By modelling training data with a Gaussian distribution, Eq. (2) will lead to a criterion shown below.
where and denote the mean and the covariance matrix for class . is the number of training samples in class . Then becomes
(14)  
where and denote the total mean and the covariance matrix for all data. Assuming that and , Eq.(14) can be simplified as
(15)  
where and are the withinclass scatter matrix and the total scatter matrix defined in [26]. The criterion strongly connects with used in [26]. Minimizing is a fractional programming problem. It can be effectively solved by the Dinkelbach’s algorithm [21], which iteratively minimizes , where is the ratio of to at the last iteration.
3.4 Discussion on the two key factors
Parameter estimation. In UVD, parameter estimation is implicitly handled through the Bayesian method. The performance of the Bayesian method highly depends on the choice of prior distribution and its hyperparameters. In practice, for the sake of computational feasibility, the hyperparameters are usually empirically set and a same set of hyperparameters is often applied to all classes. This could bring negative impact to the practical performance of the Bayesian method. As a result, the Bayesian method does not necessarily outperform the way that explicitly estimates model parameters from training data, for example, through maximum likelihood estimate (MLE).
Distribution model: If the true distribution of data is known, we could employ it in our framework and produce a merging criterion of high quality. In practice, however, we do not have such information and have to rely on our knowledge to choose the distribution model. The appropriateness of the chosen model plays a pivotal role.
From the three existing merging criteria discussed above, two distributions are employed, namely, multinomial distribution and Gaussian distribution. Note that in the literature, the BoW model originates from document analysis, in which a histogram of words is usually modelled by a multinomial distribution [2]. In this sense, multinomial distribution seems to be a suitable choice of modelling histogram based image representation. On the other hand, with the recent development of bagoffeatures model, the local features are usually sampled at a dense spatial grid [11]. This operation in effect reduces the sparsity of the histogram and may change the underlying distribution of training data. In addition, some postprocessing such as square root operation [27] on the histogram could also alter the characteristics of the distribution of training data^{6}^{6}6For example, as indicated in UVD [27], square rooting operation has the effect of making the data distribution to be more Gaussianalike.. In this work, we find that Gaussian distribution sometimes results in a good merging criterion too when the image representation is obtained by using the dense sampling and square root operation. Note that in our previous study on this framework [16], these settings have not been considered.
3.5 MLT: Multinomial distribution + Bayesian Method (Dirichlet prior)
In this section, we first propose to use the multinomial distribution and Dirichlet prior to replace the Gaussian distribution and the Gaussiangamma prior in UVD [27]. This will produce a new merging criterion called MLT. This new criterion can outperform UVD when data is better characterized by multinomial distribution.
In MLT, is still modeled as Eq.(5), but the likelihood and the prior terms become:
(16) 
where denotes the th word and is the model parameter, which represents the likelihood of word occurring in class . is the multinomial Beta function and is the hyperparameter. Substituting Eq.(3.5) into Eq.(5), we can derive that
(17) 
where we define for class and .
Note that the integral in Eq. (5) can be analytically worked out in this case because the Dirichlet distribution is the conjugate prior of a multinomial distribution. In this way, the proposed MLT criterion is obtained as
(18) 
Recall that . At each level of the hierarchy, the pair of words and whose mergence maximizes is identified and merged.
3.6 GMLE: Gaussian distribution + Maximum Likelihood Estimation
From Section 3.3, we can see that CSM [26]
simply treats the covariance matrices as a scaled identity matrix and uses a formulation that only approximately connects with the proposed framework, as shown in Eq.(
15). In this section, we propose another merging criterion by following the proposed framework, which is called GMLE in short.For the sake of reliable parameter estimation and the computational efficiency of GMLE, we assume that the covariance matrices are diagonal (but not an identity matrix), that is, and . By doing so, we can rewrite the criterion in Eq. (14) as:
(19)  
Note that is a summation over terms depending on each dimension. For a given merging pair and , the criterion value can be efficiently reevaluated by only updating the terms involving and , which significantly reduces the computational cost.
3.7 MME: Multinomial distribution + MaxMargin Parameter Estimation
The maximum likelihood estimation (MLE) of model parameters still presents potential drawbacks. Due to its generative nature, it prevents us from using more information in training data. Particularly, when the multinomial distribution is employed, the MLE of its parameters are only determined by the average histogram per class and higher order statistics such as the variances of visual words are completely neglected. Thus the creation of a compact codebook does not fully exploit the information of training data and consequently the resulted performance may be less satisfying. In the literature, this phenomenon is known as exchangeable property [2]. One way to overcome this drawback is to adopt more complex distributions, for example, the multivariate Polya distribution [17]. However, this will lead to intractable computation because there is usually no analytical MLE for the parameters in these complex models. Another disadvantage of MLE is that the estimation could become unreliable when training samples are scarce or many less discriminative visual words exist. MLE estimates cannot effectively identify the discriminative words since the parameters are estimated based on the data from each class individually. This limits the performance of the created compact codebooks.
To improve this situation, we propose a new MaxMargin parameter Estimation (MME) scheme for merging visual words. The idea is to seek the model parameters that can maximize the margin of posterior probability ratio of the true class label to all other possible labels under certain regularization. The disadvantages of MLE mentioned above can be removed because (i) the parameter estimation now considers all training samples from different classes together; (ii) the maxmargin principle emphasizes discriminative features. In the remaining parts of this section, we firstly present a detailed derivation of the proposed maxmargin parameter estimation formulation. Then we discuss how to solve the resultant optimisation problem and the implementation.
3.7.1 Problem formulation
We still model by a multinomial distribution. The posterior probability ratio for the th training sample is defined as
(20)  
where is the true label of sample and
is one of the other possible labels. Note that this ratio will take a form of linear classifier if we treat
and as variables, although the parameters to estimate are and and . This ratio reflects how confident the sample is classified into its groundtruth class and a large ratio is preferred.The idea of maxmargin parameter estimation can be intuitively understood as to maximize the lowest for all pairs of and , that is, to maximize the minimum confident score. However, merely optimizing could lead to severe overfitting because the ratio can always be increased by reducing towards zero. To avoid such a situation, we introduce a regularization term
(21) 
where is a positive constant which controls the relative strength of the regularization on and . This term attains its minimum when and . Thus, it prefers a uniform estimation of and with respect to different class
, which is consistent with the principle of maximum entropy
[9]. Inspired by the margin definition in SVM [8], we formally define the margin of the posterior probability ratio as the ratio between the minimal and the regularization term:(22) 
Then the proposed maxmargin parameter estimation aims to maximize over the model parameters. Note that scaling the terms and does not change the value of . As a result, the solution of this margin maximization problem is not unique and different solutions are connected through a scaling factor. Without loss of generality, we can obtain one of these solutions by simply setting the regularization term to be a positive constant , that is,
(23) 
When is fixed, the margin maximization problem becomes:
(24) 
Let us define and . By rescaling and by ( for linear separable case), it is easy to rewrite the above maximisation problem as
Furthermore, by defining and , Eq. (3.7.1) can be expressed in a compact form as
It is worth mentioning that if our interest is to learn a model for classification, we can simply treat and as variables and the scaling of and will not change the decision function. This is why the scaling factor is usually ignored in maxmargin learning problems, e.g. SVMs. However, our goal is to estimate and , for which the scaling factor will affect the estimation. Hence, has to be explicitly considered in our case.
3.7.2 Problem solution and implementation
It is not difficult to see that the problem in Eq.(3.7.1) is similar to a linear SVMs. In fact, if we consider binary classification (which is the focus of this paper) and add the slack variables to handle the nonseparable case, Eq.(3.7.1) will reduce to a standard binary linear SVM with several additional constraints, that is:
(27) 
where we define for the symbol simplicity. The first two constraints are identical to those in the standard SVM. The third and fourth constraints establish the relationship between the SVM solution and the multinomial distribution parameters, where is the scaling factor discussed in subsection 3.7.1. The last two constraints come from the properties of probability. Note that we need to incorporate to make the variables properly bounded. In this work, we simply obtain via the method of MLE.
At the first glance, solving this optimization problem is hard since it involves many nonlinear constraints. However, we show that under mild assumptions, the problem in Eq.(3.7.2) could be solved in two stages. At the first stage, we only consider the first two constraints to construct a subproblem and obtain the solution via an offtheshelf SVM solver. At the second stage, we calculate the probability parameters by solving equations and . The key insight here is that if a certain assumption is taken for a given solution of and , we can always find corresponding values for and to make the remaining constraints satisfied. The detailed analysis is presented in Appendix A.
After obtaining these model parameters, we can readily apply them to the multinomial distribution to compute to identify the optimal pair of words to merge at each level of .
Implementation

To evaluate for a pair of words and , we need to calculate the class conditional probability for the merged word . If strictly following the maxmargin parameter estimation, we have to reestimate by solving Eq.(3.7.2) for each possible pair of and , incurring a computational cost at the order of at level . Even though a highly efficeint SVM solver is used, this repeated reestimation process will still be too timeconsuming. In practice, we adopt a compromised scheme: the maxmargin estimation is only carried out once at each level after the optimal pair of words is identified. In the course of identifying the optimal pair, the updating formula is used for the merged word . That is, the underlying criterion for identifying the optimal merging pair at each level is same to that used in AIB, while the model parameters are estimated via the proposed maxmargin estimation scheme. Experimental study shows that this strategy works very well in practice.

In our implementation, we use LIBSVM to solve Eq. (3.7.2). We set in order to be consistent with the formulation used in LIBSVM. Also, since LIBSVM solves SVM in its dual form, we use the precomputed kernel as the input of LIBSVM interface. At each hierarchy, two words and are identified and merged. This leads to an update of kernel matrix which could be efficiently calculated by
(28) where is the kernel matrix at the th level. . is defined in a similar way. Note that this update is efficient since and
are merely two column vectors.
4 Experimental Result
To examine the effectiveness of our framework and the impact of the two factors identified in our framework, we conduct a number of experiments in this section. In our experiment, the goodness of a compact codebook is evaluated by its performance on two applications: 1) Building compact representation for image classification. In this application, the aim is to create a compact image representation which can largely maintain the discriminative power of the initial codebook. The performance of a word merging method is evaluated by the classification performance with respect to the reduced codebook size. 2) Using compact codebook for efficient pixellevel object detection. This is an application in which the use of compact codebook could significantly reduce the computational complexity. The aim of using this application for evaluation is to see whether the newly proposed methods can achieve better performance than the traditional ones in a realworld application.
The experiments are organized into two parts. The first part is based on the first application and the purpose of this part is to demonstrate the impact of the two key factors identified in our framework. More specifically, we conduct three experiments in this part.

(1) The evaluation of MLT. In this experiment, we focus on the comparison between MLT and UVD. This comparison aims to show the importance of choosing appropriate distribution model in our framework.

(2) The evaluation of GMLE. This experiment focuses on the comparison between GMLE and UVD. This comparison aims to validate the use of MLE as an appropriate parameter estimation method in our framework.

(3) The evaluation of MME. The purpose of this experiment is to demonstrate the advantage of using maxmargin parameter estimation in our framework.
In the second part of our experiments, we further show the excellent performance of the proposed methods, especially MME on the second application.
In the proposed MME method, there is a scaling factor which can be chosen freely within a range. To investigate its impact on the performance of MME, we also conduct theoretical and experimental analysis on the choice of its value.
Throughout the experiments, six methods induced from our framework are compared. They are AIB [1], UVD [27], CSM [26], MLT, GMLE and MME. Also, we focus on the binaryclass classification/detection setting. Multiclass case can be handled by onevsrest decomposition.
Five datasets are used in our experiments, including Caltech256 [7], PASCAL VOC2007, PASCAL VOC2012 [6], KTH [4] and Graz02 [19]. The first four datasets are used for the evaluation of imagelevel classification task while the last one is mainly used for the evaluation of pixellevel object detection task. The introduction of these datasets and their preprocessing details are elaborated as follows:
(1) Caltech256 Caltech256 contains 256 object classes and one background class. For this dataset, we create 256 objectvsbackground classification tasks, that is, the task is to discriminate the images containing the object from the background class images. For each objectvsbackground task, we randomly split the images into 10 training/test sets and report the average performance of all 10 splits. To obtain the bagoffeatures image representation, we firstly densely sample patches with the step size of 8 pixels and describe them by the SIFT descriptor using the implementation in [14]. Then an initial codebook with 1024 visual words is created by applying a means clustering on the local features sampled from the training images. Finally, we use this codebook to create a histogram for each image. We normalize each histogram to make its 1 norm equal to 1 to eliminate the affect of the image size difference. In our evaluation, we also apply a squareroot operation on each histogram since it usually significantly boosts the classification performance. A linear SVM is applied as the classifier and we use LIBSVM [3] as the SVM solver.
(2) PASCAL VOC2007 PASCAL VOC2007 is a commonly used evaluation benchmark for image classification. It contains 20 object classes and in the standard evaluation protocol the task is to distinguish the images containing the object from those that do not. In our experiment, we follow this evaluation protocol and use the training/validation/test sets provided by this dataset. We learn a linear SVM classifier from the training set together with the validation set and evaluate the performance by mean averageprecision (mAP) on the test set. We use the same image representation extraction approach as the one used for Caltech256 but with a largersized codebook containing 4000 visual words.
(3) PASCAL VOC2012 PASCAL VOC2012 is the latest PASCAL VOC dataset. Except for more images, the other settings and evaluation protocol used for this dataset are identical to those used in PASCAL VOC2007. Since the test set has not been released, we use the validation set as the test set instead.
(4) KTH KTH is a commonly used action recognition benchmark. It consists of six actions: boxing, handclapping, jogging, running, walking and handwaving. These actions are performed by 25 subjects in various scenarios, e.g. different lighting conditions, clothes and viewpoints. In our experiment, we randomly choose the actions performed by 16 subjects as the training set and the actions performed by the remaining 9 subjects as test set. We repeat this random partition ten times and report the average performance on the ten groups of training/test sets. The histogramofopticalflow and histogramofgradient are extracted as local features at the interest points detected by the spatialtemporal interest point detector [12]. Following [12], we create a 4000visualword codebook and represent each video by a histogram of 4000 visual words. Six onevsrest classification tasks are used for the evaluation.
(4) Graz02 Graz02 [19] contains three object categories, Car, Person and Bike. The pixelwise object annotations are provided by this dataset. The task is to learn a detector to determine whether a pixel belongs to the foreground (object) or the background. We follow the framework in [1] to use the bagoffeatures model to extract the feature representation for each pixel, that is, for each pixel in an image we crop a local region around it and use the histogram of visual words in this region as the feature representation for that pixel. Then this feature is sent to a classifier to test if the pixel belongs to foreground or background. We densely extract SIFT features with the same settings as in the previous experiments and a 1000wordcodebook is utilized. The local region size and the boundary issue are set and handled in the same way as in [1]. Following the setting in [1]
we use the first 150 oddnumbered images as the training set and the first 150 evennumbered images as the test set. The normalization and square root operation are applied to the extracted histograms since they lead to better detection performance.
To generate the training set, we randomly sample pixels in a training image and use their feature representations as training samples. The class label of each sample is determined by whether its corresponding pixel belongs to foreground or background. Note that the above procedure is different from the way of generating the training set in [1]. In [1], each positive (negative) class sample is the histogram of visual words which are obtained over the whole foreground (background) region in an image rather than the local region centered at each pixel. Compared with their method, our scheme keeps the consistency in sample generation process between the training and test stages. Empirically, we find this simple modification could lead to significant improvement on the detection performance.
4.1 Comparison of UVD and MLT
Compared with UVD, MLT only changes the distribution model from Gaussian distribution to Multinomial distribution. Thus the performance comparison between these two methods demonstrates the impact of using different distribution models. Recall that MLT is inspired from the fact that in text analysis multinomial distribution is more commonly adopted in the bagofwords model. But is multinomial distribution still suitable for the visual words extracted from the bagoffeatures model in visual recognition? To answer this question, in this section, we firstly compare UVD and MLT on four datasets. They are KTH, Caltech256, PASCAL 2007 and PASCAL 2012.
The performance comparison between UVD and MLT on KTH is shown in Figure 2. As seen from the average performance (Figure 2 (a)) on six onevsrest tasks, MLT significantly outperforms UVD, especially when the codebook size is reduced to a small number. The same trend is observed on the two most difficult tasks: ‘running vs. the rest’ in Figure 2 (b) and ‘jogging vs. the rest’ in Figure 2 (c). For the result on Caltech256 shown in Figure 3, we can see that UVD performs better at the beginning of the merging process, but when the codebook size is reduced to be less than 300, it is outperformed by MLT. For the result on Pascal 2007 and Pascal 2012, however, MLT performs much worse than UVD.
To explain the better performance of MLT over UVD on KTH dataset, we notice that among these four datasets, the feature extraction scheme used for KTH is different from that used for the other three datasets. In KTH, the local features are extracted from a set of detected interest points while in the other three datasets the local features are extracted in a dense spatial grid, namely, using a dense sampling strategy. Note that in the multinomial distribution model, the occurrence of words is assumed to follow the i.i.d property. However, for the dense sampling strategy, the neighboring sampling points are spatially close to each other. Due to the Markov property of images, the visual patterns within a neighborhood are often cooccurred. Thus, the neighboring local features and their quantized visual words can be highly correlated and the i.i.d assumption taken in the multinomial distribution model tends to be violated. In contrast, the strategy of extracting local features around interest points introduces much less correlation among visual words because interest points are usually spatially scattered. As a result, the multinomial distribution is more appropriate for modeling the histogram in KTH than that in the other three datasets.
To further verify the above interpretation, we apply both dense and sparse sampling strategies on Graz02 ^{7}^{7}7For the simplicity of experiment, we employ the Graz02 dataset as the test benchmark because its size is relatively smaller than Caltech256, PASCAL 07 and PASCAL 12. Here, it is used to evaluate imagelevel classification performance here. to create two datasets. Following the same experiment protocol used above, we obtain the performance comparison between UVD and MLT on both datasets, shown in Figure 4 (a)(b). It can be seen that the UVD and MLT show quite similar performance in the dataset obtained by using the dense sampling strategy while MLT significantly outperforms UVD in the dataset obtained by using the sparse sampling strategy. This is consistent with the observation made in KTH and supports the above interpretation.
From the above observation and discussion, the impact of the distribution model in our framework is clearly demonstrated: if the distribution model well represents the image representation, better performance can be obtained. In contrary, if the distribution model is inappropriate for modeling the image representation, the performance of the resultant merging algorithm will suffer. On KTH, the local feature extraction scheme makes the occurrence of visual words more independent of each other and in this case the visual words resemble the keywords in document analysis. Consequently, multinomial distribution becomes a better probabilistic model and MLT significantly outperforms UVD. In PASCAL VOC or Caltech256, dense sampling strategy is adopted and multinomial distribution becomes inappropriate for modeling the resultant image representation. Consequently, MLT performs less satisfying in such cases.
4.2 The Evaluation of GMLE
Compared with UVD, GMLE only replaces the parameter estimation method with maximum likelihood estimate. Thus, from the comparison between GMLE and UVD, the importance of parameter estimation can be demonstrated. Figure 5 shows the comparison. As seen, in all three datasets (Caltech 256, PASCAL 2007 and PASCAL 2012) GMLE outperforms or at least performs equally well as UVD. This supports our claim that MLE can be comparable to or even better than the Bayesian method for our framework because it directly learns the model parameters from the training data rather than relying on an empirical choice of hyperparameters as in UVD.
4.3 The Evaluation of MME
In this section, we compare the performance of MME against all the other five methods. MME adopts a more advanced parameter estimation method which incorporates the discriminative information into parameter estimation process. Thus it is expected to be better at maintaining the discriminative power of an initial codebook. In Figure 6, we evaluate the performance of MME on Caltech256, PASCAL 2007 and PASCAL 2012.
In Caltech256, the performance of MME becomes the second best method when the codebook size is reduced to 200 ^{8}^{8}8Generally speaking, for a supervised compact codebook creation method, the performance with a smaller codebook size is usually more important since the advantage of using compact codebook is more pronounced in such scenario. and its difference from the best method is very marginal (less than 0.5%).
In the more challenging PASCAL 2007 and 2012 datasets, the advantage of MME is more clearly demonstrated. As seen in Figure 6, after a slight drop in the period when the codebook size is reduced from 4000 to 1500, its classification performance is steadily kept in the remaining course of merging process (from 1500 to 100). This is in a sharp contrast to the quick performance drop of the other merging methods. Compared with CSM – the one achieving the second best performance – the improvement can be as large as 45%. This well demonstrates the advantage of using maxmargin parameter estimation in our framework.
Interestingly, MME adopts the multinomial distribution and it still achieves excellent performance on PASCAL datasets in which multinomial distribution may not be an accurate model. It seems that using supervised parameter estimation can compensate the disadvantage caused by choosing a less appropriate distribution.
4.4 Evaluation on the application of pixellevel detection
In this section, we further compare the word merging algorithms on the pixelwise object detection problem [1]. As indicated by [1], to perform the pixelwise detection, we need to calculate the histogram of visual words occurring within the region centered at each pixel and this can be timeconsuming if we implement it directly. An efficient way is to leverage the integral histogram [20] to quickly compute the histogram for a given region. However, the memory usage and computational cost will increase linearly with the size of codebook. If we could reduce the codebook size without significantly sacrificing the classification performance, then a better tradeoff between the performance and computational complexity could be achieved. Compact codebook created by word merging algorithms fits perfectly to this demand.
In Table I, we compare the average detection performance obtained by applying different merging algorithms with respect to different codebook size. The performance is measured by EER (Equal Error Rate) as in [1]. As seen, MME achieves the overall best performance. It well maintains the discriminative power of the initial codebook. The EER achieved with a 10word codebook is comparable to that obtained with the 1000word initial codebook. Thus, by using this compressed codebook, we only need 1/100 computational cost and memory usage of that required in the direct implementation.
codebook size  1000 (initial)  200  150  100  80  50  20 

MME  0.621  0.623  0.620  0.614  0.635  0.621  0.625 
AIB  0.621  0.604  0.598  0.622  0.600  0.617  0.596 
CSM  0.621  0.588  0.587  0.609  0.612  0.599  0.587 
UVD  0.621  0.604  0.600  0.602  0.578  0.582  0.601 
MLT  0.621  0.573  0.571  0.584  0.581  0.605  0.596 
GMLE  0.621  0.601  0.569  0.585  0.607  0.609  0.572 
5 Discussion on the impact of scaling factor
Throughout our experiments, we set the scaling factor to a small value (0.01) in our MME method. As discussed in the Appendix A, a small ensures that the estimated probability values lie between 0 and 1. However, among those possible values of that guarantee valid probability estimates, we still have many choices. Then a question arises, what is the impact of the value on the performance of MME? In this section, we discuss this issue with both empirical evaluation and theoretical analysis.
For the empirical evaluation, we reevaluate the performance of MME on Caltech256 with different scaling factors. We test a range of scaling factors – and show the result in Fig. 7. From the result, it is clear that the performance obtained by using different scaling factors is very similar. This suggests that the choice of scaling factor has little impact on the performance of MME once it is set to a relatively small value.
To further justify our empirical observation, we analyse this issue from the theoretical aspect. As discussed in Section 3.7.2, once the maxmargin parameter estimation is completed at each level, the identification of the word pair follows the same criterion as in AIB, that is, the best word pair should maximize the merging criterion:
(29) 
where denotes the training histograms obtained after merging the th and th words. We can show that (see Appendix B) when the scaling factor is small, can be approximated by
(30) 
where is a term which does not involve . In other words, Eq. (30) suggests that the scaling factor just scales the cost term and the relative relationship between the costs of different pairs is almost unaffected. Thus, the identified merging pair tends to remain the same even though different scaling factors are used.
6 Conclusion
This paper presents a generalized probabilistic framework to both unify existing visual words merging criteria and induce new criteria for compact codebook construction. The key insight of this framework is that different merging criteria can be realized by changing two key factors identified in the proposed framework, that is, the function used to model the classconditional distribution and the method to handle parameter estimation for the distribution model. By appropriately setting these two factors, we not only recover the existing merging criteria but also create three new criteria, named as MLT, GMLE and MME. Through the experimental comparison between these three criteria and the existing ones, we made three main discoveries: 1) The appropriateness of the distribution model choice could have significant impact on the performance of a word merging criterion. 2) Besides the Bayesian method, MLE and MME are also good parameter estimation methods in our framework. MLE is comparable or even better than the Bayesian method since it does not need to empirically set the hyperparameter. 3) MME achieves the overall best performance, demonstrating the power of using the maxmargin objective to perform parameter estimation. In our future work, we will further study this framework for more visual learning tasks, for example, instead of focusing on classification, we could extend the proposed framework for creating compact codebook in metric learning setting. Also, the computational efficiency of the proposed MME will be addressed to handle higher dimensional image representation.
7 Appendix A: Discussion on the two stage solution for Eq.(3.7.2)
The twostage optimization method shown in Eq.(3.7.2) is valid if for a solution and obtained in the first stage, we can find a scaling factor which makes the last three constraints in Eq.(27) satisfied. This is because the solution and attained without the last three constraints always gives a lower or equal objective value than the one which considers these additional constraints. Thus, and will be the optimal solution if the last three constraints can be automatically satisfied by tuning the scaling factor. To examine when this is true, we first derive the solution for and according to constraints:
The solutions of above equalities can be worked out as:
From the above solutions, we could see that and are always between 0 and 1. However, the solution of and could be greater than 1 because the term and can be larger than 1. However, we noticed that when , these two terms will approach 1. Meanwhile, since is calculated via the MLE method, that is:
(32) 
where denotes the whole training set. Recall that is the compact codebook size and generally is much larger than 2. This will make the numerator much smaller than the denominator. Thus, is usually much smaller than 1. Hence, in practice, will greatly scale down and terms and make
Comments
There are no comments yet.