1 Introduction
While SVM [VapnikVapnik1995]
classification accuracy on many classification tasks is often competitive with that of human subjects, the number of training examples required to achieve this accuracy is prohibitively large for some domains. Intelligent user interfaces, for example, must adopt to the behavior of an individual user after a limited amount of interaction in order to be useful. Medical systems diagnosing rare diseases have to generalize well after seeing very few examples. Any natural language processing task that performs processing at the level of ngrams or phrases (which is frequent in translation systems) cannot expect to see the same sequence of words a sufficient number of times even in large training corpora. Moreover, supervised classification methods rely on manually labeled data, which can be expensive to obtain. Thus, it is important to improve classification performance on very small datasets. Most classifiers are not competitive with humans in their ability to generalize after seeing very few examples. Various techniques have been proposed to address this problem, such as active learning
[Tong KollerTong Koller2000b, Campbell, Cristianini, SmolaCampbell et al.2000], hybrid generativediscriminative classification [Raina, Shen, Ng, McCallumRaina et al.2003], learningtolearn by extracting common information from related learning tasks [ThrunThrun1995, BaxterBaxter2000, FinkFink2004], and using prior knowledge.In this work, we concentrate on improving smallsample classification accuracy with prior knowledge. While prior knowledge has proven useful for classification [Scholkopf, Simard, Vapnik, SmolaScholkopf et al.2002, Wu SrihariWu Srihari2004, Fung, Mangasarian, ShavlikFung et al.2002, Epshteyn DeJongEpshteyn DeJong2005, Sun DeJongSun DeJong2005]
, it is notoriously hard to apply in practice because there is a mismatch between the form of prior knowledge that can be employed by classification algorithms (either prior probabilities or explicit constraints on the hypothesis space of the classifier) and the domain theories articulated by human experts. This is unfortunate because various ontologies and domain theories are available in abundance, but considerable amount of manual effort is required to incorporate existing prior knowledge into the native learning bias of the chosen algorithm. What would it take to apply an existing domain theory automatically to a classification task for which it was not specifically designed? In this work, we take the first steps towards answering this question.
In our experiments, such a domain theory is exemplified by WordNet, a linguistic database of semantic connections among English words [MillerMiller1990]. We apply WordNet to a standard benchmark task of newsgroup categorization. Conceptually, a generative model describes how the world works, while a discriminative model is inextricably linked to a specific classification task. Thus, there is reason to believe that a generative interpretation of a domain theory would seem to be more natural and generalize better across different classification tasks. In Section 2 we present empirical evidence that this is, indeed, the case with WordNet in the context of newsgroup classification. For this reason, we interpret the domain theory in the generative setting. However, many successful learning algorithms (such as support vector machines) are discriminative. We present a framework which allows the use of generative prior in the discriminative classification setting.
Our algorithm assumes that the generative distribution of the data is given in the Bayesian framework: and the prior are known. However, instead of performing Bayesian model averaging, we assume that a single model has been selected apriori, and the observed data is a manifestation of that model (i.e., it is drawn according to
). The goal of the learning algorithm is to estimate
. This estimation is performed as a twoplayer sequential game of full information. The bottom (generative) player chooses the Bayesoptimal discriminator functionfor the probability distribution
(without taking the training data into account) given the model . The model is chosen by the top (discriminative) player in such a way that its prior probability of occurring, given by , is high, and it forces the bottom player to minimize the trainingset error of its Bayesoptimal discriminator . This estimation procedure gives rise to a bilevel program. We show that, while the problem is known to be NPhard, its approximation can be solved efficiently by iterative application of secondorder cone programming.The only remaining issue is how to construct the generative prior automatically from the domain theory. We describe how to solve this problem in Section 2, where we also argue that the generative setting is appropriate for capturing expert knowledge, employing WordNet as an illustrative example. In Section 3, we give the necessary preliminary information and important known facts and definitions. Our framework for incorporating generative prior into discriminative classification is described in detail in Section 4. We demonstrate the efficacy of our approach experimentally by presenting the results of using WordNet for newsgroup classification in Section 5. A theoretical explanation of the improved generalization ability of our discriminative classifier constrained by generative prior knowledge appears in Section 6. Section 7 describes related work. Section 8 concludes the paper and outlines directions for future research.
2 Generative vs. Discriminative Interpretation of Domain Knowledge
WordNet can be viewed as a network, with nodes representing words and links representing relationships between two words (such as synonyms, hypernyms (isa), meronyms (partof), etc.). An important property of WordNet is that of semantic distance  the length (in links) of the shortest path between any two words. Semantic distance approximately captures the degree of semantic relatedness of two words. We set up an experiment to evaluate the usefulness of WordNet for the task of newsgroup categorization. Each posting was represented by a bagofwords, with each binary feature representing the presence of the corresponding word. The evaluation was done on pairwise classification tasks in the following two settings:

The generative framework assumes that each posting is generated by a distinct probability distribution for each newsgroup. The simplest version of a Linear Discriminan Analysis (LDA) classifier posits that and for posting given label , where
is the identity matrix. Classification is done by assigning the most probable label to
: . It is wellknown [<]e.g. see¿duda that this decision rule is equivalent to the one given by the hyperplane
. The means are estimated via maximum likelihood from the training data ^{1}^{1}1The standard LDA classifier assumes that and and estimates the covariance matrix as well as the means from the training data. In our experiments, we take .. 
The discriminative SVM classifier sets the separating hyperplane to directly minimize the number of errors on the training data:
.
Our experiment was conducted in the learningtolearn framework [ThrunThrun1995, BaxterBaxter2000, FinkFink2004]. In the first stage, each classifier was trained using training data from the training task (e.g., for classifying postings into the newsgroups ’atheism’ and ’guns’). In the second stage, the classifier was generalized using WordNet’s semantic information. In the third stage, the generalized classifier was applied to a different, test task (e.g., for classifying postings for the newsgroups ’atheism’ vs. ’mideast’) without seeing any data from this new classification task. The only way for a classifier to generalize in this setting is to use the original sample to acquire information about WordNet, and then exploit this information to help it label examples from the test sample. In learning how to perform this task, the system also learns how to utilize the classification knowledge implicit in WordNet.
We now describe the second and third stages for the two classifiers in more detail:

It is intuitive to interpret information embedded in WordNet as follows: if the title of the newsgroup is ’guns’, then all the words with the same semantic distance to ’gun’ (e.g., ’artillery’, ’shooter’, and ’ordnance’ with the distance of two) provide a similar degree of classification information. To quantify this intuition, let be the vector of semantic distances in WordNet between each feature word and the label of each training task newsgroup . Define , where denotes cardinality of a set. compresses information in based on the assumption that words equidistant from the newsgroup label are equally likely to appear in a posting from that newsgroup. To test the performance of this compressed classifier on a new task with semantic distances given by , the generative distributions are reconstructed via . Notice that if the classifier is trained and tested on the same task, applying the function is equivalent to averaging the components of the means of the generative distribution corresponding to the equivalence classes of words equidistant from the label. If the classifier is tested on a different classification task, the reconstruction process reassigns the averages based on the semantic distances to the new labels.

It is less intuitive to interpret WordNet in a discriminative setting. One possible interpretation is that coefficients of the separating hyperplane are governed by semantic distances to labels, as captured by the compression function and reconstructed via .
Note that both the LDA generative classifier and the SVM discriminative classifier have the same hypothesis space of separating hyperplanes. The resulting test set classification accuracy for each classifier for a few classification tasks from the 20newsgroup dataset [Blake MerzBlake Merz1998] is presented in Figure 2.1. The xaxis of each graph represents the size of the training task sample, and the yaxis  the classifier’s performance on the test classification task. The generative classifier consistently outperforms the discriminative classifier. It converges much faster, and on two out of three tasks the discriminative classifier is not able to use prior knowledge nearly as effectively as the generative classifier even after seeing 90% of all of the available training data. The generative classifier is also more consistent in its performance  note that its error bars are much smaller than those of the discriminative classifier. The results clearly show the potential of using background knowledge as a vehicle for sharing information between tasks. But the effective sharing is contingent on an appropriate task decomposition, here supplied by the tuned generative model.
The evidence in Figure 2.1 seemingly contradicts the conventional wisdom that discriminative training outperforms generative for sufficiently large training samples. However, our experiment evaluates the two frameworks in the context of using an ontology to transfer information between learning tasks. This was never done before. The experiment demonstrates that the interpretation of semantic distance in WordNet is more intuitive in the generative classification setting, probably because it better reflects the human intuitions behind WordNet.
However, our goal is not just to construct a classifier that performs well without seeing any examples of the test classification task. We also want a classifier that improves its behavior as it sees new labeled data from the test classification task. This presents us with a problem: one of the bestperforming classifiers [<]and certainly the best on the text classification task according to the study by¿joachims98text is SVM, a discriminative classifier. Therefore, in the rest of this work, we focus on incorporating generative prior knowledge into the discriminative classification framework of support vector machines.
3 Preliminaries
It has been observed that constraints on the probability measure of a halfspace can be captured by secondorder cone constraints for Gaussian distributions
[<]see, e.g., the tutorial by¿lobo1. This allows for efficient processing of such constraints within the framework of secondorder cone programming (SOCP). We intend to model prior knowledge with elliptical distributions, a family of probability distributions which generalizes Gaussians. In what follows, we give a brief overview of secondorder cone programming and its relationship to constraints imposed on the Gaussian probability distribution. We also note that it is possible to extend the argument presented by Lobo et al. lobo1 to elliptical distributions.Secondorder cone program is a mathematical program of the form:
(3.1)  
(3.2) 
where is the optimization variable and , , , , are problem parameters ( represents the usual norm in this paper). SOCPs can be solved efficiently with interiorpoint methods, as described by Lobo et al. lobo1 in a tutorial which contains an excellent overview of the theory and applications of SOCP.
We use the elliptical distribution to model distribution of the data apriori. Elliptical distributions are distributions with ellipsoidallyshaped equiprobable contours. The density function of the variate elliptical distribution has the form , where
is the random variable,
is the location parameter, is a positive definite matrix representing the scale parameter, function is the density generator, and is the normalizing constant. We will use the notation to denote that the random variable has an elliptical distribution with parameters . Choosing appropriate density generator functions, the Gaussian distribution, the Studentt distribution, the Cauchy distribution, the Laplace distribution, and the logistic distribution can be seen as special cases of the elliptical distribution. Using an elliptical distribution relaxes the restrictive assumptions the user has to make when imposing a Gaussian prior, while keeping many desirable properties of Gaussians, such as:

If , , and , then

If , then .

If , then , where is a constant that depends on the density generator .
The following proposition shows that for elliptical distributions, the constraint (i.e., the probability that takes values in the halfspace is less than ) is equivalent to a secondorder cone constraint for :
Proposition 3.1.
If is equivalent to , where is a constant which only depends on and .
Proof.
The proof is identical to the one given by Lobo lobo1 and Lanckriet et al. minimax1 for Gaussian distributions and is provided here for completeness:
(3.3) 
Let . Let denote the mean of , and
denote its variance. Then the constraint
3.3 can be written as(3.4) 
By the properties of elliptical distributions, , , and . Thus, statement 3.4 above can be expressed as , which is equivalent to , where . The proposition follows with . ∎
Proposition 3.2.
For any monotonically decreasing , is equivalent to , where is a constant which only depends on .
Proof.
Follows directly from the definition of . ∎
4 Generative Prior via Bilevel Programming
We deal with the binary classification task: the classifier is a function which maps instances to labels . In the generative setting, the probability densities and parameterized by are provided (or estimated from the data), along with the prior probabilities on class labels and , and the Bayes optimal decision rule is given by the classifier
where if and otherwise. In LDA, for instance, the parameters and are the means of the two Gaussian distributions generating the data given each label.
Informally, our approach to incorporating prior knowledge is straightforward: we assume a twolevel hierarchical generative probability distribution model. The lowlevel probability distribution of the data given the label is parameterized by , which, in turn, has a known probability distribution . The goal of the classifier is to estimate the values of the parameter vector from the training set of labeled points . This estimation is performed as a twoplayer sequential game of full information. The bottom (generative) player, given , selects the Bayes optimal decision rule . The top (discriminative) player selects the value of which has a high probability of occurring (according to ) and which will force the bottom player to select the decision rule which minimizes the discriminative error on the training set. We now give a more formal specification of this training problem and formulate it as a bilevel program. Some of the assumptions are subsequently relaxed to enforce both tractability and flexibility.
We use an elliptical distribution to model , and another elliptical distribution to model . If the parameters are known, the Bayes optimal decision rule restricted to the class of linear classifiers^{2}^{2}2A decision rule restricted to some class of classifiers is optimal if its probability of error is no larger than that of any other classifier in [Tong KollerTong Koller2000a]. of the form is given by which minimizes the probability of error among all linear discriminants: , assuming equal prior probabilities for both classes. We now model the uncertainty in the means of the elliptical distributions by imposing elliptical prior distributions on the locations of the means: . In addition, to ensure the optimization problem is welldefined, we maximize the margin of the hyperplane subject to the imposed generative probability constraints:
(4.1)  
(4.2)  
(4.3)  
(4.4) 
This is a bilevel mathematical program (i.e., an optimization problem in which the constraint region is implicitly defined by another optimization problem), which is strongly NPhard even when all the constraints and both objectives are linear [Hansen, Jaumard, SavardHansen et al.1992]. However, we show that it is possible to solve a reasonable approximation of this problem efficiently with several iterations of secondorder cone programming. First, we relax the secondlevel minimization (4.4) by breaking it up into two constraints: and . Thus, instead of looking for the Bayes optimal decision boundary, the algorithm looks for a decision boundary with low probability of error, where low error is quantified by the choice of .
Propositions 3.1 and 3.2 enable us to rewrite the optimization problem resulting from this relaxation as follows :
(4.5)  
(4.6)  
(4.7)  
(4.8)  
(4.9) 
Notice that the form of this program does not depend on the generator function of the elliptical distribution  only constants and depend on it. defines how far the system is willing to deviate from the prior in its choice of a generative model, and bounds the tail probabilities of error (Type I and Type II) which the system will tolerate assuming its chosen generative model is correct. These constants depend both on the specific generator and the amount of error the user is willing to tolerate. In our experiments, we select the values of these constants to optimize performance. Unless the user wants to control the probability bounds through these constants, it is sufficient to assume apriori only that probability distributions (both prior and hyperprior) are elliptical, without making any further commitments.
Our algorithm solves the above problem by repeating the following two steps:

Fix the toplevel optimization parameters and . This step combines the objectives of maximizing the margin of the classifier on the training data and ensuring that the decision boundary is (approximately) Bayes optimal with respect to the given generative probability densities specified by the .

Fix the bottomlevel optimization parameters . Expand the feasible region of the program in step 1 as a function of . This step fixes the decision boundary and pushes the means of the generative distribution as far away from the boundary as the constraint (4.7) will allow.
The steps are repeated until convergence (in practice, convergence is detected when the optimization parameters do not change appreciably from one iteration to the next). Each step of the algorithm can be formulated as a secondorder cone program:
Step 1. Fix and . Removing unnecessary constraints from the mathematical program above and pushing the objective into constraints, we get the following SOCP:
(4.10)  
(4.11)  
(4.12)  
(4.13)  
(4.14) 
Step 2. Fix and expand the span of the feasible region, as measured by . Removing unnecessary constraints, we get:
(4.15)  
(4.16) 
The behavior of the algorithm is illustrated in Figure 4.1.
The following theorems state that the algorithm converges.
Theorem 4.1.
Suppose that the algorithm produces a sequence of iterates
, and the
quality of each iterate is evaluated by its margin .
This evaluation function converges.
Proof.
Let be the values of the prior location parameters, and be the minimum error hyperplane the algorithm finds at the end of the th step. At the end of the st step, is still in the feasible region of the th step SOCP. This is true because the function is monotonically increasing in each one of its arguments when the other argument is fixed, and fixing (or ) fixes exactly one argument. If the solution at the end of the st step were such that , then could be increased by fixing and using the value of from the beginning of the step which ensures that , which contradicts the observation that is maximized at the end of the second step. The same contradiction is reached if . Since the minimum error hyperplane from the previous iteration is in the feasible region at the start of the next iteration, the objective must decrease monotonically from one iteration to the next. Since it is bounded below by zero, the algorithm converges. ∎
In addition to the convergence of the objective function, the accumulation points of the sequence of iterates can be characterized by the following theorem:
Theorem 4.2.
The accumulation points of the sequence (i.e., limiting points of its convergent subsequences) have no feasible descent directions for the original optimization problem given by (4.5)(4.9).
Proof.
See Appendix A. ∎
If a point has no feasible descent directions, then any sufficiently small step along any directional vector will either increase the objective function, leave it unchanged, or take the algorithm outside of the feasible region. The set of points with no feasible descent directions is a subset of the set of local minima. Hence, convergence to such a point is a somewhat weaker result than convergence to a local minimum.
In practice, we observed rapid convergence usually within 24 iterations.
Finally, we may want to relax the strict assumptions of the correctness of the prior/linear separability of the data by introducing slack variables into the optimization problem above. This results in the following program:
(4.17)  
(4.18)  
(4.19)  
(4.20)  
(4.21)  
(4.22)  
(4.23)  
(4.24) 
As before, this problem can be solved with the twostep iterative SOCP procedure. Imposing the generative prior with soft constraints ensures that, as the amount of training data increases, the data overwhelms the prior and the algorithm converges to the maximummargin separating hyperplane.
5 Experiments
The experiments were designed both to demonstrate the usefulness of the proposed approach for incorporation of generative prior into discriminative classification, and to address a broader question by showing that it is possible to use an existing domain theory to aid in a classification task for which it was not specifically designed. In order to construct the generative prior, the generative LDA classifier was trained on the data from the training classification task to estimate the Gaussian location parameters , as described in Section 2. The compression function
is subsequently computed (also as described in Section 2), and is used to set the hyperprior parameters via
. In order to apply a domain theory effectively to the task for which it was not specifically designed, the algorithm must be able to estimate its confidence in the decomposition of the domain theory with respect to this new learning task. In order to model the uncertainty in applicability of WordNet to newsgroup categorization, our system estimated its confidence in homogeneity of equivalence classes of semantic distances by computing the variance of each random variable as follows: . The hyperprior confidence matrices were then reconstructed with respect to the test task semantic distances as follows: . Identity matrices were used as covariance matrices of the lowerlevel prior: . The rest of the parameters were set as follows: , , . These constants were chosen manually to optimize performance on Experiment 1 (for the training task: atheism vs. guns, test task: guns vs. mideast, see Figure 5.2) without observing any data from any other classification tasks.The resulting classifier was evaluated in different experimental setups (with different pairs of newsgroups chosen for the training and the test tasks) to justify the following claims:

The bilevel generative/discriminative classifier with WordNetderived prior knowledge has good lowsample performance, showing both the feasibility of automatically interpreting the knowledge embedded in WordNet and the efficacy of the proposed algorithm.

The bilevel classifier’s performance improves with increasing training sample size.

Integrating generative prior into the discriminative classification framework results in better performance than integrating the same prior directly into the generative framework via Bayes’ rule.

The bilevel classifier outperforms a stateoftheart discriminative multitask classifier proposed by Evgeniou and Pontil multitask by taking advantage of the WordNet domain theory.
In order to evaluate the lowsample performance of the proposed classifier, four newsgroups from the 20newsgroup dataset were selected for experiments: atheism, guns, middle east, and auto. Using these categories, thirty experimental setups were created for all the possible ways of assigning newsgroups to training and test tasks (with a pair of newsgroups assigned to each task, under the constraint that the training and test pairs cannot be identical)^{3}^{3}3Newsgroup articles were preprocessed by removing words which could not be interpreted as nouns by WordNet. This preprocessing ensured that only one part of WordNet domain theory was exercised and resulted in virtually no reduction in classification accuracy.. In each experiment, we compared the following two classifiers:

Our bilevel generativediscriminative classifier with the knowledge transfer functions learned from the labeled training data provided for the training task (using 90% of all the available data for that task). The resulting prior was subsequently introduced into the discriminative classification framework via our approximate bilevel programming approach

A vanilla SVM classifier which minimizes the regularized empirical risk:
(5.1) (5.2)
Both classifiers were trained on 0.5% of all the available data from the test classification task^{4}^{4}4SeDuMi software [SturmSturm1999] was used to solve the iterative SOCP programs., and evaluated on the remaining 99.5% of the test task data. The results, averaged over one hundred randomly selected datasets, are presented in Figure 5.1, which shows the plot of the accuracy of the bilevel generative/discriminative classifier versus the accuracy of the SVM classifier, evaluated in each of the thirty experimental setups. All the points lie above the 45 line, indicating improvement in performance due to incorporation of prior knowledge via the bilevel programming framework. The amount of improvement ranges from 10% to 30%, with all of the improvements being statistically significant at the 5% level.
The next experiment was conducted to evaluate the effect of increasing training data (from the test task) on the performance of the system. For this experiment, we selected three newsgroups (atheism, guns, and middle east) and generated six experimental setups based on all the possible ways of splitting these newsgroups into unique training/test pairs. In addition to the classifiers 1 and 2 above, the following classifiers were evaluated:

A stateofthe art multitask classifier designed by Evgeniou and Pontil multitask. The classifier learns a set of related classification functions for classification tasks training task, test task given data points for each task by minimizing the regularized empirical risk:
(5.3) (5.4) (5.5) The regularization constraint captures a tradeoff between final models being close to the average model and having a large margin on the training data. 90% of the training task data was made available to the classifier. Constant was chosen, and was selected from the set to optimize the classifier’s performance on Experiment 1 (for the training task: atheism vs. guns, test task: guns vs. mideast, see Figure 5.2) after observing .05% of the test task data (in addition to the training task data).

The LDA classifier described in Section 2 trained on 90% of the test task data. Since this classifier is the same as the bottomlevel generative classifier used in the bilevel algorithm, its performance gives an upper bound on the performance of the bottomlevel classifier trained in a generative fashion.
Figure 5.2 shows performance of classifiers 13 as a function of the size of the training data from the test task (evaluation was done on the remaining testtask data). The results are averaged over one hundred randomly selected datasets. The performance of the bilevel classifier improves with increasing training data both because the discriminative portion of the classifier aims to minimize the training error and because the generative prior is imposed with soft constraints. As expected, the performance curves of the classifiers converge as the amount of available training data increases. Even though the constants used in the mathematical program were selected in a single experimental setup, the classifier’s performance is reasonable for a wide range of data sets across different experimental setups, with the possible exception of Experiment 4 (training task: guns vs. mideast, testing task: atheism vs. mideast), where the means of the constructed elliptical priors are much closer to each other than in the other experiments. Thus, the prior is imposed with greater confidence than is warranted, adversely affecting the classifier’s performance.
The multitask classifier 3 outperforms the vanilla SVM by generalizing from data points across classification tasks. However, it does not take advantage of prior knowledge, while our classifier does. The gain in performance of the bilevel generative/discriminative classifier is due to the fact that the relationship between the classification tasks is captured much better by WordNet than by simple linear averaging of weight vectors.
Because of the constants involved in both the bilevel classifier and the generative classifiers with Bayesian priors, it is hard to do a fair comparison between classifiers constrained by generative priors in these two frameworks. Instead, the generatively trained classifier 4 gives an empirical upper bound on the performance achievable by the bottomlevel classifier trained generatively on the test task data. The accuracy of this classifier is shown as as a horizontal in the plots in Figure 5.2. Since discriminative classification is known to be superior to generative classification for this problem, the SVM classifier outperforms the generative classifier given enough data in four out of six experimental setups. What is more interesting, is that, for a range of training sample sizes, the bilevel classifier constrained by the generative prior outperforms both the SVM trained on the same sample and the generative classifier trained on a much larger sample in these four setups. This means that, unless prior knowledge outweighs the effect of learning, it cannot enable the LDA classifier to compete with our bilevel classifier on those problems.
Finally, a set of experiments was performed to determine the effect of varying mathematical program parameters and on the generalization error. Each parameter was varied over a set of values, with the rest of the parameters held fixed ( was increased up to its maximum feasible value). The evaluation was done in the setup of Experiment 1 (for the training task:atheism vs. guns, test task: guns vs. mideast), with the training set size of 9 points. The results are presented in Figure 5.3. Increasing the value of is equivalent to requiring a hyperplane separator to have smaller error given the prior. Decreasing the value of is equivalent to increasing the confidence in the hyperprior. Both of these actions tighten the constraints (i.e., decrease the feasible region). With good prior knowledge, this should have the effect of improving generalization performance for small training samples since the prior is imposed with higher confidence. This is precisely what we observe in the plots of Figure 5.3.
6 Generalization Performance
Why does the algorithm generalize well for low sample sizes? In this section, we derive a theorem which demonstrates that the convergence rate of the generalization error of the constrained generativediscriminative classifier depends on the parameters of the mathematical program and not just the margin, as would be expected in the case of largemargin classification without the prior. In particular, we show that as the certainty of the generative prior knowledge increases, the upper bound on the generalization error of the classifier constrained by the prior decreases. By increasing certainty of the prior, we mean that either the hyperprior becomes more peaked (i.e., the confidence in the locations of the prior means increases) or the desired upper bounds on the Type I and Type II probabilities of error of the classifier decrease (i.e., the requirement that the lowerlevel discriminative player choose the restricted Bayesoptimal hyperplane is more strictly enforced).
The argument proceeds by bounding the fatshattering dimension of the classifier constrained by prior knowledge. The fatshattering dimension of a large margin classifier is given by the following definition [Taylor BartlettTaylor Bartlett1998]:
Definition 6.1.
A set of points is shattered by a set of functions mapping from a domain to if there are real numbers such that, for each , there is a function in with . We say that witness the shattering. Then the fatshattering dimension of is a function fat() that maps to the cardinality of the largest shattered set .
Specifically, we consider the class of functions
(6.1) 
The following theorem bounds the fatshattering dimension of our classifier:
Theorem 6.2.
Let be the class of apriori constrained functions defined by (6.1), and let and
denote the minimum and maximum eigenvalues of matrix
, respectively. If a set of points is shattered by , then , where with and , assuming that , , and .Proof.
See Appendix B. ∎
We have the following corollary which follows directly from Taylor and Bartlett’s bartlett:generalize Theorem 1.5 and bounds the classifier’s generalization error based on its fatshattering dimension:
Corollary 6.3.
Let be a class of realvalued functions. Then, with probability at least over independently generated examples , if a classifier has margin at least on all the examples in , then the error of is no more than where . If is the class of functions defined by (6.1), then . If is the usual class of large margin classifiers (without the prior), then the result in [Taylor BartlettTaylor Bartlett1998] shows that .
Notice that both bounds depend on . However, the bound of the classifier constrained by the generative prior also depends on and through the term . In particular, as increases, tightening the constraints, the bound decreases, ensuring, as expected, quicker convergence of the generalization error. Similarly, decreasing also tightens the constraints and decreases the upper bound on the generalization error. For , the factor is less than and the upper bound on the fatshattering dimension is tighter than the usual bound in the noprior case on .
Since controls the amount of deviation of the decision boundary from the Bayesoptimal hyperplane and depends on the variance of the hyperprior distribution, tightening of these constraints corresponds to increasing our confidence in the prior. Note that a high value represents high level of user confidence in the generative elliptical model. Also note that there are two ways of increasing the tightness of the hyperprior constraint (4.7)  one is through the userdefined parameter , the other is through the automatically estimated covariance matrices . These matrices estimate the extent to which the equivalence classes defined by WordNet create an appropriate decomposition of the domain theory for the newsgroup categorization task. Thus, tight constraint (4.7) represents both high level of user confidence in the means of the generative classification model (estimated from WordNet) and a good correspondence between the partition of the words imposed by the semantic distance of WordNet and the elliptical generative model of the data. As approaches zero and approaches its highest feasible value, the solution of the bilevel mathematical program reduces to the restricted Bayes optimal decision boundary computed solely from the generative prior distributions, without using the data.
Hence, we have shown that, as the prior is imposed with increasing level of confidence (which means that the elliptical generative model is deemed good, or the estimates of its means are good, which in turn implies that the domain theory is wellsuited for the classification task at hand), the convergence rate of the generalization error of the classifier increases. Intuitively, this is precisely the desired effect of increased confidence in the prior since the benefit derived from the training data is outweighed by the benefit derived from prior knowledge. For low data samples, this should result in improved accuracy assuming the domain theory is good, which is what the plots in Figure 5.3 show.
7 Related Work
There are a number of approaches to combining generative and discriminative models. Several of these focus on deriving discriminative classifiers from generative distributions [Tong KollerTong Koller2000a, TippingTipping2001] or on learning the parameters of generative classifiers via discriminative training methods [Greiner ZhouGreiner Zhou2002, Roos, Wettig, Grunwald, Myllymaki, TirriRoos et al.2005]. The closest in spirit to our approach is the Maximum Entropy Discrimination framework [JebaraJebara2004, Jaakkola, Meila, JebaraJaakkola et al.1999], which performs discriminative estimation of parameters of a generative model, taking into account the constraints of fitting the data and respecting the prior. One important difference with our framework is that, in estimating these parameters, maximum entropy discrimination minimizes the distance between the generative model and the prior, subject to satisfying the discriminative constraint that the training data be classified correctly with a given margin. Our framework, on the other hand, maximizes the margin on the training data subject to the constraint that the generative model is not too far from the prior. This emphasis on maximizing the margin allows us to derive apriori bounds on the generalization error of our classifier based on the confidence in the prior which are not (yet) available for the maximum entropy framework. Another difference is that our approach performs classification via a single generative model, while maximum entropy discrimination averages over a set of generative models weighted by their probabilities. This is similar to the distinction between maximumaposteriori and Bayesian estimation and has repercussions for tractability. Maximum entropy discrimination, however, is more general than our framework in a sense of allowing a richer set of behaviors based on different priors.
Ng et al. ng:hybrid,ng:logregr explore the relative advantages of discriminative and generative classification and propose a hybrid approach which improves classification accuracy for both lowsample and highsample scenarios. Collins hmmperceptron proposes to use the Viterbi algorithm for HMMs for inferencing (which is based on generative assumptions), combined with a discriminative learning algorithm for HMM parameter estimation. These research directions are orthogonal to our work since they do not explicitly consider the question of integration of prior knowledge into the learning problem.
In the context of support vector classification, various forms of prior knowledge have been explored. Scholkopf et al. scholkopf demonstrate how to integrate prior knowledge about invariance under transformations and importance of local structure into the kernel function. Fung et al. fung use domain knowledge in form of labeled polyhedral sets to augment the training data. Wu and Srihari wu allow domain experts to specify their confidence in the example’s label, varying the effect of each example on the separating hyperplane proportionately to its confidence. Epshteyn and DeJong rotatesvm explore the effects of rotational constraints on the normal of the separating hyperplane. Sun and DeJong qiangebl propose an algorithm which uses domain knowledge (such as WordNet) to identify relevant features of examples and incorporate resulting information in form of soft constraints on the hypothesis space of SVM classifier. Mangasarian et al. mangasarian suggest the use of prior knowledge for support vector regression. In all of these approaches, prior knowledge takes the form of explicit constraints on the hypothesis space of the largemargin classifier. In this work, the emphasis is on generating such constraints automatically from domain knowledge interpreted in the generative setting. As we demonstrate with our WordNet application, generative interpretation of background knowledge is very intuitive for natural language processing problems.
Secondorder cone constraints have been applied extensively to model probability constraints in robust convex optimization [Lobo, Vandenberghe, Boyd, LebretLobo et al.1998, Bhattacharyya, Pannagadatta, SmolaBhattacharyya et al.2004] and constraints on the distribution of the data in minimax machines [Lanckriet, Ghaoui, Bhattacharyya, JordanLanckriet et al.2001, Huang, King, Lyu, ChanHuang et al.2004]. Our work, as far as we know, is the first one which models prior knowledge with such constraints. The resulting optimization problem and its connection with Bayes optimal classification is very different from the approaches mentioned above.
Our work is also related to empirical Bayes estimation [Carlin LouisCarlin Louis2000]
. In empirical Bayes estimation, the hyperprior parameters of the generative model are estimated using statistical estimation methods (usually maximum likelihood or method of moments) through the marginal distribution of the data, while our approach learns those parameters discriminatively using the training data.
8 Conclusions and Future Work.
Since many sources of domain knowledge (such as WordNet) are readily available, we believe that significant benefit can be achieved by developing algorithms for automatically applying their information to new classification problems. In this paper, we argued that the generative paradigm for interpreting background knowledge is preferable to the discriminative interpretation, and presented a novel algorithm which enables discriminative classifiers to utilize generative prior knowledge. Our algorithm was evaluated in the context of a complete system which, faced with the newsgroup classification task, was able to estimate the parameters needed to construct the generative prior from the domain theory, and use this construction to achieve improved performance on new newsgroup classification tasks.
In this work, we restricted our hypothesis class to that of linear classifiers. Extending the form of the prior distribution to distributions other than elliptical and/or looking for Bayesoptimal classifiers restricted to a more expressive class than that of linear separators may result in improvement in classification accuracy for non linearlyseparable domains. However, it is not obvious how to approximate this more expressive form of prior knowledge with convex constraints. The kernel trick may be helpful in handling nonlinear problems, assuming that it is possible to represent the optimization problem exclusively in terms of dot products of the data points and constraints. This is an important issue which requires further study.
We have demonstrated that interpreting domain theory in the generative setting is intuitive and produces good empirical results. However, there are usually multiple ways of interpreting a domain theory. In WordNet, for instance, semantic distance between words is only one measure of information contained in the domain theory. Other, more complicated, interpretations might, for example, take into account types of links on the path between the words (hypernyms, synonyms, meronyms, etc.) and exploit commonsense observations about WordNet such as words that are closer to the category label are more likely to be informative than words farther away. Comparing multiple ways of constructing the generative prior from the domain theory and, ultimately, selecting one of these interpretations automatically is a fruitful direction for further research.
The authors thank the anonymous reviewers for valuable suggestions on improving the paper. This material is based upon work supported in part by the National Science Foundation under Award NSF IIS 0413161 and in part by the Information Processing Technology Office of the Defense Advanced Research Projects Agency under award HR00110510040. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the National Science Foundation or the Defense Advanced Research Projects Agency.
Appendix A Convergence of the Generative/Discriminative Algorithm
Let the map determine an algorithm that, given a point , generates a sequence of iterates through the iteration . The iterative algorithm in Section 4 generates a sequence of iterates by applying the following map :
(A.1) 
(A.2)  
(A.3)  
(A.4)  
(A.5)  
(A.6) 
with the conic constraints .
(A.7)  
(A.8)  
(A.9) 
with .
Notice that and are functions because the minima for optimization problems (4.10)(4.14) and (4.15)(4.16) are unique. This is the case because Step 1 optimizes a strictly convex function on a convex set, and Step 2 optimizes a linear nonconstant function on a strictly convex set.
Convergence of the objective function of the algorithm was shown in Theorem 4.1. Let denote the set of points on which the map does not change the value of the objective function, i.e. . We will show that every accumulation point of lies in . We will also show that every point augmented with is a point with no feasible descent directions for the optimization problem (4.5)(4.9), which can be equivalently expressed as:
(A.10) 
In order to formally state our result, we need a few concepts from the duality theory. Let a constrained optimization problem be given by
(A.11) 
The following conditions, known as KarushKuhnTucker(KKT) conditions are necessary for to be a local minimum:
Proposition A.1.
The following wellknown result states that KKT conditions are sufficient for to be a point with no feasible descent directions:
Proposition A.2.
If such that the following conditions are satisfied at :
Comments
There are no comments yet.