As the popularity of big data increases and more data is being gathered, the importance of sequential models that are able to continuously update with new data has increased. These models are particularly crucial in high throughput real-time applications such as speech or streaming text classification. To this end, we propose a sequential framework to update the probabilistic maximum margin classifier built from the Maximum Entropy Discrimination (MED) principle of .
The proposed sequential MED framework can be cast as recursive Bayesian estimation where the likelihood function is a log-linear model formed from a series of constraints and weighted by Lagrange multipliers. In the Gaussian case it shares similarities with the problem of Gaussian process classification, which has been previously studied[2, 3, 4, 5, 6, 7], but to the best of our knowledge, a method to recursively update the Gaussian process classifier has not been developed. In the single time point case, sequential MED can be specialized to the support vector machine  and Laplacian support vector machine  as previously discussed in  and .
We are interested in situations where we receive a stream of data over time where each is a matrix of dimension , with denoting the number of feature variables and denoting the number of i.i.d. samples, where may vary with time. In the fully labeled scenario, the data has corresponding labels ; however in the partially labeled scenario, at each time point , only of the samples have labels. We define the observed data at any time point as and all observed data up to time as . Such scenarios would arise in a variety of domains such as a satellite that only transmits its data daily or a government agency that only releases its data quarterly with their corresponding reports. The rest of the paper is organized as follows: Section 2 and Section 3 will discuss how to sequentially update the corresponding MED models for supervised and semi-supervised classification. Section 4 validates the method by simulation and we present an application to a dataset of spoken letters of the English alphabet.
2 Sequential MED
Constrained relative entropy minimization is used to estimate the closest distribution to a given prior distribution subject to a set of moment constraints. The authors of
show that, if the prior distribution is from the exponential family, then the density that optimizes the constrained relative entropy problem is also a member of the exponential family. Similar to Bayesian conjugate priors, there exist relative entropy conjugate priors that facilitate evaluation of the closest distribution. These produce optimal constrained relative entropy densities, which can be thought of as posteriors, from the same parametric family as the prior. Maximum entropy discrimination (MED) also admits conjugate priors as it a special case of constrained relative entropy minimization where one of the constraints is over a parametric family of discriminant functions .
2.1 Review of MED for Maximum Margin Classification
In this paper, we are interested in maximum margin binary classifiers. In this case the discriminant function is linear for some feature transformation , feature weights vector , and bias term . Slack variables are used to create a margin in the constraints , the expected hinge loss with slack variables. The MED objective function is
whose solution is the constrained minimum relative entropy posterior. The associated MED decision rule is a weighted combination of discriminant functions. The minimum relative entropy posterior has the form
where are Lagrange multipliers that minimize the partition function . It is common to set the initial prior distribution to the separable form:
. If in addition, we specify that , is , and is a zero mean Bayesian non-informative (diffuse) prior, denoted , then the Lagrange multipliers can be obtained as the solution to the constrained optimization
where . This objective function has a log barrier term instead of the inequality constraints commonly found in the dual form of the SVM. Except in some ill-defined cases where the maximum lies near the boundary of the feasible set, the will be identical to the optimal support vectors that maximize the SVM objective. The authors in [1, 9] show that the maximum a posteriori (MAP) estimator for of the MED posterior is related to the Lagrange multipliers by , so the MED posterior mode is equivalent to a maximum margin classifier.
2.2 Updating MED
Under the separable prior assumptions above, the MED posterior will take the factored form . Due to the fact that the slack parameters do not depend on the data , the density does not affect the MED decision rule given after (1). Hence only and are important. This remaining part of the MED posterior has the form: , which is a conjugate distribution. Due to this conjugacy the posterior distribution optimizing the objective in (1) can be propagated forward in time in a recursive manner. The updating procedure is given in the following theorem and corollaries.
Let the MED prior at be , and . Then given data at time point , the relative entropy conjugate priors are
and the MED posterior can represented as
where is the prior mean and is the same as the Bayes non-informative prior.
Introducing the kernel function and the parameter transformation , the posterior at time can be represented in terms of this kernel.
The equivalent prior at for the transformed parameter is where . Furthermore, the posterior at time is of Gaussian form
where the mean parameter satisfies the recursions .
Since is Gaussian, the MAP estimator is simply the mean parameter given in the Corollary 1.1. Thus the decision rule reduces to where the MAP estimator is a function of the previously estimated Lagrange multipliers and the maximizing values and for the current time point .
Given all previous , the current optimal Lagrange multipliers are the solution to
and, holding the Lagrange multipliers fixed, the optimal bias
ensures that the expectation constraints in the objective hold.
The above dual formulation for the Lagrange multipliers has some interesting implications. Since the Lagrange multipliers from the previous time points are fixed at time step , the factor are constants and can be thought of as (unnormalized) weights for , the Lagrange multipliers from the current time point. Thus the corresponding Lagrange multipliers for samples that are easily predicted using only the prior information will have lower weight than the Lagrange multipliers for samples that are difficult or incorrect.
3 Manifold Regularization
Next we consider the case wheres some of the labels are missing. Without loss of generality we will assume the first points are labeled and the latter points are unlabeled.
We will adopt the semi-supervised MED classification framework of , called Laplacian MED (LapMED). LapMED introduces an additional “geometric” constraint
to (1) where is a compact submanifold, is the Laplace-Beltrami operator on , and controls the complexity of the decision boundary in the intrinsic geometry of . This constraint was motivated by the semi-supervised framework of  to encourage the function to be smooth over the support set of the feature distribution
, inducing a geometric interpolation of unlabeled points. Since the marginal distribution is unknown, from
where is the normalized graph Laplacian formed with a heat kernel. The LapMED posterior can be approximated as
where is a Lagrange multiplier for the smoothness constraint.
3.1 Sequential Laplacian MED
The distribution that minimizes the objective with the additional constraint (2) can similarly be factorized and, like the distribution of slack parameters considered in Section 2, the distribution of the smoothness parameter is also independent of the data . Likewise, the distribution of the decision rule coefficients are conjugate distributions with their priors. Thus the updating procedure for the LapMED problem is similar to the updating procedure in Section 2.
At , the MED priors for (or ), , and are the same as in Theorem 1, and the prior for is a Bayesian zero mean point prior, denoted . Then given data at time point , the MED conjugate prior and posterior are still for , the same as in Theorem 1 for and , and Gaussian of form for (or ). Define a expansion matrix as . Then the mean and covariance parameters for the distribution of are
where is a recursive graph of vertex disjoint subgraphs, and for the distribution of are
where is a kernel function that can be recursively defined as
Theorem 2 gives the posterior distribution for semi-supervised classification whose form is comparable to the form given in Corollary 1.1 for the supervised case. Indeed the forms are identical except for the presence of the precision matrix term in the semi-supervised case. As the sparsity of is associated with the graph Laplacian, the kernel function of the semi-supervised case is a regularized version of the kernel function that appears in Corallary 1.1. If we let be a fixed parameter, then and optimize an objective of the same form as in Corollary 1.2, but with kernel function . If is chosen to be 0, the sequential LapMED simply ignores the unlabeled data of time point , and if all ’s are , then the unlabeled data is always ignored and the updating procedure is exactly the same as in the supervised scenario. These parameters are functions of the and , which are identical to the penalty parameters in the Laplacian SVM , associated with the reproducing kernel Hilbert space and data distribution respectively: and .
3.2 Approximating the Kernel Function
Because the kernel function in (2) is a function of the previous kernel functions, calculating a map to its associated Hilbert space can be computationally expensive. Thus in this subsection, we derive an approximation to the map to , which is computationally easier than direct recursive calculation.
Recall that we approximate the constraint in (2), at any time point , empirically with the graph Laplacian formed using the data from that time point . However, the non-empirical constraint using the Laplace-Beltrami operator over the unknown marginal distribution , is actually the same at every time point. Thus as , the prior graph converges to
are the eigenvalues of the Laplace-Beltrami operator, andand
are the infinite sequence of right singular functions and singular values of. The approximate decomposition arises since the left singular functions of
are the eigenfunctions of the Laplace-Beltrami operator and . Thus instead of empirically approximating the Laplacian as a sum of subgraphs
, we can instead implement approximations to the eigen/singular values and singular functions in (4).
Assuming that the sample size is large enough, the average eigenvalues of the graph Laplacians would be a good estimator for the eigenvalues of the Laplace-Beltrami operator. Additionally the rows of the matix
from the singular value decomposition ofwill contain the basis for its row space. Thus because the right singular functions form an orthonormal basis for the coimage of , if the mapping approximately preserves the basis, the mapped average singular vectors would be good estimators for the right singular functions and correspondingly so for the singular values.
The posterior kernel function using an approximation to the decomposition in (4) will no longer be a recursive function of prior kernel functions that have the same form, like in (2). Instead for , it uses a prior kernel function
where is the non-regularized kernel function. So at time , the singular vectors of are used to update the average singular vectors, in the above function, through
and similarly so for the average corresponding singular values and the average eigenvalues of the graph Laplacians .
In this section, we compare the proposed sequential maximum margin classifiers to popular supervised and semi-supervised maximum margin classifiers (SVM  and LapSVM ) where the model is trained using just the current time points data and where the model has been re-trained on all previous data. The former type of model is a lower bound on performance since it ignores all previous data and the latter type of model is an upper bound since it is re-trained on all previous data at every time point. Note the MED and SVM models only differ by a weak log-barrier term in the objective function making their performance identical, and similarly so for LapMED and LapSVM. Thus their performance curves will referred to as Full SVM/MED and Full LapSVM/LapMED.
In both of the following simulations, the models receive roughly 100 samples () at every time point, the parameters are empirically chosen with a validation set, and then the models are tested on an independent data set of 1000 test points. The test accuracy is the average accuracy over 100 trials of simulation.
In the first simulation, we generate data from 200 categorical distributions where 100 of the variables are sparse so they have high probability of being 0, another 50 of the variables have lower probability of being 0, and the final 50 variables are used to distinguish between the two classes. We use the term frequency - inverse document frequency (TF-IDF) kernel of, which is used in document processing and topic models. Figure 1 shows that the accuracy of the sequential model (SeqMED) improves as the model is updated with more training data and has much better results even after one model update versus the independent model (SVM) that ignores previous training data. Of course the sequential model does not improve as rapidly as the model that is re-trained on all the data (Full SVM/MED), but this is the price paid for lower computational complexity. For example, at , SeqMED updates and fits 100 coefficients for the new data whereas Full SVM/MED fits 3,000 coefficients for all the data.
In the second simulation, we generate data from the interior of a 3-dimensional sphere where one class is roughly at the center of the sphere and the other class is on the shell, but only 10% of the samples are labeled. We use a rbf kernel with width 1 for the kernel function and a heat kernel with width 0.01 and a 20 nearest neighbors graph for the graph Laplacian. Figure 2 shows improvement in performance of the sequential model similar to in Figure 1. We use the approximate kernel function of Subsection 3.2 to perform each update, establishing that the approximation is adequate.
We compare the proposed algorithms on the Isolet speech database from the UCI machine learning repository following the experimental framework used in . To train the models, we take the entire training set of 120 speakers (isolet1 - isolet4) and break them into 24 groups (time points) of 5 speakers where only the first speaker is labeled. At each time point, the models train on 260 samples ( and only have 259) where 52 of the samples are labeled. The parameters are set in the same way as in  and the test set is similarly composed of the 1,559 samples from isolet5. Figure 3 shows that, after two time points, the sequential model always performs better than the model that ignores previous data, and comes close to performing as well as the fully re-trained model as time progresses.
We have proposed recursive versions of supervised and semi-supervised maximum margin classifiers in the minimum entropy discrimination (MED) classification framework. The proposed sequential maximum margin classifiers perform nearly as well as a much more computationally expensive fully re-trained maximum margin classifiers and significantly better than a classifier that ignores previous data.
Appendix A Appendix
Proof of Theorem 1.
Let where . At time , let the priors be , where , and . Then the posterior
So the posterior of the weights
the posterior of the bias term
and the posterior of the margin parameters do not depend on the data.
Proof of Corollary 1.1.
At time , let have prior where . Then the posterior
Proof of Corollary 1.2.
The optimal Lagrange multipliers at are the solution to
Proof of Theorem 2.
At time , let the priors for and be the same as in Theorem 1, where , and . Then the posterior and partition function factorize similarly as
The bias and margin terms are independent of , so their posterior and partition functions are the same as in Theorem 1. The posterior of the smoothness parameter does not depend on the data and