The history of the large volume principle
(LVP) goes back to the early age of the statistical learning theory whenVapnik (1982)
introduced it for the case of hyperplanes. But it did not gain much attention until a creative approximation was proposed inEl-Yaniv et al. (2008)
to implement LVP for the case of soft response vectors. From then on, it has been applied to various binary learning problems successfully, such as binary transductive learning(El-Yaniv et al., 2008), binary clustering (Niu et al., 2013a)
, and outlier detection(Li and Ng, 2013).
LVP is a learning-theoretic principle which views learning as hypothesis selecting from a certain hypothesis space . Despite the form of the hypothesis, can always be partitioned into a finite number of equivalence classes after we observe certain data, where an equivalence class is a set of hypotheses that generate the same labeling of the observed data. LVP, as one of the learning-theoretic principles from the statistical learning theory, prioritizes those equivalence classes according to the volume they occupy in . See the illustration in Figure 1: The blue ellipse represents , and it is partitioned into each occupying a quadrant of the Cartesian coordinate system intersected with ; LVP claims that and are more preferable than and , since and have larger volume than and .
In practice, the hypothesis space cannot be as simple as in Figure 1. It frequently locates in very high-dimensional spaces where exact or even quantifiable volume estimation is challenging. Therefore, El-Yaniv et al. (2008) proposed a volume approximation
to bypass the volume estimation. Instead of focusing on the equivalence classes of, it directly focuses on the hypotheses in since learning is regarded as hypothesis selecting in LVP. It defines via an ellipsoid, measures the angles from hypotheses to the principal axes of , and then prefers hypotheses near the long principal axes to those near the short ones. This manner is reasonable, since the long principal axes of lie in large-volume regions. In Figure 1, and are two hypotheses and / is the long/short principal axis; LVP advocates that is more preferable than as is close to and is close to
. We can adopt this volume approximation to regularize our loss function, which has been demonstrated helpful for various binary learning problems.
Nevertheless, the volume approximation in El-Yaniv et al. (2008) only fits binary learning problem settings in spite of its potential advantages. In this paper, we naturally extend it to a more general definition that can be applied to some transductive problem settings including but not limited to multi-class learning (Zhou et al., 2003), multi-label learning (Kong et al., 2013), and serendipitous learning (Zhang et al., 2011). We adopt the same strategy as El-Yaniv et al. (2008): For data and labels, a hypothesis space is defined in and linked to an ellipsoid in , such that the equivalence classes and the volume approximation can be defined accordingly. Similarly to the binary volume approximation, our approach is also distribution free, that is, the labeled and unlabeled data do not necessarily share the same marginal distribution. This advantage of transductive learning over (semi-supervised) inductive learning is especially useful for serendipitous problems where the labeled and unlabeled data must not be identically distributed.
We name the learning method which realizes the proposed multi-class volume approximation multi-class approximate volume regularization (MAVR). It involves a non-convex optimization problem, but the globally optimal solution is almost surely unique and accessible in time following Forsythe and Golub (1965). Moreover, we theoretically provide stability and error analyses for MAVR, as well as experimentally compare it to two state-of-the-art methods in Zhou et al. (2003) and Belkin et al. (2006) using USPS, MNIST, 20Newsgroups and Isolet.
The rest of this paper is organized as follows. In Section 2 the binary volume approximation is reviewed, and in Section 3 the multi-class volume approximation is derived. In Section 4, we develop and analyze MAVR. Finally, the experimental results are in Section 5.
2 Binary Volume Approximation
The binary volume approximation in El-Yaniv et al. (2008) involves a few key concepts: The soft response vector, the hypothesis space and the equivalence class, and the power and volume of equivalence classes. We review the concepts in this section for later use in the next section.
Suppose that is the domain of input data, and most often but not necessarily, where is a natural number. Given a set of data where , a soft response vector is an -dimensional vector
so that stands for a soft or confidence-rated label of . For binary transductive learning problems, a soft response vector suggests that is from the positive class if , is from the negative class if , and the above two cases are equally possible if .
A hypothesis space is a collection of hypotheses. The volume approximation requires a symmetric positive-definite matrix which contains the pairwise information about . Consider the hypothesis space
where the hypotheses are soft response vectors. The set of sign vectors contains all of possible dichotomies of , and can be partitioned into a finite number of equivalence classes , such that for fixed , all hypotheses in will generate the same labeling of .
Then, in statistical learning theory, the power of an equivalence class
is defined as the probability mass of all hypotheses in it(Vapnik, 1998, p. 708), i.e.,
where is the underlying probability density of over . The hypotheses in which has a large power should be preferred according to Vapnik (1998).
When no specific domain knowledge is available (i.e.,
is unknown), it would be natural to assume the continuous uniform distribution, where
is the volume of . That is, the volume of an equivalence class is defined as the geometric volume of all hypotheses in it. As a result, is proportional to , and the larger the value is, the more confident we are of the hypotheses chosen from .
However, it is very hard to accurately compute the geometric volume of even a single convex body in , let alone all convex bodies, so El-Yaniv et al. (2008) introduced an efficient approximation. Let
be the eigenvalues of, and
be the associated orthonormal eigenvectors. Actually, the hypothesis spacein Eq. (2) is geometrically an origin-centered ellipsoid in with and as the direction and length of its -th principal axis. Note that a small angle from a hypothesis in to some with a small/large index (i.e., a long/short principal axis) implies that is large/small (cf. Figure 1). Based on this crucial observation, we define
where means the cosine of the angle between and . We subsequently expect to be small when lies in a large-volume equivalence class, and conversely to be large when lies in a small-volume equivalence class.
3 Multi-class Volume Approximation
In this section, we propose a more general multi-class volume approximation that fits for several problem settings.
3.1 Problem settings
Recall the setting of binary transductive problems (Vapnik, 1998, p. 341). A fixed set of points from is observed, and the labels of these points are also fixed but unknown. A subset of size is picked uniformly at random, and then is revealed if . We call the labeled data and the unlabeled data. Using and , the goal is to predict of (while any unobserved is currently left out of account).
A slight modification suffices to extend the setting. Instead of , we assume that where is the domain of labels and is a natural number. Though the binary setting is popular, this multi-class setting has been studied in just a few previous works such as Szummer and Jaakkola (2001) and Zhou et al. (2003). Without loss of generality, we assume that each of the labels possesses some labeled data.
In addition, it would be a multi-label setting, if with where each is a label set, or if with where each is a label vector. To the best of our knowledge, the former setting has been studied only in Kong et al. (2013) and the latter setting has not been studied yet. The latter setting is more general, since the former one requires labeled data to be fully labeled, while the latter one allows labeled data to be partially labeled. A huge challenge of multi-label problems is that some label sets or label vectors might have no labeled data (Kong et al., 2013), since there are possible label sets and possible label vectors.
A more challenging serendipitous setting which is a multi-class setting but some labels have no labeled data has been studied in Zhang et al. (2011). Let and , then we have where measures the cardinality. It is still solvable when if a special label of outliers is allowed and when as a combination of classification and clustering problems. Zhang et al. (2011) is the unique previous work which successfully dealt with and .
The multi-class volume approximation to be proposed can handle all the problem settings discussed so far in a unified manner. In order to extend the binary definitions, we need only to extend the hypothesis and the hypothesis space.
To begin with, we allocate a soft response vector in Eq. (1) for each of the labels:
The value is a soft or confidence-rated label of concerning the -th label and it suggests that
should possess the -th label, if ;
should not possess the -th label, if ;
the above two cases are equally possible, if .
For multi-class and serendipitous problems, is predicted by . For multi-label problems, we need a threshold that is either preset or learned since usually positive and negative labels are imbalanced, and can be predicted by ; or we can use the label set prediction methods proposed in Kong et al. (2013).
Then, a soft response matrix as our transductive hypothesis is an -by- matrix defined by
and a stacked soft response vector as an equivalent hypothesis is an -dimensional vector defined by
where is the vectorization of formed by stacking its columns into a single vector.
As the binary definition of the hypothesis space, a symmetric positive-definite matrix which contains the pairwise information about is provided, and we assume further that a symmetric positive-definite matrix which contains the pairwise information about is available. Consider the hypothesis space
where the hypotheses are soft response matrices. Let be the Kronecker product of and . Due to the symmetry and the positive definiteness of and , the Kronecker product is also symmetric and positive definite, and in (5) could be defined equivalently as
As a consequence, there is a bijection between and
which is geometrically an origin-centered ellipsoid in . The set of sign vectors spreads over all the quadrants of , and thus the set of sign matrices contains all of possible dichotomies of . In other words, can be partitioned into equivalence classes , such that for fixed , all soft response matrices in will generate the same labeling of .
The definition of the power is same as before, and so is the definition of the volume:
Because of the bijection between and , is likewise the geometric volume of all stacked soft response vectors in the intersection of the -th quadrant of and . By a similar argument to the definition of , we define
where and means the Frobenius norm of . We subsequently expect to be small when lies in a large-volume equivalence class, and conversely to be large when lies in a small-volume equivalence class.
Note that and are consistent for binary learning problem settings. We can constrain if where is the all-zero vector in . Let where
is the identity matrix of size 2, then
which coincides with defined in Eq. (3). Similarly to , for two soft response matrices and from the same equivalence class, and may not necessarily be the same value. In addition, the domain of could be extended to though the definition of is originally null for outside .
4 Multi-class Approximate Volume Regularization
The proposed volume approximation motivates a family of new transductive methods taking it as a regularization. We develop and analyze an instantiation in this section whose optimization problem is non-convex but can be solved exactly and efficiently.
First of all, we define the label indicator matrix for convenience whose entries can be from either or depending on the problem settings and whether negative labels ever appear. Specifically, we can set if is labeled to have the -th label and otherwise, or alternatively we can set if is labeled to have the -th label, if is labeled to not have the -th label, and otherwise.
Let be our loss function measuring the difference between and . The multi-class volume approximation motivates the following family of transductive methods:
where is a regularization parameter. The denominator is quite difficult to tackle, so we would like to eliminate it as El-Yaniv et al. (2008) and Niu et al. (2013a). We fix a scale parameter , constrain to be of norm , replace the feasible region with by extending the domain of implicitly, and it becomes
Although the optimization in (8) is done in , the regularization is carried out relative to , since under the constraint , the regularization is a weighted sum of the squares of cosines between and the principal axes of like El-Yaniv et al. (2008).
Subsequently, we denote by and the -dimensional vectors that satisfy and . Consider the following loss functions to be in optimization (8):
Squared losses over all data ;
Squared losses over labeled data ;
Linear losses over all data ;
Linear losses over labeled data ;
The first loss function has been used for multi-class transductive learning (Zhou et al., 2003) and the binary counterparts of the fourth and third loss functions have been used for binary transductive learning (El-Yaniv et al., 2008) and clustering (Niu et al., 2013a). Actually, the third and fourth loss functions are identical since for is identically zero, and the first loss function is equivalent to them in (8) since and are constants and is also a constant. The second loss function is undesirable for (8) due to an issue of the time complexity which will be discussed later. Thus, we instantiate , and optimization (8) becomes
We refer to constrained optimization problem (9) as multi-class approximate volume regularization (MAVR). An unconstrained version of MAVR is then
Optimization (9) is non-convex, but we can rewrite it using the stacked soft response vector as
where is the vectorization of . In this representation, the objective is a second-degree polynomial and the constraint is an origin-centered sphere, and fortunately we could solve it exactly and efficiently following Forsythe and Golub (1965). To this end, a fundamental property of the Kronecker product is necessary (see, e.g., Theorems 13.10 and 13.12 of Laub, 2005):
Let be the eigenvalues and be the associated orthonormal eigenvectors of , and be those of , and the eigen-decompositions of and be and . Then, the eigenvalues of are associated with orthonormal eigenvectors for , , and the eigen-decomposition of is , where and .
After we ignore the constants and in the objective of optimization (11), the Lagrange function is
where is the Lagrangian multiplier for . The stationary conditions are
Hence, for any locally optimal solution where is not an eigenvalue of , we have
Let us sort the eigenvalues into a non-descending sequence , rearrange accordingly, and find the smallest which satisfies . As a result, Eq. (16) implies that
for any stationary . By Theorem 4.1 of Forsythe and Golub (1965), the smallest root of determines a unique so that is the globally optimal solution to , i.e., minimizes the objective of (11) globally. For this , the only exception when it cannot determine by Eq. (14) is that is an eigenvalue of , but this happens with probability zero. Finally, the theorem below points out the location of this (the proof is in the appendix):
The function defined in Eq. (17) has exactly one root in the interval and no root in the interval , where .
The algorithm of MAVR is summarized in Algorithm 1. It is easy to see that fixing in Algorithm 1 instead of finding the smallest root of suffices to solve optimization (10). Moreover, for a special case where is the identity matrix of size , any stationary is simply
Let where is the all-one vector in , and is the smallest number that satisfies . Then the smallest root of
gives us the feasible leading to the globally optimal .
The asymptotic time complexity of Algorithm 1 is . More specifically, eigen-decomposing in the first step of Algorithm 1 costs , and this is the dominating computation time. Eigen-decomposing just needs and is negligible under the assumption that without loss of generality. In the second step, it requires for sorting the eigenvalues of and for computing . Finding the smallest root of based on a binary search algorithm uses in the third step, and for multi-class problems and for multi-label problems. In the final step, recovering is essentially same as computing and costs .
We would like to comment a bit more on the asymptotic time complexity of MAVR. Firstly, we employ the squared losses over all data rather than the squared losses over labeled data. If the latter loss function was plugged in optimization (8), Eq. (14) would become
where is an -by- diagonal matrix such that if is labeled and if is unlabeled. The inverse in the expression above cannot be computed using the eigen-decompositions of and , and hence the computational complexity would increase from to . Secondly, given fixed and but different , , and , the computational complexity is if we reuse the eigen-decompositions of and and the sorted eigenvalues of
. This property is especially advantageous for validating and selecting hyperparameters. It is also quite useful for picking differentto be labeled following transductive problem settings. Finally, the asymptotic time complexity can hardly be improved based on existing techniques for optimizations (9) and (10). Even if is fixed in optimization (10), the stationary condition Eq. (12) is a discrete Sylvester equation which consumes for solving it (Sima, 1996).
4.3 Theoretical analyses
We provide two theoretical results. Under certain assumptions, the stability analysis upper bounds the difference of two optimal and trained with two different label indicator matrices and , and the error analysis bounds the difference of from the ground truth.
Theorem 2 guarantees that . In fact, with high probability over the choice of , it holds that (we did not meet in our experiments). For this reason, we make the following assumption:
Fix and , and allow to change according to the partition of into different and . There is , which just depends on and , such that for all optimal trained with different , .
Note that for unconstrained MAVR, there must be since and . Based on the above assumption and the lower bound of in Theorem 2, we can prove the theorem below.
Theorem 3 (Stability of MAVR).
In order to present an error analysis, we assume there is a ground-truth soft response matrix with two properties. Firstly, the value of should be bounded, namely,
where is a small number. This ensures that lies in a large-volume region. Otherwise MAVR implementing the large volume principle can by no means learn some close to . Secondly, should contain certain information about . MAVR makes use of , and only and the meanings of and are fixed already, so MAVR may access the information about only through . To make and correlated, we assume that where is a noise matrix of the same size as and . All entries of
are independent with zero mean, and the variance of them isor depending on its correspondence to a labeled or an unlabeled position in . We could expect that , such that the entries of in labeled positions are close to the corresponding entries of , but the entries of in unlabeled positions are completely corrupted and uninformative for recovering . Notice that we need this generating mechanism of even if is the smallest eigenvalue of , since may have multiple smallest eigenvalues and have totally different meanings. Based on these assumptions, we can prove the theorem below.
Theorem 4 (Accuracy of MAVR).
Assume the existence of , , and the generating process of from and . Let and be the numbers of the labeled and unlabeled positions in and assume that where the expectation is with respect to the noise matrix . For each possible , let be the globally optimal solution trained with it. Then,
for MAVR in optimization (9), and
for unconstrained MAVR in optimization (10).
The proofs of Theorems 3 and 4 are in the appendix. Considering the instability bounds in Theorem 3 and the error bounds in Theorem 4, unconstrained MAVR is superior to constrained MAVR in both cases. That being said, bounds are just bounds. We will demonstrate the potential of constrained MAVR in the next section by experiments.
In this section, we numerically evaluate MAVR.
5.1 Serendipitous learning
We show how to handle serendipitous problems by MAVR directly without performing clustering (Hartigan and Wong, 1979; Ng et al., 2001; Sugiyama et al., 2014) or estimating the class-prior change (du Plessis and Sugiyama, 2012). The experimental results are displayed in Figure 2. There are 5 data sets, and the latter 3 data sets are from Zelnik-Manor and Perona (2004). The matrix was specified as the normalized graph Laplacian (see, e.g., von Luxburg, 2007)111Though the graph Laplacian matrices have zero eigenvalues, they would not cause algorithmic problems when used as . , where is a similarity matrix and is the degree matrix of . The matrix was specified by
For data sets 1 and 2 we used the Gaussian similarity
with the kernel width , and for data sets 3 to 5 we applied the local-scaling similarity (Zelnik-Manor and Perona, 2004)
with the number of nearest neighbors , where is the -th nearest neighbor of in . We set and . Furthermore, a class balance regularization was imposed for data sets 2 to 5. The detail is omitted here due to the space limit, while the idea is to encourage balanced total responses of all classes. For this regularization, the regularization parameter was . We can see that in Figure 2
, MAVR successfully classified the data belonging to the known classes and simultaneously clustered the data belonging to the unknown classes. By specifying different, we could control the influence of the known classes on the unknown classes.
5.2 Multi-class learning
Experimental results on the artificial data set 3circles. Means with standard errors are shown.
A state-of-the-art multi-class transductive learning method named learning with local and global consistency (LGC) (Zhou et al., 2003) is closely related to MAVR. Actually, if we specify and , unconstrained MAVR will be reduced to LGC exactly. Although LGC is motivated by the label propagation viewpoint, it can be written as optimization (4) in Zhou et al. (2003). Here, we illustrate the nuance of constrained MAVR and LGC that is unconstrained MAVR using an artificial data set.
The artificial data set 3circles is generated as follows. We have three classes with the class ratio . Let be the ground-truth label of , then is generated by
where is an angel drawn i.i.d. from the uniform distribution , and and
are noises drawn i.i.d. from the normal distribution. We vary one factor and fix all other factors. The default values of these factors are , , , , , and . Figure 3 shows the experimental results, where the means with the standard errors of the classification error rates are plotted. For each task that corresponds to a full specification of all factors, MAVR and LGC were repeatedly ran on 100 random samplings. We can see from Figure 3 that the performance of LGC was usually not as good as MAVR.
Over the past decades, a huge number of transductive learning and semi-supervised learning methods have been proposed based on various motivations as graph cut(Blum and Chawla, 2001), random walk (Zhu et al., 2003), manifold regularization (Belkin et al., 2006), and information maximization (Niu et al., 2013b), just to name a few. A state-of-the-art semi-supervised learning method called Laplacian regularized least squares (LapRLS) (Belkin et al., 2006) is included to be compared with MAVR besides LGC.
The experimental results are reported in Figure 4. Similarly to Figure 3, the means with the standard errors of the classification error rates are shown where 4 methods were repeatedly ran on 100 random samplings for each task. We considered another specification of as the unnormalized graph Laplacian
which was also employed by LapRLS. The cosine similarity is defined by
where means and are among the -nearest neighbors of each other. We set for all involved in Figure 4, and there seems no reliable model selection method given very few labeled data, so we select the best hyperparameters for each method using the labels of unlabeled data from 10 additional random samplings. Specifically, is the median distance , and is from for both local-scaling and cosine similarities; is . The hyperparameters are all fixed since it resulted in more stable performance. For MAVR, LGC, and of LapRLS, it was fixed to 99 if the Gaussian and cosine similarities were used and 1 if the local-scaling similarity was used; of LapRLS was if the Gaussian and local-scaling similarities were used and if the cosine similarity was used since LapRLS also needed that was too sparse and near singular, but an exception was panel (i) where gave lower error rates of LapRLS. We can see from Figure 4 that two MAVR methods often compared favorably with the state-of-the-art methods LGC and LapRLS, which implies that our proposed multi-class volume approximation is reasonable and practical.
We proposed a multi-class volume approximation that can be applied to several transductive problem settings such as multi-class, multi-label and serendipitous learning. The resultant learning method is non-convex in nature but can be solved exactly and efficiently. It is theoretically justified by our stability and error analyses and experimentally demonstrated promising.
Appendix A Proofs
a.1 Proof of Theorem 2
The derivative of is
Hence, whenever , and is strictly increasing in the interval . Moreover,
and thus has exactly one root in . Notice that since is an orthonormal matrix, and then . As a result,
where the first inequality is because