Python implementation of projection losses.
We propose in this paper a general framework for deriving loss functions for structured prediction. In our framework, the user chooses a convex set including the output space and provides an oracle for projecting onto that set. Given that oracle, our framework automatically generates a corresponding convex and smooth loss function. As we show, adding a projection as output layer provably makes the loss smaller. We identify the marginal polytope, the output space's convex hull, as the best convex set on which to project. However, because the projection onto the marginal polytope can sometimes be expensive to compute, we allow to use any convex superset instead, with potentially cheaper-to-compute projection. Since efficient projection algorithms are available for numerous convex sets, this allows us to construct loss functions for a variety of tasks. On the theoretical side, when combined with calibrated decoding, we prove that our loss functions can be used as a consistent surrogate for a (potentially non-convex) target loss function of interest. We demonstrate our losses on label ranking, ordinal regression and multilabel classification, confirming the improved accuracy enabled by projections.READ FULL TEXT VIEW PDF
Learning with non-modular losses is an important problem when sets of
The XGBoost method has many advantages and is especially suitable for
Due to the non-smoothness of the Hinge loss in SVM, it is difficult to o...
Over the past decades, numerous loss functions have been been proposed f...
In this work we provide a theoretical framework for structured predictio...
We propose and analyze a novel theoretical and algorithmic framework for...
The ranking problem is to order a collection of units by some unobserved...
Python implementation of projection losses.
The goal of supervised learning is to learn a mapping that links an input to an output, using examples of such pairs. This task is noticeably more difficult when the output objects have a structure, i.e., when they are not mere vectors. This is the so-called structured prediction settingbakir_2007
We focus in this paper on the surrogate loss framework, in which a convex loss is used as a proxy for a (potentially non-convex) target loss of interest. Existing convex losses for structured prediction come with different trade-offs. On one hand, the structured perceptronstructured_perceptron and hinge structured_hinge losses only require access to a maximum a-posteriori (MAP) oracle for finding the highest-scoring structure, while the conditional random field (CRF) lafferty_2001
loss requires access to a marginal inference oracle, for evaluating the expectation under a Gibbs distribution. Since marginal inference is generally considered harder than MAP inference, for instance containing #P-complete counting problems, this makes the CRF loss less widely applicable. On the other hand, unlike the structured perceptron and hinge losses, the CRF loss is smooth, which is crucial for fast convergence, and comes with a probabilistic model, which is important for dealing with uncertainty. Unfortunately, when combined with MAP decoding, these losses are typically inconsistent, meaning that their optimal estimator does not converge to the target loss function’s optimal estimator. Recently, several worksciliberto_2016 ; korba_2018 ; nowak_2018 ; luise_2019 showed good results and obtained consistency guarantees by combining a simple squared loss with calibrated decoding. Since these approaches only require a decoding oracle at test time and no oracle at train time, this questions whether structural information is even beneficial during training.
In this paper, we propose loss functions for structured prediction using a different kind of oracle: projections. Kullback-Leibler projections onto various polytopes have been used to derive online algorithms helmbold_2009 ; yasutake_2011 ; online_submodular ; ailon_2016 but it is not obvious how to extract a loss from these works. In our framework, the user chooses a convex set containing the output space and provides an oracle for projecting onto that set. Given that oracle, we automatically generate an associated loss function. As we show, incorporating a projection as output layer provably makes the loss smaller. We identify the marginal polytope, the output space’s convex hull, as the best convex set on which to project. However, because the projection onto the marginal polytope can sometimes be expensive to compute, we allow to use instead any convex superset, with potentially cheaper-to-compute projection. When using the marginal polytope as the convex set, our loss comes with an implicit probabilistic model. Our contributions are summarized as follows:
We study the consistency w.r.t. a target loss of interest when combined with calibrated decoding, extending a recent analysis nowak_2019 to the more general projection-based losses. We exhibit a trade-off between computational cost and statistical estimation.
We demonstrate our losses on label ranking, ordinal regression and multilabel classification, confirming the improved accuracy enabled by projections.
We denote the probability simplex by, the domain of by , the Fenchel conjugate of by . We denote .
The goal of structured prediction is to learn a mapping , from an input to an output , minimizing the expected target risk
where is a typically unknown distribution and is a potentially non-convex target loss. We focus in this paper on surrogate methods, which attack the problem in two main phases. During the training phase, the labels are first mapped to using an encoding or embedding function . In this paper, we focus on , but some works consider general Hilbert spaces ciliberto_2016 ; korba_2018 ; luise_2019 . In most cases, will be a zero-one encoding of the parts of , i.e., . Given a surrogate loss , a model
(e.g., a neural network or a linear model) is then learned so as to minimize the surrogate risk
This allows to leverage the usual empirical risk minimization framework in the space . During the prediction phase, given an input , a model prediction is “pulled back” to a valid output using a decoding function . This is summarized in the following diagram:
Commonly used decoders include the pre-image oracle weston_2003 ; cortes_2005 ; kadri_2013 and the maximum a-posteriori inference oracle structured_perceptron ; structured_hinge ; lafferty_2001 , which finds the highest-scoring structure:
In the remainder of this paper, for conciseness, we will use use as a shorthand for but it is useful to bear in mind that surrogate losses are always really defined over vector spaces.
We now review classical examples of loss functions that fall within that framework. The structured perceptron structured_perceptron loss is defined by
Clearly, it requires a MAP inference oracle at training time in order to compute subgradients w.r.t.
. The structured hinge loss used by structured support vector machinesstructured_hinge is a simple variant of (5) using an additional loss term. Classically, it is assumed that this term satisfies an affine decomposition, so that we only need a MAP oracle. The conditional random fields (CRF) lafferty_2001 loss, on the other hand, requires a so-called marginal inference oracle wainwright_2008 , for evaluating the expectation under the Gibbs distribution . The loss and the oracle are defined by
When is a zero-one encoding of the parts of (i.e., a bit vector), can be interpreted as some marginal distribution over parts of the structures. The CRF loss is smooth and comes with a probabilistic model, but its applicability is hampered by the fact that marginal inference is generally harder than MAP inference. This is for instance the case for permutation-prediction problems, where exact marginal inference is intractable valiant1979complexity ; taskar-thesis ; petterson_2009 but MAP inference can be computed exactly.
When working with surrogate losses, an important question is whether the surrogate and target risks are consistent, that is, whether an estimator minimizing produces an estimator minimizing . Although this question has been widely studied in the multiclass setting zhang_2004 ; bartlett_2006 ; tewari_2007 ; mroueh_2012 and in other specific settings duchi_2010 ; ravikumar_2011 , it is only recently that it was studied in a fully general structured prediction setting. The structured perceptron, hinge and CRF losses are generally not consistent when using MAP as decoder nowak_2019 . Inspired by kernel dependency estimation weston_2003 ; cortes_2005 ; kadri_2013 , several works ciliberto_2016 ; korba_2018 ; luise_2019 showed good empirical results and proved consistency by combining a squared loss with calibrated decoding (no oracle is needed during training). A drawback of this loss, however, is that it does not make use of the output space during training, ignoring precious structural information. More recently, the consistency of the CRF loss in combination with calibrated decoding was analyzed in nowak_2019 .
In this section, we build upon Fenchel-Young losses fy_losses ; fy_losses_journal to derive a class of smooth loss functions leveraging structural information through a different kind of oracle: projections. Our losses are applicable to a large variety of tasks (including permutation problems, for which CRF losses are intractable) and have consistency guarantees when combined with calibrated decoding (cf. §5).
Zero loss: ,
Convexity: is convex in ,
Smoothness: If is -strongly convex, then is -smooth,
Gradient as residual (generalizing the squared loss): .
In the Fenchel duality perspective, belongs to the dual space and is thus unconstrained. This is convenient, as this places no restriction on the model outputs . On the other hand, belongs to the primal space , which must include the encoded output space , i.e., , and is typically constrained. The gradient is a mapping from to and can be seen as loss with mixed arguments, between these two spaces. The theory of Fenchel-Young loss was recently extended to infinite spaces in mensch_2019 .
Let the Bregman divergence generated by be defined as . The Bregman projection of onto a closed convex set is
Intuitively, maps the unconstrained predictions to
, ensuring that the Bregman projection is well-defined. Let us define the Kullback-Leibler divergence by. Two examples of generating function are with and , and with and . This leads to the Euclidean projection and the KL projection , respectively.
Our key insight is to use a projection onto a chosen convex set as output layer. If contains the encoded output space, i.e., , then for any ground truth . Therefore, if , then is necessarily a better prediction than , since it is closer to in the sense of . If already belongs to , then and thus is as good as . To summarize, we have for all and . Therefore, it is natural to choose so as to minimize the following compositional loss
Unfortunately, is non-convex in in general, and requires to compute the Jacobian of , which could be difficult, depending on . Other works have considered the output of an optimization program as input to a loss stoyanov_2011 ; domke_2012 ; belanger_2017 but these methods are non-convex too and typically require unrolling the program’s iterations. We address these issues, using Fenchel-Young losses.
We now set the generating function of the Fenchel-Young loss (7) to , where denotes the indicator function of . We assume that is Legendre type rockafellar_1970 ; wainwright_2008 , meaning that it is strictly convex and explodes at the boundary of the interior of . This assumption is satisfied by both and . With that assumption, as shown in fy_losses ; fy_losses_journal , we obtain for all , allowing us to use Fenchel-Young losses. For brevity, let us define the Fenchel-Young loss generated by as
Note that if (largest possible set), then . In particular, with and , recovers the squared loss .
Recall that should be a convex set such that . The next new proposition, a simple consequence of (7), gives an argument in favor of using smaller sets. Using smaller sets results in smaller loss
Let be two closed convex sets such that . Then,
As a corollary, combined with (11), we have
and in particular when , noticing that , we have
Therefore, the Euclidean projection always achieves a smaller squared loss than . This is intuitive, as is a smaller region than and is guaranteed to include the ground-truth . Our loss is a convex and structurally informed middle ground between and .
How to choose ? The smallest convex set such that is the convex hull of
When is a zero-one encoding of the parts of , is also known as the marginal polytope wainwright_2008 , since any point inside it can be interpreted as some marginal distribution over parts of the structures. The loss with and is exactly the sparseMAP loss proposed in sparsemap . More generally, we can use any superset of , with potentially cheaper-to-compute projections. For instance, when uses a zero-one encoding, the marginal polytope is always contained in the unit cube, i.e., , whose projection is very cheap to compute. We show in our experiments that even just using the unit cube typically improves over the squared loss. However, an advantage of using is that produces a convex combination of structures, i.e., an expectation.
The well-known equivalence between strong convexity of a function and the smoothness of its Fenchel conjugate implies that the following three statements are all equivalent:
is-strongly convex w.r.t. a norm over ,
is -Lipschitz continuous w.r.t. the dual norm over ,
is -smooth in its first argument w.r.t. over .
With the Euclidean geometry, since is -strongly-convex over w.r.t. , we have that is -smooth w.r.t. regardless of . With the KL geometry, the situation is different. The fact that is -strongly convex w.r.t. over is well-known (this is Pinsker’s inequality). The next proposition, proved in §C.1, shows that this straightforwardly extends to any bounded and that the strong convexity constant is inversely proportional to the size of . Strong convexity of over a bounded set
Let and . Then, is-strongly convex w.r.t. over . This implies that is -smooth w.r.t. . Since smaller is smoother, this is another argument for preferring smaller sets . With the best choice of , we obtain .
Assuming is compact (closed and bounded), the Euclidean projection can always be computed using Frank-Wolfe or active-set algorithms, provided access to a linear maximization oracle . Note that in the case , assuming that is injective, meaning that is has a left inverse, MAP inference reduces to an LMO, since
(the LMO can be viewed as a linear program, whose solutions always hit a vertexof ). The KL projection is more problematic but Frank-Wolfe variants have been proposed belanger_2013 ; krishnan_barrier . In the next section, we focus on examples of sets for which an efficient dedicated projection oracle is available.
For multiclass classification, we set , where is the number of classes. With
, the one-hot encoding of, MAP inference (4) becomes . The marginal polytope defined in (15) is now , the probability simplex. The Euclidean and KL projections onto then correspond to the sparsemax sparsemax and softmax transformations. We therefore recover the sparsemax and logistic losses as natural special cases of . Note that, although the CRF loss lafferty_2001 also comprises the logistic loss as a special case, it no longer coincides with our loss in the structured case.
For multilabel classification, we choose , the powerset of . Let us set , the label indicator vector of (i.e., if and otherwise). MAP inference corresponds to predicting each label independently. More precisely, for each label , if we predict , otherwise we do not. The marginal polytope is now , the unit cube. Each vertex is in bijection with one possible subset of . The Euclidean projection of onto is equal to a coordinate-wise clipping of , i.e., for all . The KL projection is equal to for all . More generally, whenever for the task at hand uses a - encoding, we can use the unit cube as superset with computationally cheap projection.
We now set , the subsets of of bounded size. We assume . This is useful for multilabel classification with known lower bound and upper bound on the number of labels per sample. Setting again , MAP inference is equivalent to the integer linear program s.t. . Let be a permutation sorting in descending order. An optimal solution is
The marginal polytope is an instance of knapsack polytope almeida_2013 . It is equal to and is illustrated in Figure 1(c) with , and (i.e., we keep all elements of except ). The next proposition, proved in §C.2, shows how to efficiently project on . Efficient Euclidean and KL projections on
Let be the projection of onto the unit cube (cf. “unit cube” paragraph).
If , then is optimal.
Otherwise, return the projection of onto , where if and otherwise.
The total cost is in the Euclidean case and in the KL case (cf. §C.2 for details).
We view ranking as a structured prediction problem and let be the set of permutations of . Setting as the permutation matrix associated with , MAP inference becomes the linear assignment problem and can be computed exactly using the Hungarian algorithm hungarian . The marginal polytope becomes the Birkhoff polytope birkhoff , the set of doubly stochastic matrices
Noticeably, marginal inference is known to be #P-complete (valiant1979complexity, ; taskar-thesis, , §3.5), since it corresponds to computing a matrix permanent. In contrast, the KL projection on the Birkhoff polytope can be computed using the Sinkhorn algorithm sinkhorn ; cuturi_2013 . The Euclidean projection can be computed using Dykstra’s algorithm rot_mover or dual approaches blondel_2018 . For both projections, the cost of obtaining an -precise solution is . To obtain cheaper projections, we can also use blondel_2018 ; nowak_2019 the set of row-stochastic matrices, a strict superset of the Birkhoff polytope and strict subset of the unit cube
Projections onto reduce to row-wise projections onto , for a worst-case total cost of in the Euclidean case and in the KL case.
We again consider ranking and let be the set of permutations of but use a different encoding. This time, we define , where is a prescribed vector of weights, which without loss of generality, we assume sorted in descending order. MAP inference becomes , where denotes the inverse permutation of . The MAP solution is thus the inverse of the permutation sorting in descending order, and can be computed in time. When , which we use in our experiments, is known as the permutahedron. For arbitrary , we follow projection_permutahedron and call the permutahedron induced by . Its vertices correspond to the permutations of . Importantly, the Euclidean projection onto reduces to sorting, which takes , followed by isotonic regression, which takes zeng_2014 ; orbit_regul . Bregman projections reduce to isotonic optimization projection_permutahedron .
We again set but now consider the ordinal regression setting, where there is an intrinsic order . We need to use an encoding that takes into account that order. Inspired by the all-threshold method pedregosa_2017 ; nowak_2019 , we set . For instance, with , we have , , and . This encoding is also motivated by the fact that it enables consistency w.r.t. the absolute loss (§A). As proved in §C.3, with that encoding, the marginal polytope becomes the order simplex grotzinger_1984 . Vertices of the order simplex
Note that without the upper bound on , the resulting set is known as monotone nonnegative cone boyd_2004 . The scores can be calculated using a cumulated sum in time and therefore so do MAP and marginal inferences. The Euclidean projection is equivalent to isotonic regression with lower and upper bounds, which can be computed in time best_1990 .
We now study the consistency of as a proxy for a possibly non-convex target loss .
We assume that the target loss satisfies the decomposition
This is a slight generalization of the decomposition of ciliberto_2016 , where we used an affine map instead of a linear one and where we added the term , which is independent of . This modification allows us to express certain losses using a zero-one encoding for instead of a signed encoding nowak_2019 . The latter is problematic when using KL projections and does not lead to sparse solutions with Euclidean projections. Examples of target losses satisfying (20) are discussed in §A.
A drawback of the classical inference pipeline (3) with decoder is that it is oblivious to the target loss . In this paper, we propose to use instead
where we define the decoding calibrated for the loss by
Under the decomposition (20), calibrated decoding therefore reduces to MAP inference with pre-processed input. It is a “rounding” to of the projection , that takes into account the loss . Recently, ciliberto_2016 ; korba_2018 ; nowak_2018 ; luise_2019 used similar calibrated decoding in conjunction with a squared loss (i.e., without an intermediate layer) and nowak_2019 used it with a CRF loss (with marginal inference as intermediate layer). To our knowledge, we are the first to use a projection layer (in the Euclidean or KL senses) as an intermediate step.
Given a (typically unknown) joint distribution, let us define the target risk of and the surrogate risk of by
The quality of estimators and is measured in terms of the excess of risks
The following proposition shows that and are calibrated when using our proposed inference pipeline (21), i.e., when . Calibration of target and surrogate excess risks
The proof, given in §C.4, is based on the calibration function framework of osokin_2017 and extends a recent analysis nowak_2019 to projection-based losses. Our proof covers Euclidean projection losses, not covered by the previous analysis. Proposition 24 implies Fisher consistency, i.e., , where and . Consequently, any optimization algorithm converging to will also recover an optimal estimator of . Combined with Propositions 3 and 3, Proposition 24 suggests a trade-off between computational cost and statistical estimation, larger sets enjoying cheaper-to-compute projections but leading to slower rates.
We present in this section our empirical findings on three tasks: label ranking, ordinal regression and multilabel classification. In all cases, we use a linear model and solve by L-BFGS, choosing against the validation set. A Python implementation is available at https://github.com/mblondel/projection-losses.
We consider the label ranking setting where supervision is given as full rankings (e.g., ) rather than as label relevance scores. Note that the exact CRF loss is intractable for this task. We use the same six public datasets as in korba_2018 . We compare different convex sets for the projection and the decoding . For the Euclidean and KL projections onto the Birkhoff polytope, we solve the semi-dual formulation blondel_2018 by L-BFGS. We report the mean Hamming loss, for which our loss is consistent, between the ground-truth and predicted permutation matrices in the test set. Results are shown in Table 1 and Table 4. We summarize our findings below.
For decoding, using or instead of the Birkhoff polytope considerably degrades accuracy. This is not surprising, as these choices do not produce valid permutation matrices.
Using a squared loss (, no projection) works relatively well when combined with permutation decoding. Using supersets of the Birkhoff polytope as projection set , such as or , improves accuracy substantially. However, the best accuracy is obtained when using the Birkhoff polytope for both projections and decoding.
The losses derived from Euclidean and KL projections perform similarly. This is informative, as algorithms for Euclidean projections onto various sets are more widely available.
Beyond accuracy improvements, the projection is useful to visualize soft permutation matrices predicted by the model, an advantage lost when using supersets of the Birkhoff polytope.
: Birkhoff polytope
We compared classical ridge regression to our order simplex based loss on sixteen publicly-available datasetsgutierrez_ordinal_2016 . For evaluation, we use mean absolute error (MAE), for which our loss is consistent when suitably setting and (cf. §A). We find that ridge regression performs the worst with an average MAE of . Combining a squared loss (no projection) with order simplex decoding at prediction time improves the MAE to . Using a projection on the unit cube, a superset of the order simplex, further improves the MAE to . Finally, using the Euclidean projection onto the order simplex achieves the best MAE of , confirming that using the order simplex for both projections and decoding works better. Detailed results are reported in Table 3.
We compared losses derived from the unit cube and the knapsack polytope on the same seven datasets as in sparsemax ; fy_losses . We set the lower bound to and the upper-bound to , where and are computed over the training set. Although the unit cube is a strong baseline, we find that the knapsack polytope improves score on some datasets, especially with few labels per sample (“birds”, “emotions”, “scene”). Results are reported in Tables 5 and 6.
We proposed in this paper a general framework for deriving a smooth and convex loss function from the projection onto a convex set, bringing a computational geometry perspective to structured prediction. We discussed several examples of polytopes with efficient Euclidean or KL projection, making our losses useful for a variety of structured tasks. Our theoretical and empirical results suggest that the marginal polytope is the convex set of choice when the projection onto it is affordable. When not, our framework allows to use any superset with cheaper-to-compute projection.
We thank Vlad Niculae for suggesting the knapsack polytope for multilabel classification, and Tomoharu Iwata for suggesting to add a lower bound on the number of labels. We also thank Naoki Marumo for numerous fruitful discussions.
Scikit-learn: Machine learning in Python.JMLR, 12:2825–2830, 2011.
For more generality, following nowak_2019 , we consider losses , where is the output space and is the ground-truth space. Typically, but we give examples below where . Our affine decomposition (20) now becomes
where . We give examples below of possibly non-convex losses satisfiying decomposition (26). When not mentioned explicitly, we set , and . For more examples of decomposable losses, see also ciliberto_2016 ; nowak_2018 ; nowak_2019 .
Let . The 0-1 loss can be written as (26) if we set and , i.e., is the 0-1 cost matrix.
Let and . Then,
This can be written as (26) with , and .
Let be the set of permutations of . If is the permutation matrix associated with permutation , the Hamming loss is . It can thus be written as (26) with and .
Let be the permutations of and be the relevance scores of documents. The NDCG loss is , where is a normalization constant. Inspired by nowak_2019 , we can thus write as (26) by defining as the permutation of according to , , and . This shows the importance of learning to predict normalized relevance scores, as also noted in ravikumar_2011 . For this reason, we suggest using . Decoding reduces to linear maximization over the permutahedron.
Let be the permutations of and be binary relevance scores. Precision at corresponds to the number of relevant results (e.g., labels or documents) in the top results. The corresponding loss can be defined by , where and . If the number of positive labels is less than , we replace with . Similarly as for NDCG, we can therefore write as (26). Again, decoding reduces to linear maximization over the permutahedron.
Let and . Then,
This can be written as (26) with , and . A similar loss decomposition is derived in nowak_2019 in the case of a signed encoding, instead of the zero-one encoding we use. However, the signed encoding is problematic when using KL projections and does not lead to sparse projections when using Euclidean projections.
We discuss in this section our experimental setup and additional empirical results.
For all datasets, we normalized samples to have zero mean unit variance. We use the train-test split from the dataset when provided. When not, we use 80% for training data and 20% for test data. We hold out 25% of the training data for hyperparameter validation purposes. For the regularization hyper-parameter, we used ten log-spaced values between and . Once we select the best hyperparameter, we refit the model on the entire training set. We ran all experiments on a machine with Intel(R) Xeon(R) CPU with 2.90GHz and 4GB of RAM.
In all experiments, we use publicly-available datasets:
In this section, we compare the Birkhoff and permutahedron polytopes for the same label ranking task as described in the main manuscript. With the Birkhoff polytope,
can be interpreted as an affinity matrix between classes for the input. With the permutahedron, can be intepreted as vector, containing the score of each class for input . Therefore, the two polytopes have different expressive power. For the model , we compare , where is a matrix or linear map of proper shape, and a polynomial model , where , and is the polynomial degree. For the Euclidean projection onto the permutahedron, we use the isotonic regression solver from scikit-learn scikit_learn .
Our results, shown in Table 2, indicate that in the case of a linear model, the Birkhoff polytope outperforms the permutahedron by a large margin. Using a polynomial model closes the gap between the two, but the model based on the Birkhoff polytope is still slightly better.