Learning Output Embeddings in Structured Prediction

07/29/2020 ∙ by Luc Brogat-Motte, et al. ∙ Télécom Paris aalto Télécom ParisTech Inria 0

A powerful and flexible approach to structured prediction consists in embedding the structured objects to be predicted into a feature space of possibly infinite dimension, and then, solving a regression problem in this output space. A prediction in the original space is computed by solving a pre-image problem. In such an approach, the embedding, linked to the target loss, is defined prior to the learning phase. In this work, we propose to jointly learn an approximation of the output embedding and the regression function into the new feature space. Output Embedding Learning (OEL) allows to leverage a priori information on the outputs and also unexploited unsupervised output data, which are both often available in structured prediction problems. We give a general learning method that we theoretically study in the linear case, proving consistency and excess-risk bound. OEL is tested on various structured prediction problems, showing its versatility and reveals to be especially useful when the training dataset is small compared to the complexity of the task.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A large number of real-world applications involves the prediction of a structured output (Nowozin and Lampert, 2011)

, whether it be a sparse multiple label vector in recommendation systems

(Tsoumakas and Katakis, 2007), a ranking over a finite number of objects in user preference prediction (Hüllermeier et al., 2008) or a labeled graph in metabolite identification (Nguyen et al., 2019)

. Embedding-based methods generalizing ridge regression to structured outputs

Weston et al. (2003); Cortes et al. (2005); Brouard et al. (2011); Kadri et al. (2013); Brouard et al. (2016b); Ciliberto et al. (2016), besides conditional generative models and margin-based methods Tsochantaridis et al. (2004); Taskar et al. (2004); Bakhtin et al. (2020)

, represent one of the main theoretical and practical frameworks to solve structured prediction problems and also find use in other fields of supervised learning such as zero-shot learning

(Palatucci et al., 2009).

In this work, we focus on Output Kernel Regression (OKR) methods that rely on a simple idea: structured outputs are embedded into a Hilbert space, enabling to substitute to the initial structured output prediction problem, a less complex problem of vectorial output regression. Once this problem is solved, a structured prediction function is obtained by decoding the embedded prediction into the original output space, e.g. solving a pre-image problem. To benefit from an infinite dimensional embedding, the kernel trick can be leveraged in the output space, opening the approach to a large variety of structured outputs.

A generalization of the OKR approaches under the name of Implicit Loss Embedding has been recently studied from a statistical point of view in (Ciliberto et al., 2020) , extending the theoretical guarantees developed in (Ciliberto et al., 2016; Nowak-Vila et al., 2019)

about the Structure Encoding Loss Framework (SELF). In particular, it proved that the excess risk of the final structured output predictor depends on the excess risk of the surrogate regression estimator. This motivates the approach of this paper, controlling the error of the regression estimator by adapting the embedding to the observed data.

In this work, we propose to jointly learn a finite dimensional embedding that approximates the given embedding and regress the new embedded output variable instead of the original embedding. Our contributions are four-fold:

  • We introduce, Output Embedding Learning (OEL), a novel approach to Structured Prediction that jointly learns the embedding of outputs and the regression of the embedded output given the input, leveraging the prior information about the structure and unlabeled output data.

  • We devise an OEL algorithm focusing on on kernel ridge regression and a projection-based embedding that exploits the closed-form of the regression problem. We provide an efficient algorithm based on randomized SVD and Nyström approximation of kernel ridge regression.

  • We derive excess risk bounds for this novel estimator, showing the relevance of this approach.

  • We provide a comprehensive experimental study on various Structured Prediction problems with a comparison with dedicated methods, showing the versatility of our approach and highlighting the benefits of our approach when the training dataset size is small compared to the complexity of the task or when unlabeled data are available.

2 Output Embedding Learning

Notations:  denotes the input space and is the set of structured objects of finite cardinality . Given two spaces , , denotes the set of functions from to . Given two Hilbert spaces and , is the space of bounded linear operators from to . is the identity operator over . The adjoint of an operator is noted .

Structured Prediction is generally associated to a loss that takes into account the inherent structure of objects in . In this work, we consider a structure-dependent loss by relying on an embedding that maps the structured objects into a Hilbert space and the squared loss defined over pairs of elements of : .

A principled and general way to define the embedding consists in choosing , the canonical feature map of a positive definite symmetric kernel defined over , referred here as the output kernel. The space is then the Reproducing Kernel Hilbert Space associated to kernel . This choice enables to solve various structured prediction problems within the same setting, drawing on the rich kernel literature devoted to structured objects (Gärtner, 2008).

Given an unknown joint probability distribution

defined on , the goal of structured prediction is to solve the following learning problem:


with the help of a training i.i.d. sample drawn from .

To overcome the inherent difficulty to learn through , embedding-based approaches address structured prediction by solving the Output Kernel Regression as a surrogate problem, e.g. regressing the target variable given , and then make their prediction in the original space with a decoding function as follows (see Figure 1, left):


This regression step is then followed by a pre-image or decoding step in order to recover :

where the decoding function computes .

While the above is a powerful approach for structured prediction, relying on a fixed output embedding given by may not be optimal in terms of prediction error, and it is hard by a human expert to decide on a good embedding.

(a) OKR
(b) OKR with OEL
Figure 1: Schematic illustration of OKR and OKR with OEL

In this paper, we propose to jointly learn a novel output embedding as a finite dimensional proxy of and the corresponding regression model .

Our novel approach, called Output Embedding Learning (OEL), thus consists in solving the two problems (Figure 1, right).

for a given hyperparameter





where the decoding output is .

In the learning objective of Eq. (3), the term expresses a surrogate regression problem from the input space to the learned output embedding space while is a reconstruction error that constrains to provide a good proxy of , and thus, encouraging the novel surrogate loss to be calibrated with loss .

This approach allows learning an output embedding that, intuitively, is easier to predict from inputs than and also provides control of the complexity of surrogate regression model by choosing the dimension .

To solve the learning problem in practise, a training i.i.d. sample is used for estimating . For estimating we can also benefit from additional i.i.d. samples of the outputs , denoted . Such data is generally easy to obtain for many structured output problems, including the metabolite identification task described in the experiments.

2.1 Solving OEL with a linear transformation of the embedding

We consider the case where the chosen model for the output embedding is a linear transformation of

: where is an operator, , with the linear associated decodings: . Here

can be interpreted as a one-layer linear autoencoder whose inputs and outputs belong to the Hilbert space

(thus giving overall non-linear embedding ) Laforgue et al. (2019a), and the hidden layer is trained in supervised mode (through ), or alternatively as a Kernel PCA model Schölkopf et al. (1998) of the outputs, but trained in supervised mode.

We denote the conditional expectation of given , . Within this setting, the general problem depicted in Eq. 3 instantiates as follows:


Leveraging the regression , and ( is orthogonal), this is equivalent to solve the following problem111see details in Section 1 of the supplements:


where we restrict to ensure the objective is theoretically grounded. In the following, we use , with . The objective boils down to estimating the linear subspaces of the and the .

In the empirical context, given an i.i.d. labeled sample , and an i.i.d unlabeled sample , we propose to use an empirical estimate of the unknown conditional expectation and solve the following remaining optimization problem in :


Learning in vector-valued RKHS:  To find an empirical estimate of , we need a hypothesis space , whose functions have infinite dimensional outputs. Following the Input Output Kernel Regression approach depicted in Brouard et al. (2016b), we solve a kernel ridge regression problem in , the RKHS associated to the operator-valued kernel and we got the following closed-form expression:


where and is the ridge regularization hyperparameter.

OEL estimator:  For a given , denoting as the solution of the above convex problem, the proposed estimator for the solution of the problem stated in Eq. 5 can be expressed as: . However it is important to stress that we only need to compute the associated structured prediction function:


We derive Algorithm 1

which consists in computing the singular value decomposition of the mixed gram matrix

, noticing that the objective 7 is equivalent to minimize the empirical mean reconstruction error of the vectors of : .

  Input: (supervised data), (unsupervised data), KRR regularization, embedding dimension, supervised/unsupervised balance.
  IKRR estimation: / /
  ISubspace estimation:
  return   KRR coefficients, output embedding coefficients, GY new training embedding
Algorithm 1 Output Embedding Learning with KRR (Training)
Computational complexity

The complexity in time of the training Algorithm 1 is the sum of the complexity of a Kernel Ridge Regression (KRR) with data and a Singular Value Decomposition with data: . However, this complexity can be a lot improved as for both KRR and SVD there exists a rich literature of approximation methods (Rudi et al., 2017; Halko et al., 2011). For instance, using Nyström KRR approximation of rank and randomized SVD approximation of rank , then, the time complexity becomes: .

3 Theoretical analysis

From a statistical viewpoint we are interested in controlling the expected risk of the estimator , that for the considered loss corresponds to

Interpreting the decoding step in the context of structured prediction, we can leverage the so called comparison inequality from Ciliberto et al. (2016). This inequality is applied for our case in the next lemma and relates the excess-risk of to the distance of to (see Ciliberto et al. (2016) for more details on structured prediction and the comparison inequality).

Lemma 3.1.

For every measurable , ,

with , .

It is possible to apply the comparison inequality, since the considered loss function

belongs to the wide family of SELF losses (Ciliberto et al., 2016) for which the comparison inequality holds. A loss is SELF if it satisfies the implicit embedding property (Ciliberto et al., 2020), i.e. there exists an Hilbert space and two feature maps such that

In our case the construction is direct and corresponds to , and .

Intuitively, the idea of output embedding learning, is to find a new embedding that provides an easier regression task while being still able to predict in the initial regression space. In our formulation this is possible due to introduction of and a suitable choice of . In this construction can be decomposed into two parts:


In the case of KRR, linear projections , and arbitrary set of function from to , the left term expresses as KRR excess-risk on a linear subspace of of dimension . For the right term, defining the covariance for all ,


we have the following bound due to Jensen inequality,

Lemma 3.2.

Under the assumptions of Lemma 3.1, when is a linear projection, we have

The closer is to , the tighter is the bound, but having close to could lead to a much easier learning objective. Relying on results on subspace learning in Rudi et al. (2013) and following their proofs, we bound this upper bound and get the Theorem 3.3. In particular, we use natural assumption on the spectral properties of the mixed covariance operator as introduced in Rudi et al. (2013).

Assumption 1.There exist and such that the eigendecomposition of the positive operator has the following form


The assumption above controls the so called degrees of freedom of the learning problem. A fast decay of

can be interpreted as a problem that is well approximated by just learning the first few eigenvectors (see

Caponnetto and De Vito (2007); Rudi et al. (2013) for more details). To conclude, the first part of the r.h.s. of 11 is further decomposed and then bounded via Lemma 18 of Ciliberto et al. (2016), leading to Theorem 3.3. Before we need an assumption on the approximability of .

Assumption 2. The function satisfies .

The assumption above where is the RKHS associated to , guarantees that is approximable by kernel ridge regression with the given choice of the kernel on the input. The kernel satisfies this assumption. Now we are ready to state the theorem.

Theorem 3.3.

[Excess-risk bound, KRR + linear OEL] Let be a distribution over , the marginal of , be i.i.d samples from , i.i.d samples from , , and and satisfy Assumption 1 and 2. When


then the following holds with probability at least ,

where , , with , . Finally , a constant depending only on (defined in the proof).

The first term in the above bound is the usual bias-variance trade-off that we expect from KRR (c.f.

Caponnetto and De Vito (2007)). The second term is an approximation error due to the projection.

We further see a trade-off in the choice of when we try to maximize . Choosing close to one aims to estimate the linear subspace of the which is smaller than the one of the

leading to a better eigenvalues decay rate, but the learning is limited by the convergence of the least-square estimator as is clear by the term

in Eq. (14) (via ). Choosing

close to zero leads to completely unsupervised learning of the output embeddings with a smaller eigenvalue decay rate


In the following corollary, we give a simplified version of the bound (we denote by the fact that there exists such that for ).

Corollary 3.3.1.

Under the same assumptions of Theorem 3.3, let . Then, running the proposed algorithm with a number of components

is enough to achieve an excess-risk of .

Note that is the typical rate for problems of structured prediction without further assumptions on the problem Ciliberto et al. (2016, 2020). Here it is interesting to note that the more regular the problem is, the more , and we can almost achieve rate with a number of components , in particular , leading to a relevant improvement in terms of computational complexity.

Finally we note that while from an approximation viewpoint the largest would lead to better results, there is still a trade-off with the computational aspects, since an increased leads to greater computational complexity. Moreover, we expect to find a more subtle trade-off between the KRR error and the approximation error due to projection, since reducing the dimensionality of the output space have a beneficial impact on the degrees of freedom of the problem. We observed this effect from an experimental viewpoint and we expect to observe using a more refined analysis, that we leave as future work. We want to conclude with a remark on why projecting on with should be more beneficial than just projecting on the subspace spanned by the covariance operator associated to .

Remark 3.1 (Supervised learning benefit).

Learning the embedding in a supervised fashion is interesting as the principal components of the may differ from that of the . The most basic example: , with the relationship between and where independent from with . In this case unsupervised learning is not able to find the -dimensional subspace where the lie, whereas, supervised learning could do so, assuming is a good estimation of . From this elementary example we could build more complex and realistic ones, by adding any kind of non-isotropic noise that change the shape of the subspace. For instance, defining eigenvalue decay of and with lower eigenvalue decay, then for sufficiently large the order of principal components start to change (comparing and ).

4 Experiments

We provide a detailed experimental study of OEL on image reconstruction, multilabel classification and labeled graph prediction. For each dataset/task, we selected the relevant state-of-the-art algorithms to which compare to. Details about experimental protocols (hyperparameter grids, dataset splitting) are provided in the Supplements. Additionally, an exhaustive study on label ranking is also provided in the Supplements. We used the following notations to refer to variants of our method:
: () only reconstruction loss is used to learn with data coming only from
: () is learned using both criteria (regression and reconstruction) also using
 with : is selected by inner CV, and additional unlabeled dataset is leveraged to learn .

Image reconstruction.

In the image reconstruction problem provided by (Weston et al., 2003), the goal is to predict the bottom half of a USPS handwritten postal digit (16 x 16 pixels), given its top half. We have labeled images and test images. Fig. 3 presents the behaviour of OEL in terms of Mean Squared Error of  w.r.t. and , showing that a minimum is attained and highlighting a trade-off between the two regularization parameters. We compared OEL to IOKR Brouard et al. (2016b) and Kernel Dependency Estimation (KDE) (Weston et al., 2003). In KDE, KPCA is used to decompose the output feature vectors into orthogonal directions. Kernel ridge regression is then used for learning independently the mapping between the input feature vectors and each direction. By applying KPCA on the outputs KDE aims at estimating the linear subspace of the output embedding , while OEL aims at estimating the linear subspace of the . In addition the decoding problem in KDE is different from OEL.

The obtained results are given in Table 3. Firstly, OEL obtains improved results in comparison to IOKR and KDE. Moreover, adding additional unsupervised output data ("+ data") is beneficial. The selected dimensions follow the theoretical insights. Learning the embedding in a supervised fashion (OEL instead of OEL) allows to learn smaller dimensional embedding giving the same, or even better, performance. Learning the embedding with additional data allows to learn bigger dimensional linear subspace. Furthermore, on this problem, the selected balancing parameter is close to , that is, all output training data are given the same importance.

Method RBF loss p
KDE (Weston et al., 2003) 0.764 0.011
IOKR 0.751 0.011
OEL 0.737 0.011 71
OEL 0.734 0.011 64
OEL + data () 0.725 0.011 98
Figure 2:

Test mean losses and standard errors for IOKR, OEL and KDE on the USPS digits reconstruction problem where

Figure 3: Test MSE w.r.t and (OEL)
Multi-label classification.

Here we compare OEL with several multi-label and structured prediction approaches including IOKR Brouard et al. (2016b)

, logistic regression (LR) trained independently for each label

(Lin et al., 2014)

, a two-layer neural network with cross entropy loss (NN) by

(Belanger and McCallum, 2016), the multi-label approach PRLR (Posterior-Regularized Low-Rank) (Lin et al., 2014)

, the energy-based model SPEN (Structured Prediction Energy Networks)

(Belanger and McCallum, 2016) as well as DVN (Deep Value Networks) (Gygli et al., 2017).

The results in Table 1 (left) show that OEL can compete with state-of-the-art dedicated multilabel methods on the standard dataset Bibtex. In a second experiment (Table 1, right) we trained OEL by splitting the datasets in a smaller training set and using the rest of the examples as unsupervised output data. We observe that the OEL obtains higher scores than IOKR in this setup. Using additional unsupervised data provides further improvement in the case of the Bookmarks and Corel5k datasets.

IOKR 44.0 ()
OEL 43.8 ()
LR (Lin et al., 2014) 37.2
NN (Belanger and McCallum, 2016) 38.9
SPEN (Belanger and McCallum, 2016) 42.2
PRLR (Lin et al., 2014) 44.2
DVN (Gygli et al., 2017) 44.7
Bibtex Bookmarks Corel5k
IOKR 35.9 22.9 13.7
OEL 39.7 25.9 16.1
OEL + data 39.7 27.1 19.0
Table 1: Left: scores of state-of-the-art methods on Bibtex dataset (. Right: score of OEL and IOKR on different multi-label problems in a small training data regime.

Metabolite Identification. In metabolite identification, the problem is to predict the molecular structure of a metabolite given its tandem mass spectrum. State-of-the-art results for this problem have been obtained with the IOKR method by (Brouard et al., 2016a), where we take the similar numerical experimental protocol (5-CV Outer/4-CV Inner loops) and output kernel (Gaussian-Tanimoto) on the molecular fingerprints. Besides the labeled training set (), a very large additional dataset of molecules (candidate outputs), without corresponding inputs (spectra) is available and utilized by OEL. Details can be found in the Supplementary material. Analyzing the results in Table 2, given for three metrics including the top-k-accuracy, we observe that without any additional data, and slightly improved upon plain IOKR and even more when exploiting additional unlabeled data with . Such accuracy improvement is crucial in this real-world task. In , the selected balancing parameter by a inner cross-validation on training set is in average on the outer splits, imposing a balance between the influence of the small size labeled dataset and the large unsupervised output set.

Method MSE Tanimoto-Gaussian loss Top-k accuracies
| |
OEL + data
Table 2: Test mean losses and standard errors for the metabolite identification problem.

5 Conclusion

We have presented a novel approach, OEL, for learning output embeddings in structured prediction. The OEL methods allows us to leverage the rich Hilbert space representations of structured outputs while allowing adaptation to particular input spaces and controlling the excess risk of the model through supervised dimensionality reduction in the output space. Our empirical experiments demonstrate the relevance of the method in particular when separate output sample is available. Our experimental results show a state-of-the-art performance or exceeding it for various applications.

Broader Impact

Making machine learning models that are able to predict structured objects such as molecules have broad applications in society. For example the metabolite identification task has relevance to biomedicine, pharmaceuticals, anti-doping, and regulatory affairs, to name a examples - and structured prediction methods have state-of-the-art performance.

From the theoretical side, understanding how output representations can be learned better is an important consideration both in practice and theory. This paper provides a theoretically well-founded approach, which helps our understanding of how structured prediction methods can be improved.

Potential down-sides of the method are shared with all machine learning: the methods will make errors and human scrutiny may be needed to post-process the predictions before application.

The methods will be provided as open code for the disposal of the society as a whole.


  • Bakhtin et al. [2020] A. Bakhtin, Y. Deng, S. Gross, M. Ott, M. Ranzato, and A. Szlam. Energy-based models for text. CoRR, abs/2004.10188, 2020.
  • Belanger and McCallum [2016] D. Belanger and A. McCallum. Structured prediction energy networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, page 983–992. JMLR.org, 2016.
  • Brouard et al. [2011] C. Brouard, F. d’Alché-Buc, and M. Szafranski. Semi-supervised penalized output kernel regression for link prediction. In Proceedings of the 28th International Conference on Machine Learning, pages 593–600, 2011.
  • Brouard et al. [2016a] C. Brouard, H. Shen, K. Dührkop, F. d’Alché Buc, S. Böcker, and J. Rousu. Fast metabolite identification with input output kernel regression. Bioinformatics, 32(12):i28–i36, 2016a.
  • Brouard et al. [2016b] C. Brouard, M. Szafranski, and F. d’Alché-Buc. Input output kernel regression: Supervised and semi-supervised structured output prediction with operator-valued kernels. The Journal of Machine Learning Research, 17(1):6105–6152, 2016b.
  • Caponnetto and De Vito [2007] A. Caponnetto and E. De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007.
  • Cheng et al. [2010] W. Cheng, E. Hüllermeier, and K. J. Dembczynski. Label ranking methods based on the plackett-luce model. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 215–222, 2010.
  • Ciliberto et al. [2016] C. Ciliberto, L. Rosasco, and A. Rudi. A consistent regularization approach for structured prediction. In Advances in neural information processing systems, pages 4412–4420, 2016.
  • Ciliberto et al. [2020] C. Ciliberto, L. Rosasco, and A. Rudi. A general framework for consistent structured prediction with implicit loss embeddings. arXiv preprint arXiv:2002.05424, 2020.
  • Cortes et al. [2005] C. Cortes, M. Mohri, and J. Weston. A general regression technique for learning transductions. In Proceedings of the 22nd International Conference on Machine Learning, page 153–160, 2005.
  • Djerrab et al. [2018] M. Djerrab, A. Garcia, M. Sangnier, and F. d’Alché Buc. Output fisher embedding regression. Machine Learning, 107(8-10):1229–1256, 2018.
  • Gärtner [2008] T. Gärtner. Kernels for Structured Data, volume 72 of

    Series in Machine Perception and Artificial Intelligence

    WorldScientific, 2008.
  • Geurts et al. [2006] P. Geurts, L. Wehenkel, and F. d’Alché Buc. Kernelizing the output of tree-based methods. In Proceedings of the 23rd international conference on Machine learning, pages 345–352, 2006.
  • Gygli et al. [2017] M. Gygli, M. Norouzi, and A. Angelova. Deep value networks learn to evaluate and iteratively refine structured outputs. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 1341–1351, 2017.
  • Halko et al. [2011] N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288, 2011.
  • Hüllermeier et al. [2008] E. Hüllermeier, J. Fürnkranz, W. Cheng, and K. Brinker. Label ranking by learning pairwise preferences. Artificial Intelligence, 172(16):1897 – 1916, 2008.
  • Kadri et al. [2013] H. Kadri, M. Ghavamzadeh, and P. Preux. A generalized kernel approach to structured output learning. In International Conference on Machine Learning, pages 471–479, 2013.
  • Katakis et al. [2008] I. Katakis, G. Tsoumakas, and I. Vlahavas. Multilabel text classification for automated tag suggestion. ECML PKDD Discovery Challenge 2008, page 75, 2008.
  • Korba et al. [2018] A. Korba, A. Garcia, and F. d’Alché Buc. A structured prediction approach for label ranking. In Advances in Neural Information Processing Systems, pages 8994–9004, 2018.
  • Laforgue et al. [2019a] P. Laforgue, S. Clémençon, and F. d’Alche Buc. Autoencoding any data through kernel autoencoders. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1061–1069, 2019a.
  • Laforgue et al. [2019b] P. Laforgue, A. Lambert, L. Motte, and F. d’Alché Buc. On the dualization of operator-valued kernel machines. arXiv preprint arXiv:1910.04621, 2019b.
  • Lapin et al. [2016] M. Lapin, M. Hein, and B. Schiele. Loss functions for top-k error: Analysis and insights. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016

    , pages 1468–1477. IEEE Computer Society, 2016.
    doi: 10.1109/CVPR.2016.163. URL https://doi.org/10.1109/CVPR.2016.163.
  • Lin et al. [2014] X. V. Lin, S. Singh, L. He, B. Taskar, and L. Zettlemoyer. Multi-label learning with posterior regularization. In

    NIPS Workshop on Modern Machine Learning and Natural Language Processing

    , 2014.
  • Luise et al. [2019] G. Luise, D. Stamos, M. Pontil, and C. Ciliberto. Leveraging low-rank relations between surrogate tasks in structured prediction. In International Conference on Machine Learning, pages 4193–4202, 2019.
  • Mohri et al. [2012] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. The MIT Press, 2012. ISBN 026201825X, 9780262018258.
  • Nguyen et al. [2019] D. H. Nguyen, C. H. Nguyen, and H. Mamitsuka. Recent advances and prospects of computational methods for metabolite identification: a review with emphasis on machine learning approaches. Briefings in bioinformatics, 20(6):2028–2043, 2019.
  • Nowak-Vila et al. [2019] A. Nowak-Vila, F. Bach, and A. Rudi. A general theory for structured prediction with smooth convex surrogates. arXiv preprint arXiv:1902.01958, 2019.
  • Nowozin and Lampert [2011] S. Nowozin and C. H. Lampert. Structured learning and prediction in computer vision. Found. Trends Comput. Graph. Vis., 6(3-4):185–365, 2011.
  • Osokin et al. [2017] A. Osokin, F. R. Bach, and S. Lacoste-Julien. On structured prediction theory with calibrated convex surrogate losses. In Advances in Neural Information Processing Systems 30, pages 302–313, 2017.
  • Palatucci et al. [2009] M. Palatucci, D. Pomerleau, G. E. Hinton, and T. M. Mitchell. Zero-shot learning with semantic output codes. In Advances in neural information processing systems, pages 1410–1418, 2009.
  • Pillutla et al. [2018] V. K. Pillutla, V. Roulet, S. M. Kakade, and Z. Harchaoui. A smoother way to train structured prediction models. In Advances in Neural Information Processing Systems, pages 4766–4778, 2018.
  • Rudi et al. [2013] A. Rudi, G. D. Cañas, and L. Rosasco. On the sample complexity of subspace learning. In Advances in Neural Information Processing Systems, pages 2067–2075, 2013.
  • Rudi et al. [2017] A. Rudi, L. Carratino, and L. Rosasco. Falkon: An optimal large scale kernel method. In Advances in Neural Information Processing Systems, pages 3888–3898, 2017.
  • Schölkopf et al. [1998] B. Schölkopf, A. J. Smola, and K. Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5):1299–1319, 1998.
  • Sohn et al. [2015] K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, pages 3483–3491, 2015.
  • Struminsky et al. [2018] K. Struminsky, S. Lacoste-Julien, and A. Osokin. Quantifying learning guarantees for convex but inconsistent surrogates. In Advances in Neural Information Processing Systems, pages 669–677, 2018.
  • Taskar et al. [2004] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In Advances in neural information processing systems, pages 25–32, 2004.
  • Tsochantaridis et al. [2004] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for interdependent and structured output spaces. In Proceedings of the twenty-first international conference on Machine learning, page 104, 2004.
  • Tsoumakas and Katakis [2007] G. Tsoumakas and I. Katakis. Multi-label classification: An overview. IJDWM, 3(3):1–13, 2007.
  • Weston et al. [2003] J. Weston, O. Chapelle, V. Vapnik, A. Elisseeff, and B. Schölkopf. Kernel dependency estimation. In Advances in neural information processing systems, pages 897–904, 2003.

6 Supplementary material

6.1 Definitions and Notation

Here we introduce, and give basic properties, on the ideal and empirical linear operators that we will use in the following to prove the excess-risk theorem.

  • , ,

  • , (with )

  • and its empirical counterpart

  • and its empirical counterpart

  • (with and its empirical counterpart ( with ).

  • and its empirical counterpart

  • and its empirical counterpart

  • The gamma function

We have the following properties:

  • If , then , with

  • , , with

6.2 Lemmas

First, we give the following lemma showing the equivalence in the linear case between the general initial objective (3) and the mixed linear subspace estimation one (6).

Lemma 6.1.

Using the solution of the regression problem, Eq. (5) is equivalent to:


In the linear case, (3) instantiates as


Decomposing the first term, with , and noticing that ( is orthogonal), one can check that we obtain the desired result. ∎

From here, we give the lemmas, and their proofs, that we used in order to prove the main theorem in the next section.

First, we leverage the comparison inequality from Ciliberto et al. [2016], allowing to relate the excess-risk of to the distance of to .

See 3.1


The considered loss function belongs to the wide family of SELF losses [Ciliberto et al., 2016] for which the comparison inequality holds. A loss is SELF if it satisfies the implicit embedding property [Ciliberto et al., 2020], i.e. there exists an Hilbert space and two feature maps such that

In our case the construction is direct and corresponds to , and

Hence, directly applying Theorem 3. from Ciliberto et al. [2020], we get a constant:

In the theorem’s proof, we will split the surrogate excess-risk of , using triangle inequality, in two terms: 1) the KRR excess-risk of in estimating the new embedding , 2) the excess-risk or reconstruction error of the learned couple when recovering . For now, are supposed to be of the form: , with such that .

The following lemma give a bound for the first term: the KRR excess-risk on the learned linear subspace of dimension . To do so, we simply bound it with the KRR excess-risk on the entire space , and leave as a future work a refined analysis of this term studying its dependency w.r.t .

Lemma 6.2 (Kernel Ridge Excess-risk Bound on a linear subspace of dimension ).

Let be such that , then with probability at least :

with , , .


We observe that:

Then, considering the assumptions of Theorem 3.3, we use result for kernel ridge regression from Ciliberto et al. [2016], with , , .

Then, we show that we can upper bound the reconstruction error term, using Jensen inequality, by the ideal counterpart of the empirical objective w.r.t the output embedding of the algorithm 1.

See 3.2


Let , such that , ,

Defining the projection