Structured Prediction Theory Based on Factor Graph Complexity

05/20/2016
by   Corinna Cortes, et al.
Google
NYU college
0

We present a general theoretical analysis of structured prediction with a series of new results. We give new data-dependent margin guarantees for structured prediction for a very wide family of loss functions and a general family of hypotheses, with an arbitrary factor graph decomposition. These are the tightest margin bounds known for both standard multi-class and general structured prediction problems. Our guarantees are expressed in terms of a data-dependent complexity measure, factor graph complexity, which we show can be estimated from data and bounded in terms of familiar quantities. We further extend our theory by leveraging the principle of Voted Risk Minimization (VRM) and show that learning is possible even with complex factor graphs. We present new learning bounds for this advanced setting, which we use to design two new algorithms, Voted Conditional Random Field (VCRF) and Voted Structured Boosting (StructBoost). These algorithms can make use of complex features and factor graphs and yet benefit from favorable learning guarantees. We also report the results of experiments with VCRF on several datasets to validate our theory.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/26/2020

Relative Deviation Margin Bounds

We present a series of new and more favorable margin-based learning guar...
06/02/2019

Minimax bounds for structured prediction

Structured prediction can be considered as a generalization of many stan...
09/11/2019

Learning Vector-valued Functions with Local Rademacher Complexity

We consider a general family of problems of which the output space admit...
08/05/2015

Structured Prediction: From Gaussian Perturbations to Linear-Time Principled Algorithms

Margin-based structured prediction commonly uses a maximum loss over all...
05/23/2018

Learning latent variable structured prediction models with Gaussian perturbations

The standard margin-based structured prediction commonly uses a maximum ...
09/18/2015

Accelerating Optimization via Adaptive Prediction

We present a powerful general framework for designing data-dependent opt...
01/15/2014

Transductive Rademacher Complexity and its Applications

We develop a technique for deriving data-dependent error bounds for tran...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Structured prediction covers a broad family of important learning problems. These include key tasks in natural language processing such as part-of-speech tagging, parsing, machine translation, and named-entity recognition, important areas in computer vision such as image segmentation and object recognition, and also crucial areas in speech processing such as pronunciation modeling and speech recognition.

In all these problems, the output space admits some structure. This may be a sequence of tags as in part-of-speech tagging, a parse tree as in context-free parsing, an acyclic graph as in dependency parsing, or labels of image segments as in object detection. Another property common to these tasks is that, in each case, the natural loss function admits a decomposition along the output substructures. As an example, the loss function may be the Hamming loss as in part-of-speech tagging, or it may be the edit-distance, which is widely used in natural language and speech processing.

The output structure and corresponding loss function make these problems significantly different from the (unstructured) binary classification problems extensively studied in learning theory. In recent years, a number of different algorithms have been designed for structured prediction, including Conditional Random Field (CRF) (Lafferty et al., 2001), StructSVM (Tsochantaridis et al., 2005), Maximum-Margin Markov Network (M3N) (Taskar et al., 2003), a kernel-regression algorithm (Cortes et al., 2007), and search-based approaches such as (Daumé III et al., 2009; Doppa et al., 2014; Lam et al., 2015; Chang et al., 2015; Ross et al., 2011)

. More recently, deep learning techniques have also been developed for tasks including part-of-speech tagging

(Jurafsky and Martin, 2009; Vinyals et al., 2015a), named-entity recognition (Nadeau and Sekine, 2007), machine translation (Zhang et al., 2008), image segmentation (Lucchi et al., 2013), and image annotation (Vinyals et al., 2015b).

However, in contrast to the plethora of algorithms, there have been relatively few studies devoted to the theoretical understanding of structured prediction (Bakir et al., 2007). Existing learning guarantees hold primarily for simple losses such as the Hamming loss (Taskar et al., 2003; Cortes et al., 2014; Collins, 2001) and do not cover other natural losses such as the edit-distance. They also typically only apply to specific factor graph models. The main exception is the work of McAllester (2007), which provides PAC-Bayesian guarantees for arbitrary losses, though only in the special case of randomized algorithms using linear (count-based) hypotheses.

This paper presents a general theoretical analysis of structured prediction with a series of new results. We give new data-dependent margin guarantees for structured prediction for a broad family of loss functions and a general family of hypotheses, with an arbitrary factor graph decomposition. These are the tightest margin bounds known for both standard multi-class and general structured prediction problems. For special cases studied in the past, our learning bounds match or improve upon the previously best bounds (see Section 3.3). In particular, our bounds improve upon those of Taskar et al. (2003). Our guarantees are expressed in terms of a data-dependent complexity measure, factor graph complexity, which we show can be estimated from data and bounded in terms of familiar quantities for several commonly used hypothesis sets along with a sparsity measure for features and graphs.

We further extend our theory by leveraging the principle of Voted Risk Minimization (VRM) and show that learning is possible even with complex factor graphs. We present new learning bounds for this advanced setting, which we use to design two new algorithms, Voted Conditional Random Field (VCRF) and Voted Structured Boosting (StructBoost). These algorithms can make use of complex features and factor graphs and yet benefit from favorable learning guarantees. As a proof of concept validating our theory, we also report the results of experiments with VCRF on several datasets.

The paper is organized as follows. In Section 2 we introduce the notation and definitions relevant to our discussion of structured prediction. In Section 3, we derive a series of new learning guarantees for structured prediction, which are then used to prove the VRM principle in Section 4. Section 5 develops the algorithmic framework which is directly based on our theory. In Section 6, we provide some preliminary experimental results that serve as a proof of concept for our theory.

2 Preliminaries

Let denote the input space and the output space. In structured prediction, the output space may be a set of sequences, images, graphs, parse trees, lists, or some other (typically discrete) objects admitting some possibly overlapping structure. Thus, we assume that the output structure can be decomposed into substructures. For example, this may be positions along a sequence, so that the output space is decomposable along these substructures: . Here, is the set of possible labels (or classes) that can be assigned to substructure .

Loss functions. We denote by a loss function measuring the dissimilarity of two elements of the output space . We will assume that the loss function is definite, that is iff . This assumption holds for all loss functions commonly used in structured prediction. A key aspect of structured prediction is that the loss function can be decomposed along the substructures . As an example, may be the Hamming loss defined by for all and , with . In the common case where is a set of sequences defined over a finite alphabet, may be the edit-distance, which is widely used in natural language and speech processing applications, with possibly different costs associated to insertions, deletions and substitutions.

may also be a loss based on the negative inner product of the vectors of

-gram counts of two sequences, or its negative logarithm. Such losses have been used to approximate the BLEU score loss in machine translation. There are other losses defined in computational biology based on various string-similarity measures. Our theoretical analysis is general and applies to arbitrary bounded and definite loss functions.

Scoring functions and factor graphs. We will adopt the common approach in structured prediction where predictions are based on a scoring function mapping to . Let be a family of scoring functions. For any , we denote by the predictor defined by : for any , .

Furthermore, we will assume, as is standard in structured prediction, that each function can be decomposed as a sum. We will consider the most general case for such decompositions, which can be made explicit using the notion of factor graphs.111Factor graphs are typically used to indicate the factorization of a probabilistic model. We are not assuming probabilistic models, but they would be also captured by our general framework: would then be

of a probability.

A factor graph is a tuple , where is a set of variable nodes, a set of factor nodes, and a set of undirected edges between a variable node and a factor node. In our context, can be identified with the set of substructure indices, that is .

For any factor node , denote by the set of variable nodes connected to via an edge and define as the substructure set cross-product . Then, admits the following decomposition as a sum of functions , each taking as argument an element of the input space and an element of , :

(1)

Figure 1 illustrates this definition with two different decompositions. More generally, we will consider the setting in which a factor graph may depend on a particular example : . A special case of this setting is for example when the size (or length) of each example is allowed to vary and where the number of possible labels is potentially infinite.

(a) (b)
Figure 1: Example of factor graphs. (a) Pairwise Markov network decomposition: (b) Other decomposition .

We present other examples of such hypothesis sets and their decomposition in Section 3, where we discuss our learning guarantees. Note that such hypothesis sets with an additive decomposition are those commonly used in most structured prediction algorithms (Tsochantaridis et al., 2005; Taskar et al., 2003; Lafferty et al., 2001). This is largely motivated by the computational requirement for efficient training and inference. Our results, while very general, further provide a statistical learning motivation for such decompositions.

Learning scenario

. We consider the familiar supervised learning scenario where the training and test points are drawn i.i.d. according to some distribution

over . We will further adopt the standard definitions of margin, generalization error and empirical error. The margin of a hypothesis for a labeled example is defined by

(2)

Let be a training sample of size drawn from . We denote by the generalization error and by the empirical error of over :

(3)

where and where the notation indicates that is drawn according to the empirical distribution defined by . The learning problem consists of using the sample to select a hypothesis with small expected loss .

Observe that the definiteness of the loss function implies, for all , the following equality:

(4)

We will later use this identity in the derivation of surrogate loss functions.

3 General learning bounds for structured prediction

In this section, we present new learning guarantees for structured prediction. Our analysis is general and applies to the broad family of definite and bounded loss functions described in the previous section. It is also general in the sense that it applies to general hypothesis sets and not just sub-families of linear functions. For linear hypotheses, we will give a more refined analysis that holds for arbitrary norm- regularized hypothesis sets.

The theoretical analysis of structured prediction is more complex than for classification since, by definition, it depends on the properties of the loss function and the factor graph. These attributes capture the combinatorial properties of the problem which must be exploited since the total number of labels is often exponential in the size of that graph. To tackle this problem, we first introduce a new complexity tool.

3.1 Complexity measure

A key ingredient of our analysis is a new data-dependent notion of complexity that extends the classical Rademacher complexity. We define the empirical factor graph Rademacher complexity of a hypothesis set for a sample and factor graph as follows:

where and where

s are independent Rademacher random variables uniformly distributed over

. The factor graph Rademacher complexity of for a factor graph is defined as the expectation: . It can be shown that the empirical factor graph Rademacher complexity is concentrated around its mean (Lemma 8). The factor graph Rademacher complexity is a natural extension of the standard Rademacher complexity to vector-valued hypothesis sets (with one coordinate per factor in our case). For binary classification, the factor graph and standard Rademacher complexities coincide. Otherwise, the factor graph complexity can be upper bounded in terms of the standard one. As with the standard Rademacher complexity, the factor graph Rademacher complexity of a hypothesis set can be estimated from data in many cases. In some important cases, it also admits explicit upper bounds similar to those for the standard Rademacher complexity but with an additional dependence on the factor graph quantities. We will prove this for several families of functions which are commonly used in structured prediction (Theorem 2).

3.2 Generalization bounds

In this section, we present new margin bounds for structured prediction based on the factor graph Rademacher complexity of . Our results hold both for the additive and the multiplicative empirical margin losses defined below:

(5)
(6)

Here, for all , with . As we show in Section 5, convex upper bounds on and directly lead to many existing structured prediction algorithms. The following is our general data-dependent margin bound for structured prediction.

Theorem 1.

Fix . For any , with probability at least over the draw of a sample of size , the following holds for all ,

The full proof of Theorem 1 is given in Appendix A. It is based on a new contraction lemma (Lemma 5) generalizing Talagrand’s lemma that can be of independent interest.222A result similar to Lemma 5 has also been recently proven independently in (Maurer, 2016). We also present a more refined contraction lemma (Lemma 6) that can be used to improve the bounds of Theorem 1. Theorem 1 is the first data-dependent generalization guarantee for structured prediction with general loss functions, general hypothesis sets, and arbitrary factor graphs for both multiplicative and additive margins. We also present a version of this result with empirical complexities as Theorem 7 in the supplementary material. We will compare these guarantees to known special cases below.

The margin bounds above can be extended to hold uniformly over at the price of an additional term of the form in the bound, using known techniques (see for example (Mohri et al., 2012)).

The hypothesis set used by convex structured prediction algorithms such as StructSVM (Tsochantaridis et al., 2005), Max-Margin Markov Networks (M3N) (Taskar et al., 2003) or Conditional Random Field (CRF) (Lafferty et al., 2001) is that of linear functions. More precisely, let be a feature mapping from to such that . For any , define as follows:

Then, can be efficiently estimated using random sampling and solving LP programs. Moreover, one can obtain explicit upper bounds on . To simplify our presentation, we will consider the case , but our results can be extended to arbitrary and, more generally, to arbitrary group norms.

Theorem 2.

For any sample , the following upper bounds hold for the empirical factor graph complexity of and :

where , and where is a sparsity factor defined by .

Plugging in these factor graph complexity upper bounds into Theorem 1 immediately yields explicit data-dependent structured prediction learning guarantees for linear hypotheses with general loss functions and arbitrary factor graphs (see Corollary 10). Observe that, in the worst case, the sparsity factor can be bounded as follows:

where . Thus, the factor graph Rademacher complexities of linear hypotheses in scale as . An important observation is that and depend on the observed sample. This shows that the expected size of the factor graph is crucial for learning in this scenario. This should be contrasted with other existing structured prediction guarantees that we discuss below, which assume a fixed upper bound on the size of the factor graph. Note that our result shows that learning is possible even with an infinite set . To the best of our knowledge, this is the first learning guarantee for learning with infinitely many classes.

Our learning guarantee for can additionally benefit from the sparsity of the feature mapping and observed data. In particular, in many applications, is a binary indicator function that is non-zero for a single . For instance, in NLP, may indicate an occurrence of a certain -gram in the input and output . In this case, and the complexity term is only in , where may depend linearly on .

3.3 Special cases and comparisons

Markov networks. For the pairwise Markov networks with a fixed number of substructures studied by Taskar et al. (2003), our equivalent factor graph admits nodes, , and the maximum size of is if each substructure of a pair can be assigned one of classes. Thus, if we apply Corollary 10 with Hamming distance as our loss function and divide the bound through by , to normalize the loss to interval as in (Taskar et al., 2003), we obtain the following explicit form of our guarantee for an additive empirical margin loss, for all :

This bound can be further improved by eliminating the dependency on using an extension of our contraction Lemma 5 to (see Lemma 6). The complexity term of Taskar et al. (2003) is bounded by a quantity that varies as , where is the maximal out-degree of a factor graph. Our bound has the same dependence on these key quantities, but with no logarithmic term in our case. Note that, unlike the result of Taskar et al. (2003), our bound also holds for general loss functions and different -norm regularizers. Moreover, our result for a multiplicative empirical margin loss is new, even in this special case.

Multi-class classification. For standard (unstructured) multi-class classification, we have and , where is the number of classes. In that case, for linear hypotheses with norm-2 regularization, the complexity term of our bound varies as (Corollary 11). This improves upon the best known general margin bounds of Kuznetsov et al. (2014), who provide a guarantee that scales linearly with the number of classes instead. Moreover, in the special case where an individual is learned for each class , we retrieve the recent favorable bounds given by Lei et al. (2015), albeit with a somewhat simpler formulation. In that case, for any , all components of the feature vector are zero, except (perhaps) for the components corresponding to class , where is the dimension of . In view of that, for example for a group-norm -regularization, the complexity term of our bound varies as , which matches the results of Lei et al. (2015) with a logarithmic dependency on (ignoring some complex exponents of in their case). Additionally, note that unlike existing multi-class learning guarantees, our results hold for arbitrary loss functions. See Corollary 12 for further details. Our sparsity-based bounds can also be used to give bounds with logarithmic dependence on the number of classes when the features only take values in . Finally, using Lemma 6 instead of Lemma 5, the dependency on the number of classes can be further improved.

We conclude this section by observing that, since our guarantees are expressed in terms of the average size of the factor graph over a given sample, this invites us to search for a hypothesis set and predictor such that the tradeoff between the empirical size of the factor graph and empirical error is optimal. In the next section, we will make use of the recently developed principle of Voted Risk Minimization (VRM) (Cortes et al., 2015) to reach this objective.

4 Voted Risk Minimization

In many structured prediction applications such as natural language processing and computer vision, one may wish to exploit very rich features. However, the use of rich families of hypotheses could lead to overfitting. In this section, we show that it may be possible to use rich families in conjunction with simpler families, provided that fewer complex hypotheses are used (or that they are used with less mixture weight). We achieve this goal by deriving learning guarantees for ensembles of structured prediction rules that explicitly account for the differing complexities between families. This will motivate the algorithms that we present in Section 5.

Assume that we are given families of functions mapping from to . Define the ensemble family , that is the family of functions of the form , where is in the simplex and where, for each , is in for some . We further assume that . As an example, the s may be ordered by the size of the corresponding factor graphs.

The main result of this section is a generalization of the VRM theory to the structured prediction setting. The learning guarantees that we present are in terms of upper bounds on and , which are defined as follows for all :

(7)
(8)

Here, can be interpreted as a margin term that acts in conjunction with . For simplicity, we assume in this section that .

Theorem 3.

Fix . For any , with probability at least over the draw of a sample of size , each of the following inequalities holds for all :

where .

The proof of this theorem crucially depends on the theory we developed in Section 3 and is given in Appendix A. As with Theorem 1, we also present a version of this result with empirical complexities as Theorem 14 in the supplementary material. The explicit dependence of this bound on the parameter vector suggests that learning even with highly complex hypothesis sets could be possible so long as the complexity term, which is a weighted average of the factor graph complexities, is not too large. The theorem provides a quantitative way of determining the mixture weights that should be apportioned to each family. Furthermore, the dependency on the number of distinct feature map families is very mild and therefore suggests that a large number of families can be used. These properties will be useful for motivating new algorithms for structured prediction.

5 Algorithms

In this section, we derive several algorithms for structured prediction based on the VRM principle discussed in Section 4. We first give general convex upper bounds (Section 5.1) on the structured prediction loss which recover as special cases the loss functions used in StructSVM (Tsochantaridis et al., 2005), Max-Margin Markov Networks (M3N) (Taskar et al., 2003), and Conditional Random Field (CRF) (Lafferty et al., 2001). Next, we introduce a new algorithm, Voted Conditional Random Field (VCRF) Section 5.2, with accompanying experiments as proof of concept. We also present another algorithm, Voted StructBoost (VStructBoost), in Appendix C.

5.1 General framework for convex surrogate losses

Given , the mapping is typically not a convex function of , which leads to computationally hard optimization problems. This motivates the use of convex surrogate losses. We first introduce a general formulation of surrogate losses for structured prediction problems.

Lemma 4.

For any , let be an upper bound on . Then, the following upper bound holds for any and ,

(9)

The proof is given in Appendix A. This result defines a general framework that enables us to straightforwardly recover many of the most common state-of-the-art structured prediction algorithms via suitable choices of : (a) for , the right-hand side of (9) coincides with the surrogate loss defining StructSVM (Tsochantaridis et al., 2005); (b) for , it coincides with the surrogate loss defining Max-Margin Markov Networks (M3N) (Taskar et al., 2003) when using for the Hamming loss; and (c) for , it coincides with the surrogate loss defining the Conditional Random Field (CRF) (Lafferty et al., 2001).

Moreover, alternative choices of can help define new algorithms. In particular, we will refer to the algorithm based on the surrogate loss defined by as StructBoost, in reference to the exponential loss used in AdaBoost. Another related alternative is based on the choice . See Appendix C, for further details on this algorithm. In fact, for each described above, the corresponding convex surrogate is an upper bound on either the multiplicative or additive margin loss introduced in Section 3. Therefore, each of these algorithms seeks a hypothesis that minimizes the generalization bounds presented in Section 3. To the best of our knowledge, this interpretation of these well-known structured prediction algorithms is also new. In what follows, we derive new structured prediction algorithms that minimize finer generalization bounds presented in Section 4.

5.2 Voted Conditional Random Field (VCRF)

We first consider the convex surrogate loss based on , which corresponds to the loss defining CRF models. Using the monotonicity of the logarithm and upper bounding the maximum by a sum gives the following upper bound on the surrogate loss holds:

which, combined with VRM principle leads to the following optimization problem:

(10)

where . We refer to the learning algorithm based on the optimization problem (10) as VCRF. Note that for , (10) coincides with the objective function of -regularized CRF. Observe that we can also directly use or its upper bound as a convex surrogate. We can similarly derive an -regularization formulation of the VCRF algorithm. In Appendix D, we describe efficient algorithms for solving the VCRF and VStructBoost optimization problems.

6 Experiments

In Appendix B, we corroborate our theory by reporting experimental results suggesting that the VCRF algorithm can outperform the CRF algorithm on a number of part-of-speech (POS) datasets.

7 Conclusion

We presented a general theoretical analysis of structured prediction. Our data-dependent margin guarantees for structured prediction can be used to guide the design of new algorithms or to derive guarantees for existing ones. Its explicit dependency on the properties of the factor graph and on feature sparsity can help shed new light on the role played by the graph and features in generalization. Our extension of the VRM theory to structured prediction provides a new analysis of generalization when using a very rich set of features, which is common in applications such as natural language processing and leads to new algorithms, VCRF and VStructBoost. Our experimental results for VCRF serve as a proof of concept and motivate more extensive empirical studies of these algorithms.

Acknowledgments

This work was partly funded by NSF CCF-1535987 and IIS-1618662 and NSF GRFP DGE-1342536.

References

  • Bakir et al. [2007] G. H. Bakir, T. Hofmann, B. Schölkopf, A. J. Smola, B. Taskar, and S. V. N. Vishwanathan. Predicting Structured Data (Neural Information Processing). The MIT Press, 2007.
  • Chang et al. [2015] K. Chang, A. Krishnamurthy, A. Agarwal, H. Daumé III, and J. Langford. Learning to search better than your teacher. In ICML, 2015.
  • Collins [2001] M. Collins. Parameter estimation for statistical parsing models: Theory and practice of distribution-free methods. In Proceedings of IWPT, 2001.
  • Cortes et al. [2007] C. Cortes, M. Mohri, and J. Weston. A General Regression Framework for Learning String-to-String Mappings. In Predicting Structured Data. MIT Press, 2007.
  • Cortes et al. [2014] C. Cortes, V. Kuznetsov, and M. Mohri. Ensemble methods for structured prediction. In ICML, 2014.
  • Cortes et al. [2015] C. Cortes, P. Goyal, V. Kuznetsov, and M. Mohri. Kernel extraction via voted risk minimization. JMLR, 2015.
  • Daumé III et al. [2009] H. Daumé III, J. Langford, and D. Marcu. Search-based structured prediction. Machine Learning, 75(3):297–325, 2009.
  • Doppa et al. [2014] J. R. Doppa, A. Fern, and P. Tadepalli. Structured prediction via output space search. JMLR, 15(1):1317–1350, 2014.
  • Jurafsky and Martin [2009] D. Jurafsky and J. H. Martin. Speech and Language Processing (2nd Edition). Prentice-Hall, Inc., 2009.
  • Kuznetsov et al. [2014] V. Kuznetsov, M. Mohri, and U. Syed. Multi-class deep boosting. In NIPS, 2014.
  • Lafferty et al. [2001] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001.
  • Lam et al. [2015] M. Lam, J. R. Doppa, S. Todorovic, and T. G. Dietterich. c-search for structured prediction in computer vision. In CVPR, 2015.
  • Lei et al. [2015] Y. Lei, Ü. D. Dogan, A. Binder, and M. Kloft. Multi-class svms: From tighter data-dependent generalization bounds to novel algorithms. In NIPS, 2015.
  • Lucchi et al. [2013] A. Lucchi, L. Yunpeng, and P. Fua. Learning for structured prediction using approximate subgradient descent with working sets. In CVPR, 2013.
  • Maurer [2016] A. Maurer. A vector-contraction inequality for rademacher complexities. In ALT, 2016.
  • McAllester [2007] D. McAllester. Generalization bounds and consistency for structured labeling. In Predicting Structured Data. MIT Press, 2007.
  • Mohri et al. [2012] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. The MIT Press, 2012.
  • Nadeau and Sekine [2007] D. Nadeau and S. Sekine. A survey of named entity recognition and classification. Linguisticae Investigationes, 30(1):3–26, January 2007.
  • Ross et al. [2011] S. Ross, G. J. Gordon, and D. Bagnell.

    A reduction of imitation learning and structured prediction to no-regret online learning.

    In AISTATS, 2011.
  • Taskar et al. [2003] B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In NIPS, 2003.
  • Tsochantaridis et al. [2005] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and interdependent output variables. JMLR, 6:1453–1484, Dec. 2005.
  • Vinyals et al. [2015a] O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. Hinton. Grammar as a foreign language. In NIPS, 2015a.
  • Vinyals et al. [2015b] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, 2015b.
  • Zhang et al. [2008] D. Zhang, L. Sun, and W. Li. A structured prediction approach for statistical machine translation. In IJCNLP, 2008.

Appendix A Proofs

This appendix section gathers detailed proofs of all of our main results. In Appendix A.1, we prove a contraction lemma used as a tool in the proof of our general factor graph Rademacher complexity bounds (Appendix A.3). In Appendix A.8, we further extend our bounds to the Voted Risk Minimization setting. Appendix A.5 gives explicit upper bounds on the factor graph Rademacher complexity of several commonly used hypothesis sets. In Appendix A.9, we prove a general upper bound on a loss function used in structured prediction in terms of a convex surrogate.

a.1 Contraction lemma

The following contraction lemma will be a key tool used in the proofs of our generalization bounds for structured prediction.

Lemma 5.

Let be a hypothesis set of functions mapping to . Assume that for all , is -Lipschitz for equipped with the 2-norm. That is:

for all . Then, for any sample of points , the following inequality holds

(11)

where and s are independent Rademacher variables uniformly distributed over .

Proof.

Fix a sample . Then, we can rewrite the left-hand side of (11) as follows:

where . Assume that the suprema can be attained and let be the hypotheses satisfying

When the suprema are not reached, a similar argument to what follows can be given by considering instead hypotheses that are -close to the suprema for any . By definition of expectation, since is uniformly distributed over , we can write

Next, using the -Lipschitzness of and the Khintchine-Kahane inequality, we can write

Now, let denote and let denote the sign of . Then, the following holds:

Proceeding in the same way for all other s () completes the proof. ∎

a.2 Contraction lemma for -norm

In this section, we present an extension of the contraction Lemma 5, that can be used to remove the dependency on the alphabet size in all of our bounds.

Lemma 6.

Let be a hypothesis set of functions mapping to . Assume that for all , is -Lipschitz for equipped with the norm-(, 2) for some . That is

for all . Then, for any sample of points , there exists a distribution over such that the following inequality holds:

(12)

where and s are independent Rademacher variables uniformly distributed over and is a sequence of random variables distributed according to . Note that s themselves do not need to be independent.

Proof.

Fix a sample . Then, we can rewrite the left-hand side of (11) as follows:

where . Assume that the suprema can be attained and let be the hypotheses satisfying

When the suprema are not reached, a similar argument to what follows can be given by considering instead hypotheses that are -close to the suprema for any . By definition of expectation, since is uniformly distributed over , we can write

Next, using the -Lipschitzness of and the Khintchine-Kahane inequality, we can write

Define the random variables .

Now, let denote and let denote the sign of