Structured Output Learning with Abstention: Application to Accurate Opinion Prediction

03/22/2018 ∙ by Alexandre Garcia, et al. ∙ 0

Motivated by Supervised Opinion Analysis, we propose a novel framework devoted to Structured Output Learning with Abstention (SOLA). The structure prediction model is able to abstain from predicting some labels in the structured output at a cost chosen by the user in a flexible way. For that purpose, we decompose the problem into the learning of a pair of predictors, one devoted to structured abstention and the other, to structured output prediction. To compare fully labeled training data with predictions potentially containing abstentions, we define a wide class of asymmetric abstention-aware losses. Learning is achieved by surrogate regression in an appropriate feature space while prediction with abstention is performed by solving a new pre-image problem. Thus, SOLA extends recent ideas about Structured Output Prediction via surrogate problems and calibration theory and enjoys statistical guarantees on the resulting excess risk. Instantiated on a hierarchical abstention-aware loss, SOLA is shown to be relevant for fine-grained opinion mining and gives state-of-the-art results on this task. Moreover, the abstention-aware representations can be used to competitively predict user-review ratings based on a sentence-level opinion predictor.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Up until recent years, opinion analysis in reviews has been commonly handled as a supervised polarity (positive vs. negative) classification problem. However, understanding the grounds on which an opinion is formed is of highest interest for decision makers. Aligned with this goal, the emerging field of aspect-based sentiment analysis

(pontiki2016semeval) has evolved towards a more involved machine learning task where opinions are considered to be structured objects—typically hierarchical structures linking polarities to aspects and relying on different units of analysis (i.e. sentence-level and review-level) as in (marcheggiani2014hierarchical). While this problem has attracted a growing attention from the structured output prediction community, it has also raised an unprecedented challenge: the human interpretation of opinions expressed in the reviews is subjective and the opinion aspects and their related polarities are sometimes expressed in an ambiguous way and difficult to annotate (clavel2016sentiment; marcheggiani2014hierarchical). In this context, the prediction error should be flexible and should integrate this subjectivity so that, for example, mistakes on one aspect do not interfere with the prediction of polarity.

In order to address this issue, we propose a novel framework called Structured Output Learning with Abstention (SOLA) which allows for abstaining from predicting parts of the structure, so as to avoid providing erroneous insights about the object to be predicted, therefore increasing reliability. The new approach extends the principles of learning with abstention recently introduced for binary classification (CortesAbstention) and generalizes surrogate least-square loss approaches to Structured Output Prediction recently studied in (brouard2016input; Ciliberto; osinski2017). The main novelty comes from the introduction of an asymmetric loss, based on embeddings of desired outputs and outputs predicted with abstention in the same space. Interestingly, similarly to the case of Output Kernel Regression (brouard2016input) and appropriate inner product-based losses (Ciliberto), the approach relies on a simple surrogate formulation, namely a least-squares formulation followed by the resolution of a new pre-image problem. The paper is organized as follows. Section 2 introduces the problem to solve and the novel framework, SOLA. Section 3 provides statistical guarantees about the excess risk in the framework of Least Squares Surrogate Loss while section 4 is devoted to the pre-image developed for hierarchical output structures. Section 5 presents the numerical experiments and Section 6 draws a conclusion.

2 Structured Output Labeling with Abstention

Let be the input sample space. We assume a target graph structure of interest, where is the set of vertices and is the edge relationship between vertices. A legal labeling or assignment of is a

-dimensional binary vector,

, that also satisfies some properties induced by the graph structure, i.e. by . We call the subset of that contains all possible legal labelings of . Given , the goal of Structured Output Labeling is to learn a function that predicts a legal labeling given some input . Let us emphasize that does not necessarily share the same structure with the outputs objects. For instance, in Supervised Opinion Analysis, the inputs are reviews in natural language described by a sequence of feature vectors, each of them representing a sentence. Extending Supervised Classification with Abstention (CortesAbstention), Structured Output Learning with Abstention aims at learning a pair of functions from to composed of a predictor that predicts the label of each component of the structure and an abstention function that determines on which components of the structure to abstain from predicting a label. If we note , the set of legal labelings with abstention where denotes the abstention label, then the abstention-aware predictive model is defined from and as follows:


Now, assuming we have a random variable

taking its values in

and distributed according to a probability distribution

. Learning the predictive model raises the issue of designing an appropriate abstention-aware loss function to define a learning problem as a risk minimization task. Given the relationship in Eq.

(2), a risk on can be converted into a risk on the pair using an abstention-aware loss :


In this paper, we propose a family of abstention-aware losses that both generalizes the abstention-aware loss in the binary classification case (see (CortesAbstention)) and extends the scope of hierarchical losses previously proposed by (cesa2006hierarchical) for Hierarchical Output Labeling tasks. An abstention-aware loss is required to deal asymmetrically with observed labels which are supposed to be complete and predicted labels which may be incomplete due to partial abstention. We thus propose the following general form for the function:


relying on a bounded linear operator (a rectangular matrix) and two bounded feature maps: devoted to outputs with abstention and , devoted to outputs without abstention. The three ingredients of the loss must enable the loss to be non negative. This is the case for the following examples.

In Binary classification with abstention, we have and the abstention-aware loss is defined by :

where is the rejection cost; with , in case of abstention and , otherwise. This can be written with the corresponding functions and defined as:

H-loss (hierarchical loss): now we assume that the target structure is a hierarchical binary tree. Then, is now the set of directed edges, reflecting a parent relationship among nodes (each node except the root has one parent). Regarding the labeling, we impose the following property : if an oriented pair , then , meaning that a child node cannot be greater that his parent node. The H-loss (cesa2006hierarchical) which measures the length of the common path from the root to the leaves between these assignments is defined as follows:

where is the index of the parent of according to the set of edges , and is a set of positive constants non-increasing on paths from the root to the leaves.

Such a loss can be rewritten under the form:

is the adjacency matrix of the underlying binary tree structure and the vector of weights defined above. The case of the Hamming loss can also be recovered by choosing:

where is the identity matrix.

Abstention-aware H-loss (Ha-loss): By mixing the H-loss and the abstention-aware binary classification loss, we get the novel Ha-loss which we define as follows:


where and can be chosen as constants or be function of the predictions. Thus, we have designed this loss so it is adapted to hierarchies where some nodes are known to be hard to predict whereas their children are easy to predict. In this case, the abstention choice can be used at a particular node to pay the cost for predicting its child. If this prediction is still a mistake, the price is additionally paid and acts as a regret cost penalizing the unnecessary abstention chosen at the parent. Acting on and provides a way to control the number of abstentions not only through the risk taken by predicting a given node but also its children. For sake of space, the dot product representation with and of this loss is detailed in the supplementary material.

2.1 Empirical risk minimization for SOLA

The goal of SOLA is to learn a pair from a i.i.d. (training) sample drawn from a probability distribution that minimizes the true risk:

We notice that this risk can be rewritten as an expected valued over the input variables only:

This pleads for considering the following surrogate problem:

  • Step 1: we define . is then the minimizer of a square surrogate risk.

  • Step 2: we solve the following pre-image or decoding problem:

Solving directly the problem above raises some difficulties:

  • In practice, as usual, we do not know the expected value of conditioned on :

    needs to be estimated from the training sample

    . This simple regression problem is referred to as the learning step and will be solved in the next subsection.

  • The complexity of the problem will depend on some properties of . We will refer to this problem as the pre-image and show how to solve it practically at a later stage.

These pitfalls, common to all structured output learning problems, can be overcome by substituting a surrogate loss to the target loss and proceeding in two steps:

  1. Solve the surrogate penalized empirical problem (learning phase):


    where is a penalty function and a positive parameter. Thus, get a minimizer which is an estimate of .

  2. Solve the pre-image or decoding problem:


2.2 Estimation of the conditional density from training data

We choose to solve this problem in , a vector-valued Reproducing Kernel Hilbert Space associated to an operator-valued kernel . For the sake of simplicity, is chosen as a decomposable operator-valued kernel with identity: where is a positive definite kernel on and is the identity matrix. The penalty is chosen as

. This choice leads to the ridge regression problem:


that admits a unique and well known closed-form solution (micchelli2005learning; brouard2016input).

As is only needed at the prediction stage, within the pre-image to solve, it is important to emphasize the dependency of on the feature vectors :


where is the following vector:


where . is the block matrix such that and is the identity matrix of the same size. is the block of .

3 Learning guarantee for structured losses with abstention

In this section, we give some statistical guarantees when learning predictors in the framework previously described. To this end, we build on recent results in the framework of Least Squares Loss Surrogate (Ciliberto) that are extended to abstention-aware prediction.

Theorem 1.

Given the definition of in (3), let us denote , the pair of predictor and reject functions associated to the estimate obtained by solving the learning problem stated in Eq. (7):

Its true risk with respect to writes as:

The optimal predictor is defined as:

The excess risk of an abstention aware predictor : is linked to the estimation error of the conditional density by the following inequality:


where , and .

The full proof is given in the Supplements. Close to the one in (Ciliberto), it is extended by taking the sup of the norm of over . Moreover when the problem (7) is solved by Kernel Ridge Regression, (Ciliberto) have shown the universal consistency and have obtained a generalization bound that still holds in our case since it relies on the result of Theorem 1 only. As a consequence the excess risk of predictors built in the SOLA framework is controlled by the risk suffered at the learning step for which we use off the shelf vector valued regressors with their own convergence guarantees.

In the following, we specifically study the pre-image problem in the SOLA framework for a class of output structures that we detail hereafter.

4 Pre-image for hierarchical structures with Abstention

In what follows we focus on a class of structured outputs that can be viewed as hierarchical objects for which we show how to solve the pre-image problems involved for a large class of losses.

4.1 Hierarchical output structures

Definition 1.

A HEX graph is a graph consisting of a set of nodes V = , directed edges , and undirected edges , such that the subgraph is a directed acyclic graph (DAG) and the subgraph has no self loop.

Definition 2.

An assignment (state) of labels in a HEX graph is legal if for any pair of nodes labeled , and for any pair , .

Definition 3.

The state space of graph is the set of all legal assignments of .

Thus a HEX graph can be described by a pair of (1) a directed graph over a set of binary nodes indicating that any child can be labeled 1 only if its parent is also labeled 1 and (2) an undirected graph of exclusions such that two nodes linked by an edge cannot be simultaneously labeled 1. Note that HEX graphs can represent any type of binary labeled graph since and can be empty sets. In previous works, they have been used to model some coarse to fine ontology through the hierarchy while incorporating some prior known labels exclusions encoded by (deng2014large; BenTaieb2016)

While the output data we consider consists of HEX graph assignments , our predictions with abstention belong to another space for which we do not restrict to belong to but rather allow for other choices detailed in the next section.

4.2 Efficient solution for the preimage problem

The complexity of the preimage problem is due to two aspects: i) the space in which we search the solution () can be hard to explore; and ii) the function can lead to high dimensional representations for which the minimization problem is harder.

The pre-image problem involves a minimization over a constrained set of binary variables. For a large class of abstention-aware predictors we propose a branch-and-bound formulation for which a nearly optimal initialization point can be obtained in a polynomial time. Following the line given by the form of our abstention aware predictor

defined in Section 2, we consider losses involving binary interaction between the predict function and the reject function , and suppose that there exists a rectangular matrix such that where is the Kronecker product between vectors. Such a class takes as special cases the examples presented in Section 2. We state the following linearization theorem under binary interaction hypothesis:

Theorem 2.

Let be an abstention-aware loss defined by its output mappings , and the corresponding cost matrix .

If the mapping is a linear function of the binary interactions of and i.e. there exists a matrix such that , then there exists a bounded linear operator and a vector such that the pre-image problem:

has the same solutions as the linear program:

Where is a dimensional vector constrained to be equal to .

The proof is detailed in the supplementary material.

The problem above still involves a minimization over the structured binary set . Such a set of solutions encodes some predefined constraints:

  • Since the objects we intend to predict are HEX graph assignments, the vectors of the output space should satisfy the hierachical constraint : with the index of the parent of according to the hierarchy. When predicting with abstention we relax this condition since we suppose that a descendant node can take the value if its parent was active or if we abstained from predicting it . Such a condition is equivalent to the constraint

  • A second condition we used in practice is the restriction of the use of abstention for two consecutive nodes: structured abstention at a layer must be used in order to reveal a subsequent prediction which is known to be easy. Such a condition can be encoded through the inequality:


In our experiments, the structured space has been chosen as the set of binary vectors that respect the two above conditions. These choices are motivated by our application but note that any subset of can be built in a similar way by adding some inequality constraints: . Consequently, the constraints can be added to the previous minimization problem to build the canonical form:

where and .

The complexity of the problem above is linked to some properties of the operator. (goh) have shown that in the case of the minimization of the H-loss with hierarchical constraints, the linear operator satisfies the property of total unimodularity (schrijver1998theory) which is a sufficient condition for the problem above to have the same solutions as its continuous relaxation leading to a polynomial time algorithm. In the more general case of the Ha-loss, solving such an integer program is NP-hard and the optimal solution can be obtained using a branch-and-bound algorithm. When implementing this type of approach, the choice of the initialization point can strongly influence the convergence time. As in practical applications, we expect the number of abstentions to remain low, such a point can be chosen as the solution of the original prediction problem without abstention (goh). Moreover since the abstention mechanism should modify only a small subset of the predictions, we expect this solution to be close to the abstention aware one.

5 Numerical Experiments

We study three subtasks of opinion mining, namely sentence-based aspect prediction, sentence-based joint prediction of aspects and polarities (possibly with abstention) and full review-based star rating. We show that these tasks can be linked using a hierarchical graph similar to the probabilistic model of (marcheggiani2014hierarchical) and exploit the abstention mechanism to build a robust pipeline: based on the opinion labels available at the sentence-level, we build a two-stage predictor that first predicts the aspects and polarities at the sentence level, before deducing the corresponding review-level values.

5.1 Parameterization of the Ha-loss

In all our experiments, we rely on the expression of the Ha-loss presented in 4. The linear programming formulation of the pre-image problem used in the branch-and-bound solver is derived in the supplementary material and involves a decomposition similar to the one described in Section 2 for the H-loss. Implementing the Ha-loss requires choosing the weights and . We first fix the weights in the following way :

Here, is assumed to be the index of the root node. This weighting scheme has been commonly used in previous studies (rousu2006kernel; bi2012hierarchical) and is related to the minimization of the Hamming Loss on a vectorized representation of the graph assignment. As far as the abstention weights and are concerned, making an exhaustive analysis of all the possible choices is impossible due to the number of parameters involved. Therefore, our experiments focus on weighting schemes built in the following way:

The effect of the choices of and will be illustrated below on the opinion prediction task. We also ran a set of experiments on a hierarchical classification task of MRI images from the IMAGECLEF2007 dataset reusing the setting of (Dimitrovski08:proc) where we show the results obtained for different weighting schemes. The settings and the results have been placed in the supplementary material.

5.2 Learning with Abstention for aspect-based opinion mining

We test our model on the problem of aspect-based opinion mining on a subset of the TripAdvisor dataset released in (marcheggiani2014hierarchical). It consists of 369 hotel reviews for a total of 4856 sentences with predefined train and test sets. In addition to the review-level star ratings, the authors gathered the opinion annotations at the sentence-level for a set of 11 predefined aspects and their corresponding polarity. Similarly to them, we discard the “NOT RELATED” aspect and consider the remaining 10 aspects with the 3 different polarities (positive, negative or neutral) for each. We propose a graphical representation of the opinion structure at the sentence level (see Fig. 1). Objects in the output space consist of trees of depth 3 where the first node is the root, the second layer is made of aspect labels and the third one is the polarities corresponding to each aspect. The corresponding assignments are encoded by a binary matrix where is the concatenation of the vectors indicating the presence of each aspect (depth 2) and the ones indicating the polarity.

An example of encoding is displayed in Fig.1. Based on the recent results of (DBLP:journals/corr/ConneauKSBB17)

, we focus on the InferSent representation to encode our inputs. This dense sentence embedding corresponds to the inner representation of a deep neural network trained on a natural language inference task and has been shown to give competitive results in other natural language processing tasks.

We test our model on 3 different subtasks. In Exp1, we first apply our model (H Regression InferSent) to the task of opinion aspect prediction and compare it against two baselines and the original results of (marcheggiani2014hierarchical). In Exp2, we test our method and baselines on the problem of joint aspect and polarity prediction in order to assess the ability of the hierarchical predictor to take advantage of the output structure. On this task we additionally illustrate the behavior of abstention when varying the constants and . In Exp3, we illustrate the use abstention as a mean to build a robust pipeline on the task of star rating regression based on a sentence-level opinion predictor.

Figure 1: Graphical representation of the opinion structure

Exp1. Aspect prediction.  In this first task, we aim at predicting the different aspects discussed in each sentence. This problem can be cast as a multilabel classification problem where the target is the first column of the output objects

for which we devise two baselines. The first relies on a logistic regression model (Logistic Regression InferSent) trained

separately for each aspect. The second baseline (Linear chain Conditional Random Fields (CRF) (sutton2012introduction) InferSent) is inspired by the work of (marcheggiani2014hierarchical) who built a hierarchical CRF model based on a handcrafted sparse feature set including one-hot word encoding, POS tags and sentiment vocabulary. Since the optimization via Gibbs sampling of their model relies on the sparsity of the feature set, we could not directly use it with our dense representation. Linear chain CRF InferSent takes advantage of our input features while remaining computationally tractable. One linear chain is trained for each node of the output structures and the chain encodes the dependency between successive sentences.

Table 1 below shows the results in terms of micro-averaged F1 () score obtained on the task of aspect prediction.

H Regression InferSent 0.59
Logistic Regression InferSent 0.60
Linear chain CRF InferSent 0.59
Linear chain CRF sparse features
Hierarchical CRF sparse features
Table 1: Experimental results on the TripAdvisor dataset for the aspect prediction task.

The three methods using InferSent give significantly better results than (marcheggiani2014hierarchical). Consequently, the next experiments will not consider them. Even though H Regression was trained in order to predict the whole structure, it obtains results similar to logistic regression and linear chain CRF.

Exp2. Joint polarity and aspect prediction with abstention. We take as output objects the assignments of the graph described (Fig. 1) and build an adapted abstention mechanism. Our intuition is that in some cases, the polarity might be easier to predict than the aspect to which it is linked. This can typically happen when some vocabulary linked to the current aspect has been unseen during the training or is implicit whereas the polarity vocabulary is correctly recognized. An example is the sentence " We had great views over the East River" where the aspect "Location" is implicit and where the "views" could mislead the predictor and result in a prediction of the aspect "Other". In such a case, (marcheggiani2014hierarchical)

underline that the inter-annotator agreement is low. For this reason, we want that our classifier allows multiple candidates for aspect prediction while providing the polarity corresponding to them. We illustrate this behavior by running two sets of experiments in which we do not allow the predictor to abstain on the polarity.

In the first experiment, we want to analyze the influence of the parameterization of the Ha-loss. Following the parameterization of and previously proposed, we generated some predictions with varying values of and . We displayed the Hamming loss between the true labels and the predictions as a function of the mean number of aspects on which the predictor abstained (Fig. 2) and handle two cases : modified : in the left figure, all nodes except the one on which we abstained were used to compute the Hamming loss. In the right one, all nodes except the aspect on which we abstained and their corresponding polarity were used to compute the Hamming loss.

Figure 2: Hamming loss as a function of the number of aspect labels where the predictor abstained itself.

The results correspond to a predictor for which the original hierarchical constraint is forced: and the three other curves have been obtained with the generalized constraint hypothesis .

We additionally ran our model H Regression without abstention and our two baselines logistic regression for which we measured a similar Hamming loss of 0.03 (corresponding to 0 abstention on the left Figure 2). Concerning the micro-averaged F1 score, the H Regression retrieved a score of being slightly above the logistic regression which scored and the linear chain CRF with .

Two conclusions can be raised. Firstly, the value of and the choice of the hypothesis have little to no influence on the scores computed in the two cases previously described. Secondly, increasing the number of abstentions on aspects helps reducing the number of errors counted on the aspects nodes when the predictor abstains on less than 3 labels. After this point, the quality of the overall prediction decreases since the error rate on the remaining aspects selected for abstention is less than the one on the polarity labels

Subsequently, we examine the Hamming loss on the polarity predictions situated after an aspect node to understand the influence of the coefficients and the relaxation of the hypothesis in Figure 3.

Figure 3: Hamming loss computed on polarity nodes located after an aspect for which the predictor abstained

The orange curve gives the best score when the mean number of abstentions is between 2 and 4 per sentence. The only difference with the hypothesis is the ability to predict the polarity of an aspect candidate for abstention even if the predictor function does not select it. This behavior is made possible by the fact that our prediction does not respect the constraints but instead belong to the more flexible space Finally we show how abstention can be used to build a robust pipeline for star-rating regression.

Exp3. Star rating regression at the review level based on sentence level predictions.  In the last round of experiments, we show that abstention can be used as a way to build a robust intermediate representation for the task of opinion rating regression (wang2011latent) which consists in predicting the overall average star rating given by each reviewer on a subset of six predefined aspects. The figure below illustrates the different elements involved in our problem.

Figure 4: Star rating regression pipeline

The procedure is split in two steps. Firstly, we learn a sentence-level opinion predictor that takes advantage of the available annotations. This step corresponds to the one studied in the previous experiment. Then a vector-valued regressor (star regressor in Figure 4) is built. It takes as input the component-wise average of the sentence level opinion representations, and intends to predict the star ratings at the review level. For each of the five overall aspects a separate Ridge Regressor is trained based on the true labels available. Once learned, the regressors take as input the prediction of the first step in a pipelined way

Similarly to (marcheggiani2014hierarchical), we rescale the star ratings on a (-1,0,1) scale and report the macro-averaged mean average error on the test-set in Table 2 below under the column MAE text level. We additionally include the MAE error measured on polarity predictions at the sentence level counted when the underlying aspect predicted is a true positive.

sentence level
text level
Oracle: regression with
true sentence labels
0 0.38
Hierarchical CRF
0.50 0.50
H Regression 0.30 0.45
H Regression
with Abstention
C 0.43
Table 2: Experimental result on the TripAdvisor dataset for the polarity prediction task

The first row is our oracle: the sentence-level opinion representations are assumed to be known on the test set and fed in the text-level opinion regressors to find back the star ratings. The Hierarchical CRF line corresponds to the best results reported by (marcheggiani2014hierarchical) on the two tasks. H Regression is our model without abstention used as a predictor of the sentence-level representation in the pipeline shown in Fig 4. Finally for the H Regression with abstention, we used as a sentence-level representation : . Since the only non-zero components of correspond to aspects on which we abstained, subtracting them from the original prediction results in a reduction of the confidence of the regressor for these aspects and biasing the corresponding polarity predictions towards 0. H Regression strongly outperforms Hierarchical CRF on both tasks. We do not report the score for H Regression with abstention since it is dependent on the number of abstentions but show that it improves the results of the H Regression model on the text-level prediction task. The significance of the scores has been assessed with a Wilcoxon rank sum test (p-value ).

6 Conclusion

The novel framework, Structured Learning with Abstention, extends two families of approaches: learning with abstention and least-squares surrogate structured prediction. It is important to notice that beyond ridge regression, any vector-valued regression model that writes as (8) is eligible. This is typically the case of Output Kernel tree-based methods (GeurtsWd06). Also, SOLA has here been applied to opinion analysis but it could prove suitable for more complex structure-labeling problems. Concerning Opinion Analysis, we have shown that abstention can be used to build a robust representation for star rating in a pipeline framework. One extension of our work would consist in learning how to abstain by jointly predicting the aspects and polarity at the sentence and text level.


This work has been funded by the french ministry of research and by the chair Machine Learning for Big Data of Télécom ParisTech.