Adversarial Constraint Learning for Structured Prediction

by   Hongyu Ren, et al.
Stanford University

Constraint-based learning reduces the burden of collecting labels by having users specify general properties of structured outputs, such as constraints imposed by physical laws. We propose a novel framework for simultaneously learning these constraints and using them for supervision, bypassing the difficulty of using domain expertise to manually specify constraints. Learning requires a black-box simulator of structured outputs, which generates valid labels, but need not model their corresponding inputs or the input-label relationship. At training time, we constrain the model to produce outputs that cannot be distinguished from simulated labels by adversarial training. Providing our framework with a small number of labeled inputs gives rise to a new semi-supervised structured prediction model; we evaluate this model on multiple tasks --- tracking, pose estimation and time series prediction --- and find that it achieves high accuracy with only a small number of labeled inputs. In some cases, no labels are required at all.


Learning Constraints for Structured Prediction Using Rectifier Networks

Various natural language processing tasks are structured prediction prob...

Predict and Constrain: Modeling Cardinality in Deep Structured Prediction

Many machine learning problems require the prediction of multi-dimension...

RankingMatch: Delving into Semi-Supervised Learning with Consistency Regularization and Ranking Loss

Semi-supervised learning (SSL) has played an important role in leveragin...

Query-Adaptive Predictive Inference with Partial Labels

The cost and scarcity of fully supervised labels in statistical machine ...

Adversarial Attack and Defense of Structured Prediction Models

Building an effective adversarial attacker and elaborating on countermea...

Simplifying Models with Unlabeled Output Data

We focus on prediction problems with high-dimensional outputs that are s...

Effective and Efficient Data Poisoning in Semi-Supervised Learning

Semi-Supervised Learning (SSL) aims to maximize the benefits of learning...

1 Introduction

Large labeled datasets are a key component for building state-of-the-art systems in many applications of machine learning, including image recognition, machine translation, and speech recognition. Collecting such datasets can be expensive, which has driven significant research interest in unsupervised, semi-supervised, and weakly supervised learning approaches

[Radford et al.2015, Kingma et al.2014, Papandreou et al.2015, Ratner et al.2016].

Constraint-based learning is a recently proposed form of weak supervision which aims to reduce the need for labeled inputs by having users supervise algorithms through general properties that hold over the label space [Shcherbatyi and Andres2016, Stewart and Ermon2017]. Examples of such properties include logical rules [Richardson and Domingos2006, Chang et al.2007, Choi et al.2015, Xu et al.2017] or physical laws [Stewart and Ermon2017, Ermon et al.2015].

Unlike labels — which only apply to their corresponding inputs — properties used in a constraint-based learning approach are specified once for the entire dataset, providing an opportunity for more cost-efficient supervision. Algorithms supervised with explicit constraints have shown promising results in object detection [Stewart and Ermon2017], preference learning [Choi et al.2015], materials science [Ermon et al.2012], and semantic segmentation [Pathak et al.2015].

However, describing the high level invariants of a dataset may also require a non-trivial amount of effort. First, designing constraints requires strong domain expertise. Second, in the case of high dimensional labels, it is difficult to encode the constraints using simple formulas. For example, suppose we want to constrain a pedestrian joint detector to produce skeletons that “look like a walking person”; in this case, it is difficult to capture invariants over human poses with simple logical or algebraic formulas that an annotator could specify. Third, constraints may change over time and across tasks; designing new constraints for new tasks may not scale in many practical applications.

In this paper, we propose an implicit approach to constraint learning, in which invariants are automatically learned from a small set of representative label samples (see Figure 1).111Please find source code in These samples do not need to be tied to corresponding inputs (as in supervised learning) and may come from a black-box simulator that abstracts away physics-based formulas or produces examples of labels collected by humans. Such simulators include physics engines, humanoid simulators from robotics, or driving simulators [Li et al.2017].

Inspired by recent advances in implicit (likelihood-free) generative modeling, we capture the distribution of outputs using an approach based on adversarial learning [Goodfellow et al.2014]. Specifically, we train two distinct learners: a primary model for the task at hand and an auxiliary classification algorithm called discriminator. During training, we constrain the main model such that its outputs cannot be distinguished by the discriminator from representative (true) label samples, thus forcing it to capture the structure of the label space. This approach forms a novel adversarial framework for performing weak supervision with learned constraints, which we call adversarial constraint learning.

Although constraint learning does not require input-label pairs, providing such pairs can improve performance and turns our problem into an instance of semi-supervised learning. In this setting, our approach combines supervised learning on a small labeled dataset with constraint learning on a large unlabeled set, where constraint learning enforces that the structure of predictions on unlabeled data matches the structure observed in the labeled data. Experimental results demonstrate that this method performs better than state-of-the-art semi-supervised learning methods on a variety of structured prediction problems.

Figure 1: Constraint learning allows us to learn a conditional probabilistic model (parameterized by ) without direct labels by specifying properties that holds over the output space. In prior work (left), is defined as a formula describing known invariants. In this paper (right), we propose to instead learn

through an auxiliary classifier

(parameterized by ) that discriminates (provided by ) from (provided by an additional source unrelated to , such as a simulator).

2 Background

In this section, we introduce structured prediction and constraint-based learning. The next section will expand upon these subjects to introduce the proposed adversarial constraint learning framework.

2.1 Structured Prediction

Our work focuses on structured prediction, a form of supervised learning, in which the outputs

can be a complex object such as a vector, a tree, or a graph

[Koller and Friedman2009]. We capture the distribution of using a conditional probabilistic model parameterized by . A model maps each input to the corresponding output distribution , where

denotes all the probability distributions over

. For example, we may take

to be a Gaussian distribution

with mean

and variance


A standard approach to learning (or as an abbreviation) is to solve an optimization problem of the form


over a labeled dataset

. A typical supervised learning objective is comprised of a loss function

and a regularization term that encourages non-degenerate solutions or solutions that incorporate prior knowledge [Stewart and Ermon2017].

2.2 Constraint-Based Learning

Collecting a large labeled dataset for supervised learning can often be tedious. Constraint-based learning is a form of weak supervision which instead asks users to specify high-level constraints over the output space, such as logical rules or physical laws [Shcherbatyi and Andres2016, Stewart and Ermon2017, Richardson and Domingos2006, Xu et al.2017]. For example, in an object tracking task where corresponds to the space of joint positions over time, we expect correct outputs to be consistent with the laws of physical mechanics.

Let be an unlabeled dataset of inputs. Formally, constraints can be specified via a function , which penalizes conditional probabilistic models that are inconsistent with known high-level structure of the label space. Learning from constraints proceeds by optimizing the following objective:


over . By solving this optimization problem, we look for a probabilistic model parameterized by that satisfies known constraints when applied to the unlabeled dataset (through the term), and is likely a priori (through the term). Note that although the constraint is data-dependent, it does not require explicit labels. For example, in object tracking we could ask that when making predictions on , joint positions over time are consistent with known kinematic equations, with measuring how the output distribution from deviates from those equations. The regularization term can be used to avoid overly complex and/or degenerate solutions, and may include , , or entropy regularization terms. Stewart and Ermon [Stewart and Ermon2017] have shown that a model learned with the objective described in Eq. 2 can learn to track objects.

3 Adversarial Constraint Learning

The process of manually specifying high level constraints, , can be time-consuming and may require significant domain expertise. Such is the case in pose estimation, where it is difficult to describe high dimensional rules for joints movements precisely; but the large availability of unpaired videos and motion capture data makes constraint learning attractive in spite of the difficulty of providing high dimensional constraints.

In the sciences, discovering general invariants is often a data-driven approach; for example, the laws of physics are often discovered by validating hypotheses with experimental results. Motivated by this, we propose in this section a novel framework for learning constraints from data.

3.1 Learning Constraints from Data

Suppose we have a dataset of inputs , a dataset of labels , and a set that describes correspondence between some elements of and . We denote the empirical distributions of , and as , and respectively. Note that can come from either a simulator (such as one based on physical rules), or from some other source of data (such as motion captures of people for which we have no corresponding videos).

Let us first consider the setting where ; i.e. there are inputs and labels but no correspondence between them. In spite of the lack of correspondences, we will see that constraints can be learned from the prior knowledge that the same underlying distribution generates both the empirical labels and the structured predictions obtained from applying our model to . These learned constraints can then be used for supervision. Let structured predictions be given by the following implicit sampling procedure:


where is a (parameterized) conditional distribution of outputs given inputs. Discarding , the above procedure corresponds to sampling from the marginal distribution over , .

Labels drawn from should have high likelihood values in , but optimizing this objective directly is computationally infeasible; evaluating the marginal likelihood exactly is expensive due to the integration over . Instead, we formulate the task of learning a constraint loss from through a likelihood-free approach using the framework of generative adversarial learning [Goodfellow et al.2014], which only requires samples from and .

We introduce an auxiliary classifier (parametrized by ) called discriminator which scores outputs in the label space . It is trained to assign high scores to representative output labels from , while assigning low scores to samples from . It learns to effectively extract latent constraints that hold over the output space and that are implicitly encoded in the samples from . The goal of is to produce outputs result in higher scores in the discriminator, satisfying the constraints imposed by in the process.

For practical reasons, we consider to be a Dirac-delta distribution , and thus we refer to the conditional probabilistic model as the mapping in the experiment section for simplicity. We train and for the following objective [Arjovsky et al.2017]


Assuming infinite capacity, Theorem 1 of [Goodfellow et al.2014] shows that at the optimal solution of Eq. 4, cannot distinguish between the given set of labels and those predicted by the model , suggesting that the latter satisfy the set of constraints defined by . Unlike in constraint-based learning where a (possibly incomplete) set of constraints is manually specified, convergence in the adversarial setting implies that the label and output distributions match on all possible discriminator projections. Figure 2(a) shows an overview of the adversarial constraint learning framework in the context of trajectory estimation.

3.2 Constraint Learning via Matching Distributions

Generative Adversarial Networks (GANs) are a prominent example of implicit probabilistic models [Mohamed and Lakshminarayanan2016] which are defined through a stochastic sampling procedure instead of an explicitly defined likelihood function. One advantage of implicit generative models is that they can be trained with methods that do not require likelihood evaluations.

Hence, our approach to learning constraints for structured prediction can also be interpreted as learning an implicit generative model that matches the empirical label distribution . Specifically, our adversarial constraint learning approach optimizes over an approximation to the optimal transport from to  [Arjovsky et al.2017]; thus our constraint can be implicitly defined as “ minimizes the optimal transport from to ”.

(a) Our architecture trains by asking it to take in frames and generate trajectories that cannot be discriminated from sample trajectories from a simulator. Training eliminates the need for hand-engineering constraints.
(b) Top: frames from the video used in the pendulum experiment. Bottom: the network is trained to predict angles that cannot be distinguished from the simulated dynamics, encouraging it to track the metal ball over time.
Figure 2: Architecture and results of the pendulum tracking experiment.

3.3 Semi-Supervised Structured Prediction

Table 1: Settings in different learning paradigms. Supervised Learning (SL) requires a dataset with paired . Semi-Supervised Learning (SSL) utilizes additional unlabeled inputs . Adversarial Constraint Learning (ACL) requires inputs and labels but without correspondences between them. Semi-Supervised Adversarial Constraint Learning (SSACL) extends ACL by also considering labeled pairs .

Although our framework does not require datasets containing input-label pairs , providing it with such data gives rise to a new semi-supervised structured prediction method.

When given a set of labeled examples, we may extend our constraint learning objective (over both labeled and unlabeled data) with a standard classification loss term (over labeled data):


where is the adversarial constraint learning objective defined in Eq. 4, and

is a hyperparameter that balances between fitting to the general (implicit) label distribution (first term) and fitting to the explicit labeled dataset (second term).

Our semi-supervised constraint learning framework is different from traditional semi-supervised learning approaches, as listed in Table 1. In particular, traditional semi-supervised learning methods assume there is a large source of inputs and tend to impose regularization over , such as through latent variables [Kingma et al.2014], through outputs [Miyato et al.2017], or through another network [Salimans et al.2016]. We consider the case where there exists a source, e.g., a simulator that can provide abundant samples from the label space that are not matched to particular inputs, and impose regularization over by exploiting a discriminator that provides an implicit constraint over the predicted values. Therefore, we can also utilize sample labels that are not associated with particular inputs, instead of merely restricting to standard labeled pairs. Moreover, our method can be easily combined with existing semi-supervised learning approaches [Kingma et al.2014, Li et al.2016, Miyato et al.2017] to further boost performance.

4 Experimental Results

We evaluate the proposed framework on three structured prediction problems. First, we aim to track the angle of a pendulum in a video without labels using supervision provided by a physics-based simulator. Next, we extend the output space to higher dimensions and perform human pose estimation in a semi-supervised setting. Lastly, we evaluate our approach on multivariate time series prediction, where the goal is to predict future temperature and humidity.

A label simulator is provided for each experiment in place of hand-written constraints. Although explicit constraints for the pendulum case can be written down analytically, we demonstrate that our adversarial framework is capable of learning the constraint from data. In the other two experiments, we consider structured prediction settings where the outputs are high dimensional; in these settings, the correct constraints are very complex and hand-engineering them would be difficult. Instead, our model learns these constraints from a small number of samples provided by the simulator.

4.1 Pendulum Tracking

For this task, we aim to predict the angle of the pendulum from images in a YouTube video 222, i.e., learn a regression mapping , where and are the height and width of the input image. Since the outputs of over consecutive frames are constrained by temporal structure (a sine wave in this case), we concatenate consecutive outputs of and form a high dimensional trajectory, thus defining . Critically, must make a separate prediction for each image, preventing from simply memorizing the output structure. Unlike previous methods [Stewart and Ermon2017], no explicit formulas are provided for supervision, and the (implicit) constraints are learned through the discriminator using samples provided by the physics simulator.

Training Details

The video contains a total of 170 images, and we hold out images for evaluation. We manually observe that the pendulum completes one full oscillation approximately every 12 frames. Based on this observation, we write a simulator of these dynamics with a simple harmonic oscillator having a fixed amplitude and random sample period of 10 to 14 frames. is trained to distinguish between the output of across continuous images and a random trajectory sampled from the simulator. We implement

as a 5 layer convolutional neural network with ReLU nonlinearities, and

as a 5-cell LSTM. We use in Eq. 5, and the same training procedure and hyperparameters as [Gulrajani et al.2017] across our experiments.


We manually label the horizontal position of the ball of the pendulum for each frame in the test set, and measure the correlation between the predicted positions and the ground truth labels. Since the same is applied to each input frame independently, cannot just memorize valid (i.e. simple harmonic) trajectory sequences and produce them while ignoring inputs. The model must learn to track the pendulum in order to fool the discriminator and subsequently achieve a high correlation on the test set.

Our adversarial constraint learning approach achieves a correlation of , whereas training with hand-crafted constraints achieves a marginally higher correlation of . Both approaches are trained without labels. Example predictions on the test data are shown in Figure 2. This real-world experiment demonstrates the effectiveness of constraint-based learning in the absence of labels, and suggests that using learned constraints from data is almost as effective as using ideal hand-crafted constraints.

Figure 3: Pose estimation using the proposed semi-supervised adversarial constraint learning approach. takes in single image and outputs the 2-D location of 6 joints (in green). Lines (in red) are added automatically. The images show results across 4 test groups (horizontal strips) when only 3 out of 28 training groups were directly labeled.

4.2 Pose Estimation

In this experiment, we evaluate the proposed model on pose estimation, which has a significantly larger output space. We aim to learn a regression network , where denotes the number of joints to detect, and each joint has coordinates. As in the pendulum tracking experiment, is mapped across several frames to produce a trajectory that is indistinguishable from samples provided by the simulator.

We evaluate the model with videos and joint trajectories from the CMU multi-modal action database (MAD) [Huang et al.2014]. MAD contains videos of 20 subjects performing a sequence of actions in each video. We extract frames from subjects performing the “Jump and Side-Kick” action and train to detect the location of the left/right hip/knee/foot () in each frame. The processed dataset contains 35 groups (549 valid frames in total).

Training Details

We divide the 35 groups of motion data into training and testing sets of 28 groups and 7 groups, respectively, where direct labels will be provided for a subset of the 28 training groups. Each group contains 14 to 17 frames, and we train on randomly selected contiguous intervals of length . Using the metric of PCK@0.1 [Yang and Ramanan2013] for evaluation, a prediction is considered correct if it lies within pixels from the true location, where and denote the height and width of the subject’s body. We evaluate with .

We first design a simulator of valid labels (joint positions) based on known kinematics of skeletons. Specifically, the anatomical shape of the subject’s legs approximately forms an expanding isosceles trapezoid when they jump and side kick. We simulate a large range of trapezoidal motions capturing these trajectories, which requires much less effort than hand engineering precise mathematical formulas to express explicit constraints. takes a single image as input and produces a dimensional vector, representing the location of joints. Critically, as in the pendulum experiment, is applied to each frame independently, and has no knowledge of the neighboring frames. The outputs of are concatenated and passed to the discriminator for training.


We construct 50 random train/test splits of the dataset and report the averaged PCK@0.1 scores for evaluation. The results are summarized in Table 2 and Figure 3, where we compare three forms of learning when labels are only available for of the training groups:

  • “L(i)”: vanilla supervised learning on labeled groups

  • “L(i)+VAT”: a baseline form of semi-supervised learning leveraging virtual adversarial training on unlabeled groups (VAT, [Miyato et al.2017])

  • “L(i)+ADV”: semi-supervised learning with adversarial constraint learning (Eq. 5)

When no labels are provided (“L(0)+ADV”; i.e., optimizing just ), is able to find the correct “shape” of the joints for each frame, but the predictions are biased. Since the subjects are not strictly acting in the center of the image, a constant minor shift () for all predicted joint locations still meets the requirements imposed by , which encodes the structure of the label space. This problem is addressed when providing even a very small (“i=1”) number of labeled training groups and using the semi-supervised objective . Availability of labels fixes the constant bias and we note that using adversarial training produces a massive (25-30%) boost over both the supervised and “VAT” baselines when only 1 group of labeled data is available.

With only 3 groups of labeled data (“L(3)+ADV”), adversarial constraint learning achieves a comparable performance to standard supervised learning with 7 groups of labeled inputs (“L(7)”). Adversarial constraint learning “L(i)+ADV” consistently outperforms the virtual adversarial training “L(i)+VAT” baseline for different values of . When further combined with VAT regularization in the objective, our method achieves slightly better performance.

The strong performance of our model over baselines on the pose estimation task with a few or no labels demonstrates that constraint learning can work well over high-dimensional outputs when using our proposed adversarial framework. Designing precise constraints in high-dimensional spaces is often tedious, error-prone, and restricted to one particular domain. Our method avoids these downsides by learning these constraints implicitly through data generated from a simulator, even though the simulator can be a noisy (or even slightly biased) description of the true label distribution.

PCK@0.1(%) Left Hip Left Knee Left Foot Right Hip Right Knee Right Foot
L(0)+ADV 0.6813 0.7326 0.6047 0.6669 0.6729 0.5834
L(1) 0.5453 0.5728 0.5464 0.5360 0.4983 0.4362
L(1)+VAT 0.5795 0.6086 0.5797 0.5608 0.5016 0.4571
L(1)+ADV 0.8529 0.8510 0.8151 0.8482 0.8531 0.7394
L(3) 0.8275 0.7937 0.6716 0.8092 0.7529 0.6196
L(3)+VAT 0.8334 0.7866 0.7082 0.8281 0.7760 0.6420
L(3)+ADV 0.8760 0.9097 0.8328 0.8601 0.8746 0.7549
L(5) 0.8603 0.8483 0.7493 0.8309 0.8267 0.6626
L(5)+VAT 0.8750 0.8764 0.7411 0.8471 0.8398 0.6596
L(5)+ADV 0.9022 0.9160 0.8581 0.9192 0.8706 0.7894
L(7) 0.9088 0.8639 0.8217 0.8887 0.8387 0.7338
L(7)+VAT 0.9201 0.8665 0.8436 0.9074 0.8312 0.7526
L(7)+ADV 0.9469 0.9347 0.8418 0.9367 0.8988 0.8161
L(ALL) 0.9622 0.9633 0.9290 0.9464 0.9133 0.8936
L(ALL)+ADV 0.9758 0.9882 0.9627 0.9708 0.9522 0.8740
Table 2: PCK@0.1 results on MAD. “L(i)” indicates supervised learning (SL) where labeled data is provided for only out of 28 groups in the training set. “L(i)+VAT” indicates SL with additional optimization over unlabeled groups using virtual adversarial training [Miyato et al.2017] (SSL). “L(i)+ADV” indicates SL with additional optimization over the entire training set with the ACL objective. Our approach outperforms the baselines, especially when very few labels are available.

4.3 Time Series Prediction

Lastly, we validate our model on another structured prediction problem: multi-step multivariate time series prediction. In this task, we aim to learn a mapping . Given consecutive values of a series, , we aim to predict the following values, , where each has variables. In this task, learns the constraint that both holds across variables and time by distinguishing the output of from real label samples.

Figure 4: Mean absolute error of the predictions on temperature (top) and humidity (bottom) during training (left) and testing (right). Our method SSACL (trained on objective) consistently outperforms SL (supervised learning) on the test set with different numbers of “complete groups” used in training.

Training Details

We conduct experiments on the SML2010 Dataset [Zamora-Martínez et al.2014], which contains humidity and temperature data of indoor and outdoor environments over 40 days at 15 minute intervals.

We hold out 8 consecutive days for testing and leave the rest for training. From the train and test set, we sample 480 and 120 groups of time series data respectively, each having length of 28 hours, and smooth each group into 7 data points at 4-hour intervals. Each group uses the first data points as input, and leaves the final values as targets for prediction, with each data point having variables representing the indoor/outdoor temperature/humidity. We measure the mean absolute error (MAE) on the test set.

We further explore the setting when not all groups in the training set are “complete”; for example, in some groups we may only have temperature information. This is reasonable in real-world scenarios where sensors fail to work properly from time to time. Hence, we use “complete groups” to denote groups with full information, and “incomplete groups” to denote groups with only temperature information. Without humidity records, we could not perform supervised learning on these “incomplete groups”, since the input requires all values. Under the context of adversarial constraint learning, however, such data can facilitate learning constraints over the temperature series. In this task, the simulator is designed to produce humidity samples from only the “complete groups” and temperature samples from all the groups in the training set. Both and

are 4-layer MLPs with 64 neurons per layer.


We display quantitative results in Figure 4. The supervised baseline (trained only with labels) achieves lower training error but results in higher test error. Our model effectively avoids overfitting to the small portion of labeled data and consistently outperforms the baseline, achieving a MAE of 1.933 and 3.042 on the predictions of temperature and humidity when all groups are “completely” labeled.

5 Conclusion

We have proposed adversarial constraint learning, a new framework for structured prediction that replaces hand-crafted, domain specific constraints with implicit, domain agnostic ones learned through adversarial methods. Experimental results on multiple structured prediction tasks demonstrate that adversarial constraint learning works across many real-world applications with limited data, and fits naturally into semi-supervised structured prediction problems. Our success with matching distributions of labeled and unlabeled model outputs motivates future work exploring analogous opportunities for adversarially matching labeled and unlabeled distributions of learned intermediate representations.


This work was supported by a grant from the SAIL-Toyota Center for AI Research, TRI, Siemens, ONR, NSF grants #1651565, #1522054, #1733686, and a Hellman Faculty Fellowship.