Joint Discovery of Object States and Manipulation Actions

by   Jean-Baptiste Alayrac, et al.

Many human activities involve object manipulations aiming to modify the object state. Examples of common state changes include full/empty bottle, open/closed door, and attached/detached car wheel. In this work, we seek to automatically discover the states of objects and the associated manipulation actions. Given a set of videos for a particular task, we propose a joint model that learns to identify object states and to localize state-modifying actions. Our model is formulated as a discriminative clustering cost with constraints. We assume a consistent temporal order for the changes in object states and manipulation actions, and introduce new optimization techniques to learn model parameters without additional supervision. We demonstrate successful discovery of seven manipulation actions and corresponding object states on a new dataset of videos depicting real-life object manipulations. We show that our joint formulation results in an improvement of object state discovery by action recognition and vice versa.



page 1

page 3

page 8

page 12


Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos

Human actions often induce changes of object states such as "cutting an ...

Recognizing Manipulation Actions from State-Transformations

Manipulation actions transform objects from an initial state into a fina...

Classifying Object Manipulation Actions based on Grasp-types and Motion-Constraints

In this work, we address a challenging problem of fine-grained and coars...

Learning Object Manipulation Skills via Approximate State Estimation from Real Videos

Humans are adept at learning new tasks by watching a few instructional v...

Towards Improving Spatiotemporal Action Recognition in Videos

Spatiotemporal action recognition deals with locating and classifying ac...

Recognition from Hand Cameras

We revisit the study of a wrist-mounted camera system (referred to as Ha...

Recognizing Car Fluents from Video

Physical fluents, a term originally used by Newton [40], refers to time-...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many of our activities involve changes in object states. We need to open a book to read it, to cut bread before eating it and to lighten candles before taking out a birthday cake. Transitions of object states are often coupled with particular manipulation actions (open, cut, lighten). Moreover, the success of an action is often signified by reaching the desired state of an object (whipped cream, ironed shirt) and avoiding other states (burned shirt). Recognizing object states and manipulation actions is, hence, expected to become a key component of future systems such as wearable automatic assistants or home robots helping people in their daily tasks.

Human visual system can easily distinguish different states of objects, such as open/closed bottle or full/empty coffee cup [Brady06]

. Automatic recognition of object states and state changes, however, presents challenges as it requires distinguishing subtle changes in object appearance such as the presence of a cap on the bottle or screws on the car tire. Despite much work on object recognition and localization, recognition of object states has received only limited attention in computer vision 


Figure 1: We automatically discover object states such as empty/full coffee cup along with their corresponding manipulation actions by observing people interacting with the objects.

One solution to recognizing object states would be to manually annotate states for different objects, and treat the problem as a supervised fine-grained object classification task [duan2012discovering, Farhadi09ObjectsAttrib]. This approach, however, presents two problems. First, we would have to decide a priori on the set of state labels for each object, which can be ambiguous and not suitable for future tasks. Second, for each label we would need to collect a large number of examples, which can be very costly.

In this paper we propose to discover object states directly from videos with object manipulations. As state changes are often caused by specific actions, we attempt to jointly discover object states and corresponding manipulations. In our setup we assume that two distinct object states are temporally separated by a manipulation action. For example, the empty and full states of a coffee cup are separated by the “pouring coffee” action, as shown in Figure 1. Equipped with this constraint, we develop a clustering approach that jointly (i) groups object states with similar appearance and consistent temporal locations with respect to the action and (ii) finds similar manipulation actions separating those object states in the input videos. Our approach exploits the complementarity of both subproblems and finds a joint solution for states and actions. We formulate our problem by adopting a discriminative clustering loss [Bach07diffrac]

and a joint consistency cost between states and actions. We introduce an effective optimization solution in order to handle the resulting non-convex loss function and the set of spatial-temporal constraints. To evaluate our method, we collect a new video dataset depicting real-life object manipulation actions in realistic videos. Given this dataset for training, our method demonstrates successful discovery of object states and manipulation actions. We also demonstrate that our joint formulation gives an improvement of object state discovery by action recognition and vice versa.

2 Related work

Below we review related work on person-object interaction, recognizing object states, action recognition and discriminative clustering that we employ in our model.

Person-object interactions. Many daily activities involve person-object interactions. Modeling co-occurrences of objects and actions have shown benefits for recognizing actions in [delaitre11personaction, gupta2009observing, kjellstrom2011visual, pirsiavash2012detecting, yao2011human]. Recent work has also focused on building realistic datasets with people manipulating objects,  in instructional videos [Alayrac15Unsupervised, Malmaud15what, Sener15unsupervised] or while performing daily activities [varol16hollywood]. We build on this work but focus on joint modeling and recognition of actions and object states.

States of objects. Prior work has addressed recognition of object attributes [Farhadi09ObjectsAttrib, Parikh2011Relattrib, patterson2014sun]

, which can be seen as different object states in some cases. Differently from our approach, these works typically focus on classifying still images, do not consider human actions and assume an

a priori known list of possible attributes. Closer to our setting, Isola [Isola15State] discover object states and transformations between them by analyzing large collections of still images downloaded from the Internet. In contrast, our method does not require annotations of object states. Instead, we use the dynamics of consistent manipulations to discover object states in the video with minimal supervision. In [dima2014youdo], the authors use consistent manipulations to discover task relevant objects. However, they do not consider object states, rely mostly on first person cues (such as gaze) and take advantage of the fact that videos are taken in a single controlled environment.

Action recognition. Most of the prior work on action recognition has focused on designing features to describe time intervals of a video using motion and appearance [laptev08learning, simonyan2014two, tran2015learning, Wang13action]. This is effective for actions such as dancing or jumping, however, many of our daily activities are best distinguishable by their effect on the environment. For example, opening door and closing door can look very similar using only motion and appearance descriptors but their outcome is completely different. This observation has been used to design action models in [fathi11modeling, fernando15vidDarwin, Wang16Transformation]. In [Wang16Transformation], for example, the authors propose to learn an embedding in which a given action acts as a transformation of features of the video. In our work we localize objects and recognize changes of their states using manipulation actions as a supervisory signal. Related to ours is also the work of Fathi  [fathi11modeling] who represent actions in egocentric videos by changes of appearance of objects (also called object states), however, their method requires manually annotated precise temporal localization of actions in training videos. In contrast, we focus on (non-egocentric) Internet videos depicting real-life object manipulations where actions are performed by different people in a variety of challenging indoor/outdoor environments. In addition, our model jointly learns to recognize both actions and object states with only minimal supervision.

Discriminative clustering. Our model builds on unsupervised discriminative clustering methods [Bach07diffrac, singh12discPat, Xu2004maximum] that group data samples according to a simultaneously learned classifier. Such methods can incorporate (weak) supervision that helps to steer the clustering towards a preferred solution [Bojanowski13finding, doersch2012what, feifei2016connectionist, jain2013representing, tang14coloc]. In particular, we build on the discriminative clustering approach of [Bach07diffrac] that has been shown to perform well in a variety of computer vision problems [Bojanowski13finding]. It leads to a quadratic optimization problem where different forms of supervision can be incorporated in the form of (typically) linear constraints. Building on this formalism, we develop a model that jointly finds object states and temporal locations of actions in the video. Part of our object state model is related to [Joulin14efficient], while our action model is related to [Bojanowski15weakly]. However, we introduce new spatial-temporal constraints together with a novel joint cost function linking object states and actions, as well as new effective optimization techniques.

Contributions. The contributions of this work are three-fold. First, we develop a new discriminative clustering model that jointly discovers object states and temporally localizes associated manipulation actions in video. Second, we introduce an effective optimization algorithm to handle the resulting non-convex constrained optimization problem. Finally, we experimentally demonstrate that our model discovers key object states and manipulation actions from input videos with minimal supervision.

Figure 2: Given a set of clips that depict a manipulated object, we wish to automatically discover the main states that the object can take along with localizing the associated manipulation action. In this example, we show one video of someone filling a coffee cup. The video starts with an empty cup (state 1), which is filled with coffee (action) to become full (state 2). Given imperfect object detectors, we wish to assign to the valid object candidates either the initial state or the final state (encoded in ). We also want to localize the manipulating action in time (encoded in ) while maintaining a joint action-state consistency.

3 Modeling manipulated objects

We are given a set of clips that contain a common manipulation of the same object (such as “open an oyster”). We also assume that we are given an a priori model of the corresponding object in the form of a pre-trained object detector [girsh15fastrcnn]. Given these inputs, our goal is twofold: (i) localize the temporal extent of the action and (ii) spatially/temporally localize the manipulated object and identify its states over time. This is achieved by jointly clustering the appearances of an object (such as an “oyster”) appearing in all clips into two classes, corresponding to the two different states (such as “closed” and “open”), while at the same time temporally localizing a consistent “opening” action that separates the two states consistently in all clips. More formally, we formulate the problem as a minimization of a joint cost function that ties together the action prediction in time, encoded in the assignment variable , with the object state discovery in space and time, defined by the assignment variable :


where is a discriminative clustering cost to temporally localize the action in each clip, is a discriminative clustering cost to identify and localize the different object states and is a joint cost that relates object states and actions together. denotes the total length of all video clips and denotes the total number of tracked object candidate boxes (tracklets). In addition, we impose constraints  and  that encode additional structure of the problem: we localize the action with its most salient time interval per clip (“saliency”); we assume that the ordering of object states is consistent in all clips (“ordering”) and that only one object is manipulated at a time (“non overlap”).

In the following, we proceed with describing different parts of the model (1). In Sec. 3.1 we describe the cost function for the discovery of object states. In Sec. 3.2 we detail our model for action localization. Finally, in Sec. 3.3 we describe and motivate the joint cost .

3.1 Discovering object states

The goal here is to both (i) spatially localize the manipulated object and (ii) temporally identify its individual states. To address the first goal, we employ pre-trained object detectors. To address the second goal, we formulate the discovery of object states as a discriminative clustering task with constraints. We obtain candidate object detections using standard object detectors pre-trained on large scale existing datasets such as ImageNet 

[imagenet09]. We assume that each clip is accompanied with a set of tracklets111In this work, we use short tracks of objects (less than one second) that we call tracklet. We want to avoid long tracks that continue across a state change of objects. By using the finer granularity of tracklets, our model has the ability to correct for detection mistakes within a track as well as identify more precisely the state change. of the object of interest.

We formalize the task of localizing the states of objects as a discriminative clustering problem where the goal is to find an assignment matrix , where indicates that the -th tracklet represents the object in state . We also allow a complete row of  to be zero to encode that no state was assigned to the corresponding tracklet. This is to model the possibility of false positive detections of an object, or that another object of the same class appears in the video, but is not manipulated and thus is not undergoing any state change. In detail, we minimize the following discriminative clustering cost [Bach07diffrac]:222We concatenate all the variables  into one matrix .


where is the object state classifier that we seek to learn, is a regularization parameter and is a matrix of features, where each row is a

-dimensional (state) feature vector storing features for one particular tracklet. The minimization in

actually leads to a convex quadratic cost function in  (see [Bach07diffrac]). The first term in (2) is the discriminative loss on the data that measures how easily the input data  is classified by the linear classifier  when the object state assignment is given by matrix . In other words, we wish to find a labeling  for given object tracklets into two states (or no state) so that their appearance features  are easily classified by a linear classifier. To steer the cost towards the right solution, we employ the following constraints (encoded by in (1)).

Only one object is manipulated at a time : non overlap constraint. As it is common in instructional videos, we assume that only one object can be manipulated at a given time. However, in practice, it is common to have multiple (spatially diverse) tracklets that occur at the same time, for example, due to a false positive detection in the same frame. To overcome this issue, we impose that at most one tracklet can be labeled as belonging to state 1 or state 2 at any given time. We refer to this constraint as “non overlap” in problem (1).

state 1 Action state 2: ordering constraints. We assume that the manipulating action transforms the object from an initial state to a final state and that both states are present in each video. This naturally introduces two constraints. The first one is the ordering constraints on the labeling , the state 1 should occur before state 2 in each video. The second constraint imposes that we have at least one tracklet labeled as state 1 and at least one tracklet labeled as state 2. We call this last constraint the “at least one” constraint in contrast to forcing “exactly one” ordered prediction as previously proposed in a discriminative clustering approach on video for action localization [Bojanowski15weakly]. This new type of constraint brings additional optimization challenges that we address in Section 4.2.

3.2 Action localization

Our action model is equivalent to the one of [Bojanowski15weakly] applied to only one action. More precisely, the goal is to find an assignment matrix  for each clip , where encodes that the -th time interval of video is assigned to an action and encodes that no action is detected in interval . The cost that we minimize for this problem is similar to the object states cost:


where is the action classifier, is a regularization parameter and is a matrix of visual features. We constrain our model to predict exactly one time interval for an action per clip, an approach for actions that was shown to be beneficial in a weakly supervised setting [Bojanowski15weakly] (referred to as “action saliency” constraint). As will be shown in experiments, this model alone is incomplete because the clips in our dataset can contain other actions that do not manipulate the object of interest. Our central contribution is to propose a joint formulation that links this action model with the object state prediction model, thereby resolving the ambiguity of actions. We detail the joint model next.

3.3 Linking actions and object states

Actions in our model are directly related to changes in object states. We therefore want to enforce consistency between the two problems. To do so, we design a novel joint cost function that operates on the action video labeling  and the state tracklet assignment  for each clip. We want to impose a constraint that the action occurs in between the presence of the two different object states. In other words, we want to penalize the fact that state  is detected after the action happens, or the fact that state  is triggered before the action occurs.

Joint cost definition. We propose the following joint symmetric cost function for each clip:


where and are the times when the action  and the tracklet occur in a clip , respectively. and are the tracklets in the -th clip that have been assigned to state and state , respectively. Finally is the positive part of . In other words, the function penalizes the inconsistent assignment of objects states 

by the amount of time that separates the incorrectly assigned tracklet and the manipulation action in the clip. The overall joint cost is the sum over all clips weighted by a scaling hyperparameter



4 Optimization

Optimizing problem (1) poses several challenges that need to be addressed. First, we propose a relaxation of the integer constraints and the distortion function (Section 4.1). Second, we optimize this relaxation using Frank-Wolfe with a new dynamic program able to handle our tracklet constraints (Section 4.2). Finally, we introduce a new rounding technique to obtain an integer candidate solution to our problem (Section 4.3).

4.1 Relaxation

Problem (1) is NP-hard in general [loiola07qap] due to its specific integer constraints. Inspired by the approach of [Bojanowski14weakly]

that was successful to approximate combinatorial optimization problems, we propose to use the tightest convex relaxation of the feasible subset of binary matrices by taking its convex hull. As our variables now can take values in

, we also have to propose a consistent extension for the different cost functions to handle fractional values as input. For the cost functions and , we can directly take their expression on the relaxed set as they are already expressed as (convex) quadratic functions. Similarly, for the joint cost function in (4), we use its natural bilinear relaxation:


where denotes the video time of tracklet in clip . This relaxation is equal to the function (4) on the integer points. However, it is not jointly convex in  and , thus we have to design an appropriate optimization technique to obtain good (relaxed) candidate solutions, as described next.

4.2 Joint optimization using Frank-Wolfe

When dealing with a constrained optimization problem for which it is easy to solve linear programs but difficult to project on the feasible set, the Frank-Wolfe algorithm is an excellent choice 

[Jaggi2013, Lacoste15GlobalLinearFW]. It is exactly the case for our relaxed problem, where the linear program over the convex hull of feasible integer matrices can be solved efficiently via dynamic programming. Moreover, [lacoste16nonconvexFW] recently showed that the Frank-Wolfe algorithm with line-search converges to a stationary point for non-convex objectives at a rate of . We thus use this algorithm for the joint optimization of (1). As the objective is quadratic, we can perform exact line-search analytically, which speeds up convergence in practice. Finally, in order to get a good initialization for both variables and , we first optimize separately and (without the non-convex ), which are both convex functions.

Dynamic program for the tracklets. In order to apply the Frank-Wolfe algorithm, we need to solve a linear program (LP) over our set of constraints. Previous work has explored “exact one” ordering constraints for time localization problems [Bojanowski14weakly]. Differently here, we have to deal with the spatial (non overlap) constraint and finding “at least one” candidate tracklet per state. To deal with these two requirements, we propose a novel dynamic programming approach. First, the “at least one” constraint is encoded by having a memory variable which indicates whether state 1 or state 2 have already been visited. This variable is used to propose valid state decisions for consecutive tracklets. Second, the challenging “non-overlap” tracklet constraint is included by constructing valid left-to-right paths in a cost matrix while carefully considering the possible authorized transitions. We provide details of the formulation in Appendix C. In addition, we show in section 5.2 that these new constraints are key for the success of the method.

4.3 Joint rounding method

Once we obtain a candidate solution of the relaxed problem, we have to round it to an integer solution in order to make predictions. Previous works [Alayrac15Unsupervised, Bojanowski15weakly] have observed that using the learned classifier for rounding gave better results than other possible alternatives. We extend this approach to our joint setup by proposing the following new rounding procedure. We optimize problem (1) but fix the values of in the discriminative clustering costs. Specifically, we minimize the following cost function over the integer points and :


where and are the classifier weights obtained at the end of the relaxed optimization. Because when is binary, (7) is actually a linear objective over the binary matrix  for  fixed. Thus we can optimize (7) exactly by solving a dynamic program on for each of the possibilities of , yielding time complexity per clip (see Appendix D for details).

5 Experiments

In this section, we first describe our dataset, the object tracking pipeline and the feature representation for object tracklets and videos (Section 5.1). We consider two experimental set-ups. In the first weakly-supervised set-up (Section 5.2), we apply our method on a set of video clips which we know contain the action of interest but do not know its precise temporal localization. In the second, more challenging “in the wild” set-up (Section 5.3), the input set of weakly-supervised clips is obtained by automatic processing of text associated with the videos and hence may contain erroneous clips that do not contain the manipulation action of interest. The data and code are available online [Alayrac16ObjectStatesWeb].

5.1 Dataset and features

Dataset of manipulation actions. We build a dataset of manipulation actions by collecting videos from different sources: the instructional video dataset introduced in [Alayrac15Unsupervised], the Charades dataset from [varol16hollywood], and some additional videos downloaded from YouTube. We focus on “third person” videos (rather than egocentric) as such videos depict a variety of people in different settings and can be obtained on a large scale from YouTube. We annotate the precise temporal extent of seven different actions333put the wheel on the car (47 clips), withdraw the wheel from the car (46), place a plant inside a pot (27), open an oyster (28), open a refrigerator (234), close a refrigerator (191) and pour coffee (57). applied to five distinct objects444car wheel, flower pot, oyster, refrigerator and coffee cup.. This results in 630 annotated occurrences of ground truth manipulation action.

To evaluate object state recognition, we define a list of two states for each object. We then run automatic object detector for each involved object, track the detected object occurrences throughout the video and then subdivide the resulting long tracks into short tracklets. Finally, we label ground truth object states for tracklets within 40 seconds of each manipulation action. We label four possible states: state 1, state 2, ambiguous state or false positive detection. The ambiguous state covers the (not so common) in-between cases, such as cup half-full. In total, we have 19,499 fully annotated tracklets out of which: cover state 1 or state 2, are ambiguous, and are false positives. Note that this annotation is only used for evaluation purpose, and not by any of our models. Detailed statistics of the dataset are given in Appendix B.

Object detection and tracking. In order to obtain detectors for the five objects, we finetune the FastRCNN network [girsh15fastrcnn] with training data from ImageNet [imagenet09]. We use bounding box annotations from ImageNet when available (the “wheel” class). For the other classes, we manually labeled more than 500 instances per class. In our set-up with only moderate amount of training data, we observed that class-agnostic object proposals combined with FastRCNN performed better than FasterRCNN [ren2015faster]. In detail, we use geodesic object proposals [kra2014gop] and set a relatively low object detection threshold () to have good recall. We track objects using a generic KLT tracker from [Bojanowski13finding]. The tracks are then post-processed into shorter tracklets that last about one second and thus are likely to have only one object state.

Object tracklet representation. For each detected object, represented by a set of bounding boxes over the course of the tracklet, we compute a CNN feature from each (extended) bounding box that we then average over the length of the tracklet to get the final representation. The CNN feature is extracted with a ROI pooling [ren2015faster] of ResNet50 [he16resnet]. The ROI pooling notably allows to capture some context around the object which is important for some cases (wheel “on” or “off” the car). The resulting feature descriptor of each object tracklet is 8,192 dimensional.

Representing video for recognizing actions. Following the approach of [Alayrac15Unsupervised, Bojanowski14weakly, Bojanowski15weakly], each video is divided into chunks of 10 frames that are represented by a motion and appearance descriptor averaged over 30 frames. For the motion we use a 2,000 dimensional bag-of-word representation of histogram of local optical flow (HOF) obtained from Improved Dense Trajectories [Wang13action]. Following [Alayrac15Unsupervised], we add an appearance vector that is obtained from a 1,000 dimensional bag-of-word vector of conv5 features from VGG16 [Simonyan14vggnets]. This results in a 3,000 dimensional feature vector for each chunk of 10 frames.


put remove fill open fill open close
Method wheel wheel pot oyster coff.cup fridge fridge Average


(a) Chance 0.10 0.11 0.10 0.07 0.06 0.10 0.10 0.09
(b) Kmeans 0.25 0.12 0.11 0.23 0.14 0.19 0.22 0.18
(c) Constraints only 0.35 0.38 0.35 0.36 0.31 0.29 0.42 0.35
(d) Salient state only 0.35 0.48 0.35 0.38 0.30 0.40 0.37 0.38
(e) At least one state only 0.43 0.55 0.46 0.52 0.29 0.43 0.39 0.44
(f) Joint model 0.52 0.59 0.50 0.45 0.39 0.47 0.47 0.48
(g) Joint model + det. scores. 0.47 0.65 0.50 0.61 0.44 0.46 0.43 0.51
(h) Joint + GT act. feat. 0.55 0.56 0.56 0.52 0.46 0.45 0.49 0.51


(i)ii Chance 0.31 0.20 0.15 0.11 0.40 0.23 0.17 0.22
(ii)i [Bojanowski15weakly] 0.24 0.13 0.11 0.14 0.26 0.29 0.23 0.20
(iii) [Bojanowski15weakly] + object cues 0.24 0.13 0.26 0.07 0.84 0.33 0.37 0.32
(iv)i Joint model 0.67 0.57 0.48 0.32 0.82 0.57 0.44 0.55
(v)ii Joint + GT stat. feat. 0.72 0.66 0.44 0.46 0.86 0.55 0.44 0.59


Table 1: State discovery (top) and action localization results (bottom).

5.2 Weakly supervised object state discovery

Experimental setup. We first apply our method in a weakly supervised set-up where for each action we provide an input set of clips, where we know the action occurs somewhere in the clip but we do not provide the precise temporal localization. Each clip may contain other actions that affect other objects or actions that do not affect any object at all (e.g. walking / jumping). The input clips are about 20s long and are obtained by taking approximately 10s of each annotated manipulation action.

Evaluation metric: average precision. For all variants of our method, we use the rounded solution that reached the smallest objective during optimization. We evaluate these predictions with a precision score averaged over all the videos. A temporal action localization is said to be correct if it falls within the ground truth time interval. Similarly, a state prediction is correct if it matches the ground truth state.555In particular, we count “ambiguous” labels as incorrect. Note that a “precision” metric is reasonable in our set-up as our method is forced to predict in all videos, i.e. the recall level is fixed to all videos and the method cannot produce high precision with low recall.

Hyperparameters. In all methods that involve a discriminative clustering objective, we used (action localization) and (state discovery) for all 7 actions. For joint methods that optimize (1), we set the weight of the distortion measure (5) to .

State discovery results. Results are shown in the top part of Table 1. In the following, we refer to “State only” whenever we use our method without looking at the action cost or the distortion measure (1). We compare to two baselines for the state discovery task. Baseline (a) evaluates chance performance. Baseline (b

) performs K-means clustering of the tracklets with

(2 clusters for the states and 1 for false positives). We report performance of the best assignment for the solution with the lowest objective after 10 different initializations. Baseline (c) is obtained by running our “State only” method while using random features for tracklet representation as well as ”at least one ordering” and ”non overlap” constraints. We use random features to avoid non-trivial analytic derivation for the ”Constraints only” performance. This baseline reveals the difficulty of the problem and quantifies improvement brought by the ordering constraints. The next two methods are “State only” variants. Method (d) corresponds to a replacement of the “at least one constraint” by an “exactly one constraint” while the method (e) uses our new constraint. Finally, we report three joint methods that use our new joint rounding technique (7) for prediction. Method (f) corresponds to our joint method that optimizes (1). Method (g) is a simple improvement taking into account object detection score in the objective (details below). Finally, method (h) is our joint method but using the action ground truth labels as video features in order to test the effect of having perfect action localization for the task of object state discovery.

We first note that method (e) outperforms (d), thus highlighting the importance of the “at least one” constraint for modeling object states. While the saliency approach (taking only the most confident detection per video) was useful for action modeling in [Bojanowski15weakly], it is less suitable for our set-up where multiple tracklets can be in the same state. The joint approach with actions (f) outperforms the “State only” method (e) on 6 out of 7 actions and obtains better average performance, confirming the benefits of joint modeling of actions and object states. Using ground truth action locations further improves results (cf. (h) against (f)). Our weakly supervised approach (f) performs not much lower compared to using ground truth actions (h), except for the states of the coffee cup (empty/full). In this case we observe that a high number of false positive detections confuses our method. A simple way to address this issue is to add the object detection score into the objective of our method, which then prefers to assign object states to higher scoring object candidates further reducing the effect of false positives. This can be done easily by adding a linear cost reflecting the object detection score to objective (1). We denote this modified method “(g) Joint model + det. scores”. This method achieves the best average performance and highlights that additional information can be easily added to our model.

Action localization results. We compare our method to three different baselines and give results in the bottom part of Table 1. Baseline (i) corresponds to chance performance, where the precision for each clip is simply the proportion of the entire clip taken by the ground truth time interval. Baseline (ii) is the method introduced in [Bojanowski15weakly] used here with only one action. It also corresponds to a special case of our method where the object state part of the objective in equation (1) is turned off (salient action only). Interestingly, this baseline is actually worse than chance for several actions. This is because without additional information about objects, this method localizes other common actions in the clip and not the action manipulating the object of interest. This also demonstrates the difficulty of our experimental set-up where the input video clips often contain multiple different actions. To address this issue, we also evaluate baseline (iii), which complements [Bojanowski15weakly] with the additional constraint that the action prediction has to be within the first and the last frame where the object of interest is detected, improving the overall performance above chance. Our joint approach (iv) consistently outperforms these baselines on all actions, thus showing again the strong link between object states and actions. Finally, the approach (v) is the analog of method (g) for action localization where we use ground truth state labels as tracklet features in our joint formulation showing that the action localization can be further improved with better object state descriptors. In addition, we also compare to a supervised baseline. The average obtained performance is 0.58 which is not far from our method. This demonstrates the potential of using object states for action localization. More details on this experiment are provided in Appendix E.

Benefits of joint object-action modeling. We observe that the joint modeling of object states and actions benefits both tasks. This effect is even stronger for actions. Intuitively, knowing perfectly the object states reduces a lot the search space for action localization. Moreover, despite the recent major progress in object recognition using CNNs, action recognition still remains a hard problem with much room for improvement. Qualitative results are shown in Fig. 3 and failure cases of our method are discussed in F.

Figure 3: Qualitative results for joint action localization (middle) and state discovery (left and right) (see Fig. 1 for “fill coffee cup”).

5.3 Object state discovery in the wild

Towards the discovery of a large number of manipulation actions and state changes, we next apply our method in an automatic setting, where action clips have been obtained using automatic text-based retrieval.

Clip retrieval by text. Instructional videos [Alayrac15Unsupervised, Malmaud15what, Sener15unsupervised] usually come with a narration provided by the speaker describing the performed sequence of actions. In this experiment, we keep only such narrated instructional videos from our dataset. This results in the total of 140 videos that are 3 minutes long in average. We extract the narration in the form of subtitles associated with the video. These subtitles have been directly downloaded from YouTube and have been obtained either by Youtube’s Automatic Speech Recognition (ASR) or provided by the users.

We use the resulting text to retrieve clip candidates that may contain the action modifying the state of an object. Obtaining the approximate temporal location of actions from the transcribed narration is still very challenging due to ambiguities in language (“undo bolt” and “loosen nut” refer to the same manipulation) and only coarse temporal localization of the action provided by the narration. Given a manipulation action such as “remove tire”, we first find positive and negative sentences relevant for the action from an instruction website such as Wikihow. We then train a linear SVM classifier [cortes95SVM] on bigram text features. Finally, we use the learned classifier to score clips from the input instructional videos. In detail, the classifier is applied in a sliding window of 10 words finding the best scoring window in each input video. The clip candidates are then obtained by trimming the input videos 5 seconds before and 15 seconds after the timing of the best scoring text window to account for the fact that people usually perform the action after having talked about it. We apply our method on the top 20 video clips based on the SVM score for each manipulation action. More details about this process are provided in Appendix A.

Results. As shown in Table 2, the pattern of results, where our joint method performs the best, is similar to the weakly supervised set-up described in Sec. 5.2. This highlights the robustness of our model to noisy input data – an important property for scaling-up the method to Internet scale datasets. To assess how well our joint method could do with perfect retrieval, we also report results for a “Curated” set-up where we replace the automatically retrieved clips with the 20s clips used in Sec. 5.2 for the corresponding videos.


(c) Cstrs only 0.23 0.34 0.25 0.29 0.11 0.24
State + det. sc. 0.33 0.48 0.28 0.40 0.13 0.32
(g) Joint 0.38 0.53 0.25 0.43 0.20 0.36
(g) Curated 0.63 0.68 0.63 0.63 0.53 0.62


(i) Chance 0.14 0.10 0.06 0.10 0.15 0.11
(iii) Action 0.05 0.10 0.00 0.15 0.25 0.11
(iv) Joint 0.30 0.30 0.20 0.20 0.20 0.24
(iv) Curated 0.53 0.35 0.32 0.40 0.59 0.44


Table 2: Results on noisy clips automatically retrieved by text.

6 Conclusion and future work

We have described a joint model that relates object states and manipulation actions. Given a set of input videos, our model both localizes the manipulation actions and discovers the corresponding object states. We have demonstrated that our joint approach improves performance of both object state recognition and action recognition. More generally, our work provides evidence that actions should be modeled in the larger context of goals and effects. Finally, our work opens up the possibility of Internet-scale learning of manipulation actions from narrated video sequences.


This research was supported in part by a Google Research Award, ERC grants Activia (no. 307574) and LEAP (no. 336845), the CIFAR Learning in Machines & Brains program and ESIF, OP Research, development and education Project IMPACT No. CZ.02.1.01/0.0/0.0/15 003/0000468.


Outline of Appendix

This Appendix gives additional details and quantitative results for our method. The organization is as follows. In Appendix A, we provide additional experimental details about the SVM training used to retrieve the clips with subtitles, described in Section 5.3 of the main paper, along with a visualization of results. In Appendix B, we give additional statistics and details about the dataset that was briefly introduced in Section 5.1 of the main paper. In Appendix C, we give additional details about the dynamic program that we use to solve the linear program over the track constraints defined in Section 3.1 of the main paper as needed for the Frank-Wolfe optimization algorithm. In Appendix D, we detail how we implement the new joint rounding method (7) that was introduced in Section 3.3 of the main paper. In Appendix E, we give additional details about the supervised baselines results given in Section 5.2. Finally, in Appendix F, we comment on the common failure cases of the method.

Appendix A SVM training for clip retrieval

In Section 5.3 we proposed an automatic method for retrieving video clips with manipulated objects. This method makes use of narrations that come along with instructional videos. Narrations in the form of text are first obtained automatically with Automatic Speech Recognition (ASR)666The ASR transcriptions are directly downloaded from YouTube. and then processed as detailed below.

Language dataset. For each manipulation action, we first find relevant positive and negative sentences on instruction websites such as Wikihow. On average we obtain about 12 positive and 50 negative sentences per action.

Language features. Off-the-shelf methods for text parsing typically fail in the absence of punctuation. To process ASR output, which comes without punctuation, we propose to use the following simple but robust text representation. We represent every 10-word window of the narration by a TF-IDF vector based of uni-grams and bi-grams. We use the same TF-IDF representation to encode text in our Language dataset on the level of sentences.

SVM training. We train binary linear SVM classifiers to identify manipulation actions using the Language dataset for training and the regularization parameter . The obtained classifiers are then used to score every 10-word window of text narrations. Video clips with the temporal correspondence to the top-scoring text narrations for each action are then retrieved. To deduce the temporal extent of the video clip given the top-scoring window, we trim 5 seconds before and 15 seconds after the corresponding timing. This is to account for the fact that people are usually doing the action after speaking about it. These are the clips we use for our evaluation in Section 5.3 of the main paper.

Illustration. In Figure 3, we provide an illustration of text based clip retrieval. This visualization demonstrates the difficulty of the addressed problem (see caption for details).

we ’ve also talked the wheel on the opposite side of the flat tire so we ’re ready to start changing it the first thing we ’ll do is jack the vehicle up you have to loosen the jack a little bit in order to get the handle off this particular model has a lug wrench built right in before we jack they would want to loosen the lug nuts because once it ’s jacked up it ’s going to be very difficult to get those lug nuts off this particular model has a hubcap which we ’ll just loosen those up does n’t take much because they ’re made out of plastic that the hubcap out of the way i have access to the lug nuts loosen the lug nuts position yourself firmly pressing counterclockwise to loosen the lug nuts not to lose just enough to crack them is a little stubborn position yourself over using your knee or your foot you can gain leverage now we ’re ready to jack it up this is the common scissors jack the screen moves in and out allowing the mac mechanism to move up and down and lift your car so as you turn it to the right it will go up as you loosen it the jack will collapse allowing the car to come down now will position the jack you want to be sure to get the jack on a good part of the frame your owners manual is a good place to find where did properly jack the vehicle you can raise it up by hand until it contacts the frame it is in good position we use the jack handle to raise the vehicle insert the end of the jack handle into the jack using it as leverage it will help make the car go up easier remember turning clockwise to go up and counterclockwise to go down carefully jack the car up never stick your hands or your legs under the vehicle as the car can fall and cause damage checking can take a while so be patient and cautious as you jack ok now that the car is jacked up now we ’ll take the lug nuts loose remember we already pre loosen them when the car was on the ground making it easy to notice will come right off now trying to do that while the car was jacked up would be very difficult remember to keep your lug nuts in close hand you do n’t want to roll away in the grass because it ’ll be hard to find and that ’s what holds your tire on again we ’re turning them counterclockwise to get them loose or to the left we ’ll remove the flat tire set it over here out of the way now there was heavy traffic area you would n’t want to leave it in the street you might want to put it to the rear of the car grab the spare tire

put it in position you ’re going to center the spare tire on the wheel studs that ’s what the lug nuts go so go ahead and do that lining it up you see it lines are pretty easy then install your lug nuts again clockwise is tight counterclockwise is loose so we ’ll take them up by hand as far as they ’ll go once the tire centered lug nuts are hand tight take the lug wrench again turning it clockwise to tighten the lugs to their firm not too tight because remember your cars up on a jack we would n’t want it falling off with too much leverage of force insert the jack handle back into the jack turning it counterclockwise again we ’re going to lower the vehicle again this will take some time but going down a lot easier than going up it seems to be going down easy you can just do it like this by doing it straight instead of having an angle but if it ’s too tough you can angle it and get the leverage that you ’ll need to lower the jack or to raise the jack turn the jack low enough you can do it by hand now that it does n’t have contact with the frame and we can remove it now that the tires on the ground we ’ll go ahead and give them that final type we ’re going to start down here then move to the top one then back down over and back again a little star pattern making sure we have equal torque on the wheel stud ok what we have our spare tire on it set it up we ’re ready to go a couple things to remember
Figure 3: Illustration of our text based clip retrieval approach for the action “put wheel on a car”. We display the narration of the video that is obtained from Automatic Speech Recognition (ASR). The text is highlighted with different colors. These colors correspond to the score of the SVM that has been trained to detect sentences which refer to the action of interest. More precisely, for each word, we compute the average score over all the windows that contain it. Red indicates high score, blue indicates low score. The shown frames correspond to the top scoring part of the narration. We can see the coherence between what the person says (highlighted in red) and does in the video. Note the different challenges of the problem. First, the input narration is very long. Second, the text directly comes from ASR, therefore it contains mistakes and does not have any punctuation. Third, several different expressions are similar and could refer to the action of interest. Despite all these challenges, our method is able to correctly retrieve the clip that contains the action of interest.

Appendix B Dataset of manipulated objects

Table 4 provides statistics for the dataset introduced in Section 5.1 of the main paper. For each object class we indicate associated action classes and the number of video clips for each action. We also provide the list of states and the number of object tracklets with state annotations. In total, we have around 20,000 annotated tracks which we use for the quantitative evaluation of state discovery.

Objects Actions (#clips) States #Tracklets
wheel {remove (47), put (46)} {attached, detached} 5447
coffee cup {fill (57)} {full, empty} 1819
flower pot {put plant (27)} {full, empty} 2463
fridge {open (234), close (191)} {open, closed} 7968
oyster {open (28)} {open, closed} 1802
Table 4: Statistics of our new dataset of manipulated objects
((a)) Non-overlap data structure for tracklets
((b)) Cost matrix for the dynamic program
Figure 5: In (a), we provide an illustration of a possible situation for the tracklets. and are two fictitious tracklets that encode the beginning and end of the video. Each tracklet is indexed based on its beginning time. The time overlap between tracklets is shown by the grey color. We specify for each tracklet its possible successors by the dotted red arrows (see main text). Finally an admissible labeling is illustrated by yellow tags where and have both been assigned to state  and to state . In (b), we give an illustration of our approach to solve (8) with a dynamic program. We display the modified cost matrix (see main text). A valid path has to go from the green dot () to the red dot (). The light yellow entries show part of the matrix that are inserted in , whereas white entries encode the rows of s that are inserted to impose the at least one ordering constraint. The red arrows specify an example optimal path inside the matrix. The red entries display the tracklets that have been assigned to state  ( and ) or state () (equivalent to putting ones in the appropriate corresponding entries in ). Finally, the grey arrows display the possible valid transitions that can be made for the entries along the red path, for clarity. We see for example that from , there are possible transitions: two column choices from the two red arrows from in (a) encoding the non-overlap constraint; and three row choices encoding the valid transition from “state 1” (corresponding to the choice “state 1”, or “state 2” for the next tracklet) encoding the “at least one” ordering constraint.

Appendix C Dynamic program for the tracklets

The track constraints defined in Section 3.1 introduce new challenges compared to the previous related work [Alayrac15Unsupervised, Bojanowski14weakly, Joulin14efficient]. Recall that there are three main components in the constraints. First, we assume that only one object is manipulated at a given time. Thus at most one tracklet can be assigned to a state at a given time. This constraint is referred to as the non-overlap constraint. Second, we have the ordering constraint that imposes that state 1 always happens before state 2. The last constraint imposes that we have at least one tracklet labeled as state 1 and at least one tracklet labeled as state 2. We need to be able to minimize linear functions over this set of constraints in order to use the Frank-Wolfe algorithm. More precisely, as the constraints decompose over the different clips, we can solve independently for each clip the following linear problem:


where is a cost matrix that typically comes from the computation of the gradient of the cost function at the current iterate. In order to solve this problem, we use a dynamic program approach that we explain next. Recall that we are given tracklets and our goal is to output the matrix that assigns to each of these tracklets either state , state  or no state at all while respecting the constraints. The whole method is illustrated in Figure 5 with a toy example.

Non-overlap data structure for the tracklets. We first pre-process the tracklets to build an auxiliary data-structure that is used to enforce the non-overlap constraint between the tracklets, as illustrated in Figure 4(a). First, we sort and index each tracklet by their beginning time, and add two fictitious tracklets: as the starting tracklet and as the ending tracklet. These two tracklets are used to start and terminate the dynamic program. If all the tracklets were sequentially ordered without any overlap in time, then we could simply make a decision for each of them sequentially as was done in previous work on action localization for example (one decision per time step) [Bojanowski14weakly]. To enforce the non-overlap constraint, we force the decision process to choose only one possible successor among the group of overlapping valid immediate successors of a tracklet. For each tracklet , we thus define its (smallest) set of “valid successors” as the earliest tracklet 777Earliest means the smallest . after that is also non-overlapping with , as well as any other tracklet for that is overlapping with (thus giving the earliest valid group of overlapping tracklets). The valid successors are illustrated by red dotted arrows in Figure 4(a). For example, the valid successors of  are (the earliest one that is non-overlapping) as well as (which overlaps with thus forming an overlapping group). Skipping a tracklet in this decision process means that we assign it to zero (which trivially always satisfies the non-overlapping constraint); whereas once we choose a tracklet to potentially assign it to state 1 or 2, we cannot visit any overlapping tracklet by construction of the valid successors, thus maintaining the non-overlap constraint.

Dynamic program. The dynamic programming approach is used when we can solve a large problem by solving a sequence of inclusive subproblems that are linked by a simple recursive formula and that use overlapping solutions (which can be stored in a table for efficiency). In terms of implementation, [Bojanowski14weakly] encoded their dynamic program as finding an optimal path inside a cost matrix. This approach is particularly suited when the update cost rule depends only on the arrival entry in the cost matrix as opposed to be transition dependent. As we will show below, we can encode the solution to our problem in a way that satisfies this property. We therefore use the framework of [Bojanowski14weakly] by casting our problem as a search for an optimal path inside a cost matrix illustrated in Figure 4(b), and where the valid transitions encode the possible constraints.

One main difference with [Bojanowski14weakly] is that we have to deal with the challenging at least one constraint in the context of ordered labels. To do so, we can filter further the set of valid decisions by using “memory states” that encode in which of the following three situations we are: (i) that state  has not yet been visited, (ii) that state  has already been visited, but state  has not yet been visited (and thus that we can either come back to state  or go to state ) and (iii) that both states have been visited. These memory states can be encoded by interleaving complete rows of s in between columns of stored as rows, to obtain the matrix . These new rows encode the three different memory states previously described when making a prediction of for a specific tracklet, and we enforce the correct memory semantic by only allowing a path to move to the same row or the row immediately below, except for state  which can also move directly to state  (two rows below), and the middle “between state 1/2” row, where one can go up one row additionally to state . Finally, the valid transitions between columns (tracklets) are given by the valid successors data structure as given in Figure 4(a) to encode the non-overlap constraints. Combining these two constraints (at least one ordering and non-overlap), we illustrate with grey arrows in Figure 4(b) the possible transitions from the states along the path in red. To describe the dynamic program recursion below, we need to go the opposite direction from the successors, and thus we say that is a predecessor of if and only if is a successor of .

To perform the dynamic program, we maintain a matrix of the same size as where contains the minimal valid path cost of going from to inside the cost matrix . To define the cost update recursion to compute , let be the set of tuples for which it is possible to go from to according to the rules described above. The update rule is then as follows:


As we see here, the added cost depends only on the arrival entry . We can therefore use the approach of [Bojanowski14weakly] and only consider entry costs rather than edge costs. Thanks to our indexing property (tracklets are sorted by the beginning time), we can update the dynamic program matrix by filling each column of one after the other. Once this update is finished, we back-track to get the best path by starting from the ending track (predecessors of ) at the last row (to be sure that both states have been visited) that has the lowest score in the matrix. The total complexity of this algorithm is of order .

Appendix D Joint cost rounding method

Recall that we propose to use a convex relaxation approach in order to obtain a candidate solution of main problem (1). Thus, we need to round the relaxed solution afterward in order to get a valid integer solution. We propose here a new rounding that is adapted to our joint problem. We referred to this rounding as the joint cost rounding (see Section 4 of main paper). This rounding is inspired by [Bojanowski15weakly, Alayrac15Unsupervised]. They observe that using the learned classifier to round gives them better solutions, both in terms of objective value and performance. We propose to use its natural extension for our joint model. We first fix the classifiers for actions and for states to their relaxed solution and find, for each clip , the couple that minimizes the joint cost (7). To do so, we observe that we can enumerate all possibilities for , and solve for each of them the minimization of the joint cost with respect to . The minimization with respect to can be addressed as follows. First, we observe that the distortion function (6) is bilinear in . Let be a vector, and let be a vector of ones of length 2. We can actually write: for some matrix . Thus, when is fixed, the joint term is actually a simple linear function of . In addition, the quadratic term in coming from (2) is also linear over the integer points (using the fact that for ). Thus, when , and are fixed, the minimization over is a linear program (8) that we solve using our dynamic program from the previous section. The final algorithm is given in Algorithm 1. Its complexity is of order .

Get and from the relaxed problem.
Initialize , and .
# Loop over all possibilities for (saliency)
for  in 1 :  do
      zeros(, ) # Set the -th entry of to
      # Definition of the cost matrix
      # Dynamic program for the tracks
      # Cost computation
      # Update solution if better
     if  then
     end if
end for
Algorithm 1 Joint cost rounding for video

Appendix E Supervised baselines for Action Localization

Features put remove fill open fill open close Average
wheel wheel pot oyster coff.cup fridge fridge
(1) CNN + HOF 0.65 0.68 0.56 0.11 0.91 0.54 0.59 0.58
(2) CNN + IDT 0.65 0.72 0.56 0.21 0.93 0.6 0.62 0.61
Table 5: Results of supervised baselines for action localization.

We have run supervised baseline methods with state-of-the-art features. To be able to compare numbers with our experiment, we used a leave-one-out technique. For each action, we train a binary classifier with SVM on all videos except one. Similarly to our setting, we then select the top scoring time interval of the left alone test video. We repeat this process for all videos and report the metric used in our paper. For baseline (1), we use the same features we are using in the main paper. For baseline (2), we complete our features with all channels of Improved Dense Trajectories (IDT) [Wang13action]. Detailed results are given in Table 5. We observe that we obtain results that are on par with our weakly supervised baselines (0.55 versus 0.58), therefore demonstrating the potential of using the information of object states for action localization.

Appendix F Failure cases

Figure 6: Typical failure cases for “removing car wheel” (top) and ‘̀‘fill coffee cup” (middle, bottom) actions. Yellow indicates correct predictions; red indicates mistakes. Top: the removed wheel is incorrectly localized (right). Middle: the “empty cup” is incorrectly localized (left). Bottom: In this case, both object tracklets are annotated as “ambiguous” in the ground truth as they occur during the pouring action and hence the predictions, while they appear reasonable, are deemed incorrect.

We observed two main types of failures, illustrated in Figure 6. The first one occurs when a false positive object detection consistently satisfies the hypothesis of our model in multiple videos (the top two rows in Figure 6). The second typical failure mode is due to ambiguous labels (bottom row in Figure 6). This highlights the difficulty in annotating ground truth for long actions such as “pouring coffee”.