Code repository for the paper: 'Something-Else: Compositional Action Recognition with Spatial-Temporal Interaction Networks'
Human action is naturally compositional: humans can easily recognize and perform actions with objects that are different from those used in training demonstrations. In this paper, we study the compositionality of action by looking into the dynamics of subject-object interactions. We propose a novel model which can explicitly reason about the geometric relations between constituent objects and an agent performing an action. To train our model, we collect dense object box annotations on the Something-Something dataset. We propose a novel compositional action recognition task where the training combinations of verbs and nouns do not overlap with the test set. The novel aspects of our model are applicable to activities with prominent object interaction dynamics and to objects which can be tracked using state-of-the-art approaches; for activities without clearly defined spatial object-agent interactions, we rely on baseline scene-level spatio-temporal representations. We show the effectiveness of our approach not only on the proposed compositional action recognition task, but also in a few-shot compositional setting which requires the model to generalize across both object appearance and action category.READ FULL TEXT VIEW PDF
Code repository for the paper: 'Something-Else: Compositional Action Recognition with Spatial-Temporal Interaction Networks'
Let’s look at the simple action of “taking something out of something” in Figure 1
. Even though these two videos show human hands interacting with different objects, we recognize that they are the same action based on changes of the relative positions of the objects and hands involved in the activity. Further, we can easily recognize the action even when it is presented with previously unseen objects and tools. We ask, do current machine learning algorithms have the capability to generalize across different combinations of verbs and nouns?
We investigate actions represented by the changes of geometric arrangements between subjects (agents) and objects. We propose a compositional action recognition setting in which we decompose each action into a combination of a verb, a subject, and one or more objects. Instead of the traditional setting where training and testing splits include the same combinations of verbs and nouns, we train and test our model on the same set of verbs (actions) but combine them with different object categories, so that tested verb and object combinations have never been seen during training time (Figure 1 (b)).
This problem turns out to be very challenging for heretofore state-of-the-art action recognition models. Computer vision researchers have developed deep networks with temporal connections for action recognition by using Recurrent Neural Networks with 2D Convolutions[74, 11] and 3D ConvNets [7, 72, 59, 61]. However, both types of models have difficulty in this setting; our results below suggest that they cannot fully capture the compositionality of action and objects. These approaches focus on extracting features for the whole scene and do not explicitly recognize objects as individual entities; scene-level convolutional operators may rely more on spatial appearance rather than temporal transformations or geometric relations, since the former alone are often highly predictive of the action class [52, 3].
Recently, researchers have investigated building spatial-temporal graph representations of videos [65, 66, 9, 25] leveraging recently proposed graph neural networks . These methods take dense object proposals as graph nodes and learn the relations between them. While this certainly opens a door for bringing relational reasoning in video understanding, the improvement over the 3D ConvNet baselines is not very significant. Generally, these methods have employed non-specific object graphs based on a large set of object proposals in each frame, rather than sparse semantically grounded graphs which model the specific interaction of an agent and constituent objects in an action.
In this paper, we propose a model based on a sparse and semantically-rich object graph learned for each action. We train our model with accurately localized object boxes in the demonstrated action at training time. Semantic role labels at training time can either be provided explicitly by annotators, or simply inferred from the action label, if the label is a string including the names of the constituent objects. Our model learns explicit relations between subjects and objects; these turn out to be the key for successful compositional action recognition. We leverage state-of-the-art object detectors to accurately locate the subject (agent) and constituent objects in the videos, perform multi-object tracking on them and form multiple tracklets for boxes belonging to the same instance. As shown in Figure 1, we localize the hand, and the objects manipulated by the hand. We track the objects over time and the objects belonged to the same instance are illustrated by the boxes with the same color.
Our Spatial-Temporal Interaction Network (STIN) reasons on candidate sparse graphs found from these detection and tracking results. In general, the detector might return more objects present in the scene than are necessary for understanding the action. One potential solution is to search over the possible combinations of object candidates in order to infer the action. In our experiments, we train a detector on our dataset which contains labels only for the objects relevant to the action, as well as the hands of the actor. Thus searching over all of the configurations is not necessary. Our model takes the locations and shapes of object and subject in each frame as inputs. It first performs spatial interaction reasoning on them by propagating the information among the subjects and objects. Once the box representations are updated, we perform temporal interaction reasoning over the boxes along the same tracklet, which encodes the transformation of objects and the relation between subjects and objects in time. Finally, we compose the trajectories for the agent and the objects together to understand the action. Our model is designed for activities which have prominent interaction dynamics between a subject or agent (e.g., hand) and constituent objects; for activities where no such dynamics are clearly discernible with current detectors (e.g., pouring water, crushing paper), our model falls back to leverage baseline spatio-temporal scene representations.
We introduce the Something-Else task, which extends the Something-Something dataset  with new annotation and a new compositional split. In our compositional split, methods are required to recognize an action when performed with unseen objects, i.e., objects which do not appear together with this action at training time. Thus methods are trained on “Something”, but are tested on their ability to generalize to “Something-Else”. Each action category in this dataset is described as a phrase composed with the same verb and different nouns. We reorganize the dataset for compositional action recognition and model the dynamics of inter-object geometric configurations across time per action. We investigate compositional action recognition tasks in both a standard setting (where training and testing are with the same categories) and a few-shot setting (where novel categories are introduced with only a few examples). To support these two tasks, we collect and will release annotations on object bounding boxes for each video frame. Surprisingly, we observe even with only low dimensional coordinate inputs, our model can show comparable results and improves the appearance based models in few-shot setting by a significant margin.
Our contributions include: (i) A Spatial-Temporal Interaction Network which explicitly models the changes of geometric configurations between agents and objects; (ii) Two new compositional tasks for testing model generalizability and dense object bounding box annotations in videos; (iii) Substantial performance gain over appearance-based model on compositional action recognition.
Action recognition is of central importance in computer vision. Over the past few years, researchers have been collecting larger-scale datasets including Jester , UCF101 , Charades , Sports1M  and Kinetics 
. Boosted by the scale of data, modern deep learning approaches, including two-stream ConvNets[55, 64], Recurrent Neural Networks [74, 12, 46, 5] and 3D ConvNets [31, 7, 14, 71, 60, 61, 13], have been developed and show encouraging results on these datasets. However, a recent study in  shows that most of the current models trained with the above-mentioned datasets are not focusing on temporal reasoning but the appearance of the frames: Reversing the order of the video frames at test time will lead to almost the same classification result. In light of this problem, the Something-Something dataset  is introduced to recognize action independent of object appearance. To push this direction forward, we propose the compositional action recognition task for this dataset and provide object bounding box annotations.
The idea of compositionality in computer vision originates from Hoffman’s research on Parts of Recognition . Following this work, models with pictorial structures have been widely studied in traditional computer vision [15, 76, 29]. For example, Felzenszwalb et al. 
propose a deformable part-based model which organizes a set of part classifiers in a deformable manner for object detection. The idea of composing visual primitives and concepts has also been brought back in the deep learning community recently[62, 45, 36, 1, 32, 28]. For example, Misra et al.  propose a method to compose classifiers of known visual concepts and apply this model to recognize objects with unseen combinations of concepts. Motivated by this work, we propose to explicitly compose the subjects and objects in a video and reason about the relationships between them to recognize the action with unseen combinations of verbs and nouns.
The study of visual relationships has a long history in computer vision [23, 73, 50]. Recent works have shown relational reasoning with deep networks on images [19, 27, 53, 33]. For example, Gkioxari et al.  propose to accurately detect the relations between the objects together with state-of-the-art object detectors. The idea of relational reasoning has also been extended in video understanding [65, 66, 68, 25, 58, 18, 2, 67, 30]. For instance, Wang et al.  apply a space-time region graph to improve action classification in cluttered scenes. Instead of only relying on dense “objectness” region proposals, Wu et al.  further extend this graph model with accurate human detection and reasoning over a longer time range. Motivated by these works, we build our spatial-temporal interaction network to reason about the relations between subjects and objects based on accurate detection and tracking results. Our work is also related to the Visual Interaction Network , which models the physical interactions between objects in a simulated environment.
To further illustrate the generalizability of our approach, we also apply our model in a few-shot setting. Few-shot image recognition has become a popular research topic in recent years [16, 56, 63, 8, 17, 47]. For example, Chen et al.  have re-examined recent approaches in few shot learning and found a simple baseline model which is very competitive compared to meta-learning approaches [16, 40, 51]. Going beyond the image domain, researchers have also investigated few-shot learning in videos [22, 6]. For example, Guo et al. 
propose to perform KNN on object graph representations for few-shot 3D action recognition. Motivated by these works, we also adopt our spatial-temporal interaction network for few-shot video classification, by using the same learning scheme as the simple baseline mentioned in.
We present Spatial-Temporal Interaction Networks (STIN) for compositional action recognition. Our model utilizes a generic detector and tracker to build object-graph representations that explicitly include hand and constituent object nodes. We perform spatial-temporal reasoning among these bounding boxes to understand how the relations between subjects and objects change over time for a given action (Figure 2). By explicitly modeling the transformation of object geometric relations in a video, our model can effectively generalize to videos with unseen combinations of verbs and nouns as demonstrated in Figure 3.
Given a video with frames, we first perform object detection on these video frames, using a detector which detects hands and generic candidate constituent objects. The generic object detector is trained on the set of all objects in the train split of the dataset as one class, and all hands in the training data as a second class. Assume that we have detected instances including the hands and the objects manipulated by the hands in the scene, we then perform multi-object tracking to find correspondences between boxes in different video frames. We extract two types of feature representation for each box: (a) bounding box coordinates; and (b) an object identity feature. Both of these features are designed for compositional generalization and avoiding object appearance bias.
Bounding box coordinates.
One straightforward way to represent an object and its movement is to use its location and shape. We use center coordinate of each object along with its height and width as a quadruple, and forward it to a Multi-Layer Perceptron (MLP), yielding a-dimensional feature. Surprisingly, this simple representation alone turns out to be highly effective in action recognition. Details and ablations are illustrated in Section 5.
Object identity embedding. In addition to the object coordinate feature, we also utilize a learnable embedding (with -dimension) to represent the identities of objects and subjects. We define three types of embedding: (i) subject (or equivalently, agent) embedding, i.e., representing hands in an action; (ii) object embedding, i.e., representing the objects involved in the action; (iii) null embedding, i.e
., representing dummy boxes irrelevant to the action. The three embeddings are initialized from an independent multivariate normal distribution. The identity embedding can be concatenated together with box coordinate features as the input to our model. Since the identity (category) of the instances is predicted by the object detector, we can combine coordinate features together with embedding features accordingly. Gradients can be directly backpropagated to embeddings during training for action recognition task. We note that these embeddings do not depend on appearance of input videos.
We find that combining box coordinate feature with identity feature significantly improves the performance of our model. Since we are using a general object embedding for all kinds of objects, this helps the model to generalize across different combinations of verbs and nouns in a compositional action recognition setting.
Robustness to Unstable Detection. In the cases where object detector is not reliable, where the number of detected objects is larger than a fix number , we can perform object configuration search during inference. Each time we randomly sample object tracklets and forward them to our model. We perform classification based on the most confident configuration which has the highest score. However, in our current experiments, we can already achieve significant improvement without this process.
Given video frames and objects per frame, we denote the set of object features as , where represents the feature of object in frame . Our goal is to perform spatial-temporal reasoning in for action recognition. As illustrated in Figure 2(a), we first perform spatial interaction reasoning on objects in each frame, then we connect these features together with temporal interaction reasoning.
Spatial interaction module. We perform spatial interaction reasoning among the objects in each frame. For each object , we first aggregate the features from the other objects by averaging them, then we concatenate the aggregated feature with . This process can be represented as,
where denotes concatenation of two features in the channel dimension and is learnable weights implemented by a fully connected layer. We visualize this process in Figure 2(b) in the case of .
Temporal interaction module. Given aggregated feature of objects in each individual frame, we perform temporal reasoning on top of the features. As tracklets are formed and obtained previously, we can directly link objects of the same instance across time. Given objects in the same tracklet, we compute the feature of the tracklet as : We first concatenate the object features, then forward the combined feature to another MLP network. Given a set of temporal interaction results, we aggregate them together for action recognition as,
where is a function combining and aggregating the information of tracklets. In this study, we experiment with two different approaches to combine tracklets: (i) Design as a simple averaging function to prove the effectiveness of our spatial-temporal interaction reasoning. (ii) Utilize non-local block  as the function . The non-local block encodes the pairwise relationships between every two trajectory features before averaging them. In our implementation, we adopt three non-local blocks succeeded by convolutional kernels. We use as our final classifier with cross-entropy loss.
Combining video appearance representation. Besides explicitly modeling the transformation of relationships of subjects and objects, our spatial-temporal interaction model can be easily combined with any video-level appearance representation. The presence of appearance features helps especially the action classes without prominent inter-object dynamics. To achieve this, we first forward the video frames to a 3D ConvNet. We follow the network backbone applied in , which takes frames as input and extracts a spatial-temporal feature representation. We perform average pooling across space and time on this feature representation, yielding a -dimensional feature. Video appearance representations are concatenated with object representations , before fed into the classifier.
To illustrate the idea of compositional action recognition, we adopted the Something-Something V2 dataset  and create new annotations and splits from it. We name the action recognition on the new splits as the “Something-Else task”. We first discuss the limitations of the current Something-Something dataset organization and then introduce our proposed dataset reorganization and tasks.
The Something-Something V2 dataset contains 174 categories of common human-object interactions. Collected via Amazon Mechanical Turk in a crowd-sourcing manner, the protocol allows turkers to pick a candidate action category (verb), and perform and upload a video accordingly with arbitrary objects (noun). The lack of constraints in choosing the objects naturally results in a large variability in the dataset. There are different descriptions for objects in total. The original split does not consider the distribution of the objects in the training and the testing set, instead it asserts that the videos recorded by a same person are in either training or testing set but both. While this setting reduces environment and individual bias, it ignores the fact that the combination of verbs and nouns presented in the testing set may have been encountered in the training stage. The high performance obtained in this setting might indicate that models have learned the actions coupled by typical objects occurring, yet does not reflect the generalization capacity of models to actions with novel objects.
Compositional Action Recognition. In contrast to randomly assigning videos into training or testing sets, we present a compositional action recognition task. In our setting, the combinations of verb (action) and nouns in the training set do not exist in the testing set. We define a subset of frequent object categories as those appearing in more than 100 videos in the dataset. We split the frequent object categories into two disjoint groups, and . Besides objects, action categories are divided into two groups and as well. In  these categories are organized hierarchically, e.g., “moving something up” and “moving something down” belong to the same super-class. We randomly assign each action category into one of two groups, and at the same time enforce that the actions belonging to the same super-class are assigned into the same group.
Given the splits of groups, we combine action group with object group , and action group with object group , to form the training set, termed as . The validation set is built by flipping the combination into . Different combinations of verbs and nouns are thus divided into training or testing splits in this way. The statistics of the training and the validation sets under the compositional setting are shown in the second row of Table 1.
|Task Split||# Classes||Training||Validation|
Few-shot Compositional Action Recognition. The compositional split challenges the network to generalize over the object appearance. We further consider a few-shot dataset split setting indicating how well a trained action recognition model can generalize to novel action categories with only a few training examples. We assign the action classes in the Something-Something V2 dataset into a base split and a novel split, yielding 88 classes in the base set and 86 classes in the novel set. We randomly allocate of the videos from the base set to form a validation set and the rest of the videos as the base training set. We then randomly select examples for each category in the novel set whose labels are present in the training stage, and the remaining videos from the novel set are designated as the validation set. We ensure that the object categories in -shot training videos do not appear in the novel validation set. In this way, our few-shot setting additionally challenges models to generalize over object appearance. We term this task as few-shot compositional recognition. We set to or in our experiments. The statistics are shown in Table 1.
Bounding-box annotations. We annotated 180,049 videos of the Something-Something V2 dataset. For each video, we provide a bounding box of the hand (hands) and objects involved in the action. In total, 8,183,381 frames with 16,963,135 bounding boxes are annotated, with an average of 2.41 annotations per frame and 94.21 per video. Other large-scale video datasets use bounding box annotation, in applications involving human-object interaction , action recognition , and tracking .
We perform experiments on the two proposed tasks: compositional action recognition and few-shot compositional action recognition.
We illustrate the details of the implementation of the detector and the tracker as below.
Detector. We choose Faster R-CNN [49, 69] with Feature Pyramid Network (FPN)  and ResNet-101  backbone. The model is first pre-trained with the COCO  dataset, then finetuned with our object box annotations on the Something-Something dataset. During finetuning, only two categories are registered for the detector: hand and object involved in action.
Once we have the object detection results, we apply multi-object tracking to find correspondence between the objects in different frames. The multi-object tracker is implemented based on minimalism to keep the system as simple as possible. Specifically, we use the Kalman Filter and Kuhn-Munkres (KM) algorithm  for tracking objects as . At each time step, the Kalman Filter predicts plausible whereabouts of instances in the current frame based on previous tracks, then the predictions are matched with single-frame detections by the KM algorithm.
Training details. The MLP in our model contains 2 layers. We set the dimension of MLP outputs
. We train all our models for 50 epochs with learning rate 0.01 using SGD with 0.0001 weight decay and 0.9 momentum, the learning rate is decayed by the factor of 10 at epochs 35 and 45.
The experiments aim to explore the effectiveness of different components in our Spatial-Temporal Interaction Networks for compositional action recognition. We also compare and combine our approach with the 3D ConvNet model as follows.
STIN: Spatial-Temporal Interaction Network with bounding box coordinates as input. Average pooling is used as aggregation operator .
STIN + OIE: STIN model not only takes box coordinates but also Object Identity Embeddings (OIE).
STIN + OIE + NL: Use non-local operators for aggregation operator in STIN + OIE.
I3D: A 3D ConvNet model with ResNet-50 backbone as in , with state-of-the-art performance.
STRG: Space-Time Region Graph (STRG) model introduced in  with only similarity graph.
I3D + STIN + OIE + NL: Combining the appearance feature from the I3D model and the feature from the STIN + OIE + NL model by joint learning.
I3D, STIN + OIE + NL: A simple ensemble model combining the separately trained I3D model and the trained STIN + OIE + NL model.
STRG, STIN + OIE + NL: An ensemble model combining the STRG model and the STIN + OIE + NL model, both trained separately.
Our experiments with STIN use either ground-truth boxes or the boxes detected by the object detector.
Figure 4 visualizes examples of how our STIN model and I3D model performs. For the top-left example “Pretending to pick smth up”, our STIN model can keep tracking how the hand moves to understand the action. For the bottom-right example “Pulling smth onto smth”, our model can easily predict the action by seeing one object box is moved on another bounding box. These examples indicate our model takes full advantage of geometric arrangement of objects.
We first perform our experiments on the original Something-something V2 split. We test our I3D baseline model and the STIN model with ground-truth object bounding boxes for action recognition. As shown in Table 2, our I3D baseline is much better than the recently proposed TRN  model and is very close to the state-of-the-art TSM  with RGB inputs. This indicates the significance of improvement over our I3D model in the compositional action recognition experiments.
The result of our STIN + OIE model with ground-truth annotations is reported in Table 2. We can see that with only coordinates inputs, our performance is comparable with TRN . After combining with the I3D baseline model, we can improve the baseline model by . This indicates the potential of our model and bounding box annotations even for the standard action recognition task.
We further evaluate our model on the compositional action recognition task. As described in Section 4, we take the split as the training set, whereas the split as the validation set. They contain 55k videos and 57k videos respectively. All 174 action categories are applied.
We first experiment with using the ground-truth object bounding boxes for the STIN model, as reported in Table (a)a. To illustrate the difficulty of our compositional task, we also report the results on a “shuffled” split of the videos: We use the same candidate videos but shuffle them randomly and form a new training and validation set. Note that the number of training videos are the same as the compositional split. The performance of I3D baseline sharply drops from the shuffled setting to compositional setting by almost in terms of top-1 accuracy. On the shuffled split, although our STIN model trails I3D, it performs better than I3D in the compositional split. By applying the Object Identity Embedding (OIE), we can improve the STIN model by . This attests to the importance of explicit reasoning about the interactions between the agent and the objects. We can further combine our model with the I3D baseline: the joining of two models yields improvement over the baseline and the ensemble model significantly improves over the appearance only model (I3D) by .
Given these encouraging results, we build our model on object bounding boxes obtained via object detection and tracking, and show its results in Table (b)b. We observe that OIE still boosts the STIN model by . By combining I3D with our model, we observe improvement over I3D with joint learning and improvement with model ensemble. That we obtain better results when using ensemble to combine two model might be attributed to the fact that the two models converge at different paces during training, causing optimization difficulty for joint training.
We also see that by replacing the base network with STRG we obtain some improvement in performance over I3D. After combining the STRG model with our model (STRG, STIN + OIE + NL), we can still achieve a large relative improvement ( better than STRG). This shows that our method is complimentary to the existing graph model, since our graph is modeling the changes of geometric arrangements between subjects and objects.
For the few-shot compositional action recognition task, we have 88 base categories and 86 novel categories as described in Section 4, instead of all 174 action categories. We first train our model with the videos from the base categories, then finetune on few-shot samples from the novel categories. We evaluate the model on the novel categories with more than 50k videos to benchmark the generalizability of our model.
For finetuning, instead of following the -way, -shot setting in few-shot learning , we directly finetune our model with all the novel categories. For example, if we perform 5-shot training, then the number of training example is . During the fine-tuning stage, we randomly initialize the last classification layer and train this layer while fixing all other layers. We train the network for 50 epochs with a fixed learning rate . We perform both 5-shot and 10-shot learning in our experiment.
We report our results with ground-truth object boxes in Table (a)a. Since we have a small validation set for the base
categories, to reduce the variance of evaluations, we also evaluate our method on base set before few-shot training is initiated. We can see that our full model (STIN+OIE+NL) outperforms the I3D model by almostin both 5-shot and 10-shot learning setting, even though our approach trails I3D on the validation set in base categories. This indicates that the I3D representation can easily overfit to object appearance while our model generalizes much better. We also observe the OIE and non-local block individually and cooperatively boost the few-shot performance. When combining with I3D with model ensemble, we achieve improvement on 5-shot and on 10-shot setting.
The results with object detection boxes are shown in Table (b)b. Although the best model STIN+OIE+NL trails I3D on base evaluation by a notably large margin, the performance in the few-shot setting is much closer. This indicates our model generalizes better in the compositional setting. When combining our model with the I3D model, joint learning yields improvement and model ensemble yields improvement in the 5-shot setting. We observe similar improvement in the 10-shot setting (). By replacing the I3D base network with STRG, our method (STRG, STIN + OIE + NL) still gives large improvement over STRG ( in 5-shot setting and in 10-shot).
|STIN + OIE (GT)||Compositional||28.5||54.1|
|STIN + OIE (Detector)||Compositional||20.3||40.4|
We push the compositional setting to an extreme, where we only select the videos where actions are interacting with the object category “box” for training ( videos in action categories). The rest of the videos are the validation set (170K videos). The objective of this experiment is to examine the generalizability of our STIN model, even when the training set is strongly biased toward one type of object.
The results are summarized in Table 5. Our model with ground-truth boxes almost doubles the I3D performance. Our model with detection boxes is also better than I3D. This attests to the advantage of our model in terms of generalizability across different object appearances.
We compare the performance difference between our Spatial-Temporal Interaction Networks and the I3D model for individual action categories. We visualize the five action categories that STIN surpasses or trails by the largest margin compared to I3D model in Figure 5. A priori, actions which are closely associated to the transformation of object’s geometric relations should be better represented by STIN model than I3D. We can see that the actions in which STIN outperforms I3D by the largest margin are the ones that directly describe the movements of objects, such as “put something” and “take something”. On the other hand, STIN fails when actions are associated more with the changes in terms of the intrinsic property of an object, such as “poking” and “tearing”.
Motivated by the appearance bias in current activity recognition models, we propose a new model for action recognition based on sparse semantically grounded subject-object graph representations. We validate our approach on novel compositional and few shot settings in the Something-Else dataset; our model is trained with new constituent object grounding annotations which will be made available. Our STIN approach models the interaction dynamics of objects composed in an action and outperforms all baselines.
Acknowledgement: Prof. Darrell’s group was supported in part by DoD, NSF, BAIR, and BDD. We would like to thank Fisher Yu and Haofeng Chen for helping set up the annotation pipeline. We would also like to thank Anna Rohrbach and Ronghang Hu for many helpful discussions.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 39–48, 2016.
Video action transformer network.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 244–253, 2019.
3d convolutional neural networks for human action recognition.TPAMI, 2013.