Multi-modal Capsule Routing for Actor and Action Video Segmentation Conditioned on Natural Language Queries

12/02/2018 ∙ by Bruce McIntosh, et al. ∙ University of Central Florida 0

In this paper, we propose an end-to-end capsule network for pixel level localization of actors and actions present in a video. The localization is performed based on a natural language query through which an actor and action are specified. We propose to encode both the video as well as textual input in the form of capsules, which provide more effective representation in comparison with standard convolution based features. We introduce a novel capsule based attention mechanism for fusion of video and text capsules for text selected video segmentation. The attention mechanism is performed via joint EM routing over video and text capsules for text selected actor and action localization. The existing works on actor-action localization are mainly focused on localization in a single frame instead of the full video. Different from existing works, we propose to perform the localization on all frames of the video. To validate the potential of the proposed network for actor and action localization on all the frames of a video, we extend an existing actor-action dataset (A2D) with annotations for all the frames. The experimental evaluation demonstrates the effectiveness of the proposed capsule network for text selective actor and action localization in videos, and it also improves upon the performance of the existing state-of-the art works on single frame-based localization.



There are no comments yet.


page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Related Work

Video semantic segmentation is an active topic of research in computer vision, as it serves as a basic foundation for various real-world applications, such as autonomous driving. It is a very challenging problem where every pixel in the video needs to be assigned a semantic category. Video action detection is a subset of this problem, where we are only interested in segmenting/localizing the actors and objects involved in the activities present in a video. The problem becomes even more interesting, when we want to focus on only selective actors in the video. In this work, we focus on selective localization of actors and actions in a video based on a textual query, as seen in Figure 1.

We have recently seen good progress in the task of action localization [7, 10, 5, 14, 27, 1, 2]

, and it is mainly accredited to the success in deep learning along with the availability of large scale datasets

[5, 28, 12]. However, one limitation of these datasets is that the actor is mainly a person performing various activities. Xu et al. [29] introduce an actor-action video dataset (A2D), which has several actor and action pairs; this dataset presents several challenges as there could be different types of actors besides just humans performing the actions. Also, there could be multiple actors present in a scene, which is quite challenging when compared to typical datasets, where we have only one actor performing the action. Their experiments on A2D showed that a joint inference over actor and action outperforms methods that treat them independently. Ji et al. [13] explored the role of optical flow along with RGB data and proposed an end-to-end deep network for joint actor and action localization.

Gavrilyuk et al. [4] recently extended the A2D dataset with human generated sentences, describing the actors and actions in the video, and proposed the task of actor and action segmentation from a sentence. Their method uses a convolutional network for both visual as well as textual inputs and predicts localization on one frame of the video. We propose a different approach, where we make use of capsules for both visual as well as text encoding, and perform localization on the full video instead of just one frame, in order to fully utilize the spatiotemporal information captured by the video.

Hinton et al. first introduced the idea of capsules in [9], and subsequently capsules were popularized in [24], where dynamic routing for capsules was proposed. This was further extended in [8], where a more effective EM routing algorithm was introduced. Recently, capsule networks have shown state-of-the-art results for human action localization in video [2], object segmentation in medical images [18], and text classification [31]. We propose to extend the use of capsule networks into the multi-modal domain, where the segmentation and localization of objects in video are conditioned on a natural language input. We introduce a novel capsule based attention mechanism for fusion of video and text capsules for text selected segmentation.

There are several existing works focusing on learning the interaction between text and visual data [19, 30, 3, 6]. These works are mainly focused on whole image and video level detections as opposed to pixel level segmentation. Hu et al. [11]

introduced the problem of segmenting images based on a natural language expression; their method for merging images and text in a convolutional neural network (CNN) was by concatenating features extracted from both modalities and followed by a convolution. Li et al.

[20] propose a different approach to merge these two modalities for the task of tracking a target in a video; they use an element-wise multiplication between the image feature maps and the sentence features in a process called dynamic filtering. Our proposed method encodes both the video as well as the textual query as capsules, and makes use of an EM routing algorithm to learn the interaction between text and visual information. The method differs from conventional convolutional approaches as it takes advantage of capsules networks’ ability to model entities and their relationships; the routing procedure allows our network to learn the relationship between entities from different modalities.

Figure 2: Network Architecture. Capsules containing spatiotemporal features are created from video frames, and capsules representing a textual query are created from natural language sentences. These capsules are routed together to create capsules representing actors in the image. The actor capsule poses go through a masking procedure and an upsampling network to create binary segmentation masks of the actor specified in the query.

In summary, we make the following contributions in this work: (1) We propose an end-to-end capsule network for the task of selective actor and action localization in videos, which encodes both the video and the textual query in the form of capsules. (2) We introduce a novel multi-modal conditioning routing as an attention mechanism to address the issue of cross-modality capsule selection. (3) To demonstrate the potential of the proposed text selective actor and action localization in videos we extend the annotaions in A2D dataset to full video clips. Our experiments demonstrate the effectiveness of the proposed method, and we show its advantage over the existing state-of-the art works in terms of performance.

2 Conditioning with Multi-modal Capsule Routing

We argue that capsule networks can effectively perform multi-modal conditioning. Capsules represent entities and routing uses high-dimensional coincidence filtering [8] to learn part-to-whole relationships between these entities. There are several possible ways to incorporate conditioning into capsule networks. One trivial approach would be to apply a convolutional method (concatenation followed by a 1x1 convolution [11] or multiplication/dynamic filtering [20]) to create conditioned feature maps, and then extract a set of capsules from these feature maps. This, however, would not perform much better than the fully convolutional networks, since the same conditioned feature maps are obtained from the merging of the visual and textual modalities, and the only difference is how they are transformed into segmentation maps.

Another method would be to first extract a set of capsules from the video, and then apply the dynamic filtering on these capsules. This can be done by (1) applying a dynamic filter to the pose matrices of the capsules, or (2) applying a dynamic filter to the activations of the capsules. The first is not much different than the trivial approach described above, since the same set of conditioned features would be present in the capsule pose matrices, as opposed to the layer prior to the capsules. The second approach would just discount importance of the votes corresponding to entities not present in the sentence; this is not ideal, since it does not take advantage of routing’s ability to find agreement between entities in both modalities.

Instead, we propose an approach that leverages the fact that the same entities exist in both the video and sentence inputs and that routing can find similarities between these entities. From the video, we extract a grid of capsules describing the visual entities, , with pose matrices and activations . Similarly, from the sentence, we generate sentence capsules, , with pose matrices and activations . Each set of capsules has transformation matrices and , for video and text respectively, which are used to cast votes for the capsules in the following layer. Video capsules at different spatial locations share the same transformation matrices. Using the procedure described in Algorithm 1, we obtain a grid of higher-level capsules, . This algorithm allows the network to find similarity, or agreement, between the votes of the video and sentence capsules at every location on the grid. If there is agreement between the votes, then the same entity exists in both the sentence and the given location in the video, leading to a high activation of the capsule corresponding to that entity. Conversely, if the sentence does not describe the entity present at the given spatial location, then the activation of the higher-level capsules will be low since the votes would disagree.

  for  do
     for  do
     end for
  end for
Algorithm 1 Multi-modal Capsule Routing. The operation is concatenation, such that the activations and votes of both the video and sentence capsules are inputs to the EM routing procedure described in [8].

This formulation of multi-modal conditioning using capsules allows the network to learn a set of entities (capsules) from both the visual and sentence inputs. Then, the voting and routing procedure allows us to find agreement between both modalities in the form of higher-level capsules. Suppose there is a higher-level capsule describing a dog. When given a query like “The brown dog running” the sentence capsules’ vote matrices corresponding to the dog class, contain the different properties of the dog, like the fact that it is running or that it is brown. If there is a running brown dog at some location in the visual input, then the visual votes at that location would be similar to the the sentence’s votes, so the activation of the higher-level dog capsule would be high at that location. If, for instance, there is a black dog or a dog rolling at some location then the votes would not agree, and activation for the dog capsule would be low there.

Although this work mainly focuses on video and text, the multi-modal capsule routing procedure can be applied to capsules generated from many other modalities, like images or audio.

3 Network Architecture

The overall network architecture is shown in Figure 2. In this section, we discuss the various components of the architecture as well as the objective function used to train the network.

3.1 Video Capsules

The video input consists of 4 frames. The process for generating video capsules begins with an Inception based 3D convolutional network known as I3D [1], which generates 832 - spatiotemporal feature maps taken from the maxpool3d_3a_3x3 layer. Capsule pose matrices and activations are generated by applying a

convolution operation to these feature maps, with linear and sigmoid activations respectively. Since there is no padding for this operation, the result is a

capsule layer with 8 capsule types.

Figure 3: Capsule Merging. Video capsules are created with n types of capsules at each pixel location. Sentence capsules are created with m types of capsules for the sentence. The sentence capsules are tiled over all of the spatial locations. EM routing is performed to create the next higher level of capsules representing actors at each spatial location.

3.2 Sentence Capsules

A series of convolutional and fully connected layers is used to generate the sentence capsules. First, each word from the sentence is converted into a size 300 vector using a word2vec model pre-trained on the Google News Corpus


. Sentences in the network are set to be 16 words, so longer sentences are truncated, and shorter sentences are padded with zeros. The sentence representation is then passed through 3 parallel stages of 1D convolution with kernel sizes of 2, 3 and 4 with a ReLU activation. We then apply max-pooling to obtain 3 vectors, which are concatenated and passed through a max-pooling layer to obtain a single length 300 vector to describe the entire sentence. A fully connected layer then generates the 8 pose matrices and 8 activations for the capsules which represent the entire sentence. We found that this method of generating sentence capsules performed best in our network: various other methods are explored in the Supplementary Material.

3.3 Merging and Masking

Once the video and sentence capsules are obtained, we merge them in the manner described in section 2, and depicted in Figure 3. The result of the routing operation is a grid with 8 capsule types - one for each actor class in the A2D dataset and one for a “background” class, which is used to route unnecessary information. The activations of these capsules correspond to the existence of the corresponding actor at the given location, so averaging the activations over all locations gives us a classification prediction over the video clip. We perform the capsule masking procedure described in [24]. When training the network, we mask (multiply by 0) all pose matrices not corresponding to the ground truth class. At test time, we mask the pose matrices not corresponding to the predicted class. These masked poses are then fed into an upsampling network to generate a foreground/background segmentation mask for the actor described by the sentence.

3.4 Upsampling Network

The upsampling network consists of 5 convolutional transpose layers. The first of these increases the feature map dimension from to with a kernel, which corresponds to the kernel used to create the video capsules from the I3D feature maps. The following 3 layers have

kernels and are strided in both time and space, so that the output dimensions are equal to the input video dimensions (

). The final segmentation is produced by a final layer which has a kernel. We must note here, that this method diverges from previous methods in that it outputs segmentations for all input frames, rather than a single frame segmentation per video clip input. We use parameterized skip connections from the I3D encoder to obtain more fine-grained segmentations. At each step of upsampling, lower resolution segmentation maps are generated to aid in the training of these skip connections.

3.5 Objective Function

The network is trained end-to-end using an objective function based on classification and segmentation losses. For classification, we use a spread loss which is computed as follows:


where is a margin, is the activation of the capsule corresponding to class , and is the activation of the capsule corresponding to the ground-truth class. During training, is linearly increased between and .

The segmentation loss is computed using sigmoid cross entropy. When averaged over all pixels in the segmentation map, we get the following loss:


where is the ground-truth segmentation map and is the network’s output segmentation map. We use this segmentation loss at several resolutions to aid in the training of the skip connections.

The final loss is a weighted sum between the classification and segmentation losses:


where is set to

when training begins. Since the network quickly learns to classify the actor when given a sentence input, we set

to when the classification accuracy saturates (over 95% on the validation set). We find that this reduces over-fitting and results in better segmentations.

4 Experiments

Implementation Details

The network was implemented using PyTorch

[23]. The I3D used weights pretrained on Kinetics [15] and fine tuned on Charades [26]. The network was trained using the adam optimizer [17] with a learning rate of .001. As video resolutions vary within different datasets, all video inputs are scaled to while maintaining aspect ratio through the use of horizontal black bars. When using bounding box annotations, we consider pixels within the bounding box to be foreground and pixels outside of the bounding box to be background.

Overlap mAP IoU
P@0.5 P@0.6 P@0.7 P@0.8 P@0.9 0.5:0.95 Overall Mean
Hu et al. [11] 34.8 23.6 13.3 3.3 0.1 13.2 47.4 35.0
Li et al. [20] 38.7 29.0 17.5 6.6 0.1 16.3 51.5 35.4
Gavrilyuk et al. [4] 50.0 37.6 23.1 9.4 0.4 21.5 55.1 42.6
Our Network 52.6 45.0 34.5 20.7 3.6 30.3 56.8 46.0
Table 1: Results on A2D dataset with sentences. Baselines [11, 20] take only single image/frame inputs. Gavrilyuk et al. [4] uses multi-frame RGB and Flow inputs. Our model uses only multi-frame RGB inputs and outperforms other state-of-art-methods in all metrics without the use of optical flow.
Overlap mAP IoU
P@0.5 P@0.6 P@0.7 P@0.8 P@0.9 0.5:0.95 Overall Mean
Hu et al. [11] 63.3 35.0 8.5 0.2 0.0 17.8 54.6 52.8
Li et al. [20] 57.8 33.5 10.3 0.6 0.0 17.3 52.9 49.1
Gavrilyuk et al. [4] 69.9 46.0 17.3 1.4 0.0 23.3 54.1 54.2
Our Network 63.8 47.9 26.3 4.0 0.0 24.3 49.2 52.0
Table 2: Results on JHMDB dataset with sentences. Our model outperforms other state-of-the-art methods at higher IoU thresholds and in the mean average precision metric.

4.1 Single-Frame Segmentation Conditioned on Sentences

In this experiment, a video clip and a human generated sentence describing one of the actors in the video are taken as inputs, and the network generates a binary segmentation mask localizing the described actor. Similar to previous methods, the network is trained and tested on the single frame annotations provided in the A2D dataset. To compare our method with previous approaches, we modify our network in these experiments. We replace the 3d convolutional transpose layers in our upsampling network to 2d convolutional transpose layers to output a single frame segmentation.


We conduct our experiments on two datasets: A2D [29] and J-HMDB [12]. The A2D dataset contains 3782 videos (3036 for training and 746 for testing) consisting of 7 actor classes, 8 action classes, and an extra action label none, which accounts for actors in the background or actions different from the 8 action classes. Since actors cannot perform all labeled actions, there are a total of 43 valid actor-action pairs. Each video in A2D has 3 to 5 frames which are annotated with pixel-level actor-action segmentations. The J-HMDB dataset contains 928 short videos with 21 different action classes. All frames in the J-HMDB dataset are annotated with pixel-level segmentation masks. Gavrilyuk et al. [4] extended both of these datasets with human generated sentences that describe the actors of interest for each video. These sentences use the actor and action as part of the description, but many do not include the action and rely on other descriptors such as location or color.


We evaluate our results using all metrics used in [4]. The overall IoU is the intersection-over-union (IoU) over all samples, which tends to favor larger actors and objects. The mean IoU is the IoU averaged over all samples, which treats samples of different sizes equally. We also measure the precision at 5 IoU thresholds and the mean average precision over [21].


We compare our results on A2D with previous approaches in Table 1. Our network outperforms previous state-of-the-art methods in all metrics, and has a notable 9% improvement in the mAP metric, even though we do not process optical flow, which would require extra computation. We also find that our network achieves much stronger results at higher IoU thresholds, which signifies that the segmentations produced by the network are more fine-grained and adhere to the contours of the queried objects. Qualitative results on A2D can be found in Figure 4.

Following the testing procedure in [4], we test on all the videos of J-HMDB using our model trained on A2D without fine-tuning. The results on J-HMDB are found in Table 2; our network outperforms other methods at the higher IoU thresholds (0.6, 0.7, and 0.8) and in the mAP metric.

4.2 Full Video Segmentation Conditioned on Sentences

In this set of experiments, we train the network using the bounding box annotations for all the frames. Since previous baselines only output single frame segmentations, we test our method against our single-frame segmentation network as a baseline. It can generate segmentations for an entire video, by processing the video frame-by-frame.


We extend the A2D dataset by adding bounding box localizations for the actors of interest in every frame of the dataset. This allows us to train and test our method using the entire video and not just 3 to 5 frames per video. The J-HMDB dataset has annotations on all frames, so we can evaluate the method on this dataset as well.


To evaluate the segmentation results for entire videos, we consider each video as a single sample. Thus, the IoU computed is the intersection-over-union between the ground-truth tube and the generated segmentation tube. Using this metric, we can calculate the video overall IoU and the video mean IoU; the former will favor both larger objects and objects in longer videos, while the latter will treat all videos equally. We also measure the precision at 5 different IoU thresholds and the video mean average precision over .


Since the network is trained using the bounding box annotations, the produced segmentations are more block-like, but it is still able to successfully segment the actors described in given queries. We compare the qualitative results between the network trained only using fine-grained segmentations and the network trained using bounding box annotations in Figure 5. When tested on the A2D dataset, we find that there is a significant improvement in all metrics when compared to the network trained only on single frames with pixel-wise segmentations. However, this is to be expected, since the ground-truth tubes are bounding boxes and box-like segmentations around the actor would produce higher IoU scores. For a fairer comparison, we place a bounding box around the fine-grained segmentations produced by the network trained on the pixel-wise annotations: this produces better results since the new outputs more resemble ground-truth tubes. Even with this change, the network trained on bounding box annotations has the strongest results since it learned all frames in the training videos, as opposed to a handful of frames per video. The results of this experiment can be seen in Table 6.

The J-HMDB dataset has pixel-level annotations for all frames, so the box-like segmentations produced by the network should be detrimental to results; we found that this was the case: the network performed poorly in every metric when compared to the network trained on fine-grained pixel-level annotations..

Figure 4: A comparison of our results with [4]. The sentence query colors correspond with the segmentation colors. The first row are frames from the input video. The second row shows the segmentation output from [4], and the third row shows the segmentation output from our model. In both examples, our model produces more finely detailed output, where the separation of the legs can be clearly seen. Our model also produces an output that is more accurately conditioned on the sentence query, as seen in the first example where our network segments the correct dog for each query, while [4] incorrectly selects the center dog for both queries.
Figure 5: Qualitative results. The sentence query colors correspond with the segmentation colors. The first row are frames from the input video. The second row contains the segmentations from the network trained only using pixel-wise annotations, and the third row contains the segmentations from the network trained using bounding box annotations on all frames. The segmentations from the network trained using bounding boxes are more box-like, but the extra training data leads to fewer missegmentations or under-segmentations as seen in the second example. Higher resolution qualitative results, with their corresponding ground-truths can be found in the Supplementary Material.

4.3 Image Segmentation Conditioned on Sentences

We also evaluate our method by segmenting images based on text queries. To make as few modifications to the network as possible, the single images are repeated to create a “boring” video input with 4 identical frames.


We use train and test on the ReferItGame dataset [16], which contains 20000 images with 130525 natural language expressions describing various objects in the images. We use the same train/test splits as [11, 25], with 9000 training images and 10000 testing images. Unlike A2D there are no predefined set of actors, so no classification loss or masking is used.


The results for this experiment can be seen in Table 8. We obtain similar results to other state-of-the-art approaches, even though our network architecture is designed for actor/action video segmentation. This demonstrates that our proposed capsule routing procedure is effective on multiple visual modalities - both videos and images.

Video Overlap v-mAP Video IoU
P@0.5 P@0.6 P@0.7 P@0.8 P@0.9 0.5:0.95 Overall Mean
Key frames (pixel) 9.6 1.6 0.4 0.0 0.0 1.8 34.4 26.6
Key frames (bbox) 41.9 33.3 22.2 10.0 0.1 21.2 51.5 41.3
All frames 45.6 37.4 25.3 10.0 0.4 23.3 55.7 41.8
Table 3: Results on A2D dataset with bounding box annotations. The first row is for the network trained with only pixel-level annotations on key frames of the video, and evaluated with its pixel-wise segmentation output. The second is the same network, but a bounding-box is placed around its segmentation output for evaluation. The final row, is the network trained with bounding box annotations on all frames. Significant performance gain is achieved when training with all frames.
Overall IoU P@0.5 P@0.7 P@0.9
Hu et al. [11] 56.83 43.86 26.65 6.47
Shi et al. [25] 59.09 45.87 32.82 11.79
Our Network 55.7 43.4 28.3 9.7
Table 4: Results on ReferItGame dataset. This result for [11] is obtained by using Deeplab101 as a backbone network, as described in [25]. We achieve comparable results, even with a network designed for video inputs. This level of performance can be partially attributed to the lack of classification loss and masking, which tends to improve segmentation results in our network.
Figure 6: Failure Cases. These are frames from the same video in which different text queries resulted in distinct segmentation failure cases. In the first example the network chooses the wrong actor based on the query; in the second, the network is unable to find the queried actor.

4.4 Ablation Studies

The ablation experiments were trained and evaluated using the pixel-level segmentations from the A2D dataset. All ablation results can be found in Table 7.

Skip Connections

To understand the effectiveness of the parameterized skip connections from the I3D encoder, an experiment was run with these skip connections removed. This resulted in a 3% reduction in mean IoU score and mean average precision. The decrease in performance shows that the skip connections are necessary for the network to preserve fine-grained details from the input video.

Classification and Masking

We test the influence of the classification loss for this segmentation task, by running an experiment without back-propogating this loss. Without classification, the masking procedure would fail at test time, so masking is not used and all poses are passed forward to the upsampling network. This performed worse than the baseline in all metrics, which shows the importance of classification loss and masking when training this capsule network. To further investigate the effects of masking, we perform an experiment with no masking, but with the classification loss. Surprisingly, it performs worse than the network without masking nor classification loss; this signifies that classification loss can be detrimental to this segmentation task, if there is no masking to guide the flow of the segmentation loss gradient.

Multi-Resolution Segmentation Loss

The base network computes a segmentation loss not only at the final output, but at multiple intermediate resolutions (, , and ). This approach was used in [4], to great success. As an ablation, we trained the network ignoring the intermediate segmentation losses, which produced similar results to the baseline network. Thus, the multi-resolution segmentation loss has no noticeable effect on our network.

Alternative Conditioning

We run a series of experiments to test the effectiveness of our multi-modal capsule routing procedure. We test the four other conditioning methods described in Section 2: the two trivial approaches (concatenation and multiplication), and the two methods which apply dynamic filtering to the video capsules (filtering the pose matrices and filtering the activations). The results suggest that multi-modal capsule routing is an effective method for merging different data modalities, since the convolution-based approaches perform much worse in all metrics. Moreover, these experiments show that it is non-trivial task to extend techniques developed for CNNs, like dynamic filtering for natural language conditioning, to capsule networks.

P@0.5 mAP Mean IoU
No skip connections 49.5 26.9 43.1
No nor Masking 49.4 28.8 43.6
No Masking (with ) 48.3 27.8 42.5
Single Resolution 51.9 31.2 45.0
Concatenation 22.9 9.9 25.0
Multiplication 38.4 19.4 35.0
Filter Poses 49.1 29.1 42.7
Filter Activations 48.8 29.2 43.0
Our Network 52.6 30.3 46.0
Table 5: Ablations on the A2D dataset with sentences. We test the effect of parameterized skip connections, capsule masking, the classification loss, and the multi-resolution segmentation loss on our network. We also test conventional conditioning methods on our capsule network to evaluate the effectiveness of the proposed multi-modal capsule routing procedure. The final row contains the results of our network without any changes.

4.5 Failure Cases

We find that the network has two main failure cases: (1) the network incorrectly selects an actor which is not described in the query, and (2) the network fails to segment anything in the video. Figure 6 contains examples of both cases. The first case occurs when the text query refers to an actor/action pair and multiple actors are doing this action or the video is cluttered with many possible actors from which to choose. This suggests that an improved video encoder which extracts better video feature representations and creates more meaningful video capsules could reduce the number of these incorrect segmentations. The second failure case tends to occur when the queried object is small, which is often the case with the “ball” class or when the actor of interest is far away.

5 Conclusion and Future Work

In this work, we propose a capsule network for localization of actor and actions based on a textual query. The proposed framework makes use of capsules for both video as well as textual representation. We introduce the concept of multi-modal capsule networks, through multi-modal EM routing for the localization of actors and actions in video, conditioned on a textual query. The existing annotations on the A2D dataset are for single frames and we extended the dataset with annotations for all the frames to validate the performance of our proposed approach. In our experiments, we demonstrate the effectiveness of multi-modal capsule routing, and observe an improvement in the performance when compared to the state-of-the art approaches. We found the capsule representation to be effective for both visual and text modalities; we plan to explore the interplay between these modalities using capsules and also apply it to other domains in future work.


This research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA R&D Contract No. D17PC00345. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.


Appendix A Appendices

Here we include many qualitative results, and quantitative results which could not be included in the main text. Also, we include figures and a more in-depth description of the network architecture.

Appendix B Network Architecture

When designing our network, we found several components key to obtaining our state-of-the-art results. We explain some of these components, and how they impacted our results. Furthermore, we include several figures (7, 8, 9) which illustrate the construction of the video capsules, the sentence network, and the upsampling network respectively.

b.1 Video Encoder

Our network uses a pretrained I3D network to encode the video into a set of feature maps of size . Originally we used the simpler C3D network [tran2015learning], to encode the feature maps, but it achieved substantially worse results - a mean IoU of 34%. This shows that the features extracted from the I3D network, are more useful for our video capsules than were the features from the C3D network. Not only that, but it suggests that future improvements in video feature extraction techniques and networks, will lead to improved capsule representations and improved results.

b.2 Sentence Network

We also tested several different configurations for our sentence network. We began with using the capsule networks (both Capsule-A and Capsule-B) described in Figure 2 of [31] which showed strong results on text classification. These network have several capsule layers, but their use led to poor results: a mean IoU of 35.7% for the Capsule-A network, and a mean IoU of 36.4% for the Capsule-B network. Although these network performed well on the text classification task, a different set of text features must be learned to condition visual features. Therefore, our sentence network with conventional convolutional layers, max-pooling, and a fully connected layer, is better able to extract textual features for the task of video segmentation from a sentence.

Appendix C Evaluations

We include several tables which we were unable to include in our main text. Table 6 shows the results of our network trained on the bounding box annotations from A2D, and tested on JHMDB. The network’s outputs are more box-like because it was trained on bounding boxes as opposed to pixel-wise segmentations; therefore, the network has strongest results when tested using bounding box ground-truths. Tables 7 and 8 contain all the metrics which we were unable to include for the ablation experiments and ReferItGame experiments, respectively.

Appendix D Qualitative Results

We have generated several videos with the segmentations produced by our networks for both the A2D and JHMDB datasets. Each video contains the input sentences, color coded to match the ground-truth colors. The first row of each video has the ground-truth bounding boxes (for A2D) or ground-truth segmentation masks (for JHMDB). The second row is the output of the network which was trained on the key frames of the A2D dataset, which had pixel-wise segmentation ground-truths. The third row is the output of the network which was trained on all the frames of the A2D dataset, using bounding-box ground-truths. In our analysis of the qualitative results, we will refer to the prior network as the ”Key Frame network” and the latter network as the ”Bounding Box network”.

d.1 Single Actor

The networks seem to perform best when there is a single actor in the scene. If this is the case, we find that the Key Frame network produces very fine-grained segmentations which maintain the boundaries of the actors; Meanwhile, the Bounding Box network successfully segment a box around the actor. This behaviour can be observed in the videos in the A2DSingleActor folder, which contains examples from the A2D dataset, and the JHMDB folder, which contains examples from the JHMDB dataset.

d.2 Multiple Actors

The network can also perform well with multiple actors in the scene as seen in the A2DMultiActor videos. In these cases, the segmentations are not as precise, but the general location of the actors is being segmented by both networks. We note that the Bounding Box network, which was trained with all the frames of dataset, tends to produce more consistant multi-actor segmentations: as seen in videos ”video3_multi_a2d” and ”video5_multi_a2d”. In both of these cases, each instance is of the same actor class, and the Key Frame network seems to incorrect segment one of the instances.

d.3 Failure Cases

As mentioned in the main text, our network tends to fail on the A2D dataset when there are multiple instances of the same actor class. We present 5 videos in which our network fails in the A2DFailure folder. The probability of failure is increased when the sentence queries are vague, or could describe multiple actors within the scene, like in the video ”video1_failure_a2d”. Several birds can fit the description of ”sparrow sitting on the grass” or ”sparrow is walking on the brown grass”. In these cases it would be very difficult for even a human to correctly segment the video. In many cases, the failure occurs when there are many similar actors near each other, like in the videos ”video2_failure_a2d.mp4” and ”video5_failure_a2d”. In the first, there are multiple people running next to each other, while the second contains several cars moving near each other.

Since the JHMDB dataset only has a single actor in each video, the failures encountered are not from ”difficult” queries or videos, but rather a result of the mismatch between the training and testing data. A2D videos tend to have humans perform an action requiring large amounts of motion - like walking, running, jumping or rolling - while the A2D videos have many videos in which the action has little motion - like brushing hair or archery. Thus, both the videos, and the input textual queries, are quite different between the datasets, which can cause a large performance discrepancy during evaluation. Examples of failure cases on the JHMDB dataset can be found in the JHMDBFailure folder.

Figure 7: Video Capsule Network. Video capsules are formed from the I3D features by convolution with one layer to create the 4x4 pose matrix for each capsule, and another layer to create the activation for each capsule.
Figure 8: Sentence Network. Each word from a natural language sentence is converted into a size 300 word2vec vector. The vectors go through a convolutional network and are then reshaped into the poses and activations of capsules representing the sentence.
Figure 9:

Upsampling Network. The pose matrices undergo a masking procedure and are passed through a series of convolutional transpose layers to create the binary segmentation mask. Skip connection from the I3D are made at 3 different resolutions. Segmentation loss is backpropagated from the final output as well as 3 intermediate levels.

Video Overlap v-mAP Video IoU
P@0.5 P@0.6 P@0.7 P@0.8 P@0.9 0.5:0.95 Overall Mean
All frames 16.3 2.0 0.1 0.0 0.0 2.3 37.9 35.6
All frames (bbox) 46.7 32.1 15.8 3.2 0.0 16.3 44.5 44.4
Table 6: Results for network trained on the bounding box annotations for A2D with sentences, and evaluated on JHMDB. The first row is the network tested against the ground-truth pixel-wise segmentations from the JHMDB dataset. The second row is the network tested against bounding boxes around the ground-truth segmentations from the JHMDB dataset. Since our network was trained with bounding boxes, it performs better when it is evaluated against bounding box ground-truths.
Overlap mAP IoU
P@0.5 P@0.6 P@0.7 P@0.8 P@0.9 0.5:0.95 Overall Mean
No skip connections 49.5 41.2 29.1 14.6 1.5 26.9 56.7 43.1
No nor Masking 49.4 42.5 32.7 19.6 3.3 28.8 57.6 43.6

No Masking (with )
48.3 41.4 31.2 18.4 3.1 27.8 56.6 42.5
Single Resolution 51.9 45.2 35.4 21.7 4.2 31.2 59.9 45.0
Concatenation 22.9 15.4 7.6 2.0 0.1 9.9 35.1 25.0
Multiplication 38.4 30.1 20.9 9.7 0.8 19.4 48.2 35.0
Filter Poses 49.1 42.3 32.6 19.1 3.2 29.1 57.2 42.7
Filter Activations 48.8 42.7 33.4 20.1 3.8 29.2 56.8 43.0
Our Network 52.6 45.0 34.5 20.7 3.6 30.3 56.8 46.0
Table 7: All metrics for the ablations trained and tested on the A2D dataset with sentences. We test the effect of parameterized skip connections, capsule masking, the classification loss, and the multi-resolution segmentation loss on our network. We also test conventional conditioning methods on our capsule network to evaluate the effectiveness of the proposed multi-modal capsule routing procedure. The final row contains the results of our network without any changes.
Overall IoU P@0.5 P@0.6 P@0.7 P@0.8 P@0.9
Hu et al.* [11] 56.83 43.86 35.75 26.65 16.75 6.47
Shi et al. [25] 59.09 45.87 39.80 32.82 23.81 11.79
Our Network 55.7 43.4 36.2 28.3 19.6 9.7
Table 8: Results on ReferItGame dataset. *This result is using Deeplab101 as a backbone, as described in [25]. We achieve comparable results, even with a network designed for video inputs.