1 Introduction and Related Work
Video semantic segmentation is an active topic of research in computer vision, as it serves as a basic foundation for various real-world applications, such as autonomous driving. It is a very challenging problem where every pixel in the video needs to be assigned a semantic category. Video action detection is a subset of this problem, where we are only interested in segmenting/localizing the actors and objects involved in the activities present in a video. The problem becomes even more interesting, when we want to focus on only selective actors in the video. In this work, we focus on selective localization of actors and actions in a video based on a textual query, as seen in Figure 1.
, and it is mainly accredited to the success in deep learning along with the availability of large scale datasets[5, 28, 12]. However, one limitation of these datasets is that the actor is mainly a person performing various activities. Xu et al.  introduce an actor-action video dataset (A2D), which has several actor and action pairs; this dataset presents several challenges as there could be different types of actors besides just humans performing the actions. Also, there could be multiple actors present in a scene, which is quite challenging when compared to typical datasets, where we have only one actor performing the action. Their experiments on A2D showed that a joint inference over actor and action outperforms methods that treat them independently. Ji et al.  explored the role of optical flow along with RGB data and proposed an end-to-end deep network for joint actor and action localization.
Gavrilyuk et al.  recently extended the A2D dataset with human generated sentences, describing the actors and actions in the video, and proposed the task of actor and action segmentation from a sentence. Their method uses a convolutional network for both visual as well as textual inputs and predicts localization on one frame of the video. We propose a different approach, where we make use of capsules for both visual as well as text encoding, and perform localization on the full video instead of just one frame, in order to fully utilize the spatiotemporal information captured by the video.
Hinton et al. first introduced the idea of capsules in , and subsequently capsules were popularized in , where dynamic routing for capsules was proposed. This was further extended in , where a more effective EM routing algorithm was introduced. Recently, capsule networks have shown state-of-the-art results for human action localization in video , object segmentation in medical images , and text classification . We propose to extend the use of capsule networks into the multi-modal domain, where the segmentation and localization of objects in video are conditioned on a natural language input. We introduce a novel capsule based attention mechanism for fusion of video and text capsules for text selected segmentation.
There are several existing works focusing on learning the interaction between text and visual data [19, 30, 3, 6]. These works are mainly focused on whole image and video level detections as opposed to pixel level segmentation. Hu et al. 
introduced the problem of segmenting images based on a natural language expression; their method for merging images and text in a convolutional neural network (CNN) was by concatenating features extracted from both modalities and followed by a convolution. Li et al. propose a different approach to merge these two modalities for the task of tracking a target in a video; they use an element-wise multiplication between the image feature maps and the sentence features in a process called dynamic filtering. Our proposed method encodes both the video as well as the textual query as capsules, and makes use of an EM routing algorithm to learn the interaction between text and visual information. The method differs from conventional convolutional approaches as it takes advantage of capsules networks’ ability to model entities and their relationships; the routing procedure allows our network to learn the relationship between entities from different modalities.
In summary, we make the following contributions in this work: (1) We propose an end-to-end capsule network for the task of selective actor and action localization in videos, which encodes both the video and the textual query in the form of capsules. (2) We introduce a novel multi-modal conditioning routing as an attention mechanism to address the issue of cross-modality capsule selection. (3) To demonstrate the potential of the proposed text selective actor and action localization in videos we extend the annotaions in A2D dataset to full video clips. Our experiments demonstrate the effectiveness of the proposed method, and we show its advantage over the existing state-of-the art works in terms of performance.
2 Conditioning with Multi-modal Capsule Routing
We argue that capsule networks can effectively perform multi-modal conditioning. Capsules represent entities and routing uses high-dimensional coincidence filtering  to learn part-to-whole relationships between these entities. There are several possible ways to incorporate conditioning into capsule networks. One trivial approach would be to apply a convolutional method (concatenation followed by a 1x1 convolution  or multiplication/dynamic filtering ) to create conditioned feature maps, and then extract a set of capsules from these feature maps. This, however, would not perform much better than the fully convolutional networks, since the same conditioned feature maps are obtained from the merging of the visual and textual modalities, and the only difference is how they are transformed into segmentation maps.
Another method would be to first extract a set of capsules from the video, and then apply the dynamic filtering on these capsules. This can be done by (1) applying a dynamic filter to the pose matrices of the capsules, or (2) applying a dynamic filter to the activations of the capsules. The first is not much different than the trivial approach described above, since the same set of conditioned features would be present in the capsule pose matrices, as opposed to the layer prior to the capsules. The second approach would just discount importance of the votes corresponding to entities not present in the sentence; this is not ideal, since it does not take advantage of routing’s ability to find agreement between entities in both modalities.
Instead, we propose an approach that leverages the fact that the same entities exist in both the video and sentence inputs and that routing can find similarities between these entities. From the video, we extract a grid of capsules describing the visual entities, , with pose matrices and activations . Similarly, from the sentence, we generate sentence capsules, , with pose matrices and activations . Each set of capsules has transformation matrices and , for video and text respectively, which are used to cast votes for the capsules in the following layer. Video capsules at different spatial locations share the same transformation matrices. Using the procedure described in Algorithm 1, we obtain a grid of higher-level capsules, . This algorithm allows the network to find similarity, or agreement, between the votes of the video and sentence capsules at every location on the grid. If there is agreement between the votes, then the same entity exists in both the sentence and the given location in the video, leading to a high activation of the capsule corresponding to that entity. Conversely, if the sentence does not describe the entity present at the given spatial location, then the activation of the higher-level capsules will be low since the votes would disagree.
This formulation of multi-modal conditioning using capsules allows the network to learn a set of entities (capsules) from both the visual and sentence inputs. Then, the voting and routing procedure allows us to find agreement between both modalities in the form of higher-level capsules. Suppose there is a higher-level capsule describing a dog. When given a query like “The brown dog running” the sentence capsules’ vote matrices corresponding to the dog class, contain the different properties of the dog, like the fact that it is running or that it is brown. If there is a running brown dog at some location in the visual input, then the visual votes at that location would be similar to the the sentence’s votes, so the activation of the higher-level dog capsule would be high at that location. If, for instance, there is a black dog or a dog rolling at some location then the votes would not agree, and activation for the dog capsule would be low there.
Although this work mainly focuses on video and text, the multi-modal capsule routing procedure can be applied to capsules generated from many other modalities, like images or audio.
3 Network Architecture
The overall network architecture is shown in Figure 2. In this section, we discuss the various components of the architecture as well as the objective function used to train the network.
3.1 Video Capsules
The video input consists of 4 frames. The process for generating video capsules begins with an Inception based 3D convolutional network known as I3D , which generates 832 - spatiotemporal feature maps taken from the maxpool3d_3a_3x3 layer. Capsule pose matrices and activations are generated by applying a
convolution operation to these feature maps, with linear and sigmoid activations respectively. Since there is no padding for this operation, the result is acapsule layer with 8 capsule types.
3.2 Sentence Capsules
A series of convolutional and fully connected layers is used to generate the sentence capsules. First, each word from the sentence is converted into a size 300 vector using a word2vec model pre-trained on the Google News Corpus
. Sentences in the network are set to be 16 words, so longer sentences are truncated, and shorter sentences are padded with zeros. The sentence representation is then passed through 3 parallel stages of 1D convolution with kernel sizes of 2, 3 and 4 with a ReLU activation. We then apply max-pooling to obtain 3 vectors, which are concatenated and passed through a max-pooling layer to obtain a single length 300 vector to describe the entire sentence. A fully connected layer then generates the 8 pose matrices and 8 activations for the capsules which represent the entire sentence. We found that this method of generating sentence capsules performed best in our network: various other methods are explored in the Supplementary Material.
3.3 Merging and Masking
Once the video and sentence capsules are obtained, we merge them in the manner described in section 2, and depicted in Figure 3. The result of the routing operation is a grid with 8 capsule types - one for each actor class in the A2D dataset and one for a “background” class, which is used to route unnecessary information. The activations of these capsules correspond to the existence of the corresponding actor at the given location, so averaging the activations over all locations gives us a classification prediction over the video clip. We perform the capsule masking procedure described in . When training the network, we mask (multiply by 0) all pose matrices not corresponding to the ground truth class. At test time, we mask the pose matrices not corresponding to the predicted class. These masked poses are then fed into an upsampling network to generate a foreground/background segmentation mask for the actor described by the sentence.
3.4 Upsampling Network
The upsampling network consists of 5 convolutional transpose layers. The first of these increases the feature map dimension from to with a kernel, which corresponds to the kernel used to create the video capsules from the I3D feature maps. The following 3 layers have
kernels and are strided in both time and space, so that the output dimensions are equal to the input video dimensions (). The final segmentation is produced by a final layer which has a kernel. We must note here, that this method diverges from previous methods in that it outputs segmentations for all input frames, rather than a single frame segmentation per video clip input. We use parameterized skip connections from the I3D encoder to obtain more fine-grained segmentations. At each step of upsampling, lower resolution segmentation maps are generated to aid in the training of these skip connections.
3.5 Objective Function
The network is trained end-to-end using an objective function based on classification and segmentation losses. For classification, we use a spread loss which is computed as follows:
where is a margin, is the activation of the capsule corresponding to class , and is the activation of the capsule corresponding to the ground-truth class. During training, is linearly increased between and .
The segmentation loss is computed using sigmoid cross entropy. When averaged over all pixels in the segmentation map, we get the following loss:
where is the ground-truth segmentation map and is the network’s output segmentation map. We use this segmentation loss at several resolutions to aid in the training of the skip connections.
The final loss is a weighted sum between the classification and segmentation losses:
where is set to
when training begins. Since the network quickly learns to classify the actor when given a sentence input, we setto when the classification accuracy saturates (over 95% on the validation set). We find that this reduces over-fitting and results in better segmentations.
The network was implemented using PyTorch. The I3D used weights pretrained on Kinetics  and fine tuned on Charades . The network was trained using the adam optimizer  with a learning rate of .001. As video resolutions vary within different datasets, all video inputs are scaled to while maintaining aspect ratio through the use of horizontal black bars. When using bounding box annotations, we consider pixels within the bounding box to be foreground and pixels outside of the bounding box to be background.
|Hu et al. ||34.8||23.6||13.3||3.3||0.1||13.2||47.4||35.0|
|Li et al. ||38.7||29.0||17.5||6.6||0.1||16.3||51.5||35.4|
|Gavrilyuk et al. ||50.0||37.6||23.1||9.4||0.4||21.5||55.1||42.6|
|Hu et al. ||63.3||35.0||8.5||0.2||0.0||17.8||54.6||52.8|
|Li et al. ||57.8||33.5||10.3||0.6||0.0||17.3||52.9||49.1|
|Gavrilyuk et al. ||69.9||46.0||17.3||1.4||0.0||23.3||54.1||54.2|
4.1 Single-Frame Segmentation Conditioned on Sentences
In this experiment, a video clip and a human generated sentence describing one of the actors in the video are taken as inputs, and the network generates a binary segmentation mask localizing the described actor. Similar to previous methods, the network is trained and tested on the single frame annotations provided in the A2D dataset. To compare our method with previous approaches, we modify our network in these experiments. We replace the 3d convolutional transpose layers in our upsampling network to 2d convolutional transpose layers to output a single frame segmentation.
We conduct our experiments on two datasets: A2D  and J-HMDB . The A2D dataset contains 3782 videos (3036 for training and 746 for testing) consisting of 7 actor classes, 8 action classes, and an extra action label none, which accounts for actors in the background or actions different from the 8 action classes. Since actors cannot perform all labeled actions, there are a total of 43 valid actor-action pairs. Each video in A2D has 3 to 5 frames which are annotated with pixel-level actor-action segmentations. The J-HMDB dataset contains 928 short videos with 21 different action classes. All frames in the J-HMDB dataset are annotated with pixel-level segmentation masks. Gavrilyuk et al.  extended both of these datasets with human generated sentences that describe the actors of interest for each video. These sentences use the actor and action as part of the description, but many do not include the action and rely on other descriptors such as location or color.
We evaluate our results using all metrics used in . The overall IoU is the intersection-over-union (IoU) over all samples, which tends to favor larger actors and objects. The mean IoU is the IoU averaged over all samples, which treats samples of different sizes equally. We also measure the precision at 5 IoU thresholds and the mean average precision over .
We compare our results on A2D with previous approaches in Table 1. Our network outperforms previous state-of-the-art methods in all metrics, and has a notable 9% improvement in the mAP metric, even though we do not process optical flow, which would require extra computation. We also find that our network achieves much stronger results at higher IoU thresholds, which signifies that the segmentations produced by the network are more fine-grained and adhere to the contours of the queried objects. Qualitative results on A2D can be found in Figure 4.
4.2 Full Video Segmentation Conditioned on Sentences
In this set of experiments, we train the network using the bounding box annotations for all the frames. Since previous baselines only output single frame segmentations, we test our method against our single-frame segmentation network as a baseline. It can generate segmentations for an entire video, by processing the video frame-by-frame.
We extend the A2D dataset by adding bounding box localizations for the actors of interest in every frame of the dataset. This allows us to train and test our method using the entire video and not just 3 to 5 frames per video. The J-HMDB dataset has annotations on all frames, so we can evaluate the method on this dataset as well.
To evaluate the segmentation results for entire videos, we consider each video as a single sample. Thus, the IoU computed is the intersection-over-union between the ground-truth tube and the generated segmentation tube. Using this metric, we can calculate the video overall IoU and the video mean IoU; the former will favor both larger objects and objects in longer videos, while the latter will treat all videos equally. We also measure the precision at 5 different IoU thresholds and the video mean average precision over .
Since the network is trained using the bounding box annotations, the produced segmentations are more block-like, but it is still able to successfully segment the actors described in given queries. We compare the qualitative results between the network trained only using fine-grained segmentations and the network trained using bounding box annotations in Figure 5. When tested on the A2D dataset, we find that there is a significant improvement in all metrics when compared to the network trained only on single frames with pixel-wise segmentations. However, this is to be expected, since the ground-truth tubes are bounding boxes and box-like segmentations around the actor would produce higher IoU scores. For a fairer comparison, we place a bounding box around the fine-grained segmentations produced by the network trained on the pixel-wise annotations: this produces better results since the new outputs more resemble ground-truth tubes. Even with this change, the network trained on bounding box annotations has the strongest results since it learned all frames in the training videos, as opposed to a handful of frames per video. The results of this experiment can be seen in Table 6.
The J-HMDB dataset has pixel-level annotations for all frames, so the box-like segmentations produced by the network should be detrimental to results; we found that this was the case: the network performed poorly in every metric when compared to the network trained on fine-grained pixel-level annotations..
4.3 Image Segmentation Conditioned on Sentences
We also evaluate our method by segmenting images based on text queries. To make as few modifications to the network as possible, the single images are repeated to create a “boring” video input with 4 identical frames.
We use train and test on the ReferItGame dataset , which contains 20000 images with 130525 natural language expressions describing various objects in the images. We use the same train/test splits as [11, 25], with 9000 training images and 10000 testing images. Unlike A2D there are no predefined set of actors, so no classification loss or masking is used.
The results for this experiment can be seen in Table 8. We obtain similar results to other state-of-the-art approaches, even though our network architecture is designed for actor/action video segmentation. This demonstrates that our proposed capsule routing procedure is effective on multiple visual modalities - both videos and images.
|Video Overlap||v-mAP||Video IoU|
|Key frames (pixel)||9.6||1.6||0.4||0.0||0.0||1.8||34.4||26.6|
|Key frames (bbox)||41.9||33.3||22.2||10.0||0.1||21.2||51.5||41.3|
|Hu et al. ||56.83||43.86||26.65||6.47|
|Shi et al. ||59.09||45.87||32.82||11.79|
4.4 Ablation Studies
The ablation experiments were trained and evaluated using the pixel-level segmentations from the A2D dataset. All ablation results can be found in Table 7.
To understand the effectiveness of the parameterized skip connections from the I3D encoder, an experiment was run with these skip connections removed. This resulted in a 3% reduction in mean IoU score and mean average precision. The decrease in performance shows that the skip connections are necessary for the network to preserve fine-grained details from the input video.
Classification and Masking
We test the influence of the classification loss for this segmentation task, by running an experiment without back-propogating this loss. Without classification, the masking procedure would fail at test time, so masking is not used and all poses are passed forward to the upsampling network. This performed worse than the baseline in all metrics, which shows the importance of classification loss and masking when training this capsule network. To further investigate the effects of masking, we perform an experiment with no masking, but with the classification loss. Surprisingly, it performs worse than the network without masking nor classification loss; this signifies that classification loss can be detrimental to this segmentation task, if there is no masking to guide the flow of the segmentation loss gradient.
Multi-Resolution Segmentation Loss
The base network computes a segmentation loss not only at the final output, but at multiple intermediate resolutions (, , and ). This approach was used in , to great success. As an ablation, we trained the network ignoring the intermediate segmentation losses, which produced similar results to the baseline network. Thus, the multi-resolution segmentation loss has no noticeable effect on our network.
We run a series of experiments to test the effectiveness of our multi-modal capsule routing procedure. We test the four other conditioning methods described in Section 2: the two trivial approaches (concatenation and multiplication), and the two methods which apply dynamic filtering to the video capsules (filtering the pose matrices and filtering the activations). The results suggest that multi-modal capsule routing is an effective method for merging different data modalities, since the convolution-based approaches perform much worse in all metrics. Moreover, these experiments show that it is non-trivial task to extend techniques developed for CNNs, like dynamic filtering for natural language conditioning, to capsule networks.
|No skip connections||49.5||26.9||43.1|
|No nor Masking||49.4||28.8||43.6|
|No Masking (with )||48.3||27.8||42.5|
4.5 Failure Cases
We find that the network has two main failure cases: (1) the network incorrectly selects an actor which is not described in the query, and (2) the network fails to segment anything in the video. Figure 6 contains examples of both cases. The first case occurs when the text query refers to an actor/action pair and multiple actors are doing this action or the video is cluttered with many possible actors from which to choose. This suggests that an improved video encoder which extracts better video feature representations and creates more meaningful video capsules could reduce the number of these incorrect segmentations. The second failure case tends to occur when the queried object is small, which is often the case with the “ball” class or when the actor of interest is far away.
5 Conclusion and Future Work
In this work, we propose a capsule network for localization of actor and actions based on a textual query. The proposed framework makes use of capsules for both video as well as textual representation. We introduce the concept of multi-modal capsule networks, through multi-modal EM routing for the localization of actors and actions in video, conditioned on a textual query. The existing annotations on the A2D dataset are for single frames and we extended the dataset with annotations for all the frames to validate the performance of our proposed approach. In our experiments, we demonstrate the effectiveness of multi-modal capsule routing, and observe an improvement in the performance when compared to the state-of-the art approaches. We found the capsule representation to be effective for both visual and text modalities; we plan to explore the interplay between these modalities using capsules and also apply it to other domains in future work.
This research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA R&D Contract No. D17PC00345. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.
J. Carreira and A. Zisserman.
Quo vadis, action recognition? a new model and the kinetics dataset.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4724–4733. IEEE, 2017.
-  K. Duarte, Y. S. Rawat, and M. Shah. Videocapsulenet: A simplified network for action detection. In Advances in Neural Information Processing Systems, 2018.
-  J. Gao, C. Sun, Z. Yang, and R. Nevatia. Tall: Temporal activity localization via language query. arXiv preprint arXiv:1705.02101, 2017.
-  K. Gavrilyuk, A. Ghodrati, Z. Li, and C. G. Snoek. Actor and action video segmentation from a sentence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5958–5966, 2018.
-  C. Gu, C. Sun, S. Vijayanarasimhan, C. Pantofaru, D. A. Ross, G. Toderici, Y. Li, S. Ricco, R. Sukthankar, C. Schmid, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. CVPR, 2018.
L. A. Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. Russell.
Localizing moments in video with natural language.In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 5803–5812, 2017.
-  S. Herath, M. Harandi, and F. Porikli. Going deeper into action recognition: A survey. Image and vision computing, 60:4–21, 2017.
-  G. Hinton, S. Sabour, and N. Frosst. Matrix capsules with em routing. 2018.
-  G. E. Hinton, A. Krizhevsky, and S. D. Wang. Transforming auto-encoders. In International Conference on Artificial Neural Networks, pages 44–51. Springer, 2011.
-  R. Hou, C. Chen, and M. Shah. Tube convolutional neural network (t-cnn) for action detection in videos. In IEEE International Conference on Computer Vision, 2017.
-  R. Hu, M. Rohrbach, and T. Darrell. Segmentation from natural language expressions. In European Conference on Computer Vision, pages 108–124. Springer, 2016.
-  H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black. Towards understanding action recognition. In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 3192–3199. IEEE, 2013.
-  J. Ji, S. Buch, A. Soto, and J. C. Niebles. End-to-end joint semantic segmentation of actors and actions in video. In Proceedings of the European Conference on Computer Vision (ECCV), pages 702–717, 2018.
-  V. Kalogeiton, P. Weinzaepfel, V. Ferrari, and C. Schmid. Action tubelet detector for spatio-temporal action localization. In ICCV-IEEE International Conference on Computer Vision, 2017.
-  W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg.
Referitgame: Referring to objects in photographs of natural scenes.
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  R. LaLonde and U. Bagci. Capsules for object segmentation. arXiv preprint arXiv:1804.04241, 2018.
-  S. Li, T. Xiao, H. Li, B. Zhou, D. Yue, and X. Wang. Person search with natural language description.
-  Z. Li, R. Tao, E. Gavves, C. G. Snoek, A. W. Smeulders, et al. Tracking by natural language specification. In CVPR, volume 1, page 5, 2017.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
-  T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
-  S. Sabour, N. Frosst, and G. E. Hinton. Dynamic routing between capsules. In Advances in Neural Information Processing Systems, pages 3856–3866, 2017.
-  H. Shi, H. Li, F. Meng, and Q. Wu. Key-word-aware network for referring expression image segmentation. In The European Conference on Computer Vision (ECCV), September 2018.
-  G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In European Conference on Computer Vision, pages 510–526. Springer, 2016.
-  K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568–576, 2014.
-  K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
-  C. Xu, S.-H. Hsieh, C. Xiong, and J. J. Corso. Can humans fly? action understanding with multiple classes of actors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2264–2273, 2015.
-  M. Yamaguchi, K. Saito, Y. Ushiku, and T. Harada. Spatio-temporal person retrieval via natural language queries. arXiv preprint arXiv:1704.07945, 2017.
-  W. Zhao, J. Ye, M. Yang, Z. Lei, S. Zhang, and Z. Zhao. Investigating capsule networks with dynamic routing for text classification. arXiv preprint arXiv:1804.00538, 2018.
Appendix A Appendices
Here we include many qualitative results, and quantitative results which could not be included in the main text. Also, we include figures and a more in-depth description of the network architecture.
Appendix B Network Architecture
When designing our network, we found several components key to obtaining our state-of-the-art results. We explain some of these components, and how they impacted our results. Furthermore, we include several figures (7, 8, 9) which illustrate the construction of the video capsules, the sentence network, and the upsampling network respectively.
b.1 Video Encoder
Our network uses a pretrained I3D network to encode the video into a set of feature maps of size . Originally we used the simpler C3D network [tran2015learning], to encode the feature maps, but it achieved substantially worse results - a mean IoU of 34%. This shows that the features extracted from the I3D network, are more useful for our video capsules than were the features from the C3D network. Not only that, but it suggests that future improvements in video feature extraction techniques and networks, will lead to improved capsule representations and improved results.
b.2 Sentence Network
We also tested several different configurations for our sentence network. We began with using the capsule networks (both Capsule-A and Capsule-B) described in Figure 2 of  which showed strong results on text classification. These network have several capsule layers, but their use led to poor results: a mean IoU of 35.7% for the Capsule-A network, and a mean IoU of 36.4% for the Capsule-B network. Although these network performed well on the text classification task, a different set of text features must be learned to condition visual features. Therefore, our sentence network with conventional convolutional layers, max-pooling, and a fully connected layer, is better able to extract textual features for the task of video segmentation from a sentence.
Appendix C Evaluations
We include several tables which we were unable to include in our main text. Table 6 shows the results of our network trained on the bounding box annotations from A2D, and tested on JHMDB. The network’s outputs are more box-like because it was trained on bounding boxes as opposed to pixel-wise segmentations; therefore, the network has strongest results when tested using bounding box ground-truths. Tables 7 and 8 contain all the metrics which we were unable to include for the ablation experiments and ReferItGame experiments, respectively.
Appendix D Qualitative Results
We have generated several videos with the segmentations produced by our networks for both the A2D and JHMDB datasets. Each video contains the input sentences, color coded to match the ground-truth colors. The first row of each video has the ground-truth bounding boxes (for A2D) or ground-truth segmentation masks (for JHMDB). The second row is the output of the network which was trained on the key frames of the A2D dataset, which had pixel-wise segmentation ground-truths. The third row is the output of the network which was trained on all the frames of the A2D dataset, using bounding-box ground-truths. In our analysis of the qualitative results, we will refer to the prior network as the ”Key Frame network” and the latter network as the ”Bounding Box network”.
d.1 Single Actor
The networks seem to perform best when there is a single actor in the scene. If this is the case, we find that the Key Frame network produces very fine-grained segmentations which maintain the boundaries of the actors; Meanwhile, the Bounding Box network successfully segment a box around the actor. This behaviour can be observed in the videos in the A2DSingleActor folder, which contains examples from the A2D dataset, and the JHMDB folder, which contains examples from the JHMDB dataset.
d.2 Multiple Actors
The network can also perform well with multiple actors in the scene as seen in the A2DMultiActor videos. In these cases, the segmentations are not as precise, but the general location of the actors is being segmented by both networks. We note that the Bounding Box network, which was trained with all the frames of dataset, tends to produce more consistant multi-actor segmentations: as seen in videos ”video3_multi_a2d” and ”video5_multi_a2d”. In both of these cases, each instance is of the same actor class, and the Key Frame network seems to incorrect segment one of the instances.
d.3 Failure Cases
As mentioned in the main text, our network tends to fail on the A2D dataset when there are multiple instances of the same actor class. We present 5 videos in which our network fails in the A2DFailure folder. The probability of failure is increased when the sentence queries are vague, or could describe multiple actors within the scene, like in the video ”video1_failure_a2d”. Several birds can fit the description of ”sparrow sitting on the grass” or ”sparrow is walking on the brown grass”. In these cases it would be very difficult for even a human to correctly segment the video. In many cases, the failure occurs when there are many similar actors near each other, like in the videos ”video2_failure_a2d.mp4” and ”video5_failure_a2d”. In the first, there are multiple people running next to each other, while the second contains several cars moving near each other.
Since the JHMDB dataset only has a single actor in each video, the failures encountered are not from ”difficult” queries or videos, but rather a result of the mismatch between the training and testing data. A2D videos tend to have humans perform an action requiring large amounts of motion - like walking, running, jumping or rolling - while the A2D videos have many videos in which the action has little motion - like brushing hair or archery. Thus, both the videos, and the input textual queries, are quite different between the datasets, which can cause a large performance discrepancy during evaluation. Examples of failure cases on the JHMDB dataset can be found in the JHMDBFailure folder.
|Video Overlap||v-mAP||Video IoU|
|All frames (bbox)||46.7||32.1||15.8||3.2||0.0||16.3||44.5||44.4|
|No skip connections||49.5||41.2||29.1||14.6||1.5||26.9||56.7||43.1|
|No nor Masking||49.4||42.5||32.7||19.6||3.3||28.8||57.6||43.6|
No Masking (with )
|Hu et al.* ||56.83||43.86||35.75||26.65||16.75||6.47|
|Shi et al. ||59.09||45.87||39.80||32.82||23.81||11.79|