Classifying Collisions with Spatio-Temporal Action Graph Networks

12/04/2018 ∙ by Roei Herzig, et al. ∙ Tel Aviv University Nexar Ltd berkeley college 0

Events defined by the interaction of objects in a scene often are of critical importance, yet such events are typically rare and available labeled examples insufficient to train a conventional deep model that performs well across expected object appearances. Most deep learning activity recognition models focus on global context aggregation and do not explicitly consider object interactions inside the video, potentially overlooking important cues relevant to interpreting activity in the scene. In this paper, we show that a new model for explicit representation of object interactions significantly improves deep video activity classification for driving collision detection. We propose a Spatio-Temporal Action Graph (STAG) network, which incorporates spatial and temporal relations of objects. The network is automatically learned from data, with a latent graph structure inferred for the task. As a benchmark to evaluate performance on collision detection tasks, we introduce a novel data set based on data obtained from real life driving collisions and near-collisions. This data set reflects the challenging task of detecting and classifying accidents in a richly varying but yet highly constrained setting, that is very relevant to the evaluation of autonomous driving and alerting systems. Our experiments confirm that our STAG model offers significantly improved results for collision activity classification.



There are no comments yet.


page 1

page 3

page 4

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: The Spatio-Temporal Action Graph (STAG) network for driving event classification: our end-to-end trainable network exploits a latent graph structure to leverage object-relational structure on individual frames. A temporal attention mechanism weights graph features over time, with a final classification output predicting, e.g., a potential collision or not collision event.

Recognition of rare events in natural scenes poses a challenge for deep learning approaches to activity recognition, since an insufficient number of training examples are typically available to learn all required conditions. E.g., in driving scenarios critical events are often a function of the spatial relationship of prominent objects, yet are rare enough that they will not be experienced across a sufficient range of object appearances. Conventional deep models work well when they are supervised across the range of conditions they will experience, and conversely, may perform poorly on new object-appearance combinations. We desire an activity recognition model which can generalize effectively across object appearance and intra-object relationships.

With the advent of deep learning techniques, models have generally been limited to holistic representations, directly applying convolutional filters to full video frames. Typical methods included an LSTM conditioned on per-frame CNN features [7], 3-dimensional convolutional networks (3D ConvNets) [37] and two-Stream convolutional networks [35]. None of these models explicitly leverage object interactions to help activity classification, though most activities contain common object interaction patterns which are a clear cue for the activity category.

Historically, objects and their interactions have been considered valuable features for understanding activities, represented by spatio-temporal tubes (e.g., [12, 44]), spatially-aware embeddings [28], and/or spatio-temporal graphical models (e.g., [30, 3, 13]). But generally, when learning-based approaches were considered, the addition of explicit object detection often did not show significant improvements in real-world evaluation settings. The most successful SVM-based video classification methods were largely based on handcrafted and holistic features including bag of word desriptors, dense trajectories, and motion boundary descriptors [39, 41, 40].

Recently, Wang et al. addressed the role of object temporal properties in deep activity classification [42]. They build a temporal graph to model object evolution, and combine this graph feature encoding with the holistic convolutional encoding of the whole video to improve video classification. While they show temporal graph modeling of boxes offers significant improvement on activity classification, their model did not consider the spatial relation of boxes in each frame. In addition, in [42] the temporal graph connection is defined by two metrics, the similarity relation of correlated objects and the spatial-temporal relations between nearby objects in adjacent frames. However, the two metrics may be too rigid to find the most suitable temporal connection for a specific dataset.

In this paper, we explore whether spatio-temporal object relationships offer improved activity recognition performance. We propose a Spatio-Temporal Action Graph (STAG) with both explicit spatial and explicit temporal connectivity between objects. STAGs not only models the object appearance relationship on a per-frame base but also models the temporal evolution of object relations. We learn the temporal relation structures from data, which may be more suitable for specific datasets and/or tasks. In Fig. 1, we demonstrate our approach on a driving scene for collision event detection.

To evaluate our model on realistic critical driving scenarios, we introduce a challenging dataset based on real-world dashcam data. We collect 803 videos from more than 10 million rides, focusing on rare events of collision and near-collision scenarios experienced by the (dashcam) vehicle, typically involving its contact with a fixed or moving object such as another vehicle or pedestrian. Such kinds of activity tasks, which involve recognition of interaction with a specific well detected object, seem to fit well to our proposed approach.

Our experimental results validate the thesis that both spatial and temporal relations of objects are important for activity classification, and that explicitly modeling interactions offers advantages over one based on implicit temporal modeling.

With an arbitrary amount of training data, end-to-end learning with a suitably high-capacity network could implicitly learn object relationships relevant to recognition tasks. But such amounts of training data are not realizable in practice. Our results confirm that explicit object representations can offer better generalization performance for deep activity recognition in realistic conditions with limited training data.

The following section reviews relevant work on activity classification and graph models. Sec. 3 describes our approach to object-aware activity recognition using a STAG architecture. Sec. 5 shows the experimental results on Collisions dataset. Sec. 6 concludes the paper.

2 Related Work

Figure 2: Object-aware Activity Recognition architecture. The input consists of a sequence of 20 frames. (1) A backbone of ResNet50 and RPN produces a set of bounding boxes proposals for each frame. (2) A RoiAlign Single layer extracts features from the backbone using the boxes. In parallel, every pair of box proposals is used for computing a union box, and pairwise features extracted similar to the features. (3) The and are used as inputs to an Spatio-Temporal Action Graph Network Module which outputs a video representing processed by an attention over frames and objects from the new and improved features using graph networks approach. (4) Last, the model outputs collision classification.

Video Classification Early deep learning activity classification systems are in the spirit of bag of words models: per-frame features are pooled over an entire video sequence [22]

. Simply pooling the per-frame features will ignore temporal structure in the video. Later work used a recurrent neural network, such as an LSTM, to temporally aggregate per-frame features

[7, 48].

Another line of activity classifiers use 3D spatio-temporal filters to create hierarchical representations of the whole input video sequence [19, 36, 37, 38]. While spatio-temporal filters can benefit from large video datasets, they cannot be pretrained on images. Two-stream networks are proposed to leverage image pretraining in RGB as well as capturing fine low-level motion in another optical flow stream [35, 9]. I3D[4] is designed to inflate 2D kernels to 3D to learn spatio-temporal feature extractors from video while leveraging image pretrain weights. In all these aforementioned activity classifiers, whole frame video features are extracted without looking into the object details inside each frame. Recently some works have investigated the effect of either temporal or spatial object relations in activity classification [43, 26] Our paper follows this direction and investigates the spatio-temporal relationship of objects for activity classification.

Graph Models

Traditional graph models include Bayesian Networks and Conditional Random Field (CRF). Scene Graphs were first used for rich image representation in the image retrieval tasks 

[21]. To apply deep learning to graph models, various graph convolution operations and architectures have been proposed [16, 23]. One unique characteristic of graph models is their relational reasoning ability [2, 49, 1]. Graph models have been applied on many applications, including image generation [20] and robotics [34]. Models for automatically constructing scene graphs based on a Faster R-CNN architecture has been proposed in [46]. In this paper, we build an object graph model to improve performance on video activity classification tasks.

Object Detection Faster R-CNN [32], extends R-CNN [11] and Fast R-CNN [10] object detection approaches, incorporating region of interest (RoI) pooling and a region proposal network. More recent object detection models include SSD [25], R-FCN [5] and YOLO [31] which achieve faster and better object detection. Mask R-CNN [14]

extends Faster R-CNN by adding one extra branch for predicting object masks in parallel with original bounding box predictions. It efficiently detects objects in an image while simultaneously generating a high-quality instance segmentation mask. In this paper we take the high quality proposals from Mask R-CNN to build the spatial-temporal graph over them for activity classification. Object detection proposals serve as a foundation for several computer vision tasks like visual grounding 

[18], robot learning [6] etc. Here we try to leverage the object proposals for activity classification task.

Objects for activities [24] classifies the event in the image as well as providing a number of semantic labels to the objects and scene environment in the image. They show that using scenes or objects alone cannot achieve good performance. [29] is the early work to exploit human motion and object context to perform action recognition and object classification. Since many activity classes are related to humans, a variety of papers aim to model the interaction of human and objects for activity recognition [33, 13, 47].  [8] exploits the consistent appearance of objects, hands, and actions from the egocentric view for activity recognition. They also show that joint modeling of activities, actions, and objects leads to superior performance compared to the case where they are considered independently.  [45] proposes an approach for activity recognition based on detecting and analyzing the temporal sequence of objects. We investigate spatio-temporal object interactions in the car-road scene for collision classification, and demonstrate the benefit of explicit modeling object relationships through space and time.

3 Spatio-Temporal Action Graph

In what follows we describe our model for object-aware recognition using a Spatio-Temporal Action Graph Network (STAG). The key insight of the Spatio-Temporal Action Graph module is that is that actions are characterized by the way different objects interact in the image across space and time. To capture these, our model is constructed as follows. First, a detector module is applied to the image to extract bounding boxes of objects of interest (see Section 3.1

). This results in a set of vector representations for the bounding boxes, and pairs of bounding boxes. This is done for each time-frame in the sequence.

Next, these representations are re-processed via a graph network, resulting in a new representation per box, where now each such representation has incorporated information from all other boxes. Therefore, the new representation can take context into account. Again, this is done per time-frame. Additionally, at each time frame, a per-frame representation is calculated.

Finally, information is aggregated across frames, by using a variant of an attention module. This results in a representation of the entire sequence, which is then used for classifying the action. Those, design choices in the above architectures are specifically targeted at modeling spatio-temporal patterns, whereby information is gradually aggregated along both these axes.

We next describe each of the components in more detail. In what follows we use the following constants: is the number of frames in the sequence to be classified, is the maximum number of objects (i.e., bounding boxes) per frame, and is the dimensionality of feature vector for each bounding box (as calculated by the Mask R-CNN). We also use to denote the set and to denote the set .

Figure 3: A schematic representation of the Spatio-Tempora Action Graph Network Module (a) First, the features are processed element-wise by . (b) Features are summed using an attention over the neighbors per object to create a vector . (c) A representation of the entire graph per frame is created by applying and summing the created vector using an attention over the objects. (d) An improved object features are finally processed by , together with a concatenated features with an expanded graph features per frame. (e) An attention over frames and objects is been applied to get a feature sized which represent a video.

3.1 Detector

As a first step, we use a Mask R-CNN detector [15] for extracting bounding boxes for region proposals and corresponding features. Recall that a Mask R-CNN consists of three main components: (1) A backbone network of ResNet50 with a Feature Pyramid Network (FPN) (2) A Region Proposal Network (RPN) (3) An RoIAlign layer for extracting feature vectors in for the regions of interest.

Because we are interested in interaction between objects, we use the RoIAlign layer to also extract features for pairs of boxes. Specifically, for each pair of boxes, we consider its union (see Figure 2) and use an RoIAlign layer for extracting its features. Choosing a union box should capture objects interactions in their spatial space.

The detection stage therefore results in two sets of tensors:

  • Single-object features: For each time step and box we have a feature vector for the corresponding box. The feature contains the output of the RoIAlign layer and the bounding box coordinates.

  • Object-pair features: For each time step and box-pair we have a feature vector for the corresponding pair of boxes. The feature contains the output of the RoIAlign layer and the coordinates of the union bounding-box.

Thus overall, we have a tensor of size for single object features, and a tensor of size for object-pair features.

3.2 STAG Module

In order to model the interaction between multiple objects in the same frame, we use a graph network that integrates information across all boxes and pairs of boxes in a single frame. The network, inspired by the architecture in [17], is shown in Fig. 3. It progressively aggregates over information to eventually obtain a vector in describing the whole sequence. At the first, the features of are processed element-wise by . Then, a reduction step a tensor of size is reduced to by having each box “attend” to all the other boxes, thereby integrating information from across the frame. After that, those features, are processed by , and in the next reduction phase, is transformed to via a soft-attention over boxes per time frame. This results in a feature vector describing each complete frame. At this point, we reprocess the single box features using the complete frame features by to get an improved features . To obtain a representation for the whole sequence we reduce over resulting in a vector for the entire sequence. Finally, we apply a multi-layer neural network to this vector, with a softmax classifier. As we described later in Sec. 5.1, , and are functions mapping features from one dimension to another.

3.3 Object Alignment

Figure 4: Attention mechanism for object alignment: in the above example we can see in each frame the object with the highest attention score. This turns out to be, in most of the frames, the truck involved in the collision, even though the RPN outputs the corresponding box at a different index in each frame. Therefore, the attention mechanism effectively aligns the car across frames.

The motion of the objects is a critical key in order to correctly recognize the action in a video clip. Since the order of the detected objects will typically vary between different frames (as it just corresponds to the order in which the RPN ranks these), there is a need to align the objects in order to mix the objects information across the temporal dimension. In cases where there is a need to recognize activity of one main object (such as car collisions), the network architecture can exploit this property and concentrate on one object at each frame.

The way we implement this idea in the network architecture is by using soft attention on the objects (see model description above), in order to reduce the object axis, allowing the network to concentrate on specific objects. Thus, instead of explicitly modeling all object interactions between frames (which requires a fixed object ordering across all frames), we first aggregate object relationships, then model temporal relationships. After the dimensions reduction, we left with one feature vector per frame. We then combine all the frames together, reduced the temporal dimension and predict the action.

4 The Collisions Dataset

To evaluate our approach on realistic driving scenes, we introduce a dataset comprising of real-world driving videos of rare events. These events encompass collisions scenarios, i.e., scenarios involving the contact of the (dashcam) vehicle with a fixed or moving object; and near-collisions, i.e., scenarios requiring an evasive maneuver to avoid a crash. Such driving scenarios most often contain interactions between two vehicles, or between a vehicle and a pedestrian or a bike, and thus are suitable for modeling object interactions. The dataset contains rare events from diverse driving scenes including urban and highway areas in several large cities in the US.

Party type dist.
Vehicle 85%
Bike 6%
Pedestrian 6%
Road object 1%
Motorcycle 1%
Weather dist.
Clear 93%
Rain 5.3%
Snow 1.7%
Lighting dist.
Day 62%
Night 38%
Table 1: Rare event dataset statistics: 3rd party, weather, and lighting conditions.

Data collection. The data was collected from a large-scale deployment of connected dashcams. Each vehicle is equipped with a dashacam and a companion smartphone app that continuously captures and uploads sensor data such as GPS, IMU and gyroscope readings. Overall the vehicles collected more than 10 million rides. Rare events are automatically detected using a triggering algorithm based on the IMU and gyroscope sensor data 111The algorithm is tuned to capture driving maneuvers such harsh braking, acceleration, and sharp cornering.. The events are then manually validated by human annotators based on visual inspection to identify edge case events of collisions and near-collisions, as well as safe driving events. Of all the detected triggers, our subset contains 743 collisions and 60 near-collisions, all from different drivers. For each event, we maintain a video, and the time of impact (or near-impact), which typically occurs in the middle of the video. Each video is approximately 40 seconds in length and has a resolution of .

The full and few-shot datasets To investigate the effect of the proposed graph model, we use the events to create two datasets: full and few-shot. The full dataset contains 803 videos. We use 732 videos as training data (44 of them are near collision) and 71 as test data (6 of them are near-collision). The few-shot dataset contains only 125 videos, 25 videos used as training data and 100 as test data. We temporally downsample the videos to 5 frames per second to avoid feeding near-duplicate frames into our model. We use segments of the videos for model training to fully leverage our data. Specifically, we split each video into three segments, two non-overlapping segments of 20 frames each from before the time of impact, and one segment starting at the time of impact. So that we have two negative segments (non-risky driving scenes) and one positive segment (a collision scene) for each collision event, and three negatives for each near-collision event. After this processing, we have a total 2409 video segments, out of which 1656 are negative examples and 753 are positives.

Figure 5: Different examples of the dataset: (a) extreme weather conditions such as snow and heavy raining. (b) near collision of a truck and a bicycle rider. (c) day and night collisions.

Diversity Our dataset is collected to learn to recognize collisions in natural driving scenes. The coverage of the dataset includes various types of scenes, the parties involved, and lighting conditions. Fig. 5 shows different examples of collision scenes under various weather and lighting conditions. We use visual inspection to analyze the identity of 3rd parties involved in the collisions. We find that most of the data (85%) consists of crashes involving two vehicles, and collisions involving pedestrians and cyclists tally up to 6% each. Table 1 shows the distribution of the 3rd party types involved, as well as weather and lighting conditions.

Finally, we will release a publicly available challenge based on this data upon acceptance of paper.

5 Experiments

We next describe the results of applying our model to the dataset in Sec. 4, compare it a state of the art activity recognition classifier, and study model variants.

5.1 Model Details

We chose the following parameters: (number of frames per video; we use a randomly sub-sampled segment with frames from the dataset, as described in Sec. 4.), (number of boxes per frame) and (feature dimension per box).

Detector. We used Mask R-CNN with ResNet50 as a backbone to train a sequence of

frames corresponding to one video. The FPN was trained with strides (4, 8, 16, 32, 64). The RPN trained with anchor scales (32, 64, 128, 256, 512) and aspect ratios (0.5, 1, 2) and anchor stride of

. The input image was resized with padding to 256, which is the maximum dimension. The RPN proposals were filtered by non-maximum suppression with IoU threshold of 0.7. The model then subsampled the most likely 12 ROIs from an initial 2000 ROIs of the RPN. Using a higher number or ROIs could potentially be a major issue for time and space complexity, however, meaningful scene could be represented by only 12 objects because they capture sufficient information of the scene.

Features for the objects and relations are extracted as explained in Sec. 3 resulting in features for objects and for relations. Both the objects and relations ROIs are pooled to followed by convolution layer to be and , respectively. The Mask R-CNN was pretrained on the BDD dataset [27] using the default split as the author suggested.

Spatio-Temporal Action Graph Module. The , and modules are implemented using 3 fully connected (FC) layers (see Fig. 3) with a dimensional output each. The attention is implemented in the standard way using another as a network like , and , receiving the same inputs, but using the output scores for the attention for reducing dimension in the model.

5.2 Training Details

Our model was trained using SGD with a momentum of and a learning rate . The learning rate was decayed by a factor of

each epoch, and the gradient was clipped at norm

. Each batch includes a video segment of frames that includes ground truth detection annotations per frame and collision annotation per segment. The detection annotations came from the pretrained Mask-RCNN, which trained on BDD [27]. The loss of the Object-aware Activity Recognition was a sum of the following two components: the Mask-RCNN detector and the Spatio-Temporal Action Graph network. We trained the Object-aware Activity Recognition with a ratio of 1:3 of positive to negatives, meaning the Spatio-Temporal Action Graph was able to learn both positive and negatives examples.

Detector. The Backbone and RPN are trained in with a loss that consisting of RPN classification and regression, and the classification and regression of the second stage, as is standard with two stage detectors. The two losses are given the same weight.

Spatio-Temporal Action Graph Classification Loss The Spatio-Temporal Action Graph network (Sec. 3.2

) has two logits at its output, corresponding to the two possible actions (”collision” or ”no collision”). Since we have the ground-truth for this class, we simply use a binary cross entropy loss between the logits and the ground-truth label.

5.3 Model Variants

We refer to our model as a Spatio-Temporal Action Graph or STAG (see Sec. 5.2). STAG progressively processes the RPN features into a single vector describing the whole frame sequence. Through this process, it can capture the spatio-temporal aspects of the sequence. Another approach to capturing temporal structure are recursive neural nets and in particular LSTM. Here we describe two LSTM variants that use some aspects of STAG, but replace the temporal processing in STAG with an LSTM model. We considered the following models:
(1) LSTM Boxes - An LSTM model whose input at time are the RPN features for the boxes. This model does not use any component in Fig. 3. The final state of the LSTM is used for predicting the binary action label “collision” or “no collision”.
(2) LSTM Spatial Graph - An LSTM model as above, but whose input at time is the vector in Fig. 3 (i.e., the output of function). Thus, this model replaces the final attention-based reduction step in STAG with LSTM reduction.

5.4 A C3D Baseline

We compare our approach with the C3D [37] approach, which uses 3D ConvNet to extract rich spatio-temporal feature hierarchies from a given input video buffer. We adopt the convolutional layers (conv1a to conv5b) and fc layers(fc6 to fc8) of C3D model. The input to our model is a sequence of RGB video frames with dimension . The height () and width () of the frames are taken as 112 each following [37]. The number of frames is set as 16 frames sampled from each video. We use the Sports-1M pretrained C3D weights to initialize the model used in our training. In order to show the complementary effect of the video feature encoding using spatio-temporal filters, we fuse the prediction from the C3D model and our proposed STAG model by averaging the classification scores from both models. Similarly, C3D model is fused with other model variants LSTM Boxes and LSTM Spatial Graph (listed in Sec. 5.3).

5.5 Results

As described in Sec. 4, we have two datasets; the full dataset and the few-shot dataset. Table 2 reports classification accuracy on these two datasets, for our Spatio-Temporal Action Graph (STAG) model as well as two of its variants and the C3D model. First, it can be seen that STAG outperforms all the other models. Second, the LSTM Boxes model performs very poorly, highlighting the importance of using more expressive per-frame representations. This, LSTM still underperforms STAG even when using part of the STAG representation, highlighting the advantage of the attention based reduction in Fig. 3. Fourth, on the Few-Shot data the performance of all models decreased significantly, as expected. However, in these cases the STAG model still outperforms the others.

Table 3 reports accuracy for models fused with C3D. It can be seen that fusion improves the STAG models in all settings.

5.6 Qualitative Analysis

Visual inspection of success and failure cases reveals an interesting pattern. We observe that STAG outperforms C3D on “near-collision” cases such as that in the upper row of Fig. 6. Correctly classifying such cases requires understanding the relative configuration of the objects in order to determine if there was an incident or not. On the other hand C3D may put too much weight on events such as speed-change which are not always predictive of an accident.

The C3D model outperforms STAG on cases where objects are not clearly visible. For example in the lower row of Fig. 6.

Because C3D and STAG have complementary strengths, their fusion results in a stronger model, as shown in Table 3.

Figure 6: Success and failure cases for STAG. Top Row: a near-collision example (i.e., the correct label is “no collision”) where the STAG model correctly classifiers, and the C3D model does not. Middle Row: a collision example (i.e., the correct label is “collision”), where the C3D model is correct, while the STAG module errs. Bottom Row: a collision example (i.e., the correct label is “collision”), where the STAG module which was trained on the Few-shot dataset is correct.
Full Dataset Few-shot Dataset
C3D 0.799 0.72
LSTM spatial Graph 0.775 0.67
LSTM boxes 0.695 0.69
STAG 0.845 0.728
Table 2: Classification accuracy for the STAG model and its variants, and the C3D model.
Full Dataset Few-shot Dataset
LSTM spatial Graph 0.8356 0.731
LSTM boxes 0.812 0.723
STAG 0.855 0.734
Table 3: Classification accuracy for STAG model and variants, when fused with the C3D model.
Figure 7: Explainability via Attention. A sequence of frames, with superimposed heat-map as calculated using the attention component in Fig. 3. It can be seen that more “dangerous” objects tend to receive more attention.

5.7 Attention Visualization

Our STAG model use attention across space and time to reduce a video into a single vector. One advantage of attention models is their interpretability. Specifically, one can view the attention map that the model uses to understand which parts of the input have more influence on the decision. For the case of collision detection, this is clearly an important feature. Fig. 

7 provides a nice illustration of the insight that attention maps provide. The figure shows heat maps per scene for the attention maps used in Fig. 3. It can be seen that objects that pose more danger to the driver tend to receive higher attention values.

6 Conclusion

In this paper, we introduce an approach to action recognition that specifically focuses on objects and their spatio-temporal interaction. Our approach integrates context from across the image, and then aggregates information using multiple uses of attention. Empirical results show that our STAG model outperforms the C3D model, as well as LSTM based approaches.

The use of attention turns out to be particularly useful in terms of explainability, since the model chooses to attend to risk factors in the video.

We believe these results highlight the importance of modeling objects for activity recognition. One promising extension of STAG is to consider semantic segmentation of objects, which may be particularly important for modeling pedestrian behavior. Another is to extend the model to longer time sequences, where more elaborate spatio-temporal patterns may be relevant. Finally, an exciting challenge is to integrate such prediction networks in autonomous driving systems, where the controller can use input from the collision predictor.

Our proposed model architecture mainly addresses the use case where there are prominent objects and their relations in a driving scene which determine the activity class. In future work we would like to address more complex scenarios of activities and scenes which contain high order of complexity between multiple objects. An example for such a use case for activity recognition in the driving field could be understanding third party collisions, which could be useful in a real life scenarios.