Deep Learning Driven Visual Path Prediction from a Single Image

01/27/2016 ∙ by Siyu Huang, et al. ∙ Nanjing University Zhejiang University Columbia University 0

Capabilities of inference and prediction are significant components of visual systems. In this paper, we address an important and challenging task of them: visual path prediction. Its goal is to infer the future path for a visual object in a static scene. This task is complicated as it needs high-level semantic understandings of both the scenes and motion patterns underlying video sequences. In practice, cluttered situations have also raised higher demands on the effectiveness and robustness of the considered models. Motivated by these observations, we propose a deep learning framework which simultaneously performs deep feature learning for visual representation in conjunction with spatio-temporal context modeling. After that, we propose a unified path planning scheme to make accurate future path prediction based on the analytic results of the context models. The highly effective visual representation and deep context models ensure that our framework makes a deep semantic understanding of the scene and motion pattern, consequently improving the performance of the visual path prediction task. In order to comprehensively evaluate the model's performance on the visual path prediction task, we construct two large benchmark datasets from the adaptation of video tracking datasets. The qualitative and quantitative experimental results show that our approach outperforms the existing approaches and owns a better generalization capability.



There are no comments yet.


page 1

page 4

page 5

page 7

page 8

page 9

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

[lines=2]Inference and prediction are significant capabilities of intelligent visual systems [1]

such that they have been popular topics in computer vision community during recent years. As part of visual inference and prediction, we address the visual path prediction problem, with the goal inferring the most possible future path for an object in a static scene image. For instance, given a single static image like Fig.

1(a), we humans can easily recognize the objects inside it, and tell others which ones are active — persons and car will move, but grass and house will remain still. Furthermore, for the active objects, we will naturally infer their intentions and future motions. Taking the man at bottom left with red bounding box for an example, he is most likely to walk straight, meanwhile, bypassing the car which appears to be an obstacle for him. The aforementioned visual inference process is illustrated as the red path in Fig. 1(d). As a matter of fact, these predictions are naturally driven by a human visual system and supported by the prior knowledge stored in it.

(a) Original Image
(b) Reward Map
(c) Estimated Orientation
(d) Predictions
Fig. 1: Illustration of our approach. Image (a) shows a man in the parking lots. The goal of visual path prediction is to infer the possible paths for him in the future. In this paper, we first generate a reward map (b) representing regions the man can reach in the future (green). Then, we estimate his facing orientation (c). Finally, we incorporate the results of (b) and (c) to plan the most likely paths as shown in (d), where the red line and the black lines respectively show our top-1 and top-10 predictions.

In this work, we aim to automatically learn this prior knowledge from a diverse set of videos, and further infer the possible future motions of objects. The prior knowledge here includes both the scene structure and motion patterns underlying the frame sequences. More specifically, they can be respectively associated with the contextual properties of the scene structure from the spatio-temporal perspectives. Therefore, the key way of solving the visual path prediction task is modeling the spatial and temporal context, followed by a certain inference algorithm to predict the future path. Such a task is very challenging because it not only needs deep semantic understanding of videos, but also is often confronted with very complicated and diverse situations. For instance, just a single scene in this task may contain various kinds of appearance which are easy to confuse with each other. To address this dilemma, the visual representation is typically required to be semantic and highly discriminative. On the other hand, the scenes and objects are usually diverse that cover a large amount of cases. It is required for the context model and visual representation to possess good enough generalization ability for the adaptation to complex scenarios.

In recent years, topics of visual inference and prediction are widely studied by computer vision reseachers, and there has been some work referring to the visual path prediction task. Earlier work [2, 3] focuses on the matching-based approaches. For instance, Yuen et al. [3] explore scene modeling by searching straightforwardly in image space with keypoint matching techniques using descriptors like GIST and dense SIFT. In general, these matching-based methods rely on large amount of data and do not really understand the scene. In the test phase, they have to compare with all the alternative samples, leading to high computation cost. In contrast, more recent work has poured attention into the learning-based approaches [4, 5]. The key concept is learning context model to capture the structure relationships between the scene and specific objects, followed by learning robust temporal models like IOC [4] for inference. The learning-based approaches seek to establish inductive models to understand the scene in depth, which results in the state-of-the-art performance in the visual path prediction task.

While in practice, the complex and cluttered situations (e.g., a crowd of cars and people moving at the crossroads) in this task have raised higher demands on the effectiveness and robustness of our models. In general, the conventional visual representations are based on handcrafted features, which are often much restrictive in complex visual scenes and thus cannot provide abundant semantic information about the visual content. Besides, the context model built in the aforementioned approaches is relatively simple and shallow, which leads to the inability of modeling the intrinsic contextual interactions among objects as well as their associated scene structures. For instance, Walker et al. [5] build their context model by straightforwardly counting the votes from training data. Such a practice is hard to effectively model the contextual information.

Motivated by these observations, in this paper we propose a unified deep learning framework for visual path prediction, which simultaneously performs deep feature learning for visual representation in conjunction with spatio-temporal context modeling. After that, a unified path planning scheme is proposed to make accurate future path prediction based on the analytic results of the context models. Compared with the conventional approaches to visual path prediction, the visual representation employed in our framework is highly effective because it has a better discrimination and generalization capability. Meanwhile, our deep context models can be better adapted to the complex scenarios. These improvements ensure that our framework can make a deep semantic understanding about the scene and motion pattern, consequently improving the performance in the visual path prediction.

The key contributions of our paper are summarized as follows:

  1. We present a novel deep learning framework based on the CNNs for visual path prediction task. To the best of our knowledge, it is the first work to leverage a deep learning approach in this task. Our framework models both of the scene structure information and motion patterns. The abstraction of visual representation and the learning of context models are accomplished in a unified framework. It largely improves the scene understanding capability compared with the previous approaches.

  2. We propose a unified path planning scheme to infer the future paths on the basis of the analytic results returned by our context models. In this scheme, the problem of future path inference is equivalently converted into an optimization problem, which can be solved in an efficient way.

  3. We construct two benchmark datasets for visual path prediction from the adaptation of two large video tracking datasets [6, 7]. The adapted datasets are much larger than those used in the previous work and cover a diverse set of scenes and objects. They can be used for comprehensively evaluating the performance of a visual path prediction model. They will be publicly available on our homepage.

Ii Related Work

In general, the methods for visual path prediction contain two components: (1) Understanding the scene and motion pattern of the video sequences. (2) Inferring the future path based on information obtained by step (1). This section will review the representative methods of these two steps respectively.

Understanding the scene and motion pattern: Scene understanding is a prerequisite to many high level tasks for intelligent systems operating in real world environments. In the past ten years, researchers have made great efforts for understanding the static image scene at a deeper level, including several typical topics like scene classification [8, 9, 10, 11, 12], semantic segmentation and labeling [13, 14, 15, 16, 17], depth estimation [18, 19, 20], etc. What these approaches have in common is that they learn and model the scene structure prior to recover different aspects of a scene. Accordingly, Yao et al. [9] propose a holistic structure prediction model based on CRF to jointly solve several scene understanding problems.

Except for modeling scenes in static images and inferring knowledge at the current state, an intelligent visual system is supposed to be able to infer what will happen in the near future. In more recent years, many researchers have paid attention to modeling the motion pattern in video sequences for temporal aspect of recognition and prediction. For instance, recognition and forecasting of human action [4, 21, 22, 23, 24, 25], event [3, 26, 27, 28] and scene transition [29, 5] have caught lots of interest. For dynamic scene understanding, the key is to model the structure relationships among different frames. As well, techniques of static scene understanding play a significant role in it.

Path inference:

Methods for path inference can be generally classified into two categories: the matching-based methods

[2, 3, 30] and the learning-based methods [4, 5]. The matching-based methods simply retrieve the information from databases to the queries without building an inference model. For instance, Liu et al. [2] propose a method by matching a query image to a large amount of video data and warping the ground truth paths from the nearest neighbour videos to the static query image with SIFT Flow. Instead of the warping process, Yuen et al. [3]

build localized motion maps as probability distributions after merging votes from several nearest neighbors. These matching-based approaches rely on the richness of the databases.

On the other hand, the learning-based methods learn temporal inference models to capture the spatio-temporal variation of scenes and objects. Temporal models such as Markov Logic Networks [31], IOC [4], CRF [29], ATCRF [21] and EDD [32] are often employed. These models help infer the future of individual objects. Further work has taken into consideration the relationships between objects and scenes. Kitani et al. [4] detect the physical scene features based on semantic scene labeling techniques [33, 34], and then, fuse them into the reward function of IOC. Walker et al. [5] build a temporal model based on the effective mid-level patches [35]. They learn patch-to-patch transition matrix, which serves as the temporal model, and learn a context model for the interaction between mid-level patch and the scene. These approaches draw on the strength of scene semantic understanding in depth, and successfully advance the overall performance for visual path prediction task.

Convolutional Neural Networks:

The proposed framework in this paper is built upon the convolutional neural networks (CNNs). The CNNs are a popular and leading visual representation technique, for they are able to learn powerful and interpretable visual representations

[12]. The CNNs have given the state-of-the-art performance on various computer vision tasks [36, 37, 14, 12, 15]. In recent years, some work has combined CNNs with temporal modeling. Donahue et al. [38] use CNNs and LSTM [39] for visual recognition and description tasks. On the perspective of temporal prediction, Walker et al. [40] employ the same architecture of [38] to predict long term motion of pixels in terms of optical flow.

Iii Our Approach

Iii-a Problem Formulation

In this work, we aim to build a framework to automatically solve the visual path prediction problem. Given a static scene image and the bounding box of an object in , the goal is to infer the most possible path of the object in the future. Here, are respectively the top left coordinate, the width, and the height of . And represents the coordinate of a position, such that consists of a sequence of adjacent positions. Fig. 2 gives an illustration of the problem. We formalize the original scene into a grid graph such that each grid corresponds to a specific position of the scene. Between the object location (the center of ) and a certain edge location , there are a large amount of alternative paths. The question is how to select such an appropriate path from the very large path space ? We convert the original problem into an optimization problem of planning a path with the lowest cost :

Fig. 2: A simple illustration of the visual path prediction problem. Between the object location and the -th edge point , we desire to plan a path which has the lower spatial matching costs on the cost map, meanwhile, the smaller angular difference between its initial moving direction and the estimated direction .

Then, the issue is how to formulate the cost of a path . Intuitively, if there are more obstacles on a path, the associated cost of it ought to be higher:


is a cost map of the scene representing the cost of each coordinate position . Therefore, we need to discover which regions of the scene the object can reach. Such a structure relationship between the object and the scene is referred to as “spatial context matching”; thus is referred to as the “spatial matching cost” in this paper. We build a deep context model called Spatial Matching Network to learn the spatial contextual information from the video sequences in the training phase. In the testing phase, Spatial Matching Network generates a cost map according to a testing scene image.

On the other hand, the object’s current moving direction also crucially influences the path selection. Hence, paths which are consistent with the object’s current moving direction should have lower costs:


Here, is called as “orientation cost” in this paper. is the initial moving direction of , and represents the angular difference between two angles and . For the sake of motion orientation modeling, we build another deep context model called Orientation Network to learn the temporal contextual information underlied in video sequences. In the testing phase, Orientation Network estimates an object’s facing orientation as from the single object image.

The above two types of costs — the spatial matching cost and the orientation cost adequately help us semantically understand the scene and make a decision about the future path. As shown in Fig. 2, suppose the three paths have the same average accumulated costs on cost map . Which one is optimal? wins out because its initial direction is closer to . Therefore, the path cost is written as


where is a trade-off coefficient between the two types of costs. Finally, substituting into the optimization problem (1), we propose a unified path planning scheme to solve it in an easy and efficient way.

Fig. 3: The overview of our framework. Spatial Matching Network and Orientation Network are two CNNs, which respectively model the spatial and temporal contexts. We repeatedly input images of the object and local environment patches into Spatial Matching Network to generate the reward map of the scene. Intuitively, it helps us decide whether the object could reach certain areas of the scene. Orientation Network estimates the object’s facing orientation, which indicates the object’s preferred moving direction in the future. Then we incorporate this analysis and infer the most likely future paths with a unified path planning scheme.

Fig. 3 shows our general framework in the testing phase. The far left of the figure is the input of visual path prediction problem, containing a scene image of parking lots and a bounding box of a car. We employ two CNNs to semantically analyze different aspects of the scene and the object. The first CNN, which we call Spatial Matching Network, generates a reward map

representing the reward of every pixel on the scene image. The larger reward means the higher probability the car will reach that pixel position in the future. The reward map

is then converted into a cost map for the subsequent path planning. The second CNN, which we call Orientation Network, outputs an estimated facing orientation of the car. And then, based on the analytic results and , we infer the most possible future paths of the car by solving the optimization problem (1).

In such a framework, there are still some important problems to solve in what follows: For the two networks, how do we learn the contextual information from video sequences, and, what are the appropriate architectures of them? How do we efficiently solve the optimization problem (1)? We will discuss these issues in the following subsections.

Iii-B Spatial Matching Network

Fig. 4: Illustration of Spatial Matching Network. The bottom is the scene image. We crop out the object image patch with the bounding box shown in red. We use a sliding window on the entire scene image to crop out the local environment patches, shown as the blue boxes with dotted lines. Each time we input the object patch and an environment patch into the network. The network outputs the likelihood of spatial context matching between the two patches. In this figure, two inputs are the car and the ground. They are spatial context matching, so label for this sample is set as 1 during training.

We build Spatial Matching Network to model the interaction relationships between various objects and regions in scenes, namely the spatial context. More intuitively, for example, a pedestrian is more likely to walk on the pavement than climbing over the fence beside it. Here we call the pedestrian and the pavement as spatial context matching, while the pedestrian and the fence are not spatial context matching. As another example, if there is a house in front of a car, the car is supposed to detour the house but not to crash into it. Obviously the car is not spatial context matching with the house in this case. In our framework, such relationships are modeled by Spatial Matching Network.

Fig. 4 illustrates the architecture of Spatial Matching Network. We expect the network to model the relationship between two instances, so it takes two image patches as its input at the same time. One represents the given object and the other is a certain local environment patch obtained by a sliding window on the entire scene image, denoted as the blue boxes with dotted lines shown in Fig 4. The two inputs respectively propagate through two CNNs from conv1 to fc7 and then concatenated into a new fully connected layer fc8. The layers from conv1 to fc7 are similar to the AlexNet [36]

. Note that the parameters of the two CNNs are different, as their inputs come from two different semantic spaces. We use a softmax layer at the output end of Spatial Matching Network. In the training phase, the label

of the network is set as 1 if the two input patches are spatial context matching. Otherwise it is set as 0. In the testing phase, the network outputs the likelihood of spatial context matching between the object patch and the local environment patch :


is obtained according to the object bounding box on the scene image . represents the forward propagation in Spatial Matching Network, and are its learned parameters. For a scene image , we can crop out the local environment patches with an overlapped sliding window on , where is the central position of patch . In this way, we can generate a reward map for an object and a scene image by repeatedly inputing all the local environment patches with the same object patch into Spatial Matching Network:


is the reward for each position . The larger value means the higher reward for that position, namely the higher probability the object will reach that position in the future. Visualization of our reward maps generated on different scenes are shown in the middle column of Fig. 7.

It is noted that the reward function in the previous work [4, 5] only models the scene itself. However, different objects may have different relationships with the same region of the scene. So the reward map in our method is built with respect to both of the specific object and the scene appearance, for the purpose of generalization across a diverse set of scenes and objects.

The reward map can be converted to the cost map , such that:


where is the tolerance to obstacles. is fixed to , as the scale of is . Based on this formulation, we can compute the spatial matching cost of a path according to Eq. (2).

Iii-C Orientation Network

In this subsection, we discuss how to build Orientation Network to learn the temporal context from video sequences in the training phase, and, to estimate an object’s facing orientation in the testing phase. Because the scene is assumed to be static in the visual path prediction task, we only focus on modeling the temporal context of the object itself. In other words, we are going to model the time-dependent variation of the object’s own state. The state here includes the physical appearance and the spatial position. As the information about physical appearance has been integrated in Spatial Matching Network, we only model the position variation of object itself, namely the relative position of the object at different time. When in the test phase, it is represented as the object’s facing orientation with the input of a single image. The temporal context also plays an important role in selecting the future path. For instance, imagine a man walking on the street; he is most likely to walk along his facing orientation. Similarly, any kind of active object follows this rule if there are no other external factors disturbing it.

Fig. 5: Illustration of Orientation Network. What is the facing orientation of the given object? We train Orientation Network to estimate it accurately. The network takes an object image as input. It outputs the estimated facing orientation angle of the object. The architecture from conv1 to fc7 is similar to AlexNet [36]. We add fc8 and fc9 to reduce features’ dimension, followed by a regression layer. The relative position of the same object between neighbouring frames serves as the ground truth label.

Therefore, we build Orientation Network to estimate an object’s facing orientation . The architecture of Orientation Network is shown in Fig. 5. We first extract image features using the standard seven-layer architecture similar to AlexNet [36] and then embed the features to low-dimensional space with linear mapping. At the output end of Orientation Network, the low-dimensional features are finally regressed into a single value

, which represents the estimated angle of the object’s facing orientation. Now we decide an appropriate loss function. Intuitively, the orientation estimation can be posed as either classification or regression. Walker et al.

[5] treat it as classification because the state space of their temporal model is discrete. However, the orientation angle has reasonably high spatial self-correlation. The labels in classification task are often not sufficiently related to each other or even mutually exclusive. Therefore, we are going to use regression as the output of Orientation Network, in view of its correlation between labels. We use Euclidean distance as the regression loss of Orientation Network:


where is the ground truth angle set as the relative position of the same object between two neighbouring frames, and is the output of Orientation Network. is the angular difference between two angles and :


In the testing phase, we can estimate the facing orientation of the input object image by doing forward propagation in Orientation Network:


where are the learned parameters of Orientation Network.

Iii-D Path Planning

Up to now, the contextual properties of the scene structure are respectively formalized to be a cost map corresponding to the scene and an estimated facing orientation corresponding to the object. How do we plan the most probable future path for the given object? We propose a unified path planning scheme by efficiently solving the primitive optimization problem (1). The right part of Fig. 3 illustrates the function of this scheme.

The optimization problem (1) aims to find the optimal path from the path space , which has the lowest path cost . By combining Eq. (2), (3), (4) and a few constraints to , we rewrite problem (1) as:


where the first constraint means that the object can only move to one of its adjacent positions in every step. In our experiments we use eight directions (top, left, bottom, right, top-left, top-right, bottom-left, bottom-right). The second and the third constraints specify the starting and ending positions of paths, where is the number of edge points. The initial moving direction of is obtained by computing the relative position between the initial position and a certain position on . In our experiments, the distance between and is fixed to the diagonal length of the object bounding box : , where is the rounding floor. is set to 5 as a matter of experience.

Input: Scene image , object bounding box , network parameters
Output: Predicted paths
Scene Analysis
1. Generate the reward map
   - Crop out the object image according to , and the scene patches with an overlapped sliding window on ;
   - for  do
           - ;
2. Estimate the object’s facing orientation
   - ;
Path Planning
1. Find the optimal paths
   - Obtain the cost map according to Eq. (7);
   - Build a directed graph , whose edge weights are set according to Eq. (12);
   - Compute the shortest paths between and on graph , and sort them as based on their lengths from in an ascending order.
Algorithm 1 Visual path planning framework

In order to solve problem (11) more efficiently and easily, we employ a graph shortest path algorithm. We build a directed graph whose nodes correspond to the positions of map . The weight of edges is:



where is the relative position between and . On graph we can compute the shortest paths between the node of the initial position and the nodes of all the edge points using Dijsktra’s algorithm. These paths are sorted according to their lengths in an ascending order, represented as . are the top predicted paths for the visual path prediction task. The shortest one is exactly the solution of problem (11) on large scales. The whole path planning procedure is summarized in Algorithm 1.

Iv Experiments

Iv-a Experimental Setup

We give the details on the the network architecture, datasets, the comparison algorithms, and the evaluation metric in the following.

Network Architecture:

We build the CNNs based on the popular Caffe toolbox

[41]. Fig. 4 and Fig. 5

respectively illustrate the network architectures of Spatial Matching Network and Orientation Network. Specifically, in the two figures ‘conv’ represents a convolution layer, ‘fc’ represents a fully connected layer, ‘pool’ represents a max-pooling layer, and ‘LRN’ represents a local response normalization layer. Numbers in the parentheses are respectively kernel size, number of outputs, and stride. All convolutional layers and fully connected layers are followed by ReLU activation function.

In the experiments, Spatial Matching Network is trained for 2K iterations with a batch size of 256 and learning rate of . The input images are uniformly resized to 256256 and cropped to patches with the size of 227227. Orientation Network is trained for 10K iterations with a batch size of 256 and learning rate of . The input images of Orientation Network are directly resized to 227227 without any cropping operation, because a part of an object image often cannot represent its exact facing orientation. The weights of models are initialized randomly for a fair comparison with the other algorithms.

Datasets: For the evaluation of visual path prediction task, we adapt a new large evaluation set. Raw data of this set come from VIRAT Video Dataset Release 2.0 [6]. VIRAT111 is a public video dataset collected in multiple natural scenes, with people or vehicles performing actions with cluttered backgrounds. It contains a total of 8.5 hours HD videos from 11 different outdoor scenes, with a variety of camera viewpoints, and diverse types of activities which involve both human and vehicles. The ground truth object bounding boxes are manually annotated. Previous work [4, 5] on path prediction has also built their evaluation set with VIRAT, but with only a single or very few scenes. We do it in a different manner. We select 9 applicable scenes from the total 11 scenes to form our evaluation set. Fig. 6 shows the chosen scenes clearly, where the coloured box denotes the scene adopted by previous work. Among the total 195 videos, we use 152 videos for training, and 43 videos for testing. From the testing set, we automatically extract objects with at least 200 pixels in length to form a total of 386 testing samples.

For evaluating the model’s generalization capability, we also adapt a novel evaluation set. Raw data of this set come from KIT AIS Dataset222, which comprises aerial image sequences with manually labeled trajectories of the visible vehicles. This dataset is entirely novel to visual path prediction task to our knowledge. It is relatively smaller than VIRAT, so we only use it for testing without training. From the total 9 scenes, we select 8 appropriate scenes and automatically extract 136 samples from the labeled trajectories to construct our evaluation set. The selected trajectories have larger distance between their starting and ending points.

Fig. 6: The nine scenes of the first evaluation set, which is used in the evaluation of visual path prediction performance. The scenes include different parking lots, streets, and campuses. They are in a semi-birdseye view, and the videos are shot by cameras at different heights and locations to the grounds. The blue box denotes the scene used in the previous work [5]. In this paper, we make quantitative experiments on every scene.

Comparison methods: There has been only a little work in the field of visual path prediction, so in this paper we compare our model with two methods:

  1. Nearest neighbour searching with SIFT Flow warping [3, 2]. Identical to the implementation in Walker et al. [5], we use a Gist-matching approach [42] similar to Yuen et al. [3], and warp the labeled path of the nearest neighbour scene into the test scene using SIFT Flow [2].

  2. The mid-level elements based temporal modeling [5]. It is the current state-of-the-art approach for visual path prediction task. We use their publicly available implementation code and train a model according to their parameters on the VIRAT dataset.

In our experiments, all the methods including ours share the same training and testing sets. Because of the larger size of the evaluation set and the higher resolution of the scene images, images are uniformly downsampled into 640360 for method (1) and (2).

Evaluation metric: We employ the commonly used [4, 5] metric: modified Hausdorff distance (MHD) [43]

as the metric for the distance between two paths. The MHD allows for finding the best local point correspondence and it is robust to outlier points. Three indicators are used in this paper for comprehensive comparison: (1) top-1, (2) top-5 average and (3) top-10 average. The top-N average means that for a method on a certain testing sample, we first compute the MHDs between the ground truth path and the top-N paths predicted by this method, and then take an average of these distances as the method’s performance on this sample.

Iv-B Path Prediction

(a) Original Image
(b) Reward Map
(c) Predicted Paths
Fig. 7: Qualitative results generated by our algorithm. Each row represents a sample. The left column shows the input images. Red boxes on them denote the given objects. The middle column shows the generated reward maps. The right column shows predicted top-10 paths. Our framework can output discriminative reward maps and make accurate predictions on a diverse set of scenes.
(a) Original Image
(b) Nearest Neighbour
(c) Mid-level Elements
(d) Ours
Fig. 8: Some qualitative comparison results. Each row represents a sample. Column (a) shows the input images with red boxes denoting the objects. Column (b), (c) and (d) respectively show the predicted paths generated by different approaches: NN [3], MLE [5] and ours. Our approach has better performance in most of the scenes. The predictions generated by our approach are closer to the common sense.

Qualitative: Fig. 7 shows some qualitative results generated by our method on different scenes of the evaluation set. Each row represents a sample. The left column is the input images. The middle column shows the reward maps generated by our algorithm, in which those green areas are accessible (high reward) while pink areas are obstacles (low reward). We can see in the maps that the grass, tree and house are detected as low reward, while the road and parking lot are of high reward. Notice the fourth and fifth maps, where the sidewalk is recognized as high reward area for the corresponding pedestrians. The right column shows the predicted paths for corresponding input images, where the red lines represent the top-1 predictions and the black lines represent the other top-10 predictions. Visually, the predicted paths are close to our human’s inference. Notice how the predicted paths avoid other objects (cars, pedestrians) or obstacles (grass, trees, buildings) and go along the road. Furthermore, we can see that our framework is able to make correct prediction of the destination. In the third image, the red car will be parked in the square. In the fourth image, the person probably wants to walk across the street. A correct destination estimation will largely improve the performance of path planning.

Besides, we make qualitative comparison among different methods as shown in Fig. 8. We select various scenes and objects for testing. Each row represents a testing sample. Column (a) is the input images, in which we mark the given objects with red boxes. Column (b) and (c) show the predicted paths generated by the comparison methods NN [3] and MLE [5]. Our predictions are shown in column (d). We can see that the NN approach does not give effective performance. It is nearly betting that there have been appropriate paths stored in database. The last image of column (b) shows this clearly where most trajectories of the nearest samples in database are distributed along the road. It is not effective in practical use. MLE approach produces comparatively better performance. However, limited to its visual representation ability on diverse scenes, the predicted paths do not appear reasonable. In the fifth image of column (c), the man would attempt to climb over the fence in front of him. In the sixth image of column (c), the car would attempt to drive across the trees. On most scenes shown in Fig. 8, our approach makes reasonable predictions that is consistent with the common sense. Furthermore, our method infers a variety of appropriate optional paths as shown in the first, third and sixth image of column (d).


Scene A B C D E F G H I Total
Samples 53 26 21 36 26 41 44 46 93 386

NN [3] 19.57 25.47 16.15 18.19 24.78 29.16 23.16 14.84 12.31 19.12

MLE [5]
17.63 17.55 24.12 15.06 13.72 19.27 20.47 16.57 18.13 17.97

Rewards (ours)
20.83 20.43 13.72 19.01 21.67 17.13 16.06 16.03 13.30 16.98

13.37 13.81 16.34 13.29 12.95 10.99 10.41 12.24 10.42 12.09

Top-5 Average
NN [3] 22.34 25.75 16.35 17.10 28.89 31.09 22.86 14.65 12.72 19.95

MLE [5]
17.41 17.49 22.77 15.03 13.75 19.00 20.22 16.51 18.09 17.78

Rewards (ours)
18.87 18.80 13.68 17.94 20.55 17.13 17.06 13.50 11.73 15.86

13.21 13.43 15.71 13.17 12.78 11.71 10.22 11.60 10.57 12.00

Top-10 Average
NN [3] 22.81 26.31 15.86 16.19 28.31 31.38 23.37 15.88 14.42 20.55

MLE [5]
17.04 16.85 20.44 15.92 13.49 18.19 20.16 15.59 16.68 17.09

Rewards (ours)
17.62 17.43 14.97 17.57 19.65 16.16 17.27 12.24 11.16 15.20

13.44 12.89 15.59 12.96 12.55 12.11 11.79 11.14 10.69 12.15


TABLE I: Quantitative results for visual path prediction task


Scene No. 1 2 3 4 5 6 7 8 Total
Samples 5 36 5 6 45 22 6 11 136

NN [3] 14.54 18.75 52.77 40.67 51.58 43.38 17.67 15.54 35.35

MLE [5]
24.50 20.12 32.20 25.66 75.45 42.70 16.19 14.05 42.26

Rewards (ours)
22.42 8.57 56.96 23.71 23.97 37.52 15.42 10.73 21.78
Ours 18.23 8.99 49.24 23.71 21.53 34.19 19.77 10.14 20.25

Top-5 Average
NN [3] 17.28 21.91 50.80 34.78 64.64 44.38 16.04 16.71 40.46
MLE [5] 24.28 18.76 32.20 25.73 75.33 42.61 16.21 14.32 41.87

Rewards (ours)
18.90 6.70 55.31 14.90 19.46 34.79 12.16 8.82 18.48
Ours 16.29 6.63 48.76 12.82 19.70 32.12 14.97 9.76 17.88

Top-10 Average
NN [3] 17.36 20.31 53.27 34.82 61.38 46.13 19.29 15.75 39.41

MLE [5]
22.92 16.22 43.71 25.66 75.39 42.59 16.14 12.92 41.47

Rewards (ours)
17.68 6.56 47.84 13.93 21.24 30.02 11.78 9.65 17.94

15.78 6.37 43.82 14.80 21.47 27.55 12.45 10.01 17.45


TABLE II: Quantitative results for generalization capability evaluation

Quantitative: For quantitative evaluation, we compare our method with the competing methods on all scenes in the evaluation set. Table I

shows the results on each scene with a total of 386 testing samples. Our method outperforms the comparison methods by large margins on all of the scenes. Compared to the state-of-the-art methods on the entire evaluation set, our method makes 33%, 33%, 29% improvement respectively under the top-1, top-5 average and top-10 average metric. For each scene, the improvement varies from 6% to 49% under the top-1 metrics. In addition, the results of our method show a relatively smaller inter-scene variance than the other methods. To some extent, it indicates that our model can be trained and tested robustly on a diverse set of scenes.

The third row in every sheet shows the results of our rewards only method, for which we only use our reward map for prediction without the help of Orientation Network. It shows 6%, 11%, 11% improvement over the other comparison methods under the three metrics, respectively, demonstrating the value of our Spatial Matching Network. However, the error of the rewards only method is larger than that of our complete framework on most of the scenes. Fig. 9 shows a more qualitative comparison between these two methods. It indicates that the temporal context modeled by Orientation Network also offers much help to our complete framework.

(a) Rewards Only
(b) Rewards + Orientation
Fig. 9: Comparison between (a) our path planning scheme using only rewards and (b) the complete framework. Orientation Network estimates the facing orientation of the object. With its help, the path planning scheme is able to rectify the paths according to the object’s current moving direction, consequently improving the final performance.

Iv-C Generalization Capability

We have evaluated the visual path prediction performance, where the training set and testing set own the same scenes. However, a robust path prediction framework ought to perform well on novel scenes and objects. In this experiment, we evaluate the generalization capability of the methods on the second evaluation set described in subsection IV-A. We simply test the models on this evaluation set without retraining the models. Parameters of the models remain the same as those in the path prediction experiment of subsection IV-B.

Table II documents the quantitative results of the generalization capability evaluation. Our method respectively makes 43%, 56%, 56% improvement over the comparison methods under the top-1, top-5 average and top-10 average metrics on the entire evaluation set. These improvements are larger than those in the primary experiments as Table I, showing that our method has a better generalization ability than the existing work. Most of the absolute MHD values in Table II have increased, while meantime the inter-scene variance has also increased. This is in line with our intuition that the models have never seen the testing samples in this experiment. In addition, different from the other two methods, the top-5 average metric and top-10 average metric of our method in II show some improvement over the top-1 metric on most of the scenes. To some extent this indicates that our method can explore more proper underlying paths on unknown scenes than the other methods. Compared with the rewards only method, the complete framework performs better on half of the scenes and a little worse on the entire dataset. This indicates that in this experiment the Orientation Network does not help much. It is possibly due to the inadequate training samples.

V Conclusion

In this paper we proposed a deep learning framework to address the visual path prediction problem. The proposed deep learning framework simultaneously performs deep feature learning for visual representation in conjunction with spatio-temporal context modeling, which largely enhances the scene understanding capability. In addition, we presented a unified path planning scheme to infer the future paths on the basis of the analytic results returned by our context models. For comprehensively evaluating the model’s performance on the visual path prediction task, we constructed two large benchmark datasets from the adaptation of video tracking datasets. The experimental results demonstrated the effectiveness and robustness of our approach in comparison with the state-of-the-art literature.


  • [1] J. Hawkins and S. Blakeslee, On intelligence. Macmillan, 2007.
  • [2] C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, “Sift flow: Dense correspondence across different scenes,” in Computer Vision–ECCV 2008, pp. 28–42, Springer, 2008.
  • [3] J. Yuen and A. Torralba, “A data-driven approach for event prediction,” in Computer Vision–ECCV 2010, pp. 707–720, Springer, 2010.
  • [4] K. Kitani, B. Ziebart, J. Bagnell, and M. Hebert, “Activity forecasting,” Computer Vision–ECCV 2012, pp. 201–214, 2012.
  • [5] J. Walker, A. Gupta, and M. Hebert, “Patch to the future: Unsupervised visual prediction,” in

    Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on

    , pp. 3302–3309, IEEE, 2014.
  • [6] S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C.-C. Chen, J. T. Lee, S. Mukherjee, J. Aggarwal, H. Lee, L. Davis, et al., “A large-scale benchmark dataset for event recognition in surveillance video,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 3153–3160, IEEE, 2011.
  • [7] W. I. Weisbrich, “Kit-ipf-software and datasets,” 2014.
  • [8] L.-J. Li, H. Su, Y. Lim, and L. Fei-Fei, “Objects as attributes for scene classification,” in Trends and Topics in Computer Vision, pp. 57–69, Springer, 2012.
  • [9] J. Yao, S. Fidler, and R. Urtasun, “Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 702–709, IEEE, 2012.
  • [10] J. Yu, D. Tao, Y. Rui, and J. Cheng, “Pairwise constraints based multiview features fusion for scene classification,” Pattern Recognition, vol. 46, no. 2, pp. 483–496, 2013.
  • [11] M. Juneja, A. Vedaldi, C. Jawahar, and A. Zisserman, “Blocks that shout: Distinctive parts for scene classification,” in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 923–930, IEEE, 2013.
  • [12] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pp. 1725–1732, IEEE, 2014.
  • [13] N. Silberman and R. Fergus, “Indoor scene segmentation using a structured light sensor,” in Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pp. 601–608, IEEE, 2011.
  • [14] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical features for scene labeling,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 35, no. 8, pp. 1915–1929, 2013.
  • [15] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pp. 580–587, IEEE, 2014.
  • [16] L. Zhang, Y. Yang, Y. Gao, Y.-T. Yu, C. Wang, and X. Li, “A probabilistic associative model for segmenting weakly supervised images,” Image Processing, IEEE Transactions on, vol. 23, no. 9, pp. 4150–4159, 2014.
  • [17] Q. Li, X. Chen, Y. Song, Y. Zhang, X. Jin, and Q. Zhao, “Geodesic propagation for semantic labeling,” Image Processing, IEEE Transactions on, vol. 23, no. 11, pp. 4812–4825, 2014.
  • [18] A. Saxena, S. H. Chung, and A. Y. Ng, “3-d depth reconstruction from a single still image,” International journal of computer vision, vol. 76, no. 1, pp. 53–69, 2008.
  • [19] B. Liu, S. Gould, and D. Koller, “Single image depth estimation from predicted semantic labels,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 1253–1260, IEEE, 2010.
  • [20] J. Lin, X. Ji, W. Xu, and Q. Dai, “Absolute depth estimation from a single defocused image,” Image Processing, IEEE Transactions on, vol. 22, no. 11, pp. 4545–4550, 2013.
  • [21] H. Koppula and A. Saxena, “Anticipating human activities using object affordances for reactive robotic response,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 38, no. 1, pp. 14–29, 2016.
  • [22] J. C. Nascimento, M. Figueiredo, J. S. Marques, et al.

    , “Activity recognition using a mixture of vector fields,”

    Image Processing, IEEE Transactions on, vol. 22, no. 5, pp. 1712–1725, 2013.
  • [23] T. Lan, T.-C. Chen, and S. Savarese, “A hierarchical representation for future action prediction,” in Computer Vision–ECCV 2014, pp. 689–704, Springer, 2014.
  • [24] H. Wang, C. Yuan, W. Hu, H. Ling, W. Yang, and C. Sun, “Action recognition using nonnegative action component representation and sparse basis selection,” Image Processing, IEEE Transactions on, vol. 23, no. 2, pp. 570–581, 2014.
  • [25] L. Wang, Y. Qiao, and X. Tang, “Latent hierarchical model of temporal structure for complex activity classification,” Image Processing, IEEE Transactions on, vol. 23, no. 2, pp. 810–822, 2014.
  • [26] L. Duan, D. Xu, I.-H. Tsang, and J. Luo, “Visual event recognition in videos by learning from web data,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 34, no. 9, pp. 1667–1680, 2012.
  • [27] M. Merler, B. Huang, L. Xie, G. Hua, and A. Natsev, “Semantic model vectors for complex video event recognition,” Multimedia, IEEE Transactions on, vol. 14, no. 1, pp. 88–101, 2012.
  • [28] Y.-G. Jiang, S. Bhattacharya, S.-F. Chang, and M. Shah, “High-level event recognition in unconstrained videos,” International Journal of Multimedia Information Retrieval, vol. 2, no. 2, pp. 73–101, 2013.
  • [29] D. F. Fouhey and C. L. Zitnick, “Predicting object dynamics in scenes,” in Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pp. 2027–2034, IEEE, 2014.
  • [30] C. G. Keller, C. Hermes, and D. M. Gavrila, “Will the pedestrian cross? probabilistic path prediction based on learned motion features,” in Pattern Recognition, pp. 386–395, Springer, 2011.
  • [31] S. D. Tran and L. S. Davis, “Event modeling and recognition using markov logic networks,” in Computer Vision–ECCV 2008, pp. 610–623, Springer, 2008.
  • [32] C. H. Lampert, “Predicting the future behavior of a time-varying probability distribution,” in Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pp. 942–950, IEEE, 2015.
  • [33] D. Munoz, J. A. Bagnell, and M. Hebert, “Stacked hierarchical labeling,” in Computer Vision–ECCV 2010, pp. 57–70, Springer, 2010.
  • [34] D. Munoz, J. A. Bagnell, and M. Hebert, “Co-inference for multi-modal scene analysis,” in Computer Vision–ECCV 2012, pp. 668–681, Springer, 2012.
  • [35] S. Singh, A. Gupta, and A. Efros, “Unsupervised discovery of mid-level discriminative patches,” Computer Vision–ECCV 2012, pp. 73–86, 2012.
  • [36]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    Advances in neural information processing systems, pp. 1097–1105, 2012.
  • [37] D. Ciresan, U. Meier, and J. Schmidhuber, “Multi-column deep neural networks for image classification,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 3642–3649, IEEE, 2012.
  • [38] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pp. 2625–2634, IEEE, 2015.
  • [39]

    S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

    Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [40] J. Walker, A. Gupta, and M. Hebert, “Dense optical flow prediction from a static image,” in International Conference on Computer Vision, 2015.
  • [41] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the ACM International Conference on Multimedia, pp. 675–678, ACM, 2014.
  • [42] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” International journal of computer vision, vol. 42, no. 3, pp. 145–175, 2001.
  • [43] M.-P. Dubuisson and A. K. Jain, “A modified hausdorff distance for object matching,” in Pattern Recognition, 1994. Vol. 1-Conference A: Computer Vision and Image Processing, Proceedings of the 12th IAPR International Conference on, vol. 1, pp. 566–568, IEEE, 1994.