Log In Sign Up

Future Localization from an Egocentric Depth Image

This paper presents a method for future localization: to predict a set of plausible trajectories of ego-motion given a depth image. We predict paths avoiding obstacles, between objects, even paths turning around a corner into space behind objects. As a byproduct of the predicted trajectories of ego-motion, we discover in the image the empty space occluded by foreground objects. We use no image based features such as semantic labeling/segmentation or object detection/recognition for this algorithm. Inspired by proxemics, we represent the space around a person using an EgoSpace map, akin to an illustrated tourist map, that measures a likelihood of occlusion at the egocentric coordinate system. A future trajectory of ego-motion is modeled by a linear combination of compact trajectory bases allowing us to constrain the predicted trajectory. We learn the relationship between the EgoSpace map and trajectory from the EgoMotion dataset providing in-situ measurements of the future trajectory. A cost function that takes into account partial occlusion due to foreground objects is minimized to predict a trajectory. This cost function generates a trajectory that passes through the occluded space, which allows us to discover the empty space behind the foreground objects. We quantitatively evaluate our method to show predictive validity and apply to various real world scenes including walking, shopping, and social interactions.


page 1

page 3

page 6

page 7

page 8


3D Semantic Trajectory Reconstruction from 3D Pixel Continuum

This paper presents a method to reconstruct dense semantic trajectory st...

Silhouette Guided Point Cloud Reconstruction beyond Occlusion

One major challenge in 3D reconstruction is to infer the complete shape ...

SafePicking: Learning Safe Object Extraction via Object-Level Mapping

Robots need object-level scene understanding to manipulate objects while...

The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction

This paper studies the problem of predicting the distribution over multi...

Planning with Occluded Traffic Agents using Bi-Level Variational Occlusion Models

Reasoning with occluded traffic agents is a significant open challenge f...

Customizing First Person Image Through Desired Actions

This paper studies a problem of inverse visual path planning: creating a...

1 Introduction

Consider a dynamic scene such as Figure 1 where you, as the camera wearer, plan to pass through the corridor in the shopping mall while others walk in different directions. You need to plan your trajectory to avoid collisions with others and objects such as walls and fence. Looking ahead, you would plan a trajectory that enters into the shop by turning left at the corner although such space cannot be seen directly from your perspective.

Figure 1: Where am I supposed to be after 5, 10, and 15 seconds? We present a method to predict a set of plausible trajectories given a first person depth image. As a byproduct of the predicted trajectories, the occluded space by foreground objects such as the space inside of the shop or behind the ladies are discovered.

The fundamental problem we are interested in is future localization: where am I supposed to be after 5, 10, and 15 seconds? This challenging task requires understanding of the scene in terms of a long term temporal human behaviors with respect to the spatial scene layout, with missing data due to occlusions.

We study the future localization problem using a first person depth (stereo) camera. We present a method to predict a set of plausible trajectories of ego-motion given a depth image captured from a egocentric view. As a byproduct of predicted trajectories, the occluded space behind foreground objects is discovered. Our method purely relies on the depth measurements, i.e., no image based features such as semantic labeling/segmentation or object detection/recognition are required.

Inspired by proxemics [10], we represent the space around a camera wearer using an EgoSpace map which reassembles an illustrated tourist map: an overhead map with objects seen from first person video projected onto it.

A predictive future localization model, using the EgoSpace map, is learned from in-situ first person stereo videos from various life logging activities such as commutes, shopping, and social interactions. By leveraging structure from motion, camera trajectories are reconstructed. These camera trajectories are associated with its depth image at each time instant, i.e., given the depth image, a future camera trajectory is precisely measured while the depth image is obtained by the stereo cameras111Any depth sensor such as Kinect and Creative Senz3D are complimentary to our depth measurement. as shown in Figure 2(a).

In a training phase, we discriminatively learn the relationship between the EgoSpace map and future camera trajectory. We model a trajectory of ego-motion using a linear combination of compact trajectory bases. By the nature of the alignment between ego-motion and gaze direction, the trajectory is highly structured. We empirically show that 46 linear trajectory bases are sufficient enough to express all plausible trajectories of ego-motion with high precision (99 accuracy). This compact representation allows us to efficiently find a set of trajectories that are compatible with the associated depth image using EgoSpace map matching. This provides an initialization of the predicted trajectories. However, not all these ‘re-imagined’ trajectories avoid objects in the current first person view. We refine it by minimizing a cost function that takes into account compatibility between the obstacles in EgoSpace map and trajectory. This cost function explicitly models partial occlusion of a trajectory which allows us to discover the space behind foreground objects.

Why EgoSpace map? Two cues are strongly related to predict a trajectory of ego-motion, e.g., where is he or she going? (1) ego-cue: a vanishing point is often aligned with gaze direction; and 2D visual layout of the obstacles in the first person view implicitly encodes the semantics of the scene. (2) exo-cue: objects in a 3D scene such as road, buildings, and tables constrain the space where the wearer can navigate. Such cues can be explicitly extracted by an ego-depth image where the gaze direction of the wearer can be calibrated with respect to a ground plane (exocentric coordinate) while the depth provides obstacles with respect to the wearer (egocentric coordinate). Our EgoSpace map representation exploits these two cues where we measure depth from an egocentric view, and create an illustrated tourist map representation capturing both 2D visual arrangement of the obstacles (in first person view) and their 3D layout (in overhead view). This representation allows us to analyze and understand different scene types and gaze directions in the same coordinate system.

Contributions To our best knowledge, this is the first paper that predicts ego-motion from a depth image without semantic scene labels or object detection via in-situ first person measurements. Core technical contributions of our paper are (a) a predictive model that describes a spatial distribution of objects with respect to an egocentric view, allowing us to register different scenes in a unified coordinate system; (b) a compact subspace representation of the predicted trajectories enabling a search for trajectory parameters feasible without explicit modeling of dynamics of human behaviors; (c) occluded space discovery through trajectory prediction; and (d) the EgoMotion dataset with a depth and its long term camera trajectory, which includes diverse daily activities across camera wearers. We evaluate our algorithm to predict ego-motion in real world scenes.

(a) Ego-stereo cameras
(b) Geometry
(c) Depth image
(d) EgoSpace map
Figure 2: (a) We use ego-stereo cameras to capture our dataset where the depth image can be computed. Any depth sensor such as Kinect is complementary to our stereo setup. (b) Inspired by proxemics, we represent the space around a person using an EgoSpace map computed from (c) the depth image. (d) The EgoSpace map, , captures a likelihood of occlusion.

2 Related Work

Our framework lies an intersection between behavior prediction and egocentric vision.

2.1 Human Behavior Prediction

Predicting where-to-go is a long standing task in behavioral science. This task requires to understand the interactions of agents with objects in a scene that afford a space to move. There is a large body of literature on human behaviors prediction algorithms. Pentland and Lin [28] modeled human behaviors using a hidden Markov dynamic model to recognize driving patterns. Such Markovian model is an attractive choice to encode human behaviors because it reflects the way humans make a decision [38, 17, 19]

. These models, especially partially observable Markov decision process (POMDP), have influenced motion planning in robotics 

[29, 15, 31].

In computer vision, Ali and Shah 

[3] developed a flow field model that predicts spatial crowd behaviors for tracking extremely cluttered crowd scenes. Inspired by the social force model [11], Mehran et al. [22] predicted pedestrian behaviors in a crowd scene to detect abnormal behaviors, and Pellegrini et al. [27] used a modified model to track multiple agents. Ryoo [32] presented a bag-of-word approach to recognize social activities at the early stage of videos. Vu et al. [36] predicted plausible activities from a static scene by associating the scene statistics and labeled actions. In terms of the trajectory prediction task, our work is closely related with three path planning frameworks by Gong et al. [9], Kitani et al. [13], and Alahi et al. [2]. Gong et al. presented a method to generate multiple plausible trajectories of each agent in the scene constructed by homotopy classes, which allows them to produce a long term trajectory for visual tracking in crowd scenes. Kitani et al. leveraged inverse optimal control theory to learn human preference with respect to the scene semantic labels, which enables them to predict the paths an agent follows. Alahi et al. introduced a geometric feature, social affinity model that captures a spatial relationship of neighboring agents to predict destinations of a crowd.

Unlike previous methods that use semantic labels/segmentation or object detection/tracking which are often noisy in real world scenes, our measurements are a single depth image that can be reliably obtained by stereo cameras or depth sensors. Estimating optimal parameters for Markovian models is often intractable. In contrast, our trajectory representation in a egocentric view can be encoded using compact trajectory bases, thus it makes learning tractable because of the reduced number of parameters.

2.2 Egocentric Vision

A first person camera is an ideal camera placement to observe human activities because it reflects the attention of the camera wearer. This characteristics provides a powerful cue to understand human behaviors [5, 12, 7, 30, 33].

Kitani et al. [12] used scene statistics produced by camera ego-motion to recognize sport activities from a firse person camera. Traditional vision frameworks such as object detection, recognition, and segmentation frameworks are successfully integrated in first person data: Pirsiavash and Ramanan [30] recognized daily activities using deformable part models, Lee et al. [18] found important persons and objects, Fathi et al. [5] discovered objects, and Li et al. [21, 20] segmented pixels corresponding to hands. In a social setting, Fathi et al. [6] presented a method to recognize social interactions by detecting gaze directions of people and Park et al. [25] introduced an algorithm to reconstruct joint attention in 3D by leveraging 3D reconstruction of camera ego-motion. This reconstruction allows prediction of joint attention possible by learning the spatial relationship between a social formation and joint attention [26].

Such characteristics of first person cameras were used to generate interesting applications in vision, graphics, and robotics. Lee et al. [18] summarized a life logging video, Xiong et al. [37] detected iconic images using a web image prior. Arev et al. [4] used 3D joint attention to edit social video footages and Kopf et al. [14] used 3D camera motion to generate a hyperlapse first person video. In robotics, Ryoo et al. [34] predicted human activities for human-robot interactions.

Unlike most previous methods, our task primarily focuses on predicting future behaviors by leveraging in-situ measurements from 3D reconstruction of camera ego-motion. This also allows us to tackle a more challenging problem—to discover an empty space that is not observable because of visual occlusion.

3 Representation

Inspired by proxemics [10], we present a characterization of space with respect to the egocentric coordinate system, called EgoSpace map.

3.1 EgoSpace Map

EgoSpace Map is a representation for space experienced from first-person view but visualized in an overhead bird-eye map, akin to an illustrated tourist map.

It has three key ingredients. First, we define an egocentric coordinate system centered at the feet location, the projection of the center of eyes onto the ground plane as shown in Figure 2(b). The normal direction, , of the ground plane is aligned with the -axis, and the height of the eye location is , i.e., where is the 3D location of the center of eyes. The gaze direction defines tangential directions of the ground plane: the -axis is aligned with the projection of the gaze direction, , i.e., .

Second, the EgoSpace encodes depth cue from a first person view onto an overhead view on the ground plane. Using a log-polar parametrization of the X-Z (ground) plane, we define EgoSpace Map as a function , measuring likelihood of occlusion introduced by foreground objects from the gaze direction. One can think of the eye gaze is a light source shining on foreground objects casting shadows onto the ground plane. On the shadow image we record the object height which is proportional to the occlusion likelihood.

Formally, measures the height of the point, , from the ground plane that intersects the ray, , from the center of eyes, , to with an occluding object, , i.e.,


where such that . is a set of objects in the scene.

We discretize the polar coordinate system by uniform sampling in angle between and and uniform sampling in the inverse of radius which results in uniform sampling in the egocentric view as shown in Figure 2(c). Note that the locations to measure the EgoSpace map are almost radially uniform from the first person view point. Figure 2(d) shows the EgoSpace map for Figure 2(c).

For future localization, ground plane provides a free space for us to move into. On the EgoSpace map, if from the first person view the point lies on the ground plane. More interestingly, the space behind an object also indicates potential places to navigate. Since the EgoSpace map is represented in the ground plane, not in first person view, the space behind the object are marked as occluded area (the right few columns of the map).

Third, the area outside of a first person view depth image boundary is set to m. On the EgoSpace map, shape of the mask is uniquely defined by the gaze direction (roll and pitch angles of the head direction). For example, Figure 2(c) shows a case where the wearer is looking ahead almost parallel to the ground, the ground area close to the wearer (m) was not visible e.g., is marked as . If the wearer is looking down, the masked area on EgoSpace would be for large values of .

The EgoSpace representation supports learning future localization from first person videos by combining cues from 3D scene geometry and gaze direction. Its benefits include: 1) the gaze direction normalized coordinate system provides a common 3D reference frame to learn; 2) overhead view representation removes the variations in first person 3D experience due to the head’s pitch angle, 3) the log-polar encoding and sampling gives more importance to nearby space, and 4) the depth masking encodes implicitly both roll and pitch angle of head, making it more situation aware.

3.2 Compact Trajectory Representation

Let be a 2D trajectory on the ground plane of the egocentric coordinate system, where is the number of future frames to predict and and are two coordinates at the time instance as shown in Figure 2(b). In practice, this trajectory can be obtained by projecting 3D camera poses between the and time instances at the time instant onto the ground plane. This allows us to represent all trajectories in the same egocentric coordinate system, which are normalized by gaze direction because the axis is aligned with the gaze direction.

The gaze direction normalized trajectory is highly compressible. Most trajectory of ego-motion can be encoded using a linear combination of trajectory bases learned using Principal Coordinate Analysis (PCA) from the EgoMotion dataset described in Section 5:


where is a mean trajectory and is a collection of trajectory bases, i.e., each column of is a trajectory basis where is the number of basis. In practice, is selected as 46 which can express all ego-motion trajectories with accuracy as shown in Figure 3(a) and Figure 3(b). is the trajectory coefficient, which is the low dimensional parametrization of . In Figure 3(b), we compare reconstruction error produced by PCA bases and DCT generic bases [1].

(a) Ego-motion trajectories
(b) Reconstruction error
Figure 3: (a) We register all trajectories in an ego-centric coordinate system, which results in highly redundant trajectories that can be represented by a linear combination of (b) compact trajectory bases.

4 Prediction

A trajectory of ego-motion is associated with an EgoSpace map, i.e., given a depth image, we know how we explored the space in the training data (Section 5). By leveraging a computational representation of egocentric space and trajectory described in Section 3, in this section, we present a method to predict a set of plausible trajectories given an EgoSpace map and to discover the occluded space using the predicted trajectories.

4.1 Ego-motion Prediction

Estimating that conforms to a depth image is to find a path that stays in the ground plane minimizing the following cost function along the trajectory:


where is the Cartesian coordinate representation of the EgoSpace map, and is a matrix composed of the and rows of . Therefore, is the point at the time instant.

Equation (3) finds a trajectory that stays on the ground given a depth image. This approach has been used in robotics communities for various path planning tasks. However, this does not take into account the trajectory that is partially occluded by objects because the occluded part of the trajectory always produces higher cost. Instead, we introduce a novel cost function that minimizes a trajectory cost difference between the given depth image and the retrieved depth image from the database:


where and are the EgoSpace map and trajectory parameter retrieved from the training dataset. This minimization finds a partially occluded trajectory as long as there exists a trajectory in the database that has similar occlusion cost.

There exist infinite number of trajectories that are compatible with a given EgoSpace map. More importantly, the cost function in Equation (4) is nonlinear where an initialization of the solution is critical.

We initialize

using a trajectory retrieved from the training data by EgoSpace map matching. The dataset is divided into 3 gaze directions (3 pitch angles) to reduce the false matches dominated by the area beyond the depth image. Given an EgoSpace map, k-nearest neighbors (KNN) are found using K-d tree 

[23]. Other search or planning methods such as structured SVM [35] and Rapidly Exploring Random Tree (RRT) [16] can be complimentary to the KNN search.

4.2 Occluded Space Discovery

The predicted trajectories of ego-motion allow us to discover the hidden space occluded by foreground objects because the trajectories can be still predicted in the hidden space. We build a likelihood map of the occluded space as follows:


where is the likelihood of the occluded space that a trajectory can pass through at the evaluating point in the ground. is the predicted trajectories, is the number of predicted trajectories, and is the bandwidth for the Guassian kernel. Equation (5) takes into account the likelihood of the predicted trajectories weighted by the likelihood of the occlusion. is high when many trajectories are predicted at while is high.

5 EgoMotion Dataset

We present a new dataset, EgoMotion dataset, captured by first person stereo cameras. This dataset includes various indoor and outdoor scenes such as Park, Malls, and Campus with various activities such as walking, shopping, and social interactions.

5.1 Data Collection

A stereo pair of GoPro Hero 3 (Black Edition) cameras with 100mm baseline are used to capture EgoMotion dataset as shown in Figure 2(a). All videos are recorded at 1280960 with 100fps. The stereo cameras are calibrated prior to the data collection and synchronized manually with a synchronization token at the beginning of each sequence.

Depth Computation We compute disparity between the stereo pair after stereo rectification. A cost space of stereo matching is generated for each scan line and match each pixel by exploiting dynamic programming in a coarse-to-fine manner.

3D Reconstruction of Ego-motion We reconstruct a camera trajectory using a standard structure from motion pipeline with a few modifications to handle a large number of images222A 30 minute walking sequence at a 30 fps reconstruction rate produces HD 108,000 images.. We partition the dataset such that each dataset includes less than 500 images with sufficient overlap with neighbor image sets (100 image overlap). We reconstruct each dataset independently and merge them by minimizing cross reprojection error between two dataset, i.e., a point in one dataset is reprojected to a camera in the other dataset. Then, we project the reconstructed camera trajectory onto the ground plane estimated by fitting a plane using RANSAC [8].

Scenes We collect both indoor and outdoor data, which consists of 21 scenes with 55,933 frames of 7.7 hours long in total, including walking on campus, in parks and downtown streets, shopping in the mall, cafe and grocery, as well as taking public transportation. The data consists of various activities (walking, talking, and shopping), scenes (campus, park, malls, and downtown streets), cities, and time. We also collect repeated daily routines multiple times at a campus. The dataset is summarized in Table 1.

Image Disparity
Scene IKEA Costco Mall Park School1/2 Downtown1/2 Grocery1/2/3 Bus1/2
Frames 966 577 2683 3088 3754/3736 2856/3405 2858/2892/2834 2292/1850
Duration 08:03 04:49 22:22 25:44 31:17/31:08 23:48/28:23 23:49/24:06/23:37 19:06/15:25
Image Disparity
Scene Campus1 Campus2 Campus3 Campus4 Campus5 Campus6 Campus7 Campus8
Frames 2607 1884 1975 2359 3337 4034 2568 3378
Duration 21:44 15:42 16:28 19:40 27:49 33:37 21:24 28:09
Table 1: EgoMotion dataset

5.2 Data Analysis

We define the EgoSpace map with respect to a gaze direction, which allows us to canonicalize all trajectories in one coordinate system and further to represent it with compact bases. This stems from a primary conjecture: a gaze direction is aligned with ego-motion. In this section, we empirically prove the conjecture from our EgoMotion dataset.

(a) Attention
(b) Yaw distribution
Figure 4: From our dataset, we empirically prove that the gaze direction is highly correlated with the direction of destination, i.e., we look where we go.

We compute the pitch angle of a gaze direction by calibrating the relationship between the first person camera and gaze direction [25]. The pitch angle is by definition in Section 3.1 and the position after 10 seconds is used to measure the direction of destination. Figure 4(a) shows a distribution of the direction of destination with respect to the gaze pitch angle, which indicates that the gaze direction is aligned with the pitch axis. Figure 4(b) shows a yaw distribution of the direction of destination given pitch angle (a horizontal cross section of Figure 4(a)). This also indicates that gaze direction is highly correlated with the direction of destination.

6 Result

Figure 5: We compare our method with four baseline representations: (1) Going straight; (2) Pure 2D: no EgoSpace map without adaptation of the ground plane by the test scene; (3) 2D + ground plane: no EgoSpace map with adaptation of the ground plane by the test scene; (4) EgoSpace without trajectory optimization. Our method outperforms other representations.

We apply our method to predict ego-motion and hidden space in real world scenes by leveraging the EgoMotion dataset. We divide all scenes into two categories: indoor and outdoor scenes as ego-motion has different characteristics, e.g., speed and scene layout. Note that for all evaluations, we predict a scene that is not included in training data, i.e., training and testing scenes are completely separated.

6.1 Quantitative Evaluation

We quantitatively evaluate our trajectory prediction by comparing with ground truth trajectories achieved by 3D reconstruction of the first person camera. Our evaluation addresses the future localization problem.

Multiple trajectories are often equally plausible, e.g., Y-junction, while one ground truth trajectory is available per image. This results in a large prediction error. To address this multiple path configuration, we measure predictive precision—how often one of our predicted trajectories aligns with the ground truth trajectory, i.e., , where is the number of testing images. if , and otherwise where is the location at the time instant of the predicted trajectory and is the ground truth trajectory. We set . Note that unlike previous approaches measured a spatial distance between trajectories [13]333A dynamic time warping was used to handle a time scale., our evaluation measures a spatiotemporal distance between trajectories because the time scale also needs to be considered.

Four baseline methods444These baseline algorithms are designed by ours because no previous algorithm exists to predict the trajectories of ego-motion are used to compare our approach: one method solely based on gaze direction, two methods with a subsampled depth image at the same resolution of our EgoSpace map, and one method with EgoSpace map but without trajectory refinement by Equation (4). (1) Going straight: we generate a trajectory aligned with the gaze direction to test gaze bias; (2) Pure 2D: we retrieve a set of trajectories using KNN solely based on a subsampled depth image; (3) 2D+ground plane: we retrieve trajectories using the subsampled depth image but transform the coordinates of the trajectories such that they lie on the ground plane of the test image. This coordinate transform takes into account the 3D camera direction with respect to the ground plane of the test image; (4) EgoSpace w/o trajectory optimization: the trajectories are retrieved by the EgoSpace map but no adaptation to the test image by Equation (4). In fact, this method provides an initialization of our predicted trajectories.

Figure 5 shows evaluations on indoor and outdoor depth images. We retrieve neighbors from dataset and measure precision. Our method outperforms the baseline algorithms with large margin. These experiments indicate that the EgoSpace representation has strong predictive power comparing to the camera pose oriented feature produced by the subsampled depth image. Also the scene adaptation by the trajectory optimization allows us to produce more accurate prediction (see the performance gap from the initialization). As noted in Section 5.2, a gaze direction is a good predictor but it is not strong enough to predict a long term behavior. Note that the precision at early k may be significantly improved by using N-best algorithms [24] based on homotopy class [9] because KNN retrieves many redundant trajectories. In Table 2, we measure the average precision across all scenes in Section 5.

Indoor 05 secs 510 secs 1015 secs
k=100 k=60 k=30 k=100 k=60 k=30 k=100 k=60 k=30
Going straight 0.571 0.221 0.124
Pure 2D 0.643 0.507 0.308 0.524 0.379 0.217 0.346 0.229 0.123
2D+Ground plane 0.710 0.556 0.367 0.561 0.413 0.267 0.384 0.261 0.162
EgoSpace w/o opt. 0.690 0.534 0.341 0.570 0.265 0.255 0.401 0.265 0.156
EgoSpace w/ opt. 0.825 0.687 0.458 0.693 0.543 0.331 0.482 0.347 0.192
Outdoor 05 secs 510 secs 1015 secs
k=100 k=60 k=30 k=100 k=60 k=30 k=100 k=60 k=30
Going straight 0.443 0.259 0.103
Pure 2D 0.535 0.506 0.303 0.417 0.391 0.218 0.267 0.255 0.142
2D+Ground plane 0.554 0.554 0.350 0.425 0.407 0.244 0.293 0.261 0.135
EgoSpace w/o opt. 0.567 0.527 0.329 0.432 0.399 0.233 0.289 0.250 0.141
EgoSpace w/ opt. 0.683 0.666 0.441 0.538 0.522 0.298 0.373 0.355 0.171
Table 2: Average precision (k is the number of neighbors)

Occluded Space Discovery We quantitatively evaluate our occluded space discovery by measuring detection rate, where is the number of true positive detection and the total number of detection produced by the space discovery. We threshold the likelihood of the occluded space, , from Equation (5) and manually evaluate whether the detection is correct. Note that no ground truth label is available unless the camera wearer already had passed through the space. The detection rate in Table 3 indicates that our method predicts the outdoor scenes better than the indoor scenes. This is because the indoor scenes such as Grocery and IKEA, the camera wearer had a number of close interactions with objects such as shelves or products where the view of the scenes are substantially limited.

Indoor Mall I Grocery IKEA
Detection rate 0.5882 0.2371 0.3937
Outdoor Park Bus stop Walk
Detection rate 0.6234 0.6593 0.6338
Table 3: Detection rate

6.2 Qualitative Evaluation

We apply our method on real world examples to predict a set of plausible trajectories of ego-motion and the occluded space by foreground objects. Our training dataset is completely separated from testing data, e.g., Grocery scene was trained to predict IKEA scene. Given a depth image, we estimate the ground plane by a RANSAC based plane fitting with gravity and height prior. This ground plane is used to define the EgoSpace map with respect to the camera direction555The yaw angle of the gaze direction is assumed to be aligned with the camera direction..

Figure 1 and Figure 7 illustrate our results from the EgoMotion dataset. In a testing phase, only a depth image was used while 3D reconstruction of camera poses were used in the training phase. In Figure 7, we show (1) image and ground truth ego motion; (2) input depth image; (3) EgoSpace map overlaid with the predicted trajectories (gray) and ground truth trajectory (red); (4) reprojection of the trajectories; (5) reprojection of occluded space computed by the EgoSpace map (inset image). For all scenes, our method predicts the plausible trajectories that pass through unexplored space.

Obstacle Avoidance Our cost function in Equation (4) minimizes cost difference between trajectories from training data and testing data. This precludes a trajectory passing through an object unless the retrieved trajectory was partially occluded. EgoSpace map captures the obstacle avoidance as shown in Campus and Grocery.

Multiple Plausible Trajectories Our prediction produces a number of plausible trajectories that conform to the testing scene. Trifurcated trajectories in Campus; bifurcated trajectories in Bus stop; and multiple directions of trajectories in Mall I.

Occluded Space Discovery The space occluded by foreground objects is discovered by the predicted trajectories. The space inside of the shop and behind the person in Figure 1; the space occluded by the left fence and persons in Campus; the space behind the cars and the parking vending machine in Bus stop; the space behind the persons and trees in Park; the space inside the shop and around the left corner; the space behind the column; and the space occluded by the fence.

7 Discussion

In this paper, we present a method to predict ego-motion and occluded space by foreground objects from a first person depth image. EgoSpace map that encodes a likelihood of occlusion is used to represent a scene around a camera wearer. We associate a trajectory with the EgoSpace map in the training phase to predict a set of plausible trajectories given a test depth image. The trajectories that are parametrized by a linear combination of compact trajectory bases are refined to conform with the test depth image. The occluded space is detected by measuring how often the predicted trajectories invade the occluded space.

Figure 6: Our method fails due to mis-estimation ground plane, different scene distributions, and failure of depth estimation.

Limitation Our framework needs three ingredients: similar scene training data, ground estimation, and depth computation. These failure cases are illustrated in Figure 6.

Figure 7: Given a depth image (the second column), we predict a set of plausible trajectories of ego-motion (the forth column) and discover the occluded space (the fifth column) using the EgoSpace map (the third column: predicted trajectories (gray) and ground truth trajectory (red)). The first column shows an image with ground truth trajectory of ego-motion measured by 3D reconstruction of a first person camera (time is color-coded). For more scene description, see Section 6.2.


  • [1] I. Akhter, Y. Sheikh, S. Khan, and T. Kanade. Trajectory space: A dual representation for nonrigid structure from motion. PAMI, 2011.
  • [2] A. Alahi, V. Ramanathan, and L. Fei-Fei. Socially-aware large-scale crowd forecasting. In CVPR, 2014.
  • [3] S. Ali and M. Shah. Floor fields for tracking in high density crowd scenes. In ECCV, 2008.
  • [4] I. Arev, H. S. Park, Y. Sheikh, J. K. Hodgins, and A. Shamir. Automatic editing of footage from multiple social cameras. SIGGRAPH, 2014.
  • [5] A. Fathi, A. Farhadi, and J. M. Rehg. Understanding egocentric activities. In ICCV, 2011.
  • [6] A. Fathi, J. K. Hodgins, and J. M. Rehg. Social interaction: A first-person perspective. In CVPR, 2012.
  • [7] A. Fathi and J. M. Rehg. Modeling actions through state changes. In CVPR, 2013.
  • [8] M. A. Fischler and R. C. Bolles. Modeling and prediction of human behavior. Communications of the ACM, 1981.
  • [9] H. Gong, J. Sim, M. Likhachev, and J. Shi. Multi-hypothesis motion planning for visual object tracking. In ICCV, 2011.
  • [10] E. T. Hall. A system for the notation of proxemic behaviour. American Anthropologist, 1963.
  • [11] D. Helbing and P. Molnár. Social force model for pedestrian dynamics. Physics Review, 1995.
  • [12] K. M. Kitani, T. Okabe, Y. Sato, and A. Sugimoto. Fast unsupervised ego-action learning for first-person sports videos. In CVPR, 2011.
  • [13] K. M. Kitani, B. Ziebart, J. D. Bagnell, and M. Hebert. Activity forecasting. In ECCV, 2012.
  • [14] J. Kopf, M. Cohen, and R. Szeliski. First person hyperlapse videos. SIGGRAPH, 2014.
  • [15] H. Kurniawati, Y. Du, D. Hsu, and W. S. Lee. Motion planning under uncertainty for robotic tasks with long time horizons. In Robotics Research, 2009.
  • [16] S. M. LaValle. Rapidly-exploring random trees: A new tool for path planning. Technical Report 98-11, Computer Science Department, Iowa State University, 1999.
  • [17] R. Lee, D. H. Wolpert, S. Backhaus, R. Bent, J. Bono, and B. Tracey. Modeling humans as reinforcement learners: How to predict human behavior in multi-stage games. In NIPS, 2011.
  • [18] Y. J. Lee, J. Ghosh, and K. Grauman. Discovering important people and objects for egocentric video summarization. In CVPR, 2012.
  • [19] S. Levine and V. Koltun. Continuous inverse optimal control with locally optimal examples. In ICML, 2012.
  • [20] C. Li and K. M. Kitani. Pixel-level hand detection for ego-centric videos. In CVPR, 2013.
  • [21] Y. Li, A. Fathi, and J. M. Rehg. Learning to predict gaze in egocentric video. In ICCV, 2013.
  • [22] R. Mehran, A. Oyama, and M. Shah. Abnormal crowd behavior detection using social force model. In CVPR, 2009.
  • [23] M. Muja and D. G. Lowe.

    Scalable nearest neighbor algorithms for high dimensional data.

    PAMI, 2014.
  • [24] D. Park and D. Ramanan. N-best maximal decoders for part models. In ICCV, 2011.
  • [25] H. S. Park, E. Jain, and Y. Shiekh. 3D social saliency from head-mounted cameras. In NIPS, 2012.
  • [26] H. S. Park and J. Shi. Social saliency prediction. In CVPR, 2015.
  • [27] S. Pellegrini, A. Ess, K. Schindler, and L. van Gool. You’ll never walk alone: Modeling social behavior for multi-target tracking. In ICCV, 2009.
  • [28] A. Pentland and A. Lin. Modeling and prediction of human behavior. Neural Computation, 1995.
  • [29] J. Pineau and G. J. Gordon. Pomdp planning for robust robot control. In Robotics Research, 2007.
  • [30] H. Pirsiavash and D. Ramanan. Recognizing activities of daily living in first-person camera views. In CVPR, 2012.
  • [31] S. Ragi and E. K. P. Chong. Uav path planning in a dynamic environment via partially observable markov decision process. In IEEE Transactions on Aerospace and Electronics Systems, 2013.
  • [32] M. S. Ryoo. Human activity prediction: Early recognition of ongoing activities from streaming videos. In ICCV, 2011.
  • [33] M. S. Ryoo and L. Matthies. First-person activity recognition: What are they doing to me. In CVPR, 2013.
  • [34] M. S. Ryoo, B. Rothrock, and L. Matthies. Pooled motion features for first-person videos. In CVPR, 2015.
  • [35] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and interdependent output variables. JMLR, 2005.
  • [36] T. Vu, C. Olsson, I. Laptev, A. Oliva, and J. Sivic. Predicting actions from static scenes. In ECCV, 2014.
  • [37] B. Xiong and K. Grauman. Detecting snap points in egocentric video with a web photo prior. In ECCV, 2014.
  • [38] B. Ziebart, A. Maas, J. Bagnell, and A. Dey.

    Maximum entropy inverse reinforcement learning.

    In AAAI, 2008.