Despite successful demonstrations of autonomous driving, there are still many open problems, including how autonomous vehicles will interact and communicate with human agents . These concerns are particularly important when considering vulnerable users like pedestrians . Although there has been some work in vehicle control in the presence of pedestrians, the majority of research has been focused on improving perception for pedestrian detection [3, 4, 5].
While detection is an important part of a complete autonomous system, this paper considers a specific scenario concerning the interaction between pedestrians and human drivers and how it might influence map estimation, as a proxy for detection. In the scenario shown in Figure 1, a pedestrian may be starting to cross the street. From the perspective of the red car, the human is occluded. We examine how to take advantage of other agent’s actions to make inferences about the presence of a pedestrian, despite occlusion. We present an approach based on a map estimation framework.
Mapping in mobile robotics refers to the process of representing an agent’s environment. Based on this representation of the environment, the agent can make intelligent decisions on how to behave in and interact with the environment.
One popular representation of the environment is the occupancy grid map. The occupancy grid map represents the environment as a grid of cells whose occupancy is modeled by independent binary random variables. The occupancy grid map allows for tractability in representing large environments with considerable amount of detail and provides a starting point for more advanced representations. Another type of map representation is the sparse landmark-based map, which only represents key objects in the environment .
There are many more representations, many of which are more detailed (e.g. point-clouds and textured meshes), yet require more computational resources . The particular choice of representation is dictated by the environment, computational resources, and how the map representation will be used to make decisions.
Common to all these representations is the need for sensor modeling. Sensor models are usually derived from physical properties of how the sensor in question works. For example, the pinhole model is used for visual cameras and beam models for LIDAR and ultrasound sensors [9, 10].
In this work, we exploit the fact that, aside from the physical interaction of forces within the environment, the actions of other intelligent agents also give us useful information about the environment. As such, we derive a data-driven behavioral model for agents in the environment and incorporate these people as sensors. We show the usefulness of this model for improving the representation of the environment in the presence of uncertainty.
Behavioral driver modeling is an active area of research in many different applications, ranging from driver assistance systems to improved interaction for autonomy [1, 11, 12]. In , the mapping of the environment state was learned with respect to the states of the surrounding vehicles. These influences can also be learned by estimating the cost function the driver, which can determine what actions the driver might take given some feature representations . Driver models specifically considering pedestrian interactions have also been developed [15, 16, 17].
Few approaches have directly modeled the external influences of driver behavior in a manner that is amenable to improving environment estimation. Map-based approaches require directly modeling the connection between observable states of the vehicle (i.e., that which can be observed from a nearby vehicles) and the belief over the environment. In this work, we focus on developing a driver model that can act as a sensor for the environment. We apply learning techniques to approximate the distribution over pre-determined actions given a map representation. From this, we integrate the sensor model into mapping frameworks to improve our overall awareness of the environment. This paper presents four key contributions:
We introduce and formalize the concept of people as sensors for imputing maps;
We conduct an experiment with human drivers in a vehicle simulator to collect data on interactions between drivers and pedestrians;
We demonstrate improved environment estimation using occupancy grids on the collected data; and
We modified pedestrian motion estimation and prediction with the landmark representation of mapping, which we test on a real-world dataset.
This paper is organized as follows. The methodology used to integrate driver models and mapping is summarized in Section II. The experimental setup for the user studies is described in Section III-A. Section III-B presents our results using our dataset. This method is also validated using an existing real-world dataset in Section IV. Section V discusses our findings and outlines future work.
This section provides a brief overview of mapping and the methods used to incorporate people as sensors.
Ii-a Mapping Preliminaries
Let represent the space of maps. A map represents a possible state of the environment. In addition, let represent relevant information about the mobile agent post up to time (e.g., pose) and represent information about the physical state of the environment up to time , (e.g., position of other vehicles in the environment). Then, conventionally, the problem of mapping can be formulated as estimating the posterior belief at time , , over the space of maps .
The choice of environment representation largely determines which algorithm to use to estimate . In this paper, we have chosen to represent the environment using two different approaches, depending on the structure of the data. We examine applying the people as sensors framework to (1) mapping the world using occupancy grids and (2) using a collection of sparse landmarks in the environment.
Ii-A1 Occupancy Grid Maps
When the environment is represented using an occupancy grid, the world is a set of binary random variables arranged in grids. Each random variable indicates whether or not its corresponding grid cell is occupied. Therefore, each map is a realization of a set of binary random variables. If we denote the value of the grid cell with index as , then
Unfortunately, this choice of representation results in a space of maps that grows exponentially with the number of cells. However, most mapping algorithms make a further assumption of statistical independence between each binary random variable. Due to this assumption, one may compute the posterior belief over the space of maps as:
leaving us to focus on the simpler and more tractable task of estimating . Further, we make the simplifying assumption that the state of the world at any time only depends on data obtained at time . This is a reasonable assumption, given rich enough sensor and mobile agent information at time . As such, we may write that:
To compute , we make use of the mapping algorithm presented by Thrun et al., . Occupancy grids are typically used for mapping in static environments. The application of our work focuses on non-static environments, including moving vehicles and pedestrians. Consequently, we modify the traditional mapping algorithm in by removing the time dependence across maps, thus taking a one shot approach with no prior knowledge of the environment.
Ii-A2 Landmark Representation
When the environment is represented as a collection of sparse landmarks, the world can be viewed as a collection of salient points in the environment (e.g., people, vehicles, key buildings and natural objects). Each of these salient points are termed landmarks and the mapping task is to estimate the state of these landmarks given data obtained from sensors. Typically, the state of most interest is the pose of these landmarks. This approach represents the map as a collection of landmarks , so that now, .
In this work, we have chosen to use pedestrians as landmarks. The state of interest is their position on the 2D floor plane. Concretely,
. One could make use of a Kalman filter to estimate
, but the publicly available dataset this model was tested on did not contain enough information to do this. Alternatively, since we are interested in modeling the position of the pedestrian in cases where they are occluded, we have assumed a uniform distribution for
, approximated by a Gaussian distribution centered at zero with large variance. This simple approach represents the fact that when the pedestrian is occluded, we have no information about its possible location.
Ii-B Integrating Humans in Mapping
One of the main contributions of this paper is the use of human models as a source of sensor information. We argue that the actions of intelligent agents, specifically other drivers in this scenario, are a ubiquitous source of rich information that should not be ignored. However, since human agents are highly uncertain and are difficult to model, it is also important that this information be incorporated appropriately with other sources of information while estimating quantities of interest.
Given an observed human action from a set of actions , we may reformulate the mapping problem as estimating , where is a sequence of observed actions from up until time . While this idea is indeed general, we restrict ourselves to the case where we only observe the action from the closest driver in front of our mobile agent.
Building on the formulation discussed in the previous section, we estimate using Bayes’ rule to fuse information obtained from driver actions with our map estimate obtained using other conventional sensors given by Equation 1:
As before, we make the assumption that the state of the world at any time only depends on data obtained at time . We also assume that given a representation of the world, the action of the human driver does not depend on the pose of our mobile agent or our sensor information. While this generally might not be true, in the specific context of our application, this assumption is valid due to the relative positioning of the agents. Taking into account the recursive influences is left as future work. Given these assumptions, what we seek to estimate is:
In the case where the world is represented as an occupancy grid, we have used a driver model that depends on each grid cell and fused the information from the driver action to estimate the map:
using Equation 2 to obtain . The next section explains in detail how we obtain the driver model .
Ii-C Sensor Models for Drivers
To model the driver as a sensor, we must estimate the probability distribution over human actions given the two different approaches to mapping. For both cases, we consider the following actions based on the vehicle’s velocity profile to be consistent with actions used in the literature:
Moving Fast: The vehicle is moving at a speed above predetermined threshold.
Moving Slow: The vehicle is moving at a speed below predetermined threshold.
Accelerating: The vehicle increasing its speed.
Decelerating: The vehicle is decreasing its speed.
Standing: The vehicle is stopped.
The parameters and specific labeling methods used in this work are provided in the experimental sections.
Ii-C1 Estimating Actions from Occupancy Grids
Supposing we have a finite collection of driver actions and the assumption that each cell in the grid is an independent Bernoulli random variable, we can approximate the probability of an action given the state of cell empirically. For each action, we denote this empirical distribution as , where is the total number of trials and is the action in set .
Ii-C2 Likelihood of Action from Landmarks
Using the landmark interpretation of the mapping problem, the sensor model of the driver must be approximated as the probability of an action given the position of the landmark obstacle. To do this, we use discrete choice theory and apply the logit model to find this mapping.
Previous work has demonstrated that this method can determine driver actions and intent with high accuracy . This approach employs the EM algorithm to iteratively find the optimal linear combination of features in the dataset to estimate the probability of an action given some map configuration: .
Iii Case 1: Occupancy Grid Formulation
We first evaluate our conceptual framework on the map representation of occupancy grids. In this test case, we carry out a user study to collect ground truth information about the state of the world, which is easily translated into the discretized space of occupancy grids.
Iii-a Experimental Setup
In order to build the driver model for mapping purposes, training, testing, and validation driving data is required. For the scenario considered, there are few publicly available datasets that provide the quality of data required for mapping and driver modeling purposes. Section IV presents the formulation and results on one of these real-world datasets.
Due to lack of available data with full information about the vehicle and environment states, a new dataset was collected to study driver pedestrian interaction. Driver data was collected using PreScan, an industry standard simulation tool that provides vehicle dynamics and customizable driving environments . Using a force feedback steering wheel and pedals for the subject to control the ego vehicle, we created various intersection scenarios in which a pedestrian might appear, as shown in Figure 2.
In each trial, the ego vehicle began approaching an intersection at an initial distance and speed . There were a total of seven possible behaviors the pedestrian might exhibit, which were designed to recreate typical pedestrian behaviors. After appearing from behind an occluding obstacle, the prescribed behaviors included boldly crossing the road, waiting to cross until the approaching vehicle slowed down, and just standing at the side of the road. To discourage anticipating the pedestrian motion, the pedestrian did not appear in half of the instances.
Five subjects each completed approximately one hour of experiments. In each trial, the subject was asked to maintain a constant velocity between 10 and 15 mph and stay in their lane, if possible. This resulted in 1,440 example interactions each lasting approximately 5 to 10 seconds. From this, we generated a total of 281,506 maps to build our sensor models and test our mapping. Twenty percent of this data is used to generate the empirical distribution over actions.
For each trial, the following data is collected: the human driven vehicle states (global and local), the human driver inputs, and the ground truth position of the pedestrian. Using this data, we created a ground truth occupancy for the region in front of the human driver that would be occluded for our ego vehicle. The occlusion is determined using a simple lidar model to determine what the closest obstacles are in the 360 view. We assume only some of the vehicle states are observable from the ego vehicle (i.e., relative position and velocity, distance to crosswalk).
Using the actions defined in Section II, we use the following parameters, denoting as current velocity and as change in velocity from the previous time step:
where the velocity thresholds are selected to be and .
To reiterate, we consider a scenario with three agents: the ego vehicle, the human driven vehicle, and the pedestrian, as visualized in Figure 1. The ego vehicle observes the human driven vehicle that is occluding the pedestrian. Based on the observed actions, we hope to construct a posterior belief across possible maps.
We train a sensor model that maps the ground truth occupancy grid and sensor measurements to a distribution over actions; this human driver model can then be used in imputing the occupancy map from observed actions. An example of the occupancy grid input and the associated action distribution is shown in Figure 3.
Iii-B Evaluation Metrics
We tested our work on multiple scenarios from our experimental dataset. Each scenario is composed of three agents, an autonomous vehicle in one lane, a vehicle in the other lane causing the occlusion, and a pedestrian. The actions of the other vehicle were generated from the human driver behaviors observed while gathering our experimental dataset. The autonomous vehicle was set to follow a constant velocity trajectory behind the other vehicle in the second lane (see Fig. 1). In some of the scenarios, the pedestrian is occluded by the vehicle in the other lane.
We used common metrics to evaluate the goodness of our approach. The use of these metrics is dictated by the choice of model used to represent the environment. To evaluate the occupancy grids generated by our approach, we have use a variation of the Martin-Moravec Score and Image Similarity.
Iii-B1 n-Martin-Moravec Score, nMM
Given an occupancy grid A and the ground truth map B, the Martin-Moravec Score is a metric used to compare the similarity of an occupancy grid to an ideal or ground truth measurement . We slightly modify this metric, with the formal definition:
where is the number of grid cells, is the event that grid cell in map A is occupied, and is the event that it is not occupied. The same notation follows for B.
Intuitively, this score gives a number ranging from , and provides insight to the cellwise dissimilarity between two maps. A perfect match would have a score of 0.
The Martin-Moravec Score has the property of only comparing grid cells with the same grid index. While this is a common metric, the down side is that it does not take into account the value of neighboring cells in computing a score, which means that a slight mis-estimation of an object’s location can greatly impact the score.
We desire the ability to differentiate among candidate occupancy grid maps based on the spatial proximity of their prediction to the ground truth.
Iii-B2 Image Similarity,
The Image Similarity metric is another metric used to evaluate the similarity of an occupancy grid to an ideal or ground truth measurement . This metric computed as:
where is the occupancy value at grid cell in map , returns the 2D coordinates of grid cell , gives the Manhattan distance between coordinates, and is the number of cells in with occupancy values . This score addresses some of the problems involved in using cellwise comparisons by taking into account the value of neighbouring grids when assigning a score.
Since this scoring method considers a neighborhood of cells, it is arguably a more robust method for evaluating occupancy grids. To make use of , we indicate the occupancy value of each cell by thresholding the probability as follows:
Setting the value of occupancy to indicates an unknown state. These states are not included in the computation of , as discussed in .
We compare our results to a standard occupancy grid mapping algorithm that does not incorporate information from the actions of other drivers and to ground truth measurements. We refer to the results from the standard occupancy grid algorithm as “Vanilla Grid,” where the occluded regions are considered unknown.
shows a sample result based on using occupancy grids to represent the environment. The orange and blue vehicle icons represent the ground truth positions of the autonomous vehicle and the other vehicle respectively, while the pedestrian icon represents the ground truth position of the pedestrian. In this scenario, the pedestrian is occluded from the view of the autonomous vehicle by the other vehicle in the scene. Due to this, there is a large uncertainty concerning the position of the pedestrian using the Vanilla Grid. Although we do not observe the pedestrian, we observe that the other car is coming to a stop. By incorporating this information, our driver model is able to help us reason about the likely positions of the pedestrian. Our algorithm can reduce the uncertainty present and provide a more accurate prediction about the position of the pedestrian. Intuitively, we are able to reproduce the common knowledge that if a car by you slows down at a crosswalk, there are probably pedestrians somewhere around, even if you cannot see them. This is one of the safe heuristics humans use in driving that naturally results from the mathematics.
Quantitatively, we apply the metrics presented in the previous subsection to compare our work to the Vanilla Grid. Table I shows the average scores under the two different metrics over time, where indicates the beginning of the trials, indicates the middle of each scenario, and and
indicates the end of each scenario, providing insight as to how this method performs over time. The mean and standard deviation of metrics over time are shown in Figures5 and 6. Our work does significantly better under the Image Similarity metric. We perform comparably to the Vanilla Grid under the nMM score as it does not consider spatial proximity.
Iv Case 2: Landmarks in Real-World Dataset
To demonstrate the utility of this conceptual framework, we also test on a real-world dataset for pedestrian interaction. The publicly available dataset provides only partial information about the state of the world, making occupancy grids more difficult to consider without significant assumptions. Taking these constraints into account, we use driver models to improve tracking landmarks that may be occluded.
Iv-a JAAD Dataset of Pedestrian Interactions
We use the JAAD dataset, which consists of 346 high-resolution video clips, lasting approximately 5 to 10 seconds each, that are representative of scenes in everyday urban driving. These clips are annotated, providing labels associated with the driver actions and bounding boxes of detected pedestrians .
From this bounding box, we obtain an estimated relative position of the pedestrians relative to the vehicle of interest. Given the assumptions required to get this estimate, we assume a Gaussian distribution over our estimates, making this partial, noisy data more inline with the landmark philosophy.
Using the driver model learned from the JAAD dataset, we estimate and predict the location of the pedestrian as a landmark, as described in Section II. We assume a uniform prior over the occluded space, and show how a posterior distribution conditioned on the human driver actions can improve the estimation of the pedestrian’s location. To evaluate the map generated by our landmarks model of the environment, we compare the likelihood of the pedestrian’s true location in the prior and posterior distributions.
The results of our approach compared with the uniform prior are shown in Table II. Figure 8 presents a sample output of the our work using the landmark representation and that of the uniform prior. The orange and blue vehicle icons represent the ground truth positions of the autonomous vehicle and the other vehicle respectively, while the pedestrian icon represents the ground truth position of the pedestrian.
Once again, in this scenario, the pedestrian is occluded from the view of the autonomous vehicle. The contour plots in the figure represent the estimated posterior density of the position of the pedestrian, with darker regions indicating higher density values. As shown, by incorporating the driver model learned from data, our work is able to place a higher posterior density on the position of the pedestrian compared the the uniform prior.
|Action||Uniform Prior||Our Work||Improvement Ratio|
As shown, our approach out-performs the uniform prior in a majority of the actions. While this does not seem like a promising result, the scenarios where our methodology fails are intuitive when we consider how drivers behave in the real-world. When we drive in the streets and observe other drivers driving at a constant speed, we gain little information about the landmarks in the occluded part of the scene. We observe from the driver model derived from the JAAD dataset that the two constant velocity actions (moving fast and slow) are not very informative without detailed contextual information. Further, since we partition the dataset so we only consider samples where the pedestrian might be occluded, these two actions are underrepresented relative to the other labels. Because of these two points, our approach only exhibits improved performance on a subset of the actions.
By exploiting the actions of other intelligent agents, a wealth of information can be inferred about the environment. We have presented a methodology that uses driver models as sensors to impute maps that can be used to improve planning in the face of uncertainty. Thus, regions of the map that would otherwise be occluded can be imputed, providing an estimation of the environment’s state. We validate this concept on two different mapping methods and datasets, demonstrating significantly improved performance over standard mapping techniques.
While we have presented promising results on an interesting case study, there is a great deal of future work to be done. First, given the data-driven method of this framework, there is a strong dependence on the underlying data and scenes that are represented. Further, expanding this work to more scenarios and to multi-agent settings is key for making sure this works in a real-world scenario. Additionally, in the presented implementation, the discretization of the map was quite coarse to account for computation complexity. We hope to explore techniques for generating compact representations to improve tractability in complex environments.
-  K. Driggs-Campbell, V. Govindarajan, and R. Bajcsy, “Integrating intuitive driver models in autonomous planning for interactive maneuvers,” IEEE Transactions on Intelligent Transportation Systems, vol. PP, no. 99, pp. 1–12, 2017.
-  B. Chen, D. Zhao, and H. Peng, “Evaluation of automated vehicles encountering pedestrians at unsignalized crossings,” Available on arXiv:1702.00785, 2017.
-  T. Bandyopadhyay, C. Z. Jie, D. Hsu, M. H. Ang Jr, D. Rus, and E. Frazzoli, “Intention-aware pedestrian avoidance,” in Experimental Robotics. Springer, 2013, pp. 963–977.
-  S. Y. Gelbal, S. Arslan, H. Wang, B. Aksun-Guvenc, and L. Guvenc, “Elastic band based pedestrian collision avoidance using V2X communication,” in IEEE Intelligent Vehicles Symposium (IV), 2017, pp. 270–276.
-  P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: An evaluation of the state of the art,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 4, pp. 743–761, 2012.
-  A. Elfes, “Using occupancy grids for mobile robot perception and navigation,” IEEE Computer, vol. 22, no. 6, pp. 46–57, June 1989.
-  J. Guivant and E. Nebot, “Improving computational and memory requirements of simultaneous localization and map building algorithms,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), vol. 3, 2002, pp. 2731–2736.
S. Thrun et al., “Robotic mapping: A survey,”
Exploring Artificial Intelligence in the New Millennium, vol. 1, pp. 1–35, 2002.
-  A. I. Mourikis and S. I. Roumeliotis, “A multi-state constraint kalman filter for vision-aided inertial navigation,” in IEEE International Conference on Robotics and Automation (ICRA), 2007, pp. 3565–3572.
-  S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics. MIT press, 2005.
-  V. A. Shia, Y. Gao, R. Vasudevan, K. D. Campbell, T. Lin, F. Borrelli, and R. Bajcsy, “Semiautonomous vehicular control using driver modeling,” IEEE Transactions on Intelligent Transportation Systems, vol. 15, no. 6, pp. 2696–2709, Dec 2014.
-  A. Doshi and M. M. Trivedi, “Tactical driver behavior prediction and intent inference: A review,” in 14th IEEE International Conference on Intelligent Transportation Systems (ITSC), 2011, pp. 1892–1897.
-  K. Driggs-Campbell and R. Bajcsy, “Identifying modes of intent from driver behaviors in dynamic environments,” in IEEE 18th International Conference on Intelligent Transportation Systems (ITSC), Sept 2015, pp. 739–744.
P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse reinforcement learning,” in
Proceedings of the 21st ACM International Conference on Machine Learning (ICML), 2004, p. 1.
-  K. Salamati, B. Schroeder, D. Geruschat, and N. Rouphail, “Event-based modeling of driver yielding behavior to pedestrians at two-lane roundabout approaches,” Transportation Research Record: Journal of the Transportation Research Board, no. 2389, pp. 1–11, 2013.
-  N. Guéguen, S. Meineri, and C. Eyssartier, “A pedestrian’s stare and drivers’ stopping behavior: A field experiment at the pedestrian crossing,” Safety Science, vol. 75, pp. 87–89, 2015.
-  R. Sun, X. Zhuang, C. Wu, G. Zhao, and K. Zhang, “The estimation of vehicle speed and stopping distance by pedestrians crossing streets in a naturalistic traffic environment,” Transportation Research Part F: Traffic Psychology and Behaviour, vol. 30, pp. 97–106, 2015.
-  S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics. MIT press, 2005.
-  I. Kotseruba, A. Rasouli, and J. K. Tsotsos, “Joint attention in autonomous driving (JAAD),” Available on arXiv:1609.04741, 2016.
-  K. E. Train, Discrete choice methods with simulation. Cambridge University Press, 2009.
-  K. Driggs Campbell and R. Bajcsy, “Experimental design for human-in-the-loop driving simulations,” Master’s thesis, EECS Department, University of California, Berkeley, May 2015.
-  M. C. Martin and H. P. Moravec, “Robot evidence grids.” Robotics Institute, Carnegie-Mellon University, Pittsburg, PA, Tech. Rep., 1996.
-  A. Birk and S. Carpin, “Merging occupancy grid maps from multiple robots,” Proceedings of the IEEE, vol. 94, no. 7, pp. 1384–1397, 2006.
-  A. Rasouli, I. Kotseruba, and J. K. Tsotsos, “Agreeing to cross: How drivers and pedestrians communicate,” in IEEE Intelligent Vehicles Symposium (IV), June 2017, pp. 264–269.