1 Introduction
Recent advances in robotics and machine learning have enabled the deployment of mobile robots for day-to-day tasks whether as domestic cleaners, navigational aids, autonomous driving vehicles or last-mile delivery agents. In most of these applications, as robots navigate closely around humans, it is essential that they follow the navigational conventions while also being robust to unexpected situations. The ability to autonomously navigate across street intersections is among the situations in which a robot can cause unintended outcomes for the surrounding traffic participants.
In order to decide if a street intersection is safe for crossing, humans are taught at an early age to follow a rigorous decision making process which is comprised of checking and waiting for the traffic light signal, followed by looking in both directions to ensure the safety of the intersection for crossing. Hence, solely relying on the traffic light information to make the crossing decision is suboptimal as not only is the traffic light recognition task challenging in itself, the signal alone does not ensure the intersection safety for crossing. For example, when a speeding vehicle such as an ambulance or firetruck approaches an intersection, it has the right of way as it does not necessarily follow the traffic regulations. Traffic participants such as pedestrians and vehicles are required to wait until the intersection becomes clear. This problem becomes even more challenging with the varying types of intersections and the associated rules on how to cross each variant. For instance, the standard convention at zebra crossings is that the pedestrian has the priority for crossing the intersection, whereas the oncoming traffic slows down and stops until they have crossed. While, at unmarked intersections such as a side street merging into a main road, there is neither a traffic light to regulate the crossing nor does the pedestrian have the right of way. Further complicating the problem, the topology of the road such as street width, presence of a middle island and direction of traffic play an important role in determining the crossing procedure. Hence, hard-coding a set of behavioral rules for a mobile robot to abide by at intersections is not only highly tedious, but also requires constant upkeep and tailoring to suit varying scenarios that change with each region or city.
In this work, we propose a convolutional neural network framework to address the problem of autonomous street crossing that considers the dynamicity of the scene as well as factors that influence the crossing decision such as the presence of a traffic light. Our network consists of two streams, an interaction-aware motion prediction stream to estimate the future states of all surrounding traffic participants and a traffic light recognition stream to predict the state of the traffic light. Our framework fuses feature maps from both network streams to learn the crossing decision in an end-to-end manner, rendering it tolerant to noise and mispredictions by either subnetwork, as well as making it inherently robust to the type of intersection and surrounding road topology.

Predicting and modeling the behavior of agents whether pedestrians or vehicles is an extremely challenging problem that requires understanding the navigation conventions as well as the complex interactions among the various agents. For humans, identifying and following these conventions during navigation is a skill learned over several years that often needs readjustment depending on the environment. Hence, formalizing a set of behavioral rules for a mobile robot to uphold is both complex and taxing, requiring constant maintenance for each new environment. Recently, learning based motion prediction approaches (Kim et al., 2017; Alahi et al., 2016) have shown considerable robustness in modeling interactions among agents in real-world scenarios. However, as the density of the scene increases, their run-time and representational capabilities decrease, as they rely on modeling each agent separately by considering only their local neighborhood. Furthermore, current approaches which employ recurrent networks for modeling the behavior of agents define a local neighborhood of surrounding agents whose actions are likely to affect the motion of the current “query” agent. Deciding which agents to include in the local neighborhood requires handcrafted definitions and domain-level knowledge which is undesirable for achieving a robust framework that can be applied in different settings.
In order to address these problems, in this paper we propose the novel IA-TCNN architecture for interaction-aware motion prediction to jointly estimate the future trajectories of all the observed agents in the scene from 2D trajectory data. We utilize a data driven method that leverages the 2D tracks from LiDAR data provided through an obstacle detection/tracking module to represent the behavior of the different agents in the scene. Through leveraging the inherent motion interdependencies between the different agents and by employing causal convolutions, we enable our approach to simultaneously predict the behavior of all observed agents in the scene and learn these interactions without manually specifying a set of behavioral rules (Helbing and Molnar, 1995; Pellegrini et al., 2009)
. While, sequence modeling problems such as trajectory estimation have been mostly tackled using recurrent neural networks, recent studies have shown that temporal convolutional neural networks are able to more effectively model sequence-to-sequence tasks
(Bai et al., 2018). To the best of our knowledge, this work is the first to employ causal convolutions in networks that perform behavior and motion prediction.We propose the novel AtteNet architecture for traffic light recognition that is robust to varying weather and lighting conditions. Our architecture incorporates Squeeze-Excitation blocks (Hu et al., 2017), thereby enabling it to learn a robust feature recalibration method that explicitly models the complex interdependencies between the channels of the various feature maps. This allows the network to actively suppress irrelevant features in the scene and highlight the most relevant features, which in turn enables it to learn representations that are robust to noisy data. Note that we do not strive to achieve state-of-the-art performance for traffic light recognition. Our aim is however to incorporate the information regarding the traffic light signal into the street crossing predictor to learn a model that acts in accordance with the navigational norms.
Figure 1 depicts the proposed architecture for intersection safety prediction along with the constituting subnetworks. The input to our network is an RGB image of the scene and 2D trajectory information of all observed dynamic agents in terms of spatial coordinates encoded relative to the robot for an interval of time. Our network simultaneously predicts the traffic light signal, the future states of all traffic participants over a prediction window and the safety of the intersection for crossing during this interval. As our method does not rely on structural knowledge of the environment or any form of communication with the surrounding traffic participants, it can be applied independently of the intersection type. Our model can be readily used to continuously track the states of surrounding agents as well the traffic light status and change the crossing behavior online. Due to the modularity of our network, the inference is in fact easily interpretable by visualizing the states of surrounding agents and the traffic light status. We benchmark our IA-TCNN architecture on several publicly available datasets, namely ETH (Pellegrini et al., 2009), UCY (Lerner et al., 2007) and L-CAS (Yan et al., 2017), in addition to our own Freiburg Street Crossing dataset (Radwan et al., 2017). For the traffic light recognition task, we benchmark on Nexar (Nexar, 2016) and Bosch (Behrendt and Novak, 2017) datasets as well as the Freiburg Street Crossing dataset. While for the autonomous crossing prediction, we perform extensive experimental evaluations on the extended Freiburg Street Crossing dataset. Additionally, in order to evaluate the performance of our approach for autonomous crossing prediction in real-world scenarios, we deploy our framework on a robotic platform and conduct real-world experiments at various street intersections in Freiburg. The results demonstrate that our architecture achieves state-of-the-art performance on each of the tasks, while being robust to enable generalization to new and unseen environments without the need for retraining or pre/post-processing.
In summary, the primary contributions of this work are as follows:
-
A novel multimodal convolutional neural network architecture for intersection crossing safety prediction that jointly predicts the state of the traffic light and the future trajectories of surrounding traffic participants, thereby learning a crossing decision that is intersection agnostic.
-
The novel IA-TCNN architecture for interaction-aware motion prediction to model the complex behavior and interactions among all observed agents in a scene while maintaining a fast inference time and being efficiently deployable in robots with limited resources.
-
We extend the previously introduced Freiburg Street Crossing dataset (Radwan et al., 2017) consisting of images, LIDAR as well as RADAR data by eight sequences captured at various intersections along with annotations for the traffic light state, trajectory annotations for the tracked dynamic objects and the corresponding crossing decision, and make the dataset publicly available.
-
We present extensive qualitative and quantitative analysis on each of the proposed modules on various publicly available benchmarks coupled with real-world experiments to demonstrate their utility in challenging real-world scenarios.
2 Related Works
Over the last decade, there has been significant work in the areas of motion prediction, traffic light recognition and intersection handling. In the following, we review some of the techniques developed thus far for addressing each of these tasks.
Motion Prediction approaches can be divided into two categories; methods modeling interactions among pedestrians and approaches modeling the behavior of vehicles. Among the first methods to model pedestrian interactions is the Social Forces (SF) method of Helbing and Molnar (1995) in which they apply a potential field based approach with attractive and repulsive forces to model the interactions among various pedestrians in the surrounding environment. An improvement of the SF method was subsequently proposed by Yamaguchi et al. (2011), in which the authors employ a data-driven approach to estimate the hidden variables affecting the behavior of the agents such as group affinity and destinations. Trautman and Krause (2010) propose a solution for the freezing robot problem in crowded navigation by utilizing concepts for human crowd navigation and collision avoidance. Using Gaussian processes to estimate the non-Markov nature of agents in the wild, they are able to predict navigation trajectories for the robot that are safer and shorter than trajectories taken by the compared pedestrians. Subsequently, Trautman et al. (2015) present a detailed navigation study in urban crowded environments investigating the effect of different cooperation strategies on the overall navigation performance utilizing insights from how humans navigate to deter the freezing robot problem. Lerner et al. (2007) use an example-based reactive approach to model pedestrian behavior by creating a database of local spatio-temporal scenarios. During an interaction, the autonomous agent samples its trajectory incrementally by considering similar spatio-temporal scenarios from the database. Subsequently, Pellegrini et al. (2009) introduced the Linear Trajectory Avoidance (LTA) method which uses similar concepts from crowd simulation to model the behavior of pedestrians in crowded environments using linear extrapolation over short intervals. Van den Berg et al. (2008) propose an alternative approach for multi-agent navigation, in which they extend on the velocity obstacle concept by assuming that surrounding agents follow a similar “collision avoidance” policy. Kuderer et al. (2012)
employ a maximum entropy reinforcement learning approach to model human navigation behavior. In order to approximate the feature expectations, the proposed method employs Dirac delta functions at the modes of the distributions. However, while this approach was able to accurately model the behavior of pedestrians, suboptimal behavior often emerged due to the large amount of data required to capture the stochasticity of the human behavior. In order to address this problem and enable accurate modeling of the pedestrian behavior,
Kretzschmar et al. (2014)propose computing feature expectations using Hamiltonian Markov chain Monte Carlo sampling.
Kretzschmar et al. (2016) further expand on their proposed approach by utilizing a joint mixture distribution to capture both the discrete and continuous aspects of the design problem. They build upon the reinforcement learning approach with Hamiltonian Markov chain Monte Carlo sampling to learn a socially compliant navigation behavior. Pfeiffer et al. (2016) present a data-driven approach for interaction-aware motion prediction, wherein the authors employ a maximum entropy distribution over the navigation behavior observed from demonstrations. In order to facilitate deployment in real-world environment, the authors further propose a receding horizon motion planner that does not require knowledge of the destination for the surrounding pedestrians. The authors further provide evidence that learning interaction-aware motion trajectories from human demonstrations is insufficient due to the peaked interest of pedestrians in the robot which results in different behavior than the trained policy.While the aforementioned approaches are able to capture the pedestrian behavior in specific situations, the need for defining hand-crafted features make them undesirable for deployment in dynamic environments. Inspired by the success of deep learning based approaches in the various areas of computer vision and robotics,
Alahi et al. (2016) propose an approach dubbed Social LSTM. Using a Long-Short Term Memory (LSTM) network architecture and a
Social Pooling layer that leverages spatial information of nearby pedestrians thereby implicitly modeling interactions among them. Similarly, Sun et al. (2018) use a sequence-to-sequence LSTM encoder-decoder architecture to predict the pedestrian position and angle of direction. The authors show that incorporating the angular information in addition to the temporal information leads to a significant improvement in the accuracy of the prediction. Vemula et al. (2018) propose an alternative Social Attention method to predict future trajectories based on capturing the relative importance of pedestrians regardless of their proximity. The authors formulate the problem as a spatio-temporal graph with nodes representing the pedestrians and edges capturing the dynamics of the interactions between two pedestrians such as orientation and distance. Concurrently, Pfeiffer et al. (2018) propose an LSTM based data driven model for motion prediction by incorporating the obstacle map of the environment and encoding the surrounding pedestrians in polar angular space, thereby enabling fast inference times in crowded environments. Chen et al. (2017) propose a deep reinforcement learning framework for socially aware motion planning. Unlike the aforementioned methods, the proposed approach relies on the fact that it is easier to specify which behavior is undesirable as opposed to defining socially compliant navigation. The authors develop a symmetrical neural network architecture and demonstrate in simulation generalization to densely crowded environments. Gupta et al. (2018) propose the use of recurrent based Generative Adversarial Network (GAN) to generate and predict socially acceptable paths. Their proposed SGAN approach is comprised of an LSTM-based encoder-decoder generator to predict the future trajectories, followed by an LSTM-based discriminator to predict whether each generated trajectory follows the social norms. Similarly, Sadeghian et al. (2018) present a framework for predicting trajectories based on GAN dubbed SoPhie. By utilizing an RGB image from the scene and the trajectory information of the pedestrians, the method computes both the physical and social context vectors by focusing on only the relevant information for each observed pedestrian. The computed vectors are then utilized by an LSTM-based GAN module to generate physically and socially acceptable trajectories. More recently, approaches that leverage the scene structure for predicting the future trajectories are proposed (
Manh and Alaghband (2018), Varshneya and Srinivasaraghavan (2017) and Xue et al. (2018)). While utilizing structural knowledge of the scene enables such methods to achieve highly accurate motion estimates, it simultaneously limits the transferability of these approaches to environments that are known in advance due to either the need of a preprocessing mapping stage or training data from the environment. In the context of autonomous street crossing prediction, generalization capabilities to unseen intersections and new environments is a crucial requirement for our system. We therefore, do not compare our proposed approach to such methods that are dependent on the scene structure.Over the years, several methods have been proposed for trajectory estimation of vehicles (Lefèvre et al., 2014; Park et al., 2018). Lefèvre et al. (2011)
propose a Bayesian network to infer the driver’s intention by utilizing the digital map of the road network.
Kim et al. (2017)propose a trajectory prediction method that employs a recurrent approach to predict the future coordinates of all surrounding vehicles using an occupancy grid map representation with probability values to reflect the uncertainty of the predictions. Similarly,
Baumann et al. (2018) propose an encoder-decoder architecture to predict the ego-motion of the vehicle using previous path information. In order to minimize the potential collision risk, Park et al. (2018) propose an encoder-decoder LSTM architecture accompanied with beam search to produce the most likely trajectories.Despite the varying application areas of the motion prediction task, there is a growing consensus that recurrent units in combination with trajectory information of the most relevant pedestrians/vehicles can provide accurate predictions. While this is true, it comes at the cost of the representational and run-time capabilities of these methods. As the majority of the aforementioned approaches model each pedestrian/vehicle separately by predicting only their local neighborhood, suboptimal behavior often occurs in complex densely populated environments. In this work, we propose a novel scalable neural network architecture to address the problem of learning trajectories in populated environments. Instead of the widely employed recurrent units such as LSTMs, our proposed network utilizes causal convolutions to model the sequential behavior of the various agents in the scene. Furthermore, by jointly learning the trajectories for all agents in the scene, our network is able to better leverage the interdependencies in the motion without the need for explicitly defining the relative importance of each agent. Finally, our approach is not restricted to modeling the behavior of either pedestrians or vehicles, but is rather able to learn and infer the complex interactions among the various types of agents in the scene.
Traffic Light Recognition is one of the vital tasks for autonomous agents operating in urban environments whether pedestrian assistant robots or autonomous vehicles. Although traffic lights are designed to be relatively easily perceived by humans, they are not always easily identified in camera images due to their small size, presence of other sources of similar lights e.g. brake lights, billboards and other traffic lights in different directions, and partial occlusions caused by different objects in the scene (Jensen et al., 2016). Furthermore, due to the highly dynamic nature of the environment, traffic light recognition approaches need to have fast inference times to enable safe deployment. In order to accurately recognize traffic lights in varying illumination conditions, John et al. (2014) employ a Convolutional Neural Network (CNN) based approach to extract features from the image. Accompanied by a GPS sensor to identify the regions of interest within the image, the approach produces a saliency map containing the traffic light location to enable recognition in low lighting conditions. Behrendt and Novak (2017) propose a system for detecting, tracking and recognizing traffic lights for autonomous vehicles. Their approach utilizes the YOLO architecture (Redmon et al., 2016) to detect the location of the traffic lights within the image. The traffic lights are then tracked using the ego-motion information and stereo imagery to triangulate their relative position. Finally the state of the light is identified using a small neural network trained on the extracted regions.
Similarly, in order to enable accurate traffic light recognition in complex scenes, Li et al. (2018) utilize prior information from the image regarding the position and size of the traffic light in order to reduce the computational complexity of locating it within the image. Additionally, they propose an aggregate channel feature method accompanied with inter-frame information analysis to facilitate accurate and consistent recognition across the different frames. With the goal of improving the run-time capabilities and reducing the computational resources, Liu et al. (2017) propose a traffic light recognition system operating in an online manner on smartphones. Using ellipsoid geometry in the HSV colorspace, their approach is able to extract region proposals which are in turn passed through a kernel function to recognize the phase and type of the traffic light.
In contrast to the aforementioned methods for traffic light recognition, we do not perform any preprocessing or utilize any structural prior from the scene, rather our proposed network is able to attend to areas in the image containing the traffic light, thereby increasing ease of deployment and robustness to new environments.
Intersection Safety Prediction: Among the early works on enabling autonomous street crossing for pedestrian assistant robots are the works of Baker and Yanco (2003, 2005) in which the authors propose a system to detect and track vehicles using cameras mounted on both sides of the robot. Using image differencing and edge extraction techniques, the method is able to identify and track vehicles in a two lane street. Subsequently, Bauer et al. (2009) proposed an autonomous city explorer robot to navigate in urban environments. In their approach, the robot is able to handle street crossings by identifying and recognizing the state of the traffic light. In order to identify the safety of intersections for autonomous vehicles, Campos et al. (2013) propose a negotiation approach by solving local optimization problems for each of the vehicles approaching the intersection. Similarly, Medina et al. (2015) propose a decentralized Cooperative Intersection Control (CIC) system to enable safe navigation of a T-intersection for a platoon of vehicles. An alternate approach to cooperative intersection crossing is proposed in Azimi et al. (2014), in which the authors propose a vehicle-to-vehicle intersection protocol guided by a GPS model, where each vehicle periodically broadcasts its pose and intent to nearby vehicles and the crossing priority is then decided by the vehicles among themselves.
Inspired by learning from demonstration approaches, Diaz et al. (2017) propose an approach to aid visually impaired users to remain within the crosswalk bounds while crossing a road. Their proposed method processes images from the scene to extract the relative destination of the user and in turn produces an audio signal as a beacon for the user to follow to reach the goal. More recently, Habibi et al. (2018); Jaipuria et al. (2018) present techniques for pedestrian intent prediction at intersections by utilizing the contextual information of the scene and Augmented Semi Non-negative Sparse Coding (ASNSC) for learning the motion primitives to enable more accurate predictions of the trajectories at street crossings. Fang and López (2018)
develop an approach for predicting the crossing intention of pedestrians. Their proposed method first detects and tracks pedestrians approaching the sidewalks and then utilize this information to estimate the pose of the pedestrians by fitting skeletal features which are in turn utilized by a Random Forest classifier to predict the crossing intent.
In our previous work (Radwan et al., 2017), we proposed a classification approach to predict the safety of an intersection for crossing by training a Random Forest classifier on tracked detections from both radar and LiDAR scanners which enables fast and reliable detections of oncoming traffic. While this method has the advantage of being independent to the intersection type, it lacks the ability to generalize to new unseen scenarios as it learns a discriminative model of the problem. This in turn can lead to suboptimal behavior when learning from imperfect data or when encountering an unseen situation. To address this issue, in this work we utilize information from both the interaction-aware motion prediction and traffic light recognition approaches to predict the safety of the intersection for crossing. By leveraging the predicted trajectories of surrounding vehicles and pedestrians in addition to the state of the traffic light if present, our proposed approach is able to accurately estimate the safety of the intersection for crossing. Furthermore, as we do not rely on any prior knowledge of the environment or any form of communication technique with the surrounding traffic participants, our proposed approach can be easily deployed in various environments without any additional preprocessing steps.
3 Technical Approach
In this section, we detail our proposed system for predicting the safety of the intersection for crossing by jointly learning to predict the future motion of the observed traffic participants and simultaneously recognizing the traffic light state. Our framework fuses the predicted future states as well as the uncertainities of the traffic participants from the motion prediction stream with feature maps from the traffic light recognition stream in order to predict the safety of the intersection for crossing. Note that the proposed networks for interaction-aware motion prediction and traffic light recognition can be deployed separately for their respective tasks. Furthermore, it is worth noting that in this work our goal is to investigate the impact of utilizing the information from the traffic light signal in combination with the predicted trajectories for surrounding dynamic agents to learn a safe street crossing classifier. We propose a traffic light recognition architecture for the sake of completeness and to demonstrate its effect on the overall system. However, it can be easily replaced with other traffic light detection/recognition modules.
Our proposed architecture depicted in Figure 1
consists of two convolutional neural network streams; an interaction-aware motion prediction stream IA-TCNN and a traffic light recognition stream AtteNet. The learned representations from both streams are concatenated channel-wise and passed to the road crossing module which in turn produces a probability distribution over the crossing decision. Given an RGB image at the current timestep and the 2D trajectory information for each dynamic object over a window of time with respect to the position of the robot, the output of our model is the traffic light state, the predicted trajectory for each object over the prediction interval and the crossing decision. In the following sections, we will first detail each of the networks, followed by the fusion procedure for predicting the safety of the intersection for crossing.
3.1 Interaction-aware Motion Prediction
Given the observed trajectory information for the observable dynamic agents in the vicinity of the robot either from a tracker or an object detection module, our motion prediction network predicts for each agent the future trajectory over a prediction interval.
In order to predict interaction-aware trajectories for each dynamic agent without explicitly specifying the importance of each agent for the behavior of the surrounding agents, we create a feature vector for each observation interval with size , where is the number of dynamic agents observed within the interval, is the observation interval and
is the set of features obtained from the tracker/detection module for each object. We order the dynamic agents in this feature vector with respect to their detection time and apply padding for objects that are visible later within the interval. Under this representation, the input feature vector for our network has the following format:
(1) |
We represent each trajectory point for each agent by the spatial coordinates of the agent , the velocity and the yaw angle in normalized quaternion representation. The set of features representing each agent are encoded relative to the robot, thereby facilitating the transferability of the method to different environments. Furthermore, we disregard the size information for each agent, thus each observed dynamic agent is represented as a point with the aforementioned features. While employing this representation discards the size information for each agent, this information is not essential for predicting the future trajectories. We further assume that it can be easily recovered from the object detection module from which the features are obtained. The trajectory of agent during the observation interval is represented as:
(2) |
Our network predicts a feature vector with the same order as Equation (1) over the prediction interval .
In order to represent this problem as a sequence-to-sequence modeling task, the predicted output at timestep can only depend on inputs from . In other words, predictions cannot depend on future states of traffic participants. Moreover, we predict the future trajectories for an interval greater than or equal to the observation interval, as estimating the trajectories for an interval shorter than the observation interval is comparatively trivial. In this work, we strive to accurately predict the future states of dynamic agents for an interval longer than the observation interval.
We propose the Interaction-aware Temporal Convolutional Neural Network (IA-TCNN) architecture depicted in Figure 2(a) which fulfills the above criteria. Our network consists of three causal blocks; where each block contains zero-padding followed by dilated causal convolutions, cropping and a activation function. In each block, we employ zero padding and cropping layers to fulfill the requirement of predicting a trajectory with length greater than or equal to the observed trajectory. We utilize causal convolutions where the output at each timestep is convolved with elements from earlier timesteps, thereby preventing information leak across different layers.
Although the amount of previous information utilized by causal convolutions is linear to the network depth, increasing the depth or using extremely large filter sizes increases the inference time as well as the training complexity. We overcome this problem by employing dilated causal convolutions to increase the size of the receptive field without increasing the depth of the network. We use a constant kernel size of for each of the convolutional layers with filter sizes of respectively and increase the dilation rate by
for each following convolution. Similar to current deep learning approaches for motion prediction, we model the predicted spatial coordinates of each pedestrian using a bivariate Gaussian distribution in order to obtain a measure of confidence over the output of the network (see
Alahi et al. (2016); Sun et al. (2018); Vemula et al. (2018)). The output of the last block is passed to a time distributed fully connected layer of size to produce temporal predictions for each timestep of the prediction interval, where for each pedestrian the network predicts the mean , correlation coefficient and quaternion components .We propose two variants of our method depicted in Figure 2 to further investigate the suitability of the proposed architecture for the sequence modeling task. IA-LinConv closely resembles the IA-TCNN architecture with the exception of setting the dilation rate and the number of dilated convolutions , thus obtaining a single standard convolution per causal block. This variant is proposed to investigate the effect of adding a dilation factor on improving the representational learning ability of the network. In the second variant IA-DResTCNN, we replace the middle causal block with a residual causal block and the
activation function with a ReLU. By introducing residual connections in the network, we investigate if the current depth, filter size and dilation factor affect the stability of the architecture. Employing residual connections within the temporal convolutional network was proposed in
Lea et al. (2017) for action segmentation and detection with encouraging results demonstrating that using residual connections results in a larger receptive field without drastically increasing the number of parameters. We introduce this architectural variant with the goal of investigating if such a hypothesis is valid for the task of motion prediction.![]() |
![]() |
(a) IA-TCNN | (b) IA-DResTCNN |
![]() |
We train our model by minimizing the weighted combination of the negative log likelihood loss of the groundtruth position under the predicted Gaussian distribution parameters and the loss of the orientation in normalized quaternion representation as follows:
(3) | ||||
where is the number of dynamic agents, , are learnable weighting variables for balancing the translational and rotational components of the predicted pose.
Since in real world data, the trajectories of different dynamic agents have varying lengths due to the limited sensor range and in order to fully leverage all the information available during training, we train our proposed IA-TCNN with dynamic sequence lengths by using binary activation masks predicted by the network to signify the end of a trajectory. More precisely, we fix the start and end of the observation interval and pad the beginning and end of the trajectories of each agent such that they are aligned within the observation interval. In order to cope with the varying number of dynamic agents observed within a single interval, we additionally pad the input features to a fixed value which corresponds to the maximum number of agents observed within an interval. During training, we use an input activation mask of size with the purpose of encoding the valid parts of the input. As our input feature vector has dynamic shape for the first two dimensions (number of observed agents and the time for each an agent is observed), the input activation mask encodes the valid parts for both dimensions and “masks out” the padded sections with respect to the first two dimensions of the input. During inference, the network predicts a feature vector of fixed size and an output activation mask of size . Similar to the input activation mask, the output mask encodes the valid positions within the output feature vector along the first two dimensions of the output. Training the network with dynamic feature length (in terms of number of agents and observation/prediction time) enables the network to learn a robust representation that better aligns with real-world data; for instance when a pedestrian or vehicle exits the field of view of the sensor. The predicted trajectory is then first multiplied by the activation mask before computing the prediction error. Moreover, by utilizing the information from all agents during the observation interval, we eliminate the need for creating handcrafted definitions which attempt to explicitly model how the behavior of a dynamic agent is affected by the surrounding agents. Furthermore, it expedites the information flow throughout the various layers of the network, hence facilitating fast trajectory estimation for all the dynamic agents in the scene.
3.2 Traffic Light Recognition
In this section, we describe the architecture of our traffic light recognition subnetwork, which given an input RGB image predicts the state of the traffic light . We build upon the ResNet-50 architecture (He et al., 2016b) with pre-activation residual units which allow unimpeded information flow throughout the network thus enabling the construction of more complex models that are easily trainable. We show the standard residual unit in Figure 3(a) and the pre-activation residual unit in Figure 3(b). Our proposed network AtteNet consists of five bottleneck residual blocks with multiple pre-activation residual units. Each residual unit, depicted in Figure 3
(b), consists of batch normalization and ELU preceding the convolutional layers. By moving the batch normalization layer to the beginning of the residual unit, the input of the layer is ensured to be renormalized after the addition operation from the previous layer, thereby improving the regularization of the network. Similarly, moving the ELU activations to the beginning of the unit as opposed to after the addition operation ensures that the original information is preserved throughout the entire network. Furthermore, by replacing the traditional ReLU activation function with ELUs, our network is more robust to noise in the training data while achieving shorter convergence time. Note that unlike the IA-DResTCNN architecture for interaction-aware motion prediction, we utilize the bottleneck residual units as the building block of our network due to their ability to aid in training deeper architectures without significantly increasing the number of parameters.
![]() |
![]() |
(a) Original Residual Unit | (b) Pre-activation Residual Unit |
In order to improve the representational learning abilities of our network, we introduce Squeeze-Excitation (SE) blocks into our network (Hu et al., 2017). Using SE blocks enables the network to perform feature recalibration, which in turn allows the network to utilize the global information in the images to selectively emphasize and suppress features depending on their usefulness for the task at hand. Instead of equally weighting all channels while creating the output feature maps, each SE block employs a content aware mechanism which learns an adaptive weighting for each channel with a minimum computational cost. A SE block, depcited in Figure 4 is comprised of two operations; squeeze and excite. During the squeeze operation feature maps from the previous layer are aggregated across the spatial dimension. Thus embedding the global distribution of the features to be leveraged by upcoming layers in the network. The excitation operation which follows emphasizes the informative features and suppresses the irrelevant ones thus aiding in learning sample-specific activations for each channel. In order to further reduce the number of parameters of our model, we replace the fully connected layers in the SE blocks with convolutional layers. We add a global average pooling layer after the fifth residual block, followed by a fully connected layer of size which produces the prediction of the network. Our final architecture is shown in Figure 5. During training, we minimize the
cross entropy loss between the labels and the predicted logits.


3.3 Learning To Cross The Road
In order to learn a crossing strategy that is robust to the type of intersection, we propose fusing the output predictions from the trajectory estimation subnetwork and the traffic light recognition subnetwork. Incorporating the traffic light recognition information is crucial at signalized intersections as the robot is expected to act within the behavioral norms obeying the intersection crossing rules such as crossing only when the light is green. At the same time, in certain situations, one cannot rely solely on the traffic light information to cross such as when an ambulance or police car is speeding towards an intersection. In such cases, despite the green pedestrian traffic light, the robot is expected to wait at the sidewalk until the intersection becomes safe for crossing. Similarly at unsignalized intersections, the robot is expected to identify safe crossing intervals from unsafe intervals. In these situations, utilizing the information from the trajectory estimation module is crucial to ensure safe crossing prediction. To achieve this goal, we perform element-wise concatenation of the feature maps from the traffic light recognition stream and the motion prediction stream. More specifically, the predicted Gaussian distribution parameters from IA-TCNN are first passed to a fully connected layer of dimension , the output of which is reshaped to which corresponds in shape to the output of layer Res-5c
of AtteNet. Since we apply padding to the input and output of the IA-TCNN network to maintain constant feature size which corresponds to the maximum number of observable dynamic agents, fusing the features from both networks can be done in a rather straightforward manner. The output tensor from the reshaping is then concatenated with the output of layer
Res-5c of AtteNet and fed to a fully connected layer with units. This is then followed by another fully connected layer with activation and output units signalizing the intersection safety state . By utilizing the Gaussian distribution parameters to model the trajectories, we enable our model to incorporate the confidence information regarding the likelihood of the predictions, which in turn improves the robustness of our method to the prediction accuracy. We train the model by minimizing thecross entropy loss function. In Section
4.5, we evaluate the impact of incorporating information from each of the streams on the accuracy of the learned crossing decision.4 Experimental Evaluation
In order to evaluate our proposed system for predicting the safety of the intersection for crossing, we first evaluate each of the constituting subtasks followed by providing detailed results on the performance of the combined model. We evaluate our approach on multiple publicly available datasets as well as deploy it on our robotic platform shown in Figure 8 and evaluate the performance in a real-world environment in Section 4.6. We further provide comprehensive details of our evaluation protocol to facilitate comparison and benchmarking. We first present the datasets used for evaluating our approach, followed by the training schedule and extensive qualitative and quantitative analysis of the results.
4.1 Datasets & Augmentation
In the following, we discuss in detail each of the datasets used for evaluation as well as any preprocessing or augmentation procedure applied. As our architecture is comprised of multiple sub-networks and due to the unavailability of a large public labeled dataset capturing both motion prediction, traffic light recognition and intersection safety prediction, we split our evaluation procedure into three parts and subsequently employ public datasets for each of the aforementioned tasks.
4.1.1 Motion Prediction Datasets
In order to investigate the accuracy of our proposed architecture for the task of motion prediction, we rely on three benchmarking datasets that are commonly employed for the aforementioned problem. The selected datasets vary with respect to the environment in which they were captured (indoors versus outdoors), the capturing mode (camera versus LiDAR) and difficulty (crowded environment versus empty environment). Evaluating on such a large variety of environments enables us to gain a better understanding of the advantages and limitations of our proposed approach. Below, we provide a description for each of the datasets along with any pre-processing that was carried out prior to training.
L-CAS dataset is a recently proposed benchmark for pedestrian motion prediction (Yan et al., 2017). The data is captured using a 3D LiDAR scanner mounted on top of a Pioneer robot placed inside a university building. It consists of over 900 pedestrian tracks, each with an average length of seconds and is divided into a training and testing split. Each pedestrian is identified by a unique ID, a time frame at which they are detected, the spatial coordinates, and their orientation angle. Some of the challenges faced while benchmarking on this dataset are people pushing trolleys, children running and groups dispersing. Figure 6(c) shows an example scan from the dataset, where pedestrians are marked by bounding boxes with arrows showing their trajectories for a sample interval. We use the same training and test splits provided by the authors for this dataset to facilitate comparison with other approaches.
ETH crowd set dataset consists of two scenes: Univ and Hotel, containing a total of approximately 750 pedestrians exhibiting complex interactions (Pellegrini et al., 2009). For each scene, the dataset contains an obstacle map file providing the static map information of the surroundings, and an annotations file which provides the trajectory information for each pedestrian. Each tracked pedestrian is identified by a pedestrian ID, the frame number at which they were observed, the spatial coordinates and velocity with which they were traveling. The dataset additionally provides a groups file that provides information on pedestrians forming a group and a destinations file reporting the assumed destinations of all subjects in the scene. The dataset is one of the widely used benchmarks for pedestrian tracking and motion prediction as it represents real world crowded scenarios with multiple non linear trajectories, covering a wide range of group behavior such as crossings, dispersing and forming of groups. We show an example image from the Hotel sequence in Figure 6(a), where arrows represent the trajectories of the pedestrians for a sample interval. The sequence is recorded near a public transport stop. It captures the complex behavior of pedestrians as they enter/exit the vehicle as well as surrounding pedestrians navigating the scene. For this dataset, we utilize only the information from the annotations file, keeping track of the spatial coordinates of each pedestrian at each time frame. Furthermore, we assume no knowledge of the destination of each pedestrian, nor do we utilize any information regarding group behavior or the structure of the environment.
UCY dataset consists of three scenes: Zara01, Zara02 and Uni, with a total of approximately 780 pedestrians (Lerner et al., 2007). For each scene, the dataset provides an annotations file comprised of a series of splines each describing the trajectory of a pedestrian using the spatial coordinates, frame number and the viewing direction of the pedestrian. This dataset in addition to the ETH dataset are widely used in conjunction as benchmarks for motion prediction and pedestrian tracking due to the wide range of non linear trajectories and pedestrian interactions exhibited including group behavior and pedestrians idling nearby shop fronts. Figure 6(b) shows a sample image from the Uni sequence, where pedestrian trajectories are represented by arrows. This particular sequence is the most challenging among the three sequences forming this dataset due to the large number of pedestrians observed concurrently, in addition to the complex crowd behavior demonstrated. We combine both this dataset with the ETH dataset similar to previous works (Alahi et al., 2016; Vemula et al., 2018) and apply a leave-one-out procedure during training by randomly selecting trajectories from all scenes except the scenes used for testing. Furthermore, in order to facilitate the combination of the datasets, we predict only the 2D spatial coordinates for each pedestrian.
![]() |
![]() |
(a) ETH-Hotel | (b) UCY-Uni |
![]() |
![]() |
(c) L-CAS | (d) Freiburg Street Crossing (FSC) |
4.1.2 Traffic Light Recognition Datasets
We investigate the performance accuracy of our AtteNet architecture for the task of traffic light recognition by evaluating the model on two publicly available datasets. Despite targeting pedestrian traffic lights in our approach, to the best of our knowledge, there is a lack of publicly available pedestrian traffic light datasets. Nonetheless, the two tasks share a number of similarities which render benchmarking the performance on traffic light datasets a reliable estimate of the overall performance of the model on pedestrian traffic lights. Furthermore, our newly proposed Freiburg Street Crossing dataset contains several instances of pedestrian traffic lights, enabling us to evaluate the performance of our AtteNet on real-world data. We benchmark the performance of our model by evaluating on the Nexar Traffic Lights Challenge dataset and the Bosch Small Traffic Lights dataset. The datasets were selected due to the challenging nature of the data as images are captured in varying lighting conditions, contain multiple traffic lights and occlusions to the traffic light. Furthermore, the datasets contain a large number of sources of distraction such as brake lights of other cars and glass buildings which reflect the traffic light signal. Through evaluating our proposed AtteNet on such challenging datasets, we aim to gain an understanding of the power of the network for the task at hand, as well its generalizability to various environments and lighting conditions. In the following, we describe each of the datasets along with data augmentation procedures that were carried out.
Nexar Traffic Lights dataset consists of over 18000 RGB images captured in varying weather and lighting conditions. The dataset was released as part of a challenge to recognize the traffic light state in images taken by drivers using the Nexar app (Nexar, 2016). Each image is labeled with the state of the traffic light in the driving direction, where . Several factors make benchmarking on this dataset extremely challenging such as the varying light conditions, the presence of substantial motion blur and the presence of multiple traffic lights in the image. The top row in Figure 7 shows sample images from the dataset. In addition to the aforementioned challenges, the evaluation criteria for this dataset was selected to be the classification accuracy and model size, with a minimum success criteria of in terms of accuracy for submission acceptance. In order to train our method, we split the data into a training and a validation set using a split ratio of and perform augmentations on the training split in the form of random applications of brightness and contrast modulations.
Bosch Small Traffic Lights dataset contains RGB images at a resolution of pixels captured in the San Fransisco Bay Area and Palo Alto, California (Behrendt and Novak, 2017). The training set consists of over 5000 images which are annotated at a 2 second interval, while the test set consists of over 8000 images annotated at a frame rate of 15 fps. Each image contains multiple labeled traffic lights amounting to a total of over 10000 annotated traffic lights in the training set and 13000 in the test set, with a median traffic light width of 8.6 pixels. For each image, the label file includes the bounding box coordinates of the traffic light, the status of the light , and whether the light is occluded by some object. This dataset is among the challenging benchmarks for detecting and recognizing traffic lights due to the small size of the lights in the image as well as the varying lighting conditions, presence of shadows and occlusions. We show example images from this dataset in Figure 7(c, d). We use the same training and test split provided by the authors and apply augmentations on the training set in the form of random brightness and contrast modulations. As our approach only predicts the status of the traffic light and not its position, we preprocess each image by masking out all but one traffic light using the bounding box information from the label file. To learn identifying when no traffic light is present in the image, we additionally mask out all the traffic lights, thereby creating from each image images where is the number of non-occluded traffic lights.
![]() |
![]() |
(a) Nexar | (b) Nexar |
![]() |
![]() |
(c) Bosch | (d) Bosch |
![]() |
![]() |
(e) FSC | (f) FSC |
4.1.3 Intersection Safety Prediction Dataset
In order to evaluate the overall performance of our model for the joint tasks of motion prediction, traffic light recognition and intersection safety prediction, we extend our previously proposed dataset Radwan et al. (2017) with additional sequences and labels for each of the aforementioned tasks. The Freiburg Street Crossing (FSC) dataset consists of tracked detections of cars, cyclists and pedestrians captured at different intersections in Freiburg, Germany using a 3D LiDAR scanner and Delphi Electronically Scanning Radars (ESRs) mounted on our robotic platform shown in Figure 8 (Radwan et al., 2017). Note that both the data capturing procedure and all experiments on this dataset were conducted using this robotic platform. Furthermore, we follow the same data collection and labeling procedure as in (Radwan et al., 2017). The data was captured over the course of two weeks and it is divided into 10 different sequences containing approximately over 2000 tracked objects111The extended dataset is publicly available at:
http://aisdatasets.cs.uni-freiburg.de/streetcrossing/. Each object is identified by a unique track ID, spatial coordinates, velocity and orientation angle. We obtain this information through the radar and LiDAR trackers Kümmerle et al. (2015). Additionally, we augment the dataset with hand-labeled annotation information in the form of intervals indicating the safety of the intersection for crossing. In this work, we further augmented the dataset with 8 extra sequences captured at different intersections. Figure 9 shows birds-eye-view images of some of the intersections captured in this dataset. For each sequence we provide the detected tracks from the radar and LiDAR trackers (Figure 6(d)), along with the RGB camera images (Figure 7(e,f)) and the intersection safety for crossing. Furthermore, we provide annotations of the camera images regarding the state of the traffic light .

Several factors make benchmarking on this dataset extremely challenging for motion prediction and traffic light recognition tasks including large number of traffic participants, varying motion dependencies among different dynamic objects (Figure 6(d)), motion blur in the images, presence of reflecting surfaces and varying lighting conditions as shown in Figure 7(e,f). The dataset covers a wide range of road topologies and intersection types as depicted in Figure 9, which makes benchmarking on this dataset extremely challenging. To the best of our knowledge, this is the first dataset with multitask labels for motion prediction, traffic light recognition and intersection safety prediction. During training, we apply a leave-one-out procedure for the motion prediction task by randomly selecting trajectories from all sequences except the testing sequence. However, for the traffic light recognition task, we divide the data into a split, and apply random brightness and contrast modulations as an augmentation procedure for the training images.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
4.2 Training Schedule
In the following, we describe the training procedure used for each of the motion prediction, traffic light recognition and intersection safety prediction tasks. In order to train our IA-TCNN model such that it is robust to the varying number of pedestrians observable in each interval, we introduce a variable to represent the maximum number of distinct trajectories observed within an interval and initially set it to the maximum observed in all the datasets. During training and testing, we use an activation mask to encode the positions of valid trajectories and discard all remaining information. We train our model for epochs with a mini-batch size of . We employ the Adam solver (Kingma and Ba, 2014) for optimization, with a learning rate of
and apply gradient clipping. Details regarding the sequence length used for training and the observation and prediction lengths used for testing are covered in Section
4.3.2.We train our AtteNet model for traffic light recognition on random crops of size and test on the center crop which we found adds more regularization to the network and helps learning a more generalized model. We use the SGD solver with momentum to optimize our AtteNet model, using an initial learning rate of and a polynomial weight decay of . We train our approach for epochs using a mini-batch size of and dropout probability of
. Note that both IA-TCNN and AtteNet are trained from scratch on their respective tasks for each of the aforementioned datasets. In order to learn the final model for predicting the intersection safety for crossing, we initially bootstrap the training of both IA-TCNN and AtteNet using transfer learning from the respective optimization procedures. We combine each of the task-specific loss functions using learnable weighting parameters and use a single optimizer to train all subnetworks concurrently. Training all tasks jointly aims at finding optimal weights that satisfy the constraints of each task as well as their interdependencies. Moreover, employing learnable weighting parameters ensures the proper balancing between the distinct tasks. We employ the Adam optimizer with an initial learning rate of
. The final model is trained for epochs with a mini-batch size of. All experiments are conducted using the Tensorflow library
(Abadi et al., 2015) on a single Nvidia Titan X GPU.4.3 Evaluation of Motion Prediction
4.3.1 Comparison with the State-of-the-art
We benchmark the performance of our approach against several state-of-the-art methods for motion prediction including Social-LSTM (Alahi et al., 2016), Social-Attention (Vemula et al., 2018), Pose-LSTM (Sun et al., 2018), SGAN (Gupta et al., 2018) and SoPhie (Sadeghian et al., 2018). Furthermore, we compare against the Social Forces model (Helbing and Molnar, 1995) and a basic LSTM implementation as baselines. Note that for each of the methods, we report the numbers directly from the corresponding manuscripts, with the exception of the Social Forces model where we report the numbers from Alahi et al. (2016) as the original manuscript does not report evaluations using the metrics employed by the aforementioned methods. Furthermore, we use our own implementation for the LSTM baseline. We evaluate the accuracy of our motion prediction model by reporting the following metrics:
-
Average Displacement Error: mean squared error over all predicted and groundtruth points in the trajectory.
-
Final Displacement Error: distance between the predicted and groundtruth poses at the end of the prediction interval.
Dataset | Social-LSTM | Pose-LSTM | IA-LinConv | IA-DResTCNN | IA-TCNN (Ours) |
---|---|---|---|---|---|
L-CAS |
Dataset | Social Forces | Basic LSTM | Social-LSTM | Social-Attention | SGAN | SoPhie | IA-LinConv | IA-DResTCNN | IA-TCNN (fixed length) | IA-TCNN (Ours) |
---|---|---|---|---|---|---|---|---|---|---|
ETH-Univ | ||||||||||
ETH-Hotel | ||||||||||
Zara01 | ||||||||||
Zara02 | ||||||||||
UCY-Uni | ||||||||||
Average |
On the L-CAS, ETH and UCY datasets, we follow the standard evaluation procedure (Sun et al., 2018; Alahi et al., 2016) of training using a sequence length of 20 frames and using observation and prediction lengths of 8 frames () and 12 frames () respectively during testing. Note that the observation length is implemented as a sliding window therefore the robot does not have to wait for a certain amount of time in order to be able to make a prediction. Only in the beginning of the system initialization, the buffer of the window needs to be accumulated and subsequently the prediction can be made as fast as new location information of the surrounding agents are available. We opted to use the same sequence length as existing methods in order to facilitate a fair comparison and avoid any performance bias induced by increasing or decreasing the observation interval. We follow this setting for all the datasets. Additionally, we also present results with varying observation and prediction lengths in Table 7. Table 1 shows the average displacement accuracy of our approach on the L-CAS dataset. As demonstrated by the results, both our proposed variants, IA-LinConv and IA-DResTCNN are able to outperform the standard recurrent-based approaches by and in the translational and rotational components respectively. This in turn corroborates the advantage of utilizing a causal convolutional architecture over the standard recurrent methods. Moreover, by utilizing our proposed IA-TCNN, we achieve an average displacement error of and further improving upon the results by and in translation and rotation respectively. The improvement over the results achieved by IA-LinConv is attributed to employing dilated convolutions which increase the size of the receptive field, thereby increasing the content captured in each layer. However, we observe that adding a residual block to our network to improve the feature discriminability, as in IA-DResTCNN, does not help in improving the prediction accuracy despite it being helpful for other sequence modeling tasks such as language modeling (Bai et al., 2018). We speculate that this model tends to accumulate more temporal information than the remaining methods through skip connections which has an adverse effect of adding noise and eventually degrading the quality of the predicted motion trajectories.
In Table 2, we present the average displacement accuracy of our proposed methods in comparison to state-of-the-art approaches on different sequences from the ETH and UCY crowd set datasets. Due to the complexity of the pedestrian interactions demonstrated in this dataset, employing the IA-LinConv model does not yield significant improvement over recurrent-based approaches due to the small receptive field at each layer. By employing dilated convolutions in our IA-TCNN architecture, the network is able to better capture the interactions across the various pedestrians, thereby achieving an improvement of in comparison to the previous state-of-the-art. Similar to IA-LinConv, the accuracy of IA-DResTCNN is comparable to that of recurrent-based approaches. However, the convergence time of the network is more than IA-LinConv and IA-TCNN. We additionally compare the performance of our IA-TCNN architecture with a fixed trajectory length variant (IA-TCNN fixed length). The fixed trajectory length variant was trained using only trajectories of pedestrians that have a minimum length equal to the sequence length (20 frames). We compare this variant with our dynamic length trained architecture to investigate if using binary activation masks has an impact over the performance of our method. The results show that as hypothesized using binary activation masks during training enables our network to learn that not all pedestrians must be observed for the same amount of time which leads to an overall improvement in the accuracy of the trained model.
We report the final displacement accuracy of our proposed IA-TCNN on the various sequences of the ETH and UCY datasets in Table 3. Despite the sparse amount of sequences available for each dataset and the complexity of the pedestrian interactions demonstrated, our method is able to achieve the lowest final displacement error on all sequences of the ETH and UCY datasets with an average improvement of in comparison to previous methods. It is worth noting that while other approaches incorporate information about nearby pedestrians or the surrounding environment to predict the trajectories, our proposed method is able to accurately infer the trajectories, surpassing the performance of state-of-the-art methods, without leveraging information about the structure of the scene or performing any pre/postprocessing on the trajectory data.
Dataset | Social Forces | Basic LSTM | Social-LSTM | Social-Attention | SGAN | SoPhie | IA-LinConv | IA-DResTCNN | IA-TCNN (Ours) |
---|---|---|---|---|---|---|---|---|---|
ETH-Univ | |||||||||
ETH-Hotel | |||||||||
Zara01 | |||||||||
Zara02 | |||||||||
UCY-Uni | |||||||||
Average |
We benchmark on the Freiburg Street Crossing (FSC) dataset largely due to the variety of motion trajectories and complex interactions. Furthermore, unlike the ETH, UCY and L-CAS datasets, the FSC dataset includes trajectories and interactions among various types of dynamic objects such as cyclists, vehicles and pedestrians which in turn both increases the difficulty of the prediction task, as well as renders the data more representative of real-world scenarios. On the FSC dataset, we train using a sequence length of and use observation and prediction lengths of . As the radar sensor has a larger field-of-view than the LiDAR, and in order to observe the traffic participants in both sensors, we experimentally identified that a time window of is appropriate for correlating objects in both sensors on this dataset. We report the average displacement accuracy of our proposed method in Table 4. The results show an improvement in employing IA-LinConv and IA-DResTCNN over the LSTM baseline, specifically in terms of rotation and velocity estimation. We attribute this to the increased complexity of the interactions demonstrated in this dataset, in addition to the presence of multiple types of dynamic objects which exhibit different interaction and motion behavior. Furthermore, we observe that by employing the IA-DResTCNN architecture, the rotational accuracy of the pose is further improved in comparison to the IA-LinConv architecture. We attribute this improvement partially to the bigger receptive field at each layer due to the dilation factor employed. The best performance is achieved by leveraging the proposed IA-TCNN architecture which is able to balance the motion specific pose components for each dynamic object independent of their type, yielding an average displacement error of and .
Dataset | Basic LSTM | IA-LinConv | IA-DResTCNN | IA-TCNN (Ours) |
---|---|---|---|---|
Seq.-1 | ||||
Seq.-2 | ||||
Seq.-3 | ||||
Seq.-4 | ||||
Seq.-5 | ||||
Seq.-6 | ||||
Seq.-7 | ||||
Seq.-8 | ||||
Seq.-9 | ||||
Seq.-10 | ||||
Seq.-11 | ||||
Seq.-12 | ||||
Seq.-13 | ||||
Seq.-14 | ||||
Seq.-15 | ||||
Seq.-16 | ||||
Seq.-17 | ||||
Seq.-18 | ||||
Average |
4.3.2 Ablation Study & Qualitative Evaluation
In this section, we perform detailed studies on the influences of various components of our proposed architecture. Note that for all experiments, we use the same data-specific frame rate as in Section 4.3.1 to facilitate ease of comparison. Table 7 shows the effect of varying the observation and prediction lengths on the average displacement accuracy of IA-TCNN on the Uni sequence of the UCY crowd set dataset. For short observation lengths () frames, the error in the predicted trajectory linearly increases with the increase in the prediction length, with the lowest error achieved using a prediction interval smaller than or equal to the observation interval. This accounts for the increased difficulty of making accurate predictions given short trajectory information as future interactions cannot be reliably predicted. Concurrently, by increasing the observation length, the prediction accuracy gradually increases with small improvements between () observation frames. This can be attributed to the reduction in the amount of significant information over time due to the short interaction times between the pedestrians and the low likelihood of abrupt changes in the behavior of one or more pedestrians.
Observation |
![]() |
![]() |
![]() |
![]() |
Prediction |
![]() |
![]() |
![]() |
![]() |
(a) | (b) | (c) | (d) |
In Table 5, we evaluate the effect of incorporating orientation and velocity information on the accuracy of the predicted motion trajectories on the FSC dataset. We compare the performance of the following variants:
-
LSTM: Basic LSTM architecture predicting the future position of each agent.
-
MP-M1: Predicting only the position of each agent.
-
MP-M2: Predicting the position and orientation for each agent.
-
MP-M3: Predicting the position, orientation and velocity of each agent, corresponding to our proposed IA-TCNN architecture.
Model | Orientation | Velocity | Avg. Disp. Error |
---|---|---|---|
LSTM | ✗ | ✗ | |
MP-M1 | ✗ | ✗ | |
MP-M2 | ✓ | ✗ | |
MP-M3 (IA-TCNN) | ✓ | ✓ |
Employing our proposed architecture with causal convolutions results in an improvement of in the average displacement error over a standard LSTM-based approach. Incorporating the orientation information of each agent, further improves the results by over the MP-M1 model. This improvement corroborates the utility of predicting the orientation information of each agent in the overall accuracy of the predicted poses. This occurs as a direct consequence of the fact that most real-world agents rarely turn on the spot, but rather slowly change their orientation along the path. Hence incorporating the orientation information enables the model to predict more accurate trajectories in such situations. Finally, incorporating the velocity information further improves the average displacement error by over the MP-M2 model. Despite the fact that velocity can be approximated through the difference in the predicted positions over the time interval, we believe that directly predicting the velocity provides a more accurate information source specifically in highly dynamic scenarios such as street intersections and busy roads where the agents can quickly vary their speed. In such situations, the velocity obtained from computing the difference in positions over the time interval can be often too inaccurate for reliable estimation. However, directly predicting the velocity enables the network to benefit from this information and hence enables robust trajectory estimates in various environments.
We further investigate the effect of the various architectural choices leading to the development of our IA-TCNN architecture for predicting interaction-aware motion trajectories in Table 6. More precisely, we investigate the average displacement accuracies of the following models on the FSC dataset:
-
MP-M31: Basic LSTM-based architecture for motion prediction with fixed equal weights assigned to the translation and orientation components of the pose along with the speed.
-
MP-M32: Causal convolutional architecture with fixed equal weights for the pose and speed components.
-
MP-M3: Causal convolutional architecture with adaptive learnable weights for the various pose and speed components. This architecture corresponds to the proposed IA-TCNN model.
The results in Table 6 show that utilizing causal convolutions for temporal feature aggregation results in an improvement of in the translation, orientation and speed respectively over employing LSTMs. This improvement validates our hypothesis that employing causal convolutions over basic recurrent blocks such as LSTMs enables the network to better leverage the temporal information, thereby improving the quality of the predicted poses. Utilizing adaptive weighting parameters to learn the optimal weight for the translation and orientation components of the pose results in a further improvement of in the average displacement accuracy in terms of translation, orientation and speed respectively. By employing adaptive weights that vary during the optimization procedure, the most favorable set of weights is found that enables the optimal balancing between the various components and preventing a single term from dominating the loss without benefiting the remaining components.
Model | Temporal Block | Loss Weights | Avg. Disp. Error |
---|---|---|---|
MP-M31 | LSTM | fixed equal | |
MP-M32 | Causal Convolution | fixed equal | |
MP-M3 (IA-TCNN) | Causal Convolution | adaptive |
Obs. LengthPred. Length | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
In order to evaluate the performance of our proposed model in various types of interactions, we visualize four example scenes from the Uni sequence of the UCY dataset in Figure 10. The top row shows the observation sequence for each pedestrian represented by a solid line with the current position of the pedestrian depicted by a circle. In the bottom row, we plot the groundtruth trajectory (solid line) and the predicted trajectory of the network (dashed line). Figure 10(a) presents a scenario with collision avoidance for two individuals. In this scenario, our IA-TCNN method is able to predict the temporary change in direction for both individuals to avoid collision. Our proposed model is also able to represent group behavior as shown in Figure 10(b), where it predicts a common change of direction for all members of the group. Figure 10(c) shows a more complex scene with collision avoidance and overtaking maneuvers. The pedestrians depicted in red, blue and black display an example of the overtaking maneuver, where the red colored pedestrian is walking with a slightly lower velocity. Our model predicts that the blue colored pedestrian will adjust their trajectory to the right while increasing their velocity in order to overtake the red colored pedestrian. In order to avoid potential collision with the blue colored pedestrian, the model predicts a trajectory for the black colored pedestrian that is slightly deviated to the right. As for the purple and olive colored individuals, the model incorrectly predicts a trajectory where the olive colored pedestrian attempts to overtake the purple colored pedestrian. Whereas a more socially acceptable behavior, as shown by the groundtruth trajectory in this example would be to halt and wait for the purple colored pedestrian to pass.
Figure 10(d) shows another complex scenario with one static pedestrian in the middle, and a crossing maneuver between the red and blue colored pedestrians. In this example, our model predicts a trajectory where the red colored pedestrian follows the blue one. Note that while our approach incorporates the rotational information of the various dynamic objects into the prediction, on the ETH and UCY datasets, we do not utilize the information about the heading of each individual as this information is only available for one of the datasets, which further hinders it from being combined with the others. Nonetheless, we believe in such scenarios that information about the heading of each individual can significantly reduce the error in the predictions as shown by the results in Table 1, since sudden changes in direction are uncommon in the behavior of pedestrians.
Method | Avg. Disp. Error () | Final Disp. Error () | Run-time () | Size () |
---|---|---|---|---|
Basic LSTM | ||||
Social-LSTM | ||||
SGAN | ||||
IA-TCNN (Ours) |
We further compare the run time and model size of our approach with various recurrent based approaches in Table 8. While the lowest average displacement error is achieved by the Social-LSTM approach (Alahi et al., 2016), both the run-time and model size render it infeasible to be deployed in real-world scenarios. Similarly, while the SGAN method (Gupta et al., 2018) achieves fast run-time, it has the lowest average and final displacement accuracies among all methods. The results show that using our proposed IA-TCNN, we improve upon the final displacement accuracy by while achieving analogous average displacement error in comparison with the best performing model with a competitive run time of on a single Nvidia Titan X GPU. Moreover, our model requires only of storage space, thereby making it efficiently deployable in resource limited systems, while achieving accurate predictions in an online manner. It is worth noting that none of the aforementioned datasets contain significant amount of clutter or static obstacles in the scene. Furthermore, as our model does not incorporate the structural information of the scene in any manner, its performance is expected to vary depending on the amount of clutter in the environment. However, as highly cluttered scenes with multiple static obstacles were rarely encountered in the investigated scenarios, we do not tackle this in our current work. Our goal was rather to develop a dynamic interaction-aware motion prediction network that is capable of online performance in resource limited systems. For future work, we aim to additionally incorporate the structural information of the scene in order to achieve more reliable trajectory estimates.
4.4 Evaluation of Traffic Light Recognition
4.4.1 Comparison with the State-of-the-art
In this section, we evaluate the performance of our AtteNet on the task of traffic light recognition by benchmarking against several network architectures tailored for the aforementioned task namely SqueezeNet (Iandola et al., 2016), DenseNet (Huang et al., 2016) and ResNet (He et al., 2016a). We compare against the SqueezeNet architecture due to its relatively small size and high representational ability which enables it to be efficiently deployed in an online manner. This was demonstrated in the Nexar challenge where the first place winner used the SqueezeNet architecture achieving a recognition accuracy of
. Concurrently, we benchmark against DenseNet and ResNet architectures due to their top performance in various classification and regression tasks. We employ the ResNet-50 architecture with five residual blocks as a baseline. Similarly, we utilize the DenseNet-121 architecture with four dense blocks and a growth-rate of 16. We quantify the performance of each architecture by reporting the prediction accuracy, precision and recall rates. Furthermore, we visualize the precision-recall plots for each traffic light state on each of the compared datasets and compare the performance of the benchmarked architecture against our proposed architecture with the goal of obtaining more insight into the performance of the individual approaches for each of the traffic light states. Table
9 shows the classification accuracy of AtteNet on all three datasets; Nexar, Bosch and FSC. AtteNet outperforms all the baselines on each of the datasets by an average of which in turn validates the suitability of our proposed architecture for the task of traffic light recognition. Furthermore, by employing AtteNet, we are able to outperform the state-of-the-art on the Nexar challenge dataset.Analyzing the precision and recall rates for each class on the Nexar dataset in Table 10 shows that our proposed AtteNet is capable of accurately identifying the various traffic light signals with the highest recall despite the challenging lighting conditions demonstrated in the dataset. Figure 11 depicts the precision-recall curves for the compared approaches on the Nexar dataset. The results show that the AtteNet outperforms each of the compared methods on all traffic light states. This is further corroborated in Figure 15(a) which plots the 3D t-Distributed Stochastic Neighbor Embedding (t-SNE) (Van der Maaten and Hinton, 2008) of the features learned by our proposed AtteNet on the Nexar dataset in which data points belonging to the same traffic light category are distinctively clustered together. We discuss more about these plots in the ablation study presented in the following section.
Table 11 shows the precision and recall rates of our proposed AtteNet in comparison to the baseline approaches on the Bosch dataset. Unlike the Nexar dataset, the Bosch traffic lights dataset contains four categories for the traffic light signal by including a label for the yellow state. This in turn increases the difficulty of the task at hand as there only exists few labeled examples for the aforementioned class creating an imbalance in the distribution of the distinct classes. Nonetheless, our proposed approach is able to achieve comparable precision to the baseline variants and the highest recall rate. Furthermore, we plot the precision-recall curves on the Bosch dataset in Figure 12. The results further corroborate the suitability of AtteNet for traffic light recognition, as it significantly outperforms the compared methods on the three major classes (Off, Red and Green). However, for the ”Yellow” traffic light state, all methods perform sub-optimally as seen in Figure 12(d), which we attribute to the scarcity of the examples containing the ”Yellow” traffic light state, thereby increasing the difficulty of learning a good decision function for this class. In Table 12, we present the precision and recall rates on the FSC dataset. Our proposed AtteNet architecture outperforms the baselines in terms of precision on each of the individual classes, while achieving high recall rates. This further corroborates the suitability of our proposed method for recognizing traffic lights in various conditions as shown in Figure 15(c) depicting the distribution of the learned features by our model in comparison to the baseline. In Figure 13, we plot the precision-recall curves for the compared approaches on the FSC dataset. Despite the low performance of SqueezeNet on the ”Off” class as shown in Figure 13(a), it significantly outperforms DenseNet on the remaining two classes. We hypothesize this might be due to the ratio between the traffic light size and the image size which renders it difficult for the network to accurately localize the traffic light thereby leading to spurious recognition. Our proposed AtteNet architecture on the other hand, significantly outperforms the compared approaches for each of the traffic light states further corroborating its suitability for the task at hand.
Dataset | SqueezeNet | DenseNet | ResNet | AtteNet (Ours) |
---|---|---|---|---|
Nexar | ||||
Bosch | ||||
FSC |
![]() |
![]() |
![]() |
---|---|---|
(a) Off | (b) Red | (c) Green |
![]() |
![]() |
(a) Off | (b) Red |
![]() |
![]() |
(c) Green | (d) Yellow |
![]() |
![]() |
![]() |
---|---|---|
(a) Off | (b) Red | (c) Green |
Model | Precision | Recall | ||||
---|---|---|---|---|---|---|
No Light | Red | Green | No Light | Red | Green | |
SqueezeNet | ||||||
DenseNet | ||||||
ResNet | ||||||
AtteNet |
Model | Precision | Recall | ||||||
---|---|---|---|---|---|---|---|---|
No Light | Red | Green | Yellow | No Light | Red | Green | Yellow | |
SqueezeNet | ||||||||
DenseNet | ||||||||
ResNet | ||||||||
AtteNet |
Model | Precision | Recall | ||||
---|---|---|---|---|---|---|
No Light | Red | Green | No Light | Red | Green | |
SqueezeNet | ||||||
DenseNet | ||||||
ResNet | ||||||
AtteNet |
4.4.2 Ablation Study & Qualitative Analysis
In this section, we investigate the various architectural decisions made while designing AtteNet as well as present qualitative analysis of the obtained results on the benchmarking datasets. In order to understand the design choices made in AtteNet, we compare the improvements gained employing each of the following variants:
-
ResNet: ResNet-50 base architecture
-
AtteNet-M1: ResNet-50 base architecture with pre-activation residual units
-
AtteNet-M2: ResNet-50 with pre-activation residual units and ELUs
-
AtteNet-M3: ResNet-50 with squeeze-excitation blocks, pre-activation residual units and ELUs
-
AtteNet-M4: ResNet-50 with convolution squeeze-excitation blocks, pre-activation residual units and ELUs.
Table 13 reports the accuracy, precision and recall rates of the aforementioned variants on the Nexar dataset. We observe that the most notable improvement is achieved by replacing the traditional identity residual units with the pre-activation residual units, increasing the accuracy by . This shows that utilizing the pre-activation residual units enables the network to better regularize the information flow which in turn leads to better representational learning. Replacing the traditional ReLU activation function for ELUs yields an additional
increase in the recognition accuracy which validates the importance of applying activation functions that are robust to noisy and unbalanced data, as is the case in the current dataset. This can be primarily attributed to the ability of ELUs to help in reducing the bias shift in the neurons which also yields a faster convergence. By incorporating squeeze-excitation blocks and replacing the fully connected layers with
convolutional layers, we are able to further improve on the recognition accuracy of the model. This corroborates the significance of learning different weighting factors for the various channels of the feature maps to enable the network to learn the interdependencies between the channels which in turn improves its recognition capabilities as shown by the improved precision values.Model | Accuracy | Precision | Recall | ||||
---|---|---|---|---|---|---|---|
No Light | Red | Green | No Light | Red | Green | ||
ResNet | |||||||
AtteNet-M1 | |||||||
AtteNet-M2 | |||||||
AtteNet-M3 | |||||||
AtteNet-M4 |
Furthermore, we show the confusion matrix for AtteNet on the different datasets in Figure
14. On the Nexar dataset, our introduced architecture is able to accurately disambiguate the distinct classes as shown by the diagonal pattern of the confusion matrix. On the bosch dataset shown in Figure 14(b), AtteNet is able to distinguish with high accuracy between three of the four classes with the yellow traffic light often misclassified as red or green. We believe this occurs as a result of the large imbalance in the distribution of the training data wherein the yellow traffic light occurs less compared to the remaining classes. A potential remedy for this problem is to employ class balancing techniques, apply more augmentations to images belonging to this class, or by adding more images of the yellow class to the training set. Figure 14(c) shows the confusion matrix of AtteNet on the FSC dataset. The results indicate that our proposed AtteNet is able to accurately distinguish the various classes further demonstrating the appropriateness of the architecture for the given task.![]() |
![]() |
![]() |
---|---|---|
(a) Nexar | (b) Bosch | (c) FSC |
In order to gain a better understanding of the representations learned by the network, we employ the t-Distributed Stochastic Neighbor embedding (t-SNE) (Van der Maaten and Hinton, 2008) on the learned features of the network. Through obtaining the set of principal components of the data, t-SNE is able to transform the data to a lower dimensional space, thereby revealing cluster and subcluster structures. In Figure 15, we display the down-projected features obtained after applying t-SNE on the features from the penultimate layer of AtteNet and DenseNet on the various datasets. Unlike DenseNet, the features learned in AtteNet show clear cluster patterns separating the different classes, whereas in DenseNet there is no clear distinction between the features learned for the various classes especially in the Bosch and FSC dataset shown in Figure 15(b-c). Examining the t-SNE results of AtteNet on the Bosch dataset shows three distinct clusters for the off, red and green classes, with the yellow class falling in between the red and green cluster. Nonetheless, the representations learned by AtteNet are able to better capture the distinct classes in the dataset in comparison to DenseNet where all four classes are merged together in one cluster.
AtteNet |
![]() |
![]() |
![]() |
---|---|---|---|
DenseNet |
![]() |
![]() |
![]() |
(a) Nexar | (b) Bosch | (c) FSC |
Furthermore, we perform qualitative analysis of the recognition accuracy of our proposed AtteNet on the Nexar dataset in Figure 16. Figure 16(d-f) show incorrect classifications by our method, where in Figure 16(d), green light reflected off a glass structure is misidentified for a green traffic light signal due to both the shape and position of the light matching the shape and potential placement of a traffic light. Similarly, in Figure 16(f), the green sign of the shop is misidentified as the traffic light resulting in an incorrect classification. In Figure 16(e), the lack of information identifying the driving direction of the car results in a misclassification as the network incorrectly identifies the left-most traffic light to be the relevant one. However, despite the significant motion blur and low lighting conditions, our proposed model is able to accurately predict the state of the traffic light as shown in Figure 16(a-c).
Input Image |
![]() |
![]() |
![]() |
---|---|---|---|
(a)GT:![]() ![]() |
(b)GT: ![]() ![]() |
(c)GT: ![]() ![]() |
|
Input Image |
![]() |
![]() |
![]() |
(d)GT: ![]() ![]() |
(e)GT: ![]() ![]() |
(f)GT: ![]() ![]() |
Input Image |
![]() |
![]() |
![]() |
![]() |
---|---|---|---|---|
(a)GT: ![]() ![]() |
(b)GT: ![]() ![]() |
(c)GT: ![]() ![]() |
(d)GT: ![]() ![]() |
|
Input Image |
![]() |
![]() |
![]() |
![]() |
(e)GT: ![]() ![]() |
(f)GT: ![]() ![]() |
(g)GT: ![]() ![]() |
(h)GT: ![]() ![]() |
Input Image |
![]() |
![]() |
![]() |
---|---|---|---|
(a)GT: ![]() ![]() |
(b)GT: ![]() ![]() |
(c)GT: ![]() ![]() |
|
Input Image |
![]() |
![]() |
![]() |
(d)GT: ![]() ![]() |
(e)GT: ![]() ![]() |
(f)GT: ![]() ![]() |
In Figure 17, we show qualitative results on the Bosch traffic light dataset. Figure 17(f,g) show misclassification examples where AtteNet incorrectly predicts no traffic light in the driving direction. In both cases this occurs due to the small size of the traffic light and the presence of partial occlusions such as a pole hiding part of the traffic light. In Figure 17(h) a yellow traffic light signal is incorrectly classified as red. We attribute the cause of the misprediction to the close similarity of the red and yellow colors particularly in this image which can be verified by comparing the color of the brake lights of the cars to that of the traffic light signal. Figure 17(b) shows a correct classification of a green traffic light signal, where our proposed AtteNet is able to accurately recognize the traffic light signal despite the small size of the traffic light and the presence of partial occlusions. Similarly, in Figure 17(c, d), our network is able to accurately recognize the red traffic light despite the presence of several surrounding traffic lights, the small size of the light and the presence of blur.
Input Image | Activation Mask | Input Image | Activation Mask |
![]() |
![]() |
![]() |
![]() |
(a) GT: ![]() ![]() |
(b) GT: ![]() ![]() |
||
![]() |
![]() |
![]() |
![]() |
(c) GT: ![]() ![]() |
(d) GT: ![]() ![]() |
||
![]() |
![]() |
![]() |
![]() |
(e) GT: ![]() ![]() |
(f) GT: ![]() ![]() |
We present a qualitative evaluation of the performance of AtteNet on the FSC dataset in Figure 18. In Figure 18(d), the red and white pattern on the traffic light pole cause our network to incorrectly predict the presence of a red traffic light signal despite the absence of a traffic light within the image. Figure 18(a-c) show challenging scenarios in which AtteNet is able to accurately recognize the state of the traffic light despite the lighting conditions, presence of multiple light sources and the small size of the traffic light. Despite the achieved accuracy of the network, failing to recognize a traffic light as shown in Figure 18(e, f) will lead to unintended circumstances. While on the one hand, recognizing the traffic light in both images is quite challenging even for humans due to its small size in the image, one cannot rely solely on the traffic light recognition system to decide the safety of the intersection. Our proposed approach for autonomous street crossing prediction rather combines the information from both the traffic light recognition and motion prediction modules to accurately predict the intersection safety for crossing as discussed in the following section.
In order to provide further insight into the AtteNet architecture, in Figure 19 we utilize the Grad-CAM method by Selvaraju et al. (2016) to visualize the activation masks of AtteNet on the FSC dataset. Visualizing the output of the penultimate layer of our network using Grad-CAM produces a gradient-weighted class activation mask highlighting regions of the relevant regions in the image for predicting the output, thus providing us with a better understanding of the network predictions. For each image we show the activation mask, the ground-truth label and the network prediction. In Figure 19(a, c, e), the attention of the network is placed on areas of the image that contain the traffic light hence leading to correct predictions. Figure 19(b) shows an example image where a car crossing the intersection is occluding the traffic light. The activation mask shows that the attention of the network is incorrectly placed on the brake lights of the car which in turn lead to the incorrect prediction of a red traffic light signal. In Figure 19(d, f), the small size of the traffic light in the image increase the difficulty of locating and recognizing it as can be seen from the activation masks.
4.5 Evaluation of the Crossing Decision
We evaluate the performance of our proposed Autonomous Road Crossing Predictor (ARCP) by reporting the accuracy, precision and recall rates for the “Safe” class prediction on the FSC dataset. We compare the performance of our approach with the Random Forest classifier of Radwan et al. (2017). Furthermore, in order to evaluate the tolerance of the learned predictor to mispredictions and noise from the information source, we also report the individual performance of our proposed ARCP by utilizing data from either the traffic light recognition module ARCP(TLR) or the motion prediction module ARCP(MP) or their combination ARCP(TLR+MP). With the goal of investigating the impact of employing our motion prediction module on the overall crossing accuracy, we additionally compare the performance of our proposed framework with a constant velocity motion prediction model coupled with our proposed traffic light recognition module CV+TLR. Moreover, we compare the performance of using our proposed ARCP in comparison to a simple binary classifier trained on a concatenation of the predictions from the TLR and MP modules, which we refer to as NCP(TLR+MP). The goal of this experiment is to understand whether utilizing the proposed fusion strategy and incorporating the uncertainties in the predictions has an impact on the overall crossing strategy. We use the original 10 sequences from the FSC dataset for training, and test on the newly captured 8 sequences. Table 14 demonstrates the precision, recall and accuracy for each of the aforementioned methods. We compute the aforementioned metrics with respect to the “Safe” class. More precisely, a prediction is considered a true positive if both the groundtruth and the detection label are for the class “Safe”. We observe a drop in the accuracy of the Random Forest classifier in comparison to the reported results in Radwan et al. (2017), which we attribute to the inability of the classifier to generalize to unseen behavior as in the newly captured sequences. This occurs as the Random Forest classifier learns a discriminative model of the problem which leads to suboptimal behavior in new scenarios.
Method | Precision | Recall | Accuracy |
---|---|---|---|
Random Forest | |||
ARCP(TLR) | |||
ARCP(MP) | |||
CV+TLR | |||
NCP(TLR+MP) | |||
ARCP(TLR+MP) |
Furthermore, despite the high accuracy of the proposed AtteNet for traffic light recognition, we observe that utilizing only information from this module, as in ARCP(TLR), results in minor improvement in the accuracy over random guessing. We attribute this to the difficulty of accurately predicting the intersection safety for crossing in the absence of a traffic light or in cases where the classifier fails to detect the presence of one, which is further demonstrated in Figure 21(b), where the confusion matrix does not show strong distinction between the various classes.
Seq-1 |
![]() |
![]() |
![]() |
![]() |
(a) time= | (b) time= | (c) time= | ||
Seq-2 |
![]() |
![]() |
![]() |
![]() |
(a) time= | (b) time= | (c) time= | ||
Seq-3 |
![]() |
![]() |
![]() |
![]() |
(a) time= | (b) time= | (c) time= | ||
Seq-4 |
![]() |
![]() |
![]() |
![]() |
(a) time= | (b) time= | (c) time= | ||
Seq-5 |
![]() |
![]() |
![]() |
![]() |
(a) time= | (b) time= | (c) time= | ||
![]() |
![]() |
![]() |
![]() |
(a) Random Forest | (b) ARCP(TLR) | (c) ARCP(MP) |
![]() |
![]() |
![]() |
(d) CV + TLR | (e) NCP(TLR+MP) | (f) ARCP(TLR+MP) |
On the other hand, by employing information from only the motion prediction module, the overall accuracy of the crossing decision as well as the precision and recall are improved by and respectively. Unlike ARCP(TLR), the confusion matrix shown in Figure 21(c) shows that the learned classifier is able to better differentiate between safe and unsafe crossing intervals. However, comparing the top row of the confusion matrix of ARCP(MP) with that of the Random Forest baseline shows that the ARCP(MP) classifier is more likely to label safe intervals as unsafe which could potentially lead to deadlock-type situations where the robot is stuck at the side of the road unable to cross.
Inspecting the results of employing a constant velocity motion model coupled with the traffic light recognition module in Figure 21(d) shows that similar to the ARCP(MP) model, this model provides an improvement of and to the precision, recall and accuracy, respectively, over solely relying on the traffic light classifier. However, unlike the ARCP(MP) classifier, we see more confusion between the safe and unsafe predictions. We hypothesize that this is due to the constant velocity assumption which can result in suboptimal performance in cases where a car speeds up or slows down at the end of an interval. The performance of this model is, nonetheless, superior to the NCP(TLR+MP) as shown in Figure 21(e). Simple fusion of the predictions of both modules into a binary classifier proves the least optimal strategy, albeit performing slightly better than relying solely on the traffic light information. We believe this to be a direct consequence of discarding the confidence information from both modules and the manner with which the features were fused resulting in the TLR features overpowering the MP features.
Utilizing the ARCP(TLR+MP) model shown in Figure 21(f) reduces both the numbers of false positives and false negatives with respect to the intersection safety. By utilizing a structured approach for combining information from both the traffic light recognition and the motion prediction modules, the learned classifier is able to make accurate crossing predictions while being robust to the type of intersection encountered. Furthermore, by incorporating feature maps from the last downsampling stage in AtteNet and the Gaussian distribution parameters of IA-TCNN, the learned classifier can better generalize to unseen environments as shown by the improvement in the accuracy, precision and recall rates over the Random Forest baseline classifier in Table 14.
In Figure 20, we perform qualitative analysis of the learned decision of our proposed ARCP(TLR+MP) classifier in comparison to the Random Forest classifier as a baseline on the FSC dataset. Each sequence is represented by three images corresponding to the beginning, middle and end of the interval. Furthermore, to provide a complete image of the scene, we overlay the sensor detections on birds-eye-view images of the intersections. Seq-1 depicts a situation where the robot is located at the side of a zebra crossing that is clear, with the exception of a cyclist (represented by the blue arrow) that is moving towards the robot. Despite the intersection being safe for crossing, the Random Forest classifier incorrectly labels the interval as unsafe for crossing. On the other hand, the ARCP(TLR+MP) classifier correctly identifies the intersection state by utilizing the information from the motion prediction module to infer the driving direction of the cyclist.
In Seq-2, the robot is located in the middle island at a signalized intersection, with traffic coming from the left-hand side. As the pedestrian traffic light is green, the approaching vehicle slows down throughout the observed interval rendering the intersection safe for crossing. By utilizing information from the traffic light recognition module to detect the state of the traffic light, in combination with the motion prediction module to identify that the approaching vehicle is reducing its velocity, our ARCP(TLR+MP) classifier is able to correctly label the interval as safe for crossing.
Seq-3 demonstrates a situation where a false positive detection by the tracker causes an incorrect classification of the intersection as unsafe by the Random Forest classifier. Our proposed classifier is however able to correctly predict the safety of the intersection for crossing as it is able to identify the spurious detection as a false positive or a ghost detection by the tracker. Another scenario is depicted in Seq-4, where the robot is located at a grid-type signalized intersection, with vehicles approaching from the upper right corner and heading towards the street parallel to the robot. By utilizing only information from the tracker, the Random Forest classifier labels the crossing unsafe as it appears that the cars are approaching perpendicularly to the robot. However, by predicting the behavior of the vehicles for the remainder of the interval, our proposed ARCP(TLR+MP) classifier is able to correctly classify the safety of the interval for crossing.
Finally, Seq-5 depicts an interval for which both classifiers incorrectly label the intersection as unsafe. The robot is placed at a signalized intersection with a green pedestrian light and a vehicle approaching from the lower left corner of the image. As the intersection contains a middle island, and since there is no traffic approaching from the significant direction (top left corner of the image), the crossing is labeled as safe. However, as neither classifier has a representation of the structure of the intersection showing the presence of the middle island, the interval is in turn misclassified as unsafe for crossing. By incorporating semantic knowledge of the scene or learning an obstacle map of the environment, the aforementioned problem can be rectified as the classifier can learn about the various road topologies and their effect on the crossing decision.
4.6 Generalization across Different Intersections
In the following, we perform an extended evaluation of our proposed pipeline for predicting the safety of the intersection for crossing by analyzing the performance of each module as well as the entire system in a real-world scenario. We place our robotic platform shown in Figure 8, with the proposed framework, at a busy street intersection which contains high speed traffic, a pedestrian crossing, and a tram line. Note that this intersection is not included in either the training or test sequences of the FSC dataset, rather the goal of this experiment is to evaluate the generalization capabilities of our framework to new environments.
Seq-1 |
![]() |
![]() |
![]() |
![]() |
(a) time= | (b) time= | (c) time= | ||
Seq-2 |
![]() |
![]() |
![]() |
![]() |
(a) time= | (b) time= | (c) time= | ||
Seq-3 |
![]() |
![]() |
![]() |
![]() |
(a) time= | (b) time= | (c) time= | ||
![]() |
Employing our IA-TCNN on the dataset to predict the future trajectories of all observable traffic participants, we achieve an average displacement error of , , in terms of translation, rotation and velocity respectively. Furthermore, using AtteNet we achieve an accuracy of for predicting the state of the traffic light. Overall, predicting the safety of the intersection for crossing, our ACP classifier achieves a precision of and a recall of . The low prediction errors achieved by the overall network as well as the individual modules demonstrate the generalization capabilities and efficacy of our proposed framework.
Additionally, we perform qualitative analysis of the crossing decision predicted by our proposed ACP classifier in Figure 22. We depict three sequences from the intersection, where each sequence is represented by three images from the beginning, middle and end of the prediction interval. As in Figure 20, we overlay the sensor detections on birds-eye-view images of the intersection to provide a more comprehensive image of the scene. Seq-1 depicts a scenario wherein a cyclist is approaching the robot on the sidewalk from the direction of oncoming traffic and continues to cycle past the robot. Utilizing and predicting the orientation information for the observed traffic participants, our IA-TCNN network is able to predict an accurate trajectory for the cyclist continuing on the sidewalk and hence passing behind the robot as opposed to in front of it. This information is in turn used by our ACP classifier to correctly predict the safety of the intersection for crossing.
Seq-2 depicts a situation with heavy oncoming traffic, in which our ACP classifier accurately predicts the intersection at the observed interval to be not safe for crossing. In Seq-3, in the first half of the interval a car is approaching the intersection, however, at the remaining half it slows down as the traffic light signal changes. By utilizing both the traffic light predictions from AtteNet showing the pedestrian traffic light to be green, and the trajectory information from IA-TCNN predicting the continued deceleration of the car until it comes to a halt at the end of the interval, our ACP classifier is able to accurately predict the safety of the intersection at the given interval for crossing.
5 Conclusion
In this paper, we proposed a system for autonomous street crossing using multimodal data. Our system consists of two main network streams; a traffic light recognition stream and an interaction-aware motion prediction stream. Information from both streams is fused as input to a convolutional neural network to predict the safety of the intersection for crossing. We proposed AtteNet, a convolutional neural network architecture for traffic light recognition that utilizes the global information in the images to selectively emphasize informative features suppressing irrelevant features, while being robust to noisy data. We performed extensive experimental evaluations on various traffic light recognition benchmarks and show that the proposed architecture outperforms the compared methods. Furthermore, we proposed an interaction-aware temporal convolutional neural network architecture that utilizes causal convolutions to accurately predict the trajectories of dynamic objects. We demonstrated that our approach is scalable to complex urban environments while simultaneously being able to predict accurate trajectories of all the observable traffic participants in the scene. Experimental evaluations on several benchmark datasets demonstrate that our architecture achieves state-of-the-art performance on both indoor and outdoor datasets, while achieving faster inference times and requiring less storage space in comparison to recurrent approaches.
In order to learn a classifier that is robust to the type of intersection, feature maps from the traffic light recognition network and the interaction-aware motion prediction network are fused to learn the final crossing decision. By incorporating the uncertainty information from the motion prediction stream and the learned representations from the traffic light recognition stream, the classifier is robust to incorrect predictions by either task-specific subnetwork. Moreover, we extended the previously introduced Freiburg Street Crossing dataset by additional sequences captured at various intersections including signalized and zebra intersections, as well as various road curvatures and topologies which affect the crossing procedure. We deployed our proposed framework on a robotic platform and conducted real-world experiments that demonstrate the accuracy, robustness and generalization capabilities of the proposed system to new environments. We conducted comprehensive experimental evaluations that demonstrate the efficacy of the proposed system for determining the safety of the intersection for crossing. Furthermore, the results demonstrate the tolerance of the system to noise and inaccuracies in the data, while accurately generalizing to new unseen scenarios.
For future work, we aim to additionally predict the obstacle map of the environment, as we believe that knowledge about the vicinity can improve the motion prediction accuracy by avoiding trajectories that may intersect with obstacles. Similarly, the crossing prediction accuracy would also benefit as it eliminates false negatives by leveraging the road structure. Moreover, learning to predict the traffic flow direction can aid in eliminating further sources of confusion for the crossing decision.
References
- Abadi et al. (2015) Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mané D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viégas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y and Zheng X (2015) TensorFlow: Large-scale machine learning on heterogeneous systems. URL https://www.tensorflow.org/. Software available from tensorflow.org.
-
Alahi et al. (2016)
Alahi A, Goel K, Ramanathan V, Robicquet A, Fei-Fei L and Savarese S (2016)
Social lstm: Human trajectory prediction in crowded spaces.
In:
IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)
. - Azimi et al. (2014) Azimi R, Bhatia G, Rajkumar RR and Mudalige P (2014) Stip: Spatio-temporal intersection protocols for autonomous vehicles. In: Int. Conf. on Cyber-Physical Systems (ICCPS).
- Bai et al. (2018) Bai S, Kolter JZ and Koltun V (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 .
- Baker and Yanco (2003) Baker M and Yanco HA (2003) A vision-based tracking system for a street-crossing robot. Univ. Massachusetts Lowell Technical Report .
- Baker and Yanco (2005) Baker M and Yanco HA (2005) Automated street crossing for assistive robots. In: Int. Conf. on Rehabilitation Robotics.
- Bauer et al. (2009) Bauer A, Klasing K, Lidoris G, Mühlbauer Q, Rohrmüller F, Sosnowski S, Xu T, Kühnlenz K, Wollherr D and Buss M (2009) The autonomous city explorer: Towards natural human-robot interaction in urban environments. Int. Journal of Social Robotics 1(2): 127–140.
- Baumann et al. (2018) Baumann U, Glaeser C, Herman M and Zöllner JM (2018) Predicting ego-vehicle paths from environmental observations with a deep neural network. In: Int. Conf. on Robotics & Automation (ICRA).
- Behrendt and Novak (2017) Behrendt K and Novak L (2017) A deep learning approach to traffic lights: Detection, tracking, and classification. In: Int. Conf. on Robotics & Automation (ICRA).
- Campos et al. (2013) Campos GDD, Falcone P and Sjoberg J (2013) Autonomous cooperative driving: a velocity-based negotiation approach for intersection crossing. In: Int. Conf. on Intelligent Transportation Systems (ITSC).
- Chen et al. (2017) Chen YF, Everett M, Liu M and How JP (2017) Socially aware motion planning with deep reinforcement learning. In: Int. Conf. on Intelligent Robots and Systems (IROS).
- Diaz et al. (2017) Diaz M, Girgis R, Fevens T and Cooperstock J (2017) To veer or not to veer: Learning from experts how to stay within the crosswalk. In: Int. Conf. on Computer Vision Workshops (ICCV Workshops).
- Fang and López (2018) Fang Z and López AM (2018) Is the pedestrian going to cross? answering by 2d pose estimation. arXiv preprint arXiv:1807.10580 .
- Gupta et al. (2018) Gupta A, Johnson J, Fei-Fei L, Savarese S and Alahi A (2018) Social gan: Socially acceptable trajectories with generative adversarial networks. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).
- Habibi et al. (2018) Habibi G, Jaipuria N and How JP (2018) Context-aware pedestrian motion prediction in urban intersections. arXiv preprint arXiv:1806.09453 .
- He et al. (2016a) He K, Zhang X, Ren S and Sun J (2016a) Deep residual learning for image recognition. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).
- He et al. (2016b) He K, Zhang X, Ren S and Sun J (2016b) Identity mappings in deep residual networks. In: European Conf. on Computer Vision (ECCV).
- Helbing and Molnar (1995) Helbing D and Molnar P (1995) Social force model for pedestrian dynamics. Physical review E 51(5): 4282.
- Hu et al. (2017) Hu J, Shen L and Sun G (2017) Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507 .
- Huang et al. (2016) Huang G, Liu Z, Weinberger KQ and van der Maaten L (2016) Densely connected convolutional networks. arXiv preprint arXiv:1608.06993 .
- Iandola et al. (2016) Iandola FN, Han S, Moskewicz MW, Ashraf K, Dally WJ and Keutzer K (2016) Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360 .
- Jaipuria et al. (2018) Jaipuria N, Habibi G and How JP (2018) A transferable pedestrian motion prediction model for intersections with different geometries. arXiv preprint arXiv:1806.09444 .
- Jensen et al. (2016) Jensen MB, Philipsen MP, Møgelmose A, Moeslund TB and Trivedi MM (2016) Vision for looking at traffic lights: Issues, survey, and perspectives. IEEE Transactions on Intelligent Transportation Systems (ITS) 17(7): 1800–1815.
- John et al. (2014) John V, Yoneda K, Qi B, Liu Z and Mita S (2014) Traffic light recognition in varying illumination using deep learning and saliency map. In: Int. Conf. on Intelligent Transportation Systems (ITSC).
- Kim et al. (2017) Kim B, Kang CM, Lee SH, Chae H, Kim J, Chung CC and Choi JW (2017) Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network. arXiv preprint arXiv:1704.07049 .
- Kingma and Ba (2014) Kingma DP and Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .
- Kretzschmar et al. (2014) Kretzschmar H, Kuderer M and Burgard W (2014) Learning to predict trajectories of cooperatively navigating agents. In: Int. Conf. on Robotics & Automation (ICRA).
- Kretzschmar et al. (2016) Kretzschmar H, Spies M, Sprunk C and Burgard W (2016) Socially compliant mobile robot navigation via inverse reinforcement learning. Int. J. of Robotics Research (IJRR) 35(11): 1289–1307.
- Kuderer et al. (2012) Kuderer M, Kretzschmar H, Sprunk C and Burgard W (2012) Feature-based prediction of trajectories for socially compliant navigation. In: Proc. of Robotics: Science and Systems.
- Kümmerle et al. (2015) Kümmerle R, Ruhnke M, Steder B, Stachniss C and Burgard W (2015) Autonomous robot navigation in highly populated pedestrian zones. Journal on Field Robotics 32(4): 565–589.
- Lea et al. (2017) Lea C, Flynn MD, Vidal R, Reiter A and Hager GD (2017) Temporal convolutional networks for action segmentation and detection .
- Lefèvre et al. (2011) Lefèvre S, Laugier C and Ibañez-Guzmán J (2011) Exploiting map information for driver intention estimation at road intersections. In: Intelligent Vehicles Symposium (IV).
- Lefèvre et al. (2014) Lefèvre S, Vasquez D and Laugier C (2014) A survey on motion prediction and risk assessment for intelligent vehicles. Robomech Journal 1(1): 1.
- Lerner et al. (2007) Lerner A, Chrysanthou Y and Lischinski D (2007) Crowds by example. In: Computer Graphics Forum, volume 26. Wiley Online Library, pp. 655–664.
- Li et al. (2018) Li X, Ma H, Wang X and Zhang X (2018) Traffic light recognition for complex scene with fusion detections. IEEE Transactions on Intelligent Transportation Systems (ITS) 19(1): 199–208.
- Liu et al. (2017) Liu W, Li S, Lv J, Yu B, Zhou T, Yuan H and Zhao H (2017) Real-time traffic light recognition based on smartphone platforms. IEEE Transactions on Circuits and Systems for Video Technology 27(5): 1118–1131.
- Manh and Alaghband (2018) Manh H and Alaghband G (2018) Scene-lstm: A model for human trajectory prediction. arXiv preprint arXiv:1808.04018 .
- Medina et al. (2015) Medina AIM, Wouw NVD and Nijmeijer H (2015) Automation of a t-intersection using virtual platoons of cooperative autonomous vehicles. In: Int. Conf. on Intelligent Transportation Systems (ITSC).
- Nexar (2016) Nexar (2016) Nexar challenge-1: Using deep learning for traffic light recognition. URL https://www.getnexar.com/challenge-1/.
- Park et al. (2018) Park S, Kim B, Kang CM, Chung CC and Choi JW (2018) Sequence-to-sequence prediction of vehicle trajectory via lstm encoder-decoder architecture. arXiv preprint arXiv:1802.06338 .
- Pellegrini et al. (2009) Pellegrini S, Ess A, Schindler K and Gool LV (2009) You’ll never walk alone: Modeling social behavior for multi-target tracking. In: IEEE Int. Conf. on Computer Vision (ICCV).
- Pfeiffer et al. (2018) Pfeiffer M, Paolo G, Sommer H, Nieto J, Siegwart R and Cadena C (2018) A data-driven model for interaction-aware pedestrian motion prediction in object cluttered environments.
- Pfeiffer et al. (2016) Pfeiffer M, Schwesinger U, Sommer H, Galceran E and Siegwart R (2016) Predicting actions to act predictably: Cooperative partial motion planning with maximum entropy models. In: Int. Conf. on Intelligent Robots and Systems (IROS).
- Radwan et al. (2017) Radwan N, Winterhalter W, Dornhege C and Burgard W (2017) Why did the robot cross the road? - learning from multi-modal sensor data for autonomous road crossing. In: Int. Conf. on Intelligent Robots and Systems (IROS).
- Redmon et al. (2016) Redmon J, Divvala S, Girshick R and Farhadi A (2016) You only look once: Unified, real-time object detection. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).
- Sadeghian et al. (2018) Sadeghian A, Kosaraju V, Sadeghian A, Hirose N and Savarese S (2018) Sophie: An attentive gan for predicting paths compliant to social and physical constraints. arXiv preprint arXiv:1806.01482 .
- Selvaraju et al. (2016) Selvaraju RR, Das A, Vedantam R, Cogswell M, Parikh D and Batra D (2016) Grad-cam: Why did you say that? visual explanations from deep networks via gradient-based localization. CoRR, abs/1610.02391 7.
- Sun et al. (2018) Sun L, Yan Z, Mellado SM, Hanheide M and Duckett T (2018) 3dof pedestrian trajectory prediction learned from long-term autonomous mobile robot deployment data. In: Int. Conf. on Robotics & Automation (ICRA).
- Trautman and Krause (2010) Trautman P and Krause A (2010) Unfreezing the robot: Navigation in dense, interacting crowds. In: Int. Conf. on Intelligent Robots and Systems (IROS).
- Trautman et al. (2015) Trautman P, Ma J, Murray RM and Krause A (2015) Robot navigation in dense human crowds: Statistical models and experimental studies of human–robot cooperation. Int. J. of Robotics Research (IJRR) 34(3): 335–356.
- Van den Berg et al. (2008) Van den Berg J, Lin M and Manocha D (2008) Reciprocal velocity obstacles for real-time multi-agent navigation. In: Int. Conf. on Robotics & Automation (ICRA).
- Van der Maaten and Hinton (2008) Van der Maaten L and Hinton G (2008) Visualizing data using t-sne. Journal of machine learning research 9(Nov): 2579–2605.
- Varshneya and Srinivasaraghavan (2017) Varshneya D and Srinivasaraghavan G (2017) Human trajectory prediction using spatially aware deep attention models. arXiv preprint arXiv:1705.09436 .
-
Vemula et al. (2018)
Vemula A, Muelling K and Oh J (2018) Social attention: Modeling attention in human crowds.
In: Int. Conf. on Robotics & Automation (ICRA). - Xue et al. (2018) Xue H, Huynh DQ and Reynolds M (2018) Ss-lstm: a hierarchical lstm model for pedestrian trajectory prediction. In: IEEE Winter Conf. on Applications of Computer Vision (WACV).
- Yamaguchi et al. (2011) Yamaguchi K, Berg AC, Ortiz LE and Berg TL (2011) Who are you with and where are you going? In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).
- Yan et al. (2017) Yan Z, Duckett T and Bellotto N (2017) Online learning for human classification in 3d lidar-based tracking. In: Int. Conf. on Intelligent Robots and Systems (IROS).
Comments
There are no comments yet.