NMR: Neural Manifold Representation for Autonomous Driving

by   Unnikrishnan R. Nair, et al.

Autonomous driving requires efficient reasoning about the Spatio-temporal nature of the semantics of the scene. Recent approaches have successfully amalgamated the traditional modular architecture of an autonomous driving stack comprising perception, prediction, and planning in an end-to-end trainable system. Such a system calls for a shared latent space embedding with interpretable intermediate trainable projected representation. One such successfully deployed representation is the Bird's-Eye View(BEV) representation of the scene in ego-frame. However, a fundamental assumption for an undistorted BEV is the local coplanarity of the world around the ego-vehicle. This assumption is highly restrictive, as roads, in general, do have gradients. The resulting distortions make path planning inefficient and incorrect. To overcome this limitation, we propose Neural Manifold Representation (NMR), a representation for the task of autonomous driving that learns to infer semantics and predict way-points on a manifold over a finite horizon, centered on the ego-vehicle. We do this using an iterative attention mechanism applied on a latent high dimensional embedding of surround monocular images and partial ego-vehicle state. This representation helps generate motion and behavior plans consistent with and cognizant of the surface geometry. We propose a sampling algorithm based on edge-adaptive coverage loss of BEV occupancy grid and associated guidance flow field to generate the surface manifold while incurring minimal computational overhead. We aim to test the efficacy of our approach on CARLA and SYNTHIA-SF.



There are no comments yet.


page 1

page 2


NEAT: Neural Attention Fields for End-to-End Autonomous Driving

Efficient reasoning about the semantic, spatial, and temporal structure ...

ModEL: A Modularized End-to-end Reinforcement Learning Framework for Autonomous Driving

Heated debates continue over the best autonomous driving framework. The ...

PillarFlow: End-to-end Birds-eye-view Flow Estimation for Autonomous Driving

In autonomous driving, accurately estimating the state of surrounding ob...

Autonomous Marine Sampling Enhanced by Strategically Deployed Drifters in Marine Flow Fields

We present a transportable system for ocean observations in which a smal...

Probabilistic Future Prediction for Video Scene Understanding

We present a novel deep learning architecture for probabilistic future p...

Occupancy Flow Fields for Motion Forecasting in Autonomous Driving

We propose Occupancy Flow Fields, a new representation for motion foreca...

Design Space of Behaviour Planning for Autonomous Driving

We explore the complex design space of behaviour planning for autonomous...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Autonomous driving is one of the most active research areas of this decade. The operating environment for a self-driving system is highly complex, diverse, and dynamic. Navigating such scenes warrants reasoning and behavior planning that jointly reasons in the spatio-temporal domain of the scene. In this context, end-to-end systems with interpretable and trainable projections of shared latent representations have shown promising results. One exciting representation is the Bird’s-eye View (BEV). BEV is a top-down view of the space around the ego-vehicle in an egocentric frame of reference. It is also the native space for path/behavior planning. Different groups have proposed multiple architectures [mp3, p3, lss, fiery, nmp] in the recent past to derive the BEV of the scene from raw sensor inputs. However, the BEV assumes the coplanarity of the ego-vehicle and other agents in the scene. This assumption is highly restrictive, which, when relaxed, distorts the generated representation and makes it unintuitive for planning. We propose NMR, a surface representation amenable to the task of end-to-end autonomous driving on a non-planar road. We test it on a network that does waypoint prediction using a learnt Guidance Offset Field (GOF) and dense Semantic Occupancy Grid (SOG) prediction , which aims to predict semantic labels at any spatio-temporal query location on the manifold, bounded by some spatial range and the time interval. The sparse task of waypoint prediction is further aided by a dense prediction task of future semantic occupancy. We further improve the performance of the proposed approach by incorporating an attention based feature thresholding in the network. Finally, we improve the architecture’s scalability by incorporating adaptive sampling based on edge distance transform and coverage loss, generating well-resolved segmentation maps without incurring a high computation cost.

2 Problem Statement

An ego-vehicle has to travel from point 1 to point 2 in a generic driving environment. The route for this is given in terms of sparse level target points. The driving environment includes jaywalkers, pedestrians, traffic lights and other vehicles. The driving surface can be a planar/ non-planar manifold.

Figure 1: The image with grey borders shows a scene from Carla [carla] with a steeply graded road. Notice the loss on information (Line-of-Sight, gradient information of surface) in BEV representation (black border) - as compared to the actual surface topology (red/blue/green border). The NMR architecture infers the scene semantics and waypoints which are consistent with the surface as shown in the bottom most image.

3 Methodology

3.1 Surface Representation

To represent surfaces smoothly, we propose NMR. NMR is a two-staged parametrization of the surface. In the first stage, the Cartesian points on the surface is mapped to parameters . In the next stage, parametric space is transformed to the surface isometric arc length mapping , giving intuitive two-dimensional mapping on the surface as it is topologically invariant. The two mappings when combined with any set of smooth basis functions provide an intuitive two-dimensional representation for a surface. In this work, without loss of generality, we choose the Bézier surface[bezier1977essai] which uses Bernstein polynomial basis, denoted as . Hence any point on the surface is represented as linear combination of Bernstein polynomial product basis functions and an net of control points , mapping from .


3.2 Network architecture

In Fig. 4, motivated by [neat], we structure our approach to predict waypoints and semantics on the manifold in an end-to-end manner, learning from expert demonstrations. The vehicle coordinate system is defined with position of ego-vehicle being at origin at current instant. The right handed coordinate system has the front of the vehicle in positive X-axis and Z-axis pointing upwards. The architecture consists of an encoder, attention field and a decoder.

Figure 2: Co-planarity assumption when the object is sharing a different planar profile than the ego-vehicle can lead to error in localization of the object, and incorrect inputs to planning.

Encoder: As our agent drives through the scene, we collect sensor inputs X from the surround monocular cameras, over time steps, where , , , where is the number of sensors. Each RGB image, is passed through a Res-Net [resnet] to obtain a feature representation of the image. This, along with the vehicle speed and a learned position embedding, is summed and passed through a transformer. The transformer integrates the features globally, adding contextual cues to each patch with its self-attention mechanism. This enables interactions over a large spatial regions and across the different sensor outputs. The output of the transformer is a latent encoding, represented by Z . The encoder operation is depicted in Eq.2 below:


where number of spatial features and feature dimensionality.

Attention Field: We define a query point on the manifold , where is time, is the query location and is the target location. To attend to patch features for a particular query point q, we adopt the iterative attention mechanism of [neat]. Specifically, at each iteration i, the output of the attention field is used to relatively weigh each of the features , based on their specific relevance to a query point q. This is used as input of the attention along with q at the next iteration. For the first iteration, each of the ’s are initialized with a uniform scalar - signifying an uniform attention to start with. The weights of the attention network are shared across all N iterations. The attention mechanism is denoted in Eq.3 below:


To capture the correlation of the presence of traffic participants and the road profile on which they are present (e.g. a car is on the surface ) a common attention field is proposed.

Figure 3: (a) Error in localization () of object if co-planarity assumption is enforced on non-planar profiles [mobileye, chandraker2015], (b) Coordinate wise error (

) in estimates of position in ego reference frame when the inclination is neglected. Unit dimensions (

) are assumed.

Decoder: Given features }, a grid of control points

that govern the manifold representation of the scene are extracted from a Multi Layer Perceptron(MLP). Each control point

, which makes the output of MLP . Given and the the control points , we use a Deep Declarative Network(DDN) [ddn] layer to obtain the reverse mapping by minimizing the Eq.4 below, obtained by rearranging Eq.1:


Subsequently, we remap the space to isometric arch length space through an MLP. Next, the decoder predicts the semantic class (where M is the number of classes) and waypoint offset at each of the N attention iterations.

Figure 4: Architecture: The system takes in images from a surround monocular setup, along with the vehicle velocity, and predicts control points, offset and occupancy in the space given a query point q on the manifold. At test time, we sample points from surface to obtain offsets and semantics, which are used to generate drive commands.

4 Training and Inference

At train time, the outputs from each of the iterations are supervised to assist with convergence of loss. The predicted control points are supervised via loss function. For the semantic and offset prediction in the space, Cross-Entropy(CE) loss and losses are used respectively. The final loss function takes the form as shown in Eq.5 below:


where, , , are weighing factors that control the relative importance of each of the loss terms. We predict both observed and the future semantic information to have a more holistic understanding of the scene. For each of the semantic classes, points are sampled in an edge aware manner to accurately capture the semantic boundaries. A coverage loss is also imposed to sampling to ensure the SOG space is uniformly covered as much as possible.At test time, the surround monocular images and a query point q is the input. The semantic occupancy and offsets are obtained at the end of attention iteration.

5 Experiments

In Fig.1 we show visualization of a scene from Carla [carla] and how ignoring the road gradients while formulation the motion planning in BEV can lead to incorrect inference. In Fig. 2 we show how ignoring of ground plane topology leads to incorrect object localization. In Fig. 2(a) and Fig 2(b), we show the error in localization and inputs to planning as the plane inclination changes. We plan to carry out experiments in Carla [carla] and SYNTHIA-SF [synthia-sf] which contain scenarios with different road-profiles and gradients. The experiments aim to predict accurate SOG and GOF. The proposed approach would be compared with [codevilla2019exploring, lbc, prakash2021multi, chen2021learning, toromanoff2020end] while evaluating on Carla and [ansari2018earth] while evaluation on SYNTHIA-SF [synthia-sf]. It has to be noted that for [ansari2018earth] only deals with the task of object localization in 3D, hence semantic occupancies would need to be lifted to 3D bounding boxes in order to have a fair comparison. For experiments, . The semantic classes that we consider are {none, road, obstacle, red-light, green-light}.

6 Conclusion

We present NMR, an approach that captures the accurate road surface for the task of end-to-end autonomous driving. We lift the raw sensor input data and vehicle state to a high dimensional latent embedding. Attention fields are used to extract control points that govern the surface geometry, and semantic occupancy and offset information given any query point in on the manifold. We present edge-aware sampling methods to accurately capture the occupancy information in the scene. We propose to test our approach on challenging road topologies in Carla[carla] and SYNTHIA-SF[synthia-sf].