1 Introduction
Autonomous driving is one of the most active research areas of this decade. The operating environment for a self-driving system is highly complex, diverse, and dynamic. Navigating such scenes warrants reasoning and behavior planning that jointly reasons in the spatio-temporal domain of the scene. In this context, end-to-end systems with interpretable and trainable projections of shared latent representations have shown promising results. One exciting representation is the Bird’s-eye View (BEV). BEV is a top-down view of the space around the ego-vehicle in an egocentric frame of reference. It is also the native space for path/behavior planning. Different groups have proposed multiple architectures [mp3, p3, lss, fiery, nmp] in the recent past to derive the BEV of the scene from raw sensor inputs. However, the BEV assumes the coplanarity of the ego-vehicle and other agents in the scene. This assumption is highly restrictive, which, when relaxed, distorts the generated representation and makes it unintuitive for planning. We propose NMR, a surface representation amenable to the task of end-to-end autonomous driving on a non-planar road. We test it on a network that does waypoint prediction using a learnt Guidance Offset Field (GOF) and dense Semantic Occupancy Grid (SOG) prediction , which aims to predict semantic labels at any spatio-temporal query location on the manifold, bounded by some spatial range and the time interval. The sparse task of waypoint prediction is further aided by a dense prediction task of future semantic occupancy. We further improve the performance of the proposed approach by incorporating an attention based feature thresholding in the network. Finally, we improve the architecture’s scalability by incorporating adaptive sampling based on edge distance transform and coverage loss, generating well-resolved segmentation maps without incurring a high computation cost.
2 Problem Statement
An ego-vehicle has to travel from point 1 to point 2 in a generic driving environment. The route for this is given in terms of sparse level target points. The driving environment includes jaywalkers, pedestrians, traffic lights and other vehicles. The driving surface can be a planar/ non-planar manifold.

3 Methodology
3.1 Surface Representation
To represent surfaces smoothly, we propose NMR. NMR is a two-staged parametrization of the surface. In the first stage, the Cartesian points on the surface is mapped to parameters . In the next stage, parametric space is transformed to the surface isometric arc length mapping , giving intuitive two-dimensional mapping on the surface as it is topologically invariant. The two mappings when combined with any set of smooth basis functions provide an intuitive two-dimensional representation for a surface. In this work, without loss of generality, we choose the Bézier surface[bezier1977essai] which uses Bernstein polynomial basis, denoted as . Hence any point on the surface is represented as linear combination of Bernstein polynomial product basis functions and an net of control points , mapping from .
(1) |
3.2 Network architecture
In Fig. 4, motivated by [neat], we structure our approach to predict waypoints and semantics on the manifold in an end-to-end manner, learning from expert demonstrations. The vehicle coordinate system is defined with position of ego-vehicle being at origin at current instant. The right handed coordinate system has the front of the vehicle in positive X-axis and Z-axis pointing upwards. The architecture consists of an encoder, attention field and a decoder.

Encoder: As our agent drives through the scene, we collect sensor inputs X from the surround monocular cameras, over time steps, where , , , where is the number of sensors. Each RGB image, is passed through a Res-Net [resnet] to obtain a feature representation of the image. This, along with the vehicle speed and a learned position embedding, is summed and passed through a transformer. The transformer integrates the features globally, adding contextual cues to each patch with its self-attention mechanism. This enables interactions over a large spatial regions and across the different sensor outputs. The output of the transformer is a latent encoding, represented by Z . The encoder operation is depicted in Eq.2 below:
(2) |
where number of spatial features and feature dimensionality.
Attention Field: We define a query point on the manifold , where is time, is the query location and is the target location. To attend to patch features for a particular query point q, we adopt the iterative attention mechanism of [neat]. Specifically, at each iteration i, the output of the attention field is used to relatively weigh each of the features , based on their specific relevance to a query point q. This is used as input of the attention along with q at the next iteration. For the first iteration, each of the ’s are initialized with a uniform scalar - signifying an uniform attention to start with. The weights of the attention network are shared across all N iterations. The attention mechanism is denoted in Eq.3 below:
(3) |
To capture the correlation of the presence of traffic participants and the road profile on which they are present (e.g. a car is on the surface ) a common attention field is proposed.
![]() |
![]() |
) in estimates of position in ego reference frame when the inclination is neglected. Unit dimensions (
) are assumed.Decoder: Given features }, a grid of control points
that govern the manifold representation of the scene are extracted from a Multi Layer Perceptron(MLP). Each control point
, which makes the output of MLP . Given and the the control points , we use a Deep Declarative Network(DDN) [ddn] layer to obtain the reverse mapping by minimizing the Eq.4 below, obtained by rearranging Eq.1:(4) |
Subsequently, we remap the space to isometric arch length space through an MLP. Next, the decoder predicts the semantic class (where M is the number of classes) and waypoint offset at each of the N attention iterations.

4 Training and Inference
At train time, the outputs from each of the iterations are supervised to assist with convergence of loss. The predicted control points are supervised via loss function. For the semantic and offset prediction in the space, Cross-Entropy(CE) loss and losses are used respectively. The final loss function takes the form as shown in Eq.5 below:
(5) |
where, , , are weighing factors that control the relative importance of each of the loss terms. We predict both observed and the future semantic information to have a more holistic understanding of the scene. For each of the semantic classes, points are sampled in an edge aware manner to accurately capture the semantic boundaries. A coverage loss is also imposed to sampling to ensure the SOG space is uniformly covered as much as possible.At test time, the surround monocular images and a query point q is the input. The semantic occupancy and offsets are obtained at the end of attention iteration.
5 Experiments
In Fig.1 we show visualization of a scene from Carla [carla] and how ignoring the road gradients while formulation the motion planning in BEV can lead to incorrect inference. In Fig. 2 we show how ignoring of ground plane topology leads to incorrect object localization. In Fig. 2(a) and Fig 2(b), we show the error in localization and inputs to planning as the plane inclination changes. We plan to carry out experiments in Carla [carla] and SYNTHIA-SF [synthia-sf] which contain scenarios with different road-profiles and gradients. The experiments aim to predict accurate SOG and GOF. The proposed approach would be compared with [codevilla2019exploring, lbc, prakash2021multi, chen2021learning, toromanoff2020end] while evaluating on Carla and [ansari2018earth] while evaluation on SYNTHIA-SF [synthia-sf]. It has to be noted that for [ansari2018earth] only deals with the task of object localization in 3D, hence semantic occupancies would need to be lifted to 3D bounding boxes in order to have a fair comparison. For experiments, . The semantic classes that we consider are {none, road, obstacle, red-light, green-light}.
6 Conclusion
We present NMR, an approach that captures the accurate road surface for the task of end-to-end autonomous driving. We lift the raw sensor input data and vehicle state to a high dimensional latent embedding. Attention fields are used to extract control points that govern the surface geometry, and semantic occupancy and offset information given any query point in on the manifold. We present edge-aware sampling methods to accurately capture the occupancy information in the scene. We propose to test our approach on challenging road topologies in Carla[carla] and SYNTHIA-SF[synthia-sf].
Comments
There are no comments yet.