UniFormer: Unified Multi-view Fusion Transformer for Spatial-Temporal Representation in Bird's-Eye-View

by   Zequn Qin, et al.
Zhejiang University

Bird's eye view (BEV) representation is a new perception formulation for autonomous driving, which is based on spatial fusion. Further, temporal fusion is also introduced in BEV representation and gains great success. In this work, we propose a new method that unifies both spatial and temporal fusion and merges them into a unified mathematical formulation. The unified fusion could not only provide a new perspective on BEV fusion but also brings new capabilities. With the proposed unified spatial-temporal fusion, our method could support long-range fusion, which is hard to achieve in conventional BEV methods. Moreover, the BEV fusion in our work is temporal-adaptive, and the weights of temporal fusion are learnable. In contrast, conventional methods mainly use fixed and equal weights for temporal fusion. Besides, the proposed unified fusion could avoid information lost in conventional BEV fusion methods and make full use of features. Extensive experiments and ablation studies on the NuScenes dataset show the effectiveness of the proposed method and our method gains the state-of-the-art performance in the map segmentation task.


page 1

page 8

page 13


Bridging the View Disparity of Radar and Camera Features for Multi-modal Fusion 3D Object Detection

Environmental perception with multi-modal fusion of radar and camera is ...

BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation

Multi-sensor fusion is essential for an accurate and reliable autonomous...

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

3D visual perception tasks, including 3D detection and map segmentation ...

MotionNet: Joint Perception and Motion Prediction for Autonomous Driving Based on Bird's Eye View Maps

The ability to reliably perceive the environmental states, particularly ...

DouFu: A Double Fusion Joint Learning Method For Driving Trajectory Representation

Driving trajectory representation learning is of great significance for ...

Active Defense Analysis of Blockchain Forking through the Spatial-Temporal Lens

Forking breaches the security and performance of blockchain as it is sym...

1 Introducion

Recently, bird’s-eye-view (BEV) representation [pan2020cross, philion2020lift, li2021hdmapnet] becomes an emerging perception formulation in the autonomous driving field. The main idea of BEV representation is to map the multi-camera features into the ego BEV space, i.e., spatial fusion, as shown in creftype 1. This kind of spatial fusion composes an integrated BEV space, and duplicate results from different cameras are uniquely represented in the BEV space, which greatly reduces the difficulty in fusing multi-camera features. Moreover, the BEV spatial fusion naturally shares the same 3D space as other modalities like LiDAR and radar, making multi-modality fusion simple. The integrated BEV representation based on spatial fusion provides the basis of temporal fusion. Temporal fusion is a cornerstone in BEV representation, which can be used in many aspects like 1) representing temporarily occluded objects; 2) accumulating observation in a long-range, which can be used for generating map; 3) stabilizing the perception results for standstill vehicles. There have been many methods [li2021hdmapnet, huang2022bevdet4d, li2022bevformer] showing the importance and effectiveness of temporal fusion.

(a) Inputs with surrounding images.
(b) Map.
Figure 1: Illustration of the map segmentation task in BEV.
Figure 2: Different methods in BEV temporal fusion. From left to right, they are methods with no temporal fusion, warp-based temporal fusion, and our unified multi-view fusion. For the method with no temporal fusion, the BEV space is only predicted with surrounding images at the current time step. The warp-based temporal fusion would warp the BEV space from the previous time step and is a serial fusion method. In this work, we propose unified multi-view fusion, which is a parallel method and could support long-range fusion.

Despite the success of current progress, present methods usually use warp-based temporal fusion, i.e., warping past BEV features to the current time according to the positions of BEV spaces at different time steps. Although this kind of design can well align temporal information, there are still some open problems. First, the warping is usually serial; that is to say, it is conducted only between adjacent time steps. In this way, it is hard to model long-range temporal fusion. Long-range history information can only implicitly make an impact and would be forgotten and dispelled rapidly. Besides, excessive long temporal fusion would even harm the performance in the warp-based temporal fusion. Second, warping would cause information loss during temporal fusion, as shown in creftypeplural 2(b)2(a). Third, since the warping is serial, the weights for all time steps are equal, and it is hard to adaptively fuse temporal information. To solve the above problems, we propose a new perspective that combines both spatial and temporal fusion into a unified multi-view fusion, termed UniFormer. Specifically, spatial fusion is regarded as a multi-view fusion from multi-camera features. For the temporal fusion, since the temporal features are from the past and absent in the current time, we create “virtual views” for the temporal features as if they are present in the current time. The idea of “virtual views” is to treat past camera views as the current views and assign them virtual locations relative to the current BEV space based on the camera motion. In this way, the whole spatial-temporal representation in BEV can be simply treated as a unified multi-view fusion, which contains both current (spatial fusion) and past (temporal fusion) virtual views, as shown in creftype 2. With the proposed unified fusion, both spatial and temporal fusions are conducted in parallel. We can directly access all useful features through space and time at once, which enables the long-range fusion. Another benefit is that we can realize adaptive temporal fusion since we can directly access all temporal features. Meanwhile, the parallel property guarantees that no information is lost during fusion. Furthermore, the multi-view unified fusion can even support different sensors, camera rigs, and camera types at different time steps. This will bridge higher-level and heterogeneous fusion like vehicle-side and road-side perceptions. For example, we can fuse information from a car’s camera and a surveillance camera on top of a traffic light, as long as they overlap in the BEV space. The contributions of this work are as follows:

  • We propose a new parallel multi-view perspective for BEV representation, which unifies the spatial and temporal fusion. The proposed unified parallel multi-view fusion can address the problem of long-range fusion and information loss. And we can realize adaptive temporal fusion based on the unified fusion. The proposed unified method can also support arbitrary camera rigs and bridge higher-level and heterogeneous fusion.

  • We analyze the widely used evaluation settings in the map segmentation task on NuScenes [caesar2020nuscenes] and propose a new setting for a more comprehensive comparison in creftype 4.1.

  • The proposed method achieves the state-of-the-art BEV map segmentation performance on the challenging benchmark NuScenes in all settings.

2 Related Work

Spatial fusion in BEV

Spatial fusion is the basis of BEV representation, i.e., how to transform and fuse information and features from surrounding multi-camera inputs into an ego BEV space to represent the surrounding 3D world. The earliest and most straightforward method is the inverse perspective mapping (IPM) [matthies1992stereo, bertozzi1996real, aly2008real, deng2019restricted], which assumes the ground surface is flat and at a fixed height. In this way, the spatial fusion in BEV can be conducted with a homography transformation. Note that IPM is usually utilized in the image space. However, IPM is hard to cope with the non-flat and unknown-height ground surface. Later, View Parsing Network (VPN) [pan2020cross] uses a fully connected layer to transform the image features into the BEV features and directly supervise the features in the BEV space in an end-to-end manner. Similarly, BEVSegFormer [peng2022bevsegformer] uses the deformable attention [zhu2020deformable] mechanism to achieve end-to-end mapping. These methods avoid the explicit mapping between image and BEV spaces, but this property also makes them hard to adopt the geometry prior. Based on VPN, HDMapNet [li2021hdmapnet] proposes to only map the image space to camera-ego BEV space in an end-to-end manner, while the multi-camera BEV spaces are fused with the camera poses. In this way, part of the geometry prior, i.e., the camera extrinsic information is utilized. To make full use of geometry prior in the spatial fusion of BEV space, Lift-splat-shoot [philion2020lift]

proposes a latent estimation network to predict depth for each pixel in the image space. Then all the pixels with depth can be directly mapped into the BEV space. Another kind of method OFT 

[roddick2018orthographic] does not make predictions of depth. OFT directly copy-and-paste the features in the image space to all locations that trace along the ray from the camera in the BEV space.

Temporal fusion in BEV

With the basis of spatial fusion, temporal fusion could further boost the representation in BEV space. The mainstream methods of temporal fusion are the warp-based method [zhang2022beverse, huang2022bevdet4d, li2022bevformer]. The main idea of the warp-based method is to warp and align BEV spaces at different time steps based on the ego motions of vehicles. The major differences reflect in the way of using wrapped BEV spaces. BEVFormer [li2022bevformer] uses deformable self-attention to fuse wrapped BEV spaces while BEVDet4D directly concatenates the wrapped BEV spaces. In this work, we propose a new unified spatial-temporal fusion that could directly access BEV features from all locations and time steps and does not need to warp BEV spaces.

3 Method

In this section, we elaborate on the design of our method from two aspects. First, we show the derivation of the unified multi-view fusion. Then we demonstrate the network architecture with unified multi-view fusion.

3.1 Unified Fusion with Virtual Views

As discussed in the introduction, spatial fusion is the foundation of BEV representation, while temporal fusion reveals a new direction for better BEV representation. Conventional BEV temporal fusion is warp-based fusion, as shown in creftype 2(a). The warp-based fusion warps past BEV features and information based on the ego-motion of different time steps. Since all features are already organized in a pre-defined ego BEV space at a certain time step before warping, this process would lose information. The actual visible range of a camera is much bigger than the one of ego BEV space. For example, 100m is a very humble visible range for typical cameras, while most BEV ranges are defined as no more than 52m [li2022bevformer, philion2020lift]. In this way, it is possible to obtain better BEV temporal fusion than simply warping BEV spaces, as shown in creftype 2(b).

(a) Warp-based BEV fusion. The fused area is marked in gray.
(b) Actual BEV space that can be fused.
(c) Illustration of virtual views.
Figure 3: Derivation of virtual views.

To achieve better temporal fusion, we propose a new concept, i.e., virtual view, as shown in creftype 2(c). Virtual views are defined as the views of sensors that do not present in the current time step, and these past views are rotated and translated according to the ego BEV space as if they are present in the current time step. Denote and as the rotations and translations matrices of current and past ego BEV spaces, respectively. Suppose , , and are the rotation, translation and intrinsic matrices of a certain view . The rotation and translation matrices of virtual views can be written as:


in which and are the unified virtual rotation and translation matrices for any view . It can be examined that creftype 1 also holds for the current views. In this way, all views can be mapped and utilized in the same way, no matter they are past or current views. Suppose represents the coordinates in the BEV space, is the homogeneous coordinates in the image space, and is the number of coordinates. The mapping between BEV space and all views can be written as:


Then we can map the image features to the BEV features .

Figure 4: Network architecture.

3.2 Network Design with Unified Fusion

With the help of the unified multi-view fusion, we show the network architecture in this part. The network is composed of three parts, which are the backbone network, unified multi-view fusion Transformer, and segmentation head, as shown in creftype 4.


We use three kinds of widely used backbones ResNet50 [he2016deep], Swin-Tiny [liu2021swin] and VoVNet [lee2019energy] to extract multi-scale features () from multi-camera images. For the ResNet50 and VoVNet models, only features from stages 2, 3, and 4 are used. Following Deformable-DETR [zhu2020deformable]

, an extra 3x3 convolution with a stride of 2 is used to generate the last feature. The backbone is shared between all views’ images. It is worth mentioning that the features of past images can be maintained and reused in a feature queue without extra computational cost.

Fusion Transformer

We use a Transformer [NIPS2017_3f5ee243]

encoder to fusion features from all views. There are four major parts in the Transformer encoder, which are the BEV queries, the self-attention module, the cross-attention model, and the self-regression mechanism. In order to represent the BEV space, we use

queries in a 2D grid to represent the whole BEV space, where and are the spatial sizes of the BEV grid. The second major part is the self-attention module. It is used to interact with all BEV queries and exchange information in the BEV space. Since the time complexity of the vanilla self-attention interaction is , we use deformable self-attention [zhu2020deformable] to reduce the computational cost. The most important module of this work is the cross-attention used for unified multi-view spatial-temporal fusion. With the help of the unified multi-view fusion, all spatial-temporal features can be mapped to the same ego BEV space. The goal of the cross-attention module is to fuse and integrate the mapped spatial-temporal BEV space features . Denote are the real-world coordinates in the 2D BEV grid , and is the real-world height for sampling. Suppose the number of sampling in height in each BEV grid is , then each BEV query corresponds to points, and the total coordinates in the BEV space is . Then we can obtain the mapped BEV features according to creftype 2 with . Suppose the number of time steps in temporal fusion is , then the cross-attention (CA) module can be written as:


where is the sampled value at the point of from the BEV features of -th multi-scale level and -th time step. is the summation over time steps, scales, and heights. The attention value of is:


in which is the dimension of each BEV query, and is the attention key composed of input and positional embedding.

Setting Front/rear range Left/right range BEV grid size Map element type Line width Split
100m 100m 50m 50m 0.5m 0.5m Line, polygon 1-pixel Vanilla
60m 30m 30m 15m 0.15m 0.15m Line 5-pixel Vanilla
160m 100m 100m/60m 50m 0.25m 0.25m Line 3-pixel City-based
Table 1: Comparison of different map segmentation settings on NuScenes.

In this way, we can use BEV queries to iterate over features from different places in the BEV space, time steps, multi-scale levels, and sampling heights. The information from all over the places and all over the time can be directly retrieved without any loss in a unified manner. This kind of design also makes long-range fusion possible since all features are directly accessed no matter how long before, which also enables adaptive temporal fusion. The last major part of our method is the self-regression mechanism. Inspired by BEVFormer [li2022bevformer], which concatenates the warped previous BEV features with the BEV queries before the self-attention module to realize the temporal fusion, we use a self-regression mechanism that concatenates the output of Transformer with the BEV queries as the new inputs and rerun the Transformer to get the final features. For the first running of the Transformer, we simply double and concatenate the BEV queries as the inputs. In BEVFormer, it is believed that the concatenation of warped BEV features and BEV queries brings temporal fusion, and it is the root cause of performance gain. In this work, we propose another explanation for this phenomenon, that is, the concatenation of BEV features and queries is to implicitly deepen and double the number of the Transformer’s layers. Because the warped BEV features are already processed by the Transformer at previous time steps, the concatenation can be viewed as the grafting of two successive Transformers. In this way, a simple self-regression without warping can achieve a similar performance gain as BEVFormer. The detailed ablation study can be found in creftype 4.3.

Segmentation head

We use a lightweight, fully convolutional model ERFNet [romera2017erfnet] as our segmentation head, which will upsample the output of the Transformer to the given BEV space resolution.

4 Experiments

4.1 Dataset and Evaluation Settings


In this work, we use NuScenes [caesar2020nuscenes] as the evaluation dataset for the map segmentation task, which contains 1,000 driving scenes collected in Boston and Singapore. There are 28,130 and 6,019 keyframes for the training and validation set. Each keyframe contains six surrounding images.

Evaluation settings

There are two widely used settings for the map segmentation task on NuScenes. The first one is the setting [philion2020lift, li2022bevformer, xie2022m] with two classes road and lane. The other one is the setting [li2021hdmapnet, peng2022bevsegformer, zhang2022beverse] with three classes boundary, divider, and ped crossing. In this work, we also propose a new setting for a more comprehensive evaluation, as shown in Tab. 1. The key motivations of the new setting are: 1) the evaluation range should be as large as the visible limit. 2) the evaluation criterion should be discriminative for both bad and good predictions. 3) the evaluation should avoid overfitting and show the ability of generalization111The detailed information, motivation, and derivation of the new setting can be found in the supplementary materials.. In the new setting, we also use two difficulty levels “easy” and “hard”. For the “easy” level, the evaluation is conducted with the front, rear, left, and right ranges of 50m, 30m, 30m, and 30m, respectively. The “hard” level is onducted with the left areas in the

range. For all settings, mean intersection-over-union (mIoU) is used as the evaluation metric.

4.2 Implementation Details

To evaluate the results of our method, we use ResNet50 [he2016deep], Swin-Tiny [liu2021swin], and VoVNet [lee2019energy]

as our backbones. The ResNet50 and Swin backbones are initialized from ImageNet 

[deng2009imagenet] pretraining, and VoVNet backbone is initialized from DD3D checkpoint [park2021pseudo]. The default number of layers of the Transformer is set to 12. The input image resolutions are set to for ResNet50 and Swin. For VoVNet, we use image size. We use AdamW [loshchilov2018decoupled]

optimizer with a learning rate of 2e-4 and a weight decay of 1e-4. The learning rate is decreased by a factor of 10 for the backbone. The batch size is set to 1 per GPU, and models are trained with eight GPUs for 24 epochs. At the 20th epoch, the learning rate is decreased by a factor of 10. The number of multi-scale features is set to

, the default number of previous time steps is set to , and the number of sampling heights is set to . The height range is with a stride of 2m. For the setting, we use BEV queries to represent the whole BEV space, then the results are upsampled by a factor of 4 to match the BEV resolution. For the setting, we use BEV queries with a similar upsampling as the setting. For the setting, we use BEV queries and then upsample 8x to match the ground truth resolution. We use cross entropy loss to train on both settings. The loss weight for the background class is set to 0.4 by default for the class imbalance problem. Since the road class in the setting is polygon area without the class imbalance problem, the loss weight of the road background class is set to 1.0.

4.3 Ablation Study

Ability of long-range fusion

As discussed in the Introduction, the proposed unified multi-view fusion has the ability of long-range fusion since it can directly access both spatial and temporal information. In this part, We show the results of different fusion time steps to examine the ability of long-range fusion.

Figure 5: Ability of long-range temporal fusion.

From creftype 5, we can see that our method could consistently benefit from the long temporal fusion even up to 10 steps. And the fusion duration for the 10 steps is 2 seconds. However, the warp-based BEVFormer’s performance would drop after 3 fusion steps. This is also in accord with the results in BEVFormer [li2022bevformer] that the performance of warp-based temporal fusion would decrease with longer fusion than 4 contiguous steps. This shows the effectiveness of the proposed multi-view unified temporal fusion and the ability of long-range fusion. Since the performance gradually converges after 6 fusion steps, we set the number of temporal fusion steps to 6 in this work.

Disentangled training and inference fusion

Although the proposed unified fusion has the ability of long-range fusion, this also brings another problem of computational complexity, especially during training. Longer fusion steps demand more memory and computational cost. We find a phenomenon that can alleviate this problem, i.e., the number of temporal fusion steps during training does not need to be the same as the one during inference. And a model trained with a short-range fusion setting still has the ability of long-range fusion during inference. We call this phenomenon disentangled training and inference fusion. The results are shown in Tab. 2.

#Fusion steps
#Fusion steps
Road mIoU Lane mIoU
0 0 79.04 22.64
1 1 79.48 23.03
1 6 81.12 24.24
2 6 80.91 24.99
3 6 81.02 24.48
4 6 81.25 24.75
Table 2: Comparison of different numbers of temporal fusion steps. Note that the number of steps does not include current step.

From Tab. 2, we can see that no matter how many temporal fusion steps we use during training, the performance is very close when using 6 inference fusion steps. Moreover, even if we use only one previous step during training, the model still gains good performance with 6 temporal steps during inference. That is to say, the model still has the ability of long-range fusion when trained with a short-range fusion setting. By default, we use 2 temporal fusion steps during training.

Effectiveness of self-regression mechanism

In creftype 3.2, we propose a self-regression mechanism to further boost the performance. In this part, we examine the effectiveness of the self-regression mechanism. As shown in Tab. 3, we can see that the model with self-regression always gains better performance. Interestingly, the performance of the 12-layer non-regression model is close to the one of the 6-layer self-regression model. This verifies the analysis in creftype 3.2. Moreover, we can see that the number of layers is also important for the final performance.

#Layers Self-Reg Road mIoU Lane mIoU
6 80.42 24.26
6 80.91 24.99
12 81.13 25.29
12 81.97 25.76
Table 3: Comparison with different number of Transformer layers and self-regression.
Method Years Backbone Parameters FLOPs FPS Road mIoU Lane mIoU
(Vanilla / City-based) (Vanilla / City-based)
LSS ECCV20 EffNetb0 - - - 72.9 / - 20.0 / -
VPN* IROS20 Res101DCN - - - 76.9 / - 19.4 / -
LSS* ECCV20 Res101DCN - - - 77.7 / - 20.0 / -
M2BEV - ResNeXt101 112.5 - 1.4 77.2 / -   40.5 / -
BEVFormer - Res101DCN 68.7 1303.5 1.7 80.1 / - 25.7 / -
BEVFormer** - ResNet50 35.6 1020.8 4.1 80.6 / 41.9 22.4 / 9.6
UniFormer - ResNet50 42.4 1586.7 2.6 82.0 / 42.6 25.8 / 11.2
UniFormer - VoVNet99 84.0 - 2.7 85.4 / 47.9 31.0 / 11.6
Table 4: Experiments on NuScenes with the setting. * means the results are reported from BEVFormer [li2022bevformer]. indicates that M2BEV uses a different setting, in which the BEV resolution is 2x larger. So the “Lane mIoU” is high. ** means the BEVFormer is re-implemented in this work.
Method Years Backbone mIoU (Vanilla / City-based)
Divider Ped Crossing Boundary All
VPN* IROS20 EffNetb0 36.5 / - 15.8 / - 35.6 / - 29.3 / -
LSS* ECCV20 EffNetb0 38.3 / - 14.9 / - 39.3 / - 30.8 / -
HDMapNet ICRA22 EffNetb0 40.6 / - 18.7 / - 39.5 / - 32.9 / -
BEVSegFormer - ResNet101 51.1 / - 32.6 / - 50.0 / - 44.6 / -
BEVerse - Swin-tiny 56.1 / - 44.9 / - 58.7 / - 53.2 / -
BEVFormer** - ResNet50 53.0 / 20.4 36.6 / 8.9 54.1 / 24.3 47.9 / 17.9
UniFormer - Swin-tiny 58.6 / 32.4 43.3 / 17.2 59.0 / 29.8 53.6 / 26.5
UniFormer - VoVNet99 60.6 / 32.5 49.0 / 11.5 62.5 / 32.9 57.4 / 25.6
Table 5: Experiments on NuScenes with the setting. * means the results are reported from HDMapNet [li2021hdmapnet]. ** means the BEVFormer is reimplemented in this work.

Unified cross attention brings adaptive temporal fusion

In creftype 3, we show the core design of the unified multi-view spatial-temporal fusion is the unified cross attention module based on virtual views. The cross attention module can iterate over features from different time steps, which brings another important property, i.e., adaptive temporal fusion. To verify this, we directly average the temporal features before feeding them into the Transformer as the counterpart for comparison, which can be viewed as a fixed equal-weighted fusion. The results are shown in Tab. 6.

Method 1 2 3 4 5 6
UniFormer 24.03 25.08 25.46 25.61 25.72 25.76
Avg. 23.26 24.47 24.82 24.95 25.03 25.08
Table 6: Effectiveness of adaptive temporal fusion with different number of fusion steps. “Avg.” is the equal-weighted fusion.

We can see that our method outperforms the equal-weighted temporal fusion counterpart in all settings. This shows that our method could adaptively fuse information from different time steps.

4.4 Results

To validate the performance of our method, we use VPN [pan2020cross], Lift-Splat-Shoot [philion2020lift], M2BEV [xie2022m], and BEVFormer [li2022bevformer] for comparsion in the setting, as shown in Tab. 4. The number of parameters, FLOPs, FPS, Road mIoU, and Lane mIoU are compared. The FPS of our method is measured on the RTX 3090 GPU. We can see that the proposed method with a ResNet50 backbone even outperforms the BEVFormer model with a ResNet101DCN [dai2017deformable, wang2021fcos3d] backbone. In the road class, our method outperforms the previous SOTA BEVFormer by 1.9 points with the vanilla split. It is worth mentioning that BEVFormer uses much more BEV queries than ours ( vs. ), which could benefit the segmentation of thin lane lines. But our method still outperforms BEVFormer in the lane class with a smaller backbone and fewer BEV queries, which shows the effectiveness of the proposed UniFormer. Besides, our method also achieves the fastest speed compared with BEVFormer and M2BEV. Finally, our method with a larger VoVNet99 backbone outperforms BEVFormer by more than 5 points in all classes with the vanilla split. The reason why the speed of VoVNet99 is faster than ResNet50 is that the VoVNet’s input image resolution is smaller, as discussed in creftype 4.2.

Method Years Backbone mIoU (Easy) mIoU (Hard)
Divider Crossing Boundary All Divider Crossing Boundary All
VPN IROS20 ResNet50 25.4 / 8.3   6.7 / 0.5 25.3 / 14.6 19.1 / 7.8 13.4 / 2.9 4.3 / 0.0 13.1 / 6.5 10.3 / 3.1
LSS ECCV20 ResNet50 11.3 / 6.4   0.3 / 0.2 10.8 / 4.4 7.5 / 3.7   6.0 / 1.2 0.4 / 0.2 6.2 / 1.1 4.2 / 0.8
BEVFormer - ResNet50 42.2 / 16.1 26.9 / 7.6 42.1 / 18.6 37.1 / 14.1 27.3 / 7.8 17.5 / 2.3 26.3 / 10.0 23.7 / 6.7
UniFormer - ResNet50 46.3 / 18.5 30.5 / 10.5 45.8 / 21.0 40.9 / 16.7 28.1 / 8.8 17.6 / 2.7 26.9 / 10.2 24.2 / 7.2
Table 7: Comparison on NuScenes with the setting. We reimplement other methods with the same setting for comparison. All results are reported with the format of Vanilla split / City-based split.
Figure 6: Visualization of our method on NuScenes val set under complex road structures with the setting. From left to right, there are surrounding images, predictions, and ground truth. The red rectangle represents the ego car.

For the setting, we adopt VPN [pan2020cross], Lift-Splat-Shoot [philion2020lift], HDMapNet [li2021hdmapnet], BEVSegFormer [peng2022bevsegformer], and BEVerse [zhang2022beverse] for comparsion. The comparison results are shown in Tab. 5. From Tab. 5, we can see that our method still obtains the best results in all settings. In order to better evaluate different models and provide a scenario that is closer to real-world autonomous driving, we also introduce a new setting. We use VPN [pan2020cross], LSS [philion2020lift], BEVFormer [li2022bevformer], and our method with the same backbone, input resolution, training setting, and segmentation head for comparison. The results are shown in Tab. 7. From Tab. 7 we can see that visible range is crucial for the map segmentation task. And the relatively low performance suggests that large-range real-world map segmentation is still an open problem. Finally, we can see our method still obtains the best performance. It should be noted that the vanilla NuScenes train/val sets contain many similar samples, and the evaluation based on the vanilla split is likely to be influenced by overfitting. In this way, we introduce the new city-based split for NuScenes, the results can be seen in Tabs. 754. We can see that with the city-based split, all methods’ performance drops significantly, which suggests the overfitting problem in the vanilla split on NuScenes. And the poor improvement of VoVNet in Tab. 5 with the city-based split also indicates the problem of generalization. This could be an important direction for future works.

4.5 Visualization

In this section, we show the visualization results of our method, as shown in creftype 6. From creftype 6 we can see that our method gains good results under complex road structures. It is worth mentioning that our method could even segment the parts that are missing the ground truth, as shown in the second row of creftype 6. Moreover, for the irregular road boundary in the third row, our method still gains good results.

5 Conclusion

In this work, we propose a unified spatial-temporal fusion method for BEV representation, termed UniFormer. Different from previous methods that use warpping, we propose a new concept, i.e., virtual views that merge both spatial and temporal fusion in a unified formulation. With this design, we can realize long-range and adaptive temporal fusion with no information loss. The experiments and visualizations in this work validate the effectiveness of our method.


1 Overview

In this part, we provide more detailed illustration, explanation, and visualization for the following aspects: 1) The motivation of the new setting; 2) The long-range fusion ability of warp-based methods. 3) Visual comparison of different methods.

2 Motivation of the setting

Generally speaking, we propose a new setting that has different BEV range, line width of map element, and split compared with the existing and settings. The key motivations of this setting are: 1) the evaluation range should be as large as the visible limit. 2) the evaluation criterion should be discriminative for both bad and good predictions. 3) the evaluation should avoid overfitting and show the ability of generalization.

2.1 BEV Range

To determine the BEV range, we consider the visible limit of cameras. In this work, we define the visible range as the farthest point where a lane is represented by less than two pixels in the feature map (since we need to distinguish the left and right lanes of the lane, two pixels is the minimum requirement). Suppose is the focal length of the camera, is the minimal number of pixels to represent a lane, and is the width of the lane. The visible limit can be written as:

Figure A: Derivation of BEV range.

An example of the derivation is shown in creftype A. Typically, the focal length on NuScenes can be derived from the FOV and image resolution. Suppose image resolution is , FOV is , and we have:

Image Resolution FOV Focal Lenght Lane Width Number of pixels
1600 70 / 110 1142.5 / 560.2 3.0m 32
Table A: The values on the NuScenes dataset. For the FOV and focal length, we list the values of front and rear cameras separately. Lane width is about 3.0m-4.0m according to the regulations of different places, and we use the minimum value of 3.0m. Since the common network output stride is larger than 32, one pixel in the feature map corresponds to at least 32 pixels in the original image.

The detailed numbers are shown in Tab. A. Finally, we get the BEV range :


However, the rear BEV range of 52.5m is slightly short in real scenarios. We slightly extend the rear BEV range to 60m. For the left and right range, we follow the existing setting with a distance of 50m. This composes the setting.

2.2 Evaluation criterion

The first difference in the evaluation criterion is that all the map elements are defined as the “Line”. This is because the polygon area is not suitable for representing road structures and the mIoU metric with polygon is abnormally high. For example, the “Road mIoU” is about 80 while the “Lane mIoU” is only about 20. The second part of our evaluation is the line width. In this work, we use 3-pixel-wide lines. This is to avoid the problem of the 1-pixel evaluation. For example, if the predicted lane is only shifted by 1 pixel from the ground truth, then the mIoU is 0. There is no discrimination for “wrong but close” and ‘totally wrong‘” cases under this setting. This property also causes another problem, that is, if we simply upsample the ground truth and make the prediction also works in high resolution, the performance would increase significantly, which would cause an unfair comparison between different methods. To avoid these problems, we set the line width to 3 pixels. For the predictions that are close to ground truth but not exactly correct, our evaluation could also give responses to these results and are more discriminative. For the upsample problem, since we make the original 1-pixel “lane mIoU” a 3-pixel “area mIoU”, the upsampled results are less affected.

2.3 City-based split

In our setting, we also propose the city-based split for NuScenes. This is because the vanilla training and validation splits in NuScenes contain many similar scenes, which potentially suffer from the overfitting problem. In this way, we propose a split that is based on the cities and locations on NuScenes. NuScenes is collected in four places, which are “singapore-onenorth”, “singapore-queenstown”, “singapore-hollandvillage”, and “boston-seaport”. We use the samples collected in “singapore-queenstown” and “singapore-hollandvillage” as the training split, and “singapore-onenorth” and “boston-seaport” as the validation split. The numbers of training and validation samples are 26,093 and 8,056, respectively. For comparison, the numbers of training and validation samples in the vanilla split are 28,130 and 6,019, respectively. The detailed split list can be found in https://github.com/cfzd/UniFormer.

3 Visualization and Comparison

In this part, we show the visualization results on NuScenes with the setting. Moreover, we also show the results of other method for comparison in creftype B. From creftype B, we can see that our method gains the best results. The prediction of lines in our method is smooth and clear.

Figure B: The visual comparison on the city-based val split of NuScenes with the setting. Best viewed when zoomed in.