READ: Large-Scale Neural Scene Rendering for Autonomous Driving

by   Zhuopeng Li, et al.
Zhejiang University

Synthesizing free-view photo-realistic images is an important task in multimedia. With the development of advanced driver assistance systems (ADAS) and their applications in autonomous vehicles, experimenting with different scenarios becomes a challenge. Although the photo-realistic street scenes can be synthesized by image-to-image translation methods, which cannot produce coherent scenes due to the lack of 3D information. In this paper, a large-scale neural rendering method is proposed to synthesize the autonomous driving scene (READ), which makes it possible to synthesize large-scale driving scenarios on a PC through a variety of sampling schemes. In order to represent driving scenarios, we propose an ω rendering network to learn neural descriptors from sparse point clouds. Our model can not only synthesize realistic driving scenes but also stitch and edit driving scenes. Experiments show that our model performs well in large-scale driving scenarios.



There are no comments yet.


page 1

page 3

page 6

page 7

page 9


Neural Point Light Fields

We introduce Neural Point Light Fields that represent scenes implicitly ...

NeRFusion: Fusing Radiance Fields for Large-Scale Scene Reconstruction

While NeRF has shown great success for neural reconstruction and renderi...

Large-scale 3D point cloud representations via graph inception networks with applications to autonomous driving

We present a novel graph-neural-network-based system to effectively repr...

Repopulating Street Scenes

We present a framework for automatically reconfiguring images of street ...

SceneGen: Learning to Generate Realistic Traffic Scenes

We consider the problem of generating realistic traffic scenes automatic...

Can Autonomous Vehicles Identify, Recover From, and Adapt to Distribution Shifts?

Out-of-training-distribution (OOD) scenarios are a common challenge of l...

The ApolloScape Dataset for Autonomous Driving

Scene parsing aims to assign a class (semantic) label for each pixel in ...

Code Repositories


implementation of "READ: Large-Scale Neural Scene Rendering for Autonomous Driving"

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Synthesizing free-view photo-realistic images is an important task in multimedia (Chen, 2019). Especially, the synthetic large-scale street views are essential to a series of real-world applications, including autonomous driving (Li et al., 2019; Kim et al., 2013), robot simulation (Dosovitskiy et al., 2017; Wiriyathammabhum et al., 2016), object detection (Zhang et al., 2021b, c; He et al., 2021), and image segmentation (Ying et al., 2021; Gao et al., 2020; Tang et al., 2020a). As illustrated in Fig. 1, the objective of neural scene rendering is to synthesize the 3D scene from a moving camera, where the user can browse the street scenery from different views and conduct automatic driving simulation experiments. In addition, this can generate multi-view images to provide data for multimedia tasks.

With the development of autonomous driving, it is challenging to conduct experiments in various driving scenarios. Due to the complicated geographic locations, varying surroundings, and road conditions, it is usually difficult to simulate outdoor environments. Additionally, it is hard to model some unexpected traffic scenarios, such as car accidents, where the simulators can help to reduce the reality gap. However, the data generated by the widely used simulator like CARLA (Dosovitskiy et al., 2017) is far different from real world scenes using the conventional rendering pipeline.

The image-to-image translation-based methods (Gao et al., 2020; Tang et al., 2020a; Isola et al., 2017; Tang et al., 2020b) synthesize the street views with semantic labels by learning the mapping between source images and targets. Despite of generating the encouraging street scene, there exist some large artifacts and incoherent textures. Moreover, the synthesized image has only a single view that cannot provide the rich multi-view traffic conditions for autonomous vehicles. This hinders them from a large number of real world applications.

Recently, Neural Radiance Field (NeRF) based methods (Zhang et al., 2021a; Niemeyer and Geiger, 2021; Mildenhall et al., 2020; Wang et al., 2021b) achieve the promising results in synthesizing the photo-realistic scenes with multi-view. As suggested in (Deng et al., 2022), they cannot produce reasonable results with only few input views, which typically happens in the driving scenario with the objects appearing in only a few frames. Moreover, the NeRF-based methods mainly render either the interiors or objects. They have difficulty synthesizing the large scale driving scenes with the complicated environment, where the large artifacts occur in the closed-up views and surroundings. To tackle this problem, NeRFW (Martin-Brualla et al., 2021) makes use of the additional depth and segmentation annotations to synthesize an outdoor building, which takes about two days with 8 GPU devices. Such a long reconstruction time is mainly due to the unnecessary sampling of the vast spaces.

Unlike the NeRF-based methods that purely depend on per-scene fitting, the neural rendering approaches (Thies et al., 2019; Wang et al., 2021a; Wu et al., 2021) can be effectively initialized via neural textures, which are stored as maps on top of the 3D mesh proxy. Similarly, NPBG (Aliev et al., 2020) learns neural descriptors from a raw point cloud to encode the local geometry and appearance, which avoids sample rays in empty scene space by the classical point clouds reflecting the geometry of the scene in real world. Moreover, ADOP (Rückert et al., 2021) improves NPBG by adding a differentiable camera model with a tone mapper, which introduces the formulation to better approximate the spatial gradient of pixel rasterization. In general, the point-based neural rendering method can synthesize a larger scene with fewer captured images by initializing the scene through three-dimensional point cloud information. Although neural rendering-based methods can synthesize the photo-realistic novel views in both indoor and outdoor scenes, it is still very challenging to deal with the large-scale driving scenarios due to the limitations on model capacity, as well as constraints on memory and computation. Additionally, it is difficult to render the photo-realistic views with rich buildings, lanes, and road signs, where the sparse point cloud data obtained from the few input images usually contain lots of holes.

In this paper, we propose an effective neural scene rendering approach, which makes it possible to synthesize the large-scale driving scenarios through efficient Monte Carlo sampling, screening of large-scale point clouds, and patch sampling. It is worth mentioning that our method synthesized large-scale driving scenarios with an average of two days of training on a PC with two RTX2070 GPUs. This greatly reduces the computational cost so that large-scale scene rendering can be achieved on affordable hardware. For sparse point clouds, we fill in the missing areas of point clouds by multi-scale feature fusion. To synthesize photo-realistic driving scene from sparse point clouds, we propose an network to filter neural descriptors through basic gate modules and fuse features of the same scale and different scales with different strategies. Through , our model can not only synthesize the realistic scenes but also edit and stitch scenes via neural descriptors. Moreover, we are able to update the specific areas and stitch them together with the original scene. Scene editing can be used to synthesize the diverse driving scene data from different views even for traffic emergencies.

The main contributions of this paper are summarized as: 1) Based on our neural rendering engine (READ), a large-scale driving simulation environment is constructed to generate realistic data for advanced driver assistance systems; 2) network is proposed to obtain a more realistic and detailed driving scenario, where multiple sampling strategies are introduced to enable synthesize the large-scale driving scenes; 3) Experiments on the KITTI benchmark (Geiger et al., 2012) and Brno Urban dataset (Ligocki et al., 2020) show the good qualitative and quantitative results, where the driving scenes can be edited and stitched so as to synthesize larger and more diverse driving data.

2. Related Work

2.1. Image-to-image Translation

Many researchers (Gao et al., 2020; Tang et al., 2020a; Isola et al., 2017; Tang et al., 2020b; Richter et al., 2021) employ image-to-image translation technique to synthesize photo-realistic street scenes. Gao et al. (Gao et al., 2020) propose an unsupervised GAN-based framework, which adaptively synthesizes images from segmentation labels by considering the specific attributes of the task from segmentation labels to image synthesis. As in (Gao et al., 2020), Tang et al. (Tang et al., 2020a) present a dual-attention GAN that synthesizes the photo-realistic and semantically consistent images with fine detail from input layouts without the additional training overhead. To enhance the fidelity of images in the game, Richter et al. (Richter et al., 2021)

use the G-buffers generated in the rendering process of the game engine as the additional input signal to train the convolutional neural network, which is able to eliminate the disharmonious and unreasonable illusions generated by previous deep learning methods. Although the image-to-image translation method can synthesize the realistic street scene, it still cannot guarantee coherence in the scene transformation. Moreover, it can only synthesize the scene from a single view, whose results are far different from the real scene in terms of texture details.

2.2. Novel View Synthesis

Neural Radiance Fields (Mildenhall et al., 2020) become an important breakthrough for the novel view synthesis task, which is proposed to use a fully connected network of entire scenes optimized by differentiable volume rendering. Recently, there have been many variations of this method to render different objects, such as human (Pang et al., 2021), car (Niemeyer and Geiger, 2021), interior scene (Wang et al., 2021b), and building (Martin-Brualla et al., 2021). However, NeRF-based methods depend on per-scene fitting, it is hard to fit a large-scale driving scenario. NeRFW (Martin-Brualla et al., 2021) combines appearance embedding and decomposition of transient and static elements through uncertainty fields. Unfortunately, dynamic objects are ignored, which may lead to occlusions in the static scene. Moreover, synthesizing scenes as large as street view requires huge computing resources, which cannot be rendered in real-time.

Figure 2. Overview of our proposed large-scale neural scene Render (READ) for Autonomous Driving. The input image is firstly aligned, and then the point cloud of the scene is obtained by matching feature points and dense construction. We rasterize points at several resolutions. Given the point cloud , the learnable neural descriptor , and the camera parameter , our presented rendering network synthesizes realistic driving scenes by filtering neural descriptors learned from the data and fusing the features from the same scale and different scales.

2.3. Scene Synthesis by Neural Rendering

Meshry et al. (Meshry et al., 2019)

take a latent appearance vector and a semantic mask of the transient object’s location as input, which render the scene’s points into the deep frame buffer and learn these initial render mappings to the real photo. This requires a lot of semantic annotation and ignores the transient objects. By combining the traditional graphics pipelines with learnable components, Thies et al. 

(Thies et al., 2019) introduce a new image composition paradigm, named Deferred Neural Rendering, where feature mapping of the target image is learned from UV-map through neural texture. Despite the promising results, it is time-consuming to obtain explicit surfaces with good quality from the point cloud.

The point-based neural rendering method employs point clouds as input to learn the scene representation. NPBG 

(Aliev et al., 2020) encodes the local geometric shapes and appearance by learning neural descriptors, which synthesizes high quality novel indoor views from point clouds. TRANSPR (Kolos et al., 2020) extends NPBG by augmenting point descriptors with alpha values and replacing Z-buffer rasterization with ray marching, it is able to synthesize semi-transparent parts of the scene. ADOP (Rückert et al., 2021) proposes a point-based differentiable neural rendering, where the parameters of all scenarios are optimized by designing the stages of the pipeline to be differentiable.

3. Large-Scale Neural Scene Render

Our proposed Neural Scene Render approach aims to synthesize the photo-realistic images from an arbitrary camera viewpoint by representing the driving scenes with point clouds. In this section, we first outline our proposed method. Secondly, multi sampling strategies of sparse point clouds are proposed to reduce the computational cost for large-scale driving scenes. Thirdly, is proposed to represent driving scenes with sparse point clouds and synthesize realistic driving scenarios. Finally, the driving scenes are edited and stitched to provide synthetic data for larger and richer driving scenes.

3.1. Overview

Given a set of input images for a driving scene and the point cloud with known camera parameters, our framework is capable of synthesizing the photo-realistic driving scenes from multiple views, as well as stitching and editing driving scenes. To this end, we propose an end-end large-scale neural scene render that synthesizes realistic images from sparse point clouds. Our framework is divided into the following three parts: rasterization, sampling with the sparse point cloud, rendering network. The overview of our proposed framework is illustrated in Fig. 2.

Sparse 3D point cloud can be obtained through the classic Structure-from-Motion and Multi-View Stereo pipelines, such as Agisoft Metashape (1). Each point is located at , which is associated with a neural descriptor vector encoding the local scene content. As in (Aliev et al., 2020), each input 3D point in contains the position , whose appearance feature is extracted by mapping the RGB value of image pixel to its corresponding 3D space. Neural descriptors are calculated from the input point cloud, namely latent vectors representing local geometry and photometric properties. We update these features by propagating gradient to the input so that the features of the neural descriptor can be automatically learned from data. Given the camera’s internal and external parameters, we can view the scene from different views by designing 8-dimensional neural descriptors to represent the RGB values. In the rasterization phase, images of size are captured by pinhole camera , we construct a pyramid of rasterized raw images (T=4 in all our experiments), has the spatial size of , which is formed by assigning the neural descriptor of the point passing the depth test to each pixel. Then, it is projected onto the pixel under the full projection transformation of the camera. Essentially, the neural descriptors feature encodes the local 3D scene content around . The rendering network expresses a local 3D function that outputs the specific neural scene description at , modeled by the neural point in its local frame.

Figure 3. Basic gate module. Neural descriptors learned from sparse point clouds can effectively screen out invalid values.

3.2. Sampling with Sparse Point Cloud

Synthesizing driving scenes with thousands of meters requires enormous computational power. Therefore, the key to our proposed approach is to reduce memory usage and improve training efficiency. Instead of fitting each scene separately, we employ the point clouds generated by the off-the-shelf 3D reconstruction software (1) to initialize the geometry of the real world scene. As a huge amount of point cloud data consumes a lot of memory, it is still difficult to train. To tackle this critical issue, we take advantage of a sparse sampling strategy to generate sparse point clouds from 1/4 of the originally available pixels, resulting in only 25% of the total number of point clouds trained.

3.2.1. Screen out occluded point clouds

To avoid updating the descriptors of the occluded points, we approximate the visibility of each point. We use the nearest rasterization scheme by constructing a Z-buffer, reserving only the points with the lowest Z-value at the pixel position. Neural descriptors avoid the points that are far from the camera of the current frame to better synthesize the scene. Thus, the computational cost of calculating occluded point cloud is reduced so that the training efficiency is greatly improved.

3.2.2. Monte Carlo sampling

Due to the different distribution of point clouds in the scene, there are abundant point clouds in the area with obvious features. For the area of the sky or dynamic objects, there are fewer corresponding regional point clouds due to the lack of obvious features or fewer feature points. To train effectively, we propose dynamic training strategies, which take advantage of the Monte Carlo method (Shapiro, 2003) to sample a large amount of driving scene data. For the image set in the training phase ,


is the synthetic quality of image , which is calculated by perceptual loss (Johnson et al., 2016) in our task. We employ the samples with the worst performance at each phase as training data. Through the training strategy of dynamic sampling, the model strengthens to learn the sparse region of the point cloud so that the overall training time is reduced.

3.2.3. Patch sampling

Image resolution also plays a very important role in memory usage. To this end, we randomly divide the whole image into multiple patches through a sampling strategy, which can select the random patches with the size of according to the available GPU memory size. It is worth mentioning that the proportion of pixels in patch , to the whole image is less than 15% in our task. Given the intrinsic matrix as below

whee and represent the focal lengths of the and axes, respectively. is the position of the principal point with respect to the image plane.

For each image in , the patch set is obtained by the following strategy to ensure that all areas in the scene can be trained:


where is zoom ratio. It shifts the patch to enhance the synthetic quality of the scene from different views.


The point cloud, especially ones from external reconstruction methods (e.g., Metashape (1) or COLMAP (Schonberger and Frahm, 2016)

), often has holes and outliers that degrade the rendering quality. Motivated by MIMO-UNet 

(Cho et al., 2021), rendering network is proposed to synthesize the novel view from sparse point clouds, which consists of three parts.

Given the sparse point cloud , the purpose of the rendering network is to learn reliable neural descriptors to represent scenes. However, neural descriptors learned from point clouds still have holes. To deal with this problem, we design a basic gate module to filter the neural descriptors of different scales, as shown in Fig. 3.

By taking into consideration the efficiency, we firstly employ convolution layers to extract the feature of neural descriptor

. A mask is learned by the sigmoid function to filter the invalid values in the neural descriptor. The output is the value range of (0,1), which represents the importance of features in the neural descriptor. To improve the learning efficiency, we employ the ELU activation function for the neural descriptor.

denotes the element-wise multiplication. We concatenate () the initial feature with the filtered one as a new feature. Finally, we use an additional convolution layer to further refine the concatenated features. In addition, Gate convolution  (Yu et al., 2019) is introduced to re-filter the fused features.

Figure 4. Feature fusion module. Part 1 fuses the features at the same scale, which takes advantage of the complementary information of same scale. Part 2 learns missing points in neural descriptors by fusing the features at different scales.

3.3.1. Fusing features at different scales

The lack of topology information in the point cloud leads to holes and bleeding. Given the feature of the neural descriptor with holes, as shown in the red box in Fig. 4. Although has a higher resolution with fine details, it still suffers from larger surface bleeding. For feature blocks in with no values, rough values can be obtained after average pooling. has a low resolution, however, it is able to reduce surface bleeding. For the sparse point cloud, fusing features at two different scales still cannot completely fill the hole. Therefore, we suggest

to fuse multi-scale features. In our model, we use neural descriptors of four scales in order to achieve the trade-offs between efficiency and accuracy. Through fusing features at different scales, our proposed model learns the missing points in the sparse point cloud from the data, so as to synthesize a realistic novel view of the sky, distant objects, etc. Instead of using transpose convolution, we employ bilinear interpolation in upsampling phase. This is because transpose convolution is basically learnable upsampling, and learnable parameters incur the extra computation.

3.3.2. Fusing features at the same scale

In our presented method, the feature of neural descriptor is obtained from by an average pooling operation. The down-sampled feature of neural descriptor concatenates itself with the last layer feature, which is used for detail enhancement to retain the lost information. At the same time, the feature with the size of is obtained by the neural descriptor of the gate module using the gate convolution. The fusion of and features at the same scale can make use of complementary information between features, where the fusion feature is , as shown in the red circle of Fig. 4.

Figure 5. The scene stitching.

3.4. Scene Editing and Stitching

As our proposed model learns neural descriptors from point clouds to represent scenes, the scene can be edited by changing the trained neural descriptors. Therefore, we can move the cars, people, and houses in the scene at will. Or even simulate the extreme scenarios, like a car going the wrong way is about to crash.

As shown in Fig.5, represent the range of a car in the point clouds. Through back propagation, we employ a rendering network with learnable parameter to project all the neural descriptors onto the RGB image


where is the projected 2D image, and denotes the projection and rasterization process. We synthesize the novel view of car via changing its position of car . By taking advantage of scene editing, we can not only move objects in the scene, but also remove dynamic objects, so as to obtain more diverse driving scenes, as shown in Fig.7.

To account for the large scale driving scene, we propose a scene stitching method that is able to concatenate multiple scenes and update a block locally. For coordinates at the boundary of scene 1 , the boundary coordinates of scene 2 need to be stitched. We firstly rotate the point clouds of the two scenes so that they are aligned on a coordinate system at the boundary. The feature descriptors represent the texture of their scenes after being trained by our rendering network. Then, and are stitched at the boundary to update the scene. The new scene is .

Figure 6. Comparative results of novel view synthesis on the Residential, Road, City scenes from the KITTI benchmark, and a multiple view scene from the Brno Urban dataset. Comparing to DAGAN (Tang et al., 2020a), NRW (Meshry et al., 2019), NPBG (Aliev et al., 2020) and ADOP (Rückert et al., 2021), our approach performs the best in cases of pedestrians, vehicles, sky, buildings and road signs. Please zoom in for more details.

3.4.1. Loss function

Perceptual loss (Johnson et al., 2016)

, also known as VGG loss, can effectively reflect the image quality of perception more than other loss functions. Thus, we employ the perceptual loss function to prevent smoothing the high-frequency details while encouraging color preservation. Specifically, we compute the perceptual loss between the synthetic novel view and ground truth image

, which is calculated by a pretrained VGG layer as follows:


where denotes the randomly cropped patches. Given point cloud and camera parameters , our driving scene render learns the neural descriptors and network parameters .

4. Experiment

To fairly compare the qualitative and quantitative results of various methods, we conduct the experiments on Nvidia GeForce RTX 2070 GPU and evaluate our proposed approach on the two datasets for autonomous driving. To reduce memory usage to load the point cloud of the entire large-scale scene, all comparison methods use the sparse point cloud optimized by our method as input.

4.1. Datasets

KITTI Dataset (Geiger et al., 2012): KITTI is a large data set of real driving scenarios, which contains rich scenes. We mainly conducted experiments in three different cases, namely Residential, Road and City, covering 3724, 996, and 1335 meters, respectively. Due to the overlapping parts, we finally adopted 3560 frames, 819 frames, and 1584 frames in Residential, Road, and City scenes as the testbed. We evaluated every 10 frames (e.g., frame 0, 10, 20…) by following the training and testing split of (Aliev et al., 2020; Rückert et al., 2021). The rest of the frames are used for training. To demonstrate the effectiveness of our method, we conducted the more challenging experiment by discarding 5 test frames before and after every 100 frames as test frames. As the car is driving at a fast speed, losing 10 frames may lose a lot of scene information, especially at corners.

Brno Urban Dataset (Ligocki et al., 2020): Compared to KITTI’s single-view trajectory, the Brno Urban Dataset contains four views, including left, right, left-front and right-front. In this paper, we use 1641 frames of driving images in our experiments. The left view, the left-front view, and the right view are used as a set of data using similar evaluation criteria as KITTI.

4.2. Evaluation on the KITTI Testing Set

Since point clouds from external reconstruction methods such as MetaShape (1) often contain holes and outliers, the quality of rendering is usually degraded. Moreover, the sparse point cloud as input brings a great challenge to scene synthesis.

To demonstrate the efficacy of our method, we compare it against the recent image-to-image translation based and neural point-based approaches, including DAGAN (Tang et al., 2020a), NRW (Meshry et al., 2019), NPBG (Aliev et al., 2020) and ADOP (Rückert et al., 2021)

, which have achieved promising results in outdoor scene synthesis. Followed by the above methods, we employ Peak Signal-to-Noise Ratio (PSNR), Structural Similarity (SSIM) and perceptual loss (VGG loss) as the evaluation metrics. To facilitate the fair comparison, we also adopt the perception metric and learned Perceptual Image Patch Similarity (LPIPS) in our evaluation.

Tang et al. (Tang et al., 2020a) propose a Novel Dual Attention GAN (DAGAN) algorithm, which can effectively model semantic attention at spatial and channel dimensions, thus improving the feature representation ability of semantic images. The GAN-based method can synthesize reasonable semantic images, however, there is a gap with the real scene texture, and its performance is weak in various metrics.

NRW (Meshry et al., 2019)

renders the points to the deep frame buffer and learns the mapping from the initial rendering to the actual photo by training the neural network. Besides point clouds, camera views, and paired poses, NRW needs the extra depth maps for supervision, which are obtained by Agisoft Metashape 

(1). Although having obtained good metrics in KITTI dataset, it does not perform well in the texture with details, as shown in Fig.6.

NPBG (Aliev et al., 2020) is one of the baseline methods. To facilitate fair comparisons, we use the same parameters, such as the number of neural descriptors and the number of layers of the network structure, etc. We evaluate the effectiveness of our method in detail by comparing ablation with NPBG.

ADOP (Rückert et al., 2021) proposes the differentiable render that can optimize the parameters of camera pose, intrinsic, or texture color. However, it has complicated scene parameters of pipeline and is a two-stage rendering network, per image exposure and per image white balance needs to be manually adjusted. ADOP has difficulty in sparse point clouds, which tends to synthesize the blurred images in the area of point clouds with holes.

In our proposed approach, we initialize the driving scene through point cloud in four datasets and use neural descriptors to encode the geometry and appearance. At each iteration, we sample ten target patches with the size of in KITTI Dataset. Due to the high image resolution of the Brno Urban Dataset, the patch with the size of is used for training. In Monte Carlo sampling, we set the sampling ratio to 80%. Table 1 shows the results of our method. It can be seen that our proposed approach is significantly better than the previous method on all metrics.

KITTI Residential KITTI Road KITTI City
Test every 100 frames (w/ discard)
DAGAN (Tang et al., 2020a) 1241.3  11.18   0.4968   0.3081 929.0   15.33   0.3570   0.4135 1301.3  10.74   0.4949   0.3014
NRW (Meshry et al., 2019) 923.0   15.70   0.3874   0.4748 860.8   17.01   0.3343   0.4311 1007.0  15.66   0.3847   0.4361
NPBG (Aliev et al., 2020) 924.7   14.98   0.4426   0.4733 791.4   17.63   0.3680   0.5080 994.5   14.97   0.4384   0.4518
ADOP (Rückert et al., 2021) 900.5   14.89   0.3590   0.4734 785.9   17.56   0.3275   0.4701 910.6   15.67   0.3497   0.4774
READ (Ours) 695.3   17.70   0.2875   0.5963 573.5   20.26   0.2408   0.6238 673.2   18.35   0.2529   0.6412
Test every 10 frames (w/o discard)
DAGAN (Tang et al., 2020a) 1031.2  14.27   0.3800   0.4337 847.2   16.84   0.2916   0.4638 1128.8  13.40   0.3971   0.3845
NRW (Meshry et al., 2019) 767.4   18.43   0.3197   0.5476 748.0   18.58   0.2809   0.4996 823.7   18.02   0.3102   0.5682
NPBG (Aliev et al., 2020) 621.2   19.32   0.2584   0.6316 597.3   20.25   0.2517   0.5919 632.8   19.58   0.2480   0.6277
ADOP (Rückert et al., 2021) 610.8   19.07   0.2116   0.5659 577.7   19.67   0.2150   0.5554 560.9   20.08   0.1825   0.6234
READ (Ours) 454.9   22.09   0.1755   0.7242 368.2   24.29   0.1465   0.7402 391.1   23.48   0.1321   0.7871
Table 1. Quantitative evaluation of novel view synthesis on three scenes from the KITTI dataset.
Left side view Left front side view Right side view Total
Test every 100 frames(w/ discard)
DAGAN (Tang et al., 2020a) 1055.5  13.93   0.3960   0.3705 754.9   16.95   0.3078   0.5234 1105.1  11.84   0.5323   0.3561 14.24   0.4120
NRW (Meshry et al., 2019) 919.2   15.25   0.4435   0.4397 712.7   17.87   0.4063   0.5513 949.5   13.49   0.5790   0.4405 15.54   0.4762
NPBG (Aliev et al., 2020) 1002.3  13.14   0.5242   0.3978 724.5   17.13   0.4098   0.5596 1024.4  12.22   0.6634   0.4333 14.17   0.5325
ADOP (Rückert et al., 2021) 997.1   14.08   0.4373   0.3915 683.6   18.24   0.3150   0.5618 1091.2  13.21   0.5531   0.3803 15.18   0.4352
READ (ours) 842.0   15.28   0.3992   0.4752 523.9   20.51   0.2467   0.6713 928.0   13.88   0.5464   0.4533 16.56   0.3974
Test every 10 frames(w/o discard)
DAGAN (Tang et al., 2020a) 851.4   16.67   0.2822   0.4766 657.6   19.08   0.2445   0.5662 1041.6  13.14   0.4514   0.3805 16.30   0.3260
NRW (Meshry et al., 2019) 735.0   18.64   0.3199   0.5422 619.6   19.74   0.3125   0.6062 864.6   16.05   0.4631   0.4749 18.14   0.3651
NPBG (Aliev et al., 2020) 659.4   18.56   0.3112   0.5849 531.6   20.30   0.2705   0.6773 813.1   16.00   0.4424   0.5093 18.28   0.3414
ADOP (Rückert et al., 2021) 634.0   19.19   0.2414   0.5927 520.6   20.83   0.2189   0.6633 807.1   16.51   0.3636   0.5009 18.84   0.2746
READ (ours) 459.8   21.79   0.1905   0.7067 341.1   24.85   0.1513   0.7836 663.6   18.44   0.3065   0.5771 21.69   0.2161
Table 2. Quantitative evaluation of novel view synthesis on Brno Urban Dataset.

4.3. Evaluation on the Brno Urban Testing Set

Unlike the KITTI dataset, the Brno Urban dataset is very challenging with three views.

In the evaluation, we test the left side view, left front side view, and right side view, separately. As shown in Table 2, our proposed approach is significantly better than the other methods while DAGAN obtains slightly better LPIPS metrics in the 100 frame test. The results of NRW’s side view are similar to ours. This is due to the faster car speed and the narrower side view, resulting in significantly different images between frames. As the distributions of training and test sets are different, the methods based on point cloud rendering are seriously affected. NRW relies on additional segmentation annotations to synthesize the unfamiliar scenes, and the image is projected onto the point cloud to initialize the deep buffer. The method using the GAN generator can obtain reasonable images for the missing regions of the point cloud. Therefore, DAGAN and NRW achieve a slight improvement in the case of fewer input images. However, such methods are limited by synthesizing novel scenes from different views. In addition, ADOP uses a differentiable rendering pipeline to align camera images to point clouds, the right side view synthesized results are similar to ours.

4.4. Ablation Study

In the ablation experiment, we examine the effectiveness of each module, and more results are given in the supplementary materials. For a fair comparison, we have added the sampling strategy proposed in Section 3.2.3 to the NPBG as our baseline, namely sampling NPBG, as shown at the first line in Table 3. Then, we gradually add each module mentioned in Section 3 and evaluate them in the KITTI Road scenario. In contrast to sampling NPBG, our proposed basic gate module can effectively filter the invalid values in neural descriptors, which obtains significant improvement in PSNR, SSIM, and other metrics. By fusing the features at the same scale (Same), the texture of the scene is enhanced with fine details. The fusion of different scale features modules (Differ) can effectively fill the value of neural descriptors close to zero. All metrics are greatly improved over the baseline. We also study the influence of different loss functions. It can be observed that combining , PSNR Loss and VGG Loss can improve the SSIM index slightly.




572.1 20.63 0.2359 0.6109
477.6 22.29 0.1893 0.6883
392.1 23.76 0.1477 0.7205
368.2 24.29 0.1465 0.7402
401.3 24.16 0.1865 0.7487
383.3 23.96 0.1506 0.7325
401.1 24.19 0.1863 0.7490
Table 3. Ablation study of our method on KITTI road dataset.
READ 454.9   22.08   0.1755   0.7242
READ w/ stitching 429.3   22.58   0.1625   0.7392
Table 4. Comparisons of scene stitching on KITTI dataset.

4.5. Driving Scene Editing

Editing the driving scenarios not only provides more synthetic data for Advanced Driver Assistance Systems, but also simulates the rare traffic conditions in daily life, i.e., a car driving illegally on the wrong side of the road. Moreover, our proposed approach can remove the dynamic objects in the scene so that data collection staff do not need to worry about the impact of complex traffic and vehicles on the restoration of the real scene. This provides convenience for data collection. Additionally, the panoramic view can be synthesized through our method. The larger field-of-view provides more street view information for Driver Assistance Systems, which makes it easier to observe the surrounding environment and deal with the emergencies in a timely manner, as shown in Fig. 7. More results are presented in the supplementary materials.

Figure 7. Example results of scene editing. We can move and remove the cars in different views. A panorama with larger view can be synthesized by changing the camera parameters.

4.6. Driving Scene Stitching

By taking advantage of scene stitching, our model is able to synthesize the larger driving scenes and update local areas with obvious changes in road conditions. This not only enables to deal with the driving area at a larger scale, but also divides the large-scale scene into small parts for efficient training in parallel. As shown in Fig. 5, we stitch two KITTI residential scenes, which share the same rendering network and learn the texture from the corresponding part, respectively. Table 4 shows the comparison results on the stitched scenario. It can be seen that decomposing large scenes into small parts can achieve better results. This indicates the effectiveness of our presented stitching method.

5. Limitations

We propose a multi-scale rendering method that synthesizes the photo-realistic driving scenes from sparse point clouds. However, for images that differ greatly from the training views, for example, in the right view of the Brno Urban data set, 10 frames near the test frame are discarded. As shown in Fig.8. neural rendering-based methods are difficult to synthesize the scene with few observations, resulting in blur. In addition, misalignment on point clouds affects the rendering results. In the future, we will consider using point clouds scanned by LiDAR sensor as training data so as to reduce the reconstruction errors.

Figure 8. Failure cases.

6. Conclusion

This paper proposed an efficient neural scene rendering approach to autonomous driving, which makes it possible to synthesize large-scale scenarios on a PC through a variety of sampling schemes. We presented an rendering network to filter the neural descriptors through basic gate modules, and fused features at the same scale and different scales with different strategies. Our proposed approach not only synthesized the photo-realistic views, but also edited and stitched the driving scenes. This enables to generate various photo-realistic images to train and test the autonomous driving system. The encouraging experimental results showed that our proposed approach significantly outperforms the alternative methods both qualitatively and quantitatively.


  • [1] (2019) Agisoft: metashape software. retrieved 20.05.2019. Cited by: §3.1, §3.2, §3.3, §4.2, §4.2.
  • K. Aliev, A. Sevastopolsky, M. Kolos, D. Ulyanov, and V. Lempitsky (2020) Neural point-based graphics. In

    European conference on computer vision

    pp. 696–712. Cited by: §1, §2.3, Figure 6, §3.1, §4.1, §4.2, §4.2, Table 1, Table 2.
  • S. Chen (2019) Multimedia for autonomous driving. IEEE MultiMedia 26 (3), pp. 5–8. Cited by: §1.
  • S. Cho, S. Ji, J. Hong, S. Jung, and S. Ko (2021) Rethinking coarse-to-fine approach in single image deblurring. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 4641–4650. Cited by: §3.3.
  • K. Deng, A. Liu, J. Zhu, and D. Ramanan (2022) Depth-supervised NeRF: fewer views and faster training for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  • A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun (2017) CARLA: an open urban driving simulator. In Conference on robot learning, Cited by: §1, §1.
  • L. Gao, J. Zhu, J. Song, F. Zheng, and H. T. Shen (2020)

    Lab2pix: label-adaptive generative adversarial network for unsupervised image synthesis

    In Proceedings of the ACM International Conference on Multimedia, pp. 3734–3742. Cited by: §1, §1, §2.1.
  • A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. Cited by: §1, §4.1.
  • L. He, Q. Zhou, X. Li, L. Niu, G. Cheng, X. Li, W. Liu, Y. Tong, L. Ma, and L. Zhang (2021) End-to-end video object detection with spatial-temporal transformers. In Proceedings of the ACM International Conference on Multimedia, pp. 1507–1516. Cited by: §1.
  • P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1125–1134. Cited by: §1, §2.1.
  • J. Johnson, A. Alahi, and L. Fei-Fei (2016)

    Perceptual losses for real-time style transfer and super-resolution

    In European conference on computer vision, pp. 694–711. Cited by: §3.2.2, §3.4.1.
  • J. Kim, H. Kim, K. Lakshmanan, and R. Rajkumar (2013) Parallel scheduling for cyber-physical systems: analysis and case study on a self-driving car. In Proceedings of the ACM/IEEE international conference on cyber-physical systems, pp. 31–40. Cited by: §1.
  • M. Kolos, A. Sevastopolsky, and V. Lempitsky (2020) TRANSPR: transparency ray-accumulating neural 3d scene point renderer. In International Conference on 3D Vision, pp. 1167–1175. Cited by: §2.3.
  • J. Li, S. Dong, Z. Yu, Y. Tian, and T. Huang (2019) Event-based vision enhanced: a joint detection framework in autonomous driving. In IEEE International Conference on Multimedia and Expo, pp. 1396–1401. Cited by: §1.
  • A. Ligocki, A. Jelinek, and L. Zalud (2020) Brno urban dataset-the new data for self-driving agents and mapping tasks. In IEEE International Conference on Robotics and Automation, pp. 3284–3290. Cited by: §1, §4.1.
  • R. Martin-Brualla, N. Radwan, M. S. Sajjadi, J. T. Barron, A. Dosovitskiy, and D. Duckworth (2021) Nerf in the wild: neural radiance fields for unconstrained photo collections. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7210–7219. Cited by: §1, §2.2.
  • M. Meshry, D. B. Goldman, S. Khamis, H. Hoppe, R. Pandey, N. Snavely, and R. Martin-Brualla (2019) Neural rerendering in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6878–6887. Cited by: §2.3, Figure 6, §4.2, §4.2, Table 1, Table 2.
  • B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng (2020) Nerf: representing scenes as neural radiance fields for view synthesis. In European conference on computer vision, pp. 405–421. Cited by: §1, §2.2.
  • M. Niemeyer and A. Geiger (2021) Giraffe: representing scenes as compositional generative neural feature fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11453–11464. Cited by: §1, §2.2.
  • A. Pang, X. Chen, H. Luo, M. Wu, J. Yu, and L. Xu (2021) Few-shot neural human performance rendering from sparse rgbd videos. In

    Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence

    pp. 938–944. Cited by: §2.2.
  • S. R. Richter, H. A. AlHaija, and V. Koltun (2021) Enhancing photorealism enhancement. arXiv preprint arXiv:2105.04619. Cited by: §2.1.
  • D. Rückert, L. Franke, and M. Stamminger (2021) Adop: approximate differentiable one-pixel point rendering. arXiv preprint arXiv:2110.06635. Cited by: §1, §2.3, Figure 6, §4.1, §4.2, §4.2, Table 1, Table 2.
  • J. L. Schonberger and J. Frahm (2016) Structure-from-motion revisited. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4104–4113. Cited by: §3.3.
  • A. Shapiro (2003) Monte carlo sampling methods. Handbooks in operations research and management science 10, pp. 353–425. Cited by: §3.2.2.
  • H. Tang, S. Bai, and N. Sebe (2020a) Dual attention gans for semantic image synthesis. In Proceedings of the ACM International Conference on Multimedia, pp. 1994–2002. Cited by: §1, §1, §2.1, Figure 6, §4.2, §4.2, Table 1, Table 2.
  • H. Tang, D. Xu, Y. Yan, P. H. Torr, and N. Sebe (2020b) Local class-specific and global image-level generative adversarial networks for semantic-guided scene generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7870–7879. Cited by: §1, §2.1.
  • J. Thies, M. Zollhöfer, and M. Nießner (2019) Deferred neural rendering: image synthesis using neural textures. ACM Transactions on Graphics 38 (4), pp. 1–12. Cited by: §1, §2.3.
  • L. Wang, Z. Wang, P. Lin, Y. Jiang, X. Suo, M. Wu, L. Xu, and J. Yu (2021a) IButter: neural interactive bullet time generator for human free-viewpoint rendering. In Proceedings of the ACM International Conference on Multimedia, pp. 4641–4650. Cited by: §1.
  • Q. Wang, Z. Wang, K. Genova, P. P. Srinivasan, H. Zhou, J. T. Barron, R. Martin-Brualla, N. Snavely, and T. Funkhouser (2021b) Ibrnet: learning multi-view image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690–4699. Cited by: §1, §2.2.
  • P. Wiriyathammabhum, D. Summers-Stay, C. Fermüller, and Y. Aloimonos (2016)

    Computer vision and natural language processing: recent approaches in multimedia and robotics

    ACM Computing Surveys 49 (4), pp. 1–44. Cited by: §1.
  • H. Wu, J. Jia, H. Wang, Y. Dou, C. Duan, and Q. Deng (2021) Imitating arbitrary talking style for realistic audio-driven talking face synthesis. In Proceedings of the ACM International Conference on Multimedia, pp. 1478–1486. Cited by: §1.
  • X. Ying, X. Li, and M. C. Chuah (2021) SRNet: spatial relation network for efficient single-stage instance segmentation in videos. In Proceedings of the ACM International Conference on Multimedia, pp. 347–356. Cited by: §1.
  • J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang (2019)

    Free-form image inpainting with gated convolution

    In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4471–4480. Cited by: §3.3.
  • J. Zhang, G. Yang, S. Tulsiani, and D. Ramanan (2021a) NeRS: neural reflectance surfaces for sparse-view 3d reconstruction in the wild. Advances in Neural Information Processing Systems 34. Cited by: §1.
  • M. Zhang, T. Liu, Y. Piao, S. Yao, and H. Lu (2021b) Auto-msfnet: search multi-scale fusion network for salient object detection. In Proceedings of the ACM International Conference on Multimedia, pp. 667–676. Cited by: §1.
  • W. Zhang, G. Ji, Z. Wang, K. Fu, and Q. Zhao (2021c) Depth quality-inspired feature manipulation for efficient rgb-d salient object detection. In Proceedings of the ACM International Conference on Multimedia, pp. 731–740. Cited by: §1.