This is the official repository of "Boosting Real-Time Driving Scene Parsing with Shared Semantics"
Real-time scene parsing is a fundamental feature for autonomous driving vehicles with multiple cameras. Comparing with traditional methods which individually process the frames from each camera, in this letter we demonstrate that sharing semantics between cameras with overlapped views can boost the parsing performance. Our framework is based on a deep neural network for semantic segmentation but with two kinds of additional modules for sharing and fusing semantics. On one hand, a semantics sharing module is designed to establish the pixel-wise mapping between the input image pair. Features as well as semantics are shared by the map to reduce duplicated workload which leads to more efficient computation. On the other hand, feature fusion modules are designed to combine different modal of semantic features, which learns to leverage the information from both inputs for better results. To evaluate the effectiveness of the proposed framework, we collect a new dataset with a dual-camera vision system for driving scene parsing. Experimental results show that our network outperforms the baseline method on the parsing accuracy with comparable computations.READ FULL TEXT VIEW PDF
This study aims to improve the control performance and generalization
Semantic segmentation is an important task in computer vision, from whic...
Autonomous driving vehicles and robotic systems rely on accurate percept...
This paper presents a real-time online vision framework to jointly recov...
Aerial pixel-wise scene perception of the surrounding environment is an
Two factors have proven to be very important to the performance of seman...
Parsing urban scene images benefits many applications, especially
This is the official repository of "Boosting Real-Time Driving Scene Parsing with Shared Semantics"
With the development of autonomous driving in recent years, scene parsing as a critical functionality of autonomous vehicles, has attracted more and more attention. Since scene parsing is a dense classification problem, it still remains a difficult task to achieve an accurate performance for real-time applications, especially for vehicles with multiple cameras and limited computation resources.
Taking our autonomous vehicle platform shown in Fig. 1(a) as an example, a dual-camera vision system is mounted at the top of the vehicle, which is commonly adopted by modern advanced driver assistance systems (ADAS) . These two cameras are with different field of views (FOVs). In the figure CAM-60 refers to the camera with a horizontal FOV (HFOV) and CAM-120 stands for the camera with HFOV. To get scene parsing results from both cameras, traditional approaches usually process images from each camera individually, which neglects the connection inside the dual-camera system.
Since the cameras with different perspectives have overlapped perception regions as shown in Fig. 1(b), we consider to find a method to (1) build a pixel-wise mapping to share semantics between two cameras and (2) leverage the compensation of perspective to get refined scene parsing results. More specifically, because the scenes captured by CAM-60 are almost completely contained in the image from CAM-120, the processing of image from CAM-60 can benefit from the information propagated from CAM-120, which leads to a more efficient computation. At the meanwhile, because CAM-60 has a larger focal length, it has a clearer perception to the scenes far distant from the vehicle. Thus CAM-120 can fuse such information to enhance its original segmentation results.
In general, comparing with classic approaches, our method boosts the scene parsing task for the dual-camera system in the following two aspects:
Reduce the computation load for CAM-60. Feature extraction procedure only needs to do once in the overlapped regions between two cameras. The heavy and slow feature extraction backbones for CAM-60 can be replaced with a lightweight one for extracting low-level features. The semantic information propagated from CAM-120 by a semantic sharing module ensures a basic level of segmentation performance for CAM-60, which can be further enhanced by the fusion with its own low-level features.
Improve the scene parsing quality for CAM-120. The semantic features from CAM-60 are also back-propagated to CAM-120 with the same semantic sharing module. By appropriately fused with the original semantics of CAM-120, those semantics located in the overlapped regions can be further refined with the perspective advantage from CAM-60.
Scene parsing and semantic segmentation have been extensively investigated in recent years. Although the state-of-the-art semantic segmentation networks can output high-quality results [3, 4], they are too heavy and computational expensive to be adopted in real-time applications. Recently some lightweight semantic segmentation networks has been designed to work on-line while giving satisfying outputs[5, 6, 7, 8]. However, these networks are not naturally designed for those vision systems with multiple cameras, which makes them still too memory or computation consuming for autonomous driving applications. In our work, we aim to design an optimized architecture to reduce the redundant computation which leads to a more efficient framework.
Semantics sharing or propagation seeks to find correspondences between different images which have overlapped views, eg. the image pairs from stereo cameras or the consecutive frames in a video sequence. Semantics sharing is commonly conducted in two levels: pixel-level and feature level.
For pixel-level sharing, a pixel-wise grid map is built to warp a image from one perspective to the other. The map can be derived from the transformations in geometry space or image space. The transformation in geometry space generally uses the prior knowledge, eg., the planar assumption for perspective transformation
, or the depth estimation of the scene[10, 11, 12]. The transformation in image space usually consider the correlations around neighborhoods of a pixel[13, 14]. With recent development on lightweight optical flow estimation networks[15, 16, 17], it is much more practical to exploit an optical estimation network in real-time applications. Xu et al.  applied different segmentation strategies to various regions of the input image, which exploited optical flow to preserve the semantics in static regions. Zhu et al. 
investigated the generation of future semantic segmentation labels from current manual labels by video prediction based on motion vector estimation. Yin et al. combined a rigid structure reconstructer and a non-rigid motion localizer to warp from one views to the other. Similar to , in our framework we also integrate both geometry-based and image-based methods for sharing semantic information between two cameras at the pixel-level.
Feature-level sharing propagates information implicitly in the model, which is usually applied in video sequence processing. Jin et al.  designed a network to learn predictive features in video scene parsing. Li et al.  proposed a framework with adaptive feature propagation for high-level features to reduce the latency of video semantic segmentation. Wang et al.  used an unsupervised method to learn feature representations for identifying correspondences across frames. Lee et al.  attempted to derive semantic correspondences by object-aware losses. Compare with pixel-level sharing, feature-level sharing is learned by an end-to-end way. Therefore it is difficult to directly evaluate its performance. In addition, the feature-level sharing may rely on the training data more heavily than the pixel-level sharing used in our framework.
The idea of semantics fusion for improving the segmentation outputs has been widely applied in previous works. For example, in  and , different modal or level of features were fused to each other to generate refined results. Li et al.  hypothesized a scaled region from the original image by a perspective estimation network, which aimed to refine the original segmentation results of small objects. Jiao et al.  proposed to improve and distill the semantic features with the estimated depth embeddings by geometry-aware propagation. All of the works above focus on the fusion for a single image, while Hoyer et al.  demonstrated a spatial-temporal fusion method for multiple camera sequences but with non-overlapped views. In our work, we have followed the basic idea of semantics fusion and apply it to cameras with different perspectives and shared visions to enhance the overall scene parsing performance.
In this section, we will describe the proposed method in detail. First the overview of our framework will be demonstrated. Then the ideas behind the design of each core modules will be discussed. The detailed implementation information will be given at the end of the section.
The proposed framework is illustrated in Fig. 2. The final goal is to output the scene parsing results for each input image from both CAM-120 and CAM-60.
From the view of structure, our framework can be divided into two branches. Unlike traditional designs with exactly the same pipeline for both branches, the input image from CAM-60 passes a much more lightweight convolutional neural network (CNN) comparing with a complete semantic segmentation network in the branch of CAM-120. The sharing and fusion of information between two branches are realized with a semantics sharing module and two feature fusion modules, respectively.
From the view of functionalities, four kinds of modules in our framework play different roles. The semantic segmentation network provides high-level semantic features, while the lightweight CNN is for low-level feature extraction. The semantic sharing module establishes a bridge for bi-directional feature propagation from CAM-120 to CAM-60 and vise versa. The feature fusion modules merge shared semantics for each branch to achieve better parsing results.
Since the semantic segmentation network is a full-function network which can output scene parsing results by itself, it can be easily replaced with any modern networks designed for real-time scene parsing. For the lightweight CNN, it can also be designed as a sequential of several convolutional layers or sharing the structure with the feature extraction backbone in the semantic segmentation network. The implementation details of these two parts will be described in Sec. III-D. In the following we will focus on the details of the semantic sharing module and the feature fusion module.
The task of the semantics sharing module is to remap the semantic features between two branches. Through such a bridge, the semantic features from branch CAM-120 can be propagated to branch CAM-60 to speed-up its processing, and then the results of CAM-60 are transfered back to refine the outputs of CAM-120, which forms a closed loop.
As concluded in Sec. II-B, the results from pixel-level sharing methods are more explicit and controllable. Thus we propose a two-stage image warping method to build the semantics sharing module as shown in Fig. 3.
In the first stage, the input image from CAM-120 is warped by the perspective transformation. The homography matrix used in the transformation can be derived from the intrinsic and extrinsic parameters of the dual-camera system :
where is the homography matrix for mapping from CAM-120 to CAM-60, and are their camera matrix. is the rotation matrix from CAM-120 to CAM-60. Due to the limitation of perspective transformation, those objects closed to the camera will be distorted after the transformation. Thus the warped image from CAM-120 still needs to be adjusted to accurately match the ground truth image from CAM-60.
In the second stage, the warped image from CAM-120 is further warped by the optical flow to compensate the distortion effects. The core process of this stage is the precise estimation of optical flow between the input image pair. It should be noticed that because the pose variation between two cameras is very small and the input image pair is correctly synchronized, the scene can be considered as static and the occlusion effect is negligible. Therefore the movement of pixels is not that large and the artifacts of the warped image also can be ignored comparing with the situation of video scene parsing.
The feature fusion modules are used to generate the segmentation results for CAM-60 as well as to refine the results of CAM-120. As shown in Fig. 4, we have implemented and evaluated three different types of the feature fusion modules to compare with the direct output of the semantic segmentation network. In Sec. IV-D4 we will show the ablation analysis of these blocks which depicts that even integrating the simple basic block can boost the parsing outputs to some extent.
In the following we will take the feature fusion module in the CAM-60 branch as an example to describe their structures.
The basic type of feature fusion module only concatenates the input feature maps and output the semantic feature maps after an .
Since the effectiveness of residual block has been widely proved in previous works, we also apply it to our framework. The inputs are first concatenated and then passed through a standard residual block with layers. Finally the output is processed by an for classification.
Considering to decrease the computation and the amount of parameters in our framework, we also evaluate a bottleneck type of feature fusion module. The inputs are first converted to the same channels with an
Taking the implementation of the semantics sharing module into account, we have developed two types of structures for our framework: a) loosely-coupled structure and b) tightly-coupled structure. For the loosely-coupled structure, we simply exploit a complete optical flow network following the perspective transformation, which can achieve the best estimation performance.
However, because the optical flow network also has its own feature extraction modules, it is possible duplicable to those in the semantic segmentation network. Therefore, in the tightly-coupled structure shown in Fig. 5, we remove the feature extraction part of the optical flow network and reused the feature maps from the semantic segmentation network. With such adjustment, the whole model becomes more compact and the computation load can be further cut down.
We exploit a real-time semantic segmentation network based on MobileNetV3-large  to get the initial semantic features for CAM-120. To train the semantic segmentation network, we apply the common cross entropy loss to supervise the training progress.
The implementation of the lightweight CNN is based on the structure of the framework. For the loosely-coupled structure, we share the structure with the semantic segmentation network and output a feature pyramid with 1/8 size of the original resolution for later fusion with the results from CAM-120. For the tightly-coupled structure, we reuse the feature extraction part of the optical flow network and adjust it to output a feature pyramid with exactly the same size from 1/4, 1/8 to 1/16 as those from the semantic segmentation network.
The lightweight CNN also shares the weights with the semantic segmentation network in the loosely-coupled structure. In the tightly-coupled structure, it is trained together with the optical flow network.
The main part of the semantics sharing module is an optical flow estimation network. We use a PWC-Net  to provide grid maps for warping feature maps. It should be noticed that the original feature pyramid given by PWC-Net is not the same as the MobileNetV3-large. Thus for the tightly-coupled structure, we need to modify the channels of the output feature maps to match those in MobileNetV3-large accordingly.
The training losses of the optical flow network in our cases consist of three different types: a) supervised loss, b) unsupervised loss and c) semantic loss. The supervised loss is applied when the ground-truth flow is available with some synthetic datasets. It is defined as the average end-point error (AEPE):
where is the pixel index and N is the total number of pixels in the flow image. and are the ground-truth and the estimated flow, respectively.
The unsupervised loss is mainly for training on those datasets without the ground-truth flow. We choose three most commonly used losses for unsupervised learning:
Here the first term is defined as the norm of the pixel intensity difference between the ground-truth image and the flow-warped image :
The second term is the SSIM  loss of the ground-truth image and the flow-warped image. The third term is the smoothness loss  of the estimated flow. The weights of these three losses , and are set to 0.1, 1.0 and 1.0, respectively.
The semantic loss is for the dataset with semantic labels. It can be regarded as a supervision for flow at the boundaries of each semantic class. Here we also applied the cross entropy loss to supervise the fine-tuning of the optical flow network.
For the basic and residual type of feature fusion modules, they are applied to both branches without modification. However, since the bottleneck type has an element-wise addition unit, we will additionally need an to reshape to the same size as in the CAM-60 branch, as shown in Fig. 4(c).
Since we have not found any public dataset with configurations as our applications, we built our own dataset with a dual-camera system on an autonomous vehicle. The videos were captured by a Sekonix SF3324 (CAM-120) and a Sekonix SF3325 (CAM-60) with an NVIDIA DRIVE AGX platform. The video sequences were collected inside the SAIC Motor Park and the driving route is shown in Fig. 6.
The statistics of the dataset is listed in Table I. The videos from each camera is synchronized by the hardware. We automatically extract images from the videos at the rate of one frame per second. Then we select about 1000 image pairs to manually label six classes of semantics: background (BG), road, person, car, barrier and cycle. We use these images to train a PSPNet  to automatically label the other images as the ground truth for later training the semantic segmentation network based on MobileNetV3-large for real-time parsing.
In addition to the semantic labels, we also need ground-truth optical flow to train the PWC-Net, which is difficult to obtain. So we turned to synthesize a warped image by a random perspective and affine transformation from an input image. The flows generated by the transforming process is used as the ground truth. In Sec. IV-D2 we will show that after training on this dataset, the performance of PWC-Net on the original dataset will be improved.
The performance of our network is evaluated with the mean intersection over union (mIoU) metric for semantic segmentation and end-point error (EPE) for optical flow estimation.
We follow a multi-stage training procedure to train each component of our framework.
|Network||CAM||Mean IoU of Semantic Segmentation ()||Model|
Semantic segmentation network: We trained the MobileNetV3-large with a segmentation head for 160K iterations using a mini-batch size of 16. The initial learning rate was set to 0.015 and followed a ‘poly’ policy with power 0.9.
Optical flow estimation network: For the loosely-coupled structure, the PWC-Net was trained separately to the segmentation network. It was first trained on the Chairs dataset with the same settings as . Then we further trained it on our synthetic flow dataset for 300K iterations using a mini-batch size of 8. The initial learning rate was 0.0005 and was scaled by 0.5 at 100K, 200K, 250K. Finally the model was fine-tuned on the real data with unsupervised losses and semantic loss sequentially. For the tightly-coupled structure, only one of the feature extraction parts and the optical flow estimation part were needed to be trained. The training settings remained the same as the loosely-coupled structure.
Feature fusion modules: The feature fusion modules were trained with the whole network with the fixed weights of MobileNetV3-large and PWC-Net. The feature fusion module in CAM-60 branch was first trained for 60K iterations with a mini-batch size of 4. The initial learning rate was set to 0.001. The other training settings were the same as MobileNetV3-large. The feature fusion module in CAM-120 was also trained in the same way.
Fine-tune: The whole network was finally fine-tuned together for 120K iterations with the same settings as training MobileNetV3-large. In order to keep a steady performance of optical flow estimation, the weights of PWC-Net were fixed in the final fine-tuning.
We use PyTorch to implement our network. The network is trained and tested on two NVIDIA Tesla V100 GPUs.
We have chosen the MobileNetV3-large based segmentation network  as our baseline, which is also used in the CAM-120 branch of our framework. As shown in Table II, we have compared the semantic segmentation performance and model statistics of the baseline and our network with loosely-coupled and tightly-coupled structures.
For the CAM-60 branch, the segmentation results show that our loosely-couped structure has slightly over-performed than the baseline in general, although there is only a lightweight CNN in this branch. Besides, the performance of our tightly-coupled structure is also very close to the baseline with more reused intermediate features.
For the CAM-120 branch, both of our loosely-coupled and tightly-coupled structure have an obvious improvement comparing with the baseline, especially for the loosely-coupled with the class of Person(2.3), Barrier(1.7) and Cycle(5.9). This reflects the effectiveness of our semantics sharing module and feature fusion module which propagate and fuse the semantic information from CAM-60 to CAM-120. The sharing of such information compensates and improves the features of those small objects in the view of CAM-120. As shown in Fig. 7, our network successfully recovers the missing small objects that are far from the vehicle (refer to the first and second group of image pairs), and has a more accurate classification at the boundary of small objects (refer to the third group of image pairs).
Our loosely-coupled model has 2.3 more parameters than the baseline, which can be reduced to 1.6 with tightly-coupled structure. The computation is evaluated by input images with 19201208 resolution for the MobileNetV3-large and 768483 for PWC-Net. The results show that our loosely-coupled model has comparable computation with the baseline, while the tightly-coupled model needs even less computation resources.
The influence of loosely-coupled and tightly-coupled structure for optical flow estimation was evaluated. The results on Data_Sim (the hypothesized image pairs)and Data_Real (the real image pairs) are listed in Table III. We can find that the loosely-coupled structure achieves less amounts of AEPE and unsupervised loss for both datasets. This is mainly because that we have fixed the weights of reused feature extraction part from MobileNetV3-large and trained the rest past of PWC-Net. From the comparisons of semantic segmentation in Table II for both structures, we can find that such inaccuracy of flow estimation will only remains slight effect on the segmentation results after fine-tuning the whole network.
Since the performance of optical flow estimation can be influenced by the training schedules on different datasets , we also evaluated the effectiveness of the synthetic Data_Sim dataset. Table IV shows the comparisons of three different types of training schedules. It suggests that the training on Data_Sim has positive effects on the original network trained with the FlyingChairs  dataset and improves its performance on the final Data_Real dataset.
The performance of the optical flow estimation network directly affects the shared semantics. In Table V we compares the semantic segmentation results for CAM-60 branch with or without the optical flow warping in the semantics sharing module. Note that we have also skipped the feature fusion modules in the evaluation. The results depict that with only warping by perspective transformation (P.T.), the segmentation results are relatively poor especially for those classes of small objects, which means the semantics are badly propagated. After applying the warping with the optical flow, the performance has a significant enhancement (11.0) suggesting the importance of accurate remapping.
|Network Structure||AEPE||Unsupervised Loss|
|Training Schedule||AEPE||Unsupervised Loss|
|Warping||Mean IoU of Semantic Segmentation for CAM-60 ()|
|P.T. + Flow||98.9||98.5||61.6||92.4||53.0||59.9||77.4|
|Mean IoU of Semantic Segmentation for CAM-60 ()|
Table VI illustrates the comparisons of integrating different types of feature fusion blocks in CAM-60 branch as an example. We can find that even the simplest basic block can dramatically boost the final segmentation performance. The bottleneck type achieves similar outputs to the residual type in most classes as well as the total average, although it has much less parameters and needs lower computations.
In this letter we demonstrate to boost the performance of a scene parsing task for real-time autonomous driving applications with shared semantics. A semantics sharing and fusion framework was proposed to propagation semantic features between two cameras with different perspectives. The shared semantics can not only reduce the duplicable computations in feature extraction procedure, but also refine the segmentation results for both cameras. In the future work we will further investigate to sharing semantics in video scene parsing to realize a more compact and faster semantic perception system.