I Introduction
Disparity estimation (also referred to as stereo matching) is a classical and important problem in robotics and autonomous driving for 3D scene reconstruction [1, 2, 3]
. While traditional methods based on handcrafted feature extraction and matching cost aggregation such as SemiGlobal Matching (SGM)
[4]) tend to fail on those textureless and repetitive regions in the images, recent advanced deep neural network (DNN) techniques surpass them with decent generalization and robustness to those challenging patches, and achieve stateoftheart performance in many public datasets [5][6][7][8][9][10]. However, how to design an efficient DNN structure for disparity estimation with limited computational cost for those InternetofThings (IoT) scenarios remains a concern.The DNNbased methods for disparity estimation are endtoend frameworks which take stereo images (left and right) as input to the neural network and predict the disparity directly. The architectures of DNN are very essential to achieve accurate estimation, and can be categorized into two classes, the encoderdecoder network with 2D convolution (EDConv2D) and the cost volume matching with 3D convolution (CVMConv3D). Besides, recent studies [11, 12]
begin to reveal the potential of automated machine learning (AutoML) for neural architecture search (NAS) on stereo matching. In practice, to measure whether a DNN model is applicable in realworld applications, we not only need to evaluate its accuracy on unseen stereo images (whether it can estimate the disparity correctly), but also need to evaluate its time efficiency (whether it can generate the results in realtime). However, existing methods either focus on model accuracy (e.g.,
[9][10]) or on time efficiency (e.g.,[13][14][15]), which could make the trained models not applicable to the realworld applications supporting realtime inference on GPU servers or mobile devices with good model accuracy.In EDConv2D methods, which are relatively computeefficient compared to CVMConv3D, stereo matching neural networks [5][6][8] are first proposed for endtoend disparity estimation by exploiting an encoderdecoder structure. The encoder part extracts the features from the input images, and the decoder part predicts the disparity with the generated features. The disparity prediction is optimized as a regression or classification problem using largescale datasets (e.g., SceneFlow [8]) with disparity ground truth. The correlation layer [7][8] is then proposed to increase the learning capability of DNNs in disparity estimation, and it has been proved to be successful in learning strong features at multiple levels of scales [7][8][17][18][19]. To further improve the capability of the models, residual networks [20][21][22] are introduced into the architecture of disparity estimation networks since the residual structure enables much deeper network to be easier to train [23]. The EDConv2D methods have been proven computing efficient, but they cannot achieve very high estimation accuracy [24].
To address the accuracy problem of disparity estimation, researchers have proposed CVMConv3D networks to better capture the features of stereo images and thus improve the estimation accuracy [6][25][9][10][26]. The key idea of the CVMConv3D methods is to generate the cost volume by concatenating left feature maps with their corresponding right counterparts across each disparity level [25][9]. The features of cost volume are then automatically extracted by 3D convolution layers. 3D operations in DNNs, However, are computingintensive and hence very slow even with current powerful AI accelerators (e.g., GPUs). Although the 3D convolution based DNNs can achieve stateoftheart disparity estimation accuracy, they are difficult for deployment due to the very high latency to generate results. On one hand, it requires a large amount of memory to offload the model, so only a limited set of accelerators (like Nvidia Tesla V100 with 32GB memory) can run these models. On the other hand, it takes several seconds to generate a single result even on a very powerful Nvidia Tesla V100 GPU using CVMConv3D models. The memory consumption and the high computation workloads make the CVMConv3D methods difficult to be deployed in practice. Therefore, it is crucial to address the accuracy and efficiency problems for realworld applications.
To ease the human efforts of designing an efficient network structure for stereo matching, some recent studies [11, 27] also take advantages of automated machine learning (AutoML) [12], especially the neural architecture search (NAS) technique, to search the optimal set of network operators as well as their connections. However, those stateoftheart methods are still far from realtime inference even on a server GPU since they are still based on either the complicated network stacking or lowefficient 3D convolution operations. Besides, another series of studies focus on lightweight network structures for fast inference, such as StereoNet [28] and AnyNet [29]. However, the lightweight models significantly sacrifice the model accuracy, especially on some complex realistic datasets, such as KITTI [30] and Middlebury [31].
To achieve a practical model in stereo matching, we propose FADNet++ which produces realtime and accurate disparity estimation with configurable networks. This article is an extension of our previous conference paper [24]. Similar to the previous FADNet, in FADNet++, we first exploit the multiple stacked 2Dbased convolution layers with fast computation, and then we combine stateoftheart residual architectures to improve the learning capability, and finally we introduce multiscale outputs for FADNet++ so that it can exploit the multiscale weight scheduling to improve the training speed. As illustrated in Fig. 7, our FADNet++ can easily obtain comparable performance as stateoftheart GANet [10], while it runs approximately 70 faster than GANet and consumes 3 less GPU memory. Besides, the new FADNet++ advances the previous FADNet in [24] in three folds. First, we allow configurable variants of FADNet++ to meet different demands of model accuracy and speed. Second, we conduct an extensive comparative study on the model accuracy and speed of different FADNet++ variants during both the training and inference stages. Third, compared to only two stereo datasets and two highend GPUs in [24], we validate our proposed FADNet++ on four stereo datasets and six different GPU platforms from serverlevel to edgelevel. As shown in Fig. 8, the FADNet++ variants (denoted by “FADNet*”) can adapt to the platforms of different computing capability. On a server GPU, even the slowest FADNet++ can achieve 30 FPS with a lower EPE than those CVMConv3D methods. On a mobile GPU, our FADNet++ can achieve up to 15 frames per second (FPS) with a much lower EPE than the fastest AnyNet [29]. We make the project of FADNet++ publicly available^{1}^{1}1https://github.com/HKBUHPML/FADNet. Our contributions are summarized as follows:

We propose an accurate yet efficient DNN architecture for disparity estimation named FADNet++ (with configurable architecture to support multiple hardware for efficient inference), which achieves comparable prediction accuracy as CVMConv3D models and it runs at an order of magnitude faster speed than the 3Dbased models.

We develop a multiple rounds training scheme with multiscale weight scheduling for FADNet++ as well as its variants during training, which improves the training speed yet maintains the model accuracy.

We achieve stateoftheart accuracy on the Scene Flow dataset with more than 14 and up to 69 faster disparity prediction speed than both the NASbased (LEAStereo [27]) and the humandesigned (PSMNet [9] and GANet [10]) models. Besides, by tuning the channel ratios of our FADNet++ to meet the limited computational resources, the variant FADNetS advances the existing mobile solution, AnyNet [29], with much higher prediction accuracy and a competitive inference speed of 15 FPS on the mobile Jetson AGX.
The rest of the paper is organized as follows. We introduce some related work about DNN based solutions to disparity estimation in Section II. Section III introduces the methodology and implementation of our proposed network with configurable size of models. We demonstrate our experimental settings and results in Section IV. We finally conclude the paper in Section V.
Ii Related Work
There exist many studies using deep learning methods in estimating image depth using monocular, stereo and multiview images. Although monocular vision is low cost and commonly available in practice, it does not explicitly introduce any geometrical constraint, which is important for disparity estimation[32]. On the contrary, stereo vision leverages the advantages of crossreference between the left and the right view, and usually shows greater performance and robustness in geometrical tasks. Thanks to the rapid and promising development of DNNs, stereo matching also gains considerable credits from DNNs which efficiently extract great feature representation and fit the cost matching function between the left and right view. The early studies mainly focus on optimizing the existing network architectures by enormous handson trialanderror tweaking efforts. Besides, recent studies also leverage multitask learning [33, 34, 35] to combine other prior vision information and NASbased methods [27, 11] to tweak the network structure as well as the operator hyperparameters (i.e., kernel size and channel number for the convolution layer). According to the basic operator (related to the computational efficiency) and the network pipeline, we mainly discuss two branches of network structures for disparity estimation, the EDConv2D series and the CVMConv3D series.
Iia Disparity Estimation with EDConv2D CNNs
In the EDConv2D series, endtoend architectures with mainly convolution layers [8][16] are proposed for disparity estimation, which use two stereo images as input and generate the disparity directly and the disparity is optimized as a regression task. This is achieved by adopting large Ushape encoderdecoder networks with 2D convolutions to predict the disparity map. However, the models that are pure 2D CNN architectures are difficult to capture the matching features such that the estimation results are not good. To address the problem, the correlation layer which can express the relationship between left and right images is introduced in the endtoend architecture (e.g., DispNetCorr1D [8], FlowNet [7], FlowNet2 [17], DenseMapNet [18]). The correlation layer significantly increases the estimating performance compared to the pure CNNs, but existing architectures are still not accurate enough for production. Furthermore, CRL [16] and FADNet [24] introduce the idea of residual learning [20] to conduct efficient disparity refinement in a coarsetofine manner. Liang et al. [36] apply the similar idea of them but with constructing multiscale cost volumes from the feature pyramid. Although those existing EDConv2D methods enjoy the high model inference efficiency, they usually fail to produce satisfactory results in some challenging scenarios. Besides, some studies leverage multitask learning to incorporates other visual information, such as edge cues [34] and semantic segmentation [33], to promote the accuracy of the textureless regions, detailed structures and small objects.
IiB Disparity Estimation with CVMConv3D CNNs
The CVMConv3D CNNs are further proposed to increase the estimation performance [6][25][9][10][26], which leverage the concept of semiglobal matching [4] to learn disparities from a 4D cost volume. The cost volume is mainly constructed by concatenating left feature maps with their corresponding right counterparts across each disparity level [25][9], and the features of the generated cost volumes can be learned by 3D convolution layers. The CVMConv3D CNNs can automatically learn to regularize the cost volume, which have achieved stateoftheart accuracy of various datasets. However, the key limitation of the 3D based CNNs is their extremely high computation resource requirements. For example, training GANet [10] with the Scene Flow [8] dataset takes weeks even using very powerful Nvidia Tesla V100 GPUs. Even they achieve good accuracy, it is difficult to deploy due to their very low time efficiency. Thus, recent research proposes some optimization solutions, such as cost volume compression by grouping [37], efficient search space pruning [38] and corporative learning of multiscale features [39]. However, the fastest AANet [39] among all CVMConv3D CNNs only runs 12 FPS even on a great Tesla V100 GPU and is still far from realtime inference on other lowend devices. Besides, to lessen the effort dedicated to designing network architectures, automated machine learning (AutoML) [12] especially neural architecture search (NAS) [40, 41, 42], has also been applied to stereo matching in [27, 11, 43] and successfully achieved the leader accuracy and generalization in several benchmarks. However, the low time efficiency and high memory footprint of those 3Dconv based architectures still remain. To this end, we propose a fast and accurate DNN model for disparity estimation.
Iii Approach
Iiia Model Design and Implementation
Our proposed FADNet++ exploits the structure of DispNetC [8] as a backbone, but it is extensively reformed to take care of both accuracy and inference speed, which is lacking in existing studies. We introduce four novel components in FADNet++ to enable its good generalization ability and fast inference speed with configurable size for different hardware. 1) We first change the structure in terms of branch depth and layer type by introducing two new modules, residual block and pointwise correlation; 2) Then we exploit the multiscale residual learning strategy for training the refinement network; 3) We design the model to be configurable (with a scaling ratio) to balance the accuracy and inference speed. 4) Finally, a loss weight training schedule is used to train the network in a coarsetofine manner.
IiiB Residual Block and Pointwise Correlation
DispNetC and DispNetS which are both from the study in [8]
basically use an encoderdecoder structure equipped with five feature extraction and downsampling layers and five feature deconvolution layers. While conducting feature extraction and downsampling, DispNetC and DispNetS first adopt a convolution layer with a stride of 1 and then a convolution layer with a stride of 2 so that they consistently shrink the feature map size by half. We call the twolayer convolutions with size reduction as DualConv, as shown in Fig.
10(a). DispNetC equipped with DualConv modules and a correlation layer finally achieves an endpoints error (EPE) of 1.68 on the SceneFlow dataset [8], as reported in [8].The residual block originally derived in [20] for image classification tasks is widely used to learn robust features and train a very deep network. The residual block can well address the gradient vanish problem when training very deep networks. Thus, we replace the convolution layer in the DualConv module by the residual block to construct a new module called DualResBlock, as shown in Fig. 10(b). With DualResBlock, we can make the network deeper without training difficulty as the residual block allows us to train very deep models. Therefore, we further increase the number of feature extraction and downsampling layers from five to seven. Finally, DispNetC and DispNetS are evolving to two new networks with better learning ability, which are called RBNetC and RBNetS respectively, as shown in Fig. 9.
One of the most important contributions of DispNetC is the correlation layer, which targets at finding correspondences between the left and right images. Given two multichannel feature maps with and as their width, height and number of channels, the correlation layer calculates the cost volume of them using Eq. (1).
(1) 
where is the kernel size of cost matching, and are the centers of two patches from and respectively. Computing all patch combinations involves multiplication and produces a cost matching map of . Given a maximum searching range , we fix and shift the on the xaxis direction from to with a stride of two. Thus, the final output cost volume size becomes .
However, the correlation operation assumes that each pixel in the patch contributes equally to the pointwise convolution results, which may lost the ability to learn more complicated matching patterns. Here we propose pointwise correlation composed of two modules. The first module is a classical convolution layer with a kernel size of and a stride of . The second one is an elementwise multiplication which is defined by Eq. (2).
(2) 
where we remove the patch convolution manner from Eq. (1). Note that the maximum search range for the original image resolution should not be larger than the maximum valid disparity. For example, in the SceneFlow dataset, its maximum valid disparity is 192, and the correlation layer of our FADNet++ is put after the third DualResBlock, of which the output feature resolution is 1/8. So a proper searching range value should not be less than 192/8=16. We set a marginally larger value 20. We also test some other values, such as 10 and 40, which do not surpass the version of using 20. The reason is that applying too small or large search range value may lead to underfitting or overfitting.
Table I lists the accuracy improvement brought by applying the proposed DualResBlock and pointwise correlation. To simplify the validation experiment, we train them using the same SceneFlow [8]
dataset for only 20 epochs, which is different from the complete training scheme in Section
IV. It is observed that RBNetC outperforms DispNetC with a much lower EPE, which indicates the effectiveness of the residual structure. We also notice that setting a proper searching range value of the correlation layer helps further improve the model accuracy.Model  Training EPE  Test EPE  
DispNetC  20  2.89  2.80 
RBNetC  10  2.28  2.06 
RBNetC  20  2.09  1.76 
RBNetC  40  2.12  1.83 
IiiC MultiScale Residual Learning
Instead of directly stacking DispNetC and DispNetS subnetworks to conduct disparity refinement procedure [18], we apply the multiscale residual learning firstly proposed by [16]. The basic idea is that the second refinement network learns the disparity residuals and accumulates them into the initial results generated by the first network, instead of directly predicting the whole disparity map. In this way, the second network only needs to focus on learning the highly nonlinear residual, which is effective to avoid gradient vanishing. Our final FADNet++ is formed by stacking RBNetC and RBNetS with multiscale residual learning, which is shown in Fig. 9.
As illustrated in Fig. 9, the upper RBNetC takes the left and right images as input and produces disparity maps at a total of 7 scales, denoted by , where is from 0 to 6. The bottom RBNetS exploits the inputs of the left image, right image, and the warped left images to predict the residuals. The generated residuals (denoted by ) from RBNetS are then accumulated to the prediction results by RBNetC to generate the final disparity maps with multiple scales (). Thus, the final disparity maps predicted by FADNet++, denoted by , can be calculated by
(3) 
IiiD Configurable Network Size
Although the recent stateoftheart models, such as PSMNet [9], GANet [10], LEAStereo [27] and our previous FADNet [24], produce decent accuracy of disparity estimation, the practicability on computing devices of different computational capability, especially those lowend mobile ones, has not yet been extensively studied. Recently, AnyNet [29] reduced the inference overhead of stereo matching by alternatively refining the disparity map in a coarsetofine manner according to the target device, and made it possible to be deployed on a mobile Jetson TX2 platform with over 20 FPS. However, the lowlevel features, which are important to recover the object details and boundaries, could be discarded to keep a high inference speed on a lowend device. Prior to AnyNet, we keep all the features from low to high scales but make the channel numbers of convolution/deconvolution layers configurable so that we can balance the model accuracy and inference speed. Our design has three advantages. First, the network size can be easily controlled by two ratio parameters, which is proved to be simple yet effective in our experiments. Second, the variants of different configurations still share the overall network structure of FADNet++ instead of dropping some layers/modules (as adopted in [28]) or some scales (as adopted in [29]) such that the benefits of the FADNet++ backbone can be maintained. Third, the configurable ratio is convenient in terms of balancing the accuracy and performance under different application requirements.
In our proposed FADNet++, RBNetC and RBNetS have the same number of layers in their decoder and encoder parts, respectively. Assume that the encoder part has layers and the decoder part has layers. The layer in the encoder is denoted by . The layer in the decoder is denoted by . For each convolution layer, we have a basic channel number denoted by , which also indicates the minimum channels. Then we introduce two ratios, ERatio for encoders and DRatio for decoders, to conveniently configure the model size. By assigning different values for ERatio and DRatio, we are able to construct a set of FADNet++ variants. We list some of them in Table II. The channel number of each convolution layer can be calculated by
(4a)  
(4b) 
The feature of configurable network size obviously promotes the flexibility of FADNet++ in terms of network parameters as well as the model inference speed. We will further evaluate its effectiveness and efficiency in Section IV by deploying different variants to a wide range of computing devices. On the one hand, on a server GPU, the full FADNet++ outperforms those expensive CVMConv3D methods with slightly better accuracy and a considerable margin of model speed. On the other hand, on a mobile device, the shrinking FADNetT beats the realtime AnyNet with equivalent model speed but much lower prediction errors.
Network  ERatio  DRatio  Params [M] 
FADNet++  16  16  124.38 
FADNetM  8  8  31.15 
FADNetS  4  4  7.82 
FADNetT  2  1  1.65 
IiiE Loss Function Design
Given a pair of stereo RGB images, our FADNet++ takes them as input and produces seven disparity maps at different scales. Assume that the input image size is . The dimension of the seven scales of the output disparity maps are , , , , , , and respectively. To train FADNet++ in an endtoend manner, we adopt the pixelwise smooth L1 loss between the predicted disparity map and the ground truth using
(5) 
where is the number of pixels of the disparity map, is the element of and
(6) 
Note that is the ground truth disparity of scale and is the predicted disparity of scale
. The loss function is separately applied in the seven scales of outputs, which generates seven loss values. The loss values are then accumulated with loss weights.
Round  
1  0.32  0.16  0.08  0.04  0.02  0.01  0.005 
2  0.6  0.32  0.08  0.04  0.02  0.01  0.005 
3  0.8  0.16  0.04  0.02  0.01  0.005  0.0025 
4  1.0  0  0  0  0  0  0 
The loss weight scheduling technique which is initially proposed in [8] is useful to learn the disparity in a coarsetofine manner. Instead of just switching on/off the losses of different scales, we apply different nonzero weight groups for tackling different scale of disparity. Let denote the weight for the loss of the scale of . The final loss function is
(7) 
The specific setting is listed in Table III. Totally there are seven scales of predicted disparity maps. At the beginning, we assign lowvalue weights for those large scale disparity maps to learn the coarse features. Then we increase the loss weights of large scales to let the network gradually learn the finer features. Finally, we deactivate all the losses except the final predict one of the original input size. With different rounds of weight scheduling, the evaluation EPE is gradually increased to the final accurate performance which is shown in Table IV on the SceneFlow dataset.
Network  Round  # Epochs 


Improvement (%)  
FADNet++  1  20  1.45  1.28    
2  20  1.07  0.96  25.0  
3  20  0.91  0.89  7.3  
4  30  0.74  0.76  14.6  
FADNetM  1  20  1.61  1.38    
2  20  1.31  1.19  13.8  
3  20  1.16  1.02  14.3  
4  30  0.97  0.91  10.8  
FADNetS  1  20  2.10  1.91    
2  20  1.72  1.54  19.4  
3  20  1.58  1.35  12.3  
4  30  1.47  1.19  11.9  
FADNetT  1  20  3.10  2.52    
2  20  2.65  2.16  14.3  
3  20  2.49  2.11  2.3  
4  30  2.25  1.83  13.3 

Note: “Improvement” indicates the test EPE decrease of the current round of weight schedule over its previous.
Table IV lists the model accuracy improvements (average 13.3% and up to 25.0% among all the rounds) brought by the multiple round training of four loss weight groups. For each tested network, it is observed that both the training and testing EPEs are decreased smoothly and close, which indicates good generalization and advantages of our training strategy.
Iv Experimental Studies
In this section, we present the experimental studies to show the effectiveness of our FADNet++. We first demonstrate the accuracy of our proposed networks on different datasets compared to existing stateoftheart methods. Then we present the inference performance on some popular inference GPUs (including server GPUs and mobile GPUs) to show that our networks are able to support realtime disparity estimation (i.e., not less than 30FPS).
Iva Experimental Setup
Testbed. For model training, we use four Nvidia Tesla V100PCIe GPUs to train all compared models. For model inference, to cover various types of inference GPUs, we choose a desktoplevel Nvidia RTX2070 GPU and two serverlevel Nvidia GPUs (i.e., Tesla P40 and Tesla T4) to measure the inference speed. We also choose two mobile GPUs including Jetson TX2 and Jeston AGX to evaluate the inference speed. The details of the training and inference servers are shown in Table V, and the inference mobile devices are shown in Table VI
. In terms of software that are related to the time performance, the server is installed with GPU Driver440.36, CUDA10.2, and PyTorch1.4.0 with cuDNN7.6.
Training Server  Inference Servers  
IS1  IS2  IS3  
GPU  Tesla V100 4  RTX2070  Tesla P40  Tesla T4 
Memory  512GB  32GB  256GB  64GB 
OS  CentOS7.2  Ubuntu16.04 
Jetson TX2  Jetson AGX  
CPU 

8Core ARM v8.2  
GPU  256Core Pascal  512Core Volta  
Memory  8GB  32GB  
OS  Ubuntu 18.04.5, JetPack 4.4 
Datasets. To cover a range of scenarios in disparity estimation, we use many popular public datasets, including Middlebury 2014 (M2014) [31], KITTI 2015 (K2015) [30], ETH3D 2017 (ETH3D) [44], and SceneFlow (SF) [8], to evaluate the performance of different algorithms. The details of the datasets are shown in Table VII.
Dataset  # of Training Samples  # of Test Samples  Resolution 
M2014 [31]  15  15  29601942 
K2015 [30]  200  200  1242375 
ETH3D [44]  27  20  960480 
SF [8]  35454  4370  960540 
The distribution of disparity of different datasets is quite different, which is an important factor to guide the network design, especially the disparity search range in the pointwise correlation layer discussed in Section IIIB. We statistic the disparity distribution from the ground truth of the above datasets as shown in Fig. 15.
Baselines. We choose existing stateoftheart DNNs in estimating disparity from stereo images. In terms of EDConv2D, we choose DispNetC [8], CRL [16], DNCSS [18], AnyNet [29], and FADNet [24]. Regarding CVMConv3D, we use PSMNet [9], GANet [10], GWCNet [37], AANet [39], and LEAStereo [27]. From the model accuracy’s perspective, GANet and LEAStereo are the main topranked methods, while from the inference performance’s perspective, AnyNet and FADNet are very efficient. Comparing with these baselines, we will show how our new proposed framework balance the model accuracy and inference speed.
Implementation Details. We firstly pretrain FADNet++ on the SceneFlow training samples for 90 epochs. Following the finetuning strategy proposed in [45], we then jointly finetune our pretrained FADNet++ on the combination of training samples in M2014, K2015 and ETH3D for another 2400 epochs.
IvB Model Accuracy
In this subsection, we train the chosen models on the selected datasets and evaluate their model accuracy (EPE, endpoint error). We follow the same training scheme [45] that first trains a base model on the SceneFlow dataset, and finetunes the model on other datasets.
Type  Method 

EPE [px]  Runtime [s]  
EDConv2D  DispNetC [8]  1.9  1.68  0.015  
CRL [16]  2.2  1.32  0.026  
AnyNet [29]  1.31  3.39  0.013  
FADNet [24]  2.6  0.83  0.048  
CVMConv3D  PSMNet[9]  5.6  1.09  0.619  
GANet[10]  7.5  0.78  2.292  
GWCNet[37]  5.7  0.77  0.260  
AANet[39]  1.91  0.87  0.086  
LEAStereo[27]  25.3  0.78  0.478  

FADNet++  2.3  0.76  0.033  
FADNetM  1.7  0.91  0.019  
FADNetS  1.8  1.19  0.015  
FADNetT  1.9  1.83  0.014 
SceneFlow. The accuracy comparison of different models is shown in Table VIII. In terms of EPE on the SceneFlow dataset, we can see that our FADNet++ outperforms all the other models including both EDConv2D and CVMConv3D, which shows the capability of our model to capture the disparity information of stereo images.
Compared to EDConv2D methods, our FADNet++ significantly improves the model accuracy with comparable inference time. For example, in EDConv2D, the best accuracy model is FADNet with EPE of 0.83, whose inference time is 0.048 seconds. Our FADNet++ outperforms FADNet in both EPE (with around 9% improvement) and runtime (with around 50% faster speed). In terms of the runtime of EDConv2D, AnyNet is very efficient with on 0.013 seconds, but its EPE is very high, which is far away from realworld production. Our configurable feature of FADNet++ enables to configure different sizes of models to balance EPE and runtime. For example, FADNetT is as efficient as AnyNet, but it achieves around 80% lower EPE than AnyNet. With a larger model of our FADNetM, the runtime is only 0.003 longer than AnyNet, but our method can achieve 3.7 times lower EPE than AnyNet.
Compared to CVMConv3D methods, our FADNet++ achieves better EPE and inference time. Existing GANet, GWCNet, and LEAStereo obtain about 0.770.78 EPE on SceneFlow with more than 0.27 inference time, while our FADNet++ achieves 0.76 EPE with a magnitude smaller inference time. Even the very efficient 3D mode of AANet, it runs at 0.07 seconds, which is more than 2 times slower than FADNet++, and its EPE is still larger than ours.
We also analyze the GPU memory footprint needed to support the runtime execution of each network. The memory space is typically used to hold the model parameters, the optimizer status and the intermediate output tensors
[46]. The memory footprint is managed by the deep learning toolkit, such as PyTorch in our implementation, and related to not only the network characteristics listed above but also the chosen network forwarding/backpropagation algorithms and the memory caching scheme. Notice that the CVMConv3D methods usually suffer from large memory requirements and fail to be deployed on those lowend computing devices. However, our FADNet++ and its variants only consume nearly 2 GB of memory space, which make them feasible in many platforms. We also observe that FADNetS and FADNetT consumes a bit more memory space than FADNetM. The reason is that the cuDNN library may choose different convolution algorithms, which consume different sizes of memory, for different layer channel settings to achieve the best model inference efficiency.The visualization of some samples is shown in Fig. 32, which compares our FADNet++ with two EDConv2D networks, DispNetC and CRL, and three CVMConv3D networks, AANet, GANet and LEAStereo. It is observed that DispNetC and CRL fail to produce accurate disparities for the object boundaries. Besides, the hole of the knife cannot be correctly recognized by those two EDConv2D methods. On the contrary, our FADNet++ can work well on the boundaries and the details of the knife. The predicted disparity map of FADNet++ is close to those of AANet, GANet and LEAStereo, while FADNet++ runs much faster than those CVMConv3D methods.
Robotics Vision Challenge. To demonstrate the model robustness on different scenarios, we utilize the similar strategy as [45], where we validate our model on three realistic stereo datasets using the Robotics Vision Challenge (RVC) 2020^{2}^{2}2http://www.robustvision.net/index.php. In RVC, each model is required to be trained in the dataset combined with M2014, K2015 and ETH3D, and it then is evaluated on M2014, K2015 and ETH3D separately. We choose topranked representative models (i.e., from top1 to top6, the models are CFNet [45], NLCANet_V2 [47], HSMNet [48], CVANet^{3}^{3}3CVANet has no published paper and code, so we cannot evaluate its runtime., AANet [39], and GANet [10], respectively) on the RVC leaderboard^{4}^{4}4http://www.robustvision.net/leaderboard.php?benchmark=stereo to compare the model accuracy and the inference speed.
Method  KITTI2015  Middlebury2014  ETH3D2017  Runtime  Rank  
D1_bg  D1_fg  D1_all  bad 4.0  rms  avg error  bad 1.0  bad 2.0  avg error  [s]  
CFNet [45]  1.65  3.53  1.96  11.3  18.2  5.07  3.7  0.97  0.26  0.234  1 
NLCANet_V2 [47]  1.51  3.97  1.92  10.3  21.9  5.60  4.11  1.2  0.29  0.44  2 
HSMNet [48]  2.74  8.73  3.74  9.68  13.4  3.44  4.40  1.51  0.28  0.15  3 
CVANet  1.74  4.98  2.28  23.1  25.9  8.64  4.68  1.37  0.34    4 
AANet [39]  2.23  4.89  2.67  25.8  32.8  12.8  5.41  1.95  0.33  0.062  5 
GANet [10]  1.88  4.58  2.33  16.3  42.0  15.8  6.97  1.25  0.45  1.71  6 
FADNet++  1.99  3.18  2.19  31.4  27.7  11.9  4.36  1.30  0.34  0.029   
The results are shown in Table IX
. The runtime for different models is measured on the same platform using their opensourced code to guarantee fair comparison. The runtime in CVANet is empty as it has no publicly available code and paper. It can be seen that the performance of our model is ranked from 35 in the three datasets among the top6 models. Specifically, in KITTI2015, our model is slightly worse than CFNet and NLCANet_V2, and it outperforms other four models in terms of the metric of D1_all. In the average error of the M2014 dataset, our FADNet++ still outperforms AANet and GANet. Regarding the ETH3D dataset, our model outperforms GANet and is comparable with CVANet and GANet. In summary, the top3 models have good model accuracy, but their inference time is very slow, while our FADNet++ achieves a magnitude order of faster speed. Compared with the top4 to top6 models, FADNet++ achieves comparable model accuracy while achieving around
faster than AANet and around faster than GANet. Note that among the compared methods, only our FADNet++ can provide realtime inference speed (i.e., FPS) on a Tesla V100 GPU.Some visualization effects on K2015, M2014, and ETH3D datasets are shown in Fig. 41, Fig. 50, and Fig. 57 respectively. For K2015, compared to GANet and AANet, our FADNet++ can generate disparity maps with richer details (see left white boxes in Fig. 41) and smoother results (see right white boxes in Fig. 41). For M2014, from the white desk in Fig. 50, it can also be clearly observed that our method produces much better and smoother results than GANet and AANet. For ETH3D, it is clear that our FADNet++ performs well on textureless regions (such as the pingpong table). Its disparity is close to the top1 CFNet and much smoother than those achieved by other traditional SOTA methods.
IvC Inference Efficiency
In the above subsection, we have shown that our model achieves comparable model accuracy while providing very efficient inference speed on the Tesla V100 GPU. In this subsection, we provide more experimental results on inference GPUs and mobile GPUs to show how our configurable model achieves realtime inference performance on different platforms with good model accuracy.
Method  EPE [px]  Runtime [s]  
RTX2070  P40  T4  
DispNetC[8]  1.68  0.022  0.025  0.04 
CRL[16]  1.32  0.042  0.047  0.074 
AnyNet[29]  3.39  0.012  0.017  0.014 
FADNet[24]  0.83  0.085  0.096  0.146 
PSMNet[9]  1.09  0.571  0.492  0.792 
GANet[10]  0.78  5.2  5.5  7.344 
GWCNet[37]  0.77  0.45  0.421  0.646 
AANet[39]  0.87  0.124  0.183  0.23 
LEAStereo[27]  0.78  0.851  0.71  0.978 
FADNet++  0.76  0.053  0.06  0.091 
FADNetM  0.91  0.025  0.031  0.037 
FADNetS  1.19  0.017  0.023  0.023 
FADNetT  1.83  0.013  0.02  0.015 
On Inference Server GPUs. The inference performance on the inference servers is shown in Table X. In terms of the runtime, we can see that AnyNet achieves the fastest speed among the evaluated methods, but its EPE on the SceneFlow dataset is extremely high (3.39). Our FADNetT achieves very close inference speed with AnyNet while achieving around improvement in EPE. Being aimed to achieving realtime inference speed (i.e., 30FPS whose inference time should be around 0.033s), our FADNetM can provide realtime inference speed in all three inference server GPUs with the EPE of 0.91. The other existing model, DispNetC, who also achieves realtime inference speed in all inference servers, has the EPE of 1.68, which is around higher than ours. Even the CVMConv3D based models achieve very good model accuracy, they run very slow on these inference GPUs so that they are far away from production to provide realtime disparity estimation. In summary, our configuration framework can be configured as a relatively small model (i.e., FADNetM) compared to FADNet++ and provides realtime inference speed with good model accuracy.
Method 

EPE [px]  Runtime [s]  
TX2  AGX  
DispNetC[8]  3.9  1.68  0.309  0.108  
StereoNet[28]  9.5  1.10  1.148  0.282  
AnyNet[29]  3.1  3.39  0.125  0.041  
AANet[39]  12.6  0.87  1.83  0.585  
FADNet[24]  4.9  0.83  1.176  0.413  
FADNet++  4.3  0.76  0.735  0.258  
FADNetM  3.7  0.91  0.335  0.113  
FADNetS  3.8  1.19  0.192  0.068  
FADNetT  3.9  1.83  0.111  0.043 
On Mobile GPUs. To demonstrate the feasibility of our model applying on mobile devices, we choose two model GPUs (Nvidia TX2 and AGX) to compare the performance. Due to the memory limitation, all the CVMConv3D methods cannot run on such mobile devices. Therefore, we compare the inference speed with EDConv2D methods and also include the occupied GPU memory footprints. The results are shown in Table XI. Again, AnyNet still has very fast inference speed even on mobile GPUs, but its EPE is rather high. Our configured model FADNetT achieves very close inference speeds with AnyNet while it has much better model accuracy than AnyNet. Comparing between our configured FANetS and StereoNet, both of which have similar model accuracy (EPE is around 1.11.2), we can see that FADNetS runs and faster than StereoNet on TX2 and AGX GPUs, respectively. In summary, our configurable framework enables us to set different sizes of models for adapting on different computing power devices with reasonable model accuracy. We also profile the device memory usage of different models. Notice that there are no CVMConv3D models since they fail to run on our tested mobile platforms due to the memory limitation. Compared to the existing realtime networks like DispNetC and AnyNet, our FADNet++ and FADNetM achieve much lower EPEs with similar memory usage. Besides, since the cuDNN library in PyTorch may use different convolution algorithms for different layer channel numbers to achieve the best inference speed, it is possible that the smaller FADNetS and FADNetT can even consume a bit large memory than FADNetM. In addition, the GPU memory usage of the same network can be also different between two computing platforms, such as 2.3 GB on V100 but 4.3 GB on AGX for FADNet++. On the one hand, the memory space on Jetson TX and AGX is shared by both the CPU and GPU so that the memory management strategy is different from the pure GPU memory on V100. On the other hand, the cuDNN library may also have different implementations for the X86based and ARMbased systems, respectively.
We put our configured models on FADNet++ running on all evaluated devices in Table XII, which shows the configurable feature of our model for balancing model accuracy and inference speeds on different hardware.
Model  EPE  Runtime [s]  
[px]  RTX2070  P40  T4  V100  TX2  AGX  
FADNet++  0.76  0.053  0.06  0.091  0.032  0.735  0.258 
FADNetM  0.91  0.025  0.031  0.037  0.016  0.335  0.113 
FADNetS  1.19  0.017  0.023  0.023  0.015  0.192  0.068 
FADNetT  1.83  0.013  0.02  0.015  0.013  0.111  0.043 
V Conclusion
In this paper, we proposed an efficient yet accurate neural network, FADNet++, for endtoend disparity estimation to embrace the time efficiency and estimation accuracy on the stereo matching problem. The proposed FADNet++ exploits pointwise correlation layers, residual blocks, and multiscale residual learning strategy to make the model be accurate in many scenarios while preserving fast inference time. Moreover, to adapt to the target computing devices of different capability, we design a simple but effective configurable channel scaling ratio that can generate various FADNet++ variants of different inference performance. Our training solution can be applied to all the variants and boost their highest accuracy. We conducted extensive experiments to compare our FADNet++ with existing stateoftheart 2D and 3D based methods in terms of accuracy and speed. Experimental results showed that FADNet++ achieves comparable accuracy while it runs much faster than the 3D based models. Compared to the existing mobile solution, FADNet++ achieves a competitive inference speed of 15 FPS with nearly three times accurate.
We have two future directions following our discovery in this paper. First, we would like to improve the disparity estimation accuracy on the lowend devices. To approach the accuracy of FADNet++ produced by the server GPUs, it is necessary to explore the techniques of model compression, including pruning, quantization, and so on. Second, we would also like to apply AutoML [12] for searching a wellperforming network structure for disparity estimation.
Acknowledgments
This research was supported by Hong Kong RGC GRF grant HKBU 12200418. We thank the anonymous reviewers for their constructive comments and suggestions. We would also like to thank NVIDIA AI Technology Centre (NVAITC) for providing the GPU clusters for some experiments.
References
 [1] G. Zhang, J. H. Lee, J. Lim, and I. H. Suh, “Building a 3d linebased map using stereo slam,” IEEE Transactions on Robotics, vol. 31, no. 6, pp. 1364–1377, 2015.
 [2] R. MurArtal and J. D. Tardós, “Orbslam2: An opensource slam system for monocular, stereo, and rgbd cameras,” IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
 [3] A. Yang, C. Zhang, Y. Chen, Y. Zhuansun, and H. Liu, “Security and privacy of smart home systems based on the internet of things and stereo matching algorithms,” IEEE Internet of Things Journal, vol. 7, no. 4, pp. 2521–2530, 2020.
 [4] H. Hirschmuller, “Stereo processing by semiglobal matching and mutual information,” IEEE Transactions on pattern analysis and machine intelligence, vol. 30, no. 2, pp. 328–341, 2007.

[5]
S. Zagoruyko and N. Komodakis, “Learning to compare image patches via convolutional neural networks,” in
Proceedings of the IEEE conference on computer vision and pattern recognition
, 2015, pp. 4353–4361.  [6] J. Zbontar, Y. LeCun et al., “Stereo matching by training a convolutional neural network to compare image patches.” Journal of Machine Learning Research, vol. 17, no. 132, p. 2, 2016.
 [7] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,” in The IEEE International Conference on Computer Vision (ICCV), December 2015.
 [8] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox, “A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4040–4048.
 [9] J.R. Chang and Y.S. Chen, “Pyramid stereo matching network,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
 [10] F. Zhang, V. Prisacariu, R. Yang, and P. H. Torr, “Ganet: Guided aggregation net for endtoend stereo matching,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
 [11] T. Saikia, Y. Marrakchi, A. Zela, F. Hutter, and T. Brox, “Autodispnet: Improving disparity estimation with automl,” in The IEEE International Conference on Computer Vision (ICCV), October 2019.
 [12] X. He, K. Zhao, and X. Chu, “Automl: A survey of the stateoftheart,” arXiv preprint arXiv:1908.00709, 2019.
 [13] R. Atienza, “Fast disparity estimation using dense networks,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), May 2018, pp. 3207–3212.
 [14] A. Cipolletta, V. Peluso, A. Calimera, M. Poggi, F. Tosi, F. Aleotti, and S. Mattoccia, “Energyquality scalable monocular depth estimation on lowpower cpus,” IEEE Internet of Things Journal, pp. 1–1, 2021.
 [15] X. Chen, L. Xie, J. Wu, and Q. Tian, “Cyclic cnn: Image classification with multiscale and multilocation contexts,” IEEE Internet of Things Journal, vol. 8, no. 9, pp. 7466–7475, 2021.
 [16] J. Pang, W. Sun, J. S. Ren, C. Yang, and Q. Yan, “Cascade residual learning: A twostage convolutional neural network for stereo matching,” in The IEEE International Conference on Computer Vision (ICCV) Workshops, Oct 2017.
 [17] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “Flownet 2.0: Evolution of optical flow estimation with deep networks,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
 [18] E. Ilg, T. Saikia, M. Keuper, and T. Brox, “Occlusions, motion and depth boundaries with a generic network for disparity, optical flow or scene flow estimation,” in The European Conference on Computer Vision (ECCV), September 2018.
 [19] Z. Liang, Y. Feng, Y. Guo, H. Liu, W. Chen, L. Qiao, L. Zhou, and J. Zhang, “Learning for disparity estimation through feature constancy,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2811–2820.
 [20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 [21] A. E. Orhan and X. Pitkow, “Skip connections eliminate singularities,” arXiv preprint arXiv:1701.09175, 2017.
 [22] W. Zhan, X. Ou, Y. Yang, and L. Chen, “Dsnet: Joint learning for scene segmentation and disparity estimation,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 2946–2952.
 [23] X. Du, M. ElKhamy, and J. Lee, “Amnet: Deep atrous multiscale stereo disparity estimation networks,” arXiv preprint arXiv:1904.09099, 2019.
 [24] Q. Wang, S. Shi, S. Zheng, K. Zhao, and X. Chu, “FADNet: A fast and accurate network for disparity estimation,” in 2020 IEEE International Conference on Robotics and Automation (ICRA 2020), 2020, pp. 101–107.
 [25] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry, “Endtoend learning of geometry and context for deep stereo regression,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 66–75.
 [26] G.Y. Nie, M.M. Cheng, Y. Liu, Z. Liang, D.P. Fan, Y. Liu, and Y. Wang, “Multilevel context ultraaggregation for stereo matching,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3283–3291.
 [27] X. Cheng, Y. Zhong, M. Harandi, Y. Dai, X. Chang, H. Li, T. Drummond, and Z. Ge, “Hierarchical neural architecture search for deep stereo matching,” Advances in Neural Information Processing Systems, vol. 33, 2020.
 [28] S. Khamis, S. Fanello, C. Rhemann, A. Kowdle, J. Valentin, and S. Izadi, “Stereonet: Guided hierarchical refinement for realtime edgeaware depth prediction,” in Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
 [29] Y. Wang, Z. Lai, G. Huang, B. H. Wang, L. van der Maaten, M. Campbell, and K. Q. Weinberger, “Anytime stereo image depth estimation on mobile devices,” in 2019 International Conference on Robotics and Automation (ICRA), 2019, pp. 5893–5900.
 [30] M. Menze, C. Heipke, and A. Geiger, “Joint 3d estimation of vehicles and scene flow,” in ISPRS Workshop on Image Sequence Analysis (ISA), 2015.
 [31] D. Scharstein, H. Hirschmüller, Y. Kitajima, G. Krathwohl, N. Nešić, X. Wang, and P. Westling, “Highresolution stereo datasets with subpixelaccurate ground truth,” in German conference on pattern recognition. Springer, 2014, pp. 31–42.
 [32] Y. Luo, J. Ren, M. Lin, J. Pang, W. Sun, H. Li, and L. Lin, “Single view stereo matching,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
 [33] G. Yang, H. Zhao, J. Shi, Z. Deng, and J. Jia, “Segstereo: Exploiting semantic information for disparity estimation,” in Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
 [34] X. Song, X. Zhao, L. Fang, H. Hu, and Y. Yu, “Edgestereo: An effective multitask learning network for stereo matching and edge detection,” International Journal of Computer Vision, vol. 128, no. 4, pp. 910–930, 2020.
 [35] Q. Wang, S. Zheng, Q. Yan, F. Deng, K. Zhao, and X. Chu, “Irs: A large naturalistic indoor robotics stereo dataset to train deep models for disparity and surface normal estimation,” in 2021 IEEE International Conference on Multimedia and Expo (ICME), 2021, pp. 1–6.
 [36] Z. Liang, Y. Guo, Y. Feng, W. Chen, L. Qiao, L. Zhou, J. Zhang, and H. Liu, “Stereo matching using multilevel cost volume and multiscale feature constancy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 1, pp. 300–315, 2021.
 [37] X. Guo, K. Yang, W. Yang, X. Wang, and H. Li, “Groupwise correlation stereo network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3273–3282.
 [38] S. Duggal, S. Wang, W.C. Ma, R. Hu, and R. Urtasun, “Deeppruner: Learning efficient stereo matching via differentiable patchmatch,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 4383–4392.
 [39] H. Xu and J. Zhang, “Aanet: Adaptive aggregation network for efficient stereo matching,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 1959–1968.
 [40] T. Elsken, J. H. Metzen, and F. Hutter, “Neural architecture search: A survey,” vol. 20, no. 1, p. 1997–2017, Jan. 2019.
 [41] H. Liu, K. Simonyan, and Y. Yang, “Darts: Differentiable architecture search,” 2018.
 [42] H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han, “Once for all: Train one network and specialize it for efficient deployment,” in International Conference on Learning Representations, 2020.
 [43] L.C. Chen, M. D. Collins, Y. Zhu, G. Papandreou, B. Zoph, F. Schroff, H. Adam, and J. Shlens, “Searching for efficient multiscale architectures for dense image prediction,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, ser. NIPS’18, 2018, p. 8713–8724.
 [44] T. Schöps, J. L. Schönberger, S. Galliani, T. Sattler, K. Schindler, M. Pollefeys, and A. Geiger, “A multiview stereo benchmark with highresolution images and multicamera videos,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 [45] Z. Shen, Y. Dai, and Z. Rao, “Cfnet: Cascade and fused cost volume for robust stereo matching,” arXiv preprint arXiv:2104.04314, 2021.
 [46] J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He, “Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 3505–3506.
 [47] Z. Rao, M. He, Y. Dai, Z. Zhu, B. Li, and R. He, “Nlcanet: a nonlocal context attention network for stereo matching,” APSIPA Transactions on Signal and Information Processing, vol. 9, p. e18, 2020.
 [48] G. Yang, J. Manela, M. Happold, and D. Ramanan, “Hierarchical deep stereo matching on highresolution images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
 [49] K. Batsos, C. Cai, and P. Mordohai, “Cbmv: A coalesced bidirectional matching volume for disparity estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
 [50] J. L. Schonberger, S. N. Sinha, and M. Pollefeys, “Learning to fuse proposals from multiple scanline optimizations in semiglobal matching,” in Proceedings of the European Conference on Computer Vision (ECCV), September 2018.