1 Introduction
Stereo matching is one of the techniques for depth perception of a scene, which is established based on the displacement of the matching points in a binocular camera setup. Given a pair of rectified left/right images, we can compute depth by redirecting to disparity map estimation. Depth prediction is used in many realworld applications, like selfdriving cars
[15], robotics [26], and object detection [25]. Compared to other techniques for depth perception, like LiDAR and TimeofFlight sensors, passive stereo vision is more desirable in realworld scenarios because of inherent problems in other techniques, such as sparsity of depth data, incompetency in sunlight or reflective/absorbing surfaces, and limited operating depth range.A stereo matching algorithm has three main components: feature extraction, regularization, and disparity selection. Before deep learning, numerous algorithms proposed different schemes for each step, like local descriptors (Sum of Absolute Difference or Census Transform
[30]) for feature extraction, Semiglobal Matching (SGM) [7]for regularization, and winnertakeall (WTA) for final disparity selection. Nowadays, similar to other computer vision tasks, stereo matching has also benefited from deep networks. Primarily, the research incorporated deep learning in individual components of the pipeline, like
[31, 17] in matching costs and [18, 24] in regularization. Later, the research stirred towards endtoend frameworks, taking all the fundamental components into one network [13, 10, 2, 6].Accordingly, in endtoend pipelines, depending on the dimension of the cost volume built on top of the unary features, the subsequent convolutional layers in the regularization part (encoderdecoder) can be either 2D convolutions (for 3D cost volume) [13] or 3D convolutions (for 4D cost volume) [10]. We dub the former and the latter group as “2D” and “3D” models, respectively. While the 3D endtoend networks introduce highly boosted accuracy in disparity estimation, they are computationally costly due to more complex networks. As such, some of these networks cannot fit even on a moderate GPU, raising an outofmemory (OOM) state. On the other hand, there are many embedded platforms with memory constraints on which the networks should fit and execute efficiently.
In order to reduce the complexity of 3D models, some works reconsider the configuration of the networks. For instance, DeepPruner [4] develops a PatchMatch module to ignore the cost volume evaluation for most of the disparity range. In [29], authors establish an aggregation module on top of a cost volume computed by traditional local descriptors. Thus, skipping the convolutional feature extraction, the network benefits from a lighter learning paradigm. Also, it creates a 3D cost to avoid the curse of 3D convolutions.
In this work, we develop two endtoend stereo matching networks, which exploit MobileNetV1 [8] and MobileNetV2 [23] blocks to mitigate the computational burden in favor of real embedded platforms, like FPGAs or mobile devices. Depending on the cost volume dimension, 3D or 4D, we propose a 2D and a 3D network. Moreover, for the 3D cost volume, a new learning construction module is devised based on interlacing the features from two viewpoints. Although reducing the computation cost is usually accompanied by a degradation in performance, we show that the proposed architectures are competitive with their stateoftheart counterparts (Fig. 1).
Overall, our main contributions are as follows: Two lightweight models (2D/3D) are designed and proposed for stereo matching using MobileNet blocks without sacrificing accuracy. We raise MobileNet blocks from the originally proposed 2D convolutions to 3D for the application of stereo matching. Also, by analyzing their costs, we prove their merit in reducing the computational load when processing 4D data. We introduce a learnable cost volume module for the 2D model to keep the accuracy comparable with overparameterized 3D models. Extensive experiments for analyzing the accuracy/complexity tradeoff in different design choices are conducted. Our findings in the design choice can be applicable to similar 2D/3D networks to reduce their complexity.
2 Related work
Stereo vision is one of the popular techniques for estimating the depth from images. In the last decade, machine learning and deep learning approaches have wellprogressed in computer vision tasks, including stereo matching. Deep learningbased methods can be categorized into two groups: methods that focus on transferring only one or some of the general pipeline components into a deep learning framework
[31, 24, 1], and approaches that formulate the whole process in an endtoend scheme [13, 10, 2, 6, 32]. Most of the endtoend methods are developed based on 3D convolutional layers. Although these architectures achieve a substantial increase in accuracy, they require a high amount of memory usage, making them impractical for mobile and realtime applications, such as robotics and autonomous vehicles.Lighter networks. Lightweight architectures have become an active research domain, ringing the bell that it is getting impractical to slide through complex networks without considering the load of computations. Generally, convolutions entail the most considerable computational load. Some deep networks for stereo reconstruction have been developed to achieve less complexity while being competitive in terms of accuracy with heavy 3D architectures. In [29], an initial matching cost is constructed based on the traditional cost computation. After reducing the channel dimension through convolutions, the data is fed into a UNet to regress the disparity map. In another work, DeepPruner [4] mitigates the computational complexity of 3D convolutional layers by calculating the matching cost for a subset of possible disparity values. Recently, [20] proposed to use featurewise and featuredisparitywise separable convolutions for optimizing the 3D stereo models. On the whole, according to the results in [29], the 2D architectures can make a better tradeoff between accuracy and speed.
Other works mainly focus on reducing the complexity of 3D convolutional layers for other 3D vision tasks. Qiu [19] developed pseudoconvolutions that decouple a 3D convolution into 3D convolutions equivalent to a 2D convolution and a light 3D convolution, which aggregates information only across the third domain. The network is demonstrated on video classification. In [28], authors made use of 3D depthwise convolutions for 3D reconstruction. Recently, [5] proposed progressively forward expanding a 2D tiny network along multiple axes to get fewer parameters for video classification and detection.
Cost volume computation. In a stereo model, measuring the similarity of left/right features is just as important as feature extraction and regularization. Traditional algorithms utilized simple calculations, like absolute difference, Hamming distance, or correlation. Similar solutions carried on to deep learningbased networks. Specifically, correlation is used on top of unary features to compute a 3D cost volume [3, 12, 13, 31, 9]. Later, Kendall [10]
proposed to concatenate unary features to make a 4D cost, requiring 3D convolutions in the following. After that, this approach was mainly adapted for 3D models with some modifications to enhance the accuracy, variancebased
[21], groupwise correlation [6] and pyramid [27] cost volume.3 Methodology
As a first step, we reformulate two common light blocks [8, 23] to raise them from 2D convolutions to 3D for the application of stereo matching. We also analyze their computation cost the standard 2D/3D convolution counterparts. The computation cost of a deep network is measured by the number of operations in MACs (MultiplyAccumulate) and the number of parameters. While the number of parameters is fixed for a model, MACs depend on the input size.
3.1 Light blocks replacing 2D/3D convolutions
As a pioneer work, MobileNetV1 [8] employs depthwise and pointwise convolutions to produce an output with the same size as the output of a standard convolution, but with fewer computations. Later, MobileNetV2 [23] was introduced, which formulates its block with pointwise, depthwise, and once again, pointwise layers. The number of input and output channels are specific for each layer, such that the channel dimension is expanded with an expansion factor () within the block. In a nondownsampling layer, a skip connection is included as well, making it a socalled Inverted Residual block. In Fig. 2, MobileNetV1 and MobileNetV2 blocks (shortened to and hereon, for simplicity) are shown.
Raising MobileNet blocks to 3D for stereo network. Originally, and blocks are designed to replace 2D convolutions and are proved mainly for sparse prediction tasks, like image classification. Still, many other computer vision problems require 3D convolutions to operate on 4D data, tasks with temporal input besides the spatial data. Likewise, dense disparity estimation is such a topic, which explores the space for the 3rd dimension of the scene. This outlook of using 3D convolutions for stereo matching has emerged recently, where they are utilized for processing 4D cost data. Hence, we take and blocks to their 3D counterparts for stereo vision application and show their necessity for light stereo vision models.
For this, just as in 2D convolutions, we commit the depthwise and pointwise convolutions in the channel (feature) dimension in 3D convolutions. To be more precise, to raise the convolutions from 2D to 3D, the input data is extended from to , where and indicate the input height and width, respectively; is the new third dimension, and is the number of channels. In our formulation for stereo matching, the 4D data of cost volume [10, 27, 6] is likewise of size , where is the predefined disparity range for building the cost volume. Accordingly, we raise kernels from 2D to 3D, considering the same size for the added dimension, if the 2D kernel is , the 3D kernel is . Therefore, taking and blocks to 3D is straightforward by applying depthwise and pointwise convolutions in the channel dimension. Figure 2 displays the new and blocks raised to 3D.
In Tab. 1, we compute the cost of the standard 2D/3D convolutions and 2D/3D and blocks. Comparing to standard convolution counterparts, there is a reduction factor in computation cost that depends on the kernel size , channels ( and ) and expansion factor () in blocks. The example in the last column shows that blocks are lighter than in both 2D and 3D versions, as expected. Moreover, exploiting and blocks in 3D type is more desirable as they show the capability for further reduction in operations compared to standard counterparts.
Additionally, we examine the impact of the expansion factor of blocks in Fig. 3. We consider the common cases in an hourglass module (in [2, 6]) for evaluation, with . We see that by increasing , the reduction factor decreases until a point where the block becomes heavier than a standard convolution counterpart. Also, the 2D block is more sensitive to , such that the cost is increased beyond the convolution after . Here, once again, we can verify the merit of MobileNet architectures in our reformulation for 3D convolutional layers of 3D networks.
Operator  Operations – MACs (  Example for Reduction Factor  
Std. Conv.    
2D  Block  7.9x  
Block  2.7x  
Std. Conv.    
3D  Block  18.9x  
Block  7.0x 
belongs to 3D types. We have ignored the computation cost of batch normalization, ReLU and residual connection of
. Example for reduction factor (the standard convolution counterparts) is computed for , , , and .3.2 Proposed models
Here, we describe two endtoend baselines (2D and 3D) and design MobileStereoNets in reliance on these two models and 2D/3D and . Following the common pipeline in recent work, we first feed the rectified left/right images to the feature extraction backbone and obtain the unary features. The backbone is shared for the left and right images. The results are passed into a cost volume construction module to merge the data from two viewpoints. Finally, an encoderdecoder (hourglass) is applied on top of the cost volume to estimate the disparity map.
3D baseline. For this case, we adopt GwcNetg [6] with only one hourglass (Fig. 4) as it is performing superior to other similar designs, GCNet [10], PSMNet [2], and GANet [32]. This baseline utilizes a ResNetlike backbone and an encoderdecoder with 3D convolutions. Namely, a 4D cost volume is constructed by groupwise correlation of unary features, requiring 3D convolutions afterward. The encoderdecoder consists of an hourglass [2, 6] outputting a downsampled disparity map, which after upsampling is compared against the groundtruth with a smoothloss function. For the detailed architecture, we refer the reader to the appendix.
2D baseline. In order to develop a much lighter stereo network, we modify the 3D baseline such that it uses an encoderdecoder with 2D convolutions (Fig. 4). This approach contrasts with the recent trend, where 3D convolutions are deployed to add a feature dimension for disparity via a 4D cost volume. With a ResNetlike backbone similar to the 3D baseline, an input image with resolution is turned into a feature of size . We add further processing by four successive pointwise convolutions to reduce the number of channels and attain a size of . In order to aggregate two unary features to form a cost volume, which indicates a similarity measurement in the left/right images across the disparity dimension, we propose a new Interlacing Cost Volume. Note that we need a 3D cost volume to retain the encoderdecoder with 2D convolutions. Ignoring the feature dimension for disparity, 3D cost volume is of size , where is the maximum disparity level. Finally, the 3D cost volume is taken into the encoderdecoder module after passing through two convolutions as a prehourglass module.
Interlacing cost volume construction. Traditionally, a cost volume for stereo matching is computed for comparing the descriptors of binocular images across the disparity dimension, mainly as 3D data as following:
(1) 
where and are the spatial location and the disparity value within a range of , respectively. is the traversed right feature for a specific disparity level. indicates a similarity measurement function that was conventionally chosen as correlation or Hamming distance. With the advent of deep learning in stereo vision, as the unary features raised into 3D data, , DispNetC [13] proposed to use a correlation layer (dot product) to merge these data by . Later, [10] introduced a cost volume as 4D data, , by concatenating the unary features.
These 3D or 4D cost volumes are obtained in an unparameterized module of the network. Thus, we propose a subnetwork, named Interlacing Cost Volume Construction with the motivations as following: In order to achieve a better aggregation of the two unary features, we propose a parameterized subnetwork to learn the aggregation. By interlacing the left/right unary features, the corresponding features maps are distilled by a kernel. we aim to retain the encoderdecoder with only 2D convolutions (and not 3D, which is the case in recent work), as this can contribute to significant reduction of operations.
Namely, given the left and traversed right unary features, each of size , we first interlace them across the channel dimension to form a data with doublesized channels, (Fig. 5). After unsqueezing the output and raising it to 4D data (), a 3D convolutional layer is applied such that the 3D kernels cover a specific number of left/right feature pairs. That is , a kernel of size covers channel from each of the unary features. By increasing the , the kernel covers more features and thus, integrates more information. The two following layers (Fig. 5) are also 3D convolution with double and same number of kernels’ of the first layer. In general, the kernels are of size with strides as , showing that they are covering nonoverlapping channels. Finally, we convert the data to a single channel and pass through one 2D convolutional layer. Note that similar to other methods for cost volume, the spatial resolution is unchanged. We can write the general formula for a certain disparity level as Eg. 2. In Sec. 4, we show that the inclusion of learnable weights in stereo network contributes to better aggregation of the left/right features.
(2) 
MobileStereoNets. In our baselines, the feature extraction and channel reduction are essentially processed with 2D convolutions. On the other hand, the hourglass employs 2D or 3D convolutions depending on the constructed cost volume. In order to obtain lighter networks, we replace these components with 2D and 3D counterparts of and blocks. There are different design choices when exploiting these blocks in individual modules of the networks. Thus, extensive experiments are conducted to answer the following questions: Can we replace different modules in the 2D/3D baselines with 2D/3D and to achieve lighter stereo networks and keep the error rates low? If so, which modules should be replaced with them for a better compromise? Which block type performs better in terms of accuracy and computational load?
Our experiments (cf. Sec. 4) lead to our MobileStereoNets by modifying the baselines as follows:

[leftmargin=*, noitemsep]

First Convolutions: Each of the three initial convolutions are replaced with one . Using an expansion factor of provides a favorable tradeoff between the performance and computational complexity.

Feature Extraction: We retained the original layer architecture and block structure, consisting of two convolutions and a residual connection between each block. Substituting these convolutions with keeps performance competitive while reducing the computational complexity significantly.

Channel Reduction: In the 2D baseline, we keep this module, four convolutions, unchanged with standard convolutions as replacing them with lighter blocks deteriorates the performance.

Prehourglass: In the 2D baseline, we replace the two convolutions in both of the blocks with . For 3DMobileStereoNet, we use our extension of to 3D instead of 3D convolutions. In both models, we choose expansion factor as .

Hourglass: We employ a stack of three hourglasses for both 2D and 3D models. While the 3D network uses the same channel dimension as GwcNet [6] (32), the hourglass width is increased to 48 in the 2D model. 2DMobileStereoNet uses instead of 2D convolutions. In 3DMobileStereoNet, we once again swap the 3D convolutions for our extension of to 3D. For both models, the expansion factor is .
4 Experimental results and discussion
Here, we first evaluate the performance of the proposed interlaced cost volume. Then, extensive experiments for taking the 2D/3D and blocks into the baselines are elaborated to show the path taken to reach the final architectures. Finally, we compare our developed networks with other methods. Note that is 192 in all cases.
In order to analyze different design choices, we use the SceneFlow “final pass” dataset [13], consisting of 35,454/4,370 training/test samples with resolution. This dataset can also help to pretrain the networks for limited real datasets. The quantitative evaluation for SceneFlow images is mainly reported with EndPointError (EPE), the mean average disparity error in
. Two more errors are also reported, px3 and D1, which are percentages of the outliers with disparity errors larger than
and , respectively.4.1 Cost volume construction
To verify the performance of the interlacing cost volume, we replace the corresponding module in 2D baseline (Fig. 4) with correlation, which is adopted by 2D networks [3, 13, 9]. Table 5 shows the evaluation of this baseline against the model embedded with our interlacing cost volume. indicates that channels are taken from each unary feature, and the kernel of the subsequent layer is of size . For instance, when , four channels from each feature data are combined by kernels. Therefore, cases of can be interpreted as groupwise interlacing. We observe that better similarity measurements between the left/right features are achieved by introducing this learnable cost volume construction, resulting in lower EPE. According to the table, the best case is achieved when (also depicted in Fig. 5), and hence, we consider this case for further experiments in the 2D baseline.
We also investigate the effect of interlacing against direct concatenation of left/right unary features. According to the table, the error has increased in this case, showing that direct concatenation is not efficiently distilling the corresponding left/right features, which is essential for stereo matching.
Method  EPE()  D1(%)  px3(%) 
1.86  7.46  8.48  
1.71  6.80  7.84  
1.70  6.20  7.06  
1.61  6.39  7.31  
1.55  6.15  7.06  
1.64  6.41  7.35  
1.73  6.65  7.58 
4.2 Effect of incorporating MobileNet blocks
In this section, we incorporate and blocks (either 2D or 3D, depending on the type of convolutional module) in various components of the 2D and 3D baselines. In addition to error metric, we monitor the reduction of computational complexity to help us choose lighter models. We analyze replacing the fundamental modules of the network, feature extraction and hourglass with lighter blocks. The results of substituting these modules with and () for 2D and 3D models are tabulated in Tables 2(b) and 2(b)
. The best model is selected according to the least EPE obtained in 20 epochs. Also, the input resolution for computing MACs is
. In these cases, the first convolutions of feature extraction are kept in standard type.FE  HG  EPE()  MACs()  Params() 
conv.  conv.  1.55  74.42  
conv.  1.62  73.97  
conv.  1.66  30.43  
1.59  29.98  
conv.  1.63  74.32  
conv.  1.57  35.54  
1.53  35.44  
1.50  30.33  
1.60  35.10 
FE  HG  EPE()  MACs()  Params() 
conv.  conv.  0.97  155.2  
conv.  0.99  143.66  
conv.  0.98  111.2  
1.03  99.67  
conv.  0.97  149.01  
conv.  0.96  116.32  
0.97  110.13  
0.99  105.01  
1.02  104.78 
We can conclude that: In both 2D and 3D baselines, feature extraction is responsible for much of the computational load. Substituting feature extraction with and hourglass with yields a better compromise between accuracy and computational complexity. For the 2D baseline, this combination results in the least EPE. We consider this combination for both 2D and 3D models.
We also examine replacing other modules with lighter blocks to make the network even lighter. Namely, the first convolutional layers in feature extraction and prehourglass modules are replaced with . The reason for choosing instead of is the higher accuracy can maintain after substituting for standard convolutions. We observed that in this case, for the 2D baseline, both the complexity and the error are reduced (tables are available in the appendix). We also tried replacing other modules with MobileNet blocks, the channel reduction module and the convolutions in interlacing cost volume construction in the 2D baseline. However, since these replacements deteriorate the learning capability of the network, they are kept unchanged with standard convolutions.
4.3 Quantitative and qualitative results
The discussed experiments support our design choice for the final 2D and 3D models, 2DMobileStereoNet and 3DMobileStereoNet. To increase the accuracy, we utilize a stack of three hourglasses. Also, we found out that higher values make the network less accurate and heavier.
Method  EPE() pt  Params() pt  Red. Params 
DispNetC[9]  17.0x  
CRL[16]  35.3x  
AutoDispNetC[22]  16.6x  
iResNet[11]  19.3x  
2DMobileStereoNet  1.14 pt  2.23 pt   
Method  EPE()  MACs()  Params()  Red. MACs  Red. Params 
GCNet[10]  1.84  718.01  3.18  4.7x  1.8x 
PSMNet[2]  0.88  256.66  5.22  1.7x  2.9x 
GANetdeep[32]  0.84  670.25  6.58  4.4x  3.7x 
GANet11[32]  0.93  383.42  4.48  2.5x  2.5x 
Gwc40Cat24Base[6]  1.12  169.42  4.60  1.1x  2.6x 
GwcNetgc[6]  0.76  260.49  6.82  1.7x  3.9x 
GwcNetg[6]  0.79  246.27  6.43  1.6x  3.6x 
DeepPrunerBest[4]  0.86  129.23  7.39  0.8x  4.2x 
DeepPrunerFast[4]  0.97  51.83  7.47  0.3x  4.2x 
3DMobileStereoNet  0.80  153.14  1.77     
SceneFlow dataset. The evaluation of the proposed models on SceneFlow are presented in Tables 3(b) and 3(b). Note that 2DMobileStereoNet has more parameters than 3DMobileStereoNet due to channel reduction module, parameterized cost volume construction, and wider hourglass. Nevertheless, with the least operations, it can be practical on systems with limited computation capacities. Compared to other 2D models, a lower error is achieved by 2DMobileStereoNet with 16.6x fewer parameters. Moreover, we observe that in 3DMobileStereoNet, significantly fewer parameters are achieved, while the performance is competitive with or better than other methods. Compared to GwcNetgc with the best EPE metric in the table, 3DMobileStereoNet uses 1.7x fewer parameters (in millions) and 3.9x fewer GigaMACs. Note that although DeepPrunerFast [4] obtains the least number of operations, it is still overparametrized and this causes issues when finetuning on smaller datasets like KITTI (cf. Table 6). Figure 6 shows the disparity estimation results.
KITTI 2015 dataset. This dataset consists of images of realworld driving scenarios [14], with resolution. To evaluate our models on this dataset, which has only groundtruth available for 200 for training samples, we use a 160/40 training/validation split. We finetune the networks that are pretrained on SceneFlow. For a fair comparison, we also train and evaluate other methods with the same data split. As shown in Tab. 5, 2DMobileStereoNet attains comparable results to PSMNet, which is a 3D model, with much less computational load (2.3x/8x fewer parameters/operations). Also, 3DMobileStereoNet is outperforming PSMNet and GANet11, and is competitive with GANetdeep. Compared to GwcNetg, 3DMobileStereoNet is lighter with 3.6x/1.6x fewer parameters/operations.
Method  EPE()  D1(%)  px3(%)  MACs()  Params() 
PSMNet[2]  0.88  2.00  2.10  256.66  5.22 
GANetdeep[32]  0.63  1.61  1.67  670.25  6.58 
GANet11[32]  0.67  1.92  2.01  383.42  4.48 
GwcNetgc[6]  0.63  1.55  1.60  260.49  6.82 
GwcNetg[6]  0.62  1.49  1.53  246.27  6.43 
2DMobileStereoNet  0.79  2.53  2.67  32.2  2.32 
3DMobileStereoNet  0.66  1.59  1.69  153.14  1.77 
Methods  All(%)  Noc(%)  
D1  D1  D1  D1  D1  D1  
MCCNN[31]  2.89  8.88  3.89  2.48  7.64  3.33 
Fast DSCS[29]  2.83  4.31  3.08  2.53  3.74  2.73 
GCNet[10]  2.21  6.16  2.87  2.02  5.58  2.61 
DeepPrunerFast[4]  2.32  3.91  2.59  2.13  3.43  2.35 
PSMNet[2]  1.86  4.62  2.32  1.71  4.31  2.14 
AutoDispNetCSS[22]  1.94  3.37  2.18  1.80  2.98  2.00 
DeepPrunerBest[4]  1.87  3.56  2.15  1.71  3.18  1.95 
GwcNetg[6]  1.74  3.93  2.11  1.61  3.49  1.92 
2DMobileStereoNet  2.49  4.53  2.83  2.29  3.81  2.54 
3DMobileStereoNet  1.75  3.87  2.10  1.61  3.50  1.92 
Method  GANetdeep[32]  GANet11[32]  GwcNetgc[6]  GwcNetg[6]  PSMNet[2]  2DMobileStereoNet  3DMobileStereoNet 
Memory (MB)  26.4  18.0  27.9  26.3  21.1  10.03  7.99 
Additionally, we submitted the results of our finetuned models to the KITTI 2015 benchmark. To this end, we finetuned the epoch with the best crossdomain generalizability from SceneFlow to KITTI 2015. Table 6 shows that 2DMobileStereoNet is surpassing GCNet (a 3D model) with 27%/95% fewer parameters/operations. Also, we can verify that 3DMobileStereoNet shows superior performance when compared to GCNet and PSMNet and it is surpassing GwcNetg (in D1 and in D1 in all pixels) with 72%/38% fewer parameters/operations.
Figure 7 visualizes the results from KITTI 2015 benchmark. Comparing 2DMobileStereoNet and 3DMobileStereoNet, we observe that 3DMobileStereoNet obtains crisp edges due to deploying 3D convolutions in the encoderdecoder. In other words, in the upsampling/downsampling process in encoderdecoder, 3D convolutions can better preserve finer details compared to an encoderdecoder with 2D convolutions. Nevertheless, 2DMobileStereoNet achieves visually similar outputs to 3D models while requiring considerably fewer operations (87% fewer operations compared to GwcNetg). Also, 3DMobileStereoNet visually achieves competitive or better results compared to other methods.
Model size. We also report the memory requirement of our models in Table 7. Both of the proposed methods show smaller memory sizes, which is promising for memoryconstrained chips and their power consumption.
5 Conclusion
This paper presented lightweight stereo networks to alleviate high memory usage on embedded or mobile devices. Namely, we proposed two models (2D and 3D) with the primary goal of reducing the cost (in terms of parameters, operations, and model size) by using MobileNet blocks. To increase the accuracy of the 2D model, we also designed a new cost volume to learn the similarity of unary features. Yielding a favorable accuracy/complexity tradeoff, these MobileStereoNets are promising for deploying endtoend stereo networks on edge devices.
Acknowledgment
This work is supported by the German Federal Ministry of Education and Research (BMBF) for the project DeepStereoVision (FRE: 01I518024B).
References

[1]
(2018)
CBMV: a coalesced bidirectional matching volume for disparity estimation.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 2060–2069. Cited by: §2.  [2] (2018) Pyramid stereo matching network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5410–5418. Cited by: Figure 3, §1, §2, §3.1, §3.2, Figure 7, 3(b), Table 5, Table 6, Table 7.
 [3] (2015) FlowNet: learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2758–2766. Cited by: §2, §4.1.
 [4] (2019) DeepPruner: learning efficient stereo matching via differentiable patchmatch. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4384–4393. Cited by: §1, §2, §4.3, 3(b), Table 6.
 [5] (2020) X3D: expanding architectures for efficient video recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 203–213. Cited by: §2.
 [6] (2019) Groupwise correlation stereo network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3273–3282. Cited by: Figure 3, §1, §2, §2, Figure 4, 5th item, §3.1, §3.1, §3.2, Figure 7, 2(b), 3(b), Table 5, Table 6, Table 7.
 [7] (2005) Accurate and efficient stereo processing by semiglobal matching and mutual information. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 807–814. Cited by: §1.

[8]
(2017)
MobileNets: efficient convolutional neural networks for mobile vision applications
. arXiv preprint arXiv:1704.04861. Cited by: §1, §3.1, §3.  [9] (2018) Occlusions, motion and depth boundaries with a generic network for disparity, optical flow or scene flow estimation. In Proceedings of the European Conference on Computer Vision, pp. 614–630. Cited by: §2, §4.1, 3(b).
 [10] (2017) Endtoend learning of geometry and context for deep stereo regression. In Proceedings of the IEEE International Conference on Computer Vision, pp. 66–75. Cited by: §1, §1, §2, §2, §3.1, §3.2, §3.2, Figure 7, 3(b), Table 6.
 [11] (2018) Learning for disparity estimation through feature constancy. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2811–2820. Cited by: 3(b).
 [12] (2016) Efficient deep learning for stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5695–5703. Cited by: §2.
 [13] (2016) A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4040–4048. Cited by: §1, §1, §2, §2, §3.2, §4.1, §4.
 [14] (2015) Object scene flow for autonomous vehicles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3061–3070. Cited by: §4.3.
 [15] (2020) ”Looking at the right stuff”guided semanticgaze for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11883–11892. Cited by: §1.

[16]
(2017)
Cascade residual learning: a twostage convolutional neural network for stereo matching
. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 887–895. Cited by: 3(b).  [17] (2016) Look wider to match image patches with convolutional neural networks. IEEE Signal Processing Letters 24 (12), pp. 1788–1792. Cited by: §1.
 [18] (2015) Leveraging stereo matching with learningbased confidence measures. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 101–109. Cited by: §1.
 [19] (2017) Learning spatiotemporal representation with pseudo3d residual networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5533–5541. Cited by: §2.
 [20] (2021) Separable convolutions for optimizing 3d stereo networks. In Proceedings of the IEEE International Conference on Image Processing, Cited by: §2.
 [21] (2020) NLCANet: a nonlocal context attention network for stereo matching. APSIPA Transactions on Signal and Information Processing 9, pp. e18. External Links: Document Cited by: §2.
 [22] (2019) AutoDispNet: improving disparity estimation with automl. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1812–1823. Cited by: 3(b), Table 6.
 [23] (2018) MobileNetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §1, §3.1, §3.
 [24] (2017) SGMNets: semiglobal matching with neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 231–240. Cited by: §1, §2.
 [25] (2020) PointGNN: graph neural network for 3d object detection in a point cloud. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1711–1719. Cited by: §1.
 [26] (2019) Normalized object coordinate space for categorylevel 6d object pose and size estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2642–2651. Cited by: §1.
 [27] (2019) Semantic stereo matching with pyramid cost volumes. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7484–7493. Cited by: §2, §3.1.

[28]
(2019)
3d depthwise convolution: reducing model parameters in 3d vision tasks.
In
Canadian Conference on Artificial Intelligence
, pp. 186–199. Cited by: §2.  [29] (2020) Fast deep stereo with 2d convolutional processing of cost signatures. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pp. 183–191. Cited by: §1, §2, Table 6.
 [30] (1994) Nonparametric local transforms for computing visual correspondence. In Proceedings of the European Conference on Computer Vision, pp. 151–158. Cited by: §1.
 [31] (2016) Stereo matching by training a convolutional neural network to compare image patches. Journal of Machine Learning Research 17 (1), pp. 2287–2318. Cited by: §1, §2, §2, Table 6.
 [32] (2019) GANet: guided aggregation net for endtoend stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 185–194. Cited by: Figure 3, §2, §3.2, 3(b), Table 5, Table 7.
Appendix A Detailed baseline architectures
The detailed architectures of the 2D and 3D baseline models are displayed in Fig. 1. The numbers in the blocks indicate the output size of each particular layer/module. The feature extraction step is the same for the two models. The architecture of hourglass and its intraconnections are also similar, except that in the 2D baseline, the convolutions are all in 2D type, while there are 3D convolutions in hourglass of the 3D baseline. These two models differ in the cost volume construction and the channel reduction module as well.
Appendix B More qualitative results
Figure 2 depicts more qualitative results on SceneFlow dataset. We have also shown qualitative comparison on KITTI 2015 validation set in Fig. 3.
From Fig. 3, once again, we can verify that 2DMobileStereoNet shows close performance to 3D models with the least number of operations. Also, 3DMobileStereoNet obtains competitive or better accuracy with the least number of parameters among other methods.
Appendix C Incorporating light blocks in other modules
As mentioned in the paper, in order to further reduce the complexity, the first convolutions in the feature extraction and the prehourglass convolutions (cf. Fig. 1) are replaced with MobileNetV2 (). The experimental results are reported in Tables 1 and 2. Note that the first convolutions are of the 2D type for both 2D and 3D baselines; however, the prehourglass comes in 2D or 3D convolutions depending on the baseline. We can observe that in 2DMobileStereoNet, when the two modules are replaced with MobileNetV2 (), the network obtains the least EPE. In 3DMobileStereoNet, this combination yields slightly higher EPE. However, due to the nice reduction in the computation cost, we consider the same design choice for the 3D network. It is noteworthy that we have examined MobileNetV1 () for these modules as well. However, as it deteriorates the performance, we ignore for these modules, albeit it shows much decrease in the cost.
firstconv  preHG  EPE()  MACs()  Params() 
conv.  conv.  1.50  30.33  
conv.  1.41  30.0  
conv.  1.54  29.75  
1.40  29.42 
firstconv  preHG  EPE()  MACs()  Params() 
conv.  conv.  0.99  105.01  
conv.  1.01  69.44  
conv.  0.99  104.44  
1.01  68.86 
Appendix D Implementation details
We used PyTorch for implementation and conducting experiments. All the trainings are executed on 4
NVIDIA GeForce GTX 1080 Ti. We adapt the Adam optimizer with and . On the SceneFlow dataset, the networks are trained for 20 epochs, starting with a learning rate of 0.001. The learning rate is halved after epoch 10, 12, 14, and 16. The best model is selected based on the least EPE value. In the experiments on the KITTI 2015 validation set, we finetune the best SceneFlow model for 400 epochs, reducing the initial learning rate 0.001 by a factor of 10 after 200 epochs. To submit the results to the KITTI 2015 benchmark, we finetune starting from a SceneFlow checkpoint showing the best generalization performance from the SceneFlow to the KITTI 2015 images. For the 3DMobileStereoNet, we used a batch size of 4, and for 2DMobileStereoNet, the batch size is 8.Appendix E Analyzing the complexity
Table 3 shows the computation cost of the main modules, feature extraction and encoderdecoder, in baselines (with standard convolutions) and in MobileStereoNets. Note that feature extraction is the same in 2D and 3D models. We see our design choice for feature extraction is significantly reducing the complexity both in operation (from 52.07 to 7.84 GigaMACs) and in parameters (from 7.84 to only 0.39 million). We also observe that the cost of the encoderdecoder modules, either in 2D or 3D, is reduced in lighter networks in both number of operations and parameters. Evidently, the major bottleneck for the 3D models is the encoderdecoder with 3D convolutions.
Baselines  MobileStereoNets  
MACs()  Params()  MACs()  Params()  
Feature Extraction  2.95  7.84  
Encoderdecoder in 2DMobileStereoNet  2.61  3.92  
Encoderdecoder in 3DMobileStereoNet  3.45  128.73 