The frequent detection of different types of road damage, e.g., cracks and potholes, is a critical task in road maintenance . Road condition assessment reports allow governments to appraise long-term investment schemes and allocate limited resources for road maintenance . However, manual visual inspection is still the main form of road condition assessment . This process is, however, not only tedious, time-consuming and costly, but also dangerous for the personnel . Furthermore, the detection results are always subjective and qualitative because decisions entirely depend on the experience of the personnel . Therefore, there is an ever-increasing need to develop automated road inspection systems that can recognise and localise road damage both efficiently and objectively .
Over the past decades, various technologies, such as vibration sensing, active or passive sensing, have been used to acquire road data and help technicians in assessing the road condition . For example, Fox et al.  developed a crowd-sourcing system to detect road damage by analysing accelerometer data obtained from multiple vehicles. Although vibration sensors are cost-effective and only require a small amount of storage space, the shape of a damaged road area cannot be explicitly inferred from the vibration data . Furthermore, Tsai et al.  mounted two laser scanners on a digital inspection vehicle (DIV) to collect 3D road data for pothole detection. However, such vehicles are not widely used, because of their high equipment and long-term maintenance costs .
The most commonly used passive sensors for road condition assessment include Microsoft Kinect and other types of digital cameras . In , Jahanshahi et al. utilised a Kinect to acquire depth maps, from which the damaged road areas were extracted using image segmentation algorithms. However, Kinect sensors were initially designed for indoor use, and they do not perform well when exposed to direct sunlight, causing depth values to be recorded as zero . Therefore, it is more effective to detect road damages using digital cameras, as they are cost-effective and capable of working in outdoor environments .
With recent advances in airborne technology, unmanned aerial vehicles (UAVs) equipped with digital cameras provide new opportunities for road inspection . For example, Feng et al.  mounted a camera on a UAV to capture road images. The latter was then analysed to illustrate conditions such as traffic congestion, road accidents, among others. Furthermore, Zhang 
designed a robust photogrammetric mapping system for UAVs, which can recognise different road defects, such as ruts and potholes, from the captured RGB images. Although the aforementioned 2D computer vision methods can recognise damaged road areas with low computational complexity, the achieved level of accuracy is still far from satisfactory[14, 16]. Additionally, the structure of a detected road damage is not obvious from only a single video frame, and the depth/disparity information is more effective than RGB information in terms of detecting severe road damages, e.g., potholes . Therefore, it becomes increasingly important to use digital cameras for 3D road data acquisition.
To reconstruct 3D road scenery using digital cameras, multiple camera views are required . Images from different viewpoints can be captured using either a single movable camera or an array of synchronised cameras . In , Zhang and Elaksher reconstructed the 3D road scenery using structure from motion (SfM), where the keypoints in each frame were extracted using scale-invariant feature transform (SIFT) , and an energy function with respect to all camera poses was optimised for accurate 3D road scenery reconstruction. However, SfM can only acquire sparse point clouds, which are usually infeasible for road damage detection . In this regard, many researchers have resorted to using stereo vision technology to acquire dense point clouds for road damage detection. In , Fan et al. developed an accurate dense stereo vision algorithm for road surface 3D reconstruction, and an accuracy of approximately 3 mm was achieved. However, the search range propagation strategy in their algorithm makes it difficult to fully exploit the parallel computing architecture of the graphics cards . Therefore, the motivation of this paper is to explore a highly efficient dense stereo vision algorithm, which can be embedded in UAVs for real-time road inspection.
The remainder of this paper is organised as follows. Section 2 discusses the related work on stereo vision. Section 3 presents the proposed embedded stereo vision system. The experimental results for performance evaluation are provided in Section 4. Finally, Section 5 summarises the paper and provides recommendations for future work.
2 Related Work
The two key aspects of computer stereo vision are speed and accuracy . A lot of research has been carried out over the past decades to improve either the disparity accuracy or the algorithm’s computational complexity 20, 32, 33, 2, 36] and traditional [5, 13, 26, 1, 12, 23]20]. For example, PSMNet  generates the cost volumes by learning region-level features with different scales of receptive fields. Although these approaches have achieved some highly accurate disparity maps, they usually require a large amount of labelled training data to learn from. Therefore, it is impossible for them to work on the datasets without providing the disparity ground truth . Moreover, predicting disparities with CNNs is still a computationally intensive task, which usually takes seconds or even minutes to execute on state-of-the-art graphics cards . Therefore, the existing CNN-based stereo vision algorithms are not suitable for real-time applications.
The traditional stereo vision algorithms can be classified as local, global and semi-global . The local algorithms typically select a series of blocks from the target image and match them with a constant block selected from the reference image . The disparities are then determined by finding the shifting distances corresponding to either the highest correlation or the lowest cost . This optimisation technique is also known as winner-take-all (WTA).
Unlike the local algorithms, the global algorithms generally translate stereo matching into an energy minimisation problem, which can later be addressed using sophisticated optimisation techniques, e.g., belief propagation (BP)  and graph cuts (GC) . These techniques are commonly developed based on the Markov random field (MRF) . Semi-global matching (SGM)  approximates the MRF inference by performing cost aggregation along all directions in the image, and this greatly improves the accuracy and efficiency of stereo matching. However, finding the optimum smoothness values is a challenging task, due to the occlusion problem . Over-penalising the smoothness term can reduce ambiguities around the discontinuous areas, but on the other hand, can cause incorrect matches for the continuous areas . Furthermore, the computational complexities of the aforementioned optimisation techniques are significantly intensive, making these algorithms difficult to perform in real time .
In , Fan et al. proposed a novel perspective transformation method, which improves both the disparity accuracy and the computational complexity of the algorithm. Furthermore, Mozerov and Weijer  proved that bilateral filtering is a feasible solution for the energy minimisation problem in a fully connected MRF model. The costs can be adaptively aggregated by performing bilateral filtering on the initial cost volumes . Therefore, the proposed stereo vision system is developed based on the work in  and . Finally, the estimated disparity maps are transformed by minimising an energy function with respect to the roll angle and disparity projection model. This makes the damaged road areas become highly distinguishable from the road surface.
3 System Description
The workflow of the proposed stereo vision system is depicted in Figure 1, where the system consists of three main components: a) perspective transformation; b) dense road stereo; and c) disparity transformation. The following subsections describe each component in turn.
3.1 Perspective Transformation
In this paper, the road surface is treated as a ground plane:
where is a 3D point on the road surface in the world coordinate system (WCS), and
is the normal vector of the ground plane. The projections ofon the reference and target images, i.e., and , are and , respectively. It should be noted that the left and right images are respectively referred to as the reference and target images in this paper. can be transformed to using a homography matrix as follows :
is a translation vector, represents the rotation from the WCS to the reference camera coordinate system (RCCS), denotes the rotation from the RCCS to the target camera coordinate system (TCCS), and and are the intrinsic matrices of the reference and target cameras, respectively. can be estimated using at least four pairs of matched correspondence points and . In order to simplify the estimation of , the authors of  made several hypotheses regarding , , , , and . (2) can be rewritten as follows:
where is the focus length of each camera, is the baseline, is the pitch angle, and is the principal point. . (4) implies that a perspective distortion always exists for the ground plane in two images when is not equal to , and this further affects the stereo matching accuracy. Therefore, the perspective transformation aims to make the ground plane in the transformed target image similar to that in the reference image . This can be straightforwardly realised by shifting each point on row in the target image pixels to the right, where , and is a constant used to guarantee that all the disparities are non-negative. The values of , , , , and can be estimated from a set of reliable correspondence pairs and . The transformed target image is shown in Figure 1 as .
3.2 Dense Road Stereo
3.2.1 Cost Computation and Aggregation
where denotes a node at the position of in the graph , represents the intensity differences corresponding to different disparities , represents the neighbourhood system of , expresses the compatibility between each possible disparity and the corresponding intensity difference, and expresses the compatibility between and its neighbourhood system . It is noteworthy that refers to and refers to the reference image. In practice, maximising the joint probability in (5) is commonly formulated as an energy minimisation problem as follows :
where computes the matching cost of , and determines the aggregation strategy. For disparity estimation algorithms based on the MRF, formulating in an adaptive way is crucial and necessary, because the intensity of a pixel in a discontinuous area usually differs greatly from those of its neighbours . Since bilateral filtering is a feasible solution for the energy minimisation problem in a fully connected MRF model , and can be rewritten as follows:
is the cost function; and represent the pixel intensities at in the reference and target images, respectively; and represent the means of the pixel intensities within the reference and target blocks, respectively; and and
denote the standard deviations of the reference and target blocks, respectively..
is controlled by two parameters and , with based on spatial distance and based on colour similarity. The cost of each neighbour can therefore be adaptively aggregated to . Finally, is normalised by rewriting (6) as follows:
The computed matching costs are stored in two cost volumes, as shown in Figure 1.
3.2.2 Disparity Optimisation and Refinement
By applying WTA optimisation on the reference and target cost volumes, the best disparities can be estimated. Since the perspective view of the target image has been transformed in Section 3.1, the estimated disparities on row should be added to obtain the disparity map between the original reference and target images. The occluded areas in the reference disparity map are then removed by finding the pixels satisfying the following condition :
where and represent the reference and target disparity maps, respectively. is the threshold for occlusion removal. Finally, a subpixel enhancement is performed to increase the resolution of the estimated disparity values :
where , illustrated in Figure 1, represents the final disparity map in the reference perspective view.
3.3 Disparity Transformation
The proposed system focuses entirely on the road surface whose disparity values decrease gradually from the bottom of the disparity map to its top, as shown in Figure 1. For a stereo rig whose baseline is perfectly parallel to the road surface, the roll angle equals zero and the disparities on each row have similar values, which can also be proved by (4). Therefore, the projection of the road disparities on a v-disparity image can be represented by a linear model: . A column vector storing the coefficients of the disparity projection model can be estimated as follows:
However, in practice, the stereo rig baseline is not always perfectly parallel to the road surface, and this introduces a non-zero roll angle into the imaging process. The disparity values will change gradually in the horizontal direction, and this makes the approach of representing the road disparity projection using a linear model problematic. Additionally, the minimum energy becomes higher, due to the disparity dispersion in the horizontal direction. Hence, the proposed disparity transformation first finds the angle corresponding to the minimum . The image rotation caused by is then eliminated, and is subsequently estimated.
To rotate the disparity map around a given angle , each set of original coordinates is transformed to a set of new coordinates using the following equations :
The energy function in (15) can, therefore, be rewritten as follows:
Roll angle estimation is, therefore, equivalent to the following energy minimisation problem:
which can be formulated as an iterative optimisation problem as follows :
where is the learning rate. (25) is a standard form of gradient descent. The expression of is as follows:
is an identity matrix. Ifis too high, (25) may overshoot the minimum. On the other hand, if is set to a relatively low value, the convergence of (25) may require a lot of iterations . Therefore, selecting a proper is always essential for gradient descent. Instead of fixing the learning rate with a constant value, backtracking line search is utilised to produce an adaptive learning rate:
The selection of the initial learning rate will be discussed in Section 4. The initial approximation is set to , because the roll angle in practical experiments is usually small. It should be noted that the estimated at time is used as the initial approximation at time . The optimisation iterates until the absolute difference between and is smaller than a preset threshold . can be obtained by substituting the estimated roll angle into (21). Finally, each disparity is transformed using:
where , shown in Figure 1, represents the transformed disparity map, and is a constant used to make the transformed disparity values positive.
4 Experimental Results
In this section, we evaluate the performance of the proposed stereo vision system both qualitatively and quantitatively. The following subsections detail the experimental set-up, datasets, implementation notes and the performance evaluation.
4.1 Experimental Set-Up
In the experiments, a ZED stereo camera111https://www.stereolabs.com/ is mounted on a DJI Matrice 100 Drone222https://www.dji.com/uk/matrice100 to capture stereo road images. The maximum take-off weight of the drone is 3.6 kg. The stereo camera has two ultra-sharp six-element all-glass lenses, which can cover the scene up to 20 m. The captured stereo road images are processed using an NVIDIA Jetson TX2 GPU333https://developer.nvidia.com/embedded/buy/jetson-tx2, which has 8 GB LPDDR4 memory and 256 CUDA cores. An illustration of the experimental set-up is shown in Figure 2.
Using the above experimental set-up, three datasets including 11368 stereo image pairs are created. The resolution of the original reference and target images is . In each dataset, the UAV flight trajectory forms a closed loop, which makes it possible to evaluate the performance of the state-of-the-art visual odometry algorithms using our created datasets. The datasets and a demo video are publicly available at http://www.ruirangerfan.com.
4.3 Implementation Notes
In the practical implementation, the reference and target images are first sent to the global memory of the GPU from the host memory. However, a thread is more likely to fetch the data from the closest addresses that its nearby threads accessed444https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf. This fact makes the use of cache in global memory impossible. Furthermore, constant memory and texture memory are read-only and cached on-chip, and this makes them more efficient than global memory for memory requesting. Therefore, we store the reference and target images in the texture memory to reduce the memory requests from the global memory. This is realised by creating two texture objects in the texture memory and binding these objects with the addresses of the reference and target images. The pixel intensities can therefore be fetched from the texture objects instead of the global memory. In addition, (10) is rewritten as follows:
The values of and are pre-calculated and stored in the constant memory to reduce the repetitive computations of . Moreover, the values of , , and are also pre-calculated and stored in the global memory to avoid the unnecessary computations in stereo matching.
4.4 Performance Evaluation
4.4.1 Disparity Estimation
Some experimental results are illustrated in Figure 3. is a 120-connected neighbourhood system. and are empirically set to and , respectively. Since the datasets we created do not contain disparity ground truth, the KITTI555http://www.cvlibs.net/datasets/kitti/ stereo 2012 and 2015 datasets [10, 22] are utilised to quantify the accuracy of the proposed system. Some experimental results of the KITTI stereo datasets are shown in Figure 4, where the road regions are manually selected to evaluate the accuracy of the road disparities. Furthermore, we compare the proposed method with PSMNet  in terms of the percentage of error pixels and root mean squared error . The expressions of and are as follows:
is the total number of disparities used for evaluation, is the disparity error tolerance, and represents the ground truth disparity map. The comparison of and between these two methods is shown in Table 1, where
it can be observed that the proposed method outperforms PSMNet in terms of and when is set to 2, while PSMNet performs better than our method when is set to 3. It should be noted that the proposed algorithm is capable of estimating disparity maps between a pair of motion blurred stereo images, as shown in Figure 5. This also demonstrates the robustness of the proposed dense stereo system.
In addition to the disparity accuracy, the execution speed of the proposed dense stereo vision system is also quantified to evaluate the overall system’s performance. Owing to the fact that the image size and disparity range are not constant among different datasets, a general way of evaluating the performance in terms of processing speed is to measure millions of disparity evaluations per second :
where the resolution of the disparity map is , is the maximum disparity value, and is the processing time in seconds. The runtime of the proposed dense stereo vision system on the Jetson TX2 GPU is approximately 152.889 ms, and the resolution of the disparity map is . Therefore, the value of is 49.231, which is much higher than most stereo vision systems implemented on powerful graphics cards.
4.4.2 Roll Angle Estimation
In the experiments, we select a range of and record the number of iterations that (25) takes to converge to the minimum. It is shown that is the optimum value when the threshold is set to rad ().
Furthermore, a synthesised stereo dataset from EISATS666https://ccv.wordpress.fos.auckland.ac.nz/eisats/set-2/ [29, 31] is used to quantify the accuracy of the proposed roll angle estimation algorithm. The roll angle of each image in this dataset is perfectly zero. Therefore, we manually rotate the disparity maps around a given angle, and then estimate the roll angles from the rotated disparity maps. Examples of the roll angle estimation experiments are shown in Figure 6, where it can be observed that the effects due to image rotation are effectively corrected. When is set to rad, the average difference between the actual and estimated roll angles is approximately rad. The runtime of the proposed roll angle estimation on the Jetson TX2 GPU is approximately 7.842 ms.
4.4.3 Disparity Transformation
In , Fan et al. published three road datasets containing various types of road damages, such as potholes and cracks. Therefore, we first use their datasets to qualitatively evaluate the performance of the proposed disparity transformation algorithm. Examples of the transformed disparity maps are illustrated in Figure 7, where it can be observed that the disparities of the road surface have similar values, while their values differ greatly from those of the road damages. This fact enables the damaged road areas to be easily recognised from the transformed disparity maps.
The KITTI stereo datasets are further utilised to evaluate the performance of disparity transformation. Examples of the KITTI stereo datasets are shown in Figure 8. To quantify the accuracy of the transformed disparities, we compute the standard deviation of the transformed disparity values as follows:
where stores the transformed disparity values. The average value of the KITTI stereo datasets is 0.519 pixels. However, if the image rotation effects caused by the non-zero roll angle are not eliminated, the average value becomes 0.861 pixels. The runtime of the disparity transformation on the Jetson TX2 GPU is around 1.541 ms.
5 Conclusion and Future Work
This paper presented a robust dense stereo vision system embedded in a DJI Matrice 100 UAV for road condition assessment. The perspective transformation greatly improved the disparity accuracy and reduced the algorithm computational complexity, while the disparity transformation algorithm enabled the UAV to estimate roll angles from disparity maps. The damaged road areas became highly distinguishable in the transformed disparity maps, and this can provide new opportunities for UAV-based road damage inspection. The proposed system was implemented with CUDA on a Jetson TX2 GPU, and real-time performance was achieved.
In the future, we plan to use the obtained disparity maps to estimate the flight trajectory of the UAV and reconstruct the 3D maps using the state-of-the-art simultaneous localisation and mapping (SLAM) algorithms.
This work is supported by grants from the Research Grants Council of the Hong Kong SAR Government, China (No. 11210017 and No. 21202816) awarded to Prof. Ming Liu. This work is also supported by grants from the Shenzhen Science, Technology and Innovation Commission, JCYJ20170818153518789, and National Natural Science Foundation of China (No. 61603376) awarded to Dr. Lujia Wang.
-  Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. IEEE Transactions on pattern analysis and machine intelligence, 23(11):1222–1239, Nov. 2001.
J.-R. Chang and Y.-S. Chen.
Pyramid stereo matching network.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5410–5418, 2018.
-  L. Cruz, D. Lucio, and L. Velho. Kinect and rgbd images: Challenges and applications. In Proc. Patterns and Images Tutorials 2012 25th SIBGRAPI Conf. Graphics, pages 36–49, Aug. 2012.
-  R. Fan. Real-time computer stereo vision for automotive applications. PhD thesis, University of Bristol, 2018.
-  R. Fan, X. Ai, and N. Dahnoun. Road surface 3D reconstruction based on dense subpixel disparity map estimation. IEEE Transactions on Image Processing, PP(99):1, 2018.
-  R. Fan, M. J. Bocus, and N. Dahnoun. A novel disparity transformation algorithm for road segmentation. Information Processing Letters, 140:18–24, 2018.
-  R. Fan, Y. Liu, X. Yang, M. J. Bocus, N. Dahnoun, and S. Tancock. Real-time stereo vision for road surface 3-d reconstruction. In 2018 IEEE International Conference on Imaging Systems and Techniques (IST), pages 1–6. IEEE, 2018.
-  W. Feng, W. Yundong, and Z. Qiang. Uav borne real-time road mapping system. In 2009 Joint Urban Remote Sensing Event, pages 1–7. IEEE, 2009.
-  A. Fox, B. V. Kumar, J. Chen, and F. Bai. Multi-lane pothole detection from crowdsourced undersampled vehicle sensor data. IEEE Transactions on Mobile Computing, 16(12):3417–3430, 2017.
-  A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 3354–3361, June 2012.
-  R. Hartley and A. Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003.
-  H. Hirschmuller. Stereo processing by semiglobal matching and mutual information. IEEE Transactions on pattern analysis and machine intelli- gence, 30(2):328–341, Feb. 2008.
-  E. T. Ihler, J. W. F. Iii, and A. S. Willsky. Loopy belief propagation: Convergence and effects of message errors, 2005.
-  M. R. Jahanshahi, F. Jazizadeh, S. F. Masri, and B. Becerik-Gerber. Unsupervised approach for autonomous pavement-defect detection and quantification using an inexpensive depth sensor. Journal of Computing in Civil Engineering, 27(6):743–754, 2012.
-  T. Kim and S.-K. Ryu. Review and analysis of pothole detection methods. Journal of Emerging Trends in Computing and Information Sciences, 5(8):603–608, 2014.
-  C. Koch and I. Brilakis. Pothole detection in asphalt pavement images. Advanced Engineering Informatics, 25(3):507–515, 2011.
-  C. Koch, K. Georgieva, V. Kasireddy, B. Akinci, and P. Fieguth. A review on computer vision based defect detection and condition assessment of concrete and asphalt civil infrastructure. Advanced Engineering Informatics, 29(2):196–210, 2015.
-  C. Koch, G. M. Jog, and I. Brilakis. Automated pothole distress assessment using asphalt pavement video data. Journal of Computing in Civil Engineering, 27(4):370–378, 2012.
-  D. G. Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91–110, 2004.
W. Luo, A. G. Schwing, and R. Urtasun.
Efficient deep learning for stereo matching.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5695–5703, 2016.
-  S. Mathavan, K. Kamal, and M. Rahman. A review of three-dimensional imaging technologies for pavement distress detection and measurements. IEEE Transactions on Intelligent Transportation Systems, 16(5):2353–2362, 2015.
-  M. Menze, C. Heipke, and A. Geiger. Joint 3d estimation of vehicles and scene flow. ISPRS Workshop on Image Sequence Analysis (ISA), II-3/W5:427–434, 2015.
-  M. G. Mozerov and J. van de Weijer. Accurate stereo matching by two-step energy minimization. 24:1153–1163, 2015.
-  P. Pedregal. Introduction to optimization, volume 46. Springer Science & Business Media, 2006.
-  E. Schnebele, B. F. Tanyu, G. Cervone, and N. Waters. Review of remote sensing methodologies for pavement management and assessment. European Transport Research Review, 7(2):1, Mar. 2015.
-  Tappen and Freeman. Comparison of graph cuts with belief propagation for stereo, using identical mrf parameters. In Proc. Ninth IEEE Int. Conf. Computer Vision, pages 900–906 vol.2, Oct. 2003.
-  B. Tippetts, D. J. Lee, K. Lillywhite, and J. Archibald. Review of stereo vision algorithms and their suitability for resource-limited systems. Journal of Real-Time Image Processing, 11(1):5–25, 2016.
-  Y.-C. Tsai and A. Chatterjee. Pothole detection and classification using 3d technology and watershed method. Journal of Computing in Civil Engineering, 32(2):04017078, 2017.
-  T. Vaudrey, C. Rabe, R. Klette, and J. Milburn. Differences between stereo and motion behaviour on synthetic and real-world stereo sequences. In Image and Vision Computing New Zealand, 2008. IVCNZ 2008. 23rd International Conference, pages 1–6. IEEE, 2008.
-  P. Wang, Y. Hu, Y. Dai, and M. Tian. Asphalt pavement pothole detection and segmentation based on wavelet energy field. Mathematical Problems in Engineering, 2017, 2017.
-  A. Wedel, C. Rabe, T. Vaudrey, T. Brox, U. Franke, and D. Cremers. Efficient dense scene flow from sparse or dense stereo data. In European conference on computer vision, pages 739–751. Springer, 2008.
-  S. Zagoruyko and N. Komodakis. Learning to compare image patches via convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4353–4361, 2015.
-  J. Zbontar and Y. LeCun. Computing the stereo matching cost with a convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1592–1599, 2015.
-  C. Zhang. An uav-based photogrammetric mapping system for road condition assessment. Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci, 37:627–632, 2008.
-  C. Zhang and A. Elaksher. An unmanned aerial vehicle-based imaging system for 3d measurement of unpaved road surface distresses 1. Computer-Aided Civil and Infrastructure Engineering, 27(2):118–129, 2012.
-  C. Zhou, H. Zhang, X. Shen, and J. Jia. Unsupervised learning of stereo matching. In Proceedings of the IEEE International Conference on Computer Vision, pages 1567–1575, 2017.