Depth prediction from images plays a significant role in autonomous driving and advanced driver assistance systems, which helps understanding a geometric layout in a scene, and can be leveraged to solve other tasks, including vehicle/pedestrian detection [chen2017coherent, keller2011benefits], traffic scene segmentation [li2018traffic], and 3D reconstruction [wu2017geometry]. Stereo matching is a typical approach to recovering depth that finds dense correspondences between a pair of stereo images [nguyen2017robust, van2006real, li2008binocular]. Stereo matching methods compute similarities between local patches [muresan2017mutlipatch] or optimize global objective functions to consider smoothness priors penalizing large derivatives of depth [hirschmuller2007stereo, miclea2018real, spangenberg2014large]. These approaches show state-of-the-art performance, but capturing pairs of stereo images requires multiple cameras calibrated, making it difficult to apply them in practice. An alternative is to predict depth from a monocular video sequence, and it is of great interests in recent years [eigen2014depth, liu2014discrete, zhou2017unsupervised, wang2018learning, godard2017unsupervised, kuznietsov2017semi, cs2018depthnet, fu2018deep]. This approach builds upon the insight that human can perceive depth using monocular depth cues (e.g., occlusion, perspective, motion parallax) only [rogers1979motion]. Eigen et al. [eigen2014depth]
first propose a supervised learning method for predicting depth from a single still image using CNNs. Zhouet al. [zhou2017unsupervised] and Wang et al. [wang2018learning] recently propose CNN architectures for predicting depth from a monocular video, where two networks are trained separately to estimate depth and camera pose. These methods are limited in that they predict depth independently for each frame, discarding temporal coherence in the video sequence. That is, they give temporally inconsistent results, causing serious temporal flickering artifacts. Recurrent neural networks (RNNs) have been widely used to model temporal dependency across sequential data (e.g., video and text), and they have shown the effectiveness in various applications including action recognition [du2018recurrent] and machine translation [zhang2017context]. They, however, still show a limited capability of handing the flickering artifacts [cs2018depthnet, shi2017deep].
In this paper, we present a simple yet effective method for a temporally consistent depth prediction from a monocular video sequence (Fig. 1). We transfer temporal consistency in the video to RNNs explicitly, particularly using convolutional gated recurrent units (ConvGRUs) [ballas2015delving]
. To implement this idea, we propose a flow-guided memory unit using optical flow specific to our task, maintaining a long-term temporal consistency in depth prediction results. Our module uses spatial and temporal features extracted by a two-stream CNN. We have two main reasons for decoupling these features. First, it has been proven that learning spatiotemporal features jointly from a stack of frames does not capture the motion well[simonyan2014two]. Second, optical flow itself provides an important clue for motion parallax, which is helpful to infer depth from a monocular video sequence. For example, objects closer to a camera move faster than distant ones. We show that our method outperforms the state of the art in terms of temporal consistency, and shows a good trade-off between depth prediction accuracy and runtime. The main contributions of this paper can be summarized as follows:
We present an effective ConvGRU encoder-decoder module for a temporally consistent depth prediction from a monocular video sequence. To our knowledge, this is the first approach based on convolutional/recurrent networks to considering temporal consistency in depth prediction.
We propose a flow-guided memory unit that retains a long-term temporal consistency explicitly for individual pixels.
We present state-of-the-art results on the KITTI [geiger2013vision] benchmark. We additionally provide an extensive experimental analysis, clearly demonstrating the effectiveness of our approach to memorizing temporal consistency for depth prediction.
To encourage comparison and future work, we release our code and models online: https://cvlab-yonsei.github.io/projects/FlowGRU/.
Ii Related work
In this section, we briefly review representative works related to ours.
Ii-a Monocular depth prediction
The problem of predicting depth from monocular images or video sequences has significant attention in recent years. Early works exploit hand-crafted features such as SIFT [lowe2004distinctive] and HOG [dalal2005histograms] together with graphical models [saxena2006learning, liu2014discrete] or nonparametric sampling [karsch2014depth]. Saxena et al. [saxena2006learning] estimate depth from monocular images using Markov random fields (MRFs) where they incorporate multi-scale features. Liu et al. [liu2014discrete] extend this idea by using a discrete-continuous graphic model. Karsch et al. [karsch2014depth] introduce a nonparametric approach to depth prediction from monocular images and videos. They transfer depth labels from a large-scale RGB/D dataset using dense correspondences established by a SIFT flow method [liu2011sift]. CNNs have allowed remarkable advances in depth prediction in the past few years. Eigen et al. [eigen2014depth]
first leverage CNNs to predict depth from monocular images in a coarse-to-fine manner. In particular, they introduce a scale-invariant loss function that alleviates ambiguity in scale. Liuet al. [liu2015deep] combine CNNs with conditional random fields (CRFs) for structured prediction. Kuznietsov et al. [kuznietsov2017semi] propose to use additional stereo images at training time. They predict depth from left images and synthesize novel views by warping right ones using estimated depth. The differences between left and synthesized images are then used as a supervisory signal for training. Unlike the aforementioned methods using monocular images, recent works [zhou2017unsupervised, wang2018learning, yin2018geonet] have shown success in learning depth from a monocular video sequence. Zhou et al. [zhou2017unsupervised] present an approach to estimating depth and camera pose simultaneously from a video sequence. Similar to [kuznietsov2017semi], they synthesize adjacent frames using estimated depth and camera pose, and use the discrepancy between the synthesized and original ones as a supervisory signal. This approach, however, requires ground-truth parameters for camera pose. Wang et al. [wang2018learning]
propose an unsupervised learning approach to estimating pose parameters using a differential version of the direct visual odometry (DVO) method[steinbrucker2011real], commonly employed in a SLAM community, and leverage it to depth prediction from a monocular video. Yin et al. [yin2018geonet] propose an unsupervised learning framework that learns depth, camera pose, and optical flow jointly. They first estimate depth and camera pose to obtain rigid flow, and then use them to compute optical flow. These approaches using CNNs outperform traditional methods by large margins, but none of them consider temporal coherence in a video sequence. They give temporally inconsistent results, resulting in temporal flickering artifacts. On the contrary, our method memorizes temporal coherence in a video sequence, enabling a temporally consistent depth prediction.
Ii-B Recurrent models
RNNs have been widely used to capture temporal dependency in sequential data [hopfield1982neural, rumelhart1986learning]. Representative models include GRUs [cho2014learning]
and long short term memory (LSTM)[hochreiter1997long], and they have been adopted successfully to various tasks such as video representation [srivastava2015unsupervised]donahue2015long] and car-following modeling [wang2018capturing]
. LSTM and GRUs, typically using fully-connected layers, do not maintain spatial information and require a lot of network parameters. This is problematic especially for high-dimensional data (e.g., video sequences). ConvLSTM [xingjian2015convolutional] and ConvGRUs [ballas2015delving]
replace the fully-connected layers with convolutional ones, preserving spatial information while reducing the number of parameters drastically. They have been widely exploited in computer vision and image processing tasks including video recognition[tokmakov2017learning, ballas2015delving], depth prediction [jie2018left, cs2018depthnet], and precipitation nowcasting [xingjian2015convolutional, shi2017deep]. Particularly, Shi et al. [shi2017deep] introduce a variant of ConvGRUs, trajectory GRU (TrajGRU), and apply it to predict a future rainfall intensity. Similar to the deformable convolutional network [dai2017deformable], TrajGRU learns local offsets (i.e., where to aggregate) and adds them to a regular grid in a standard convolution, which has an effect of using spatially-variant convolutional kernels. In contrast to this, our model memorizes spatiotemporal features aligned along the path guided by optical flow. Most similar to ours is a ConvLSTM-based framework for depth prediction [cs2018depthnet]. It uses ConvLSTMs to exploit spatiotemporal features from a video sequence, but does not give temporally coherent results. We also use a recurrent network to exploit spatiotemporal features, but a flow-guided memory unit in our model retains a long-term temporal consistency in a video sequence explicitly. Note that the ability to capture temporal dependency in ConvLSTMs or ConvGRUs does not guarantee to obtain temporally consistent results [shi2017deep].
Ii-C Temporal coherency
Recently, several approaches have been introuced to model temporal coherence in a video sequence. They typically use optical flow to smooth results along dense trajectories [lang2012practical, aydin2014temporally, bonneel2013example], to construct loss functions penalizing the difference between current and synthesized frames [chen2017coherent, lai2018learning] or to align features from current and previous ones [chen2017coherent, gadde2017semantic]. Different from the first and second approaches, we focus on designing a recurrent model itself that transfers temporal coherence in the video sequence to depth prediction results, without using temporal filtering techniques or corresponding loss functions. Our method is similar to the last approach in that we use aligned features to obtain temporally consistent results. In contrast to [chen2017coherent, gadde2017semantic], we use a memory unit that filters out or retains spatiotemporal features aligned along dense trajectories. More recently, V. Miclea et al. propose to exploit temporal cues [miclea2018real] for depth prediction. They use a previous frame and corresponding depth and segmentation results to refine incorrect depth values. Compared to this work, our method maintains a long-term temporal consistency by using the memory unit.
Iii Proposed solution
In this section, we describe a recurrent model with a flow-guided memory module for a temporally consistent depth prediction (Section III-A). We then present loss functions for learning depth and refined flow (Section III-B). The entire network is trained end-to-end.
Iii-a Network architecture
Our network mainly consists of three parts (Fig. 2): A two-stream encoder extracts spatial and temporal features from a video frame and optical flow in a backward direction (i.e, a dense flow field from to ), respectively, where represents a time step. A flow-guided memory module inputs both features, and retains parts of them along trajectories of individual pixels to memorize temporal coherence in a video sequence. A decoder takes an output of the flow-guided memory module and outputs a depth map . In the following, we present the detailed description of each part.
Iii-A1 Two-stream encoder
A video sequence allows to leverage spatial and temporal information for depth prediction. Motivated by the works [simonyan2014two, wang2016temporal] for action recognition, we use a two-stream encoder where each stream has the same CNN architecture (but different parameters), takes the video frame and optical flow , and then extracts spatial and temporal features, respectively. They are complementary each other. The spatial features capture appearance of objects and scene layout within each frame while the temporal ones encode trajectories of individual pixels (i.e., motion) across frames. Monocular depth prediction using CNNs typically requires large receptive fields to extract monocular depth cues including motion parallax and perspective [eigen2014depth]
. We can enlarge the receptive fields by using convolutions with multiple strides or pooling methods, but they lead to loss of spatial resolution and scene details such as small and thin structures[yu2017dilated]. We instead implement the two-stream encoder using a series of dilated convolutions [yu2015multi] with which we can adjust size of receptive fields by changing a dilation rate without loss of resolution. Note that the dilated version with rate of 1 corresponds to a standard convolution.
Iii-A2 Flow refine network
In addition to the extraction of the temporal features from the encoder, we also use optical flow to align video frames and hidden states in a flow-guided memory module over time. Although CNN-based optical flow methods give state-of-the-art results, they are still not accurate enough to propagate fine-grained information across the video frames or hidden states, especially for motion boundaries. To address this problem, we use an additional network for refining the pre-computed optical flow (Fig. 3). In particular, we learn residuals between the pre-computed optical flow and refined one [he2016deep], built upon the assumption that they are similar and the initial flow does not change drastically. Similar to [gadde2017semantic], we use an early fusion approach, directly concatenating video frames and optical flow, to transfer the low-level information in the frames to the initial flow effectively. On the contrary, we compute the refined flow using a multi-scale architecture (Fig. 3) and use it to align both video frames and hidden states in the memory module for the task of depth prediction. Specifically, the network extracts spatiotemporal features from video frames, and , and optical flow , and computes the residuals through convolutional layers. They are concatenated to the pre-computed optical flow, and the results are then passed on to additional convolutional layers, resulting in a refined flow specific to aligning video frames. The spatial resolution of is the same as that of the pre-computed optical flow . Other refined flow fields, and , are similarly computed by applying convolutional layers, while reducing the spatial resolution of the pre-computed optical flow by factor of 2 and 4 in each dimension, respectively.
Iii-A3 Flow-guided memory
Our memory module exploits trajectories of individual pixels using the refined optical flow to align hidden states selectively across frames (Fig. 4). This allows to transfer a long-term temporal consistency in a video sequence to depth prediction results. We implement the memory module with a ConvGRU [ballas2015delving], since it does not suffer from spatial resolution loss and is more efficient in terms of memory, compared to vanilla RNNs and ConvLSTMs [jaderberg2015spatial], respectively. The flow-guided memory module is defined as follows:
where and are element-wise multiplication and convolution, respectively. Here, we denote by a warping operator using a flow field, e.g., at position . and are weight and bias terms, respectively.
is the sigmoid function.
The flow-guided memory module inputs the feature obtained from the two-stream encoder and a previous hidden state acting as a visual memory, and outputs a new state by combining and a candidate state weighted by an output of an update gate . The update and reset gates, and , selectively choose and discard information, respectively, from the input feature and the previous hidden state . Conventional GRUs aggregate features from the hidden state at time directly to compute the current one . This is problematic especially when input features from previous and current frames, and , are not aligned with each other. Examples of this issue are cases when objects move across video frames or the viewpoint is changed due to camera motion. Mixing features from different locations leads to temporally inconsistent results. To address this problem, we instead use a flow-guided memory where the feature from the previous state are aligned to the current input feature by warping using the refined flow
. We implement this with a differential warping operator using bilinear interpolation[jaderberg2015spatial]. We additionally use a matching confidence to consider reliability of the refined optical flow as follows.
where and is a bandwidth parameter. We denote by a resized video frame at time that has the same spatial resolution as . We use the same matching confidence in each channel of the hidden state .
Recently, Shi et al. introduce the TrajGRU [shi2017deep] for precipitation nowcasting. Our model is closely related to the TrajGRU in that both consider temporally aligned hidden states to compute a new one. The TrajGRU learns offsets for sampling locations [dai2017deformable], typically defined on a regular grid in the standard convolution, to fetch information from the previous frame. Although this can be seen as an implicit feature alignment, the TrajGRU is not designed to enforce temporal consistency and does not consider large displacements. It may provide temporally inconsistent results when the learned offsets are wrong or displacements between video frames are large. The TrajGRU is also computationally inefficient, since it applies a warping operator for each offset. Compared to this work, we align hidden states in the memory module explicitly using the refined optical flow together with a matching confidence. This considers large motion and prevents aggregating the hidden states for unreliable correspondences, making it possible to obtain temporally consistent results.
The decoder inputs the hidden state in the flow-guided memory module and gives depth maps that have the same resolution as input images. In order to consider fine details (e.g., depth boundaries), we use additional low-level features from spatial and temporal streams by skip connections (Fig. 2).
Iii-B Training loss
We use three types of losses for training: First, a scale-invariant term is used to alleviate scale ambiguity in predicted depth. Second, we use a photometric consistency term to learn the refined flow, making the pre-computed optical flow specific to aligning video frames. Finally, smoothness terms regularize depth and flow fields. Our final loss is a linear combination of them, balanced by the parameter as
where . and are losses for depth prediction and flow refinement, respectively. In the following, we describe each term in detail.
Iii-B1 for depth prediction
Motivated by the work [eigen2014depth], we define the scale-invariant loss as
where is the difference between the predicted depth and ground truth at position in log space, and is the total number of pixels. The first term encourages predicted depth to be similar to ground truth. Estimating absolute scale of depth is, however, extremely hard especially from monocular video sequences. The second term alleviates this problem by comparing relationships between pairs of pixels , in . It encourages them to have the same direction, and gives lower error when both and are positive or negative values. The first and second terms are balanced by . As approaches to one, predicted depth becomes robust to scale variations. We also use the smoothness term that regularizes a prediction result, while preserving depth discontinuities, defined as
where is a Laplace operator and is the smoothness bandwidth. We compute the second-order derivative of a predicted depth map weighted using the magnitude of image discontinuities, with an assumption that depth boundaries are aligned well to image discontinuities. We define a total loss for depth prediction as
where balances the scale-invariant and smoothness terms.
Iii-B2 for flow refinement
We use the photometric consistency loss to refine the pre-computed optical flow. This term encourages the refined flow to be specific to aligning video frames over time. Motivated by the works [zhao2017loss, godard2017unsupervised], we define the consistency term but in a multi-scale manner as
L^PH_i = 1Ni ∑_p ( β 1 - SSIM(Iit(p), ¯Iit(p))2
+ (1-β) ∥I_i^t(p) - ¯I_i^t (p) ∥_1 ), where is the total number of pixels in the image . The first and second terms, balanced by , compute the differences and structural similarity (SSIM) between original images and synthesized ones from using the corresponding refined flow , respectively. Similar to depth prediction, we define the smoothness term for the refined flow as
and use a sum of photometric consistency and smoothness terms, balanced by a regularization parameter , as a total loss:
Iv Experimental results
|Layer||Type||K||S||I/O ch||I/O rs||Input|
|Encoder (Spatial & temporal)|
|Gxz||c||5||1||128/64||4/4||Econv8b () + Econv8b ()|
|Gz||s||-||-||64/64||4/4||Gxz + Ghz|
|Gxr||c||5||1||128/64||4/4||Econv8b () + Econv8b ()|
|Gr||s||-||-||64/64||4/4||Gxr + Ghr|
|Gxh||c||5||1||128/64||4/4||Econv8b () + Econv8b ()|
|Gh||t||-||-||64/64||4/4||Gxh + Ghh|
|-||-||-||64/64||4/4||(Gz) + Gz Gh|
|Flow refine network|
Type: A type of operations; K: Kernel size; S: Strides; I/O ch: The number of channels for the input/output; I/O rs: A downsampling factor for the input/output relative to the input image. c: Convolution; d: Dilated convolution; u: Up-convolution; s: Sigmoid; t: Hyperbolic tangent.
In this section we present a detailed analysis and evaluation of our approach. Our code and more results including depth videos are available at our project webpage: https://cvlab-yonsei.github.io/projects/FlowGRU/
|lower is better||higher is better|
|Abs Rel||Sq Rel||RMSE||RMSE (log)||Runtime(s)|
|Eigen et al. [eigen2014depth]||K||D||0-80m||0.215||1.515||7.156||0.270||0.692||0.899||0.967||-|
|Liu et al. [liu2014discrete]||K||D||0-80m||0.217||1.841||6.986||0.289||0.647||0.882||0.961||-|
|Godard et al. [godard2017unsupervised]||K||S||0-80m||0.148||1.344||5.927||0.247||0.803||0.922||0.964||0.04|
|Zhou et al. [zhou2017unsupervised]||K||M||0-80m||0.208||1.768||6.856||0.283||0.678||0.885||0.957||0.03|
|Wang et al. [wang2018learning]||K||M||0-80m||0.151||1.257||5.583||0.228||0.810||0.936||0.974||0.03|
|Yin et al. [yin2018geonet]||K||M||0-80m||0.155||1.296||5.857||0.233||0.793||0.931||0.973||0.04|
|Kuznietsov et al. [kuznietsov2017semi]||I+K||D+S||0-80m||0.113||0.741||4.621||0.189||0.862||0.960||0.986||0.06|
|Kumar et al. [cs2018depthnet]||K||D+M||0-80m||0.137||1.019||5.187||0.218||0.809||0.928||0.971||-|
|Fu et al. [fu2018deep]||I+K||D+M||0-80m||0.102||0.617||3.859||0.165||0.890||0.964||0.985||1.08|
|Kuznietsov et al. [kuznietsov2017semi]||I+K||D+S||1-50m||0.108||0.595||3.518||0.179||0.875||0.964||0.988||0.06|
|Garg et al. [garg2016unsupervised]||K||S||1-50m||0.169||1.080||5.104||0.273||0.740||0.904||0.962||0.04|
|Godard et al. [godard2017unsupervised]||K||S||1-50m||0.108||0.657||3.729||0.194||0.873||0.954||0.979||0.04|
Abs Rel: Absolute relative difference; Sq Rel: Square relative difference; RMSE: Root Mean Square Error; RMSE (log): RMSE in log scale; : The percentage of pixels where the ratio of estimated depth and ground truth is within a range in the threshold . D: Ground-truth depth; S: Rectified stereo pairs; M: Monocular video sequences.
We train our model from scratch with the KITTI raw dataset [geiger2013vision] that provides pairs of stereo images for 61 scenes together with 3D points and camera parameters. In particular, we use the split provided by [eigen2014depth], where it contains 35,600 and 697 images for training and test, respectively. We consider each view in stereo image pairs as an individual monocular sequence. We also train our model with the Cityscapes dataset [Cordts2016Cityscapes] that consists of 89k, 15k and 45k images for training, validation and test, respectively. We split the training sets into a chunk of frames, each of which contains 50 and 30 successive frames for the KITTI and Cityscapes datasets, respectively. We choose 20 and 5 nearby frames randomly for the KITTI and Cityscapes dataset, respectively, and augment the datasets by randomly cropping training samples to the size of
. We use a batch size of 16 for 200 epochs which corresponds to about 450k iterations for the KITTI dataset. For the Cityscapes dataset, the same batch size of 16 is used with 200 epochs (about 600k iterations), and the trained model is then fine-tuned with additional 100 epochs with the train split provided by[eigen2014depth]. We use the Adam optimizer [kingma2014adam] with and . As learning rate, we use 1e-4 at first 100 epochs and gradually reduce it during training. We use a grid search to set the balance parameters, , and , to , and , respectively. We follow the experimental setting in [eigen2014depth, godard2017unsupervised, lai2018learning, wang2018occlusion] to set other parameters, and fix them in all experiment: , , , and . We compute optical flow using the DIS-Flow method [kroeger2016fast] that offers a good compromise in terms of runtime and accuracy. For example, it requires 0.1 seconds for images of size with an Intel i5 3.3Ghz CPU. All networks are trained end-to-end using TensorFlow [abadi2016tensorflow]. With two Nvidia GTX Titan Xs, training our model takes about 10 and 15 days for the KITTI and Cityscapes datasets, respectively, including fine-tuning.
Iv-B Network architecture details
We show a detailed description of the network architecture in Table I. We denote by “+”, “”, and “” concatenation, element-wise multiplication, and 2
downsampling, respectively. We use the ReLU[krizhevsky2012imagenet]
as an activation function except for the last layer. Each sub-network in the encoder consists of 9 convolutional and 7 dilated convolutional layers. A dilated convolution[yu2015multi] enables covering large receptive fields using small-size convolutions and maintaining the spatial resolution of feature maps, but it typically causes grid artifacts [yu2017dilated]. To alleviate this problem, we add a convolutional layer followed by the dilated one, except the last two layers. The flow-guided memory module has an architecture similar to the ConvGRU [ballas2015delving] consisting of reset and update gates. Differently, we align the previous hidden state w.r.t. the current input feature using the refined flow. The decoder has 2 up-convolutional and 3 convolutional layers. Following [mayer2016large], we add a convolutional layer after applying an up-convolutional operator, which gives smooth prediction results. We use skip connections from the encoder to leverage low-level but fine-grained features for depth prediction. The spatial resolution of predicted depth is the same as that of an input frame. The flow refine network computes three residuals with different scales. The residual for each scale is computed through 3 convolutional layers. We use the ReLU [krizhevsky2012imagenet] as an activation function except for the last layer.
Depth predicted by our model is defined up to a scale factor. Following the experimental protocol in [zhou2017unsupervised, wang2018learning], we multiply a predicted depth map by a constant in order to make median values of predicted depth and ground truth the same. To evaluate our model in terms of temporal consistency, we measure temporal differences along dense trajectories. To this end, we synthesize a depth map at time by warping using optical flow. For fair comparison, we use an optical flow method [sun2018pwc] different from the one [kroeger2016fast] used in our model. We then compute the differences between and over time. That is, we compute temporal differences along trajectories (TDT) as follows.
where a binary confidence map represents reliability of optical flow, defined as
We set and to 0.5 and 0.05, respectively. We also compute the percentage of erroneous pixels, denoted by TDT , TDT , and TDT , where a point is considered to be erroneous when the differences are more than 1, 2, and 3, respectively.
Iv-C1 Comparison with the state of the art
We compare in Table II our models with the state of the art on the test split of [eigen2014depth] in terms of prediction accuracy and runtime. We denote by “K”, “CS”, and “I” the KITTI [geiger2013vision, eigen2014depth], Cityscapes [Cordts2016Cityscapes]
and ImageNet[deng2009imagenet] datasets, respectively. Numbers in bold indicate the best performance and underscored ones are the second best among monocular depth prediction methods. Following the experimental protocol in [eigen2014depth], we use standard metrics to measure depth prediction accuracy. The results for the comparison, except [eigen2014depth, liu2014discrete, cs2018depthnet], have been obtained from models provided by the authors. The runtime is measured with a Nvidia GTX Titan X. From this table, we observe three things: (1) Our model trained on the KITTI dataset (“Ours”) achieves comparable or better performance than others in terms of depth prediction accuracy. In particular, it gives results comparable to [kuznietsov2017semi, fu2018deep], even without using ResNet features [he2016deep] trained for ImageNet classification [kuznietsov2017semi, fu2018deep], and exploiting stereo images for training [kuznietsov2017semi]; (2) Our method benefits from using additional training samples. We fine-tune our model trained with the Cityscapes [Cordts2016Cityscapes] using the KITTI dataset (“Ours-CS+ft-K”), boosting the performance and outperforming the state of the art; (3) Our models show a good trade-off between runtime and depth prediction accuracy. They outperform other state-of-the-art methods, expect [fu2018deep], in terms of accuracy with a small loss of speed. Our models are slightly outperformed by Fu et al. [fu2018deep] in terms of accuracy, but with significantly faster overall speed (0.13 vs 1.08 seconds).
|lower is better||higher is better|
|Godard et al. [godard2017unsupervised]||2.964||0.759||0.856||0.898|
|Zhou et al. [zhou2017unsupervised]||1.578||0.786||0.893||0.935|
|Wang et al. [wang2018learning]||1.251||0.809||0.914||0.951|
|Yin et al. [yin2018geonet]||1.651||0.791||0.894||0.932|
|Kuznietsov et al. [kuznietsov2017semi]||1.335||0.805||0.907||0.947|
|Fu et al. [fu2018deep]||1.049||0.827||0.932||0.966|
We show in Fig. 5 an example of the TDT comparison of the state of the art and our models in the KITTI dataset [geiger2013vision]. Although Zhou et al. [zhou2017unsupervised] and Wang et al. [wang2018learning] use a video sequence as a supervisory signal similar to ours, they do not consider temporal coherence in the video, producing temporally inconsistent results. Kuznietsov et al. [kuznietsov2017semi] and Fu et al. [fu2018deep] give results comparable to ours in terms of depth prediction accuracy as shown in Table II, but their TDT scores are far from the ground truth. On the contrary, our models produce temporally stable and consistent results, with lower errors than the state of the art. In Table III, we show the average TDT scores on the test split of [eigen2014depth] and compare our models with the state of the art in terms of temporal consistency. Numbers in bold indicate the best performance and underscored ones are the second best. Our method outperforms the state of the art including [kuznietsov2017semi, fu2018deep] by a significant margin. For comparison, the scores computed with ground-truth depth are 0.712 for TDT, and 0.924, 0.982, 0.989 for TDT, TDT, TDT, respectively. To this end, we interpolate sparse ground-truth depth maps and discard values at highly sparse regions (e.g., upper parts of images) using masks provided by [garg2016unsupervised]. Note that the better ability to give temporally consistent results by our method does not come from the use of ground-truth depth. The supervised learning approach [kuznietsov2017semi] shows much worse results than the unsupervised one [wang2018learning], indicating that using ground truth does not always give temporally consistent results.
Iv-C2 Qualitative results
We show in Fig. 6 a visual comparison of depth prediction results on the KITTI dataset [eigen2014depth]. We can see that our models predict a fine-grained depth (e.g., for distant objects and poles) and provide a sharp depth transition without artifacts. For comparison, Fu et al. [fu2018deep] shows grid artifacts often caused by dilated convolutions [yu2017dilated]. We can also see that our models are highly robust to occlusion compared to other methods. For example, they predict depth from occluded cars on the bottom left of images while others are limited to handle such objects. Figure 7 visualizes pixel-wise TDT scores. We show temporal differences , weighted by the confidence map , between predicted depth maps. It shows that our model gives temporally consistent results, especially for regions having large displacements (e.g., traffic signs), resulting in less flickering artifacts.
Iv-C3 Refined optical flow
In Fig. 8(a), we show an example of the refined flow field and its difference from the input flow. We can see that the flow refine network modifies the input flow, particularly around moving objects, making it possible to capture fine details while preserving edges and object boundaries. Our model uses the refined flow to align video frames and hidden states in the visual memory. We show video frames and hidden states at time and in Figs. 8(b-c), respectively. Warping results w.r.t. time using the refined flow are shown in Fig. 8(d). By comparing Figs. 8(c) and (d), we can see that the refined flow aligns both the video frame and the hidden state well, which enables our model to aggregate temporally aligned features and to prevent flickering artifacts.
Iv-C4 Generalization to other dataset
We test our model trained with the KITTI [eigen2014depth] on the Cityscapes [Cordts2016Cityscapes] and the NYU [Silberman:ECCV12] datasets to demonstrate its generalization ability. Examples shown in Fig. 9 demonstrate that our model generalizes well to other images outside the training dataset. Particularly, it infers both a geometric layout in a scene and object instances (e.g., cars and trees in Fig. 9(a) and a bed in Fig. 9(b)) well. Note that, for the Cityscapes and the NYU datasets, all previous works we are aware of (e.g., [zhou2017unsupervised, wang2018learning, godard2017unsupervised, kuznietsov2017semi, fu2018deep, yin2018geonet]) offer qualitative results only.
We have presented a recurrent network for monocular depth prediction that gives temporally consistent results while preserving depth boundaries. Particularly, we have introduced a flow-guided memory module that selectively retains hidden states aligned along motion trajectories, enforcing a long-term temporal consistency to prediction results. We have also presented a flow refine network that outputs dense flow fields specific to our task. We have shown that the refined flow aligns both video frames and hidden states, preventing flickering artifacts. We have demonstrated that our method outperforms the state of the art by a large margin in terms of temporal consistency, shows a good trade-off between depth prediction accuracy and runtime, and performs well on other images outside training datasets.