Temporally Consistent Depth Prediction with Flow-Guided Memory Units

by   Chanho Eom, et al.
Yonsei University

Predicting depth from a monocular video sequence is an important task for autonomous driving. Although it has advanced considerably in the past few years, recent methods based on convolutional neural networks (CNNs) discard temporal coherence in the video sequence and estimate depth independently for each frame, which often leads to undesired inconsistent results over time. To address this problem, we propose to memorize temporal consistency in the video sequence, and leverage it for the task of depth prediction. To this end, we introduce a two-stream CNN with a flow-guided memory module, where each stream encodes visual and temporal features, respectively. The memory module, implemented using convolutional gated recurrent units (ConvGRUs), inputs visual and temporal features sequentially together with optical flow tailored to our task. It memorizes trajectories of individual features selectively and propagates spatial information over time, enforcing a long-term temporal consistency to prediction results. We evaluate our method on the KITTI benchmark dataset in terms of depth prediction accuracy, temporal consistency and runtime, and achieve a new state of the art. We also provide an extensive experimental analysis, clearly demonstrating the effectiveness of our approach to memorizing temporal consistency for depth prediction.


page 1

page 2

page 3

page 8

page 9

page 11


Learning to Segment Moving Objects

We study the problem of segmenting moving objects in unconstrained video...

Learning Spatio-Temporal Features with Two-Stream Deep 3D CNNs for Lipreading

We focus on the word-level visual lipreading, which requires recognizing...

Less is More: Consistent Video Depth Estimation with Masked Frames Modeling

Temporal consistency is the key challenge of video depth estimation. Pre...

Exploiting temporal consistency for real-time video depth estimation

Accuracy of depth estimation from static images has been significantly i...

Temporally Consistent Video Colorization with Deep Feature Propagation and Self-regularization Learning

Video colorization is a challenging and highly ill-posed problem. Althou...

Attention meets Geometry: Geometry Guided Spatial-Temporal Attention for Consistent Self-Supervised Monocular Depth Estimation

Inferring geometrically consistent dense 3D scenes across a tuple of tem...

I Introduction

Depth prediction from images plays a significant role in autonomous driving and advanced driver assistance systems, which helps understanding a geometric layout in a scene, and can be leveraged to solve other tasks, including vehicle/pedestrian detection [chen2017coherent, keller2011benefits], traffic scene segmentation [li2018traffic], and 3D reconstruction [wu2017geometry]. Stereo matching is a typical approach to recovering depth that finds dense correspondences between a pair of stereo images [nguyen2017robust, van2006real, li2008binocular]. Stereo matching methods compute similarities between local patches [muresan2017mutlipatch] or optimize global objective functions to consider smoothness priors penalizing large derivatives of depth [hirschmuller2007stereo, miclea2018real, spangenberg2014large]. These approaches show state-of-the-art performance, but capturing pairs of stereo images requires multiple cameras calibrated, making it difficult to apply them in practice. An alternative is to predict depth from a monocular video sequence, and it is of great interests in recent years [eigen2014depth, liu2014discrete, zhou2017unsupervised, wang2018learning, godard2017unsupervised, kuznietsov2017semi, cs2018depthnet, fu2018deep]. This approach builds upon the insight that human can perceive depth using monocular depth cues (e.g., occlusion, perspective, motion parallax) only [rogers1979motion]. Eigen et al. [eigen2014depth]

first propose a supervised learning method for predicting depth from a single still image using CNNs. Zhou

et al. [zhou2017unsupervised] and Wang et al. [wang2018learning] recently propose CNN architectures for predicting depth from a monocular video, where two networks are trained separately to estimate depth and camera pose. These methods are limited in that they predict depth independently for each frame, discarding temporal coherence in the video sequence. That is, they give temporally inconsistent results, causing serious temporal flickering artifacts. Recurrent neural networks (RNNs) have been widely used to model temporal dependency across sequential data (e.g., video and text), and they have shown the effectiveness in various applications including action recognition [du2018recurrent] and machine translation [zhang2017context]. They, however, still show a limited capability of handing the flickering artifacts [cs2018depthnet, shi2017deep].

In this paper, we present a simple yet effective method for a temporally consistent depth prediction from a monocular video sequence (Fig. 1). We transfer temporal consistency in the video to RNNs explicitly, particularly using convolutional gated recurrent units (ConvGRUs) [ballas2015delving]

. To implement this idea, we propose a flow-guided memory unit using optical flow specific to our task, maintaining a long-term temporal consistency in depth prediction results. Our module uses spatial and temporal features extracted by a two-stream CNN. We have two main reasons for decoupling these features. First, it has been proven that learning spatiotemporal features jointly from a stack of frames does not capture the motion well 

[simonyan2014two]. Second, optical flow itself provides an important clue for motion parallax, which is helpful to infer depth from a monocular video sequence. For example, objects closer to a camera move faster than distant ones. We show that our method outperforms the state of the art in terms of temporal consistency, and shows a good trade-off between depth prediction accuracy and runtime. The main contributions of this paper can be summarized as follows:

  • [leftmargin=*]

  • We present an effective ConvGRU encoder-decoder module for a temporally consistent depth prediction from a monocular video sequence. To our knowledge, this is the first approach based on convolutional/recurrent networks to considering temporal consistency in depth prediction.

  • We propose a flow-guided memory unit that retains a long-term temporal consistency explicitly for individual pixels.

  • We present state-of-the-art results on the KITTI [geiger2013vision] benchmark. We additionally provide an extensive experimental analysis, clearly demonstrating the effectiveness of our approach to memorizing temporal consistency for depth prediction.

To encourage comparison and future work, we release our code and models online: https://cvlab-yonsei.github.io/projects/FlowGRU/.

[width=]figure/teaser/ts_img1.jpg [width=]figure/teaser/ts_wang1.jpg Wang et al.[width=]figure/teaser/ts_k1.jpg Kuznietsov et al.[width=]figure/teaser/ts_our1.jpg Ours[width=]figure/teaser/ts_gt1.jpg Ground truth
(a) Depth at time
[width=]figure/teaser/ts_img2.jpg [width=]figure/teaser/ts_wang2.jpg [width=]figure/teaser/ts_k2.jpg [width=]figure/teaser/ts_our2.jpg [width=]figure/teaser/ts_gt2.jpg
(b) Depth at time
[width=]figure/teaser/teaser_diff.pdf Wang et al.Kuznietsov et al.OursGround truth
(c) Temporal differences
Fig. 1: Visual comparison of the state of the art and our model on monocular depth prediction. (a-b) Top to bottom: Video frames, depth maps obtained by Wang et al. [wang2018learning], Kuznietsov et al. [kuznietsov2017semi] and our model, and ground-truth depth at time  and , respectively. (c) Absolute differences between the depth maps at time  and . Compared to other methods, our model gives a temporally consistent result similar to ground truth while providing a sharp depth transition (yellow: high, blue: low). (Best viewed in color.)

Ii Related work

In this section, we briefly review representative works related to ours.

Ii-a Monocular depth prediction

The problem of predicting depth from monocular images or video sequences has significant attention in recent years. Early works exploit hand-crafted features such as SIFT [lowe2004distinctive] and HOG [dalal2005histograms] together with graphical models [saxena2006learning, liu2014discrete] or nonparametric sampling [karsch2014depth]. Saxena et al[saxena2006learning] estimate depth from monocular images using Markov random fields (MRFs) where they incorporate multi-scale features. Liu et al[liu2014discrete] extend this idea by using a discrete-continuous graphic model. Karsch et al. [karsch2014depth] introduce a nonparametric approach to depth prediction from monocular images and videos. They transfer depth labels from a large-scale RGB/D dataset using dense correspondences established by a SIFT flow method [liu2011sift]. CNNs have allowed remarkable advances in depth prediction in the past few years. Eigen et al. [eigen2014depth]

first leverage CNNs to predict depth from monocular images in a coarse-to-fine manner. In particular, they introduce a scale-invariant loss function that alleviates ambiguity in scale. Liu

et al. [liu2015deep] combine CNNs with conditional random fields (CRFs) for structured prediction. Kuznietsov et al. [kuznietsov2017semi] propose to use additional stereo images at training time. They predict depth from left images and synthesize novel views by warping right ones using estimated depth. The differences between left and synthesized images are then used as a supervisory signal for training. Unlike the aforementioned methods using monocular images, recent works [zhou2017unsupervised, wang2018learning, yin2018geonet] have shown success in learning depth from a monocular video sequence. Zhou et al. [zhou2017unsupervised] present an approach to estimating depth and camera pose simultaneously from a video sequence. Similar to [kuznietsov2017semi], they synthesize adjacent frames using estimated depth and camera pose, and use the discrepancy between the synthesized and original ones as a supervisory signal. This approach, however, requires ground-truth parameters for camera pose. Wang et al. [wang2018learning]

propose an unsupervised learning approach to estimating pose parameters using a differential version of the direct visual odometry (DVO) method 

[steinbrucker2011real], commonly employed in a SLAM community, and leverage it to depth prediction from a monocular video. Yin et al. [yin2018geonet] propose an unsupervised learning framework that learns depth, camera pose, and optical flow jointly. They first estimate depth and camera pose to obtain rigid flow, and then use them to compute optical flow. These approaches using CNNs outperform traditional methods by large margins, but none of them consider temporal coherence in a video sequence. They give temporally inconsistent results, resulting in temporal flickering artifacts. On the contrary, our method memorizes temporal coherence in a video sequence, enabling a temporally consistent depth prediction.

Fig. 2: Overview of our framework. Our model inputs a video frame  and optical flow  at time , and extracts spatial and temporal features from each input, respectively, using a two-stream encoder. A flow-guided memory module takes these features concatenated and memorizes temporal coherence in the video using trajectories of individual pixels. Specifically, it aligns hidden states over time using a refined flow  specific to our task (Fig. 3), while partially retaining or filtering out the hidden states in the visual memory. A decoder reconstructs a depth map  at time  from the output of the memory module together with spatial and temporal features. For depth prediction, we use scale-invariant and smoothness terms, and train the whole network end-to-end. See Table I for the detailed description of the network structure. (Best viewed in color.)

Ii-B Recurrent models

RNNs have been widely used to capture temporal dependency in sequential data [hopfield1982neural, rumelhart1986learning]. Representative models include GRUs [cho2014learning]

and long short term memory (LSTM) 

[hochreiter1997long], and they have been adopted successfully to various tasks such as video representation [srivastava2015unsupervised]

, image captioning 

[donahue2015long] and car-following modeling [wang2018capturing]

. LSTM and GRUs, typically using fully-connected layers, do not maintain spatial information and require a lot of network parameters. This is problematic especially for high-dimensional data (

e.g., video sequences). ConvLSTM [xingjian2015convolutional] and ConvGRUs [ballas2015delving]

replace the fully-connected layers with convolutional ones, preserving spatial information while reducing the number of parameters drastically. They have been widely exploited in computer vision and image processing tasks including video recognition 

[tokmakov2017learning, ballas2015delving], depth prediction [jie2018left, cs2018depthnet], and precipitation nowcasting [xingjian2015convolutional, shi2017deep]. Particularly, Shi et al. [shi2017deep] introduce a variant of ConvGRUs, trajectory GRU (TrajGRU), and apply it to predict a future rainfall intensity. Similar to the deformable convolutional network [dai2017deformable], TrajGRU learns local offsets (i.e., where to aggregate) and adds them to a regular grid in a standard convolution, which has an effect of using spatially-variant convolutional kernels. In contrast to this, our model memorizes spatiotemporal features aligned along the path guided by optical flow. Most similar to ours is a ConvLSTM-based framework for depth prediction [cs2018depthnet]. It uses ConvLSTMs to exploit spatiotemporal features from a video sequence, but does not give temporally coherent results. We also use a recurrent network to exploit spatiotemporal features, but a flow-guided memory unit in our model retains a long-term temporal consistency in a video sequence explicitly. Note that the ability to capture temporal dependency in ConvLSTMs or ConvGRUs does not guarantee to obtain temporally consistent results [shi2017deep].

Ii-C Temporal coherency

Recently, several approaches have been introuced to model temporal coherence in a video sequence. They typically use optical flow to smooth results along dense trajectories [lang2012practical, aydin2014temporally, bonneel2013example], to construct loss functions penalizing the difference between current and synthesized frames [chen2017coherent, lai2018learning] or to align features from current and previous ones [chen2017coherent, gadde2017semantic]. Different from the first and second approaches, we focus on designing a recurrent model itself that transfers temporal coherence in the video sequence to depth prediction results, without using temporal filtering techniques or corresponding loss functions. Our method is similar to the last approach in that we use aligned features to obtain temporally consistent results. In contrast to [chen2017coherent, gadde2017semantic], we use a memory unit that filters out or retains spatiotemporal features aligned along dense trajectories. More recently, V. Miclea et al. propose to exploit temporal cues [miclea2018real] for depth prediction. They use a previous frame and corresponding depth and segmentation results to refine incorrect depth values. Compared to this work, our method maintains a long-term temporal consistency by using the memory unit.

Iii Proposed solution

In this section, we describe a recurrent model with a flow-guided memory module for a temporally consistent depth prediction (Section III-A). We then present loss functions for learning depth and refined flow (Section III-B). The entire network is trained end-to-end.

Iii-a Network architecture

Our network mainly consists of three parts (Fig. 2): A two-stream encoder extracts spatial and temporal features from a video frame  and optical flow  in a backward direction (i.e, a dense flow field from  to ), respectively, where represents a time step. A flow-guided memory module inputs both features, and retains parts of them along trajectories of individual pixels to memorize temporal coherence in a video sequence. A decoder takes an output of the flow-guided memory module and outputs a depth map . In the following, we present the detailed description of each part.

Iii-A1 Two-stream encoder

A video sequence allows to leverage spatial and temporal information for depth prediction. Motivated by the works [simonyan2014two, wang2016temporal] for action recognition, we use a two-stream encoder where each stream has the same CNN architecture (but different parameters), takes the video frame  and optical flow , and then extracts spatial and temporal features, respectively. They are complementary each other. The spatial features capture appearance of objects and scene layout within each frame while the temporal ones encode trajectories of individual pixels (i.e., motion) across frames. Monocular depth prediction using CNNs typically requires large receptive fields to extract monocular depth cues including motion parallax and perspective [eigen2014depth]

. We can enlarge the receptive fields by using convolutions with multiple strides or pooling methods, but they lead to loss of spatial resolution and scene details such as small and thin structures 

[yu2017dilated]. We instead implement the two-stream encoder using a series of dilated convolutions [yu2015multi] with which we can adjust size of receptive fields by changing a dilation rate without loss of resolution. Note that the dilated version with rate of 1 corresponds to a standard convolution.

Fig. 3: Flow refine network. The flow refine network inputs video frames, and , and optical flow . Using features extracted from concatenated inputs, the network estimates residuals with different scales. They are concatenated to the pre-computed optical flow , and then passed on to additional convolutional layers to obtain the refined ones, , , and . The network maintains the spatial resolution of  to be the same as that of optical flow  while reducing  and  by a factor of 2 and 4 in each dimension, respectively. To compute the refined flows, we use a multi-scale loss using photometric consistency and smoothness terms. The photometric consistency term computes the differences between video frames and warped ones using the refined flows. The smoothness term encourages the refined flows to be smooth while preserving flow discontinuities. See Table I for the detailed description of the network structure. (Best viewed in color.)

Iii-A2 Flow refine network

In addition to the extraction of the temporal features from the encoder, we also use optical flow to align video frames and hidden states in a flow-guided memory module over time. Although CNN-based optical flow methods give state-of-the-art results, they are still not accurate enough to propagate fine-grained information across the video frames or hidden states, especially for motion boundaries. To address this problem, we use an additional network for refining the pre-computed optical flow (Fig. 3). In particular, we learn residuals between the pre-computed optical flow and refined one [he2016deep], built upon the assumption that they are similar and the initial flow does not change drastically. Similar to [gadde2017semantic], we use an early fusion approach, directly concatenating video frames and optical flow, to transfer the low-level information in the frames to the initial flow effectively. On the contrary, we compute the refined flow using a multi-scale architecture (Fig. 3) and use it to align both video frames and hidden states in the memory module for the task of depth prediction. Specifically, the network extracts spatiotemporal features from video frames, and , and optical flow , and computes the residuals through convolutional layers. They are concatenated to the pre-computed optical flow, and the results are then passed on to additional convolutional layers, resulting in a refined flow  specific to aligning video frames. The spatial resolution of  is the same as that of the pre-computed optical flow . Other refined flow fields, and , are similarly computed by applying convolutional layers, while reducing the spatial resolution of the pre-computed optical flow  by factor of 2 and 4 in each dimension, respectively.

Iii-A3 Flow-guided memory

Our memory module exploits trajectories of individual pixels using the refined optical flow to align hidden states selectively across frames (Fig. 4). This allows to transfer a long-term temporal consistency in a video sequence to depth prediction results. We implement the memory module with a ConvGRU [ballas2015delving], since it does not suffer from spatial resolution loss and is more efficient in terms of memory, compared to vanilla RNNs and ConvLSTMs [jaderberg2015spatial], respectively. The flow-guided memory module is defined as follows:


where  and  are element-wise multiplication and convolution, respectively. Here, we denote by a warping operator using a flow field, e.g. at position . and are weight and bias terms, respectively.

is the sigmoid function.

The flow-guided memory module inputs the feature  obtained from the two-stream encoder and a previous hidden state  acting as a visual memory, and outputs a new state  by combining  and a candidate state  weighted by an output of an update gate . The update and reset gates, and , selectively choose and discard information, respectively, from the input feature  and the previous hidden state . Conventional GRUs aggregate features from the hidden state  at time  directly to compute the current one . This is problematic especially when input features from previous and current frames, and , are not aligned with each other. Examples of this issue are cases when objects move across video frames or the viewpoint is changed due to camera motion. Mixing features from different locations leads to temporally inconsistent results. To address this problem, we instead use a flow-guided memory  where the feature from the previous state  are aligned to the current input feature  by warping using the refined flow 

. We implement this with a differential warping operator using bilinear interpolation 

[jaderberg2015spatial]. We additionally use a matching confidence  to consider reliability of the refined optical flow  as follows.


where and is a bandwidth parameter. We denote by  a resized video frame at time  that has the same spatial resolution as . We use the same matching confidence in each channel of the hidden state .

Fig. 4: Illustration of the flow-guided memory module. It uses the refined flow  to align the hidden state  at time  to the input feature at time , making a new candidate state . On the contrary, conventional GRUs compute the candidate state  using  directly. This is problematic when the previous hidden state  and the current input  are not spatially aligned, due to e.g., viewpoint changes and moving objects. See Table I for the detailed description of the network structure. (Best viewed in color.)

Recently, Shi et al. introduce the TrajGRU [shi2017deep] for precipitation nowcasting. Our model is closely related to the TrajGRU in that both consider temporally aligned hidden states to compute a new one. The TrajGRU learns offsets for sampling locations [dai2017deformable], typically defined on a regular grid in the standard convolution, to fetch information from the previous frame. Although this can be seen as an implicit feature alignment, the TrajGRU is not designed to enforce temporal consistency and does not consider large displacements. It may provide temporally inconsistent results when the learned offsets are wrong or displacements between video frames are large. The TrajGRU is also computationally inefficient, since it applies a warping operator for each offset. Compared to this work, we align hidden states in the memory module explicitly using the refined optical flow together with a matching confidence. This considers large motion and prevents aggregating the hidden states for unreliable correspondences, making it possible to obtain temporally consistent results.

Iii-A4 Decoder

The decoder inputs the hidden state in the flow-guided memory module and gives depth maps that have the same resolution as input images. In order to consider fine details (e.g., depth boundaries), we use additional low-level features from spatial and temporal streams by skip connections (Fig. 2).

Iii-B Training loss

We use three types of losses for training: First, a scale-invariant term is used to alleviate scale ambiguity in predicted depth. Second, we use a photometric consistency term to learn the refined flow, making the pre-computed optical flow specific to aligning video frames. Finally, smoothness terms regularize depth and flow fields. Our final loss is a linear combination of them, balanced by the parameter  as


where . and are losses for depth prediction and flow refinement, respectively. In the following, we describe each term in detail.

Iii-B1 for depth prediction

Motivated by the work [eigen2014depth], we define the scale-invariant loss as


where is the difference between the predicted depth  and ground truth  at position  in log space, and is the total number of pixels. The first term encourages predicted depth to be similar to ground truth. Estimating absolute scale of depth is, however, extremely hard especially from monocular video sequences. The second term alleviates this problem by comparing relationships between pairs of pixels , in . It encourages them to have the same direction, and gives lower error when both and are positive or negative values. The first and second terms are balanced by . As approaches to one, predicted depth becomes robust to scale variations. We also use the smoothness term that regularizes a prediction result, while preserving depth discontinuities, defined as


where is a Laplace operator and is the smoothness bandwidth. We compute the second-order derivative of a predicted depth map weighted using the magnitude of image discontinuities, with an assumption that depth boundaries are aligned well to image discontinuities. We define a total loss for depth prediction as


where balances the scale-invariant and smoothness terms.

Iii-B2 for flow refinement

We use the photometric consistency loss to refine the pre-computed optical flow. This term encourages the refined flow to be specific to aligning video frames over time. Motivated by the works [zhao2017loss, godard2017unsupervised], we define the consistency term but in a multi-scale manner as L^PH_i = 1Ni ∑_p ( β 1 - SSIM(Iit(p), ¯Iit(p))2
+ (1-β) ∥I_i^t(p) - ¯I_i^t (p) ∥_1 ), where is the total number of pixels in the image . The first and second terms, balanced by , compute the differences and structural similarity (SSIM) between original images  and synthesized ones  from  using the corresponding refined flow , respectively. Similar to depth prediction, we define the smoothness term for the refined flow as


and use a sum of photometric consistency and smoothness terms, balanced by a regularization parameter , as a total loss:


Iv Experimental results

Layer Type K S I/O ch I/O rs Input
Encoder (Spatial & temporal)
Econv1a c 3 2 3/32 1/2 or
Econv1b c 3 1 32/32 2/2 Econv1a
Econv2a c 3 2 32/64 2/4 Econv1b
Econv2b c 3 1 64/64 4/4 Econv2a
Econv3a d 3 2 64/64 4/4 Econv2b
Econv3b c 3 1 64/64 4/4 Econv3a
Econv4a d 3 4 64/64 4/4 Econv3b
Econv4b c 3 1 64/64 4/4 Econv4a
Econv5a d 3 8 64/128 4/4 Econv4b
Econv5b c 3 1 128/128 4/4 Econv5a
Econv6a d 3 16 128/128 4/4 Econv5b
Econv6b c 3 1 128/128 4/4 Econv6a
Econv7a d 3 16 128/256 4/4 Econv6b
Econv7b c 3 1 256/256 4/4 Econv7a
Econv8a d 3 1 256/256 4/4 Econv7b
Econv8b d 3 1 256/64 4/4 Econv8a
Flow-guided memory
Gxz c 5 1 128/64 4/4 Econv8b () + Econv8b ()
Ghz c 5 1 64/64 4/4
Gz s - - 64/64 4/4 Gxz + Ghz
Gxr c 5 1 128/64 4/4 Econv8b () + Econv8b ()
Ghr c 5 1 64/64 4/4
Gr s - - 64/64 4/4 Gxr + Ghr
Gxh c 5 1 128/64 4/4 Econv8b () + Econv8b ()
Ghh c 5 1 64/64 4/4 Gr
Gh t - - 64/64 4/4 Gxh + Ghh
- - - 64/64 4/4 (Gz) + Gz Gh
- - - 64/64 4/4
Dconv1a u 5 2 64/32 4/2
Dconv1b c 5 1 96/32 2/2
Dconv1a + Econv1b ()
+ Econv1b ()
Dconv2a u 5 2 32/16 2/1 Dconv1b
Dconv2b c 5 1 16/16 1/1 Dconv2a
Output c 5 1 16/1 1/1 Dconv2b
Flow refine network
Fconv1a c 3 1 8/32 1/1
Fconv1b c 3 1 32/2 1/1 Fconv1a
c 3 1 4/2 1/1 Fconv1b +
Fconv2a c 3 2 32/32 1/2 Fconv1a
Fconv2b c 3 1 34/2 2/2 Fconv2a +
c 3 1 4/2 2/2 Fconv2b +
Fconv3a c 3 2 32/32 2/4 Fconv1a
Fconv3b c 3 1 34/2 4/4 Fconv3a +
c 3 1 4/2 4/4 Fconv3b +
  • Type: A type of operations; K: Kernel size; S: Strides; I/O ch: The number of channels for the input/output; I/O rs: A downsampling factor for the input/output relative to the input image. c: Convolution; d: Dilated convolution; u: Up-convolution; s: Sigmoid; t: Hyperbolic tangent.

TABLE I: Network architecture details.

In this section we present a detailed analysis and evaluation of our approach. Our code and more results including depth videos are available at our project webpage: https://cvlab-yonsei.github.io/projects/FlowGRU/

lower is better higher is better





Abs Rel Sq Rel RMSE RMSE (log) Runtime(s)
Eigen et al. [eigen2014depth] K D 0-80m 0.215 1.515 7.156 0.270 0.692 0.899 0.967 -
Liu et al. [liu2014discrete] K D 0-80m 0.217 1.841 6.986 0.289 0.647 0.882 0.961 -
Godard et al. [godard2017unsupervised] K S 0-80m 0.148 1.344 5.927 0.247 0.803 0.922 0.964 0.04
Zhou et al. [zhou2017unsupervised] K M 0-80m 0.208 1.768 6.856 0.283 0.678 0.885 0.957 0.03
Wang et al. [wang2018learning] K M 0-80m 0.151 1.257 5.583 0.228 0.810 0.936 0.974 0.03
Yin et al. [yin2018geonet] K M 0-80m 0.155 1.296 5.857 0.233 0.793 0.931 0.973 0.04
Kuznietsov et al. [kuznietsov2017semi] I+K D+S 0-80m 0.113 0.741 4.621 0.189 0.862 0.960 0.986 0.06
Kumar et al. [cs2018depthnet] K D+M 0-80m 0.137 1.019 5.187 0.218 0.809 0.928 0.971 -
Fu et al. [fu2018deep] I+K D+M 0-80m 0.102 0.617 3.859 0.165 0.890 0.964 0.985 1.08
Ours K D+M 0-80m 0.117 0.726 4.537 0.192 0.865 0.958 0.983 0.13
Ours-CS+ft-K CS+K D+M 0-80m 0.112 0.700 4.260 0.184 0.881 0.962 0.983 0.13
Kuznietsov et al. [kuznietsov2017semi] I+K D+S 1-50m 0.108 0.595 3.518 0.179 0.875 0.964 0.988 0.06
Garg et al. [garg2016unsupervised] K S 1-50m 0.169 1.080 5.104 0.273 0.740 0.904 0.962 0.04
Godard et al. [godard2017unsupervised] K S 1-50m 0.108 0.657 3.729 0.194 0.873 0.954 0.979 0.04
Ours K D+M 1-50m 0.113 0.580 3.493 0.181 0.877 0.963 0.985 0.13
Ours-CS+ft-K CS+K D+M 1-50m 0.109 0.580 3.359 0.176 0.891 0.965 0.985 0.13
  • Abs Rel: Absolute relative difference; Sq Rel: Square relative difference; RMSE: Root Mean Square Error; RMSE (log): RMSE in log scale; : The percentage of pixels where the ratio of estimated depth and ground truth is within a range in the threshold . D: Ground-truth depth; S: Rectified stereo pairs; M: Monocular video sequences.

TABLE II: Quantitive comparison with the state of the art on monocular depth prediction with the test split provided by [eigen2014depth].
Fig. 5: Examples of TDT variations over time on the KITTI dataset [geiger2013vision]. Compared to the state of the art [godard2017unsupervised, zhou2017unsupervised, wang2018learning, yin2018geonet, kuznietsov2017semi, fu2018deep], our models give lower errors during whole frames and show analogous patterns with ground truth. (Best viewed in color.)

Iv-a Training

We train our model from scratch with the KITTI raw dataset [geiger2013vision] that provides pairs of stereo images for 61 scenes together with 3D points and camera parameters. In particular, we use the split provided by [eigen2014depth], where it contains 35,600 and 697 images for training and test, respectively. We consider each view in stereo image pairs as an individual monocular sequence. We also train our model with the Cityscapes dataset [Cordts2016Cityscapes] that consists of 89k, 15k and 45k images for training, validation and test, respectively. We split the training sets into a chunk of frames, each of which contains 50 and 30 successive frames for the KITTI and Cityscapes datasets, respectively. We choose 20 and 5 nearby frames randomly for the KITTI and Cityscapes dataset, respectively, and augment the datasets by randomly cropping training samples to the size of 

. We use a batch size of 16 for 200 epochs which corresponds to about 450k iterations for the KITTI dataset. For the Cityscapes dataset, the same batch size of 16 is used with 200 epochs (about 600k iterations), and the trained model is then fine-tuned with additional 100 epochs with the train split provided by 

[eigen2014depth]. We use the Adam optimizer [kingma2014adam] with and . As learning rate, we use 1e-4 at first 100 epochs and gradually reduce it during training. We use a grid search to set the balance parameters, , and , to , and , respectively. We follow the experimental setting in [eigen2014depth, godard2017unsupervised, lai2018learning, wang2018occlusion] to set other parameters, and fix them in all experiment: , , , and . We compute optical flow using the DIS-Flow method [kroeger2016fast] that offers a good compromise in terms of runtime and accuracy. For example, it requires 0.1 seconds for images of size with an Intel i5 3.3Ghz CPU. All networks are trained end-to-end using TensorFlow [abadi2016tensorflow]. With two Nvidia GTX Titan Xs, training our model takes about 10 and 15 days for the KITTI and Cityscapes datasets, respectively, including fine-tuning.

Iv-B Network architecture details

We show a detailed description of the network architecture in Table I. We denote by “+”, “”, and “” concatenation, element-wise multiplication, and 2

downsampling, respectively. We use the ReLU 


as an activation function except for the last layer. Each sub-network in the encoder consists of 9 convolutional and 7 dilated convolutional layers. A dilated convolution 

[yu2015multi] enables covering large receptive fields using small-size convolutions and maintaining the spatial resolution of feature maps, but it typically causes grid artifacts [yu2017dilated]. To alleviate this problem, we add a convolutional layer followed by the dilated one, except the last two layers. The flow-guided memory module has an architecture similar to the ConvGRU [ballas2015delving] consisting of reset and update gates. Differently, we align the previous hidden state w.r.t. the current input feature using the refined flow. The decoder has 2 up-convolutional and 3 convolutional layers. Following [mayer2016large], we add a convolutional layer after applying an up-convolutional operator, which gives smooth prediction results. We use skip connections from the encoder to leverage low-level but fine-grained features for depth prediction. The spatial resolution of predicted depth is the same as that of an input frame. The flow refine network computes three residuals with different scales. The residual for each scale is computed through 3 convolutional layers. We use the ReLU [krizhevsky2012imagenet] as an activation function except for the last layer.

Iv-C Evaluation

Depth predicted by our model is defined up to a scale factor. Following the experimental protocol in [zhou2017unsupervised, wang2018learning], we multiply a predicted depth map by a constant in order to make median values of predicted depth and ground truth the same. To evaluate our model in terms of temporal consistency, we measure temporal differences along dense trajectories. To this end, we synthesize a depth map  at time  by warping  using optical flow. For fair comparison, we use an optical flow method [sun2018pwc] different from the one [kroeger2016fast] used in our model. We then compute the differences between and over time. That is, we compute temporal differences along trajectories (TDT) as follows.


where a binary confidence map  represents reliability of optical flow, defined as


We set  and  to 0.5 and 0.05, respectively. We also compute the percentage of erroneous pixels, denoted by TDT , TDT , and TDT , where a point is considered to be erroneous when the differences are more than 1, 2, and 3, respectively.

[width=]figure/vis_depth/ex_1_godard.jpg Godard et al.[width=]figure/vis_depth/ex_1_zhou.jpg Zhou et al.[width=]figure/vis_depth/ex_1_wang.jpg Wang et al.[width=]figure/vis_depth/ex_1_yin.jpg Yin et al.[width=]figure/vis_depth/ex_1_k.jpg Kuznietsov et al.[width=]figure/vis_depth/ex_1_fu.jpg Fu et al.[width=]figure/vis_depth/ex_1_our.jpg Ours[width=]figure/vis_depth/ex_1_our_CS.jpg Ours-CS+ft-K[width=]figure/vis_depth/ex_1_gt.jpg Ground truth
Fig. 6: Visual comparison of predicted depth on the KITTI dataset [geiger2013vision]. Top to bottom: Video frames, depth images predicted by Godard et al. [godard2017unsupervised], Zhou et al. [zhou2017unsupervised], Wang et al[wang2018learning], Yin et al. [yin2018geonet], Kuznietsov et al[kuznietsov2017semi], Fu et al[fu2018deep] and our models (Ours and Ours-CS+ft-K), and ground truth. We interpolate sparse ground-truth depth maps for the purpose of visualization only. Our method predicts depth for small-size or occluded objects (e.g., thin poles and occluded cars on the bottom left of images) and provides a sharp depth transition without artifacts. (Best viewed in color.)

Iv-C1 Comparison with the state of the art

We compare in Table II our models with the state of the art on the test split of [eigen2014depth] in terms of prediction accuracy and runtime. We denote by “K”, “CS”, and “I” the KITTI [geiger2013vision, eigen2014depth], Cityscapes [Cordts2016Cityscapes]

and ImageNet 

[deng2009imagenet] datasets, respectively. Numbers in bold indicate the best performance and underscored ones are the second best among monocular depth prediction methods. Following the experimental protocol in [eigen2014depth], we use standard metrics to measure depth prediction accuracy. The results for the comparison, except [eigen2014depth, liu2014discrete, cs2018depthnet], have been obtained from models provided by the authors. The runtime is measured with a Nvidia GTX Titan X. From this table, we observe three things: (1) Our model trained on the KITTI dataset (“Ours”) achieves comparable or better performance than others in terms of depth prediction accuracy. In particular, it gives results comparable to [kuznietsov2017semi, fu2018deep], even without using ResNet features [he2016deep] trained for ImageNet classification [kuznietsov2017semi, fu2018deep], and exploiting stereo images for training [kuznietsov2017semi]; (2) Our method benefits from using additional training samples. We fine-tune our model trained with the Cityscapes [Cordts2016Cityscapes] using the KITTI dataset (“Ours-CS+ft-K”), boosting the performance and outperforming the state of the art; (3) Our models show a good trade-off between runtime and depth prediction accuracy. They outperform other state-of-the-art methods, expect [fu2018deep], in terms of accuracy with a small loss of speed. Our models are slightly outperformed by Fu et al. [fu2018deep] in terms of accuracy, but with significantly faster overall speed (0.13 vs 1.08 seconds).

lower is better higher is better
Godard et al. [godard2017unsupervised] 2.964 0.759 0.856 0.898
Zhou et al. [zhou2017unsupervised] 1.578 0.786 0.893 0.935
Wang et al. [wang2018learning] 1.251 0.809 0.914 0.951
Yin et al. [yin2018geonet] 1.651 0.791 0.894 0.932
Kuznietsov et al. [kuznietsov2017semi] 1.335 0.805 0.907 0.947
Fu et al. [fu2018deep] 1.049 0.827 0.932 0.966
Ours 0.940 0.835 0.951 0.979
Ours-CS+ft-K 0.896 0.848 0.952 0.979
Ground truth 0.712 0.924 0.982 0.989
TABLE III: Quantitive comparison with the state of the art on the test split provided by [eigen2014depth] in terms of the average TDT.
Godard et al. [godard2017unsupervised]
Zhou et al. [zhou2017unsupervised]
Wang et al. [wang2018learning]
Yin et al. [yin2018geonet]
Kuznietsov et al. [kuznietsov2017semi]
Fu et al. [fu2018deep]
Ground truth
Fig. 7: Visual comparison of pixel-wise TDT scores. Two examples are shown for each method. The TDT scores are color-coded (blue: low, yellow: high). Our model shows lower TDT scores than the state of the art [godard2017unsupervised, zhou2017unsupervised, wang2018learning, yin2018geonet, kuznietsov2017semi, fu2018deep], especially for the regions near objects, demonstrating that it gives temporally consistent results. (Best viewed in color.)
Fig. 8: Examples of a refined flow field and warping results. (a) Top to bottom: A refined flow and its difference from the input optical flow. (b-c) Top to bottom: Video frames and hidden states at time  and , respectively. (d) A video frame and a hidden state aligned w.r.t. time  by warping using the refined flow. The refined flow captures structure details, particularly around moving objects, allowing to provide a sharp depth transition. It also aligns both video frames and hidden states well, making it possible for our model to give temporally consistent results without flickering artifacts. (Best viewed in color.)
(a) Cityscapes dataset
(b) NYU dataset
Fig. 9: Examples of predicted depth by our model on (b)the Cityscapes [Cordts2016Cityscapes] and (b)the NYU datasets [Silberman:ECCV12]. We apply our model trained with the KITTI [eigen2014depth]. The examples demonstrate that our model performs well on other images outside the training dataset.

We show in Fig. 5 an example of the TDT comparison of the state of the art and our models in the KITTI dataset [geiger2013vision]. Although Zhou et al. [zhou2017unsupervised] and Wang et al. [wang2018learning] use a video sequence as a supervisory signal similar to ours, they do not consider temporal coherence in the video, producing temporally inconsistent results. Kuznietsov et al. [kuznietsov2017semi] and Fu et al. [fu2018deep] give results comparable to ours in terms of depth prediction accuracy as shown in Table II, but their TDT scores are far from the ground truth. On the contrary, our models produce temporally stable and consistent results, with lower errors than the state of the art. In Table III, we show the average TDT scores on the test split of [eigen2014depth] and compare our models with the state of the art in terms of temporal consistency. Numbers in bold indicate the best performance and underscored ones are the second best. Our method outperforms the state of the art including [kuznietsov2017semi, fu2018deep] by a significant margin. For comparison, the scores computed with ground-truth depth are 0.712 for TDT, and 0.924, 0.982, 0.989 for TDT, TDT, TDT, respectively. To this end, we interpolate sparse ground-truth depth maps and discard values at highly sparse regions (e.g., upper parts of images) using masks provided by [garg2016unsupervised]. Note that the better ability to give temporally consistent results by our method does not come from the use of ground-truth depth. The supervised learning approach [kuznietsov2017semi] shows much worse results than the unsupervised one [wang2018learning], indicating that using ground truth does not always give temporally consistent results.

Iv-C2 Qualitative results

We show in Fig. 6 a visual comparison of depth prediction results on the KITTI dataset [eigen2014depth]. We can see that our models predict a fine-grained depth (e.g., for distant objects and poles) and provide a sharp depth transition without artifacts. For comparison, Fu et al. [fu2018deep] shows grid artifacts often caused by dilated convolutions [yu2017dilated]. We can also see that our models are highly robust to occlusion compared to other methods. For example, they predict depth from occluded cars on the bottom left of images while others are limited to handle such objects. Figure 7 visualizes pixel-wise TDT scores. We show temporal differences , weighted by the confidence map , between predicted depth maps. It shows that our model gives temporally consistent results, especially for regions having large displacements (e.g., traffic signs), resulting in less flickering artifacts.

Iv-C3 Refined optical flow

In Fig. 8(a), we show an example of the refined flow field and its difference from the input flow. We can see that the flow refine network modifies the input flow, particularly around moving objects, making it possible to capture fine details while preserving edges and object boundaries. Our model uses the refined flow to align video frames and hidden states in the visual memory. We show video frames and hidden states at time  and in Figs. 8(b-c), respectively. Warping results w.r.t. time  using the refined flow are shown in Fig. 8(d). By comparing Figs. 8(c) and (d), we can see that the refined flow aligns both the video frame and the hidden state well, which enables our model to aggregate temporally aligned features and to prevent flickering artifacts.

Iv-C4 Generalization to other dataset

We test our model trained with the KITTI [eigen2014depth] on the Cityscapes [Cordts2016Cityscapes] and the NYU [Silberman:ECCV12] datasets to demonstrate its generalization ability. Examples shown in Fig. 9 demonstrate that our model generalizes well to other images outside the training dataset. Particularly, it infers both a geometric layout in a scene and object instances (e.g., cars and trees in Fig. 9(a) and a bed in Fig. 9(b)) well. Note that, for the Cityscapes and the NYU datasets, all previous works we are aware of (e.g., [zhou2017unsupervised, wang2018learning, godard2017unsupervised, kuznietsov2017semi, fu2018deep, yin2018geonet]) offer qualitative results only.

V Conclusion

We have presented a recurrent network for monocular depth prediction that gives temporally consistent results while preserving depth boundaries. Particularly, we have introduced a flow-guided memory module that selectively retains hidden states aligned along motion trajectories, enforcing a long-term temporal consistency to prediction results. We have also presented a flow refine network that outputs dense flow fields specific to our task. We have shown that the refined flow aligns both video frames and hidden states, preventing flickering artifacts. We have demonstrated that our method outperforms the state of the art by a large margin in terms of temporal consistency, shows a good trade-off between depth prediction accuracy and runtime, and performs well on other images outside training datasets.