This is the UNOFFICIAL implementation of the ICCV 2019 paper 'Exploiting Temporal Consistency for Real-Time Video Depth Estimation'.
Accuracy of depth estimation from static images has been significantly improved recently, by exploiting hierarchical features from deep convolutional neural networks (CNNs). Compared with static images, vast information exists among video frames and can be exploited to improve the depth estimation performance. In this work, we focus on exploring temporal information from monocular videos for depth estimation. Specifically, we take the advantage of convolutional long short-term memory (CLSTM) and propose a novel spatial-temporal CSLTM (ST-CLSTM) structure. Our ST-CLSTM structure can capture not only the spatial features but also the temporal correlations/consistency among consecutive video frames with negligible increase in computational cost. Additionally, in order to maintain the temporal consistency among the estimated depth frames, we apply the generative adversarial learning scheme and design a temporal consistency loss. The temporal consistency loss is combined with the spatial loss to update the model in an end-to-end fashion. By taking advantage of the temporal information, we build a video depth estimation framework that runs in real-time and generates visually pleasant results. Moreover, our approach is flexible and can be generalized to most existing depth estimation frameworks. Code is available at: https://tinyurl.com/STCLSTMREAD FULL TEXT VIEW PDF
This is the UNOFFICIAL implementation of the ICCV 2019 paper 'Exploiting Temporal Consistency for Real-Time Video Depth Estimation'.
Exploiting temporal consistency for real-time video depth estimation (ICCV 2019) https://arxiv.org/abs/1908.03706
Benefiting from the powerful convolutional neural networks (CNNs), some recent methods [1, 2, 3, 4, 5] have achieved outstanding performance on depth estimation from monocular static images. The success of these methods is based on the deeply stacked network structures and large amount of training data. For instance, the state-of-the-art depth estimation model DORN  has more than one hundred of convolution layers, the high computational cost may hamper it from practical applications. However, in some scenarios such as automatic driving  and robots navigation , estimating of depths in real-time is required. Directly extend existing methods from static image to video sequence is not feasible because of the excessive computational cost. In addition, sequential frames which contain rich temporal information are usually provided in such scenarios. The existing methods fail to take the temporal information into consideration.
In this work, we exploit temporal information from videos by making use of the convolutional long short-term memory (CLSTM) and the generative adversarial networks (GANs), and propose a real-time depth estimation framework. We illustrate our proposed framework in Fig. 1
. It consists of three main parts: 1) spatial features extraction part; 2) temporal correlations collection part and 3) spatial-temporal loss calculation part. The spatial features extraction part and the temporal correlations collection part compose our novel spatial-temporal CLSTM (ST-CLSTM) structure. The spatial features extraction part first takes as inputcontinuous frames and outputs high level features . The temporal correlations collection part then takes as input the high-level features and outputs depth estimations . With the cell and gate modules, the CLSTM can make use of the cues acquired from the previous frame to reason the current frame, and thus encode the temporal information. As for spatial-temporal loss calculation, we first calculate the spatial loss between the estimated and the ground-truth depths. In order to further enforce the temporal consistency, we design a new temporal loss by introducing a generative adversarial learning scheme. Specifically, we apply a 3D CNN as the discriminator which takes as input the estimated and ground-truth depth sequences and outputs the temporal loss. The temporal loss is combined with the spatial loss and back propagated through the entire framework to update the weights in an end-to-end fashion.
To summarize, our main contributions are as follows.
We propose a novel ST-CLSTM structure that is able to capture spatial features as well as temporal correlations for video depth estimation. To our knowledge, this is the first time that CLSTM is employed for video depth estimation.
We design a novel temporal consistency loss by using the generative adversarial learning scheme. Our temporal loss can further enforce the temporal consistency and improve the performance for video depth estimation.
Our proposed video depth estimation framework can execute in real-time and can be generalized to most existing depth estimation frameworks.
Recently, many deep learning based depth estimation methods have been proposed and achieved significant achievements. To name a few, Eigenet al.  employed a multi-scale neural network with two components to generate coarse estimations globally and refine the results locally. Xie et al.  used shortcut connections in their network to fuse low-level and high-level features. Cao et al.  proposed to formulate depth estimation as a classification problem instead of a regression problem. Laina et al.  employed a reverse huber loss to estimate depth distributions and an up-sampling module to overcome the low-resolution problem. Yin et al.  designed a loss term to enforce geometric constraints. To further improve the performance, some methods incorporate conditional random fields in their methods [11, 12]. Recently the method DORN  proposed a spacing-increasing discretization (SID) policy and estimated depths with a ordinal regression loss. Although excellent performance has been achieved, the networks are deep and computation is heavy.
Some other works focus on estimating depth values from videos. Zhou et al. 
proposed to use bundle adjustment as well as a super-resolution network to improve depth estimation. Specifically, the bundle adjustment is used to estimate depths and camera poses simultaneously, and the super-resolution network is used to recover details. Mahjourianet al.  incorporated a 3D loss with geometric constraints to estimate depths and ego-motions simultaneously. In this work, we propose to estimate depths by exploiting temporal information from videos.
CLSTM in video analysisRecurrent neural networks (RNNs), especially the long short-term memories (LSTMs) have achieved great success in various computer vision tasks such as language processing  and speech recognition 
. With the memory cells, LSTMs can capture short and long term temporal dependencies. However, conventional LSTMs only take as input one-dimensional vectors and thus can not be applied to image sequence processing.
To overcome this limitation, Shi et al.  proposed convolutional LSTM (CLSTM), which can capture long and short term temporal dependencies while retaining the ability of handling two-dimensional feature maps. Recently, CLSTMs have been used in video processing. In , Song et al. proposed a Deeper Bidirectional CLSTM (DB-CLSTM) structure which learns temporal characteristics in a cascaded and deeper way for video salient object detection. Liu et al.  proposed a tree-structure based traversal method to model the 3D-skeleton of a human being in spatial-temporal domain. They applied CLSTM to handle the noise and occlusions in 3D skeleton data, which improves the temporal consistency of the results. Jiang et al.  developed a two-layer ConvLSTM (2C-LSTM) to predict video saliency. An object-to-motion convolutional neural network has also been proposed.
GAN The generative adversarial network (GAN) has been an active research topic since it was proposed by Goodfellow et al. in 
. The basic idea of GAN is the training of two adversarial networks, a generator and a discriminator. During the process of adversarial training, both generator and discriminator become more robust. GANs have been widely used in various applications, such as image-to-image translation and synthetic data generation . GAN has been mainly used for generating images. One of the first work to apply adversarial training to improve structured output learning might be , where a discriminator loss is used to distinguish predicted pose and ground-truth pose for pose estimation from monocular images. Recently, GANs have also been adopted in depth estimation. In , Almalioglu et al. employed GAN to generate sharper and more accurate depth maps.
In this paper, we design a novel temporal loss by employing GAN. Our temporal loss can enforce the temporal consistency among video frames.
In this section, we elaborate on our proposed video depth estimation framework. We first introduce our ST-CLSTM structure; then we present our generative adversarial learning scheme and our spatial and temporal loss functions.
Our depth estimation framework contains three main components: spatial feature extraction; temporal correlation collection; and spatial-temporal loss calculation, as illustrated in Fig. 1.
Spatial feature extraction is the key to the performance and processing speed as it contains the majority of trainable parameters in our depth estimation framework. In our work, we use a modified structure proposed by Hu et al. .
We show the details of our spatial feature extraction network in Fig. 2. The network contains an encoder, a decoder and a multi-scale feature fusion module (MFF). The encoder can be any 2D CNN model, such as the VGG-16 , the ResNet , the SENet , among many others. In order to build a real-time depth estimation framework, we apply a shallow ResNet-18 model instead of the SENet-154 as the encoder.
The decoder employs four up-projection modules to improve the spatial resolution and decreases the number of channels of the feature maps. This encoder-decoder structure has been widely used in pixel-level tasks [28, 2]. The MFF module is designed to integrate features of different scales. Similar strategies are used in .
Note that, in our depth estimation framework, the spatial feature extraction network can be replaced by other depth estimation models. In other words, our proposed depth estimation framework can be applied to other state-of-the-art depth estimation methods with minimum modification.
As the input frames are continuous in the temporal dimension, taking the temporal correlations of these frames into consideration is intuitive and presumably helpful for improving depth estimation performance. In terms of achieving this goal, both the 3D CNN and the CLSTM are competent. Here, we use the CLSTM, as the it is more flexible than the 3D CNN for online inference. The structure of our proposed CLSTM is shown in Fig. 3 (b).
Fig. 3 (a) shows the traditional LSTM. The inputs and the outputs are vectors and the key operation is the Hadamard product. A single LSTM cell at time can be expressed as:
are sigmoid and hyperbolic tangent activation functions.and represent the Hadamard product and pointwise multiplication.
Compared with the traditional LSTM, our proposed CLSTM exhibits two main differences: 1) Operation. Following , we replace the Hadamard product in LSTM with convolution to handle the extracted 2D feature maps. 2) Structure. We adjust the structure of CLSTM to deal with depth estimation task. Specifically, our proposed CLSTM cell can be expressed as:
where is the convolutional operator. and denote the kernels and bias terms at the corresponding convolution layers. After we extract the spatial features of video frames, we feed the feature map of the previous frame into a convolution layer to compress the number of channels from to 8. Then we concatenate with the feature map of current frame to formulate a feature map with channels. Next, we feed the concatenated feature map to CLSTM to update the information stored in memory cell. Finally, we concatenate the information in the updated memory cell and the feature map of output gate, then feed them to a refine structure that consists of two convolution layers to obtain the final estimation result.
As shown in Fig. 1, the output of our ST-CLSTM is the estimated depth. We design two loss functions to train our ST-CLSTM model: a spatial loss to maintain the spatial features and a temporal loss to capture the temporal consistency.
We follow  and design a similar loss function as our spatial loss, which can be expressed as:
where and are weighting coefficients. It is composed of three terms. The is applied to penalize inaccurate depth estimations. Most existing depth estimation methods simply apply the or loss. As pointed in , a problem of this type of loss is that the value tends to be larger as the ground-truth depth getting further. We apply a logarithm loss which is expressed as:
Consequently, our is defined as:
where is the number of pixels; and are the estimated and ground-truth depth of pixel respectively.
is designed to penalize the errors around edges. It is defined as:
where and represent the spatial derivative along the -axis and -axis respectively.
The last item is designed to measure the angle between two surface normals, and thus is sensitive to small depth structures. It is expressed as:
where and denotes inner product.
Our proposed ST-CLSTM is able to exploit the temporal correlations among consecutive video frames. In order to further enforce the consistency among frames, we apply the generative adversarial learning scheme and design a temporal consistency loss. Specifically, after our ST-CLSTM produces depth estimations, we introduce a three-dimensional convolutional neural network (3D CNN) which takes as input the estimated depth sequence and output a score. This score represents the probability of the depth sequence comes from our ST-CLSTM rather than the ground-truths. The 3D CNN is then act as a discriminator. We train the discriminator by maximizing the probability of assigning the correct label to both the estimated and ground-truth depth sequences. Our ST-CLSTM acts as the generator. The discriminator tries to distinguish the generator’s output (labelled as ‘fake’) from the ground truth depth sequence (labelled as ‘real’). Upon convergence we wish that the generator’s output can appear as close as possible to the ground truth so as to confuse the discriminator. During the training of discriminator, we train the generator simultaneously. The objective of our generative adversarial learning is expressed as follows:
where are the input RGB frames and are the ground-truth depth frames. and are the distributions of input RGB frames and ground-truth depths respectively.
Since our discriminator is a binary classifier, we train it using the cross entropy loss. The cross entropy loss then acts as our temporal loss function. During the training of our ST-CLSTM, we combine our temporal loss with the aforementioned spatial loss as follows:
where is a weighting coefficient. We empirically set it to .
The detailed structure of our 3DCNN is illustrated in Fig. 4
. It is composed of 4 convolution blocks, a global average pooling layer and a fully-connected layer. Each convolution block contains a 3D convolution layer, followed by a batch normalization layer, a ReLU layer and a max pooling layer. The first 3D convolution layer and all the max pooling layers have a stride of 2. In practice, as plotted in Fig.4, our 3DCNN takes as input concatenated RGB and depth frames to enforce the consistency between the video frame and the corresponding depth. In order to increase the robustness of our discriminator, in our generated input depth sequences, we randomly mix some ground-truth depth frames with a certain probability.
Note that, the adversarial training here is mainly to enforce temporal consistency, instead of improving the depth accuracy of single frame’s depth as in .
In this section, we evaluate our proposed depth estimation framework on the indoor NYU Depth V2 dataset and the outdoor KITTI dataset, and compare against a few existing depth estimation approaches.
NYU Depth V2 contains 464 videos taken from indoor scenes. We apply the same train/test split as in Eigen et al.  which contains 249 videos for training, and 654 samples from the rest 215 videos for test. During training, we resize the image from to and then crop patches of for training.
KITTI contains 61 outdoor video scenes captured by cameras and depth sensors mounted on a driving car. We apply the same train/test split as in Eigen et al.  which contains 32 videos for training, and 697 samples from the rest 29 videos for test. During training, we randomly crop patches of size from the original images as inputs.
Spatial Metrics We evaluate the performance of our framework using the commonly applied metrics defined as follows: 1) Mean relative error (Rel): ; 2) Root mean squared error (RMS): ; 3) Mean error (log10): ; 4) Accuracy with threshold t: Percentage of such that . denotes the total number of pixels. and are estimated and ground-truth depths of pixel , respectively.
Temporal Metrics Maintaining temporal consistency means keeping the changes and motions among adjacent frames of estimation results consistent with that of corresponding ground truths. In order to quantitatively evaluate the temporal consistency, we introduce two metrics: temporal change consistency (TCC) and temporal motion consistency (TMC). They are defined as:
We train our proposed framework for 20 epochs. The initial learning rate of the ST-CLSTM is set to 0.0001 and decrease by a factor of 0.1 after every five epochs. Our spatial feature extraction network in the ST-CLSTM is pretrained on the ImageNet dataset. As for our 3D CNN, the initial learning rate is set to 0.1 for the NYU Depth V2 dataset and 0.01 for the KITTI dataset. The parameters of our 3D CNN are randomly initialized. During the generative adversarial training, before we start to update our 3D CNN parameters, we first train our ST-CLSTM for one epoch for the NYU Depth V2 dataset, and two epochs for the KITTI dataset, to make sure that our ST-CLSTM is able to generate plausible depth estimations.
Following , we employ three data augmentation methods including: 1) randomly flip the RGB image and depth map horizontally with a probability of 50%; 2) rotate the RGB image and depth map by a random degree ; 3) scale the brightness, contrast and saturation values of the RGB image by a random ratio .
The ST-CLSTM is the key component in our proposed depth estimation framework as it captures both spatial and temporal information. In this section, we evaluate the performance of our ST-CLSTM on both indoor and outdoor datasets. The results are reported in Table 1. We denote the baseline approach that captures no temporal information as 2DCNN. Specifically, we replace the CLSTM in our ST-CLSTM structure with 3 convolution layers. The number of channels are 128, 128 and 1 respectively. Since the temporal information exists among consecutive frames, the number of input frames influences the performance of our ST-CLSTM. We first evaluate the performance of our ST-CLSTM on the NYUD Depth V2 dataset with different number of input frames and show the results in the first 4 rows in Table 1. We can see that with the number of frame increases, the performance increases, as our ST-CLSTM captures more temporal information. We use 5 input frames in our experiments considering the computation cost.
We can see from Table 1 that our ST-CLSTM is able to capture the temporal information and improve the depth estimation performance on both indoor and outdoor datasets.
In this section, we evaluate the performance of our generative adversarial learning scheme which further enforces the temporal consistency among video frames. The evaluation results on the NYU Depth V2 and the KITTI dataset are reported in Table 2. For each dataset, we show the results of our ST-CLSTM without and with generative adversarial learning, denoted as ST-CLSTM and GAN respectively. We can see from Table 2 that our generative adversarial learning and temporal loss can enforce the temporal consistency and further improve the performance of our ST-CLSTM.
The major contribution of our work is to exploit temporal information for accurate depth estimation. The aforementioned experiments have revealed that our proposed ST-CLSTM and generative adversarial learning scheme are able to better capture the temporal information and improve the depth estimation performance. In this section, we show the improvement of our proposed framework in the temporal dimension with both visual effects and temporal consistency metrics.
We show the estimated depths of four consecutive frames with one frame gap between each frame in Fig. 5. We first show the RGB frames and the ground-truth depth maps in the first two rows, then we show the depth estimations of the baseline method (2DCNN) and our proposed framework in the last three rows.
We highlight a front area and a background area in blue and red dotted windows respectively, and we maximize the blue dotted window for better visualization. Since the four frames are consecutive, the ground-truth depths in these four frames change smoothly. However, the baseline method fails to maintain the smoothness. The estimated depths vary largely. Our ST-CLSTM captures the temporal correlations and produces visually better performance as demonstrated in Fig. 5. For all the frames, the edges of objects are sharper and the backgrounds are smoother. With our proposed generative adversarial learning scheme, the temporal consistency is enforced and the performance is further improved. The details are well maintained in all the frames. For instance, the bars of the chair in the red dotted window.111Readers may refer to the demonstration video: https://youtu.be/B705k8nunLU
3D CNN can capture the change and motion information between consecutive frames, as it convolves the input along both the spatial and temporal dimensions. To confuse the 3D CNN discriminator, the change and motion of estimation results must keep consistent with that of corresponding ground truths. We sampled 654 sequences from test set with a length of 16 frames each and report the average TCC and TMC in Table 3, from which we can see that the 3D CNN discriminator does not only improve the estimation accuracy, but also better enforces the temporal consistency.
|Liu et al. ||0.335||1.060||0.127||-||-||-||-|
|Li et al. ||0.232||0.821||0.094||0.621||0.886||0.968||-|
|Liu et al. ||0.230||0.824||0.095||0.614||0.883||0.971||-|
|Wang et al. ||0.220||0.824||-||0.605||0.890||0.970||-|
|Liu et al. ||0.213||0.759||0.087||0.650||0.906||0.976||-|
|Eigen et al. ||0.158||0.641||-||0.769||0.950||0.988||-|
|Chakrabarti et al. ||0.149||0.620||-||0.806||0.958||0.987||VGG19|
|Li et al. ||0.143||0.635||0.063||0.788||0.958||0.991||VGG16|
|Ma & Karaman ||0.143||-||-||0.810||0.959||0.989||ResNet-50|
|Laina et al. ||0.127||0.573||0.055||0.811||0.953||0.988||ResNet50|
|Eigen et al. ||0.190||7.156||-||0.692||0.899||0.967||-|
|Liu et al. ||0.217||6.986||-||0.647||0.882||0.961||-|
|Kuznietsov et al. ||0.113||4.621||-||0.862||0.960||0.986||ResNet-50|
|Mahjourian et al. ||0.159||5.912||-||0.784||0.923||0.970||DispNet |
|Zhou et al. ||0.143||5.370||-||0.824||0.937||0.974||VGG-19|
In this section, we evaluate our approach on the NYU Depth V2 dataset and the KITTI dataset and compare with some state-of-the-art results. The results are reported in Table 4 and Table 5 respectively. We can see that with our captured temporal information, we outperform most state-of-the-art methods which often use more complicated network structures. The aim of our work is to exploit temporal information for real-time depth estimation. We apply a shallow ResNet18 model as our backbone. The performance of our approach can be improved with deeper backbone networks. We leave this as future work.
|Model||Dataload||Time (ms per frame)||Speed (fps)|
One of the contributions of our work here is that our model can execute in real-time for practical applications. In this section, we evaluate the processing time of our model. Specifically, we feed our model videos with spatial resolution of . We test 600 frames for five epochs and report the mean values. We load the videos in two different ways: 1) Serial mode (S-mode). We load the video frames one by one. 2) Parallel+serial mode (PS-mode). We feed 120 frames to our spatial extraction network at one time to obtain the spatial features, then we feed the spatial features to our CLSTM one by one.
We implement our model with the PyTorch, and perform the inference on a computer with 8GB RAM, Intel i7-4790 CPU and GTX1080Ti GPU. We report the processing time of one frame, and the frame rate in Table 6. We can see that compared with the baseline (2D CNN) method, our ST-CLSTM method shows negligible drop of processing speed. Moreover, when we adopt the PS-mode for data loading, our processing speed increases dramatically. As the frame rate of common video formats is less than 30fps, our model is sufficiently fast to work in real-time.
In this work, we have proposed a novel ST-CLSTM structure by combining a shallow 2D CNN and a CLSTM. Our ST-CLSTM is able to capture both spatial features and temporal correlations among video frames for depth estimation. We have also designed a novel temporal loss by introducing the generative adversarial learning scheme. Our temporal loss is able to further enforce temporal consistencies among video frames. Experiments on benchmark indoor and outdoor datasets reveal that our proposed framework can effectively capture temporal information and achieve outstanding performance. Moreover, our proposed framework is able to execute in real-time for real-world applications, and can be easily generalized to most existing depth estimation frameworks.
Acknowledgments We would like to thank Huawei Technologies for the donation of GPU cloud computing resources. This work was in part supported by the National Natural Science Foundation of China (61871460, 61876152), Fundamental Research Funds for the Central Universities (3102019ghxm016) and Innovation Foundation for Doctor Dissertation of Northwestern Polytechnical University (CX201816).
R. Mahjourian, M. Wicke, and A. Angelova, “Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 5667–5675, 2018.
S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo, “Convolutional lstm network: A machine learning approach for precipitation nowcasting,” inProc. Advances in Neural Inf. Process. Syst., pp. 802–810, 2015.
P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 1125–1134, 2017.
B. Li, C. Shen, Y. Dai, A. Van Den Hengel, and M. He, “Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 1119–1127, 2015.