STCLSTM
This is the UNOFFICIAL implementation of the ICCV 2019 paper 'Exploiting Temporal Consistency for RealTime Video Depth Estimation'.
view repo
Accuracy of depth estimation from static images has been significantly improved recently, by exploiting hierarchical features from deep convolutional neural networks (CNNs). Compared with static images, vast information exists among video frames and can be exploited to improve the depth estimation performance. In this work, we focus on exploring temporal information from monocular videos for depth estimation. Specifically, we take the advantage of convolutional long shortterm memory (CLSTM) and propose a novel spatialtemporal CSLTM (STCLSTM) structure. Our STCLSTM structure can capture not only the spatial features but also the temporal correlations/consistency among consecutive video frames with negligible increase in computational cost. Additionally, in order to maintain the temporal consistency among the estimated depth frames, we apply the generative adversarial learning scheme and design a temporal consistency loss. The temporal consistency loss is combined with the spatial loss to update the model in an endtoend fashion. By taking advantage of the temporal information, we build a video depth estimation framework that runs in realtime and generates visually pleasant results. Moreover, our approach is flexible and can be generalized to most existing depth estimation frameworks. Code is available at: https://tinyurl.com/STCLSTM
READ FULL TEXT VIEW PDFThis is the UNOFFICIAL implementation of the ICCV 2019 paper 'Exploiting Temporal Consistency for RealTime Video Depth Estimation'.
Exploiting temporal consistency for realtime video depth estimation (ICCV 2019) https://arxiv.org/abs/1908.03706
Benefiting from the powerful convolutional neural networks (CNNs), some recent methods [1, 2, 3, 4, 5] have achieved outstanding performance on depth estimation from monocular static images. The success of these methods is based on the deeply stacked network structures and large amount of training data. For instance, the stateoftheart depth estimation model DORN [2] has more than one hundred of convolution layers, the high computational cost may hamper it from practical applications. However, in some scenarios such as automatic driving [6] and robots navigation [7], estimating of depths in realtime is required. Directly extend existing methods from static image to video sequence is not feasible because of the excessive computational cost. In addition, sequential frames which contain rich temporal information are usually provided in such scenarios. The existing methods fail to take the temporal information into consideration.
In this work, we exploit temporal information from videos by making use of the convolutional long shortterm memory (CLSTM) and the generative adversarial networks (GANs), and propose a realtime depth estimation framework. We illustrate our proposed framework in Fig. 1
. It consists of three main parts: 1) spatial features extraction part; 2) temporal correlations collection part and 3) spatialtemporal loss calculation part. The spatial features extraction part and the temporal correlations collection part compose our novel spatialtemporal CLSTM (STCLSTM) structure. The spatial features extraction part first takes as input
continuous frames and outputs high level features . The temporal correlations collection part then takes as input the highlevel features and outputs depth estimations . With the cell and gate modules, the CLSTM can make use of the cues acquired from the previous frame to reason the current frame, and thus encode the temporal information. As for spatialtemporal loss calculation, we first calculate the spatial loss between the estimated and the groundtruth depths. In order to further enforce the temporal consistency, we design a new temporal loss by introducing a generative adversarial learning scheme. Specifically, we apply a 3D CNN as the discriminator which takes as input the estimated and groundtruth depth sequences and outputs the temporal loss. The temporal loss is combined with the spatial loss and back propagated through the entire framework to update the weights in an endtoend fashion.To summarize, our main contributions are as follows.
We propose a novel STCLSTM structure that is able to capture spatial features as well as temporal correlations for video depth estimation. To our knowledge, this is the first time that CLSTM is employed for video depth estimation.
We design a novel temporal consistency loss by using the generative adversarial learning scheme. Our temporal loss can further enforce the temporal consistency and improve the performance for video depth estimation.
Our proposed video depth estimation framework can execute in realtime and can be generalized to most existing depth estimation frameworks.
Depth estimation
Recently, many deep learning based depth estimation methods have been proposed and achieved significant achievements. To name a few, Eigen
et al. [4] employed a multiscale neural network with two components to generate coarse estimations globally and refine the results locally. Xie et al. [8] used shortcut connections in their network to fuse lowlevel and highlevel features. Cao et al. [9] proposed to formulate depth estimation as a classification problem instead of a regression problem. Laina et al. [5] employed a reverse huber loss to estimate depth distributions and an upsampling module to overcome the lowresolution problem. Yin et al. [10] designed a loss term to enforce geometric constraints. To further improve the performance, some methods incorporate conditional random fields in their methods [11, 12]. Recently the method DORN [2] proposed a spacingincreasing discretization (SID) policy and estimated depths with a ordinal regression loss. Although excellent performance has been achieved, the networks are deep and computation is heavy.Some other works focus on estimating depth values from videos. Zhou et al. [1]
proposed to use bundle adjustment as well as a superresolution network to improve depth estimation. Specifically, the bundle adjustment is used to estimate depths and camera poses simultaneously, and the superresolution network is used to recover details. Mahjourian
et al. [3] incorporated a 3D loss with geometric constraints to estimate depths and egomotions simultaneously. In this work, we propose to estimate depths by exploiting temporal information from videos.CLSTM in video analysisRecurrent neural networks (RNNs), especially the long shortterm memories (LSTMs) have achieved great success in various computer vision tasks such as language processing [13] and speech recognition [14]
. With the memory cells, LSTMs can capture short and long term temporal dependencies. However, conventional LSTMs only take as input onedimensional vectors and thus can not be applied to image sequence processing.
To overcome this limitation, Shi et al. [15] proposed convolutional LSTM (CLSTM), which can capture long and short term temporal dependencies while retaining the ability of handling twodimensional feature maps. Recently, CLSTMs have been used in video processing. In [16], Song et al. proposed a Deeper Bidirectional CLSTM (DBCLSTM) structure which learns temporal characteristics in a cascaded and deeper way for video salient object detection. Liu et al. [17] proposed a treestructure based traversal method to model the 3Dskeleton of a human being in spatialtemporal domain. They applied CLSTM to handle the noise and occlusions in 3D skeleton data, which improves the temporal consistency of the results. Jiang et al. [18] developed a twolayer ConvLSTM (2CLSTM) to predict video saliency. An objecttomotion convolutional neural network has also been proposed.
GAN The generative adversarial network (GAN) has been an active research topic since it was proposed by Goodfellow et al. in [19]
. The basic idea of GAN is the training of two adversarial networks, a generator and a discriminator. During the process of adversarial training, both generator and discriminator become more robust. GANs have been widely used in various applications, such as imagetoimage translation
[20] and synthetic data generation [21]. GAN has been mainly used for generating images. One of the first work to apply adversarial training to improve structured output learning might be [22], where a discriminator loss is used to distinguish predicted pose and groundtruth pose for pose estimation from monocular images. Recently, GANs have also been adopted in depth estimation. In [23], Almalioglu et al. employed GAN to generate sharper and more accurate depth maps.In this paper, we design a novel temporal loss by employing GAN. Our temporal loss can enforce the temporal consistency among video frames.
In this section, we elaborate on our proposed video depth estimation framework. We first introduce our STCLSTM structure; then we present our generative adversarial learning scheme and our spatial and temporal loss functions.
Our depth estimation framework contains three main components: spatial feature extraction; temporal correlation collection; and spatialtemporal loss calculation, as illustrated in Fig. 1.
Spatial feature extraction is the key to the performance and processing speed as it contains the majority of trainable parameters in our depth estimation framework. In our work, we use a modified structure proposed by Hu et al. [24].
We show the details of our spatial feature extraction network in Fig. 2. The network contains an encoder, a decoder and a multiscale feature fusion module (MFF). The encoder can be any 2D CNN model, such as the VGG16 [25], the ResNet [26], the SENet [27], among many others. In order to build a realtime depth estimation framework, we apply a shallow ResNet18 model instead of the SENet154 as the encoder.
The decoder employs four upprojection modules to improve the spatial resolution and decreases the number of channels of the feature maps. This encoderdecoder structure has been widely used in pixellevel tasks [28, 2]. The MFF module is designed to integrate features of different scales. Similar strategies are used in [29].
Note that, in our depth estimation framework, the spatial feature extraction network can be replaced by other depth estimation models. In other words, our proposed depth estimation framework can be applied to other stateoftheart depth estimation methods with minimum modification.
As the input frames are continuous in the temporal dimension, taking the temporal correlations of these frames into consideration is intuitive and presumably helpful for improving depth estimation performance. In terms of achieving this goal, both the 3D CNN and the CLSTM are competent. Here, we use the CLSTM, as the it is more flexible than the 3D CNN for online inference. The structure of our proposed CLSTM is shown in Fig. 3 (b).
Fig. 3 (a) shows the traditional LSTM. The inputs and the outputs are vectors and the key operation is the Hadamard product. A single LSTM cell at time can be expressed as:
(1)  
where and
are sigmoid and hyperbolic tangent activation functions.
and represent the Hadamard product and pointwise multiplication.Compared with the traditional LSTM, our proposed CLSTM exhibits two main differences: 1) Operation. Following [15], we replace the Hadamard product in LSTM with convolution to handle the extracted 2D feature maps. 2) Structure. We adjust the structure of CLSTM to deal with depth estimation task. Specifically, our proposed CLSTM cell can be expressed as:
(2)  
where is the convolutional operator. and denote the kernels and bias terms at the corresponding convolution layers. After we extract the spatial features of video frames, we feed the feature map of the previous frame into a convolution layer to compress the number of channels from to 8. Then we concatenate with the feature map of current frame to formulate a feature map with channels. Next, we feed the concatenated feature map to CLSTM to update the information stored in memory cell. Finally, we concatenate the information in the updated memory cell and the feature map of output gate, then feed them to a refine structure that consists of two convolution layers to obtain the final estimation result.
As shown in Fig. 1, the output of our STCLSTM is the estimated depth. We design two loss functions to train our STCLSTM model: a spatial loss to maintain the spatial features and a temporal loss to capture the temporal consistency.
We follow [24] and design a similar loss function as our spatial loss, which can be expressed as:
(3) 
where and are weighting coefficients. It is composed of three terms. The is applied to penalize inaccurate depth estimations. Most existing depth estimation methods simply apply the or loss. As pointed in [30], a problem of this type of loss is that the value tends to be larger as the groundtruth depth getting further. We apply a logarithm loss which is expressed as:
(4) 
Consequently, our is defined as:
(5) 
where is the number of pixels; and are the estimated and groundtruth depth of pixel respectively.
is designed to penalize the errors around edges. It is defined as:
(6)  
where and represent the spatial derivative along the axis and axis respectively.
The last item is designed to measure the angle between two surface normals, and thus is sensitive to small depth structures. It is expressed as:
(7) 
where and denotes inner product.
Our proposed STCLSTM is able to exploit the temporal correlations among consecutive video frames. In order to further enforce the consistency among frames, we apply the generative adversarial learning scheme and design a temporal consistency loss. Specifically, after our STCLSTM produces depth estimations, we introduce a threedimensional convolutional neural network (3D CNN) which takes as input the estimated depth sequence and output a score. This score represents the probability of the depth sequence comes from our STCLSTM rather than the groundtruths. The 3D CNN is then act as a discriminator. We train the discriminator by maximizing the probability of assigning the correct label to both the estimated and groundtruth depth sequences. Our STCLSTM acts as the generator. The discriminator tries to distinguish the generator’s output (labelled as ‘fake’) from the ground truth depth sequence (labelled as ‘real’). Upon convergence we wish that the generator’s output can appear as close as possible to the ground truth so as to confuse the discriminator. During the training of discriminator, we train the generator simultaneously. The objective of our generative adversarial learning is expressed as follows:
(8)  
where are the input RGB frames and are the groundtruth depth frames. and are the distributions of input RGB frames and groundtruth depths respectively.
Since our discriminator is a binary classifier, we train it using the cross entropy loss. The cross entropy loss then acts as our temporal loss function. During the training of our STCLSTM, we combine our temporal loss with the aforementioned spatial loss as follows:
(9) 
where is a weighting coefficient. We empirically set it to .
The detailed structure of our 3DCNN is illustrated in Fig. 4
. It is composed of 4 convolution blocks, a global average pooling layer and a fullyconnected layer. Each convolution block contains a 3D convolution layer, followed by a batch normalization layer, a ReLU layer and a max pooling layer. The first 3D convolution layer and all the max pooling layers have a stride of 2. In practice, as plotted in Fig.
4, our 3DCNN takes as input concatenated RGB and depth frames to enforce the consistency between the video frame and the corresponding depth. In order to increase the robustness of our discriminator, in our generated input depth sequences, we randomly mix some groundtruth depth frames with a certain probability.Note that, the adversarial training here is mainly to enforce temporal consistency, instead of improving the depth accuracy of single frame’s depth as in [31].
In this section, we evaluate our proposed depth estimation framework on the indoor NYU Depth V2 dataset and the outdoor KITTI dataset, and compare against a few existing depth estimation approaches.
NYU Depth V2 contains 464 videos taken from indoor scenes. We apply the same train/test split as in Eigen et al. [4] which contains 249 videos for training, and 654 samples from the rest 215 videos for test. During training, we resize the image from to and then crop patches of for training.
KITTI contains 61 outdoor video scenes captured by cameras and depth sensors mounted on a driving car. We apply the same train/test split as in Eigen et al. [4] which contains 32 videos for training, and 697 samples from the rest 29 videos for test. During training, we randomly crop patches of size from the original images as inputs.
Spatial Metrics We evaluate the performance of our framework using the commonly applied metrics defined as follows: 1) Mean relative error (Rel): ; 2) Root mean squared error (RMS): ; 3) Mean error (log10): ; 4) Accuracy with threshold t: Percentage of such that . denotes the total number of pixels. and are estimated and groundtruth depths of pixel , respectively.
Temporal Metrics Maintaining temporal consistency means keeping the changes and motions among adjacent frames of estimation results consistent with that of corresponding ground truths. In order to quantitatively evaluate the temporal consistency, we introduce two metrics: temporal change consistency (TCC) and temporal motion consistency (TMC). They are defined as:
(10)  
We train our proposed framework for 20 epochs. The initial learning rate of the STCLSTM is set to 0.0001 and decrease by a factor of 0.1 after every five epochs. Our spatial feature extraction network in the STCLSTM is pretrained on the ImageNet dataset. As for our 3D CNN, the initial learning rate is set to 0.1 for the NYU Depth V2 dataset and 0.01 for the KITTI dataset. The parameters of our 3D CNN are randomly initialized. During the generative adversarial training, before we start to update our 3D CNN parameters, we first train our STCLSTM for one epoch for the NYU Depth V2 dataset, and two epochs for the KITTI dataset, to make sure that our STCLSTM is able to generate plausible depth estimations.
Following [24], we employ three data augmentation methods including: 1) randomly flip the RGB image and depth map horizontally with a probability of 50%; 2) rotate the RGB image and depth map by a random degree ; 3) scale the brightness, contrast and saturation values of the RGB image by a random ratio .
#  model  Rel  RMS  log10  

NYUDepth V2  
1  2DCNN  0.139  0.585  0.059  0.819  0.961  0.990 
3  STCLSTM  0.134  0.581  0.058  0.824  0.965  0.991 
4  STCLSTM  0.133  0.577  0.057  0.831  0.963  0.990 
5  STCLSTM  0.132  0.572  0.057  0.833  0.966  0.991 
KITTI  
1  2DCNN  0.111  4.385  0.048  0.871  0.962  0.987 
5  STCLSTM  0.104  4.139  0.045  0.883  0.967  0.988 
The STCLSTM is the key component in our proposed depth estimation framework as it captures both spatial and temporal information. In this section, we evaluate the performance of our STCLSTM on both indoor and outdoor datasets. The results are reported in Table 1. We denote the baseline approach that captures no temporal information as 2DCNN. Specifically, we replace the CLSTM in our STCLSTM structure with 3 convolution layers. The number of channels are 128, 128 and 1 respectively. Since the temporal information exists among consecutive frames, the number of input frames influences the performance of our STCLSTM. We first evaluate the performance of our STCLSTM on the NYUD Depth V2 dataset with different number of input frames and show the results in the first 4 rows in Table 1. We can see that with the number of frame increases, the performance increases, as our STCLSTM captures more temporal information. We use 5 input frames in our experiments considering the computation cost.
We can see from Table 1 that our STCLSTM is able to capture the temporal information and improve the depth estimation performance on both indoor and outdoor datasets.
model  Rel  RMS  log10  

NYUDepth V2  
STCLSTM  0.132  0.572  0.057  0.833  0.966  0.991 
GAN  0.131  0.571  0.056  0.833  0.965  0.991 
KITTI  
STCLSTM  0.104  4.139  0.045  0.883  0.967  0.988 
GAN  0.101  4.137  0.043  0.890  0.970  0.989 

In this section, we evaluate the performance of our generative adversarial learning scheme which further enforces the temporal consistency among video frames. The evaluation results on the NYU Depth V2 and the KITTI dataset are reported in Table 2. For each dataset, we show the results of our STCLSTM without and with generative adversarial learning, denoted as STCLSTM and GAN respectively. We can see from Table 2 that our generative adversarial learning and temporal loss can enforce the temporal consistency and further improve the performance of our STCLSTM.
The major contribution of our work is to exploit temporal information for accurate depth estimation. The aforementioned experiments have revealed that our proposed STCLSTM and generative adversarial learning scheme are able to better capture the temporal information and improve the depth estimation performance. In this section, we show the improvement of our proposed framework in the temporal dimension with both visual effects and temporal consistency metrics.
We show the estimated depths of four consecutive frames with one frame gap between each frame in Fig. 5. We first show the RGB frames and the groundtruth depth maps in the first two rows, then we show the depth estimations of the baseline method (2DCNN) and our proposed framework in the last three rows.
We highlight a front area and a background area in blue and red dotted windows respectively, and we maximize the blue dotted window for better visualization. Since the four frames are consecutive, the groundtruth depths in these four frames change smoothly. However, the baseline method fails to maintain the smoothness. The estimated depths vary largely. Our STCLSTM captures the temporal correlations and produces visually better performance as demonstrated in Fig. 5. For all the frames, the edges of objects are sharper and the backgrounds are smoother. With our proposed generative adversarial learning scheme, the temporal consistency is enforced and the performance is further improved. The details are well maintained in all the frames. For instance, the bars of the chair in the red dotted window.^{1}^{1}1Readers may refer to the demonstration video: https://youtu.be/B705k8nunLU
3D CNN can capture the change and motion information between consecutive frames, as it convolves the input along both the spatial and temporal dimensions. To confuse the 3D CNN discriminator, the change and motion of estimation results must keep consistent with that of corresponding ground truths. We sampled 654 sequences from test set with a length of 16 frames each and report the average TCC and TMC in Table 3, from which we can see that the 3D CNN discriminator does not only improve the estimation accuracy, but also better enforces the temporal consistency.
Model  Rel  RMS  log10  TCC  TMC  

Baseline  0.139  0.585  0.059  0.819  0.961  0.990  0.846  0.956 
STCLSTM  0.132  0.572  0.057  0.833  0.966  0.991  0.866  0.962 
3DGAN  0.131  0.571  0.056  0.833  0.965  0.991  0.870  0.965 
Method  Rel  RMS  log10  backbone  

DepthTransfer [34]  0.350  1.200  0.131         
Make3D [35]  0.349  1.214    0.447  0.745  0.897   
Liu et al. [36]  0.335  1.060  0.127         
Li et al. [37]  0.232  0.821  0.094  0.621  0.886  0.968   
Liu et al. [38]  0.230  0.824  0.095  0.614  0.883  0.971   
Wang et al. [11]  0.220  0.824    0.605  0.890  0.970   
Liu et al. [12]  0.213  0.759  0.087  0.650  0.906  0.976   
Eigen et al. [39]  0.158  0.641    0.769  0.950  0.988   
Chakrabarti et al. [40]  0.149  0.620    0.806  0.958  0.987  VGG19 
Li et al. [41]  0.143  0.635  0.063  0.788  0.958  0.991  VGG16 
Ma & Karaman [42]  0.143      0.810  0.959  0.989  ResNet50 
Laina et al. [5]  0.127  0.573  0.055  0.811  0.953  0.988  ResNet50 
Padnet [43]  0.120  0.582  0.055  0.817  0.954  0.987  ResNet50 
DORN [2]  0.115  0.509  0.051  0.828  0.965  0.992  ResNet101 
Ours  0.131  0.571  0.056  0.833  0.965  0.991  ResNet18 

Method  Rel  RMS  log10  backbone  

Make3D [35]  0.280  8.734    0.601  0.820  0.926   
Eigen et al. [4]  0.190  7.156    0.692  0.899  0.967   
Liu et al. [12]  0.217  6.986    0.647  0.882  0.961   
LRC [44]  0.114  4.935    0.861  0.949  0.976  ResNet50 
Kuznietsov et al. [45]  0.113  4.621    0.862  0.960  0.986  ResNet50 
Mahjourian et al. [3]  0.159  5.912    0.784  0.923  0.970  DispNet [46] 
Zhou et al. [1]  0.143  5.370    0.824  0.937  0.974  VGG19 
Ours  0.101  4.137  0.043  0.890  0.970  0.989  ResNet18 

In this section, we evaluate our approach on the NYU Depth V2 dataset and the KITTI dataset and compare with some stateoftheart results. The results are reported in Table 4 and Table 5 respectively. We can see that with our captured temporal information, we outperform most stateoftheart methods which often use more complicated network structures. The aim of our work is to exploit temporal information for realtime depth estimation. We apply a shallow ResNet18 model as our backbone. The performance of our approach can be improved with deeper backbone networks. We leave this as future work.
Model  Dataload  Time (ms per frame)  Speed (fps) 

Baseline  Smode  28.90  34.60 
STCLSTM  Smode  30.22  33.09 
STCLSTM  PSmode  5.72  174.83 

One of the contributions of our work here is that our model can execute in realtime for practical applications. In this section, we evaluate the processing time of our model. Specifically, we feed our model videos with spatial resolution of . We test 600 frames for five epochs and report the mean values. We load the videos in two different ways: 1) Serial mode (Smode). We load the video frames one by one. 2) Parallel+serial mode (PSmode). We feed 120 frames to our spatial extraction network at one time to obtain the spatial features, then we feed the spatial features to our CLSTM one by one.
We implement our model with the PyTorch
[47], and perform the inference on a computer with 8GB RAM, Intel i74790 CPU and GTX1080Ti GPU. We report the processing time of one frame, and the frame rate in Table 6. We can see that compared with the baseline (2D CNN) method, our STCLSTM method shows negligible drop of processing speed. Moreover, when we adopt the PSmode for data loading, our processing speed increases dramatically. As the frame rate of common video formats is less than 30fps, our model is sufficiently fast to work in realtime.In this work, we have proposed a novel STCLSTM structure by combining a shallow 2D CNN and a CLSTM. Our STCLSTM is able to capture both spatial features and temporal correlations among video frames for depth estimation. We have also designed a novel temporal loss by introducing the generative adversarial learning scheme. Our temporal loss is able to further enforce temporal consistencies among video frames. Experiments on benchmark indoor and outdoor datasets reveal that our proposed framework can effectively capture temporal information and achieve outstanding performance. Moreover, our proposed framework is able to execute in realtime for realworld applications, and can be easily generalized to most existing depth estimation frameworks.
Acknowledgments We would like to thank Huawei Technologies for the donation of GPU cloud computing resources. This work was in part supported by the National Natural Science Foundation of China (61871460, 61876152), Fundamental Research Funds for the Central Universities (3102019ghxm016) and Innovation Foundation for Doctor Dissertation of Northwestern Polytechnical University (CX201816).
R. Mahjourian, M. Wicke, and A. Angelova, “Unsupervised learning of depth and egomotion from monocular video using 3d geometric constraints,” in
Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 5667–5675, 2018.S. Xingjian, Z. Chen, H. Wang, D.Y. Yeung, W.K. Wong, and W.c. Woo, “Convolutional lstm network: A machine learning approach for precipitation nowcasting,” in
Proc. Advances in Neural Inf. Process. Syst., pp. 802–810, 2015.P. Isola, J.Y. Zhu, T. Zhou, and A. A. Efros, “Imagetoimage translation with conditional adversarial networks,” in
Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 1125–1134, 2017.B. Li, C. Shen, Y. Dai, A. Van Den Hengel, and M. He, “Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs,” in
Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 1119–1127, 2015.