Future semantic segmentation of time-lapsed videos with large temporal displacement

12/27/2018 ∙ by Talha Siddiqui, et al. ∙ 18

An important aspect of video understanding is the ability to predict the evolution of its content in the future. This paper presents a future frame semantic segmentation technique for predicting semantic masks of the current and future frames in a time-lapsed video. We specifically focus on time-lapsed videos with large temporal displacement to highlight the model's ability to capture large motions in time. We first introduce a unique semantic segmentation prediction dataset with over 120,000 time-lapsed sky-video frames and all corresponding semantic masks captured over a span of five years in North America region. The dataset has immense practical value for cloud cover analysis, which are treated as non-rigid objects of interest. provides both semantic segmentation of cloud region and solar irradiance emitted from a region from the sky-videos. Next, our proposed recurrent network architecture departs from existing trend of using temporal convolutional networks (TCN) (or feed-forward networks), by explicitly learning an internal representations for the evolution of video content with time. Experimental evaluation shows an improvement of mean IoU over TCNs in the segmentation task by 10.8 model simultaneously measures both the current and future solar irradiance from the same video frames with a normalized-MAE of 10.5 results indicate that recurrent memory networks with attention mechanism are able to capture complex advective and diffused flow characteristic of dense fluids even with sparse temporal sampling and are more suitable for future frame prediction tasks for longer duration videos.



There are no comments yet.


page 1

page 3

page 6

page 7

page 9

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


To translate the significant progress made by the community in image semantic segmentation into better video understanding, memory networks are being employed to encode video content. Much like with humans, memory network allows systems to efficiently encode higher-order information content present in videos such as semantic structure across video frames. An interesting direction of research to enable deep neural networks in encoding this spatio-temporal semantic structure in videos is by future frame prediction

[Luc et al.2017, Mathieu, Couprie, and LeCun2015, Srivastava, Mansimov, and Salakhudinov2015]. Pursing future frame prediction in videos allows us to build models that construct an internal representation of a video that captures the evolution of its content over time. Such a representation provides both pixel-level understanding that can be resolved to per-frame semantic segmentation, as well as frame-level property measurement, for both current and future frames.

This paper generalizes the problem of video future frame semantic segmentation prediction to time-lapsed videos with large temporal time-steps. Further, a secondary task of measurement of frame level properties related to the content of the video is used as an additional constraint. Our framework jointly learns the dual objective and additionally localizes the pixel level contribution of the measurement implicitly. Lastly, we show empirical results for the specific application of prediction of cloud region from time-lapsed sky-videos and measurement of solar irradiance.

Figure 1:

Time-lapsed videos of meteorological phenomenon captured for recreational and scientific observations. Estimating the likely future weather patterns from such time-lapsed videos can influence project planning leading to social and economical benefits.

Related Work

Semantic segmentation techniques for an image, or pixel-wise labeling from structure, have made large strides beginning with the fully-convolutional augmentation of convolutional networks

[Long, Shelhamer, and Darrell2015] and their variants with dilating or a’trous convolutions [Yu and Koltun2015]

. In order to replicate similar success transitioning into video understanding, the computer vision community has embraced memory (recurrent) networks and more recently, temporal convolutional networks (TCN)

[Bai, Kolter, and Koltun2018] in various interesting ways.

We first position our work by briefly reviewing the recent flavours of memory networks to enhance video representations as follows:

Encoding short video sequences was initially viewed as the next step to extending the performance of image understanding to videos. Early approach used two tier architectures [Karpathy et al.2014] to compute features at different scales. Soon the encoder-decoder framework was favoured [Venugopalan et al.2014] first by aggregation of frame representations and later with attention based approaches and 3D convolutional operations [Yao et al.2015, Tran et al.2015]. [Klein, Wolf, and Afek2015] present a dynamic convolution approach to predict short term weather from radar imaging sequences. However, a sufficiently robust representation of video frames becomes challenging with increasing length and complexity in context. [Vondrick, Pirsiavash, and Torralba2015] approach anticipating the likely-labels in the future frames without flow.

Unsupervised next frame prediction was introduced by [Srivastava, Mansimov, and Salakhudinov2015], which provides video representations using LSTMs in the encoder-decoder framework. The reconstruction error is minimized in a composite encoder-decoder model that simultaneously predicts now and future frames. [Shi et al.2015] extend the framework by introducing spatially constrained or convolutional LSTMs. The authors view the memory units as hidden layers and convert the input-state transition to convolution operations. These operations induce spatial structure in the memory units that is lacking in LSTM and can help encode evolving content in videos. Recently, [Hou and Wu2018] utilize mid-level video frame encodings to capture semantic concepts in action recognition.

Flow estimation that is based on recurrent networks is implicitly computed and often intertwined with next frame prediction. Video encoder-decoder performance may improve with an additional penalty, added to the overall loss, of gradient smoothness parameters [Patraucean, Handa, and Cipolla2015]. Explicit optical flow approaches [Weinzaepfel et al.2013, Ilg et al.2016, Feichtenhofer, Pinz, and Zisserman2016, Hur and Roth2017]

compute dense flow vectors for video understanding that track the motion of every pixel in the image. Such approaches are also shown to improve related tasks such as occlusion detection when exploited jointly.

[Feichtenhofer, Pinz, and Zisserman2017] use the model architectures of explicit flow, two consecutive frames as simultaneous inputs, to improve both tracking and object detection in videos.

Recurrent attention is an aspect of semantic segmentation which identifies individual instances of class. While the approaches presented in literature are evaluated on images, they use memory networks for sequentially predicting objects. Graphical models have been replaced in favour of recurrent networks with attention feedback, termed recurrent attention. [Romera-Paredes and Torr2016] use a convolutional LSTM (ConvLSTM), while [Ren and Zemel2016] later relax the spatial constraint for instance segmentation to capture disconnected instances of a given semantic class. Recently, [Piergiovanni, Fan, and Ryoo2017] use temporal attention filters to identify latent sub-events in videos.

Temporal convolutional networks (TCN) [Bai, Kolter, and Koltun2018] refer to convolutional network architecture that are utilized for temporal prediction tasks such as next frame prediction. Unsupervised reconstruction of videos produces next frames but with noise and blurring, [Mathieu, Couprie, and LeCun2015] replace the squared error loss with adversarial training method and show improvement in reconstruction. [Luc et al.2016] first propose using adversarial training for semantic segmentation. They next extend the work [Luc et al.2017] to propose future frame and semantic segmentation prediction in the context automatic driving problem. Their approach uses convolutional filters with auto-regression for very short term forecasting of future semantic masks. While the approach has shown excellent results on the Cityscapes dataset [Cordts et al.2016], the prediction task in the dataset is limited to 0.5 seconds future prediction. This may have limited practical use in automated driving applications. [Jin et al.2017a] propose video scene parsing by using predictive feature learning and prediction steering parsing. Their predictive feature learning architecture learns to extract spatiotemporal features by enforcing the model to predict a future frame in the sequence. They extend their work [Jin et al.2017b] and propose simultaneous scene parsing and optical flow for future video frames. They state that capturing motion dynamics as well as predicting semantic masks are correlated problems and benefit from each other. However to the best of our knowledge, there is no recent work which illustrates the performance of spatiotemporal memory networks aided with spatial attention in predicting future frame semantic masks. Recently, [Miller and Hardt2018] argue that the comparable performance of recurrent networks can often be achieved simply with feed-forward networks in an auto-regression fashion. Here, we show the merit of memory networks over auto-regression in time-lapsed videos with large temporal displacement.

Figure 2: [Best viewed in colour] The proposed future video frame semantic segmentation approach explained here. For each videoframe (

), a representation tensor (

) is obtained using ResNet with dilating filters. is used in the now model to predict both the current semantic mask (), with upsampling, and irradiance (

), with convolutions, batchnorm+ReLU. The

future model is a 3-tier ConvLSTM over the frame representations of the previous frames. Further, spatial attention is applied on the frame representations to further enhance performance. (Details in Section Model).

Key Contributions

The key contributions of this research can be summarized as follows:

  • We propose a novel future semantic mask prediction framework for simultaneously performing two collegial tasks, namely segment and measure that often occur together in various applications of video understanding.

  • We propose a memory network based approach for future frame segmentation in time-lapsed videos of weather phenomenon. We show that the performance of memory networks can be considerably amplified with spatial attention models and a dual objective.

  • The time-lapsed Sky-video dataset introduced in this paper represents, in our view, a stronger challenge to the semantic segmentation prediction problem by relaxing two assumptions, object rigidity and temporal continuity. The dataset contains clouds at a frame rate of 6 frames per hour.

  • Our proposed approach out-performs state-of-art semantic pixel-wise labeling approaches that are designed for short-term videos on the sky-video dataset. The result makes a case for ConvLSTMs with attention as a middle ground between unsupervised recurrent nets and temporal convolutional feed-forward networks.


We present a future video-frame semantic segmentation approach designed for time-lapse videos of weather phenomenon. The proposed end-to-end trainable approach performs two tasks: segment provides a semantic segmentation (or pixel-wise labeling) of the objects of interest in a current or future frames, and, measure regresses and also forecasts a frame-level property related to the content of the video. Problems in recent literature, such as object detection and semantic segmentation prediction in videos [Luc et al.2017], partial object counting [Seguí, Pujol, and Vitria2015, Chattopadhyay et al.2016], and video question answering ([Zeng et al.2017]) can be viewed within our framework.

We first describe the architecture of the model followed by our utilization of the spatial attention models to enhance the future prediction. We further describe the co-dependency in the architecture of the segment and measure components and argue that our architecture choice enables a spatially constrained representation to encode partial-contributions of the measured property from all localized regions of a given video frame.


Notationally, given a video which contains frames, with frames , we propose a framework to determine a model ( and ), that produces both frame-wise semantic segmentation of classes of interest and their corresponding measurements, given by = , where , and is a scalar predicted value per frame. The now and future predictions for a look-back period are respectively obtained as follows:


The future model () is built over an intermediary representation of an input video frame obtained from the now model . Further, the model is learnt over a training set for with ground truth semantic segmentation , containing , and the scale property measurement given by .


An overview of the proposed architecture for segment and measure for both now and future video frames prediction is illustrated in Fig. 2. We describe important details of the architecture that are relevant to the analysis, deferring remaining details of the model, such as parameters of layers, to the Supplementary.

Now model: Our approach utilizes the ResNet50 [He et al.2016] as the front-end model. We further augment the last convolutional layer with dilation [Yu and Koltun2015], to obtain the fully convolutional layer () that is used as a representation vector (). Dilated convolutions are used to capture features at multi-scales without adding additional layers. All convolutional blocks are interlaced with Batch-norm and non-linearity and use bi-linear up-sampling to merge the skip-connections.

The representational vector () obtained from the front-end model is bifurcated into the segment branch, that is upsampled to the semantic segmentation mask resolution. Pixel label assignment is performed with softmax. Conversely, the measure branch is up-sampled with convolution up-sampling layers, again interlaced with Batch-norm and non-linearity. The prediction measure is aggregated over a dense connection, with dropout, to a single value.

Future model: To perform future frame prediction, we use a recurrent architecture to predict tensor by utilizing only the corresponding representation tensors.111we use the term tensors rather than vectors to indicate their multi-dimensional nature. () from historical frames. In order to preserve the spatial correspondence of the representation, we utilize convolutional LSTMs (ConvLSTM) [Shi et al.2015] that contain memory gates which are convolutional operations and preserve the spatial and temporal structure while inducing memory into the architecture.

Specifically, multiple tiers of ConvLSTM are used over a look-back period , to construct the representations (). We also observe that the performance of the stacked ConvLSTMs is significantly improved with the addition of spatial attention mechanisms, described next.

Future prediction with spatial soft-Attention

Our architecture uses stacked convolutional LSTMs to predict semantic representations of the future frames, from a look back period , given by =. Attention is computed over the spatial, temporal and class label () dimensions of to induce more structure in the predictions, that can particularly be affected by the large temporal displacements in time-lapsed videos.

Spatial attention is assigned to a given pixel of over all the values in the look back period , of the given and neighbourhood pixels (), with a simple convolutional operation. This can be described as:


where is an additional set of convolutional filters learnt in the overall architecture. The filters are oversampled and reduced with a densely connected operation to obtain the attention mask . The mask obtained is further normalized with softmax and multiplied with the original samples . This attention model utilizes learned convolutional operations over the entire training batch matrix, learning attention without explicit temporal structure or memory. We also compute softmax attention [Olah and Carter2016] over averaged over all the pixels and use for baseline comparison, that we term mean attention.


As shown in the Fig. 2, the model’s end-to-end trainable architecture performs both semantic segmentation prediction and measure prediction for both now and future

problem. The loss function is a weighted mixture of the loss from each task, given in Eq.



where is the combined loss given in Eq. 4, and the is the normalizing weight to equally favour predictions of segment and measure respectively.


The combined loss is a combination of focal loss () [Lin et al.2017] and categorical cross-entropy () measured for the semantic segmentation over every pixel. The focal loss is given by Eq. 5 with tuneable parameters set to control the order of magnitude ( and ). The combined loss penalizes incorrect classification of pixel semantic label, with larger focus on harder predictions, such as cloud pixels. We observe small improvement in segmentation compared to pixel-wise categorical cross entropy.


To compute prediction errors of the frame level measurement, in this case the irradiance forecast, smooth Huber loss is used (Eq. 6). We prefer the hyperbolic version for the smoothness, which is magnified at . Empirically, irradiance prediction below are less important.

Figure 3: An illustration of the bifurcation point of the model. Note is a tensor spatially consistent with the image rather than a flat vector.

Segment with partial Measure

A unique feature of our architecture is the strong constraint for every spatial region to have a partial-contribution to the measured property . As illustrated in Fig. 3, the bifurcation point of the architecture that branches into two tasks, segment and measure, emerge from a single down-sampled representation (rather than a flat vector), which is received directly from the input image for now prediction, and is predicted from the stacked convolutional-LSTMs for the future prediction. In either case, for a given pixel, the segment branch up samples using only local neighbourhood vectors. Similarly, for the measure branch, the convolution filters again operate only on the local neighbourhood of the pixel to predict . Extending this intuition to the future prediction model, the stacked convolutional LSTMs are also constraint locally to predict the representation for a future frame. Hence, we assert that our framework implicitly generates the partial-contribution of the measure property over the spatial region of the image. Hence, measure is an integral over a downsampled semantic mask prediction.


In this section, we describe the performance of the proposed approach on newly introduced sky-video dataset. The proposed approach is compared on the same parameters and protocol with two different attention mechanisms and with a temporal convolutional network [Luc et al.2017] designed for shorter videos. The baseline persistence model refers to a model which simply uses the predicted semantic mask of the current frame (at ) as the semantic mask of the future frame.

Sky-video dataset and protocol

A sky-video is obtained from an upward facing wide-angle lensed video camera such as the one shown in Fig. 1. The dataset is recorded at Solar Radiation Research Laboratory (SRRL), Golden, Colorado [Andreas1981]222the dataset is available for download at www.nrel.gov/midc/srrl_bms/. The processed dataset is available for easy reproducibility at https://bit.ly/2Bw7HGP., situated in North America. The time-lapsed videos are recorded using a commercial sky imager (TSI) [Morris2005] at every 10 minutes interval. A mechanical sun tracker is used to block the sun preventing saturation in the image and blooming effects. The dataset is available for the last 13 years from 2005-2017. Over the same period, we obtained solar irradiance measurements (in )from the same location using a pyranometer. The cloud cover, defined as the ratio of pixels labeled cloud in the sky image is correlated with irradiance measure (0.67 on the training set). More details of capture device and sample frames illustrations are in Supplementary.

Experiment n-MAE IoU Cloud IoU Sky
Baseline (only Irradiance) 11.31 - -
ResNet50 + dilation 10.96 83.31 87.89
Table 1: normalized-MAE of Irradiance () (%) and IoU for segmentation in Now prediction on the test set

We showcase our experiments on the images captured in the year 2010-2014 with the total number of 123,064 images in our dataset. The input video consists of RGB frames of dimensions , . The TSI imager also provides ground truth segmentation masks of four classes in the sky, namely, sky, cloud, sun, and tracker (). We use data from 2010-2012 in training (73,120 images) and from 2013-2014 (49,944 images) for testing based on availability and quality of ground truth.

For the future model, we use look-back ()=hour (or six samples at 10mins interval) and batch-size= (

batches per epoch) and compute a forward prediction for the same size. We start with a learning rate of

and reduce with power decay () per epoch. A weight decay of 0.00005 with -norm regularizer is used for all convolutional layers. Adam optimizer is used for all our experiments. We report pixel-level segmentation accuracy, Intersection over Union (IoU) and normalized mean absolute error (nMAE) as applicable.

Attention Mechanism +10 mins +20 mins +30 mins
Without 28.057 28.472 28.229
[Olah and Carter2016] 21.853 22.954 24.013
Spatial 19.486 22.051 23.472
Attention Mechanism +40 mins +50 mins +60 mins
Without 27.526 27.872 27.883
[Olah and Carter2016] 25.048 26.210 27.173
Spatial 24.934 25.982 26.938
Table 2: Measure Task: normalized-MAE (%) of Irradiance (in ) for Future using proposed Attention Mechanisms on Testing Data
Attention Mechanism Accuracy +10 mins +20 mins +30 mins
Cloud Sky Cloud Sky Cloud Sky
Persistence 80.55 64.04 69.77 59.31 65.65 55.97 62.74
Without 87.59 61.61 69.76 61.36 69.57 61.15 69.55
Mean [Olah and Carter2016] 89.00 70.39 77.25 68.39 75.34 66.15 73.38
Spatial 89.38 74.15 79.57 70.48 76.43 67.73 73.81
[Luc et al.2017] 75.94 63.15 68.88 58.08 64.08 52.64 56.72
Attention Mechanism Accuracy +40 mins +50 mins +60 mins
Cloud Sky Cloud Sky Cloud Sky
Persistence 80.55 53.14 60.31 50.14 57.90 47.05 55.77
Without 87.59 60.97 69.49 59.71 69.11 58.66 69.53
Mean [Olah and Carter2016] 89.00 63.89 71.75 60.96 70.13 58.87 69.65
Spatial 89.38 65.29 71.71 62.49 69.89 60.16 69.05
[Luc et al.2017] 75.94 48.02 49.61 43.76 42.89 39.55 35.53
Table 3: Segment Task: Accuracy(%) and IoU for Future Frames using various Attention Mechanism on Testing Data


The performance of the now prediction model in Table 1 shows that the ResNet50 model (with dilated convolutional filters) captures cloud region with Intersection over Union (IoU) of and IoU of on sky with the segment task on sky-video frames. Simultaneously, the model has a normalized mean absolute error (nMAE) of on the measure task of solar irradiance. The total accuracy measure which includes the segmentation accuracy of the tracker and boundary bezel of the frame is 94.57%. As shown in Fig. 4, the ground truth masks are grainy pixel segmentation, whereas our model generates semantic masks by linear up-sampling from a low dimensional representation, resulting in smoother segmentation masks.

Figure 4: Sample semantic segmentation of now predictions. The three rows in the illustration are a sequence of input frames, the corresponding ground truth and semantic masks, respectively.

Fig. 4 also illustrates sample image frames and their corresponding segmentation masks obtained. Unlike rigid objects in benchmark vision datasets, clouds exhibit non-compressible fluid behaviour such as advection and diffusion. The trained now prediction model weights are also used as pre-training initialization weights in the future prediction experiments. As shown in the now experiments in Table 1, our approach out-performs a direct regression model, trained separately with the loss function corresponding to only measure, Eq. 6. We exclude a comparison of existing literature in skycamera based irradiance measurement, as all deep neural network models far-outperform previous approaches when trained with data from the same location [Paoli et al.2010]. The measure task from the future prediction model forecasts the solar irradiance for next hour at 10 minutes interval. The performance of the model is improved by spatial attention mechanism. Specifically, the normalized-MAE% for minutes forecast of solar irradiance improves from without attention, and for mean attention, to for spatial attention.

The segment task in future prediction model forecasts semantic segmentation masks for the next hour at 10 minutes interval. For 10 minutes ahead-of-time prediction, the IoU for cloud segmentation is improved from (with no attention) to with spatial attention. Similar improvement from to is observed for 30 minutes ahead-of time predictions. In order to maximize the effect of attention mechanisms, we compute the attention vector only on the dimension. We find empirically that this improves the affect of attention in the future prediction models. We choose the class, as it is of most value in this application scenario.

Fig. 5(a) represents a one hour temporal frames () that is input to the model, the expected segmentation masks (), and the true original future frames() for the next one hour prediction. Frames at intervals , , , , , are used to generate the semantic segmentation masks for intervals , , , , and minutes. Fig. 5(b) represents corresponding

semantic segmentation masks as generated by various attention mechanisms for one hour ahead sky orientation. The first row represents predicted semantic masks without the aid of any attention. Masks predicted for later intervals using this model are not accurate and mis-classify many cloud regions as sky. The next two rows represent semantic masks generated using mean, and spatial attention respectively. It can be inferred that masks generated using spatial attention attend to a more precise representation of sky and hence generate better frames even for the later time intervals.

Figure 5: Performance of segment task on 6 consecutive frames of one hour. The figure contains input (), target (), ground truth () and 6 predictions (next 1 hour) for multiple attention methods and TCN [Luc et al.2017].

An empirical analysis of the partial-contribution from localized regions of the video frame for the total estimate of solar irradiance requires pixel level ground truth for solar irradiance. However, to our knowledge, such an instrumentation is not available. As discussed above, our architecture design ensures that both segment semantic mask of sky region and measure of solar irradiance are upsampling estimates from a single () tensor of course resolution. We assert that the model implicitly localizes the individual contributions of each sky pixel to obtain the total measurement.

Comparison with [Luc et al.2017]: We use the open-source implementation provided by the authors for this comparison. The base model (pre-trained ResNet with dilated convolution layers) is the same as the proposed model but computed at two scales as per their algorithm [Mathieu, Couprie, and LeCun2015]. The summation of loss with gradient difference loss using Adam optimizer. As shown in Table 3, the IoU for the first future frame prediction (+10 minutes ahead) is 63.15%, which is comparable to the persistence model. However, the performance degrades by the third frame (+30 minutes ahead) to 52.65% and by the sixth frame (+60 minutes) to 39.55% due to the compounding effects of the auto-regression approach. The bottom row of Fig. 5(b) illustrates the effect of compounding errors.


We uniquely evaluate our future frame prediction for video understanding on a time-lapse videos of the sky. Our approach outperforms temporal CNNs at future pixel-wise labeling of sky regions. Further, the integral over the spatial regions of the image can produce an estimate of the total solar irradiance from the sky. The solar irradiance prediction so obtained closely approximate a pyranometer readings over two years test period without re-training or online update, indicating the efficacy of the frame representation. Further, the architecture compel the model to learn localized partial-contributions of solar irradiance from different regions of the sky. All scripts will be open-sourced for easy reproducibility (https://bit.ly/2Bw7HGP).