Log In Sign Up

Continual Learning of Predictive Models in Video Sequences via Variational Autoencoders

by   Damian Campo, et al.

This paper proposes a method for performing continual learning of predictive models that facilitate the inference of future frames in video sequences. For a first given experience, an initial Variational Autoencoder, together with a set of fully connected neural networks are utilized to respectively learn the appearance of video frames and their dynamics at the latent space level. By employing an adapted Markov Jump Particle Filter, the proposed method recognizes new situations and integrates them as predictive models avoiding catastrophic forgetting of previously learned tasks. For evaluating the proposed method, this article uses video sequences from a vehicle that performs different tasks in a controlled environment.


page 3

page 4


Improving and Understanding Variational Continual Learning

In the continual learning setting, tasks are encountered sequentially. T...

Multiband VAE: Latent Space Partitioning for Knowledge Consolidation in Continual Learning

We propose a new method for unsupervised continual knowledge consolidati...

Continual Predictive Learning from Videos

Predictive learning ideally builds the world model of physical processes...

Routing Networks with Co-training for Continual Learning

The core challenge with continual learning is catastrophic forgetting, t...

Self-Attention Meta-Learner for Continual Learning

Continual learning aims to provide intelligent agents capable of learnin...

Life-Long Multi-Task Learning of Adaptive Path Tracking Policy for Autonomous Vehicle

This paper proposes a life-long adaptive path tracking policy learning m...

BooVAE: A scalable framework for continual VAE learning under boosting approach

Variational Auto Encoders (VAE) are capable of generating realistic imag...

1 Introduction

Some biological organisms, such as pigeons and large primates, possess the ability to learn new experiences continuously through their lifetime [5, 17, 13]. Such a capacity of preserving and use information from past experiences allows organisms to develop cognitive skills that are often critical for survival [2, 3]. As discussed in [4], the development of biological expertise follows a characteristic pattern of gradual improvement of performance over a particular task. The work in [4] also claims that the level of expertise reached by an organism depends on its i) long-term memory, ii) working memory capacity, iii) ability to focus attention on relevant information, iv) capability to anticipate, perceive and comprehend surroundings, v) velocity at the decision-making and vi) coordination in motor movements. The capabilities mentioned above can be seen as a set of cognitive abilities that organisms employ to solve problems and adapt to new situations systematically. We argue that each of those skills can be improved/refined by recalling past experiences that match with current situations, suggesting a major role of continual learning when solving problems and developing expertise.

Motivated by the aforementioned research studies on the continual learning in living beings, we propose a method by which video sequences acquired by artificial systems are employed to create models that can predict future situations based on past experiences. The proposed method facilitates the continual learning of new experiences by using abnormal information (video-frames) detected from available predictive models. We believe artificial systems can be highly benefited by the continual learning of new experiences, contributing to the automatic development of cognitive skills that facilitate the growth of expertise and adaptability in machines.

Several articles have tried to include continual learning (also known as lifelong learning) capabilities into artificial systems inspired by findings in psychology and neuroscience [20, 6]. Primarily, continual learning has been studied in deep neural networks (DNNs) due to their remarkable advances across diverse applications [15]. Nonetheless, when trying to integrate new information to artificial neural networks (ANNs), it is observed a dramatic performance degradation in previously learned tasks, such a phenomenon is known as catastrophic forgetting; and various articles have tried to overcome it by using different techniques [11, 9, 19]. In particular, the work in [20] distinguishes three main approaches to deal with catastrophic forgetting, facilitating the continual learning in ANNs: i) Architectural, where the architecture of the network is modified, and the objective function remains the same. ii) Functional

, which modifies the loss function, encouraging that the learning of new tasks does not affect already learned information.

iii) Structural, consisting of introducing penalties on parameters to avoid forgetting already learned experiences.

The proposed method consists of an architectural approach that enables the continual learning of predictive models in video sequences by adopting a duplication and tuning process over Variational Autoencoders (VAEs), which are trained as new situations are detected. For a given task, our work uses a VAE for learning the appearance features of video data and a set of fully connected ANNs for performing predictions of following video frames at the latent space level. Our work demonstrates how predictive models can be learned and used incrementally as new situations are observed.

Although various research works have tackled the problem of continual learning in DNNs, their contributions have been mainly focused on classifications tasks [16, 14, 10]

consisting in continuously learning different classes from diverse datasets such as CIFAR-10/100, ImageNet, and MNIST. On the other hand, our work focuses on the ability to predict the subsequent frames of video sequences by associating visual observations with already identified/learned experiences. Moreover, the proposed method employs a Markov Jump Particle Filter (MJPF)

[1] over latent space information for making predictions that can be potentially integrated with other sensory information, e.g., positional and control data.

The main contributions of the proposed method are: i) The employment of VAEs that facilitate to obtain latent spaces from which to create predictive models at the low-dimensional level. ii) The detection of new situations that autonomous systems may employ to learn continuously predictive models without forgetting previously learned experiences. iii) For evaluation purposes, this paper uses real video sequences coming from a vehicle performing different tasks in a controlled environment.

The rest of the paper is organized as follows: Section 2 explains the proposed method for enabling continual learning over video sequences based on VAEs’ latent spaces. Section 3 introduces the employed dataset. Section 4 discusses the obtained results and section 5 concludes the article and suggests future developments.

Figure 1: Block diagram of proposed method.

2 Method

The proposed method includes two major phases that are continuously repeated each time new situations are detected: i) a training process (section 2.1) to learn probabilistic models based on observed data, and an online testing procedure (section 2.2) for detecting possible new situations on observed data. Accordingly, when new situations are identified, they are employed to trigger a new training process that refines available predictive models (section 2.3). The proposed method is summarized in the block diagram shown in Fig. 1.

2.1 Training phase

Variational Autoencoder. A Variational Autoencoder (VAE) is used for describing images in a latent space that significantly reduces the original dimension of video frames. Moreover, a VAE facilitates to represent images in the latent state probabilistically by using a mean

and variance

to approximate each latent variable. As is well known, a VAE is composed of two parts: an encoder and a decoder . The latent state sampled from , returns an approximate reconstruction of the observation . Through and , we define the parameters of both encoder and decoder, respectively. To optimize them, the VAE maximizes the sum of the lower bound on the marginal likelihood of each observation of the dataset , as described in [7, 8].

This work uses the VAE’s ability to encode visual information into a significant lower-dimensional probabilistic latent space, which is employed to make inferences of future instances. Consequently, we first train the VAE with a set of training images , where indexes the task to be learned (initially, ). By utilizing the trained VAE’s encoder, we obtain a set of latent features described by and , which represent data.
Generalized states. Let be a set training images’ states corresponding to the VAE’s latent space data; we build a set of Generalized States (GSs) containing also the first-order time derivatives of . Accordingly, let be the state of the training image at time , its first-order time derivative can be approximated by , . assumes a normalized regular sampling of images. The GS at time can thus be written as . By repeating this for each image , we obtain a set of GSs for the training set, defined by:


Clustering and neural networks. After obtaining

, we use a traditional k-means algorithm to cluster GSs into groups that carry similar information. Since we use

and as input data, obtained clusters capture information of encoded images and their dynamics.

By letting be the total number of identified clusters, it is possible to use to index clusters, such that . Once the clustering is performed, we calculate a transition matrix

encoding the passage probabilities from each cluster to the others. Consequently, the following features are extracted from each cluster

: i) cluster’s centroid , ii) cluster’s covariance and iii) cluster’s radius of acceptance . Finally, a fully connected neural network defining the dynamics of GSs, i.e., continuous predictive model, is learned for each cluster. For training each , the value of every is taken as input and the corresponding as output, where , such that:


where is the residual error after the convergence of the network.

Each learns a sort of quasi-semantic information based on a particular image appearance and motion detected by the cluster

, facilitating the estimation of future latent spaces, i.e., predicting the following frames. Such predictions can be employed to measure the similarity between new observations and previously learned experiences encoded into NNs. In case predictions from NNs are not compliant with observations, an abnormality should be detected, and models should be adapted to learn new situations and semantic information. Consequently, each identified task

can be described by the set of parameters .

2.2 Testing Phase

During the testing phase, each image is processed through the VAE, and their respective GSs are calculated. Then, an adapted version of the MJPF based on the learned information is used to detect new situations in video sequences.
Adapted Markov Jump Particle Filter. An MJPF, firstly proposed in [1], is adapted for prediction and abnormality detection purposes on visual data. The MJPF uses a Particle Filter coupled with a bank of Kalman Filters (KFs) for inferring continuous and discrete level information. Since this work tackles a problem that requires a non-linear predictive model and a non-linear observation model, solved respectively by the set of NNs and a VAE, it is employed a bank of unscented KFs (UKF) and VAE’s encoded information for making inferences over video sequences.

The proposed adapted MJPF (A-MJPF) follows two stages at each time instant : prediction and update. During prediction, the next cluster (discrete level) and GS (continuous level) are estimated for each particle, i.e., and respectively. The prediction at discrete level is similar to the standard MPJF in [1]. Instead, the A-MPJF uses the neural network to make predictions at a continuous level. Since non-linear models are considered for predicting continuous level information, a UKF is utilized as described in [18] by taking additional sigma points. Each sigma point’s prediction follows the equation below:


and are matrices that map the previous state and the predicted velocity computed by on the new state . with , ; and . The mean and covariance of are calculated through the UKF.

The update phase is performed when a new measurement (image) is observed. At the discrete level, particles are resampled based on an innovation measurement. At the continuous state level, a modified KF is in charge of the update. This update takes into consideration the fact that and associated with can be used as the mapped observation on the state space at time . Consistently, can approximate the covariance matrix, such that , representing the uncertainty while encoding images. By assuming a negligible observation noise, it is possible to employ a modified version of the KF update equations where the observation matrix disappears. Algorithm 1 describes the employed KF’s steps.

2:Calculation of the sigma points and of their respective weights and as described in [18].
Algorithm 1 Equations for the prediction and update phases of the Adapted Kalman Filter.

Detection of new situation. After the update phase, at each time instant , the predicted value of related to latent state component and particle is compared with the actual updated value, outputting a measure of innovation defined as:


The innovation values of training video sequences are used to set a threshold defined as:


where and

are the mean and standard deviation of innovations from the training data respectively. When applying algorithm

1 on testing data, frames producing innovation values above the threshold in Eq.(5) are considered as new situations. Moreover, to avoid spurious innovation peaks, a temporal window of 3 frames is used, such that new situations are recognized only if 3 consecutive frames are above .

2.3 Continual Learning

The calculation of innovations facilitates defining a continual learning process where frames belonging to new situations are detected and stored. These identified frames are employed to perform a new training process as described in section 2.1, involving a new VAE (). and are employed as bottleneck features related to the images of a new situation .

During the new testing phase, the outputted bottleneck features and learned feature variables of the different VAEs where , are used together in a single A-MJPF. The particles in the A-MJPF are then distributed among all the available clusters and consequently among the various VAEs, i.e., situations.

Since the bottleneck features among VAEs capture a different meaning, in the MJPF, particles assigned to a particular VAE’s cluster cannot be reassigned to other VAEs’ clusters. Therefore, some particles do not jump between VAEs, but they always remain attached to a particular VAE. The innovation measurement (see Eq. 4) is again estimated in order to detect additional new situations.

Figure 2: Testing phases on the pedestrian avoidance task: (a) Innovation signal using . (b) Color-coded innovation. (c) Innovation signal using . (d) Innovation signal using .

3 Employed Dataset

A real vehicle called “iCab” [12], is used to collect video sequences from an onboard front camera. A human drives the iCab performing different tasks in a closed environment.

This work aims at studying situations that have not been previously seen in a normal situation (Scenario I), which is used for learning purposes. Scenario II includes unseen maneuvers caused by the presence of pedestrians while the vehicle performs a previously seen task. The two scenarios considered in this work are:
Scenario I (perimeter monitoring). The vehicle follows a rectangular trajectory around a closed building.

Scenario II (pedestrian avoidance maneuver). Two obstacles (stationary pedestrians) in different locations interfere with the perimeter monitoring task of Scenario I. The vehicle performs an avoidance maneuver and continues the perimeter monitoring. Fig. 3 shows a temporal evolution of video frames from both scenarios.

Figure 3: Video sequences from considered scenarios: a) parameter monitoring task and b) pedestrian avoidance maneuver.

4 Experimental results

First training phase: Perimeter Monitoring. Initial models are trained based on the video frames from Scenario I. This data corresponds to and facilitates the obtainment of and the set of features in , see section 2.1. The threshold in Eq.(5) is then calculated based on the behavior of the training data on the initial A-MJPF. We call the model using and the A-MJPF based on .
Detection of new situation: Pedestrian Avoidance. The icab faces a new situation: it encounters and avoids a static pedestrian. Fig. 2a) shows the resulting innovation signal from ; blue zones refer to video frames containing the pedestrian and yellow regions encode the avoidance maneuvers. At each lap, the vehicle encounters two different static pedestrians, see images (1) and (6) in Fig. 2a). They wear t-shirts of different colors (black and white), which make them “camouflage” with the environment in some particular configurations due to changeable illumination conditions. This factor influences the innovation values of frames (1) and (6), with the second one generating a higher values.

Each maneuver of pedestrian avoidance generates two peaked zones, see frames(2) and (4) or (7) and (9). Between such peaks, there is a zone with low innovation values, see (3) or (8), due to the execution of similar behaviors already observed in the training set.

As described in section 2.2, the amplitude threshold obtained from the initial experience and a temporal window of 3 frames are used to detect the new situations. Fig. 2

b) displays the frames that were classified as known experiences (green) or new situations (red).

4.1 Learning of the new situation

The frames classified as new are used as for generating an additional VAE () and set of feature variables which can be used for generating a model that understands only the pedestrian avoidance maneuver; we call it . Innovation measurements from are displayed in Fig. 2c) where low innovations are obtained in zones related to the pedestrian presence and vehicle’s avoidance maneuver.

Innovation signals from and can be seen as complementary information that facilitates the incremental understanding of the proposed two tasks together, see how large innovations values in Fig. 2a) correspond to low innovations in c). Accordingly, by employing available VAEs and variables: , , , and , we generate a single A-MPJF, called that uses all previously learned concepts for prediction purposes, see section 2.3. Fig.2d) shows how innovations from remain low through the entire scenario II, confirming the continual learning of new experiences avoiding the catastrophic forgetting of previously learned concepts. The percentage of false positive alarms of is .

Figure 4: Examples of image-level behaviour in perimeter monitoring (a) and pedestrian avoidance (b-c) cases.

Fig. 4 displays three examples that explain the performance of our algorithm visually. From left to right the columns of the blue block correspond to: the image at time ; its reconstruction by using respectively and ; the decoded version of the predicted image at time given the image at time when adopting respectively , and . Similarly, columns of the green block represent: the image at ; its reconstruction when using respectively and .

The row a) corresponds to the perimeter monitoring task, whereas rows b) and c) are related to the pedestrian avoidance situation. and innovation values correspond to the ones shown in Fig. 2a) with a blue circle, see frames (5), (7), (9). Note how in a), the prediction of is accurate, while the one of is not, due to the wrong reconstruction of the image. In cases b) and c), the prediction of is accurate and the prediction of is not. This lousy performance of in case c) is again due to the image not being recognized, leading to inconsistencies while predicting. In the case of b), can generalize the observed image to a similar one that was in the training set. However, the prediction still produces high innovations because of a discrepancy between the expected motion and the observed dynamics: instead of moving left, the vehicle moves right to finalize the pedestrian avoidance maneuver. It can be visually observed how the prediction of performs well in all three cases.

5 Conclusion and future work

The proposed work proposes a method that facilitates the continual learning of dynamical situations in video data. The proposed method is based on a probabilistic approach that uses latent spaces from VAEs to represent the state of video frames at each time instant. The dynamics of video sequences are captured by a set of NNs that encode different types of video motions in a given task. Future work includes the insertion of multimodal data into the A-MPJF, allowing the model to make inferences by fusing heterogeneous sensory data, e.g., video and positional information. Another possible path of the proposed work consists of improving the clustering process of latent space information, such that richer semantics can be obtained.


  • [1] M. Baydoun, D. Campo, V. Sanguineti, L. Marcenaro, A. Cavallaro, and C. Regazzoni (2018-07) Learning switching models for abnormality detection for autonomous driving. In 2018 21st International Conference on Information Fusion (FUSION), Vol. , pp. 2606–2613. External Links: Document, ISSN null Cited by: §1, §2.2, §2.2.
  • [2] M. J. Beran, C. R. Menzel, A. E. Parrish, B. M. Perdue, K. Sayers, J. D. Smith, and D. A. Washburn (2016) Primate cognition: attention, episodic memory, prospective memory, self-control, and metacognition as examples of cognitive control in nonhuman primates. Wiley Interdisciplinary Reviews: Cognitive Science 7 (5), pp. 294–316. Cited by: §1.
  • [3] K. P. Darby, L. Castro, E. A. Wasserman, and V. M. Sloutsky (2018) Cognitive flexibility and memory in pigeons, human children, and adults. Cognition 177, pp. 30–40. Cited by: §1.
  • [4] R. Dukas (2019) Animal expertise: mechanisms, ecology and evolution. Animal behaviour 147, pp. 199–210. Cited by: §1.
  • [5] J. Fagot and R. G. Cook (2006) Evidence for large long-term memory capacities in baboons and pigeons and its implications for learning and the evolution of cognition. Proceedings of the National Academy of Sciences 103 (46), pp. 17564–17567. Cited by: §1.
  • [6] T. Flesch, J. Balaguer, R. Dekker, H. Nili, and C. Summerfield (2018) Comparing continual task learning in minds and machines. Proceedings of the National Academy of Sciences 115 (44), pp. E10313–E10322. Cited by: §1.
  • [7] D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes.. In ICLR, Y. Bengio and Y. LeCun (Eds.), Cited by: §2.1.
  • [8] D. P. Kingma and M. Welling (2019) An introduction to variational autoencoders..

    Foundations and Trends in Machine Learning

    12 (4), pp. 307–392.
    Cited by: §2.1.
  • [9] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114 (13), pp. 3521–3526. Cited by: §1.
  • [10] T. Lesort, H. Caselles-Dupré, M. Garcia-Ortiz, A. Stoian, and D. Filliat (2019) Generative models from the perspective of continual learning. In 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §1.
  • [11] Z. Li and D. Hoiem (2017) Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence 40 (12), pp. 2935–2947. Cited by: §1.
  • [12] P. Marın-Plaza, J. Beltrán, A. Hussein, B. Musleh, D. Martın, A. de la Escalera, and J. M. Armingol (2016) Stereo vision-based local occupancy grid map for autonomous navigation in ros.

    Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP)

    Cited by: §3.
  • [13] A. E. Milne, C. I. Petkov, and B. Wilson (2018) Auditory and visual sequence learning in humans and monkeys using an artificial grammar learning paradigm. Neuroscience 389, pp. 104–117. Cited by: §1.
  • [14] C. V. Nguyen, Y. Li, T. D. Bui, and R. E. Turner (2017) Variational continual learning. arXiv preprint arXiv:1710.10628. Cited by: §1.
  • [15] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter (2019) Continual lifelong learning with neural networks: a review. Neural Networks. Cited by: §1.
  • [16] S. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert (2017) Icarl: incremental classifier and representation learning. In

    Proceedings of the IEEE conference on Computer Vision and Pattern Recognition

    pp. 2001–2010. Cited by: §1.
  • [17] M. Shanahan, V. P. Bingman, T. Shimizu, M. Wild, and O. Güntürkün (2013) Large-scale network organization in the avian forebrain: a connectivity matrix and theoretical analysis. Frontiers in computational neuroscience 7, pp. 89. Cited by: §1.
  • [18] E.A. Wan and R. van der Merwe (2000) The unscented kalman filter for nonlinear estimation. In Symposium on Adaptive Systems for Signal Processing, Communication and Control, Cited by: §2.2, 2.
  • [19] X. Yao, T. Huang, C. Wu, R. Zhang, and L. Sun (2019) Adversarial feature alignment: avoid catastrophic forgetting in incremental task lifelong learning. Neural computation 31 (11), pp. 2266–2291. Cited by: §1.
  • [20] F. Zenke, B. Poole, and S. Ganguli (2017) Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3987–3995. Cited by: §1.