Temporal Event Segmentation using Attention-based Perceptual Prediction Model for Continual Learning

by   Ramy Mounir, et al.
University of South Florida

Temporal event segmentation of a long video into coherent events requires a high level understanding of activities' temporal features. The event segmentation problem has been tackled by researchers in an offline training scheme, either by providing full, or weak, supervision through manually annotated labels or by self-supervised epoch based training. In this work, we present a continual learning perceptual prediction framework (influenced by cognitive psychology) capable of temporal event segmentation through understanding of the underlying representation of objects within individual frames. Our framework also outputs attention maps which effectively localize and track events-causing objects in each frame. The model is tested on a wildlife monitoring dataset in a continual training manner resulting in 80% recall rate at 20% false positive rate for frame level segmentation. Activity level testing has yielded 80% activity recall rate for one false activity detection every 50 minutes.


page 1

page 3

page 8


Action Localization through Continual Predictive Learning

The problem of action recognition involves locating the action in the vi...

Polyphonic Sound Event and Sound Activity Detection: A Multi-task approach

Polyphonic Sound Event Detection (SED) in real-world recordings is a cha...

CoSeg: Cognitively Inspired Unsupervised Generic Event Segmentation

Some cognitive research has discovered that humans accomplish event segm...

Continual-wav2vec2: an Application of Continual Learning for Self-Supervised Automatic Speech Recognition

We present a method for continual learning of speech representations for...

A Perceptual Prediction Framework for Self Supervised Event Segmentation

Temporal segmentation of long videos is an important problem, that has l...

Video Event Recognition for Surveillance Applications (VERSA)

VERSA provides a general-purpose framework for defining and recognizing ...

Automatic Detection Of Noise Events at Shooting Range Using Machine Learning

Outdoor shooting ranges are subject to noise regulations from local and ...

1 Introduction

Figure 1: Prediction and Motion Weighted loss signals during a walk in and out event. Bahdanau Attention tracking the bird

The segmentation of large videos into meaningful segments heavily depends on a higher level understanding of visual cues in a given scene, such process is required to comprehend how the objects’ underlying representations change over time to form events. Temporal segmentation models are also required to effectively analyze the temporal change of such higher level frame representation and decide on when to signal a new event.

Event segmentation research has largely focused on offline epoch-based training methods which requires training the model on the entire training dataset prior to testing its performance. This poses a challenge for many real world applications, where the entire dataset is simply non-existent and has to be collected sequentially in a stream over time [online]. Continual learning has the added advantage of processing infinitely large datasets because, unlike offline training methods, continual training frameworks completely disregard datapoints that have been previously used for training. We adapt a continual training scheme to alleviate the need for epoch-based training in order to appeal to more practical applications and reduce training time.

Our framework follows key ideas from the perceptual prediction line of work in the cognitive psychology field [zacks2001perceiving, zacks2001, zacks2007]. Research has shown that event segmentation is an ongoing process in human perception, which helps form the basis of memory and learning. Humans can identify event boundaries using a biological perceptual predictive model which predicts future perceptual states based on the current perceived sensory information. Experiments have shown that human perceptual system identifies event boundaries based on the perceived, visual, motion cues from the environment [zacks_movement, speer_movement]

. Our model implements the perceptual predictive framework and introduces a motion weighted loss function to allow for the localization and processing of motion cues.

Our approach uses a feature encoding network to transform low level perceptual information to higher level feature representation. The model is trained to predict the future perceptual encoded input and signal an event if the prediction is significantly different from the actual features of the next perceived input. The prediction signal also incorporates a higher level representation of the movement cues within frames.

Novel contributions: We are - to the best of our knowledge - the first to (1) introduce attention based mechanism to temporal event segmentation models, allowing the model to localize the event in each processed frame, (2) introduce the motion weighted loss function to stabilize the attention maps and incorporate the processing of encoded movement visual cues in the training procedure, and (3) evaluate and report the performance of the temporal segmentation framework on a remarkably long dataset (over 10 days of continuous wildlife monitoring).

2 Relevant Work

Supervised temporal event segmentation uses direct labelling (of frames) to segment videos into smaller constituent events. Fully supervised models are heavily dependent on vast amount of training data to achieve good segmentation results. Different model variations and approaches have been tested, such as using an encoder decoder Temporal Convolutional Network (ED-TCN) [TCN], or a spatiotemporal CNN model [spatiotemporal_CNN]. To alleviate the need for expensive direct labelling, weakly supervised approaches [RNN, weakly_ordered, weakly_soft, weakly_connectionist] have emerged with an attempt to use metadata (such as captions or narrations) to guide the training process without the need for explicit training labels [weakly_speech, weakly_narrated]. However, such metadata are not always available as part of the dataset, which makes weakly supervised approaches inapplicable to most practical applications.

Self-supervised temporal event segmentation attempts to completely eliminate the need for annotations [unsupervised_complex, unsupervised_predicting]. Many approaches rely heavily on higher level features clustering of frames to sub-activities [unsupervised_clustering, unsupervised_clustering_2]. The performance of the clustering algorithms in unsupervised event segmentation is proportional to the performance of the embedding/encoding model that transforms frames to higher level feature representations. Clustering algorithms can be highly computationally expensive depending on the number of frames to be clustered. Recent work [perceptual_event_segmentation] uses a self-supervised perceptual predictive model to detect event boundaries; we improve upon this model to include attention unit, which helps the model focus on event-causing objects. Other work [event_boundaries]

uses a self-supervised perceptual prediction model that is refined over significant amount of reinforcement learning iterations. The work in the neuroscience field

[numenta_sparse, numenta_theory, numenta_synapse, heeger], focusing on utilizing cortical function theory in a continual learning framework, have influenced our approach to event segmentation.

Frame predictive models have attempted to provide accurate predictions of the next frame in a sequence [PredNet, PredRnn++, HPNet, Long_Term_Prediction, Physicial_interaction_prediction]; however, these models are focusing on predicting future frames in raw pixel format. Such models may generate a prediction loss that only captures frame motion difference with limited understanding of higher level features that constitutes event boundaries.

Attention units

have been applied to image captioning


, and natural language processing

[attention_is_all_you_need, bahdanau_attention, luong_attention, bert, xlnet] fully supervised applications. Attention is used to expose different temporal - or spatial - segments of the input to the decoding LSTM at every time step using fully supervised model architectures. We use attention in a slightly different form, where the LSTM is decoded only once (per input frame) to predict future features and uses attention weighted input to do so. Unlike [show_attend_tell, attention_is_all_you_need, bahdanau_attention, luong_attention, bert, xlnet], our attention weights and biases are trained using an unsupervised loss functions.

3 Methodology

The proposed framework is inspired by the works of Zacks on perceptual prediction for events segmentation [zacks2007]. The proposed architecture, summarised in figure 2, can be divided into several individual components. In this section, we explain the role of each component starting by the encoder network and attention unit in sections 3.1 & 3.2, followed by a discussion on the recurrent predictive layer in section 3.3. We also introduce the different loss functions (section 3.4) used for self-supervised continual learning as well as the adaptive thresholding function (section 3.5). We conclude by providing the implementation details (section 3.6) used to generate the segmentation results in section 4.

Figure 2: The Perceptual Prediction Network Architecture

3.1 Input Encoding

The raw input images are transformed from pixel space into a higher level feature space by utilizing an encoder (CNN) model. This encoded feature representation allows the network to extract features of higher importance to the task being learned. The encoding process is achieved by learning the parameters of the CNN layers summarized by the function where is the learnable weights and biases parameters and is the input image. The encoder network transforms an input image with dimensions to output features with dimensions , where is the spatial dimensions and

is the feature vector length.

3.2 Attention Unit

In this framework, we utilize Bahdanau attention [bahdanau_attention] to spatially localize the event in each processed frame. The attention unit receives as an input the encoded features and outputs a set of attention weights () with dimensions . The hidden feature vectors () from the prediction layer of the previous time step is used to calculate the output set of weights using equation 1, expressed visually in figure 2.


Where represents hyperbolic tangent () function,

represents a single fully connected neural network layer and

represents a softmax function. The weights () are then multiplied by the encoded input feature vectors () to generate the masked feature vectors (). Attention mask is extracted from , linearly scaled and resized, then overlayed on the raw input image () to produce the attention visuals shown in figure 7.

3.3 Future Prediction Layer

The process of future prediction requires a layer capable of storing a flexible internal state of the previous frames. For this purpose, we use a recurrent layer, specifically Long-Short Term Memory cell LSTM

[lstm], which is designed to output a future prediction based on the current input and a feature representation of the internal state. More formally, the LSTM cell can be described using the function , where and are the output hidden state and previous hidden state respectively, the encoded input features at time step and

is a set of weights and biases vectors

controlling the internal state of the LSTM. Equation 2 expresses the mathematical operations defining the LSTM cell.



is a non-linear sigmoid activation function and the dot operator (.) represents element-wise multiplication. The gates

controls adding to and removing from the internal state depending on the previous internal state and the current input. The input to the LSTM can be formulated as:


where is a single fully connected layer, is the masked encoded input feature vector and is the hidden state from the previous time step. The symbol represents vectors concatenation.

3.4 Loss Function

The perceptual prediction model aims to train a model capable of predicting the feature vectors of the next time step. We define two different loss functions, prediction loss and motion weighted loss.

Prediction Loss

This function is defined as the L2 euclidean distance loss between the output prediction and the next frame encoded feature vectors . It is to be noted that we apply L2 loss only in the higher level feature space not pixel space. The prediction loss function can be mathematically expressed by equation 4.


Motion Weighted Loss

This function aims to extract the motion related feature vectors from two consecutive frames to generate a motion dependent mask, which is applied to the prediction loss. The motion weighted loss function allows the network to benefit from motion information in higher level feature space rather than pixel space. This function can be visually expressed in figure 3 and formally defined by equation 5

Figure 3: Motion Weighted Loss. Formally defined in equation 5

3.5 Error Gate

The error gating function receives, as an input, the error signal defined in section 3.4

, and applies a thresholding function to classify each frame. In this framework, we define two types of error gating functions. A simple threshold function

and an adaptive threshold function . Equation 6 formally defines the smoothing function for the adaptive error gating implementation. Both error gating functions use equation 7 to threshold the error signal. Equations 6 & 7 apply the smoothing function to the full loss signal for analyses purposes; however, the convolution operation can be reduced to element-wise multiplication to calculate a single smoothed value at time step .


where represents a 1D convolution operation.

3.6 Implementation Details

In our experiments, we use an Inception V3 [inception]

encoding model (trained on the ImageNet dataset) to transform input images from raw pixel representation to a higher level feature representation. We freeze the model’s parameters (weights and biases) and remove the last layer. This set up outputs a

feature vectors, which we reshape to feature vectors. Each 2048 feature vector requires one LSTM cell for future feature prediction. In other words, the encoded input frame () is provided with 64 LSTM cells, each processing a 2048 features vector (hidden state size) simultaneously. We use a 0.4 drop rate (recurrent dropout) on the hidden states to prevent overfitting, which may easily occur due to the stateful LSTM nature of the model and the continual training approach. LSTMs’ Hidden states are initialized to zero. Teacher forcing [teacher] approach is utilized by concatenating the weighted encoded input image () with the encoded input image (), instead of concatenating it with its prediction from the previous time step (). A single optimization step is performed per frame, Adam optimizer is used with a learning rate of for the gradient descent algorithm. The dataset is divided into four equal portions and trained on four Nvidia GTX 1080 GPUs simultaneously.

4 Experimental Evaluation

In this section, we present the results of our experiments for our approach defined in section 3

. We begin by defining the continual learning dataset used for testing, followed by explaining the evaluation metrics used to quantify performance. We discuss the model variations evaluated and conclude by presenting quantitative and qualitative results in sections

4.4 & 4.5.

4.1 Dataset

We analyze the performance of our model on a wildlife monitoring dataset. The dataset consists of 10 days (254 hours) continuous monitoring of a nest of the Kagu, a flightless bird of New Caledonia. The labels include four unique bird activities, {feeding the chick, incubation/brooding, nest building while sitting on the nest, nest building around the nest}. Start and end times for each instance of these activities are provided with the annotations. We modified the annotations to include walk in and walk out events representing the transitioning events from an empty nest to incubation and vice versa. Our approach can flag the nest building (on and around the nest), feeding the chick, walk in and out events. Other events based on climate, time of day, lighting conditions are ignored by our segmentation network.

4.2 Evaluation Metrics

The Receiver Operating Characteristic (ROC) chart is used to quantify the performance of the models’ variations and parameters. We provide quantitative ROC results for both frame level (figure 4) and activity level (figures 5 & 6) event segmentation. Frame window size () is defined as the maximum joining window size between events; a high value can causes separate detected events to merge, which decreases the overall performance.

Frame Level

The recall value in frame level ROC is calculated as the ratio of true positive frames (event present) to the number of positive frames in the annotations dataset, while the false positive rate is expressed as ratio of the false positive frames to the total number of negative frames (event not present) in the annotation dataset. Threshold value () is varied to obtain a single ROC line, while varying the frame window size () results in a different ROC line.

Activity Level

The Hungarian matching (Munkres assignment) algorithm is utilized to achieve one to one mapping between the ground truth labeled events and the detected events. Recall is defined as ratio of the number of correctly detected events (overlapping frames) to the total number of groundtruth events. For the activity level ROC chart, the recall values are plotted against the false positive rate per minute, defined as the ratio of the total number of false positive detected events to the total duration of the dataset in minutes. The false positive rate per minute evaluation metric is also used in the ActEV TRECVID challenge [ActEV]. Frame window size value () is varied to obtain a single ROC line, while varying the threshold value () results in a different ROC line.

Figure 4: Frame Level Event Segmentation ROC charts for simple thresholding of the prediction and motion weighted loss signals
Figure 5: Activity Level Event Segmentation ROC charts for simple thresholding of the prediction and motion weighted loss signals

4.3 Ablative Studies

Different variations of our framework (section 3

) have been evaluated to quantify the effect of individual components on the overall performance. In our experiments, we tested the base model, which trains the perceptual prediction framework - including attention unit - using the prediction loss function for backpropagation of the error signal. We refer to the base model as

LSTM+ATTN. We also experimented with the effect of removing the attention unit, from the model architecture, on the overall segmentation performance; results of this variation are reported under the model name LSTM. Further testing includes using the motion weighted loss for backpropagation of the error signal. We refer to the motion weighted model as LSTM+ATTN+MW. Each of the models has been tested extensively; results are reported in sections 4.4 & 4.5, as well as visually expressed in figures 1, 4, 5, 6 & 7.

Figure 6: Activity Level Event Segmentation ROC charts for adaptive thresholding of the prediction and motion weighted loss signals

4.4 Quantitative Evaluation

We tested three different models, LSTM, LSTM+ATTN, and LSTM+ATTN+MW, for frame level and activity level event segmentation. Simple and adaptive gating functions (section 3.5), were applied to prediction and motion weighted loss signals (section 3.4) for frame level and activity level experiments. For each model we vary parameters such as the threshold value and the frame window size to achieve the ROC charts presented in figures 4, 5 & 6.

It is to be noted that thresholding a loss signal does not necessarily imply that the model was trained to minimize this particular signal. In other words, loss functions used for backpropagating the error to the models’ learnable parameters are identified only in the model name (section 4.3); however, thresholding experiments have been conducted on different types of loss signals, regardless of the backpropagating loss function used for training.

The best performing model, for frame level segmentation, (LSTM+ATTN,) is capable of achieving {40%, 60%, 80%} frame recall value at {5%, 10%, 20%} frame false positive rate respectively. Activity level segmentation can recall {80%, 90%, 95%} of the activities at {0.02, 0.1, 0.2} activity false positive rate per minute, respectively, for the model (LSTM+ATTN,) as presented in figure 6. A 0.02 false positive activity rate per minute can also be interpreted as one false activity detection every 50 minutes of training (for detecting 80% of the groundtruth activities).

Comparing the results shown in figures 5 & 6 indicate a significant increase of overall performance when using an adaptive threshold for loss signal gating. The efficacy of adaptive thresholding is evident when applied to activity level event segmentation. Results have also shown that the model can effectively generate attention maps (section 4.5) without impacting the segmentation performance.

4.5 Qualitative Evaluation

A sample of the qualitative attention results are presented in figure 7. The attention mask, extracted from the model, has been trained to track the event in all processed frames. Our results show that the events are tracked and localized in various lighting (shadows, day/night) and occlusion conditions. Attention has also learned to indefinitely focus on the bird regardless of its motion state (stationary/Non-stationary), which indicates that the model has acquired a high-level temporal understanding of the events in the scene and learned the underlying structure of the bird. Supplemental results display attention weighted frames during illumination changes and moving shadows. We also provide a supplemental video showing the prediction loss signal, motion weighted loss signal and attention mask during a walk in and out event (summarized in figure 1).

Figure 7: Samples of Bahdanau attention weights visualized on input images

5 Conclusion

We demonstrate a continual self-supervised approach to temporal event segmentation. Our framework can effectively segment a long sequence of activities (video) into a set of individual events. We introduce a novel approach to extract attention results from unsupervised temporal event segmentation network. Gating the loss signal with different threshold values can result in segmentation at different granularities. Quantitative and qualitative results are presented in the form of ROC charts and attention weighted frames. Our results demonstrate the effectiveness of our approach in understanding the higher level spatiotemporal features required for practical temporal event segmentation.

6 Acknowledgements

The bird video dataset used in this paper was made possible through funding from the Polish National Science Centre (grant NCN 2011/01/M/NZ8/03344 and 2018/29/B/NZ8/02312). Province Sud (New Caledonia) issued all permits - from 2002 to 2020 - required for data collection.