Learning event representations in image sequences by dynamic graph embedding

10/08/2019 ∙ by Mariella Dimiccoli, et al. ∙ Universitat Politècnica de Catalunya IRIT 2

Recently, self-supervised learning has proved to be effective to learn representations of events in image sequences, where events are understood as sets of temporally adjacent images that are semantically perceived as a whole. However, although this approach does not require expensive manual annotations, it is data hungry and suffers from domain adaptation problems. As an alternative, in this work, we propose a novel approach for learning event representations named Dynamic Graph Embedding (DGE). The assumption underlying our model is that a sequence of images can be represented by a graph that encodes both semantic and temporal similarity. The key novelty of DGE is to learn jointly the graph and its graph embedding. At its core, DGE works by iterating over two steps: 1) updating the graph representing the semantic and temporal structure of the data based on the current data representation, and 2) updating the data representation to take into account the current data graph structure. The main advantage of DGE over state-of-the-art self-supervised approaches is that it does not require any training set, but instead learns iteratively from the data itself a low-dimensional embedding that reflects their temporal and semantic structure. Experimental results on two benchmark datasets of real image sequences captured at regular intervals demonstrate that the proposed DGE leads to effective event representations. In particular, it achieves robust temporal segmentation on the EDUBSeg and EDUBSeg-Desc benchmark datasets, outperforming the state of the art.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 8

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Temporally adjacent frames in two events in first person image sequences.

Temporal segmentation of videos and image sequences has a long story of research since it is crucial not only to video understanding but also to video browsing, indexing and summarization [32, 38, 19]. With the proliferation of wearable cameras in recent years, the field is facing new challenges. Indeed, wearable cameras allow to capture, from a first-person (egocentric) perspective, and “in the wild”, long unconstrained videos (35fps) and image sequences (aka photostreams, 2fpm). Due to their low temporal resolution, the segmentation of first person image sequences is particularly challenging. Indeed, abrupt changes in appearance may arise within an event due to sudden camera movements, and event transitions that are smooth at higher sampling rate due to continuous recording are reduced to a few frames that are difficult to detect. While for a human observer it is relatively easy to segment egocentric image sequences into discrete units, this poses serious difficulties for automated temporal segmentation (see Fig. 1 for an illustration).

Given the limited amount of annotated data, current state-of-the-art approaches for the temporal segmentation of first-person image sequences aim at obtaining event representations by encoding the temporal context of each frame of a sequence in an unsupervised fashion [13, 18]

. These methods rely on neural or recurrent neural networks and are generally based on the idea of learning event representations by training the network to predict past and future frame representations. Recurrent neural networks have proved to be more efficient than simple neural networks for the temporal prediction task. The main limitation of these approaches is that they must rely on large training datasets to yield state-of-the-art performance. Even if, in this case, training data do not require manual annotations, they can always introduce a bias and suffer from the domain adaptation problem. For instance, in the case of temporal segmentation of image sequences, the results will be difficult to generalize to data acquired with a camera with different field of view or for people having different lifestyles.

In this paper, we aim at overcoming this limitation with a novel approach that is able to unveil a representation encoding the temporal structure of an image sequence from the single sequence itself. With this goal in mind, we propose to learn event representations as an embedding on a graph. Our model is based on the assumption that each event belongs to a community structure of semantically similar events on an unknown underlying graph, where communities are understood here as sets of nodes that are interconnected by edges with large weights. Moreover, the graph weights reflect not only temporal proximity, but also semantic similarity. This is motivated by neuroscientific findings showing that neural representations of events arise from temporal community structures [46]. In other words, frames that share the temporal context are grouped together in the representational space. In Fig. 2 we illustrate the idea by means of an egocentric image sequence capturing the full day of a person: going from home to work using public transports, having a lunch break in a restaurant and going back to home after doing some shopping, etc. Each point cloud corresponds with images similar in appearance and most of them are visited multiple times. This means that every pair of images in a point cloud is related semantically, but they could or could not be related at temporal level.

Figure 2: The assumption underlying our learning approach is that image sequences captured at regular intervals (2fpm) can be organized into a graph structure, where each community structure in the graph corresponds to a particular temporal context, that captures both temporal and semantic relatedness. Points of the same color in the figure are related semantically and may be more or less related at temporal level. The arrows indicate temporal transitions between semantic clusters. They have only a visualization purpose, since temporal transition are between pairs of points.

Based on this model, the proposed solution consists in learning simultaneously the graph structure (encoded by its weights) and the data representation. This is achieved by iterating over two alternate steps: 1) the updating of the graph structure as a function of the current data representation, where the graph structure is assumed to encode a finite number of community structures, and 2) the updating of the data representation as a function of the current graph structure in a low-dimensional embedding space. We term this solution dynamic graph embedding (DGE). We provide illustrative experiments on synthetic data, and we validate the proposed approach on two real world benchmark datasets for first person image sequence temporal segmentation. Our framework is the first attempt to learn simultaneously graph structure and data representations for temporal segmentation of image sequences.

Our main contributions are: (i) we re-frame the event learning problem as the problem of learning a graph embedding, (ii) we introduce an original graph initialization approach based on the concept of temporal self-similarity, (iii) we propose a novel technical approach to solve the graph embedding problem when the underlying graph structure is unknown, (iv) we demonstrate that the learnt graph embedding is suitable for the task of temporal segmentation, achieving state-of-the-art results on two challenging reference benchmark datasets [14, 4], without relying on any training set for learning the temporal segmentation model.

The structure of the paper is as follows. Section 2 highlights related work on data representation learning on graphs and on the temporal segmentation of videos and image sequences. In Section 3.1 we introduce our problem formulation while in Sections 3.2 to 3.5 we detail the proposed graph embedding model. The performance of our algorithm on real world data are evaluated in Section 4. In Section 5, we conclude on our contributions and results.

2 Related work

2.1 Geometric learning

The proposed approach lies in the field of geometric learning, which is an umbrella term for those techniques that work in non-Euclidean domains such as graphs and manifolds. Following [5], geometric learning approaches either address the problem of characterizing the structure of the data, or deal with analyzing functions defined on a given non-Euclidean domain. In the former case, which is more closely related with the method proposed in this paper, the goal is to learn an embedding of the data in a low dimensional space, such that the geometric relations in the embedding space reflect the graph structure. These methods are commonly referred to as node-embedding and can be understood from an encoder-decoder perspective [25]. Given a graph , where and represents the set of nodes and edges of the graph respectively, the encoder maps each node of in a low dimensional space. The decoder is a function defined in the embedding space that acts on node pairs to compute a similarity

between the nodes. Therefore the graph embedding problem can be formulated as the problem of optimizing decoder and encoder mappings to minimize the discrepancy between similarity values in the embedding and original feature space. Within the general encoder-decoder framework, node embedding algorithms can be roughly classified into two classes: shallow embeddings

[1, 41, 43, 22, 50, 10, 7] and generalized encoder-decoder architectures [8, 52, 27, 24]. In shallow embedding approaches, the encoder function acts simply as a lookup function and the input nodes

are represented as one-hot vectors, so that they cannot leverage node attributes during encoding.

Instead, in generalized encoder-decoder architectures [8, 52, 27], the encoders depend more generally on the structure and attributes of the graph. In particular, convolutional encoders [24]

generate embeddings for a node by aggregating information from its local neighborhood, in a manner similar to the receptive field of a convolutional kernel in computer vision. They rely on node features to generate embeddings and, as the process iterates, the node embedding contains information aggregated from further and further reaches of the graph. Closely related to convolutional encoders are Graph Neural Networks (GNNs). The main difference is that GNNs capture the dependence of graphs via message passing between the nodes of graphs and utilize node attributes and node labels to train model parameters end-to-end for a specific task in a semi-supervised fashion

[26, 20, 12, 39].

In all these methods, the graph structure is assumed to be given by the problem domain. For instance in social networks, the structure of the graph is given by the connection between people. However, in the case of temporal segmentation considered here, the problem is non-structural since the graph structure is not given by the problem domain but needs to be determined together with the node embedding.

2.2 Event segmentation

Extensive research has been conducted to temporally segment videos and image sequences into events. Early approaches aimed at segmenting edited videos such as TV programs and movies [31, 56, 9, 37, 36] into commercial, news-related or movie events. More recently, with the advent of wearable cameras and camera equipped smartphones, there has been an increasing interest in segmenting untrimmed videos or image sequences captured by nonprofessionals into semantically homogeneous units [28, 51, 30]. In particular, videos or image sequences captured by a wearable camera are typically long and unconstrained [3], therefore it is important for the user to have them segmented into semantically meaningful chapters. In addition to appearance-based features [2, 54], motion features have been extensively used for temporal segmentation of both third-person videos [32, 51] and first-person videos [29, 47, 33, 44]. In [47], motion cues from a wearable inertial sensor are leveraged for temporal segmentation of human motion into actions. Lee and Grauman [33] used temporally constrained clustering of motion and visual features to determine whether the differences in appearance correspond to event boundaries or just to abrupt head movements. Poleg et al. [44] proposed to use integrated motion vectors to segment egocentric videos into a hierarchy of long-term activities, whose first level corresponds to static/transit activities.

However, motion information is not available in first person image sequences, that are the main focus of this paper. In addition, given the limited amount of annotated data, event segmentation is very often performed by using a clustering algorithm relying on hand-crafted visual features [16, 17, 34, 55, 35]. Tavalera et al. [49] proposed to combine agglomerative clustering with a change detection method within a graph-cut energy minimization framework. Later on, [14] extended this framework by improving the feature representations through building a vocabulary of concepts for representing each frame. Paci et al. [42] proposed a Siamese ConvNets based approach that aims at learning a similarity function between low-resolution egocentric images. Recently, del Molino et al. [18] proposed to learn event representations by training in an unsupervised fashion a LSTM-based model to predict the temporal context of each frame, that is initially represented by the output of the pre-pooling layer of InceptionV3. This simple approach has shown outstanding results on reference benchmark datasets [14, 4], as long as the model is trained on a large training dataset (over 1.2 million images). The method in [18] is similar to the chronologically earlier approach reported in [13], that proposed to train the model in a fully self-supervised fashion instead. Specifically, starting from a concept vector representation of each frame, the authors proposed a neural network based model and an LSTM-based model performing a self-supervised pretext task consisting in predicting the concept vectors of neighbor frames given the concept vector of the current frame. Consequently, unlike [18], the single image sequence itself is used to learn the features encoding the temporal context, without need for a training dataset. The performance achieved in [13] were less impressive than in [18].

Here, we propose a new model that as [13] does not make use of any training set for learning the temporal event representation, but achieves state-of-the-art results. In particular, it outperforms [18] on the EDUBSeg and EDUBSeg-Desc benchmarks [14, 4].

3 Dynamic Graph Embedding (DGE)

3.1 Problem formulation and proposed model

We formulate the event learning problem as a geometric learning problem. More specifically, given a set of data points (the frames of the image sequence) embedded into a high-dimensional Euclidean space (the initial data representation), we assume that these data points are organized in an underlying low dimensional graph structure. Our a priori on the structure of the graph in the underlying low dimensional space is that it consists of a finite number of community structures corresponding to different temporal contexts. Since along an image sequence, a same temporal context can be visited several times at different time intervals, we assume that the underlying graph consists of a finite number of community structures and that edges between nodes belonging to different communities correspond to transitions between different temporal contexts, whereas edges between nodes belonging to the same community correspond to transitions between nodes sharing the same temporal context, being them temporally adjacent or not. This structure implicitly assumes that the graph topology models jointly temporal and semantic similarity relations.

More formally, given a sequence of images and their feature vectors , we aim at finding a fully connected, weighted graph with node embedding in a low dimensional space, , and edge weights

given by the entries of an affinity matrix

, where is a similarity kernel, such that the similarity between any pair of nodes of the graph reflects both semantic relatedness and temporal adjacency between the images. Semantic relatedness is captured by a similarity function between semantic image descriptors, whereas temporal adjacency is imposed through temporal constraints injected in the graph structure. The above constraints lead to easily grouping the graph nodes in a finite number of clusters, each corresponding to a different temporal context. As seen in the previous section, in classical node embedding the low dimensional representation of each node encodes information about the position and the structure of local neighborhood in the graph. Since all these methods incorporate graph structure in some way, the construction of the underlying graph is extremely important, but relatively little explored. In our problem at hand the graph structure is initially unknown since it arises from events, that are unknown. Therefore, we aim at learning jointly the structure of the underlying graph and the node embedding.

3.2 Graph initialization by nonlocal self-similarity

Temporal nonlocal self-similarity.  Let be the set of given data points in a high-dimensional space , normalized in the interval

. To obtain a first coarse estimate of the graph, we apply a nonlocal self-similarity algorithm in the temporal domain to the initial data

[15]. The nonlocal self-similarity filtering creates temporal neighborhoods of frames that are likely to be in the same event. Let denote the -th row of , that is, the vector of image features at time , . Further, let and denote the indices of the and neighboring feature vectors of , respectively, with . In analogy with 2D data (images) [6], the self-similarity function of in a temporal sequence, conditioned to its temporal neighborhood , is given by the quantity [15]

(1)

where is a normalizing factor such that , ensuring that

can be interpreted as a conditional probability of

given , as detailed in [6], is the sum of the distances of the vectors in the neighborhoods of and , and is the parameter that tunes the decay of the exponential function. The key idea of our graph initialization is to model each frame by its denoised version, obtained as

(2)

A numerical illustration on real data is provided in Fig. 4.

Initial graph and initial embedding.  An initial graph is obtained by computing the affinity matrix of , defined elementwise as the pairwise similarity

(3)

where is the cosine distance and the filtering parameter of the exponential function. In the following, we will not distinguish any longer between a graph and its representation by its affinity matrix G and make use both symbols synonymously. In our model, represents the initial data structure in the original high dimensional space as a fully connected graph, from which we would like to learn the graph in the embedding space, say , that better encodes temporal constraints. To obtain an initial embedding for the graph, we apply PCA on , keep the major principal components , and minimize the cross-entropy loss between the affinity matrices and

where the different filtering parameters and account for the different dimensionality of and . Even if PCA is a linear operator and for small sets of high-dimensional vectors dual PCA could be more appropriate [21], we found it sufficient here for initializing the algorithm. The initial graph in the embedding space is then given by .

3.3 DGE core alternating steps

Given the initial embedding and graphs and , the main loop of our DGE alternates over the following two steps:

  1. Assuming that is fixed, update the node representations .

  2. Assuming that is fixed, update the graph .

length of the image sequence
original feature dimension
embedding feature dimension ()
Input :  initial feature matrix
Output :  graph embedded feature matrix
/* Reference graph initialization Eq. (1-3) */
denoise initial features
initialize graph in original space
/* Graph embedding initialization Eq. (3) */
initialize embedding features
initialize graph in embedding space/* DGE core loop Eqs. (4-6) */
for  to  do
        --- graph embedding update
       
        update embedding features given current graph
        --- graph structure update: temporal prior
        local average of weights of graph of
        strengthen temporally adjacent edges
        --- graph structure update: semantic prior
        estimate semantic communities
        encode semantic similarity in graph
end for
ALGORITHM 1 Dynamic Graph Embedding

Step (1) is inspired from graph embedding methods, such as the ones reviewed in Section 2.1, that have proved to be very good at encoding a given graph structure. Step (2) aims at enforcing temporal constraints and at fostering semantic similarity in the graph structure.

Graph embedding update.  To estimate the graph embedding at iteration assuming that is given, we solve

(4)

Here and are cross-entropy losses and is the cosine-based similarity defined in (3). The first loss term controls the fit of the representation with the learnt graph in low dimensional embedding space. The second loss term quantifies the fit of the representation with the fixed initial graph in high dimensional space and is reminiscent of shallow graph embedding; is a regularization parameter that controls the weight of each loss. Standard gradient descent can be used to solve (4).

Graph structure update.  To obtain an estimate of the graph structure at the -th iteration, say , assuming that is given, we start from an initial estimate for as , and then make use of the model assumptions described in Section 3.1 to modify the graph: temporal adjacency, and semantic similarity.

i) To foster similarity for temporally adjacent nodes, we apply two operations. First, local averaging of the edge weights, , where is the 2D convolution operator and a kernel that is here simply the normalized bump function. Second, and more importantly, we apply a non-linear operation to that scales down by a factor , , the weights of edges between nodes and that are not direct temporal neighbors (i.e., for which ) while leaving similarities of directly temporally adjacent nodes unchanged

(5)

thus strengthening the temporal adjacency of the graph.

ii) To reinforce the semantic similarity of , we first obtain a coarse estimate of the community structure of the graph . To this end, we apply a clustering algorithm on , which yields estimated cluster labels , , for each frame. Then we update by applying to it a non-linear function that reduces the similarity between nodes and that do not belong to the same cluster, , by a factor , , and does not change similarities within clusters, i.e.,

(6)

For , this operation hence reinforces within-event semantic similarity.

DGE aims revealing at the temporal and semantic relatedness for each pair of data vectors, and therefore the estimated graphs , are fully connected at each stage. A high-level overview of our DGE approach can be found on ALGORITHM 1.

3.4 Graph post-processing: Event boundary detection

Depending on the problem, applicative context and objective, different standard and graph signal processing tools can be applied to the estimated graph in order to extract the desired information, or transform [53, 40]. In this work, the focus is on event detection in image sequences and we thus directly use the learnt representation corresponding to in a contextual event boundary detector.

Boundary detector.  Since the focus of this paper is on how to improve event representation and not on the temporal segmentation algorithm itself, we used the same boundary detector as in [18]. Such a boundary detector is based on the idea that the larger the distance between the predicted visual contexts for a frame , once computed forward (from past frames ) and once backward (from future frames ), is, the more likely this frame will correspond to an event boundary. Hence, the boundary prediction function is defined as the distance between the past and future contexts of frame : , where is the cosine distance and the temporal contexts and are defined as the average of the representation vectors of a small number of the previous (or past) frames. Those frames for which the values of the boundary prediction function exceed a threshold

Figure 3: Illustration of DGE on synthetic data (Gaussian mixture with components, ): (a) scatter plot of features for iterations (left to right column, respectively); mixture components are indicated with color, solid lines indicate temporal adjacency; (b) the features plotted as time series, with estimated segment boundaries (vertical dashed lines) and component number ground truth (solid black); (c) the corresponding graphs and (d)

after pruning edges with weight smaller than 0.7 times that of the strongest edge (yellow and blue color correspond with strong and weak similarity, respectively); (e) F-score, precision and recall for estimated segmentation as a function of iteration number

.

are the detected event boundaries, see [18] for details. We use the same parameter values and thresholds as in [18].

Hereafter we call our temporal segmentation model relying on the features learnt by using the proposed DGE approach CES-DGE, in analogy with CES-VCP in [18] (where CES stands for contextual event segmentation and VCP for visual context prediction).

3.5 Numerical illustration of CES-DGE on synthetic data

To illustrate the main idea behind our modeling, in Fig. 3 we show with a synthetic example in dimensions how the original data and the associated initial graph change over iteration of the DGE algorithm (here, , and we assume , i.e., no preprocessing). The data consist of a temporal sequence of feature vectors that are drawn from a Gaussian mixture with equi-probable components with mean vectors and diagonal covariance matrix with . To model the community structures in the data, after each switch to a new state the state is maintained with probability that decreases from at an exponential rate with time. In Fig. 3, panel (a) shows the learnt representations as scatter plots for iterations , where colors indicate the true cluster labels and connections between data points temporal adjacency; panel (b) plots as time series, together with event boundary detection as dashed bars and ground truth labels as solid bars; panel (c) shows the corresponding graph and panel (e) summarizes boundary detection performance as a function of iteration number ; in addition, to highlight the estimated event graph structure, panel (d) plots a pruned version of in which all edges with weight below 70% the weight of the strongest edge are removed. Clearly, it is difficult to obtain a good segmentation for the initial, cleaned data (iteration , left column). It can be appreciated how, with a few DGE iterations, the embedded data points belonging to the four different components are pushed apart, and the distance between temporally adjacent points belonging to the same components is reduced, increasing the similarity within each temporal context and revealing the underlying data generating mechanism (Fig. 3 (a-b), columns 2 to 4, respectively). Moreover, the learnt representation remains faithful to the temporal structure within each segment (e.g., the position and relative size of local maxima and minima is not changed). An alternative view is provided by the corresponding learnt graphs in Fig. 3 (c-d), for which increasingly homogeneous diagonal and off-diagonal blocks with sharp boundaries emerge with increasing number of iterations , reflecting the temporal community structure underlying the data. This improved representation leads in turn to significantly better event segmentation results, with F-score increasing from initially to , , for iterations , respectively.

Parameter F-score
1 2 3 4 5
(DGE iterations) 0.655 0.698 0.682 0.671 0.662
0.05 0.1 0.2 0.3 0.4
(initial graph memory) 0.679 0.698 0.690 0.685 0.682
2 3 4 5 6
(2D local average size) 0.691 0.698 0.691 0.687 0.679
0.01 0.1 0.3 0.4 0.6
(graph temporal regularization) 0.685 0.688 0.698 0.686 0.672
0.03 0.05 0.1 0.2 0.3
(extra-cluster penalty) 0.675 0.697 0.698 0.687 0.670
3 4 6 8 10
0.650 0.656 0.666 0.682 0.698
(cluster number) 12 14 16 18 20
0.686 0.685 0.677 0.674 0.676
Table 1:

Robustness of CES-DGE with respect to hyperparameter values (EDUBSeg, tolerance

): F-scores obtained when varying the DGE hyperparameter indicated in the first column, with all others held fixed (best results in bold).

3.6 Comparative results for EDUBSeg dataset

Method F-score Rec Prec
k-Means smoothed 0.51 0.39 0.82
AC-color 0.38 0.25 0.90
SR-ClusteringCNN 0.53 0.68 0.49
KTS 0.53 0.40 0.87
CES-VCP 0.69 0.77 0.66
CES-DGE 0.70 0.70 0.72
Manual segmentation 0.72 0.68 0.80
Table 2: Comparison of CES-DGE with state-of-the-art methods & manual segmentation for EDUBSeg and tolerance .
Method CES-VCP CES-DGE
Tolerance F-score Rec Prec F-score Rec Prec
0.69 0.77 0.66 0.70 0.70 0.72
0.67 0.75 0.63 0.68 0.69 0.70
0.64 0.62 0.71 0.65 0.64 0.68
0.59 0.67 0.56 0.59 0.59 0.61
0.44 0.44 0.49 0.48 0.48 0.50
Table 3: Comparison of CES-DGE with state-of-the-art CES-VCP for different values of tolerance for EDUBSeg.

4 Performance evaluation

4.1 Datasets and experimental setup

Datasets. 

We used two temporal segmentation benchmark datasets for performance evaluation. The EDUBSeg dataset, introduced in [14], consists of 20 first person image sequences acquired by seven people with a total of 18,735 images. The dataset has been used to validate image sequence temporal segmentation in several recent works [14, 13, 42, 18]. For each image sequence, manual segmentations have been obtained by different persons; in line with previous works, the first segmentation was used here as ground truth. The second dataset is the larger and more recent EDUBSeg-Desc dataset [4], with 46 sequences (42,947 images) acquired by eight people. Every frame of the egocentric image sequences was described here using the output of the pre-pooling layer of InceptionV3 [48]

pretrained on ImageNet as in

[18], resulting in features.

On average, the sequences of both datasets contain roughly the same number of ground truth event boundaries (28 for the former vs. 26 for the latter), but those of EDUBSeg-Desc dataset consist of 25% longer continuously recorded segments than EDUBSeg (3h46m25s vs. 3h1m29s continuous “camera on” time, and 3.0 vs. 3.55 continuously recorded segments per sequence). Because of this increased continuity with a larger number of harder to detect continuous event transitions, EDUBSeg-Desc is considered more difficult.

Other publicly available datasets of egocentric image sequences, such as CLEF [11], NTCIR [23] and the more recent R3 [18], do not have ground truth event segmentations. They can therefore not be used for performance evaluation. In [18], these datasets with more than 1.2 million images were used as training sets to learn the temporal context representation. We emphasize that in contrast, our algorithm operates fully self-supervised instead, without training dataset.

Performance evaluation.  Following previous work, we use F-score, Precision (Prec) and Recall (Rec) to evaluate the performance of our approach. In previous works [42, 14, 18, 13], results have been reported for a single level of tolerance (i.e., time interval of length within which a detected event boundary is considered correct). Here, we report results for several values for the level of tolerance , where .

DGE hyperparameters.  The hyperparameter values for the graph initialization and embedding (i.e., for the non-local self-similarity kernel (1) and for the similarity (3)) have been chosen a priori based on visual inspection of the similarity matrices of and for a couple of sequences of EDUBSeg. The values are fixed to , , for Eq. (1), and , for Eq. (3). The embedding dimension is set to , which is found sufficient for the representation to faithfully reproduce the graph topology underlying the data (larger values for were tested but did not yield performance gains). The DGE core loop hyperparameters are set to (graph regularization), , (temporal prior), and , (semantic prior; a k-means algorithm is used to estimate cluster labels); robustness of this choice is investigated in the next section.

4.2 Robustness to changes in hyperparameter values

Tab. 1 reports F-score values obtained on the EDUBSeg dataset when the DGE iterations and the DGE core parameters , , , and are varied individually. It can be appreciated that the performance of CES-DGE is overall very robust w.r.t. precise hyperparameter values. The highest sensitivity is observed for the DGE iteration number , which should be chosen as so that F-score values are at most 3% below the best observed F-score. Results are very robust w.r.t. temporal regularization for reasonably small parameter values of and , for larger values the learnt representation is over-smoothed. Similarly, F-score values vary little when changing the semantic similarity parameter as long as (note that these variations are significant since ). Finally, the number of clusters used to estimate semantic similarity can be chosen within a reasonable range (F-score values drop by less than 3% for ). Overall, these results suggest that CES-DGE is quite insensitive to hyperparameter tuning and yields robust segmentation for a wide range of parameter values.

Tab. 2 reports comparisons with five state-of-the-art methods for a fixed value of tolerance (, results reproduced from [18]). The first four (k-Means smoothed with k, AC-color [33], SR-Clustering [14], KTS [45]) are standard/generic approaches and achieve modest performance, with F-scores no better than 0.53. CES-VCP of [18] yields significantly better F-score of 0.69, thanks to the use of a large training set for learning the event representation. The proposed CES-DGE approach further improves on this state-of-the-art result and yields F-score of 0.70. Interestingly, CES-DGE also achieves more balanced Rec and Prec values of 0.70 and 0.72, as compared to 0.77 and 0.66 for CES-VCP. Finally, even when compared to average manual segmentation performance, estimated by averaging the performance of the two remaining manual segmentations for EDUBSeg against the selected ground truth, our results are only 2 below.

In Table 3, we provide comparisons between our proposed approach and CES-VCP for different values of tolerance. We can observe that CES-DGE achieves systematically better results than CES-VCP in terms of F-score for all values of tolerance. These improvements with respect to the state of the art reach up to 4. Besides, Rec/Prec values for CES-DGE are more balanced and within 3 of the values for F-score for all tolerance levels (8 of F-score for CES-VCP).

Overall, this leads to conclude that the proposed CES-DGE approach is effective in learning event representations for image sequences and yields robust segmentation results. These results are even more remarkable considering that CES-DGE learns feature representations from the sequence itself, without relying on any training dataset.

4.3 Comparative results for EDUBSeg-Desc dataset

Tab. 4 summarizes the event boundary detection performance of the proposed CES-DGE approach and of CES-VCP for the larger EDUBSeg-Desc dataset; since CES-VCP reported state-of-the-art results (with F-score values 16% above other methods, cf., Tab. 2), we omit comparison with other methods in what follows for space reasons. The same (hyper)parameter values as for EDUBSeg are used. It can be observed that the performance for both CES-VCP and CES-DGE are inferior to those reported for the EDUBSeg dataset in the previous section, for all tolerance values, indicating that EDUBSeg-Desc contains more difficult image sequences. Interestingly, while the F-scores achieved by CES-VCP are up to 12% (and more than 11% on average) below that reported for EDUBSeg, the F-scores of the proposed CES-DGE approach are at worst 5% smaller. In other words, CES-DGE yields up to 8% (on average 6%) better F-score values than the state of the art for the EDUBSeg-Desc dataset. Our CES-DGE also yields systematically better Rec and Prec values, for all levels of tolerance. Overall, these findings corroborate those obtained for the EDUBSeg dataset and confirm the excellent practical performance of the proposed approach. The results further suggest that CES-DGE, by relying only on the image sequence itself for learning event representations, can benefit from improved adaptability and robustness as compared to approaches that leverage on training datasets.

Method CES-VCP CES-DGE
Tolerance F-score Rec Prec F-score Rec Prec
0.57 0.59 0.60 0.65 0.67 0.65
0.56 0.58 0.58 0.63 0.66 0.63
0.52 0.54 0.54 0.60 0.62 0.60
0.49 0.50 0.50 0.54 0.56 0.54
0.43 0.44 0.45 0.45 0.46 0.45
Table 4: Comparison of CES-DGE with state-of-the-art CES-VCP for different values of tolerance for EDUBSeg-Desc.

4.4 Ablation study

Graph initialization.  We investigate performance obtained by applying the boundary detector to features obtained at different stages of our method: the original features (denoted CES-raw) and the features obtained by applying NLmeans on the temporal dimension (CES-NLmeans 1D), both of dimension ; the initial embedded features (CES-Embedding) and the features obtained after running the DGE main loop for iterations (CES-DGE), both of dimension . The results, obtained for the EDUBSeg dataset, are reported in Tab. 5. They indicate that CES-NLmeans 1D increases F-score by 8% w.r.t. CES-raw, and CES-Embedding adds another 1% in F-score. CES-DGE gains an additional 9%, hence significantly improves upon this initial embedding. This confirms that the graph initialization and the reduction of the dimension of the graph representation is beneficial.

An illustration of the effect of the graph initialization and of the DGE steps for EDUBSeg Subject 1 Day 3 is provided in Fig. 4. It can be observed how boundaries between temporally adjacent frames along the diagonal in the graph are successively enhanced when the original features (column 1) are replaced with the denoised version (column 2), with the embedded features (column 3), and finally with the DGE representation estimates (columns 4 & 5 for DGE iterations , respectively). Moreover, the boundaries of off-diagonal blocks that correspond to segments of frames that ressemble segments at other temporal locations, presumably of the same event, are sharpened, revealing the community structures.

DGE core operations.  In Tab. 6, we report the performance that is obtained on the EDUBSeg dataset when the different operations in the DGE core iterations are removed one-by-one by setting the respective parameter to zero: graph regularization (), edge local averaging (), temporal edge weighting (), and extra-cluster penalization (). It is observed that the overall DGE F-score drops by 0.4% to 3.2% when one single of these operations is deactivated (versus a drop of 8.6% from 0.698 to 0.612 when no DGE operation is performed at all, as discussed above). The fact that removing any of the operations individually does not lead to a knock-out of the DGE loop suggests that the associated individual (temporal & semantic) model assumptions are all and independently important. Among all operations, the largest individual F-score-drop (3.2%) corresponds with deactivating the extra-cluster penalization. This points to the essential role of semantic similarity in the graph model. The smallest F-score difference is associated with edge local averaging; to improve this temporal regularization step, future work could use nonlocal instead of local averaging, or learnt kernels. However, the graph temporal edge weighting is effective in encoding the temporal prior (1.5% F-score drop if deactivated).

Method F-score Rec Prec
CES-raw 0.52 0.56 0.56
CES-NLmeans 1D 0.60 0.63 0.61
CES-Embedding 0.61 0.61 0.65
CES-DGE 0.70 0.70 0.72
Table 5: CES-DGE ablation study for EDUBSeg and tolerance .
Deactivated DGE parameter
F-score 0.685 0.694 0.683 0.666
difference with full DGE (0.698) 0.013 0.004 0.015 0.032
Table 6: F-scores obtained when single core steps are removed from DGE (indicated by a zero value for the respective parameter).
Figure 4: Illustration of DEG on Subject 1 Day 3 of EDUBSeg. Panel (a): Data and learnt representations with boundary estimates (red dashed vertical bars), ground truth (blue solid vertical bars) and resulting F-score values; from left to right initial features (1st column), denoised features (2nd column), initial embedded features (3rd column), learnt representation after a few iterations (4th & 5th column). The panel (b) shows the corresponding graphs , and , and panel (c) the graph after pruning edges with weight smaller than 0.3 times the strongest edge (dark blue corresponds with small weights and yellow corresponds with large weights, respectively).

5 Conclusion

This paper proposed a novel approach to learn representations of events in low temporal resolution image sequences, named Dynamic Graph Embedding (DGE). Unlike state of the art work, which requires (large) datasets for training the model, DGE operates in a fully self-supervised way and learns the temporal event representation for an image sequence directly from the sequence itself. To this end, we introduced an original model based on the assumption that the sequence can be represented as a graph that captures both the temporal and the semantic similarity of the images. The key novelty of our DGE approach is then to learn the structure of this unknown underlying data generating graph, jointly with a low dimensional graph embedding. This is, to the best of our knowledge, one of the first attempts to learn simultaneously graph structure and data representations for image sequences. Experimental results have shown that DGE yields robust and effective event representations and outperforms the state of the art in terms of event boundary detection precision, improving F-score values by 1% and 8% on the EDUBSeg and EDUBSeg-Desc event segmentation benchmark datasets, respectively. Future work will include exploring the use of more sophisticated methods than the k-means algorithm in the semantic similarity estimation step, and other applications in the field of Multimedia, such as the temporal segmentation of wearable sensor data streams.

References

  • [1] A. Ahmed, N. Shervashidze, S. Narayanamurthy, V. Josifovski, and A. J. Smola (2013) Distributed large-scale natural graph factorization. In Proc. 22nd Int. Conf. on World Wide Web, pp. 37–48. Cited by: §2.1.
  • [2] V. Bettadapura, D. Castro, and I. Essa (2016) Discovering picturesque highlights from egocentric vacation videos. In Proc. IEEE Winter Conf. Applications of Computer Vision (WACV), pp. 1–9. Cited by: §2.2.
  • [3] M. Bolanos, M. Dimiccoli, and P. Radeva (2017) Toward storytelling from visual lifelogging: an overview. IEEE Transactions on Human-Machine Systems 47 (1), pp. 77–90. Cited by: §2.2.
  • [4] M. Bolaños, Á. Peris, F. Casacuberta, S. Soler, and P. Radeva (2018) Egocentric video description based on temporally-linked sequences. Journal of Visual Communication and Image Representation 50, pp. 205–216. Cited by: §1, §2.2, §2.2, §4.1.
  • [5] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst (2017)

    Geometric deep learning: going beyond euclidean data

    .
    IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §2.1.
  • [6] A. Buades, B. Coll, and J. Morel (2005) A non-local algorithm for image denoising. In

    Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition (CVPR)

    ,
    Vol. 2, pp. 60–65. Cited by: §3.2.
  • [7] S. Cao, W. Lu, and Q. Xu (2015) Grarep: learning graph representations with global structural information. In Proc. 24th ACM Int. Conf. on information and knowledge management, pp. 891–900. Cited by: §2.1.
  • [8] S. Cao, W. Lu, and Q. Xu (2016) Deep neural networks for learning graph representations. In

    Proc. 30th AAAI Conf. on Artificial Intelligence

    ,
    Cited by: §2.1, §2.1.
  • [9] V. Chasanis, A. Kalogeratos, and A. Likas (2009) Movie segmentation into scenes and chapters using locally weighted bag of visual words. In Proc. ACM Int. Conf. on Image and Video Retrieval, pp. 35. Cited by: §2.2.
  • [10] H. Chen, B. Perozzi, Y. Hu, and S. Skiena (2018) Harp: hierarchical representation learning for networks. In Proc. 32nd AAAI Conf. on Artificial Intelligence, Cited by: §2.1.
  • [11] D. Dang-Nguyen, L. Piras, M. Riegler, G. Boato, L. Zhou, and C. Gurrin (2017) Overview of imagecleflifelog 2017: lifelog retrieval and summarization.. In CLEF (Working Notes), Cited by: §4.1.
  • [12] M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pp. 3844–3852. Cited by: §2.1.
  • [13] C. Dias and M. Dimiccoli (2018) Learning event representations by encoding the temporal context. In Proc. IEEE European Conf. on Computer Vision Workshops (ECCVW), pp. 587–596. External Links: ISBN 978-3-030-11015-4 Cited by: §1, §2.2, §2.2, §4.1, §4.1.
  • [14] M. Dimiccoli, M. Bolaños, E. Talavera, M. Aghaei, S. G. Nikolov, and P. Radeva (2017) Sr-clustering: semantic regularized clustering for egocentric photo streams segmentation. Computer Vision and Image Understanding 155, pp. 55–69. Cited by: §1, §2.2, §2.2, §4.1, §4.1, §4.2.
  • [15] M. Dimiccoli and H. Wendt (2019) Enhancing temporal segmentation by nonlocal self-similarity. In Proc. IEEE Int. Conf. on Image Processing (ICIP). In press., Cited by: §3.2.
  • [16] A. R. Doherty, D. Byrne, A. F. Smeaton, G. J. Jones, and M. Hughes (2008) Investigating keyframe selection methods in the novel domain of passively captured visual lifelogs. In Proc. Int. Conf. on Content-based image and video retrieval, pp. 259–268. Cited by: §2.2.
  • [17] A. R. Doherty, C. Ó Conaire, M. Blighe, A. F. Smeaton, and N. E. O’Connor (2008) Combining image descriptors to effectively retrieve events from visual lifelogs. In Proc. 1st ACM Int. Conf. on Multimedia information retrieval, pp. 10–17. Cited by: §2.2.
  • [18] A. Garcia del Molino, J. Lim, and A. Tan (2018) Predicting visual context for unsupervised event segmentation in continuous photo-streams. In Proc. 26th ACM Int. Conf. on Multimedia, pp. 10–17. External Links: ISBN 978-1-4503-5665-7 Cited by: §1, §2.2, §2.2, §3.4, §3.4, §3.4, §4.1, §4.1, §4.1, §4.2.
  • [19] A. Garcia del Molino, C. Tan, J. Lim, and A. Tan (2017) Summarization of egocentric videos: a comprehensive survey. IEEE Transactions on Human-Machine Systems 47 (1), pp. 65–76. Cited by: §1.
  • [20] V. Garcia and J. Bruna (2018) Few-shot learning with graph neural networks. In Proc. Int. Conf. on Learning Representations (ICLR), Cited by: §2.1.
  • [21] G. B. Giannakis, Y. Shen, and G. V. Karanikolas (2018) Topology identification and learning over graphs: accounting for nonlinearities and dynamics. Proceedings of the IEEE 106 (5), pp. 787–807. Cited by: §3.2.
  • [22] A. Grover and J. Leskovec (2016) Node2vec: scalable feature learning for networks. In Proc. 22nd ACM SIGKDD Int. Conf. on Knowledge discovery and data mining, pp. 855–864. Cited by: §2.1.
  • [23] C. Gurrin, H. Joho, F. Hopfgartner, L. Zhou, R. Gupta, R. Albatal, D. Nguyen, and D. Tien (2017) Overview of NTCIR-13 lifelog-2 task. In Proc. 13th Conf. on Evaluation of Information Access Technology, Cited by: §4.1.
  • [24] W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pp. 1024–1034. Cited by: §2.1, §2.1.
  • [25] W. L. Hamilton, R. Ying, and J. Leskovec (2017) Representation learning on graphs: methods and applications. IEEE Data Engineering Bulletin. Cited by: §2.1.
  • [26] M. Henaff, J. Bruna, and Y. LeCun (2015) Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163. Cited by: §2.1.
  • [27] G. E. Hinton and R. R. Salakhutdinov (2006) Reducing the dimensionality of data with neural networks. Science 313 (5786), pp. 504–507. Cited by: §2.1, §2.1.
  • [28] W. Hu, N. Xie, L. Li, X. Zeng, and S. Maybank (2011) A survey on visual content-based video indexing and retrieval. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 41 (6), pp. 797–819. Cited by: §2.2.
  • [29] S. Huang, W. Wang, S. He, and R. W. Lau (2017) Egocentric temporal action proposals. IEEE Transactions on Image Processing 27 (2), pp. 764–777. Cited by: §2.2.
  • [30] L. H. Iwan and J. A. Thom (2017-01-01) Temporal video segmentation: detecting the end-of-act in circus performance videos. Multimedia Tools and Applications 76 (1), pp. 1379–1401. External Links: ISSN 1573-7721, Document Cited by: §2.2.
  • [31] E. Kijak, G. Gravier, L. Oisel, and P. Gros (2006) Audiovisual integration for tennis broadcast structuring. Multimedia Tools and Applications 30 (3), pp. 289–311. Cited by: §2.2.
  • [32] I. Koprinska and S. Carrato (2001) Temporal video segmentation: a survey. Signal Processing: Image Communication 16 (5), pp. 477–500. Cited by: §1, §2.2.
  • [33] Y. J. Lee and K. Grauman (2015) Predicting important objects for egocentric video summarization. International Journal of Computer Vision 114 (1), pp. 38–55. Cited by: §2.2, §4.2.
  • [34] J. Lin, A. G. del Molino, Q. Xu, F. Fang, V. Subbaraju, and J. Lim (2017) Vci2r at the ntcir-13 lifelog semantic access task. Proc. NTCIR-13, Tokyo, Japan. Cited by: §2.2.
  • [35] W. Lin and A. Hauptmann (2006) Structuring continuous video recordings of everyday life using time-constrained clustering. In Multimedia Content Analysis, Management, and Retrieval 2006, Vol. 6073, pp. 60730D. Cited by: §2.2.
  • [36] C. Liu, D. Wang, J. Zhu, and B. Zhang (2013) Learning a contextual multi-thread model for movie/tv scene segmentation. IEEE Transactions on Multimedia 15 (4), pp. 884–897. Cited by: §2.2.
  • [37] N. Liu, Y. Zhao, Z. Zhu, and H. Lu (2011) Exploiting visual-audio-textual characteristics for automatic tv commercial block detection and segmentation. IEEE Transactions on Multimedia 13 (5), pp. 961–973. Cited by: §2.2.
  • [38] A. G. Money and H. Agius (2008) Video summarisation: a conceptual framework and survey of the state of the art. Journal of Visual Communication and Image Representation 19 (2), pp. 121–143. Cited by: §1.
  • [39] M. Niepert, M. Ahmed, and K. Kutzkov (2016) Learning convolutional neural networks for graphs. In

    Proc. Int. Conf. on Machine Learning (ICML)

    ,
    pp. 2014–2023. Cited by: §2.1.
  • [40] A. Ortega, P. Frossard, J. Kovačević, J. M. Moura, and P. Vandergheynst (2018) Graph signal processing: overview, challenges, and applications. Proceedings of the IEEE 106 (5), pp. 808–828. Cited by: §3.4.
  • [41] M. Ou, P. Cui, J. Pei, Z. Zhang, and W. Zhu (2016) Asymmetric transitivity preserving graph embedding. In Proc. 22nd ACM SIGKDD Int. Conf. on Knowledge discovery and data mining, pp. 1105–1114. Cited by: §2.1.
  • [42] F. Paci, L. Baraldi, G. Serra, R. Cucchiara, and L. Benini (2016) Context change detection for an ultra-low power low-resolution ego-vision imager. In Proc. IEEE European Conf. Computer Vision Workshops (ECCVW), pp. 589–602. Cited by: §2.2, §4.1, §4.1.
  • [43] B. Perozzi, R. Al-Rfou, and S. Skiena (2014) Deepwalk: online learning of social representations. In Proc. 20th ACM SIGKDD Int. Conf. on Knowledge discovery and data mining, pp. 701–710. Cited by: §2.1.
  • [44] Y. Poleg, C. Arora, and S. Peleg (2014) Temporal segmentation of egocentric videos. In Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition (CVPR), pp. 2537–2544. Cited by: §2.2.
  • [45] D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid (2014) Category-specific video summarization. In Proc. European Conf. on Computer Vision (ECCV), pp. 540–555. Cited by: §4.2.
  • [46] A. C. Schapiro, T. T. Rogers, N. I. Cordova, N. B. Turk-Browne, and M. M. Botvinick (2013) Neural representations of events arise from temporal community structure. Nature Neuroscience 16 (4), pp. 486. Cited by: §1.
  • [47] E. H. Spriggs, F. De La Torre, and M. Hebert (2009) Temporal segmentation and activity classification from first-person sensing. In Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 17–24. Cited by: §2.2.
  • [48] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proc. IEEE Conf. Computer vision and pattern recognition (CVPR), pp. 2818–2826. Cited by: §4.1.
  • [49] E. Talavera, M. Dimiccoli, M. Bolanos, M. Aghaei, and P. Radeva (2015) R-clustering for egocentric video segmentation. In Proc. Iberian Conf. Pattern Recognition and Image Analysis, pp. 327–336. Cited by: §2.2.
  • [50] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei (2015) Line: large-scale information network embedding. In Proc. 24th Int. Conf. on World Wide Web, pp. 1067–1077. Cited by: §2.1.
  • [51] K. Tang, L. Fei-Fei, and D. Koller (2012) Learning latent temporal structure for complex event detection. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 1250–1257. Cited by: §2.2.
  • [52] D. Wang, P. Cui, and W. Zhu (2016) Structural deep network embedding. In Proc. 22nd ACM SIGKDD Int. Conf. on Knowledge discovery and data mining, pp. 1225–1234. Cited by: §2.1, §2.1.
  • [53] X. Wang, Y. Tang, S. Masnou, and L. Chen (2015) A global/local affinity graph for image segmentation. IEEE Transactions on Image Processing 24 (4), pp. 1399–1411. Cited by: §3.4.
  • [54] J. Xu, L. Mukherjee, Y. Li, J. Warner, J. M. Rehg, and V. Singh (2015) Gaze-enabled egocentric video summarization via constrained submodular maximization. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 2235–2244. Cited by: §2.2.
  • [55] S. Yamamoto, T. Nishimura, Y. Akagi, Y. Takimoto, T. Inoue, and H. Toda (2017) Pbg at the ntcir-13 lifelog-2 lat, lsat, and lest tasks. Proc. NTCIR-13, Tokyo, Japan. Cited by: §2.2.
  • [56] J. Yuan, H. Wang, L. Xiao, W. Zheng, J. Li, F. Lin, and B. Zhang (2007) A formal study of shot boundary detection. IEEE Transactions on Circuits and Systems for Video Technology 17 (2), pp. 168–186. Cited by: §2.2.