1 Introduction
Temporal segmentation of videos and image sequences has a long story of research since it is crucial not only to video understanding but also to video browsing, indexing and summarization [32, 38, 19]. With the proliferation of wearable cameras in recent years, the field is facing new challenges. Indeed, wearable cameras allow to capture, from a firstperson (egocentric) perspective, and “in the wild”, long unconstrained videos (35fps) and image sequences (aka photostreams, 2fpm). Due to their low temporal resolution, the segmentation of first person image sequences is particularly challenging. Indeed, abrupt changes in appearance may arise within an event due to sudden camera movements, and event transitions that are smooth at higher sampling rate due to continuous recording are reduced to a few frames that are difficult to detect. While for a human observer it is relatively easy to segment egocentric image sequences into discrete units, this poses serious difficulties for automated temporal segmentation (see Fig. 1 for an illustration).
Given the limited amount of annotated data, current stateoftheart approaches for the temporal segmentation of firstperson image sequences aim at obtaining event representations by encoding the temporal context of each frame of a sequence in an unsupervised fashion [13, 18]
. These methods rely on neural or recurrent neural networks and are generally based on the idea of learning event representations by training the network to predict past and future frame representations. Recurrent neural networks have proved to be more efficient than simple neural networks for the temporal prediction task. The main limitation of these approaches is that they must rely on large training datasets to yield stateoftheart performance. Even if, in this case, training data do not require manual annotations, they can always introduce a bias and suffer from the domain adaptation problem. For instance, in the case of temporal segmentation of image sequences, the results will be difficult to generalize to data acquired with a camera with different field of view or for people having different lifestyles.
In this paper, we aim at overcoming this limitation with a novel approach that is able to unveil a representation encoding the temporal structure of an image sequence from the single sequence itself. With this goal in mind, we propose to learn event representations as an embedding on a graph. Our model is based on the assumption that each event belongs to a community structure of semantically similar events on an unknown underlying graph, where communities are understood here as sets of nodes that are interconnected by edges with large weights. Moreover, the graph weights reflect not only temporal proximity, but also semantic similarity. This is motivated by neuroscientific findings showing that neural representations of events arise from temporal community structures [46]. In other words, frames that share the temporal context are grouped together in the representational space. In Fig. 2 we illustrate the idea by means of an egocentric image sequence capturing the full day of a person: going from home to work using public transports, having a lunch break in a restaurant and going back to home after doing some shopping, etc. Each point cloud corresponds with images similar in appearance and most of them are visited multiple times. This means that every pair of images in a point cloud is related semantically, but they could or could not be related at temporal level.
Based on this model, the proposed solution consists in learning simultaneously the graph structure (encoded by its weights) and the data representation. This is achieved by iterating over two alternate steps: 1) the updating of the graph structure as a function of the current data representation, where the graph structure is assumed to encode a finite number of community structures, and 2) the updating of the data representation as a function of the current graph structure in a lowdimensional embedding space. We term this solution dynamic graph embedding (DGE). We provide illustrative experiments on synthetic data, and we validate the proposed approach on two real world benchmark datasets for first person image sequence temporal segmentation. Our framework is the first attempt to learn simultaneously graph structure and data representations for temporal segmentation of image sequences.
Our main contributions are: (i) we reframe the event learning problem as the problem of learning a graph embedding, (ii) we introduce an original graph initialization approach based on the concept of temporal selfsimilarity, (iii) we propose a novel technical approach to solve the graph embedding problem when the underlying graph structure is unknown, (iv) we demonstrate that the learnt graph embedding is suitable for the task of temporal segmentation, achieving stateoftheart results on two challenging reference benchmark datasets [14, 4], without relying on any training set for learning the temporal segmentation model.
The structure of the paper is as follows. Section 2 highlights related work on data representation learning on graphs and on the temporal segmentation of videos and image sequences. In Section 3.1 we introduce our problem formulation while in Sections 3.2 to 3.5 we detail the proposed graph embedding model. The performance of our algorithm on real world data are evaluated in Section 4. In Section 5, we conclude on our contributions and results.
2 Related work
2.1 Geometric learning
The proposed approach lies in the field of geometric learning, which is an umbrella term for those techniques that work in nonEuclidean domains such as graphs and manifolds. Following [5], geometric learning approaches either address the problem of characterizing the structure of the data, or deal with analyzing functions defined on a given nonEuclidean domain. In the former case, which is more closely related with the method proposed in this paper, the goal is to learn an embedding of the data in a low dimensional space, such that the geometric relations in the embedding space reflect the graph structure. These methods are commonly referred to as nodeembedding and can be understood from an encoderdecoder perspective [25]. Given a graph , where and represents the set of nodes and edges of the graph respectively, the encoder maps each node of in a low dimensional space. The decoder is a function defined in the embedding space that acts on node pairs to compute a similarity
between the nodes. Therefore the graph embedding problem can be formulated as the problem of optimizing decoder and encoder mappings to minimize the discrepancy between similarity values in the embedding and original feature space. Within the general encoderdecoder framework, node embedding algorithms can be roughly classified into two classes: shallow embeddings
[1, 41, 43, 22, 50, 10, 7] and generalized encoderdecoder architectures [8, 52, 27, 24]. In shallow embedding approaches, the encoder function acts simply as a lookup function and the input nodesare represented as onehot vectors, so that they cannot leverage node attributes during encoding.
Instead, in generalized encoderdecoder architectures [8, 52, 27], the encoders depend more generally on the structure and attributes of the graph. In particular, convolutional encoders [24]
generate embeddings for a node by aggregating information from its local neighborhood, in a manner similar to the receptive field of a convolutional kernel in computer vision. They rely on node features to generate embeddings and, as the process iterates, the node embedding contains information aggregated from further and further reaches of the graph. Closely related to convolutional encoders are Graph Neural Networks (GNNs). The main difference is that GNNs capture the dependence of graphs via message passing between the nodes of graphs and utilize node attributes and node labels to train model parameters endtoend for a specific task in a semisupervised fashion
[26, 20, 12, 39].In all these methods, the graph structure is assumed to be given by the problem domain. For instance in social networks, the structure of the graph is given by the connection between people. However, in the case of temporal segmentation considered here, the problem is nonstructural since the graph structure is not given by the problem domain but needs to be determined together with the node embedding.
2.2 Event segmentation
Extensive research has been conducted to temporally segment videos and image sequences into events. Early approaches aimed at segmenting edited videos such as TV programs and movies [31, 56, 9, 37, 36] into commercial, newsrelated or movie events. More recently, with the advent of wearable cameras and camera equipped smartphones, there has been an increasing interest in segmenting untrimmed videos or image sequences captured by nonprofessionals into semantically homogeneous units [28, 51, 30]. In particular, videos or image sequences captured by a wearable camera are typically long and unconstrained [3], therefore it is important for the user to have them segmented into semantically meaningful chapters. In addition to appearancebased features [2, 54], motion features have been extensively used for temporal segmentation of both thirdperson videos [32, 51] and firstperson videos [29, 47, 33, 44]. In [47], motion cues from a wearable inertial sensor are leveraged for temporal segmentation of human motion into actions. Lee and Grauman [33] used temporally constrained clustering of motion and visual features to determine whether the differences in appearance correspond to event boundaries or just to abrupt head movements. Poleg et al. [44] proposed to use integrated motion vectors to segment egocentric videos into a hierarchy of longterm activities, whose first level corresponds to static/transit activities.
However, motion information is not available in first person image sequences, that are the main focus of this paper. In addition, given the limited amount of annotated data, event segmentation is very often performed by using a clustering algorithm relying on handcrafted visual features [16, 17, 34, 55, 35]. Tavalera et al. [49] proposed to combine agglomerative clustering with a change detection method within a graphcut energy minimization framework. Later on, [14] extended this framework by improving the feature representations through building a vocabulary of concepts for representing each frame. Paci et al. [42] proposed a Siamese ConvNets based approach that aims at learning a similarity function between lowresolution egocentric images. Recently, del Molino et al. [18] proposed to learn event representations by training in an unsupervised fashion a LSTMbased model to predict the temporal context of each frame, that is initially represented by the output of the prepooling layer of InceptionV3. This simple approach has shown outstanding results on reference benchmark datasets [14, 4], as long as the model is trained on a large training dataset (over 1.2 million images). The method in [18] is similar to the chronologically earlier approach reported in [13], that proposed to train the model in a fully selfsupervised fashion instead. Specifically, starting from a concept vector representation of each frame, the authors proposed a neural network based model and an LSTMbased model performing a selfsupervised pretext task consisting in predicting the concept vectors of neighbor frames given the concept vector of the current frame. Consequently, unlike [18], the single image sequence itself is used to learn the features encoding the temporal context, without need for a training dataset. The performance achieved in [13] were less impressive than in [18].
3 Dynamic Graph Embedding (DGE)
3.1 Problem formulation and proposed model
We formulate the event learning problem as a geometric learning problem. More specifically, given a set of data points (the frames of the image sequence) embedded into a highdimensional Euclidean space (the initial data representation), we assume that these data points are organized in an underlying low dimensional graph structure. Our a priori on the structure of the graph in the underlying low dimensional space is that it consists of a finite number of community structures corresponding to different temporal contexts. Since along an image sequence, a same temporal context can be visited several times at different time intervals, we assume that the underlying graph consists of a finite number of community structures and that edges between nodes belonging to different communities correspond to transitions between different temporal contexts, whereas edges between nodes belonging to the same community correspond to transitions between nodes sharing the same temporal context, being them temporally adjacent or not. This structure implicitly assumes that the graph topology models jointly temporal and semantic similarity relations.
More formally, given a sequence of images and their feature vectors , we aim at finding a fully connected, weighted graph with node embedding in a low dimensional space, , and edge weights
given by the entries of an affinity matrix
, where is a similarity kernel, such that the similarity between any pair of nodes of the graph reflects both semantic relatedness and temporal adjacency between the images. Semantic relatedness is captured by a similarity function between semantic image descriptors, whereas temporal adjacency is imposed through temporal constraints injected in the graph structure. The above constraints lead to easily grouping the graph nodes in a finite number of clusters, each corresponding to a different temporal context. As seen in the previous section, in classical node embedding the low dimensional representation of each node encodes information about the position and the structure of local neighborhood in the graph. Since all these methods incorporate graph structure in some way, the construction of the underlying graph is extremely important, but relatively little explored. In our problem at hand the graph structure is initially unknown since it arises from events, that are unknown. Therefore, we aim at learning jointly the structure of the underlying graph and the node embedding.3.2 Graph initialization by nonlocal selfsimilarity
Temporal nonlocal selfsimilarity. Let be the set of given data points in a highdimensional space , normalized in the interval
. To obtain a first coarse estimate of the graph, we apply a nonlocal selfsimilarity algorithm in the temporal domain to the initial data
[15]. The nonlocal selfsimilarity filtering creates temporal neighborhoods of frames that are likely to be in the same event. Let denote the th row of , that is, the vector of image features at time , . Further, let and denote the indices of the and neighboring feature vectors of , respectively, with . In analogy with 2D data (images) [6], the selfsimilarity function of in a temporal sequence, conditioned to its temporal neighborhood , is given by the quantity [15](1) 
where is a normalizing factor such that , ensuring that
can be interpreted as a conditional probability of
given , as detailed in [6], is the sum of the distances of the vectors in the neighborhoods of and , and is the parameter that tunes the decay of the exponential function. The key idea of our graph initialization is to model each frame by its denoised version, obtained as(2) 
A numerical illustration on real data is provided in Fig. 4.
Initial graph and initial embedding. An initial graph is obtained by computing the affinity matrix of , defined elementwise as the pairwise similarity
(3) 
where is the cosine distance and the filtering parameter of the exponential function. In the following, we will not distinguish any longer between a graph and its representation by its affinity matrix G and make use both symbols synonymously. In our model, represents the initial data structure in the original high dimensional space as a fully connected graph, from which we would like to learn the graph in the embedding space, say , that better encodes temporal constraints. To obtain an initial embedding for the graph, we apply PCA on , keep the major principal components , and minimize the crossentropy loss between the affinity matrices and
where the different filtering parameters and account for the different dimensionality of and . Even if PCA is a linear operator and for small sets of highdimensional vectors dual PCA could be more appropriate [21], we found it sufficient here for initializing the algorithm. The initial graph in the embedding space is then given by .
3.3 DGE core alternating steps
Given the initial embedding and graphs and , the main loop of our DGE alternates over the following two steps:

Assuming that is fixed, update the node representations .

Assuming that is fixed, update the graph .
Step (1) is inspired from graph embedding methods, such as the ones reviewed in Section 2.1, that have proved to be very good at encoding a given graph structure. Step (2) aims at enforcing temporal constraints and at fostering semantic similarity in the graph structure.
Graph embedding update. To estimate the graph embedding at iteration assuming that is given, we solve
(4) 
Here and are crossentropy losses and is the cosinebased similarity defined in (3). The first loss term controls the fit of the representation with the learnt graph in low dimensional embedding space. The second loss term quantifies the fit of the representation with the fixed initial graph in high dimensional space and is reminiscent of shallow graph embedding; is a regularization parameter that controls the weight of each loss. Standard gradient descent can be used to solve (4).
Graph structure update. To obtain an estimate of the graph structure at the th iteration, say , assuming that is given, we start from an initial estimate for as , and then make use of the model assumptions described in Section 3.1 to modify the graph: temporal adjacency, and semantic similarity.
i) To foster similarity for temporally adjacent nodes, we apply two operations. First, local averaging of the edge weights, , where is the 2D convolution operator and a kernel that is here simply the normalized bump function. Second, and more importantly, we apply a nonlinear operation to that scales down by a factor , , the weights of edges between nodes and that are not direct temporal neighbors (i.e., for which ) while leaving similarities of directly temporally adjacent nodes unchanged
(5) 
thus strengthening the temporal adjacency of the graph.
ii) To reinforce the semantic similarity of , we first obtain a coarse estimate of the community structure of the graph . To this end, we apply a clustering algorithm on , which yields estimated cluster labels , , for each frame. Then we update by applying to it a nonlinear function that reduces the similarity between nodes and that do not belong to the same cluster, , by a factor , , and does not change similarities within clusters, i.e.,
(6) 
For , this operation hence reinforces withinevent semantic similarity.
DGE aims revealing at the temporal and semantic relatedness for each pair of data vectors, and therefore the estimated graphs , are fully connected at each stage. A highlevel overview of our DGE approach can be found on ALGORITHM 1.
3.4 Graph postprocessing: Event boundary detection
Depending on the problem, applicative context and objective, different standard and graph signal processing tools can be applied to the estimated graph in order to extract the desired information, or transform [53, 40]. In this work, the focus is on event detection in image sequences and we thus directly use the learnt representation corresponding to in a contextual event boundary detector.
Boundary detector. Since the focus of this paper is on how to improve event representation and not on the temporal segmentation algorithm itself, we used the same boundary detector as in [18]. Such a boundary detector is based on the idea that the larger the distance between the predicted visual contexts for a frame , once computed forward (from past frames ) and once backward (from future frames ), is, the more likely this frame will correspond to an event boundary. Hence, the boundary prediction function is defined as the distance between the past and future contexts of frame : , where is the cosine distance and the temporal contexts and are defined as the average of the representation vectors of a small number of the previous (or past) frames. Those frames for which the values of the boundary prediction function exceed a threshold
are the detected event boundaries, see [18] for details. We use the same parameter values and thresholds as in [18].
Hereafter we call our temporal segmentation model relying on the features learnt by using the proposed DGE approach CESDGE, in analogy with CESVCP in [18] (where CES stands for contextual event segmentation and VCP for visual context prediction).
3.5 Numerical illustration of CESDGE on synthetic data
To illustrate the main idea behind our modeling, in Fig. 3 we show with a synthetic example in dimensions how the original data and the associated initial graph change over iteration of the DGE algorithm (here, , and we assume , i.e., no preprocessing). The data consist of a temporal sequence of feature vectors that are drawn from a Gaussian mixture with equiprobable components with mean vectors and diagonal covariance matrix with . To model the community structures in the data, after each switch to a new state the state is maintained with probability that decreases from at an exponential rate with time. In Fig. 3, panel (a) shows the learnt representations as scatter plots for iterations , where colors indicate the true cluster labels and connections between data points temporal adjacency; panel (b) plots as time series, together with event boundary detection as dashed bars and ground truth labels as solid bars; panel (c) shows the corresponding graph and panel (e) summarizes boundary detection performance as a function of iteration number ; in addition, to highlight the estimated event graph structure, panel (d) plots a pruned version of in which all edges with weight below 70% the weight of the strongest edge are removed. Clearly, it is difficult to obtain a good segmentation for the initial, cleaned data (iteration , left column). It can be appreciated how, with a few DGE iterations, the embedded data points belonging to the four different components are pushed apart, and the distance between temporally adjacent points belonging to the same components is reduced, increasing the similarity within each temporal context and revealing the underlying data generating mechanism (Fig. 3 (ab), columns 2 to 4, respectively). Moreover, the learnt representation remains faithful to the temporal structure within each segment (e.g., the position and relative size of local maxima and minima is not changed). An alternative view is provided by the corresponding learnt graphs in Fig. 3 (cd), for which increasingly homogeneous diagonal and offdiagonal blocks with sharp boundaries emerge with increasing number of iterations , reflecting the temporal community structure underlying the data. This improved representation leads in turn to significantly better event segmentation results, with Fscore increasing from initially to , , for iterations , respectively.
Parameter  Fscore  

1  2  3  4  5  
(DGE iterations)  0.655  0.698  0.682  0.671  0.662 
0.05  0.1  0.2  0.3  0.4  
(initial graph memory)  0.679  0.698  0.690  0.685  0.682 
2  3  4  5  6  
(2D local average size)  0.691  0.698  0.691  0.687  0.679 
0.01  0.1  0.3  0.4  0.6  
(graph temporal regularization)  0.685  0.688  0.698  0.686  0.672 
0.03  0.05  0.1  0.2  0.3  
(extracluster penalty)  0.675  0.697  0.698  0.687  0.670 
3  4  6  8  10  
0.650  0.656  0.666  0.682  0.698  
(cluster number)  12  14  16  18  20 
0.686  0.685  0.677  0.674  0.676 
Robustness of CESDGE with respect to hyperparameter values (EDUBSeg, tolerance
): Fscores obtained when varying the DGE hyperparameter indicated in the first column, with all others held fixed (best results in bold).3.6 Comparative results for EDUBSeg dataset
Method  Fscore  Rec  Prec 
kMeans smoothed  0.51  0.39  0.82 
ACcolor  0.38  0.25  0.90 
SRClusteringCNN  0.53  0.68  0.49 
KTS  0.53  0.40  0.87 
CESVCP  0.69  0.77  0.66 
CESDGE  0.70  0.70  0.72 
Manual segmentation  0.72  0.68  0.80 
Method  CESVCP  CESDGE  

Tolerance  Fscore  Rec  Prec  Fscore  Rec  Prec 
0.69  0.77  0.66  0.70  0.70  0.72  
0.67  0.75  0.63  0.68  0.69  0.70  
0.64  0.62  0.71  0.65  0.64  0.68  
0.59  0.67  0.56  0.59  0.59  0.61  
0.44  0.44  0.49  0.48  0.48  0.50 
4 Performance evaluation
4.1 Datasets and experimental setup
Datasets.
We used two temporal segmentation benchmark datasets for performance evaluation. The EDUBSeg dataset, introduced in [14], consists of 20 first person image sequences acquired by seven people with a total of 18,735 images. The dataset has been used to validate image sequence temporal segmentation in several recent works [14, 13, 42, 18]. For each image sequence, manual segmentations have been obtained by different persons; in line with previous works, the first segmentation was used here as ground truth. The second dataset is the larger and more recent EDUBSegDesc dataset [4], with 46 sequences (42,947 images) acquired by eight people. Every frame of the egocentric image sequences was described here using the output of the prepooling layer of InceptionV3 [48]
pretrained on ImageNet as in
[18], resulting in features.On average, the sequences of both datasets contain roughly the same number of ground truth event boundaries (28 for the former vs. 26 for the latter), but those of EDUBSegDesc dataset consist of 25% longer continuously recorded segments than EDUBSeg (3h46m25s vs. 3h1m29s continuous “camera on” time, and 3.0 vs. 3.55 continuously recorded segments per sequence). Because of this increased continuity with a larger number of harder to detect continuous event transitions, EDUBSegDesc is considered more difficult.
Other publicly available datasets of egocentric image sequences, such as CLEF [11], NTCIR [23] and the more recent R3 [18], do not have ground truth event segmentations. They can therefore not be used for performance evaluation. In [18], these datasets with more than 1.2 million images were used as training sets to learn the temporal context representation. We emphasize that in contrast, our algorithm operates fully selfsupervised instead, without training dataset.
Performance evaluation. Following previous work, we use Fscore, Precision (Prec) and Recall (Rec) to evaluate the performance of our approach. In previous works [42, 14, 18, 13], results have been reported for a single level of tolerance (i.e., time interval of length within which a detected event boundary is considered correct). Here, we report results for several values for the level of tolerance , where .
DGE hyperparameters. The hyperparameter values for the graph initialization and embedding (i.e., for the nonlocal selfsimilarity kernel (1) and for the similarity (3)) have been chosen a priori based on visual inspection of the similarity matrices of and for a couple of sequences of EDUBSeg. The values are fixed to , , for Eq. (1), and , for Eq. (3). The embedding dimension is set to , which is found sufficient for the representation to faithfully reproduce the graph topology underlying the data (larger values for were tested but did not yield performance gains). The DGE core loop hyperparameters are set to (graph regularization), , (temporal prior), and , (semantic prior; a kmeans algorithm is used to estimate cluster labels); robustness of this choice is investigated in the next section.
4.2 Robustness to changes in hyperparameter values
Tab. 1 reports Fscore values obtained on the EDUBSeg dataset when the DGE iterations and the DGE core parameters , , , and are varied individually. It can be appreciated that the performance of CESDGE is overall very robust w.r.t. precise hyperparameter values. The highest sensitivity is observed for the DGE iteration number , which should be chosen as so that Fscore values are at most 3% below the best observed Fscore. Results are very robust w.r.t. temporal regularization for reasonably small parameter values of and , for larger values the learnt representation is oversmoothed. Similarly, Fscore values vary little when changing the semantic similarity parameter as long as (note that these variations are significant since ). Finally, the number of clusters used to estimate semantic similarity can be chosen within a reasonable range (Fscore values drop by less than 3% for ). Overall, these results suggest that CESDGE is quite insensitive to hyperparameter tuning and yields robust segmentation for a wide range of parameter values.
Tab. 2 reports comparisons with five stateoftheart methods for a fixed value of tolerance (, results reproduced from [18]). The first four (kMeans smoothed with k, ACcolor [33], SRClustering [14], KTS [45]) are standard/generic approaches and achieve modest performance, with Fscores no better than 0.53. CESVCP of [18] yields significantly better Fscore of 0.69, thanks to the use of a large training set for learning the event representation. The proposed CESDGE approach further improves on this stateoftheart result and yields Fscore of 0.70. Interestingly, CESDGE also achieves more balanced Rec and Prec values of 0.70 and 0.72, as compared to 0.77 and 0.66 for CESVCP. Finally, even when compared to average manual segmentation performance, estimated by averaging the performance of the two remaining manual segmentations for EDUBSeg against the selected ground truth, our results are only 2 below.
In Table 3, we provide comparisons between our proposed approach and CESVCP for different values of tolerance. We can observe that CESDGE achieves systematically better results than CESVCP in terms of Fscore for all values of tolerance. These improvements with respect to the state of the art reach up to 4. Besides, Rec/Prec values for CESDGE are more balanced and within 3 of the values for Fscore for all tolerance levels (8 of Fscore for CESVCP).
Overall, this leads to conclude that the proposed CESDGE approach is effective in learning event representations for image sequences and yields robust segmentation results. These results are even more remarkable considering that CESDGE learns feature representations from the sequence itself, without relying on any training dataset.
4.3 Comparative results for EDUBSegDesc dataset
Tab. 4 summarizes the event boundary detection performance of the proposed CESDGE approach and of CESVCP for the larger EDUBSegDesc dataset; since CESVCP reported stateoftheart results (with Fscore values 16% above other methods, cf., Tab. 2), we omit comparison with other methods in what follows for space reasons. The same (hyper)parameter values as for EDUBSeg are used. It can be observed that the performance for both CESVCP and CESDGE are inferior to those reported for the EDUBSeg dataset in the previous section, for all tolerance values, indicating that EDUBSegDesc contains more difficult image sequences. Interestingly, while the Fscores achieved by CESVCP are up to 12% (and more than 11% on average) below that reported for EDUBSeg, the Fscores of the proposed CESDGE approach are at worst 5% smaller. In other words, CESDGE yields up to 8% (on average 6%) better Fscore values than the state of the art for the EDUBSegDesc dataset. Our CESDGE also yields systematically better Rec and Prec values, for all levels of tolerance. Overall, these findings corroborate those obtained for the EDUBSeg dataset and confirm the excellent practical performance of the proposed approach. The results further suggest that CESDGE, by relying only on the image sequence itself for learning event representations, can benefit from improved adaptability and robustness as compared to approaches that leverage on training datasets.
Method  CESVCP  CESDGE  

Tolerance  Fscore  Rec  Prec  Fscore  Rec  Prec 
0.57  0.59  0.60  0.65  0.67  0.65  
0.56  0.58  0.58  0.63  0.66  0.63  
0.52  0.54  0.54  0.60  0.62  0.60  
0.49  0.50  0.50  0.54  0.56  0.54  
0.43  0.44  0.45  0.45  0.46  0.45 
4.4 Ablation study
Graph initialization. We investigate performance obtained by applying the boundary detector to features obtained at different stages of our method: the original features (denoted CESraw) and the features obtained by applying NLmeans on the temporal dimension (CESNLmeans 1D), both of dimension ; the initial embedded features (CESEmbedding) and the features obtained after running the DGE main loop for iterations (CESDGE), both of dimension . The results, obtained for the EDUBSeg dataset, are reported in Tab. 5. They indicate that CESNLmeans 1D increases Fscore by 8% w.r.t. CESraw, and CESEmbedding adds another 1% in Fscore. CESDGE gains an additional 9%, hence significantly improves upon this initial embedding. This confirms that the graph initialization and the reduction of the dimension of the graph representation is beneficial.
An illustration of the effect of the graph initialization and of the DGE steps for EDUBSeg Subject 1 Day 3 is provided in Fig. 4. It can be observed how boundaries between temporally adjacent frames along the diagonal in the graph are successively enhanced when the original features (column 1) are replaced with the denoised version (column 2), with the embedded features (column 3), and finally with the DGE representation estimates (columns 4 & 5 for DGE iterations , respectively). Moreover, the boundaries of offdiagonal blocks that correspond to segments of frames that ressemble segments at other temporal locations, presumably of the same event, are sharpened, revealing the community structures.
DGE core operations. In Tab. 6, we report the performance that is obtained on the EDUBSeg dataset when the different operations in the DGE core iterations are removed onebyone by setting the respective parameter to zero: graph regularization (), edge local averaging (), temporal edge weighting (), and extracluster penalization (). It is observed that the overall DGE Fscore drops by 0.4% to 3.2% when one single of these operations is deactivated (versus a drop of 8.6% from 0.698 to 0.612 when no DGE operation is performed at all, as discussed above). The fact that removing any of the operations individually does not lead to a knockout of the DGE loop suggests that the associated individual (temporal & semantic) model assumptions are all and independently important. Among all operations, the largest individual Fscoredrop (3.2%) corresponds with deactivating the extracluster penalization. This points to the essential role of semantic similarity in the graph model. The smallest Fscore difference is associated with edge local averaging; to improve this temporal regularization step, future work could use nonlocal instead of local averaging, or learnt kernels. However, the graph temporal edge weighting is effective in encoding the temporal prior (1.5% Fscore drop if deactivated).
Method  Fscore  Rec  Prec 

CESraw  0.52  0.56  0.56 
CESNLmeans 1D  0.60  0.63  0.61 
CESEmbedding  0.61  0.61  0.65 
CESDGE  0.70  0.70  0.72 
Deactivated DGE parameter  

Fscore  0.685  0.694  0.683  0.666 
difference with full DGE (0.698)  0.013  0.004  0.015  0.032 
5 Conclusion
This paper proposed a novel approach to learn representations of events in low temporal resolution image sequences, named Dynamic Graph Embedding (DGE). Unlike state of the art work, which requires (large) datasets for training the model, DGE operates in a fully selfsupervised way and learns the temporal event representation for an image sequence directly from the sequence itself. To this end, we introduced an original model based on the assumption that the sequence can be represented as a graph that captures both the temporal and the semantic similarity of the images. The key novelty of our DGE approach is then to learn the structure of this unknown underlying data generating graph, jointly with a low dimensional graph embedding. This is, to the best of our knowledge, one of the first attempts to learn simultaneously graph structure and data representations for image sequences. Experimental results have shown that DGE yields robust and effective event representations and outperforms the state of the art in terms of event boundary detection precision, improving Fscore values by 1% and 8% on the EDUBSeg and EDUBSegDesc event segmentation benchmark datasets, respectively. Future work will include exploring the use of more sophisticated methods than the kmeans algorithm in the semantic similarity estimation step, and other applications in the field of Multimedia, such as the temporal segmentation of wearable sensor data streams.
References
 [1] (2013) Distributed largescale natural graph factorization. In Proc. 22nd Int. Conf. on World Wide Web, pp. 37–48. Cited by: §2.1.
 [2] (2016) Discovering picturesque highlights from egocentric vacation videos. In Proc. IEEE Winter Conf. Applications of Computer Vision (WACV), pp. 1–9. Cited by: §2.2.
 [3] (2017) Toward storytelling from visual lifelogging: an overview. IEEE Transactions on HumanMachine Systems 47 (1), pp. 77–90. Cited by: §2.2.
 [4] (2018) Egocentric video description based on temporallylinked sequences. Journal of Visual Communication and Image Representation 50, pp. 205–216. Cited by: §1, §2.2, §2.2, §4.1.

[5]
(2017)
Geometric deep learning: going beyond euclidean data
. IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §2.1. 
[6]
(2005)
A nonlocal algorithm for image denoising.
In
Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition (CVPR)
, Vol. 2, pp. 60–65. Cited by: §3.2.  [7] (2015) Grarep: learning graph representations with global structural information. In Proc. 24th ACM Int. Conf. on information and knowledge management, pp. 891–900. Cited by: §2.1.

[8]
(2016)
Deep neural networks for learning graph representations.
In
Proc. 30th AAAI Conf. on Artificial Intelligence
, Cited by: §2.1, §2.1.  [9] (2009) Movie segmentation into scenes and chapters using locally weighted bag of visual words. In Proc. ACM Int. Conf. on Image and Video Retrieval, pp. 35. Cited by: §2.2.
 [10] (2018) Harp: hierarchical representation learning for networks. In Proc. 32nd AAAI Conf. on Artificial Intelligence, Cited by: §2.1.
 [11] (2017) Overview of imagecleflifelog 2017: lifelog retrieval and summarization.. In CLEF (Working Notes), Cited by: §4.1.
 [12] (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pp. 3844–3852. Cited by: §2.1.
 [13] (2018) Learning event representations by encoding the temporal context. In Proc. IEEE European Conf. on Computer Vision Workshops (ECCVW), pp. 587–596. External Links: ISBN 9783030110154 Cited by: §1, §2.2, §2.2, §4.1, §4.1.
 [14] (2017) Srclustering: semantic regularized clustering for egocentric photo streams segmentation. Computer Vision and Image Understanding 155, pp. 55–69. Cited by: §1, §2.2, §2.2, §4.1, §4.1, §4.2.
 [15] (2019) Enhancing temporal segmentation by nonlocal selfsimilarity. In Proc. IEEE Int. Conf. on Image Processing (ICIP). In press., Cited by: §3.2.
 [16] (2008) Investigating keyframe selection methods in the novel domain of passively captured visual lifelogs. In Proc. Int. Conf. on Contentbased image and video retrieval, pp. 259–268. Cited by: §2.2.
 [17] (2008) Combining image descriptors to effectively retrieve events from visual lifelogs. In Proc. 1st ACM Int. Conf. on Multimedia information retrieval, pp. 10–17. Cited by: §2.2.
 [18] (2018) Predicting visual context for unsupervised event segmentation in continuous photostreams. In Proc. 26th ACM Int. Conf. on Multimedia, pp. 10–17. External Links: ISBN 9781450356657 Cited by: §1, §2.2, §2.2, §3.4, §3.4, §3.4, §4.1, §4.1, §4.1, §4.2.
 [19] (2017) Summarization of egocentric videos: a comprehensive survey. IEEE Transactions on HumanMachine Systems 47 (1), pp. 65–76. Cited by: §1.
 [20] (2018) Fewshot learning with graph neural networks. In Proc. Int. Conf. on Learning Representations (ICLR), Cited by: §2.1.
 [21] (2018) Topology identification and learning over graphs: accounting for nonlinearities and dynamics. Proceedings of the IEEE 106 (5), pp. 787–807. Cited by: §3.2.
 [22] (2016) Node2vec: scalable feature learning for networks. In Proc. 22nd ACM SIGKDD Int. Conf. on Knowledge discovery and data mining, pp. 855–864. Cited by: §2.1.
 [23] (2017) Overview of NTCIR13 lifelog2 task. In Proc. 13th Conf. on Evaluation of Information Access Technology, Cited by: §4.1.
 [24] (2017) Inductive representation learning on large graphs. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pp. 1024–1034. Cited by: §2.1, §2.1.
 [25] (2017) Representation learning on graphs: methods and applications. IEEE Data Engineering Bulletin. Cited by: §2.1.
 [26] (2015) Deep convolutional networks on graphstructured data. arXiv preprint arXiv:1506.05163. Cited by: §2.1.
 [27] (2006) Reducing the dimensionality of data with neural networks. Science 313 (5786), pp. 504–507. Cited by: §2.1, §2.1.
 [28] (2011) A survey on visual contentbased video indexing and retrieval. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 41 (6), pp. 797–819. Cited by: §2.2.
 [29] (2017) Egocentric temporal action proposals. IEEE Transactions on Image Processing 27 (2), pp. 764–777. Cited by: §2.2.
 [30] (20170101) Temporal video segmentation: detecting the endofact in circus performance videos. Multimedia Tools and Applications 76 (1), pp. 1379–1401. External Links: ISSN 15737721, Document Cited by: §2.2.
 [31] (2006) Audiovisual integration for tennis broadcast structuring. Multimedia Tools and Applications 30 (3), pp. 289–311. Cited by: §2.2.
 [32] (2001) Temporal video segmentation: a survey. Signal Processing: Image Communication 16 (5), pp. 477–500. Cited by: §1, §2.2.
 [33] (2015) Predicting important objects for egocentric video summarization. International Journal of Computer Vision 114 (1), pp. 38–55. Cited by: §2.2, §4.2.
 [34] (2017) Vci2r at the ntcir13 lifelog semantic access task. Proc. NTCIR13, Tokyo, Japan. Cited by: §2.2.
 [35] (2006) Structuring continuous video recordings of everyday life using timeconstrained clustering. In Multimedia Content Analysis, Management, and Retrieval 2006, Vol. 6073, pp. 60730D. Cited by: §2.2.
 [36] (2013) Learning a contextual multithread model for movie/tv scene segmentation. IEEE Transactions on Multimedia 15 (4), pp. 884–897. Cited by: §2.2.
 [37] (2011) Exploiting visualaudiotextual characteristics for automatic tv commercial block detection and segmentation. IEEE Transactions on Multimedia 13 (5), pp. 961–973. Cited by: §2.2.
 [38] (2008) Video summarisation: a conceptual framework and survey of the state of the art. Journal of Visual Communication and Image Representation 19 (2), pp. 121–143. Cited by: §1.

[39]
(2016)
Learning convolutional neural networks for graphs.
In
Proc. Int. Conf. on Machine Learning (ICML)
, pp. 2014–2023. Cited by: §2.1.  [40] (2018) Graph signal processing: overview, challenges, and applications. Proceedings of the IEEE 106 (5), pp. 808–828. Cited by: §3.4.
 [41] (2016) Asymmetric transitivity preserving graph embedding. In Proc. 22nd ACM SIGKDD Int. Conf. on Knowledge discovery and data mining, pp. 1105–1114. Cited by: §2.1.
 [42] (2016) Context change detection for an ultralow power lowresolution egovision imager. In Proc. IEEE European Conf. Computer Vision Workshops (ECCVW), pp. 589–602. Cited by: §2.2, §4.1, §4.1.
 [43] (2014) Deepwalk: online learning of social representations. In Proc. 20th ACM SIGKDD Int. Conf. on Knowledge discovery and data mining, pp. 701–710. Cited by: §2.1.
 [44] (2014) Temporal segmentation of egocentric videos. In Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition (CVPR), pp. 2537–2544. Cited by: §2.2.
 [45] (2014) Categoryspecific video summarization. In Proc. European Conf. on Computer Vision (ECCV), pp. 540–555. Cited by: §4.2.
 [46] (2013) Neural representations of events arise from temporal community structure. Nature Neuroscience 16 (4), pp. 486. Cited by: §1.
 [47] (2009) Temporal segmentation and activity classification from firstperson sensing. In Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 17–24. Cited by: §2.2.
 [48] (2016) Rethinking the inception architecture for computer vision. In Proc. IEEE Conf. Computer vision and pattern recognition (CVPR), pp. 2818–2826. Cited by: §4.1.
 [49] (2015) Rclustering for egocentric video segmentation. In Proc. Iberian Conf. Pattern Recognition and Image Analysis, pp. 327–336. Cited by: §2.2.
 [50] (2015) Line: largescale information network embedding. In Proc. 24th Int. Conf. on World Wide Web, pp. 1067–1077. Cited by: §2.1.
 [51] (2012) Learning latent temporal structure for complex event detection. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 1250–1257. Cited by: §2.2.
 [52] (2016) Structural deep network embedding. In Proc. 22nd ACM SIGKDD Int. Conf. on Knowledge discovery and data mining, pp. 1225–1234. Cited by: §2.1, §2.1.
 [53] (2015) A global/local affinity graph for image segmentation. IEEE Transactions on Image Processing 24 (4), pp. 1399–1411. Cited by: §3.4.
 [54] (2015) Gazeenabled egocentric video summarization via constrained submodular maximization. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 2235–2244. Cited by: §2.2.
 [55] (2017) Pbg at the ntcir13 lifelog2 lat, lsat, and lest tasks. Proc. NTCIR13, Tokyo, Japan. Cited by: §2.2.
 [56] (2007) A formal study of shot boundary detection. IEEE Transactions on Circuits and Systems for Video Technology 17 (2), pp. 168–186. Cited by: §2.2.
Comments
There are no comments yet.