Learning to Sort Image Sequences via Accumulated Temporal Differences

10/22/2020 ∙ by Gagan Kanojia, et al. ∙ IIT Gandhinagar 5

Consider a set of n images of a scene with dynamic objects captured with a static or a handheld camera. Let the temporal order in which these images are captured be unknown. There can be n! possibilities for the temporal order in which these images could have been captured. In this work, we tackle the problem of temporally sequencing the unordered set of images of a dynamic scene captured with a hand-held camera. We propose a convolutional block which captures the spatial information through 2D convolution kernel and captures the temporal information by utilizing the differences present among the feature maps extracted from the input images. We evaluate the performance of the proposed approach on the dataset extracted from a standard action recognition dataset, UCF101. We show that the proposed approach outperforms the state-of-the-art methods by a significant margin. We show that the network generalizes well by evaluating it on a dataset extracted from the DAVIS dataset, a dataset meant for video object segmentation, when the same network was trained with a dataset extracted from UCF101, a dataset meant for action recognition.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 6

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In today’s world of digital photography, when a group of people attend an event like sports, they are likely to capture their moments of interest. These moments are generally short in duration and are dynamic in nature as they involves some moving objects or moving people present in the scene. Analysis of a dynamic scene using still images has been an active area of research in image processing, computer vision, and machine learning for a long time. However, the most common device for capturing such event is the mobile phone, which is a hand-held device. Even when the images are captured with a single handheld device, they are prone to misalignment due to reasons like handshakes. This makes the problem even more challenging, because apart from dealing with the object motion, the analysis also has to deal with the camera motion. This is due to the fact that the temporal information of the dynamic scene is an important tool for its analysis and visualization. In case the images are obtained from sources like internet, there may be no time stamp available. This makes the analysis of dynamic scenes extremely challenging. In

[1], the authors showed that the temporal ordering plays an important role in recognizing several classes from the standard action recognition datasets[2, 3].

In the past few years, 2D convolutional neural networks (CNNs) have been dominating several domains of computer vision like object recognition

[4]

, single image depth estimation

[5], and semantic segmentation [6]

. There are several 2D CNN architectures which are being fueled with large still image datasets like ImageNet

[7]. Apart from still images, it has also been shown that the 2D CNNs perform quite well when applied on videos [8, 9, 10]. They are applied on individual frames of the video to perform tasks such as action recognition. However, 2D CNNs lack in exploiting the 3D structure present in the input. To cope up with this issue, researchers moved on to the 2.5D approach which exploits the 3D structure while utilizing 2D convolution kernels [11, 12]. In 2.5D approach, the network is provided with some higher level information about the input apart from the RGB channels present in the images. For example, in the case of action recognition, the higher level information could be optical flow and in the case of dynamic object detection, it could be semantic maps.

Problem Statement. Consider a set of images captured from single or multiple hand-held uncalibrated cameras whose order of capture, i.e., the temporal order, is unknown. In this work, we tackle a challenging problem of image sequencing in which we recover the unknown temporal order of the unordered set of images. Similar to [13], we formulate the problem of image sequencing as a classification problem and the classes are all the possible permutations of the temporal order of the given sequence length. The objective is to map the given unordered image sequence to its corresponding permutation. Consider an input unordered image sequence of length 5 whose correct order is . The input sequence can have possible permutations. In this work, we will consider the forward and backward permutations as a single class similar to [13, 14]. Hence, for a sequence of length 5, there are 5!/2 = 60 classes. The objective is to map to its correct permutation.

Contributions. In this work, we propose a novel convolutional block for the task of image sequencing which extracts the spatial information using 2D convolution and the temporal information by exploiting the differences between the feature maps extracted from the input images. We do not provide any higher level information such as depth map and semantic information as an input. We only utilize the raw RGB images as the input. We use ResNet [4] as the back-bone architecture for the proposed convolutional block. We show that the proposed approach outperforms the state-of-the-art methods by a significant margin. The proposed approach can be used as a pre-processing step in cases when the images of dynamic scene are obtained with no time stamp from sources like internet or a group of people [15, 16, 17, 18]. In [1], the authors have already shown that the action recognition accuracy drops for several classes when the frames are randomly shuffled. In [19], the authors have shown that even with 3 or 5 frames extracted from a video, a significant accuracy can be obtained for the task of action recognition. Also, in case of dynamic object detection and/or removal, the recent works have used around six images in the input set [15, 16, 17, 18]. Hence, we limit our experiments up to six images which is also consistent with the recent works in image sequencing [13, 14].
The major contributions of the work are as follows.

  • We propose a novel convolutional block which captures spatial information by performing a 2D convolution and temporal information by exploiting differences between the feature maps extracted from the unordered input images.

  • We show that motion plays a key role in image sequencing through the motion heat maps computed using the outputs of the proposed block.

  • We show that the network learns to shift its focus on dynamic objects without being trained with any such supervision by demonstrating the progression of motion heat maps along the depth of the network.

  • We show that the network generalizes well by evaluating it on the dataset extracted from the DAVIS dataset, a dataset meant for video object segmentation, when the network is trained with the dataset extracted from UCF101, a dataset meant for action recognition.

  • We outperform the state-of-the-art accuracy on the standard dataset used in previous works.

The rest of the paper is organized as follows. Section II discusses the previous works relevant to this work. Section III describes the proposed convolutional block in detail. Section IV discusses the network architecture and the experiments which show the effectiveness of the proposed method. It also discusses the ablation studies performed on the proposed convolutional block to justify the design choices. Section V provides the conclusion.

Fig. 1: Image sequencing. The figure demonstrates the task of image sequencing.
Fig. 2: The figure shows an illustration of the difference accumulator block.

Ii Related Work

In the past few years, 2D CNNs have enjoyed a huge amount of attention and have shown very promising results in several tasks of computer vision such as image classification, object detection, and image segmentation [4, 6]. They have been very successful in obtaining rich representations for still images. Many works extended 2D CNNs to operate on spatio-temporal data by extracting features of individual frames and then integrating the information along the temporal dimension [8, 9]. In [20], the authors study different approaches to incorporate the temporal dimension along with the spatial dimensions through a “slow fusion” model which extends the connectivity of the convolutional layers along the temporal dimension.
In many applications, the temporal structure of the input plays a very important role. In such cases, sequencing an unordered image sequence could help in better exploitation of temporal information [21, 22, 23, 24]. In [22], the authors learn to predict the future actions of a person in an ego-centric video by performing two tasks related to the temporal reasoning among which one is the temporal ordering of the two given short video snippets. In [23], the authors investigate whether the video is being played in the forward direction or the backward direction.
The problem of sequencing has been addressed in several scenarios like temporal ordering of the events in news [25], photo album creation from jumbled set of images [26], and estimating the 2-D rotation applied on the images to improve the feature representations [27]. In [28], the authors learn the video representations by learning to determine whether the input video is in correct temporal order or not. In [29], instead of using only the images, the authors used image-caption pairs of an event of sequence length 5 and sorted them to make a story.
The problem of sequencing images of a dynamic scene captured by a hand-held camera was first addressed by Basha et al. [30, 31]. In their work, in the case of multiple cameras, they assume that they know the cluster of images belonging to the same camera and they know the temporal order of the images captured from the same camera. Also, they assume that at least two images are captured from almost the same location. In [32], the authors deal with these assumptions by proposing a methodology when given a set of images captured from multiple cameras, they cluster the images captured with the same camera. After clustering, they sort the images temporally in their order of capture. However, these methods are not learning-based approaches. Also, they are not evaluated on large datasets.
The recent works by Lee et al. [13] and Kanojia et al. [14] are the most relevant to us. Lee et al. [13] propose a learning-based approach to sort an unordered image sequence. They formulate the task of sequencing as a multi-class classification problem in which the classes are all the possible permutations of the temporal order of the input unordered image set. However, they do not feed the images directly to the network. Given the images, they extract image regions which exhibits large motion and then feed these regions to the network along with some pre-processing. Unlike Lee et al. [13], Kanojia et al. [14] directly feed the images into their proposed LSTM-based network. They formulated the task of image sequencing as a sequence-to-sequence mapping task. Their proposed network maps the input images to its position in the ordered sequence.
The proposed approach uses only 2D convolution kernels. Since 2D convolution kernels lack in capturing the temporal information, we adopt a 2.5D approach. In 2.5D approach, the RGB channels of the image are appended by some higher level information which captures the temporal structure of the input [11, 12]. In [12], the authors fuse the input image with its orthogonal views. In [11], the authors extend the dimension of the magnetic resonance volume along the RGB channels to exploit the 3D features. In the proposed approach, we extract the differences among the feature maps along the temporal direction and append them along the channels. Then, we perform the 2D convolution to extract the 3D structure present in the input unordered image set.
Temporal difference has been explored in the area of action recognition [19]. Temporal differences can provide the rough location of the non-rigid bodies performing the action in the videos [33, 34]. In [19], the authors proposed a motion filter in which they compute the differences only among the feature maps of the adjacent frames. However, in our case, we do not have the information regarding the adjacency of the input frames. Also, in [19], authors perform 1D convolution on the feature differences and then add them to the previous map while we adopt a 2.5D approach.

Iii Proposed Convolutional Block

The proposed convolutional block has three parts: a 2D convolution kernel, a difference accumulator block (DAB), and a 2.5D convolutional block. Let the input feature map to the convolutional block be . Here, is the number of channels, corresponds to the number of images in the input set, and and are the height and width of the feature map , respectively.

Iii-a 2D Convolution

It has been shown in the early works [35, 36], that in the initial layers, the 2D filters learn to capture the information like edges and corners. In the later layers, they try to capture object-level information of the scene. The idea behind performing 2D convolution is to first obtain rich spatial representations individually. In the proposed framework, first we obtain as shown in Eq. 1.

(1)

Here, stands for convolution and is the filter for 2D convolution of kernel size and channels. The reason behind representing the kernel size of the filter with three dimension is to indicate that we are dealing with multiple images. This is mentioned for the sake of clarity that we are not performing convolution along the temporal direction. It can be seen that the kernel size along the temporal direction is . We apply on the feature maps corresponding to each image, which are provided in the unordered fashion, to obtain their spatial feature maps. Then, we pass through the proposed difference accumulator block (DAB) to obtain the temporal structure of the feature maps.

Iii-B Difference Accumulator Block

The core idea behind Difference Accumulator Block (DAB) is to capture the 3D structure present in the input data. Since we want to find the temporal order, we exploit the changes occurring among the images. The changes can be due to the object motion or the camera motion. In the proposed DAB, we rely on the differences among the feature maps extracted from the images, i.e., the change in the spatial information at different time instances, to extract the necessary temporal information present in the feature maps. These differences can provide rough locations of the non-rigid bodies performing action in the videos which could help the network in performing image sequencing [33, 34]. In a general sense, DAB tries to capture the notion of how different is the volume of feature maps at the current temporal location in comparison to the volumes at the other temporal locations.
Let be given as the input to DAB. Here, is the number of channels, corresponds to the number of images in the input set, and and are the height and width of the feature map , respectively. We pass through DAB. We accumulate the differences of the feature map corresponding to each image of the unordered sequence with the features maps of the remaining images, i.e., the differences between the volumes along the second dimension of . Let be the volumes along the temporal depth of . The output of DAB is obtained as shown in Eq. 2. Here, is the concatenation of along the temporal depth.

(2)

Here, is a location along the temporal dimension. For , . It can be observed that the range of starts from . This is performed to avoid symmetric computations. Fig. 2 shows an illustration of the proposed DAB.

Iii-C 2.5D Convolutional Block

To capture the temporal information along with the spatial information, we adopt 2.5D approach. In 2.5D approach, the channels containing the spatial information are appended by some higher level information which captures the temporal structure of the input [11, 12]. In our case, obtained from DAB contains essential information regarding the temporal structure of the input image set. The feature maps obtained by applying the 2D convolution kernel contain only the spatial information. To exploit both the spatial and the temporal structure of the input, we concatenate and along the channels to obtain . Then, we pass through a 2D convolution kernel to obtain the final output of the block.

Iii-D Forward/Backward Propagation

The forward and backward propagation through the proposed convolutional block is quite straightforward. The first component of the proposed convolutional block is a standard 2-D convolution filter through which gradients can be passed using standard backpropagation. The second component is DAB. In DAB, we extract the feature maps from the input feature maps by tensor slicing and then, perform subtraction and addition to obtain the output. These operations can be done in a differentiable manner using standard deep learning libraries. Finally, the third component is again a convolution kernel through which the gradients can be passed using standard backpropagation.

Iv Experiments and Discussions

Iv-a Datasets

Iv-A1 Ucf101

UCF101 is a standard action recognition dataset which contains real world action videos [37]. It contains 13320 videos out of which 9537 videos are used for training and 3783 videos are used for testing purpose. It is a diverse dataset which covers 101 action categories. The videos have large camera variations, cluttered background, and illumination conditions. It has been used as a benchmark dataset in several works such as [38, 10, 39]. Lee et al. [13] extract image sequences of length 3 and 4 from the training set of the split-1 of UCF-101. To extract the image sequences, they estimate optical flow in the videos and based on the magnitude of the optical flow, they extract the image sequences. Kanojia et al. [14] extend their dataset by including the image sequences of lengths 5 and 6. They obtain sequences of length 5 by extracting a frame from the left of the sequences of length 4. Similarly, they obtain sequences of length 6 by extracting the frames from the left and the right side of the sequences of length 4. While extracting the frames, the authors made sure that the temporal spacing between the frames is consistent with the original sequence of length 4. We randomly split the datasets corresponding to each sequence length into training and testing as 70% and 30% of the data, respectively. The videos of UCF101 contains 101 action categories which are grouped in 25 groups. Each group contains 4-7 videos. The video belonging to same group can contain common features. Hence, while splitting, we made sure that the image sequences belonging to the same group fall into the same category, i.e., training and test set.

Iv-A2 Davis

DAVIS dataset is a benchmark dataset in the area of video object segmentation [40]. It contains fifty video sequences with several challenging scenarios like occlusions, appearance variation, and motion blur. The videos are captured with a moving camera. They contain single and multiple dynamic objects. We extract the dataset of evenly spaced image sequences of lengths 4 and 6 from the videos. We use this dataset to evaluate the generalization capability of the proposed approach for the task of image sequencing. We do not train the network on the sequences extracted from this dataset. We treat the whole dataset as the test set. We use the network trained on the dataset extracted from UCF101 to estimate the temporal order of the sequences extracted from the DAVIS dataset when provided in an unordered fashion.

(a) ResNet (Basic) (b) ResNet (Bottleneck) (c) Ours (Basic) (d) Ours (Bottleneck)

Fig. 3: (a) and (b) show the basic and bottleneck blocks used in ResNet architecture [4]. (c) and (d) show the residual blocks in which 2D convolution kernel is replaced by the proposed convolutional block (in green). Here, , , , , , and

stands for kernel size, stride, output channels, convolution, batch normalization and rectified linear unit, respectively.

Iv-B Network Architecture

We use residual networks (ResNet) as the back-bone architecture to show the effectiveness of the proposed convolutional block [4]. Fig. 3 (a) and (b) show the basic and bottleneck block used in ResNet architecture [4]. In each of the residual blocks, we replace the 2D convolution kernel by the proposed convolutional block (in green) as shown in the Fig. 3 (c) and (d). We just replace the 2D convolution kernels by the proposed convolutional block while keeping the overall structure of the network intact. We perform the experiments with 18 layers and 50 layers version of the ResNet architecture. Similar to [14], we train separate networks for each sequence length. The input to the network is an unordered set of images of a dynamic scene captured with a hand-held camera. Similar to [13] and [14], the forward and the backward permutations are considered as a single class. Hence, the classification layer of the network has classes. Each class corresponds to a permutation. The objective of the network is to map the input unordered image sequence to its corresponding permutation.

Fig. 4: Motion Heap Maps.

The figure shows the five test sets (first out of two columns in each set) of unordered image sequences extracted from UCF-101 dataset which has been correctly classified by the proposed network trained on the training set extracted from UCF-101. In each set, the first column shows an unordered image set which is given as the input to the proposed network and the second column shows the order of images provided by the proposed network as the output along with the motion heat maps computed from the output of the last DAB of the network.

Iv-C Training

We train the networks with Stochastic Gradient Descent (SGD) for the weight update with a momentum of 0.9, weight decay of 0.001, and an initial learning rate of 0.1. We reduce the learning rate by 0.1 when the validation loss saturates. We use a batch of 16 clips for all the networks. The data augmentation used in training the networks is the same as that used in

[14]. Similar to [14], we perform a random cropping on the input image sets. The clips are spatially resized such that the shorter edge gets scaled to 136 pixels and then, we randomly crop a region of size . The size of each data sample is . Here, is the number of channels, is the number of images in the input sequence, and

is the spatial size of the clips. We normalize the frames by subtracting the mean values and dividing them by the variance values of the ImageNet

[7]. The training set for sequences of length 3, 4, 5, and 6 contains around 87.7K, 87.7K, 85K, and 83K sets of image sequences, respectively. We train the networks by feeding all the permutations of the image sequences in a random order. For example, to train the network for the sequence length 6, we feed the network with

unordered image sequences. Similarly, we test the networks with all the permutations of the images sequences in the test sets. We use categorical cross-entropy as the loss function.

Iv-D Comparisons with the State-of-the-art Methods

Table I compares the test classification accuracy obtained on the datasets of unordered image sequences of different sequence lengths extracted from UCF-101 by the proposed approach with the state-of-the-art methods proposed by Lee et al. [13] and Kanojia et al. [14]. All the networks are trained from scratch. Table I shows the accuracy obtained using the proposed approach with ResNet (50 layers) as the backbone architecture. It can be observed that there is a significant improvement over the previous methods in terms of classification accuracy. The proposed approach outperforms the state-of-the-art method by Kanojia et al. [14] by a significant margin. It can be observed that the margin grows as we move from the sequence length of 3 to 6. This shows that the proposed approach is better in handling longer sequences in comparison to Kanojia et al. [14] and Lee et al. [13]. Fig. 4 shows the results on four test sets of unordered image sequences extracted from UCF-101 dataset obtained by the proposed network trained on the training set extracted from UCF-101.

Sequence
Length
Lee et al.[13] Kanojia et al.[14] Ours

 

3 63 67.18 83.04
4 41 60.33 80.10
5 - 54.78 81.85
6 - 51.30 78.29

 

TABLE I: Comparison with the state-of-the-art. The table compares the test classification accuracy (in percentage) obtained on the datasets of unordered image sequences of different sequence lengths extracted from UCF-101 by the proposed approach with ResNet (50 layers) as backbone architecture with the state-of-the-art methods proposed by Lee et al. [13] and Kanojia et al. [14].

Iv-E Ablation study

Iv-E1 Without DAB

In this study, we verified the importance of DAB. The output of DAB is

which contains the temporal structure of the input obtained by accumulating the difference between the features extracted from the input image set. To check its importance, during training we set

, i.e., we fill zeros at all positions in

, in all the layers of the network. We keep everything else exactly the same. We experimented with the image sequences extracted from UCF-101 of length 3 and 4 with ResNet (18 layers) as the back-bone. We observed that the network does not learn anything and gives the output accuracy equivalent to the random probability which is

for a sequence of length . This shows that extracting temporal structure is very crucial for the task of image sequencing. Also, this study confirms that DAB plays a significant role for the task.

Iv-E2 Effect of Network Depth

In this study, we observe the effect of the depth of the back-bone network on the sequences of lengths 3 and 4. We used ResNet with 18 layers and 50 layers as the back-bone architecture for the proposed convolutional block. We replaced the 2D convolution kernel in the residual blocks with the proposed convolutional block. We train them with the datasets of image sequences extracted from UCF-101 of lengths 3 and 4. Table II shows the comparison of the test classification accuracy obtained on the datasets of unordered image sequences extracted from UCF-101 of lengths 3 and 4 when trained with the networks of different depth. It can be observed in Table II that the deeper network (ResNet-50) performs better than the shallower network (ResNet-18).

Sequence Length Back-bone Network Accuracy
3 ResNet (18 layers) 80.94
3 ResNet (50 layers) 83.04
4 ResNet (18 layers) 77.79
4 ResNet (50 layers) 80.10
TABLE II: Effect of network depth. The table shows the comparison of test classification accuracy (in percentage) obtained on the datasets of unordered image sequences extracted from UCF-101 of lengths 3 and 4 when trained with the networks of different depths.

Iv-E3 Effect of the Sign of Differences

In this study, we observe the effect of the sign of the differences computed in DAB. To observe its effect, we modified Eq. 2 of the block as shown in Eq. 3.

(3)

Here, stands for the absolute value of the input. is defined in Section III-B. Instead of accumulating differences with their sign, we only accumulate their magnitude. We train the network with ResNet (18 layers) as the back-bone on the dataset of unordered image sequences of sequence length 4 extracted from UCF-101 with the modified DAB, i.e., with only the magnitude of the differences. Table III shows the comparison of the test classification accuracy obtained on the dataset of unordered image sequences extracted from UCF-101 of sequence length 4 by the proposed approach when we train the network with DAB comprising Eq. 2 with that of when it is trained with DAB comprising Eq. 3. It can be observed that the network performs significantly better with both the sign and the magnitude.

Sequence Length
Accuracy
Magnitude 4 60.82
Sign+Magnitude 4 77.7
TABLE III: Effect of Sign. The table shows the comparison of the test classification accuracy (in percentage) obtained on the dataset of unordered image sequences extracted from UCF-101 by the proposed approach when the network is trained with the difference accumulator block comprising Eq. 2 with when it is trained with the difference accumulator block comprising Eq. 3. ResNet (18 layers) is used as back-bone architecture used for this experiment.
m Back-bone Network Accuracy
0 ResNet (50 layers) 1.667
1 ResNet (50 layers) 1.667
2 ResNet (50 layers) 77.16
n ResNet (50 layers) 81.85
TABLE IV: Effect of varying the number of images for accumulating differences. The table shows the comparison of test classification accuracy (in percentage) obtained on the dataset of unordered image sequences of length 5 extracted from UCF-101 when trained with different values of m (Eq. 2).
Sequence Length
4 6
step Accuracy #Perms Accuracy #Perms
1 92.99 0.142M 90.7 4.146M
2 93.43 0.136M 89.7 3.821M
3 91.8 0.129M 84.5 3.498M
4 87.61 0.123M 77.7 3.173M
5 84.04 0.116M 72.3 2.849M
6 80.90 0.110M 68.2 2.529M
7 77.35 0.103M 64.6 2.209M
8 74.74 0.097M 62.2 1.898M
TABLE V: Generalizability. The table shows the classification accuracy (in percentage) obtained on the dataset of unordered image sequences of length 4 and 6 extracted from DAVIS dataset using the proposed approach (ResNet50 as the backbone) when the network is trained on the dataset extracted from the UCF-101. The first column shows the spacing (in terms of frames) between the images of the extracted image sequences in the original video. For each sequence length, i.e. 4 and 6, the first column shows the accuracy obtained on sets of unordered image sequences when they are extracted with the corresponding temporal spacing and the second column shows the number of permutations of image sequences extracted from the DAVIS dataset used for obtaining the classification accuracy. #Perms stands for the number of all the permutations of image sequences.
Fig. 5: Generalizability. The figure shows the seven test sets (one column each) of unordered image sequences extracted from DAVIS dataset which has been correctly classified by the proposed approach when the network is trained on the dataset extracted from UCF-101. Each column shows one image set. (a) shows the unordered image sets which are given as the input to the proposed network. (b) shows the order of images provided by the proposed approach as the output along with the motion heat maps computed from the output of the last DAB of the network.
Fig. 6: Progression of motion heat maps. The figure shows the progression of the motion heat maps computed using the output of DAB along the depth of the ResNet (50 layers) which is used as the backbone architecture for the proposed convolution block. The first column shows one of image of the input image sequences. The second column shows the output of DAB used after conv1 of ResNet [4]. The third, fourth, fifth, and sixth columns show the outputs of the last DAB of layer1, layer2, layer3, and layer4 of the ResNet architecture, respectively

Iv-E4 Varying the number of images for accumulation of temporal differences

In this study, we observe the effect of accumulating the differences of the feature map corresponding to each image with the feature maps of fixed number of images ahead of it in the input sequence. Let be the volumes along the temporal depth of . In this case, the output of DAB is obtained as shown in Eq. 4. Here, is the concatenation of along the temporal depth.

(4)

Here, is a location along the temporal dimension and . For , . Table IV shows the effect of varying the number of images used for accumulating the temporal differences, i.e., the value of in Eq. 4, by comparing the test classification accuracy (in percentage) obtained on the dataset of unordered image sequences of length 5 extracted from UCF-101 when trained with different values of (Eq. 4). For the sequence of length 5, there are classes. It can be observed that without temporal differences, the network achieved the accuracy equivalent to the random probability of picking a class among 60 classes i.e. 0.0167. Even with the help of temporal differences computed among the adjacent images, the network still does not learn anything. It can be seen that as we increase the number of image used for the accumulation of temporal gradients, the test accuracy increases.

Iv-E5 Generalizability

In this study, we evaluate the generalizability of the proposed approach. We want to verify that the proposed approach (ResNet-50 as the backbone) is learning the task of sequencing rather than somehow learning the distribution of the dataset. For this purpose, we use the dataset of image sequences extracted from the DAVIS dataset. We use the networks (ResNet50 as the backbone) trained on the dataset of sequence lengths 4 and 6 extracted from the UCF-101 to obtain the classification accuracy on the dataset of unordered image sequences of lengths 4 and 6 extracted from the DAVIS dataset.
Table V shows the classification accuracy obtained on the dataset of unordered image sequences extracted from DAVIS dataset. The dataset is extracted in such a way that the images in the sequence are evenly spaced temporally. However, the temporal spacing between the images could affect the classification accuracy of the temporal ordering. We extract different test sets of the image sequences from the DAVIS dataset by varying the number of frames skipped in the video while extracting the image sequences of length 4 and 6. Table V shows the variation of the classification accuracy as we change the temporal spacing between the images of the image sequence. It can be seen that as we increase the temporal spacing of the image sequences during the extraction, the classification accuracy decreases. This is because when the temporal spacing is large, the dynamic objects perform large motion which could lead to erroneous ordering. However, considering that we are evaluating the network on a dataset (in this case DAVIS dataset) which is different from the dataset is trained with (in this case UCF101), the obtained accuracy is considerably high. This shows that the proposed approach (ResNet50 as the backbone) learns the task of image sequencing. Fig. 5 shows seven sets of unordered image sequences extracted from DAVIS dataset which have been correctly classified by the proposed network trained on the dataset extracted from UCF-101. Fig. 5(a) shows the unordered image sets which are given as the input to the proposed network. Fig. 5(b) shows the order of image set provided by the proposed network as the output.

Iv-E6 Motion Heat Maps

We extract heat maps from the output of the last DAB of the network to understand the nature of the feature maps computed by DAB. DAB outputs the volumes of the accumulated temporal differences, i.e., corresponding to each of the input images. We compute the heat map for the image by averaging the absolute values of the feature maps belonging to along the channels. Fig. 4 and 5(b) shows the motion heat maps obtained from the output of the last DAB of the network. It can be observed that the DAB is focusing in the regions with significant motion. It is significant as we have not provided any motion related cues to the network.

Iv-E7 Progression of motion heat maps

In this study, we observe the progression of the motion heat maps computed using the output of DAB along the depth of the network. For the experiment, we used ResNet (50 layers) as the backbone architecture trained on the dataset of image sequences of length 6 extracted from UCF101. The architecture of ResNet is a sequence of conv1, layer1, layer2, layer3, layer4 and fc. Here, conv1 is the convolution layer and fc is the fully connected layer. layer1, layer2, layer3, and layer4 comprises of 3, 4, 6, and 3 residual blocks, respectively [4]. We used the output of DAB used after conv1 and the outputs of the last DAB of layer1, layer2, layer3, and layer4 to demonstrate the progression of the motion heat maps. Figure 6 shows the progression of the motion heat maps computed using the output of DAB along the depth of the ResNet (50 layers) which is used as the backbone architecture for the proposed convolution block. The first column shows one of the image of the input image sequences. The second column shows the output of DAB used after conv1 of ResNet [4]. The third, fourth, fifth and sixth columns show the outputs of the last DAB of layer1, layer2, layer3, and layer4 of the ResNet architecture, respectively.

V Conclusion

In this work, we propose a novel convolutional block for the task of image sequencing. We use residual network architecture as the back-bone for the proposed convolutional block [4]. We outperform the state-of-the-art methods on the standard dataset used in the previous works by a significant margin. Through experiments, we verify the significance of the proposed difference accumulator block (DAB). We show through experiments that the sign of the differences of the feature maps holds an important information. We also show that the proposed approach generalizes well by evaluating it on DAVIS dataset on which the networks has not been trained. Generalizability has been a major concern in deep learning for a long time. The networks trained on one dataset does not perform well on the dataset which they have not been trained with even when the task is same and quite general like estimation of optical flow and semantic segmentation. The proposed approach is observed to overcome this issue for the task of image sequencing.

References

  • [1] L. Sevilla-Lara, S. Zha, Z. Yan, V. Goswami, M. Feiszli, and L. Torresani, “Only time can tell: Discovering temporal data for temporal modeling,” arXiv preprint arXiv:1907.08340, 2019.
  • [2] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950, 2017.
  • [3] R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag et al., “The” something something” video database for learning and evaluating visual common sense.” in Proceedings of the IEEE international conference on computer vision, vol. 1, no. 4, 2017, p. 5.
  • [4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , 2016, pp. 770–778.
  • [5]

    T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of depth and ego-motion from video,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1851–1858.
  • [6] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
  • [7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition.   Ieee, 2009, pp. 248–255.
  • [8] Z. Xu, Y. Yang, and A. G. Hauptmann, “A discriminative cnn video representation for event detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1798–1807.
  • [9] R. Girdhar, D. Ramanan, A. Gupta, J. Sivic, and B. Russell, “Actionvlad: Learning spatio-temporal aggregation for action classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 971–980.
  • [10] D. Tran, J. Ray, Z. Shou, S.-F. Chang, and M. Paluri, “Convnet architecture search for spatiotemporal feature learning,” arXiv preprint arXiv:1708.05038, 2017.
  • [11] R. Alkadi, A. El-Baz, F. Taher, and N. Werghi, “A 2.5 d deep learning-based approach for prostate cancer detection on t2-weighted magnetic resonance imaging,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 0–0.
  • [12] H. R. Roth, L. Lu, A. Seff, K. M. Cherry, J. Hoffman, S. Wang, J. Liu, E. Turkbey, and R. M. Summers, “A new 2.5 d representation for lymph node detection using random sets of deep convolutional neural network observations,” in International conference on medical image computing and computer-assisted intervention.   Springer, 2014, pp. 520–527.
  • [13] H.-Y. Lee, J.-B. Huang, M. Singh, and M.-H. Yang, “Unsupervised representation learning by sorting sequences,” in IEEE International Conference on Computer Vision.   IEEE, 2017, pp. 667–676.
  • [14] G. Kanojia and S. Raman, “Deepimseq: Deep image sequencing for unsynchronized cameras,” Pattern Recognition Letters, vol. 117, pp. 9–15, 2019.
  • [15] ——, “Simultaneous detection and removal of dynamic objects in multi-view images,” in The IEEE Winter Conference on Applications of Computer Vision, 2020, pp. 1990–1999.
  • [16] ——, “Patch-based detection of dynamic objects in crowdcam images,” The Visual Computer, vol. 35, no. 4, pp. 521–534, 2019.
  • [17] N. Zarrabi, S. Avidan, and Y. Moses, “Crowdcam: Dynamic region segmentation,” arXiv preprint arXiv:1811.11455, 2018.
  • [18] A. Dafni, Y. Moses, S. Avidan, and T. Dekel, “Detecting moving regions in crowdcam images,” Computer Vision and Image Understanding, vol. 160, pp. 36–44, 2017.
  • [19] M. Lee, S. Lee, S. Son, G. Park, and N. Kwak, “Motion feature network: Fixed motion filter for action recognition,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 387–403.
  • [20] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2014, pp. 1725–1732.
  • [21] R. Baker, M. Dexter, T. E. Hardwicke, A. Goldstone, and Z. Kourtzi, “Learning to predict: Exposure to temporal sequences facilitates prediction of future events,” Vision research, vol. 99, pp. 124–133, 2014.
  • [22] Y. Zhou and T. L. Berg, “Temporal perception and prediction in ego-centric video,” in IEEE International Conference on Computer Vision, 2015, pp. 4498–4506.
  • [23] T.-H. K. Huang, F. Ferraro, N. Mostafazadeh, I. Misra, A. Agrawal, J. Devlin, R. Girshick, X. He, P. Kohli, D. Batra et al., “Visual storytelling,” in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 1233–1239.
  • [24] L. C. Pickup, Z. Pan, D. Wei, Y. Shih, C. Zhang, A. Zisserman, B. Scholkopf, and W. T. Freeman, “Seeing the arrow of time,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2035–2042.
  • [25] I. Mani and B. Schiffman, “Temporally anchoring and ordering events in news,” Time and Event Recognition in Natural Language. John Benjamins, 2005.
  • [26] F. Sadeghi, J. R. Tena, A. Farhadi, and L. Sigal, “Learning to select and order vacation photographs,” in Applications of Computer Vision (WACV), 2015 IEEE Winter Conference on.   IEEE, 2015, pp. 510–517.
  • [27] M. Noroozi and P. Favaro, “Unsupervised learning of visual representations by solving jigsaw puzzles,” in European Conference on Computer Vision.   Springer, 2016, pp. 69–84.
  • [28] I. Misra, C. L. Zitnick, and M. Hebert, “Shuffle and learn: unsupervised learning using temporal order verification,” in European Conference on Computer Vision.   Springer, 2016, pp. 527–544.
  • [29] H. Agrawal, A. Chandrasekaran, D. Batra, D. Parikh, and M. Bansal, “Sort story: Sorting jumbled images and captions into stories,” arXiv preprint arXiv:1606.07493, 2016.
  • [30] T. Basha, Y. Moses, and S. Avidan, “Photo sequencing,” in European Conference on Computer Vision.   Springer, 2012, pp. 654–667.
  • [31] Y. Moses, S. Avidan et al., “Space-time tradeoffs in photo sequencing,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 977–984.
  • [32] G. Kanojia, S. R. Malireddi, S. C. Gullapally, and S. Raman, “Who shot the picture and when?” in International Symposium on Visual Computing.   Springer, 2014, pp. 438–447.
  • [33] D. Park, C. L. Zitnick, D. Ramanan, and P. Dollár, “Exploring weak stabilization for motion feature extraction,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 2882–2889.
  • [34] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid, “Evaluation of local spatio-temporal features for action recognition,” in BMVC 2009-British Machine Vision Conference.   BMVA Press, 2009, pp. 124–1.
  • [35] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in European conference on computer vision.   Springer, 2014, pp. 818–833.
  • [36] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [37] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
  • [38] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4489–4497.
  • [39] A. Diba, M. Fayyaz, V. Sharma, M. Mahdi Arzani, R. Yousefzadeh, J. Gall, and L. Van Gool, “Spatio-temporal channel correlation networks for action classification,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 284–299.
  • [40] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung, “A benchmark dataset and evaluation methodology for video object segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 724–732.