Event Recognition with Automatic Album Detection based on Sequential Processing, Neural Attention and Image Captioning

11/25/2019 ∙ by Andrey V. Savchenko, et al. ∙ 10

In this paper a new formulation of event recognition task is examined: it is required to predict event categories in a gallery of images, for which albums (groups of photos corresponding to a single event) are unknown. We propose the novel two-stage approach. At first, features are extracted in each photo using the pre-trained convolutional neural network. These features are classified individually. The scores of the classifier are used to group sequential photos into several clusters. Finally, the features of photos in each group are aggregated into a single descriptor using neural attention mechanism. This algorithm is optionally extended to improve the accuracy for classification of each image in an album. In contrast to conventional fine-tuning of convolutional neural networks (CNN) we proposed to use image captioning, i.e., generative model that converts images to textual descriptions. They are one-hot encoded and summarized into sparse feature vector suitable for learning of arbitrary classifier. Experimental study with Photo Event Collection and Multi-Label Curation of Flickr Events Dataset demonstrates that our approach is 9-20 method has 13-16 obtained with hierarchical clustering. It is experimentally shown that the image captions trained on Conceptual Captions dataset can be classified more accurately than the features from object detector, though they both are obviously not as rich as the CNN-based features. However, it is possible to combine our approach with conventional CNNs in an ensemble to provide the state-of-the-art results for several event datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

People are taking more photos than ever before in recent years [guo2017multigranular] due to the rapid growth of social networks, cloud services and mobile technologies. To organize a personal collection, the photos are usually assigned to albums according to some events. The photo organizing systems (Apple iPhoto, Google Photos, etc.) allow the user to rapidly search for required photo, and also to increase the efficiency of work with a gallery [sokolova2017organizing]. Nowadays, these systems usually include content-based image analysis and automatic association of each photo with different tags (scene description, persons, objects, locations, etc.). Such analysis can be used not only to selectively retrieve photos for particular tag in order to keep nice memories of some episodes of user’s live [wang2018transferring], but to make personalized recommendations that assist customers in finding relevant items within large collections. The design of such systems requires the careful consideration of the user modeling approach [savchenko2019user]. A large gallery of photos on a mobile device can be used for understanding of such user’s interests as sport, gadgets, fitness, cloth, cars, food, travelling, pets, etc. [grechikhin2019user, rassadin2019scene].

In this paper we focus on one of the most challenging parts of photo organizing engine, namely, image-based event recognition [ahmad2019deep], in order to extract such events as holidays, sport events, weddings, various activities, etc. An event can be defined as a category that captures the “complex behavior of a group of people, interacting with multiple objects, and taking place in a specific environment” [wang2018transferring]. There exist two different tasks of event recognition. The first one is focused on processing of single photos, i.e. event is considered as a complex scene with large variations in visual appearance and structure [wang2018transferring]. The second task aims at predicting the event categories of a group of photos (album) [bacha2016event]. In the latter case it is assumed that all photos in an album are weakly labeled [ahmad2017event], though importance of each image may differ [wang2017recognizing]. However, in practice only a gallery of photos is available so that the latter approach requires a user to manually choose the albums. Another option includes location-based album creation if the GPS tags are switched on. In both cases the usage of album-based event recognition is limited or even impossible.

Thus, in this paper we consider the new task of event recognition, in which a gallery of photos is given and it is known that it contains ordered albums with unknown borders. We propose to automatically assign these borders based on the visual content of consecutive photos in a gallery. Next, consecutive photos are grouped, and descriptor of each group is computed with an attention mechanism from the neural aggregation module [yang2017neural]. Finally, this approach is extended as follows. Despite conventional usage of CNNs as discriminative models in a classifier design, we propose to borrow generative models to represent an input image in other domain. In particular, we use existing methods of image captioning [hossain2019comprehensive]

that generates textual descriptions of images. Our main contribution is a demonstration that the generated descriptions can be fed to the input of classifier in an ensemble in order to improve event recognition accuracy of traditional methods. Though the proposed visual representation is not as rich as features extracted by fine-tuned CNNs, they are better than the outputs of object detectors 

[rassadin2019scene].

2 Literature Survey

Annotating personal photo albums is an emerging trend in photo organizing services [dao2011signature]. A method for hierarchical photo organization into topics and topic-related categories on a smartphone is proposed in [lonn2019smartphone] based on integration of convolutional neural network (CNN) and topic modeling for image classification. An automatic hierarchical clustering and best photo selection solution is introduced in [kuzovkin2019context] for modeling user decisions in organizing or clustering similar photos in albums. Organizing photo albums for user preference prediction on mobile device is considered in [savchenko2019efficient].

The task of event recognition in the personal photo collections is not to recognize event in individual photo but in the whole album [tsai2011album]. The events and sub-events of the sub-sequence photos are identified in [dao2011signature]

by integrating the optimized linear programming with the color descriptor of the signature image. The Stopwatch Hidden Markov Models were applied in 

[bossard2013event] by treating the photos in an album as sequential data. The detectors for objects relevant to the events were trained in the holiday dataset [tsai2011album]. Next, these holidays are classified based on the outputs of object detector. The paper [ahmad2017event] tackles the presence of irrelevant images in an album with multiple instance learning techniques. An iterative updating procedure for event type and image importance score prediction in a siamese network is presented in [wang2017recognizing]

. The authors of this paper used a CNN that recognizes the event type, and a Long Short-Term Memory (LSTM)-based sequence level event recognizer in a whole album. Moreover, they successfully applied the method for learning representative deep features for image set analysis 

[wu2015learning]. The latter approach focuses on capturing the co-occurrences and frequencies of features so that the temporal coherence of photos in an album is not required. A model to recognize events from coarse to fine hierarchical level using multi-granular features [savchenko2019efficient] is proposed in [guo2017multigranular] based on an attention network that learns the representations of photo albums. The efficiency of re-finding expected photos in mobile phones was improved by a method to classify personal photos based on relationship of shooting time and shooting location to specific events [geng2018classifying].

The album information is not always available so that a gallery contains unstructured list of photos ordered by their creation time. In such case it is possible to use existing methods of event recognition on single photos [ahmad2019deep]

. Similar to other computer vision domains, the mainstream approach tends to applications of CNN-based architectures. For example, four different layers of fine-tuned CNN were used to extract features and perform Linear Discriminant Analysis in order to obtain the top entry in the ChaLearn LAP 2015 cultural event recognition challenge 

[escalera2015chalearn]. The bounding boxes of detected objects are projected onto multi-scale spatial maps for increasing the accuracy of event recognition [xiong2015recognize]. The novel iterative selection method is introduced in [wang2018transferring]

to identify a subset of classes that are most relevant for transferring deep representations learned from object (ImageNet) and scene (Places2) datasets.

Unfortunately, the accuracy of event classification on still photos [wang2018transferring] is in general much lower than the accuracy of album-based recognition [wang2017recognizing]

. That is why in this paper we proposed to concentrate on other suitable visual features extracted with the generative models and, in particular, image captioning techniques. There is a wide range of applications of image captioning: from automatic generation of descriptions for photos posted in social networks to image retrieval from databases using generated text descriptions 

[vijayaraju2019image]. The image captioning methods are usually based on an encoder-decoder neural network, which first encodes an image into a fixed-length vector representation using pre-trained CNN, and then decodes the representation into captions (a natural language description). During the training of a decoder (generator) the input image and its ground-truth textual description are fed as inputs to the neural network, while one hot encoded description presents the desired network output. Description is encoded using text embeddings in the Embedding (look-up) layer [goodfellow2016deep]

. The generated image and text embeddings are merged using concatenation or summation, and form the input to the decoder part of the network. It is typical to include the recurrent neural network (RNN) layer followed by a fully connected layer with the Softmax output layer.

One of the first successful models, “Show and Tell” [cap_ST], won the first MS COCO Image Captioning Challenge in 2015. It uses the RNN with the long short-term memory (LSTM) units in a decoder part. Its enhancement “Show, Attend and Tell” [cap_SAT] incorporates a soft attention mechanism to improve the quality of the caption generation. The “Neural Baby Talk” image captioning model [cap_NBT] is based on generating of the template with slot locations explicitly tied to specific image regions. These slots are then filled in by visual concepts identified in the object detectors. The foreground regions are obtained using the Faster-RCNN network [ren2015faster], and the LSTM with attention mechanism serves as a decoder. The “Multimodal Recurrent Neural Network” (mRNN) [cap_mRNN] is based on the Inception network for image features extraction and deep RNN for sentences generation. One of the best models nowadays is the Auto-Reconstructor Network (ARNet) [cap_ARNet], which uses Inception-V4 network [cap_inc4] in an encoder and the decoder is based on LSTM. There exist two pre-trained models with greedy search (ARNet-g) and beam search (ARNet-b) with size 3 to generate the final caption for each input image.

3 Materials and Methods

3.1 Problem formulation

As it was noticed above, an important task is automatic extraction of albums from a personal gallery based on a visual content of photos. In this subsection we discover a technological engine that can solve this task by using sequential processing of photos similarly to cluster analysis with the Basic Sequential Algorithmic Scheme (BSAS) 

[ashour2019improve]. Our main task can be formulated as follows. It is required to assign each photo from a gallery of an input user to one of event categories (classes). Here is the total number of photos in a gallery. The training set of albums is available for learning of event classifier. The -th reference album is defined by images . The class label of each -th album is supposed to be given, i.e., we assume that an album is associated with exactly one event type.

Conventional event recognition on single photos [wang2018transferring] is the special case of above-formulated problem if . The main difference is the following assumption. The gallery is not a random collection of photos but can be represented as a sequence of disjoint albums. Each image in an album is associated with the same event. In contrast to the album-based event recognition, the borders of each album are unknown. This task possesses several characteristics that makes it extremely challenging compared to previously studied problems. One of these characteristics is the presence of irrelevant images or unimportant photos that can be in principle associated to any event [ahmad2019deep]. These images are easily detected in attention-based models [guo2017multigranular, yang2017neural], but may have a significant impact on a quality of automatic album selection.

The baseline approach here is to classify all photos independently. In such case it is typical to unfold the training albums into a set of photos so that the collection-level label of the -th album is assigned to labels of each -th photo (). Next, it is possible to train an arbitrary event classifier. If

is rather small to train a deep CNN from scratch, the transfer learning or domain adaptation can be applied 

[goodfellow2016deep]. In these methods a large external dataset, e.g. ImageNet-1000 or Places2 [zhou2018places], is used to pre-train a deep CNN. As we pay special attention to offline recognition on mobile devices, it is reasonable to use such CNNs as MobileNet v1/v2 [howard2017MobileNets, sandler_inverted_2018]. The final step in transfer learning is fine-tuning of this neural network on . This step includes replacement of the last layer of the pre-trained CNN to the new layer with Softmax activations and outputs. During the classification process, each input image is fed to the fine-tuned CNN to compute the scores (predictions at the last layer):

This procedure can be modified by replacing logistic regressions in the last layer to more complex classifier, e.g., random forest (RF), support vector machine (SVM) or gradient boosting. In this case the features (embeddings) [savchenko2019sequential] are extracted using the outputs of one of the last layers of pre-trained CNN. Namely, the images and are fed to the CNN, and the outputs of the one-but-last layer are used as the -dimensional feature vectors and

, respectively. Such deep learning-based feature extractors allow training of a general classifier

. The -th photo is fed into this classifier to obtain -dimensional confidence scores .

Finally, the confidences

computed by any of above-mentioned ways are used to make a decision in favor of the most probable class:

(1)

3.2 Event recognition in a gallery of photos

The proposed pipeline is presented in Fig. 1.

Figure 1: Proposed gallery-based event recognition pipeline

Here, firstly, module “Feature extractor” computes embeddings of every -th individual photo as described in previous Subsection 3.1. The classifier confidences

are estimated in the “Classifier” block. Next, we use sequential analysis from BSAS clustering 

[ashour2019improve] in the “Sequential cluster analysis” module for a sequence of confidences in order to obtain the borders of albums. Namely, the distances between neighbor photos are computed. If a distance does not exceed a certain threshold then it is assumed that both photos are included in the same album. If location information is available in EXIF (Exchangeable Image File Format) data of these photos, the distance between their locations can be added to in order to obtain the final distance to be matched with a threshold. Otherwise, the border between two albums is established at the -th position. As a result, we obtain the borders of albums, so that the -th album contains photos , where .

Figure 2: Attention-based neural network for embeddings from MobileNet v2

At the second stage, the final descriptor of the -th album is produced as a weighted sum of individual features :

(2)

where the weights may depend on the features . It is typical to use here average pooling (AvgPool) with equal weights, so that conventional computation of mean feature vector is implemented. However, in this paper we propose to learn the weights , particularly, with an attention mechanism from the neural aggregation module used previously only for video recognition [yang2017neural]:

(3)

Here is the learnable -dimensional vector of weights. The dense (fully connected) layer is attached to the resulted descriptor , and the whole neural network (Fig. 2) is trained in end-to-end manner using given training set of

albums. The event class predicted by this network in the “Neural attention model” block (Fig. 

1) is assigned to all photos .

1:input gallery , classifier , threshold
2:event labels of all input images
3:Assign , initialize list of borders
4:for each input image  do
5:     Feed the -th image into a CNN and compute embeddings
6:     Compute confidences using classifier
7:     if  or  then
8:         Assign , append to the list
9:     end if
10:end for
11:Append to the list
12:for each extracted album  do
13:     Feed input images into attention network (2)-(3) and obtain event class
14:     Assign for all
15:end for
16:return labels
Algorithm 1 Proposed gallery-based event recognition
1:for each album  do
2:     for each image  do
3:         Feed image into a CNN and compute embeddings
4:     end for
5:end for
6:Train classifier using unfolded training set of embeddings
7:Train attention network (2)-(3) using subsets with fixed size of all training sets of features
8:for each album  do
9:     for each image  do
10:         Predict confidence scores for embeddings using classifier
11:     end for
12:end for
13:Randomly permute all indices to obtain sequence
14:Unfold all training embeddings using this permutation:
15:Assign
16:for each potential threshold  do
17:     Call Algorithm 1 with parameters and threshold
18:     Compute accuracy using predictions for all training images
19:     if  then
20:         Assign
21:     end if
22:end for
23:return classifier , attention network, threshold
Algorithm 2 Learning procedure in the proposed approach

Complete classification and learning procedures are presented in Algorithm 1 and Algorithm 2, respectively. For simplicity, we mentioned that the latter calls the event prediction in step 17. However, to speed-up computations it is recommended to pre-compute the pair-wise distance matrix between confidence scores of all training images so that feature extraction (steps 3-4 in Algorithm 1) and distance calculation are not needed during the learning of our model.

(a)
(b)
(c)
Figure 3: Mobile demo GUI

We implemented the whole pipeline (Fig. 1) in the publicly-available demo application for Android111https://drive.google.com/open?id=1aYN0ZwU90T8ZruacvND01hbIaJS4EZLI (Fig. 3), that was previously developed to extract user preferences by processing all photos from the gallery in the background thread [savchenko2019user]. The similar events found in photos made in one day were united into High-level logs for the most important events. We display only those scenes/events for which there exist at least 2 photos and the average score of scene/event predictions for all photos of the day exceeds a certain threshold. The sample screenshot of the main user interface is shown in Fig. 2(a). It is possible to tap any bar in this histogram to show a new form with detailed categories (Fig. 2(b)). If a concrete category is tapped, a “display” form appears, which contains a list of all photos from the gallery with this category (Fig. 2(c)). Here we group event by date and provide a possibility to choose concrete day.

3.3 Event recognition in single photos

Event recognition in single photos task can be formulated as a typical image recognition problem. It is required to assign an input photo from a gallery to one of event categories (classes). The training set of images with known event labels is available for classifier learning. Sometimes the training photos of the same event are associated with an album [bossard2013event, wang2017recognizing]. In such case the training albums are unfolded into a set so that the collection-level label of the album is assigned to labels of each photo from this album. This task possesses several characteristics that makes it extremely challenging compared to album-based event recognition. One of these characteristics is the presence of irrelevant images or unimportant photos that can be associated to any event [ahmad2019deep]. These images can be detected by attention-based models when the whole album is available [guo2017multigranular], but may have a significant impact on a quality of event recognition in single images.

As is usually rather small, the transfer learning may be applied [goodfellow2016deep]. A deep CNN is firstly pre-trained on a large dataset, e.g. ImageNet or Places [zhou2018places]. Secondly, this CNN is fine-tuned on , i.e., the last layer is replaced to the new layer with Softmax activations and outputs. An input image is classified by feeding it to the fine-tuned CNN to compute

scores from the output layer, i.e., estimates of posterior probabilities for all event categories. This procedure can be modified by extraction of deep image features (embeddings) using the outputs of one of the last layers of the pre-trained CNN. The images

and are fed to the input of the CNN, and the outputs of the one-but-last layer are used as the -dimensional feature vectors and , respectively. Such deep learning-based feature extractors allow training of a general classifier , e.g., k-nearest neighbor, random forest (RF), support vector machine (SVM) or gradient boosting. The -dimensional vector of

confidence scores is predicted given the input image in both cases of fine-tuning with the last Softmax layer in a role of classifier

and feature extraction with general classifier. The final decision can be made in favor of class with the maximal confidence.

In this paper we use another approach to event recognition based on generative models and image captioning. The proposed pipeline is presented in Fig. 4. At first, conventional extraction of embeddings is implemented using pre-trained CNN. Next, these visual features and a vocabulary are fed to a special RNN-based neural network (generator) that produces the caption, which describes the input image. Caption is represented as a sequence of tokens from the vocabulary (). It is generated sequentially, word-by word starting from token until a special word is produced [cap_ARNet].

Figure 4: Proposed event recognition pipeline based on image captioning

The generated caption is fed into an event classifier. In order to learn its parameters, every -th image from the training set is fed to the same image captioning network to produce the caption . Since the number of tokens is not the same for all images, it is necessary to either train a sequential RNN-based classifier or transform all captions into a feature vectors with the same dimensionality. As the number of training instances is not very large, we experimentally noticed that the latter approach is as accurate as the former, though the training time is significantly lower. Hence, we decided to use the one-hot encoding of the sequences and into vectors of 0s and 1s as described in [francois2017deep]. In particular, we select a subset of vocabulary by choosing top most frequently occuring words in the training data with optional exclusion of stop words. Next, the input image is represented as the -dimensional sparse vector , where is the size of reduced vocabulary and the -th component of vector is equal to 1 only if at least one of words in the caption is equal to the -th word from vocabulary . This would mean, for instance, turning the sequence {1, 5, 10, 2} into a -dimensional sparse vector that would be all 0s except for indices 1, 2, 5 and 10, which would be 1s [francois2017deep]. The same procedure is used to describe each -th training image with -dimensional sparse vector . After that an arbitrary classifier of such textual representations suitable for sparse data can be used to predict confidence scores . It was demonstrated in [francois2017deep] that such approach is even more accurate than conventional RNN-based classifiers (including one layer of LSTMs) for IMDB dataset.

(a)
(b)
(c)
(d)
Figure 5: Sample results of event recognition

In general we do not expect that classification of short textual descriptions is more accurate than the conventional image recognition methods. Nevertheless, we believe that the presence of image captions in an ensemble of classifiers can significantly improve its diversity. Moreover, as the captions are generated based on the extracted feature vector , only one inference in the CNN is required if we combine conventional general classifier of embeddings from pre-trained CNN and the image captions. In this paper the outputs of individual classifiers are combined in a simple voting with soft aggregation. In particular, we compute aggregated confidences as the weighted sum of outputs of individual classifier:

(4)

The decision is taken in favor of the class with the maximal confidence:

(5)

The weight in (4) can be chosen using a special validation subset in order to obtain the highest accuracy of criterion (5).

Let us provide qualitative examples for the usage of our pipeline (Fig. 4). The results of (correct) event recognition using our ensemble are presented in Fig. 5. Here the first line of the title contains the generated image caption. In addition, the title displays the result of event recognition using captions (second line), embeddings (third line), and the whole ensemble (last line). As one can notice, the single classification of captions is not always correct. However, our ensemble is able to obtain reliable solution even when individual classifiers make wrong decisions.

4 Experimental Study

4.1 Event recognition in a gallery of photos

Only a limited number of datasets is available for event recognition in personal photo-collections [ahmad2019deep]. Hence, we examined two main datasets in this field, namely:

  1. PEC [bossard2013event] with 61,364 images from 807 collections of 14 social event classes (birthday, wedding, graduation, etc.). We used its split provided by authors: the training set with 667 albums (50,279 images) and testing set with 140 albums (11,085 images).

  2. ML-CUFED [wang2017recognizing] contains 23 common event types. Each album is associated with several events, i.e., it is a multi-label classification task. Conventional split into the training set (75,377 photos, 1507 albums) and test set (376 albums with 19,420 photos) was used.

The features were extracted using the scene recognition models (Inception v3 and MobileNet v2 with

and ) pre-trained on the Places2 dataset [zhou2018places]. We used two techniques in order to obtain a final descriptor of a set of images, namely, 1) simple averaging of features of individual images in a set (AvgPool); and 2) our implementation of neural attention mechanism (2)-(3) for -normed features. In the former case the linear SVM classifier from scikit-learn library was used as , because it has higher accuracy than RF, k-NN and RBF SVM. In the latter case the weights of the attention-based network (Fig. 2) are learned using the sets with

randomly chosen images from all albums in order to make identical shape of input tensors. As a result, 667 training subsets and 1507 subsets with

images were obtained for PEC and ML-CUFED, respectively. As the ML-CUFED contains multiple labels per each album, we use sigmoid activations and binary cross-entropy loss. Conventional Softmax activations and categorical cross-entropy are applied for the PEC. The model was learned using ADAM optimizer (learning rate 0.001) for 10 epochs with early stop in Keras 2.3 framework with TensorFlow 1.15 backend.

CNN Aggregation PEC ML-CUFED
MobileNet2, AvgPool 86.42 81.38
Attention 89.29 84.04
MobileNet2, AvgPool 87.14 81.91
Attention 87.36 84.31
Inception AvgPool 86.43 82.45
v3 Attention 87.86 84.84
CNN-LSTM-Iterative [wang2017recognizing] 84.5 79.3
AlexNet Aggregation of representative features [wu2015learning] 87.9 84.5
CNN-LSTM-Iterative [wang2017recognizing] 84.5 71.7
ResNet-101 Aggregation of representative features [wu2015learning] 89.1 83.4
Table 1: Accuracy (%) of event recognition in a set of images (album).

The recognition accuracies of the pre-trained CNN are presented in Table 1. Here we computed the multi-label accuracy for ML-CUFED so that prediction is assumed to be correct if it corresponds to any label associated with an album. In this table we provided the best-known results for these datasets [wang2017recognizing, wu2015learning].

Here in all cases the attention-based aggregation is 1-3% more accurate when compared to classification of average features. As one can notice, the proposed implementation of attention mechanism achieves the state-of-the-art results, though we used much faster CNNs (MobileNet and Inception rather than AlexNet and ResNet-101) and do not consider sequential nature of photos in an album in our attention-based network (Fig. 2). The most remarkable fact here is that the best results for the PEC are achieved for the most simple model (MobileNet v2, ), which can be explained by the lack of training data for this particular dataset.

As we claimed above, in general there is no information about albums in a gallery. Hence, event should be assigned to all photos individually. In the next experiment we directly assigned the collection-level first label to each image contained in both datasets and simply use the image itself for event recognition, without any meta information. In addition to baseline approach (Subsection 3.1) we used hierarchical agglomerative clustering of entire testing gallery. We report only the best results achieved by the average linkage clustering of embeddings extracted by pre-trained CNN and confidence scores . In the former case we used both Euclidean () and chi-squared () distances. As the confidence scores returned by decision_function for LinearSVC are not always non-negative, only Euclidean distance is implemented for the confidence scores. The results are shown in Table 2.

Dataset CNN Baseline Embeddings Scores
MobileNet2, 58.32 60.42 60.69 58.44
PEC MobileNet2, 60.34 61.25 61.92 60.58
Inception v3 61.82 64.19 64.22 61.97
ML- MobileNet2, 54.41 57.03 57.45 54.56
CUFED MobileNet2, 53.54 54.97 55.98 54.03
Inception v3 57.26 59.19 60.12 57.87
Table 2: Accuracy (%) of event recognition in a single image.
CNN Aggregation Baseline Embeddings Scores Scores (-normed)
MobileNet2, AvgPool 58.32 -
(pre-trained), embeddings Attention -
MobileNet2, AvgPool -
(pre-trained), embeddings Attention -
MobileNet2, AvgPool - - -
(fine-tuned), scores Attention - - -
Inception v3 AvgPool -
(pre-trained), embeddings Attention -
Inception v3 AvgPool - - -
(fine-tuned), scores Attention - - -
Table 3: Accuracy (%) of the proposed approach, PEC.
CNN Aggregation Baseline Embeddings Scores Scores (-normed)
MobileNet2, AvgPool -
(pre-trained), embeddings Attention -
MobileNet2, AvgPool -
(pre-trained), embeddings Attention -
MobileNet2, AvgPool - - -
(fine-tuned), scores Attention - - -
Inception v3 AvgPool -
(pre-trained), embeddings Attention -
Inception v3 AvgPool - - -
(fine-tuned), scores Attention - - -
Table 4: Accuracy (%) of the proposed approach, ML-CUFED.

Here, firstly, the accuracy of event recognition in single images is 25-30% lower than the accuracy of the album-based classification (Table 1). Secondly, clustering of the confidence scores at the output of the best classifier does not significantly influence the overall accuracy. Thirdly, hierarchical clustering with the chi-squared distance leads to slightly more accurate results than conventional Euclidean metric. Finally, preliminarily clustering of embeddings decreases the error rate of the baseline in only 1.2-2% even if the distance threshold in clustering is carefully chosen.

Let us demonstrate how the assumption about sequentially ordered photos in an album can increase the accuracy of event recognition. In order to make the task more complex, the following transformation of the order of testing photos was performed 10 times. We randomly shuffled the sequence of albums, and the photos in each album are also shuffled. In addition to the matching of confidences from decision_function of the linearSVC we perform their normalization. Moreover, we fine-tuned CNNs using the unfolded training set as follows. At first, the weights in the base part of the CNN were frozen and the new head (fully connected layer with outputs and Softmax activation) was learned during 10 epochs. Next, the weights in the whole CNN were learned during 3 epochs with 10-times lower learning rate.

The results (mean accuracy

its standard deviation) of the proposed Algorithms 

12 for the PEC and the ML-CUFED are presented in Table 3 and Table 4, respectively. Here the attention mechanism provides up to 8% lower error rates in most cases. It is remarkable that the matching of distances between -normed confidences significantly improves the overall accuracy of attention model for the PEC (Table 3), though our experiments did not show any improvements in conventional clustering from the previous experiment (Table 2). The fine-tuned CNNs obviously lead to the most accurate decision, but the difference (0.1-1.6%) with the best results of the pre-trained models is rather small. However, the latter do not require additional inference in existing scene recognition models, so the implementation of event recognition in an album will be very fast if the scenes should be additionally classified, e.g., for more detailed user modeling [savchenko2019user]. Surprisingly, computing the distance between confidence scores of classifiers () reduces the error rate of conventional matching of embeddings () on 2-7%. Let us recall that conventional clustering of embeddings was 1-2% more accurate when compared to the classifier’s scores (Table 2). It seems that the threshold can be estimated (Algorithm 2) more reliably in this particular case when most images from the same event are matched in the prediction procedure (Algorithm 1). Finally, the most important conclusion is that the proposed approach has 9-20% higher accuracies when compared to baseline. Moreover, our algorithm is 13-16% more accurate than classification of groups of photos obtained with hierarchical clustering (Table 2).

4.2 Event recognition in single photos

In addition to PEC and ML-CUFED we examined WIDER (Web Image Dataset for Event Recognition) [xiong2015recognize] with 50,574 images and event categories (parade, dancing, meeting, press conference, etc). We used standard train/test split for all datasets proposed by their creators. In PEC and ML-CUFED the collection-level label is directly assigned to each image contained in this collection. We completely ignore any metadata, e.g., temporal information, except the image itself similarly to the paper [wang2018transferring].

As we mainly focus on possibility to implement offline event recognition on mobile devices [savchenko2019user], in order to compare the proposed approach with conventional classifiers, we used MobileNet v2 with  [cap_mobilenet] and Inception v4 [cap_inc4] CNNs. At first, we pre-trained them on the Places2 dataset [zhou2018places] for feature extraction. The linear SVM classifier from scikit-learn library was used, because it has higher accuracy than other classifiers from this library (RF, k-NN and RBF SVM). Moreover, we fine-tuned these CNNs using the given training set as follows. At first, the weights in the base part of the CNN were frozen and the new head (fully connected layer with outputs and Softmax activation) was learned using ADAM optimizer (learning rate 0.001) for 10 epochs with early stop in Keras 2.2 framework with TensorFlow 1.15 backend. Next, the weights in the whole CNN were learned during 5 epochs using ADAM. Finally, the CNN was trained using SGD during 3 epochs with 10-times lower learning rate.

In addition, we used features from object detection models that are typical for event recognition [xiong2015recognize, savchenko2019user]. As many photos from the same event sometimes contains identical objects (e.g., ball in the football), they can be detected by contemporary CNN-based methods, i.e., SSDLite [cap_mobilenet] or Faster R-CNN [ren2015faster]. These methods detect the positions of several objects in the input image and predict the scores of each class from the predefined set of types. We extract the sparse -dimensional vector of scores for each type of object. If there are several objects of the same type, the maximal score is stored in this feature vector [rassadin2019scene]

. These feature vector is either classified by the linear SVM or used to train a feed-forward neural network with two hidden layers containing 32 units. Both classifiers were learned using the training set from each event dataset. In this study we examined SSD with MobileNet backbone and Faster R-CNN with InceptionResNet backbones. The models pre-trained on the Open Images Dataset v4 (

objects) were taken from the TensorFlow Object Detection Model Zoo.

Our preliminarily experimental study with the pre-trained image captioning models discussed in Section 2 demonstrated that the best quality for MS COCO captioning dataset is achieved by the ARNet model [cap_ARNet]. Thus, in this experiment we used ARNet’s encoder-decoder model. However, it can be replaced to any other image captioning technique without modification of our event recognition algorithm. The ARNet was trained on the Conceptual Captions Dataset that contains more than 3.3M image-URL and caption pairs in the training set, and about 15 thousand pairs in the validation set. The feature extraction in encoder is implemented not only with he same CNNs (Inception and MobileNet v2). We extracted most frequent words except special tokens and . They are classified by either linear SVM or a feed-forward neural network with the same architecture as for object detection case. Again, these classifiers are trained from scratch given each event training set. The weight in our ensemble (Eq. 1) was estimated using the same set.

The results of the lightweight mobile (MobileNet and SSD object detector) and deep models (Inception and Faster R-CNN) for PEC, WIDER and ML-CUFED are presented in Tables 567, respectively. Here we added the best known results for the same experimental setups.

Classifier Features Lightweight models Deep models
SVM Embeddings 59.72 61.82
Objects 42.18 47.83
Texts 43.77 47.24
Proposed ensemble (4), (5) 60.56 62.87
Fine-tuned CNN Embeddings 62.33 63.56
Objects 40.17 47.42
Texts 43.52 46.89
Proposed ensemble (4), (5) 63.38 65.12
Aggregated SVM [bossard2013event] 41.4
Bag of Sub-events [bossard2013event] 51.4
SHMM [bossard2013event] 55.7
Initialization-based transfer learning [wang2018transferring] 60.6
Transfer learning of data and knowledge [wang2018transferring] 62.2
Table 5: Event recognition accuracy (%), PEC
Classifier Features Lightweight models Deep models
SVM Embeddings 48.31 50.48
Objects 19.91 28.66
Texts 26.38 31.89
Proposed ensemble (4), (5) 48.91 51.59
Fine-tuned CNN Embeddings 49.11 50.97
Objects 12.91 21.27
Texts 25.93 30.91
Proposed ensemble (4), (5) 49.80 51.84
Baseline CNN [xiong2015recognize] 39.7
Deep channel fusion [xiong2015recognize] 42.4
Initialization-based transfer learning [wang2018transferring] 50.8
Transfer learning of data and knowledge [wang2018transferring] 53.0
Table 6: Event recognition accuracy (%), WIDER
Classifier Features Lightweight models Deep models
SVM Embeddings 53.54 57.27
Objects 34.21 40.94
Texts 37.24 41.52
Proposed ensemble (4), (5) 55.26 58.86
Fine-tuned CNN Embeddings 56.01 57.12
Objects 32.05 40.12
Texts 36.74 41.35
Proposed ensemble (4), (5) 57.94 60.01
Table 7: Event recognition accuracy (%), ML-CUFED

Certainly, the proposed recognition of image captions is not as accurate as conventional CNN-based features. However, classification of textual descriptions is much better than the random guess with accuracy , and for PEC, WIDER and ML-CUFED, respectively. It is important to emphasize that our approach has lower error rate than classification of the features based on object detection in most case. This gain is especially noticeable for lightweight SSD models, which are 1.5-13% less accurate than the proposed classification of image captions due to the limitations of SSD-based models to detect small objects (food, pets, fashion accessories, etc.). The Faster R-CNN-based detection features can be classified more accurately, but the inference in the Faster R-CNN with InceptionResNet backbone is several times slower than decoding in the ARNet (6-10 seconds vs 0.5-2 seconds on MacBook Pro 2015).

Finally, the most appropriate way to use image captioning in event classification is its fusion with conventional CNNs. In such case we improved the previous state-of-the-art for PEC from 62.2% [wang2018transferring] even for the lightweight models (63.38%) if the fine-tuned CNNs are used in an ensemble. Our Inception-based model is even better (accuracy 65.12%). We have not still reached the state-of-the-art accuracy 53% [wang2018transferring] for the WIDER dataset, though our best accuracy (51.84%) is up to 9% higher when compared to the best results (42.4%) from original paper [xiong2015recognize]. Our experimental setup for the ML-CUFED dataset is studied at first time here because this dataset is developed mostly for album-based event recognition.

In practice it is preferable to use pre-trained CNN as a feature extractor in order to prevent additional inference in fine-tuned CNN when it differs with the encoder in image captioning model. Unfortunately, the accuracies of SVM for pre-trained CNN features are 1.5-3% lower when compared to the fine-tuned models for PEC and ML-CUFED. In this case additional inference may be acceptable. However, the difference in error rates between pre-trained and fine-tuned models for WIDER dataset is not significant, so that the pre-trained CNNs are definitely worth being used here.

5 Conclusion

We have shown that existing studies of event recognition cannot be directly used for processing of a gallery of mobile device because the albums of photos corresponding to the same event may be unavailable. The usage of event recognition in single images is possible but is very inaccurate even if similar photos are combined with a clustering (Table 2). We have demonstrated that grouping of consecutive photos and attention-based recognition of resulted image sets (Algorithm 1) can drastically improve the recognition accuracy (Tables 34). It has been shown that the most important parameter, namely, distance threshold , can be automatically estimated in our learning procedure (Algorithm 2). It has been experimentally demonstrated that consecutive photos from the same album are better discovered if we match the confidence scores of classifier, which has been learned on unfolded training set .

In addition, we have proposed to apply generative models in classical discriminative task, namely, image captioning in event recognition in still images. We have presented the novel pipeline of visual preferences prediction using image captioning with classification of generated captions and retrieval of images based on their textual descriptions (Fig. 4). It has been experimentally demonstrated that our approach is more accurate and faster than the widely-used image representations obtained by object detectors [xiong2015recognize, rassadin2019scene]. What is more important, generated caption provides additional diversity to conventional CNN-based recognition, which is especially useful for ensemble models.

Our engine has been implemented in the publicly available Android application (Fig. 3) that extracts the profile of user’s interests. It is applicable for such personalized mobile services as recommender systems and target advertisements.

The main disadvantage of the proposed approach is its lower accuracy (up to 8-11%) when compared to the best models for the case of known borders of albums (Table 1). Moreover, short conceptual textual descriptions are obviously not enough to classify event categories with high accuracy even for a human due to errors and lack of specificity (see example of generated captions in Fig. 5). Another disadvantage of the proposed approach is the need to repeat inference if fine-tuned CNN is applied in an ensemble. Hence, the decision-making time will be significantly increased though the overall accuracy becomes also higher in most cases (Tables 57).

Thus, in future it is possible to extend our algorithm by, e.g., replacing the pre-defined metric to a metric learned on a given training set [goodfellow2016deep]. Secondly, our attention model does not work well for single photos: its accuracy for the baseline with pre-trained CNNs is 4-5% worth than the accuracy of linear SVM (row “AvgPool” in Tables 34). Hence, it is desirable to examine appropriate enhancements of attention model that are suitable even for small input set [guo2017multigranular, wu2015learning]. Finally, it is necessary to make classification of generated captions more accurate. Though our preliminary experiments of LSTMs did not decrease the error rate of our simple approach with linear SVM and one-hot encoded words, we strongly believe that a thorough study of the RNN-based classifiers of generated textual descriptors is required.

This research is based on the work supported by Samsung Research, Samsung Electronics.

References