On the Role of Event Boundaries in Egocentric Activity Recognition from Photostreams

09/02/2018 ∙ by Alejandro Cartas, et al. ∙ Universitat de Barcelona 0

Event boundaries play a crucial role as a pre-processing step for detection, localization, and recognition tasks of human activities in videos. Typically, although their intrinsic subjectiveness, temporal bounds are provided manually as input for training action recognition algorithms. However, their role for activity recognition in the domain of egocentric photostreams has been so far neglected. In this paper, we provide insights of how automatically computed boundaries can impact activity recognition results in the emerging domain of egocentric photostreams. Furthermore, we collected a new annotated dataset acquired by 15 people by a wearable photo-camera and we used it to show the generalization capabilities of several deep learning based architectures to unseen users.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Wearable cameras offer a hand-free way to capture the world from a first-person perspective, hence providing rich contextual information about the activities being performed by the user [16]. Similarly to other wearable sensors, wearable cameras are ubiquitous and allow to capture daily activities in natural settings.

Currently, recognizing daily activities from first-person (egocentric) images and videos is a very active area of research in computer vision

[15, 18, 5, 2, 3]. In this paper, we focus on streams of images captured at regular intervals through a wearable photo-camera, also called photostreams, that have received comparatively little attention in the literature. With respect to egocentric videos, photostreams usually cover the full day of a person (see Fig. 1

). However, since the photo-camera typically takes a picture every 30 seconds, temporally adjacent images present abrupt changes. Consequently, optical flow cannot be reliably estimated and several fine-grained action are completely missed or too sampled for being identifiable. Since motion is an important feature to disambiguate activities, recognize them become particularly challenging in the photostream domain.

Figure 1: Sample images captured by a wearable photo-camera user during a day, together with their timestamp and activity label.

[width=0.9trim=0 3cm 0 0cm,clip]SR-Clustering.pdf

Figure 2: Example of events obtained by applying SR-Clustering on a visual lifelog. The color above the images indicate correspondence to the event to which consecutive images belong.
Figure 3: Pipeline of our proposed approach.

Recently, several papers have proposed different deep learning architectures to recognize activities from egocentric photostreams. The earliest works [5, 3]

focused on image-based approach, aimed at classifying each image independently by its neighbor frames. With the goal of taking advantage of the temporal coherence of objects that characterizes photostreams

[1], instead of working at image level, Cartas et al. [2, 4]

proposed to train in an end-to-end fashion a Long Short Term Memory (LSTM) recurrent neural network on the top of a CNN by feeding the LSTM using a sliding window approach. This strategy allows to copy with both the not negligible length of photostreams and the lack of knowledge of event boundaries.

Figure 4: Dataset summary. Please notice that distributions are normalized and the vertical axis has a logarithm scale.
Figure 5: Example of automatically extracted events used in the experiments.

This approach has showed that considering overlapping segments of fixed size turn out to be effective to better capture long-term temporal dependencies in photo-streams. In this paper,we argue that knowing exactly event boundaries would allow to further improve activity recognition performance, since it would allow to capture temporal dependencies both within an event and across events.

2 Event boundaries for activity recognition

In this work, we investigate whether the use of event boundaries as additional input can improve the recognition of activities in egocentric photo-sequences. To this goal, we used the temporal segmentation method introduced in [9] that allows to extract events from long unstructured photostreams. Events obtained with such approach, correspond to temporally adjacent images the share both contextual and semantic features, as shown in Fig. 2. As it can be observed, these events constitute a good basis for activity recognition, since typically, when the user is engaged in an activity, contextual and semantic features have little variation.

3 Experimental setup

The objective of our experiments was to determine if the temporal coherence of segmented events from egocentric photostreams improved the activity recognition at the frame level. Therefore, we trained three many-to-many LSTM models using the full-day sequence and the automatically extracted event segments, i.e. CNN+RF+LSTM, CNN+LSTM, and CNN+Bidirectional LSTM (see Fig. 3) . For comparative purposes, we used as a baseline to train all models the Xception network [6]. Additionally, we implemented the best model presented in [4], namely the combination of CNN+RF+LSTM. We measure the activity recognition performance using the classification accuracy and associated macro metrics.

Dataset. We collected over 102,227 pictures from 15 college students who were asked to wear an egocentric camera111http://getnarrative.com/ on their chest. The camera automatically captured an image at seconds rate with a 5MP resolution. The annotation process took into account the continuous context of activity sequences. In order to split the data in training and test sets, all the possible combinations of users for both sets were calculated. Only the combinations with a test set having all the categories and 20-21% of all images were kept. A histogram of the number of photos per category and split is shown in Fig. 4.

Temporal sequences. The following temporal sequences were used in the experiments:

  1. Fixed size segments. The stateful sliding window training procedure for fixed size segments from [carta2017batch] for LSTM was also implemented.

  2. Full sequence. The whole day photostream sequence of each user were used as a single input.

  3. Event segmentation. Groups of sequential images were obtained by applying the method introduced by Dimiccoli et al. [9], which temporally segments the given photostream as illustrated in Fig 5.

     
  Xception+RF+LSTM   Xception+LSTM   Xception+Bidi LSTM
Activity angle=90,max width=1.5cm,lap=0pt-(1em)

Xception

angle=90,max width=1.5cm,lap=0pt-(1em)

Xception+RF

 
angle=90,max width=1.5cm,lap=0pt-(1em)

Fixed size

segments

angle=90,max width=1.5cm,lap=0pt-(1em)

Full

sequence

angle=90,max width=1.5cm,lap=0pt-(1em)

Event

segmentation

 
angle=90,max width=1.5cm,lap=0pt-(1em)

Fixed size

segments

angle=90,max width=1.5cm,lap=0pt-(1em)

Full

sequence

angle=90,max width=1.5cm,lap=0pt-(1em)

Event

segmentation

 
angle=90,max width=1.5cm,lap=0pt-(1em)

Fixed size

segments

angle=90,max width=1.5cm,lap=0pt-(1em)

Full

sequence

angle=90,max width=1.5cm,lap=0pt-(1em)

Event

segmentation

Accuracy 68.88 70.77   70.64 72.21 72.52   70.24 74.27 73.28   74.20 75.59 76.09
Macro precision 52.04 52.02   35.16 42.71 54.81   48.06 59.03 57.71   51.95 56.81 59.29
Macro recall 38.17 33.32   34.62 36.11 36.98   41.07 50.71 49.75   54.68 48.30 50.22
Macro F1-score 39.05 32.23   32.71 35.44 36.44   40.19 50.85 48.94   50.66 48.50 51.21
Table 1: Activity classification performance. Upper part shows the recall for each category and the lower part shows the performance metrics for all models. The best result per measure is shown in bold but does not take into account the temporal models trained using the groundtruth segmentation, that we consider as an upper bound.
Non- Native Action Action/Activity
Dataset Photo-streams Scripted? Env? Frames Sequences Segments Classes Participants
Ours 0.1M 191 days - 23 15
UT-Egocentric[14] 0.9M 4 - - 4
KrishnaCam[13] 7.6M 460 - - 1
DECADE[10] 0.02M 380 - 48 1
ADL[17] 1.0M 20 436 32 20
Epic Kitchens[7] 11.5M 432 39,596 149* 32
GTEA Gaze+[11] 0.4M 35 3,371 42 13
CMU[12] 0.2M 16 516 31 16
BEOID[8] 0.1M 58 742 34 5
Table 2: Comparison with different Egocentric datasets. Information based on [7]

4 Experimental results

In Table 1 we present the performance of all the models using full sequence, SR-Clustering (event segmentation), and the sliding window training procedure (fixed size segments) proposed in [2]. The performance was evaluated using the accuracy and macro metrics for precision, recall, and F1-score.

The results indicate that the CNN+Bidirectional LSTM model achieves the best performance over all the models and on each temporal segmentations. On the other hand, the CNN+RF+LSTM model did not improved the performance as much as the other models and was even worse than its baseline using the sliding window training. This is a consequence of the overfitting of its base model (CNN+RF) in the training set, as shown by the categories recall in Table 1. This contrasts the results previously obtained in [4] using another dataset and it is likely due to the fact that here we are using non-seen users in our test set.

Furthermore, the results suggests that the temporal segmentation increased the classification performance of the tested LSTM based models. For instance, Fig. 6 shows some qualitative results. In particular, the automatic event segmentation (SR-Clustering) was better than the day segmentation as it improved the accuracy, macro precision, and macro F1-scores in two of the three LSTM based models. Since most of the test users had short day sequences, the day temporal segmentation was the best for CNN+LSTM model. Finally, the best macro recall was obtained using the Sliding Window training [4] for the CNN+Bidirectional LSTM model. This can be understood as a smoothing effect over the test sequences.

Figure 6: Examples of qualitative results obtained from three of the evaluated methods (Xception, Xception+RF+LSTM, and Xception+Bidirectional LSTM) for different activity classes. False and true activity labels for a given image are marked in red and green, respectively.

5 Conclusions

This paper has shed light on two poorly investigated issues in the context of activity recognition from egocentric photostreams. The first issue was related to the role of event boundaries as input for activity recognition in photostreams. By relying on manually-annotated and automatically-extracted event boundaries, in addition to overlapping batches of images of fixed size, this paper pointed out that activity recognition performances benefit from the knowledge of event boundaries. The second issue was related to the generalization capabilities of existing methods for activity recognition. By using a large egocentric dataset acquired from 15 users, this paper could elucidated for the first time, how activity recognition performance generalize at test time to unseen users. The best results were achieved by using a CNN+Bidirectional LSTM architecture on a temporal event segmentation.

Acknowledgments

A.C. was supported by a doctoral fellowship from the Mexican Council of Science and Technology (CONACYT) (grant-no. 366596). This work was partially founded by TIN2015-66951-C2, SGR 1219, CERCA, ICREA Academia’2014 and 20141510 (Marató TV3). The funders had no role in the study design, data collection, analysis, and preparation of the manuscript. M.D. is grateful to the NVIDIA donation program for its support with GPU card.

References

  • [1] D. Byrne, A. R. Doherty, C. G. Snoek, G. J. Jones, and A. F. Smeaton. Everyday concept detection in visual lifelogs: validation, relationships and trends. Multimedia Tools and Applications, 49(1):119–144, 2010.
  • [2] A. Cartas, M. Dimiccoli, and P. Radeva. Batch-based activity recognition from egocentric photo-streams. Proceedings on the International Conference in Computer Vision (ICCV), 2nd international workshop on Egocentric Perception, Interaction and Computing, Venice, Italy, 2017.
  • [3] A. Cartas, J. Marín, P. Radeva, and M. Dimiccoli. Recognizing activities of daily living from egocentric images. In L. A. Alexandre, J. Salvador Sánchez, and J. M. F. Rodrigues, editors, Pattern Recognition and Image Analysis, pages 87–95, Cham, 2017. Springer International Publishing.
  • [4] A. Cartas, J. Marín, P. Radeva, and M. Dimiccoli. Batch-based activity recognition from egocentric photo-streams revisited. Pattern Analysis and Applications, May 2018.
  • [5] D. Castro, S. Hickson, V. Bettadapura, E. Thomaz, G. Abowd, H. Christensen, and I. Essa. Predicting daily activities from egocentric images using deep learning. In proceedings of the 2015 ACM International symposium on Wearable Computers, pages 75–82. ACM, 2015.
  • [6] F. Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR, pages 1800–1807. IEEE Computer Society, 2017.
  • [7] D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray. Scaling egocentric vision: The epic-kitchens dataset. In European Conference on Computer Vision (ECCV), 2018.
  • [8] D. Damen, T. Leelasawassuk, O. Haines, A. Calway, and W. Mayol-Cuevas. You-do, i-learn: Discovering task relevant objects and their modes of interaction from multi-user egocentric video. In Proceedings of the British Machine Vision Conference. BMVA Press, 2014.
  • [9] M. Dimiccoli, M. Bolaños, E. Talavera, M. Aghaei, S. G. Nikolov, and P. Radeva. Sr-clustering: Semantic regularized clustering for egocentric photo streams segmentation. Computer Vision and Image Understanding, 2017.
  • [10] K. Ehsani, H. Bagherinezhad, J. Redmon, R. Mottaghi, and A. Farhadi. Who let the dogs out? modeling dog behavior from visual data. In CVPR, 2018.
  • [11] A. Fathi, Y. Li, and J. M. Rehg. Learning to recognize daily actions using gaze. In A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, and C. Schmid, editors, Computer Vision – ECCV 2012, pages 314–327, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.
  • [12] A. B. X. M. J. M. A. C. Fernando de la Torre, Jessica Hodgins and P. Beltran. Guide to the carnegie mellon university multimodal activity (cmu-mmac) database. In Tech. report CMU-RI-TR-08-22, Robotics Institute, Carnegie Mellon University, April 2008.
  • [13] A. A. E. Krishna Kumar Singh, Kayvon Fatahalian.

    Krishnacam: Using a longitudinal, single-person, egocentric dataset for scene understanding tasks.

    In IEEE Winter Conference on Applications of Computer Vision (WACV), 2016.
  • [14] Y. J. Lee, J. Ghosh, and K. Grauman. Discovering important people and objects for egocentric video summarization. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 1346–1353, June 2012.
  • [15] M. Ma, H. Fan, and K. M. Kitani. Going deeper into first-person activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1894–1903, 2016.
  • [16] T.-H.-C. Nguyen, J.-C. Nebel, and F. Florez-Revuelta. Recognition of activities of daily living with egocentric vision: A review. Sensors (Basel), 16(1):72, Jan 2016. sensors-16-00072[PII].
  • [17] H. Pirsiavash and D. Ramanan. Detecting activities of daily living in first-person camera views. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2847–2854, June 2012.
  • [18] Y. Poleg, A. Ephrat, S. Peleg, and C. Arora. Compact cnn for indexing egocentric videos. In Applications of Computer Vision (WACV), IEEE Winter Conference on, pages 1–9, 2016.