Recognition of Activities from Eye Gaze and Egocentric Video

by   Anjith George, et al.

This paper presents a framework for recognition of human activity from egocentric video and eye tracking data obtained from a head-mounted eye tracker. Three channels of information such as eye movement, ego-motion, and visual features are combined for the classification of activities. Image features were extracted using a pre-trained convolutional neural network. Eye and ego-motion are quantized, and the windowed histograms are used as the features. The combination of features obtains better accuracy for activity classification as compared to individual features.



There are no comments yet.


page 1

page 3

page 5

page 6

page 7


Image based Eye Gaze Tracking and its Applications

Eye movements play a vital role in perceiving the world. Eye gaze can gi...

Your Eyes Say You're Lying: An Eye Movement Pattern Analysis for Face Familiarity and Deceptive Cognition

Eye movement patterns reflect human latent internal cognitive activities...

A Two-stream End-to-End Deep Learning Network for Recognizing Atypical Visual Attention in Autism Spectrum Disorder

Eye movements have been widely investigated to study the atypical visual...

High Accuracy Human Activity Monitoring using Neural network

This paper presents the designing of a neural network for the classifica...

A pattern recognition approach for distinguishing between prose and poetry

Poetry and prose are written artistic expressions that help us to apprec...

Gaze-in-wild: A dataset for studying eye and head coordination in everyday activities

The study of gaze behavior has primarily been constrained to controlled ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Activity recognition from videos is an important topic in computer vision community. Recognition of actions has several applications in many areas such as human-computer interaction (HCI), robotics, surveillance, image and video retrieval. Most of the literature in this field deals with action recognition from video streams captured by a camera which may be situated far away from the subjects (third person view)

[1],[2], [3].

Recently with the proliferation of wearable devices, there has been an upsurge in research in the field of activity recognition from wearable devices. Recent works in egocentric video-based (first person view) activity recognition [4],[5],[6] has shown great promise in providing insights into various activities. The egocentric video gives direct information regarding user’s environment. Head-mounted eye trackers can provide gaze locations and head movements along with the ego-centric video.

Nowadays a lot of virtual and augmented reality (VR and AR) devices are coming up in the consumer market such as Oculus Rift, Hololens, Google Glass [7], etc. They hold the potential to augment human capabilities. Eye tracking [8] and egocentric video could give important cues about the user’s point of attention and actions. Usage of visual features along with the eye movement behavior as observed through eye tracking can lead to the understanding of activities and cognitive processes.Identification of human emotions [9] and cognitive states [10], [11] can lead to intelligent interaction modalities. Eye movements have shown to contain information useful for biometric authentication [12]. Identification of human actions and intentions in real-time could result in human-machine systems which are more natural and ‘pro-active’ .

Fig. 1: The activity classes considered in the work, a) Read, b) Watching Video, c) Write, d) Copying text, and e) Browsing.

In this chapter, a framework for activity classification using egocentric information obtained from a head-mounted eye tracker is presented. Three channels of information, namely, eye movement patterns, ego-motion patterns and visual features as observed through the camera, are used for activity classification. We consider activities performed in office environments which are difficult to classify by other modalities alone. Combining all these modalities can improve the accuracy of classification. The activity classes used in this work are shown in Fig.


Ii Related works

An excellent review of recent works in egocentric activity recognition can be found in [13]. Some of the recent works related to activity recognition from eye gaze are described here.

Bulling et al.[14]

presented an activity recognition scheme based on eye movement parameters obtained using Electro Oculogram (EOG). They extracted a large number of features from fixations, saccades, and blinks. A feature selection approach was used to select the best features for activity classification. They considered five activities performed in the office environment, along with a null class. A support vector machine based classification was adopted for recognizing the activities. This work paved the way for further investigations using eye gaze where activity recognition using other modalities are difficult. Hipiny and Mayol-Cuevas

[15] presented an activity classification scheme using the gaze data. They represented each activity as a record of fixation locations. A Bag of words based weighted voting scheme, along with the Bhattacharya distance between templates and samples were used for classification. Ogaki et al.[16]

presented an approach for egocentric activity recognition by fusing eye movement and ego-motion features. They estimated ego-motion from the global optical flow computed from the “outward looking” camera. The eye tracking data was obtained from a head-mounted eye tracker. Both eye motion and ego-motion parameters were encoded to a string sequence using the motion pattern. The ‘N-gram’ statistics, computed over a sliding window, was used as a feature for classification. From the experiments, they demonstrated that the combination of features improves the accuracy compared to eye movement features alone. George

et al. [17] presented an approach for gaze direction classification using Convolutional Neural Network. The direction of eyes obtained can provide important cues for activity recognition.

Li et al.[18] presented a novel scheme for combining different modalities of information for egocentric action recognition. From the egocentric video, they extracted dense trajectories and a set of local descriptors across the trajectories. The features included motion binary histograms along and directions, histogram of flow, histogram of gradients and Lab color histogram. They computed these features within a grid, and the features were then concatenated. Egocentric features such as head motion and hand manipulation point were also extracted. They encoded the features using Improved Fisher Vector (IFV). Finally, the IFVs of different features were concatenated as a representation of the video. Support vector machine (SVM) was used for classification. However, they did not use eye movement patterns in their framework. Fathi et al.[19]

demonstrated the relation between the task being performed and the locations of visual attention. They showed that the information regarding hand-eye coordination could be beneficial in two different scenarios, predicting the probable gaze sequence given an action and predicting the likely action given the gaze sequence. Shiga

et al.[20]

proposed a method for egocentric activity recognition by combining eye motion and visual features. The eye movement feature extraction scheme was similar to the method used in

[14]. They used ‘N-gram’ statistics computed over sliding windows. The visual features were obtained by selecting a patch around the gaze location and extracting local features using SIFT-PCA and dense sampling. A Bag of words approach was used for the classification. They trained separate multi-class SVMs for visual and eye movement features, and score fusion methodology was adopted for the final activity classification. Yan et al.[6] proposed a multi-task clustering approach for egocentric activity classification. They proposed two different algorithms for activity classification in unsupervised settings. Kunze et al.[21] provided a description of possibilities of eye tracking in various use cases such as detection of fatigue and reading. Data from mobile eye trackers can be utilized for the analysis of reading habits, type of document read, reading speed comprehension level and identifying alertness levels.

While there are many approaches for activity classification in egocentric videos, classification in indoor environments is still a challenge. This can be mainly attributed to the lack of significant motion patterns and limited variations in the environment. In most of the office activities (like reading, copying, browsing, watching a video, writing ), the variability in image background, as observed from the egocentric video is limited. This yields poor accuracy due to the lack of sufficient discriminative information. However, a fusion of these features could improve the performance. The visual features can provide a context for the action, and the combination of ego-motion and eye movement pattern can result in better accuracy in the overall classification.

Fig. 2: The proposed framework, three channels of information are fused to classify the activities.

Iii Proposed method

In this work we propose to use information from the image, gaze locations and ego-motion for the recognition of activities. The features extracted from each domain along with the proposed fusion scheme is described below. A schematic diagram of the proposed framework is shown in Fig. 2.

Iii-a Feature extraction from image

Location of gaze on the images captured from a first person view (ego-centric) cameras carries valuable information which might be useful for activity classification. Previous works [20] have used dense SIFT descriptor with PCA in a Bag of words (BoW) framework. Features were extracted from the patch around the point of gaze. They computed the descriptors for each frame separately. The accuracy of this method could fall when the training and testing environments are different. For example, the appearance of a book might differ with variations in size, pose, color, and different types of binding. Ideally, the feature representation should be invariant to such changes as it is intended to give a context to the actions. We have used convolutional neural network [22] based feature extractor in this work owing to its high representation power. A pre-trained Alexnet model [23]

(trained on the Imagenet dataset) is employed for this purpose. The fully connected output layer was removed, and a feature descriptor of dimension 4096 was obtained. The architecture of Alexnet excluding the final fully connected layer is shown in Fig.

3. We take the output from fc7

layer after applying the rectified linear unit (ReLU) transformation


Fig. 3: CNN feature extraction scheme, cropped and resized image is fed into the pretrained network, outputs from fc7 are used as the feature.

For each image in the training set, a patch of size

was selected around the gaze location. The image patch obtained was resized and fed to the CNN to obtain a 4096-dimensional feature vector. We have extracted features from all the images in the training set in a similar manner. K-means clustering was performed on this data, and 15 cluster centers were kept. Now, for each image, the feature representation is computed, and the cluster center closest to it is found out. Histogram Voting across the cluster centers are carried out, and the normalized votes are computed in a temporal window of 25 seconds. The histogram obtained is used as the feature input for the activity classification.

Iii-B Feature extraction from eye tracking data

The eye movement sequence is of the form


where, denote the and components of gaze position at the time instant . denotes the duration of the sequence. The raw sequence is median filtered to remove noise. Let be the input signal corresponding to the component of eye movement. The wavelet coefficient of at scale and position is defined as


Continuous 1D wavelet coefficients are computed at a scale 10 using Haar-wavelet function.

Now, the wavelet coefficients are computed separately for and directions. The coefficients obtained are quantized as


where, and are empirically decided thresholds.

Fig. 4: Motion encoding scheme.

is also quantized to in a similar manner.

Based on the joint sequence , a string sequence is generated as in Fig. 4.

The normalized histogram of the string sequence over a sliding temporal window is used as the feature for classification.

Iii-C Feature extraction from motion

Motion features are extracted from the optical flow between subsequent frames. Let the frame be denoted as . For each frame, corner detection is performed to obtain the candidate points to track. The points are tracked using Lucas-Kanade optical flow. Successfully tracked points are found out using forward-backward error [25]. The median flow between the frames can be computed as


Where is the number of sparse points tracked between and , and and denote the optical flow of point in and direction respectively.

Once the global optical flow is obtained, we use a similar encoding scheme as used for eye gaze data. The histogram of the encoded sequence obtained over a temporal window is used as the feature for the classification task.

Iii-D Fusion and classification framework

Features obtained from the three independent modalities namely ego-motion, eye gaze features and visual features are combined in the proposed approach. Feature level fusion [26]

is adopted where three modalities are concatenated to form the final feature vector. We have extracted all the features using a temporal sliding window of 25 seconds with a stride of one second. Histogram of each independent feature is computed and concatenated for training the classifier model.

The classification model chosen should be able to handle different types of data as inputs. We have chosen Random Forest (RF) Classifier for this task. Random forest algorithm is an ensemble of decision trees initially proposed by Breiman

[27]. It can intrinsically handle multi-class classification problems. Instead of using a single tree for classification, predictions from a large number of trees are integrated to form the final prediction. Different trees in the forest are trained from bootstrap samples. The original data is sampled with replacement and trees are trained using these bootstrap samples. For each tree, a subset of predictors are randomly selected at each node and an optimal split is found [28]. The tree is grown without pruning. In the testing phase, the test sample is fed to trees in the forest. Each tree makes a prediction by evaluating the decision tree. The final prediction is obtained using voting strategy among the outputs of decision trees. Random forest is robust to noise and faster to train. RF gives better predictions without overfitting due to the out of bag error cross-validation used during the training.

Iv Experiments and results

Activities performed in office environments are considered in the experiments as they are difficult to classify by other methods. We have evaluated the accuracy of individual features as well as joint representation in a multi-class scenario to assess their performance.

Iv-a Database used

We have used UTokyo First-Person Activity Recognition Dataset [16] for the evaluation task. The dataset contain the data recording of five subjects performing five different actions in an office environment. The classes available were reading a book, watching a video, copying text, writing on paper, and internet browsing. Each of these activities was performed for two minutes. There was a time gap of thirty seconds (‘Void’ class) between each activity where subjects were allowed to converse, sing and move freely. Each subject performed the activities twice. The data from these two sessions were used as the training and test sets. The recordings obtained from EMR-9 eye tracking device was also available with the dataset. For analysis purpose, we have used the eye tracking data and the low-resolution video ( resolution) from the dataset.

Iv-B Experiment protocol

For each subject in the dataset, the features corresponding to visual, eye movement and ego-motion were extracted from the dataset. For each subject, two separate instances of the same activity class are available. We have used these two folds for the evaluations. Initially, the first fold was used for training and the second one for testing. In the second fold, training and testing sets were interchanged and the average accuracy computed across these two sets are reported. The evaluations were performed for a multi-class scenario, data across all subjects were used for training and testing.

Iv-C Multi-class classification

We have analyzed the performance with two different scenarios namely five class and six class classification. In the latter, ‘Void’ class is also used as a valid label.

Iv-C1 Experiments with five activity classes

We have used five activity classes in this trial. Experiments were performed in multiclass classification scenario to evaluate the generalization capability of the features. Training and testing were done across all the individuals. The first session data from all the subjects were used for training. A Random Forest model was trained using the joint feature vector obtained from ego-motion, eye motion, and CNN features. The individual accuracy of the modalities was also tested by training separate models for CNN as well as joint eye-ego motion features. The experiment was also performed by interchanging the training and testing sets. The average results among these two folds were found. The normalized confusion matrix obtained is shown in Fig.

5. The individual confusion matrices for visual and motion features alone are also shown in Fig. 5. The average accuracy over multiple runs is shown in Table. I.

Fig. 5: Normalized confusion matrix for five classes, a) Combined features, b) Joint Ego-Eye motion feature, c) Visual features

Iv-C2 Experiments with six activity classes

In this experiment, we have considered all six classes including the ’Void’ class. We have followed similar testing methodology as described for five class scenario. The results obtained are shown in Fig. 6 and Table I.

Fig. 6: Normalized confusion matrix for six classes, a) Combined features, b) Joint Ego-Eye motion feature, c) Visual features

Iv-C3 Accuracy across different subjects

The variations in accuracy across different subjects are shown in Fig. 7. The combined feature gives better results for most of the subjects.

Fig. 7: Variation of accuracy across different subjects
Eye and Ego Motion
Visual (CNN)
6 class 77.09% 72.49% 45.03%
5 class 85.65% 79.38% 62.97%
TABLE I: Average accuracy for all three for both 5 and 6 class scenarios

Iv-C4 Accuracy across classes

The accuracy of different classes for different feature combinations are shown in Fig. 8. The joint representation achieves better results as compared to the individual features. The joint eye-ego motion feature obtains the best accuracy among the features. The ‘Void’ class shares similar visual features and motion features as the subjects were allowed to interact freely during those periods. This could explain the low accuracy of the ‘Void’ class. Visual features give good results in activities like ‘Write’ and ‘Read’ since the field of view is different from other activities.

Fig. 8: Variation of accuracy across different classes
Fig. 9: Comparison with state of the art methods [16], SW+MW (Saccade Word+ Motion Word) [16], MH (Motion Histogram) [29], GIST [30]

Iv-D Comparison with other methods

We have compared the results obtained with different methods. Saccade word and motion word (SW + MW) [16], which is a combination of eye movement and egomotion ‘N-gram’ features, obtains the second best result. GIST features [30] extracts the visual content of the scene can be used for activity recognition in egocentric video [31]. A combination of saccade word (SW) and GIST effectively combines motion and visual features. Motion histogram (MH) proposed by Kitani et al.[29] encodes the instantaneous as well as period motion using Fourier analysis. The accuracy of saccade word and motion histogram is also taken for comparison. The mean average precisions of the methods are compared in Fig. 9.

The proposed method outperforms all the other methods. The addition of visual features along with the motion and eye gaze features improved the accuracy significantly. Compared to other methods, the higher representation power of the CNN based feature and the combination of ego-eye motion features makes the algorithm more accurate.

Iv-E Discussions

From the results obtained, it can be seen that the addition of three modalities improves the accuracy. In six class scenario, highest accuracy is achieved for class ‘Write’. This can be attributed to both distinct gaze patterns as well as visual features during the activity. Especially, the high accuracy of visual features during this activity may be due to the appearance of paper and pen which are unique to this activity. Even though the addition of visual features increases the overall accuracy, the individual performance of visual features in many cases are poor. The activities used in this experiment were performed in an office environment,which does not have much diversity in visual information. The addition of the ‘Void’ class introduces more errors as the same visual features appear in multiple activities.

In the five class scenario, ‘Void’ class was not present. The accuracy of visual features is much better than the six class case. The overall accuracy of classification is also much better in this scenario. The random forest based classifier tries to identify the important features for activity classification from the joint feature representation.

Some of the advantages of the proposed system are described here. Three distinct channels of information are fused in the proposed approach. This improves the generalizability of the approach for a larger number of classes. Representation of one particular activity might not require the features from all three channels. For example, reading has a characteristic pattern as observed from eye tracking data (sequence of small fixations and saccades), It may be possible to identify reading activity from eye tracking data alone. Classifying browsing activity from watching movies might require all three channels of information. The high-level CNN descriptors used are suitable for giving a context to the actions. The random forest algorithm is capable of identifying important features which are relevant for the identification of a particular action. The framework can determine the important features required for classifying the activities accurately. Even though the activity classes used in this work are small, the framework is capable of handling a large number of classes. The Random Forest based classifier can compute the features relevant for identifying each action.

V Conclusions

In this work, we have proposed an approach for combining different modalities such as ego-motion features, eye movement features and visual features for classification of activities. A joint feature vector is formed from the individual feature extractors, and a random forest classifier was used to classify the activities using this joint representation. Joint eye-ego motion feature gave the best individual accuracy among the features. However, the addition of visual feature resulted in a higher accuracy in activity classification. Additional channels of information can be easily added to the framework. The addition of activity-dependent object detectors and a weighted fusion of these three modalities might improve the results.


  • [1] R. Poppe, “A survey on vision-based human action recognition,” Image and vision computing, vol. 28, no. 6, pp. 976–990, 2010.
  • [2] D. Weinland, R. Ronfard, and E. Boyer, “A survey of vision-based methods for action representation, segmentation and recognition,” Computer Vision and Image Understanding, vol. 115, no. 2, pp. 224–241, 2011.
  • [3] P. Turaga, R. Chellappa, V. S. Subrahmanian, and O. Udrea, “Machine recognition of human activities: A survey,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 18, no. 11, pp. 1473–1488, 2008.
  • [4] A. Fathi, A. Farhadi, and J. M. Rehg, “Understanding egocentric activities,” in Computer Vision, IEEE International Conference on, 2011, pp. 407–414.
  • [5] H. Pirsiavash and D. Ramanan, “Detecting activities of daily living in first-person camera views,” in

    Computer Vision and Pattern Recognition, IEEE Conference on

    , 2012, pp. 2847–2854.
  • [6] Y. Yan, E. Ricci, G. Liu, and N. Sebe, “Egocentric daily activity recognition via multitask clustering,” Image Processing, IEEE Transactions on, vol. 24, no. 10, pp. 2984–2995, 2015.
  • [7] T. Starner, “Project glass: An extension of the self,” Pervasive Computing, IEEE, vol. 12, no. 2, pp. 14–16, 2013.
  • [8] A. George and A. Routray, “Fast and accurate algorithm for eye localisation for gaze tracking in low-resolution images,” IET Computer Vision, vol. 10, no. 7, pp. 660–669, 2016.
  • [9] S. Happy, A. George, and A. Routray, “A real time facial expression classification system using local binary patterns,” in Intelligent Human Computer Interaction (IHCI), 2012 4th International Conference on.   IEEE, 2012, pp. 1–5.
  • [10] A. Dasgupta, A. George, S. Happy, and A. Routray, “A vision-based system for monitoring the loss of attention in automotive drivers,” IEEE Transactions on Intelligent Transportation Systems, vol. 14, no. 4, pp. 1825–1838, 2013.
  • [11] A. Sengupta, A. Dasgupta, A. Chaudhuri, A. George, A. Routray, and R. Guha, “A multimodal system for assessing alertness levels due to cognitive loading,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 25, no. 7, pp. 1037–1046, 2017.
  • [12] A. George and A. Routray, “A score level fusion method for eye movement biometrics,” Pattern Recognition Letters, vol. 82, pp. 207–215, 2016.
  • [13] T.-H.-C. Nguyen, J.-C. Nebel, and F. Florez-Revuelta, “Recognition of activities of daily living with egocentric vision: A review.” Sensors (Basel, Switzerland), vol. 16, no. 72, 2016.
  • [14] A. Bulling, J. A. Ward, H. Gellersen, and G. Troster, “Eye movement analysis for activity recognition using electrooculography,” IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 4, pp. 741–753, 2011.
  • [15] I. M. Hipiny and W. Mayol-Cuevas, “Recognising egocentric activities from gaze regions with multiple-voting bag of words,” University of Bristol, Tech. Rep., 2012.
  • [16] K. Ogaki, K. M. Kitani, Y. Sugano, and Y. Sato, “Coupling eye-motion and ego-motion features for first-person activity recognition,” in Conference on Computer Vision and Pattern Recognition Workshops.   IEEE, 2012, pp. 1–7.
  • [17] A. George and A. Routray, “Real-time eye gaze direction classification using convolutional neural network,” in Signal Processing and Communications (SPCOM), 2016 International Conference on.   IEEE, 2016, pp. 1–5.
  • [18] Y. Li, Z. Ye, and J. M. Rehg, “Delving into egocentric actions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 287–295.
  • [19] A. Fathi, Y. Li, and J. M. Rehg, “Learning to recognize daily actions using gaze,” in European Conference on Computer Vision.   Springer, 2012, pp. 314–327.
  • [20] Y. Shiga, T. Toyama, Y. Utsumi, K. Kise, and A. Dengel, “Daily activity recognition combining gaze motion and visual features,” in Proceedings of the International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication.   ACM, 2014, pp. 1103–1111.
  • [21] K. Kunze, M. Iwamura, K. Kise, S. Uchida, and S. Omachi, “Activity recognition for the mind: Toward a cognitive” quantified self”,” Computer, vol. 46, no. 10, pp. 105–108, 2013.
  • [22] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn features off-the-shelf: an astounding baseline for recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2014, pp. 806–813.
  • [23] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [24] Y. Gong, L. Wang, R. Guo, and S. Lazebnik, “Multi-scale orderless pooling of deep convolutional activation features,” in European Conference on Computer Vision.   Springer, 2014, pp. 392–407.
  • [25] Z. Kalal, K. Mikolajczyk, and J. Matas, “Forward-backward error: Automatic detection of tracking failures,” in Pattern recognition, IEEE 20th international conference on, 2010, pp. 2756–2759.
  • [26] A. A. Ross and R. Govindarajan, “Feature level fusion of hand and face biometrics,” in Defense and Security.   International Society for Optics and Photonics, 2005, pp. 196–204.
  • [27] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
  • [28] Y. Ma, B. Cukic, and H. Singh, “A classification approach to multi-biometric score fusion,” in International Conference on Audio-and Video-Based Biometric Person Authentication.   Springer, 2005, pp. 484–493.
  • [29] K. M. Kitani, T. Okabe, Y. Sato, and A. Sugimoto, “Fast unsupervised ego-action learning for first-person sports videos,” in Computer Vision and Pattern Recognition, IEEE Conference on, 2011, pp. 3241–3248.
  • [30] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” International journal of computer vision, vol. 42, no. 3, pp. 145–175, 2001.
  • [31] E. H. Spriggs, F. De La Torre, and M. Hebert, “Temporal segmentation and activity classification from first-person sensing,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2009, pp. 17–24.