An egocentric camera captures rich and varied information of how the wearer interacts with their environment. The challenge for the visual understanding of this information is currently significant and not only incited by the enormous variety of such interactions but also by limitations in the available visual descriptors, e.g. those rooted in motion or appearance. Supervised learning from labelled examples is used to alleviate some of these ambiguities. Egocentric datasets[12, 10, 34, 6] and interaction recognition methods [10, 28, 9, 23] differ in the features used and classification techniques adopted, yet they all assume a semantically distinct set of pre-selected verbs or verb-noun combinations for supervision. When free annotations are available - unbounded choice of verbs or verb-nouns - from audio scripts  or textual annotations 
, a single label is selected to represent each interaction using a majority vote. Less frequent annotations are treated as outliers, though they typically represent a meaningful and correct annotation. For example, lifting an object from a workspace could be described aspick-up, lift, take or grab; all valid labels. Note that assuming multiple valid labels is different from the problem of Ambiguous Label Learning, [3, 14], where the aim is to find a single valid label from a mixed set of related and unrelated labels.
Egocentric video offers a unique insight into object interactions in particular. The camera is ideally positioned to capture objects being used and, equally interesting, the different ways in which the same object is used. One interaction (e.g. open) applies to a wide variety of objects, and each video can be labelled by multiple valid labels (e.g. open door vs push door). In this context, recognition cannot be simplified as a one-vs-all classification task. Capturing the semantic relationships between annotations and the visual ambiguities between accompanying video segments can better represent the space of possible interactions. Figure 1 shows a graphical abstract of our work.
Given a dataset of egocentric object interactions with free annotations, we contribute four diversions from previous attempts: (i) We treat all free annotations as valid, correct labellings, (ii) A graph that combines semantic relationships with visual similarities is built, inspired by previous work on object class categories in single images  (Sec. 3.1), (iii) A test video is embedded into the previously learnt semantic-visual graph and the probability distribution over its possible annotations is estimated (Sec. 3.2) and (iv) When verb meanings are available, we discover semantic relationships between annotations using WordNet (Sec. 3.3).
We test semantic embedding (SEMBED) on three public egocentric datasets [6, 34, 9]. We show that as the number of verb annotations and their semantic ambiguities increase, SEMBED outperforms classification approaches. We also show that incorporating higher level semantic relationships, such as the hyponymy relationship, improves the results. Note that while we focus on egocentric object interaction recognition as a rich domain of semantic and visual ambiguities, some of the arguments can apply to action recognition in general.
2 Embedding Object Interactions - Prior Work
To the best of our knowledge, embedding for egocentric action recognition has not been attempted previously. We first review works on recognising egocentric object interactions, then review works which incorporate semantic knowledge for recognition tasks.
Egocentric Object Interaction Recognition: Egocentric action recognition works range from self-motion  (e.g. walk, cycle) to high-level activities (e.g. [34, 18, 20, 2, 35]). On the task of object interaction recognition, approaches vary in whether they use hand-centred features [15, 19], object-specific features [10, 6, 23, 29] or a combination [12, 21]. Ishihara et al  use dense trajectories in addition to global hand shape features and apply a linear SVM to determine the action class. Kumar et al  sample and describe superpixel regions around the hand. Their method allows for hand detectors to be trained spontaneously with the user performing the action.
Object-specific features are better suited for recognising verb-noun actions (e.g. pick-cup vs pick-plate) rather than a general picking action. In Damen et al , spatio-temporal interest points have been used to discover object interactions in an unsupervised manner. The works of Fathi et al [10, 9, 21, 11] have tested features including gaze, colour, texture and shape for verb-noun action classification. Of these,  specifically discusses the change in the object state as a useful feature to recognise object interactions. Though attempting video summarisation primarily, Ghosh et al 
introduces a collection of features that could be used to classify object-interactions such as distance from the hand, saliency, objectness represented using a spatio-temporal pyramid to detect change. These features were proven useful for segmenting object-interactions from a lengthy video, but have not been tested for action classificationper se. On several publicly available datasets, Li et al  compare motion, object, head motion and gaze information along with a linear SVM for object interaction classification. Their results prove that Improved Dense Trajectories (IDT) proposed by  outperform other motion features.
Based on [21, 30] conclusions, in this work we report results on IDT as a state-of-the-art motion feature and pre-trained CNN features a state-of-the-art appearance feature. Testing tuned CNNs is left for future work.
Semantic Embedding for Object and Action Recognition:
Using linguistic semantic knowledge for Computer Vision tasks, including action recognition, has been fuelled by the accessibility of text or audio descriptions from online sources.
One such dataset which made this possible was gathered from YouTube videos  with free annotations. The dataset includes a variety of real-world scenarios, though not limited to egocentric or object-interactions. For each video, multiple annotators were asked to describe the video. Both [26, 13] use this dataset for action recognition. In Motwani and Mooney , the most frequently annotated verb for each video is used, and verbs are grouped into classes using semantic similarity measures, extracted from the WordNet hierarchy as well as information corpuses. Videos are described by HoG and HoF features around spatio-temporal interest points. Guadarrama et al  find subject, object and verb triplets in an attempt to automatically annotate the action. They create a separate semantic hierarchy for each, formulated by co-occurrences of words within the free annotations and use Spearman’s rank to find the distances between clusters. Semantic links are used to generate specific, rather than general, annotations and a classifier is trained for each leaf node within the hierarchies. Their method allows zero-shot action annotation by trading-off specificity and semantic similarity. While combining semantics, both works use majority voting to limit the description per class to a single verb.
Another recent YouTube dataset was collected of users performing tasks while narrating their actions 
. Labels are extracted from audio descriptions using automatic speech recognition. Verb labels are then used to align videos using a WordNet similarity measure as well as visual similarity (HoF and CNN) to find the sequence of actions in a task.
Semantics have also been used for object recognition with images. Jin et al use WordNet to remove noisy labels from images which have multiple labels. Similarly, Ordonez et al  use WordNet to find the most frequently-used object labels amongst multiple annotations. We build our work on Fang and Torresani , where images are embedded in a semantic-visual graph. In , images are clustered depending on the semantic relationships between the labels and edges of the graph are weighted with the visual similarity. They use ImageNet as the database for training, and benefit from the fact that images within ImageNet are organised according to the WordNet hierarchy. We differ from  in how we add visual links to the semantic graphs as will be explained next.
3 Semantic Embedding of Egocentric Action Videos
We next, in Sec. 3.1, explain how we build a semantic-visual graph (SVG) that encodes label and visual ambiguities in the training set. In Sec. 3.2, we detail how videos with an unknown class are embedded in SVG, and how the probability distribution over their annotations is estimated. Finally, in Sec. 3.3 we explore further semantic relationships when verb meanings are annotated.
3.1 Learning the Semantic-Visual Graph
The Semantic-Visual Graph (SVG) is a representation of the training videos, with three sources of information encoded. First, videos that are semantically linked, e.g. have the same label, are linked in SVG. Second, nodes that are visually similar, yet semantically distinct, should also be linked as these indicate visual ambiguities. Third, edge weights correspond to the normalised visual similarity, over neighbouring nodes, using a visual descriptor and a defined distance measure. In this section we explain how SVG, an undirected graph, is constructed, then normalised to achieve the directed graph SVG.
SVG is an undirected graph, where one node corresponds to one training video. Assume AX(, ) is a binary function that checks whether two video labels are semantically related. Initially, AX(, ) is true when both videos are annotated by the exact same verb. This assumption is revisited in Sec. 3.3. Edges in SVG are created between nodes with a semantic relationship:
The undirected edge is assigned a weight where is a distance measure defined over the visual descriptor chosen. Assume is a function that returns the relative position of the distance measure amongst all the remaining pairs of videos such that,
and is the minimum element in the list. In addition, assume
is a function that returns the relative position of amongst all nodes not connected to such that,
Further links are added to SVG to encode visual ambiguities such that,
where is the number of visual connections in SVG that correspond to the top visually similar and semantically dissimilar nodes in SVG. We differ from  in that we ensure each node is connected to its top visually similar but semantically distinct node.
The undirected graph SVG is then converted to a directed graph by replacing each edge with two directed edges.
The weights of directed edges are initially the same as the weights for their undirected counterparts however they are normalised to define the probability of traversing from video to ,
The reciprocal of the weights is taken so that the most visually similar path will have the highest probability.
3.2 Embedding in Semantic-Visual Graph
Given a test video, , we first embed the video into SVG then use the Markov Walk (MW) method from  to determine . To embed , we begin by finding the set which contains the closest neighbours to based on visual distance, such that
We embed into SVG by adding directed edges connecting to nodes in
: with normalised weights . Following the embedding, MW attempts to traverse the nodes in the directed graph to estimate the probability of . Given the Markovian assumption and a predefined number of steps , we calculate the probability distribution of reaching a node as follows
To perform MW efficiently, we construct the vector q such that
We also construct a matrix such that (Eq. 6), note that this matrix is asymmetrical as nodes have a different set of neighbours in SVG. Accordingly, where is the transpose of and is the number of steps in MW. We can then accumulate for every unique annotation as follows
We then select as the semantic label of . Figure 2 shows an example of SVG and video embedding. In the figure, given two nearest neighbours and two steps in MW , the probability distribution over possible labellings is calculated.
3.3 Semantic Relationships: Synsets and Hyponyms
In Sec. 3.1, videos are considered semantically linked only when the annotated verbs are the same. SVG then enables handling ambiguities via incorporating visual similarity links in the graph. However, further semantic relationships, such as synonymy and hyponymy relationships, can be exploited between annotations. In linguistics, two words are synonyms if they have the same meaning, and the set of all synonyms is a synset. Moreover, two words are described as a hyponym and a hypernym respectively if the first is a more specific instance of the second. The terms originate from the Greek word and - under and over.
Synonymy and hyponymy relationships are encoded in lexical databases. WordNet (v3.1, 2012) is a commonly-used lexical database that is based on six semantic relations . In the WordNet verb hierarchy, verbs are first separated into their various meanings by the notation where is the number of disjoint meanings. The meanings are then arranged in hierarchies that encapsulate semantic relationships. To benefit from such hierarchies, verbs should be annotated with their meanings. We annotate  using verb meanings, and Fig. 3 shows how such annotations of the same action can be synonyms and hyponyms, as annotators chose different or more specific action descriptions.
Given annotated meanings, we define the term action synsets (AS) to indicate that annotations are linked by a synonymy relationship solely, and the term action hyponym (AH) to indicate that annotations are linked by both the synonymy or the hyponymy relationships. For comparison, we define the term action meaning (AM) where annotations are linked only when the annotation matches exactly. We use the general term AX where AM, AS, AH is one of the the possible types of semantic relationships tested.
4 Datasets, Experiments and Results
Verb annotations: We exploited the annotations provided by the authors to split the CMU and GTEA+ sequences into object-interaction segments. For CMU, object-interaction annotations are only provided for the activity of making brownies. Annotators chose from 12 disjoint verbs to ground-truth segments. In GTEA+ annotators chose from verb-noun pairing to ground-truth, e.g. cut_cucumber versus divide_bun and similarly squeeze_ketchup versus compress_bun. When removing the nouns, verbs could be used interchangeably but free annotations were not available to annotators.
While BEOID contains a variety of activities and locations, ranging from a desktop to operating a gym machine, it does not provide action-level annotations so we annotated BEOID using free annotations111Annotations can be found at: http://www.cs.bris.ac.uk/~damen/BEOID/, allowing annotators to split video sequences into object-interaction segments in addition to choosing the verb. We recruited 20 native English speakers. These annotators were given a free textbox to label each segment with the verb that best described the seen interaction in their opinion. Once a verb has been chosen, the annotators were given the set of potential meanings extracted from WordNet for the chosen verb. Again, they were asked to select the meaning that, in their opinion, best suited the segment. Multiple annotators (8-10) were asked to label each task to intentionally introduce variability in the choice of verbs and start-end times of object interaction segments.
Motion and Appearance Features: We test two state-of-the-art feature descriptors to represent both the motion and the appearance of the videos. These are the Improved Dense Trajectories (IDT)  and Overfeat Convolutional Neural Networks pre-trained for ImageNet classes (CNN) . For CNN features, we take every 5th frame from 30fps video, starting always from the first frame, and rescale to 320x240 pixels.
Encodings: We test two encodings, using Bag of Words (BoW)  and Fisher Vectors (FV)  with Euclidean distance. For IDT, when creating the BoW and FV representations, we use a 25% random sample from every video to model the Gaussians for efficiency. We vary the number of Gaussians () and the size of the codebook () in reported results.
Classification: In all results, leave-one-person-out cross validation has been used. Namely, when testing a video containing one person performing an action, all other videos captured from the same person are excluded from the training set. For SVM results, as the tested datasets contain an imbalance in the distribution of instances per class, we weight the classes by the term where is the exponent that best fits the distribution of segments per verb for a given dataset (ref supplementary material).
|CMU ||58.6||46.6||46.3||55.9||43.3||52.0||69.4||58.1||57.4||55.9||57.6||61.6||12||48.6 , 73.4 |
|GTEA+||15.6||30.0||31.0||25.1||33.5||33.6||43.6||43.4||42.1||27.8||34.5||40.3||25||60.5 , 65.1 |
Results on annotated verbs: Table 1 compares the three datasets for every features, encoding, classifier combination. The following conclusions can be made: (i) for all datasets, motion features (IDT) outperform appearance features (CNN) when classifying verbs without considering the object used. (ii) for CMU and GTEA+, we produce comparable results to published results using motion information on the same datasets. These are reported under ‘Other Works’ but are not directly comparable as published works tend to report on verb-noun classes. (iii) For the three datasets with varying number of verbs, as the number of verbs increases (12 75) with an increase in semantic ambiguity, SEMBED outperforms standard classifiers (SVM and K-NN). While the table shows the best results for encoding, Fig. 5 reports comparative results as is changed - generally led to higher accuracies on all datasets, compared to .
We test the sensitivity of SEMBED to its key parameters and and report results in Fig. 6 showing the accuracy over various features for BEOID and across the three datasets for IDT-BOW (Ref. supplementary material for all combinations). As noted, and behave differently for the various appearance and motion descriptors as well as for different encodings. Generally, SEMBED is more sensitive to the choice of than . This is because the Markovian Walk (MW) is unable to represent the probability distribution over labels unless the starting positions are representative of the visual ambiguity. Figure 6 also shows that MW isn’t too helpful for CMU (as increases, accuracy decreases) because it has visually distinctive verb classes. On all datasets, SEMBED is resilient to changing values; the results are comparable on .
Results on annotated verbs and meanings: As mentioned earlier, we also annotate BEOID with verb-meaning ground-truth. This resulted in 108 annotations for the 1225 segments in the dataset. Note the increase in the number of classes from 75 when using verbs only to 108 when using verb-meaning ground-truth. This increase is due to two reasons - one helpful, another problematic. For example, it is helpful when annotators choose between : “keep in a certain state, position” and : “hold in one’s hand”. Annotators would then use the first for when a button is pressed and the second for when an object is grasped. However, frequently, WordNet meanings can appear ambiguous resulting in problematic cases, especially in the context of egocentric actions. An example of this is the action of turning a tap on so water would flow. Annotators used : “change orientation or direction” and : “cause to move around or rotate” interchangeably. In WordNet though, and are not semantically related, introducing unwanted ambiguity affecting the ground-truth labels. While we accept that WordNet may not be the best method to incorporate meaning, we report results as semantic links are incorporated.
We test the three types of semantic relationships AM,AS,AH. Histograms of all classes for the various semantic relationships are included in the supplementary material. Table 2 shows that embedding consistently improved performance as synsets and then hypernyms are grouped. Results also demonstrate the advantages of introducing semantic links between videos. Additionally, IDT continues to outperform CNN. Figure 7 shows one example of SEMBED in action when using meanings and AH semantic links222Video with results available at: youtube. It should be noted that the best performance of SEMBED on meanings is inferior to using verbs only. This is due to the difficulty in assigning meanings to verbs as previously noted. Approaches to address meaning ambiguities are left for future work.
5 Conclusion and Future Directions
The paper proposes embedding an egocentric action video in a semantic-visual graph to estimate the probability distribution over potentially ambiguous labels. SEMBED profits from semantic knowledge to capture interchangeable labels for the same action, along with similarities in visual descriptors.
While showing clear potential, outperforming classification approaches on a challenging dataset, results merely evaluate the label when compared to ground-truth. Further analysis of the probability distribution will be targeted next. Other approaches to identify semantically related object-interaction labels from, for example, other lexical sources, overlapping annotations or object labels will also be attempted. SEMBED’s ability to scale to other object interactions and more discriminative visual descriptors will also be tested.
Alayrac, J., Bojanowski, P., Agrawal, N., Laptev, I., Sivic, J., Lacoste-Julien, S.: Unsupervised learning from narrated instruction videos. In: CVPR (2016)
-  Bleser, G., Damen, D., Behera, A., Hendeby, G., Mura, K., Miezal, M., Gee, A., Petersen, N., Macaes, G., Domingues, H., Gorecky, D., Almeida, L., Mayol-Cuevas, W., Calways, A., Cohen, A., Hogg, D., Stricker, D.: Cognitive learning, monitoring and assistance of industrial workflows using egocentric sensor networks. PLOS ONE (2015)
Chen, C.H., Patel, V.M., Chellappa, R.: Matrix completion for resolving label ambiguity. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
-  Chen, D., Dolan, W.: Collecting highly parallel data for paraphrase evaluation. In: Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (2011)
-  Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, ECCV (2004)
-  Damen, D., Leelasawassuk, T., Haines, O., Calway, A., Mayol-Cuevas, W.: You-do, I-learn: Discovering task relevant objects and their modes of interaction from multi-user egocentric video. In: BMVC (2014)
-  De La Torre, F., Hodgins, J., Bargteil, A., Martin, X., Macey, J., Collado, A., Beltran, P.: Guide to the Carnegie Mellon University Multimodal Activity (CMU-MMAC) database. Robotics Institute (2008)
-  Fang, C., Torresani, L.: Measuring image distances via embedding in a semantic manifold. In: ECCV (2012)
-  Fathi, A., Li, Y., Rehg, J.: Learning to recognize daily actions using gaze. In: ECCV (2012)
-  Fathi, A., Rehg, J.: Modeling actions through state changes. In: CVPR (2013)
-  Fathi, A., Ren, X., Rehg, J.: Learning to recognize objects in egocentric activities. In: CVPR (2011)
-  Ghosh, J., Lee, Y.J., Grauman, K.: Discovering important people and objects for egocentric video summarization. In: CVPR (2012)
-  Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., Saenko, K.: Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: ICCV (2013)
-  Hüllermeier, E., Beringer, J.: Learning from ambiguously labeled examples. Intelligent Data Analysis pp. 419–439 (2006)
-  Ishihara, T., Kitani, K., Ma, W., Takagi, H., Asahawa, C.: Recognizing hand-object interactions in wearable camera videos. In: ICIP (2015)
-  Jin, Y., Khan, L., Wang, L., Awad, M.: Image annotations by combining multiple evidence & Wordnet. In: ACM international conference on Multimedia (2005)
-  Kitani, K., Okabe, T., Sato, Y., Sugimoto, A.: Fast unsupervised ego-action learning for first-person sports videos. In: CVPR (2011)
-  Kuehne, H., Serre, T.: Towards a generative approach to activity recognition and segmentation. AXiv preprint ArXiv:1509.01947 (2015)
-  Kumar, J., Li, Q., Kyal, S., Bernal, E., Bala, R.: On-the-fly hand detection training with application in egocentric action recognition. In: CVPRW (2015)
Lade, P., Krishnan, N., Panchanathan, S.: Task prediction in cooking activities using hierarchical state space markov chain and object based task grouping. In: ISM (2010)
-  Li, Y., Ye, Z., Rehg, J.: Delving into egocentric actions. In: CVPR (2015)
-  Ma, M., Fan, H., Kitani, K.: Going deeper into first-person activity recognition. In: CVPR (2016)
-  McCandless, T., Grauman, K.: Object-centric spatio-temporal pyramids for egocentric activity recognition. In: BMVC (2013)
-  Miller, G.: Wordnet: a lexical database for english. Communications of the ACM (1995)
-  Moghimi, M., Azagra, P., Montesano, L., Murillo, A., Belongie, S.: Experiments on an rgb-d wearable vision system for egocentric activity recognition. In: CVPRW (2014)
-  Motwani, T., Mooney, R.: Improving video activity recognition using object recognition and text mining. In: ECAI (2012)
-  Ordonez, V., Liu, W., Deng, J., Choi, Y., Berg, A., Berg, T.: Predicting entry-level categories. IJCV (2015)
-  Pirsiavash, H., Ramanan, D.: Detecting activities of daily living in first-person camera views. In: CVPR (2012)
-  Ren, X., Gu, C.: Figure-ground segmentation improves handled object recognition in egocentric video. In: CVPR (2010)
-  Ryoo, M., Rothrock, B., Matthies, L.: Pooled motion features for first-person videos. In: CVPR (2015)
-  Sánchez, J., Perronnin, F., Mensink, T., Verbeek, J.: Image classification with the fisher vector: Theory and practice. IJCV (2013)
-  Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: Integrated recognition, localization and detection using convolutional networks. ICLR (2013)
Singh, S., Arora, C., Jawahar, C.: First person action recognition using deep learned descriptors. In: CVPR (2016)
-  Spriggs, E., De La Torre, F., Hebert, M.: Temporal segmentation and activity classification from first-person sensing. In: CVPRW (2009)
-  Sundaram, S., Mayol-Cuevas, W.: Egocentric visual event classification with location-based priors. In: ISVC (2010)
-  Taralova, E., De La Torre, F., Hebert, M.: Source constrained clustering. In: ICCV (2011)
-  Wang, H., Kläser, A., Schmid, C., Liu, C.: Action recognition by dense trajectories. In: CVPR (2011)
-  Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV (2013)