According to Wikipedia:
”A mirror neuron is a neuron that fires both when an animal acts and when the animal observes the same action performed by another. Thus, the neuron mirrors the behavior of the other, as though the observer were itself acting. Such neurons have been directly observed in primate species.”
Achieving human-like learning abilities, requires modeling the mirror neuron phenomenon. In other words, intelligent systems should be able to relate visual information from a third person perspective, to first person perspective, and vice versa. Watching a human running, an intelligent system (e.g., a robot) should be able to imagine how the visual world would look like, if the system itself actually attempted such an act. We believe now is the perfect time for taking the first step towards modeling this concept from a computer vision standpoint.
During the past few years, egocentric cameras have provided us the opportunity to study first person vision widely and extensively. Thanks to the affordability of wearable cameras and smart glasses (e.g., GoPro, Google glass), a lot of interesting research has been done ranging from action recognition  to identification and localization. The history of computer vision, however, goes beyond the past few years. Tremendous amount of research has been conducted in different areas of computer vision but on more traditional types of videos collected using static cameras from canonical, oblique or top view. We refer to these videos as exocentric or third-person videos. Given the fact that egocentric vision is a relatively new area , compared to exocentric vision, the amount of available exocentric data is drastically more than egocentric data. For example, while there are several datasets in the computer vision community for action and activity recognition in exocentric domain, there are not nearly as many egocentric datasets.
In order to take advantage of the vast amount of knowledge that exists in the exocentric domain, there is a need for a systematic adaptation of exocentric information to the egocentric domain. In this study, we explore the relationship between egocentric and exocentric for visual motion transfer. In other words, we seek to learn a transformation from motion features in exocentric space, to that in egocentric space and vice versa. To do so, we collect a dataset of egocentric and exocentric videos captured simultaneously with body mounted egocentric and static exocentric cameras, capturing people performing diverse actions covering a broad spectrum of motions. We then divide each pair of videos to time-synchronized short clips of 16 frames, and extract motion features from each clip in each view as illustrated in Figure 1. This will provide us a set of feature pairs from the two views. We then train different linear and non-linear mappings to learn a transformation between the two views. For testing the performance of our models, we evaluate their capability in terms of retrieving their correct match from the other view. In the test set, we have a set of video pairs, and therefore feature pairs, extracted from simultaneously recorded ego and exocentric videos. We map each feature from the first view (source view) to the other view (target view), and evaluate its capability in terms of finding its correct paired video/feature in the target set. In other words, we try to retrieve it’s corresponding video clip in the test set. We evaluate and analyze the performance of different mapping methods in different scenarios.
2 Related Work
First person vision, also known as egocentric vision, has became increasingly popular in the vision community. A lot of research has been conducted in the past few years [14, 5], including object detection , activity recognition [10, 9] and video summarization .
Motion in egocentric vision, in particular, has been studied as one of the fundamental features of first person video analysis. Costante et al. 
explore the use of convolutional neural networks (CNNs) to learn the best visual features and predict the camera motion in egocentric videos. Su and Grauman propose a learning-based approach to detect user engagement by using long-term egomotion cues. Jayaraman et al.  learn the feature mapping from pixels in a video frame to a space that is equivariant to various motion classes. Ma et al.  have proposed a twin stream network architecture to analyze the appearance information and the motion information from egocentric videos and have used these features to recognize egocentric activities.
Action and activity recognition in egocentric videos have been hot topics in the community. Ogaki et al.  jointly used eye motion and ego motion to compute a sequence of global optical flow from egocentric videos. Poleg et al.  proposed a compact 3D Convolutional Neural Network (3DCNN) architecture for long-term activity recognition in egocentric videos and extended it to egocentric video segmentation. Singh et al.  used CNNs for end-to-end learning and classification of actions by using hand pose, head motion and saliency map. Li et al.  used gaze information, in addition to these features, to perform action recognition. In their work, Matsuo et al.  have proposed an attention based approach for activity recognition by detecting visually salient objects. The relationship between egocentric and top-view videos has been explored in tasks such as human identification [2, 1] and temporal correspondence. The relationship between egocentric and top-view information has been explored in tasks such as human identification [2, 1], semantic segmentation and temporal correspondence. In this work, we relate two different views of a motion, which can be considered as a knowledge transfer or domain adaptation task. Knowledge transfer has been used for the multi-view action recognition (e.g., [13, 17, 15]) in which multiple exocentric views of an action are related to each other. Having multiple exocentric views allows geometrical and visual reasoning, since: a) the nature of the data is the same in different views and b) the actor is visible in all cameras. In contrast, our paper aims to automatically learn mappings between two drastically different views, egocentric and exocentric. To the best of our knowledge, this is the first attempt in relating these two domains for transferring motion information.
Our main goal is to transform motion features from a source view to a target view, where one is egocentric and the other is exocentric. In other words, we try to learn mapping models from egocentric to exocentric space, and vice versa. As shown in figure 1, we have a set of datapoint pairs , in this case
, in the training set for which we try to learn a transformation that maps one view to another i.e. estimates a mapping function, such that or . The datapoints are in fact spatiotemporal features such as C3D (3D neural network based spatiotemporal features ), and HOOF features  capturing histogram of oriented optical flow. We then train linear and non-linear mapping models using these pairs. We evaluate the learned models one a test set, in terms of their capability in retrieving the groundtruth paired feature from the other view. As shown in figure 2, for each ego/exocentric video, we extract its motion feature descriptor and retrieve its correct paired video from the other view. We then rank the videos in the other view. The rank of the correct match will be a metric for us to evaluate different models over different scenarios.
3.1 Extracting Motion Features
We represent each short video clip in each view using its motion feature. Two different motion features with different levels of complexity are employed. We use the simple feature of histogram of oriented optical flow (HOOF), and also a more complicated spatiotemporal feature known as 3D convolutional neural networks (C3D) proposed in . We study the mapping capacity of these features from/to egocentric videos to/from exocentric videos (top-view and side-view) using different mapping methods.
C3D Features: These 4096D feature descriptors are computed using a 3D convolutional neural network 
. In order to reduce the computational complexity in training our models, we reduce the dimensionality to 128D using Principal Component Analysis (PCA).
Histogram of Oriented Optical Flow was extracted with 32 different orientations, resulting in a 32D feature vector representing each clip.
3.2 Mapping Models:
We train 3 linear baseline models and two different non-linear models to evaluate the possibility of learning a mapping between egocentric and exocentric motion features. In what follows, we explain the details of each mapping scheme alongside with the implementation details.
3.2.1 Linear Models:
We train 3 different linear models of uniform transformation (direct matching), linear regression, and linear regression with L2 regularization as baselines.
Direct Matching: One might ask the question that how the feature descriptors would perform if they were directly compared. In other words, what if we simply retrieve the exocentric videos, by directly comparing them with the egocentric query feature. To answer whether complicated mappings are necessary, we evaluate the performance of direct matching between the two spaces. Direct matching assumes a uniform transformation across the two spaces. Given that source and target domains are totally different, direct matching is not expected to outperform chance significantly. Our experiments also validate this expectation, as direct matching always achieves near-random performances.
Linear Regression: We tried training a linear regression model from the source domain to the target domain. Linear regression has the following form and can be computed using a closed form solution with least squares optimization.
Our experiments indicate that linear regression consistently outperforms chance and direct matching with a large margin, but it suffers from the limits of linear models. This consistent edge compared to direct matching, suggests that better mappings are possible.
Regularized Linear Regression: Regularization has shown to improve regression models by preventing them from overfitting to data and converging to trivial solutions. We tried L2 regularized linear regression models and evaluated their accuracies as well. In our experiments, L2 regularization does not improve the accuracy considerably compared to linear regression in most of the scenarios.
3.2.2 Non-linear Models:
We test the performance of two generic neural network based architectures in order to explore the possibility of improving our linear baselines. Details are described in the following.
Non-linear Mapping with a Reconstruction Objective:3. This architecture was designed for the purpose of reconstructing the target (ego/exocentric) features from the source features (exo/egocentric). We used the least square loss and adam optimizer for training. For HOOF features, the dimensionality of the fully connected layers are 32, 64, 128, 64, and 32. For C3D features, the dimensionalities are 128, 256, 256, 128, and 128. Our experiments show that this simple architecture is able to outperform the linear model with a large margin in most of the cases. Numbers of training and testing examples can be found in table 1
. First, we train the model using batch size of 100 for 60 epochs. We then find the epoch number where the validation loss is minimum (i.e., the optimal number of epochs). Finally, we add the validation set to the training set, and train the model from scratch, up to the optimum number of epochs. We use the Keras platform with Theano backend to implement and test this network.
Two-stream Classification Network:
Two-stream classification networks have been one of the popular architectures for tasks such as matching and classification. We trained a two-stream neural network that requires pairs of features as input, and sends the dot product of the non-linear transformation of the two to a sigmoid function to enforce the notion of probability in the output. We choose the output to be 1 for corresponding feature pairs and 0 for non-corresponding pairs. The intuition behind this network is the assumption that there is a common space for the two views, to which they are non-linearly transformable, and therefore their dot product can be maximized. As shown in figure4, we have two dense layers with relu activation and batch normalization applied to each stream. For the HOOF features, the dimensionalities of the dense layers in each stream are 64 and 128. For C3D features, the dimensionalities are 128 and 256. For training, positive and negative pairs are needed. Our original training data contains a lot of positive pairs. To generate negative pairs, we pick random non-correspondent features from the two views. Since the negative examples drastically outnumber the positive ones, we use sample weights in order to balance the training data. The weights for negative and positive examples, in order are and .
As in the other non-linear mapping method, we first train the model using batch size of 100 for 60 epochs. We then find the epoch number for which the validation loss is minimum. Finally, we add the validation set to the training set, and train the model from scratch and up to the optimum number of epochs. For optimizing this network, we use binary cross entropy loss.
3.3 Testing our Mapping Models
). We extract the motion features of the query video. We then use our trained linear and non-linear models to transform the query source feature to the target domain. Finally, we compare the features extracted from the target set to the transformed query feature and rank them based on that. We rank target videos and evaluate the performance of the mapping models in terms of ranking the target videos. One would ideally want the correct match of the query video to appear at the top. For all the linear models and the non-linear mapping with reconstruction objective, this process is straightforward. For the two-stream network, we feed the query source feature, paired with each of the target features and acquire a score between 0 and 1, which denotes the probability of that target feature being matched with the query feature. Finally, we sort the target features in descending order, based on their score, and evaluate the method using the rank of the correct groundtruth match to the query feature.
4 Experimental Results
In this section, we describe details of our collected dataset, evaluation metric, as well as performance of mapping models.
To the best of our knowledge, there is no dataset containing simultaneously recorded egocentric and exocentric (side and top view videos) with a wide range of first and third person motions. Therefore, we collect a dataset containing 420 video pairs. Each video pair contains one egocentric and one exocentric (side or top-view) video. The pair of videos are temporally aligned, which will cause temporal features to correspond to each other. Some examples are shown in figure 5. Each pair was collected by asking an actor to perform a range of actions (walking, jogging, running, hand waving, hand clapping, boxing, push ups) covering a broad range of motions in front of an exocentric camera (top or side view), while wearing an egocentric body-worn camera capturing the actor’s motion from the first person perspective. Each pair of videos is divided into pairs of temporally aligned short clips (16 frames each), and feature descriptors are extracted from each clip. Details about the number of videos and features used for training and testing are included in the following table. In order to increase the number of training and testing examples, we also add flipped versions of egocentric and side view videos, and also 12 rotated versions of the top-view videos (corresponding to 30 degrees of rotation).
|Training Pairs||Validation Pairs||Testing Pairs||Total Number of Pairs|
The test set consists of a set of paired videos, from the source and target view. For each feature in the source view, we attempt to find its correct pair in the target view. To do so, we transform its feature to the target space using our learned mapping models. For each test feature, we evaluate the matching performance for each of the linear and non-linear mapping methods in terms of ranking. We then compute the area under the curve of the cumulative matching curve to have a quantitative measure of performance. We evaluate the mappings for two different features, C3D and HOOF, and four different scenarios including 1) egocentric to top-view, 2) top-view to egocentric, 3) egocentric to side-view, and 4) side-view to egocentric.
4.2.1 HOOF Features
As shown in figures 6 and 7, in transforming the features from ego to top-view and vice versa, the non-linear models drastically outperform the linear models. In particular, the two-stream classification network achieves the highest accuracy. Further, for transferring features across egocentric and side-view, the non-linear mapping method outperforms the two-stream classification network. A summary of all the quantitative results with C3D features can be found in table 2.
Mapping between Egocentric and Top-View:
Figure 6 shows the cumulative matching curves for mapping from egocentric to top-view and vice versa. In both scenarios, all models outperform chance, which confirms the possibility of finding a mapping between the two spaces. Non-linear methods perform more favorably compared to linear models. The two-stream classification based network achieves the highest accuracy.
Mapping between Egocentric and Side-View:
Figure 7 shows the cumulative matching curves for mapping from egocentric to top-view and top-view to egocentric. In both scenarios, the linear mappings outperform chance, but under-perform the non-linear models. Also, the non-linear mapping with reconstruction objective achieves the best performance.
|-||Random||Uniform||Regression||Regression L2||Non-linear Mapping||Two-stream|
4.2.2 C3D Features
Here, we evaluate the retrieval performance of our mapping methods using C3D features. Figures 8 and 9 show the retrieval performance for transforming features across ego to top and ego to side, respectively.
It can be observed from figure 8 that the non-linear mapping outperforms linear mapping, while the two-stream classification model does not. On the other hand, in figure 9 where the aim is to transform features between egocentric and side view, the non-linear mapping does not perform as well as the linear methods. Here, the two-stream network drastically outperforms both. Generally, it can be seen that using C3D features, the non-linear models do not consistently outperform the linear models. We believe this is because C3D features, extracted from the last fully connected layers of the 3DCNNS, offer fully meaningful and independent information. This makes them more suitable for linear models. As a result, not every non-linear model should necessarily be able to outperform a linear model when trained on those features. According to this, one may expect non-linear models to estimate an identity transformation. However,  shows that it is very hard for a non-linear fully connected layer to learn an identity transformation accurately. Therefore, additional non-linearity can further reduce the accuracy of a mapping of a feature which is already tailored to work with a linear model such as C3D. A summary of all the quantitative results with C3D features can be found in table 3.
Mapping between Egocentric and Top-View: Figure 8 shows the cumulative matching curves for mapping from egocentric to top-view and top-view to egocentric. It can be seen that in both scenarios, the non-linear mapping achieves the best performance.
Mapping between Egocentric and Side-View: Figure 9 illustrates the performance for mapping from egocentric to side-view and vice versa. Here, we find that the two-stream classification network achieves the best performance.
|-||Random||Uniform||Regression||Regression L2||Non-linear Mapping||Two-stream|
5 Conclusion and Discussion
Inspired by the mirror neuron concept, we explored the possibility of transforming motion information across two drastically different views: egocentric (or first-person) and exocentric (or third-person). We showed that it is possible to learn a transformation, linear and non-linear, across the two spaces. We observe that depending on the scenario (e.g., ego to top, side to ego) and the feature type, linear models can outperform non-linear models. The opposite can happen as well. Overall, using both HOOF and C3D features, transferring motion to and from side view leads to higher accuracy compared to top-view. This intuitively makes sense, since side views are often visually more similar and contain more information regarding an activity.
For future, we plan to extend our work for action recognition and visual tracking. For the former, a classifier will be trained from data of a source domain, using the proposed features, and will be applied to the data of a target domain. For the latter, we will use the recorded egocentric video, offline or online, to better track a person in top-view surveillance camera. We will consider using more sophisticated spatio-temporal features as well as domain adaptation and task transfer approaches.
In this work, we explored the possibility of transferring knowledge across egocentric and exocentric domains. We believe that our work can be a stepping stone to further explore the relationship between the two domains, with possible applications in action recognition, and identification.
-  S. Ardeshir and A. Borji. From egocentric to top-view.
-  S. Ardeshir and A. Borji. Ego2top: Matching viewers in egocentric and top-view videos. In European Conference on Computer Vision, pages 253–268. Springer, 2016.
-  S. Ardeshir and A. Borji. Egocentric meets top-view. arXiv preprint arXiv:1608.08334, 2016.
S. Ardeshir, K. Malcolm Collins-Sibley, and M. Shah.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2792–2799, 2015.
-  R. C. R. M. Betancourt A, Morerio P. The evolution of first person vision methods: A survey. Circuits and Systems for Video Technology, IEEE Transactions on, 2015.
-  R. Chaudhry, A. Ravichandran, G. Hager, and R. Vidal. Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1932–1939. IEEE, 2009.
-  G. Costante, M. Mancini, P. Valigi, and T. A. Ciarfuglia. Exploring representation learning with cnns for frame-to-frame ego-motion estimation. IEEE Robotics and Automation Letters, 1(1):18–25, 2016.
-  X. R. Fathi, Alireza and J. M. Rehg. Learning to recognize objects in egocentric activities. Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference On, 2011.
-  R. J. Fathi A, Farhadi A. Understanding egocentric activities. Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011.
-  R. J. Fathi A, Li Y. Learning to recognize daily actions using gaze. Computer Vision–ECCV, 2012.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
-  D. Jayaraman and K. Grauman. Learning image representations equivariant to ego-motion. CoRR, abs/1505.02206, 2015.
-  I. N. Junejo, E. Dexter, I. Laptev, and P. Pérez. Cross-view action recognition from temporal self-similarities. In European Conference on Computer Vision, pages 293–306. Springer, 2008.
-  T. Kanade and M. Hebert. First-person vision. Proceedings of the IEEE 100.8, 2012.
-  R. Li and T. Zickler. Discriminative virtual views for cross-view action recognition. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2855–2862. IEEE, 2012.
-  Y. Li, Z. Ye, and J. M. Rehg. Delving into egocentric actions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
-  J. Liu, M. Shah, B. Kuipers, and S. Savarese. Cross-view action recognition via view knowledge transfer. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 3209–3216. IEEE, 2011.
-  Z. Lu and K. Grauman. Story-driven summarization for egocentric video. Computer Vision and Pattern Recognition (CVPR), IEEE Conference On, 2013.
-  M. Ma, H. Fan, and K. M. Kitani. Going deeper into first-person activity recognition. CoRR, abs/1605.03688, 2016.
-  K. Matsuo, K. Yamada, S. Ueno, and S. Naito. An attention-based activity recognition for egocentric video. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2014.
-  K. Ogaki, K. M. Kitani, Y. Sugano, and Y. Sato. Coupling eye-motion and ego-motion features for first-person activity recognition. In 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pages 1–7, June 2012.
-  Y. Poleg, A. Ephrat, S. Peleg, and C. Arora. Compact CNN for indexing egocentric videos. CoRR, abs/1504.07469, 2015.
S. Singh, C. Arora, and C. V. Jawahar.
First person action recognition using deep learned descriptors.In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
-  Y. Su and K. Grauman. Detecting engagement in egocentric video. CoRR, abs/1604.00906, 2016.
-  D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 4489–4497. IEEE, 2015.