Human re-identification has been studied extensively in the past in the computer vision community. Due to changes in parameters such as lighting condition, view-point, and occlusion, human re-identification is known to be a very challenging problem. However, in almost all attempts to solve human re-identification, the nature of the data is almost the same between the two cameras, leaving room for a lot of geometric reasoning. Both cameras usually capture humans with oblique or side view. This allows reasoning about the geometry of the targets and also the spatial correspondences among the pedestrians’ bounding boxes such as their expected location of head, torso and legs, leaving room for better appearance based reasoning. On the contrary, in this paper, we aim to perform human re-identification from two very drastic viewpoints. Our objective is to design a framework capable of addressing two tasks. Our first task is to identify the camera holder in the content of the top-view video. Our second task is to re-identify people visible in the egocentric camera in the surveillance top-view video, assuming the camera-holder’s identity is given. The first task is challenging since we do not have any information regarding the visual appearance of the camera holder. Our only input for solving this problem, is the content of his egocentric video. In this work, we solely focus on the content of the video in terms of the visual appearance and geometric position of detected human bounding boxes.
On one hand, egocentric vision has attracted a lot of interest during the past few years due to the abundance of affordable wearable cameras such as GoPro cameras and smart glasses. On the other hand, top- and oblique view videos have been very useful during the past decade due to the increasing affordability and capability of surveillance cameras, UAVs, and drones. These two sources of information provide very drastic view-points opening the door to a lot of exciting research aiming at relating these two sources of knowledge.
Since top, side, and front views of a humans share very little appearance information, due to severe view-point changes and occlusion of the lower half of the human body, a direct comparison across these views will result in poor performance. Further, any sort of geometric reasoning requires some knowledge about the relative spatial position of the egocentric camera with respect to the top-view video. In other words, we need to know the identity and spatial location of the camera holder. This means that identification of the camera holder and the people in the content of the egocentric camera are related. Therefore, each of these two problems can be solved utilizing the solution of the other. This motivates a joint formulation that aims to jointly seek a labeling for the camera holder and people visible in the egocentric camera.
In this paper, we refer to identifying the camera-holder as self-identification, and re-identifying the people visible in the egocentric video simply as re-identification. We evaluate the cost of assigning each top-view identity to the camera holder, by fixing the self-identification label, and measuring the cost of the best possible re-identification labeling. Intuitively, if the self-identification is correct, the re-identification cost would be low, as the content of the camera is consistent with the expected content of the viewer in the top-view video. Assuming the self-identification identity is known, we estimate the initial re-identification labels based on some rough geometric reasoning between egocentric and top-view videos. Tracking the targets in the top-view video using a multiple object tracker, we can have an estimation of the camera-holder’s field of view. We then infer an initial labeling for the humans visible in the egocentric video. We then modify the labeling by taking the visual appearance of the human detection results into account and enforcing visual consistency among different bounding boxes acquiring the same label. Once the re-identification labeling is finalized, the total cost of re-identification will be associated to the self-identification identity. Intuitively, the self-identification of an identity is evaluated based on the geometric consistency of the egocentric video with the identity’s expected field of view in the top-view video. Our experiments show that these two cues provide complementary information and further improve our self-identification and re-identification results.
To the best of our knowledge, the only previous work tackling the self-identification of egocentric viewers in top-view videos is . This approach requires a set of egocentric videos and heavily relies on pairwise relationships among egocentric cameras to reason about the assignment. Having only one egocentric video, the problem would be very difficult. In our experiments, we compare our results to  as a baseline. In addition,  does not address our second task, which is re-identifying people visible in the egocentric video.
The rest of this paper is organized as follows. We selectively review the related work in Section 2. Our framework is described in Section 3. Experiments and results are explained in Section 4. Finally, Section 5 concludes the paper.
2 Related Work
In this section, we review related works in the areas of human re-identification and egocentric vision.
2.1 Person Re-identification
Person re-identification has been studied heavily during the past few years in the computer vision community [19, 10, 11]. The objective here is to find and identify people across multiple cameras. In other words, who is each person present in one static camera, in another overlapping or non-overlapping static camera? The main cue in human re-identification is visual appearance of humans, which is absent in egocentric videos.
Tasks such as human identification and localization in egocentric cameras have been studied in the past.  uses the head motion of an egocentric viewer as a biometric signature to determine which videos have been captured by the same person. In , authors identify egocentric observers in other egocentric videos using their head motion. The work of  localizes the field of view of an egocentric camera by matching it against a reference dataset of videos or images, such as Google street view images. Landmarks and map symbols have been used in  to perform self localization on a map. The study reported in  addresses the problem of person re-identification in a surveillance network of wearable devices, and  performs re-identification on time-synchronized wearable cameras. The relationship between egocentric and top-view information has been explored in tasks such as human identification [6, 5], semantic segmentation and temporal correspondence.  also seeks an automated method for learning a transformation between motion features across egocentric and non-egocentric domains.
One of the popular approaches in recent years is using deep learning for person re-identification[39, 28, 2, 18, 37, 38]. Yi et al. 
uses ”siamese” deep neural network for performing re-identification. The method proposed by Ahmed et al. uses improved deep neural network architectures to determine whether input pairs of images match. Cheng. et al.  uses a multi-channel CNN model in a metric learning based approach. Chen. et al.  used spatial constraints for similarity learning for person re-identification, and combines local and global similarities. Cho. et al.  uses a multi-pose model to perform re-identification. Matsukawa. et al. 
uses a region descriptor based on hierarchical Gaussian distribution of pixel features for person re-identification.
2.2 Egocentric Vision
Visual analysis of egocentric videos has recently became a hot research topic in computer vision [26, 12], from recognizing daily activities [24, 23] to object detection , video summarization , and predicting gaze behavior [29, 34, 14]. Some studies have addressed relationships between moving and static cameras. Interesting works reported in [3, 4] have explored the relationship between mobile and static cameras for the purpose of improving object detection accuracy.  fuses information from egocentric and exocentric vision (third-person static cameras in the environment) with laser range data to improve depth perception and 3D reconstruction. Park et al.  predict gaze behavior in social scenes using first-person and third-person cameras. Soran et al.,  have addressed action recognition in presence of an egocentric video and multiple static videos.
The block diagram of our proposed method can be seen in figure 1. Given an egocentric video and a top-view video containing identities, we run human detection on the egocentric video  which will provide us a set of bounding boxes . Re-identification is defined as labeling the human detection bounding boxes by assigning them to the viewers in the top-view video labeled with where is the number of people visible in the top-view video. We evaluate the re-identification labeling cost, assuming each of the viewers in the top-view video to be the camera-holder. We then rank the viewers in the top-view video based on their likelihood of being the cameraman. We evaluate the performance of that ranking and compute the effects of different parts of our formulation. We also evaluate the human re-identification labeling accuracy, assuming the correct egocentric ID is given. Assuming the self identification label for the egocentric video is , we seek the best set of re-identification labels .
Our framework for computing contains two main steps. First, using , we compute a set of initial labeling solely based on geometric configuration of targets in the top-view video. We then penalize visually similar bounding boxes to acquire different labels. We model our objective using a graph , where each node is a human detection bounding box as shown in figure 2. Each node is eventually going to receive a label by being matched to one of the top-view identities(). Each edge also, captures the cost of assigning the same label to the nodes on its two ends. The details of each of the two steps are explained in the following sections.
3.1 Geometric Reasoning from Top-view
As mentioned before, we evaluate all identities present in the top-view video in terms of being the egocentric camera-holder independently. We then compare their recommended labeling costs to perform self-identification. Having identities present in the top-view, we compute labeling costs , which captures how visually and geometrically consistent their recommended labeling is for the re-identification task. To perform geometric reasoning from top-view, similar to 
we perform multiple object tracking on the provided top-view bounding boxes. Knowing the direction of motion of each trajectory at each moment, we employ the same assumptions as in and estimate the head direction of each of the top-view viewers by assuming that the viewers tend to look straight ahead in majority of times. Also, not having access to the intrinsic parameters of the egocentric video such as focal length and sensor size, we assume a fixed angle and therefore estimate the field of view of each viewer as illustrated in figure 3. As a result, we can determine which identities are expected to be visible in the field of view of each viewer. Thus, given the self-identity (), we can acquire a set of suggested re-identification labeling from the top-view video.
As shown in figure 3, using the relative location and orientation of the visible top-view bounding boxes, we can estimate its spatial location in axis in the egocentric video content as relative to the center of the frame. In the previous term, is the spatial distance between the top-view bounding box to the orientation ray, and is the spatial distance between the orientation ray and the border of the field of view cone as depicted in figure 3. Also is the width of the egocentric video and therefore will encode the axis distance of the projection of the top-view bounding box in the content of the egocentric video relative to the center of the frame. Estimating a projection for each individual top-view visible bounding box, we will have a set of image -axis coordinates , where is the number of visible top-view bounding boxes(2 in the example shown).
To capture the cost of assigning each detection to a projection, we compute the distance between each projection with each human detection center , and form a
matrix, containing the matching probability between a projection-detection pair. The distance matrix is computed as. In order to maintain a notion of probability, we enforce that each projection should match to one and only one detection, and at the same time, each detection should match to one and only one projection. We also perform bi-stochastic normalization on matrix to ensure and . Bi-stochastic normalization could be done simply by an iterative row-wise and column-wise normalization up to reaching within a convergence error.
We define the unary cost of a node by evaluating the cost of assigning each of the top-view labels to that node. Therefore, having people visible in the top-view video, we can represent the unary cost of a node (human detection bounding box) () to be a vector, where:
3.2 Visual Reasoning
Having human detection bounding boxes and therefore edges containing cost of associating the same label to the two nodes, the cost contains the euclidean distances computed on visual features that we extract from the bounding boxes. For computing this cost, we extract visual features including color histogram, LBP texture , and CNN features using the VGG-19 deep network, from the human detection bounding boxes. We then use normalization, and PCA on the CNN features and reduce their dimensionality to 100. Features are concatenated and used to represent the visual information in a human detection bounding box.
3.3 Spatio-temporal Reasoning
We also incorporate a spatiotemporal cost in the graph edges capturing some spatial and temporal constrains on bounding boxes. These constraints are defined as the following:
Constraint 1: Each pair of bounding boxes present in the same frame cannot belong to the same person. Therefore the binary cost between any pair of co-occurring bounding boxes is set to infinity.
Constraint 2: If two bounding boxes have a very high overlap in temporally nearby frames, their binary cost will be alleviated, as they would probably belong to the same identity. An example is shown in figure 4.
3.4 Handling Temporal Misalignment
As mentioned in , a perfect time-alignment between the egocentric and top-view videos is not available and therefore our framework should be able to handle temporal misalignment between the sources. In order to cope with that, we compute the labeling cost for different relative time delays between the egocentric and top-view video and assign the lowest cost set of labels to the human-detection bounding boxes. Time delays will alter the initial labeling and the unary costs as frame in the egocentric video will correspond to frame in the top-view video. Therefore, the labels from the visible humans at frame will be propagated to the human detection bounding boxes in frame in the egocentric video. As a result, labeling cost would be a matter of the time-delay that we assign.
In order to combine our initial labeling with the visual content of the human detection bounding boxes, we use graph cuts  to select the minimum cost labeling. The unary geometry based term () comes from the top-view suggested initial labeling, and the second and third term includes the binary costs for assigning different bounding boxes to different labels.
Intuitively, we initialize a labeling with different time-delays using the geometric reasoning, and then enforce visual and spatiotemporal consistency among the similarly labeled nodes by incorporating the binary costs. Our experiments show that the fusion will further improve the re-identification accuracy. At the end, we pick the configuration with the lowest labeling cost as in equation 3.
4 Experimental Results
In this section, we will explain our experimental setup and dataset that will used for evaluating the two tasks of self-identification, and human re-identification. We then evaluate the performance of our proposed method over two tasks and analyze the performance.
We use the first 10 sequences of the dataset used in . The dataset contains test cases of videos shot in different indoor and outdoor environments. Each test case contains one top-view video and several egocentric videos captured by the people visible in the top-view camera. We annotated the labels for the human detection bounding boxed for each video and evaluated the accuracy for re-identification and self-identification. The first 10 sets contain 37 egocentric videos and 10 top-view videos. Number of people visible in the top-view cameras varies from 3 to 10, and lengths of the videos vary from 1019 frames (33.9 seconds) up to 3132 frames (110 seconds).
We evaluate our framework in terms of egocentric self-identification within a top-view video, and also in terms of cross-view human re-identification.
For each egocentric video, the viewers visible in the top-view video are ranked and self-identification performance is evaluated by computing the area the cumulative matching curve (CMC) as illustrated in figure 7. We also compared the self-identification accuracy with that of  where they only use results of human detection for performing identification of the camera holder. The reason behind the cumulative matching curves having jumps in values and non-smooth transitions, relies on the fact that there are only a few people visible in the top-view video and therefore the normalized rank of the correct match could only obtain limited number of values (e.g. for ranks 1 to 5 when 5 people are visible in the top-view video).
4.2.2 Cross-view Human Re-identification
Assuming the egocentric identity is known, the labeling accuracy is computed for the bounding boxes visible in the content of the egocentric video. The labeling accuracy is evaluated for the initial labeling suggested by the top-view video, and also after fusing that with the visual similarities of the bounding boxes.
In this work we studied the problem of human re-identification and self-identification in egocentric videos, by matching them to a reference top-view surveillance video. Our experiments show that both self-identification and re-identification is possible in a unified framework. If self-identification is given, re-identification can be done using the some rough geometric reasoning from top-view and enforcing visual consistency.
For future, a more general case of this problem can be explored such as assigning multiple egocentric viewers to viewers in multiple top-view cameras. Also, other approaches can be explored for solving the introduced problem or slight variations of it (e.g., supervised methods for understanding the unary and pairwise relationships). Further, other computer vision techniques such as visual odometry can be explored for relating the two sources. We attempted to approach this problem using odometry at first, however, the results were not accurate perhaps due to a lot of jitter in egocentric videos. Nonetheless, this can be another potential direction for further research in the future.
-  https://www.sighthound.com/.
E. Ahmed, M. Jones, and T. K. Marks.
An improved deep learning architecture for person re-identification.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
-  M. B. Alahi, Alexandre and M. Kunt. Object detection and matching with mobile cameras collaborating with fixed cameras. Workshop on Multi-camera and Multi-modal Sensor Fusion Algorithms and Applications-M2SFA2, 2008.
-  B. M. K. M. Alahi A, Marimon D. A master-slave approach for object detection and matching with fixed and mobile cameras. InImage Processing, 2008. ICIP 2008. 15th IEEE International Conference, 2008.
-  S. Ardeshir and A. Borji. From egocentric to top-view.
-  S. Ardeshir and A. Borji. Ego2top: Matching viewers in egocentric and top-view videos. arXiv preprint arXiv:1607.06986, 2016.
-  S. Ardeshir and A. Borji. Egocentric meets top-view. arXiv preprint arXiv:1608.08334, 2016.
-  S. Ardeshir, K. Malcolm Collins-Sibley, and M. Shah. Geo-semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2792–2799, 2015.
-  S. Ardeshir, K. Regmi, and A. Borji. Egotransfer: Transferring motion across egocentric and exocentric domains using deep neural networks. arXiv preprint arXiv:1612.05836, 2016.
-  B. F. T. M. Bak S, Corvee E. Multiple-shot human re-identification by mean riemannian covariance grid. InAdvanced Video and Signal-Based Surveillance (AVSS), 8th IEEE International Conference on, 2011.
-  M. V. Bazzani L, Cristani M. Symmetry-driven accumulation of local features for human characterization and re-identification. omputer Vision and Image Understanding., 2013.
-  R. C. R. M. Betancourt A, Morerio P. The evolution of first person vision methods: A survey. Circuits and Systems for Video Technology, IEEE Transactions on, 2015.
-  I. E. Bettadapura, Vinay and C. Pantofaru. Egocentric field-of-view localization using first-person point-of-view devices. Applications of Computer Vision (WACV), IEEE Winter Conference on., 2015.
-  A. Borji, D. N. Sihite, and L. Itti. What/where to look next? modeling top-down visual attention in complex interactive environments. Systems, Man, and Cybernetics: Systems, IEEE Transactions on, 44(5):523–538, 2014.
-  Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. IEEE Transactions on pattern analysis and machine intelligence, 23(11):1222–1239, 2001.
-  A. Chakraborty, B. Mandal, and H. K. Galoogahi. Person re-identification using multiple first-person-views on wearable devices. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1–8. IEEE, 2016.
-  D. Chen, Z. Yuan, B. Chen, and N. Zheng. Similarity learning with spatial constraints for person re-identification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng.
Person re-identification by multi-channel parts-based cnn with improved triplet loss function.In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
-  S. M. B. L. M. V. Cheng DS, Cristani M. Custom pictorial structures for re-identification. BMVC, 2011.
-  S. M. B. L. M. V. Cheng DS, Cristani M. Head motion signatures from egocentric videos. InComputer Vision–ACCV. Springer International Publishing., 2014.
-  Y.-J. Cho and K.-J. Yoon. Improving person re-identification via pose-aware multi-shot matching. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
-  X. R. Fathi, Alireza and J. M. Rehg. Learning to recognize objects in egocentric activities. Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference On, 2011.
-  R. J. Fathi A, Farhadi A. Understanding egocentric activities. Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011.
-  R. J. Fathi A, Li Y. Learning to recognize daily actions using gaze. Computer Vision–ECCV, 2012.
-  L. D. C. M. F. Ferland F, Pomerleau F. Egocentric and exocentric teleoperation interface using real-time, 3d video projection. InHuman-Robot Interaction (HRI), 2009 4th ACM/IEEE International Conference on, 2009.
-  T. Kanade and M. Hebert. First-person vision. Proceedings of the IEEE 100.8, 2012.
-  I. G. Kiefer, Peter and M. Raubal. Where am i? investigating map matching during self‐localization with mobile eye tracking in an urban environment. Transactions in GIS 18.5, 2014.
-  W. Li, R. Zhao, T. Xiao, and X. Wang. Deepreid: Deep filter pairing neural network for person re-identification. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 152–159, June 2014.
-  Y. Li, A. Fathi, and J. Rehg. Learning to predict gaze in egocentric video. In Proceedings of the IEEE International Conference on Computer Vision, pages 3216–3223, 2013.
-  Z. Lu and K. Grauman. Story-driven summarization for egocentric video. Computer Vision and Pattern Recognition (CVPR), IEEE Conference On, 2013.
-  T. Matsukawa, T. Okabe, E. Suzuki, and Y. Sato. Hierarchical gaussian descriptor for person re-identification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
-  T. Ojala, M. Pietikäinen, and D. Harwood. A comparative study of texture measures with classification based on featured distributions. Pattern recognition, 29(1):51–59, 1996.
-  E. J. Park, Hyun and Y. Sheikh. Predicting primary gaze behavior using social saliency fields. Proceedings of the IEEE International Conference on Computer Vision., 2013.
-  P. Polatsek, W. Benesova, L. Paletta, and R. Perko. Novelty-based spatiotemporal saliency detection for prediction of gaze in egocentric video.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  B. Soran, A. Farhadi, and L. Shapiro. Action recognition in the presence of one egocentric and multiple static cameras. In Asian Conference on Computer Vision, pages 178–193. Springer, 2014.
-  R. R. Varior, M. Haloi, and G. Wang. Gated siamese convolutional neural network architecture for human re-identification. CoRR, abs/1607.08378, 2016.
-  R. R. Varior, B. Shuai, J. Lu, D. Xu, and G. Wang. A siamese long short-term memory architecture for human re-identification. CoRR, abs/1607.08381, 2016.
-  D. Yi, Z. Lei, and S. Z. Li. Deep metric learning for practical person re-identification. CoRR, abs/1407.4979, 2014.
-  K. M. K. Yonetani, Ryo and Y. Sato. Ego-surfing first person videos. Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. IEEE,, 2015.
-  K. Zheng, H. Guo, X. Fan, H. Yu, and S. Wang. Identifying same persons from temporally synchronized videos taken by multiple wearable cameras.