In recent years, accurate localization and consistent tracking in a large crowd, including the shopping mall, urban street, airport, and public park, possibly involved with interaction for identification of specific requests, are extensively needed, especially for visually impaired people  and urban navigation with high accuracy localization request . However, the requirement of large storage for pre-recorded feature map  limits its usage in a large open area. Besides, the problem of view block and the lack of static features for tracking also make it harder to be implemented in urban areas . It is highly required to have a stable and mobile capable approach to solve this problem in a high accuracy.
In this paper, we propose to use a mobile camera and a static third view camera system as illustrated in Fig.1 to address this problem. We assume that the person wears a head-mount camera which observing a downward narrow area (a case for VR game headset). We aim to verify how an ego downward camera and a third view camera can be used for verification and localization in the wild.
Note that there are some existing works on third and ego-centric view matching analysis for the human. All of these approaches, however, focusing on using two streams siamese or triplet network structure [14, 27, 5]
to learn to identify between third and ego view. In these models, the most recent approaches including 3D convolutional neural network[29, 31, 23] and segmental consensus for cross-domain verification [31, 14, 27]
are deployed. However, these approaches cannot generalize the knowledge of pose and motion for human tracking and cross view verification. Thus, pure visual features are not capable to model the variance of the human action across views toward tracking, especially the ego-downward view can only visualize the human itself.
Unlike the top-and-forward view  and third-forward view  cases, the ego-downward mounting faces the following challenges : 1) appearance verification across different views does not hold under this situation since it is not pointing out to scenario; 2) clothes texture verification will not work since in large crowd there should have the similar dressing or occlusion; 3) the same action with different initial pose state (in world coordinate system) will also mislead the model since the ego-downward frames will not tell the difference (in Fig.3). Thus, using a general siamese or triplet model to correlate the two views with temporal and spatial information would fail . Moreover, the graph solution using relative view insight will not happen under this situation .
In this paper, we proposed a novel action and motion feature based model to address these challenges. Our key learning is that ego view can always visualize part of the body, and thus can help to estimate the pose variance  and body motion. Our main contribution is to learn action and 3D motion feature for cross view verification, via taking advantage of the third view person tracker and 3D pose estimation. It can be summarized as follows:
Secondly, we propose a novel action and motion verification and tracking model for cross views in Section.3. Using this model, the ego-view pose and transformation can be aligned into the third view, which is more sufficient for regression.
Finally, we build a comprehensive dataset and validate our proposed method in Section.4. Experimental results demonstrate that the proposed method achieves higher accuracy and is robust to body context as well as the background.
2 Related Works
Ego-centric and Third View Joint Modeling - The problem of associating first (mobile) and third (static) view was firstly discussed in  to improve the object detection accuracy in the third view. Authors in  discussed the problem of using the egocentric and third view camera to perform action recognition, which addressed the fact that egocentric cameras benefit the recognition. In , the authors correlated the first view and third view firstly. The authors proposed a ’Graph’ representation for temporal and spatial matching. In , the authors solved the task to localize the person in the third view if given the both the third and ego camera frames. In this paper, spatial-domain semi-siamese, motion-domain semi-siamese, dual-domain semi-siamese, and dual-domain semi-triplet networks are well studied. Besides correlation method discussion, authors in  released ”Charades-Ego Dataset” to study the problem of daily human activity study and provide the baseline of performing basic frame-to-frame association. These works mainly consider context features as the main clue, and they did not consider pose features and motion (odometry) feature for verification. Besides, our work differs from other work that we perform an association of downward view and third static view, which could help to increase the robustness of tracking.
Temporal and Spatial Model for action Learning - Temporal information was first introduced to solve action recognition in , where a convolutional operation with max-pooling were first discussed which greatly improved the performance of learning temporal features. Then, a ResNet  based 3D convolutional neural network is proposed in  to achieve higher accuracy using a smaller model. Spatial information is commonly used in detection and correlation  using context information or objects information. For egocentric and third view matching the task, temporal and spatial information is first discussed in  using a naive concatenation approach. Then, work  proposed using convolutional approach to perform the temporal learning. However, none of the above method learn the pose information in temporal or spatial domain to perform association. Current success in human pose detection  enables the learning of action in a graph convolution manner  in both temporal and spatial domain.
Learning for Localization - RGB-D images based localization  is the first localization approach used widely. Then, the first learning approach toward end-to-end localization is proposed in . In order to address the sequence continuous constraints, authors in  proposed recurrent network to enable smooth localization.  demonstrated how to multitask which incorporate visual odometry prediction and global localization can relieve requiring of a huge dataset and achieve higher localization accuracy as well. Lately, authors in  introduced almost the same idea as 
of performing multitask toward localization, while this work differs in introducing both pose loss and velocity loss to increase the convergence of the model. Tracking is a traditional topic in both computer vision and robotic area, and later learning approach has been successfully demonstrated with real-time performance .
The proposed model is illustrated in Fig.2. It contains two sub-blocks, which are action sub-model and motion sub-model. For action sub-model, given an third image of a person at time , we performed 3D pose estimation to obtain to initialize ego-downward view frame at time . Then, at time , the third view still performs 3D pose estimation , while ego-downward view tells pose variation . Thus, we can obtain two pose sequences as for ego-downward view, and for third view. The two pose sequences should be the same. For motion sub-model, 3D joints of human body can provide the transformation, , between Ego and third view. Then, at time , the ego model model predicts the transformation from to . In , the transformation of is, , and we can have for all consecutive frames. Mean while, the third view directly predicts the relative translation in image domain as, . Then, the third view translation is, . The two translation should also be the same in third view. It should be noted that the sequence translation is represented in third view coordinate system which is the default world frame in this paper (as illustrated in Fig.2).
3.1 Learning Action Feature by Applying 3D Pose
Preliminary Definitions: We represent the human pose using 3D joints as Skinned Multi-Person Linear (SMPL)  model and unlike the original 24 joints, we use the 19 joints which are defined in  as: . For the SMPL model, it factors the human body into shape - how individuals vary in height, weight, body proportion and poses - the 3D surface deforms with articulation. The whole model consists of vertices to form a 3D mesh which is continuous quad structure, and represented as .
The tracked person in the third view with a bounding box is cropped out in original RGB-image as and the optical flow images as . The cropped third view images are directly used to estimate the 3D pose with joints. In this paper, we use consecutive pose to represent an action.
Learning Third View Action
We first classify the 3D posesover poses into clusters as . For a consecutive frames, and its corresponding 3D action cluster label . Each third view clip has a dimension of , with Channels, width, height, and frames. The third view poses network architecture is composed of a 3D ResNet-18, with a total 4 blocks. The first three blocks are with a max-pooling of
in both spatial and temporal channels, and there is no temporal pooling with the four blocks. We only perform a 2D convolution for feature extraction. 3D ResNet doubles the depth while the dimension decreased starting fromfor the first block and for the fourth block. The final output after average pooling is a
dimensional vector. 3D ResNet-18 then connects with a fully-connected network with a totallayers to perform action prediction.
Ego-downward View Pose Variation Prediction Model One learning is illustrated in Fig.3. Given two initial frames with poses and . Also, the consecutive frames pose variation is given as . Then, we can obtain the corresponding 3D action sequence as and . It can clear conclude from Fig.3 that the two action and are different actions in global view (third view), even given the same ego view action.
For a clip of ego-downward flow images , which can obtain major part of the body motion (It is illustrated in Fig.2). The configuration of the selfie model is a 2D ResNet-50. The input is image with channel , width and height as the original model. The output of the ResNet is dimensional vector. Then, we introduce to directly use an iterative fully connection network to estimate the shape and pose with and , where is the variation of the iterative error.
Thus, ego-downward pose variation model directly estimate the pose error between two consecutive frames, . Given the initial 3D pose as , we can thus have the 3D joint pose for a selfie clip as .
3.2 Learning Motion for Correlation
Preliminaries In this paper, we also introduce information information, that is, translation to leverage geometric consistency in both third and ego-downward view. It is illustrated in Fig.4, the third view tracker can generate bounding boxes for a person , then the center (solid green dot) as translation of the sequence in third view image can be described as . We can tell that center directly reflects the motion of the person.
Learning Third View Translation To learn third view motion to obtain translation, we introduce 2D and followed by two fully connected layers architecture to predict the frame-to-frame translation. The input is the consecutive third view cropped flow images , and the expectation is the tracked bounding box centers sequence . The reason for choosing the flow as input is that the flow image denotes the pixel motion between two frames as, . Where and are the components of velocity in image frame and axis of optical flow, , , and are the derivatives of each pixel in direction. It can directly reflect the motion information for prediction.
For each flow frame, the motion model predicts the translation of human in third view image as . In a consecutive frames of the flow images, the model outputs the frame-to-frame translation as . Thus, the predicted translation in frames RGB images is, .
Learning ego-downward View Translation Ego-downward motion is highly related to initial pose in the third view, that is, the same motion (transformations with time in third view coordinate system) with different initialization would be total different (in Section.3.1). The ego-downward view coordinate system is represented by joints as illustrated in Fig.2 (the ego-downward body coordinate system block), where points from left shoulder to right shoulder, points out and perpendicular to the chest, and points downward which is perpendicular to and axis. In this paper, we deploy to represent the transformation between frames which is consists of a translation and a rotation in 3D space.
Given 3D human body pose , the center is and the orientation of the person in third view coordinate system is , where denotes up direction of the cross product. The transformation between ego and third view then is represented as . For ego motion model, it predicts the frame-to-frame transformation as with ego flow image input, where denotes the relative rotation between two ego-downward frames and denotes the translation.
In this paper, we use quaternion to represent the rotation predict as . However, the rotation difference between any two frame is small enough to represent in error quaternion form , that is, . Where, is called the error quaternion as:
Thus, in this paper the ego-downward motion model predicts the quaternion error (which is only 3 parameters) and relative translation with a total parameters.
3.3 Training and Regression Details
Our ETVVT model is composed of action block and motion block. For action block, a siamese structure is introduced of using third view clip and ego-downward to perform action prediction. The Siamese network is also used for learning the motion information for cross view matching. Each block is trained independently and then acts as pre-trained model for ETVVT model.
Ego-downward View Action Regression To learn the action classification in ego-downward view, the input is 3D pose and the ego-downward flow clip . The ego-downward action model is supervised to predict the action label using cross entropy loss,
is the binary indicator if the class label is the correct prediction of current observation and
denotes the corresponding probability.
Third View Action Regression For third view action mode, it directly uses the cropped person sequence as input, where . Then, the fully connected layers predict action label using the 3D features with cross entropy loss,
Ego-downward View Transformation Regression Ego-downward motion model predicts and . It can be represented as a transformation, . Thus, the transformation of ego clip is . Then, we warp this toward 2D third view as . The loss used to regress the learning of the ego-downward transformation is,
where denotes norm as the loss.
Third View Transformation Regression The third view directly predicts the translation third view image, and the tracker bounding box center, as output. It predicts frame-to-frame translation . Thus, the output of a third view clip is . We design the loss as,
ETVVT Model Learning The four sub-channels intermediate layer features then concatenated into one feature vector as input for the discriminator which is a two-layered fully connected networks. The loss for verification regression is cross entropy loss to predict or and the sum of each sub-model losses,
where is the binary indicator of prediction is correct and is the corresponding probability.
4.1 Dataset Collection
The dataset collection considers the following challenges: 1) same color dressing or close color; 2) background difference as context inference for verification; 3) number of people related with accuracy; 4) similar motion situation. All the data collected are listed in Table.1, which contains a total number of videos. For the training and validation purpose, we collected single person ego-downward and third view videos under different backgrounds. For each pair, it contains an ego-downward video and a third view static video. For all the video pairs, we generate clips which contains raw images and flow images as training and testing purpose. We highlight the challenge of verification if the person in third view have the same dressing and collect extra data on this. The testing data contains to person in view cases, and the synchronization is performed using GoPro camera remote controller.
|Single Person||Three backgrounds||A total pair of videos containing over image pairs|
|Multi-person||Two Person: No Crossing||pair of videos|
|Two Person: Crossing||pair of videos|
|Three Person: No Crossing||pair of videos|
|Three Person: Crossing||pair of videos|
|Group Crossing:||pair of videos|
|Same Dressing||Two Person: No Crossing||1 pair of videos|
|Group Crossing||1 pair of videos|
4.2 Implementation Details
Dataset Preparation For each pair of videos, we perform the following operations which can be repeated in a step by step manner: 1) parse the videos into images; 2) Generate dense optical flow and represent in and directional separate images ; 3) For third view frames, first we perform person detection and tracking to obtain the bounding boxes  for cropping. Then 3D pose estimation of generating the 3D joints is performed for each cropped image using HMR 
; 4) The 3D poses set of each clip is then clustered using K-means algorithm, with in this paper. Then, we can obtain the action label of each frame. We also tried , and . It should be advised that a bigger should be more accurate for verification considering of a more general application purpose.
Following the above procedures, we can obtain: 1) raw image, flow images, and action label for ego-downward view; 2) raw image, flow images, bounding box, and action label of each person, and the corresponding 3D pose indicated by joints for third view (it is used to calculate the initial transformation for motion model). For all the single person videos, we choose for training and for testing.
Training Details We choose to initialize each model using a pre-trained ResNet 
which is trained on ImageNet-ILSVRC
. All the models are implemented in Pytorch, with learning rate as and weight decay for epochs using two Nvidia 1080 GPUS. For our network, we trained each sub-model independently. Then we perform joint optimization for final verification.
4.3 Results and Comparison
Baselines We first implement multiple baselines to compare the performance considering inputs, and models. These baseline method are proposed in peer researches [14, 27, 23] including spatial-domain siamese network , motion-domain siamese network, two-stream semi-siamese network , triplet network , and temporal domain image and flow network [14, 23]. We also demonstrate the weight share performance for siamese-network. We deploy 2D and 3D Resnets  to learning spatial and temporal features.
For feature consideration, we performed the training and testing using image data and flow data in independent network, while we also performed learning using both information in a semi-siamese approach. Table.2 summaries the accuracies of the above models. In this paper, we use accuracy as metric to evaluate the models as . It shows in the table that temporal models are significantly much better for our tracking problem, and also flow information is more accurate. It is due to our dataset requires person to move frequently and fast, thus makes it hard to verify using pure context feature. The the maximum accuracy according to these methods is which is 3D temporal Resnet-34 model using optical flow as input. However, the semi-temporal model does not show any improvement, which may caused by limited data of color feature of our dataset.
In this table, we can also see that a share weight siamese-model is more effective then the none-share models with an average percent higher. For Semi-siamese model, in spatial domain, it is a four channel network takes both flow and image as input. The triplet model is implemented as proposed in paper , where a none-corresponding image is used input of the model. The result accuracy indicates that the triplet structure can achieve similar performance compared to temporal flow model, and it does not require huge amount of parameter to train.
For the base line implementation, we did not implement semi-triplet as proposed in  since we regard the tracking is performed in large crowd. Thus, the semi-triplet model will have to perform exponential times of verification due to the requirement of input. However, the above data tells the following learning: 1) flow information is more important for localization; 2) complex model may not help if simply use spatial and temporal information.
ETVVT Model Testing
1) Performance and Analysis We also test our proposed model on the single person dataset. The results are summarized in Table.3, where we also test the action model and motion model separately. We can obtain that the proposed method output performs the best base line by . The independent action model can achieve in accuracy and translation model can achieve in accuracy.
2) Action VS Motion Model The result shows that Action model has a higher accuracy than Motion model, and higher average precision. It is because the motion model does not tell any difference when human is static or just move the part of the body. We also visualize the activations and the overlay to image of motion model as illustrated in Fig.6. It can be seem that the third view translation highly attend to the center of the flow, while, the ego motion model attend to the outer body region for translation estimation. For action sub-model, the activations of each model the third block is Fig.7. We observe the action model attending to joints to perceive pose information both in RGB-image and flow images.
3) Ego Odometry VS Third View Odometry We also compare the importance of ego-view translation and third view translation. We directly introduce to add the translation as an independent channel into the temporal semi-siamese model, in a fully connected layer (Appear In appendix). The result shows that third view translation can increase the validation accuracy ( of the training data) from to . It can be explained according to Fig.6 that our ego view has limited view of world, also the head motion introduces error.
|Test Case||Accuracy||Bayes Filter|
|Multi-person||Two Person No Crossing||72.26||96.17|
|Two Person Crossing||62.18||80.76|
|Three Person No Crossing||72.25||92.27|
|Three Person Crossing||65.39||91.52|
|Same Dressing||Two Person No Crossing||72.26||96.17|
|Three Person Crossing||65.39||91.52|
4) Test On Multi-person Videos Then, we test the proposed model in our multi-moving people cases with results illustrate in Table.4. For the ground truth, we use the the tracker and human label to obtain. It is can be seem in Table.4 that ETVVT can achieve an average accuracy for all the test cases. For group cross, the filtering fails since to much crossing happens.For implementation, we perform prediction of all the detected person and conclude based on the maximum score.
ETVVT model has lower accuracy when the ego-camera mounted person crossed with other pedestrian. It is due to partial observable of the body, the 3D pose estimation would fail. In this paper, we also introduce a Bayes filter with velocity prediction to filter the verification results. The filterred result are illustrated in Table.4, which shows promising in few person in view scenario.
ETVVT Adaptivity Analysis
Our model directly transforms ego view information into the third view coordinate system, and we firstly introduced 3D pose to perform understanding. The geometry and action information model help to learn the two view pose and motion information for cross view verification. Besides, we use short-term video clip as input which enables on-line processing.
We also find several limitation of our model at current stage. First, if all the person are static or with similar pose in view, our algorithm would fail. Second, if all person with the same action and motion, it also fails (in Fig.8). It is illustrated in Fig.8(a), the two person have the same dressing and doing the same motion, it localized the wrong person in view. However, in most time, the person are with different motion and action (in Fig.8(b)), our model can obtain the correct result.
We present an action and motion learning model for cross view localization and tracking via introducing 3D pose as transformation for alignment. It is motivated by observation that the ego view is not able sense the third view absolute coordinate information. Our experimental results show that our method outperforms the state-of-art verification model on cross view verification, even with same dressing. It delivers a competitive generalization of cross view verification on semi-supervise learning for localization and tracking using action and motion clue.
-  Google maps ar. In https://insights.dice.com/2018/03/ 19/google-opens-its-maps-api-to-augmented-reality-development/.
-  F. Ababsa and M. Mallem. Robust camera pose tracking for augmented reality using particle filtering framework. Machine Vision and applications, 22(1):181–195, 2011.
-  A. Alahi, M. Bierlaire, and M. Kunt. Object detection and matching with mobile cameras collaborating with fixed cameras. In Workshop on Multi-camera and Multi-modal Sensor Fusion Algorithms and Applications-M2SFA2 2008, 2008.
-  S. Ardeshir and A. Borji. Ego2top: Matching viewers in egocentric and top-view videos. In European Conference on Computer Vision, pages 253–268. Springer, 2016.
-  S. Ardeshir and A. Borji. Integrating egocentric videos in top-view surveillance videos: Joint identification and temporal alignment. In Proceedings of the European Conference on Computer Vision (ECCV), pages 285–300, 2018.
-  L. Armesto, J. Tornero, and M. Vincze. Fast ego-motion estimation with multi-rate fusion of inertial and vision. The International Journal of Robotics Research, 26(6):577–589, 2007.
-  B. Bahmani, B. Moseley, A. Vattani, R. Kumar, and S. Vassilvitskii. Sklearn k-means. volume 5, pages 622–633. VLDB Endowment, 2012.
-  S. Baker and I. Matthews. Opencv dense optical flow. volume 56, pages 221–255. Springer, 2004.
-  A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft. Simple online and realtime tracking. In 2016 IEEE International Conference on Image Processing (ICIP), pages 3464–3468, 2016.
S. Brahmbhatt, J. Gu, K. Kim, J. Hays, and J. Kautz.
Geometry-aware learning of maps for camera localization.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  G. Bresson, Z. Alsayed, L. Yu, and S. Glaser. Simultaneous localization and mapping: A survey of current trends in autonomous driving. IEEE Transactions on Intelligent Vehicles, 20:1–1, 2017.
-  Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, 2017.
-  R. Clark, S. Wang, A. Markham, N. Trigoni, and H. Wen. Vidloc: A deep spatio-temporal model for 6-dof video-clip relocalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 3, 2017.
-  C. Fan, J. Lee, M. Xu, K. K. Singh, Y. J. Lee, D. J. Crandall, and M. S. Ryoo. Identifying first-person camera wearers in third-person videos. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 4734–4742. IEEE, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  D. Held, S. Thrun, and S. Savarese. Learning to track at 100 fps with deep regression networks. In European Conference on Computer Vision, pages 749–765. Springer, 2016.
-  J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. Shamma, M. Bernstein, and L. Fei-Fei. Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3668–3678, 2015.
-  A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik. End-to-end recovery of human shape and pose. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  A. Kendall, M. Grimes, and R. Cipolla. Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE international conference on computer vision, pages 2938–2946, 2015.
-  M. Klingensmith, I. Dryanovski, S. Srinivasa, and J. Xiao. Chisel: Real time large scale 3d reconstruction onboard a mobile device using spatially hashed signed distance fields.
-  M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, Oct. 2015.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
-  Z. Qiu, T. Yao, and T. Mei. Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV, 2017.
-  J. Redmon and A. Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
-  J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb-d images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2930–2937, 2013.
-  G. Sigurdsson, A. Gupta, C. Schmid, A. Farhadi, and K. Alahari. Actor and observer joint modeling of first and third-person videos. In CVPR-IEEE Conference on Computer Vision and Pattern Recognition, 2018.
-  B. Soran, A. Farhadi, and L. Shapiro. Action recognition in the presence of one egocentric and multiple static cameras. In Asian Conference on Computer Vision, pages 178–193. Springer, 2014.
-  D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
-  A. Valada, N. Radwan, and W. Burgard. Deep auxiliary learning for visual localization and odometry. arXiv preprint arXiv:1803.03642, 2018.
-  L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European Conference on Computer Vision, pages 20–36. Springer, 2016.
-  J. Watada, Z. Musa, L. C. Jain, and J. Fulcher. Human tracking: A state-of-art survey. In International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, pages 454–463. Springer, 2010.
-  J. Xiao, S. L. Joseph, X. Zhang, B. Li, X. Li, and J. Zhang. An assistive navigation framework for the visually impaired. IEEE transactions on human-machine systems, 45(5):635–640, 2015.
-  W. Xu, A. Chatterjee, M. Zollhoefer, H. Rhodin, P. Fua, H.-P. Seidel, and C. Theobalt. Mo2cap2: Real-time mobile 3d motion capture with a cap-mounted fisheye camera. arXiv preprint arXiv:1803.05959, 2018.
-  S. Yan, Y. Xiong, and D. Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. arXiv preprint arXiv:1801.07455, 2018.
6 Ego Odometry VS Third View Odometry
In this Section, we provide the network architecture for supplementary of comparison described in Section 4.3.
Ego View Odometry Network Architecture
It has been discussed in Section 3.2, the model of ego-view translation has been discussed. A ego view translation model needs input: 1) the initial transformation in third view as described in Section 3, that is, . 2) the consecutive flow frames. The output is which is already transformed into third view coordinate system and concatenated together.
To compare the importance of the odometry information, we further introduce the ego-translation only model which is illustrated in Fig.9.
Third View Odometry Network Architecture The third view odometry information only model is illustrated in Fig.10. For the third view translation prediction model, the input is consecutive flow frames. It directly outputs , which should indicate the translation of person center in third view image.
7 Prepare Training and Testing Dataset
We also provide the detailed procedures to generate dataset. The general procedures are:
Parse the videos into images
Generate dense optical flow and represent in and directional separate images 
7.1 Action Models
Action models has to train the model to recognize the action from both view and then perform verification. The dataset is prepared as follows:
We concatenate every consecutive (time ) 3D poses in a vector. Then, we perform K-means to do clustering , with in this paper. The K-means index is the action label of the last frame of each frames.
From all the labeled images, we randomly select for training and for testing.
For ego view data, we bundle the initial corresponding third view image (time ) and the consecutive ego view flow images (time ). For third view data, we bundle the consecutive RGB-images (time ).
7.2 Odometry Models
Odometry models use the third view bounding box center translation as output, that is, as described in Section 3.2. At the initial independent training stage, we follows:
Calculate the third view person translation at time
For ego view, we first calculate the initial pose, , according to Section 3.2. Then, we bundle the transformation and ego view flow images at time .
For third view, we bundle the consecutive flow images at time as training input.
We provide a video to demonstrate of ETVVT performance. We show the following cases:
The model verification for localization and tracking of three person in view, with two person are with the same dressing and mounting.
The comparison of using filter and raw model prediction
The three person in view and crossing case with two person are with the same dressing and mounting.
Large group case with: 1) only one person is moving; 2) several person are moving.