Multiple Human Tracking using Multi-Cues including Primitive Action Features

09/18/2019
by   Hitoshi Nishimura, et al.
19

In this paper, we propose a Multiple Human Tracking method using multi-cues including Primitive Action Features (MHT-PAF). MHT-PAF can perform the accurate human tracking in dynamic aerial videos captured by a drone. PAF employs a global context, rich information by multi-label actions, and a middle level feature. The accurate human tracking result using PAF helps multi-frame-based action recognition. In the experiments, we verified the effectiveness of the proposed method using the Okutama-Action dataset. Our code is available online.

READ FULL TEXT VIEW PDF

page 2

page 3

page 6

10/07/2021

A Multi-viewpoint Outdoor Dataset for Human Action Recognition

Advancements in deep neural networks have contributed to near perfect re...
02/09/2020

Weakly-Supervised Multi-Person Action Recognition in 360^∘ Videos

The recent development of commodity 360^∘ cameras have enabled a single ...
10/22/2019

Human Action Recognition in Drone Videos using a Few Aerial Training Examples

Drones are enabling new forms of human actions surveillance due to their...
09/15/2017

Multi-Label Zero-Shot Human Action Recognition via Joint Latent Embedding

Human action recognition refers to automatic recognizing human actions f...
09/05/2017

Multi-label Class-imbalanced Action Recognition in Hockey Videos via 3D Convolutional Neural Networks

Automatic analysis of the video is one of most complex problems in the f...
05/05/2015

Contextual Action Recognition with R*CNN

There are multiple cues in an image which reveal what action a person is...

Code Repositories

mht-paf

Multiple Human Tracking using Multi-Cues including Primitive Action Features (MHT-PAF)


view repo

I Introduction

Multiple human tracking is a fundamental technique and widely used in various fields such as robotics, surveillance, and marketing. The task of multiple human tracking is that multiple humans continue to be detected, maintaining their identities (ID) base on time-series images [1]. Most state-of-the-art tracking methods [2, 3, 4, 5, 6] are based on a tracking-by-detection approach owing to the recent improvement of in accuracy of human detection. The tracking-by-detection approach considers multiple human tracking as data association [4]. The data association matches detection results between consecutive frames with an association metric.

In aerial images captured by a drone, human tracking accuracy is not so accurate. i) significant change of human’s size and aspect ratio and ii) abrupt camera movement cause false positives and ID switches. ID switch means the target human ID changes to another ID. Although human tracking methods [2, 3, 4, 5, 6] utilize a human appearance feature, position, or both of them as cues, the false positive and ID switch frequently occur.

In this paper, we propose a Multiple Human Tracking method using multiple cues, including Primitive Action Features (MHT-PAF). Our idea is that an action cue is effective for each human’s tracking because a human action does not change frequently at the frame level. Fig. 1 shows the idea of the proposed method. Unlike previous methods (Fig. 1LABEL:sub@fig:idea1), the proposed method employs the action feature for human tracking (Fig. 1LABEL:sub@fig:idea2). For data association, the action feature needs to be extracted for each frame.

(a) Previous methods
(b) Proposed MHT-PAF
Fig. 1: Idea of the proposed method.

However, the frame-level feature works poorly because a human action occurs across multiple frames. Therefore, we designed the primitive action feature (PAF) in terms of the following three points:

  1. PAF employs a spatial context with a global cropped image in order to capture actions including interactions with human and object.

  2. PAF is based on multi-label actions in order to extract rich information on human action.

  3. PAF employs not a final action recognition label but a middle level feature because a human action includes some ambiguity at the frame level.

Accurate human tracking using PAF can be applied to the multi-frame-based action recognition described in Fig. 1LABEL:sub@fig:idea2.

Our main contributions are as follows:

  • We propose a multiple human tracking method using multi-cues that include primitive action features (MHT-PAF).

  • We design PAF that employs a global context, rich information by multi-labeling of actions, and a middle level feature.

  • We verify the effectiveness of the proposed method and make the code available on the web11footnotemark: 1.

Ii Related Work

Human Tracking: In human tracking, a human continue to be detected, maintaining the ID. The human tracking methods are classified into online and offline methods. While online methods estimate the human ID in a serial fashion, offline methods estimate it after all data is stored. Online: Breitenstein

et al. introduced human tracking in a particle filtering framework [2]. Wojke et al. proposed DeepSORT, which performs human tracking using human features and positions [3]. Offline: Zhang et al. proposed MCF, which solves human tracking as a minimum-cost flow problem [4]. Berclaz et al. reformulated human tracking as a constrained flow optimization in a convex problem [5]. Milan et al. proposed a human tracking method that is solved by continuous energy minimization [6]. In these previous methods, the accuracy of human tracking is not so accurate because the methods utilize only human appearance features, positions, or both of them as cues.

Action Detection: In action detection, spatio-temporal action positions and action classes are estimated. Many action detection methods have been proposed [7, 8, 9, 10, 11, 12, 13]. Action detection is classified into three categories. (1) Spatial action detection: Gkioxari et al. introduced action tubes that operate with region proposals, CNN features, and SVMs [7]. Lin et al.

proposed SSAD, which is an end-to-end neural network 

[8]. (2) Temporal action detection: LRCN performs temporal action detection using LSTM [9]. Shou et al. introduced Mulit-stage CNN, which employs 3D CNNs for temporal action detection [10]. (3) Spatio-Temporal action detection: Hou et al. proposed T-CNN, which is a unified deep neural network that detects actions based on 3D convolution features [11]. Kalogeiton et al. proposed an ACT detector which is also a unified deep neural network, and based on stacking of single-frame features [12]. Singh et al. presented ROAD, which performs spatio-temporal action detection online [13]. All these methods utilize action information as a cue, but not a specific human appearance feature that captures human ID as cues.

Action Recognition: For action recognition, an action class is estimated, given a spatio-temporal action position. Many action recognition methods have been proposed [14, 15, 9, 16, 17]. Simonyan et al. introduced a two-stream network using RGB and flow images [14]. The proposed method is based on a two-stream network because of its simplicity. Wang et al. proposed TSN, which divides an image into several segments in a temporal domain [16]. Donahue et al. introduced LRCN, which performs long-term action recognition using LSTM [9]. Tran et al. proposed C3D, which extracts a feature by 3D convolution [15]. Carreira et al. proposed I3D, which uses a 3D convolution, parameters of which are based on 2D convolution [17].

Iii Proposed Method (MHT-PAF)

Fig. 2: Pipeline of the proposed method.

In order to prevent false positives and ID switches, we introduce the primitive action feature (PAF) for human tracking. First of all, we explain the problem formulation (Section III-A). Fig. 2 shows the pipeline of the proposed method. Multi-cues including PAF are extracted (Section III-B). After this procedure is completed for all frames, data association is performed (Section III-C). The data association results can be applied to the multi-frame-based action recognition (Section III-D).

Iii-a Problem Formulation

Let be a set of human observations, each of which is a human detection result. The -th observation is defined as . denotes a time step. denotes the bounding box of a human. while and are the x and y coordinates of the upper left corner of a rectangle, and and are its width and height. and denote an appearance feature and a primitive action feature, respectively. Let be the -th human trajectory. Human tracking estimates all trajectories , given time series images.

Iii-B Multi-cue Extraction

For each , three types of cues are extracted, a location feature (), an appearance feature (), and a primitive action feature ().

Iii-B1 Location Feature

Each bounding box is estimated by SSD [18]. The backbone model of the SSD is VGG16 [19]. For the input, we use a 4K image in order to capture human actions in detail.

Iii-B2 Appearance Feature

The appearance feature is defined in a feature space where the distance between two features is small when their features indicate the same human. It is extracted by a Siamese network that has inputs and output [20]. Each backbone model of the Siamese network is WideResNet [21]. In the training phase, while the same human pair is annotated with a “1”, a different human pair is annotated with a “0”. In the inference phase, one of the two backbone models is used for extracting the appearance feature.

Iii-B3 Primitive Action Feature

Fig. 3:

Primitive action feature extraction model.

In this section, we describe the primitive action feature (PAF), which is the key point of the proposed method. A human region is cropped corresponding to . For each cropped image, a primitive action feature is extracted. Fig. 3 shows the PAF extraction model. PAF employs a global context, rich information by multi-label actions, and a middle level feature. The network is a four-stream neural network.

The network is base on two-stream network [14, 16] which has two modalities, spatial and temporal modalities. While the spatial network utilizes a RGB image, the temporal network utilizes an optical flow image. For optical flow calculation, we used TV-L1 optical flow [22] which is a fast and accurate method. The optical flow is calculated in x and y coordinates separately. The backbone model of each stream is ResNet101 [23].

For each modality, two types of images are input to the network, local and global cropped images. The local cropped image is obtained from a bounding box which is squared from to fit the long side. The global cropped image is the expanded image from the local cropped image, regarding as the center. The expansion ratio is set as a predefined parameter . The global cropped image introduces the spatial context such as objects and other humans.

Since the output of the network are multi-label actions, the feature extracts rich information on human action. The loss function is a binary cross entropy loss for each class.

The primitive action feature (RGB) is directly extracted from the layer just before the fully connected layer in the RGB network. As well as the primitive action feature (RGB) itself, the primitive action feature (FLOW) is extracted from the FLOW network. The primitive action feature is obtained by concatenating the RGB and FLOW features.

Iii-C Data Association

Data association is performed based on the multi-cues described in Section III-B. Data association is regarded as a minimum-cost flow problem [4]. We define four types of costs, , , , and .

The first, , is the observation cost of the -th observation and is based on the logistic function as follows:

(1)
(2)

where denotes a predefined bias, denotes the score of the human detection, and and denote parameters of the logistic function. In the training phase, and are estimated by the Fisher scoring algorithm.

is the transition cost between the -th observation and the -th observation and is based on the nonlinear function as follows:

(3)
(4)

where , , and respectively denote an IoU (Intersection over Union) score, a cosine distance of an appearance feature, and a cosine distance of a primitive action feature.

is represented by multiple decision trees. In the training phase, the parameters of

are estimated by a gradient boosting algorithm 

[24].

is the entry cost of the -th observation, and is the exit cost of the -th observation.

Human tracking is performed by estimating a set of indicator variables as follows:

(5)

where  [4]. is estimated by minimizing the following objective function with non-overlap constraints [4]:

(6)

The objective function is minimized by the scaling push-relabel method. In the online solution, the objective function is solved by the Hungarian algorithm for each frame.

Recall () Precision () IDs FM MOTA
DeepSORT [3]
MCF [4]
MHT-PAF (late)
MHT-PAF
TABLE I: Performance of human tracking.
Human to human interactions Human to object interactions No-interaction Mean
handshaking hugging reading drinking pushing/pulling carrying calling running walking lying sitting standing (mAP)
DeepSORT [3]
MCF [4]
MHT-PAF (late)
MHT-PAF
TABLE II: Average Precision (AP) of multi-frame-based action detection ().

Iii-D Multi-frame-based Action Recognition

Fig. 4: Multi-frame-based action recognition.

The human tracking result described in Section III-C is applied to the multi-frame-based action recognition. For each , which denotes an action class is estimated. Fig. 4 shows multi-frame-based action recognition, which is based on a sliding window. At time , the average action recognition score within the window is calculated. Each action recognition score is directly extracted from the last class layer of PAF extraction model. Then, the action recognition result is estimated. The window length is a predefined parameter. When the action recognition score is lower than predefined threshold , the action is determined to be “Unknown”.

Iv Experiments

We conducted experiments of the human tracking in order to verify the effectiveness of the proposed human tracking method and its usefulness for the multi-frame-base action recognition.

Iv-a Dataset

We used an Okutama-Action dataset [25], which is an aerial view of a concurrent human action detection dataset. The dataset is very challenging because it includes significant changes in human’s size and aspect ratio, and abrupt camera movement, dynamic transition of multi-label actions. The dataset contains videos and was split into the training ( videos) and test data ( videos). The videos are captured in FPS, and images in the dataset in total. Two drones captured participants from a distance of - meters and camera angles of - degrees. The resolution of images is 4K (). Each bounding box have one or more action labels. Twelve action labels are divided into three categories: human-to-human interactions (handshaking, hugging), human-to-object interactions (reading, drinking, pushing/pulling, carrying, calling), and no-interaction (running, walking, lying, sitting, standing). Multiple actions almost always consist of one no-interaction action and one action from the other two categories.

Human to human interactions Human to object interactions No-interaction Average
handshaking hugging reading drinking pushing/pulling carrying calling running walking lying sitting standing
Single frame (local)
Single frame (local+global)
Multi frames (local)
Multi frames (local+global)
TABLE III: Accuracy of action recognition, given ground truth of human tracking ().

Iv-B Experimental Setting

The human detection model (SSD) was trained using the Okutama-Action dataset. It was trained for iterations with a learning rate of . The input size of SSD was . We used the same human detection results for the previous methods and the proposed method. The appearance feature extraction model (WideResNet) was trained using the MARS dataset [27]. The primitive action feature (PAF) extraction model was trained using the Okutama-Action dataset. It was trained for iterations with a learning rate of . The dropout ratio was set to . In the data augmentation, random cropping and horizontal/vertical cropping were employed. , , . The dimension of PAF was (RGB: FLOW: ). The observation cost model and the transition cost model were trained using Okutama-Action dataset. For data association parameters, we empirically set , , .

Iv-C Evaluation of Proposed Human Tracking

We evaluated the human tracking (Estimating and

). For the evaluation metric, we used Multiple Object Tracking Accuracy (MOTA). MOTA is a widely used and comprehensive metric using the following combination:

(7)

where FN, IDs, FP, and DET respectively denote the total number of false negatives, ID switches, false positives, and detections. The MOTA score ranges from to . More details about these metrics are described in paper [26]. The IoU threshold between ground truth and the estimated bounding box was set to .

Table I shows the performance of human tracking. The MOTA in the case of MHT-PAF is higher than that of MCF [4], which does not utilize the action feature (MCF: vs. MHT-PAF: ). While keeping the recall almost same (MCF: vs. MHT-PAF: ), the precision improved (MCF: vs. MHT-PAF: ). The number of ID switches decreased (MCF: vs. MHT-PAF: ), and the fragmentation decreased (MCF: vs. MHT-PAF: ).

MHT-PAF (late) employs a late fusion which concatenates RGB and FLOW of the last class layer for the action feature. The MOTA in the case of MHT-PAF is higher than MHT-PAF (late) (MCF-PAF (late): vs. MHT-PAF: ). Also, the MOTA in the case of MHT-PAF is higher than that of DeepSORT [3] (DeepSORT: vs. MHT-PAF: ).

Iv-D Application for Multi-Frame-based Action Recognition

We evaluated the multi-frame-based action recognition. For the evaluation metric, we used mean Average Precision (mAP). The mAP is used for the action detection task, which estimates and . The IoU threshold between ground truth and estimated bounding box was set to . Table II shows the result of multi-frame-based action detection. In DeepSORT and MCF, PAF was not utilized. The mAP in the case of MHT-PAF is higher than that of MCF (MCF: vs. MHT-PAF: ). This is due to the improvement in accuracy of human tracking using PAF.

Iv-E Discussion

We evaluated the accuracy of action recognition (Estimating ), given the ground truth of human tracking. The purpose was to discuss the global cropped image and single/multi-frame-based action recognition. The evaluation was performed at frame level. Table III shows the accuracy of action recognition.

Global Cropped Image: Let us compare the local cropped image to local+global cropped image in the single-frame-based action recognition. The accuracy in the case of local+global cropped image is higher than that of local cropped image (local: vs. local+global: ). For human-to-human interactions and human-to-object interactions, the global cropped image is effective. These interactions need a global context such as humans or objects for recognition. On the other hands, for no-interaction, the local cropped image is effective. No-interaction needs only human motions for recognition. The average accuracy is the highest in the case of the combination of multi-frame-based action recognition and local+global cropped images ().

Single-frame-based Action Recognition: Let us explain the single-frame-based action recognition. The accuracy in walking, standing, sitting, carrying, pushing/pulling, and reading is high compared to other actions. In such actions, the mAP shows improvement (MCF vs. MHT-PAF) as shown in Table II. In order to improve the mAP, it is important to improve the accuracy of single-frame-based action recognition.

Multi-frame-based Action Recognition: For the local cropped image, the accuracy in the case of multi-frame-based recognition is higher than that of single-frame-based recognition (single frame: vs. multi frames: ). For the local+global cropped image, the accuracy in the case of multi-frame-based recognition is higher than that of single-frame-based recognition (single frame: vs. multi frames: ). Therefore, the multi-frame-based action recognition is effective. In order to leverage the multi-frame-based action recognition more effectively, the improvement in the accuracy of human tracking is needed.

Iv-F Examples of Human Tracking and Action Recognition

Fig. 5: Examples of human tracking and action recognition.

Fig. 5 shows examples of human tracking and action recognition. For each bounding box, estimated human ID and actions are indicated. If the action recognition result is “Unknown”, it is not indicated in the image. #(number) denotes a frame ID. Fig. 5 shows examples of video 1.2.10. In MCF, ID switches (IDs) occur frequently (frame 475, 487, 497, and 498) and false positive (FP) occurs (frame 476). In MHT-PAF, these ID switches and false positive are prevented. PAF employs rich information on human action, and can avoid the miss of data association.

V Conclusion

In this paper, we proposed a Multiple Human Tracking method using multi-cues including Primitive Action Features (MHT-PAF). PAF employs a global context, rich information from multi-label actions, and a middle level feature. Accurate human tracking using PAF can be applied to the multi-frame-based action recognition. In the experiments, we evaluated the proposed method using Okutama-Action dataset, which consists of aerial view videos. We verified that the human tracking accuracy (MOTA) improved . The number of ID switches decreased and the precision improved with retention of the recall. Due to the improvement in accuracy of human tracking, the action detection accuracy (mAP) improved . Also, we discussed the global cropped image and single/multi-frame-based action recognition. In the future, we will research the cooperative method that human tracking and action recognition work complementarily.

References

  • [1] R. T. Collins, A. J. Lipton, T. Kanade, H. Fujiyoshi, D. Duggins, Y. Tsin, D. Tolliver, N. Enomoto, O. Hasegawa, P. Burt, et al., “A system for video surveillance and monitoring,” VSAM final report, pp. 1–68, March 2000.
  • [2] M. D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and L. Van Gool, “Online multiperson tracking-by-detection from a single, uncalibrated camera,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 33, no. 9, pp. 1820–1833, 2010.
  • [3] N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime tracking with a deep association metric,” in Proc. IEEE International Conference on Image Processing (ICIP), 2017, pp. 3645–3649.
  • [4] L. Zhang, Y. Li, and R. Nevatia, “Global data association for multi-object tracking using network flows,” in

    Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2008, pp. 1–8.
  • [5] J. Berclaz, F. Fleuret, E. Turetken, and P. Fua, “Multiple object tracking using k-shortest paths optimization,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 33, no. 9, pp. 1806–1819, 2011.
  • [6] A. Milan, S. Roth, and K. Schindler, “Continuous energy minimization for multitarget tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 36, no. 1, pp. 58–72, 2013.
  • [7] G. Gkioxari and J. Malik, “Finding action tubes,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 759–768.
  • [8] T. Lin, X. Zhao, and Z. Shou, “Single shot temporal action detection,” in Proc. ACM International Conference on Multimedia (ACMMM), 2017, pp. 988–996.
  • [9] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 2625–2634.
  • [10] Z. Shou, D. Wang, and S.-F. Chang, “Temporal action localization in untrimmed videos via multi-stage CNNs,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1049–1058.
  • [11]

    R. Hou, C. Chen, and M. Shah, “Tube convolutional neural network (T-CNN) for action detection in videos,” in

    Proc. IEEE International Conference on Computer Vision (ICCV), 2017, pp. 5822–5831.
  • [12] V. Kalogeiton, P. Weinzaepfel, V. Ferrari, and C. Schmid, “Action tubelet detector for spatio-temporal action localization,” in Proc. IEEE International Conference on Computer Vision (ICCV), 2017, pp. 4405–4413.
  • [13] G. Singh, S. Saha, M. Sapienza, P. H. Torr, and F. Cuzzolin, “Online real-time multiple spatiotemporal action localisation and prediction,” in Proc. IEEE International Conference on Computer Vision (ICCV), 2017, pp. 3637–3646.
  • [14] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in Neural Information Processing Systems (NIPS), 2014, pp. 568–576.
  • [15] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3D convolutional networks,” in Proc. IEEE International Conference on Computer Vision (ICCV), 2015, pp. 4489–4497.
  • [16] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in Proc. European Conference on Computer Vision (ECCV), 2016, pp. 20–36.
  • [17] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6299–6308.
  • [18] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “SSD: Single shot multibox detector,” in Proc. European Conference on Computer Vision (ECCV), 2016, pp. 21–37.
  • [19] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [20] N. Wojke and A. Bewley, “Deep cosine metric learning for person re-identification,” in Proc. IEEE Winter Conference on Applications of Computer Vision (WACV), 2018, pp. 748–756.
  • [21] S. Zagoruyko and N. Komodakis, “Wide residual networks,” in Proc. British Machine Vision Conference (BMVC), 2016, pp. 1–12.
  • [22] C. Zach, T. Pock, and H. Bischof, “A duality based approach for realtime TV-L1 optical flow,” in Proc. Joint Pattern Recognition Symposium, 2007, pp. 214–223.
  • [23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
  • [24] J. H. Friedman, “Greedy function approximation: a gradient boosting machine,” Annals of Statistics, pp. 1189–1232, 2001.
  • [25] M. Barekatain, M. Martí, H.-F. Shih, S. Murray, K. Nakayama, Y. Matsuo, and H. Prendinger, “Okutama-action: an aerial view video dataset for concurrent human action detection,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2017, pp. 28–35.
  • [26] K. Bernardin and R. Stiefelhagen, “Evaluating multiple object tracking performance: The CLEAR MOT metrics,” EURASIP Journal on Image and Video Processing, vol. 2008, no. 1, p. 246309, 2008.
  • [27] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian, “MARS: A video benchmark for large-scale person re-identification,” in Proc. European Conference on Computer Vision (ECCV), 2016, pp. 868–884.