Human pose estimation in images and articulated pose tracking in videos are of significance for visual understanding task [34, 13, 35, 36]. Research community has witnessed a significant advance from single person [3, 11, 29, 28, 30, 23, 33] to multi-person pose estimation [25, 16, 5, 24, 6], from static images pose estimation [26, 13] to articulated tracking in videos [17, 15, 9, 18, 4, 19, 38, 12, 32, 31]. However, there are still challenging pose estimation problems in complex environments, such as occlusion, intense light and rare poses [2, 20, 27]. Furthermore, articulated tracking encounters new challenges in unconstrained videos such as camera motion, blur and view variants [1, 37, 39].
and deep convolutional neural network[29, 28, 30, 23, 33]. Motivated by practical applications in video surveillance, human-computer interaction and action recognition, researchers now focus on the multi-person pose estimation in unconstrained environments. Multi-person pose estimation can be categorized into bottom-up [25, 16, 5] and top-down approaches [24, 6, 13, 31], where the latter becomes dominant participants in COCO benchmarks . Top-down approaches can be divided into two-stages based methods and unified framework. Two-stages methods [24, 6, 31] firstly detect and crop persons from the image, then perform the single person pose estimation in the cropped person patches. Representative work of unified framework methods is Mask R-CNN , which extracts the human bounding box and predicts keypoints from the corresponding feature maps simultaneously.
While there has been a significant advance in pose estimation, quantization errors still exist in most of the modern networks. Although Google 
proposes to simultaneously classify the heatmaps and regress the offset filed, recent human pose estimation approaches[13, 6, 31] tend to directly predict the location heatmaps. Because of the quantization effect between input and output, performance is inevitably deteriorated within the reduced network output. While both deconv and offset can reduce quantization errors, offset is more significant for resources-restricted applications due to its efficiency. In this paper, we revisit the heatmap-offset aggregation method and propose the Offset-guided Network (OGN) for both two-stages pose estimation and unified Mask R-CNN framework. We extend modern frameworks by adding a branch for offset prediction in parallel with the existing branch. Meanwhile, an intuitive but effective fusion is adopted to obtain the final results, and we propose a greedy box generation strategy to keep more necessary candidates. The OGN aims at improving precision for all sizes output especially low resolution. Our network can output keypoints location in continuous space which reduces the quantization error.
In experiments, the offset-guided two-stages pose estimation approach reaches mAP of 74.0 on COCO test-dev set, yielding 14% relative gain compared with . On PoseTrack dataset, we achieve 67.7 MOTA using two-stages pose input without optical flow, which is the new state-of-the-art results in this task.
The main contributions can be described as follows:
(1) Heatmap-offset aggregation method is revisited and we propose the OGN for both two-stages pose estimation and Mask R-CNN. An intuitive but effective fusion strategy is adopted to obtain the final results by merging two branches.
(2) As a novel alternative to NMS, a greedy box generation strategy is adopted to keep more necessary candidates for offset-guided two-stages pose estimation.
(3) In experiments, the offset-guided two-stages pose estimation approach reaches mAP of 74.0 on COCO test-dev set with a single model, yielding 14% relative gain compared to . Furthermore, we achieve 67.7 MOTA on PoseTrack dataset without optical flow, which is the new state-of-the-art results in this task.
2 Related Works
2.1 Single person pose estimation
Single person pose estimation is a task that predicts the pose of a single person in an image. Conventional methods [3, 11] exploit pictorial structure model which expresses the human body as a tree-structured graphical model.  claims that the right selection of components for both appearance and spatial modeling is crucial. The Deformable Part Model (DPM)  adopts HOG feature to implement this idea. Recently, this task has been advanced rapidly for the development of deep convolution neural networks.  firstly tries to utilize CNN and they prefer to directly regress coordinates of body parts. More recently, researches on this task choose to regress some heat maps, which each stands for a body part.  is the first work which solves the problem by using CNN and graphical models to predict heat maps of each body part. With the continuous work of many researchers, some novel architectures like CPM , Stacked Hourglass  and PRMs  are used to achieve state-of-the-art results.
2.2 Multi-person pose estimation
Motivated by practical applications, researchers now focus on multi-person in unconstrained environments. Multi-person pose estimation can be categorized into bottom-up and top-down approaches where the latter becomes dominant participants in COCO benchmarks .
. The former adopts an integer linear programming based method and the later improves DeepCut via utilizing image-conditioned pairwise terms.
predicts heatmaps of body parts and a set of 2D vector fields of part affinities and parses them by greedy inference to generate the final results.
Top-down approaches can be divided into two-stages based methods and unified framework. Two-stages methods [26, 24, 6, 31] first detect and crop persons from an image, then perform single person pose estimation in the cropped person patches.  follows this two-step framework by using pictorial structure models based method.  combines classification and regression tasks which respectively predicts the offset vector and location heatmap of each body part.  proposes a cascaded pyramid network containing global pyramid network and pyramid refined network which aims for online hard key points mining. Representative work of unified framework methods is Mask R-CNN  that builds an end-to-end framework and yields an impressive performance.
3 Overview of offset-guided network
For pose estimation, it is noticed that the precision of keypoints localization is limited by the size of network output. During the downsampling process, there exists a quantization error. OGN is utilized to address this problem. We verify the effectiveness of OGN for two-stages pose estimation (shown in Figure 1) and extended Mask R-CNN framework (Figure 2). Following , the regions of interest (ROIs) detected and cropped by person detector are fed to the pose estimator, where the offset regression branch guides heatmap classification branch to refine the pose location. Differently, two deconv layers  are used to enlarge the heatmaps by four times and an intuitive but effective fusion is adopted to obtain the final results. Meanwhile, in extended Mask R-CNN as shown in Figure 2, the ROIs from RPN are firstly extended to a fixed ratio and then ROI-Align is used to extract the feature in each extended ROIs. Finally, a score map and an offset map are predicted and fused to obtain the final location of keypoints.
3.1 Offset-guided two-stages Pose Estimation
We first address the OGN for two-stages pose estimation framework. For the first stage, the results of the person detector are crucial for subsequent pose estimator. However, the box with a lower score may have higher IoU with ground truth and may be eliminated by the subsequent NMS  process. Therefore, a Greedy Box Generation (GBG) strategy is proposed to retain more necessary candidates. For the second stage, two branches are used to obtain the final results. The offset regression branch guides heatmap classification branch to approach the ground truth. Meanwhile, the heatmap classification branch guides offset regression branch to focus on the neighborhood of ground truth.
3.1.1 Greedy Box Generation Strategy
We adopt Mask R-CNN  as the person detector that achieves AP 51.7 of 80 categories detection on the COCO val2017. Different from most of the other approaches, we propose a greedy box generation (GBG) strategy as a novel alternative to Non-Maximum Suppression (NMS) . It prefers to retain redundant boxes which helps us to get better pose selected by OKS+IOU NMS after pose estimation. Specifically, no filtering strategies including NMS are used in both RPN and R-CNN phase. As a result of person detection, thousands of boxes are put out as candidates. The sequential selection of candidates can be described as follows. Firstly, based on the task limitation, we filter out the boxes whose size are smaller than the minimum threshold. Then, those boxes whose confidence score is larger than 0.8 are picked out. We argue those boxes are reliable and call them equivalent ground truth (EGT). After that, the other predicted boxes who has a with all EGT are eliminated. Finally, all of the rest boxes are divided into groups where every box has a with each other, and top of each group (we use ) are preserved. By adopting GBG strategy, we tend to keep the boxes with score relatively small but localization more accurate.
3.1.2 Offset-guided Network
In this work, we utilize ResNet  as the backbone of the offset-guided network. Our offset-guided architecture addresses two main problems. Firstly, in order to preserve more local details, deconv layers are appended for higher resolution. In our practice, two deconv layers are used to enlarge the feature maps by four times. Secondly, following , we adopt an approach combining classification and regression branches to obtain the final pose results which helps to reduce quantization errors. We denote the number of keypoints by . A convolution layer of channels is adopted to output coarse location, and a convolution layer of channels to regress the offset for a fine position. For each predicted position and each GT key point , the target label for the classification head is:
The target label for the -axis of offset is:
And the same is -axis. The classification head considers the whole heatmaps, while the offset loss is only computed within a disk of radius from each keypoint. Our insight is that these two heads can revise each other. The regression head helps to revise the coarse location of keypoints. The classification head helps to exclude the invalid regions, so the regression head can focus on learning offset within a small range. Besides, this heatmap-offset aggregation method outputs result in continuous space which eliminates the quantization errors. As shown in Figure 1, the OGN can be split into three stages. In experiments, the OGN dramatically improves the performance in a large range of output resolutions, especially for low resolution.
Inspired by 
, to make the pose estimator adapted to the boxes generated by our person detector, we mix up the predicted boxes and ground truth boxes. With this strategy, our pose estimator can adapt to the variance of box location distribution and perform better while testing. Once those ROIs are provided, the cropped areas from the original image will be sent to a single pose estimator. In our practice, ResNet is used to extract features and somedeconv layers  is added to pursue higher resolution. Smooth
is used as the loss function for both classification and regression. In addition, we employ a Gaussian filter to make the output heatmaps smoother. The final results are obtained by merging two branches using an intuitive but effective fusion method.
For classification branch, each key point is predicted by a heatmap ( is the width and height of the final heatmap respectively). The other branch is used to generate heatmaps. Every pair of them stands for the offset for the corresponding position in , and these heatmaps are denoted by . Each and have the same size with . Firstly, we find the maximum score in each and mark them as coarse localization .
Then, the corresponding offsets , in and are obtained. Finally, the output can be denoted as:
Compared to , our method emphasizes simplicity and effectiveness.  adopts logistic loss for the classification head and Hober robust loss for the regression head, while only the Smooth loss is used for both of them in this paper. The totally different loss types in  introduce a hyper-parameter to keep loss balanced. In contrast, we only need to simply add the loss of the two branches together. When it comes to the process of fusion,  adopts Hough voting strategy while we directly select the maximum prediction. What is more, our network can still converge well without any intermediate supervision while  adds an extra heatmap to contribute auxiliary loss. From these perspectives, OGN can be easily transferred to other frameworks like Mask R-CNN . Our approach is not only simple and intuitive, but also effective. In Section 4.1, comprehensive experiments are conducted to verify the effectiveness.
3.2 Offset-guided Mask R-CNN
Besides the above two-stages pose estimation, the effectiveness of the OGN is also evaluated on Mask R-CNN, which is an end-to-end framework producing results of detection and pose estimation simultaneously.  models the location of a keypoint as a one-hot mask and produces masks for each keypoint based on the feature from ROI-Align. However, ROI-Align will output distorted feature map in different degrees if the ratios of ROIs are different, which increases the training difficulty of subsequent prediction head. And the resulting one-hot map may be less accurate due to the small resolution of the feature map.
Therefore, this paper proposes two techniques to improve the performance of human pose estimation of Mask R-CNN. The first one is to transform all human ROIs into a fixed ratio by extending the width or height of ROIs, which makes sure the ROIs fed into prediction head fall into the same distribution of ratio and improve the ability of generalization. This strategy is denoted as ratio-consistent in the following sections. The second one is that the human pose is predicted with a score map and an offset map. Specially, the prediction with the max response in the score map represents the coarse prediction location, and the offset map further refines it to a finer location. Here the score map is the same as the score map in single person pose estimation model mentioned in Section 3.1 and we also use Smooth loss as the optimization target. The extended Mask R-CNN framework is illustrated in Figure 2.
|Method||Network||Input Size||Deconv||Feature Stride||Offset||GBG|
In this section, the performance of the proposed offset-guided network is evaluated on COCO and PoseTrack dataset.
4.1 Results on COCO dataset
Experiments are firstly conducted on the COCO  benchmark which requires both person detection and body parts localization in uncontrolled conditions. The COCO dataset contains more than 200k images and 250k person instances splitting into train, validation and test sets. Ablation study is conducted on the validation set. To compare with other methods, we provide final results on both test-dev and test-challenge2018. The qualitative results of the COCO dataset are shown in Figure 4.
Our offset-guided two-stages model is pre-trained on the Imagenet classification dataset. For data augmentation, random flip, rotation () and scale (0.9
1.2) on original image are adopted. Considering the peculiarity of multi-person pose estimation task, we use a ROI based sampling strategy to improve the model’s generalization ability. Eight TITAN X GPUs and batch size of 64 are used. For every iteration, we randomly choose two images for each GPU and four ROIs for each image. The whole train process contains 22 epochs. The learning rate is 0.02 and drops twice at the 17th epoch and the 21st epoch with the decay of 0.1, SGD optimizer is used.
The test is conducted on the COCO val2017, test-dev and test-challenge2018. Following our GBG strategy, all ROIs generated by detected boxes are adjusted to a fixed ratio 3:4. For post-processing, a Gaussian filter is used to smooth the heatmaps at first. Then following , we use the product of box score and pose score as the final score for the sorting mechanism. Finally, NMS  based on and is employed.
4.1.1 Ablation study
Ablation study is conducted on the COCO val2017 set. Offset, GBG, Resolution and network depth are considered as shown in Table LABEL:tab:ablation_COCO_val2017.
Offset Network with low resolution output is of significance for resources-restricted applications due to its efficiency. From method (a, d), it can be seen that will inevitably deteriorate the performance if offset is not considered. Performance can be improved by 15.7 AP when considering offset. When , our OGN method (c) improves MSRA baseline method (b) by 1.5 AP. As shown in Table LABEL:tab:coco-dev-maskrcnn, our offset-guided architecture also improves Mask R-CNN by 2.1 AP.
GBG From the comparison of methods (g, h), the AP can be improved by 0.4 using our GBG strategy.
Resolution Resolution is affected by input size and network stride. Comparing methods (e, f, g) with each other, one can find that larger network input produces better results within certain range. As input size grows, the AP increases by 0.3 and 1.0 respectively. Deconv layers can reduce the network stride as shown in methods (f, i) and (g, j). Similarly with , AP increases by 0.9 from method f to i. When adopting larger input size, our final AP can increase by 1.3 from method g to j.
Network depth Comparison of methods (j, k, l) exposes that performance benefits from deeper network. Changing network depth, AP can increase by 0.8 from ResNet-50 to ResNet-101 and 1.0 from ResNet-50 to ResNet-152.
4.1.2 Comparison with state-of-the-art results
The proposed OGN method participates in both COCO Keypoints 2017 and 2018 challenges. In 2017, the performance of our single model is 71.3 AP, and our final result reaches 72.8 AP on test-dev set when state-of-the-art is 73.0. In 2018, as shown in Table LABEL:tab:coco-dev, our single model method, without additional training data and ensemble, achieves new state-of-the-art performance on the COCO test-dev set with 74.0 AP, which yields 14% relative gain compared to . Comparing with the previous state-of-the-arts , our approach improves the results by 0.2 AP. Our result with 100k additional data and the ensemble of ResNet , ResNext , Xception  achieves 75.9 AP. In test-challenge set, our result ranks the 3rd by 74.1 AP among COCO leaderboard when submitted.
|Extended Mask R-CNN||65.2||88.2||70.9||61.8||71.7||69.8|
4.2 Results on PoseTrack dataset
Experiments about pose estimation and tracking on PoseTrack dataset are also conducted. Ablation study on offset-guided Mask R-CNN is conducted. Similar tracking strategy as  is adopted except that the appearance information is taken into account. Specifically, we utilize the metric which integrates the spatial cue and the appearance cue. IoU is adopted to measure the spatial similarity and human Re-identification model is utilized to extract the appearance feature of the targets. Furthermore, the Euclidean distance is adopted to measure the appearance similarity.
4.2.1 Ablation study
We evaluate the proposed ratio-consistent strategy and offset-guided Mask R-CNN on PoseTrack val2017 dataset. The results are illustrated in Table LABEL:Mask_R-CNN_PoseTrack_val2017. We conduct experiments in three aspects including offset, ratio-consistent and the type of loss to optimize.
Offset Comparing method a with , the performance improvement is 3.7 mAP, which proves the effectiveness of OGN.
Ratio-consistent Similar to single person pose, the ratio of ROIs in this model is extended to . It brings another 1.6 mAP improvement comparing to method a.
Loss We regress the score maps of keypoint location and offset map with Smooth L1 loss. With this technique, 66.7 mAP is obtained. Our final results on val2017 outperform  by 0.8 mAP.
|Method||Loss Type||Ratio Consistent||Offset-guided Refinement||mAP Total|
4.2.2 Comparison with state-of-the-art results
As shown in Table LABEL:pose_track_performance, without optical flow, there is an improvement of MOTA over existing best method  by 2.3 on PoseTrack val2017. If the optical flow is adopted, the MOTA improvement is 4.7. Meanwhile, Our approach obtains state-of-the-art performance on test2017 set. The qualitative results of the PoseTrack dataset are shown in Figure 3.
|Method||Dataset||Total mAP||Total MOTA|
|Ours (Mask R-CNN)||validation||66.7||60.7|
|Ours (two stages)||validation||75.1||67.7|
|Ours (two stages with optical flow)||validation||76.7||70.1|
|Ours(two stages with optical flow)||test||74.8||61.6|
In this paper, we revisit the heatmap-offset aggregation method for pose estimation and propose the Offet-guided network (OGN) for both two-stages approaches and Mask R-CNN. The OGN is designed to reduce errors caused by the quantization effect between network input and output. A novel alternative to NMS for two-stages network is proposed which named GBG. For offset-guided Mask R-CNN, ratio-consistent is adopted to improve the model’s ability of generalization. State-of-the-art results are achieved on both COCO and PoseTrack dataset.
-  M. Andriluka, U. Iqbal, A. Milan, E. Insafutdinov, L. Pishchulin, J. Gall, and B. Schiele. Posetrack: A benchmark for human pose estimation and tracking. In , pages 5167–5176, 2018.
-  M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3686–3693, 2014.
-  M. Andriluka, S. Roth, and B. Schiele. Pictorial structures revisited: People detection and articulated pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1014–1021. IEEE, 2009.
-  S. Bai, Z. He, T.-B. Xu, Z. Zhu, Y. Dong, and H. Bai. Multi-hierarchical independent correlation filters for visual tracking. arXiv preprint arXiv:1811.10302, 2018.
-  Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun. Cascaded pyramid network for multi-person pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
-  F. Chollet. Xception: Deep learning with depthwise separable convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1800–1807, 2016.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. Ieee, 2009.
-  A. Doering, U. Iqbal, and J. Gall. Joint flow: Temporal flow fields for multi person tracking. In British Machine Vision Conference, 2018.
-  H. Fang, S. Xie, Y.-W. Tai, and C. Lu. Rmpe: Regional multi-person pose estimation. In IEEE International Conference on Computer Vision, 2017.
-  P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010.
-  R. Girdhar, G. Gkioxari, L. Torresani, M. Paluri, and D. Tran. Detect-and-track: Efficient pose estimation in videos. In IEEE Conference on Computer Vision and Pattern Recognition, pages 350–359, 2018.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In IEEE International Conference on Computer Vision, pages 2980–2988. IEEE, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
-  E. Insafutdinov, M. Andriluka, L. Pishchulin, S. Tang, E. Levinkov, B. Andres, and B. Schiele. Arttrack: Articulated multi-person tracking in the wild. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2017.
-  E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In European Conference on Computer Vision, pages 34–50. Springer, 2016.
-  U. Iqbal, A. Milan, and J. Gall. Posetrack: Joint multi-person pose estimation and tracking. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, L. Č. Zajc, T. Vojír̃, G. Bhat, A. Lukežič, A. Eldesokey, et al. The sixth visual object tracking vot2018 challenge results. In European Conference on Computer Vision, pages 3–53. Springer, Cham, 2018.
-  B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8971–8980, 2018.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740–755. Springer, 2014.
-  J. Liu, G. Wang, L. Y. Duan, K. Abdiyeva, and A. C. Kot. Skeleton-based human action recognition with global context-aware attention lstm networks. IEEE Transactions on Image Processing, PP(99):1–1, 2017.
-  A. Neubeck and L. Van Gool. Efficient non-maximum suppression. In International Conference on Pattern Recognition, volume 3, pages 850–855. IEEE, 2006.
-  A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision, pages 483–499. Springer, 2016.
-  G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, and K. Murphy. Towards accurate multi-person pose estimation in the wild. In IEEE Conference on Computer Vision and Pattern Recognition, volume 3, page 6, 2017.
-  L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. V. Gehler, and B. Schiele. Deepcut: Joint subset partition and labeling for multi person pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 4929–4937, 2016.
-  L. Pishchulin, A. Jain, M. Andriluka, T. Thormählen, and B. Schiele. Articulated people detection and pose estimation: Reshaping the future. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3178–3185. IEEE, 2012.
-  M. R. Ronchi and P. Perona. Benchmarking and error diagnosis in multi-instance pose estimation. In IEEE International Conference on Computer Vision, pages 369–378. IEEE, 2017.
-  J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in Neural Information Processing Systems, pages 1799–1807, 2014.
-  A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1653–1660, 2014.
-  S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In IEEE Conference on Computer Vision and Pattern Recognition, pages 4724–4732, 2016.
-  B. Xiao, H. Wu, and Y. Wei. Simple baselines for human pose estimation and tracking. In European Conference on Computer Vision, 2018.
-  Y. Xiu, J. Li, H. Wang, Y. Fang, and C. Lu. Pose flow: Efficient online pose tracking. In British Machine Vision Conference, 2018.
-  W. Yang, S. Li, W. Ouyang, H. Li, and X. Wang. Learning feature pyramids for human pose estimation. In IEEE International Conference on Computer Vision, volume 2, 2017.
-  B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. In Advances in Neural Information Processing Systems, pages 487–495, 2014.
-  J. Zhu, Z. Zhu, and W. Zou. End-to-end video-level representation learning for action recognition. In 2018 24th International Conference on Pattern Recognition (ICPR), pages 645–650. IEEE, 2018.
-  J. Zhu, W. Zou, and Z. Zhu. Two-stream gated fusion convnets for action recognition. In 2018 24th International Conference on Pattern Recognition (ICPR), pages 597–602. IEEE, 2018.
-  Z. Zhu, G. Huang, W. Zou, D. Du, and C. Huang. Uct: Learning unified convolutional networks for real-time visual tracking. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 1973–1982, 2017.
-  Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu. Distractor-aware siamese networks for visual object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pages 101–117, 2018.
-  Z. Zhu, W. Wu, W. Zou, and J. Yan. End-to-end flow correlation tracking with spatial-temporal attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.