Joint Gaze-Location and Gaze-Object Detection

by   Danyang Tu, et al.

This paper proposes an efficient and effective method for joint gaze location detection (GL-D) and gaze object detection (GO-D), i.e., gaze following detection. Current approaches frame GL-D and GO-D as two separate tasks, employing a multi-stage framework where human head crops must first be detected and then be fed into a subsequent GL-D sub-network, which is further followed by an additional object detector for GO-D. In contrast, we reframe the gaze following detection task as detecting human head locations and their gaze followings simultaneously, aiming at jointly detect human gaze location and gaze object in a unified and single-stage pipeline. To this end, we propose GTR, short for Gaze following detection TRansformer, streamlining the gaze following detection pipeline by eliminating all additional components, leading to the first unified paradigm that unites GL-D and GO-D in a fully end-to-end manner. GTR enables an iterative interaction between holistic semantics and human head features through a hierarchical structure, inferring the relations of salient objects and human gaze from the global image context and resulting in an impressive accuracy. Concretely, GTR achieves a 12.1 mAP gain (25.1%) on GazeFollowing and a 18.2 mAP gain (43.3%) on VideoAttentionTarget for GL-D, as well as a 19 mAP improvement (45.2%) on GOO-Real for GO-D. Meanwhile, unlike existing systems detecting gaze following sequentially due to the need for a human head as input, GTR has the flexibility to comprehend any number of people's gaze followings simultaneously, resulting in high efficiency. Specifically, GTR introduces over a × 9 improvement in FPS and the relative gap becomes more pronounced as the human number grows.


page 1

page 11

page 12


End-to-End Human-Gaze-Target Detection with Transformers

In this paper, we propose an effective and efficient method for Human-Ga...

GaTector: A Unified Framework for Gaze Object Prediction

Gaze object prediction (GOP) is a newly proposed task that aims to disco...

Gun Source and Muzzle Head Detection

There is a surging need across the world for protection against gun viol...

Mining the Benefits of Two-stage and One-stage HOI Detection

Two-stage methods have dominated Human-Object Interaction (HOI) detectio...

Extended Gaze Following: Detecting Objects in Videos Beyond the Camera Field of View

In this paper we address the problems of detecting objects of interest i...

Free-View, 3D Gaze-Guided, Assistive Robotic System for Activities of Daily Living

Patients suffering from quadriplegia have limited body motion which prev...

Glance and Gaze: Inferring Action-aware Points for One-Stage Human-Object Interaction Detection

Modern human-object interaction (HOI) detection approaches can be divide...

Please sign up or login with your details

Forgot password? Click here to reset