Understanding Human Gaze Communication by Spatio-Temporal Graph Reasoning

by   Lifeng Fan, et al.

This paper addresses a new problem of understanding human gaze communication in social videos from both atomic-level and event-level, which is significant for studying human social interactions. To tackle this novel and challenging problem, we contribute a large-scale video dataset, VACATION, which covers diverse daily social scenes and gaze communication behaviors with complete annotations of objects and human faces, human attention, and communication structures and labels in both atomic-level and event-level. Together with VACATION, we propose a spatio-temporal graph neural network to explicitly represent the diverse gaze interactions in the social scenes and to infer atomic-level gaze communication by message passing. We further propose an event network with encoder-decoder structure to predict the event-level gaze communication. Our experiments demonstrate that the proposed model improves various baselines significantly in predicting the atomic-level and event-level gaze


page 1

page 4

page 5

page 6

page 8


Unsupervised Gaze Prediction in Egocentric Videos by Energy-based Surprise Modeling

Egocentric perception has grown rapidly with the advent of immersive com...

Predicting Gaze in Egocentric Video by Learning Task-dependent Attention Transition

We present a new computational model for gaze prediction in egocentric v...

JRDB-Act: A Large-scale Multi-modal Dataset for Spatio-temporal Action, Social Group and Activity Detection

The availability of large-scale video action understanding datasets has ...

LAEO-Net: revisiting people Looking At Each Other in videos

Capturing the `mutual gaze' of people is essential for understanding and...

Slicing and dicing soccer: automatic detection of complex events from spatio-temporal data

The automatic detection of events in sport videos has important applicat...

GIMO: Gaze-Informed Human Motion Prediction in Context

Predicting human motion is critical for assistive robots and AR/VR appli...

Prediction of gaze direction using Convolutional Neural Networks for Autism diagnosis

Autism is a developmental disorder that affects social interaction and c...

Code Repositories


This is a summary of the papers and databases for gaze estimation.

view repo

1 Introduction

In this work, we introduce the task of understanding human gaze communication in social interactions. Evidence from psychology suggests that eyes are a cognitively special stimulus, with unique “hard-wired” pathways in the brain dedicated to their interpretation and humans have the unique ability to infer others’ intentions from eye gazes [15]. Gaze communication is a primitive form of human communication, whose underlying social-cognitive and social-motivational infrastructure acted as a psychological platform on which various linguistic systems could be built [59]. Though verbal communication has become the primary form in social interaction, gaze communication still plays an important role in conveying hidden mental state and augmenting verbal communication [2]

. To better understand human communication, we not only need natural language processing (NLP), but also require a systematical study of human gaze communication mechanism.

The study of human gaze communication in social interaction is essential for the following several reasons: 1) it helps to better understand multi-agent gaze communication behaviors in realistic social scenes, especially from social and psychological views; 2) it provides evidences for robot systems to learn human behavior patterns in gaze communication and further facilitates intuitive and efficient interactions between human and robot; 3) it enables simulation of more natural human gaze communication behaviors in Virtual Reality environment; 4) it builds up a common sense knowledge base of human gaze communication for studying human mental state in social interaction; 5) it helps to evaluate and diagnose children with autism.

Over the past decades, lots of research [22, 33, 26, 29] on the types and effects of social gazes have been done in cognitive psychology and neuroscience communities. With previous efforts and established terminologies, we distinguish atomic-level gaze communications into six classes:

 Single refers to individual gaze behavior without any social communication intention (see Fig. Understanding Human Gaze Communication by Spatio-Temporal Graph Reasoning (1)).

 Mutual [2, 5] gaze occurs when two agents look into eyes of each other (see Fig. Understanding Human Gaze Communication by Spatio-Temporal Graph Reasoning (2)), which is the strongest mode of establishing a communicative link between human agents. Mutual gaze can capture attention, initialize a conversation, maintain engagement, express feelings of trust and extroversion, and signal availability for interaction in cases like passing objects to a partner.

 Avert [47, 21] refers to averted gaze and happens when gaze of one agent is shifted away from another in order to avoid mutual gaze (see Fig. Understanding Human Gaze Communication by Spatio-Temporal Graph Reasoning (3)). Avert gaze expresses distrust, introversion, fear, and can also modulate intimacy, communicate thoughtfulness or signal cognitive effort such as looking away before responding to a question.

 Refer [50] means referential gaze and happens when one agent tries to induce another agent’s attention to a target via gaze (see Fig. Understanding Human Gaze Communication by Spatio-Temporal Graph Reasoning (4)). Referential gaze shows intents to inform, share or request sth. We can use refer gaze to eliminate uncertainty about reference and respond quickly.

 Follow [51, 64, 9] means following gaze and happens when one agent perceives gaze from another and follows to contact with the stimuli the other is attending to (see Fig. Understanding Human Gaze Communication by Spatio-Temporal Graph Reasoning (5)). Gaze following is to figure out partner’s intention.

 Share [43] means shared gaze and appears when two agents are gazing at the same stimuli (see Fig. Understanding Human Gaze Communication by Spatio-Temporal Graph Reasoning (6)).

The above atomic-level gazes capture the most general, core and fine-grained gaze communication patterns in human social interactions. We further study the long-term, coarse-grained temporal compositions of the above six atomic-level gaze communication patterns, and generalize them into totally five gaze communication events, , Non-communicative, Mutual Gaze, Gaze Aversion, Gaze Following and Joint Attention, as illustrated in the right part of Fig Understanding Human Gaze Communication by Spatio-Temporal Graph Reasoning. Typically the temporal order of atomic gazes means different phases of each event. Non-communicative (see Fig. Understanding Human Gaze Communication by Spatio-Temporal Graph Reasoning (a)) and Mutual Gaze (see Fig. Understanding Human Gaze Communication by Spatio-Temporal Graph Reasoning (b)) are one-phase events and simply consist of single and mutual respectively. Gaze Aversion (see Fig. Understanding Human Gaze Communication by Spatio-Temporal Graph Reasoning (c)) starts from mutual, then avert to single, demonstrating the avoidance of mutual eye contact. Gaze Following (see Fig. Understanding Human Gaze Communication by Spatio-Temporal Graph Reasoning (d)) is composed of follow and share, but without mutual, meaning that there is only one-way awareness and observation, no shared attention nor knowledge. Joint Attention (see Fig. Understanding Human Gaze Communication by Spatio-Temporal Graph Reasoning (e)) is the most advanced and appears when two agents have the same intention to share attention on a common stimuli and both know that they are sharing something as common ground. Such event consists of several phases, typically beginning with mutual gaze to establish communication channel, proceeding to refer gaze to draw attention to the target, and follow gaze to check the referred stimuli, and cycling back to mutual gaze to ensure that the experience is shared [39]. Clearly, recognizing and understanding atomic-level gaze communication patterns is necessary and significant first-step for comprehensively understanding human gaze behaviors.

To facilitate the research of gaze communication understanding in computer vision community, we propose a large-scale social video dataset named

VACATION (Video gAze CommunicATION) with complete gaze communication annotations. With our dataset, we aim to build spatio-temporal attention graph given a third-person social video sequence with human face and object bboxes, and predict gaze communication relations for this video in both atomic-level and event-level. Clearly, this is a structured task that requires a comprehensive modeling of human-human and human-scene interactions in both spatial and temporal domains.

Inspired by recent advance in graph neural network [46, 60]

, we propose a novel spatio-temporal reasoning graph network for atomic-level gaze communication detection as well as an event network with encoder-decoder structure for event-level gaze communication understanding. The reasoning model learns the relations among social entities and iteratively propagates information over a social graph. The event network utilizes the encoder-decoder structure to eliminate the noises in gaze communications and learns the temporal coherence for each event to classify event-level gaze communication.

This paper makes three major contributions:

  • It proposes and addresses a new task of gaze communication learning in social interaction videos. To the best of our knowledge, this is the first work to tackle such problem in computer vision community.

  • It presents a large-scale video dataset, named VACATION, covering diverse social scenes with complete gaze communication annotations and benchmark results for advancing gaze communication study.

  • It proposes a spatio-temporal graph neural network and an event network to hierarchically reason both atomic- and event-level gaze communications in videos.

2 Related Work

2.1 Gaze Communication in HHI

Eye gaze is closely tied to underlying attention, intention, emotion and personality [32]. Gaze communication allows people to communicate with one another at the most basic level regardless of their familiarity with the prevailing verbal language system. Such social eye gaze functions thus transcend cultural differences, forming a universal language [11]. During conversations, eye gaze can be used to convey information, regulate social intimacy, manage turn-taking, control conversational pace, and convey social or emotional states [32]. People are also good at identifying the target of their partner’s referential gaze and use this information to predict what their partner is going to say [56, 8].

In a nutshell, gaze communication is omnipresent and multifunctional [11]. Exploring the role of gaze communication in HHI is an essential research subject, but it has been rarely touched by computer vision researchers. Current research in computer vision community [27, 7, 63, 16, 62] mainly focuses on studying the salient properties of the natural environment to model human visual attention mechanism. Only a few [44, 45, 17] studied human shared attention behaviors in social scenes.

2.2 Gaze Communication in HRI

To improve human-robot collaboration, the field of HRI strives to develop effective gaze communication for robots [2]. Researchers in robotics tried to incorporate responsive, meaningful and convincing eye gaze into HRI [1, 3], which helps the humanoid agent to engender the desired familiarity and trust, and makes HRI more intuitive and fluent. Their efforts vary widely [54, 4, 2], including human-robot visual dialogue interaction [41, 55, 36], storytelling [40], and socially assistive robotics [58]. For example, a tutoring or assistive robot can demonstrate attention to and engagement with the user by performing proper mutual and follow gazes [38], direct user attention to a target using refer gaze, and form joint attention with humans [25]. A collaborative assembly-line robot can also enable object reference and joint attention by gazes. Robots can also serve as therapy tools for children with autism.

Figure 1: Example frames and annotations of our VACATION dataset, showing that our dataset covers rich gaze communication behaviors, diverse general social scenes, different cultures, . It also provides rich annotations, , human face and object bboxes, gaze communication structures and labels. Human faces and related objects are marked by boxes with the same color of corresponding communication labels. White lines link entities with gaze relations in a temporal sequence and white arrows indicate gaze directions in the current frame. There may exist various number of agents, many different gaze communication types and complex communication relations in one frame, resulting in a highly-challenging and structured task. See §3 for details.

2.3 Graph Neural Networks

Recently, graph neural networks [49, 35, 28, 20] received increased interests since they inherit the complementary advantages of graphs (with strong representation ability) and neural networks (with end-to-end learning power). These models typically pass local messages on graphs to explicitly capture the relations among nodes, which are shown to be effective at a large range of structured tasks, such as graph-level classification [10, 13, 60], node-level classification [23], relational reasoning [48, 30], multi-agent communications [57, 6], human-object interactions [46, 18]

, and scene understanding 

[37, 34]. Some others [14, 42, 31, 52, 12] tried to generalize convolutional architecture over graph-structured data. Inspired by above efforts, we build a spatio-temporal social graph to explicitly model the rich interactions in dynamic scenes. Then a spatio-temporal reasoning network is proposed to learn gaze communications by passing messages over the social graph.

3 The Proposed Vacation Dataset

VACATION contains 300 social videos with diverse gaze communication behaviors. Example frames can be found in Fig. 1. Next we will elaborate VACATION from the following essential aspects.

3.1 Data Collection

Quality and diversity are two essential factors considered in our data collection.


Event- Non-Comm.
Mutual Gaze
Gaze Aversion
Gaze Following
Joint Attention
level (%) 28.16 24.00 10.00 10.64 27.20


92.20 15.99 3.29 39.26 26.91
0.76 75.64 14.15 0.00 16.90
1.34 6.21 81.71 0.00 1.18
0.00 0.37 0.15 0.62 7.08
1.04 0.29 0.00 10.71 2.69
4.66 1.50 0.70 49.41 45.24
Table 1: Statistics of gaze communication categories in our VACATION dataset, including the distribution of event-level gaze communication category over full dataset and the distribution of atomic-level gaze communication for each event-level category.

High quality. We searched the Youtube engine for more than 50 famous TV shows and movies (, The Big Bang Theory, Harry Potter, ). Compared with self-shot social data in laboratory or other limited environments, these stimuli provide much more natural and richer social interactions in general and representative scenes, and are closer to real human social behaviors, which helps to better understand and model real human gaze communication behaviors. After that, about video clips are roughly split from the retrieved results. We further eliminate the videos with big logo or of low-quality. Each of the rest videos is then cropped with accurate shot boundaries and uniformly stored in MPEG-4 format with spatial resolution. VACATION finally comprises a total of 300 high-quality social video sequences with 96,993 frames and 3,880-second duration. The lengths of videos span from 2.2 to 74.56 seconds and are 13.28 seconds on average.

Diverse social scenes. The collected videos cover diverse daily social scenes (, party, home, office, ), with different cultures (, American, Chinese, Indian, ). The appearances of actors/actresses, costume and props, and scenario settings, also vary a lot, which makes our dataset more diverse and general. By training on such data, algorithms are supposed to have better generalization ability in handling diverse realistic social scenes.

3.2 Data Annotation and Statistics

Our dataset provides rich annotations, including human face and object bounding boxes, human attention, atomic-level and event-level gaze communication labels. The annotation takes about 1,616 hours in total, considering an average annotation time of 1 minute per frame. Three extra volunteers are included in this process.

Human face and object annotation. We first annotate each frame with bounding boxes of human face and key object, using the online video annotation platform Vatic [61]. 206,774 human face bounding boxes (avg. 2.13 per frame) and 85,441 key object bounding boxes (avg. 0.88 per frame) are annotated in total.

Human attention annotation. We annotate the attention of each person in each frame, the bounding box (human face or object) this person is gazing at.

Gaze communication labeling. The annotators are instructed to annotate both atomic-level and event-level gaze communication labels for every group of people in each frame. To ensure the annotation accuracy, we used cross-validation in the annotation process, , two volunteers annotated all the persons in the videos separately, and the differences between their annotations were judged by a specialist in this area. See Table 1 for the information regarding the distributions of gaze communication categories.


VACATION # Video # Frame # Human # GCR
training 180 57,749 123,812 97,265
validation 60 22,005 49,012 42,066
testing 60 17,239 33,950 25,034
full dataset 300 96,993 206,774 164,365
Table 2: Statistics of dataset splitting. GCR refers to Gaze Communication Relation. See §3.2 for more details.

Dataset splitting. Our dataset is split into training, validation and testing sets with the ratio of 6:2:2. We arrive at a unique split consisting of 180 training (57,749 frames), 60 validation (22,005 frames), and 60 testing videos (17,239 frames). To avoid over-fitting, there is no source-overlap among videos in different sets (see Table 2 for more details).

4 Our Approach

We design a spatio-temporal graph neural network to explicitly represent the diverse interactions in social scenes and infer atomic-level gaze communications by passing messages over the graph. Given the atomic-level gaze interaction inferences, we further design an event network with encoder-decoder structure for event-level gaze communication reasoning. As shown in Fig. 2, gaze communication entities, , human, social scene, are represented by graph nodes, gaze communication structures are represented by edges. We introduce notations and formulations in §4.1 and provide more implementation details in §4.2.

4.1 Model Formulation

Social Graph. We first define a social graph as a complete graph , where node takes unique value from , representing the entities (, scene, human) in social scenes, and edge indicates a directed edge , representing all the possible human-human gaze interactions or human-scene relations. There is a special node representing the social scene. For node , its node representation/embedding is denoted by a

-dimensional vector:

. Similarly, the edge representation/embedding for edge is denoted by an -dimensional vector: . Each human node has an output state that takes a value from a set of atomic gaze labels: single, mutual, avert, refer, follow, share. We further define an adjacency matrix to represent the communication structure over our complete social graph , where each element represents the connectivity from node to .

Different from most previous graph neural networks that only focus on inferring graph- or node-level labels, our model aims to learn the graph structure A and the visual labels of all the human nodes simultaneously.

Figure 2: Illustration of the proposed spatio-temporal reasoning model for gaze communication understanding. Given an input social video sequence (a), for each frame, a spatial reasoning process (b) is first performed for simultaneously capturing gaze communication relations (social graph structure) and updating node representations through message propagation. Then, in (c), a temporal reasoning process is applied for each node to dynamically update node representation over temporal domain, which is achieved by an LSTM. Bolder edges represent higher connectivity weight inferred in spatial reasoning step (b). See §4.1 for details.

To this end, our spatio-temporal reasoning model is designed to have two steps. First, in spatial domain, there is a message passing step (Fig. 2 (b)) that iteratively learns gaze communication structures A and propagates information over A to update node representations. Second, as shown in Fig. 2 (c), an LSTM is incorporated into our model for more robust node representation learning by considering temporal dynamics. A more detailed model architecture is schematically depicted in Fig. 3. In the following, we describe the above two steps in detail.

Message Passing based Spatial Reasoning. Inspired by previous graph neural networks [20, 46, 30], our message passing step is designed to have three phases, an edge update phase, a graph structure update phase, and a node update phase. The whole message passing process runs for iterations to iteratively propagate information. In -th iteration step, we first perform the edge update phase that updates edge representations by collecting information from connected nodes:


where indicates the node representation of in -th step, and denotes concatenation of vectors. represents an edge update function , which is implemented by a neural network.

After that, the graph structure update phase updates the adjacency matrix A to infer the current social graph structure, according to the updated edge representations :


where the connectivity matrix encodes current visual communication structures. is a connectivity readout network that maps an edge representation into the connectivity weight, and

denotes nonlinear activation function.

Figure 3: Detailed architecture of the proposed spatio-temporal reasoning model for gaze communication understanding. See the last paragraph in §4.1 for detailed descriptions.

Finally, in the node update phase, we update node representations via considering all the incoming edge information weighted by the corresponding connectivity:


where represents a node update network .

The above functions are all learned differentiable functions. In the above message passing process, we infer social communication structures in the graph structure update phase (Eq. 2), where the relations between each social entities are learned through updated edge representations (Eq. 1). Then, the information is propagated through the learned social graph structure and the hidden state of each node is updated based on its history and incoming messages from its neighborhoods (Eq. 3). If we know whether there exist interactions between nodes (human, object), , given the groundtruth of A, we can learn A in an explicit manner, which is similar to the graph parsing network [46]. Otherwise, the adjacent matrix A can be viewed as an attention or gating mechanism that automatically weights the messages and can be learned in an implicit manner; this shares a similar spirit with graph attention network [60]. More implementation details can be found in §4.2.

Recurrent Network based Temporal Reasoning. Since our task is defined on a spatio-temporal domain, temporal dynamics should be considered for more comprehensive reasoning. With the updated human node representations from our message passing based spatial reasoning model, we further apply LSTM to each node for temporal reasoning. More specifically, our temporal reasoning step has two phases: a temporal message passing phase and a readout phase. We denote by the feature of a human node at time , which is obtained after -iteration spatial message passing. In the temporal message passing phase, we propagate the information over the temporal axis using LSTM:


where is an LSTM based temporal reasoning function that updates the node representation using temporal information. is used as the input of the LSTM at time , and indicates the corresponding hidden state output via considering previous information .

Then, in the readout phase, for each human node , a corresponding gaze label is predicted from the final node representation :


where maps the node feature into the label space , which is implemented by a classifier network.

Event Network.

The event network is designed with an encoder-decoder structure to learn the correlation of the atomic gazes and classify the event-level gaze communication for each video sequence. To reduce the large variance of video length, we pre-process the input atomic gaze sequence into two vectors: i) the transition vector that records each transition from one category of atomic gaze to another, and ii) the frequency vector that computes the frequency of each atomic type. The encoder individually encodes the transition vector and frequency vector into two embedded vectors. The decoder decodes the concatenation of these two embedded vectors and makes final event label prediction. Since the atomic gaze communications are noisy within communicative activities, the encoder-decoder structure will try to eliminate the noise and improve the prediction performance. The encoder and decoder are both implemented by fully-connected layers.


Atomic-level Gaze Communication (Precision & F1-score)
Task single mutual avert refer follow share Avg. Acc.
Metric (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) top-1 (%) top-2 (%)
Ours-full 22.10 26.17 98.68 98.60 59.20 74.28 56.90 53.16 32.83 18.05 61.51 46.61 55.02 76.45
(iteration 2)
Chance 16.50 16.45 16.42 16.65 16.65 16.51 16.07 16.06 16.80 16.74 16.20 16.25 16.44 -
CNN 21.32 27.89 15.99 14.48 47.81 50.82 0.00 0.00 19.21 23.10 11.70 2.80 23.05 40.32
CNN+LSTM 22.10 11.78 18.55 16.37 64.24 59.57 13.69 18.55 22.70 29.13 17.18 3.61 24.65 45.50
CNN+SVM 19.92 23.63 28.46 38.30 68.53 76.07 15.15 6.32 23.28 16.87 40.76 49.24 36.23 -
CNN+RF 53.12 57.98 20.78 0.24 0.00 0.00 51.88 27.31 15.90 19.39 35.56 44.42 37.68 -
PRNet 0.00 0.00 47.52 52.54 89.63 58.00 19.49 21.52 19.72 22.05 48.69 62.40 39.59 61.45
VGG16 35.55 48.93 99.70 99.85 76.95 13.04 37.02 31.88 26.62 20.89 53.05 59.88 49.91 72.18
Resnet50 (192-d) 33.61 38.19 78.22 85.66 62.27 76.75 18.58 11.21 35.89 18.55 57.82 60.26 53.72 77.16
AdjMat-only 34.00 22.63 31.46 22.81 38.06 52.42 27.70 26.79 25.42 25.25 32.32 28.69 32.64 46.48
2 branch-iteration 2 20.43 8.93 92.65 76.03 47.57 59.47 40.34 45.35 36.36 35.77 55.15 57.93 49.57 80.33
2 branch-iteration 3 18.92 19.67 99.72 97.18 57.69 60.18 11.92 6.19 31.10 20.40 39.67 53.22 46.39 66.77
Ours-iteration 1 6.69 4.66 49.39 47.96 36.56 39.44 25.89 27.82 35.05 31.93 36.71 42.22 33.67 53.97
Ours-iteration 3 44.83 0.77 51.29 66.41 47.09 64.03 0.00 0.00 25.95 26.20 47.42 46.74 44.52 72.77
Ours-iteration 4 28.01 5.77 99.59 93.15 42.06 59.06 38.46 14.02 22.02 17.54 43.69 55.77 48.35 72.35
Ours w/o. temporal reason. 13.74 10.80 98.64 98.54 54.54 53.17 55.87 53.75 40.83 25.00 45.89 61.55 53.73 80.33
Ours w. implicit learn. 30.60 9.15 33.00 34.56 43.39 56.00 21.50 26.98 22.43 18.63 58.30 39.33 33.74 56.54
Table 3: Quantitative results of atomic-level gaze communication prediction. The best scores are marked in bold.

Before going deep into our model implementation, we offer a short summary of the whole spatio-temporal reasoning process. As shown in Fig. 3, with an input social video (a), for each frame, we build an initial complete graph (b) to represent the gaze communication entities (, humans and social scene) by nodes and their relations by edges. During the spatial reasoning step (c), we first update edge representations using Eq. 1 (note the changed edge color compared to (b)). Then, in the graph structure update phase, we infer the graph structure through updating the connectivities between each node pairs using Eq. 2 (note the changed edge thickness compared to (b)). In the node update phase, we update node embeddings using Eq. 3 (note the changed node color compared to (b)). Iterating above processes leads to efficient message propagation in spatial domain. After several spatial message passing iterations, we feed the enhanced node feature into a LSTM based temporal reasoning module, to capture the temporal dynamics (Eq. 4) and predict final atomic gaze communication labels (Eq. 5). We then use event network to reason about event-level labels based on previous inferred atomic-level label compositions for a long sequence in a larger time scale.

4.2 Detailed Network Architecture

Attention Graph Learning. In our social graph, the adjacency matrix A stores the attention relations between nodes, , representing the interactions between the entities in the social scene. Since we have annotated all the directed human-human interactions and human-scene relations (§3.2), we learn the adjacency matrix A in an explicit manner (under the supervision of ground-truth). Additionally, for the scene node , since it’s a ‘dummy’ node, we enforce as 0, where . In this way, other human nodes cannot influence the state of the scene node during message passing. In our experiments, we will offer more detailed results regarding learning A in an implicit (w/o. ground-truth) or explicit manner.

Node/Edge Feature Initialization. For each node , the 4096- features (from the fc7 layer of a pre-trained ResNet50 [24]) are extracted from the corresponding bounding box as its initial feature . For the scene node , the fc7 feature of the whole frame is used as its node representation . To decrease the amount of parameter, we use fully connected layer to compress all the node features into - and then encode a - node position info with it. For an edge , we just concatenate the related two node features as its initial feature . Thus, we have and .

Graph Network Implementations. The functions in Eqs. 1, 2 and 5 are all implemented by fully connected layers, whose configurations can be determined according to their corresponding definitions. The function in Eq. 3

is implemented by gated recurrent unit (GRU) network.

Loss functions. When explicitly learning the adjacency matrix, we treat it as a binary classification problem and use the cross entropy loss. We also employ standard cross entropy loss for the multi-class classification of gaze communication labels.

5 Experiments


Event-level Gaze Communication (Precision & F1-score)
Task Non-Comm. Mutual Gaze Gaze Aversion Gaze Following Joint Attention Avg. Acc.
Metric (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) top-1 (%) top-2 (%)
Chance 21.3 29.3 25.0 23.0 20.0 14.8 36.3 15.1 20.3 22.1 22.7 45.0
FC-w/o. GT 43.7 44.3 16.9 23.3 6.2 10.0 8.3 9.1 60.9 40.2 35.6 69.1
Ours-w/o. GT 50.7 49.3 16.7 21.0 8.2 11.3 6.2 7.7 60.9 40.0 37.1 65.5
FC-w. GT 90.7 70.7 12.3 30.8 22.2 30.8 15.0 48.3 56.8 57.1 52.6 86.5
Ours-w. GT 91.4 72.7 14.5 32.3 18.5 45.5 20.0 66.7 62.2 30.8 55.9 79.4
Table 4: Quantitative results of event-level gaze communication prediction. The best scores are marked in bold.
Figure 4: Qualitative results of atomic-level gaze communication prediction. Correctly inferred labels are shown in black while error examples are shown in red.

5.1 Experimental Setup

Evaluation Metrics

. Four evaluation metrics, we use precision, F1-score, top-1 Avg. Acc. and top-2 Avg. Acc. in our experiments. Precision

refers to the ratio of true-positive classifications to all positive classifications. F1-score

is the harmonic mean of the precision and recall:

. Top-1 Avg. Acc. and top-2 Avg. Acc. calculate the average label classification accuracy over all the test set.

Implementation Details

. Our model is implemented by PyTorch. During training phase, the learning rate is set to 1e-1, and decays by 0.1 per epoch. For the atomic-gaze interaction temporal reasoning module, we set the sequential length to 5 frames according to our dataset statistics. The training process takes about 10 epochs (5 hours) to roughly converge with an NVIDIA TITAN X GPU.

Baselines. To better evaluate the performance of our model, we consider the following baselines:

 Chance is a weak baseline, , randomly assigning an atomic gaze communication label to each human node.

 CNN uses three Conv2d layers to extract features for each human node and concatenates the features with position info. for label classification (no spatial communication structure, no temporal relations).

 CNN+LSTM feeds the CNN-based node feature to an LSTM (only temporal dynamics, no spatial structures).


concatenates the CNN-based node features and feeds it into a Support Vector Machine classifier.


replaces the above SVM classifier with a Random Forest classifier.

 FC-w/o. GT & FC-w. GT are fully connected layers without or with ground truth atomic gaze labels.

Ablation Study. To assess the effectiveness of our essential model components, we derive the following variants:

 Different node feature. We try different ways to extract node features. PRNet uses 68 3D face keypoints extracted by PRNet [19]. VGG16 replaces Resnet50 with VGG16 [53]. Resnet50 (192-d) compresses the 4096-d features from fc7 layer of Resnet50 [24] to 192-d.

 AdjMat-only directly feeds the explicitly learned adjacency matrix into some Conv3d layers for classification.

 2 branch concatenates a second adjacency matrix branch alongside the GNN branch for classification. We test with different message passing iterations.

 Ours-iteration 1,2,3,4 test different message passing iterations in the spatial reasoning phase of our full model.

 Ours w/o. temporal reason. replaces LSTM with Cond3d layers in the temporal reasoning phase of our full model.

 Ours w. implicit learn.

is achieved by unsupervisedly learning adjacent matrix

(w/o. attention ground truths).

5.2 Results and Analyses

Overall Quantitative Results. The quantitative results are shown in Table 3 and 4 respectively for the atomic-level and event-level gaze communication classification experiments. For the atomic-level task, our full model achieves the best top-1 avg. acc. () on the test set and shows good and balanced performance for each atomic type instead of overfitting to certain categories. For the event-level task, our event network improves the top-1 avg. acc. on the test set, achieving with the predicted atomic labels and with the ground truth atomic labels.

In-depth Analyses. For atomic-level task, we examined different ways to extract node features and find Restnet50 the best. Also, compressing the Resnet50 feature to a low dimension still performs well and efficiently (full model vs. Resnet50 192-d). The performance of AdjMat-only

which directly uses the concatenated adjacency matrix can obtain some reasonable results compared to the weak baselines but not good enough, which is probably because that gaze communication dynamic understanding is not simply about geometric attention relations, but also depends on a deep and comprehensive understanding of spatial-temporal scene context. We examine the effect of iterative message passing and find it is able to gradually improve the performance in general. But with iterations increased to a certain extent, the performance drops slightly.

Qualitative Results. Fig. 4 shows some visual results of our full model for atomic-level gaze communication recognition. The predicted communication structures are shown with bounding boxes and arrows. Our method can correctly recognize different atomic-level gaze communication types (shown in black) with effective spatial-temporal graph reasoning. We also present some failure cases (shown in red), which may be due to the ambiguity and subtlety of gaze interactions, and the illegibility of eyes. Also, the shift between gaze phases could be fast and some phases are very short, making it hard to recognize.

6 Conclusion

We address a new problem of inferring human gaze communication from both atomic-level and event-level in third-person social videos. We propose a new video dataset VACATION and a spatial-temporal graph reasoning model, and show benchmark results on our dataset. Our model inherits the complementary advantages of graphs and standard feedforward neural networks, which naturally captures gaze patterns and provides better compositionality. We hope our work will serve as important resources to facilitate future studies related to this important topic.

Acknowledgements The authors thank Prof. Tao Gao, Tianmin Shu, Siyuan Qi and Keze Wang from UCLA VCLA Lab for helpful comments on this work. This work was supported by ONR MURI project N00014-16-1-2007, ONR Robotics project N00014- 19-1-2153, DARPA XAI grant N66001-17-2-4029, ARO grant W911NF1810296, CCF-Tencent Open Fund and Zhijiang Lab’s International Talent Fund for Young Professionals. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.


  • [1] H. Admoni and B. Scassellati (2014) Data-driven model of nonverbal behavior for socially assistive human-robot interactions. In ICMI, Cited by: §2.2.
  • [2] H. Admoni and B. Scassellati (2017) Social eye gaze in human-robot interaction: a review. JHRI 6 (1). Cited by: §1, §1, §2.2.
  • [3] S. Andrist, B. Mutlu, and A. Tapus (2015) Look like me: matching robot personality via gaze to increase motivation. In CHI, Cited by: §2.2.
  • [4] S. Andrist, X. Z. Tan, M. Gleicher, and B. Mutlu (2014) Conversational gaze aversion for humanlike robots. In HRI, Cited by: §2.2.
  • [5] M. Argyle and M. Cook (1976) Gaze and mutual gaze.. Cambridge U Press. Cited by: §1.
  • [6] P. W. Battaglia, R. Pascanu, M. Lai, D. Rezende, and K. Kavukcuoglu (2016) Interaction networks for learning about objects, relations and physics. In NIPS, Cited by: §2.3.
  • [7] A. Borji and L. Itti (2013)

    State-of-the-art in visual attention modeling

    IEEE TPAMI 35 (1), pp. 185–207. Cited by: §2.1.
  • [8] J. Boucher, U. Pattacini, A. Lelong, G. Bailly, F. Elisei, S. Fagel, P. F. Dominey, and J. Ventre-Dominey (2012) I reach faster when i see you look: gaze effects in human-human and human-robot face-to-face cooperation. Frontiers in Neurorobotics 6, pp. 3. Cited by: §2.1.
  • [9] R. Brooks and A. N. Meltzoff (2005) The development of gaze following and its relation to language. Developmental Science 8 (6), pp. 535–543. Cited by: §1.
  • [10] J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun (2014) Spectral networks and locally connected networks on graphs. In ICLR, Cited by: §2.3.
  • [11] J. K. Burgoon, L. K. Guerrero, and K. Floyd (2016) Nonverbal communication. Routledge. Cited by: §2.1, §2.1.
  • [12] X. Chen, L. Li, L. Fei-Fei, and A. Gupta (2018) Iterative visual reasoning beyond convolutions. In CVPR, Cited by: §2.3.
  • [13] H. Dai, B. Dai, and L. Song (2016) Discriminative embeddings of latent variable models for structured data. In ICML, Cited by: §2.3.
  • [14] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams (2015) Convolutional networks on graphs for learning molecular fingerprints. In NIPS, Cited by: §2.3.
  • [15] N. J. Emery (2000) The eyes have it: the neuroethology, function and evolution of social gaze. Neuroscience & Biobehavioral Reviews 24 (6), pp. 581 – 604. Cited by: §1.
  • [16] D. Fan, W. Wang, M. Cheng, and J. Shen (2019) Shifting more attention to video salient object detection. In CVPR, Cited by: §2.1.
  • [17] L. Fan, Y. Chen, P. Wei, W. Wang, and S. Zhu (2018) Inferring shared attention in social scene videos. In CVPR, Cited by: §2.1.
  • [18] H. Fang, J. Cao, Y. Tai, and C. Lu (2018) Pairwise body-part attention for recognizing human-object interactions. In ECCV, Cited by: §2.3.
  • [19] Y. Feng, F. Wu, X. Shao, Y. Wang, and X. Zhou (2018) Joint 3d face reconstruction and dense alignment with position map regression network. In ECCV, Cited by: §5.1.
  • [20] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. In ICML, Cited by: §2.3, §4.1.
  • [21] A. M. Glenberg, J. L. Schroeder, and D. A. Robertson (1998) Averting the gaze disengages the environment and facilitates remembering. Memory & Cognition 26 (4), pp. 651–658. Cited by: §1.
  • [22] M. M. Haith, T. Bergman, and M. J. Moore (1977) Eye contact and face scanning in early infancy. Science 198 (4319), pp. 853–855. Cited by: §1.
  • [23] W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In NIPS, Cited by: §2.3.
  • [24] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4.2, §5.1.
  • [25] C. Huang and A. L. Thomaz (2011) Effects of responding to, initiating and ensuring joint attention in human-robot interaction. In 2011 Ro-Man, Cited by: §2.2.
  • [26] R. J. Itier and M. Batty (2009) Neural bases of eye and gaze processing: the core of social cognition. Neuroscience & Biobehavioral Reviews 33 (6), pp. 843–863. Cited by: §1.
  • [27] L. Itti, C. Koch, and E. Niebur (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE TPAMI 20 (11), pp. 1254–1259. Cited by: §2.1.
  • [28] A. Jain, A. R. Zamir, S. Savarese, and A. Saxena (2016)

    Structural-RNN: deep learning on spatio-temporal graphs

    In CVPR, Cited by: §2.3.
  • [29] M. Jording, A. Hartz, G. Bente, M. Schulte-Rüther, and K. Vogeley (2018) The “social gaze space”: a taxonomy for gaze-based communication in triadic interactions. Frontiers in Psychology 9, pp. 226. Cited by: §1.
  • [30] T. Kipf, E. Fetaya, K. Wang, M. Welling, and R. Zemel (2018) Neural relational inference for interacting systems. In ICML, Cited by: §2.3, §4.1.
  • [31] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In ICLR, Cited by: §2.3.
  • [32] C. L. Kleinke (1986) Gaze and eye contact: a research review.. Psychological Bulletin 100 (1), pp. 78. Cited by: §2.1.
  • [33] H. Kobayashi and S. Kohshima (1997) Unique morphology of the human eye. Nature 387 (6635), pp. 767. Cited by: §1.
  • [34] R. Li, M. Tapaswi, R. Liao, J. Jia, R. Urtasun, and S. Fidler (2017) Situation recognition with graph neural networks. In ICCV, Cited by: §2.3.
  • [35] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel (2016) Gated graph sequence neural networks. In ICML, Cited by: §2.3.
  • [36] C. Liu, C. T. Ishi, H. Ishiguro, and N. Hagita (2012) Generation of nodding, head tilting and eye gazing for human-robot dialogue interaction. In HRI, Cited by: §2.2.
  • [37] K. Marino, R. Salakhutdinov, and A. Gupta (2017)

    The more you know: using knowledge graphs for image classification

    In CVPR, Cited by: §2.3.
  • [38] A. N. Meltzoff, R. Brooks, A. P. Shon, and R. P.N. Rao (2010) “Social” robots are psychological agents for infants: a test of gaze following. Neural networks 23 (8-9), pp. 966–972. Cited by: §2.2.
  • [39] C. Moore, P. J. Dunham, and P. Dunham (2014) Joint attention: its origins and role in development. Psychology Press. Cited by: §1.
  • [40] B. Mutlu, J. Forlizzi, and J. Hodgins (2006) A storytelling robot: modeling and evaluation of human-like gaze behavior. In IEEE-RAS ICHR, Cited by: §2.2.
  • [41] B. Mutlu, T. Kanda, J. Forlizzi, J. Hodgins, and H. Ishiguro (2012) Conversational gaze mechanisms for humanlike robots. ACM TIIS 1 (2), pp. 12. Cited by: §2.2.
  • [42] M. Niepert, M. Ahmed, and K. Kutzkov (2016)

    Learning convolutional neural networks for graphs

    In ICML, Cited by: §2.3.
  • [43] S. Okamoto-Barth and M. Tomonaga (2006) Development of joint attention in infant chimpanzees. In Cognitive development in chimpanzees, pp. 155–171. Cited by: §1.
  • [44] H. S. Park, E. Jain, and Y. Sheikh (2012) 3D social saliency from head-mounted cameras. In NIPS, Cited by: §2.1.
  • [45] H. S. Park and J. Shi (2015) Social saliency prediction. In CVPR, Cited by: §2.1.
  • [46] S. Qi, W. Wang, B. Jia, J. Shen, and S. Zhu (2018) Learning human-object interactions by graph parsing neural networks. In ECCV, Cited by: §1, §2.3, §4.1, §4.1.
  • [47] M. D. Riemer (1949) The averted gaze. Psychiatric Quarterly 23 (1), pp. 108–115. Cited by: §1.
  • [48] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap (2017) A simple neural network module for relational reasoning. In NIPS, Cited by: §2.3.
  • [49] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2009) The graph neural network model. IEEE TNNLS 20 (1), pp. 61–80. Cited by: §2.3.
  • [50] A. Senju, M. H. Johnson, and G. Csibra (2006) The development and neural basis of referential gaze perception. Social neuroscience 1 (3-4), pp. 220–234. Cited by: §1.
  • [51] S. Shepherd (2010) Following gaze: gaze-following behavior as a window into social cognition. Frontiers in integrative neuroscience 4, pp. 5. Cited by: §1.
  • [52] M. Simonovsky and N. Komodakis (2017) Dynamic edgeconditioned filters in convolutional neural networks on graphs. In CVPR, Cited by: §2.3.
  • [53] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §5.1.
  • [54] V. Srinivasan and R. R. Murphy (2011) A survey of social gaze. In HRI, Cited by: §2.2.
  • [55] M. Staudte and M. W. Crocker (2009) Visual attention in spoken human-robot interaction. In HRI, Cited by: §2.2.
  • [56] M. Staudte and M. W. Crocker (2011) Investigating joint attention mechanisms through spoken human-robot interaction. Cognition 120 (2), pp. 268–291. Cited by: §2.1.
  • [57] S. Sukhbaatar, A. Szlam, and R. Fergus (2016)

    Learning multiagent communication with backpropagation

    In NIPS, Cited by: §2.3.
  • [58] A. Tapus, M. J. Mataric, and B. Scassellati (2007) The grand challenges in socially assistive robotics. IEEE Robotics & Automation Magazine 14 (1), pp. 35–42. Cited by: §2.2.
  • [59] M. Tomasello (2010) Origins of human communication. MIT Press. Cited by: §1.
  • [60] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2018) Graph attention networks. In ICLR, Cited by: §1, §2.3, §4.1.
  • [61] C. Vondrick, D. Patterson, and D. Ramanan (2013) Efficiently scaling up crowdsourced video annotation. IJCV 101 (1), pp. 184–204. Cited by: §3.2.
  • [62] W. Wang, Q. Lai, H. Fu, J. Shen, and H. Ling (2019) Salient object detection in the deep learning era: an in-depth survey. arXiv preprint arXiv:1904.09146. Cited by: §2.1.
  • [63] W. Wang and J. Shen (2018) Deep visual attention prediction. IEEE TIP 27 (5), pp. 2368–2378. Cited by: §2.1.
  • [64] K. Zuberbühler (2008) Gaze following. Current Biology 18 (11), pp. R453–R455. Cited by: §1.