ScanGAN360: A Generative Model of Realistic Scanpaths for 360^∘ Images

03/25/2021
by   Daniel Martin, et al.
0

Understanding and modeling the dynamics of human gaze behavior in 360^∘ environments is a key challenge in computer vision and virtual reality. Generative adversarial approaches could alleviate this challenge by generating a large number of possible scanpaths for unseen images. Existing methods for scanpath generation, however, do not adequately predict realistic scanpaths for 360^∘ images. We present ScanGAN360, a new generative adversarial approach to address this challenging problem. Our network generator is tailored to the specifics of 360^∘ images representing immersive environments. Specifically, we accomplish this by leveraging the use of a spherical adaptation of dynamic-time warping as a loss function and proposing a novel parameterization of 360^∘ scanpaths. The quality of our scanpaths outperforms competing approaches by a large margin and is almost on par with the human baseline. ScanGAN360 thus allows fast simulation of large numbers of virtual observers, whose behavior mimics real users, enabling a better understanding of gaze behavior and novel applications in virtual scene design.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 18

page 19

page 20

page 21

page 22

page 23

page 34

page 42

03/29/2019

Photo-realistic Monocular Gaze Redirection using Generative Adversarial Networks

Gaze redirection is the task of changing the gaze to a desired direction...
11/11/2020

Personality-Driven Gaze Animation with Conditional Generative Adversarial Networks

We present a generative adversarial learning approach to synthesize gaze...
12/13/2016

How do people explore virtual environments?

Understanding how people explore immersive virtual environments is cruci...
01/24/2017

Imitating Driver Behavior with Generative Adversarial Networks

The ability to accurately predict and simulate human driving behavior is...
04/10/2019

Predicting Novel Views Using Generative Adversarial Query Network

The problem of predicting a novel view of the scene using an arbitrary n...
04/20/2022

A Probabilistic Time-Evolving Approach to Scanpath Prediction

Human visual attention is a complex phenomenon that has been studied for...
07/08/2021

Towards Collaborative Photorealistic VR Meeting Rooms

When designing 3D applications it is necessary to find a compromise betw...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Virtual reality (VR) is an emerging medium that unlocks unprecedented user experiences. To optimize these experiences, however, it is crucial to develop computer vision techniques that help us understand how people explore immersive virtual environments. Models for time-dependent visual exploration behavior are important for designing and editing VR content [42], for generating realistic gaze trajectories of digital avatars [18], for understanding dynamic visual attention and visual search behavior [60], and for developing new rendering, display, and compression algorithms, among other applications.

Current approaches that model how people explore virtual environments often leverage saliency prediction [43, 13, 31, 2]. While this is useful for some applications, the fixation points predicted by these approaches do not account for the time-dependent visual behavior of the user, making it difficult to predict the order of fixations, or give insight into how people explore an environment over time. For this purpose, some recent work has explored scanpath prediction [2, 3, 62, 4], but these algorithms do not adequately model how people explore immersive virtual environments, resulting in erratic or non-plausible scanpaths.

Figure 1: We present ScanGAN360, a generative adversarial approach to scanpath generation for 360 images. ScanGAN360 generates realistic scanpaths (bottom rows), outperforming state-of-the-art methods and mimicking the human baseline (top row).

In this work, we present ScanGAN360, a novel framework for scanpath generation for 360 images (Figure 1

). Our model builds on a conditional generative adversarial network (cGAN) architecture, for which we discuss and validate two important insights that we show are necessary for realistic scanpath generation. First, we propose a loss function based on a spherical adaptation of dynamic time warping (DTW), which is a key aspect for training our GAN robustly. DTW is a metric for measuring similarity between two time series, such as scanpaths, which to our knowledge has not been used to train scanpath-generating GANs. Second, to adequately tackle the problem of scanpath generation in 360

images, we present a novel parameterization of the scanpaths. These insights allow us to demonstrate state-of-the-art results for scanpath generation in VR, close to the human baseline and far surpassing the performance of existing methods. Our approach is the first to enable robust scanpath prediction over long time periods up to 30 seconds, and, unlike previous work, our model does not rely on saliency, which is typically not available as ground truth.

Our model produces about 1,000 scanpaths per second, which enables fast simulation of large numbers of virtual observers, whose behavior mimics that of real users. Using ScanGAN360, we explore applications in virtual scene design, which is useful in video games, interior design, cinematography, and tourism, and scanpath-driven video thumbnail generation of 360 images, which provides previews of VR content for social media platforms. Beyond these applications, we propose to use ScanGAN360 for applications such as gaze behavior simulation for virtual avatars or gaze-contingent rendering. Extended discussion and results on applications are included in the supplementary material and video.

We will make our source code and pre-trained model publicly available to promote future research.

2 Related work

Modeling and predicting attention

The multimodal nature of attention [30], together with the complexity of human gaze behavior, make this a very challenging task. Many works devoted to it have relied on representations such as saliency, which is a convenient representation for indicating the regions of an image more likely to attract attention. Early strategies for saliency modeling have focused on either creating hand-crafted features representative of saliency [19, 52, 61, 29, 20, 7], or directly learning data-driven features [49, 22]. With the proliferation of extensive datasets of human attention [43, 39, 20, 8, 59]

, deep learning–based methods for saliency prediction have been successfully applied, yielding impressive results 

[37, 36, 14, 50, 54, 55, 58].

However, saliency models do not take into account the dynamic nature of human gaze behavior, and therefore, they are unable to model or predict time-varying aspects of attention. Being able to model and predict dynamic exploration patterns has been proven to be useful, for example, for avatar gaze control [12, 41], video rendering in virtual reality [26], or for directing users’ attention over time in many contexts [9, 38]. Scanpath models aim to predict visual patterns of exploration that an observer would perform when presented with an image. In contrast to saliency models, scanpath models typically focus on predicting plausible scanpaths, , they do not predict a unique scanpath and instead they try to mimic human behavior when exploring an image, taking into account the variability between different observers. Ellis and Smith [16] were pioneers in this field: they proposed a general framework for generating scanpaths based on Markov stochastic processes. Several approaches have followed this work, incorporating behavioral biases in the process in order to produce more plausible scanpaths [24, 47, 27, 48]

. In recent years, deep learning models have been used to predict human scanpaths based on neural network features trained on object recognition 

[22, 53, 14, 5].

Attention in 360 images

Predicting plausible scanpaths in 360 imagery is a more complex task: Observers do not only scan a given image with their gaze, but they can now also turn their head or body, effectively changing their viewport over time. Several works have been proposed for modeling saliency in 360 images [33, 43, 31, 11, 44]. However, scanpath prediction has received less attention. In their recent work, Assens et al. [3] generalize their 2D model to 360 images, but their loss function is unable to reproduce the behavior of ground truth scanpaths (see Figure 4, third column). A few works have focused on predicting short-term sequential gaze points based on users’ previous history for 360 videos, but they are limited to small temporal windows (from one to ten seconds) [56, 25, 35]. For the case of images, a number of recent methods focus on developing improved saliency models and principled methods to sample from them [2, 4, 62].

Instead, we directly learn dynamic aspects of attention from ground truth scanpaths by training a generative model in an adversarial manner, with an architecture and loss function specifically designed for scanpaths in 360 images. This allows us to (i) effectively mimic human behavior when exploring scenes, bypassing the saliency generation and sampling steps, and (ii) optimize our network to stochastically generate 360 scanpaths, taking into account observer variability.

3 Our Model

We adopt a generative adversarial approach, specifically designed for 360 content in which the model learns to generate a plausible scanpath, given the 360 image as a condition. In the following, we describe the parameterization employed for the scanpaths, the design of our loss function for the generator, and the particularities of our conditional GAN architecture, ending with details about the training process.

Figure 2:

Illustration of our generator and discriminator networks. Both networks have a two-branch structure: Features extracted from the 360

image with the aid of a CoordConv layer and an encoder-like network are concatenated with the input vector for further processing. The generator learns to transform this input vector, conditioned by the image, into a plausible scanpath. The discriminator takes as input vector a scanpath (either captured or synthesized by the generator), as well as the corresponding image, and determines the probability of this scanpath being real (or fake). We train them end-to-end in an adversarial manner, following a conditional GAN scheme. Please refer to the text for details on the loss functions and architecture.

3.1 Scanpath Parameterization

Scanpaths are commonly provided as a sequence of two-dimensional values corresponding to the coordinates of each gaze point in the image. When dealing with 360 images in equirectangular projections, gaze points are also often represented by their latitude and longitude , and . However, these parameterizations either suffer from discontinuities at the borders of a 360 image, or result in periodic, ambiguous values. The same point of the scene can have two different representations in these parameterizations, hindering the learning process.

We therefore resort to a three-dimensional parameterization of our scanpaths, where each gaze point is transformed into its three-dimensional representation such that:

This transformation assumes, without loss of generality, that the panorama is projected over a unit sphere. We use this parameterization for our model, which learns a scanpath as a set of three-dimensional points over time. Specifically, given a number of samples over time, . The results of the model are then converted back to a two-dimensional parameterization in terms of latitude () and longitude () for display and evaluation purposes.

3.2 Overview of the Model

Our model is a conditional GAN, where the condition is the RGB 360

image for which we wish to estimate a scanpath. The generator

is trained to generate a scanpath from a latent code

(drawn randomly from a uniform distribution,

), conditioned by the RGB 360 image . The discriminator takes as input a potential scanpath ( or ), as well as the condition (the RGB 360 image), and outputs the probability of the scanpath being real (or fake). The architecture of both networks, generator and discriminator, can be seen in Figure 2, and further details related to the architecture are described in Section 3.4.

3.3 Loss Function

The objective function of a conventional conditional GAN is inspired by a minimax objective from game theory, with an objective 

[32]:

(1)

We can separate this into two losses, one for the generator, , and one for the discriminator, :

(2)
(3)

While this objective function suffices in certain cases, as the complexity of the problem increases, the generator may not be able to learn the transformation from the input distribution into the target one. One can resort to adding a loss term to , and in particular one that enforces similarity to the scanpath ground truth data. However, using a conventional data term, such as MSE, does not yield good results (Section 4.4 includes an evaluation of this). To address this issue, we introduce a novel term in specifically targeted to our problem, and based on dynamic time warping [34].

Dynamic time warping (DTW) measures the similarity between two temporal sequences, considering both the shape and the order of the elements of a sequence, without forcing a one-to-one correspondence between elements of the time series. For this purpose, it takes into account all the possible alignments of two time series and , and computes the one that yields the minimal distance between them. Specifically, the DTW loss function between two time series and can be expressed as [15]:

(4)

where is a matrix containing the distances between each pair of points in and , is a binary matrix that accounts for the alignment (or correspondence) between and , and is the inner product between both matrices.

In our case, and are two scanpaths that we wish to compare. While the Euclidean distance between each pair of points is usually employed when computing for Equation 4, in our scenario that would yield erroneous distances derived from the projection of the 360 image (both if done in 2D over the image, or in 3D with the parameterization described in Section 3.1). We instead use the distance over the surface of a sphere, or spherical distance, and define such that:

(5)

leading to our spherical DTW:

(6)

We incorporate the spherical DTW to the loss function of the generator (, Equation 2), yielding our final generator loss function :

(7)

where is a ground truth scanpath for the conditioning image , and the weight is empirically set to .

While a loss function incorporating DTW (or spherical DTW) is not differentiable, a differentiable version, soft-DTW, has been proposed. We use this soft-DTW in our model; details on it can be found in Section S1 in the supplementary material or in the original publication [15].

3.4 Model Architecture

Both our generator and discriminator are based on a two-branch structure (see Figure 2), with one branch for the conditioning image and the other for the input vector ( in the generator, and or in the discriminator). The image branch extracts features from the 360 image, yielding a set of latent features that will be concatenated with the input vector for further processing. Due to the distortion inherent to equirectangular projections, traditional convolutional feature extraction strategies are not well suited for 360 images: They use a kernel window where neighboring relations are established uniformly around a pixel. Instead, we extract features using panoramic (or spherical) convolutions [13]. Spherical convolutions are a type of dilated convolutions where the relations between elements in the image are not established in image space, but in a gnomonic, non-distorted space. These spherical convolutions can represent kernels as patches tangent to a sphere where the 360 is reprojected.

In our problem of scanpath generation, the location of the features in the image is of particular importance. Therefore, to facilitate spatial learning of the network, we use the recently presented CoordConv strategy [28], which gives convolutions access to its own input coordinates by adding extra coordinate channels. We do this by concatenating a CoordConv layer to the input 360 image (see Figure 2). This layer also helps stabilize the training process, as shown in Section 4.4.

3.5 Dataset and Training Details

We train our model using Sitzmann et al.’s [43] dataset, composed of 22 different 360 images and a total of 1,980 scanpaths from 169 different users. Each scanpath contains gaze information captured during 30 seconds with a binocular eye tracking recorder at 120 Hz. We sample these captured scanpaths at 1 Hz (, ), and reparameterize them (Section 3.1), so that each scanpath is a sequence . Given the relatively small size of the dataset, we perform data augmentation by longitudinally shifting the 360 images (and adjusting their scanpaths accordingly); specifically, for each image we generate six different variations with random longitudinal shifting. We use 19 of the 22 images in this dataset for training, and reserve three to be part of our test set (more details on the full test set are described in Section 4). With the data augmentation process, this yields 114 images in the training set.

During our training process we use the Adam optimizer [21], with constant learning rates for the generator, and for the discriminator, both of them with momentum = . Further training and implementation details can be found in the supplementary material.

Figure 3: Results of our model for two different scenes: market and mall from Rai et al.’s dataset [39]. From left to right: 360 image, ground truth sample scanpath, and three scanpaths generated by our model. The generated scanpaths are plausible and focus on relevant parts of the scene, yet they exhibit the diversity expected among different human observers. Please refer to the supplementary material for a larger set of results.

4 Validation and Analysis

We evaluate the quality of the generated scanpaths with respect to the measured, ground truth scanpaths, as well as to other approaches. We also ablate our model to illustrate the contribution of the different design choices.

We evaluate or model on two different test sets. First, using the three images from Sitzmann et al.’s dataset [43] left out of the training (Section 3.5): room, chess and robots. To ensure our model has an ability to extrapolate, we also evaluate it with a different dataset from Rai et al. [39]. This dataset consists of 60 scenes watched by 40 to 42 observers for 25 seconds. Thus, when comparing to their ground truth, we cut our 30-second scanpaths to the maximum length of their data. Please also refer to the supplementary material for more details on the test set, as well as further evaluation and results.

Figure 4: Qualitative comparison to previous methods for five different scenes from Rai et al.’s dataset. In each row, from left to right: 360 image, and a sample scanpath obtained with our method, PathGAN [3], SaltiNet [4], and Zhu et al.’s [62]. Note that, in the case of PathGAN, we are including the results directly taken from their paper, thus the different visualization. Our method produces plausible scanpaths focused on meaningful regions, in comparison with other techniques. Please see text for details, and the supplementary material for a larger set of results, also including ground truth scanpaths.

4.1 Scanpath Similarity Metrics

Our evaluation is both quantitative and qualitative. Evaluating scanpath similarity is not a trivial task, and a number of metrics have been proposed in the literature, each focused on a different context or aspect of gaze behavior [17]. Proposed metrics can be roughly categorized into: (i) direct measures based on Euclidean distance; (ii) string-based measures based on string alignment techniques (such as the Levenshtein distance, LEV); (iii) curve similarity methods; (iv) metrics from time-series analysis (like DTW, on which our loss function is based); and (v) metrics from recurrence analysis (, recurrence measure REC and determinism measure DET). We refer the reader to supplementary material and the review by Fahimi and Bruce [17] for an in-depth explanation and comparison of existing metrics. Here, we include a subset of metrics that take into account both the position and the ordering of the points (namely LEV and DTW), and two metrics from recurrence analysis (REC and DET), which have been reported to be discriminative in revealing viewing behaviors and patterns when comparing scanpaths. We nevertheless compute our evaluation for the full set of metrics reviewed by Fahimi and Bruce [17] in the supplementary material.

Since for each image we have a number of ground truth scanpaths, and a set of generated scanpaths, we compute each similarity metric for all possible pairwise comparisons (each generated scanpath against each of the ground truth scanpaths), and average the result. In order to provide an upper baseline for each metric, we also compute the human baseline (Human BL[57], which is obtained by comparing each ground truth scanpath against all the other ground truth ones, and averaging the results. In a similar fashion, we compute a lower baseline based on sampling gaze points randomly over the image (Random BL).

4.2 Results

Qualitative results of our model can be seen in Figures 3 and 1 for scenes with different layouts. Figure 3, from left to right, shows: the scene, a sample ground truth (captured) scanpath, and three of our generated scanpaths sampled from the generator. Our model is able to produce plausible, coherent scanpaths that focus on relevant parts of the scene. In the generated scanpaths we observe regions where the user focuses (points of a similar color clustered together), as well as more exploratory behavior. The generated scanpaths are diverse but plausible, as one would expect if different users watched the scene (the supplementary material contains more ground truth, measured scanpaths, showing this diversity). Further, our model is not affected by the inherent distortions of the 360 image. This is apparent, for example, in the market scene: The central corridor, narrow and seemingly featureless, is observed by generated virtual observers. Quantitative results in Table 1 further show that our generated scanpaths are close to the human baseline (Human BL), both in the test set from Sitzmann et al.’s dataset, and over Rai et al.’s dataset. A value close to Human BL indicates that the generated scanpaths are as valid or as plausible as the captured, ground truth ones. Note that obtaining a value lower than Human BL

is possible, if the generated scanpaths are on average closer to the ground truth ones, and exhibit less variance.

Since our model is generative, it can generate as many scanpaths as needed and model many different potential observers. We perform our evaluations on a random set of 100 scanpaths generated by our model. We choose this number to match the number of generated scanpaths available for competing methods, to perform a fair comparison. Nevertheless, we have analyzed the stability of our generative model by computing our evaluation metrics for a variable number of generated scanpaths: Our results are very stable with the number of scanpaths (please see Table 2 in the supplementary material).

4.3 Comparison to Other Methods

We compare ScanGAN360 to three methods devoted to scanpath prediction in 360 images: SaltiNet-based scanpath prediction [2, 4] (we will refer to it as SaltiNet in the following), PathGAN [3] and Zhu et al.’s method [62]. For comparisons to SaltiNet we use the public implementation of the authors, while the authors of Zhu et al. kindly provided us with the results of their method for the images from Rai et al.’s dataset (but not for Sitzmann et al.’s); we therefore have both qualitative (Figure 4) and quantitative (Table 1) comparisons to these two methods. In the case of PathGAN, no model or implementation could be obtained, so we compare qualitatively to the results extracted from their paper (Figure 4, third column).

Table 1 shows that our model consistently provides results closer to the ground truth scanpaths than Zhu et al.’s and SaltiNet. The latter is based on a saliency-sampling strategy, and thus these results indicate that indeed the temporal information learnt by our model is relevant for the final result. Our model, as expected, also amply surpasses the random baseline. In Figure 4 we see how PathGAN scanpaths fail to focus on the relevant parts of the scene (see, , snow or square), while SaltiNet exhibits a somewhat erratic behavior, with large displacements and scarce areas of focus (train, snow or square show this). Finally, Zhu et al.’s approach tends to place gaze points at high contrast borders (see, , square or resort).

Figure 5: Qualitative ablation results. From top to bottom: basic GAN strategy (baseline); adding MSE to the loss function of the former; our approach; and an example ground truth scanpath. These results illustrate the need for our DTW loss term.

4.4 Ablation Studies

We also evaluate the contribution of different elements of our model to the final result. For this purpose, we analyze a standard GAN strategy (, using only the discriminative loss), as the baseline. Figure 5 shows how the model is unable to learn both the temporal nature of the scanpaths, and their relation to image features. We also analyze the results yielded by adding a term based on the MSE between the ground truth and the generated scanpath to the loss function, instead of our DTW term (the only previous GAN approach for scanpath generation [3] relied on MSE for their loss term). The MSE only measures a one-to-one correspondence between points, considering for each time instant a single point, unrelated to the rest. This hinders the learning process, leading to non-plausible results (Figure 5, second row). This behavior is corrected when our DTW is added instead, since it is specifically targeted for time series data and takes into account the actual spatial structure of the data (Figure 5, third row). The corresponding quantitative measures over our test set from Sitzmann et al. can be found in Table 2. We also analyze the effect of removing the CoordConv layer from our model: Results in Table 2 indicate that the use of CoordConv does have a positive effect on the results, helping learn the transformation from the input to the target domain.

Dataset Method LEV DTW REC DET
Test set from
Sitzmann et al.
Random BL 52.33 2370.56 0.47 0.93
SaltiNet 48.00 1928.85 1.45 1.78
ScanGAN360 (ours) 46.15 1921.95 4.82 2.32
Human BL 43.11 1843.72 7.81 4.07
Rai et al.’s
dataset
Random BL 43.11 1659.75 0.21 0.94
SaltiNet 48.07 1928.41 1.43 1.81
Zhu et al. 43.55 1744.20 1.64 1.50
ScanGAN360 (ours) 40.99 1549.59 1.72 1.87
Human BL 39.59 1495.55 2.33 2.31
Table 1: Quantitative comparisons of our model against SaltiNet [4] and Zhu et al. [62]. We also include upper (human baseline, Human BL) and lower (randomly sampling over the image, Random BL) baselines. Arrows indicate whether higher or lower is better, and boldface highlights the best result for each metric (excluding the ground truth Human BL). SaltiNet is trained with Rai et al.’s dataset; we include it for completeness.
Metric LEV DTW REC DET
Basic GAN 49.42 2088.44 3.01 1.74
MSE 48.90 1953.21 2.41 1.73
DTW (no CoordConv) 47.82 1988.38 3.67 1.99
DTW (ours) 46.19 1925.20 4.50 2.33
Human Baseline (Human BL) 43.11 1843.72 7.81 4.07
Table 2: Quantitative results of our ablation study. Arrows indicate whether higher or lower is better, and boldface highlights the best result for each metric (excluding the ground truth Human BL). Please refer to the text for details on the ablated models.

4.5 Behavioral Evaluation

Figure 6: Behavioral evaluation. Left: Exploration time for real captured data (left) and scanpaths generated by our model (center left). Speed and exploration time of our scanpaths are on par with that of real users. Center right: ROC curve of our generated scanpaths for each individual test scene (gray), and averaged across scenes (magenta). The faster it converges to the maximum rate, the higher the inter-observer congruency. Right: Aggregate maps for two different scenes, computed as heatmaps from 1,000 generated scanpaths. Our model is able to produce aggregate maps that focus on relevant areas of the scenes and exhibit the equator bias reported in the literature.

While the previous subsections employ well-known metrics from the literature to analyze the performance of our model, in this subsection we perform a higher-level analysis of its results. We assess whether the behavioral characteristics of our scanpaths match those which have been reported from actual users watching 360 images.

Exploration time

Sitzmann et al. [43] measure the exploration time as the average time that users took to move their eyes to a certain longitude relative to their starting point, and measure how long it takes for users to fully explore the scene. Figure 6 (left) shows this exploration time, measured by Sitzmann et al. from captured data, for the three scenes from their dataset included in our test set (room, chess, and robots). To analyze whether our generated scanpaths mimic this behavior and exploration speed, we plot the exploration time of our generated scanpaths (Figure 6, center left) for the same scenes and number of scanpaths. We can see how the speed and exploration time are very similar between real and generated data. Individual results per scene can be found in the supplementary material.

Fixation bias

Similar to the center bias of human eye fixations observed in regular images [20], the existence of a Laplacian-like equator bias has been measured in 360 images [43]: The majority of fixations fall around the equator, in detriment of the poles. We have evaluated whether the distribution of scanpaths generated by our model also presents this bias. This is to be expected, since the data our model is trained with exhibits it, but is yet another indicator that we have succeeded in learning the ground truth distribution. We test this by generating, for each scene, 1,000 different scanpaths with our model, and aggregating them over time to produce a pseudo-saliency map, which we term aggregate map. Figure 6 (right) shows this for two scenes in our test set: We can see how this equator bias is indeed present in our generated scanpaths.

Inter-observer congruency

It is common in the literature analyzing users’ gaze behavior to measure inter-observer congruency, often by means of a receiver operating characteristic (ROC) curve. We compute the congruency of our “generated observers” through this ROC curve for the three scenes in our test set from the Sitzmann et al. dataset (Figure 6, center right). The curve calculates the ability of the scanpath to predict the aggregate map of the corresponding scene. Each point in the curve is computed by generating a map containing the top most salient regions of the aggregate map (computed without the scanpath), and calculating the percentage of gaze points of the scanpath that fall into that map. Our ROC curve indicates a strong agreement between our scanpaths, with around of all gaze points falling within of the most salient regions. These values are comparable to those measured in previous studies with captured gaze data [43, 23].

Temporal and spatial coherence

Our generated scanpaths have a degree of stochasticity, to be able to model the diversity of real human observers. However, human gaze behavior follows specific patterns, and each gaze point is conditioned not only by the features in the scene but also by the previous history of gaze points of the user. If two users start watching a scene in the same region, a certain degree of coherence between their scanpaths is expected, that may diverge more as more time passes. We analyze the temporal coherence of generated scanpaths that start in the same region, and observe that indeed our generated scanpaths follow a coherent pattern. Please refer to the supplementary for more information on this part of the analysis.

5 Conclusion

In summary, we propose ScanGAN360, a conditional GAN approach to generating gaze scanpaths for immersive virtual environments. Our unique parameterization tailored to panoramic content, coupled with our novel usage of a DTW loss function, allow our model to generate scanpaths of significantly higher quality and duration than previous approaches. We further explore applications of our model: Please refer to the supplementary material for a description and examples of these.

Our GAN approach is well suited for the problem of scanpath generation: A single ground truth scanpath does not exist, yet real scanpaths follow certain patterns that are difficult to model explicitly but that are automatically learned by our approach. Note that our model is also very fast and can produce about 1,000 scanpaths per second. This may be a crucial capability for interactive applications: our model can generate virtual observers in real time.

Limitations and future work

Our model is trained with 30-second long scanpaths, sampled at 1 Hz. Although this is significantly longer than most previous approaches  [16, 23, 27], exploring different or variable lengths or sampling rates remains interesting for future work. When training our model, we focus on learning higher-level aspects of visual behavior, and we do not explicitly enforce low-level ocular movements (, fixations or saccades). Currently, our relatively low sampling rate prevents us from modeling very fast dynamic phenomena, such as saccades. Yet, fixation patterns naturally emerge in our results, and future work could explicitly take low-level oculomotor aspects of visual search into account.

The model, parameterization, and loss function are tailored to 360 images. In a similar spirit, a DTW-based loss function could also be applied to conventional 2D images (using an Euclidean distance in 2D instead of our ), potentially leading to better results than current 2D approaches based on mean-squared error.

We believe that our work is a timely effort and a first step towards understanding and modeling dynamic aspects of attention in 360 images. We hope that our work will serve as a basis to advance this research, both in virtual reality and in conventional imagery, and extend it to other scenarios, such as dynamic or interactive content, analyzing the influence of the task, including the presence of motion parallax, or exploring multimodal experiences. We will make our model and training code available in order to facilitate the exploration of these and other possibilities.

References

  • [1] Elena Arabadzhiyska, Okan Tarhan Tursun, Karol Myszkowski, Hans-Peter Seidel, and Piotr Didyk. Saccade landing position prediction for gaze-contingent rendering. ACM Transactions on Graphics (TOG), 36(4):1–12, 2017.
  • [2] Marc Assens, Xavier Giro-i Nieto, Kevin McGuinness, and Noel E O’Connor. Saltinet: Scan-path prediction on 360 degree images using saliency volumes. In Proceedings of the IEEE ICCV Workshops, pages 2331–2338, 2017.
  • [3] Marc Assens, Xavier Giro-i Nieto, Kevin McGuinness, and Noel E O’Connor. Pathgan: visual scanpath prediction with generative adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 0–0, 2018.
  • [4] Marc Assens, Xavier Giro-i Nieto, Kevin McGuinness, and Noel E O’Connor. Scanpath and saliency prediction on 360 degree images. Signal Processing: Image Communication, 69:8–14, 2018.
  • [5] Wentao Bao and Zhenzhong Chen. Human scanpath prediction based on deep convolutional saccadic model. Neurocomputing, 404:154 – 164, 2020.
  • [6] Mathieu Blondel, Arthur Mensch, and Jean-Philippe Vert. Differentiable divergences between time series. arXiv preprint arXiv:2010.08354, 2020.
  • [7] A. Borji. Boosting bottom-up and top-down visual features for saliency estimation. In

    2012 IEEE Conference on Computer Vision and Pattern Recognition

    , 2012.
  • [8] Zoya Bylinskii, Tilke Judd, Ali Borji, Laurent Itti, Frédo Durand, Aude Oliva, and Antonio Torralba. Mit saliency benchmark. http://saliency.mit.edu/, 2019.
  • [9] Ying Cao, Rynson WH Lau, and Antoni B Chan. Look over here: Attention-directing composition of manga elements. ACM Trans. Graph., 33(4):1–11, 2014.
  • [10] Chien-Yi Chang, De-An Huang, Yanan Sui, Li Fei-Fei, and Juan Carlos Niebles. D3tw: Discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [11] Fang-Yi Chao, Lu Zhang, Wassim Hamidouche, and Olivier Deforges. Salgan360: Visual saliency prediction on 360 degree images with generative adversarial networks. In 2018 IEEE Int. Conf. on Multim. & Expo Workshops (ICMEW), pages 01–04. IEEE, 2018.
  • [12] Alex Colburn, Michael F Cohen, and Steven Drucker. The role of eye gaze in avatar mediated conversational interfaces.

    Technical report, Citeseer, 2000.

  • [13] Benjamin Coors, Alexandru Paul Condurache, and Andreas Geiger. Spherenet: Learning spherical representations for detection and classification in omnidirectional images. In Proc. of the European Conference on Computer Vision (ECCV), pages 518–533, 2018.
  • [14] Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara.

    Predicting human eye fixations via an lstm-based saliency attentive model.

    IEEE Transactions on Image Processing, 27(10):5142–5154, 2018.
  • [15] Marco Cuturi and Mathieu Blondel. Soft-dtw: a differentiable loss function for time-series. arXiv preprint arXiv:1703.01541, 2017.
  • [16] Stephen R Ellis and James Darrell Smith. Patterns of statistical dependency in visual scanning. Eye movements and human information processing, pages 221–238, 1985.
  • [17] Ramin Fahimi and Neil DB Bruce. On metrics for measuring scanpath similarity. Behavior Research Methods, pages 1–20, 2020.
  • [18] Kaye Horley, Leanne M Williams, Craig Gonsalvez, and Evian Gordon. Face to face: visual scanpath evidence for abnormal processing of facial expressions in social phobia. Psychiatry research, 127(1-2):43–53, 2004.
  • [19] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence, 20(11):1254–1259, 1998.
  • [20] Tilke Judd, Krista Ehinger, Frédo Durand, and Antonio Torralba. Learning to predict where humans look. In IEEE ICCV, pages 2106–2113. IEEE, 2009.
  • [21] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2014. Last updated in arXiv in 2017.
  • [22] Matthias Kümmerer, Thomas S. A. Wallis, and Matthias Bethge. Deepgaze ii: Reading fixations from deep features trained on object recognition. arXiv preprint arXiv:1610.01563, 2016.
  • [23] O. Le Meur and T. Baccino. Methods for comparing scanpaths and saliency maps: strengths and weaknesses. Behavior Research Methods, pages 251–266, 2013.
  • [24] Olivier Le Meur and Zhi Liu. Saccadic model of eye movements for free-viewing condition. Vision Research, 116:152 – 164, 2015.
  • [25] Chenge Li, Weixi Zhang, Yong Liu, and Yao Wang. Very long term field of view prediction for 360-degree video streaming. In 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pages 297–302. IEEE, 2019.
  • [26] Suiyi Ling, Jesús Gutiérrez, Ke Gu, and Patrick Le Callet. Prediction of the influence of navigation scan-path on perceived quality of free-viewpoint videos. IEEE Journal on Emerging and Sel. Topics in Circ. and Sys., 9(1):204–216, 2019.
  • [27] Huiying Liu, Dong Xu, Qingming Huang, Wen Li, Min Xu, and Stephen Lin. Semantically-based human scanpath estimation with hmms. In Proceedings of the IEEE International Conference on Computer Vision, pages 3232–3239, 2013.
  • [28] Rosanne Liu, Joel Lehman, Piero Molino, Felipe Petroski Such, Eric Frank, Alex Sergeev, and Jason Yosinski.

    An intriguing failing of convolutional neural networks and the coordconv solution.

    In Neural information processing systems, pages 9605–9616, 2018.
  • [29] Y. Lu, W. Zhang, C. Jin, and X. Xue. Learning attention map from images. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012.
  • [30] Daniel Martin, Sandra Malpica, Diego Gutierrez, Belen Masia, and Ana Serrano. Multimodality in VR: A survey. arXiv preprint arXiv:2101.07906, 2021.
  • [31] Daniel Martin, Ana Serrano, and Belen Masia. Panoramic convolutions for single-image saliency prediction. In CVPR Workshop on CV for AR/VR, 2020.
  • [32] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  • [33] Rafael Monroy, Sebastian Lutz, Tejo Chalasani, and Aljosa Smolic. Salnet360: Saliency maps for omni-directional images with cnn. Signal Processing: Image Communication, 69:26 – 34, 2018.
  • [34] Meinard Müller. Dynamic time warping. Information retrieval for music and motion, pages 69–84, 2007.
  • [35] Anh Nguyen, Zhisheng Yan, and Klara Nahrstedt. Your attention is unique: Detecting 360-degree video saliency in head-mounted display for head movement prediction. In Proc. ACM Intern. Conf. on Multimedia, pages 1190–1198, 2018.
  • [36] Junting Pan, Cristian Canton, Kevin McGuinness, Noel E. O’Connor, Jordi Torres, Elisa Sayrol, and Xavier and Giro-i Nieto. Salgan: Visual saliency prediction with generative adversarial networks. 2018.
  • [37] Junting Pan, Elisa Sayrol, Xavier Giro-i Nieto, Kevin McGuinness, and Noel E. O’Connor. Shallow and deep convolutional networks for saliency prediction. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [38] Xufang Pang, Ying Cao, Rynson WH Lau, and Antoni B Chan. Directing user attention via visual flow on web designs. ACM Trans. on Graph., 35(6):1–11, 2016.
  • [39] Yashas Rai, Jesús Gutiérrez, and Patrick Le Callet. A dataset of head and eye movements for 360 degree images. In Proceedings of the 8th ACM on Multimedia Systems Conference, pages 205–210, 2017.
  • [40] Kerstin Ruhland, Christopher E Peters, Sean Andrist, Jeremy B Badler, Norman I Badler, Michael Gleicher, Bilge Mutlu, and Rachel McDonnell. A review of eye gaze in virtual agents, social robotics and hci: Behaviour generation, user interaction and perception. In Computer graphics forum, volume 34, pages 299–326. Wiley Online Library, 2015.
  • [41] Matan Sela, Pingmei Xu, Junfeng He, Vidhya Navalpakkam, and Dmitry Lagun. Gazegan-unpaired adversarial image generation for gaze estimation. arXiv preprint arXiv:1711.09767, 2017.
  • [42] Ana Serrano, Vincent Sitzmann, Jaime Ruiz-Borau, Gordon Wetzstein, Diego Gutierrez, and Belen Masia. Movie editing and cognitive event segmentation in virtual reality video. ACM Trans. Graph. (SIGGRAPH), 36(4), 2017.
  • [43] Vincent Sitzmann, Ana Serrano, Amy Pavel, Maneesh Agrawala, Diego Gutierrez, Belen Masia, and Gordon Wetzstein. Saliency in VR: How do people explore virtual environments? IEEE Trans. on Vis. and Comp. Graph., 24(4):1633–1642, 2018.
  • [44] Mikhail Startsev and Michael Dorr. 360-aware saliency estimation with conventional image saliency predictors. Signal Proces.: Image Comm., 69:43–52, 2018.
  • [45] Yu-Chuan Su and Kristen Grauman. Making 360 video watchable in 2d: Learning videography for click free viewing. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1368–1376. IEEE, 2017.
  • [46] Yu-Chuan Su, Dinesh Jayaraman, and Kristen Grauman. Pano2vid: Automatic cinematography for watching 360 videos. In Asian Conf. on CV, pages 154–171. Springer, 2016.
  • [47] Benjamin W Tatler and Benjamin T Vincent. The prominence of behavioural biases in eye guidance. Visual Cognition, 17(6-7):1029–1054, 2009.
  • [48] Hamed Rezazadegan Tavakoli, Esa Rahtu, and Janne Heikkilä. Stochastic bottom–up fixation prediction and saccade generation. Image and Vision Computing, 31(9):686–693, 2013.
  • [49] Antonio Torralba, Aude Oliva, Monica S Castelhano, and John M Henderson. Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. Psychological review, 113(4):766, 2006.
  • [50] Eleonora Vig, Michael Dorr, and David Cox. Large-scale optimization of hierarchical features for saliency prediction in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.
  • [51] LE Vincent and Nicolas Thome. Shape and time distortion loss for training deep time series forecasting models. In Advances in neural information processing systems, pages 4189–4201, 2019.
  • [52] Dirk Walther and Christof Koch. Modeling attention to salient proto-objects. Neural Networks, 19:1395–1407, 2006.
  • [53] Wenguan Wang and Jianbing Shen. Deep visual attention prediction. IEEE Transactions on Image Processing, 27(5):2368–2378, 2017.
  • [54] W. Wang and J. Shen. Deep visual attention prediction. IEEE Transactions on Image Processing, 27(5):2368–2378, 2018.
  • [55] Wenguan Wang, Jianbing Shen, Xingping Dong, and Ali Borji. Salient object detection driven by fixation prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [56] Chenglei Wu, Ruixiao Zhang, Zhi Wang, and Lifeng Sun. A spherical convolution approach for learning long term viewport prediction in 360 immersive video. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    , volume 34, pages 14003–14040, 2020.
  • [57] Chen Xia, Junwei Han, Fei Qi, and Guangming Shi. Predicting human saccadic scanpaths based on iterative representation learning. IEEE Transactions on Image Processing, 28(7):3502–3515, 2019.
  • [58] M. Xu, Y. Song, J. Wang, M. Qiao, L. Huo, and Z. Wang.

    Predicting head movement in panoramic video: A deep reinforcement learning approach.

    IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(11):2693–2708, 2019.
  • [59] Chuan Yang, Lihe Zhang, Ruan Lu, Huchuan, Xiang, and Ming-Hsuan Yang. Saliency detection via graph-based manifold ranking. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 3166–3173. IEEE, 2013.
  • [60] Kiwon Yun, Yifan Peng, Dimitris Samaras, Gregory J Zelinsky, and Tamara L Berg.

    Exploring the role of gaze behavior and object detection in scene understanding.

    Frontiers in psychology, 4:917, 2013.
  • [61] Qi Zhao and Christof Koch. Learning a saliency map using fixated locations in natural scenes. Journal of Vision, 11:9, 2011.
  • [62] Yucheng Zhu, Guangtao Zhai, and Xiongkuo Min. The prediction of head and eye movement for 360 degree images. Signal Processing: Image Communication, 69:15–25, 2018.

Supplementary Material

This document offers additional information and details on the following topics:

  • (S1) Extended description of the soft-DTW (differentiable version of DTW) distance metric used in our model.

  • (S2) Additional results (scanpaths generated with our method) for different scenes used in our evaluation in the main paper.

  • (S3) Additional ground truth scanpaths for the scenes used in our evaluation in the main paper.

  • (S4) Further details on our training process.

  • (S5) Further details on metrics and evaluation, including a larger set of metrics (which we briefly introduce), and extended analysis.

  • (S6) Further details on the behavioral evaluation of our scanpaths.

  • (S7) Example applications of our method.

S1 Differentiable Dynamic Time Warping: soft-DTW

One of the key aspects of our framework relies in the addition of a second term to the generator’s loss function, based on dynamic time warping [34]. As pointed in Section 3.3 in the main paper, dynamic time warping (DTW) measures the similarity between two temporal sequences (see Figure 7111https://databricks.com/blog/2019/04/30/understanding-dynamic-time-warping.html, Equation 5 in the main paper for the original DTW formulation, and Equations 6 and 7 in the main paper for our spherical modification on DTW). However, the original DTW function is not differentiable, therefore it is not suitable as a loss function. Instead, we use a differentiable version of it, soft-DTW, which has been recently proposed [15] and used as a loss function in different problems dealing with time series [6, 10, 51].

Differently from the original DTW formulation (Equation 5 in the main paper), the soft-DTW is defined as follows:

(8)

where, as with traditional DTW, is a matrix containing the distances between each pair of points in and , is a binary matrix that accounts for the alignment (or correspondence) between and , and is the inner product between both matrices. In our case, and are two scanpaths that we wish to compare.

The main difference lies in the replacement of the with the function, which is defined as follows:

(9)

This soft-min function allows DTW to be differentiable, with the parameter adjusting the similarity between the soft implementation and the original DTW algorithm, both being the same when .

Figure 7: Simple visualization of dynamic time warping (DTW) alignment. Instead of assuming a pair-wise strict correspondence, DTW optimizes the alignment between two sequences to minimize their distance.

S2 Additional Results

We include in this section a more extended set of results. First, we include results for the scenes room (see Figures 17 to 20), chess (see Figures 21 to 24), and robots (see Figures 25 to 28) from the Sitzmann et al. dataset [43]. Then, we include results for the five scenes from the Rai et al. dataset [39] used in comparisons throughout the main paper: train (see Figures 29 to 32), resort (see Figures 33 to 36), square (see Figures 37 to 40), snow (see Figures 41 to 44), and museum (see Figures 45 to 48).

S3 Ground Truth Scanpaths for Comparison Scenes

We include in Figures 49 to 53 sets of ground truth scanpaths for all the images shown in Figure 4 in the main paper, which is devoted to comparisons of our method against other models; and in Figures 54 to 56 sets of ground truth scanpaths for the three images from our test set from Sitzmann et al.’s dataset.

S4 Additional Details on our Training Process

In addition to the details commented in Section 3.5 in the main paper, our generator trains two cycles per discriminator cycle, to avoid the latter from surpassing the former. To enhance the training process, we also resort to a mini-batching strategy: Instead of inputting to our model a set containing all available scanpaths for a given image, we split our data in different mini-batches of eight scanpaths each. This way, the same image is input in our network multiple times per epoch, also allowing more images to be included in the same batch, and therefore enhancing the training process. We trained our model for 217 epochs, as we found that epoch to yield the better evaluation results.

S5 Additional Details on Metrics and Evaluation

Throughout this work, we evaluate our model and compare to state-of-the-art works by means of several widely used metrics, recently reviewed by Fahimi and Bruce [17]. Table 3 shows a list of these metrics, indicating which ones take into account position and/or order of gaze points. In the following, we briefly introduce these metrics (please refer to Fahimi and Bruce [17] for a formal description):

  • Levenshtein distance: Transforms scanpaths into strings, and then calculates the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string (scanpath) into the other. All edits costs are treated equally.

  • ScanMatch: Improved version of Levenshtein distance. Different from Levenshtein distance, ScanMatch takes into account semantic information (as a score matrix), and can even take into account duration of data points. This way, each of the edit operations can be differently weighted.

  • Hausdorff distance: Represents the degree of mismatch between two sets by measuring the farthest spatial distance from one set to the other, i.e., the distance between two different curves.

  • Frechet distance: Similar to Hausdorff distance, it measures the similarity between curves. However, Frechet disatance takes into account both the position and ordering of all the points in the curves.

  • Dynamic time warping: Metric that compares two time-series with varying (and differing) lengths to find an optimal path to match both sequences while preserving boundary, continuity, and monotonicity to make sure that the path respects time.

  • Time delay embedding: Splits a scanpath into several sub-samples, i.e., small sub-scanpaths. This metrics calculates a similarity score by performing several pair-wise Hausdorff comparisons over sub-samples from both scanpaths to compare.

  • Recurrence: Measures the percentage of gaze points that match (are close) between the two scanpaths.

  • Determinism: Percentage of cross-recurrent points that form diagonal lines (i.e., percentage of gaze trajectories common to both scanpaths).

  • Laminarity: Measures locations that were fixated in detail in one of the scanpaths, but only fixated briefly in the other scanpath. This way, it indicates whether specific areas of a scene are repeatedly fixated.

  • Center of recurrence mass: Defined as the distance of the center of gravity from the main diagonal, indicates the dominant lag of cross recurrences, i.e., whether the same gaze point in both scanpaths tends to occur close in time.

Metric Abrv Position Order
Levenshtein distance LEV
ScanMatch SMT
Hausdorff distance HAU
Frechet distance FRE
Dynamic time warping DTW
Time delay embedding TDE
Recurrence REC
Determinism DET
Laminarity LAM
Center of recurrence mass COR
Table 3: Set of metrics to quantitatively evaluate scanpath similarity [17]. Each metric specializes in specific aspects of the scanpaths, and as a result using any of them in isolation may not be representative.

Our model is stochastic by nature. This means that the scanpaths that it generates for a given scene are always different, simulating observer variability. We have analyzed whether the reported metrics vary depending on the number of scanpaths generated, to asses the stability and overall goodness of our model. Results can be seen in Table 4

Dataset # of samples LEV DTW REC DET
Test set from
Sitzmann et al.
100 46.19 1925.20 4.50 2.33
800 46.10 1916.26 4.75 2.34
2500 46.15 1921.95 4.82 2.32
Human BL 43.11 1843.72 7.81 4.07
Rai et al.’s
dataset
100 40.95 1548.86 1.91 1.85
800 40.94 1542.82 1.86 1.86
2500 40.99 1549.59 1.72 1.87
Human BL 39.59 1495.55 2.33 2.31
Table 4: Quantitative results of our model with sets of generated scanpaths with different number of samples. Our results are stable regardless the number of generated samples.

We include in Table 5 the evaluation results with the full set of metrics shown in Table 3 (extension to Table 1 in the main paper), and in Tables 6 and 7 the evaluation results of our ablation studies over the full set of metrics (extension to Table 2 in the main paper).

Dataset Method LEV SMT HAU FRE DTW TDE REC DET LAM CORM
Test set from
Sitzmann et al.
Random BL 52.33 0.22 59.88 146.39 2370.56 27.93 0.47 0.93 9.19 33.19
SaltiNet 48.00 0.18 64.23 149.34 1928.85 28.19 1.45 1.78 10.45 29.23
ScanGAN360 (ours) 46.15 0.39 43.28 141.23 1921.95 18.62 4.82 2.32 24.51 35.78
Human BL 43.11 0.43 41.38 142.91 1843.72 16.05 7.81 4.07 24.69 35.32
Rai et al’s
dataset
Random BL 43.11 0.17 65.71 144.73 1659.75 35.41 0.21 0.94 4.30 19.08
SaltiNet 48.07 0.18 63.86 148.76 1928.41 28.42 1.43 1.81 10.22 29.33
Zhu et al. 43.55 0.20 73.09 136.37 1744.20 30.62 1.64 1.50 9.18 26.05
ScanGAN360 (ours) 40.99 0.24 61.86 139.10 1549.59 28.14 1.72 1.87 12.23 26.15
Human BL 39.59 0.24 66.23 136.70 1495.55 27.24 2.33 2.31 14.36 23.14
Table 5: Quantitative comparison of our model against different approaches, following the metrics introduced in Table 1. We evaluate our model over the test set we separated from Sitzmann et al.’s dataset, and compare against Saltinet [2]. On the other hand, we validate our model over Rai et al.’s dataset, and compare us against Zhu et al.’s [62], whose results over this dataset were provided by the authors; and against SaltiNet, which was trained over that specific dataset (*). HB accounts for the human baseline, computed with the set of ground-truth scanpaths. We also compute a lower baseline, computed by randomly sampling the image. The arrow next to each metric indicates whether higher or lower is better. Best results are in bold.
Metric LEV SMT HAU FRE DTW TDE REC DET LAM CORM
Basic GAN 49.42 0.36 43.69 145.95 2088.44 20.05 3.01 1.74 18.55 34.51
MSE 48.90 0.37 42.27 133.24 1953.21 19.48 2.41 1.73 18.47 37.34
DTW (no CoordConv) 47.82 0.37 46.59 144.92 1988.38 20.13 3.67 1.99 18.09 35.66
DTW (ours) 46.15 0.39 43.28 141.23 1921.95 18.62 4.82 2.32 24.21 35.78
Human BL 43.11 0.43 41.38 142.91 1843.72 16.05 7.81 4.07 24.69 35.32
Table 6: Results of our ablation study over Sitzmann et al.’s test set. We take a basic GAN strategy as baseline, and evaluate the effects of adding a second term into our generator’s loss function. We ablate a model with an MSE error (as used in the only GAN approach for scanpath generation so far [3]), and compare it against our spherical DTW approach. We also analyze the importance of the CoordConv layer, whhose absence slightly worsen the results. See Section 4 in the main paper for further discussion. Qualitative results of this ablation study can be seen in Figure 5 in the main paper.
Metric LEV SMT HAU FRE DTW TDE REC DET LAM CORM
Basic GAN 41.73 0.23 59.11 142.42 1542.52 28.40 0.99 1.47 8.08 24.55
MSE 41.81 0.23 61.30 139.59 1541.44 28.66 1.01 1.51 8.56 24.45
 DTW (no CoordConv) 41.42 0.23 61.55 148.13 1610.10 28.78 1.61 1.65 10.25 24.68
DTW (ours) 40.99 0.24 61.86 139.10 1549.59 28.14 1.72 1.87 12.23 26.15
Human BL 39.59 0.24 66.23 136.70 1495.55 27.24 2.33 2.31 14.36 23.14
Table 7: Results of our ablation study over Rai et al.’s dataset. We take a basic GAN strategy as baseline, and evaluate the effects of adding a second term into our generator’s loss function. We ablate a model with an MSE error (as used in the only GAN approach for scanpath generation so far [3]), and compare it against our spherical DTW approach. We also analyze the importance of the CoordConv layer, whhose absence slightly worsen the results. See Section 4 in the main paper for further discussion. Qualitative results of this ablation study can be seen in Figure 5 in the main paper.

Images for one of our test sets belong to Rai et al.’s dataset [39]. This dataset is larger than Sitzmann et al.’s in size (number of images), but provides gaze data in the form of fixations with associated timestamps, and not the raw gaze points. Note that most of the metrics proposed in the literature for scanpath similarity are designed to work with time series of different length, and do not necessarily assume a direct pairwise equivalence, making them valid to compare our generated scanpaths to the ground truth ones from Rai et al.’s dataset.

S6 Behavioral Evaluation

In this section, we include further analysis and additional details on behavioral aspects of our scanpaths, extending Section 4.5 in the main paper.

Temporal and spatial coherence

As discussed in the main paper, our generated scanpaths have a degree of stochasticity, and different patterns arise depending on users’ previous history. To assess whether our scanpaths actually follow a coherent pattern, we generate a set of random scanpaths for each of the scenes in our test dataset, and separate them according to the longitudinal region where the scanpath begins (,

, etc.). Then, we estimate the probability density of the generated scanpaths from each starting region using kernel density estimation (KDE) for each timestamp. We include the complete KDE results for the three images from our test set in Figures 

11 to 16, for different starting regions, at different timestamps, and computed over 1000 generated scanpaths. During the first seconds (first column), gaze points tend to stay in a smaller area, and closer to the starting region; as time progresses, they exhibit a more exploratory behavior with higher divergence, and eventually may reach a convergence close to regions of interest. We can also see how the behavior can differ depending on the starting region.

Exploration time

As introduced in the main paper, we also explore the time that users took to move their eyes to a certain longitude relative to their starting point, and measure how long it takes for users to fully explore the scene. We include in Figure 8 the comparison between ground truth and generated scanpaths in terms of time to explore the scene, for all the three scenes from our test set (room, chess, and robots), both individual and aggregated. We can see how the speed and exploration time are very similar between real and generated data.

Figure 8: Time to explore each of the scenes from the Sitzmann et al. test set, together with their ground truth counterpart.

S7 Applications of the Model

Our model is able to generate plausible 30-second scanpaths, drawn from a distribution that mimics the behavior of human observers. As we briefly discuss through the paper, this enables a number of applications, starting with avoiding the need to recruit and measure gaze from high numbers of observers in certain scenarios. We show here two applications of our model, virtual scene design and scanpath-driven video thumbnail creation for static 360 images, and discuss other potential application scenarios.

Virtual scene design

In an immersive environment, the user has control over the camera when exploring it. This poses a challenge to content creators and designers, who have to learn from experience how to layout the scene to elicit a specific viewing or exploration behavior. This is not only a problem in VR, but has also received attention in, , manga composition [9] or web design [38]. However, actually measuring gaze from a high enough number of users to determine optimal layouts can be challenging and time-consuming. While certain goals may require real users, others can make use of our model to generate plausible and realistic generated observers.

As a proof of concept, we have analyzed our model’s ability to adapt its behavior to different layouts of a scene (Figure 9). Specifically, we have removed certain elements from a scene, and run our model to analyze whether these changes affect the behavior of our generated scanpaths. We plot the resulting probability density (using KDE, see Section S6) as a function of time. The presence of different elements in the scene affects the general viewing behavior, including viewing direction, or time spent on a certain region. These examples are particularly promising if we consider that our model is trained with a relatively small number of generic scenes.

Figure 9:

Our model can be used to aid the design of virtual scenes. We show two examples, each with two possible layouts (original, and removing some significant elements). We generate a large number of scanpaths (virtual observers) starting from the same region, and compute their corresponding probability density function as a function of time, using KDE (see Section 

S6). room scene: The presence of the dining table and lamps (top) retains the viewers’ attention longer, while in their absence they move faster towards the living room area, performing a more linear exploration. gallery scene: When the central picture is present (top), the viewers linger there before splitting to both sides of the scene. In its absence, observers move towards the left, then explore the scene linearly in that direction.

Scanpath-driven video thumbnails of static 360 images

360 images capture the full sphere and are thus unintuitive when projected into a conventional 2D image. To address this problem, a number of approaches have proposed to retarget 360 images or videos to 2D [46, 43, 45]. In the case of images, extracting a representative 2D visualization of the 360 image can be helpful to provide a thumbnail of it, for example as a preview on a social media platform. However, these thumbnails are static. The Ken Burns effect can be used to animate static images by panning and zooming a cropping window over a static image. In the context of 360, however, it seems unclear what the trajectory of such a moving window would be.

To address this question, we leverage our generated scanpaths to drive a Ken Burns–like video thumbnail of a static panorama. For this purpose, we use an average scanpath, computed as the probability density of several generated scanpaths using KDE (see Section S6), as the trajectory of the virtual camera. Specifically, KDE allows us to find the point of highest probability, along with its variance, of all generated scanpaths at any point in time. Note that this point is not necessarily the average of the scanpaths. We use the time-varying center point as the center of our 2D viewport, and its variance to drive the FOV or zoom of the moving viewport.

Figure 10 shows several representative steps of this process for two different scenes (chess and street). Full videos of several scenes are included in the supplementary video. The generated Ken Burns–style panorama previews look like a human observer exploring these panorama and provide a very intuitive preview of the complex scenes they depict.

Figure 10: Scanpath-driven video thumbnails of 360 images. We propose a technique to generate these videos that results in relevant and intuitive explorations of the 360 scenes. Top row: Points of highest probability at each time instant, displayed as scanpaths. These are used as a guiding trajectory for the virtual camera. Middle rows: Two viewports from the guiding trajectory, corresponding to the temporal window with lowest variance. Bottom row: 2D images retargeted from those viewports. Please refer to the text for details.

Other applications

Our model has the potential to enable other applications beyond what we have shown in this section. One such example is gaze simulation for virtual avatars. When displaying or interacting with virtual characters, eye gaze is one of the most critical, yet most difficult, aspects to simulate [40]. Accurately simulating gaze behavior not only aids in conveying realism, but can also provide additional information such as signalling interest, aiding the conversation through non-verbal cues, facilitating turn-taking in multi-party conversations, or indicating attentiveness, among others. Given an avatar immersed within a virtual scene, generating plausible scanpaths conditioned by a 360 image of their environment could be an efficient, affordable way of driving the avatar’s gaze behavior in a realistic manner.

Another potential application of our model is its use for gaze-contingent rendering. These approaches have been proposed to save rendering time and bandwidth in VR systems or drive the user’s accommodation. Eye trackers are required for these applications, but they are often too slow, making computationally efficient approaches for predicting gaze trajectories or landing positions important [1]. Our method for generating scanpaths could not only help prototype and evaluate such systems in simulation, without the need for a physical eye tracker and actual users, but also in optimizing their latency and performance during runtime.

Figure 11: KDE for the room scene, including scanpaths starting from up to .
Figure 12: KDE for the room scene, including scanpaths starting from up to .
Figure 13: KDE for the chess scene, including scanpaths starting from up to .
Figure 14: KDE for the chess scene, including scanpaths starting from up to .
Figure 15: KDE for the robots scene, including scanpaths starting from up to .
Figure 16: KDE for the robots scene, including scanpaths starting from up to .

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 17: Generated scanpaths for the room scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 18: Generated scanpaths for the room scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 19: Generated scanpaths for the room scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 20: Generated scanpaths for the room scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 21: Generated scanpaths for the chess scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 22: Generated scanpaths for the chess scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 23: Generated scanpaths for the chess scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 24: Generated scanpaths for the chess scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 25: Generated scanpaths for the robots scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 26: Generated scanpaths for the robots scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 27: Generated scanpaths for the robots scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 28: Generated scanpaths for the robots scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 29: Generated scanpaths for the train scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 30: Generated scanpaths for the train scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 31: Generated scanpaths for the train scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 32: Generated scanpaths for the train scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 33: Generated scanpaths for the resort scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 34: Generated scanpaths for the resort scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 35: Generated scanpaths for the resort scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 36: Generated scanpaths for the resort scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 37: Generated scanpaths for the square scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 38: Generated scanpaths for the square scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 39: Generated scanpaths for the square scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 40: Generated scanpaths for the square scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 41: Generated scanpaths for the snow scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 42: Generated scanpaths for the snow scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 43: Generated scanpaths for the snow scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 44: Generated scanpaths for the snow scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 45: Generated scanpaths for the museum scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 46: Generated scanpaths for the museum scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 47: Generated scanpaths for the museum scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 48: Generated scanpaths for the museum scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 49: Ground truth scanpaths for the train scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 50: Ground truth scanpaths for the resort scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 51: Ground truth scanpaths for the snow scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 52: Ground truth scanpaths for the museum scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 53: Ground truth scanpaths for the square scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 54: Ground truth scanpaths for the room scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 55: Ground truth scanpaths for the chess scene.

trim=.10pt .050pt 0.10pt .050pt,clip,width=

Figure 56: Ground truth scanpaths for the robots scene.

References

  • [1] Elena Arabadzhiyska, Okan Tarhan Tursun, Karol Myszkowski, Hans-Peter Seidel, and Piotr Didyk. Saccade landing position prediction for gaze-contingent rendering. ACM Transactions on Graphics (TOG), 36(4):1–12, 2017.
  • [2] Marc Assens, Xavier Giro-i Nieto, Kevin McGuinness, and Noel E O’Connor. Saltinet: Scan-path prediction on 360 degree images using saliency volumes. In Proceedings of the IEEE ICCV Workshops, pages 2331–2338, 2017.
  • [3] Marc Assens, Xavier Giro-i Nieto, Kevin McGuinness, and Noel E O’Connor. Pathgan: visual scanpath prediction with generative adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 0–0, 2018.
  • [4] Marc Assens, Xavier Giro-i Nieto, Kevin McGuinness, and Noel E O’Connor. Scanpath and saliency prediction on 360 degree images. Signal Processing: Image Communication, 69:8–14, 2018.
  • [5] Wentao Bao and Zhenzhong Chen. Human scanpath prediction based on deep convolutional saccadic model. Neurocomputing, 404:154 – 164, 2020.
  • [6] Mathieu Blondel, Arthur Mensch, and Jean-Philippe Vert. Differentiable divergences between time series. arXiv preprint arXiv:2010.08354, 2020.
  • [7] A. Borji. Boosting bottom-up and top-down visual features for saliency estimation. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012.
  • [8] Zoya Bylinskii, Tilke Judd, Ali Borji, Laurent Itti, Frédo Durand, Aude Oliva, and Antonio Torralba. Mit saliency benchmark. http://saliency.mit.edu/, 2019.
  • [9] Ying Cao, Rynson WH Lau, and Antoni B Chan. Look over here: Attention-directing composition of manga elements. ACM Trans. Graph., 33(4):1–11, 2014.
  • [10] Chien-Yi Chang, De-An Huang, Yanan Sui, Li Fei-Fei, and Juan Carlos Niebles. D3tw: Discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [11] Fang-Yi Chao, Lu Zhang, Wassim Hamidouche, and Olivier Deforges. Salgan360: Visual saliency prediction on 360 degree images with generative adversarial networks. In 2018 IEEE Int. Conf. on Multim. & Expo Workshops (ICMEW), pages 01–04. IEEE, 2018.
  • [12] Alex Colburn, Michael F Cohen, and Steven Drucker. The role of eye gaze in avatar mediated conversational interfaces. Technical report, Citeseer, 2000.
  • [13] Benjamin Coors, Alexandru Paul Condurache, and Andreas Geiger. Spherenet: Learning spherical representations for detection and classification in omnidirectional images. In Proc. of the European Conference on Computer Vision (ECCV), pages 518–533, 2018.
  • [14] Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. Predicting human eye fixations via an lstm-based saliency attentive model. IEEE Transactions on Image Processing, 27(10):5142–5154, 2018.
  • [15] Marco Cuturi and Mathieu Blondel. Soft-dtw: a differentiable loss function for time-series. arXiv preprint arXiv:1703.01541, 2017.
  • [16] Stephen R Ellis and James Darrell Smith. Patterns of statistical dependency in visual scanning. Eye movements and human information processing, pages 221–238, 1985.
  • [17] Ramin Fahimi and Neil DB Bruce. On metrics for measuring scanpath similarity. Behavior Research Methods, pages 1–20, 2020.
  • [18] Kaye Horley, Leanne M Williams, Craig Gonsalvez, and Evian Gordon. Face to face: visual scanpath evidence for abnormal processing of facial expressions in social phobia. Psychiatry research, 127(1-2):43–53, 2004.
  • [19] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence, 20(11):1254–1259, 1998.
  • [20] Tilke Judd, Krista Ehinger, Frédo Durand, and Antonio Torralba. Learning to predict where humans look. In IEEE ICCV, pages 2106–2113. IEEE, 2009.
  • [21] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2014. Last updated in arXiv in 2017.
  • [22] Matthias Kümmerer, Thomas S. A. Wallis, and Matthias Bethge. Deepgaze ii: Reading fixations from deep features trained on object recognition. arXiv preprint arXiv:1610.01563, 2016.
  • [23] O. Le Meur and T. Baccino. Methods for comparing scanpaths and saliency maps: strengths and weaknesses. Behavior Research Methods, pages 251–266, 2013.
  • [24] Olivier Le Meur and Zhi Liu. Saccadic model of eye movements for free-viewing condition. Vision Research, 116:152 – 164, 2015.
  • [25] Chenge Li, Weixi Zhang, Yong Liu, and Yao Wang. Very long term field of view prediction for 360-degree video streaming. In 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pages 297–302. IEEE, 2019.
  • [26] Suiyi Ling, Jesús Gutiérrez, Ke Gu, and Patrick Le Callet. Prediction of the influence of navigation scan-path on perceived quality of free-viewpoint videos. IEEE Journal on Emerging and Sel. Topics in Circ. and Sys., 9(1):204–216, 2019.
  • [27] Huiying Liu, Dong Xu, Qingming Huang, Wen Li, Min Xu, and Stephen Lin. Semantically-based human scanpath estimation with hmms. In Proceedings of the IEEE International Conference on Computer Vision, pages 3232–3239, 2013.
  • [28] Rosanne Liu, Joel Lehman, Piero Molino, Felipe Petroski Such, Eric Frank, Alex Sergeev, and Jason Yosinski. An intriguing failing of convolutional neural networks and the coordconv solution. In Neural information processing systems, pages 9605–9616, 2018.
  • [29] Y. Lu, W. Zhang, C. Jin, and X. Xue. Learning attention map from images. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012.
  • [30] Daniel Martin, Sandra Malpica, Diego Gutierrez, Belen Masia, and Ana Serrano. Multimodality in VR: A survey. arXiv preprint arXiv:2101.07906, 2021.
  • [31] Daniel Martin, Ana Serrano, and Belen Masia. Panoramic convolutions for single-image saliency prediction. In CVPR Workshop on CV for AR/VR, 2020.
  • [32] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  • [33] Rafael Monroy, Sebastian Lutz, Tejo Chalasani, and Aljosa Smolic. Salnet360: Saliency maps for omni-directional images with cnn. Signal Processing: Image Communication, 69:26 – 34, 2018.
  • [34] Meinard Müller. Dynamic time warping. Information retrieval for music and motion, pages 69–84, 2007.
  • [35] Anh Nguyen, Zhisheng Yan, and Klara Nahrstedt. Your attention is unique: Detecting 360-degree video saliency in head-mounted display for head movement prediction. In Proc. ACM Intern. Conf. on Multimedia, pages 1190–1198, 2018.
  • [36] Junting Pan, Cristian Canton, Kevin McGuinness, Noel E. O’Connor, Jordi Torres, Elisa Sayrol, and Xavier and Giro-i Nieto. Salgan: Visual saliency prediction with generative adversarial networks. 2018.
  • [37] Junting Pan, Elisa Sayrol, Xavier Giro-i Nieto, Kevin McGuinness, and Noel E. O’Connor. Shallow and deep convolutional networks for saliency prediction. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [38] Xufang Pang, Ying Cao, Rynson WH Lau, and Antoni B Chan. Directing user attention via visual flow on web designs. ACM Trans. on Graph., 35(6):1–11, 2016.
  • [39] Yashas Rai, Jesús Gutiérrez, and Patrick Le Callet. A dataset of head and eye movements for 360 degree images. In Proceedings of the 8th ACM on Multimedia Systems Conference, pages 205–210, 2017.
  • [40] Kerstin Ruhland, Christopher E Peters, Sean Andrist, Jeremy B Badler, Norman I Badler, Michael Gleicher, Bilge Mutlu, and Rachel McDonnell. A review of eye gaze in virtual agents, social robotics and hci: Behaviour generation, user interaction and perception. In Computer graphics forum, volume 34, pages 299–326. Wiley Online Library, 2015.
  • [41] Matan Sela, Pingmei Xu, Junfeng He, Vidhya Navalpakkam, and Dmitry Lagun. Gazegan-unpaired adversarial image generation for gaze estimation. arXiv preprint arXiv:1711.09767, 2017.
  • [42] Ana Serrano, Vincent Sitzmann, Jaime Ruiz-Borau, Gordon Wetzstein, Diego Gutierrez, and Belen Masia. Movie editing and cognitive event segmentation in virtual reality video. ACM Trans. Graph. (SIGGRAPH), 36(4), 2017.
  • [43] Vincent Sitzmann, Ana Serrano, Amy Pavel, Maneesh Agrawala, Diego Gutierrez, Belen Masia, and Gordon Wetzstein. Saliency in VR: How do people explore virtual environments? IEEE Trans. on Vis. and Comp. Graph., 24(4):1633–1642, 2018.
  • [44] Mikhail Startsev and Michael Dorr. 360-aware saliency estimation with conventional image saliency predictors. Signal Proces.: Image Comm., 69:43–52, 2018.
  • [45] Yu-Chuan Su and Kristen Grauman. Making 360 video watchable in 2d: Learning videography for click free viewing. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1368–1376. IEEE, 2017.
  • [46] Yu-Chuan Su, Dinesh Jayaraman, and Kristen Grauman. Pano2vid: Automatic cinematography for watching 360 videos. In Asian Conf. on CV, pages 154–171. Springer, 2016.
  • [47] Benjamin W Tatler and Benjamin T Vincent. The prominence of behavioural biases in eye guidance. Visual Cognition, 17(6-7):1029–1054, 2009.
  • [48] Hamed Rezazadegan Tavakoli, Esa Rahtu, and Janne Heikkilä. Stochastic bottom–up fixation prediction and saccade generation. Image and Vision Computing, 31(9):686–693, 2013.
  • [49] Antonio Torralba, Aude Oliva, Monica S Castelhano, and John M Henderson. Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. Psychological review, 113(4):766, 2006.
  • [50] Eleonora Vig, Michael Dorr, and David Cox. Large-scale optimization of hierarchical features for saliency prediction in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.
  • [51] LE Vincent and Nicolas Thome. Shape and time distortion loss for training deep time series forecasting models. In Advances in neural information processing systems, pages 4189–4201, 2019.
  • [52] Dirk Walther and Christof Koch. Modeling attention to salient proto-objects. Neural Networks, 19:1395–1407, 2006.
  • [53] Wenguan Wang and Jianbing Shen. Deep visual attention prediction. IEEE Transactions on Image Processing, 27(5):2368–2378, 2017.
  • [54] W. Wang and J. Shen. Deep visual attention prediction. IEEE Transactions on Image Processing, 27(5):2368–2378, 2018.
  • [55] Wenguan Wang, Jianbing Shen, Xingping Dong, and Ali Borji. Salient object detection driven by fixation prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [56] Chenglei Wu, Ruixiao Zhang, Zhi Wang, and Lifeng Sun. A spherical convolution approach for learning long term viewport prediction in 360 immersive video. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 14003–14040, 2020.
  • [57] Chen Xia, Junwei Han, Fei Qi, and Guangming Shi. Predicting human saccadic scanpaths based on iterative representation learning. IEEE Transactions on Image Processing, 28(7):3502–3515, 2019.
  • [58] M. Xu, Y. Song, J. Wang, M. Qiao, L. Huo, and Z. Wang. Predicting head movement in panoramic video: A deep reinforcement learning approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(11):2693–2708, 2019.
  • [59] Chuan Yang, Lihe Zhang, Ruan Lu, Huchuan, Xiang, and Ming-Hsuan Yang. Saliency detection via graph-based manifold ranking. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 3166–3173. IEEE, 2013.
  • [60] Kiwon Yun, Yifan Peng, Dimitris Samaras, Gregory J Zelinsky, and Tamara L Berg. Exploring the role of gaze behavior and object detection in scene understanding. Frontiers in psychology, 4:917, 2013.
  • [61] Qi Zhao and Christof Koch. Learning a saliency map using fixated locations in natural scenes. Journal of Vision, 11:9, 2011.
  • [62] Yucheng Zhu, Guangtao Zhai, and Xiongkuo Min. The prediction of head and eye movement for 360 degree images. Signal Processing: Image Communication, 69:15–25, 2018.