Saliency Prediction for Mobile User Interfaces

by   Prakhar Gupta, et al.

We introduce models for saliency prediction for mobile user interfaces. A mobile interface may include elements like buttons, text, etc. in addition to natural images which enable performing a variety of tasks. Saliency in natural images is a well studied area. However, given the difference in what constitutes a mobile interface, and the usage context of these devices, we postulate that saliency prediction for mobile interface images requires a fresh approach. Mobile interface design involves operating on elements, the building blocks of the interface. We first collected eye-gaze data from mobile devices for free viewing task. Using this data, we develop a novel autoencoder based multi-scale deep learning model that provides saliency prediction at the mobile interface element level. Compared to saliency prediction approaches developed for natural images, we show that our approach performs significantly better on a range of established metrics.



There are no comments yet.


page 1

page 3

page 6

page 8


Understanding Visual Saliency in Mobile User Interfaces

For graphical user interface (UI) design, it is important to understand ...

User Interface Factors of Mobile UX: A Study with an Incident Reporting Application

Smartphones are now ubiquitous, yet our understanding of user interface ...

Predicting Visual Importance Across Graphic Design Types

This paper introduces a Unified Model of Saliency and Importance (UMSI),...

Modeling Mobile Interface Tappability Using Crowdsourcing and Deep Learning

Tapping is an immensely important gesture in mobile touchscreen interfac...

A Novel Method to Study Bottom-up Visual Saliency and its Neural Mechanism

In this study, we propose a novel method to measure bottom-up saliency m...

Hierarchical Saliency Detection on Extended CSSD

Complex structures commonly exist in natural images. When an image conta...

Deep learning investigation for chess player attention prediction using eye-tracking and game data

This article reports on an investigation of the use of convolutional neu...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Mobile Devices have become ubiquitous in recent years and it has been accompanied by an explosion in the number of applications that are available for these devices. In the U.S. mobile apps overtook PC Internet on time spent in the year 2014 [29]. As the world moves towards pervasive mobile app usage, brands are increasingly trying to provide an engaging experience for their customers through them [10]. Developing apps constitutes a significant cost for brands [39]. One part of the app development process is designing applications likely to help the user performs tasks efficiently and in an engaging manner. The user interface (UI) design process today involves designers creating UI mocks, which are improved in an iterative manner. A part of the iterative process is the feedback from focus groups and peers. This is a time consuming and expensive process and presents opportunities for automation. We build models that can predict the saliency of different sections of a mobile app, and propose its use as a feedback tool for designers.

For desktop devices, eye-gaze tracking as a form of user engagement feedback has been studied [16]. Most desktop based eye-gaze tracking technologies rely on specialized hardware for capturing the human face and eyes while viewing [1]. But such techniques cannot be applied to mobile device without compromising the natural usage pattern of such devices. However, modern mobile devices are almost always equipped with front facing cameras. Using these, it is possible to capture a user’s face and eyes while she is exposed to a mobile screen. In this work, we leverage iTracker [21]

, a Convolutional Neural Network (CNN) based model which can be used to predict the location of a users fixation on the screen, using the video feed from a front-facing camera as input.

Figure 1: A sample UI image, its pixel-level ground truth saliency map inferred from the collected front-facing camera video feed, and the element-level saliency map.

One approach to predicting the saliency for mobile UIs is to predict pixel level saliency. But UI designers do not work with pixels, they work with elements. We define a mobile UI element as the building blocks that are arranged to assemble the complete UI. An element can be a natural image, button, text box or any other component present in the UI. During the design process, a designer can add, remove, edit, or change the relative position of an element. Given this fact, we decide to approach the problem as one of saliency detection at the element level. The saliency output must present a spatial coherence and a smooth transition between neighbouring pixels. Addressing eye gaze saliency at the element level preserves the spatial coherence and correlation for all the pixels of an element. It enables the designer to modify the design based on the relative saliencies of elements. Figure 1 shows a sample UI image with its collected pixel-level ground truth and element-level saliency maps.

In our work, we introduce models which can be trained on a dataset of UI images along with corresponding eye gaze data collected from users. This model can then be used to predict saliency for a new test UI, to provide rapid feedback to the designer. In summary, the main contributions of our work are as follows. We propose a novel model that uses de-noising autoencoders on multiple scales of UI elements to provide saliency prediction at the element level. For the task of saliency prediction in mobile UIs, we achieve accuracies which are significantly better than the state of the art in saliency prediction.111We can share our model with other researchers upon request.

2 Related Work

In this section, we summarize the four broad areas of research that have an implication on our work.

2.1 Saliency Models for Natural Images

Predicting eye gaze for natural images is a well explored topic in computer vision. Some early natural image saliency methods were based on concepts like Feature Integration theory

[15], graph-based normalization [11], method that analyzes the log spectrum of image [13], information theory principles like self-information [41], and information maximization [3, 4]

. Some models use supervised learning for saliency prediction based on manually designed feature sets

[18, 42, 17, 2]. All these approaches modelled saliency in a bottom-up manner using low features, which leads to models that fail to generalize to complex scenes and new domains.

Recent progress in saliency prediction has been driven by deep learning methods trained on large datasets which allows learning hierarchies of feature representations from the pixel level data. Some models like [35] and SalNet [31] have trained their own networks to predict saliency from scratch, while others have used features from pre-trained CNNs, such as DeepGaze [23], SALICON [14], ML-NET [8], and Deepfix [22]. More recent advances include methods like training using adversarial examples [30] and neural attentive mechanisms to iteratively refine the predicted saliency map [9]. These methods are designed for natural image saliency prediction and we explore the applicability of these methods for mobile UI saliency prediction.

2.2 Multi-Scale Feature Extraction

Some recent models have attempted to explicitly model how the neighborhood of a location affects saliency at a particular location. Mr-CNN [27] presents a multi-scale CNN which is trained from image regions centered on fixated and non-fixated image patches at multiple scales. A similar model is proposed in [26]. SALICON [14] also incorporates features learned at two scales, coarse and fine, and optimizes KL divergence in the last layer. A multi-context approach over a subsampled and upsampled image patch at the super-pixel level has been proposed in [43]. Such methods try to leverage the contrast of an image region against the surrounding area for saliency prediction. All the methods mentioned so far have been developed exclusively for analyzing natural images, and are not trained or tested on graphic designs.

2.3 Saliency and Attention Models for Webpages and Interfaces

Attempts at understanding visual perception of interfaces and designs have been made since the last decade [16]. One such work [34] predicts the entry point in webpages in a

screen-shot dataset using features such as the center surrounded differences of colors, intensity, and orientations. A linear regression model on features extracted from HTML induced DOM to generate a model for predicting visual attention on webpages is explored in

[5] (this work uses a dataset of webpages). In [33], a model combining multi-scale low-level feature responses, explicit face maps, and positional bias was proposed to predict fixations on the Webpage Image (FiWI) dataset, this dataset contains a total of screenshots of webpages. The work in [32] extends this by replacing specific object detector with features from Deep Neural Networks. Users’ mouse and keyboard input along with the UI components have been used in predicting their attention map [38, 7]. A manually designed feature set to predict human visual attention for free-viewing webpages is studied in [24].

While this line of work presents the semantically closest area of research to our work, these are limited in their application only to webpages. Further given the size and structural differences of desktop webpages with mobile apps, these models cannot be directly ported to our problem.

2.4 Eye-Gaze data Collection

Traditionally, all work involving collection of eye gaze data has relied on custom hardware. For instance, all saliency datasets listed at [1] are collected using custom hardware. The ubiquity of mobile devices pose unique challenges and opportunities. Some recent works have explored the possibility of using the front facing cameras of mobile devices to detect the eye gaze location of users looking at their mobile screen [21, 25]. Of these, we find iTracker [21], a CNN based model, a more sophisticated approach. The iTracker system has been developed for iOS, and we modify it to work on the Android OS based mobile phones.

3 Approach

In this section, we describe the approach to saliency detection for mobile UIs.

3.1 Stimuli

In the absence of any available eye-gaze datasets for mobile UIs, we created our own dataset by assembling a set of mobile UI images from android applications from the Google Play Store. We ensured that the selected apps represent a good spread with respect to their ratings (Table 2) and download counts (Table 2). For each application, UI screenshots were taken on an average, leading to a total of UI images.

<1 M 1 M 5-10 M 50-100 M >100 M
49 28 31 34 23
Table 2: Distribution of mobile apps with ratings
2-3 3-4 4-5
6 33 114
Table 1: Distribution of mobile apps for downloads

Since our goal was to predict saliency at an element level, the bounding boxes of the elements were required. For this, two methods were used. In the first method, while capturing screenshots for the mobile UIs, we process the logs from the official Android debug tools to get information about the underlying XML code of the application. The XML code was processed to obtain bounding boxes of elements present in the UI. While doing this, a smaller element by area was considered to be ’over’ a larger element so that a pixel belonging to more than one element is assigned the ID of smaller element. This method does not work in scenarios where UI elements have a lot of overlaps, and so we use another method which involved semi-automated a drag and drop scheme to generate the bounding boxes. One example output is shown in Figure 2. The distribution of the number of elements per UI is shown in Figure 3. The mean number of elements per UI is

(Standard Deviation of

), with all UIs having at least and at most elements.

Figure 2: Element box extraction
Figure 3:

Histogram of elements per image, the curve represents the density estimator.

3.2 Eye-gaze Data Collection Experiment

In order to collect free viewing eye-gaze data for the set of mobile UI images described in the previous section, we developed a mobile application which displays screenshots of mobile UI images to participants in a natural environment. We conducted an experiment on Mechanical Turk, where participants downloaded our application on their mobile devices for the experiment. The participants belonged to the age group of . Participants were given comprehensive instructions on how to download and use the application to participate in the experiment. The application collected the front facing camera’s video feed from each participant across multiple sessions and this feed was sent to our server.

Each session began with a calibration task (described in section 3.3

) which was followed by displaying 10 mobile UI images for free viewing, that is, no instructions were given to perform specific tasks. The mobile interface screenshots were interspersed with filler images with a probability of occurrence of

. This was done to remove any spatial bias from from prior images. The filler images consisted of sceneries and abstract art, no video feed was collected for these images. Each participant was shown an average of different UI images across a span of or sessions. Each UI image persisted for seconds with a second gap in between each image. The participants were free to pause between sessions in case they wanted to take a break. In the experiment, each UI screenshot was shown to an average of different participants, while ensuring that the same image is not shown twice to the same participant.

3.3 Processing Eye-gaze Data

From the videos of the participants collected for the free viewing task from the previous step, we generated the gaze points which correspond to where the participants were looking at in the various UI images that were shown to them. To achieve this we started with iTracker [21]. This work introduced a eye tracking software that works on devices such as mobile phones and tablets, without the need for additional sensors or devices. While the available software is designed for iOS devices, we modified it to run on Android devices. The captured videos were split into frames. For each frame, the crops of face and both eyes were generated using [19]. These are required as input for iTracker. The output is the coordinates of the gaze point corresponding to each frame of the video. We can use this to locate the pixel of the UI a participant is looking at in each frame.

The prediction of the gaze points from iTracker was found not accurate enough for the task at hand (with an average error from cm). As a solution to this, we included a calibration task at the beginning of each session of the app. During this task, we showed a moving object at different positions on the screen for a total of seconds. The participants were instructed to follow the object on the screen. The video of the participant captured during the calibration task was processed as described earlier.

A linear regression model was trained for the calibration sections of each session with the gaze points predicted by iTracker as the features, to predict the actual coordinates of the object shown. We divided the calibration frames into training and test sets in a ratio and measured the tracking error on the test set. The calibration task helped in reducing the average error from cm to cm with mean standard deviation of cm. This error is in the range of error reported in paper [21]

. We used the regression output as the gaze point and also generated a 2-dimensional co-variance matrix. This is utilized during processing of ground truth eye-gaze fixations.

3.4 Generating Saliency Maps from Fixations

3.4.1 Pixel-level Saliency

We use the eye fixations predicted using the calibrated iTracker outputs for calculating the probability of a fixation point falling on a pixel. The 2-dimensional co-variance matrix generated during calibration was used in Gaussian blurring of the fixation maps for each UI viewed by the participant for a session. Through this procedure, we get a pixel-level probabilistic heatmap from the fixation points. Converting fixation locations to a continuous distribution allows for uncertainty in the eye-tracking ground truth measurements to be incorporated, as suggested in [6]. We leverage the error from calibration as it varied from one session to another based on how the mobile was held and the lighting conditions.

The average saliency map can be seen in Figure 4. It indicates a top-left bias similar to the webpage saliency dataset in [32]. This is primarily because important UI elements are generally present in this area, and participants tend to browse the images top-to-bottom and left-to-right.

Figure 4: Average ground truth saliency map

3.4.2 Element-level Saliency

We convert the pixel-level saliency maps into the UI element-level saliency maps. For this, we compute the integral of the pixel-level saliency density over the area covered by an element. This is followed by normalization over all elements to ensure that the vector sums to

. Given an UI with elements, we represent the element saliency map , as vector of probabilities where is the probability of element being fixated. In case one UI element overlaps another element, we assign the saliency of the pixels in overlapping regions to the element on top. Sample pixel-level and element-level saliency maps are presented in Figure 7.

3.5 Feature Extraction from UI Images

Saliency is driven by visual contrast and it indicates which parts of an image are more visually appealing relative to the rest of the image. Thus, the saliency model needs to capture the contrast between a region of the image, a UI element in our case, and its surrounding area. Therefore, we extract features for every UI element at three scales. The first scale is the image of the UI element itself. The second scale consists of the UI element along with a region surrounding it, whose boundaries are decided by the mid-point of the element’s boundary and the entire UI image’s boundary for both dimensions. The third scale consists of the entire UI image. Our saliency models, described in detail in later sections, then uses these multi-scale features along with other low level features to train fully connected neural network layers for saliency prediction at an UI element-level. We now describe our feature generation methods.

3.5.1 Feature Extraction from Stacked De-noising Autoencoder

We use an autoencoder model for learning feature representation for our saliency models as they provide an effective way to learn good feature representations by using large amount of unlabeled data [36]. Autoencoders are neural networks that consist of two parts, an encoder and a decoder. The encoder reduces the input to a lower dimensional representation and the decoder reconstructs it into the original input. The objective of the autoencoder is to enforce an output to be as close as possible to the corresponding input.

However, it is proven that the reconstruction criterion is not enough in itself to guarantee the extraction of useful features, as it suffers from non-generalizability. A good feature representation should be stable and robust under corruptions of the input and should capture useful structure in the input distribution. It has been shown that feature extractors learned by de-noising autoencoders are able to learn useful structure in the data, that regular autoencoders seemed unable to learn [36]. We adopt the concept of de-noising autoencoders to learn such a representation for images, where the input image is corrupted by setting a fraction of the pixels of the image to . Let’s call this noisy version of the image , . The de-noising autoencoder tries to reconstruct the original image by producing reconstruction . It minimizes the reconstruction error using the Euclidean loss,

The architecture of our autoencoder consists of convolutional layers. We adopt and filters with size

and a stride of

in the first two convolutional layers, respectively. Both are succeeded by Max-pooling layers. All max-pooling layers have size of

and a stride of 1. This is the encoder part of our autoencoder. this is followed by another three convolutional layers with size and a stride of 1 and with , and filters, respectively. After the the third and forth convolutional layers we add upsampling layers with size of

. All convolutional layers use ReLU activations

[28]. The encoder part of the autoencoder converted a size input image into a sized encoded output.

We trained the autoencoder on all UI images in our dataset. By using autoencoders we were able to learn features for the UI images and at the same time reduce the input dimensions needed for the saliency prediction model. We learned separate autoencoder models for the scales of the elements independently. All three autoencoders share the same architecture but have different parameter values. In Section 3.6 we will talk in more detail about how the autoencoder model contributes to the saliency prediction. Some sample UI elements, their noisy versions and their reconstructed versions are shown in Figure 5.

Figure 5: From top to bottom - sample images of elements, their generated noisy versions, and their reconstructed versions. Only the first of three scales is displayed.
Figure 6: The overall architecture of the mobile user interface saliency prediction system. For each UI, we first segment it into elements. In the above example, we predict the saliency of the “Alarm off” element. In addition this this element, we also take two high zoomed out images of the element we call scales. The autoencoder versions of all three are fed into the deep model. The saliency of all the elements on the page is reconstructed by combining the saliency of individual elements.

3.5.2 Low-level Feature Extraction

We also computed low-level features based on color distribution, size and position of the elements in the UI image. There were

such features generated for each element in the UI, including width, height, area and position in pixel coordinates, along with the first and second color moment for each color channel

[40] of both the element under consideration, as well as the whole image separately. We have included area, width and height in the feature set since we are rescaling all the elements before they are input to the model, and thus we loose the information regarding their size and scale in the process. The features for position helps in capturing the user’s bias towards UI elements at the top and left of the screen. For pixels in the image or element and as value of the pixel of the image at the color channel, the first color moment , analogous to the mean; and the second color moment , analogous to the standard deviation, can be calculated by

3.6 Saliency Prediction Model

Our primary aim is to predict the eye-gaze fixations at an element level. For each element, we predict the probability of fixation on the element by incorporating features learned at the three scales of the UI element and some low-level features. Motivated by works such as [27, 26], we combine information from the three scales to incorporate both the local and the global contrasts to infer the saliency. Combining features at different levels has been shown to increase the performance of predicting the saliency map [17, 37]. The idea behind this is that the saliency of an element depends not only on the element itself, but also on the content surrounding the element.

The architecture of our model is shown in Figure 6. For each element, we generated crops of the element from the UI at scales, as described earlier. For image regions at each scale, we first resize them to the size of disregarding their aspect ratio. This is done so that the autoencoder models at each resolution level can share the same architecture. Then, the features coming from different scales are fed into the three convolutional streams of the autoencoder. The details of the autoencoder model are mentioned in the section 3.5.1. The output of the three parallel streams is concatenated with the low-level features mention in section 3.5.2 and becomes the input for the subsequent three fully connected layers. These layers learn to predict saliency of the element with respect to its appearance as well its neighborhood. We used the ReLU [28] activation in all layers due to its superior effectiveness and efficiency. Dropout layers were used in between every pair of fully connected layers in order to prevent over-fitting as suggested in [12].

The element-level ground truth saliency maps are normalized in the range of

. But, since each UI has different number of elements, we do not have a response of a consistent dimension. Hence, we treat prediction for each element as independent of the others. We apply an element-wise activation function in the final layer, and treat the element-wise predictions as probabilities for independent binary random variables. We can then apply the binary cross entropy (BCE) loss function

between the predicted element-wise saliency map and corresponding ground truth in this setting. We also experimented with mean squared error (MSE) or Euclidean loss which has been successfully applied in similar settings [8, 31], but we found that BCE performed better in the experiments. As described earlier, a number of saliency approaches for natural images has been studied in the literature. We hypothesized that leveraging the knowledge contained in these models may provide valuable information to our model. To this end, we proposed another model called -SalNc. This model uses features from SalNet [31]. We generate a feature vector of dimension by providing the third level scale for each UI element through SalNet’s penultimate convolutional layer. This vector is concatenated with the features from the autoencoder and low level features, a learning performed through a fully connected and dropout layers, similar to the -Nc model.

4 Experiments and Results

4.1 Training and Validation on Mobile UI Dataset

We trained and validated our model on our dataset using 4-fold cross validation, which consists of eye-gaze of users on Mobile UI screenshot images. For generating saliency maps for each UI element in the test image set, first saliency of each element is predicted. Predicted saliency value of all elements in the test image is then normalized so that the total saliency is

. This is done since the saliency of UI elements in the UI image is a probability distribution (positive numbers adding to 1 for a UI). The network was trained using stochastic gradient descent with a Euclidean loss. We used a batch size of

. The network was validated against a validation set after every iterations to monitor convergence and over-fitting. We used the standard weight regularizer and ADAM optimizer [20]

. The autoencoder took approximately 15 hours and the saliency model took approximately 6 hours to train for 1000 epochs on a machine with 4 NVIDIA K520 GPUs.

4.2 Evaluation Metrics

We evaluate our approach on using three metrics. We describe these next. Our approach makes a saliency prediction for all elements that comprise a UI. The evaluation metrics thus apply on the vector of element level saliencies, the ground truth and the predictions. Denote the vector of ground truth as

. Further, let the predicted saliencies be .


or area under the Receiver Operating Characteristics (ROC) curve is the most widely used score for saliency model evaluation. In AUC computation, the estimated saliency map is used as a binary classifier to separate the positive samples (human fixations) from the negatives (the rest). By varying the threshold on the saliency map, an ROC curve can then be plotted as the true positive rate vs. false negative rate. AUC is then calculated as the area under this curve. For the AUC score, 1 means perfect prediction while 0.5 indicates chance level. AUC requires discrete fixations in its calculation. We chose the top 20 percent salient UI elements of the image to form the ground truth continuous saliency map as actual fixations, similar to what is described in

[17]. We report the average AUC for all test images.

CC measures the linear correlation between the estimated saliency map and the ground truth fixation map, i.e., the correlation between vectors and . These are then averaged over all the UI images in the test set. The closer CC is to 1, the better the performance of the saliency algorithm.

KL divergence is a measure of distance that captures the distance between a target and predicted distribution. It assigns a lower score to a better approximation of the ground truth by the saliency map. All metrics have their advantages and limitations and a model that performs well should have relatively high score in all these metrics.

Method AUC CC KL
-Nc 0.9256 0.8197 0.2340
-SalNc 0.9212 0.8094 0.2882
GBVS 0.8751 0.7613 0.2465
Itti 0.8423 0.7019 0.2843
SalNet 0.8725 0.7671 0.2495
SalGAN 0.8482 0.6894 0.3318
SAM 0.7316 0.4927 1.2603
ML-NET 0.8703 0.7541 0.2678
OPENSALICON 0.8629 0.7358 0.2694
Lab-Signature 0.7971 0.5177 0.4368
Table 3: Comparison of proposed approaches, against the baselines
Figure 7: Qualitative comparison of our model with other methods

4.3 Baselines and Comparisons

We compare our work with GBVS [11], Itti [15], SalNet [31], SalGAN [30], SAM [9], ML-NET [8], OPENSALICON [14], and Lab-Signature [13]. We use the pretrained models from the prior work, predicting saliency for resized images. All these models provide pixel-level saliency distributions, which are then converted to the element level saliency distributions (following the same procedure used for the ground truth) by summing over all pixels belonging to an element and dividing by the sum across all elements. For our model, we carried out 4-fold cross validation for reporting the results. We didn’t compare with prior work related to webpage saliency detection because webpages have very different size, number of elements, and orientation. Further, there were no open implementations available for comparison.

4.4 Discussions

The results on the evaluation metrics mentioned in the previous sections are presented in Table 3. On all three metrics, we observe that -Nc performs the best. This approach is better than -SalNc. In other words, including features from saliency models for natural images, like SalNet, in the training model does not provide a better model for learning saliency in mobile UIs. Compared to the next best natural image saliency models, the proposed approach is 6%, 7%, and 5% better on AUC, CC, and KL metrics, respectively. This shows that the best models trained for predicting saliency for natural images falls short for mobile UIs. Thus justifying the need to address this problem anew. The proposed approach has two advantages over existing saliency prediction models. First, by collecting eye-gaze data for mobile UIs, we inform our model of patterns which are unique to mobile UIs and their viewing in natural environments. Second, by modeling saliency at the element level, we optimize our model to predict element level saliency, leading to superior predictions.

Figure 7 presents a qualitative analysis of the different approaches’ performance in element level saliency prediction. For sample UI images, we show the original image, pixel-level ground truth, element-level ground truth, predictions form -Nc, [31], [15], [13] and [8] in the columns. Brighter shades indicate more saliency. Here are some observations - First, the ground truth reflects a top left bias for most images, and this is also reflected in the predictions from the different models. Second, other models tend to favour larger elements, where as -Nc sometimes predicts small elements to have high saliency. TSalNet and

-Nc often predicts a skewed distribution, with some elements predicted to have high and some have low saliencies. The other approaches tend to predict a flatter distribution. Fourth, our model has learnt to give higher saliency to some UI specific components like text, specially if it has high contrast with the background. Also, instead of just simply relating saliency with low-level features such as color contrast, our model also incorporates the surrounding region in consideration while making predictions as can be seen in the prediction for the mic element for the fifth sample UI.

5 Conclusions and Future work

This paper presents a novel deep learning architecture for saliency prediction on mobile UI images. Our model learns a non-linear combination of low and high level features to predict saliency at an the element level. Qualitative and quantitative comparisons with state of the art approaches demonstrate the effectiveness of the proposed model. Learning from eye-gaze data on mobile UIs and predicting at the element level leads to a more accurate saliency model for mobile UIs. Our proposed model of element level saliency predictions can help a UI designer make decisions in the following manner. The designer can make changes to properties like the color, size, and aspect ratio at the element level; the number of elements and relative positioning at the UI level. For each modification, the designer can receive feedback on these changes in terms of saliency. The designer can also use this to compare and decide among a set of variants of the same UI.

In future work, we will explore a model which learns to simultaneously detect UI components and predict saliency. Another direction of research is to understand the ease of task completion for mobile UI through eye-gaze patterns.


  • [1] MIT Saliency Benchmark., 2017. [Online; accessed 09-September-2017].
  • [2] A. Borji. Boosting bottom-up and top-down visual features for saliency estimation. In

    Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on

    , pages 438–445. IEEE, 2012.
  • [3] N. Bruce and J. Tsotsos. Saliency based on information maximization. In Advances in neural information processing systems, pages 155–162, 2006.
  • [4] N. D. Bruce and J. K. Tsotsos. Saliency, attention, and visual search: An information theoretic approach. Journal of vision, 9(3):5–5, 2009.
  • [5] G. Buscher, E. Cutrell, and M. R. Morris. What do you see when you’re surfing?: using eye tracking to predict salient regions of web pages. In Proceedings of the SIGCHI conference on human factors in computing systems, pages 21–30. ACM, 2009.
  • [6] Z. Bylinskii, T. Judd, A. Oliva, A. Torralba, and F. Durand. What do different evaluation metrics tell us about saliency models? arXiv preprint arXiv:1604.03605, 2016.
  • [7] Z. Bylinskii, N. W. Kim, P. O’Donovan, S. Alsheikh, S. Madan, H. Pfister, F. Durand, B. Russell, and A. Hertzmann.

    Learning visual importance for graphic designs and data visualizations.

    In Proceedings of the 30th Annual ACM Symposium on User Interface Software & Technology, 2017.
  • [8] M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara. A Deep Multi-Level Network for Saliency Prediction. In International Conference on Pattern Recognition (ICPR), 2016.
  • [9] M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara. Predicting human eye fixations via an lstm-based saliency attentive model. arXiv preprint arXiv:1611.09571, 2016.
  • [10] S. Greengard. Mobile Users Say, ‘It’s All About That App, ‘Bout That App’., 2014. [Online; accessed 09-September-2017].
  • [11] J. Harel, C. Koch, and P. Perona. Graph-based visual saliency. In Advances in neural information processing systems, pages 545–552, 2007.
  • [12] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
  • [13] X. Hou, J. Harel, and C. Koch. Image signature: Highlighting sparse salient regions. IEEE transactions on pattern analysis and machine intelligence, 34(1):194–201, 2012.
  • [14] X. Huang, C. Shen, X. Boix, and Q. Zhao. Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 262–270, 2015.
  • [15] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence, 20(11):1254–1259, 1998.
  • [16] R. Jacob and K. S. Karn. Eye tracking in human-computer interaction and usability research: Ready to deliver the promises. Mind, 2(3):4, 2003.
  • [17] T. Judd, K. Ehinger, F. Durand, and A. Torralba. Learning to predict where humans look. In Computer Vision, 2009 IEEE 12th international conference on, pages 2106–2113. IEEE, 2009.
  • [18] W. Kienzle, F. A. Wichmann, M. O. Franz, and B. Schölkopf. A nonparametric approach to bottom-up visual saliency. In Advances in neural information processing systems, pages 689–696, 2007.
  • [19] D. E. King.

    Dlib-ml: A machine learning toolkit.

    Journal of Machine Learning Research, 10:1755–1758, 2009.
  • [20] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), 2014.
  • [21] K. Krafka, A. Khosla, P. Kellnhofer, H. Kannan, S. Bhandarkar, W. Matusik, and A. Torralba. Eye tracking for everyone. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2176–2184, 2016.
  • [22] S. S. Kruthiventi, K. Ayush, and R. V. Babu. Deepfix: A fully convolutional neural network for predicting human eye fixations. IEEE Transactions on Image Processing, 2017.
  • [23] M. Kümmerer, L. Theis, and M. Bethge. Deep gaze i: Boosting saliency prediction with feature maps trained on imagenet. arXiv preprint arXiv:1411.1045, 2014.
  • [24] J. Li, L. Su, B. Wu, J. Pang, C. Wang, Z. Wu, and Q. Huang. Webpage saliency prediction with multi-features fusion. In Image Processing (ICIP), 2016 IEEE International Conference on, pages 674–678. IEEE, 2016.
  • [25] Y. Li, P. Xu, D. Lagun, and V. Navalpakkam. Towards measuring and inferring user interest from gaze. In Proceedings of the 26th International Conference on World Wide Web Companion, pages 525–533. International World Wide Web Conferences Steering Committee, 2017.
  • [26] N. Liu, J. Han, T. Liu, and X. Li. Learning to predict eye fixations via multiresolution convolutional neural networks. IEEE transactions on neural networks and learning systems, 2016.
  • [27] N. Liu, J. Han, D. Zhang, S. Wen, and T. Liu. Predicting eye fixations using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 362–370, 2015.
  • [28] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.
  • [29] J. O’Toole. Mobile apps overtake PC Internet usage in U.S., 2014. [Online; accessed 09-September-2017].
  • [30] J. Pan, C. Canton, K. McGuinness, N. E. O’Connor, J. Torres, E. Sayrol, and X. Giro-i Nieto. Salgan: Visual saliency prediction with generative adversarial networks. arXiv preprint arXiv:1701.01081, 2017.
  • [31] J. Pan, E. Sayrol, X. Giro-i Nieto, K. McGuinness, and N. E. O’Connor. Shallow and deep convolutional networks for saliency prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 598–606, 2016.
  • [32] C. Shen, X. Huang, and Q. Zhao. Predicting eye fixations on webpage with an ensemble of early features and high-level representations from deep network. IEEE Transactions on Multimedia, 17(11):2084–2093, 2015.
  • [33] C. Shen and Q. Zhao. Webpage saliency. In European Conference on Computer Vision, pages 33–46. Springer, 2014.
  • [34] J. D. Still and C. M. Masciocchi. A saliency model predicts fixations in web interfaces. In 5 th International Workshop on Model Driven Development of Advanced User Interfaces (MDDAUI 2010), page 25, 2010.
  • [35] E. Vig, M. Dorr, and D. Cox. Large-scale optimization of hierarchical features for saliency prediction in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2798–2805, 2014.
  • [36] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol.

    Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.

    Journal of Machine Learning Research, 11(Dec):3371–3408, 2010.
  • [37] J. Xu, M. Jiang, S. Wang, M. S. Kankanhalli, and Q. Zhao. Predicting human gaze beyond pixels. Journal of vision, 14(1):28–28, 2014.
  • [38] P. Xu, Y. Sugano, and A. Bulling. Spatio-temporal modeling and prediction of visual attention in graphical user interfaces. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pages 3299–3310. ACM, 2016.
  • [39] K. Yarmosh. How Much Does an App Cost: A Massive Review of Pricing and other Budget Considerations., 2017. [Online; accessed 09-September-2017].
  • [40] H. Yu, M. Li, H.-J. Zhang, and J. Feng.

    Color texture moments for content-based image retrieval.

    In Image Processing. 2002. Proceedings. 2002 International Conference on, volume 3, pages 929–932. IEEE, 2002.
  • [41] L. Zhang, M. H. Tong, T. K. Marks, H. Shan, and G. W. Cottrell. Sun: A bayesian framework for saliency using natural statistics. Journal of vision, 8(7):32–32, 2008.
  • [42] Q. Zhao and C. Koch. Learning a saliency map using fixated locations in natural scenes. Journal of vision, 11(3):9–9, 2011.
  • [43] R. Zhao, W. Ouyang, H. Li, and X. Wang. Saliency detection by multi-context deep learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1265–1274, 2015.