Interest tracking is a powerful tool when it comes to user experience testing. The fields of advertising, entertainment, packaging and web design have all benefited significantly from studying the visual behavior of the consumer.
Eye tracking data is collected using either a remote or head-mounted ‘eye tracker’ connected to a computer. While there are many different types of non-intrusive eye trackers, they generally include two common components: a light source and a camera. The light source (usually infrared) is directed towards the eye. The camera tracks the reflection of the light source along with visible ocular features such as the pupil. This data is used to extrapolate the rotation of the eye and ultimately the direction of gaze. Additional information such as blink frequency and changes in pupil diameter are also detected by the eye tracker.
Eye Tracking can be an interesting approach, but lacks a general application in the common life-style of the users. This means that in current solutions, an eye-tracking device must be used on a test-subject, and a researcher must be, by their side, conducting the experiment. This limits the amount of data that can be retrieved.
We want to simplify the way human interest measurements can be retrieved in images, so larger amounts of data can be gathered in a non-experimental environment. We also believe that the results of our approach overwhelm the results that can be achieved in an experimental setting, which biases the behavior of the test subjects.
Our goal is to validate if the way we are approaching interest area detection, using only user interface interaction, has an overall performance that makes it as suitable as the alternative that involves hands-on experimentation with, for example, eye-tracking devices.
As a case of use, our solution is very useful in web pages of clothing and accessory retailers, such as Zara, Aliexpress, H&M, etc. These pages display images of their products and give the user the possibility to zoom and move through the image. These images usually display an outfit composed by multiple clothes. Implementing our proposed solution, these companies would understand in which of the clothes of the outfit users are more interested.
1.1 Literature study
This project builds on the idea that current approaches for detecting areas of interest consist of direct user experimentation, with technologies such as eye-tracking devices. Eye movements provide information about the location of areas of interest in an image (Mackworth and Morandi ; Just and Carpenter ; Henderson and Hollingworth 1998 ). Eye tracking devices can record the position and time an eye is looking at a point in the image, more known as fixation time.
In order to obtain the fixation time, some techniques have been developed. For example, Mackworth and Morandi  propose to divide the image into a regular grid and count the time spent in each of the cells of the grid.
Some research has been performed based on this information. Santella and De Carlo  present an automatic data-driven method able to generate a representation of viewer interest based on clustering visual point-of-regard measurements into gazes (spatial clusters of successive fixations) and regions of interest.
Another example of clustering to extract regions of interest is the method used by Latimer 
, which consists on creating a histogram of fixation durations on the image. Using this, it later clusters that histogram using k-means.
Lastly, we have another clustering method in which eye-tracking is not taken into account. Instead of using user data to gather interest areas, G. Kim and A. Torralba  propose an unsupervised version which introduces a fast and scalable alternating optimization technique to detect regions of interest (ROIs) in cluttered Web images without labels.
In order to compare areas of interest, we need to measure the distance between those areas. Huttenlocker  uses the Hausdorff distance to obtain the degree of resemblance between two objects in an image. The Hausdorff distance measures the degree of difference between two shapes. Shapes can be seen as a set of points, and according to the Hausdorff distance, two sets of points are close if every point in one set is close to some point of the other set.
Jaccard introduces in 
the Jaccard index, more known asJaccard similarity. This measure allows to obtain the degree of similarity of two sets counting the number of elements in common and dividing it by the total number of elements between them. This measure can also be used with shapes if they are converted to a set of points inside those shapes.
After a deep research on literature, only A. Carlier et al.  have done something similar in the interaction aspect, with a completely different scope. They used pan and zoom measurements to determine which parts of a video were interesting for certain users, in order to crop the video to those sections. The purpose is to enhance user experience in devices with reduced screen size, such as smart phones.
Our purpose is to test the hypothesis of whether areas of interest in images can be detected through simple interaction recordings (zoom, panning and time spent looking at a certain area).
The project is based on the empirical method, because the only way to verify the proposed solution is through validation, using performance checking of the results of the experiments.
Given that this is a quite novel approach compared to the current technologies, we pursue a one-iteration research:
Develop a first version of the algorithm.
Gather data via user testing, and confirm whether our platform detects what they found interesting.
Analyze the results, and determine the problems of the algorithm.
Propose an improvement on those weak points.
After doing desktop research, we could not find datasets freely-available with records of the movements a subject does in an image (zoom, panning). Consequently, we needed to take care of generating this information through tests.
Subjects are shown a few images, and they are free to interact with them in any manner. In order to force the test subjects to zoom and move through the image, the images have a high resolution and are displayed in a small size. After this process, the same images are shown once again and the test subjects are asked to mark the areas of the image they considered most interesting.
Once we retrieved the data from all our test subjects, we analyzed the results comparing the output of our algorithm (a heat-map) with the areas of interest that the tests subjects marked as interesting. It is important to note that we have different types of data. On the one hand, the heat map will have continuous values in the interval [0,1]. On the other hand, the areas marked by the test subjects are binary (these areas are interesting and the rest of the image is not). Consequently, we need a way to compare both types of values.
In order to do this comparison, we have defined an interest threshold to convert the continuous values to the binary values we have to compare with. Then, the Jaccard similarity is calculated by contrasting the total number of pixels above that threshold inside the heat-map and the pixels inside the areas marked by the user.
2.1 Data retrieval
We developed a test platform to perform tests with users. This test platform is based on a web server that provides a web page where the tests are performed. The test is composed by two phases displaying a set of images.
Phase 1 - Free interaction with the image.
The web page first displays an explicative message, giving the users some hints on the test and how to perform it. The message says: ”Checkout these images. You can zoom and pan as you wish. We may ask you something about them later. When you finish, click Next.”.
The message was redacted trying to minimize the influence on the test subject while providing enough information to perform the test, as we needed the tests subjects to behave in the same way they behave when they see an image in a web page.
After that, an image is displayed to the users so that they can zoom and pan it freely to check the most interesting parts of the image. Once they are finished, they can click in the next button, which will display the next image (see Figure 3).
In order to force the user to zoom and pan through the image, the high resolution images selected for this test are initially displayed in a small size.
Phase 2 - Explicit selection of areas of interest.
If all the test images have been displayed to the test subject, the test enters in a new phase with the intention to retrieve explicitly the areas of the image the test subject considers interesting. As in the previous phase, an explicative message is displayed to the user. This message says: ”Now we need you to draw squares surrounding the interesting elements in the image. As in the previous phase, you can move and zoom to the areas you want and then click on the first button to enable the drawing mode (click again to disable). After that, select with your finger the interesting areas. If you want to undo your last action, use the second button. When you finish, click Next.”
After that, the same images used in phase 1 are displayed one by one to the test subject. They can move trough the image as in phase 1, but now with the ability to mark with squares the areas they consider are interesting in the image (see Figure 4).
3 The Platform
In this section, we describe the proposed architecture (Figure 5) for a production platform to extract the interesting areas of images.
We are using a client-server architecture. The client (Retrieval System) gathers and sends information regarding user actions, while the server (Analysis System) processes this information and generates the interest metrics.
Auxiliary, we have also a Validation system, which is used during the analysis on the performance of our platform with human subjects. The communication between these components will be described when tackling each component.
3.1 Retrieval system
The Retrieval system is in charge of gathering interface information from the user. Zoom-able images are usually implemented in web pages with libraries, and the browser does not provide any support to this. These libraries, sometimes written ad-hoc, implement this zooming capabilities in heterogeneous ways. This makes it hard to find an automatic way to gather the interface events needed for our algorithm (zoom and pan).
3.2 Analysis system
The Analysis system  is in charge of receiving the user-interaction data from the client, process this data with an algorithm, and generate a heat-map representing the interest of the user in different areas.
The Analysis System executes in the following way:
The event handler receives a new interface event from a client.
The event handler sends this data to the storage system.
The storage system permanently saves this information.
When a process requests the interest metric of an image, the algorithm is executed for each user that has interacted with the image.
The results of all the users are normalized according to the maximum and minimum interest of each section of the image, and returned to the calling process.
We used a simple JSON REST server that exposes all JSON files contained in a directory tree. The server listens to RESTful requests on a given port, and stores the information in the Data Storage.
The format of the events enables for stacking. Each event can be handled and incorporated into the system upon arrival. This behavior is typical of Log Structured Storage — that is, an append-only sequence of data entries.
Given the identifiers of the test, user and image involved in the transmission, we can isolate each action for each image and user easily. The only difference is that in SneakPeek there is more than a single log. We have used a simple file hierarchy for this purpose: /test/image/user.
Our API involves appending new data at the end of each file. Each file is indexed by [image, user], which implements native mutual exclusion (each file is accessed by a single user).
The complete algorithm for generating the heat-maps can be found in Algorithm 1. The overview on the work-flow of the algorithm is the following.
Gather the event-log from certain (user,image).
Generate a zero-interest heat-map with a size equal to the size of the image.
Transform each event into an effect on the temporary heat-map.
Screen-focus event: This is done by doing a counter on the time spent visualizing each pixel of the image. This is later normalized according to the minimum and maximum values. The main assumption is that the time spent at certain area is correlated to the user interest in such area.
Zoom event: If the user dives-in/out of a section of the image, this event is recorded. We use a multiplier for the interest of the zoomed-in area based on the interest of the outer area, and the other way around. It is inverse to the area: the bigger the area, the less effect has the overall interest weight. We normalize this feature according to the size of the image (being the total area of the image equal to zero weight, and as we decrease this size, the weight increases linearly).
Panning event: Fast movements across the image might mean lack of interest by the user. On the other hand, small movements around certain area of the image might mean interest of the scanned area.
Generate results: We have two different outputs. First, the heat-map of the interest areas across the image. Second, a deterministic number of areas that the algorithm found most relevant. In the current version, we use a threshold in the interest of each pixel. If the interest of such pixel is above average, we take it as relevant.
3.3 Validation System
We have implemented a Test controller that is responsible of executing both phases of the test and sending the data retrieved in phase 2 to the Test Web Server. This server then stores this information in JSON format in a MySQL database.
Once we have the user input, we proceed to validate the algorithm. First off, the algorithm is run with the event data gathered from the user. Then, the heatmap of these events is generated and validated against the marked areas by the user.
To measure the similarity between the output of the algorithm and the areas marked by the user, we define as output all pixels that have a higher interest value than the average over the whole image. Finally, we calculate the Jaccard similarity between the areas marked by the user and the areas output of the algorithm.
In Figure 9, we can find an example of such validation. The red areas represent the heatmap output from SneakPeek; The green ones represent what the user has marked as interesting; and the yellow ones represent the Jaccard similarity: areas in which the previous two overlap.
4 Results and Analysis
We have gathered data of 34 different users for the first instance of SneakPeek. We added a visualization tool to showcase the results of the tests, which includes the heat-map, the marked areas by the user and the intersection of both.
In Figure 10
, the results show variance depending on the image under test. In appendixLABEL:sec:test_imgs, we can find the images used in the experiment. The first four represent different number of objects, object sizes and image sizes. The last image was an easter egg in which the user was supposed to find Waldo.
For instance, Image 1 recorded the best performance of the platform. This image reflects big objects in medium-sized images, which seem to work the best in SneakPeek. There is a big problem in the experiments ran, and this is that in all of them, the minimum and maximum Jaccard similarity differ widely from the average.
In the case of Image 3, the results were not as good, since on average the algorithm seems to showcase a wider area of interest than the one the user points out. This can be tackled by increasing the threshold of which the algorithm classifies the area as interesting or not.
Last, in Image 4, some users achieved great results when they found the actors as interesting, while some others achieved poor results when being interested in a certain face or object.
In conclusion, the learning of the areas of interest seems to work really well with medium to big sized objects in the image, while tends to fall behind when small objects of interest are present.
5.1 Future work
This paper reflects a first iteration over the usage of interface metrics to record interest patterns from users. If we extended the scope of this paper further, a series of improvements could have been implemented upon release.
First of all, the amount of data gathered in the experiments represents only a small chunk of all the data that could be gathered if SneakPeek was deployed in a commercial environment. Consumer-based web pages have many thousands of users continuously navigating through their products.
Second, some aspects of the front-end appear faulty. For instance, the fact that SneakPeek records user interaction given a constrained time-bound leads to a non-optimized recording of the user interaction. For instance, if the user is looking at the position (0,0) at time=0 ms, and moves to (0,6) at time=100ms, it might be obvious that the platform should infer that the user passed through (0,3) at time=50ms.
Third, if we think about it, the threshold that differentiates whether an area is interesting or not by the algorithm is an optimizable parameter
. After many trials, it is hard to come with a threshold that works the best for all test images. Through model training, a process widely known in the field of Machine Learning, we can use Linear Regression to minimize the cost (in this case, the Jaccard similarity) based on optimizing the threshold that determines whether an area is considered interesting.
5.2 Sustainability and Ethics
There are certain ethical challenges to our research. First and most important, there is the privacy and anonymity of the users of the platform. In our case, we have maintained the anonymity of our test subjects by storing all the data retrieved anonymously. Consequently, the data will not be related to that person. However, if an industry modifies this aspect of our system, it could potentially be used to create a profile of that person and know exactly what they like.
Furthermore, we believe our system is less intrusive than the current eye tracking technologies. To retrieve the same data that our system is designed to gather (from thousands of Internet users around the world), the users should accept to start their web camera so that the web page can record a video of their faces while they are navigating. It is obvious that most users would not accept this. However, in our case, we just have to keep a log of the actions users do through their screen interface.
Regarding sustainability, our system can allow producers understand what clients want. This is related to Goal 12 of the United Nations’ (UN) sustainable development goals for 2030 (UN17) : ”Ensure sustainable consumption and production patterns”. With our system, less resources can be spent producing goods that the clients will not like. Consequently, the use of Earth’s resources can be optimized, and less materials and energy will be wasted.
In addition, this system will enhance relations between producers and consumers by helping customers get what they want, so that they will not spend that much time looking for it.
-  P. R. Report, “2015 Video Eye Trakker Industry Report - Global and Chinese Market Scenario.” [Online]. Available: http://healthcareanalysisreport.blogspot.se/2016/01/video-eye-trakker-market-size-share.html
-  N. H. Mackworth and A. J. Morandi, “The gaze selects informative details within pictures,” Perception & Psychophysics, vol. 2, no. 11, pp. 547–552, 1967. doi: 10.3758/BF03210264. [Online]. Available: http://dx.doi.org/10.3758/BF03210264
-  M. A. Just and P. A. Carpenter, “Eye fixations and cognitive processes,” Cognitive Psychology, vol. 8, no. 4, pp. 441 – 480, 1976. doi: http://dx.doi.org/10.1016/0010-0285(76)90015-3. [Online]. Available: http://www.sciencedirect.com/science/article/pii/0010028576900153
-  J. M. Henderson and A. Hollingworth, “Chapter 12 - eye movements during scene viewing: An overview,” in Eye Guidance in Reading and Scene Perception, G. Underwood, Ed. Amsterdam: Elsevier Science Ltd, 1998, pp. 269 – 293. ISBN 978-0-08-043361-5. [Online]. Available: http://www.sciencedirect.com/science/article/pii/B9780080433615500134
-  A. Santella and D. DeCarlo, “Robust clustering of eye movement recordings for quantification of visual interest,” in ETRA, 2004.
C. R. Latimer, “Eye-movement data: Cumulative fixation time and cluster analysis,”Behavior Research Methods, Instruments, & Computers, vol. 20, no. 5, pp. 437–470, 1988. doi: 10.3758/BF03202698. [Online]. Available: http://dx.doi.org/10.3758/BF03202698
-  G. Kim and A. Torralba, “Unsupervised detection of regions of interest using iterative link analysis,” Annual Conference on Neural Information Processing Systems, Dec. 2009. [Online]. Available: https://www.cs.cmu.edu/~gunhee/publish/nips09_gunhee.pdf
-  D. P. Huttenlocher, G. A. Klanderman, and W. A. Rucklidge, “Comparing images using the hausdorff distance,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 15, no. 9, pp. 850–863, Sep. 1993. doi: 10.1109/34.232073. [Online]. Available: http://dx.doi.org/10.1109/34.232073
-  P. Jaccard, “Étude comparative de la distribution florale dans une portion des Alpes et des Jura,” Bulletin del la Société Vaudoise des Sciences Naturelles, vol. 37, pp. 547–579, 1901.
-  A. Carlier, V. Charvillat, W. T. Ooi, R. Grigoras, and G. Morin, “Crowdsourced automatic zoom and scroll for video retargeting,” Oct. 2010. doi: 10.1145/1880000. [Online]. Available: http://delivery.acm.org/10.1145/1880000/1873962/p201-carlier.pdf
-  A. Vera and D. Shahrokhian, “Interestjs,” 2016. [Online]. Available: https://github.com/AlejandroVera/interest-js
-  D. Shahrokhian and A. Vera, “SneakPeek web,” 2016, [Online; accessed 12-January-2017]. [Online]. Available: http://interest.ddns.net/interest/
-  A. Vera and D. Shahrokhian, “Retrieval system implementation,” 2016. [Online]. Available: https://github.com/dshahrokhian/inteREST-client
-  D. Shahrokhian and A. Vera, “Sneakpeek server,” 2016. [Online]. Available: https://github.com/dshahrokhian/sneakpeek-server
-  United Nations, Transforming Our World: The 2030 Agenda for Sustainable Development. United Nations, 2015. [Online]. Available: http://www.un.org/ga/search/view_doc.asp?symbol=A/RES/70/1&referer=/english/&Lang=E