Finding interesting images from a large collection is a need common to both laymen surfing the internet and professional graphic designers in their work. Content-based image retrieval (CBIR) systems try to fulfil the need by showing images that are similar to what they think the user is looking for. Unfortunately CBIR systems typically do not have a clear idea of what the user is looking for, due to difficulties in communicating visual content from the user to the system. One approach to get the search started is to prune the set of candidate images with a keyword or tag search, given that the images in the collection have been tagged beforehand. However, simple tags do not carry all interesting information about the images and hence the number of images after pruning might still be so large that a further content-based search is needed. Moreover, the tags are often imperfect and it may be difficult for the user to formulate relevant tags, for example if the goal is to find an image that fits well into a given poster, or to find a beautiful image of a flower. A common approach to continue the search, after possible pruning by tags, is to ask the user for explicit relevance feedback on the shown images . However, this is a laborious process and the user might be unwilling to invest such an effort, for example while casually browsing the web.
Another approach is to obtain this feedback implicitly, by measuring indirect signals on attention patterns of the users and inferring the relevance of the seen images from these [2, 3, 4, 5, 6, 7]. This is the approach taken also by PinView, a CBIR system presented in this paper. PinView uses implicit feedback from eye movements and explicit feedback from pointer clicks to infer the interests of the user, in order to iteratively show more relevant images. Our argument on the feasibility of using eye tracking is that even though trackers are still somewhat expensive and cumbersome to use, it is a plausible scenario that they will become widely available and widely used in several applications. There are no fundamental restrictions on why an eye tracker could not be integrated into every PC and smart phone, because mass-manufacturing core components of trackers based on infrared oculography is not expensive: they require a camera, an infra-red light source, and software. Once eye trackers are available, the added cost of using them in CBIR is very low.
PinView must solve several subproblems to take advantage of the recorded noisy implicit relevance feedback. The first problem is how to infer relevance of seen images from gaze patterns and clicks. The second problem is that there is a multitude of different visual (low-level) features and similarity measures for images. Each of these captures a specific aspect of similarity, like color, texture, or shape of edges in the image. Which of these features are relevant in the current search and how does the user perceive them? PinView infers a customized similarity metric for each search session with a multiple kernel learning algorithm and tensor projection working on these features.
The third and final problem is how to select images to show to the user. Given that the system is able to show the user only a limited number of images, how should it balance exploitation of its currently limited knowledge of the query and exploration of new kinds of images? PinView incorporates a specialized exploration-exploitation algorithm LinRel which uses the inferred metric between images to suggest new images to be shown to the user.
In this paper we introduce the full PinView system expanding from the two partial views in preliminary conference papers [8, 9]. We apply PinView to both offline and online CBIR tasks to study it both in controllable setups and in real image retrieval settings.
2 Background and Related Work
In this section we discuss related work in image retrieval and eye movement research. Content-based image retrieval (CBIR) is a well-researched topic, whose history can be followed and comprehensive introductions to which can be found in surveys such as [1, 10, 11, 12]. In addition, many of CBIR’s research questions have been covered by related works on content-based multimedia retrieval, including, e.g., reviews [13, 14].
The semantic gap , i.e., the unavoidable inability of low-level visual features to capture and mediate the semantic content and mutual similarity of images, has often been cited as the foremost hindrance of successful CBIR. In particular, the existence of the semantic gap has been given as a reason for the failure of image retrieval approaches that have relied on automatic image interpretation and textual querying. How severely the gap actually harms the accuracy and usability of a CBIR system will depend on the application and the particular image retrieval task at hand. In some types of searches it will be just the visual and not the semantic similarity between the searched and retrieved images that plays the primary role and, consequently, the problem of the semantic gap will be minimal .
Since the mid-1990’s, relevance feedback has been used for incorporating the user’s preferences and his understanding of the semantic similarity of images in the retrieval process [15, 16]. Research on relevance feedback techniques constitutes a subfield of CBIR research in its own right and the early works on the topic have been summarized in . The forms of explicit user interaction and giving of relevance feedback in interactive CBIR vary. In retrieval systems with multiple feature representations of the images, a straightforward approach could be to ask the user to tune the relative weights of the features in order to be able to find more relevant images . The weight tuning method and other approaches where the user is required to be able to modify the internal parameters of the CBIR system are, however, impractical for non-professional use.
In practical CBIR systems implementing relevance feedback, the standard setting is that after the user has been presented with a set of images, the system expects him to reliably assess the relevance of each retrieved image and to return this information back to the system 
. This effectively reformulates the interactive image retrieval process as an online machine learning task with small but increasing numbers of training samples to learn the statistics of relevant (and non-relevant) images on each query round. From the user interface perspective, this type of relevance feedback is often implemented by the means of the user clicking on the relevant images, checking associated check boxes or giving a numerical relevance assessment to each image with a slider or from a multi-value choice list. It is also possible that instead of assessing each image independently, the user is asked to rank the images on the page by their relevance incomparison searching .
In numerous studies (e.g. those cited in ), explicit interactive relevance feedback has been shown to provide a dramatic improvement in the accuracy of image retrieval. Giving explicit and accurate relevance feedback for each seen image is, however, bound to be time consuming and cognitively strenuous. Therefore, implicit feedback strategies have received considerable interest in the information retrieval (IR) community, due to the promise of decreasing the burden on the user. It has become clear that implicit feedback can improve information retrieval accuracy (see the review ), but figuring out the most effective modalities for various search scenarios is still a subject of ongoing research and various alternatives are being proposed ranging from simple measures like number of clicks to brain computer interfaces that are not yet practically feasible for real search tools.
The more traditional implicit feedback approaches rely on feedback obtained from the control devices. Claypool et al.  studied use of mouse and keyboard activity, as well as time spent on the page and scrolling, and  compared the amount of information between such implicit channels and explicit feedback. The most consistent finding in these kinds of works has been that the time spent on the page and the way the user exits the page are good indicators of relevance. More advanced works still using the regular control devices use click-through data, typically on the search result page . While these sources of implicit information are readily available for all search tools, they provide a rather limited view of the actions and intents of the user.
In the other extreme, a number of approaches have used brain computer interfaces for IR or related tasks. The C3Vision system  and a human-aided computing approach by  infer image categories or presence of distinct objects in images from EEG measurements, and [24, 25] use fMRI techniques for image categorization. Wang et al.  built a prototype image annotation system using these ideas; relevance of images is inferred from EEG and visual pattern mining is used to retrieve similar images. They do not, however, consider a full relevance feedback procedure for retrieval, but only study a single iteration and measure the performance as annotation accuracy. Brain activity measurements provide the most accurate picture of the intents of the user, but are clearly not yet practically feasible for real retrieval tools. Notable instrumentation and modeling challenges remain to be solved for making the devices applicable for daily use.
The most interesting implicit feedback modalities fall between these two extremes. Various information signals can be captured by microphones, cameras or other easily wearable sensors, and they are likely to contain more information on the intentions of the user than what can be observed through the traditional control devices. Both speech and gestures have been extensively used as explicit control modalities, but there are also a few studies on their implicit use. For example,  infers tags for images from implicit speech and  considers facial expressions as indicators of topical relevance. In addition, various physiological measurements are extensively used for inferring the affective state of the user, which can in turn be used as a feedback source [29, 30]. However, to our knowledge there are no fully fledged image retrieval systems that use these input modalities as implicit feedback.
The primary feedback in this work is based on eye movements, which have become an increasingly popular feedback source in recent years, following the early concepts by . The primary body of eye-tracking works in IR has been done for text retrieval, because the highly structured eye movements while reading are easier to model. The approaches range from explicit control 
and relevance estimation of text passages[33, 34] to inferring complete queries based on eye-movements on the results pages .
The text retrieval works were followed by early attempts to utilizing eye movements in image retrieval. Based on the results of a comparison between a visual attention model and measured gaze fixations, it was suggested in that eye tracking could be used as an interface for image retrieval, but no actual retrieval setup was yet investigated. The Eye-Vision-Bot system, presented in , integrated an eye tracker with the GIFT image retrieval system111http://www.gnu.org/software/gift/ merely as a demonstration of the possibilities of gaze-based interaction without any experimental evaluations. In [3, 37], a CBIR system was implemented that used offline image saliency and online gaze fixations for extracting visual features from those image areas that were likely to be relevant when determining the relevancy of the image. The system showed promising results in offline experiments, but was not ready for real interactive user experiments.
First fully interactive and experimentally evaluated CBIR systems that made use of eye-tracking data were presented in [2, 6, 38]. The selection of an image as relevant was in  solely dependent on the accumulated fixation time exceeding a preset threshold, whereas in  also a richer set of gaze parameters, including saccadic speeds and the number of images with fixations, were used. Image similarity assessment was in 
based on visual features extracted from non-overlapping tiles of the images. The user indicated the most relevant image by clicking, after which new images were retrieved based on the sum of tile-wise feature distances weighted with values from the fixation map. Clear performance improvements were obtained in the evaluations over random selection in[6, 38] and over simple image clicking without gaze-based distance weighting in .
Two decisive characteristics common to the setups of [2, 6, 38] should, however, be noticed. First, the user is expected to always explicitly select exactly one relevant image, by either eye fixation or mouse clicking. Second, the user interface has in the experiments been such that the target or query image is continuously visible on the screen, which is not plausible in real CBIR applications. Showing the target will also facilitate and even encourage the use of gaze for image comparison, which will certainly have an effect on the gaze patterns.
Later, also  and [40, 41] and  introduced their image retrieval systems using eye movements. The first one  was based on a conceptual interface designed to be controlled completely by implicit gaze, providing a mix of a browsing and search tool. A small-scale online experiment was provided, but it cannot be used for drawing strong conclusions on the accuracy of the retrieval results. The second study mostly concentrated on the accuracy of inferring the relevance in  and on fixation-weighted region matching between the query and database images in . The last one  used gaze data for genuinely implicit relevance feedback by the means of reranking the results of Google Image Search. However, the system was not fully functional yet as the described experimantal evaluation was done in a non-interactive mode.
3 System Components
In this section we describe the main components of the system. It consists of four main components, which will be explained in more detail in the following sections. The first component predicts the relevance of seen images based on clicks and image features. Tensor decomposition and multiple kernel learning modules then infer a metric between images using known visual features of the images (see Table I and  for more detailed descriptions of the used features) and relevance feedback on the seen images. The final component, a specialized exploration-exploitation algorithm LinRel suggests new images to be shown to the user.
Figure 1 summarizes the flow of information and the relationships between the different components. The input from the user, captured by mouse clicks and the eye tracker, is fed into the image relevance predictor. The predicted relevance scores are then given to the multiple kernel learning module together with the image features extracted from the images, for the purpose of learning which feature sets the similarity metric should utilize for comparing the images. The metric is fed to the tensor decomposition module to be combined with the eye movement features, in order to learn a representation that enables implicitly estimating eye movement features also for unseen images. Finally, the system selects a new set of images with the LinRel algorithm based on the inferred relevance scores and the final metric given by the tensor decomposition, and the images are retrieved from a database and displayed through the PicSOM backend .
|DCT coefficients of average colour in rectangular grid||12|
|CIE Lab colour of two dominant colour clusters||6|
|Histogram of local edge statistics||80|
|Haar transform of quantised HSV colour histogram||256|
|Histogram of interest point SIFT features||256|
|Average CIE Lab colour||15|
Three central moments of CIE Lab colour distribution
|Histogram of four Sobel edge directions||20|
|Co-occurrence matrix of four Sobel edge directions||80|
|Magnitude of the FFT of Sobel edge image||128|
|Histogram of relative brightness of neighbouring pixels||40|
3.1 Relevance Prediction from Eye Movements and Clicks
PinView infers relevance of images during a search task from implicit feedback, explicit feedback given be the user, or their combination. As implicit feedback PinView uses eye movements of the user, building on the recent promising results on inferring image relevance from eye movements [39, 41].
The gaze direction is an indicator of the focus of attention, since accurate viewing is possible only in the central fovea area which covers 1–2 degrees of the visual angle. However, the correspondence is not one-to-one because the users can shift the attention without moving their eyes. Gaze tracking has been used extensively in the psychology literature, and more recently also in information retrieval settings to track attention patterns of users. Some examples include the human-computer interaction aspects of how users perform searches , analysis of user behavior in web search , and using eye movements as implicit relevance feedback in textual IR [47, 48]. The promising results on the textual IR task suggest that using eye movements for relevance determination could be possible also in image retrieval tasks, where they would be even more severely needed. Hence, the PinView system estimates the relationship between eye movement patterns and relevance of images from data. As explicit feedback PinView uses pointer clicks by the user.
We measured the eye movements with a Tobii 1750 eye tracker with 50Hz sampling rate. The tracker has two infra-red lights and an infra-red stereo camera attached to a flat-screen monitor, and the tracking is based on detection of pupil centers and measurement of corneal reflection. The eyes move in rapid ballistic movements called saccades, from one fixation to another. Within each fixation the eyes are fairly motionless. Raw eye measurements are preprocessed by first extracting fixations and saccades, judging a set of consecutive raw measurements to be a fixation if they occur within a dispersion of 30 pixels, which at normal viewing distance is equivalent to roughly 0.6 visual degrees (17 inch screen with resolution of 12801024 pixels). A fixation is defined to be a period of at least 100 milliseconds of looking at a single location on the screen.
Inferring the relevance feedback requires a mapping from the gaze pattern to the relevances. It is infeasible to assume that such a mapping could be constructed from first principles of human vision, and therefore we take the machine learning approach of learning it from data. That is, we assume a simple parametric mapping from a set of gaze features computed from the eye movement trajectory to the relevances, and learn its parameters from a training data with known relevance scores. To avoid needing user adaptation, we learned a single user- and task-independent predictor from data collected from multiple users and a few search tasks. This was done on data collected in online search sessions separate from the actual experiments reported in Section 4, to avoid possible biases due to having trained the relevance predictor in the same search tasks.
For each viewed image PinView collects 19 features (Table II) computed from both raw eye movement samples and fixations, including aspects such as the logarithm of the total time the image was looked at and the number of regressions to already seen images. Instead of attempting to construct maximally pyschovisually motivated features, the set of features was chosen so that they are efficient to compute and can intuitively be expected to be informative of the relevance. Furthermore, the features do not depend on the image content, so that the predictor can directly generalize to different search tasks and databases.
are the learned parameters, a weight vector and a bias term. To improve the accuracy, each feature was standardized to zero mean and unit variance and the parameters were learned with 2-norm regularization on the weights, the regularization constant selected by 5-fold cross validation. Finally, the predicted relevance for images not viewed at all is set at a small constant value.
In the relevance predictor training data, six subjects (staff members of Aalto University who were not associated with this work) performed 12 different search tasks. The objective of each task was to find as many examples as possible of a given image category of the PASCAL Visual Object Classes Challenge 2007 (VOC2007) dataset . Ten collages consisting of 15 images chosen by the PicSOM system were shown in each task, containing a varying number of relevant images to cover various types of collages observed in real search tasks. In six of the 12 search tasks the objective was to find either a cat or dog and the database was limited to cat and dog images, resulting in around 50% of images being relevant. The other six tasks had 8–12% of relevant images, and the target was either motorbikes or aircrafts in the full VOC 2007 collection.
Finally, when combining implicit and explicit feedback, we resorted to a simple and fast method: The information of which images were clicked is integrated to the model by adding a constant , determined in offline experiments, to the relevance score of the clicked image. The final relevance prediction is hence given by
where for images that were clicked and
for all other images. As a side effect, the relevance score is not directly interpretable as a probability but that does not affect the next steps.
|Raw data features|
|1||numMeasurements||log of total time of viewing the image|
|2||numOutsideFix||total time for measurements outside fixations|
|3||ratioInsideOutside||percentage of measurements inside/outside fixations|
|4||speed||average distance between two consecutive measurements|
|5||coverage||number of subimages ( grid) that contain at least one measurement|
|6||normCoverage||coverage normalized by numMeasurements|
|7||pupil||maximal pupil diameter during viewing|
|8||nJumps1||number of breaks (measurements outside the image between two visits) longer than 60ms|
|9||nJumps2||number of breaks longer than 600ms|
|10||numFix||total number of fixations|
|11||meanFixLen||mean length of fixations|
|12||totalFixLen||total length of fixations|
|13||fixPrct||percentage of time spent in fixations|
|14||nJumpsFix||number of re-visits (regressions) to the image|
|15||maxAngle||maximal angle between two consecutive saccades, transitions from one fixation to another|
|16||firstFixLen||length of the first fixation|
|17||firstFixNum||number of fixations during the first visit|
|18||distPrev||distance to the fixation before the first visit|
|19||durPrev||duration of the fixation before the first visit|
3.2 Multiple Kernel Learning
Learning the similarity measures or metric of importance for our CBIR task is central in retrieval. Some image searches may require a combination of image features to quickly distinguish them from other less relevant images. For instance, colour and texture features may be important to find pictures of snowscapes, whereas colour may be the only important feature needed to find images of blue skies. We would like to use a combination of the metrics as a cue to finding relevant images quickly and efficiently, and then pass this learnt metric (kernel) to the LinRel algorithm of Section 3.4.
Given image feature vectors , , let the inner product denote the kernel function between images and , where is some feature mapping . Multiple kernel learning (MKL) attempts to find a combination of kernels by solving a classification (or regression) problem using a weighted combination of kernels [51, 52, 53]. Given that our PinView system will use several different image features, we view each one as a separate feature space – hence, giving us different kernels i.e., , for the different image features. Using MKL we construct the kernel function:
where are the weights of each kernel function between images and
. We follow an elastic-net formulation of ridge regression MKL, which uses a parameterin order to move between a 1-norm regularization (when ) and a 2-norm regularization (when ). 222We would be able to dynamically change the value of throughout the search, however for simplicity we will fix in the experiments. Let be the Gram matrix of image features, and let be the vector of relevance scores observed so far, then we solve the following multiple kernel learning regression problem:
subject to , where is the weight vector corresponding to the th feature space (i.e. kernel). The justification for using this algorithm is that we expect to use many kernels in the first iteration rounds of our search and not too many near the end, as we gain a better understanding of relevance inferred through (explicit) pointer clicks and (implicit) eye movements (as described in Section 3.1). After each iteration of the search, when the user has indicated the relevance of the newly seen images, we can use this feedback as the labels (outputs) of our classification (regression) MKL problem to find a new set of kernel weights (based solely on the images seen thus far).
After we learn this new representation we supply these weighted kernels to the kernelized LinRel algorithm of Section 3.4. However, before that we describe the component of our system that uses eye movements as an extra set of features, by creating a combined space using the kernel learnt using Equation (2).
3.3 Tensor Decomposition
Since eye movements are available only for images already presented to the user, eye movement features cannot be used directly for predicting the relevance of unseen images. To elevate this problem, we relate the known image features to the (yet) unknown eye movement features by learning a joint representation that combines these two views. We learn this relationship by using a tensor representation which creates an implicit correlation space . The tensor representation can be computed by taking dot products between each individual kernel matrix of each view [55, 56].
Hence, let be the kernel Gram matrix constructed from previously seen image feature vectors . Similarly, let be the kernel Gram matrix constructed from eye movement features . Given these two kernel matrices we can combine them by taking a component-wise product , which corresponds to a tensor product between feature vectors and . We then use the kernel matrix to train a tensor kernel SVM  to generate a weight matrix which is composed of both views. As mentioned earlier, we do not have the eye movement features for images not yet displayed to the user. Hence, we need to decompose the weight matrix into one weight vector per view. This has been resolved by 
, who propose a novel singular value decomposition (SVD) like approach for decomposing the resulting tensor weight matrix into its two component parts, without needing to directly access the feature spacesand .
Therefore, assume the weight matrix decomposed for the image features is and for the eye movement features is , where corresponds to the number of images seen and is the dimensionality of the decomposition. Given we can project any of the MKL combined image features as follows :
to produce a new feature vector which has been mapped into the eye movement feature space using the matrix . Finally, we can create the following kernel function from our new feature vectors and . After we have this new representation we can pass this new updated kernel to the kernelized LinRel algorithm of the following section.
3.4 The LinRel Algorithm
After updating the similarity metric and the associated kernel through MKL and the tensor decomposition as described in the previous sections, the LinRel algorithm is used for selecting the next collage of images that is presented to the user. The LinRel algorithm (originally devised and analysed in ) is an exploration-exploitation oriented online learning algorithm. It aims to sequentially present images to the user such that the positive feedback from the user is maximized. Hence the LinRel algorithm is very well suited to be used in the PinView system for retrieving images that are of interest to the user.
Given the image features, the relevance of images is assumed to be mutually independent. LinRel then assumes that the expected relevance of an image is given by an (unknown) linear function of the image features ,
with an unknown weight vector . Thus in each step of the search, LinRel estimates the weight vector by some and uses this estimate to select an image which is likely to be relevant. But since the estimate might be inaccurate, LinRel also needs to ensure a sufficient amount of exploration. This is achieved by taking into account a bound on the variance of the estimated relevance
, by considering an appropriate confidence interval for the “true” expected relevance. Thus LinRel selects the image which maximizes the upper confidence bound,
where the parameter controls the amount of exploration. This rule selects an image if its predicted relevance is high (which is an exploitative selection), or if the variance of this estimate is high (which is an explorative selection). It is shown in the analysis of LinRel , that selecting an image with high variance according to the above rule improves the accuracy of the estimated weight vector . It is also shown that the error rate of LinRel — compared to the best linear predictor of the relevance — is essentially after steps of the search, where is the number of dimensions of the feature vector . While the original LinRel algorithm in  explores the dimensions of the feature vector explicitly, more recent variations of LinRel (e.g. LinUCB in ) use regularization to deal with large feature vectors. For the PinView system we also use regularized LinRel
, which calculates an estimate for the weight vector by regularized linear regression for the observed relevance scores of the selected images so far. The solution of the regularized regression can be written using the Gram matrixas
where is the matrix of feature vectors of the images selected so far, , is the vector of relevance scores observed so far,
denotes the identity matrix, andis the regularization parameter. Thus the estimated relevance of an image is given by
This rule can easily be kernelized to accommodate the kernels generated by MKL, since the Gram matrix can be expressed as the kernel matrix and .
Since in each iteration of a search the PinView system not only selects a single image but a collage of several images, the LinRel algorithm needs to be extended to accommodate this. An obvious extension — implemented for the experiments reported in Section 4 — is to select all images of the collage according to rule (5), while each image is selected at most once during the search. This method for selecting a collage rather emphasizes exploration, since all images of the collage are selected by taking also an exploration term into account. An alternative method would be to select only one image according to rule (5), and to select the remaining images to maximize the estimated relevance . This second method selects at most one explorative image, and is thus far less exploratory then the first method. By selecting more than one image according rule (5
), it is possible to interpolate between the first and the second method. Future work will show which collage selection method is most beneficial.
In this section we describe experimental evaluations of the PinView system. We study empirically the following two questions:
How close to explicit feedback performance can we get with less laborious implicit feedback?
Is it possible to still improve performance by combining implicit and explicit feedback, especially when the explicit feedback is only partial (a single click on the most relevant image) and gaze patterns can be expected to reveal more relevant images.
In the experiments we use three variants of PinView:
PinView system with implicit feedback from gaze patterns.
PinView system with explicit feedback from clicks.
PinView system with both explicit and implicit feedback, from both gaze patterns and clicks.
For evaluation purposes these variants are compared with the baseline of browsing (that is, showing randomly ordered images) and the PicSOM  CBIR system sharing the same interface as PinView but lacking the novel machine learning components. This way, the comparison emphasizes the effects caused by the new components instead of the interface.
To keep the experimental cost manageable, we started with extensive offline experiments and then validated the main findings later in online experiments with real users – performing all the comparisons with online users would not have been feasible. In offline setups we choose relevance of images based on their tags or classes, and simulate the feedback based on the relevance. Explicit feedback comes directly from the relevance and for implicit feedback we use eye movement features computed from relevant and nonrelevant images viewed in earlier experiments. We expect the simulated explicit feedback to be a reasonable approximation to real feedback, and hence in the online experiments we focus on validating the implicit feedback results.
4.1 Offline Experiments
The data set of images used in the offline experiments is the train subset of the PASCAL Visual Object Classes Challenge 2007 (VOC2007) dataset . The number of images in this dataset is 2501. It contains 20 overlapping categories whose summary statistics are given in Table III.
|Category name||Number of images||Percentage of images|
Experiment setup: Each offline experiment consists of simulated search sessions. In each search session PinView selects ten collages with 15 images each. The goal of a search session is to retrieve images from one of the categories. For simulating user feedback, images are divided into relevant images (all those from the desired category) and non-relevant images (all those not from this category). The calculation of different feedback modalities is detailed below. In each experiment the performance of the retrieval systems is measured in 40 search sessions on each of the 20 categories.
The regularization parameters of MKL and LinRel are set to a single combined regularization parameter which is found for each feedback modality with a grid search over values .
Feedback modalities: The following versions of PinView were compared:
Implicit feedback from simulated eye movements: SimulatedEye. The simulated eye movements are selected from a pool of previously recorded eye movements from online experiments. The eye movements are split to two groups, “positive” and “negative”, depending on whether the viewed image was relevant or nonrelevant in the task in which it was recorded. Both of these groups are divided into five subgroups depending on how many relevant images there were in the collage where the image was seen; the rationale is that the eye movements differ between collages having significantly different numbers of relevant images. The subgroups correspond to the following number of relevant images on a collage: 0, 1, 2–3, 4–6, 7–10, and 10–15. In the experiment, eye movements are sampled from the positive group for relevant images and from the negative group for nonrelevant ones, taking into account the number of relevant images in the current collage.
Explicit feedback from simulated clicks based on the known relevances of the images: SimulatedClick. To simulate an interface that still retains a low level of manual effort the system operates in a mode where only one of the images is clicked. If there are several relevant images, a random relevant image on the collage is selected as clicked. If the collage contains no relevant images, then an image is picked uniformly at random.
Combined explicit and implicit feedback: SimulatedEye+Click. Here both types of input are simulated, and used in the model as in Eq. (1). The explicit click weight of the model is found by running a grid search over the values , before choosing the regularization parameter of the PinView system.
For completeness we additionally include one more type of explicit feedback: Full, where the true class label of each seen image is given, corresponding to explicit feedback in which each relevant image is clicked.
Of the simulated feedbacks, Full feedback simulates the performance of PinView under ideal conditions, where the user is able and willing to provide perfect feedback. The other simulations provide lower bounds for the performance obtainable using only the implicit feedback, or by the partial explicit feedback of a single click that is still relatively effortless to provide. The real performance of the system in online experiments is expected to lie between these two extremes. This is because the simulated runs use only the incomplete tag information; in a real system the user is also able to give more refined feedback due to his ability to use the visual content of the images. Given a collage with more than one relevant image the user will not make the choice randomly, but will base his decision on the content, and the eye movements will also reflect the relative similarity of the images and the search target not captured by the simulation process.
Evaluation. To evaluate the model we record the performance on each feedback type separately. The measure of performance is mean average precision (MAP), i.e., the average fraction of relevant images that the system returned, averaged over the found relevant images and search sessions.
Pairwise paired t-test p-values for methods in the offline experiment. The empirical performance of the methods increases from left to right and from top to bottom, and SimulatedEye is abbreaviated as SimEye to save space.
The results of the offline experiments are presented in Figure 2. As expected, all PinView results lie in between the (laborious) Full feedback results and pure browsing results (Random). Implicit feedback (SimulatedEye) outperforms browsing, even if it does not provide as good feedback as pure explicit feedback (SimulatedClick). These differences are significant (Table IV). Combined explicit and implicit feedback (SimulatedEye+Click) gives very similar results to pure explicit feedback, and the small difference between the two is not significant. All of the reported results were run without the tensor decomposition of Section 3.3, since it did not increase the overall performance of the system despite showing improved accuracy for some users and tasks.
Comparing the MAP results between the PinView and PicSOM algorithms (reported as averages over all tasks in Figure 2), it is evident that the here-proposed PinView algorithm is superior with all the feedback modalities. Most importantly, PinView seems to be better than PicSOM in making simultaneous use of both explicit and implicit relevance feedback, which can be seen when comparing the SimulatedClick and SimulatedEye+Click results.
4.2 Online Experiments
In this section we describe online experiments in which test subjects interact with the PinView system. The goal of the online experiments is to validate the offline findings about relative goodness of the different feedback modalities, and naturally also to give evidence of how well the system works in practice.
The online experiments use a subset of the ImageNet dataset, created by the authors and called IMG2010 dataset. It contains 3720 images from several categories (synsets) of the ImageNet, which is a database containing URLs to images available on the internet together with semantic category information (synsets of WordNet) and a hierarchy between the categories. Hence, IMG2010 contains images that are representative of ones that appear on the internet.
Experiment setup and evaluation. Each of the ten users performed 12 different search tasks which mimic different real-world scenarios. The tasks ranged from scenarios where a tag-based search had first been used to prune the eligible images, to scenarios where the images were more diverse. During one search task the system showed to the user a total of 120 images, contained in eight separate collages each having 15 images. Before the search session the system instructed the user to find a shown target image that belongs to a given category. In practice, the experiment took approximately 20–30 minutes per subject. The 12 search tasks were divided into four groups, each consisting of three sub-tasks. The four groups were:
Finding images of a particular sport from among sports images. The particular sport categories were ice hockey, gymnastics, and soccer. The image dataset for this group contains 1006 images sampled from the sports subcategory of ImageNet, which has 89 ice hockey, 92 gymnastics, and 88 soccer images.
Finding images of aeroplanes. The image dataset contains 900 uniformly sampled images that are not flowers or aircraft, and additional 150 images of both aircraft and flowers.
Finding images of flowers. The image dataset is the same as in the previous group.
Finding images of a given mammal, amongst other mammal images. The goal categories are deer and cheetah (twice). The dataset contains 105 images sampled from deer category, 99 images sampled from cheetah category, and 612 images sampled from a mammal category that are not deers or cheetahs.
As the goodness criterion we again used the number of relevant images. The different PinView variants were randomly allocated to the sub-tasks so that each sub-task had as uniform allocation of variants as possible. The regularization parameter was set for each PinView variant to the value that performed the best in the offline experiments.
Results. The quantitative performance of the PinView variants is shown in Figure 3 for each task. All input modalities used by PinView are clearly better than the baseline of browsing randomly ordered images, which is confirmed by t-tests in Table V. Only in one of the tasks, gymnastics, the performance of PinView was below the baseline, which might be due to random fluctuations because of noise.
The relative performance of the variants varies between tasks. Implicit feedback from gaze is worse than explicit feedback from clicks, although the difference is not strongly significant (t-test, ). The number of relevant images retrieved by gaze is on average 67% of the number of relevant images returned by the best modality (the combined click and gaze). However, the gaze feedback performs well in many of the tasks and hence gaze is a viable source of implicit feedback information.
The paired t-test gives a p-value of on the hypothesis that the performances of the click modality and combined click and gaze modality are the same. There is evidence that combining information from click and gaze modalities improves the performance of the system, but more extensive testing would be needed for strong conclusions. The performance of the combined click and gaze modality is relatively better in online than in offline experiments, which might be due to the fact that the relevance feedback given by real users is more accurate than the simulated one.
|Gaze||Click||Gaze and click|
5 Discussion and conclusions
In this paper we described our PinView CBIR system which records implicit relevance signals from the user and infers his image search intent by using several novel machine learning methods. We show that the PinView variants work better than browsing (a set of randomly ordered images), indicating that PinView would be useful at least in scenarios where tag-based evidence is not available or has already been used to narrow down the search to a subset of the original collection.
Implicit feedback from gaze outperformed the baseline, suggesting that pure implicit feedback is a viable option when it is difficult or too laborious to give explicit feedback. Explicit feedback by clicks gave more accurate results, and there was evidence that combined explicit and implicit feedback produced the best results. In summary, the compilation of algorithms in PinView is a very promising approach to content-based image retrieval. One of the main use scenarios is a search session where first a tag-based search is used to focus on a subset of potentially relevant images, and content-based search is then needed to do further retrieval in the still large result set.
Our final conclusions from the present work and other serious attempts [2, 6] to use and evaluate implicit relevance feedback from eye movements in iterative online content-based image retrieval are as follows: First, when used for purely implicit relevance feedback, eye movements perform better than random picking as was demonstrated in  and in this paper. This mode of operation can prove to be useful if the setup does not allow giving explicit feedback, or if the relevance feedback mechanism is used to secretly improve the efficiency of otherwise random browsing. Second, the performance level of gaze-based implicit relevance feedback with current hardware and algorithmic techniques cannot reach that of click-based explicit feedback.
Third, when combining explicit click-based and implicit gaze-based relevance feedback together, the system performance will exceed the level of solely explicit relevance feedback as was proven in  and in our experiments. To what extent this happens most likely depends on the experiment arrangements, including the data set, eye-tracking device, and the user interface design. In , the image collection was arguably simpler than ours. Additionally, the user interface allowed the use of gaze for comparing the query image and the candidates, which surely was beneficial for the proposed method. We thus argue that our experiments have resembled genuine use scenarios of content-based image retrieval, with respect to both the used data and the user interface, more than the previous attempts. We also argue that we have been able to show that even in such a difficult context, gaze tracking data has proven to be a useful source of implicit relevance feedback that can be beneficially used either alone or together with explicit feedback.
The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007–2013) under grant agreement n 216529, Personal Information Navigator Adapting Through Viewing, PinView, IST Programme of the European Community, under the PASCAL2 Network of Excellence, IST-2007-216886, and the Academy of Finland for the Finnish Centre of Excellence in Computational Inference Research (COIN, 251170). This publication only reflects the authors’ views.
-  R. Datta, D. Joshi, J. Li, and J. Z. Wang, “Image retrieval: Ideas, influences, and trends of the new age,” ACM Computing Surveys, vol. 40, no. 2, pp. 1–60, 2008.
-  K. Essig, “Vision-based image retrieval (vbir): A new eye-tracking based approach to efficient and intuitive image retrieval,” Ph.D. dissertation, Technischen Fakultä der Universitä Bielefeld, 2007.
-  H. Grecu, “Content based image retrieval: An attention monitoring approach,” Ph.D. dissertation, Universitatea Politehnica din Bucuresti, 2006.
-  D. Kelly and J. Teevan, “Implicit feedback for inferring user preference: a bibliography,” SIGIR Forum, vol. 37, no. 2, pp. 18–28, 2003.
-  A. Klami, C. Saunders, T. E. de Campos, and S. Kaski, “Can relevance of images be inferred from eye movements?” in MIR ’08: Proceeding of the 1st ACM International Conference on Multimedia Information Retrieval. New York, NY, USA: ACM, 2008, pp. 134–140.
-  O. Oyekoya, “Eye tracking: A perceptual interface for content based image retrieval,” Ph.D. dissertation, University College London, 2007.
-  L. Scherffig, It’s in Your Eyes: Gaze Based Image Retrieval in Context. ZKM Institute for Basic Research, 2005. [Online]. Available: http://books.google.fi/books?id=bpV2AwAACAAJp
-  P. Auer, Z. Hussain, S. Kaski, A. Klami, J. Kujala, J. Laaksonen, A. P. Leung, K. Pasupa, and J. Shawe-Taylor, “Pinview: Implicit feedback in content-based image retrieval,” in Proc. of Workshop on Applications of Pattern Analysis, vol. 11, 2010, pp. 51–57.
-  Z. Hussain, A. P. Leung, K. Pasupa, D. R. Hardoon, P. Auer, and J. Shawe-Taylor, “Exploration-exploitation of eye movement enriched multiple feature spaces for content-based image retrieval,” in Machine Learning and Knowledge Discovery in Databases European Conference, ECML PKDD 2010. Berlin Heidelberg, Germany: Springer, 2010, pp. 554–569.
-  Y. Rui, T. S. Huang, and S.-F. Chang, “Image retrieval: Current techniques, promising directions, and open issues,” Journal of Visual Communication and Image Representation, vol. 10, no. 1, pp. 39–62, 1999.
-  A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, “Content-based image retrieval at the end of the early years,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 22, no. 12, pp. 1349–1380, 2000.
-  R. C. Veltkamp and M. Tanase, “Content-based image retrieval systems: A survey,” Utrecht University, Information and Computing Sciences, Utrecht, The Netherlands, Tech. Rep. 2000-34 (revised version), October 2002, available at: http://www.aa-lab.cs.uu.nl/cbirsurvey/.
-  M. S. Lew, N. Sebe, C. Djeraba, and R. Jain, “Content-based multimedia information retrieval: State of the art and challenges,” ACM Trans. on Multimedia Computing, Communications and Applications, vol. 2, no. 1, pp. 1–19, 2006.
-  N. Sebe, M. S. Lew, X. Zhou, T. S. Huang, and E. M. Bakker, “The state of the art in image and video retrieval,” in CIVR’03: Proc. of the 2nd international conference on Image and video retrieval. Berlin, Heidelberg: Springer-Verlag, 2003, pp. 1–8.
-  R. W. Picard, T. P. Minka, and M. Szummer, “Modeling user subjectivity in image libraries,” M.I.T Media Laboratory, Tech. Rep. #382, 1996.
-  Y. Rui, T. S. Huang, M. Ortega, and S. Mehrotra, “Relevance feedback: A power tool in interactive content-based image retrieval,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 8, no. 5, pp. 644–655, 1998.
-  X. S. Zhou and T. S. Huang, “Relevance feedback for image retrieval: A comprehensive review,” Multimedia Systems, vol. 8, no. 6, pp. 536–544, 2003.
-  I. J. Cox, M. L. Miller, T. P. Minka, and P. N. Yianilos, “An optimized interaction strategy for Bayesian relevance feedback,” in
-  M. Claypool, P. Le, M. Wased, and D. Brown, “Implicit interest indicators,” in IUI’01: Proc. of the 6th International Conference on Intelligent User Interfaces. New York, NY, USA: ACM, 2001, pp. 33–40.
-  S. Fox, K. Karnawat, M. Mydland, S. Dumais, and T. White, “Evaluating implicit measures to improve web search,” ACM Trans. on Information Systems, vol. 23, pp. 147–168, 2005.
-  T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay, “Accurately interpreting clickthrough data as implicit feedback,” in Proc. of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY, USA: ACM, 2005, pp. 154–161.
-  A. Gerson, L. Passa, and P. Sajda, “Cortically coupled computer vision for rapid image search,” IEEE Trans. on Neural Systems and Rehabilitation Engineering, vol. 14, no. 2, pp. 174–179, 2006.
P. Shenoy and D. S. Tan, “Human-aided computing: utilizing implicit human processing to classify images,” inProceeding of the 26th Annual SIGCHI Conference on Human Factors in Computing Systems. New York, NY, USA: ACM, 2008, pp. 845–854.
-  K. Kay, T. Naselaris, R. Prenger, and J. Gallant, “Identifying natural images from human brain activity,” Nature, vol. 452, no. 7185, pp. 352–355, 2008.
-  T. Mitchell, R. Hutchinson, R. Niculescu, F. Pereira, X. Wang, M. Just, and S. Newman, “Learning to decode cognitive states from brain images,” Machine Learning, vol. 57, no. 1, pp. 145–175, 2004.
-  J. Wang, E. Pohlmeyer, B. Hanna, Y.-G. Jiang, P. Sajda, and S.-F. Chang, “Brain state decoding for rapid image retrieval,” in Proc. of the 17th ACM International Conference on Multimedia. New York, NY, USA: ACM, 2009, pp. 945–954.
-  A. Vinciarelli, N. Suditu, and M. Pantic, “Implicit human-centered tagging,” in Proc. or IEEE International Conference on Multimedia and Expo, ICME 2009. Piscataway, NJ, USA: IEEE press, 2009, pp. 1428–1431.
-  I. Arapakis, I. Konstas, and J. Jose, “Using facial expressions and peripheral physiological signals as implicit indicators of topical relevance,” in Proc. of the 17th ACM international conference on Multimedia. New York, NY, USA: ACM, 2009, pp. 461–470.
-  I. Arapakis, J. Jose, and P. Gray, “Affective feedback: an investigation into the role of emotions in the information seeking process,” in SIGIR’08: Proc. of the 31st Annual International ACM SIGIR conference on Research and Development in Information Retrieval. New York, NY, USA: ACM, 2008, pp. 395–402.
-  M. Soleymani, G. Chanel, J. J. Kierkels, and T. Pun, “Affective ranking of movie scenes using physiological signals and content analysis,” in MS’08: Proc. of the 2th ACM Workshop on Multimedia Semantics. New York, NY, USA: ACM, 2008, pp. 32–39.
-  P. P. Maglio, R. Barrett, C. S. Campbell, and T. Selker, “Suitor: an attentive information system,” in IUI’00: Proc. of the 5th International Conference on Intelligent User Interfaces. New York, NY, USA: ACM, 2000, pp. 169–176.
-  D. J. Ward and D. J. MacKay, “Fast hands-free writing by gaze direction,” Nature, vol. 418, p. 838, 2002.
-  G. Buscher, A. Dengel, and L. van Elst, “Eye movements as implicit relevance feedback,” in CHI ’08 Extended Abstracts on Human Factors in Computing Systems. New York, NY, USA: ACM, 2008, pp. 2991–2996.
-  K. Puolamäki, J. Salojärvi, E. Savia, J. Simola, and S. Kaski, “Combining eye movements and collaborative filtering for proactive information retrieval,” in SIGIR ’05: Proc. of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY, USA: ACM press, 2005, pp. 146–153.
-  A. Ajanki, D. R. Hardoon, S. Kaski, K. Puolamäki, and J. Shawe-Taylor, “Can eyes reveal interest?—Implicit queries from gaze patterns,” User Modeling and User-Adapted Interaction: The Journal of Personalization Research, vol. 19, pp. 307–339, 2009.
-  O. Oyekoya and F. Stentiford, “Eye tracking as a new interface for image retrieval,” BT Technology Journal, vol. 22, no. 3, pp. 161–169, 2004.
-  H. Grecu, C. Cudalbu, and V. Buzuloiu, “Towards gaze-based relevance feedback in image retrieval,” in Proc. of International Workshop on Bioinspired Information Processing: Cognitive modeling and gaze-based communication (BIP 2005), 2005, poster abstract.
-  O. Oyekoya and F. Stentiford, “Perceptual image retrieval using eye movements,” in Proc. of the International Workshop on Intelligence Computing in Pattern Analysis/Synthesis 2006, 2006, pp. 281–289.
-  L. Kozma, A. Klami, and S. Kaski, “GaZIR: Gaze-based zooming interface for image retrieval,” in Proc. of ICMI-MLMI 2009, The 11th International Conference on Multimodal Interfaces and The 6th Workshop on Machine Learning for Multimodal Interaction. New York, NY, USA: ACM, 2009, pp. 305–312.
-  Z. Liang, H. Fu, Y. Zhang, Z. Chi, and D. Feng, “Content based image retrieval using a combination of visual features and eye tracking data,” in Proc. of ETRA 2010: ACM Symposium on Eye-Tracking Research & Applications. New York, NY, USA: ACM, 2010, pp. 41–44.
-  Y. Zhang, H. Fu, Z. Liang, Z. Chi, and D. Feng, “Eye movement as an interaction mechanism for relevance feedback in a content-based image retrieval system,” in Proc. of ETRA 2010: ACM Symposium on Eye-Tracking Research & Applications. New York, NY, USA: ACM, 2010, pp. 37–40.
-  A. Faro, D. Giordano, C. Pino, and C. Spampinato, “Visual attention for implicit relevance feedback in a content based image retrieval,” in Proc. of ETRA 2010: ACM Symposium on Eye-Tracking Research & Applications. New York, NY, USA: ACM, 2010, pp. 73–76.
-  V. Viitaniemi and J. Laaksonen, “Evaluating the performance in automatic image annotation: example case by adaptive fusion of global image features,” Signal Processing: Image Communications, vol. 22, no. 6, pp. 557–568, July 2007.
J. Laaksonen, M. Koskela, and E. Oja, “PicSOM — self-organizing image
retrieval with MPEG-7 content descriptions,”
IEEE Trans. on Neural Network, vol. 13, pp. 841–853, 2002.
-  E. Cutrell and Z. Guan, “What are you looking for?: An eye-tracking study of information usage in web search,” in CHI ’07: Proc. of the SIGCHI conference on Human Factors in Computing Systems. New York, NY, USA: ACM, 2007, pp. 407–416.
-  L. A. Granka, T. Joachims, and G. Gay, “Eye-tracking analysis of user behavior in WWW search,” in SIGIR ’04: Proc. of the 27th annual international ACM SIGIR conference on Research and Development in Information Retrieval. New York, NY, USA: ACM, 2004, pp. 478–479.
D. R. Hardoon, J. Shawe-Taylor, A. Ajanki, K. Puolamäki, and S. Kaski,
“Information retrieval by inferring implicit queries from eye movements,”
11th International Conference on Artificial Intelligence and Statistics, 2007.
J. Salojärvi, I. Kojo, J. Simola, and S. Kaski, “Can relevance be inferred
from eye movements in information retrieval?” in
Proc. of WSOM’03, Workshop on Self-Organizing Maps. Kitakyushu, Japan: Kyushu Institute of Technology, 2003, pp. 261–266.
-  M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results,” http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html.
-  J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis. Cambridge, U.K.: Cambridge University Press, 2004.
-  A. Argyriou, C. A. Micchelli, and M. Pontil, “Learning convex combinations of continuously parameterized basic kernels.” in Computational Learning Theory, ser. Lecture Notes in Computer Science, vol. 3559. Springer, 2005, pp. 338–352.
-  F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan, “Multiple kernel learning, conic duality, and the SMO algorithm,” in Proc. of the 21st International Conference on Machine Learning, ICML. New York, NY, USA: ACM, 2004, pp. 41–48.
-  G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. I. Jordan, “Learning the kernel matrix with semidefinite programming,” Journal of Machine Learning Research, vol. 5, pp. 27–72, 2004.
-  D. R. Hardoon and K. Pasupa, “Image ranking with implicit feedback from eye movements,” in Proc. of ETRA 2010: ACM Symposium on Eye-Tracking Research & Applications. New York, NY, USA: ACM, 2010, pp. 291–298.
-  S. Pulmannová, “Tensor products of Hilbert space effect algebras,” Reports on Mathematical Physics, vol. 53(2), pp. 301–316, 2004.
-  S. Szedmak, J. Shawe-Taylor, and E. Parado-Hernandez, “Learning via linear operators: Maximum margin regression; multiclass and multiview learning at one-class complexity,” University of Southampton, Tech. Rep., 2005.
D. R. Hardoon and J. Shawe-Taylor, “Decomposing the tensor kernel support vector machine for neuroscience data with structure labels,”Machine Learning Journal: Special Issue on Learning From Multiple Sources, vol. 79, no. 1-2, pp. 29–46, 2010.
-  P. Auer, “Using confidence bounds for exploration-exploitation trade-offs,” Journal of Machine Learning Research, vol. 3, pp. 397–422, 2003.
-  L. Li, W. Chu, J. Langford, and R. E. Schapire, “A contextual-bandit approach to personalized news article recommendation,” in WWW2010: Proc. of the 19th International Conference on the World Wide Web. New York, NY, USA: ACM, 2010, pp. 661–670.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in CVPR’09: IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA, USA: IEEE, 2009, pp. 248–255.