Relevance Feedback is an important and widespread Information Retrieval technique for query modification that formulates a new query based on a previous one and on the response of the user to the answer to that query. It is, by now, a classic: it origin in Information Retrieval can be traced back to the early 1970s, with the work of Rocchio on the SMART retrieval system . Like any true classic, relevance feedback has provided a nearly inexhaustible breeding ground for methods and algorithms whose development continue to this day.
In very general terms, let be a data base containing items, and let be a query. As a result of the query, the data base proposes a set of results, . Out of this set, the user selects two subsets: the set of positive (relevant) documents, , and the set of negative (counter-exemplar) ones with, in general, . This information is used by the system to compute a new query , which is then used to produce a new set of results . The process can be iterated ad libitum to obtain results sets and queries
Note that in document information retrieval the query is expressed directly by the user and is often considered more significant and stable than those which are automatically generated. For this reason, the parameter is always present in the calculation of
. The situation is different in image retrieval, as we shall see in the following.
Relevance feedback has been applied to many types of query systems, including systems based on Boolean queries [41, 17]. However, its most common embodiment is in similarity-based query systems, in which it offers a viable solution for expressing example-based queries . This is especially useful in a field like image retrieval, in which Boolean queries are seldom used. In this case, the items are points in a metric space , where is the metric of the space. Given a query , each item of the data base receives a score which depends on the distance between the item and the query as given by the metric , . In this case, it is possible not only to change the query based on the feedback, but to change the metric of the query space as well. Iteration is now characterized by the pair , where is the query (typically, is a point in or a set of such points) and is the metric on which the distance from the query and, consequently, the score of each image in will be computed. Given the feedback, the two are updated as
changes the metric space but doesn’t do explicit query rewriting. Algorithms that do metric modification usually use statistical methods to fit a parametric model to the observed relevance data. Many of these methods use only the setof positive examples, ignoring . The reason for this is that is usually a reasonably reliable statistical sample of what the user wants (except for the cases detailed in section 2.3). Not so , since there are often many and contrasting criteria under which an element can be deemed irrelevant. Paraphrasing Tolstoy, one could say that positive samples are all alike; every negative sample is negative in its own way. Rocchio’s algorithm, on the other hand, uses both positive and negative examples.
In this paper we shall present and evaluate two methods of relevance feedback. We shall build them in two stages. First, we shall endow a reduced-dimensionality, ”semantic” feature space with a Gaussian similarity field and with the Riemann metric induced by this field. The extension of this model to a mixture of Gaussians leads quite naturally to our second model, with latent variables. This, in turn, will result in an extension to mixed (discrete/continuous) observations of what in Information Retrieval is known as Probabilistic Latent Semantic analysis .
The paper is organized as follows. In section 2, we shall analyze in some detail two methods for metric change: MARS  and MindReader . The purpose of this background is twofold. Firstly, it provides a general introduction to metric modification, making the paper more self-contained. Secondly, it serves the purpose of introducing the problematic of dimensionality reduction, which we present in section 2.3 following the work of Rui and Huang .
In section 3, we shall analyze the dimensionality problem from an alternative point of view, leading to the definition of semantic spaces composed of groups of semantically homogeneous features, and the definition of a higher level, reduced dimensionality, feature space, in which the dimensions are given by distance measures in the two semantic spaces.
In section 4, we present the first of our relevance feedback schemes. We shall model the set of positive examples in a semantic space using a Normal distribution, and use this distribution to endow the feature space with a Riemann metric that we shall then use to ”score” the images in the data base.
In section 5, we introduce the latent variable model. A series of binary stochastic variables is used to model abstract topics that are expressed in the positive samples. Each topic endows each semantic space with a Normal distribution. A mixed (discrete-continuous) version of Expectation Maximization (EM, ) is developed here to determine the parameters of the model, and then the results of section 4 are used to define a distance-like function that is used to score the data base.
In section 6, a system-neutral testing methodology is developed to evaluate the algorithms abstracting from the system of which they will be part, and is applied to evaluating the two schemes presented here vis-à-vis the three algorithms presented in section 2. Conclusions are given in section 7.
2 Relevant Background
In this section we shall describe in detail a limited number of approaches that bear direct relevance on the work presented here. We shall not try to present a bibliography of related work. Instead, we shall use this section to build a collection of techniques and open problems that we shall use in the following sections in relation to our work.
|Number of elements in the data base|
|Data base of documents|
|Number of elements selected for feedback|
|Number of dimension in the reduced representation of Rui and Huang|
|Number of word spaces in the latent model.|
|Number of latent variables.|
Dimension of a generic vector.
|Dimension of the th word space.|
|Complete feature space.|
|Feature space of the th word.|
|Feature vector of the th sample|
|Feature vector of the th word if the th sample|
|th dimension of the th word if the th sample|
|elements in a data base (used rarely, will not conflict with the other use of the same symbols)|
|elements in a feedback set or an image set|
|dimensions of the feature spaces|
In all the following discussion we shall assume that we have a data base of elements, . We shall use the indices to span the elements of the data base. In its simplest representation each document is represented as a point is a point in a smooth manifold (in later sections we shall extend this representation to the Cartesian product of a finite number of manifolds). Depending on the model, this manifold will be either (the -dimensional Euclidean space) or (the unit sphere in ). We shall indicate this space as , specifying which manifold it represent whenever necessary. The individual coördinates will be identified using the indices . The th item of the database, is represented as the vector
The initial query will be denoted by (or ), with . Note that with this choice we assume that the query is a point in the feature space. The extension of all our considerations to queries represented as sets of points in the feature space is not hard, but would complicate our presentation considerably, so we shall not consider such case. The iterations of the relevance feedback will be indicated using the indices ; these indices will be used in a functional notation (). The query resulting after the th iteration of relevance feedback will be indicated as , and the relative result set as . After the th iteration, the user feedback will produce two sets of items: a set of positive items, and a set of negative items. The manifold is endowed with a metric that, in general, will vary as the relevance feedback progresses. We indicate with the metric and with the associated distance function at step . The query at step is the pair . In its most general formulation, a relevance feedback scheme is a function such that
A relevance feedback scheme is stable if, whenever , it is , that is, if
Stability entails that if an iteration doesn’t provide any new information, the query will not change. Note that in practice stability is not necessarily a desirable property: users tend to select positive examples more readily than negative ones and, if at any time the result set doesn’t provide any useful sample, the user might not select anything, leaving the system stuck in an unsatisfactory answer.
The various models presented in this paper requires us to use a fairly extended apparatus of symbols. For the convenience of the reader, we have tried to keep the meaning of the symbols and the span of the indices as consistent as possible throughout the various models that we present. The most important symbols and indices used in the rest of the paper are available at a glance in Table 1.
2.1 Life on MARS
There are compelling reasons to believe that the metric of the feature space should indeed be affected by relevance feedback. Let us consider the following idealized case, represented in Figure 1: we have two kinds of images: checkerboards and vertical-striped, of different densities (total number of stripes) and different colors (not represented in the figure). We have a feature system with three axes: one measures color, the second measures line density, and the third the ratio between the number of horizontal and vertical lines.
Our goal is to find an image of a regular checkerboard. A typical user, when shown a sample from the data base, will select images with approximately the right density, rather regular, and of pretty much any color. That is, the selected items will be rather concentrated around the correct value, and the set of positive examples will have a small variance on the ratio axis (all the positive samples will have a ratio of approximately 1:1), probably a larger one, but not too large, on the density axis (images of very low density don’t have enough lines to qualify as a ”checkerboard”) and very high variance on the color axis (color is irrelevant for the query, so the positive samples will be of many different colors). The idea of the MARS system is to use the inverse of the variance of the positive samples along an axis to measure the ”importance” of that axis, and to use a weighted Euclidean distance in which each feature is weighted by the inverse of the variance of the positive examples along that axis. Let be the set of positive examples at iteration . Build the projection of all the results on the th feature axis as
and compute the variance
The query point of iteration , is determined using Rocchio’s algorithm, and the items in the data base are given scores that depend on the following distance from the query
As we mentioned above, only the positive examples are used for the determination of the metric.
2.2 MindReader and optimal affine rotations
The idea of modifying the distance of the feature space to account for the relevance of each feature has proven to be a good one, but its execution in MARS has been criticized on two grounds: first, it doesn’t take into account that what is relevant (and therefore has low variance) might not be the individual features, but some linear combination of them; second, the weighting criterion looks ad hoc, and not rigorously justified.
In  an example of the first problem is given. The items in a data base are people represented by two features: their height and their weight. The query asks for ”mildly overweight” people. The condition of being mildly overweight is not given by any specific value of any individual feature. If we consider, with a certain approximation, that being mildly overweight depends on one’s body mass index, then the relation is , where is the weight in kilograms and the height in meters. So, a typical user might give a series of positive examples characterized as in figure 2.
If we consider the features individually, on each one of them the selected items have high variance, so we would conclude that the response carries little information. On the other hand, if we rotate the coördinate system as shown, the variance along the axis will be small, indicating that the corresponding weight/height ratio is relevant.
MindReader  takes this into account by considering a more general distance function between an item and a query, one of the form
where is an symmetric matrix such that (viz., a rotation). These matrices generate distance functions in which the iso-distance curves are ellipsoids centered in and whose axes can be rotated by varying the coefficients of . The matrix and the query point are determined so as to minimize the sum of the weighted distances of the positive samples from the query. That is, given the weights (, ), the matrix and the vector are sought that minimize
subject to . The weights are introduced to handle multiple-level scores: the user, during the interaction, may decide that some of the positive examples are more significant than others, and assign to them correspondingly higher weights .
The problem can be solved separately for and . The optimal , independently of , is the weighted average of positive samples
In order to find the optimal , define the weighted correlation matrix as
It can be shown that and , where is the parameter of the Lagrangian optimization. The optimal is :
Note that the matrix depends on the query point. The optimal solution is obtained when is computed with respect to the optimal query point (11).
2.3 Dimensionality Problems
The adaptation of the metric works well when the number of positive examples is at least of the same order of magnitude as the dimensionality of the feature space. A good example of this is given by MindReader. The affine matrixcan be determined using (13) only if is non-singular, and this is not the case whenever . This is an important case in image search, as is in general of the order of 10 or, in very special cases, of 100 images, while may easily be of the order of 10.000. When the inverse doesn’t exist, and 
uses in its stead a pseudo-inverse based on singular value decomposition.being symmetric, it can be decomposed as
where is an orthogonal matrix and
where is the rank of . The pseudo-inverse of is defined as
and that of as
The metric matrix is then defined as , where is chosen in such a way that .
The whole procedure depends only on parameters. To see what this entails, consider that is orthogonal, so the transformation is an isometry. In this coördinate system, and . That is, the distance depends on the value of only axes, a small fraction of those of the feature space.
In the general case, this situation is unavoidable: the matrix has coefficients, and we only have
coefficients of the positive samples that we can use to estimate them. In image data bases, feature spaces have high dimensionality, andcan be of the order of . The number of selected images is limited by practical consideration to an order . So, we are trying to estimate coefficients using samples—an obviously under-determined problem.
A system like MARS, on the other hand, only requires the estimation of coefficients, making the estimation using the feature values of the positive example stable even for reasonably low values of . The price one pays is that the MARS metric matrix can weight the different axes of the feature space but it can’t rotate them, preventing the method from exploiting statistical regularities on axes other than the coördinate ones.
In order to alleviate this problem, Rui and Huang  propose dividing the feature space into separate components, and determine the distance between an item and the query by first computing thee distance for each component and then combining the distances thus obtained.
More precisely, consider the feature space as the Cartesian product of manifolds: and let be the dimensionality of the manifold . In the following, the indices will span , while will span 111The construction that we are presenting could be called ”top-down”: we have an overall feature space and we break it down into smaller, mutually orthogonal pieces. A different, ”bottom-up”, point of view would simply ignore the overall feature space and consider that our items are described by feature vectors . We compute distances separately in these spaces and then stitch them together. The top-down point of view has the advantage of highlighting certain limitations of this decomposition. We are assuming here that the are independent of each other, and that the corresponding feature vectors can vary freely and independently. However, if , there is no guarantee that the Cartesian combination of independently varying vectors will have unit length, that is, there is no guarantee that is decomposable in this way. As we shall see, this is not a problem in the model of Rui and Huang, as the model of distance combination is very simple, and will not (per se) be a metric space–it will be one only qua combination of metric spaces. The problem has to be taken into account, however, for other combination models..
Each item will be described by feature vectors with
In Rui and Huang’s model the user, in addition to selecting the set of positive samples, , can give a relevance to each one of them. Relevance is modeled as a vector . The overall distance between an item and the query is the weighted sum of the distances, in the component spaces , between the th feature vector of and the projection of on , . That is, if is the weight vector, then
where is the symmetric -dimensional metric matrix of . With these definitions the metric optimization problem is the following:
Defining as the matrix whose th column is
we have the optimal query point
which is the one given by Rocchio’s algorithm without negative examples applied to the th feature.
The matrices are determined as in the previous section, where all the quantities are now limited now to feature . The solution is similar: defining the matrix as
The optimal weight vector is given by
where is the weighted average of the distance between the positive samples and the query in the subspace that defines feature .
Once the axes have ben rotated, we find here the same general idea that we find in MARS: the subspace in which the positive samples are far away from the query are less informative, and will receive a smaller weight; the average of the square distance along a direction is related to the variance of the coördinates along that direction, viz. to (7).
* * *
In this section we have limited our considerations to a handful of systems that are needed to provide the background for our discussion. We have, in other words, preferred depth of analysis to breadth of coverage. We should like, however, to give the briefest of mentions to a few examples more, as a recognition to the pervasiveness of these techniques.
In the introduction, in (1) we have mentioned that the original query is kept fixed and contributes to the expression of . The extent to which this is done was left unspecified, as will be in the rest of the paper. The problem is analyzed in .
This paper focuses on the use of relevance feedback in image search but, of course, the general ideas have been applied to many areas, from information retrieval in collection of documents [22, 5], web systems  and other heterogeneous data . Relevance feedback is present in a number of methods and algorithms; in , user feedback is used in order to set system parameters, in  in order to understand user behavior in faceted searches. User expectation and feedback has also been used in order to measure the effeciveness of systems .
For more general information, the reader should consult the many excellent books [1, 6, 36, 10, 32] and reviews [4, 41, 49, 35] on Information Retrieval. Information on the application of Relevance Feedback to image retrieval can be found in general texts and review on this areas [42, 45, 27, 15, 17].
3 Semantic Spaces
Rui & Huang try to solve the problems deriving from the extremely high dimensionality of the feature space by breaking the space in a two-level hierarchy. At the lower level, each feature defines a separate space, upon which one operates as in the usual case and in which one computes a distance from the query. At the higher level, these distances are linearly combined to provide the final distance. One useful point of view, one that isn’t explored in Rui & Huang’s paper, is to consider the latter as a higher order feature space, one in which the coördinates of the images are given by their distances from the query in each of the low level spaces. Since all coördinates are positive (and therefore equal to their absolute value), we can consider their linear combination as a weighted distance. Note that this choice of distance is essential to Rui & Huang’s method, since the combination function of the different feature spaces must be linear. Consequently, it is impossible to use such method to endow the high level space with any other metric.
In addition to this, Rui & Huang’s method doesn’t always succeeds in reducing the size of the problem to a manageable size, as the matrices can still be very large. Figure 3 shows the pre-computed features of the SUN data set  together with their size.
The metric matrix for the whole feature space (of size ) contains a whopping coefficients. Breaking up the feature space using Rui & Huang’s method alleviates the dimensionality problem, but doesn’t quite solve it, as the total size of the matrices is coefficients: an order-of-magnitude improvement but, still, a problem too large for many applications.
3.1 Semantic partition
In this section, we shall take the essential idea of Rui & Huang, but we shall apply it to the point of view that we just expressed, that is, to define a higher order feature space. As in Rui & Huang’s work, we shall assume that the total feature space is the Cartesian composition of feature spaces
but, in this case, we shall assume that the spaces are not simply a partition dictated by technical matters, that is, they don’t simply segregate the components of different features, but have semantic relevance: the spaces must somehow correlate with the semantic characteristics that a person would use when doing a relevance judgment [23, 2].
The division of the feature space into sub-spaces entails a semantic choice of the designer, as each subspace should correlate with an aspect of the ”meaning” of the image.
How are these groups to be selected? The general idea is that they be ”meaningful”, and meaning is often assigned through language [20, 7, 19, 24, 43]. The psychology of feedback selection is still somewhat unexplored, but it is certain that if we ask somebody why he chose certain images as positive example, we shall receive a linguistic answer . For lack of a better theory, we can assume that this answer is a reflection of the perception that made a person choose these images. Consequently, a dimension in the feature space should be something that we can easily describe in words (in practice: in a simple and direct sentence) without making reference to the underlying technical feature.
One possibility, which we shall not analyze in this paper, is that of a prosemantic space
, in which each dimension in the reduced dimensionality space is the output of a classifier, trained to recognize a specific category of images[8, 9]. Here, we shall consider the individual feature spaces as given, and use the query to transform each one into a dimension of the semantic space.
3.2 The query space
Assume, according to our model, that we have a feature space defined as the Cartesian composition of feature spaces, as in (29), and let be the dimensionality of the th of such spaces. Each item will be described by the feature vectors
The query will also be defined by vectors
Each feature space is a metric space, endowed with a distance function . We consider the distance between the th feature of item and the th component of the query as the coördinate of along the th dimension of the query space, that is, we represent with the -dimensional feature vector
The space of these vectors is the query space of dimensionality . The query space itself can be given a metric structure defining a distance function in it. If the distance is a weighted Minkowski distance, we have
Note that in this space the query is always the origin of the coördinate system, so that the score of an image is a function of its distance from the origin.
In this space, all the coördinates are positive and, depending on the characteristics of the distance functions, they can be bounded. In later sections, we shall use suitable probability densities to model the distribution of images in this space. One reasonable model for many situation in which the coördinates are positive is the logonormal distribution, that is, a normal distribution of the logarithm of the coördinates [12, 18]. To this end, sometimes we shall use the transformed query space , in which the coördinates of the th image are
The distance in this space is defined as in (33).
Several arguments have been brought forth to argue that spaces of this kind are more ”semantic” than normal feature spaces, in the sense that they correlate better with the linguistic descriptions of images or, at the very least, they are more amenable to a linguistic description than the usual feature spaces.
We shall appropriate these arguments and assume that the query space is the most suitable space in which relevance feedback should be implemented in the sense that, its reduced dimensionality notwithstanding, the query space contains the essential (semantic) information on which Relevance Feedback is based. In our tests section we shall validate this assumption by comparing the performance of the MARS algorithm in the query space with that of the same algorithm in the feature space.
Note that in all our tests the query vectors are obtained by applying Rocchio’s algorithm to the individual feature spaces. For this reason, it is not possible to implement Rocchio’s algorithm in the query space, as the algorithm is necessary in order to build it.
4 Riemann Relevance Feedback
Relevance feedback begins by placing a number of positive examples in a metric space which, in this case, is the query space of the previous section. We can consider these images as samples from a probability distribution that determines the probability that a certain region of the spaces contain semantically interesting images.
To be more precise, consider the problem of using relevance feedback to identify a target image in the query space, and let be a probability density on . Then models the semantics of if, given a volume around a point , the probability that is .
The idea of our method is to use this distribution to model the feedback process as a deformation in the metric of the query space. In particular, we shall use this distribution to determine a Riemann metric in such that images that differ in a significant area of the space will be fairly different, while images that differ in a non-significant area of the space won’t be as different. To clarify things, consider a one-dimensional query space and a distribution like that of figure 4.
Qualitatively, the area is the ”interesting” area for the user, the area where most of the relevant examples are found. Two images placed in this area will be equally relevant, that is, the distance between them will be small. On the other hand, two images placed in the area will not be affected by relevance feedback, and the distance between them will be given by the normal Euclidean distance. Note that in this section we are assuming a unimodal distribution; we shall consider a more general case in the next section.
So, given the same difference in the coördinates of two points, their distance will be small in the area of high density, and will be (approximately) Euclidean where the density is close to zero. Consider the elementary distance element in a given position of the axis. We can write it as
In a uniform Euclidean space, . In the space that we have devised, should have a behavior qualitatively similar to that of figure 5.
Let us now apply these considerations to our relevance feedback problem. We have obtained, from the user, a set of positive examples, each one being a vector in :
We arrange them into a matrix:
This matrix is a sample from our unknown probability distribution. If we assume that we are in the transformed feature space (34), we can model the unimodal distribution as a Gaussian
where and are the sampled average and covariance. For
the sake of simplicity, we translate the coördinate system so that
. We model the space as a Riemann space in which the distance
elements at position for a displacement
of the coördinate is222In differential geometry it is customary to apply Einstein’s
summation convention: whenever an index appears twice in a monomial
expression, once as a contravariant index (viz. as a superscript)
and once as a covariant index (viz. as a subscript), a summation
over that index is implied. The components of the differentials
are contravariant, while is a doubly covariant tensor. The
distance element would therefore be written as
is a doubly covariant tensor. The distance element would therefore be written as
Based on our qualitative considerations, we shall have
with . The factor is necessary in order to avoid that the Riemann tensor become degenerate in 0, and its necessity will be apparent in the following.
Working in a space with this Riemann tensor is a very complex problem, but it can be simplified if, before we define the tensor , we decouple the directions making them (approximately) independent. We apply singular value decomposition to write as
then, if we represent the images in the rotated coördinate system , where, for image ,
the covariance matrix is diagonal. Consequently, the Riemann tensor will also be diagonal:
The distance between two points in a Riemann space is given by the length of the geodesic that joins them, the geodesic being a curve of minimal length between two points (in an Euclidean space geodesics are straight lines, on a sphere they are maximal circles, and so on). Let a geodesic curve in the query space parameterized by . Then, its coördinate expressions satisfy
(as customary, the dot indicates a derivative), where are the Christoffel symbols
and are the components of the inverse of . In our case, the only non-zero symbols are
The geodesic is therefore the solution of
Define the auxiliary variables . Then
With this change of variable we have
Defining we have
and defining we obtain
where is a constant, that is
Rolling back the variable changes, we have
The constants determine the direction of the geodesic. Let be the tangent vector that we want for the geodesic in 0, then
Note that if the geodesic degenerates, as (this is the reason why we introduced the constant ). If we can choose
This leads to
Defining the function
These equations define (implicitly) the geodesic
The geodesics are curves of constant velocity and, in this case, we have
Let be any curve such that and . The length of the segment of the curve is
Given an image in -coördinates (42)——its distance from the origin (remember that in the query space the query is always placed at the origin) is the length of a segment of geodesic that joins the origin with the point . All geodesics of the form (62) go through the origin, so we only have to find one that, for a given , has . We can take, without loss of generality, : any geodesics through can be re-parameterized so that . That is, we must have
which entails an initial velocity vector
Since the geodesics are of constant speed, and due to the parameterization that we have chosen, we have
This is the distance function that we shall use to re-score the data base in response to the feedback of the user.
5 Relevance feedback with latent variables
In Information Retrieval one common and useful way to model sets of documents is through the use of latent variables, probabilistically related to the observed data, resulting in a method known as Probabilistic Latent Semantic Analysis . This method builds a semantic, low-dimensional representation of the data based on a collection of binary stochastic variables that are assumed to model the different aspects or topics [3, 48]
of the data in which one might be interested. It should be noted that these topics are assigned no a priori linguistic characterization: they are simply binary variables whose significance is statistical, deriving from the analysis of the data. It has nevertheless been observed that they often do correlate with linguistic concepts in the data.
In information retrieval, we have a collection of documents and a collection of words . The observation is a set of pairs, , where is a document and a word that appears in it. From the observations we can estimate , that is, the probability that document and word be randomly selected from the corpus.
The model associates an unobserved variable to each observation . The unobserved variables are assumed to represent topics present in the collection of documents. Let be the probability that a document be selected, the probability that variable be active for (viz., the probability that document be about topic ), and the class-conditioned probability of a word given (viz., the probability that the topic produce word ). Using these probabilities, we define a generative model for the pair as follows (see also Figure 6):
select a document with probability ;
pick a latent variable with probability ;
generate a word with probability .
The model can be used to predict the observation probabilities of a pair :
This model is asymmetric in and , and undesirable characteristic. One can use Bayes’s theorem,
A model that is symmetric in and and whose interpretation is (Figure 7):
select a topic with probability ;
generate a document containing that topic with probability