Content-based image retrieval tutorial

08/12/2016 ∙ by Joani Mitro, et al. ∙ 0

This paper functions as a tutorial for individuals interested to enter the field of information retrieval but wouldn't know where to begin from. It describes two fundamental yet efficient image retrieval techniques, the first being k - nearest neighbors (knn) and the second support vector machines(svm). The goal is to provide the reader with both the theoretical and practical aspects in order to acquire a better understanding. Along with this tutorial we have also developed the equivalent software1 using the MATLAB environment in order to illustrate the techniques, so that the reader can have a hands-on experience.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


Content Based Image Retrieval Techniques (e.g. knn, svm using MatLab GUI)

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


X, Y, M bold face roman letters indicate matrices
bold face small letters indicate vectors
a, b, c small letters indicate scalar values
p-norm distance function
infinity norm distance function
nonlinear function (e.g. sigmoid, tanh etc.)
score function, mapping function
vector norm
feature mapping function
Kernel function
Kernel matrix
Table 1: Notation

1 Introduction

As we have already mentioned this tutorial serves as an introduction for to the field of information retrieval for the interested reader. Apart from that, there’s always been a motivation for the development of efficient media retrieval systems, since the new era of digital communication has brought an explosion of multimedia data over the internet. This trend, has continued with the increasing popularity of imaging devices, such as digital cameras that nowdays are an inseparable part of any smartphone, together with an inceasing proliferation of image data over communication networks.

2 Data pre-processing

Like in any other case before we use our data we first have to clean them if that is necessary and transform them into a format that is understanble by the prediction algorithms. In this particular case the process that has been adopted includes the following six steps, applied for each image in our dataset , in order to transform the raw pixel images into something meaningful that the prediction algorithms can understand. In another sense, we map the raw pixel values into a feature space.

  1. We start by computing the color histogram for each image. In this case the HSV color space has been chosen and each H, S, V component is uniformly quantized into 8, 2 and 2 bins resepctively. This produces a vector of 32 elements/values for each image.

  2. The next step is to compute the color auto-correlogram for each image, where the image is quantized into colors in the RGB space. This process produces a vector of 64 elements/values for each image.

  3. Next, we extract the first two moments (i.e. mean and standard deviation) for each R,G,B color channel. This gives us a vector of 6 elements/values.

  4. Moving forward, we compute the mean and standard deviation of the Gabor wavelet coefficients, which produces a vector of 48 elements/values. This computation requires applying the Gabor wavelet filters for each image spanning accross four scales: “0.05, 0.1, 0.2, 0.4” and six orientations: “”.

  5. Last but not least, we apply the wavelet transform to each image with a 3-level decomposition. In this case the mean and standard deviation of the transform coefficients is utilized to form the feature vector of 40 elements/values for each image.

  6. Finally, we combine all the vectors from the step 1–5 into a new vector . Each number indicates the dimensionality of the vectors from steps 1–5 that have been concatenated into the new vector .

3 Methodology

3.1 k-Nearest Neighbour

k-Nearest Neighbour (k-NN) classifier belongs to the family of instance based learning algorithms (IBL). IBL algorithms construct hypothesis directly from the training data themselves which means that the hypothesis complexity can grow with the data. One of its advantages is the ability to adapt its model to previously unseen data. Another advantage is the low cost of updating object instances and also the fast learning rate since it requires no training. Some other examples of IBL algorithms besides k-NN are kernel machines and RBF networks. Some of the disadvantages of IBL algorithms including k-NN, besides the computational complexity, which we already mentioned, is the fact that they fail to produce good results with noisy, irrelevant, nominal or missing attribute values. They also don’t provide a natural way of explaining how the data is structured. The efficacy of k-NN algorithm relies on the use of a user defined similarity function, for instance a p-norm distance function, which depicts the nearest neighbours and the chosen set of examples. It is also often used as a base procedure in benchmarking and comparative studies. Due to the nature that it doesn’t requrie any trainnig when compared to any trained based rule, it is expected the trained based rule to perform better, if it doesn’t then the trained base rule is deemed useless for the application under study.

Since nearest neighbour rule is a fairly simple algorihtm most textbooks will have a short reference to it but will neglect to provide any facts about who invented the rule in the first place. Macello Pelillo [1] tried to give an answer to this question. Pelillo refers often to the famous Cover and Hart paper (1967) [4] which shows what happens if a very large selectively chosen training set is used. Before Cover and Hart the rule was mentioned by Nilsson (1965) [5] who called it “minimum distance classifier” and by Sebestyen (1962) [2], who called it “proximity algorithm”. Fix and Hodges [3] in their very early discussion on non-parametric discrimination (1951) already pointed to the nearest neighbour rule as well. The fundamental principle known as Ockham’s razor: “select the hypothesis with the fewest assumption” can be understood as the nearest neighbour rule for nominal properties. It is, however, not formulated in terms of observations. Ockham worked in the 14th century and Emphasized Observations before ideas. Pelillo pointed out that this was already done prior to Ockham, by Alhazen [6] (Ibn al-Haytham), a very early scientist (965–1040) in the field of optics and perception. Pelillo cites some paragraphs where he shows that Alhazen describes a training procedure as a “universal form” which is completed by recalling the original objects which in this case Alhazen refered to as particular forms.

To better understand the k-NN rule we will setup the concept of our application. Suppose that we have a dataset comprised of 1000 images in total, categorized in 10 different categories/classes where each one includes 100 images.

Given an image we would like to find all possible similar images from the pool of candidate images (i.e. all similar images from the dataset of 1000 total images). A sensible first attempt algorithm would look something like this:

Data: , ‘‘the set of all images’’
Data: , “query image, you’re trying to identify similarity against ’’
Result: , ‘‘a scalar indicating how similar two images are’’
for image in  do
       for column in image height:  do
             for row in image width:  do
             end for
       end for
       ‘‘sum accros rows and columns’’
end for
Algorithm 1 naive k-NN algorithm.

Here is a visual representation of what it might look like:

Figure 1: Final visual result of the Algorithm 1

The complexity of the naive k-NN is . Can we do better than that? Of course we can, if we avoid some of those loops by vectorizing our main operations. Instead of operating on the 2-D images we can vectorize them first and then perform the operaions. First we transform our images from 2-D matrices to 1-D vectors like it is being demonstrated in the figure below.

Figure 2: Vectorizing images.

If we denote the vectorization of our query image as and with the vectorization of every other image in the dataset . Then our k-NN algorithm can be described as follows: , and the complexity has now been reduced to . The choice of distrance metric or distance function is solely up to the discresion of the user. Another view of how k-NN algorithm operates is depicted in Figure 3. Notice that k-NN performs an implicit tessellation of the feature space that is not visible to the observer, but it is through this tessellation, that is able to distinguish nearby datum and classify them as similar. For instance, let’s pretend that the black capital “X” letters in Figure 3 denote some data projected on the feature space. When a new datum comes in, such as in this case the red capital “X” letter, which indicates a query image, then the algorithm can easily distinguish and assign to it the closest images which are semantically similar.

Figure 3: Voronoi diagram of k-NN.

3.2 Suppoprt Vector Machines

Support Vector Machines (SVMs also known as suppport vector networks) are supervised learnig models used among others for classifcation and regression analysis. They were introduceds in 1992 in the Conference on Learnning Theory by Boser, Guyon and Vapnik. It became quite popular since then because it is a theorectically well motivated algorithm which was developed since the 60s from Statistical Learning Theory (Vapnik and Chervonenkis) and it also holds good empirical performance in a diverse number of scientific fields and application domains such as bioninformatics, text and image recognition, music retrieval and many more. SVMs are based on the idea of separating data with a large “gap” also know as margins. During the presentation of SVM we’ll also concern ourselfves with the question of optimal maring classifier which will act as stepping stone for the introduction to Lagrange duality. Another aspect of SVMs which is important is the notion of kernels which allow SVMs to be applied efficiently in high dimenisonal feature spaces. Let’s start by settting up our poblem. In this case the context is known from before where we have images from different classes and we want to classify them accordingly. In other words this is a binary classifcation problem. Based on this classification we will be able to retrieve images that are similar to our query image. Figure 

4 depicts two classes of images, the positive, and the negative

. For the sake of the example let’s consider the circles to be the positive and the triangles to be the negative. We also have a hyperplane separating them as well as a three labeled data points. Notice that point

is the furthest from the decision boundary. In order to make a prediction for the value of the label at point , one might say that in this particular case we can be more confident that the value of the label is going to be . On the other hand, point C even though it is on the correct side of the decision boundary where we would have predicted a label value of , a small change to the decision boundary could have caused the prediction to be negative . Therefore, one can say that we’re much more confident about our prediction at than at . Point B lies in-between these two cases. One can extrapolate and say that if a point is far from the separating hyperplane, then we might be significantly more confident in our predictions. What we are striving for is, given a training set, find a decision boundary that allows us to make the correct and confident predictions (i.e. far from the decision boundary) on the training examples.

Figure 4: Images projected on a 2-D plane.

Let’s consider our binary classification problem where we have labels and features . Then our binary classifirer might look like


Now we have to distinguish between two different notions of margin such as functional and gemometric margin. The functional margin of () with respect to the training example (, ) is


If , then for our prediction to be confident and correct (i.e. the functional margin to be large), , needs to be a large positive number. If , then for the functional margin to be large (i.e. to make a confident and correct prediction) needs to be a large negative number. Note that if we replace with in Equation 1 and with 2, then since , would not change at all, which means that it depends only on the sign, but not on the magnitude of . Regarding the geometric margin we’ll try to interpret them using Figure 5.

Figure 5: Interpretation of geometric margin.

We can see the decision boundary corresponding to () along with the orthogonal vector . Point resembles some training example with label . The distance to the decision boundary denoted by is given by the line segment . How can we compute ? If we consider to be a unit-length vector pointing in the same direction as then point . Since this point lies on the decision boundary then it satisfies that , which means that all points on the decision boundary satisfy the same equation. Substituting with we get . Solving for we get . Usually the geometric margin of () with respect to the trainning example () is defined to be


If , then the functional margin is equal to the geometric margin. Notice also that the geometric margin is invariant to rescaling of the parameters (i.e. if we replace with and with , then the geometric margin does not change). This way it is possible to impose an arbitray scaling contraint on without changing anything significant from our original equation. Given a dataset , it is also possible to define the geometric margin of () with respect to to be the smallest of geometric margins on the individual training examples . Thus, the goal for our classifier is to find a decision boundary that maximizes the geometric margin in order to reflect a confident and correct set of predictions, resulting in a classifier that separates the positive and negative training examples with a geometric margin. Supposed that our training data are linearly separable, how do we find a separating hyperplane that achieves the maximum geometric margin? We start by posing the following optimisation problem


The constraint ensures that the functional margin equals to the geometric margin, in this way we are guaranteed that all the geometric margins are at least . Since “” constraint is a non-convex one, and this is hard to solve instead what we’ll try to do is transform the problem into an easier one. Consider


Notice that we’ve got ridden the constraint that was making our objective difficult and also since , will provide an acceptable and correct answer. The main problem is that still our objective function is non-convex, thus we still have to keep searching for a different representation. Recall that we can add an arbitary scaling constraint on and without changing anything from our original formulation. We’ll introduce the scaling constraint such that the functional margin of with respect to the training set () must be 1, this is . Multiplying by some constant yields the functional margin multiplied by the same constant. One can satisfy the scaling constraint by rescaling (). If we plug this consraint into Equation 5 and substitute , then we get the following optimization problem.


Notice that maximizing is the same thing as minimizing . We have now transformed our optimization problem into one with a convex quadratic objective and linear constraints which can be solved using quadratic programming. The solution to the above optimization problem will give us the optimal margin classifier which will lead to the dual form of our optimization problem, which in return plays an important role in the use of kernels to get optimal margin classifiers, in order to work efficiently in very high dimensional spaces. We can reexpress the constraints of Equation 6 as . Notice that constraints that hold with equality, , correspond to training examples , that have functional margin equal to one. Let’s have a look at the figure below.

Figure 6: Support vectors and maximum maring separating hyperplane.

The three points that lie on the decision boundary (two positive and one negative) are the ones with the smallest margins and thus closest to the decision boundary. Notice that these three points are called support vectors and usually they can be smaller in number than the training set. In order to tackle the problem we frame it as a Lagrangian optimization problem


with only one Lagrangian mulptiplier “” since the problem has only inequality constraints and not any equality constraints. First, we have to find the dual form of the problem, to do so we need to minimize with respect to and for a fixed . Setting the derivatives of with respect to and to zero, we get:


Substituting Equation 9 into Equation 10 and simplifying we get


Utilizing the constraint and the constraint from Equation 10 the following dual optimization problem arises:


If we are able to solve the dual problem, in other words find the that maximizes then we can use Equation 9 in order to find the optimal as a function of . Once we have found the optimal , considering the primal problem then we can also find the optimal value for the intercept term .


Suppose we’ve fit the parameters of our model to a training set, and we wish to make a prediction at a new point input . We would then calculate , and predict if and only if this quantity is bigger than zero. But using Equation 9, this quantity can also be written:


Earlier we saw that the different values for will all be zero except for the support vectors. Many of the terms in the sum above will be zero. We really need to find only the inner products between and the support vectors in order to calculate Equation 16 and make our prediction. We will exploit this property of using inner products between input feature vectors in order to apply kernels to our classification problem. To talk about kernels we’ll have to think about our input data. In this case as we have already previously mentioned we are referring to images, and images are usually discribed by a number of pixel values which we’ll refer to as attributes, indicating the different intensity colors across the three different color channels {R, G, B}. When we process the pixel values in order to retrieve more meaningful representations, in other words when we map our initial pixel values through some processing operation to some new values, these new values are called features and the operation process is referred to as feature mapping usually denoted as .

Instead of applying SVM directly to the attributes , we may want to use SVM to learn from some features . Since the SVM algorithm can be written entirely in terms of innner products we can instead replace them with . This way given a feature mapping , the corresponding kernel is defined as . If we replace every inner product in the algorithm with , then the learning process will be happening uisng features .

One can compute by finding and even though they may be expensive to calculate because of their high dimensionality. Kernels such as , allows SVMs to perform learning in high dimensional feature spaces without the need to explicitly find or represent vectors . For instance, suppose , and let’s consider which is equivalent to


for the feature mapping is computed as

Broadly speaking a kernel corresponds to a feature mapping of feature space. still takes time even though it is operating in a space, because it doesn’t need to explicitly represent feature vectors in this high dimensional space. If we think of as some measurement of how similar are and , or and , then we might expect to be large if and are close together and vice versa.

Suppose that for some learning problem we have thought of some kernel function , considered as a reasonable measure of how similar and are. For instace,


the question then becomes, can we use this definition as the kernel in an SVM algorithm? In general, given any function is there any process which will allow to describe if it exists some feature mapping so that for all and , in other words is it a valid kernel or not? If we suppose that is a valid kernel then , meaning that the kernel matrix denoted as , discribing similarity between datum and , must be symmetric. If we denote , the -the coordinate of the vector , then for any vector , we have


which shows that the kernel matrix is positive semi-definite () since our choice of was arbitary. If is a valid kernel meaning that it corresponds to some feature mapping , then the corresponding kernel matrix is symmetric positive semi-definite. This is a necessary and sufficient condition for to be a valid kernel also called the Mercer kernel. The take away message is that if you have any algorithm that you can write in terms of only inner products between the input attribute vectors, then by replacing it with a kernel you can permit your algorithm to work efficiently in the high dimensional feature space.

Switching gears for a moment and returning back to our problem or actually classifying and semantically retrieving similar images, since we now have an understanding of how the SVM algorithm is functioning, we can use it in our application. Recall that we have the following dataset:

As in Section 3.1 we have our dataset and our query image, in this case denoted by , vectorized, in order to perform mathematical operations seamelessly. To make things even more explicit, imagine that our system (i.e. the MATLAB software accompanying this tutorial) or the algorithm (i.e. the SVM in this case) receives a query image from the user and its job is to find and return to the user all the images which are similar to the query . For instance, if the query image of the user depicts a monument then the job of our system or algorithm is to return to the user all the images depicting monuments from our dataset .

In other words we are treating our problem as a multiclass classification problem. Generally speaking there are two broad approaches in which we can resolve this issue using the SVM algorithm. The first one is called “one-vs-all” approach and involves training a single classifier per class, with the samples of that class as positive samples and all other samples as negatives. This strategy requires the base classifiers to produce a real-valued confidence score for its decision, rather than just a class label.

The second approach is called “one-vs-one” and usually one has to train binary classifiers for a k-way multiclass problem. Each receives the samples of a pair of classes from the original training set, and must learn to distinguish these two classes. At prediction time, a voting scheme is applied: all classifiers are applied to an unseen sample and the class that got the highest number of “+1” predictions gets predicted by the combined classifier. This is the method that the accompanying software is utilizing for the SVM solution.

Notice that if the SVM algorithm predicts the wrong class label for a query image , then we end up retrieving and returning to the user all the images from the wrong category/class since we predicted the wrong label. How can we compensate for this shortcoming? This is left as an exercise to the reader to practice his/her skills.


  • [1]

    P. Marcelo, Pattern Recognition Letters, History of science, Nearest neighbor classification, Visual perception., pp. 34–37, Vol. 38, 2014.

  • [2] G. Sebestyen, IEEE Transactions on Information Theory, Review of Learning Machines, pp. 407, 3, Vol. 12, 1965–1966.
  • [3] E. Fix and J. L. Hodges, International Statistical Institute, Discriminatory Analysis Nonparametric Discrimination: Consistency Properties, pp. 238–247, Vol. 57, No. 3, 1989.
  • [4] T. Cover and P. Hart, IEEE Transactions on Information Theory, The nearest neighbor pattern classification, pp. 21–27, Vol. 13, 1967.
  • [5] N. Nilsson, Learning Machines: Foundations of Trainable Pattern Classifying Systems, First edition, 1965.
  • [6] Ibn al-Ḥasan, widely considered to be one fo the first theoretical physicists, c.965–c.1040 CE