Learning similarity measures from data

01/15/2020 ∙ by Bjørn Magnus Mathisen, et al. ∙ NTNU 15

Defining similarity measures is a requirement for some machine learning methods. One such method is case-based reasoning (CBR) where the similarity measure is used to retrieve the stored case or set of cases most similar to the query case. Describing a similarity measure analytically is challenging, even for domain experts working with CBR experts. However, data sets are typically gathered as part of constructing a CBR or machine learning system. These datasets are assumed to contain the features that correctly identify the solution from the problem features, thus they may also contain the knowledge to construct or learn such a similarity measure. The main motivation for this work is to automate the construction of similarity measures using machine learning, while keeping training time as low as possible. Our objective is to investigate how to apply machine learning to effectively learn a similarity measure. Such a learned similarity measure could be used for CBR systems, but also for clustering data in semi-supervised learning, or one-shot learning tasks. Recent work has advanced towards this goal, relies on either very long training times or manually modeling parts of the similarity measure. We created a framework to help us analyze current methods for learning similarity measures. This analysis resulted in two novel similarity measure designs. One design using a pre-trained classifier as basis for a similarity measure. The second design uses as little modeling as possible while learning the similarity measure from data and keeping training time low. Both similarity measures were evaluated on 14 different datasets. The evaluation shows that using a classifier as basis for a similarity measure gives state of the art performance. Finally the evaluation shows that our fully data-driven similarity measure design outperforms state of the art methods while keeping training time low.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many artificial intelligence and machine learning (ML) methods, such as k-nearest neighbors (k-NN) rely on a similarity (or distance) measure Maggini et al. (2012) between data points. In Case-based reasoning (CBR) a simple k-NN or a more complex similarity function is used to retrieve the stored cases that are most similar to the current query case. The similarity measure used in CBR systems for this purpose is typically built as a weighted Euclidean similarity measure (or as a weight matrix for discrete and symbolic variables). Such a similarity measure is designed with assistance of domain experts by adjusting the weights for each attribute of the cases to represent how important they are (one example can be seen in Wienhofen and Mathisen (2016), or generally described in chapter 4 of Bergmann (2002))

In many situations the design of such a function is non-trivial. Domain experts with an understanding of CBR or machine learning are not easily available. However, before or during most CBR projects, data is gathered that relates to the problem being solved by the CBR system. This data is used to construct cases for populating the case base. If the data is labeled according to the solution/class, it can be used to learn a similarity measure that is relevant to the task being solved by the system. Exploring efficient methods of learning similarity measures and improving on them is the main motivation of this work.

Figure 1: Illustration of problem and solution spaces Leake (1996). and are two problem descriptions with features describing a problem each of which has a corresponding ( and ) solution in solution space. illustrates the distance between a new problem and a stored problem . Correspondingly is the distance between the solution and the solution which is the (unknown) ideal solution to . A fundamental assumption in CBR is that if the similarity between and is high then the similarity between the unknown solution to is high (): Similar problems have similar solutions.

In the CBR literature, similarity measurement is often described in terms of problem- and solution spaces. Problem space is where the features of a problem describe the problem; this is often called feature space in non-CBR ML literature. Solution space, also referred to as target space, is populated by points describing solutions to points in the problem space. The function that maps a point from the problem space to its corresponding point in the solution space is typically the goal of supervised machine learning. This is illustrated in Figure 1.

A similarity measure in the problem space represents an approximation of the similarity between two cases or data points in the solution space (i.e. whether these two cases have similar or dissimilar solutions). Such a similarity measure would be of great help in situations where lots of labeled data is available, but domain knowledge is not available, or when the modeling of such a similarity measure is too complex.

Learned similarity measures can also be used in other settings, such as clustering. Another relevant method type is semi-supervised learning in which the labeled part of a dataset is used to cluster or label the unlabeled part.

How to automatically learn similarity measures has been an active area of research in CBR. For instance, Gabel et al. Gabel and Godehardt (2015) train a similarity measure by creating a dataset of collated pairs of data points and their respective similarities. This dataset is then used to train a neural network to represent the similarity measure. In this method the network needs to extract the most important features in terms of similarity for both data points, then combine these features to output a similarity measure. Recent work (e.g. Martin et al. Martin et al. (2017)) has used Siamese neural networks (SNN) Bromley et al. (1994) to learn a similarity measure in CBR. SNNs have the advantage of sharing weights between two parts of the network, in this case the two parts that extract the useful information from the two data points being compared. All of these methods for learning similarity measures have in common that they are trained to compare two data points and return a similarity measurement. Our work of automatically learning similarity measures is also related to the work done by Hüllermeier et al. on preference-based CBR Hüllermeier and Schlegel (2011); Hüllermeier and Cheng (2013). In this work the authors learn a preference of similarity between cases/data points, which represents a more continuous space between solutions than a typical similarity measure in CBR. This type of approach to similarity measures is similar to learning similarity measures by using machine learning models, in that both can always return a list of possible solutions sorted by their similarity.

In this work we have developed a framework to show the main differences between various types of similarity measures. Using this framework, we highlight the differences between existing approaches in Section 3

. This analysis also reveals areas that have not received much attention in the research community so far. Based on this we developed two novel designs for using machine learning to learn similarity measures from data. Both of the two designs are continuous in their representation of the estimated solution space.

The novelty of our work is three-fold: First showing that using a classifier as a basis for a similarity measure gives adequate performance. Then we demonstrate similarity measure designed to use as little modeling as possible, while keeping training time low, outperforms state of the art methods. Finally to analyze the state of the art and compare it to our new similarity measure design we introduce a simple mathematical framework. We show how this is a useful tool for analyzing and categorizing similarity measures.

The remainder of this paper describes our method in more detail. Section 2 describes the novel framework for similarity measurement learning, and Section 3 then summarizes previous relevant work in relation to this framework. In Section 4 we describe suggestions of new similarity measures, and how we design the experimental evaluation. Subsequently, in Section 5 we show the results of this evaluation. Finally, in Section 6 we interpret and discuss the evaluation results and give some conclusions. We present some of the limitations of our work as well as possible future paths of research.

2 A framework for similarity measures

We suggest a framework for analyzing different functions for similarity with as a similarity measure applied to pairs of data points ;

(1)

where and represents embedding or information extraction from data points and , i.e. highlights the parts of the data points most useful to calculate the similarity between them. An illustration of this process can be seen in Figure 2.

models the distance between the two data points based on the embeddings and . The functions and can be either manually modeled or learned from data. With respect to this we will enumerate all of the different configurations of Equation 1 and describe their main properties and how they have been implemented in state of the art research. Note that we will use to annotate the similarity measurement and for the sub-part of the similarity measurement that calculates the distance between the two outputs of . is distinct from unless .

Figure 2: Illustrating how from Equation 1 adds another space, the embedding space, between the problem and the solution space Leake (1996) (see Figure 1). then combines the two embeddings of and ( and respectively) and calculates the similarity between them. The main assumption is that distance in embedding space () is close to the distance in solution space () ; if the embedded points and are similar, then the true (unknown) solution is similar to solution . The main contribution of is to create a embedding space such that the distance in embeddings space () is a better estimate of the distance in solution space () than the distance in problem space ().
Modeled Learned
Modeled Type 1 Type 2
Learned Type 3 Type 4
Table 1: Table showing different types of similarity measures in our proposed framework.

In the following we characterize the different types of similarity measures:

Type 1

A typical similarity measure in CBR systems would model and from domain knowledge. Such a similarity measure is typically modeled by experts with the relevant domain knowledge together with CBR experts, who know how to encode this domain knowledge into the similarity measures.

For example when modeling the similarity measure of cars for sale, where the goal is to model the similarity of cars in terms of their final selling price. In this example, domain experts may model the embedding function so that the amount of miles driven has a greater importance than the color of the car. could be modeled such that difference in miles driven is less important than difference the number of repairs done on the car. More details and examples can be found in Cunningham (2009).

Type 2

This type represents similarity measures that models and learns the function . In this context can be viewed as an embedding function. Since is not learned from the data it is not interesting to analyze it as part of learning the similarity measure, as processing the data through could be done in batch before applying the data to . Learning needs to be done with a dataset consisting of triplets of the data points and , and being the true similarity between and .

A special case of Type 2 is when is set to be the identity function , while is learned from data. Examples of this type are presented for example in Gabel et al. Gabel and Godehardt (2015) where the similarity measure always looks at the two inputs together, never separately.

Type 3

In this type of similarity measure the embedding/feature extraction

is learned and is modeled. Typically the embedding function learned by resembles the function that is the goal during supervised machine learning. Within the similarity measurement

is used as an embedding vector for calculating similarity, when in classification

would be the softmax vector output. Using a pre-trained classification model as a starting point for as input to e.g. should give good results for similarity measurements if that model had high precision for classification within the same dataset.

However it is not given that the best embedding vector for calculating similarity is the same as the embedding vector produced by a trained to do classification.

Type 4

This measure is designed so that both and are learned.

We will design, implement and evaluate similarity measures based on Type 1, Type 3, Type 2 and Type 4 in Section 4. These results will be shown in Sections 5.

To allow as a similarity measurement for clustering e.g. k-nearest neighbors, a similarity measure should fulfill the following requirements:

Symmetry

The similarity between and should be the same as the similarity between and .

Non-negative

The similarity between to data-points can not be negative.

Identity

The similarity between two data-points should be 1 iff is equal to .

Some of these requirements are not satisfied by all types of similarity measures, i.e. symmetry is not a direct design consequence of Type 2 but of Type 3 if is symmetric. Even if symmetry is not present in all similarity measures Tversky (1977) it is important for reducing training time, as the training set size goes from to . Symmetry also enables the similarity measure to be used for clustering.

In the next section, we will relate current state of the art to the framework in context of the different types.

3 Related work

To exemplify the framework presented in the previous section we will relate previous work to the framework and the types of similarity measurements that derive from the framework. This will also enable us to see possibilities for improvement and further research.

As stated in Section 1 our motivation is to automate the construction of similarity measures. Additionally, we would like to do this while keeping training time as low as possible. Thus we will not focus on Type 1 similarity measures as this type uses no learning. Both Type 2 and Type 4 require a different type of training dataset than a typical supervised machine learning dataset, as is typically dependent on the order of the data points (see Section 4). Thus given our initial motivation, Type 3 similarity measures seems to be the most promising type of similarity measure to focus on. However, it is worth investigating similarity measures of Type 4, to see if the added benefit of learning outweighs the added training time. Or if it is possible to make it symmetric (as defined in the previous section) so that the training time could become similar to Type 3.

Thus we will focus on summarizing related work from Type 3 similarity measures, but also add relevant work from Type 1, Type 2 and Type 4 for reference.

Type 1 is a type of similarity measure which is manually constructed. A general overview and examples of this type of similarity measure can be found in Cunningham (2009). Nikpour et al. Nikpour et al. (2018)

presents an alternative method which includes enrichment of the cases/data points via Bayesian networks.

Type 2

In Type 2 similarity measures only the binary operator of the similarity measure is learned, while is either modeled or left as the identity function (). Stahl et al. have done a lot of work on learning Type 2 similarity measures from data or user feedback. In all of their work they formulate where for each feature , is the local similarity measure and is the weight of that feature. In Stahl (2001) Stahl et al. describe a method for learning the feature weights.

In Stahl and Gabel (2003)

Stahl et al. introduce learning local similarity measures through an evolutionary algorithm (EA). First they learn attribute weights (

) by adopting the method previously described in Stahl (2001). Then they use an EA to learn the local similarity measures for each feature (). In Stahl and Gabel (2006) Stahl and Gabel present work were they learn weights of a modeled similarity measure, and the local similarity for each attribute through an ANN. Reategui et al. Reategui et al. (1997) learn and represent parts of the similarity functions () through ANN. Langseth et al. Langseth et al. (1999) learn similarity knowledge () from data using Bayesian networks, which still partially relies on modeling the Bayesian networks with domain knowledge.

Abdel-Aziz et al. Abdel-Aziz et al. (2014) use the distribution of case attribute values to inform a polynomial local similarity function, which is better than guessing when domain knowledge is missing. So this method extracts statistical properties from the dataset to parametrize .

Gabel and Godehardt Gabel and Godehardt (2015) use a neural network to learn a similarity measure. Their work is done in the context of Case-based Reasoning (CBR) which uses the measure to retrieve similar cases. They concatenate the two data points into one input vector. Thus in the context of our framework is modeled as a identify function and is learned.

Maggini et al. Maggini et al. (2012) uses SIMNNs which they also see as a special case of the Symmetry Networks Shawe-Taylor (1993) (SNs). In SIMMNs and are both a function of both and data points and there is thus no distinct

. They also have a specialized structure imposed on their network to make sure the learned function is symmetric. SIMNN is in essence an extended version of a Siamese neural network, but without a distinct distance layer usually present in SNN architectures. They focus on the specific properties of the network architecture and the application of such networks in semi-supervised settings such as k-means clustering. The pair of data points (

and ) are being compared two times, the first time at the first hidden layer, then at the output layer. Since there are no learnable parameters before this comparison all the learning is done in and

is the activation function of the input layer.

Type 3

One way of looking at a similarity measure is as an inverse distance measure, as similarity is the semantic opposite of distance. There has been much work on learning distance measures. Most of this work can be categorized as a Type 3 similarity measure as the learning tasks only aims to learn the embedding function then combine the output of this function with a static (e.g. a norm function). The most well known instance of a Type 3 learned distance measure is Siamese neural networks (SNNs), it is highly related to the Type 2 similarity measure by Maggini et al.’s Similarity neural networks (SIMNN) Maggini et al. (2012).

The main characteristic of SNNs is sharing the weights between the two identical neural networks. The data points we want to measure the similarity for are then input to these networks. This frees the learning algorithm of learning two sets of weights for the same task. This was first used in Bromley et al. (1994) (using and being learned from data) to measure similarity between signatures. Similar architectures are also discussed in Shawe-Taylor (1993).

Chopra et al. Chopra et al. (2005)

uses a SNN for face verification, and pose the problem as an energy based model. The output of the SNN are combined through a

norm (absolute-value norm ) to calculate the similarity. They emphasize that using a

norm (Euclidean distance) as part of the loss function would make the gradient too small for effective gradient descent (i.e. create plateaus in the loss function). This work is closely related to Hadsell et al.

Hadsell et al. (2006), where they explain the contrastive loss function used for training the SNN (also used in Chopra et al. (2005); Martin et al. (2017)) by analogy of a spring system.

Related to this Vinyals et al. Vinyals et al. (2016) uses a similar type of setup for matching an input data point to a support set. It is framed as a discriminative task, where they use two neural networks to parametrize an attention mechanism. They use these two networks to embed the two data points into a feature space where the similarity between them are measured. However, in contrast to SNNs and SIMNNs, their two networks for embedding the data points are not identical, as one network is tailored to embed a data point from the support set, but also given the rest of the support set. Thus the embedding of the support set data point is also a function of the rest of the support set. With being modeled as a cosine softmax, this is similar to the examples of Type 3 similarity measures mentioned previously (e.g. Bromley et al. (1994); Berlemont et al. (2015)). However a major difference is that signal extraction functions are not equal: with (only stating that may potentially equal ). Since and are not sharing weights between them, the architecture is variant (or asymmetric) to the ordering of input pairs. Thus the architecture needs up to twice as much training to achieve the same performance as a SNN.

In much of the same fashion as Chopra et al. did in Chopra et al. (2005), Berlemot et al. Berlemont et al. (2015) uses SNNs combined with an energy based model to build a similarity measure between different gestures made with smart phones. However they adapt the error estimation from using only separate positive and negative pairs to a training subset including; a reference sample, a positive sample and a negative sample for every other class. They train while keeping a static . This training method of using triplets for training SNNs was also described by Lefebvre et al. Lefebvre and Garcia (2013). A similar approach can be seen in Hoffer et al. Hoffer and Ailon (2015), however they do not use a set of negative examples per reference point for each class as Berlemont et al do. Instead they use triples of , being the reference point, being the same class and being a different class.

Koch et al. Koch et al. (2015) uses a Convolutional Siamese Network (CSN), with implemented as a CNN and implemented as . This is done in a semi-supervised fashion for one-shot learning within image recognition. They learn this CSN for determining if two pictures from the Omniglot Lake et al. (2015) dataset is within the same class. The model can then be used to classify a data point representing an unseen class by comparing it to a repository of class representatives (Support Set).

CSNs are also used in the context of CBR by Martin et al. Martin et al. (2017) to represent a similarity measure in a CBR system. The CSN is trained with pairs of cases and the output is their similarity. During training they have to label pairs of cases as ’genuine’ (both cases belong to the same class) or ’impostor’ (the cases belong to different classes). This requires that the user has a clear boundary for the classes. In relation to our framework this similarity measure learns , while is static. With

implemented as a convolutional neural network, and

implemented as Euclidean distance ( norm).

In general using SNNs for constructing similarity measures have a major advantage as you can easily adopt pre-trained models for to embedding/pre-process the data points. For example to train a model for comparing two images one could use ResNet He et al. (2016) for then use the norm as . This would be a very similar approach to the similarity measure used by Koch et al. Koch et al. (2015) with , the main difference being that is designed for bigger pictures.

There are only very few examples of Type 4 similarity measures in the literature. In Zagoruyko and Komodakis’s work Zagoruyko and Komodakis (2015)

they investigate different types of architectures for learning image comparison using convolutional neural networks. In all of the architectures they evaluate

is learned, but in some of these architectures is not symmetric, i.e. where . Arandjelović and Zisserman’s work Arandjelovic and Zisserman (2017) use a very similar method to many Type 3 similarity measures for calculating similarity. However their input data is always pairs of two different data types and is as such different from most of the other relevant work leaving unsymmetrical just as in Zgoruyko et al. Zagoruyko and Komodakis (2015) and Vinyals et al. Vinyals et al. (2016). In contrast to the Type 3 similarity measures including Vinyals et al. (2016), Arandjelović et al. also learns , which they call a fusion layer.

All similarity measure of Type 3 we found in the literature use a loss function that includes feedback from the binary operator part of (). In the case of SNNs even if is non-symmetric () the loss for each network would be equal as they are equal and share weights. That means that ordering of the two data points being compared during training has no effect, i.e. the training effect of is equal to that of . This means a lot of saved time during training, as the training dataset could be halved without any negative effect on performance.

However the implementation of would then decide how much training one would need to adapt a pre-trained model from classifying single data points to measuring similarity between them. One could view the process of starting with a pre trained model for the dataset, then training the model with loss coming from as adapting the model from classification to similarity measurement.

One way of creating a Type 3 similarity measure using a minimal amount of training would be to pre-train a network on classifying individual data points. Then apply that network as that feeds into a in a similarity measurement. Evaluation of such a similarity measurement has not been reported in literature, and such a similarity will be explored in the next section.

4 Method

The framework presented in Section 2 and the subsequent analysis of previous relevant work presented in Section 3 shows that there are unexplored opportunities within research on similarity measurements.

Given the initial motivation we seek methods that work well in domains where domain knowledge is very resource demanding. This requires that as much as possible of the similarity measure is learned from data rather than modeled from domain knowledge. There are some exceptions to this, such as applying general binary operations, such as norms (e.g. or norm), on the two data points ( and ) pre-processed by . In these cases there is little domain expertise involved in designing other than intuition that the similarity of two data points is closely related to the norm between and .

The most promising type of similarity measures from this perspective are Type 3 and Type 4 where is learned in Type 3 and both and are learned in Type 4. However, to test any new design we need to have reference methods to compare against. For reference, we chose to implement one Type 1 similarity measure, two similarity measures of Type 2 (including Gabel et. al’s) similarity measures and Chopra et. al’s Type 3 similarity measure. The Type 1 similarity measure uses a similarity measure that weights each feature uniformly. The Type 2 is identical to the Type 1 similarity measure except that it uses a local similarity function for each feature which is parametrized by statistical properties of the values of that feature in the dataset.

One unexplored direction of creating similarity measures is creating a SNN similarity measure (Type 3) through training as a classifier on the dataset later being used for measuring similarity. Then using that trained to construct a SNN similarity measure. This is in contrast to the usual way of training SNNs (as seen in e.g Chopra et al. (2005); Bromley et al. (1994)) where the loss function is a function of pairs of data points, not single data points. The motivation for exploring this type of design is that it shows the similarity measuring performance of using networks pre-trained on classifying data points directly as part of a SNN similarity measure. This will be detailed in Subsection 4.2.

Finally, we will explore Type 4 similarity measures which have seen little attention in research so far. To make our design as symmetric as possible we will use the same design as SNNs for and introduce novel design to also make symmetric. That way our design is fully symmetric (invariant to ordering of the input pair) and thus training becomes much more efficient. All of the details of this design will be shown in Subsection 4.3. Both of our proposed similarity methods implement as neural networks. The Type 4 measurement design implements as a combination of a static binary function and neural network.

4.1 Reference similarity measures

As a reference for our own similarity measure we implemented several reference similarity measures of Type 1, Type 2 and Type 3. First we implemented a standard uniformly weighted global similarity () measure which can be defined as:

(2)

where denotes the local similarity of the -th of attributes. In t1,1

all weights and local similarity measures are uniformly distributed, and not parametrized by the data.

We extended this with a Type 2 similarity measure , which is based on the work from Abdel-Aziz et al. Abdel-Aziz et al. (2014), where the local similarity measures are parametrized by the data from the corresponding features.

Furthermore we implemented a Type 2 similarity measure as described by Gabel et al. Gabel and Godehardt (2015). The architecture of can be seen in Figure 3.

Figure 3: Architecture of a ANN similarity measure as used in Gabel Gabel and Godehardt (2015) (Type 2), where is set to be the identity function .

Lastly we implemented the Type 3 similarity measure described by Chopra et al. We did not implement the extension done to the contrastive loss function as seen in Berlemont et al. (2015); Lefebvre and Garcia (2013) as the change in the training dataset would be too big. This change would make comparisons between the methods harder to justify. Also none of these works showed any comparisons with previous SNNs in terms of any increased performance in relation to regular contrastive loss.

4.2 Type 3 similarity measure

In this subsection we will detail how we model the Type 3 similarity measure which uses an embedding function trained as a classifier. This embedding function maps the input point, , to an embedding space (see Figure 2

) which dimensions represents the probabilities of

belonging to a class. We then model the similarity measure between two points as the a static function ( between their two respective embeddings.

For this we choose the norm. So replacing for in Equation 1: , we can redefine Equation 1 to be:

(3)

where outputs the modeled solution as a dimensional vector (the feature vector output from the network to the softmax function for classes) for the case based on the problem attributes of data point . This means that if the evaluates the two cases as very similar in terms of classification and then . This architecture is also illustrated in Figure 4

Figure 4: Architecture of the similarity measure where is trained to output softmax vectors for classification and the similarity is calculated as a modeled norm between these two vectors (Type 3).

Following the model for the similarity measure we define the loss estimate as log-loss between and , where is the is true classification softmax vector, is the class probability vector output from . Notice that the error estimate of does not depend on the output of .

A data-set of size would then be defined as:

(4)

where is the problem part of the -th data point and is the solution/target part of the -th data point.

If the relation between the problem part of the data point () and the solution part of the data point () is complex, the network architecture needs to be able to represent the complexity and any generalizations of patterns in that complexity.

4.3 Type 4 similarity measure

As previously explained, Type 4 similarity measures are currently the most unexplored type of similarity measure. It is also the type of similarity measure that requires the least amount of modeling. In principle Type 4 similarity measures learns two things: learns a useful embedding, where the most useful parts of and is encoded into and . learns how to combine those embeddings to calculate the similarity of the original and .

In Type 4 similarity measures both and are learned. In our Type 4 similarity method we will use an ANN to represent both and . This has the advantage that the learning on is an end to end process. The loss computed after can be used to compute gradients for both and . will learn the binary combination best suited to calculate the similarity of the two embeddings, while will learn to embed the two data points optimally for calculating their similarity in . In principle any ML method could be used to learn and , but not all ML methods lend themselves naturally to back-propagating the error signal from through and back to the input.

We define our Type 4 similarity method, Extended Siamese Neural Network () as shown in figure 5.

Given that this similarity method outputs similarity and the loss function is a function of the input, we get a new general loss function for similarity, defined per data-point as follows:

(5)

where is the true similarity of case and . Since this loss function is dependent on pairs of data points and the true similarity between them, we need to create a new dataset based on the original dataset. This new dataset consists of triplets of two different data points from the original dataset and the true similarity of these two data points:

(6)

where is if and belong to the same class and otherwise.

It is worth to mention that this dataset is of size for the similarity measure to train on all possible combinations of the data points. Certain similarity measure architectures (e.g. from Gabel et al.Gabel and Godehardt (2015) or Zagoruyko et al.’s similarity measures Zagoruyko and Komodakis (2015) ) needs to train on a dataset containing all possible combinations of data points (of size ) as training on the triplet does not guarantee that the model learns that . Thus the training dataset must also include the triplet . However this may be largely avoided by using architectures (such as those seen in SNNs and SNs) that exploit symmetry and weight sharing. To achieve this we modeled as a ANN where the first lay.er is an absolute difference operator on two vectors: . where is the element-wise absolute difference between and . The rest of is hidden layers of ANN that operate on . This way becomes invariant to the ordering of inputs to . Consequently the model only needs to train on order-invariant unique pairs of data points, reducing training set size from to . The resulting architecture of can be seen in 5.

Figure 5: Architecture of a where we combine the symmetry of SNNs with the ability to learn . is expanded in this picture to highlight the operation done as the first operation of to keep invariant to the ordering of inputs. It also illustrates the two additional loss signals to which helps train the similarity measure.

In Subsection 4.2 we argue why trained to correctly classify its input is a good embedding function for calculating similarity. As a result we added two loss signals to during training. These loss signals are calculated from the difference between the embedding of the data point produced by and the correct soft-max classification vector.

This also introduced an opportunity for exploring the relative importance of the embedding function and the binary similarity function in terms of the performance of the similarity measure. This could be done by weighting the three different loss signals (, and similarity as shown in Figure 5) during training and measuring the effect of that weighting on the performance. We define our weighted loss function as such:

(7)

where is defined in Equation 5, is the true label of data point , is the true label of data point and is the categorical cross entropy loss between two softmax vectors. We use this formula and tested with different 100 different values of in the range to find the weighting scheme best for performance. The results can be seen in Figure 6.

Figure 6: Showing results from weighting the three different output in terms of signal strength to loss measured on the UCI dataset balance scale Dheeru and Karra Taniskidou (2017) (5-fold cross validation and repeated 5 times). This measurement was done using training data of size . The effect of

is much less impactful on the validation result after 200 or more epochs of training when training on

datasets. However choosing the correct using datasets does impact the speed of training for when training on datasets.

Figure 6 seems to indicate that is ideal for this dataset. We have used throughout the experiments reported in Section 5.

4.4 Network parameters

For all similarity measures tested using ANN and all datasets except MNIST, and where implemented with two hidden layers of nodes. This was done to replicate the network parameters used by Gabel et al. to ensure we had comparable results. For the MNIST dataset test both and used three hidden layers of nodes for , and the same for

Other than the network architecture we also wanted to choose which optimizer to use for learning the ANN model. We wanted to chose the RProp Riedmiller and Braun (1993) to be more comparable with the results from Gabel et al. which also used RProp optimizer. Our tests seen in Figure 7

shows that RProp outperforms all other optimizer tested (ADAM and RMSProp). This is consistent with the results reported by Florescu and Igel

Florescu and Igel (2018). This should hold true until the the run-time performance of RProp degrades with dataset size, as RProp uses full batch sizes.

Figure 7: Testing how the RProp algorithm performs in comparison with ADAM and RMSProp. Our proposed architecture performs best using the RProp algorithm (5-fold cross validation and repeated 5 times).

4.5 Evaluation protocol and implementation

The different similarity measures presented earlier in this section requires different training data sets. The reference Type 1 similarity measures () requires no training. While and does not require a similarity training consisting of triplets as described in Equations 6. All other similarity measures evaluated was trained using identical training datasets. As a result, all similarity measures were trained on a dataset consisting of all possible combinations of data points (as explained in 4.3) as this is required by the similarity measure. However, results highlighting the differences in training performance when using the different training datasets can be seen in Figure 13.

The results reported in the next section are all 5-fold stratified cross validation repeated 5 times for robustness. The performance reported is an evaluation of each similarity measurement using the part of the dataset (validation partition) that was not used for training. Using the similarity measure being evaluated, we computed the similarity between every data point in the validation partition () and every data point in the training partition (). For each validation data point () we find the data point in the training set with the highest similarity (). If has the same class as from the validation partition, we scored it as , if not, we scored it as .

The implementation was done in Keras

111Code available at NTNU OpenAI lab github page: https://github.com/ntnu-ai-lab

with Tensorflow as backend. The methods was measured on 14 different datasets available from the UCI machine learning repository

Dheeru and Karra Taniskidou (2017). Results was recorded after 200 epochs and 2000 epochs (the latter number to be consistent with Gabel et al. Gabel and Godehardt (2015)) to reveal how fast the different methods were achieving their performance.

5 Experimental evaluation

To calculate the performance of our similarity measure we chose to use the same method of evaluation as Gabel et al. Gabel and Godehardt (2015) to make the similarity metrics more easily comparable. In addition this evaluation method of using publicly available datasets from the UCI machine learning repository Dheeru and Karra Taniskidou (2017) make the results easy to reproduce. We selected a subset of the original 19 datasets, choosing not to use regression datasets, resulting in a set of 14 classification datasets. The datasets’ numerical features were all normalized, categorical features were replaced by a one-hot vector.

The validation losses from evaluating the similarity measures on the 14 datasets are shown in Figures 8 and 9. Figure 8 shows the results after training for 200 epochs, while Figure 9 shows the results after 2000 epochs. This has been done to illustrate how the differences between the similarity measures develop during training. In addition the and epoch runs are independent runs (i.e. Figure 9 is not the same models as seen in Figure 8 epochs later)

The numbers that are the basis of these figures are also reported in Table 2 for 200 epochs and Table 3

for 2000 epochs. The tables are highlighted to show the best result per dataset. In some cases the differences between two methods for one dataset was smaller than the standard deviation thus highlighting more than one result.

Finally, to illustrate that scales to larger datasets we report results from the MNIST dataset in Figure 10. The MNIST results are not validation results, as calculating the similarity between all the data points in the test set and the training set (as per the evaluation protocol described in Section 4.5) was too resource demanding.

Figure 8: Performance of in comparison to reference similarity measures and state of the art similarity methods over all test datasets trained over 200 epochs.
Figure 9: Performance of in comparison to reference similarity measures and state of the art similarity methods over all test datasets trained over 2000 epochs.
bal 0.01 0.00 0.14 0.10 0.42 0.81
car 0.04 0.02 0.19 0.16 0.25 0.25
cmc 0.52 0.53 0.54 0.55 0.54 0.58
eco 0.22 0.20 0.46 0.35 0.21 0.22
glass 0.08 0.08 0.12 0.10 0.06 0.07
hay 0.19 0.21 0.26 0.17 0.33 0.37
heart 0.21 0.24 0.28 0.24 0.24 0.23
iris 0.04 0.03 0.18 0.07 0.05 0.04
mam 0.21 0.25 0.26 0.27 0.28 0.29
mon 0.28 0.33 0.39 0.45 0.29 0.29
pim 0.28 0.30 0.35 0.35 0.31 0.32
ttt 0.03 0.03 0.17 0.07 0.32 0.07
use 0.07 0.08 0.08 0.39 0.21 0.18
who 0.29 0.45 0.33 0.45 0.46 0.45
Sum 2.47 2.75 3.75 3.72 3.97 4.17
Average 0.18 0.20 0.27 0.27 0.28 0.30
Table 2: Validation retrieval loss after 200 epochs of training, in comparison to state of the art methods. has the smallest loss in of datasets. The best result for each dataset is highlighted in bold.

Table 2 shows the validation losses of the different similarity measures on the different datasets. Our proposed Type 4 similarity measure has less validation loss than the second best (Type 3) similarity measure (Chopra et al. Chopra et al. (2005)). The other Type 3 similarity measures follow with having higher loss and (Gabel et al. Gabel and Godehardt (2015)) with more loss. The Type 1 similarity measure had more loss but managed to be the best similarity measure for the glass dataset. At last Type 2 similarity measure had higher loss than on average.

t1,1
bal 0.02 0.00 0.08 0.01 0.43 0.83
car 0.01 0.01 0.06 0.02 0.24 0.24
cmc 0.52 0.53 0.54 0.53 0.54 0.58
eco 0.22 0.20 0.22 0.18 0.19 0.21
glass 0.06 0.07 0.08 0.09 0.05 0.06
hay 0.18 0.21 0.20 0.15 0.32 0.34
heart 0.21 0.27 0.23 0.22 0.24 0.23
iris 0.08 0.05 0.07 0.04 0.06 0.05
mam 0.21 0.27 0.25 0.27 0.29 0.28
mon 0.26 0.30 0.33 0.27 0.32 0.32
pim 0.27 0.31 0.25 0.30 0.30 0.31
ttt 0.03 0.03 0.07 0.03 0.32 0.08
use 0.08 0.10 0.07 0.08 0.18 0.16
who 0.30 0.46 0.29 0.43 0.47 0.45
Sum 2.45 2.81 2.74 2.62 3.95 4.14
Average 0.18 0.20 0.20 0.19 0.28 0.30
Table 3: Validation retrieval loss after 2000 epochs of training, in comparison to state of the art methods. has the smallest validation retrieval loss in of datasets in addition to the lowest average loss. The best result for each dataset is highlighted in bold.

The results when training for epochs are quite different from those at epochs, as seen by how much closer the other similarity measures are in Figure 9 than in Figure 8. still outperforms all other similarity measures on average, but the second best similarity measure is much closer with just higher loss. is worse, is worse, t1,1 is worse and finally is worse than .

The gap between and the state of the art is considerable at epochs. This gap shrinks from at epochs to at epochs, which is still a considerable difference.

Figure 10: Training loss (not validation retrieval loss) during training on the MNIST dataset for and . could not be evaluated as training on a sized dataset for MNIST is too resource demanding.

To illustrate the difference in terms of training efficiency between different types similarity measure, we show the validation loss for , and during training. Specifically, for each epoch we test the loss of each similarity measure by the same method as described in subsection 4.5. Figure 11 and Figure 12 shows validation loss during training of , and on the UCI Iris and Mammographic mass datasets Dheeru and Karra Taniskidou (2017) respectively. This exemplifies the training performance of these methods in relation to the Iris and Mammographic mass dataset results reported in the tables above. One can also note that in training for the Mammographics dataset as seen in Fig. 11 never achieves the same performance as . In contrast, while training on the Iris dataset (as seen in Fig. 12), which is a less complex dataset than the Mammographic dataset, achieves the same performance as .

Figure 11: Validation retrieval loss during training on the Mammographic mass UCI ML dataset Dheeru and Karra Taniskidou (2017). The Figure shows that the mammograph dataset is a dataset that needs learning outside of embedding via . starts out good as is already designed as the norm. However and catches up when it learns an equivalent and better function.
Figure 12: Validation retrieval loss during training on the Iris UCI ML dataset Dheeru and Karra Taniskidou (2017) . Since starts out with very low validation loss. It seems probable that the static norm used by is close to optimal for correctly identifying if the two data points belong to the same class or not. The performance increase done by is a slight optimization of . The performance increase done during training by and is mainly by learning a equivalent in function to that used by , and secondary a slight optimization of . catches up to in performance after around 20 epochs, however gabel takes longer (5% validation loss at 2000 epochs) as shown in Table 3

Figure 13 shows the validation loss during training when and are using a training dataset of size and is using a training dataset of size . This figure illustrates how much fewer evaluations a SNN similarity measure like or symmetric Type 4 similarity measure such as needs than a similarity measurement that is not invariant to input ordering, while still having excellent relative performance.

Figure 13: Validation retrieval loss during training on the balance dataset, which illustrates the difference in amount of evaluations needed to achieve acceptable performance. Chopra achieves good performance very quickly, but is outperformed by soon. Both have very good performance before having evaluated less () data points than used by one epoch needed by gabel ()

Finally in Figure 14 and 15 we show how can be used for semi-supervised clustering. The figures show PCA and T-SNE clustering of embeddings produced untrained and trained networks respectively from the MNIST dataset. The embeddings are the vector output of for each of the data points in the test set. The embeddings shown are computed from a test set that is not used for training. The figures show that learns a way to correctly cluster data points that it has not used for training.

(a) PCA clustering on the MNIST dataset before training
(b) PCA clustering on the MNIST dataset after training
Figure 14: PCA clustering showing the two first principal components ( and ) of the embeddings produced by from MNIST input before (13(a)) and after (13(b)) training.
(a) T-SNE clustering of the MNIST dataset before training
(b) T-SNE clustering of the MNIST dataset after training
Figure 15: T-SNE clustering of embeddings produced by from MNIST input before (14(a)) and after (14(b)) training.

6 Conclusions and future work

Section 5 shows that all of the learned similarity measures outperformed the classical similarity measure and also where the local (per feature) similarity measures were adapted to the statistical properties of the features Abdel-Aziz et al. (2014). In practice one should weight the importance of each feature according to how important it is in terms of similarity measurement. In many situations the number of possible attributes to include in such a function can be overwhelming, and modeling them in the way we did in and also overlooks possible co-variations between the attributes. Both of these problems can be addressed using the proposed method to model the similarity using machine learning on a dataset that maps from case problem attributes to case solution attributes.

However one should be careful to note that all of the learned similarity measure are built on the assumption that similar data points have similar target values ( in Figure 2). If this assumption does not hold, learning the similarity measure might be much more difficult.

We have also presented a framework for how to analyze and group different types of similarity measures. We have used this framework to analyze previous work and highlight different strengths and weaknesses of the different types of similarity measures. This also highlighted unexplored types of similarity measures, such as Type 4 similarity measures.

As a result we designed and evaluated a Type 3 similarity measure based on a classifier. The evaluations showed that using a classifier as a basis for a similarity measure achieves comparable results to state of the art methods, while using much less training evaluations to achieve that performance.

We then combined strengths from Type 4 and Type 3 similarity measures into a new Type 4 similarity measure, called Extended Siamese Neural Networks (), which:

  • Learns an embedding of the data points using in the same way as Type 3 similarity measures, but using shared weights in the same way as SNNs to make the operation symmetrical.

  • Learns , thus enabling extended performance in relation to SNN and other Type 3 similarity measurements.

  • Restricts to make it invariant to input ordering, and thus obtaining end to end symmetry through the similarity measure.

Keeping symmetrical end-to-end enables the user of this similarity measure to train on much smaller datasets than required by other types of similarity measures. Type 3 measures based on SNNs also have this advantage, but our results show that the ability to learn is important for performance in many of the 14 datasets we tested. Our results showed that outperformed state of the art methods on average over the 14 datasets by a large margin. We also demonstrated that achieved this performance much faster given the same dataset than current state of the art. In addition, the symmetry of enables it to train on datasets that are orders of magnitude smaller. Our case-study of clustering embeddings produced from show that the model can be used for semi-supervised clustering.

Finally we demonstrated that the training of this similarity measure scales to large datasets like MNIST. Our main motivation for this work was to automate the construction of similarity measures while keeping training time as low as possible. We have shown that is a step towards this. Our evaluation shows that it can learn similarity measures across a wide variety of datasets. We also show that it scales well in comparison to similar methods and scales to datasets of some size such as MNIST.

The applications for as a similarity measure are not only as a similarity measure in a CBR system. It can also be used for semi-supervised clustering: training on labeled data, then use the trained for clustering unlabeled data. In much the same fashion it could be used for semi-supervised clustering, using as a matching network in the same fashion as the distance measure is applied in Vinyals et al. Vinyals et al. (2016).

In continuation of this work we would like to explore what is actually encoded by learned similarity measures. This could be done by varying the different features of a query data point in and discovering when that data point would change from one class to another (when the class of the closest other data point changes) - this would form a multi-dimensional boundary for each class. This boundary could be explored to determine what the similarity measure actually encoded during the learning phase.

Another interesting avenue of research would be to apply recurrent neural networks to embed time series into embedding space (see Figure

2) to enable the similarity measure to calculate similarity between time series which is currently a non-trivial problem.

The architecture of similarity measures still require more investigation, e.g. is the optimal embedding from different from the softmax classification vector used in normal supervised learning? If so it is worth investigating why it is different.

7 Acknowledgements

We would like to thank the EXPOSED project and NTNU Open AI Lab for the support to do this work. Thanks also to Gunnar Senneset and Hans Vanhauwaert Bjelland for their great support during our work.

References

  • Abdel-Aziz et al. (2014) Abdel-Aziz A, Strickert M, Hüllermeier E (2014) Learning solution similarity in preference-based cbr. In: International Conference on Case-Based Reasoning, Springer, pp 17–31
  • Arandjelovic and Zisserman (2017)

    Arandjelovic R, Zisserman A (2017) Look, listen and learn. In: 2017 IEEE International Conference on Computer Vision (ICCV), IEEE, pp 609–617

  • Bergmann (2002) Bergmann R (2002) Experience management: foundations, development methodology, and internet-based applications. Springer-Verlag
  • Berlemont et al. (2015) Berlemont S, Lefebvre G, Duffner S, Garcia C (2015) Siamese neural network based similarity metric for inertial gesture classification and rejection. In: Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on, IEEE, vol 1, pp 1–6
  • Bromley et al. (1994) Bromley J, Guyon I, LeCun Y, Säckinger E, Shah R (1994) Signature verification using a" siamese" time delay neural network. In: Advances in neural information processing systems, pp 737–744
  • Chopra et al. (2005)

    Chopra S, Hadsell R, LeCun Y (2005) Learning a similarity metric discriminatively, with application to face verification. In: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, IEEE, vol 1, pp 539–546

  • Cunningham (2009) Cunningham P (2009) A taxonomy of similarity mechanisms for case-based reasoning. IEEE Transactions on Knowledge and Data Engineering 21(11):1532–1543
  • Dheeru and Karra Taniskidou (2017) Dheeru D, Karra Taniskidou E (2017) UCI machine learning repository. URL http://archive.ics.uci.edu/ml
  • Florescu and Igel (2018)

    Florescu C, Igel C (2018) Resilient backpropagation (rprop) for batch-learning in tensorflow. ICLR 2018 workshop permission p To appear in

  • Gabel and Godehardt (2015) Gabel T, Godehardt E (2015) Top-down induction of similarity measures using similarity clouds. In: Hüllermeier E, Minor M (eds) Case-Based Reasoning Research and Development, Springer International Publishing, Cham, pp 149–164
  • Hadsell et al. (2006) Hadsell R, Chopra S, LeCun Y (2006) Dimensionality reduction by learning an invariant mapping. In: null, IEEE, pp 1735–1742
  • He et al. (2016) He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
  • Hoffer and Ailon (2015) Hoffer E, Ailon N (2015) Deep metric learning using triplet network. In: International Workshop on Similarity-Based Pattern Recognition, Springer, pp 84–92
  • Hüllermeier and Cheng (2013) Hüllermeier E, Cheng W (2013) Preference-based cbr: General ideas and basic principles. In: IJCAI, pp 3012–3016
  • Hüllermeier and Schlegel (2011) Hüllermeier E, Schlegel P (2011) Preference-based cbr: First steps toward a methodological framework. In: International Conference on Case-Based Reasoning, Springer, pp 77–91
  • Koch et al. (2015)

    Koch G, Zemel R, Salakhutdinov R (2015) Siamese neural networks for one-shot image recognition. In: ICML Deep Learning Workshop, vol 2

  • Lake et al. (2015) Lake BM, Salakhutdinov R, Tenenbaum JB (2015) Human-level concept learning through probabilistic program induction. Science 350(6266):1332–1338
  • Langseth et al. (1999) Langseth H, Aamodt A, Winnem OM (1999) Learning retrieval knowledge from data. In: Sixteenth International Joint Conference on Artificial Intelligence, Workshop ML-5: Automating the Construction of Case-Based Reasoners. Stockholm, Citeseer, pp 77–82
  • Leake (1996) Leake DB (1996) Case-Based Reasoning: Experiences, lessons and future directions. MIT press
  • Lefebvre and Garcia (2013) Lefebvre G, Garcia C (2013) Learning a bag of features based nonlinear metric for facial similarity. In: Advanced Video and Signal Based Surveillance (AVSS), 2013 10th IEEE International Conference on, IEEE, pp 238–243
  • Maggini et al. (2012) Maggini M, Melacci S, Sarti L (2012) Learning from pairwise constraints by similarity neural networks. Neural Networks 26:141–158
  • Martin et al. (2017) Martin K, Wiratunga N, Sani S, Massie S, Clos J (2017) A convolutional siamese network for developing similarity knowledge in the selfback dataset. In: Proceedings of the International Conference on Case-Based Reasoning Workshops, CEUR Workshop Proceedings, ICCBR (Organisers), p 85–94
  • Nikpour et al. (2018) Nikpour H, Aamodt A, Bach K (2018) Bayesian-supported retrieval in bncreek: A knowledge-intensive case-based reasoning system. In: International Conference on Case-Based Reasoning, Springer, pp 323–338
  • Reategui et al. (1997) Reategui EB, Campbell JA, Leao BF (1997) Combining a neural network with case-based reasoning in a diagnostic system. Artificial Intelligence in Medicine 9(1):5–27
  • Riedmiller and Braun (1993) Riedmiller M, Braun H (1993) A direct adaptive method for faster backpropagation learning: The rprop algorithm. In: Neural Networks, 1993., IEEE International Conference on, IEEE, pp 586–591
  • Shawe-Taylor (1993) Shawe-Taylor J (1993) Symmetries and discriminability in feedforward network architectures. IEEE Transactions on Neural Networks 4(5):816–826
  • Stahl (2001) Stahl A (2001) Learning feature weights from case order feedback. In: International Conference on Case-Based Reasoning, Springer, pp 502–516
  • Stahl and Gabel (2003) Stahl A, Gabel T (2003) Using evolution programs to learn local similarity measures. In: International Conference on Case-Based Reasoning, pp 537–551
  • Stahl and Gabel (2006) Stahl A, Gabel T (2006) Optimizing similarity assessment in case-based reasoning. In: Proceedings of the National Conference on Artificial Intelligence, Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, vol 21, p 1667
  • Tversky (1977) Tversky A (1977) Features of similarity. Psychological review 84(4):327
  • Vinyals et al. (2016) Vinyals O, Blundell C, Lillicrap T, Wierstra D, et al. (2016) Matching networks for one shot learning. In: Advances in Neural Information Processing Systems, pp 3630–3638
  • Wienhofen and Mathisen (2016) Wienhofen LWM, Mathisen BM (2016) Defining the Initial Case-Base for a CBR Operator Support System in Digital Finishing, Springer International Publishing, Cham, pp 430–444. DOI 10.1007/978-3-319-47096-2_29, URL https://doi.org/10.1007/978-3-319-47096-2_29
  • Zagoruyko and Komodakis (2015) Zagoruyko S, Komodakis N (2015) Learning to compare image patches via convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4353–4361