1 Introduction
Many artificial intelligence and machine learning (ML) methods, such as knearest neighbors (kNN) rely on a similarity (or distance) measure Maggini et al. (2012) between data points. In Casebased reasoning (CBR) a simple kNN or a more complex similarity function is used to retrieve the stored cases that are most similar to the current query case. The similarity measure used in CBR systems for this purpose is typically built as a weighted Euclidean similarity measure (or as a weight matrix for discrete and symbolic variables). Such a similarity measure is designed with assistance of domain experts by adjusting the weights for each attribute of the cases to represent how important they are (one example can be seen in Wienhofen and Mathisen (2016), or generally described in chapter 4 of Bergmann (2002))
In many situations the design of such a function is nontrivial. Domain experts with an understanding of CBR or machine learning are not easily available. However, before or during most CBR projects, data is gathered that relates to the problem being solved by the CBR system. This data is used to construct cases for populating the case base. If the data is labeled according to the solution/class, it can be used to learn a similarity measure that is relevant to the task being solved by the system. Exploring efficient methods of learning similarity measures and improving on them is the main motivation of this work.
In the CBR literature, similarity measurement is often described in terms of problem and solution spaces. Problem space is where the features of a problem describe the problem; this is often called feature space in nonCBR ML literature. Solution space, also referred to as target space, is populated by points describing solutions to points in the problem space. The function that maps a point from the problem space to its corresponding point in the solution space is typically the goal of supervised machine learning. This is illustrated in Figure 1.
A similarity measure in the problem space represents an approximation of the similarity between two cases or data points in the solution space (i.e. whether these two cases have similar or dissimilar solutions). Such a similarity measure would be of great help in situations where lots of labeled data is available, but domain knowledge is not available, or when the modeling of such a similarity measure is too complex.
Learned similarity measures can also be used in other settings, such as clustering. Another relevant method type is semisupervised learning in which the labeled part of a dataset is used to cluster or label the unlabeled part.
How to automatically learn similarity measures has been an active area of research in CBR. For instance, Gabel et al. Gabel and Godehardt (2015) train a similarity measure by creating a dataset of collated pairs of data points and their respective similarities. This dataset is then used to train a neural network to represent the similarity measure. In this method the network needs to extract the most important features in terms of similarity for both data points, then combine these features to output a similarity measure. Recent work (e.g. Martin et al. Martin et al. (2017)) has used Siamese neural networks (SNN) Bromley et al. (1994) to learn a similarity measure in CBR. SNNs have the advantage of sharing weights between two parts of the network, in this case the two parts that extract the useful information from the two data points being compared. All of these methods for learning similarity measures have in common that they are trained to compare two data points and return a similarity measurement. Our work of automatically learning similarity measures is also related to the work done by Hüllermeier et al. on preferencebased CBR Hüllermeier and Schlegel (2011); Hüllermeier and Cheng (2013). In this work the authors learn a preference of similarity between cases/data points, which represents a more continuous space between solutions than a typical similarity measure in CBR. This type of approach to similarity measures is similar to learning similarity measures by using machine learning models, in that both can always return a list of possible solutions sorted by their similarity.
In this work we have developed a framework to show the main differences between various types of similarity measures. Using this framework, we highlight the differences between existing approaches in Section 3
. This analysis also reveals areas that have not received much attention in the research community so far. Based on this we developed two novel designs for using machine learning to learn similarity measures from data. Both of the two designs are continuous in their representation of the estimated solution space.
The novelty of our work is threefold: First showing that using a classifier as a basis for a similarity measure gives adequate performance. Then we demonstrate similarity measure designed to use as little modeling as possible, while keeping training time low, outperforms state of the art methods. Finally to analyze the state of the art and compare it to our new similarity measure design we introduce a simple mathematical framework. We show how this is a useful tool for analyzing and categorizing similarity measures.
The remainder of this paper describes our method in more detail. Section 2 describes the novel framework for similarity measurement learning, and Section 3 then summarizes previous relevant work in relation to this framework. In Section 4 we describe suggestions of new similarity measures, and how we design the experimental evaluation. Subsequently, in Section 5 we show the results of this evaluation. Finally, in Section 6 we interpret and discuss the evaluation results and give some conclusions. We present some of the limitations of our work as well as possible future paths of research.
2 A framework for similarity measures
We suggest a framework for analyzing different functions for similarity with as a similarity measure applied to pairs of data points ;
(1) 
where and represents embedding or information extraction from data points and , i.e. highlights the parts of the data points most useful to calculate the similarity between them. An illustration of this process can be seen in Figure 2.
models the distance between the two data points based on the embeddings and . The functions and can be either manually modeled or learned from data. With respect to this we will enumerate all of the different configurations of Equation 1 and describe their main properties and how they have been implemented in state of the art research. Note that we will use to annotate the similarity measurement and for the subpart of the similarity measurement that calculates the distance between the two outputs of . is distinct from unless .
Modeled  Learned  
Modeled  Type 1  Type 2  
Learned  Type 3  Type 4 
In the following we characterize the different types of similarity measures:
 Type 1

A typical similarity measure in CBR systems would model and from domain knowledge. Such a similarity measure is typically modeled by experts with the relevant domain knowledge together with CBR experts, who know how to encode this domain knowledge into the similarity measures.
For example when modeling the similarity measure of cars for sale, where the goal is to model the similarity of cars in terms of their final selling price. In this example, domain experts may model the embedding function so that the amount of miles driven has a greater importance than the color of the car. could be modeled such that difference in miles driven is less important than difference the number of repairs done on the car. More details and examples can be found in Cunningham (2009).
 Type 2

This type represents similarity measures that models and learns the function . In this context can be viewed as an embedding function. Since is not learned from the data it is not interesting to analyze it as part of learning the similarity measure, as processing the data through could be done in batch before applying the data to . Learning needs to be done with a dataset consisting of triplets of the data points and , and being the true similarity between and .
A special case of Type 2 is when is set to be the identity function , while is learned from data. Examples of this type are presented for example in Gabel et al. Gabel and Godehardt (2015) where the similarity measure always looks at the two inputs together, never separately.
 Type 3

In this type of similarity measure the embedding/feature extraction
is learned and is modeled. Typically the embedding function learned by resembles the function that is the goal during supervised machine learning. Within the similarity measurementis used as an embedding vector for calculating similarity, when in classification
would be the softmax vector output. Using a pretrained classification model as a starting point for as input to e.g. should give good results for similarity measurements if that model had high precision for classification within the same dataset.However it is not given that the best embedding vector for calculating similarity is the same as the embedding vector produced by a trained to do classification.
 Type 4

This measure is designed so that both and are learned.
We will design, implement and evaluate similarity measures based on Type 1, Type 3, Type 2 and Type 4 in Section 4. These results will be shown in Sections 5.
To allow as a similarity measurement for clustering e.g. knearest neighbors, a similarity measure should fulfill the following requirements:
 Symmetry

The similarity between and should be the same as the similarity between and .
 Nonnegative

The similarity between to datapoints can not be negative.
 Identity

The similarity between two datapoints should be 1 iff is equal to .
Some of these requirements are not satisfied by all types of similarity measures, i.e. symmetry is not a direct design consequence of Type 2 but of Type 3 if is symmetric. Even if symmetry is not present in all similarity measures Tversky (1977) it is important for reducing training time, as the training set size goes from to . Symmetry also enables the similarity measure to be used for clustering.
In the next section, we will relate current state of the art to the framework in context of the different types.
3 Related work
To exemplify the framework presented in the previous section we will relate previous work to the framework and the types of similarity measurements that derive from the framework. This will also enable us to see possibilities for improvement and further research.
As stated in Section 1 our motivation is to automate the construction of similarity measures. Additionally, we would like to do this while keeping training time as low as possible. Thus we will not focus on Type 1 similarity measures as this type uses no learning. Both Type 2 and Type 4 require a different type of training dataset than a typical supervised machine learning dataset, as is typically dependent on the order of the data points (see Section 4). Thus given our initial motivation, Type 3 similarity measures seems to be the most promising type of similarity measure to focus on. However, it is worth investigating similarity measures of Type 4, to see if the added benefit of learning outweighs the added training time. Or if it is possible to make it symmetric (as defined in the previous section) so that the training time could become similar to Type 3.
Thus we will focus on summarizing related work from Type 3 similarity measures, but also add relevant work from Type 1, Type 2 and Type 4 for reference.
Type 1 is a type of similarity measure which is manually constructed. A general overview and examples of this type of similarity measure can be found in Cunningham (2009). Nikpour et al. Nikpour et al. (2018)
presents an alternative method which includes enrichment of the cases/data points via Bayesian networks.
Type 2
In Type 2 similarity measures only the binary operator of the similarity measure is learned, while is either modeled or left as the identity function (). Stahl et al. have done a lot of work on learning Type 2 similarity measures from data or user feedback. In all of their work they formulate where for each feature , is the local similarity measure and is the weight of that feature. In Stahl (2001) Stahl et al. describe a method for learning the feature weights.
In Stahl and Gabel (2003)
Stahl et al. introduce learning local similarity measures through an evolutionary algorithm (EA). First they learn attribute weights (
) by adopting the method previously described in Stahl (2001). Then they use an EA to learn the local similarity measures for each feature (). In Stahl and Gabel (2006) Stahl and Gabel present work were they learn weights of a modeled similarity measure, and the local similarity for each attribute through an ANN. Reategui et al. Reategui et al. (1997) learn and represent parts of the similarity functions () through ANN. Langseth et al. Langseth et al. (1999) learn similarity knowledge () from data using Bayesian networks, which still partially relies on modeling the Bayesian networks with domain knowledge.AbdelAziz et al. AbdelAziz et al. (2014) use the distribution of case attribute values to inform a polynomial local similarity function, which is better than guessing when domain knowledge is missing. So this method extracts statistical properties from the dataset to parametrize .
Gabel and Godehardt Gabel and Godehardt (2015) use a neural network to learn a similarity measure. Their work is done in the context of Casebased Reasoning (CBR) which uses the measure to retrieve similar cases. They concatenate the two data points into one input vector. Thus in the context of our framework is modeled as a identify function and is learned.
Maggini et al. Maggini et al. (2012) uses SIMNNs which they also see as a special case of the Symmetry Networks ShaweTaylor (1993) (SNs). In SIMMNs and are both a function of both and data points and there is thus no distinct
. They also have a specialized structure imposed on their network to make sure the learned function is symmetric. SIMNN is in essence an extended version of a Siamese neural network, but without a distinct distance layer usually present in SNN architectures. They focus on the specific properties of the network architecture and the application of such networks in semisupervised settings such as kmeans clustering. The pair of data points (
and ) are being compared two times, the first time at the first hidden layer, then at the output layer. Since there are no learnable parameters before this comparison all the learning is done in andis the activation function of the input layer.
Type 3
One way of looking at a similarity measure is as an inverse distance measure, as similarity is the semantic opposite of distance. There has been much work on learning distance measures. Most of this work can be categorized as a Type 3 similarity measure as the learning tasks only aims to learn the embedding function then combine the output of this function with a static (e.g. a norm function). The most well known instance of a Type 3 learned distance measure is Siamese neural networks (SNNs), it is highly related to the Type 2 similarity measure by Maggini et al.’s Similarity neural networks (SIMNN) Maggini et al. (2012).
The main characteristic of SNNs is sharing the weights between the two identical neural networks. The data points we want to measure the similarity for are then input to these networks. This frees the learning algorithm of learning two sets of weights for the same task. This was first used in Bromley et al. (1994) (using and being learned from data) to measure similarity between signatures. Similar architectures are also discussed in ShaweTaylor (1993).
Chopra et al. Chopra et al. (2005)
uses a SNN for face verification, and pose the problem as an energy based model. The output of the SNN are combined through a
norm (absolutevalue norm ) to calculate the similarity. They emphasize that using anorm (Euclidean distance) as part of the loss function would make the gradient too small for effective gradient descent (i.e. create plateaus in the loss function). This work is closely related to Hadsell et al.
Hadsell et al. (2006), where they explain the contrastive loss function used for training the SNN (also used in Chopra et al. (2005); Martin et al. (2017)) by analogy of a spring system.Related to this Vinyals et al. Vinyals et al. (2016) uses a similar type of setup for matching an input data point to a support set. It is framed as a discriminative task, where they use two neural networks to parametrize an attention mechanism. They use these two networks to embed the two data points into a feature space where the similarity between them are measured. However, in contrast to SNNs and SIMNNs, their two networks for embedding the data points are not identical, as one network is tailored to embed a data point from the support set, but also given the rest of the support set. Thus the embedding of the support set data point is also a function of the rest of the support set. With being modeled as a cosine softmax, this is similar to the examples of Type 3 similarity measures mentioned previously (e.g. Bromley et al. (1994); Berlemont et al. (2015)). However a major difference is that signal extraction functions are not equal: with (only stating that may potentially equal ). Since and are not sharing weights between them, the architecture is variant (or asymmetric) to the ordering of input pairs. Thus the architecture needs up to twice as much training to achieve the same performance as a SNN.
In much of the same fashion as Chopra et al. did in Chopra et al. (2005), Berlemot et al. Berlemont et al. (2015) uses SNNs combined with an energy based model to build a similarity measure between different gestures made with smart phones. However they adapt the error estimation from using only separate positive and negative pairs to a training subset including; a reference sample, a positive sample and a negative sample for every other class. They train while keeping a static . This training method of using triplets for training SNNs was also described by Lefebvre et al. Lefebvre and Garcia (2013). A similar approach can be seen in Hoffer et al. Hoffer and Ailon (2015), however they do not use a set of negative examples per reference point for each class as Berlemont et al do. Instead they use triples of , being the reference point, being the same class and being a different class.
Koch et al. Koch et al. (2015) uses a Convolutional Siamese Network (CSN), with implemented as a CNN and implemented as . This is done in a semisupervised fashion for oneshot learning within image recognition. They learn this CSN for determining if two pictures from the Omniglot Lake et al. (2015) dataset is within the same class. The model can then be used to classify a data point representing an unseen class by comparing it to a repository of class representatives (Support Set).
CSNs are also used in the context of CBR by Martin et al. Martin et al. (2017) to represent a similarity measure in a CBR system. The CSN is trained with pairs of cases and the output is their similarity. During training they have to label pairs of cases as ’genuine’ (both cases belong to the same class) or ’impostor’ (the cases belong to different classes). This requires that the user has a clear boundary for the classes. In relation to our framework this similarity measure learns , while is static. With
implemented as a convolutional neural network, and
implemented as Euclidean distance ( norm).In general using SNNs for constructing similarity measures have a major advantage as you can easily adopt pretrained models for to embedding/preprocess the data points. For example to train a model for comparing two images one could use ResNet He et al. (2016) for then use the norm as . This would be a very similar approach to the similarity measure used by Koch et al. Koch et al. (2015) with , the main difference being that is designed for bigger pictures.
There are only very few examples of Type 4 similarity measures in the literature. In Zagoruyko and Komodakis’s work Zagoruyko and Komodakis (2015)
they investigate different types of architectures for learning image comparison using convolutional neural networks. In all of the architectures they evaluate
is learned, but in some of these architectures is not symmetric, i.e. where . Arandjelović and Zisserman’s work Arandjelovic and Zisserman (2017) use a very similar method to many Type 3 similarity measures for calculating similarity. However their input data is always pairs of two different data types and is as such different from most of the other relevant work leaving unsymmetrical just as in Zgoruyko et al. Zagoruyko and Komodakis (2015) and Vinyals et al. Vinyals et al. (2016). In contrast to the Type 3 similarity measures including Vinyals et al. (2016), Arandjelović et al. also learns , which they call a fusion layer.All similarity measure of Type 3 we found in the literature use a loss function that includes feedback from the binary operator part of (). In the case of SNNs even if is nonsymmetric () the loss for each network would be equal as they are equal and share weights. That means that ordering of the two data points being compared during training has no effect, i.e. the training effect of is equal to that of . This means a lot of saved time during training, as the training dataset could be halved without any negative effect on performance.
However the implementation of would then decide how much training one would need to adapt a pretrained model from classifying single data points to measuring similarity between them. One could view the process of starting with a pre trained model for the dataset, then training the model with loss coming from as adapting the model from classification to similarity measurement.
One way of creating a Type 3 similarity measure using a minimal amount of training would be to pretrain a network on classifying individual data points. Then apply that network as that feeds into a in a similarity measurement. Evaluation of such a similarity measurement has not been reported in literature, and such a similarity will be explored in the next section.
4 Method
The framework presented in Section 2 and the subsequent analysis of previous relevant work presented in Section 3 shows that there are unexplored opportunities within research on similarity measurements.
Given the initial motivation we seek methods that work well in domains where domain knowledge is very resource demanding. This requires that as much as possible of the similarity measure is learned from data rather than modeled from domain knowledge. There are some exceptions to this, such as applying general binary operations, such as norms (e.g. or norm), on the two data points ( and ) preprocessed by . In these cases there is little domain expertise involved in designing other than intuition that the similarity of two data points is closely related to the norm between and .
The most promising type of similarity measures from this perspective are Type 3 and Type 4 where is learned in Type 3 and both and are learned in Type 4. However, to test any new design we need to have reference methods to compare against. For reference, we chose to implement one Type 1 similarity measure, two similarity measures of Type 2 (including Gabel et. al’s) similarity measures and Chopra et. al’s Type 3 similarity measure. The Type 1 similarity measure uses a similarity measure that weights each feature uniformly. The Type 2 is identical to the Type 1 similarity measure except that it uses a local similarity function for each feature which is parametrized by statistical properties of the values of that feature in the dataset.
One unexplored direction of creating similarity measures is creating a SNN similarity measure (Type 3) through training as a classifier on the dataset later being used for measuring similarity. Then using that trained to construct a SNN similarity measure. This is in contrast to the usual way of training SNNs (as seen in e.g Chopra et al. (2005); Bromley et al. (1994)) where the loss function is a function of pairs of data points, not single data points. The motivation for exploring this type of design is that it shows the similarity measuring performance of using networks pretrained on classifying data points directly as part of a SNN similarity measure. This will be detailed in Subsection 4.2.
Finally, we will explore Type 4 similarity measures which have seen little attention in research so far. To make our design as symmetric as possible we will use the same design as SNNs for and introduce novel design to also make symmetric. That way our design is fully symmetric (invariant to ordering of the input pair) and thus training becomes much more efficient. All of the details of this design will be shown in Subsection 4.3. Both of our proposed similarity methods implement as neural networks. The Type 4 measurement design implements as a combination of a static binary function and neural network.
4.1 Reference similarity measures
As a reference for our own similarity measure we implemented several reference similarity measures of Type 1, Type 2 and Type 3. First we implemented a standard uniformly weighted global similarity () measure which can be defined as:
(2) 
where denotes the local similarity of the th of attributes. In t_{1,1}
all weights and local similarity measures are uniformly distributed, and not parametrized by the data.
We extended this with a Type 2 similarity measure , which is based on the work from AbdelAziz et al. AbdelAziz et al. (2014), where the local similarity measures are parametrized by the data from the corresponding features.
Furthermore we implemented a Type 2 similarity measure as described by Gabel et al. Gabel and Godehardt (2015). The architecture of can be seen in Figure 3.
Lastly we implemented the Type 3 similarity measure described by Chopra et al. We did not implement the extension done to the contrastive loss function as seen in Berlemont et al. (2015); Lefebvre and Garcia (2013) as the change in the training dataset would be too big. This change would make comparisons between the methods harder to justify. Also none of these works showed any comparisons with previous SNNs in terms of any increased performance in relation to regular contrastive loss.
4.2 Type 3 similarity measure
In this subsection we will detail how we model the Type 3 similarity measure which uses an embedding function trained as a classifier. This embedding function maps the input point, , to an embedding space (see Figure 2
) which dimensions represents the probabilities of
belonging to a class. We then model the similarity measure between two points as the a static function ( between their two respective embeddings.(3) 
where outputs the modeled solution as a dimensional vector (the feature vector output from the network to the softmax function for classes) for the case based on the problem attributes of data point . This means that if the evaluates the two cases as very similar in terms of classification and then . This architecture is also illustrated in Figure 4
Following the model for the similarity measure we define the loss estimate as logloss between and , where is the is true classification softmax vector, is the class probability vector output from . Notice that the error estimate of does not depend on the output of .
A dataset of size would then be defined as:
(4) 
where is the problem part of the th data point and is the solution/target part of the th data point.
If the relation between the problem part of the data point () and the solution part of the data point () is complex, the network architecture needs to be able to represent the complexity and any generalizations of patterns in that complexity.
4.3 Type 4 similarity measure
As previously explained, Type 4 similarity measures are currently the most unexplored type of similarity measure. It is also the type of similarity measure that requires the least amount of modeling. In principle Type 4 similarity measures learns two things: learns a useful embedding, where the most useful parts of and is encoded into and . learns how to combine those embeddings to calculate the similarity of the original and .
In Type 4 similarity measures both and are learned. In our Type 4 similarity method we will use an ANN to represent both and . This has the advantage that the learning on is an end to end process. The loss computed after can be used to compute gradients for both and . will learn the binary combination best suited to calculate the similarity of the two embeddings, while will learn to embed the two data points optimally for calculating their similarity in . In principle any ML method could be used to learn and , but not all ML methods lend themselves naturally to backpropagating the error signal from through and back to the input.
We define our Type 4 similarity method, Extended Siamese Neural Network () as shown in figure 5.
Given that this similarity method outputs similarity and the loss function is a function of the input, we get a new general loss function for similarity, defined per datapoint as follows:
(5) 
where is the true similarity of case and . Since this loss function is dependent on pairs of data points and the true similarity between them, we need to create a new dataset based on the original dataset. This new dataset consists of triplets of two different data points from the original dataset and the true similarity of these two data points:
(6) 
where is if and belong to the same class and otherwise.
It is worth to mention that this dataset is of size for the similarity measure to train on all possible combinations of the data points. Certain similarity measure architectures (e.g. from Gabel et al.Gabel and Godehardt (2015) or Zagoruyko et al.’s similarity measures Zagoruyko and Komodakis (2015) ) needs to train on a dataset containing all possible combinations of data points (of size ) as training on the triplet does not guarantee that the model learns that . Thus the training dataset must also include the triplet . However this may be largely avoided by using architectures (such as those seen in SNNs and SNs) that exploit symmetry and weight sharing. To achieve this we modeled as a ANN where the first lay.er is an absolute difference operator on two vectors: . where is the elementwise absolute difference between and . The rest of is hidden layers of ANN that operate on . This way becomes invariant to the ordering of inputs to . Consequently the model only needs to train on orderinvariant unique pairs of data points, reducing training set size from to . The resulting architecture of can be seen in 5.
In Subsection 4.2 we argue why trained to correctly classify its input is a good embedding function for calculating similarity. As a result we added two loss signals to during training. These loss signals are calculated from the difference between the embedding of the data point produced by and the correct softmax classification vector.
This also introduced an opportunity for exploring the relative importance of the embedding function and the binary similarity function in terms of the performance of the similarity measure. This could be done by weighting the three different loss signals (, and similarity as shown in Figure 5) during training and measuring the effect of that weighting on the performance. We define our weighted loss function as such:
(7) 
where is defined in Equation 5, is the true label of data point , is the true label of data point and is the categorical cross entropy loss between two softmax vectors. We use this formula and tested with different 100 different values of in the range to find the weighting scheme best for performance. The results can be seen in Figure 6.
4.4 Network parameters
For all similarity measures tested using ANN and all datasets except MNIST, and where implemented with two hidden layers of nodes. This was done to replicate the network parameters used by Gabel et al. to ensure we had comparable results. For the MNIST dataset test both and used three hidden layers of nodes for , and the same for
Other than the network architecture we also wanted to choose which optimizer to use for learning the ANN model. We wanted to chose the RProp Riedmiller and Braun (1993) to be more comparable with the results from Gabel et al. which also used RProp optimizer. Our tests seen in Figure 7
shows that RProp outperforms all other optimizer tested (ADAM and RMSProp). This is consistent with the results reported by Florescu and Igel
Florescu and Igel (2018). This should hold true until the the runtime performance of RProp degrades with dataset size, as RProp uses full batch sizes.4.5 Evaluation protocol and implementation
The different similarity measures presented earlier in this section requires different training data sets. The reference Type 1 similarity measures () requires no training. While and does not require a similarity training consisting of triplets as described in Equations 6. All other similarity measures evaluated was trained using identical training datasets. As a result, all similarity measures were trained on a dataset consisting of all possible combinations of data points (as explained in 4.3) as this is required by the similarity measure. However, results highlighting the differences in training performance when using the different training datasets can be seen in Figure 13.
The results reported in the next section are all 5fold stratified cross validation repeated 5 times for robustness. The performance reported is an evaluation of each similarity measurement using the part of the dataset (validation partition) that was not used for training. Using the similarity measure being evaluated, we computed the similarity between every data point in the validation partition () and every data point in the training partition (). For each validation data point () we find the data point in the training set with the highest similarity (). If has the same class as from the validation partition, we scored it as , if not, we scored it as .
The implementation was done in Keras
^{1}^{1}1Code available at NTNU OpenAI lab github page: https://github.com/ntnuailabwith Tensorflow as backend. The methods was measured on 14 different datasets available from the UCI machine learning repository
Dheeru and Karra Taniskidou (2017). Results was recorded after 200 epochs and 2000 epochs (the latter number to be consistent with Gabel et al. Gabel and Godehardt (2015)) to reveal how fast the different methods were achieving their performance.5 Experimental evaluation
To calculate the performance of our similarity measure we chose to use the same method of evaluation as Gabel et al. Gabel and Godehardt (2015) to make the similarity metrics more easily comparable. In addition this evaluation method of using publicly available datasets from the UCI machine learning repository Dheeru and Karra Taniskidou (2017) make the results easy to reproduce. We selected a subset of the original 19 datasets, choosing not to use regression datasets, resulting in a set of 14 classification datasets. The datasets’ numerical features were all normalized, categorical features were replaced by a onehot vector.
The validation losses from evaluating the similarity measures on the 14 datasets are shown in Figures 8 and 9. Figure 8 shows the results after training for 200 epochs, while Figure 9 shows the results after 2000 epochs. This has been done to illustrate how the differences between the similarity measures develop during training. In addition the and epoch runs are independent runs (i.e. Figure 9 is not the same models as seen in Figure 8 epochs later)
The numbers that are the basis of these figures are also reported in Table 2 for 200 epochs and Table 3
for 2000 epochs. The tables are highlighted to show the best result per dataset. In some cases the differences between two methods for one dataset was smaller than the standard deviation thus highlighting more than one result.
Finally, to illustrate that scales to larger datasets we report results from the MNIST dataset in Figure 10. The MNIST results are not validation results, as calculating the similarity between all the data points in the test set and the training set (as per the evaluation protocol described in Section 4.5) was too resource demanding.
bal  0.01  0.00  0.14  0.10  0.42  0.81 
car  0.04  0.02  0.19  0.16  0.25  0.25 
cmc  0.52  0.53  0.54  0.55  0.54  0.58 
eco  0.22  0.20  0.46  0.35  0.21  0.22 
glass  0.08  0.08  0.12  0.10  0.06  0.07 
hay  0.19  0.21  0.26  0.17  0.33  0.37 
heart  0.21  0.24  0.28  0.24  0.24  0.23 
iris  0.04  0.03  0.18  0.07  0.05  0.04 
mam  0.21  0.25  0.26  0.27  0.28  0.29 
mon  0.28  0.33  0.39  0.45  0.29  0.29 
pim  0.28  0.30  0.35  0.35  0.31  0.32 
ttt  0.03  0.03  0.17  0.07  0.32  0.07 
use  0.07  0.08  0.08  0.39  0.21  0.18 
who  0.29  0.45  0.33  0.45  0.46  0.45 
Sum  2.47  2.75  3.75  3.72  3.97  4.17 
Average  0.18  0.20  0.27  0.27  0.28  0.30 
Table 2 shows the validation losses of the different similarity measures on the different datasets. Our proposed Type 4 similarity measure has less validation loss than the second best (Type 3) similarity measure (Chopra et al. Chopra et al. (2005)). The other Type 3 similarity measures follow with having higher loss and (Gabel et al. Gabel and Godehardt (2015)) with more loss. The Type 1 similarity measure had more loss but managed to be the best similarity measure for the glass dataset. At last Type 2 similarity measure had higher loss than on average.
t_{1,1}  

bal  0.02  0.00  0.08  0.01  0.43  0.83 
car  0.01  0.01  0.06  0.02  0.24  0.24 
cmc  0.52  0.53  0.54  0.53  0.54  0.58 
eco  0.22  0.20  0.22  0.18  0.19  0.21 
glass  0.06  0.07  0.08  0.09  0.05  0.06 
hay  0.18  0.21  0.20  0.15  0.32  0.34 
heart  0.21  0.27  0.23  0.22  0.24  0.23 
iris  0.08  0.05  0.07  0.04  0.06  0.05 
mam  0.21  0.27  0.25  0.27  0.29  0.28 
mon  0.26  0.30  0.33  0.27  0.32  0.32 
pim  0.27  0.31  0.25  0.30  0.30  0.31 
ttt  0.03  0.03  0.07  0.03  0.32  0.08 
use  0.08  0.10  0.07  0.08  0.18  0.16 
who  0.30  0.46  0.29  0.43  0.47  0.45 
Sum  2.45  2.81  2.74  2.62  3.95  4.14 
Average  0.18  0.20  0.20  0.19  0.28  0.30 
The results when training for epochs are quite different from those at epochs, as seen by how much closer the other similarity measures are in Figure 9 than in Figure 8. still outperforms all other similarity measures on average, but the second best similarity measure is much closer with just higher loss. is worse, is worse, t_{1,1} is worse and finally is worse than .
The gap between and the state of the art is considerable at epochs. This gap shrinks from at epochs to at epochs, which is still a considerable difference.
To illustrate the difference in terms of training efficiency between different types similarity measure, we show the validation loss for , and during training. Specifically, for each epoch we test the loss of each similarity measure by the same method as described in subsection 4.5. Figure 11 and Figure 12 shows validation loss during training of , and on the UCI Iris and Mammographic mass datasets Dheeru and Karra Taniskidou (2017) respectively. This exemplifies the training performance of these methods in relation to the Iris and Mammographic mass dataset results reported in the tables above. One can also note that in training for the Mammographics dataset as seen in Fig. 11 never achieves the same performance as . In contrast, while training on the Iris dataset (as seen in Fig. 12), which is a less complex dataset than the Mammographic dataset, achieves the same performance as .
Figure 13 shows the validation loss during training when and are using a training dataset of size and is using a training dataset of size . This figure illustrates how much fewer evaluations a SNN similarity measure like or symmetric Type 4 similarity measure such as needs than a similarity measurement that is not invariant to input ordering, while still having excellent relative performance.
Finally in Figure 14 and 15 we show how can be used for semisupervised clustering. The figures show PCA and TSNE clustering of embeddings produced untrained and trained networks respectively from the MNIST dataset. The embeddings are the vector output of for each of the data points in the test set. The embeddings shown are computed from a test set that is not used for training. The figures show that learns a way to correctly cluster data points that it has not used for training.
6 Conclusions and future work
Section 5 shows that all of the learned similarity measures outperformed the classical similarity measure and also where the local (per feature) similarity measures were adapted to the statistical properties of the features AbdelAziz et al. (2014). In practice one should weight the importance of each feature according to how important it is in terms of similarity measurement. In many situations the number of possible attributes to include in such a function can be overwhelming, and modeling them in the way we did in and also overlooks possible covariations between the attributes. Both of these problems can be addressed using the proposed method to model the similarity using machine learning on a dataset that maps from case problem attributes to case solution attributes.
However one should be careful to note that all of the learned similarity measure are built on the assumption that similar data points have similar target values ( in Figure 2). If this assumption does not hold, learning the similarity measure might be much more difficult.
We have also presented a framework for how to analyze and group different types of similarity measures. We have used this framework to analyze previous work and highlight different strengths and weaknesses of the different types of similarity measures. This also highlighted unexplored types of similarity measures, such as Type 4 similarity measures.
As a result we designed and evaluated a Type 3 similarity measure based on a classifier. The evaluations showed that using a classifier as a basis for a similarity measure achieves comparable results to state of the art methods, while using much less training evaluations to achieve that performance.
We then combined strengths from Type 4 and Type 3 similarity measures into a new Type 4 similarity measure, called Extended Siamese Neural Networks (), which:

Learns an embedding of the data points using in the same way as Type 3 similarity measures, but using shared weights in the same way as SNNs to make the operation symmetrical.

Learns , thus enabling extended performance in relation to SNN and other Type 3 similarity measurements.

Restricts to make it invariant to input ordering, and thus obtaining end to end symmetry through the similarity measure.
Keeping symmetrical endtoend enables the user of this similarity measure to train on much smaller datasets than required by other types of similarity measures. Type 3 measures based on SNNs also have this advantage, but our results show that the ability to learn is important for performance in many of the 14 datasets we tested. Our results showed that outperformed state of the art methods on average over the 14 datasets by a large margin. We also demonstrated that achieved this performance much faster given the same dataset than current state of the art. In addition, the symmetry of enables it to train on datasets that are orders of magnitude smaller. Our casestudy of clustering embeddings produced from show that the model can be used for semisupervised clustering.
Finally we demonstrated that the training of this similarity measure scales to large datasets like MNIST. Our main motivation for this work was to automate the construction of similarity measures while keeping training time as low as possible. We have shown that is a step towards this. Our evaluation shows that it can learn similarity measures across a wide variety of datasets. We also show that it scales well in comparison to similar methods and scales to datasets of some size such as MNIST.
The applications for as a similarity measure are not only as a similarity measure in a CBR system. It can also be used for semisupervised clustering: training on labeled data, then use the trained for clustering unlabeled data. In much the same fashion it could be used for semisupervised clustering, using as a matching network in the same fashion as the distance measure is applied in Vinyals et al. Vinyals et al. (2016).
In continuation of this work we would like to explore what is actually encoded by learned similarity measures. This could be done by varying the different features of a query data point in and discovering when that data point would change from one class to another (when the class of the closest other data point changes)  this would form a multidimensional boundary for each class. This boundary could be explored to determine what the similarity measure actually encoded during the learning phase.
Another interesting avenue of research would be to apply recurrent neural networks to embed time series into embedding space (see Figure
2) to enable the similarity measure to calculate similarity between time series which is currently a nontrivial problem.The architecture of similarity measures still require more investigation, e.g. is the optimal embedding from different from the softmax classification vector used in normal supervised learning? If so it is worth investigating why it is different.
7 Acknowledgements
We would like to thank the EXPOSED project and NTNU Open AI Lab for the support to do this work. Thanks also to Gunnar Senneset and Hans Vanhauwaert Bjelland for their great support during our work.
References
 AbdelAziz et al. (2014) AbdelAziz A, Strickert M, Hüllermeier E (2014) Learning solution similarity in preferencebased cbr. In: International Conference on CaseBased Reasoning, Springer, pp 17–31

Arandjelovic and Zisserman (2017)
Arandjelovic R, Zisserman A (2017) Look, listen and learn. In: 2017 IEEE International Conference on Computer Vision (ICCV), IEEE, pp 609–617
 Bergmann (2002) Bergmann R (2002) Experience management: foundations, development methodology, and internetbased applications. SpringerVerlag
 Berlemont et al. (2015) Berlemont S, Lefebvre G, Duffner S, Garcia C (2015) Siamese neural network based similarity metric for inertial gesture classification and rejection. In: Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on, IEEE, vol 1, pp 1–6
 Bromley et al. (1994) Bromley J, Guyon I, LeCun Y, Säckinger E, Shah R (1994) Signature verification using a" siamese" time delay neural network. In: Advances in neural information processing systems, pp 737–744

Chopra et al. (2005)
Chopra S, Hadsell R, LeCun Y (2005) Learning a similarity metric discriminatively, with application to face verification. In: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, IEEE, vol 1, pp 539–546
 Cunningham (2009) Cunningham P (2009) A taxonomy of similarity mechanisms for casebased reasoning. IEEE Transactions on Knowledge and Data Engineering 21(11):1532–1543
 Dheeru and Karra Taniskidou (2017) Dheeru D, Karra Taniskidou E (2017) UCI machine learning repository. URL http://archive.ics.uci.edu/ml

Florescu and Igel (2018)
Florescu C, Igel C (2018) Resilient backpropagation (rprop) for batchlearning in tensorflow. ICLR 2018 workshop permission p To appear in
 Gabel and Godehardt (2015) Gabel T, Godehardt E (2015) Topdown induction of similarity measures using similarity clouds. In: Hüllermeier E, Minor M (eds) CaseBased Reasoning Research and Development, Springer International Publishing, Cham, pp 149–164
 Hadsell et al. (2006) Hadsell R, Chopra S, LeCun Y (2006) Dimensionality reduction by learning an invariant mapping. In: null, IEEE, pp 1735–1742
 He et al. (2016) He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
 Hoffer and Ailon (2015) Hoffer E, Ailon N (2015) Deep metric learning using triplet network. In: International Workshop on SimilarityBased Pattern Recognition, Springer, pp 84–92
 Hüllermeier and Cheng (2013) Hüllermeier E, Cheng W (2013) Preferencebased cbr: General ideas and basic principles. In: IJCAI, pp 3012–3016
 Hüllermeier and Schlegel (2011) Hüllermeier E, Schlegel P (2011) Preferencebased cbr: First steps toward a methodological framework. In: International Conference on CaseBased Reasoning, Springer, pp 77–91

Koch et al. (2015)
Koch G, Zemel R, Salakhutdinov R (2015) Siamese neural networks for oneshot image recognition. In: ICML Deep Learning Workshop, vol 2
 Lake et al. (2015) Lake BM, Salakhutdinov R, Tenenbaum JB (2015) Humanlevel concept learning through probabilistic program induction. Science 350(6266):1332–1338
 Langseth et al. (1999) Langseth H, Aamodt A, Winnem OM (1999) Learning retrieval knowledge from data. In: Sixteenth International Joint Conference on Artificial Intelligence, Workshop ML5: Automating the Construction of CaseBased Reasoners. Stockholm, Citeseer, pp 77–82
 Leake (1996) Leake DB (1996) CaseBased Reasoning: Experiences, lessons and future directions. MIT press
 Lefebvre and Garcia (2013) Lefebvre G, Garcia C (2013) Learning a bag of features based nonlinear metric for facial similarity. In: Advanced Video and Signal Based Surveillance (AVSS), 2013 10th IEEE International Conference on, IEEE, pp 238–243
 Maggini et al. (2012) Maggini M, Melacci S, Sarti L (2012) Learning from pairwise constraints by similarity neural networks. Neural Networks 26:141–158
 Martin et al. (2017) Martin K, Wiratunga N, Sani S, Massie S, Clos J (2017) A convolutional siamese network for developing similarity knowledge in the selfback dataset. In: Proceedings of the International Conference on CaseBased Reasoning Workshops, CEUR Workshop Proceedings, ICCBR (Organisers), p 85–94
 Nikpour et al. (2018) Nikpour H, Aamodt A, Bach K (2018) Bayesiansupported retrieval in bncreek: A knowledgeintensive casebased reasoning system. In: International Conference on CaseBased Reasoning, Springer, pp 323–338
 Reategui et al. (1997) Reategui EB, Campbell JA, Leao BF (1997) Combining a neural network with casebased reasoning in a diagnostic system. Artificial Intelligence in Medicine 9(1):5–27
 Riedmiller and Braun (1993) Riedmiller M, Braun H (1993) A direct adaptive method for faster backpropagation learning: The rprop algorithm. In: Neural Networks, 1993., IEEE International Conference on, IEEE, pp 586–591
 ShaweTaylor (1993) ShaweTaylor J (1993) Symmetries and discriminability in feedforward network architectures. IEEE Transactions on Neural Networks 4(5):816–826
 Stahl (2001) Stahl A (2001) Learning feature weights from case order feedback. In: International Conference on CaseBased Reasoning, Springer, pp 502–516
 Stahl and Gabel (2003) Stahl A, Gabel T (2003) Using evolution programs to learn local similarity measures. In: International Conference on CaseBased Reasoning, pp 537–551
 Stahl and Gabel (2006) Stahl A, Gabel T (2006) Optimizing similarity assessment in casebased reasoning. In: Proceedings of the National Conference on Artificial Intelligence, Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, vol 21, p 1667
 Tversky (1977) Tversky A (1977) Features of similarity. Psychological review 84(4):327
 Vinyals et al. (2016) Vinyals O, Blundell C, Lillicrap T, Wierstra D, et al. (2016) Matching networks for one shot learning. In: Advances in Neural Information Processing Systems, pp 3630–3638
 Wienhofen and Mathisen (2016) Wienhofen LWM, Mathisen BM (2016) Defining the Initial CaseBase for a CBR Operator Support System in Digital Finishing, Springer International Publishing, Cham, pp 430–444. DOI 10.1007/9783319470962_29, URL https://doi.org/10.1007/9783319470962_29
 Zagoruyko and Komodakis (2015) Zagoruyko S, Komodakis N (2015) Learning to compare image patches via convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4353–4361
Comments
There are no comments yet.