1 Introduction
Some realworld problems do not have an exact algorithmic solution. Currently, there is a vast number of Artificial Intelligence(AI) methods which can be used to solve them. One of the branches of AI are Artificial Neural Networks. They are mathematical models inspired by biology. In 1982 T.Kohonen presented architecture called SelfOrganising Maps (SOM)
[10], which provides a method of feature mapping from multidimensional space to usually a twodimensional grid of neurons in an unspervised way. This way of data analysis was proved as an efficient tool in many applications, both in academic and industrial solutions
[11]. For example in character recognition tasks, image recognition tasks, face recognition
[2], analysis of words[12], grouping of documents [9], visualisation [5], and even bioinformatics (for example phylogenetic tree reconstruction [4]).There exists a huge number of methods for improving SOM’s performance. Some of them concentrated on finding an optimal size of a network[1], faster learning [14] or applying different neighbourhood functions [8]. In this paper we investigate two additional improvements. The first one [6], [2] uses Mahalanobis metric instead of the Euclidean one. The second improvement [13] shows how to use SOM in a supervised manner.
In our approach, contrary to [6], [2], [3], instead of computing the Mahalanobis matrix as an inverse of covariance matrix, it is learned in a way assuring the smallest distance between points from the same class and large margin separation of points from different classes. Several algorithms exist for distance metric learning (DML) [17], [7]. In this paper we use socalled Large Margin Nearest Neighbour (LMNN) method [16]. It introduces the distance metric learning problem as a convex optimization, which assures that the global minimum can be efficiently computed. First, we shortly describe SOM model used in supervised manner and LMNN method. Then we show how we combine these two approaches into our SOM+DML model. Finally we present results on real data sets.
2 Methods
Let’s denote data set as , where
is an attribute vector of
ith sample, and is a class vector, where for and for , where is class number for ith sample.2.1 SOM model
In this paper, we used the SOM architecture in a supervised manner socalled ’Learning Associations by SelfOrganisation’ (LASSO), first described in [13]. The main difference between this SOM architecture and the original Kohonen’s SOM architecture [10] is that during the learning phase the LASSO method takes into consideration class vector, additionally to attributes. Herein, we used twodimensional grid of neurons. Each neuron is represented by a weight vector , consisting of a vector corresponding to attributes , and to a class (), where
are indexes of the neuron in the grid. In a learning phase all samples are shown to the network in one epoch. For each sample we search for a neuron which is closest to the
ith sample. The distance is computed by:(1) 
The neuron with the smallest distance to ith sample is called the Best Matching Unit (BMU), we note its indexes as . Once the BMU is found, the weight update step is executed. The weights of each neuron are updated with following formulas:
(2) 
(3) 
where is a learning coefficient, consisting of a learning step size parameter and a neighbourhood function , so . Learning speed parameter is decreased between consecutive epochs, so that network’s ability to remember patterns is improved. It is described by , where is a starting value of the learning speed, is the current epoch and is responsible for regulating the speed of the decrease. Neighbourhood function controls changing the weights with respect to the distance to the BMU. It is noted as , where describes the neighbourhood function width and is the distance in the grid between the neuron and the BMU, computed by the following formula:
(4) 
We assumed a cost function as a sum of distances between samples and corresponding BMUs:
(5) 
We train network till the cost function stops decreasing or a selected number of learning procedure iterations is exceed.
The exploitation phase is performed after the learning phase. New samples, which do not take part in the training, are shown to the network in order to designate their class. It should be noted that only the part with attributes is presented to the network. The BMU is found by computing a distance between an attribute input vector and an attribute part of the weights using the following formula:
(6) 
For the tested sample, the designated class corresponds to position of maximum value in the part which code class information in BMU weights.
2.2 LMNN method
In many cases the mostly used metric is an Euclidean one. It often gives poor accuracy, because it takes all dimensions with equal contribution and assumes no correlations between the dimensions. Mahalanobis distance seems a better metric choice, because it is scaleinvariant and takes into account input dimensions correlations. It is defined by:
(7) 
where is usually an inverse of a covariance matrix. In case where
is an identity matrix, the distance (
7) is equal to the Euclidean distance.In this paper, we learned Mahalanobis matrix using the method described in [16]. Matrix coefficients are computed in a way assuring large margin separation of points from different classes and a small distance between the points of the same class. Before starting the matrix learning for each point, nearest neighbours with the same class are found. They are called target neighbours, denoted by , where means that is the target neighbour of , and means otherwise. With no prior knowledge, the Euclidean distance can be used to point the target neighbours. Target neighbours are unchanged during the whole learning. Let’s add a variable to indicate when the two samples have the same class, denoted as , where when ith and lth samples are from the same class, zero when they are from different classes.
Finding an optimal matrix can be expressed as a semidefinite programming (SDP) optimization problem with the following cost function (8) and constrains (9), (10), (11):
(8) 
(9) 
(10) 
(11) 
The first term in the minimization function penalizes a large distance between the samples and their target neighbours. The second term penalizes a small distance between the samples from different classes  it is expressed as slack variables . The parameter balances the influence between these two terms. In this paper it is set to , which gives equal strength to each term. The constraint given in (11) requires the
matrix to be positive semidefinite  all its eigenvalues should be nonnegative. Semidefinite programming is a convex optimization problem, so a global minimum can be efficiently computed. SPD can be solved using general purpose solvers, however in our approach we used Matalb implementation code
^{1}^{1}1Matalb impementation of LMNN algorithm available from http://www.cse.wustl.edu/~kilian/code/code.html described in [16], which is finely tuned to efficiently solve this kind of problems. Here most of slack variables are not used because samples are well separated.2.3 SOM+DML model
We are interested in a such linear transformation of sample attributes that will assure that the Euclidean distance computed on the transformed attributes will be equal to the Mahalanobis distance computed on the original attributes. Mahalanobis matrix
can be written as , where is the searched transformation. Lets denote as the transformed attributes of ith sample and as the transformed attributes of jth sample:(12) 
(13) 
The distance between the transformed attributes in Euclidean distance should be equal to Mahalanobis distance between the original attributes:
(14) 
We will now search for . Now for matrix
we will find eigenvectors
and a matrix with eigenvalues on diagonal :(15) 
Matrix can be expressed as:
(16) 
Since matrix found by the LMNN algorithm is positive semidefine, diagonal elements in matrix are nonnegative. Thus, we can write:
(17) 
Substituting (16) into (17), we obtain:
(18) 
Now we can note as:
(19) 
Using (19) we can write (18) as:
(20) 
We see that and hence we can write:
(21) 
From (21) we see that (14) is true. This transformation of input attributes is also known as whitening transform. It is worth mentioning that, in a transformation phase, only attributes are transformed, the class part of input vector is unchanged. Therefore, even though the data were preprocessed using the transformation, we still can use the original SOM algorithm.
3 Results
Performance of the SOM+DML method was compared to the SOM model on six real data sets. As an accuracy measure we take the percentage of incorrect classifications. It is worth mentioning that getting the highest number of correct classifications is not the goal of this paper. Data sets are described in Table 1. Sets ’Wine’, ’Ionosphere’, ’Iris’, ’Isolet’, ’Digits’ are sets from the ’UCI Machine Learing Repository’ ^{2}^{2}2http://archive.ics.uci.edu/ml/, set ’Faces’ are from the ’The ORL Database of Faces’^{3}^{3}3http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html.
Now we briefly introduce the origin of the sets. Data sets ’Wine’, ’Ionosphere’, ’Iris’ are classic benchmark sets, often used in testing newly developed classification algorithms. ’Isolet’ data set represents a spoken letter recognition task. Its samples correspond to 26 letters of the alphabet. The original number of attributes (617) was projected by PCA to 100 principal components, which covers
of its variance. ’Digits’ data set represents a handwritten digit recognition problem. The samples are from 10 classes obtained from 43 people. Original images were 32x32 bitmaps downsampled to 64 attributes by creators. The face recognition task is presented by data set ’Faces’. It contains images of faces obtained from 40 people (40 classes). For each person 10 images were taken in different times, varying lightening, changing facial expressions and details. The original data  92x112 pixels images in 256 gray levels was projected by PCA to 50 leading components (
of variance), the socalled egienfaces method[15]. Each data set, if not originally divided to train/test subsets, was randomly divided by us  of data to training subset and to testing subset.Wine  Ionosphere  Iris  Isolet  Digits  Faces  
Train examples  126  246  105  6238  3823  280 
Test examples  52  105  45  1559  1797  120 
Attributes  13  34  4  100  64  50 
Classes  3  2  3  26  10  40 
Runs  100  100  100  20  20  20 
Net size  4x4  6x6  6x6  20x20  20x20  20x20 
in LMNN  2  5  1  4  3  3 
For each data set, we arbitrarily chose the network size (selecting optimal network size is not in the scope of this paper). The network size for the SOM and the SOM+DML models was equal. For all data sets, the following values of the learning parameters were used: , , . For each data set a number of target neighbours in the LMNN algorithm was selected using cross validation with ten times repetition. Fig.1 presents results of a selecting the target neighbours parameter () for all sets. Resulting values are shown in Table 1
. SOM weights were initialized with random numbers drawn from a normal distribution with mean
. Several runs were performed for each data set (see Table 1), so that local minimums were avoided. The final result is a mean over all runs. The SOM and the SOM+DML models were trained with identical number of iterations.The comparison of results obtained by the SOM and the SOM+DML method is presented in Table 2. With one exception the SOM+DML method achieves lower error rates in both training and testing subsets. On the ’Iris’ data set the LMNN algorithm seems to cause the overfitting effect. It is clearly visible in Fig.1 during search for the parameter. The greatest improvement was achieved on the ’Faces’ set (
over the SOM model). The comparison of example face pictures classified as belonging to the same class by the SOM method and the SOM+DML method is presented in the Fig.
2. For the SOM+DML method, when using the ’Faces’ set we observed a significant difference between the training error and the testing error. It is a small set, therefore the metric was well matched to the training set, giving small error. On ’Wine’ and ’Isolet’ data sets the improvement was and respectively. On the ’Digits’ set there was a improvement on the testing subset, which corresponds to roughly digits. For the ’Ionosphere’ data set the improvement was . It is worth mentioning that for this set the largest value was used.Wine  Ionosphere  Iris  Isolet  Digits  Faces  
Train  Test  Train  Test  Train  Test  Train  Test  Train  Test  Train  Test  
SOM  4.66  5.31  16.26  18.10  9.14  7.78  7.90  9.20  6.89  8.84  26.84  32.75 
SOM+DML 
1.05  3.56  8.13  11.43  8.76  8.44  5.32  7.31  3.86  6.16  4.52  11.88 
4 Conclusions
A method of improving performance of the SelfOrganising Maps in classification tasks was described. Linear transformation of data was performed before SOM training phase. Matrix for the transformation has been obtained from the LMNN algorithm, which computes Mahalanobis matrix while assuring large margin separation between the points of different classes. We called our method SOM+DML. Testing of the method was demonstrated on several data sets, focused on recognition of: faces, handwritten digits and spoken letters. Test results confirm that the distance metric learning method improves the performance of the SOM network. Finding the optimal matrix for linear transformation plays a crucial role in obtaining improved results. Matlab implementation of the SOM+DML model is available from http://home.elka.pw.edu.pl/~pplonski/som_dml.
References

[1]
Alahakoon, D., Halgamuge, S. K. and Sirinivasan, B. Dynamic Self Organizing Maps With Controlled Growth for Knowledge Discovery. IEEE Transactions on Neural Networks, vol. 11, pp 601614, (2000)
 [2] Aly, S., Tsuruta, N., Taniguchi, R.: Face Recognition under Varying Illumination Using Mahalanobis Selforganizing Map. Artificial Life and Robotics, vol.13, no. 1, pp 298301 (2008)
 [3] Bobrowski, L., Topczewska, M.: Improving the KNN Classification with the Euclidean Distance Through Linear Data Transformations. In Industrial Conference on Data Mining(2004), pp 2332
 [4] Dopazo, J., Carazo, J.M., Phylogenetic reconstruction using an unsupervised growing neural network that adopts the topology of a phylogenetic tree. Journal of Molecular Evolution, vol.44(2), pp 22633, (1997)
 [5] Duch, W., Naud, A.: On Global SelfOrganizing Maps. In ESANN (1996), pp 9196
 [6] Fessant, F., Aknin, P., Oukhellou, L., Midenet, S.: Comparison of Supervised SelfOrganizing Maps Using Euclidian or Mahalanobis Distance in Classification Context. In IWANN(2001) pp 637644
 [7] Goldberger, J., Roweis, S., Hinton, G., Salakhutdinov, R.: Neighbourhood components analysis. In NIPS(2005), pp 513–520
 [8] Jiang, F., Berry, H., Schoenauer, M.: The impact of network topology on selforganizing maps. In GEC Summit(2009), pp 247254
 [9] Kłopotek, M. A., Pachecki,T.: Create Selforganizing Maps of Documents in a Distributed System. Intelligent Information Systems, Siedlce, pp 315320, (2010)
 [10] Kohonen, T.: Selforganized formation of topologically correct feature maps. Biological Cybernetics, vol.43 pp 5969 (1982)
 [11] Kohonen, T., Oja, E., Simula, O., Visa, A., J. Kangas, Engineering applications of the selforganizing map. Proceedings of the IEEE(2002), vol. 84, no. 10, pp. 13581384.
 [12] Kohonen, T., Xing, H.: Contextually SelfOrganized Maps of Chinese Words. In WSOM(2011) pp 1629
 [13] Midenet, S., Grumbach, A.: Learning Associations by SelfOrganization: The LASSO model. Neurocomputing vol.6(3) pp 343361 (1994)
 [14] Rauber, A., Tomsich, P., Merkl, D.: parSOM: A Parallel Implementation of the SelfOrganizing Map Exploiting Cache Effects: Making the SOM Fit for Interactive HighPerformance Data Analysis. International Joint Conference on Neural Networks(2000), pp 177182
 [15] Turk, M., Pentland, A.: Eigenfaces for recognition. Journal of Cognitive Neuroscience, vol.3, no.(1), pp 71–86, (1991)
 [16] Weinberger, K. Q., Blitzer, J., Saul, L. K.: Distance Metric Learning for Large Margin Nearest Neighbor Classification. In NIPS(2006), pp 14731480
 [17] Xing, E.P., Ng, A.Y., Jordan, M.I., Russell, S.J.: Distance Metric Learning with Application to Clustering with SideInformation. In NIPS(2002), pp 505512