Minimal Learning Machine: Theoretical Results and Clustering-Based Reference Point Selection

09/22/2019 ∙ by Joonas Hämäläinen, et al. ∙ UFC informa Jyväskylän yliopisto Instituto Federal de Educação 0

The Minimal Learning Machine (MLM) is a nonlinear supervised approach based on learning a linear mapping between distance matrices computed in the input and output data spaces, where distances are calculated concerning a subset of points called reference points. Its simple formulation has attracted several recent works on extensions and applications. In this paper, we aim to address some open questions related to the MLM. First, we detail theoretical aspects that assure the interpolation and universal approximation capabilities of the MLM, which were previously only empirically verified. Second, we identify the task of selecting reference points as having major importance for the MLM's generalization capability; furthermore, we assess several clustering-based methods in regression scenarios. Based on an extensive empirical evaluation, we conclude that the evaluated methods are both scalable and useful. Specifically, for a small number of reference points, the clustering-based methods outperformed the standard random selection of the original MLM formulation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning techniques can be roughly categorized as unsupervised and supervised, depending on whether the learning data comprises only input data or a complete set of input–output pairs (Shalev-Shwartz and Ben-David, 2014)

. In terms of target data, semi-supervised learning typically lies somewhere in the middle of these extremes

(Gan et al., 2013), and active (Aggarwal et al., 2014) or incremental (Losing et al., 2018)

learning techniques acquire the desired outputs during model construction incrementally, on a need-to-know basis. A key concept in unsupervised learning, especially clustering, is the distance or dissimilarity between two observations for an observation–metaobservation (e.g., cluster prototype) pair

(Reddy and Vinzamuri, 2013). Currently, supervised learning is making heavy use of deep structures and stochastic optimization methods to recover unknown weights (Hubara et al., 2017).

The distance-based supervised methods provide a methodological middle ground and linkage between unsupervised and supervised learning. Examples of such methods include the Minimal Learning Machine (de Souza Junior et al., 2015) and the Extreme Minimal Learning Machine (Kärkkäinen, 2019). The core learning construct in these methods is the distance regression, based on the dissimilarity between the observations. Hence, nonlinear regression and classification can be performed for all entities where the dissimilarity can be metrically defined. During learning, incremental use of the so-called reference points, together with the solution of the corresponding SPD linear system, is needed without any optimization procedure (Kärkkäinen, 2019). Note that such distance-based supervised techniques also enable direct utilization of metric learning techniques in the core construction (e.g., Kulis 2013).

More precisely, Minimal Learning Machine, MLM, is a supervised learning algorithm that has its learning capability based on a linear mapping between input and output distance matrices. The increasing popularity of the MLM can be explained by its simple formulation, easy implementation, and promising results in several applications (Mesquita et al., 2017a; Coelho et al., 2014; Marinho et al., 2017, 2018). Apart from the applications of the MLM, many studies have been carried out in recent years to improve and augment the basic form of the MLM to handle missing values (Mesquita et al., 2015, 2017b)

and outliers

(Gomes et al., 2017), perform ensemble learning (Mesquita et al., 2017a) and semi-supervised learning (Caldas et al., 2018), speed up its computations (Mesquita et al., 2017a; Marinho et al., 2016), and include a reject option in classification tasks (de Oliveira et al., 2016).

A significant number of these proposals for improvement have addressed the problem of reference point selection. In the MLM, reference points are a subset of the training points and are used to build the distance matrices that are a key part of the MLM’s induction process. In the original MLM formulation, the reference points are randomly selected. As empirically demonstrated by de Souza Junior et al. (2015), a bad choice of reference points can damage the MLM’s generalization capability. This phenomenon is even more likely to occur when the number of reference points is small (de Souza Junior et al., 2015).

To illustrate this situation, Figure 1 presents the results of 500 MLM executions using different methods of reference point selection. Note that the random selection method generates a much sparser cloud than the cluster method; however, it is not deterministic and the deterministic method does not show any variation in the results.

Figure 1:

Reference point selection variance

Dias et al. (2018) proposed a strategy to select reference points based on the identification of the class boundaries in a binary classification problem. In the proposal, it was prohibited to select any point as a reference point from a subset of points in the class boundary area. A similar objective was pursued by Florêncio et al. (2018), who identified such a region using fuzzy c-means. Maia et al. (2018) used a sparse regression method to build the linear mapping between distance matrices. In that work, the reference points were selected according to the resulting non-zero coefficients obtained by the linear model.

Even though previous works on reference point selection led to more compact models with better generalization, such works only focused on classification problems. Additionally, none of these works presented any theoretical results that could explain the impact of choosing reference points in a general setting.

In the present work, we address the aforementioned issues by: i) presenting a proof of the MLM’s interpolation capability when all training points are used as reference points; ii) demonstrating the universal approximation capability of the MLM even in scenarios in which reference point selection is considered; and iii) proposing and analyzing several reference point selection strategies for regression problems based on clustering methods.

When we choose clustering-based approaches, our basic hypothesis is that a set of well-spread reference points in the data space will improve the performance of the MLM compared to random selection. We validate the empirical contributions of this paper through computational experiments with 15 regression datasets.

The remainder of the paper is organized as follows: Section 2 presents the formulation of the MLM. Section 3 details our theoretical contributions on the interpolation and generalization capabilities of the MLM. Section 4 describes clustering-based methodologies of reference point selection. Section 5 presents a comprehensive set of experiments to evaluate clustering-based methodologies for reference point selection. Finally, Section 6 concludes the paper.

2 Minimal Learning Machine

As previously discussed, the MLM is a distance-based supervised machine learning method. The basic algorithm (de Souza Junior et al., 2013, 2015)

comprises two main steps: i) regression estimation using the distance-based kernel; and ii) distance-based interpolation of a new output. For clarity, we describe these two steps below.

Let be a set of training inputs, where , and is the set of the corresponding outputs, for , respectively. Moreover, we define the set of (input) reference points as a non-empty subset of , , and let refer to the outputs of the corresponding reference inputs, i.e., .

Next, we define two distance matrices, and , using the Euclidean distance as follows:

(1)
(2)

The key idea for the first step of the MLM is the assumption of a regression model between the distance matrices: , where denotes the residuals/error in this transformation. Assuming that the unknown regression model is of the linear form, its transformation matrix

can be estimated using the well-known ordinary least squares formulation, as follows:

(3)

The linear mapping represented by the matrix , obtained in Eq. (3), is the first step of the MLM.

For the second step, let

be a new input vector whose output needs to be estimated. Hence, based on the distance regression model from the first step, we seek the corresponding output

, satisfying

(4)

where

The solution to the multilateration problem in Eq. (4) can also be obtained using the least-squares formulation by letting

(5)

As stated by de Souza Junior et al. (2015), there are many possible solvers for Eq. (5). In the original formulation, the MLM solves the output estimation step by using a nonlinear optimization algorithm. Such an algorithm is used to find the point that minimizes the double-quadratic error between the estimated distance and the real distance, calculated on each candidate point. However, we want to verify whether, when the distances are perfectly estimated, the position of the point can be recovered without error. To that end, we follow an alternative formulation of the multilateration problem, called the localization linear system (LLS), detailed in Appendix A. This formulation provides an efficient method for the output estimation. The LLS method computes the output position by solving a linear system. An output prediction algorithm for the MLM with LLS is depicted in the Algorithm 1. Substitution “” referes to the removal of an element from a vector.

0:  input , distance regression model , reference points and .
0:  predicted output .
1:  
2:  
3:  
4:  
5:  
6:  
7:  for  do
8:     
9:     
10:  
11:  
Algorithm 1 MLM output prediction with LLS

In summary, the LLS solves a system in the form . The coefficient matrix is constructed based on all but one reference point, named benchmark-anchor-node (BAN), and each row is given by the difference between the -th reference point and the BAN. The vector is a simple translation of the target position. The vector is computed from the estimated distances between the target point and the reference points, as well as the distance from the BAN itself to the other reference points.

In Algorithm 1, a linear system of equations in Step 10 is usually overdetermined. An approximate solution can be obtained from the ordinary least squares (OLS) method with a computational cost of . Step 1 has a computational cost of . Usually, Step 2 is computationally the most expensive step and determines the asymptotic behavior of the computational complexity, , when and . Therefore, models with a reduced number of reference points can lead to significant computational time reduction of the MLM prediction with the LLS when the input and output space dimensions are small with respect to .

3 MLM Theoretical Results

In this section, we detail some theoretical guarantees of the MLM. These results are divided into two subsections: interpolation theory and universal approximation capability.

3.1 Interpolation Theory

We show that the MLM can interpolate the data in two steps. First, we show that the distance matrix , constructed using all points in the available data as reference points, is invertible. According to Eq. (3) and given that when all data points act as reference points, the distances can be estimated accurately. In the second step, we prove that, under certain conditions to be described, the estimation of the output will recover the position of the original points with zero error.

3.1.1 Inverse of distance matrices

In the training phase of the MLM, we need to solve a linear system whose coefficient matrix is given by the distances between the points of the dataset and the reference points, that is, a matrix such that is given by , i.e., the distance between the -th point of the training set and the -th reference point. If we consider the specific case in which all points in the dataset were reference points, then the coefficient matrix was a square matrix of order equal to the number of training points . We rearrange the points so that , and the matrix of coefficients is such that each element is given by . A matrix with this characteristic is formally called a distance matrix. To find an exact solution, we must show that every distance matrix admits an inverse.

The invertibility of the distance matrix was first demonstrated by Micchelli (1986); Auer (1995) offered a simplified proof. The main result is given by the following theorem:

Given a distance matrix computed from a set of distinct points, the determinant of is positive if

is odd and negative if

is even; specifically, is invertible.

With this result, we can guarantee that when the distance matrix in the input space is multiplied by the coefficient matrix obtained in the MLM training, the result is the distance matrix in the output space, without any error.

3.1.2 Condition for the perfect estimation of the multilateration

The result of the previous subsection is important since it shows that the MLM can recover the distances in the output space between the reference points and the training data with zero errors. However, this is not sufficient evidence to say the MLM is capable of interpolating any dataset. For that, we must prove the model’s ability to estimate the output, i.e., to retrieve the position of the points in the output space from the perfectly estimated distances.

Solving the LLS accurately is only possible when the coefficient matrix is non-singular. This is not necessarily true for any set of points. In fact, the theorem below shows that the matrix is invertible when the reference points, including the BAN, form an independent affine set.

[Perfect estimation with the multilateration] Given a linearly independent spanning set and . If is not an affine combination of , then the set of vectors is linearly independent.

Proof:

Suppose that is a linearly independent spanning set, is not an affine combination of , and is linearly dependent. There then exists , not all equal to zero, such that

(6)

Since is LI, Eq. (6) can only be satisfied when all are equal to zero, which means that . If , we would have ; however, this cannot be true since we assume that are not all zero. Assuming, then, that , we have ; however, this gives . Since we assume that is not an affine combination of , we arrive at a contradiction and conclude the proof.

The above theorem shows that the multilateration results in the exact position of the point in the output space when we choose linearly independent points from the training set and another one (the BAN) that is not an affine combination of the others. Since the number of training points is usually much larger than the dimension of the output, this is usually possible.

3.2 Universal Approximation Property

We will now verify an important theoretical result for the MLM, its universal approximation capability. The result is divided in two parts: one for the distance estimation error after the linear transformation, and the other for the multilateration estimation error when recovering the output position. This will clarify that the MLM can be used to approximate arbitrary functions. os

3.2.1 Upper bound for distance estimation error

To show that the distance estimation error computed by the MLM is bounded, we will use a result presented in (Park and Sandberg, 1991)

, in which the authors show that a Radial Basis Function (RBF) network is a universal approximation. The result is summarized by the following theorem:

[RBF Universal Approximation] Let be a nonzero integrable function such that is continuous and radially symmetric with respect to the Euclidean norm. Then the family is dense in the space of continuous -valued maps defined on any compact subset of with respect to the norm , where is the family of RBF networks with kernel function given by

This theorem shows that an RBF network can approximate a significant set of functions with an arbitrarily small error. For the MLM, we can resort on this result by considering that the desired output of the dataset is the distance to the reference points in the output space. With that modification, we must show that the MLM can be described in the RBF network formalism, which will ensure that the MLM can estimate the distances to the reference points in the output with an arbitrarily small error.

We will first take the centroids of the RBF as the reference points of the MLM. The function then takes the Euclidean norm, given by . The presented RBF formulation has a parameter that does not appear in the MLM. However, if there is a combination , that satisfies the property, we can calculate

and get the same result. Finally, both the RBF and the MLM apply a linear regression to compute the output, so we can state that the weights

of the RBF are equivalent to the coefficients of matrix in Eq. (3). Thus, we conclude the proof that the error of the MLM-estimated distances in the output space can be arbitrarily small.

3.2.2 Upper bound for the multilateration prediction error

In the previous section, we showed the MLM can provide a good estimate of the distances in the output space. However, the MLM needs an additional step to compute the output: the multilateration. This section shows that the multilateration estimation error is bounded.

In Hu et al. (2016), an upper bound is found for the error of multilateration, given by the method detailed in Appendix A. This work was carried out in the context of localization of mobile autonomous robots and is different in some ways from the MLM. In summary, the objective of that work is to locate a mobile robot based on estimated distances for some fixed points of known locations, called anchor points. Both the distance estimates and the anchor point locations themselves may present noise. Thus, in that context, the upper bound for the multilateration error is expressed by the following theorem:

[Upper bound for the LLS error] An LLS constructed is described in Eq. (9) in Appendix A and is expressed

where is a matrix constructed by the anchors’ positions, represents the precise position of the anchor nodes, is the anchors’ coordinate errors, is a vector collection of the anchors’ positions and the measurement data, denotes the noiseless measurement data, and represents the noise of the measurement data. The ratio between the estimated coordinate and the true coordinate satisfies

where

An MLM analogy can be made with the presented context by considering that the location of the robot is the desired output and the locations of the anchor points are the locations of the reference points in the output space. Distance estimates from the robot to the anchor points are given by the output of the MLM before the multilateration step.

Theorem 3.2.2 presents an upper bound for the multilateration error. However, the characteristics of the MLM allows us to make the bound tighter. First, we consider that the location of the reference points is accurate, which means that ; thus, . In addition, we saw in the previous section that the MLM distance estimate errors can be arbitrarily small. This means , which implies that . Thus, we can present the following corollary:

The error of the MLM multilateration step is bounded by

where

In addition, since we have , we have .

The result of this corollary shows that the ratio between the returned and the desired output is bounded. We will develop this relation to show that the distance between them is also bounded:

(7)

We can therefore conclude that if the norm

of the target output is bounded, the distance

between the desired output and the output estimated by the multilateration is also bounded.

3.3 Discussion

Corollary 3.2.2 indicates the upper bound of the multilateration error depends on matrix , which itself is associated with the reference points used to compute the distances. This observation indicates we can make the bound tighter for certain choices of reference points, thereby reducing the output estimation error limit. This idea was previously demonstrated empirically (Dias et al., 2018; Florêncio et al., 2018; Maia et al., 2018). In the present work, we have now theoretically motivated a non-random selection for the reference points. We assess that motivation by performing comprehensive computational experiments detailed in the next sections, with a focus on clustering-based approaches.

4 Clustering-based selection of reference points

In this section, we evaluate four clustering-based methods in the reference point selection problem. The methods include two nondeterministic and two deterministic ones. A general algorithm for the selection of clustering-based reference points is depicted in Algorithm 2. All the methods are based on a common strategy, where the selection of reference points is performed only in the input space. The corresponding points (indices) are simply selected as output references. Therefore, in the following, we consider only the input space when describing the proposed methods.

0:  input points , output points , and number of reference points .
0:  reference points and .
1:  Cluster to clusters.
2:  Select cluster prototype from each cluster
3:  Select according to the cluster prototypes from
4:  Select corresponding to indices of from .
Algorithm 2 Clustering-based selection of reference points

The K-means++ initialization method

(Arthur and Vassilvitskii, 2007) is one of the most popular methods of K-means initialization. The first method we evaluate is the use the K-means++ initialization with the Euclidean distance for the selection of reference points. See Hämäläinen et al. (2017) for a description of the algorithm. We will refer to this approach as reference point selection with K-means++ (RS-K-means++).

The second evaluated approach begins by running the K-means++ initialization with the Euclidean distance and then refines the initial prototypes with Lloyd’s algorithm (Lloyd, 1982)

until convergence. Finally, the closest observation to each final prototype (medoid) is picked as the reference point. These closest points then establish the set of selected reference points. This method is referred to as RS-K-medoids++. Both RS-K-medoids++ and RS-K-means++ are nondeterministic methods, because of the random sampling of the initial prototypes based on the Euclidean distance-constructed probability distribution (see

Hämäläinen et al. 2017 and articles therein).

The unweighted pair group method with arithmetic mean (UPGMA; Sokal 1958) is an agglomerative clustering algorithm that starts clustering from the initial state, where each point forms one cluster. Then, in each step, the two clusters that have the smallest average distance between the cluster members are joined together. The third evaluated method utilizes UPGMA on the data, and then computes the mean prototypes for each cluster; finally, it again selects the closest point to the prototype as a reference point. Similar to RS-K-medoids++, those closest points construct the set of selected reference points. We refer to this method as RS-UPGMA.

The fourth evaluated method is based on a maximin clustering initialization algorithm (Gonzalez, 1985). The original method starts with a random initial point and then picks each new point, similar to the K-means++ method. However, unlike K-means++, a point that has the farthest distance to the closest already selected point is chosen as a new point. Our modification of the maximin first selects the closest point to the data mean as the first point, conceiving the whole algorithm as completely deterministic. This approach is referred to as RS-maximin. We emphasize that the latter two approaches, RS-UPGMA and RS-maximin, are deterministic.

One of the justifications for selecting this specific set of clustering methods is the highly different amounts of separation between the selected reference points (see Figure 4 in Appendix B

). Random selection has the smallest amount of separation among the reference points, and the RS-maximin method has the largest; RS-K-means++, RS-K-medoids++, and RS-UPGMA interpolate between these two extremes. There are plenty of clustering methods available; the methods evaluated here are straightforward and easy to implement. Moreover, the MLM has only one hyperparameter, the number of reference points

to be selected, which the methods keep unchanged.

A summary of the evaluated approaches is shown in Table 1, where the time complexities are also presented with respect to the number of training observations . RS-K-means++, RS-K-medoids++, and RS-maximin have linear time complexity. The UPGMA has quadratic complexity (Gronau and Moran, 2007); therefore, the complexity of RS-UPGMA is also quadratic, since the post-processing after the UPGMA clustering step has linear time complexity. Since the MLM training phase has a time complexity of (de Souza Junior et al., 2015), a reference point selection method with a linear computational cost (with respect to ) and the ability to build an accurate model with a small is highly desirable.

max width= Method Based on Deterministic Type Complexity RS-K-means++ K-means++ initialization No Partitional RS-K-medoids++ K-means++ initialization and No Partitional K-medoids clustering RS-UPGMA Aggloremerative clustering Yes Hierarchical RS-maximin Maximin clustering initialization Yes Partitional

Table 1: Summary of the evaluated reference point selection approaches.
Figure 2: Distance regression model size in the MATLAB workspace (blue curve) and training time of MLM’s MATLAB implementation (orange curve) as a function of a relative number of reference points (see Eq. (8)) for the MNIST dataset (). Available computing resources might force training the MLM model with a small/moderate set of reference points for very large datasets.

Figure 2 shows the motivation for the development of efficient reference point selection for a small/moderate set of reference points in terms of computational resources. Based on experiments with the MNIST dataset and an MLM MATLAB implementation, training the full MLM model requires 39 GB of space for the regression model coefficient matrix . The training time (computation of distance matrices and ) is 100 minutes. In comparison, training a sparse model with a 10% subset of reference points requires only 0.39 GB of space for the model and 43 seconds to train it. If we consider the MNIST8M dataset, which comprises 8.1 million images generated from the original MNIST dataset, the full MLM’s matrix would require about 525 TB of space. The sparse model, with a 10% subset of reference points, would require 525 GB of space.

5 Experiments and results

5.1 Experimental setup

We selected 13 real datasets and two synthetic datasets (S1, BNK) to evaluate the reference point selection methods. The selected datasets are summarized in Table 2. All datasets had one-dimensional output values. The S1 dataset was modified for a regression task. We randomly selected 1000 observations from the original S1 data, scaled their values to the range and then computed the output values with the function . The original S1 dataset is available in (http://cs.uef.fi/sipu/datasets/). The remaining datasets are available at (http://www.dcc.fc.up.pt/~ltorgo/Regression/DataSets.html) and at (http://archive.ics.uci.edu/ml/index.php).

For a more rigorous comparison, we performed model selection and assessment as follows: we divided the original datasets into train-validation-test sets and performed cross-validation (see, e.g., Friedman et al. 2001, Chapter 7). More precisely, we used the 3-DOB-SCV (Moreno-Torres et al., 2012) approach to divide each dataset into a training and a test set. Therefore, the test set was forced to approximate the same distribution as the training set, thus making the comparison more reliable in the event concept drift is not considered. Because we focused only on regression tasks, we used DOB-SCV as a one-class case (Hämäläinen, 2018; Hämäläinen and Kärkkäinen, 2016). Moreover, we archived three training sets and three test sets for each dataset, respectively, with sizes of and of the number of observations. In training, we used the 10-DOB-SCV approach to select the optimal number of reference points. Hence, of the number of observations was used to train the model and of the number of observations was used to compute the validation error. Therefore, we have a two-level division of the datasets.

Dataset # Observations # Features
Auto Price (AP) 159 15
Servo (SRV) 167 4
Breast Cancer (BC) 194 32
Computer Hardware (CHA) 209 6
Boston Housing (BH) 506 13
Forest Fires (FF) 517 12
Stocks (STC) 950 9
S1 (S1) 1000 2
Bank (BNK) 4499 8
Ailerons (ALR) 7129 5
Computer Activity (CA) 8192 12
Elevators (ELV) 9517 6
Combined Cycle Power Plant (CCP) 9568 4
California Housing (CH) 20640 8
Census (CNS) 22784 8
Table 2: Characteristics of the datasets used in the experiments.

We evaluated the quality of the models using the root mean squared error (RMSE). In addition to the validation error, a test error was also computed for all 10-DOB-SCV training sets, resulting in 10 test RMSEs for each training set and 30 test RMSEs for the overall dataset. For more interpretable results, we expressed the number of the selected reference points in a relative manner:

(8)

where is the number of observations in the training data. In training, the number of reference points varied in the range of , with a step size of . We used the LLS method for the output prediction (Algorithm 1). To solve the linear system of equations in the MLM implementation, we utilized MATLAB’s mldivide-function. We scaled all training observations to the range . All the experiments were conducted in a MATLAB environment.

5.2 Results for optimal

Table 3 in Appendix C shows the median test RMSE and the best number of reference points. The optimal number of reference points was selected based on the smallest mean validation RMSE. The symbol indicates a statistically significant difference between test RMSEs, based on a Kruskal-Wallis H test with a significance level of . The symbols , , , , and denote that a method has a statistically significantly smaller RMSE in pairwise comparison to Random, RS-K-means++, RS-K-medoids++, RS-UPGMA, and RS-maximin, respectively. In the pairwise comparisons, the significance level was also set to . The Kruskal-Wallis H test assumes equal variances for groups; therefore, we tested equality of variances with a Brown-Forsythe test. Based on that test, variances related to optimal results are equal for all datasets. The best median test RMSE and the set of the smallest number of reference points (with respect to the mean value) are in boldface for each dataset. Note that in Table 3, there are 3 optimal values for each method, since we used the 3-DOV-SCV approach in the experiments. Rounded Kruskal-Wallis scores are shown inside the brackets and the best scores are in boldface. Dataset-wise ranking of the methods is calculated from the raw Kruskal-Wallis scores. Based on these rankings, the final ranking of the methods is shown at the bottom of Table 3. In addition, the average is also shown at the bottom of Table 3 for each method.

Based on Table 3, RS-UPGMA and RS-maximin perform equally well in the final ranking, while RS-K-medoids++ and RS-K-means++ perform similarly. In terms of the final ranking and the model size (

), Random has the worst performance and the deterministic methods RS-UPGMA and RS-maximin have the best performance. In general, clustering-based methods give sparser models that reduce the computational cost and the space complexity. In addition, the clustering-based models have better generalization ability. Based on the Kruskal-Wallis test, there are statistically significant differences between the methods for the CHA and BNK datasets in favor of the deterministic methods. For the BNK dataset, RS-maximin builds the MLM model with only

, while Random must select almost the entire dataset as reference points () and still has a clearly larger RMSE error. Reducing from 90 to 10 reduces space requirements for the distance regression model coefficient matrix by . For large datasets where and , this coefficient matrix size determines the space complexity of the full MLM model ().

The best selection based on the smallest mean validation RMSE is dubious for some of the datasets, since the complexity of the model is not taken into account. For example, for a large dataset, if increasing from 50 to 100 leads to only marginal improvement in the mean validation RMSE, then the model with higher and smaller mean validation RMSE is selected. For example, for the S1 dataset, RS-maximin already achieves the fulll MLM error level when (see Table 6 and Table 7 in Appendix C). For future work, there is still room for improvement in this respect.

5.3 Results for fixed

In Appendix C, Tables 47 show the test RMSEs. They are similar to Table 3, but with a fixed number of reference points. Variances for the error distributions are not equal for SRV (), CHA (), BH (), FF (), STC (), S1 (), CA (), and ELV (), based on the Brown-Forsythe test of group variances. Therefore, the results given by Kruskal-Wallis are questionable for these cases. However, the ordering of the methods can still be compared.

As expected, based on the final ranking, all the proposed methods have better RMSE than Random when the number of reference points is small to moderate (, Tables 47). RS-K-means++ have better RMSEs compared to RS-K-medoids++ for . Thus, refinement of the reference points with K-means does not seem to be beneficial for the small . In contrast to , accuracy is improved with K-means refinement. In general, the RS-maximin method obtained the best RMSE in the comparison. RS-UPGMA have results that are quite similar to those of RS-maximin for . Therefore, running the whole clustering (not only the initialization step) seems to work better for higher values. For , RS-UPGMA is the best approach based on the final ranking.

A drawback of RS-K-medoids++, RS-UPGMA, and RS-maximin is that if the data contains anomalies, they are prone to select them as reference points. This is probably the reason why Random gets smaller RMSE than RS-UPGMA and RS-maximin for the CH dataset with small to moderate

, since that dataset is known to contain some large anomalies. Therefore, we combined a simple anomaly detection method (k-nearest neighbors) with RS-UPGMA and tested it with the CH dataset. It was observed that anomaly detection improved the test error for RS-UPGMA (

). Similar observations can also be drawn from the results for the S1 dataset. S1 is the cleanest dataset in our experiments: all input points are mapped to output points with sine-based function evaluations without any distortions. Based on Tables 47, RS-UPGMA and RS-maximin have the largest error differences compared to Random for the S1 dataset than any other dataset. Therefore, a robust variant of the MLM combined with RS-UPGMA or RS-maximin should be considered for regression tasks with anomalies.

5.4 Case S1: comparison of methods

To demonstrate the differences among the five approaches we examined, we ran only the reference point selection methods for the S1 data, considering 100 reference points (10%), and plotted the selected reference points (marked as blue pentagrams). These are shown in Figure 3 (Appendix B). The proposed methods clearly cover the data space better than Random. Notably, the difference between Random and RS-maximin is obvious. RS-maximin creates a grid of reference points where points are sparsely spread and approximately evenly spaced in the input space. Contrary to Random’s approach, the selected reference points are accumulated near the cluster centers and near each other. Selecting reference points from near the data cloud boundaries improves the extrapolation capability of the MLM regression model (see Figures 6 and 7 in Hämäläinen 2018, pp. 35).

In Figure 4 (Appendix B), the smallest 500 pairwise Euclidean distances for the selected 100 reference points for the S1 dataset are plotted in ascending order. The selected 100 reference points for each method are the same as in Figure 3. Figure 4 also illustrates the differences between the reference point approaches. Overall, Random selection is the worst method and RS-maximin is the best method for identifying separate and input space, covering sets of reference points in a well-balanced manner. Interestingly, the ordering of the methods’ pairwise distance curves is the same as the ordering of the methods’ RMSE performance.

As noted in the results of Section 5.3, variances are not equal for several datasets based on the Brown-Forsythe test. Clustering-based reference point selection gives smaller variances compared to the Random method for a small . This is illustrated in Figure 5 and Figure 6 for the S1 dataset. Variance of RMSE for the Random method is 8 times larger compared to the RS-maximin method when . When reaches 40, variances are equal.

5.5 Discussion

We evaluated four clustering-based methods for the selection of reference points for the MLM. We focused on testing the methods against the Random approach in regression tasks with 15 datasets. An extensive experimental evaluation of the methods shows that the clustering-based methods can improve the performance of the MLM. A good set of reference points is able to cover the data space well. When an optimal number of reference points is desired, RS-UPGMA and RS-maximin are valid choices. With respect to accuracy for a fixed number of reference points , RS-maximin is the best choice for low values (). For higher K values () RS-UPGMA and RS-maximin are the best choices. However, RS-maximin is the most efficient approach, since the computational cost with respect to the number of observations is linear compared to RS-UPGMA which has a quadratic time complexity with respect to . Together with the LLS method for the second step of the MLM, we obtain, on the whole, a very computationally efficient approach. Note that deterministic reference point selection methods are required to run only once for each dataset in hyperparameter tuning, while, for example, RS-K-medoids++ must be run for each hyperparameter value from the start. Moreover, the deterministic reference point selection methods reduce the MLM model’s space and computational complexity, because they can build the optimal model with smaller sets of reference points.

Even though the maximin method is not recommended to be used for the K-means initialization based on the extensive study by Celebi et al. (2013), our own study shows it is a valid method for selecting the reference points in the MLM. This indicates that reference point selection has a slightly different aim than the K-means initialization. For example, based on the performed experiments, the maximin method selects points such that extreme points are very valuable if they are not anomalies. Contrarily, in terms of K-means initialization, those points are far from the cluster centers. Hence, they are not optimal choices for clustering initialization.

Finally, the clustering-based methods are less robust for outliers than the Random approach. Therefore, integration with outlier detection or use of a robust approach for input and output distance matrix mapping should be considered for distorted datasets. Based on the experiments, it seems that reference point selection controls the balance between interpolation and extrapolation for the regression model. Selecting reference points from the boundaries of the data clouds improves extrapolation abilities, but might lead (in rare cases) to worse interpolation in the dense areas, as most likely occurred for the CH dataset.

6 Conclusion

In this paper, we addressed important open questions related to research on the MLM. Based on previous related works, we demonstrated the theory behind the MLM’s interpolation and universal approximation properties by considering the behavior of its two main components, the linear mapping between distance matrices and the multilateration for output estimation. Our results ensure the MLM’s generalization capability and indicate the role of the reference points in the bounded estimation error.

Motivated by our findings, we performed comprehensive computation experiments to evaluate different clustering-based approaches for selecting reference points for the MLM in regression scenarios. In summary, all the methods achieved better performance than standard random selection. The RS-maximin approach was the best choice due to its better generalization capability, compact model size, simplicity, and more efficient computational implementation.

In the future, it would be interesting to adapt and evaluate the presented methods for classification tasks as well. Moreover, we could also analyze how such methods are affected if the number of reference points is different for input and output spaces. The latter consideration may modify the reference point selection problem and might result in additional interpretations of the MLM’s generalization capability.

The work was supported by the Academy of Finland from grants 311877 and 315550. The authors would also like to thank the Cearense Foundation for the Support of Scientific and Technological Development (FUNCAP) for its financial support.

Appendix A Localization Linear System

Consider , a set of known points in . Suppose the existence of is unknown, but whose distances for each , given by , are known. Suppose that we have another point , called benchmark-anchor-node (BAN), such that and are also known. Thus, we have:

(9)

Thus, after solving the system , we compute to recover the position of . Note that the BAN can be selected from and thus satisfy all the necessary conditions for the application of the technique.

Appendix B Figures

(a) Random
(b) RS-K-means++
(c) RS-K-medoids++
(d) RS-UPGMA
(e) RS-maximin
Figure 3: Selected 100 reference points for the S1 dataset.
Figure 4: The smallest 500 pairwise Euclidean distances for the selected 100 reference points for S1 in ascending order. Clustering-based methods select a set of reference points that are more separeted each other compared to the random approach.
Figure 5: Variances of the RMSE test errors for the S1 dataset.
Figure 6: Boxplot of the RMSE test errors for the S1 dataset.

Appendix C Tables

max width= Random RS-K-means++ RS-K-medoids++ RS-UPGMA RS-maximin Dataset RMSE RMSE RMSE RMSE RMSE AP SRV BC CHA BH FF STC S1 BNK ALR CA ELV CCP CH CNS Rank 5(54) 64.44 4(48) 59.56 3(44) 58.44 2(40) 55.22 1(39) 56.33

Table 3: RMSE for the optimal

max width= Dataset Random RS-K-means++ RS-K-medoids++ RS-UPGMA RS-maximin AP SRV BC CHA BH FF STC S1 BNK ALR CA ELV CCP CH CNS Rank 5(55) 2(42) 4(52) 3(43) 1(33)

Table 4: RMSE for

max width= Dataset Random RS-K-means++ RS-K-medoids++ RS-UPGMA RS-maximin AP SRV BC CHA BH FF STC S1 BNK ALR CA ELV CCP CH CNS Rank 5(58) 4(52) 3(46) 2(41) 1(28)

Table 5: RMSE for

max width= Dataset Random RS-K-means++ RS-K-medoids++ RS-UPGMA RS-maximin AP SRV BC CHA