## 1 Introduction

Many methods of speeding up the kernel density estimator’s (KDE) querying process has been proposed in the literature [silverman1982algorithm, yang2003improved, elgammal2003efficient]. As optimization problem introduced in Multithreshold Entropy Linear Classifier [melc] is closely related to the equations of KDE it appears natural that similar techniques can be used to simplify its computations with a bounded error. Importance of such reductions comes from the high (quadratic) complexity of the evaluation of functions required during training of this model which makes it hard to use for any dataset with more than a thousand points. In this paper we investigate two such approaches, first – sorting and discarding, which ignores computations of similarities between points that are too far away to have big impact on the function’s value, second – binning, which smooths the function construction in order to heavily reduce amount of unique points. Both these methods are introduced in an adaptive manner so the optimization process have fixed error bound despite many different linear projections being analyzed during the training phase. We also show a very simple method which enables to use a wide range of optimization algorithms even though proposed model requires optimization with a specific constraints (sphere bounded).

## 2 Multithreshold Entropy Linear Classifier

Multithreshold Entropy Linear Classifier (MELC [melc]) has been recently proposed as an information theoretic approach for building model from the multithreshold linear family [anthony]. It’s core idea is to find a linear operator (with unit norm) such that kernel density estimations of projected classes’ training samples maximize the Cauchy-Schwarz Divergence (D [principe2010information]). Let us recall the equation of D in order to find the core computational bottleneck which appears in MELC optimization

for being a kernel density estimator of with Silverman’s rule [silverman], thus from the definition of Renyi’s quadratic entropy, Renyi’s quadratic cross entropy and the fact that we have

As whole D function is composed of evaluations, in the rest of our paper we focus purely on the , which we expand using Gaussian kernel density estimation [melc] and denote .

where

is a sum of each classes estimated variances using Silverman’s rule

[silverman].In an obvious way, naive computation of is , where due to the summation over all possible pairs . In the following sections we focus on methods which reduce this computational bottleneck while still preserving given approximation of value.

## 3 Reduction of computational complexity

### Sorting and discarding

Let us begin with the very simple conception of computing values of only those pairs which are close enough to have an impact on the value of . If we assume that points projections are sorted (which can be done in general in ^{1}^{1}1in fact for iterative optimization techniques points ordering does not change much between subsequent calls so after initial sorting it can be done in linear time using insertion sort) we can search the dataset in linear time and identify for each point indices of first and last point which are at most at distance from . Following theorem shows what to choose in order to obtain at most error.

###### Theorem 1.

Using adaptive sorting and discarding with distance threshold in each iteration of at least

where is a sum of each classes estimated variances, leads to the computation of the function with at most error, assuming that at most fraction of points is located closer than .

###### Proof.

We assume that for pairs of points which are being ignored during computation of so thus

If we look for an approximation of non-regularized MELC objective we put and consequently

thus

obviously if then any satisfies this inequality (as it can only happen if we choose very big acceptable error ), so for simplicity we add the maximum of this value with .

∎

### Binning

While sorting and discarding technique is quite easy to implement and analyze its practical speedup might be limited for densely packed datasets. In such cases it might be more valuable to perform a binning of our projected points, so those located near each other are approximated by their empirical mean. Such an approach works well for densely packed datasets which makes it a complementary approach to the previous one.

Let us assume that we have some partitioning of the where each is an interval. We define a binning operator as , where . We use following notation for simplicity .Similarly to the previous strategy, in order to preserve good approximation, bins width () needs to be adapted in each iteration and the exact equation is given in the following theorem.

###### Theorem 2.

Using adaptive binning technique with bin width in each iteration at most

where is a sum of each classes estimated variances, leads to the computation of the function with at most error.

###### Proof.

we assume that so

Let us now assume that we are given some acceptable error . We will show how small bins have to be used based on our dataset and current projection.

but , so

thus

Naturally if then any satisfies this inequality (similarly to the sorting and discarding method, it may only happen if we choose very large acceptable error ) so we introduce maximum function here.

∎

Figure 1 shows how these two bounds behave with increasing size of the acceptable error. In particular one can see that both methods have very similar growth (up to the maximization/minimization symmetry) with changing . As a result, due to the fact that binning is much more aggressive technique we should expect that using these bounds as the actual bin width/discarding threshold will lead to much greater reduction of the computational complexity when using binning.

## 4 Out of sphere optimization

Now we are going to show, that MELC objective function can be efficiently optimized in the whole space by adding some custom regularization term. The importance of this result is the fact that it enables us to use vast amount of existing optimization techniques (such as Adaptive gradient descent, Conjugate Gradients, BFGS, L-BFGS etc.) without adapting them to the sphere constraints. The second important aspect is the fact that this modification does not involve adding any additional constants which have to be fitted. Following theorem describes modified objective function.

###### Theorem 3.

Given arbitrary sets and corresponding function we have:

and

###### Proof.

According to [melc], D is scale invariant so for any

As a result also

but as and we have that is maximized for with norm and that it is equal to . As a result sets of solutions of both problems are identical.

∎

Consequently we can apply any advanced optimization technique which is not designed to work on the sphere to optimize D criterion. In particular we can use L-BFGS [byrd1995limited] instead of more complex and less popular RBFGS [qi2010riemannian] and previously proposed [melc] less efficient – gradient descent on sphere method. At the same time the norm of the candidate solution will stay close to so we will not suffer from numerical problems [melc].

It is worth noting that despite similarity to the L regularization [vapnik2000nature]

of the additive loss function (or weight decay from neural networks) this additional terms serves no regularization purposes nor it affects the actual function value. It only guides the gradient based optimizers towards more informative regions of the state space.

From the practical point of view we also need a gradient of the new function but thanks to the additivity of derivative operator we get

and we can use any optimization software able to maximize a function given (.

## 5 Evaluation

We evaluate proposed approximations on 10 datasets from UCI repository [uci] and libSVM’s repository [chang2011libsvm, ho1996building]. Both D and approximations are coded in Python using numpy and scipy [jones2001scipy]. We use scipy’s optimization module to perform training of all models using two optimization techniques – Conjugate Gradients (CG) and L-BFGS-B [byrd1995limited]. Each experiment is performed in cross validation manner with multiple starting points (randomly selected, but constant across methods to achieve comparable results) due to the convergence of MELC optimization to local optima. We analyze hyperparameter of D in and acceptable error . Similarly to the original paper we use Balanced Accuracy (BAC^{2}^{2}2) as the measure of
classification
correctness due to MELC highly balanced formulation.

First, we investigate how big is mean reduction of computations using each of the approximating schemes. Table 1 reports mean ratio of exp function calls (which is equivalent to number of pairs analyzed in each evaluation when optimizing whole D function and its gradient) in given method to the original implementation.

method | CG | L-BFGS-B | ||
---|---|---|---|---|

name | bin | dist | bin | dist |

australian | 0.11 | 0.44 | 0.11 | 0.45 |

breast-cancer | 0.10 | 0.46 | 0.10 | 0.46 |

diabetes | 0.21 | 0.56 | 0.22 | 0.54 |

fourclass | 0.19 | 0.51 | 0.19 | 0.49 |

german.numer | 0.15 | 0.47 | 0.19 | 0.46 |

heart | 0.29 | 0.47 | 0.26 | 0.47 |

ionosphere | 0.25 | 0.55 | 0.24 | 0.54 |

liver-disorders | 0.29 | 0.65 | 0.31 | 0.67 |

sonar | 0.32 | 0.53 | 0.29 | 0.50 |

splice | 0.19 | 0.44 | 0.16 | 0.43 |

One can easily notice that sorting and discarding method (denoted as ”dist”) roughly halves the number of analyzed pairs, while binning (denoted as ”bin”) reduces it 3-10 times. It is an obvious consequence of the fact that binning is much more aggressive method. It appears that strength of reduction depends only on the dataset, not on the optimization algorithm used which suggests, that projections for which particular level of possible reduction are uniformly distributed over the space of all projections. These effects are also heavily dependent

^{3}

^{3}3we do not include the exact values in the Table for better readability on the choice of and which is the obvious consequence of Theorems 1 and 2 saying that with increasing variance (which is proportional to ) the reduction strength decreases superlinearly.

The set of heat maps in Figure 2 shows differences between BAC obtained by the original D and each approximation for a given dataset and hyperparameters pair. In general, up to few isolated cases errors are on the level of . For small values errors introduced by the approximation are significantly higher and for sonar and splice datasets can grow to even . Fortunately, these are very rare phenomena.

Even more interesting is the fact that for many experiments we actually noticed increase in the BAC score (bluish elements). This might be the consequence of more rough evaluation of the function (and gradient) values leading to optimization less prone to falling into local maxima. Our hypothesis is that it acts like a regularization helping to train MELC model.

Analysis of the number of iterations of each optimization method required to converge (see Table 2) shows that both approximations significantly simplify the problem. It is important to notice that the number of iterations is not the number of D function evaluations (as both Conjugate Gradients and L-BFGS-B evaluate it multiple times in each iteration, especially during line searches). Consequently, number of iterations cannot be used as a measure of optimization speed but it says much about the complexity of the function being maximized.

method | CG | L-BFGS-B | ||||
---|---|---|---|---|---|---|

name | bin | D | dist | bin | D | dist |

australian | 4 | 36 | 22 | 11 | 39 | 37 |

breast-cancer | 4 | 35 | 8 | 6 | 39 | 14 |

diabetes | 3 | 30 | 20 | 18 | 36 | 29 |

fourclass | 4 | 12 | 10 | 6 | 15 | 14 |

german.numer | 7 | 60 | 32 | 7 | 58 | 38 |

heart | 3 | 40 | 19 | 12 | 34 | 20 |

ionosphere | 5 | 600 | 216 | 18 | 384 | 152 |

liver-disorders | 4 | 30 | 22 | 22 | 43 | 30 |

sonar | 4 | 262 | 115 | 15 | 139 | 100 |

splice | 4 | 92 | 26 | 14 | 65 | 41 |

This seems to confirm our claim that approximation works similar to the regularization and thus it reduces small irregularities of the error surface due to the removal of small elements from the internal summation.

Experiments also showed importance on the regularization technique added to perform out of sphere optimization. During maximization of D in sonar and german datasets, norms of rapidly grew to over if we turn off this modification and still use CG/L-BFGS-B. As a result the optimization problem became extremely hard and we needed tens of thousands D evaluation in order to converge. Adding regularizing term reduced the norm to nearly and number of required function calls by two orders of magnitude.

## 6 Conclusions

In this paper we proposed two simple approximation schemes for faster computation of MELC objective function and its gradient. We proved that in order to achieve constant error bound during optimization one needs a specific adaptive strategy for each of them and gave a simple, closed form equations for setting required parameters based on the user-specified acceptable level of error in the function value. We also showed how one can easily change the objective function in order to use wide range of existing optimizers while at the same time still work near the unit sphere which, as described in the MELC theory [melc], is important from the numerical point of view.

During extensive evaluation we confirmed that such approach is valid in terms of reducing the mean number of calls by even an order of magnitude while not sacrificing the resulting classifiers accuracy. In fact the experiments suggest that proposed method acts like some kind of regularization which might not only simplify the optimization problem but also slightly increase the obtained results.