I Introduction
In a growing number of applications, including computer vision, biometrics, text categorization and information retrieval, samples are often represented more naturally in terms of similarities between each other, rather than in an explicit feature vector space
[1, 2]. Traditional machinelearning algorithms can still be used to learn over similaritybased representations; e.g., linear classification algorithms like Support Vector Machines (SVMs) [3, 4] can be trained in the space implicitly induced by the similarity measure (i.e., the kernel function) to learn nonlinear functions in input space. However, the main drawback of similaritybased techniques is their high computational complexity at test time, since computing their classification function often requires matching the input sample against a large set of reference prototypes, and evaluating such similarity measures is usually computationally demanding. Even SVMs, that induce sparsity in the number of required prototypes (the socalled support vectors, SVs) may not provide solutions that are sparse enough, as the number of prototypes (i.e., SVs) grows linearly with respect to the number of training samples [5, 6]. To reduce the number of reference prototypes, several stateoftheart approaches select them from the training data, and then separately train the classification function using the reduced set of prototypes. However, decoupling these two steps may not effectively reduce the number of prototypes, without significantly affecting classification accuracy [7, 2, 8].In this work, we first discuss the relationship between current prototypeselection methods and our recentlyproposed approach for learning supersparse machines on similaritybased representations [9, 10, 11]
. We then show that our approach can successfully tackle this issue by jointly learning the classification function along with an optimal set of virtual prototypes. The number of prototypes required by our approach can be either fixed a priori, or optimized through a carefullydesigned, incremental crossvalidation (CV) procedure. Creating a supersparse set of virtual prototypes allows us to learn much sparser solutions, drastically reducing the computational complexity at test time, at the expense of a slightly increased computational complexity during training. A much smaller set of reference prototypes also provides decisions that are easier to interpret. We validate our approach on two application examples, including biometric face verification and age estimation from faces. Our approach does not almost affect the generalization capability of SVMs, LASSO
[12], and ridge regression [13], while being capable of reducing their complexity of more than ten times, overcoming the performance of other reduction methods. We conclude this paper with a discussion of future research directions.Ii Learning in Similarity Spaces
Two main approaches can be used to learn classification and regression functions in similarity spaces, i.e., functions that depend only on similarities between samples, and not on their features [2, 1]. The first one consists of computing an explicit representation of samples in a vector space that aims to preserve the similarity values between the original samples. A wellknown example is related to kernel functions (i.e.
, positive semidefinite similarity functions), as one may exploit the eigenvalue or Cholesky decomposition of the kernel matrix to represent samples in an explicit vector space, called the empirical kernel mapping
[8, 2, 7, 1, 19]. Their underlying idea is to decompose the kernel (similarity) matrix computed on training samples as , whererepresents the training samples in the empirical kernel space. Then, at test time, neverbeforeseen samples should be mapped onto the same space, to be classified in a consistent manner, and this often requires matching them against all training samples
[2, 1]. Clearly, the aforementioned decompositions are only possible for positive semidefinite similarities (i.e., kernel matrices). For indefinite similarities different techniques can be exploited to account for the negative eigenvalues of , and potentially exploit the same decompositions, or adapt the corresponding classification functions [2, 1, 19]. They include: spectrum clip, in which the negative eigenvalues are set to zero; spectrum flip, in which their absolute values are used; spectrum shift, in which they are increased by a quantity equal to the minimum eigenvalue, such that the smallest one becomes exactly zero; and spectrum square, in which the eigenvalues are squared [1]. Notably, spectrum flip amounts to mapping the input data onto a pseudoEuclidean space. This space consists of two Euclidean spaces: one for which the inner product is positive definite, and one for which it is negative definite. This enables computing quadratic forms (e.g., vector norms and inner products) as the difference of two positivedefinite norms. PseudoEuclidean spaces are a particular case of finitedimensional Kreĭn spaces consisting of two real subspaces. In general, indefinite kernels and similarities allow a consistent (e.g., infinitedimensional) representation in a Kreĭn space [2, 1, 19].The second approach consists of learning classifiers directly in the similarity space, i.e., exploiting similarities computed against a set of reference prototypes as feature values. This is equivalent to the former approach, if the spectrumsquare technique is used [1, 2] (see the example in Fig. 1). Worth remarking, some learning algorithms have been explicitly modified to deal with similarities, instead of adapting the similaritybased representation to existing learning algorithms. Examples can be found in the area of relational fuzzy clustering [20], and relational lexical variant generation [21, 22].
Feature and similaritybased representations may be thought as two facets of the same coin: modifying the similarity measure amounts to modifying the implicit feature space in which the linear decision function operates, and vice versa. This means that, to achieve good generalization capabilities, it is necessary to properly define this space on the basis of the given learning algorithm. When one is given a similaritybased representation, this space can be manipulated essentially in two ways, i.e., by either modifying the similarity measure or the prototypes. Several approaches have been proposed to manipulate the similarity measure, including multiple kernel learning and similarity learning [14, 15]. They exploit a parametric similarity or distance measure, whose parameters are tuned by running a learning algorithm on the training data. In particular, in the case of multiple kernel learning, the goal is to learn the coefficients of a convex linear combination of a set of given kernels [14]. Conversely, only few works have addressed the problem of selecting the reference prototypes to reduce complexity of similaritybased classifiers [7], especially in the context of structured inputs like graphs and strings [16, 17]. To this end, it is also worth remarking that the Nyström approximation can be exploited to approximate similarity matrices based on a subset of randomlychosen prototypes, reducing the complexity of computing all pairwise similarities from to . This approximation also works for indefinite kernels and similarities [18].
The latter approach, based on learning classifiers directly in the similarity space, includes the solution we propose in this paper and it is of particular interest in applications where the similarity metric is: () defined and can not be modified, () not given in analytical terms, and () not necessarily positive semidefinite. For instance, in fingerprint recognition, the similarity measure is often defined a priori, as it encodes the knowledge of domain experts, and in most of the cases it is not positive semidefinite (it does not obey the triangle inequality). In addition, it is usually computed by a physical device, called matcher, and it is not even analytically defined. Modifying the similarity measure in these cases is not possible. The only way of manipulating the space induced by such a measure consists of modifying the prototypes. The main limit of the corresponding stateoftheart approaches is however intrinsic to the fact that they separately select the prototypes and then learn the classification function [7]. The approach advocated in the next section, instead, is based on the idea of jointly optimizing the prototypes and the parameters of the classification function, to outperform existing reduction methods in similarity spaces. Furthermore, with respect to methods devoted to reduce the number of SVs in SVMs, our approach requires neither the similarity to be a positive semidefinite kernel, nor the learning algorithm to be an SVM [9]. It can be applied, in principle, to reduce the complexity of any similaritybased learning algorithm.
Iii SuperSparse Virtual Vector Machines
We present here our approach to learn supersparse machines, inspired from [9, 10, 11]. The underlying idea is to reduce complexity of similaritybased learning algorithms by employing a very small set of virtual prototypes. The virtual prototypes obtained by our learning algorithm are not necessarily part of the training data, but are specifically created with the goal of retaining a very high generalization capability.
Let us denote with the training data, where are the input samples and are their labels. We consider here a vectorial representation of the input data , i.e., we assume that is a vector space. The output space depends on whether we are considering a regression or classification problem. For regression, we consider , whereas for twoclass classification, we set . The set of virtual prototypes is denoted with , where to obtain much sparser solutions. The value of can be either fixed a priori, depending on applicationspecific constraints (e.g., specific requirements on storage and time complexity, as in matchonboard biometric verification systems), or it can be optimized through a wellcrafted CV procedure, as defined at the end of this section. Note that the virtual prototypes belong to the same input space as the training samples. We finally denote the similarity function with . It can be any symmetric similarity function (not necessarily positive semidefinite).
Our goal is to learn a discriminant function consisting of a very sparse linear combination of similarities computed against the virtual prototypes:
(1) 
where is the vector of coefficients (one per virtual prototype), and is the bias. Our approach is specifically designed to work in the similarity space, i.e., our basis functions are similarity functions . The reason is that we aim to optimize the virtual prototypes without changing , as in several applications such a function can not be changed; e.g., in biometric verification, one is often given a matching function which is neither customizable nor even known analytically.
As in [9, 8], we consider a regression problem in which the goal is to minimize the distance between the target variables and on the training points, with respect to the parameters of , i.e., the coefficients , and the set of virtual prototypes . We do not constrain the prototypes to be in , but enable the creation of novel (virtual) prototypes. This allows our algorithm to achieve a better tradeoff between accuracy and the number of required prototypes.
The problem can be thus formulated as:
(2) 
where the scalars balance the contribution of each training sample
to the loss function (which may be useful when training classes are imbalanced), the quadratic regularizer
controls overfitting, and is a regularization parameter. Note that sparsity is not induced here by a sparse regularizer on , but rather by setting to a small value. This approach is clearly linear in the space induced by the similarity function, but not necessarily in the input space , i.e., the similarity mapping can be used to induce nonlinearity as in kernel methods. The objective function in Problem (2) can be rewritten in matrix form:(3) 
where the column vectors and consist of the values of and for the training data, is a diagonal matrix such that , and is the similarity matrix computed between and the prototypes .
The objective function in Eq. (3) can be iteratively minimized by modifying , and . First, we randomly initialize the prototypes with training samples from , and then iteratively repeat the two steps described below.
(1) step. The optimal coefficients are computed while keeping the prototypes fixed. This amounts to solving a standard ridge regression problem, whose analytical solution is given by deriving Eq. (3) with respect to and (with constant), and then setting the corresponding gradients to zero:
(4) 
where
denotes the identity matrix. Note that the system given by Eq. (
4) can be iteratively solved without necessarily inverting , e.g., using stochastic gradient descent
[23].(2) step. If the similarity function is differentiable, Eq. (3) can be minimized through gradient descent (as no closedform solution exists for this problem). Deriving with respect to a given , we obtain:
(5) 
where is the column of . Note that all the derivatives computed with respect to here are vectors or matrices with the same number of columns as the dimensionality of . To compute and , as required by Eq. (5), we derive Eq. (4) with respect to and solve for the required quantities:
(6) 
where is a matrix that consists of all zeros except for the column, which is equal to , and are column vectors of respectively zeros and ones.
The complete algorithm is given as Algorithm 1. Note that we invert the  and step detailed above for the sake of compactness. In fact, the coefficients and should always be updated after changing the prototypes. Our method can also be used to reduce the number of prototypes used by kernelbased or prototypebased classifiers, like the SVM, by setting the target variables to the values of the discriminant function of the target classifier for each training point. An example is reported in Fig. 2.^{1}^{1}1Exploiting the values of a classifier’s discriminant function as the target variables in our approach usually works better than using the true classification labels directly. The reason is that our approach uses the loss, which is best suited to regression tasks. For classification, one should indeed exploit a loss function tailored to classification approaches, e.g., the hinge loss. Worth remarking, the virtual prototypes found by our algorithm are quite different from the SVs found by the SVM. In fact, the latter are close to the boundary of the discriminant function, whereas our prototypes are found approximately at the centers of small clusters of training points.
Gradient of . In Eq. (6), the computation of the derivative of , …, with respect to the corresponding , depends on the given similarity measure . If has an analytical representation, like in the case of kernels, the derivative can be easily computed; e.g., for the RBF kernel, , and . Otherwise, the gradient can be only approximated numerically, by querying in a neighborhood of . This is computationally costly, especially if is high dimensional. In the case of images, we have found that the similarity tends to increase while shifting towards , even if this shift is operated linearly in the space of the pixel values (e.g., by computing a convex combination of the two images) [9, 10, 11]. This amounts to approximating the gradient as
(7) 
Although using this heuristic to approximate the gradient of
may affect, in general, the convergence of our algorithm, in the next section we show that it works quite well even when the similarity function is not analytically given (i.e., for a graphbased face matching algorithm). Clearly, different heuristics may be considered, if the proposed one turns out to be not suited to the task at hand.Prototype Initialization. Our approach might suffer from the intrinsic nature of the nonconvex optimization problem faced in the step, i.e., when optimizing the virtual prototypes. In fact, due to the presence of multiple local minima (in which some prototypes may be too close to each other), our algorithm turns out to be quite sensible to the initialization of the virtual prototypes [9, 10, 11]. To overcome this limitation, we propose the following strategy. Instead of running the algorithm multiple times with different initializations (which would increase the overall computational complexity of our approach), we modify the gradient defined in Eq. (5) to account for a penalty term. It aims to reduce similarities between virtual prototypes, avoiding them to converge towards the same points. The penalty that we add to the gradient of Eq. (5) is simply . Since this term achieves large values for prototypes that are closer to , it is clear that will be shifted away from them during optimization. We further multiply the penalty term by a decaying coefficient (e.g., , being the iteration count) to distance the prototypes sufficiently during the first iterations of the algorithm, without affecting convergence of our algorithm.
Selecting the number of virtual prototypes. Another important issue in our method regards the selection of . As already discussed, can be directly defined from specific constraints; otherwise, when some some degree of flexibility is allowed, it is possible to tune by minimizing the number of prototypes without significantly compromising accuracy. To this end, we define an objective function characterizing this tradeoff as
(8) 
and set . In the above objective, is a loss (error) measure to be evaluated on a validation set, and is a tradeoff parameter. For higher values, fewer prototypes are selected, at the expense of higher error rates.
The value of that minimizes Eq. (8) can be efficiently selected through an incremental CV procedure which uses a grid search on some predefined values , as described in what follows. Let us assume that . We first learn our approach using the largest number of prototypes . Then, to learn the solution with prototypes, we remove the prototypes assigned to the smallest coefficients (in absolute value), and update the remaining coefficients and prototypes by rerunning the learning algorithm from the current solution (warm start). We iterate until the most compact solution with prototypes is learned, and then select the value of that minimizes Eq. (8). An example is given in Fig. 3.
Computational Complexity. As previously discussed, our learning algorithm consists of two steps. The step is computationally lightweight, as it only requires solving a linear system involving variables. Furthermore, this system has to be solved from scratch only at the first iteration, while for subsequent iterations one can exploit the previous solution as a warm start, and use an iterative algorithm to converge to the solution very quickly. The most computationallydemanding part is instead the computation of during the step, which has complexity of , as we optimize one prototype at a time. Clearly this has to be repeated at each iteration. However, as our approach typically converges within 20 to 30 iterations, it is likely that it remains faster than other prototypebased learning algorithms. The reason is that the latter usually require computing the entire similarity matrix, which costs operations. We will nevertheless discuss some techniques to reduce the training complexity of our approach in Sect. VI. As for the complexity at test time, it is clear that our approach drastically reduces it to operations.
Structured Inputs. Worth remarking, if the prototypes are not represented in terms of feature vectors, but using more complex structures (e.g., graphs or strings), then the step of our algorithm can not be optimized through gradient descent. More generally, one may define a set of minimal modifications to each prototype (e.g., adding or removing a vertex in a graph, or characters in a string), and select those that greedily minimize the objective in Eq. (3). Note however that this blackbox optimization procedure is clearly computationally demanding, and further empirical investigations are required to validate its effectiveness and extend our approach to more generic, structured inputs.
Iv Application Examples
We report here two application examples related to face verification and age estimation from faces [9, 10]. The goal of these examples is to show how and to what extent our supersparse reduction approach can improve computational efficiency at test time without almost affecting system performance.
Iva Biometric Identity Verification from Faces
Face verification consists of validating if a client is claiming his/her real identity (genuine claim) or he/she is pretending to be someone else (impostor claim). To this end, his/her face is acquired through a camera and compared against the reference prototypes of the claimed identity. To save time and memory, few reference prototypes are used. The corresponding similarity values are then combined with heuristic schemes, separating prototype selection from the algorithm used to combine the similarities [9].
We train a onevsall SVM for each client. This automatically selects the optimal prototype gallery (i.e., the SVs), which however is often too large. We show that the proposed algorithm can successfully reduce the number of SVs, without affecting the recognition performance. We use the benchmark AT&T^{2}^{2}2http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html and BioID^{3}^{3}3https://www.bioid.com/About/BioIDFaceDatabase face datasets, respectively consisting of 40 clients with 10 face images each, and of 1,521 face images belonging to 23 clients. We assume that half of the clients are enrolled into the system, while the remaining ones are used only as impostors at test time. For training we randomly select 5 face images per enrolled client, and simulate impostors using face images of enrolled clients that do not belong to the claimed identity. At test time, we simulate impostors using the nonenrolled clients, and genuine claims using all the remaining face images of the claimed identity not used in the training set. This guarantees that the impostors are different between training and test sets (a common practice when testing biometric systems). Results are averaged over five different trainingtest pairs and client splits.
Matching algorithms. We use two different matching algorithms to compute similarities between faces.
1) Eigenfacebased RBF Kernel [24, 25]. This algorithm maps each face image onto a reduced
dimensional feature vector using principal component analysis (PCA). We select
so as to preserve 95% of the variance of the data. We then use the RBF kernel
as the similarity measure, and set [26]. On average, we got for AT&T, and for BioID.2) Elastic Bunch Graph Matching (EBGM) [27]. It extracts a bunch of image descriptors (i.e., Gabor wavelets) at predefined landmark points in face images (e.g., eyes and nose). These points correspond to nodes in a graph, which is elastically deformed to account for different face orientations and expressions during matching. In this case, is neither analytically given, nor positive semidefinite.
Verification methods. We compare our approach to learn supersparse SVMs (SSVM) with the SVMsel and SVMred techniques for reducing the number of SVs [8] (see Sect. V), and also consider the standard SVM for comparison.
We set the SVM regularization parameter by maximizing recognition accuracy through a 5fold CV. As the two classes are highly unbalanced, we used a different
value for each class, multiplying it by the prior probability of the opposite one, estimated from training data. To ensure fast convergence of our Algorithm
1 to a stable solution, we run a set of preliminary experiments and set the gradient step for both datasets; the regularization factor for the Eigenfacebased RBF Kernel, and and for the EBGM on the BioID and AT&T data, respectively. The gradients of are analytically computable for the RBF Kernel; for the EBGM (which is not given analytically) we use the approximate gradient given in Eq. (7).Results. Fig. 4 shows the fraction of incorrectlyrejected genuine claims (false rejection rate, FRR) vs. the fraction of incorrectlyaccepted impostors (false acceptance rate, FAR) for each method, obtained by varying each clientspecific threshold, and then by averaging over all clients and repetitions. The average number of matchings (prototypes) required by each method at test time is also reported. Except for SVM, this number has to be fixed in advance: for SVMsel, SVMred, and SSVM we respectively set it to 10, 2 and 2, when using the RBF Kernel, and to 5, for all methods, when using the EBGM.
Our SSVM achieves comparable performance as SVM but using only 2 and 5 virtual prototypes (instead of more than 20 and 15), respectively, for the RBF kernel and EBGM matching algorithms. Conversely, SVMsel and SVMred require a higher number of prototypes to achieve a comparable performance. This highlights that a principled approach may guarantee high recognition accuracy and efficiency using an extremely sparse set of virtual prototypes, without even knowing analytically the matching algorithm.
Interpretability. An example of the virtual prototypes found by our SSVM is shown in Fig. 6. We can appreciate that the genuine (virtual) prototypes () are obtained by merging genuine prototypes, preserving the aspect of the given client. Impostor prototypes () are instead the combination of faces of different identities, to compactly represent information about impostors. Although these prototypes do not correspond to any real user, interestingly they still resemble face images. This makes our SSVM approach interpretable, in the sense that a face image is considered genuine if it is sufficiently similar to the genuine prototypes found by our algorithm (and different from the impostor ones).
IvB Age Estimation from Faces
The goal here is to predict the age of a person from a photograph of his/her face. We tackle this problem as a regression task, as most of the existing methods, and show that our approach can be helpful in this context to: () speed up age estimation at test time by dramatically reducing the number of reference prototypes; () provide more interpretable decisions; and () mitigate the risk of overfitting to specific face databases. The experimental setup is similar to the one defined in our previous work [10].
Datasets. We use two publiclyavailable benchmark face databases: FgNet Aging and FRGC. FgNet is the main database for this task. It includes about 1,000 images of 82 subjects acquired in a totally uncontrolled condition, which makes it particularly challenging. Many images are blurred, exhibit different resolutions and illumination conditions, and the number of subjects per age is not equally distributed. The age range for each subject varies from 0 to 69 years, although the majority of images belong to 20yearold people. FRGC consists of about 50,000 face images acquired in different time sessions, belonging to about 500 people (about 200 females and 300 males) of different ethnicity, with ages spanning from 17 to 69 years. Face images were acquired in a controlled indoor environment, in frontal pose, to facilitate the recognition task. To keep the complexity of our experiments manageable, we restrict our analysis to a subset of about 5,000 images, randomly selected from this dataset. The age distributions of both datasets are shown in Fig. 7.
Experimental Setup. We normalize images as discussed in [10], and reduce the resulting set of 19,500 features (i.e., pixel values) through linear discriminant analysis (LDA), retaining the maximum number of components (i.e., the number of different age values minus one). We evaluate performance in terms of Mean Absolute Error: , where is the regression estimate of our approach for the subject, whose true age is , and is the number of test images. We average results using a 5fold CV procedure where each subject appears only in one fold. We use the RBF kernel as the similarity measure. We consider LASSO [12] and ridge [13] regression, and optimize their regularization parameter through CV. We compare our approach against the following prototypeselection methods [16, 17]: Random (PSR), which randomly selects prototypes from the training data; Border (PSB), which selects prototypes from the frontier of the training data; Spanning (PSS), which selects the first prototype as the trainingset median, and the remaining ones through an iterative procedure that maximizes the distance to the set of previouslyselected prototypes; and medians (PSKM), which runs means clustering to obtain clusters from the training set, and then selects the prototypes as their set medians (i.e., the medians of the clusters). We optimize according to the CV procedure defined in Sect. III, using the MAE as the loss function and (Eq. 8). We consider also a crossdatabase scenario in which training and test sets are drawn from different databases, to verify if prototypeselection methods are less prone to overfitting.
Method  FgNet/FgNet  FgNet/FRGC  FRGC/FgNet  FRGC/FRGC 

Ridge  8.00 (781.6)  8.46 (781.6)  14.85 (2747.2)  4.53 (2747.2) 
PSR Ridge  9.93 (5.0)  9.11 (5.0)  12.98 (4.0)  4.10 (4.0) 
PSB Ridge  36.78 (5.0)  30.48 (5.0)  26.34 (4.0)  17.35 (3.2) 
PSS Ridge  11.13 (5.0)  10.36 (5.0)  15.29 (4.0)  4.85 (4.0) 
PSKM Ridge  10.45 (5.0)  13.94 (5.0)  13.62 (4.0)  4.04 (4.0) 
SRidge  9.06 (5.2)  7.96 (4.6)  14.42 (4.4)  4.31 (4.2) 
LASSO  7.92 (60.0)  8.99 (60.2)  14.71 (20.8)  4.67 (22.4) 
PSR LASSO  11.77 (5.0)  7.81 (7.0)  14.54 (3.0)  5.40 (2.4) 
PSB LASSO  36.78 (5.2)  28.66 (7.0)  25.80 (2.6)  17.48 (2.2) 
PSS LASSO  10.54 (5.0)  10.34 (7.2)  14.15 (3.0)  4.48 (3.0) 
PSKM LASSO  12.54 (5.0)  9.96 (7.0)  15.37 (3.0)  5.14 (3.0) 
SLASSO  7.99 (6.2)  9.09 (7.6)  14.76 (3.6)  4.75 (4.0) 
Results. As shown in Table I, both for the standard and crossdatabase evaluations, our algorithms exhibit almost the same performance as their nonsparse versions, despite using fewer prototypes. They often outperform also the competing prototypeselection methods, or achieve a comparable performance.
Interpretability. Interpretability of decisions is important to determine whether meaningful aging patterns are learned, i.e., if the age of a subject can be correctly predicted from face images of different datasets. Fig. 6 shows a set of prototypes found by our supersparse LASSO algorithm, which correctly assigns higher values to older people.
V Related Work
Besides work in similaritybased learning, discussed in Sects. III, another line of research related to supersparse learning in similarity spaces is the one related to SVM reduction approaches, i.e., approaches that aim at reducing the number of SVs of an SVM, or to learn SVM with reduced complexity directly [8, 28, 29]. Worth remarking, however, our approach is not specifically designed for SVMs and kernel machines, as it can be in principle applied to generic similarity functions and learning algorithms. Moreover, the first reported application example on face verification has also demonstrated that some of the existing methods for reduction of SVs in SVMs (i.e., SVMsel and SVMred) can not achieve reduction rates and accuracy that are comparable with those achieved by our method. These methods in practice reduce the number of SVs by minimizing the
distance computed between the hyperplane normal of the given (unpruned) SVM and that of the reduced SVM in kernel space, as a function of the dual coefficients
and the set of reduced prototypes [8]. While can be analytically found, as in our method, the choice of the reduced prototypes is different: SVMsel eliminates one prototype at each iteration from the initial set using Kernel PCA to drive the selection; and SVMred creates a new prototype at each iteration by minimizing the aforementioned norm distance. Both approaches are thus greedy, as the reduced set of prototypes is constructed iteratively by removing or adding a prototype at a time, up to the desired number . The reason of the superior performance exhibited by our algorithm is thus twofold: (i) SVMsel and SVMred require the matching algorithm to be a positive semidefinite kernel to uniquely find the coefficients ;^{4}^{4}4In fact, the notion of hyperplane exploited in their objective is only consistent for positive semidefinite kernels. (ii) they do not modify the prototypes that are already part of the reduced expansion, and do not even reconsider the discarded ones. Our approach overcomes such limitations by optimizing a different objective (suited to nonpositive semidefinite kernels too) and by iteratively modifying the virtual prototypes during the optimization. These may be common advantages also with respect to more recent reduction methods, but this deserves further investigation [28, 29].Vi Summary and Open Problems
The proposed approach aims to tackle computational complexity of similaritybased classifiers at test time. Our approach builds on [9, 10, 11], where we originally define our supersparse learning machines. Here we have further extended our approach especially from an algorithmic viewpoint, by including a penalty term to reduce sensitivity to initialization of the virtual prototypes, and by designing a specific CV procedure to tune the number of prototypes . We remark that we do not consider multiclass classification problems, but that an extension of our approach to deal with them has already been developed, exhibiting outstanding performance on image classification [11].
Our future research directions aim at designing very efficient (and interpretable) machines at test time. We will first explore different possibilities to overcome the computational bottleneck of our approach during training; e.g., similarities between the virtual prototypes and the training samples can be only computed after a given number of iterations
, instead of being computed at each shift of a virtual prototype. During the intermediate steps, the similarity values can be updated using a firstorder approximation, which is provided by the computations performed at the previous iterations. Another interesting research direction consists of exploiting our supersparse reduction approach to reduce complexity of other nonparametric estimators, like kernel density estimators and
nearest neighbors. Finally, it would be interesting also to extend our approach to handle complex input structures like graphs and strings, considering efficient blackbox optimization techniques.References
 [1] Y. Chen, E. K. Garcia, M. R. Gupta, A. Rahimi, and L. Cazzanti, “Similaritybased classification: Concepts and algorithms,” J. Mach. Learn. Res., vol. 10, pp. 747–776, March 2009.
 [2] E. Pȩkalska, P. Paclik, and R. P. W. Duin, “A generalized kernel approach to dissimilaritybased classification,” J. Mach. Learn. Res., vol. 2, pp. 175–211, Dec. 2001.

[3]
V. N. Vapnik,
The Nature of Statistical Learning Theory
. New York, NY, USA: SpringerVerlag New York, Inc., 1995.  [4] C. Cortes and V. Vapnik, “Supportvector networks,” Mach. Learn., vol. 20, no. 3, pp. 273–297, Sept. 1995.
 [5] I. Steinwart, “Sparseness of support vector machines,” J. Mach. Learn. Res., vol. 4, pp. 1071–1105, Nov. 2003.
 [6] O. Chapelle, “Training a support vector machine in the primal,” Neural Comput., vol. 19, no. 5, pp. 1155–1178, May 2007.
 [7] E. Pȩkalska, R. P. W. Duin and P. Paclík, “Prototype selection for dissimilaritybased classifiers,” Patt. Rec., vol. 39, no. 2, pp. 189–208, Feb. 2006.

[8]
B. Schölkopf, S. Mika, C. J. C. Burges, P. Knirsch, K.R. Muller,
G. Rätsch, and A. J. Smola, “Input space versus feature space in
kernelbased methods,"
IEEE Trans. on Neural Networks
, vol. 10, no. 5, pp. 1000–1017, Sept. 1999.  [9] B. Biggio, M. Melis, G. Fumera, and F. Roli, “Sparse support faces,” in Proc. Int. Conf. Biometrics, 2015, pp. 208–213.
 [10] A. Demontis, B. Biggio, G. Fumera, and F. Roli, “Supersparse regression for fast age estimation from faces at test time,” in Proc. 18th Int. Conf. Image Analysis and Processing, 2015, pp. 551–562.
 [11] M. Melis, L. Piras, B. Biggio, G. Giacinto, G. Fumera, and F. Roli, “Fast image classification with reduced multiclass support vector machines,” in Proc. 18th Int. Conf. Image Analysis and Processing, 2015, pp. 78–88.
 [12] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. Royal Stat. Soc. (Ser. B), vol. 58, no. 1, pp. 267–288, 1996.
 [13] A. E. Hoerl and R. W. Kennard, “Ridge regression: Biased estimation for nonorthogonal problems,” Technometrics, vol. 12, no. 1, pp. 55–67, Feb. 1970.
 [14] S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Schölkopf, “Large scale multiple kernel learning,” J. Mach. Learn. Res., vol. 7, pp. 1531–1565, Dec. 2006.

[15]
G. Chechik, V. Sharma, U. Shalit, and S. Bengio, “Large scale online learning of image similarity through ranking,”
J. Mach. Learn. Res., vol. 11, pp. 1109–1135, March 2010.  [16] K. Riesen, M. Neuhaus, and H. Bunke, “Graph embedding in vector spaces by means of prototype selection,” in Proc. 6th Int. Conf. GraphBased Repr. in Patt. Rec., 2007, pp. 383–393.
 [17] B. Spillmann, M. Neuhaus, H. Bunke, E. Pekalska, and R. P. W. Duin, “Transforming strings to vector spaces using prototype selection,” in Proc. Joint IAPR Int. Conf. Structural, Syntactic, and Stat. Patt. Rec., 2006, pp. 287–296.

[18]
A. Gisbrecht, B. Mokbel, and B. Hammer, “The Nyström approximation for
relational generative topographic mappings,” in
NIPS Workshop on Challenges of Data Visualization
, 2010.  [19] E. Pȩkalska and B. Haasdonk, “Kernel discriminant analysis for positive definite and indefinite kernels,” IEEE Trans. Patt. An. and Mach. Intell., vol. 31, no. 6, pp. 1017–1032, June 2009.
 [20] R. J. Hathaway and J. C. Bezdek, “Nerf cmeans: Noneuclidean relational fuzzy clustering,” Patt. Rec., vol. 27, no. 3, pp. 429–437, March 1994.
 [21] D. O. Seaghdha and A. Copestake, “Using lexical and relational similarity to classify semantic relations,” in Proc. 12th Conf. Assoc. for Computational Linguistics, 2009, pp. 621–629.
 [22] T. Pedersen, S. V. S. Pakhomov, S. Patwardhan, and C. G. Chute, “Measures of semantic similarity and relatedness in the biomedical domain,” J. of Biomedical Informatics, vol. 40, no. 3, pp. 288–299, June 2007.
 [23] T. Zhang, “Solving large scale linear prediction problems using stochastic gradient descent algorithms,” in Proc. 21st Int. Conf. Mach. Learn., 2004, pp. 116–123.
 [24] M. Turk and A. Pentland, “Eigenfaces for recognition,” J. Cognitive Neuroscience, vol. 3, no. 1, pp. 71–86, 1991.
 [25] K. Jonsson, J. Kittler, Y. Li, and J. Matas, “Support vector machines for face authentication,” Image and Vision Comput., vol. 20, no. 56, pp. 369–375, April 2002.
 [26] C.C. Chang and C.J. Lin, “LibSVM: A library for support vector machines,” ACM Trans. Intelligent Systems and Technology, vol. 2, no. 3, pp. 5412–5475, April 2011.
 [27] J. Beveridge, D. Bolme, B. Draper, and M. Teixeira, “The CSU face identification evaluation system,” Machine Vision and Applications, vol. 16, no. 2, pp. 128–138, Feb. 2005.
 [28] S. S. Keerthi, O. Chapelle, and D. DeCoste, “Building support vector machines with reduced classifier complexity,” J. Mach. Learn. Res., vol. 7, pp. 1493–1515, July 2006.
 [29] Z. Wang, K. Crammer, and S. Vucetic, “Breaking the curse of kernelization: Budgeted stochastic gradient descent for largescale SVM training,” in J. Mach. Learn. Res., vol. 13, pp. 3103–3131, Oct. 2012.