The effectiveness of many machine learning and data mining algorithms depends on an appropriate measure of pairwise distance between data points that accurately reflects the learning task, e.g., prediction, clustering or classification. The kNN classifier, K-means clustering, and the Laplacian-SVM semi-supervised classifier are examples of suchdistance-based
machine learning algorithms. In settings where there is clean, appropriately-scaled spherical Gaussian data, standard Euclidean distance can be utilized. However, when the data is heavy tailed, multimodal, or contaminated by outliers, observation noise, or irrelevant or replicated features, use of Euclidean inter-point distance can be problematic, leading to bias or loss of discriminative power.
To reduce bias and loss of discriminative power of distance-based machine learning algorithms, data-driven approaches for optimizing the distance metric have been proposed. These methodologies, generally taking the form of dimensionality reduction or data “whitening", aim to utilize the data itself to learn a transformation of the data that embeds it into a space where Euclidean distance is appropriate. Examples of such techniques include Principal Component AnalysisBishop (2006), Multidimensional Scaling Hastie et al. (2005)
, covariance estimationHastie et al. (2005); Bishop (2006), and manifold learning Lee and Verleysen (2007). Such unsupervised methods do not exploit human input on the distance metric, and they overly rely on prior assumptions, e.g., local linearity or smoothness.
In distance metric learning one seeks to learn transformations of the data that are well matched to a particular task specified by the user. Point labels or constraints indicating point similarity or dissimilarity are used to learn a transformation of the data such that similar points are “close" to one another and dissimilar points are distant in the transformed space. Learning distance metrics in this manner allows a more precise notion of distance or similarity to be defined that is related to the task at hand.
Many supervised and semi-supervised distance metric learning approaches have been developed Kulis (2012). This includes online algorithms Kunapuli and Shavlik (2012) with regret guarantees for situations where similarity constraints are received sequentially.
This paper proposes a new method that provides distance metric tracking. Specifically, we suppose the underlying ground-truth (or optimal) distance metric from which constraints are generated is evolving over time, in an unknown and potentially nonstationary way. We propose an adaptive, online approach to track the underlying metric as the constraints are received. Our algorithm, which we call COMID-Strongly Adaptive Dynamic Learning (COMID-SADL) is inspired by recent advances in composite objective mirror descent for metric learning Duchi et al. (2010b) (COMID) and the Strongly Adaptive Online Learning (SAOL) framework proposed in Daniely et al. (2015). We prove strong bounds on the dynamic regret of every subinterval, guaranteeing strong adaptivity and robustness to nonstationary metric drift such as discrete shifts, slow drift with a nonstationary drift rate, and combinations thereof.
1.1 Related Work
Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) are classic examples of using linear transformations for projecting data into more interpretable low dimensional spaces. Unsupervised PCA seeks to identify a set of axes that best explain the variance contained in the data. LDA takes a supervised approach, minimizing the intra-class variance and maximizing the inter-class variance given class labeled data points.
Much of the recent work in Distance Metric Learning has focused on learning Mahalanobis distances on the basis of pairwise similarity/dissimilarity constraints. These methods have the same goals as LDA; pairs of points labeled “similar" should be close to one another while pairs labeled “dissimilar" should be distant. MMC (Xing et al., 2002), a method for identifying a Mahalanobis metric for clustering with side information, uses semidefinite programming to identify a metric that maximizes the sum of distances between points labeled with different classes subject to the constraint that the sum of distances between all points with similar labels be less than some constant.
Large Margin Nearest Neighbor (LMNN) (Weinberger et al., 2005) similarly uses semidefinite programming to identify a Mahalanobis distance. In this setting, the algorithm minimizes the sum of distances between a given point and its similarly labeled neighbors while forcing differently labeled neighbors outside of its neighborhood. This method has been shown to be computationally efficient (Weinberger and Saul, 2008) and, in contrast to the similarly motivated Neighborhood Component Analysis (Goldberger et al., 2004), is guaranteed to converge to a globally optimal solution. Information Theoretic Metric Learning (ITML) (Davis et al., 2007) is another popular Distance Metric Learning technique. ITML minimizes the Kullback-Liebler divergence between an initial guess of the matrix that parameterizes the Mahalanobis distance and a solution that satisfies a set of constraints. For surveys of the vast metric learning literature, see (Kulis, 2012; Bellet et al., 2013; Yang and Jin, 2006).
In a dynamic environment, it is necessary to track the changing metric at different times, computing a sequence of estimates of the metric, and to be able to compute those estimates online. Online learning Cesa-Bianchi and Lugosi (2006) meets these criteria by efficiently updating the estimate every time a new data point is obtained, instead of solving an objective function formed from the entire dataset. Many online learning methods have regret guarantees, that is, the loss in performance relative to a batch method is provably small Cesa-Bianchi and Lugosi (2006); Duchi et al. (2010b). In practice, however, the performance of an online learning method is strongly influenced by the learning rate, which may need to vary over time in a dynamic environment Daniely et al. (2015); McMahan and Streeter (2010); Duchi et al. (2010a).
Adaptive online learning methods attempt to address the learning rate problem by continuously updating the learning rate as new observations become available. For learning static parameters, AdaGrad-style methods McMahan and Streeter (2010); Duchi et al. (2010a) perform gradient descent steps with the step size adapted based on the magnitude of recent gradients. Follow the regularized leader (FTRL) type algorithms adapt the regularization to the observations McMahan (2014). Recently, a method called Strongly Adaptive Online Learning (SAOL) has been proposed for learning parameters undergoing discrete changes. SAOL maintains several learners with different learning rates and selects the best one based on recent performance Daniely et al. (2015). Several of these adaptive methods have provable regret bounds McMahan (2014); Herbster and Warmuth (1998); Hazan and Seshadhri (2007). These typically guarantee low total regret (i.e. regret from time 0 to time ) at every time McMahan (2014). SAOL, on the other hand, attempts to have low static regret on every subinterval, as well as low regret overall Daniely et al. (2015). This allows tracking of discrete changes, but not slow drift.
The remainder of this paper is structured as follows. In Section 2 we formalize the distance metric tracking problem, and section 3 presents the basic COMID online learner. Section 4 presents our COMID-SADL algorithm, a method of adaptively combining COMID learners with different learning rates. Strongly adaptive bounds on the dynamic regret are presented in Section 5, and results on both synthetic data and a text review dataset are presented in Section 6. Section 7 concludes the paper.
2 Problem Formulation
Metric learning seeks to learn a metric that encourages data points marked as similar to be close and data points marked as different to be far apart. The time-varying Mahalanobis distance at time is parameterized by as
Suppose a temporal sequence of similarity constraints are given, where each constraint is the triplet , and are data points in , and the label if the points are similar at time and if they are dissimilar.
Following Kunapuli and Shavlik (2012), we introduce the following margin based constraints:
where is a threshold that controls the margin between similar and dissimilar points. A diagram illustrating these constraints and their effect is shown in Figure 1
. These constraints are softened by penalizing violation of the constraints with a convex loss function. This gives a loss function
where , is the regularizer and the regularization parameter. Kunapuli and Shavlik Kunapuli and Shavlik (2012) propose using nuclear norm regularization (
) to encourage projection of the data onto a low dimensional subspace (feature selection/dimensionality reduction), and we have also had success with the elementwise L1 norm (). In what follows, we develop an adaptive online method to minimize the loss subject to nonstationary smoothness constraints on the sequence of metric estimates .
3 Composite Objective Mirror Descent Update
Viewing the acquisition of new data points as stochastic realizations of the underlying distribution Kunapuli and Shavlik (2012) suggests the use of composite objective stochastic mirror descent techniques (COMID).
is developed for a variety of common losses and Bregman divergences, involving rank one updates and eigenvalue shrinkage.
The output of COMID depends strongly on the choice of . Critically, the optimal learning rate depends on the rate of change of Hall and Willett (2015), and thus will need to change with time to adapt to nonstationary drift. Choosing an optimal sequence for is clearly not practical in an online setting with nonstationary drift. We thus introduce COMID-Strongly Adaptive Dynamic Learning (COMID-SADL) as a method to adaptively choose an appropriate learning rate .
Define a set of intervals such that the lengths of the intervals are proportional to powers of two, i.e. , , with an arrangement that is a dyadic partition of the temporal axis. The first interval of length starts at (see Figure 2), and additional intervals of length exist such that the rest of time is covered.
Every interval is associated with a base COMID learner that operates on that interval. Each learner (16) has a constant learning rate proportional to the inverse square of the length of the interval, i.e. . Each learner (besides the coarsest) at level () is initialized to the last estimate of the next coarsest learner (level ) (see Figure 2). This strategy is equivalent to “backdating" the interval learners so as to ensure appropriate convergence has occurred before the interval of interest is reached, and is effectively a quantized square root decay of the learning rate.
Thus, at a given time , a set of intervals/COMID learners are active, running in parallel. Because the metric being learned is changing with time, learners designed for low regret at different scales will have different performance (analogous to the classical bias/variance tradeoff). In other words, there is a scale optimal at a given time.
To select the appropriate scale, we compute weights that are updated based on the learner’s recent estimated regret. Our loss function in (3) is unbounded, however, it is a relaxation of an underlying 0-1 loss. For purposes of updating the weights, we propose using a nonlinearity to create a 0-1 loss with a smooth transitions scaled by a parameter . We choose a linear transition as the nonlinearity
We found that using a logistic nonlinearity also gave good results. We set in all our experiments. The weight update, inspired by the multiplicative weight (MW) literature, is given by
These hold for all , where , are the outputs at time of the learner on interval , and is called the estimated regret of the learner on interval at time . The initial value of is . Essentially, this is highly weighting low loss learners and lowly weighting high loss learners.
For any given time , the output of the learner of interval
is randomly selected as the output of COMID-SADL with probability
COMID-SADL is summarized in Algorithm 1.
5 Strongly Adaptive Dynamic Regret
The standard static regret is defined as
where is a loss with parameter . Since in our case the optimal parameter value is changing, the static regret of an algorithm on an interval is not useful. Instead, let be an arbitrary sequence of parameters. Then, the dynamic regret of an algorithm relative to a comparator sequence on the interval is defined as
where are generated by . This allows for a dynamically changing estimate.
In Hall and Willett (2015) the authors derive dynamic regret bounds that hold over all possible sequences such that , i.e. bounding the total amount of variation in the estimated parameter. Without this temporal regularization, minimizing the loss would cause to grossly overfit. In this sense, setting the comparator sequence to the “ground truth sequence" or “batch optimal sequence" both provide meaningful intuitive bounds.
Strongly adaptive regret bounds Daniely et al. (2015) have claimed that static regret is low on every subinterval, instead of only low in the aggregate. We use the notion of dynamic regret to introduce strongly adaptive dynamic regret bounds, proving that dynamic regret is low on every subinterval simultaneously. In the supplementary material, we prove the following:
Theorem 1 (Strongly Adaptive Dynamic Regret).
Let be any arbitrary sequence of metrics on the interval , and define . Then COMID-SADL (Algorithm 1) satisfies
for every subinterval simultaneously. is a constant, and the expectation is with respect to the random output of the algorithm.
In a dynamic setting, bounds of this type are particularly desirable because they allow for changing drift rate and guarantee quick recovery from discrete changes. For instance, suppose discrete switches (large parameter changes or changes in drift rate) occur at times satisfying . Then since , this implies that the total expected dynamic regret on remains low (), while simultaneously guaranteeing that an appropriate learning rate is used on each subinterval .
6.1 Synthetic Data
We run our metric learning algorithms on a synthetic dataset undergoing different types of simulated metric drift. We create a synthetic 2000 point dataset with 2 independent 50-20-30% clusterings (A and B) in disjoint 3-dimensional subspaces of . The clusterings are formed as 3-D Gaussian blobs, and the remaining 19-dimensional subspace is filled with iid Gaussian noise.
We create a scenario exhibiting nonstationary drift, combining continuous drifts and shifts between the two clusterings (A and B). To simulate continuous drift, at each time step we perform a small random rotation of the dataset. The drift profile is shown in 3. For the first interval, partition A is used and the dataset is static, no drift occurs. Then, the partition is changed to B, followed by an interval of first moderate, then fast, and then moderate drift. Finally, the partition reverts back to A, followed by slow drift.
We generate a series of constraints from random pairs of points in the dataset, incorporating the simulated drift, running each experiment with 3000 random trials. For each experiment conducted in this section, we evaluate performance using two metrics. We plot the K-nearest neighbor error rate, using the learned embedding at each time point, averaging over all trials. We quantify the clustering performance by plotting the empirical probability that the normalized mutual information (NMI) of the K-means clustering of the unlabeled data points in the learned embedding at each time point exceeds 0.8 (out of a possible 1). We believe clustering NMI, rather than k-NN performance, is a more realistic indicator of metric learning performance, at least in the case where finding a relevant embedding is the primary goal.
In our results, we consider both COMID-SADL, nonadaptive COMID (Kunapuli and Shavlik, 2012), LMNN (batch) (Weinberger et al., 2005), and online ITML Davis et al. (2007). All parameters were set via cross validation. For nonadaptive COMID, we set the high learning rate using cross validation for moderate drift, and we set the low learning rate via cross validation in the case of no drift. The results are shown in Figure 3. Online ITML fails due to its bias agains low-rank solutions Davis et al. (2007), and the batch method and low learning rate COMID fail due to an inability to adapt. The high learning rate COMID does well at first, but as it is optimized for slow drift it cannot adapt to the changes in drift rate as well or recover quickly from the two partition changes. COMID-SADL, on the other hand, adapts well throughout the entire interval as expected.
6.2 Clustering Product Reviews
As an example real data task, we consider clustering Amazon text reviews, using the Multi-Domain Sentiment Dataset (Blitzer et al., 2007). We use the 11402 reviews from the Electronics and Books categories, and preprocess the data by computing word counts for each review and 2369 commonly occurring words, thus creating 11402 data points in . Two possible clusterings of the reviews are considered: product category (books or electronics) and sentiment (positive: star rating 4/5 or greater, or negative: 2/5 or less).
Figures 4 and 5 show the first two dimensions of the embeddings learned by static COMID for the category and sentiment clusterings respectively. Also shown are the 2-dimensional standard PCA embeddings, and the k-NN classification performance both before embedding and in each embeddings. As expected, metric learning is able to find embeddings with improved class separability. We emphasize that while improvements in k-NN classification are observed, we use k-NN merely as a way to quantify the separability of the classes in the learned embeddings. In these experiments, we set the regularizer to the elementwise L1 norm to encourage sparse features.
We then conducted drift experiments where the clustering changes. The change happens after the metric learner for the original clustering has converged, hence the nonadaptive learning rate is effectively zero. For each change, we show the k-NN error rate in the learned COMID-SADL embedding as it adapts to the new clustering. Emphasizing the visualization and computational advantages of a low-dimensional embedding, we computed the k-NN error after projecting the data into the first 5 dimensions of the embedding. Also shown are the results for a learner where an oracle allows reinitialization of the metric to the identity at time zero, and the nonadaptive learner for which the learning rate is not increased. Figure 6 (left) shows the results when the clustering changes from the four class sentiment + type partition to the two class product type only partition, and Figure 6 (right) shows the results when the partition changes from sentiment to product type. In the first case, the similar clustering allows COMID-SADL to significantly outperform even the reinitialized method, and in the second remain competitive where the clusterings are unrelated.
7 Conclusion and Future Work
Learning a metric on a complex dataset enables both unsupervised methods and/or a user to home in on the problem of interest while de-emphasizing extraneous information. When the problem of interest or the data distribution is nonstationary, however, the optimal metric can be time-varying. We considered the problem of tracking a nonstationary metric and presented an efficient, strongly adaptive online algorithm, called COMID-SADL, that has strong theoretical regret guarantees. Performance of our algorithm was evaluated both on synthetic and real datasets, demonstrating its ability to learn and adapt quickly in the presence of changes both in the clustering of interest and in the underlying data distribution.
Potential directions for future work include the learning of more expressive metrics beyond the Mahalanobis metric, the incorporation of unlabeled data points in a semi-supervised learning framework(Bilenko et al., 2004)
, and the incorporation of an active learning framework to select which pairs of data points to obtain labels for at any given time(Settles, 2012).
The Lincoln Laboratory portion of this work was sponsored by the Assistant Secretary of Defense for Research and Engineering under Air Force Contract #FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government.
9 Strongly Adaptive Dynamic Regret
We will prove the following general theorem giving strongly adaptive dynamic regret bounds.
Let be an arbitrary sequence of parameters and define as a function of and an interval . Choose a set of learners such that given an interval the learner satisfies
for some constant . Then the strongly adaptive dynamic learner (COMID-SADL) using as the interval learners satisfies
on every interval .
The proof techniques are similar to those found in Daniely et al. (2015); blum2005external which are in turn similar to the analysis of the Multiplicative Weights Update (MW) method.
Note that where is the indicator function for .
We first prove a pair of lemmas.
for all .
For all , . Thus
Suppose that . Furthermore, note that
since with probability . Thus
Since , the lemma follows by induction.
for every .
Fix . Recall that
Since and for all ,
By Lemma 1 we have
Combining with the expectation of (14) and dividing by ,
since and .
Define the restriction of to an interval as . Note the following lemma from Daniely et al. (2015):
Consider the arbitrary interval . Then, the interval can be partitioned into two finite sequences of disjoint and consecutive intervals, given by and , such that
This enables us to extend the bounds to every arbitrary interval and thus complete the proof.
10 Online DML Dynamic Regret
In this section, we derive the dynamic regret of the COMID metric learning algorithm. Recall that the COMID algorithm is given by
where is any Bregman divergence and is the learning rate parameter. From Hall and Willett (2015) we have:
Let the sequence , be generated via the COMID algorithm, and let be an arbitrary sequence in . Then using gives a dynamic regret
Using a nonincreasing learning rate , we can then prove a bound on the dynamic regret for a quite general set of stochastic optimization problems.
Applying this to our problem, we have
For being the hinge loss and ,
The other two quantities are guaranteed to exist and depend on the choice of Bregman divergence and . Thus,
Corollary 1 (Dynamic Regret: ML COMID).
Let the sequence be generated by (16), and let be an arbitrary sequence with . Then using gives
and setting ,
In other words, we pay a linear penalty on the total amount of variation in the underlying parameter sequence. From (19), it can be seen that the bound-minimizing increases with increasing , indicating the need for an adaptive learning rate.
For comparison, if the metric is in fact static then by standard stochastic mirror descent results Hall and Willett (2015)
Theorem 4 (Static Regret).
If and , then
11 COMID-SADL Bound
Let be a COMID learner at any of the scales used in , with output . Define the relative regret
as the extra loss suffered relative to the algorithm . From the proof of Theorem 1 we have
For any , the following holds simultaneously for all and .
This implies that SADL incurs at most additional scaled 0-1 loss on any interval relative to each of the base learners, all of which have low regret in the convex loss. Due to the nonconvexity of the scaled 0-1 loss, it is difficult to state more for arbitrary .
Theorem 5 (Comid-Sadl).
Let be the COMID algorithm of (16) with . Then there exists a such that the strongly adaptive online learner (COMID-SADL) satisfies
for some constant and every interval .
Note that this bound also holds for the original convex loss .
- Bellet et al.  Aurélien Bellet, Amaury Habrard, and Marc Sebban. A survey on metric learning for feature vectors and structured data. arXiv preprint arXiv:1306.6709, 2013.
- Bilenko et al.  Mikhail Bilenko, Sugato Basu, and Raymond J Mooney. Integrating constraints and metric learning in semi-supervised clustering. In ICML, page 11, 2004.
- Bishop  Christopher M Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
- Blitzer et al.  John Blitzer, Mark Dredze, Fernando Pereira, et al. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In ACL, volume 7, pages 440–447, 2007.
- Cesa-Bianchi and Lugosi  Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.
- Daniely et al.  Amit Daniely, Alon Gonen, and Shai Shalev-Shwartz. Strongly adaptive online learning. ICML, 2015.
- Davis et al.  Jason V Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S Dhillon. Information-theoretic metric learning. In ICML, pages 209–216, 2007.
- Duchi et al. [2010a] John C Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. In COLT, 2010a.
- Duchi et al. [2010b] John C Duchi, Shai Shalev-Shwartz, Yoram Singer, and Ambuj Tewari. Composite objective mirror descent. In COLT, pages 14–26. Citeseer, 2010b.
- Goldberger et al.  Jacob Goldberger, Geoffrey E Hinton, Sam T Roweis, and Ruslan Salakhutdinov. Neighbourhood components analysis. In Advances in neural information processing systems, pages 513–520, 2004.
- Hall and Willett  E.C. Hall and R.M. Willett. Online convex optimization in dynamic environments. Selected Topics in Signal Processing, IEEE Journal of, 9(4):647–662, June 2015.
- Hastie et al.  Trevor Hastie, Robert Tibshirani, Jerome Friedman, and James Franklin. The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer, 27(2):83–85, 2005.
- Hazan and Seshadhri  Elad Hazan and C Seshadhri. Adaptive algorithms for online decision problems. In Electronic Colloquium on Computational Complexity (ECCC), volume 14, 2007.
- Herbster and Warmuth  Mark Herbster and Manfred K Warmuth. Tracking the best expert. Machine Learning, 32(2):151–178, 1998.
- Kulis  Brian Kulis. Metric learning: A survey. Foundations and Trends in Machine Learning, 5(4):287–364, 2012.
- Kunapuli and Shavlik  Gautam Kunapuli and Jude Shavlik. Mirror descent for metric learning: a unified approach. In Machine Learning and Knowledge Discovery in Databases, pages 859–874. Springer, 2012.
- Lee and Verleysen  John A Lee and Michel Verleysen. Nonlinear dimensionality reduction. Springer Science & Business Media, 2007.
- McMahan  H Brendan McMahan. Analysis techniques for adaptive online learning. arXiv preprint arXiv:1403.3465, 2014.
- McMahan and Streeter  H Brendan McMahan and Matthew Streeter. Adaptive bound optimization for online convex optimization. In COLT, 2010.
Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1):1–114, 2012.
- Weinberger and Saul  Kilian Q Weinberger and Lawrence K Saul. Fast solvers and efficient implementations for distance metric learning. In ICML, pages 1160–1167, 2008.
- Weinberger et al.  Kilian Q Weinberger, John Blitzer, and Lawrence K Saul. Distance metric learning for large margin nearest neighbor classification. In Advances in Neural Information Processing System, pages 1473–1480, 2005.
- Xing et al.  Eric P Xing, Michael I Jordan, Stuart Russell, and Andrew Y Ng. Distance metric learning with application to clustering with side-information. In Advances in Neural Information Processing Systems, pages 505–512, 2002.
- Yang and Jin  Liu Yang and Rong Jin. Distance metric learning: A comprehensive survey. Michigan State Universiy, 2, 2006.