Metric learning has been successfully extended into many fields, e.g., face identification [10.1007/978-3-642-19309-5_55], object recognition [Xu:2018:BDM:3327144.3327333] and medical diagnosis [Boukouvalas2011DistanceML]. To efficiently solve the large-scale streaming data problem, learning a discriminative metric in an online manner (i.e., online metric learning [Chechik:2010:LSO:1756006.1756042, NIPS2008_3446]) attracts lots of appealing attentions. Generally, most online metric learning models pay attention to the fast metric updating strategies [Weinberger2009, LI2018302, NIPS2009_3703, NIPS2011_4392] or fast similarity searching methods [NIPS2008_3446, Davis2007IML1273496, NIPS2009_3703] for large-scale streaming data.
However, these existing online metric learning methods [8392504, 8552662] only focus on instance evolution, and ignore the feature evolution in many real applications where some features are vanished and some new features are augmented. Take the human motion recognition [DBLP:journals/corr/abs-1904-12602] as an example (i.e., Fig. 1), the emerging of new kinect sensor and the sudden damage of depth camera will respectively lead to a corresponding increase and decrease in the feature dimensionality of the input data, which heavily cripples the performance of the pre-trained model [DBLP:journals/corr/abs-1904-12602]. Another interesting example is that different kinds of sensors (e.g., radioisotope, trace metal and biological sensors [s5010004]) are deployed for the research of dynamic environment monitoring in full aspects, some sensors (features) expire whereas some new sensors (features) are deployed when different electrochemical conditions and lifespans occur. A fixed or static online metric learning model will fail to make full use of sensors (features) evolved in this way. Therefore, how to establish a novel metric learning model to simultaneously handle instance evolution, and incremental and decremental features amongst these online practical systems is our focus in this paper.
To address the challenges above, as shown in Fig. 2, we present a new online Evolving Metric Learning (EML) model for incremental and decremental features, which can exploit streaming data with both instance and feature evolutions in an online manner. To be specific, the proposed EML model consists of two significant stages, i.e., the Transforming stage (T-stage) and the Inheriting stage (I-stage). 1) In the T-stage where features are decremental, we propose to explore the important information and data structure from vanished features, and transform them into a low-rank discriminative metric space of survived features, which can be used to assist the learning of following metric tasks. Moreover, it explores the intrinsic low-rank structure of the streaming data, which efficiently reduces the computation and memory costs especially for highly-dimensional large-scale samples. 2) For the I-stage where features are incremental, based on the learned discriminative metric space in the T-stage, we inherit the metric performance of survived features from T-stage and then expand to include new augmented features. Furthermore, to better explore the similarity relations amongst the heterogeneous data, a smoothed Wasserstein distance is applied into both T-stage and I-stage where the evolving features in different stages are strictly unaligned and heterogeneous. For the model optimization, we derive an efficient optimization method to solve both T-stage and I-stage. Besides, our model can be successfully extended from one-shot scenario into multi-shot scenario. Comprehensive experimental results on several benchmark datasets can strongly support the effectiveness of our proposed EML model.
The main contributions of this paper are summarized as follows:
We propose an online Evolving Metric Learning (EML) model for incremental and decremental features to tackle both instance and feature evolutions simultaneously. To our best knowledge, this is the first attempt to tackle this crucial, but rarely-researched challenge in the metric learning field.
We present two stages for both feature and instance evolutions, i.e., the Transforming stage (T-stage) and the Inheriting stage (I-stage), which can not only make full use of the vanished features in the T-stage, but also take advantage of streaming data with new augmented features in the I-stage.
A smoothed Wasserstein distance is incorporated into metric learning to characterize the similarity relations for heterogeneous evolving features in different stages. After deriving an alternating direction optimization algorithm to optimize our EML model, extensive experiments on benchmark datasets validate the effectiveness of our proposed model.
Ii Related Work
This section first provides a brief overview about online metric learning for instance evolution. Then some representative works about feature evolution are introduced.
Ii-a Online Metric Learning
Online metric learning has been widely explored for instance evolution to learn large-scale streaming data, which can mainly be divided into two categories: Mahalanobis distance-based and bilinear similarity-based methods. For the Mahalanobis distance-based methods, POLA [ShalevShwartz2004] is the first attempt to learn the optimal metric in an online way. Then several variants [NIPS2008_3446, Davis2007IML1273496, 8617698] extend this idea by the fast similarity searching strategies, e.g., [NIPS2009_3703] propose a regularized online metric learning model with the provable regret bound. Besides, pairwise constraint [NIPS2009_3703] and triplet constraint [NIPS2011_4392] are adopted to learn a discriminative metric function. Generally, triplet constraints are more effective than pairwise constraints to learn a discriminative metric function [NIPS2011_4392, Qian2015]. For the bilinear similarity-based models, OASIS [Chechik:2010:LSO:1756006.1756042]
is proposed to learn a similarity metric for image similarity, and SOML[Gao2014SOMLSO] aims to learn a diagonal matrix for high dimensional cases with the similar setting as OASIS. Besides,  presents an Online Multiple Kernel Similarity to tackle multi-modal tasks. However, these recently-proposed Mahalanobis distance-based and bilinear similarity-based methods cannot exploit the discriminative similarity relations for the strictly unaligned heterogeneous data in different evolution stages.
Ii-B Feature Evolution
For the feature evolution, with the assumption that there exists samples from both vanished feature space and augmented feature space in an overlapping period, [Hou:2017:LFE:3294771.3294906] develops a evolvable feature streaming learning model by reconstructing the vanished features and exploiting it along with new emerging features for large-scale streaming data. [Hou2018OnePassLW] proposes an one-pass incremental and decremental learning model for streaming data, which consists of compressing stage and expanding stage. Different from [Hou:2017:LFE:3294771.3294906], [Hou2018OnePassLW] assumes that there are overlapping features instead of overlapping period. Similarly with [Hou2018OnePassLW], [Ye2018RectifyHM] focuses on learning the mapping function from two different feature spaces by using optimal transport technique. [Zhang2015TowardsMT, 7465766] intend to deal with trapezoidal data stream where both instance and feature can doubly increase. However, the new emerging data always have overlapping features with the previously existing data. 
develops a feature incremental random forest model to improve performance for a small amount of data with newly incremental features, which enables the model generalize well to the emergence of incremental features.
Amongst the discussion above, there are no any feature evolution models highly related to our work except for OPID (OPIDe) [Hou2018OnePassLW]. However, there are several key differences between [Hou2018OnePassLW] and our EML model: 1) Our work is the first attempt to explore both instance and feature evolutions simultaneously via T-stage and I-stage in the metric learning field, when compared with [Hou2018OnePassLW]. 2) Due to the strictly unaligned evolving features in the different stages, we utilize the smoothed Wasserstein distance to characterize the similarity relations among the complex and heterogeneous data, rather than the Euclidean distance in [Hou2018OnePassLW]. 3) Compared with [Hou2018OnePassLW], the low-rank regularizer for distance matrix could efficiently learn a discriminative low-rank metric space, while neglecting non-informative knowledge for heterogeneous data in different feature evolution stages.
Iii Evolving Metric Learning (EML)
In this section, we first review online metric learning, and then detailedly introduce how to tackle both instance and feature evolutions via our proposed EML model.
Iii-a Revisit Online Metric Learning
Metric learning aims to learn an optimal distance measure matrix according to the different measure functions, e.g., Mahalanobis distance function: , where and are the -th and -th samples, respectively. is the symmetric positive semi-definite matrix, which can be decomposed as [NIPS2008_3446], where ( is the rank of ) is the transformation matrix. Then the Mahalanobis distance function between and can be rewritten as . Given an online constructed triplet , could be updated in an online manner via the Passive-Aggressive algorithm [Crammer:2006:OPA:1248547.1248566], i.e.,
is a hinge loss function.represents , and belong to the same class, and and belong to the different classes. is the regularization parameter.
However, most existing online metric learning models (e.g., Eq. (1)) only focus on instance evolution with a fixed feature dimensionality, which is unable to be used in the feature evolution scenario, i.e., streaming data with incremental and decremental features. Furthermore, the learned distance matrix of Eq. (1) is not discriminative enough to explore similarity relations of the complex and heterogeneous samples, whose evolving features in different evolution stages are not strictly aligned [xu2018multi].
Iii-B The Proposed EML Model
In this subsection, we first present the introductions of integrating a smoothed Wasserstein distance into online metric formulation (i.e., Eq. (1)) to characterize the similarity relations of heterogeneous data with feature evolution in the different stages. Then the details about how to tackle feature evolution via T-stage and I-stage in one-shot scenario are elaborated, followed by the extension of multi-shot case.
Iii-B1 Online Wasserstein Metric Learning
Wasserstein distance  is an optimal transportation to move all the earth from the source to the destination, which requires the minimum amount of efforts. Formally, given two signatures and , the smoothed Wasserstein distance [Cuturi:2013:SDL:2999792.2999868] between and is:
where is a distance matrix, and each denotes the cost of moving one unit of earth from the source sample to the target sample . is the flow network matrix, and each represents the amount of earth moved from to . andis a balance parameter, and is the strictly concave entropic function.
In the Eq. (2), the Mahalanobis distance is employed as ground distance to construct smoothed Wasserstein distance. Thus, each element of in Eq. (2) represents the squared Mahalanobis distance between the source sample of and the target sample of , i.e., . Given the online constructed triplet via [pmlr-v51-rolet16], where the samples of and belong to the same class, and the samples of and belong to different classes. After substituting Mahalanobis distance in Eq. (1) with the smooth Wasserstein distance defined in Eq. (2), online Wasserstein metric learning can be:
where When compared with the triplet , each signature in consists of several samples belonging to the same class rather than only one sample.
Iii-B2 Transforming Stage Inheriting Stage
Two essential stages (i.e., T-stage and I-stage) of our proposed EML model for steaming data with feature evolution are elaborated below.
I. Transforming Stage (T-stage): As shown in Fig. 2, suppose that denotes the data stream in the T-stage, where and denote the samples and labels in the -th batch, respectively. is the total batches in T-stage and is the number of samples in the -th batch. Obviously, each instance of contains two kinds of features, i.e., vanished and survived features, and and indicate the corresponding dimensions of vanished features and survived features .
If we directly combine both vanished and survived features to learn a unified metric function, it cannot be used in I-stage where some features are vanished and some other features are augmented. We thus propose to extract important information from vanished features and forward it into survived features by transforming them into a common discriminative metric space. In other words, we want to train a model using only survived features to represent the important information of both vanished and survived features.
In the -th batch of T-stage, inspired by [pmlr-v51-rolet16], the triplet for survived features is constructed in an online manner, where the samples of and belong to the same class while the samples of and belong to different classes. and are the numbers of samples in each signature. Similarly, we can construct the triplet for all features (including both vanished and survived features), where the samples of and belong to same class while the samples of and belong to different classes.
Let and denote the distance matrices trained on survived features and all features in T-stage. Since the dimensions of and are different, it is reasonable to add some essential consistency constraints on the optimal distance matrices and to extract important information from vanished features, and forward it into survived features. Generally, based on the smoothed Wasserstein metric learning in Eq. (3), the formulation of the -th batch for the T-stage can be expressed as follows:
where and denote the triplet losses of smoothed Wasserstein metric learning on survived features and all features, respectively. denotes the regularization term, which explores the intrinsic low-rank structure of heterogeneous samples. and are balance parameters. in Eq. (4) is designed to enable the consistence constraint for and , which aims to use only survived features to represent the important information of both vanished and survived features.
Specifically, constructs an essential triplet loss of smoothed Wasserstein metric learning on different feature spaces, i.e., all features and survived features. We attempt to compute the smoothed Wasserstein distance between different heterogeneous distributions from vanished features and all features. For example, denotes the smoothed Wasserstein distance between from all features and from survived features, where indicates the Mahalanobis distance between the -th source sample of and the -th target sample of . Likewise, and have similar definitions with . Formally, the consistence constraint is expressed as:
II. Inheriting Stage (I-stage): Suppose are the data in the -th batch of I-stage, where indicates the samples and is the corresponding labels, as shown in Fig. 2. and represent the survived features and new augmented features in the -th batch. and are the dimension of the new augmented features and the number of samples. Thus, the goal of this stage is to use for training and make the prediction for the -th batch data whose number of samples is same as that of .
To predict the label of the -th batch data, we propose to inherit the metric performance of optimal distance matrix learned on survived features in T-stage, since a set of common survived features exist in both T-stage and I-stage. Although we can construct the triplets directly from the -th batch for training, this simple strategy has two shortcomings: 1) the trained metric model is difficult to be extended into multi-shot scenario; 2) the metric model learned only with the -th batch data will have the worse performance due to the lack of full usage of data in T-stage.
The stack strategy [Breiman1996, Zhou:2012:EMF:2381019] is employed to better inherit the metric performance learned in T-stage. Concretely, let as the transformed discriminative metric space, which can be regarded as a new representation of for stacking. can then be represented as . Likewise, is characterized as . We can learn an optimal distance matrix on with online constructed triplet , and test the performance on , where the samples of and belong to same class while the samples of and belong to different classes. Formally, at the -th iterative step, the objective function of learning in I-stage can be formulated as:
where and are the balance parameters. In our experiments, and are set as the same value in both Eq. (4) and Eq. (6) for simplification. denotes the regularization term, which aims to explore the intrinsic low-rank structure of heterogeneous samples in I-stage.
Iii-B3 Multi-shot Scenario
This subsection extends our model into multi-shot scenario. Suppose that there are stages for training, i.e., multiple alternating T-stages and I-stages. The -th stage contains batches, where the features also contain two parts, i.e., survived features and augmented features. Notice that the augmented feature in -th stage is the survived features in -th stage. is denoted as the total number of training batches. The illustration example of multi-shot scenario when is depicted in Fig. 3. Generally, we have two tasks in the multi-shot scenario, i.e., 1) Task I
: Similarly to the task in one-shot case, we aim to classify testing datain -th stage with training data and other batches data of stages; 2) Task II: Different from the one-shot scenario, we attempt to make predictions in any batch of any training stage.
Iv Model Optimization
This section presents an alternating optimization strategy to update our EML model amongst two stages, i.e., T-stage and I-stage, followed by the computational complexity analysis of our model. The whole optimization procedure of our proposed EML model is summarized in Algorithm 1.
. Different from traditional Singular Value Thresholding (SVT)[doi:10.1137/080738970], we develop a regularization term to guarantee the low-rank property, i.e., . As a result, in Eq. (6) can be formulated as , where . Similarly, low rank optimization of and shares same strategy with . and can be respectively surrogated by and , where and .
Iv-a Optimizing T-stage via an Alternating Strategy
Iv-A1 Updating by fixing
Iv-A2 Updating by fixing
With the obtained distance matrix and flow matrix , the optimization problem for variable in Eq. (4) can be denoted as:
Similarly, the updating operator for can be given as:
Iv-A3 Updating by fixing
When the distance matrices and are fixed, Eq. (4) can be split into some independent traditional smoothed Wasserstein distance subproblems, which can be solved by the method [pmlr-v51-rolet16]. We omit the detailed process of solving smoothed Wasserstein distance subproblems for simplicity.
Iv-B Optimizing I-stage via an Alternating Strategy
Iv-B1 Updating by fixing
Iv-B2 Updating by fixing
The optimization procedure of variable in I-stage is same as that in T-stage: with the fixed , we split the formulation Eq. (6) into some independent traditional smoothed Wasserstein distance subproblem, and solve the variable via [pmlr-v51-rolet16].
Iv-C Computational Complexity Analysis
The main computational cost in our EML model involves the updating operations in both T-stage and I-stage. Specifically, in the T-stage, the computational costs of updating and are and , respectively. For the I-stage, solving the variable in Eq. (6) takes . Besides, the computational cost of solving in both T-stage and I-stage is , where . Since is usually small when comparing with the number of features and samples, our proposed model is thus computationally efficient in an online manner.
This section first introduces detailed experimental configurations and competing methods. Then the experimental results along with some analyses about our EML model in both one-shot and multi-shot scenarios are provided.
V-a Configurations and Competing Methods
The experimental configurations of our EML model in one-shot scenario and some competing methods are detailedly introduced in this subsection.
V-A1 Experimental Configurations
As shown in Table I, we conduct extensive comparisons on one real world human motion recognition dataset [DBLP:journals/corr/abs-1904-12602] (i.e., EV-Action) and five synthetic benchmark datasets111http://archive.ics.uci.edu/ml/ containing: three digit datasets (i.e., Mnist, Gisette and USPS), one DNA dataset (i.e., Splice) and one image dataset (i.e., Satimage). Specifically, EV-Action dataset [DBLP:journals/corr/abs-1904-12602] is a large-scale human action dataset with 5300 samples, which are collected from three sensors, i.e., depth camera, RGB camera, and skeleton senors. EV-Action consists of 20 common action classes, where 10 actions are finished by single subject and the others are accomplished by the same subjects interacting with other objects. This dataset is a typical application for features evolution in the real world, where the features collected from depth camera, RGB camera, and skeleton senors are respectively regarded as vanished, survived and augmented features. Some samples about human actions are visualized as Fig. 4.
|Dataset||Pegasos [Shalev-Shwartz2011]||OPMV [Zhu2015OnePassML]||TCA ||BDML [Xu:2018:BDM:3327144.3327333]||OPML [LI2018302]||OPIDe [Hou2018OnePassLW]||OPID [Hou2018OnePassLW]||Ours|
Comparisons between our model and state-of-the-art methods in terms of accuracy (%) on seven datasets: mean and standard errors averaged over fifty random runs in one-shot scenario. Models with the best performance are bolded.
For a fair comparison with [Hou2018OnePassLW], we adopt the same experimental settings with [Hou2018OnePassLW] in both one-shot and multi-shot scenarios, which are elaborated as follows: 1) The number of samples in each batch is same, i.e., , and the number of samples in each class is equal for all training and test batches; 2) In T-stage, the total number of training samples is fixed and the number of samples in each batch is varied. Companied with this, the number of training and test samples also changes in the last testing phase; 3) We assign the first features as vanished features, the next features as survived features and the rest as new augmented features. The first quarter and the last quarter are corresponding vanished and augmented features in our experiments. 4) All the experimental results are averaged over fifty random repetitions.
V-A2 Competing Methods
We validate the superiority of our model by comparing it with: One-pass Pegasos [Shalev-Shwartz2011] assumes that all vanished features are known in I-stage and all augmented features are known in T-stage; OPMV [Zhu2015OnePassML] regards the vanished and survived features as the first view, the survived and augmented features as the second view; TCA  assumes the data samples in the T-stage and I-stage come from the source and target domain; BDML [Xu:2018:BDM:3327144.3327333] and OPML [LI2018302] are the metric learning methods, which only utilize the samples with the augmented features, and ignore the previous vanished features; OPID and OPIDe [Hou2018OnePassLW] propose an one-pass incremental and decremental model for feature evolution.
V-B Experiments in One-shot Scenario
In this subsection, we present the comprehensive experimental analysis, ablation studies, effects of hyper-parameters and convergence investigations of our EML model in one-shot scenario, followed by computational costs of model optimization.
V-B1 Experimental Analysis
The experimental results for one-shot scenario are shown in Table II. From the presented results, we have the following observations: 1) Although our model cannot access the vanished features in T-stage and the augmented features in I-stage, both transforming and inheriting strategies can efficiently exploit useful information of vanished feature and expand it into augmented features. 2) Our model can be successfully applied into both high-dimensional (e.g., EV-Action and Gisette) and low-dimensional (e.g., Satimage) feature evolution, which are the challenging tasks to explore the data structure with the existing features; 3) When we use the learned distance matrix in T-stage to assist training in I-stage, the testing performance of our model increases significantly, even though the training samples in I-stage are relatively rare, i.e., contains a small number of samples in I-stage. 4) Our model performs better than OPID and OPIDe [Hou2018OnePassLW], since T-stage could explore important information from vanished features, and I-stage efficiently inherits the metric performance from T-stage to take advantage of augmented features.
|Pegasos [Shalev-Shwartz2011]||OPMV [Zhu2015OnePassML]||TCA ||BDML [Xu:2018:BDM:3327144.3327333]||OPML [LI2018302]||OPIDe [Hou2018OnePassLW]||OPID [Hou2018OnePassLW]||Ours|