Visual classification in an “open universe” setting is often accomplished by mapping each image onto a vector space using a function (“model”) implemented by a deep neural network (DNN). The output of such a function in response to an image is often called its “embedding”[7, 30]. Dissimilarity between a pair of images can then be measured by some type of distance between their embedding vectors. A good embedding is expected to cluster images belonging to the same class in the embedding space.
|(a) Model update without backward compatible representation.|
|(b) Model update with backward compatible representation.|
As images of a new class become available, their embedding vectors are used to spawn a new cluster in the open universe, possibly modifying its metric to avoid crowding, in a form of “life-long learning.” This process is known as indexing. It is common in modern applications to have millions, in some cases billions, of images indexed into hundreds of thousands to millions of clusters. This collection of images is usually referred to as the gallery set. A common use for the indexed gallery set is to identify the closest clusters to one or a set of input images, a process known as visual search or visual retrieval. The set of input images for this task is known as the query set. Besides the gallery and the query set, there is usually a separate large repository of images used for training the embedding model [32, 29], called the embedding training set.
As time goes by, the datasets grow and the quality of the embeddings improves with newly trained models [36, 33, 5, 34]. However, to harvest the benefits of new models, one has to use the new models to re-process all images in the gallery set to generate their embedding and re-create the clusters, a process known as “backfilling” or “re-indexing.”***The reader may have experienced this process when upgrading photo collection software, whereby the search feature is unavailable until the software has re-indexed the entire collection. This is a minor inconvenience in a personal photo collection, but for large-scale galleries, the cost of time and computation may be prohibitive, thus hampering the continuous update potential of the system. In this paper, we aim to design a system that enables new models to be deployed without having to re-index existing image collections. We call such a system backfill-free, the resulting embedding backward-compatible representation, and the enabling process backward-compatible training (BCT).
We summarize our contributions as follow: 1) We formalize the problem of backward compatible representation learning
in the context of open-set classification, or visual retrieval. The goal is to enable new models to be deployed without having to re-process the previously indexed gallery set. The core of this problem is backward compatibility, which requires a new embedding’s output to be usable for comparisons against the old embedding model without compromising recognition accuracy. 2) We propose a novel backward compatible training (BCT) approach by adding an influence loss, which uses the learned classifier of the old embedding model in training the new embedding model. 3) We achieve backward compatible representation learning with minimal loss of accuracy, enabling backfill-free updates of the models. We empirically verify that BCT is robust against multiple changing factors in training the embedding models,e.g., neural network architectures, loss functions, and data growth. Finally, 4) we show that compatibility between multiple models can be attained via chain-like pairwise BCT training.
1.1 Related Work
Embedding learning and open set recognition. Open-set visual recognition [24, 25] is relevant to retrieval [6, 3], face recognition [28, 32] and person re-identification [13, 45, 1]. Common approaches involve extracting visual features to instantiate test-time classifiers . Deep neural networks (DNNs) are widely applied to 1) learn embedding models using closed-world classification as a surrogate task [12, 28], using various forms of loss functions [34, 33, 38] and supervision methods [9, 12] to improve generalization 2) perform metric learning  enforcing affinity for pairs [26, 1] or triplets [22, 8] of representations in the embedding space. Specifically,  learns a single metric that is compatible for all tasks in a multi-task learning setting. Supervising representation learning with classifier weights from other versions of models was proposed in  for the task of unsupervised representation learning.
Learning across domains and tasks. In domain adaptation [19, 4, 35], techniques such as MMD  and related methods [31, 17, 40], can be used to align the (marginal) distribution of the new and old classes, included those trained adversarially . Knowledge distillation in [9, 15] trains new models to learn from existing models, but, unlike backward compatible representation learning, knowledge distillation does not require the embedding of the new model and the existing one to be compatible in inference. Continual learning [21, 15]2], and life-long learning [23, 11] all deal with the cases where an existing model evolves over time. In , model distillation is used as a form of regularization when introducing new classes. In , old class centers are used to regularize samples from the new classes. Hou et al.  proposed a framework for learning a unified classifier in the incremental setting. In , the authors designed a re-training loss function. Methods addressing catastrophic forgetting  are most closely related to our work, as a common reason for forgetting is the changing of the visual embedding for the subsequent classifiers. The problem we are addressing differs in that we aim to achieve backward compatibility between any pair of old model and new model. The new model is not required to be initialized by nor share a similar network architecture as the old model.
Compatible representations. In , the authors discuss the possible mapping between feature vectors from multiple models trained on the same dataset; [44, 42, 43] introduce a design where multiple models with different channel widths but the same architecture share common subsets of parameters and representation, which implicitly imposes compatibility among representations from different models. we propose an approach to solve the problem of backward-compatibility in deep learning, in the sense defined in the previous section. We focus on open-universe classification using metric discriminant functions.
We first formulate the problem of backward compatible representation learning, then describe a backward compatible training approach and its implementations.
2.1 Problem Formulation
As a prototypical application, we use the following case of a photo collection , serving the role of the gallery. is grouped into a number of classes or identities . We have an embedding model that maps each image onto an embedding vector with . The embedding model is trained on an embedding training set, . The embedding of any image produced by can then be assigned to a class through some distance . In the simplest case, dropping the subscript “old,”, each class in is associated with a “prototype” or cluster center . The vector for the class can be obtained by a set function , where is the corresponding class label of an image . Common choices of the set function
include averaging and attention models. A test sample is assigned to the class . Later, a new model with -dimensional embedding vectors becomes available, for instance trained with additional data in the new embedding training set, ( can be a superset of ), or using a different architecture. The new embedding is potentially living in a different embedding space and it is possible that .
To harvest the benefit of the new embedding model , we wish to use to process any new images that in the gallery set , , as well as images for the query set. Since the gallery set could get additional images and clusters, we denote it as , where is the number of clusters in . Then, the question becomes how to deal with images in . In order to make the system backfill-free, we wish to directly use the already computed embedding from for these images and obtain . Our goal, then, is to design a training process for the new embedding model so that any test images can be assigned to classes, new or old, in , without the need to compute , i.e., to backfill. The resulting embedding , is then backward-compatible with .
2.2 Criterion for Backward Compatibility
In a strict sense, a model is backward compatible if
where is a distance in the embedding space. These constraints formalize the fact that the new embedding, when used to compare against the old embedding, must be at least as good as the old one in separating images from different classes and grouping those from the same classes. Note that the solution is backward compatible. This trivial solution is excluded if the architectures are different, which is usually the case when updating a model. Although, to simplify the discussion, we assume the embedding dimensions for the two models to be the same (), our method is more general and not bound by this assumption.
The criterion introduced in Eq. 2.2
entails testing the gallery exhaustively, which is intractable at large scale and in the open-set setting. On the other hand, suppose we have an evaluation metric,on some testing protocols, e.g., true positive identification rates for face search, where denotes the query set, denotes the gallery set, and we use for extracting the query set feature and for the gallery set. Then, the empirical compatibility criterion, for the application can be defined as
This criterion can be interpreted as follows: In an open-set recognition task with a fixed query set and a fixed gallery set, when the accuracy using for queries without backfilling gallery images surpasses that of using , we consider backward compatibility achieved and backfill-free update feasible. Note that simply setting to will not satisfy this criterion.
2.3 Baseline and paragon
A naive approach to train the model to be compatible with , assuming they have the same dimension, is to minimize the distance between their embeddings computed on the same images. This is enforced for every image in , which is used to train . This criterion can be framed as an additive regularizer for the empirical loss when training the new embedding as
We label the solution of the problem above . Note that will be fixed during training of . As we show in Sect. 3.4, does not satisfy Eq. (2) and it will not converge to , since the training set has been changed to . So, this naive approach cannot be used to obtain a backward compatible representation.
On the other hand, performing the backfill on with the model , trained without any regularization, can be taken as a paragon. Since the embedding for is re-computed, we can fully enjoy the benefit of albeit at the cost of reprocessing the gallery. This sets the upper bound of accuracy for the backfill-free update, and thus the upper bound of the update gain.
2.4 Backward Compatible Training
We now focus on backward compatible training for classification using the cross-entropy loss. Let be a model parametrized by two disjoint sets of weights, and . The first parametrizes the classifier , or the “head” of the model, whereas the second parametrizes the embedding , so that . Now, the cross-entropy loss can be written as
As for the new model , while ordinary training would yield
to ensure backwards-compatibility, we add a second term to the loss that depends on the classifier of the old model:
We call the second term “influence loss” since it biases the solution towards one that can use the old classifier. Note that in the influence loss will be fixed during training. Here, is a design parameter, referring to the set of images we apply the influence loss to. It can be either or . The approach of using as will be introduced in Sect. 2.5. Note that the classifiers of the new and old models can be different. We call this method backward compatible training, and the result backward compatible representation or embedding, which we evaluate empirically in the next section.
2.5 Learning with Backward Compatible Training
In the proposed backward compatible training framework, there are several design choices to make.
Form of the classifier The classifiers of the new and old models and can be of the same form, for instance Softmax, angular SoftMax classifier , or cosine margin . They can also be of different forms, which is common in the cases where better loss formulations are proposed and applied to training new embedding models.
Backward compatibility training dataset.
The most straightforward choice for the dataset , on which we apply the influence loss, is , which was used to train the old embedding . The intuition is that, since the old model is optimized together with its classifier on the original training set , a new embedding model having a low influence loss will work with the old model’s classifier and thus with the embedding vectors from .
The second choice of is ; this means that we not only compute the influence loss on the old training data for , but also on the new training data. However, this choice poses a challenge in the computation of the loss value
for the images in , due to the unknown classifier parameters for the classes. We propose two rules for computing the loss value for these images:
Synthesized classifier weights. For classes in which are not in the set of classes in , we create their “synthesized” classifier weights by computing the average feature vector of on the images in each class. This approach is inspired by open-set recognition using the class vector t as described in Sect. 2.1. We use averaging as the set function in this case. The synthesized classifier weights for the new classes are concatenated with the existing to form the classifier parameters for the influence loss term.
. We penalize the KL-divergence of the classifier output probabilities between usingand with existing classifier parameters . This removes the requirement to add new classes in to the classifiers corresponding to .
Backward compatible training is not restricted to certain neural network architecture or loss function. It only requires that both the old and new embedding models be trained with classification-based losses, which is common in open-set recognition problems [34, 13]. It also does not need modification of the architecture nor of the parameters of the old model .
We assess the effectiveness of the proposed backward compatibility training in face recognition. We start with several baselines, then test the hypothesis that BCT leads to backward compatible representation learning on two face recognition tasks: face verification and face search. Finally, we demonstrate the potential of BCT by applying it to the cases of multi-factor model changes and showing it is able to construct multiple compatible models.
3.1 Datasets and Face Recognition Metrics
We use the IMDB-Face dataset  for training face embedding models. The IMDB-Face dataset contains about 1.7M images of 59K celebrities. For the openset test, we use the widely adopted IJB-C face recognition benchmark dataset . It has around k images from 3,531 identities. The images in IJB-C contain both still images and video frames. We adopt the two standard testing protocols for face recognition: 1:1 verification and 1:N search (open set). For 1:1 verification, a pair of templates (a template contains one or more face images from the same person) are presented and the algorithm is required to decide whether they belong to the same person or not. The evaluation metrics for this protocol are true acceptance rate (TAR) at different false acceptance rates (FAR). We present the results of TAR at the FAR of .
For 1:N search, a set of templates is first indexed as the gallery set. Then each template in the query set is used to search against the indexed templates. The quality metrics for this protocols are true positive identification rates (TPIR) at different false positive identification rates (FPIR). We present the results of TPIR at FPIR.
3.2 Implementation details
We use 8 NVIDIA Tesla V-100 GPUs in training the embedding models. The input size of embedding models is set to pixels . We use face mis-alignment and color distortion for data augmentation. Weight decay is set to
and standard stochastic gradient descent (SGD) is used to optimize the loss. The initial learning rate is set toand decreased to , , and
after 8, 12, and 14 epochs, respectively. The training stops after 16 epochs. The batchsize is set to. Unless stated otherwise, we use ResNet-101 
as the backbone, a linear transform after its global average pooling layer to emit 128-dimensional feature vectors , and Cosine Margin Loss with margin= as the loss function in our experiments.
3.3 Measuring Backward-Compatibility
Based on the accuracy on the individual tests on the face recognition dataset, we can test whether a pair of models satisfies the empirical backward compatibility criterion. For a pair of models , on each evaluation protocol we test whether they satisfy Eq. (2). If so, we consider the new model backward compatible with the old model in the corresponding task. In testing using the IJB-C 1:N protocol , we use the new model to extract embeddings for the query set and the old model to compute embeddings for gallery set. For the IJB-C 1:1 verification protocol , we use to extract the embeddings for the first template in the pair and for the second.
To evaluate relative improvement brought by the backfill-free update, we define the update gain as
Here, stands for the best accuracy level we can achieve from any variants of the new model by backfilling. It indicates the proportional gain we can obtain the performing backfill-free update compared with the update which performs the backfill regardless of the cost and interruption of services. Note that the update gain is only valid when Eq. (2) is satisfied.
|Comparison Pair||IJB-C 1:1 Verifi. TAR (%)@FAR=||IJB-C 1:N Retri. TNIR(%)@FPIR=|
3.4 Baseline comparisons
The first hypothesis to be tested is whether BCT is necessary at all: is it possible to achieve backward compatibility with a more straightforward approach? In this section, we experiment with several baseline approaches and validate the necessity of BCT.
Independently trained and The first sanity-check is to directly compare the embedding of two versions of models trained independently. A similar experiment is done in  for multiple close-set classification models trained on the same dataset. Here we present two models. The is trained with the randomly sampled IDs subset of the IMDBFace dataset . The new model is trained on the full IMDBFace dataset . This emulates the case where a new embedding model becomes available when the sizes of embedding training dataset grows. We name the new model according to the experiment in Sect. 3.5, showing that it currently achieves the best accuracy among all new models with the same setting. We directly test the compatibility of this pair of models following the procedure described in Sect. 3.3. The results are illustrated in Tab. 1. In the backward test for both protocols, we observed almost accuracy. Unsurprisingly, independently trained and does not naturally satisfy our compatibility criterion.
Does the naive baseline with -distance work? In Sect. 2.3 we described the naive approach of adding the -distance between the new and old embeddings as a regularizer when training the new model. We train a new model using the same old model above and train the new model with the loss function (3) on the whole IMDBFace dataset . We name this model to reflect that it is -distance regularized towards the old model. The same backward compatibility test is conducted with this pair of models on the same two protocols described in the previous baseline. The results are shown in Tab. 1. We can observe that this approach only leads to slightly above backward test accuracy, which means the new model is far from satisfying the compatibility criterion. One possible reason is that imposing an distance penalty creates a bias that is too local and restrictive to allow the new model to satisfy the compatibility constraints.
’: adding a ReLU module after the embedding output of the new model when training with BCT. ‘’: the old model with normalized SoftMax classifier . ‘’: the new model with cosine margin classifier  and trained by BCT with as the old model. ‘’: using standard softmax loss as training loss. ‘’: the new model with cosine margin classifier class  and BCT with as the old model.
3.5 Learning with BCT
We now experiment with the proposed BCT framework for backward compatible representation learning, starting from its basic form described in Sect. 2.4. We use the same old model in the previous section. For the new model, we train it with the objective function described in Eq. (8). This model is called . As illustrated in Tab. 0(b) and Tab. 0(c), the model pair, , satisfies the backward compatibility criterion in Eq. (2). Additionally, we observe update gains of and on the 1:1 verification and 1:N search protocols respectively.
We also evaluate a baseline approach adapted from . The model trained with this approach is denoted as . It uses fixed and its classifier to output soft labels of newly added samples as pseudo labels for training .
From Tab.0(b) and 0(c) we can see that model pair does not satisfy the empirical backward compatibility criterion. Showing that directly adapting methods from the continual learning task does not work out of the box. However, it is able to improves to some extend the backward comparison accuracy, suggesting that the knowledge distillation used in continual learning could be useful in BCT. We further investigate its application in the following experiments.
BCT with newly added training data. In Sect. 2.5 we described two instantiations of BCT which can work with the new classes in the growing embedding training set. The first is using the synthesised classifier, we name the new model trained with this form of BCT as . The second applies the idea of knowledge distillation to bypass obtaining classifier parameters for the new classes in the embedding training set. We name the new model trained with this form of BCT as . The backward compatibility test results are summarized in Tab. 1. We can see that both new models can achieve backward compatibility. By fully utilizing the additional training data, they also lead to higher update gain (30.00 for and 27.25 for ) compared with the basic form of BCT (26.26 for ).
Does BCT hurt the accuracy of new models? One natural question is whether the influence loss is detrimental to the new model’s recognition performance. We assess this by performing standard face recognition experiments on the 1:1 and 1:N protocols by extracting embedding only using the new models. This process can be considered as performing the backfill, or the paragon setting as described in Sect. 2.3. The results are summarized in Tab. 2. We can see that training without BCT still yields the best accuracy in this setting. So, we name the model trained without BCT as , to indicate that it is the paragon that achieves best accuracy among all variants of new models. Note that models trained with the basic form of BCT, , only leads to less than drop of accuracy in both tasks. The new models and further reduce the gap.
3.6 Extensions of BCT
In the following experiments we explore whether BCT can be applied to different types of model training and achieve multi-model compatibility.
Other changes in training . Besides increasing the size of the embedding training set, the new model could have a new model architecture (e.g., depth), a different loss function for supervision, or different embedding dimensions. We experiment the effect of these factors on BCT. For network architectures, we test a new model using ResNet-152  instead of ResNet-101  in previous experiments, denoted as . In terms of loss types, we test using Norm-Softmax Loss  for the old model, , and Cosine Margin Loss  for the new one, . In terms of embedding dimension, we test increasing the dimensions from 128 to 256 in the new model. Note that when the new model’s feature dimension is changed, we will not be able to directly feed it to . Here, we simply take the first 128 elements of the new model’s feature to feed into during backward-compatible training and testing. We tried another approach of adding a linear transformer in training to match the feature dimension but without success.
We also test the cases of changing several factors together, denoted as . The results are shown in Tab. 3. BCT can make most of new models backward compatible, even when several factors change simultaneously. This shows that BCT can be used as a general framework for achieving backward compatible representation learning.
There are two failure cases where backward compatibility is not achieved in the pairs: 1)(, ), which uses Softmax loss  for the old model and Cosine Margin Loss  for new model; this is possibly due to the drastic change of form of the loss functions. 2) , which adds ReLU activation on the embedding of the new model. The latter is possibly due to the distributional shift introduced by the ReLU activation in the new model, which makes it difficult for the new model with non-negative embedding vector elements to be compatible with the old model. This suggests that additional work is needed to expand the set of models that BCT can support.
|Comparison Pair||Mean AP (%)||Backward Compatible?||Update Gain (%)||Absolute Gain|
Towards multi-model and sequential compatibility. Here we investigate a simple case of three model versions. The first version is trained with , which is a randomly sampled subset of the IMDBFace dataset . The second version is trained with , a subset. And the third version is trained with , which is the full IMDBFace dataset . We train using BCT with and train using BCT with . Thus, in this process has no direct influence from . The backward compatibility test results are shown in Tab. 4 and Fig. 2. We observe that, by training with BCT, the last model is transitively compatible with even though is not directly involved in training . It shows that transitive compatibility between multiple models is indeed achievable through BCT, which could enable sequential update of the embedding models.
BCT in other open-set recognition tasks. We validate the BCT method on the person re-identification task using the Market-1501  benchmark. We train an old embedding model following  with of training data and two new embedding models with of the new training data. Search mean average precision (mean AP) is used as the accuracy metric. Table 5 shows the results of backward compatibility test. We observe that the trained with BCT achieves backward compatibility without sacrificing its own search accuracy. This suggests that BCT can be a general approach for open-set recognition problems.
We have presented a method for achieving backward-compatible representation learning, illustrated specific instances, and compared them with both baselines and paragons. Our approach has several limitations. The first is the accuracy gap of the new models trained with BCT relative to the new model oblivious of previous constraints. Though the gap is reduced by slightly more sophisticated forms of BCT, there is still work wo to be done in characterizing and achieving the attainable accuracy limits.
-  (2015) An improved deep learning architecture for person re-identification. In , pp. 3908–3916. Cited by: §1.1.
-  (2019) A case for backward compatibility for human-ai teams. arXiv preprint arXiv:1906.01148. Cited by: §1.1.
-  (2016) Towards open set deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1563–1572. Cited by: §1.1.
-  (2016) Domain separation networks. In Advances in neural information processing systems, pp. 343–351. Cited by: §1.1.
-  (2019) Arcface: additive angular margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4690–4699. Cited by: §1.
-  (2016) Deep image retrieval: learning global representations for image search. In European conference on computer vision, pp. 241–257. Cited by: §1.1.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §1, §3.2, §3.6.
-  (2017) In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737. Cited by: §1.1.
-  (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1.1, §1.1.
-  (2018) Conditional generative adversarial network for structured domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1335–1344. Cited by: §1.1.
-  (2019) Learning a unified classifier incrementally via rebalancing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 831–839. Cited by: §1.1.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.1, §2.4, §3.6.
-  (2014) Deepreid: deep filter pairing neural network for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 152–159. Cited by: §1.1, §2.5.
-  (2016) Convergent learning: do different neural networks learn the same representations?. In Iclr, Cited by: §1.1, §3.4.
-  (2017) Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence 40 (12), pp. 2935–2947. Cited by: 7(a), §1.1, §3.5, 0(a).
-  (2017) Sphereface: deep hypersphere embedding for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 212–220. Cited by: §2.4, §2.5.
-  (2015) Learning transferable features with deep adaptation networks. arXiv preprint arXiv:1502.02791. Cited by: §1.1.
-  (2018) IARPA janus benchmark-c: face dataset and protocol. In 2018 International Conference on Biometrics (ICB), pp. 158–165. Cited by: Table 8, Appendix C, §3.1, §3.3.
-  (2017) Open set domain adaptation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 754–763. Cited by: §1.1.
-  (2010) Large margin multi-task metric learning. In Advances in neural information processing systems, pp. 1867–1875. Cited by: §1.1.
-  (2019) Continual lifelong learning with neural networks: a review. Neural Networks. Cited by: §1.1.
-  (2015) Deep face recognition.. In bmvc, Vol. 1, pp. 6. Cited by: §1.1.
-  (2017) Icarl: incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 2001–2010. Cited by: §1.1.
-  (2012) Toward open set recognition. IEEE transactions on pattern analysis and machine intelligence 35 (7), pp. 1757–1772. Cited by: §1.1.
-  (2014) Probability models for open set recognition. IEEE transactions on pattern analysis and machine intelligence 36 (11), pp. 2317–2324. Cited by: §1.1.
-  (2015) Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §1.1.
Improving cnn classifiers by estimating test-time priors. In The IEEE International Conference on Computer Vision (ICCV) Workshops, Cited by: §1.1.
-  (2014) Deep learning face representation by joint identification-verification. In Advances in neural information processing systems, pp. 1988–1996. Cited by: §1.1, §2.4.
-  (2014-06) Deep learning face representation from predicting 10,000 classes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
-  (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §1.
-  (2014) Deep domain confusion: maximizing for domain invariance. arXiv preprint arXiv:1412.3474. Cited by: §1.1.
-  (2018) The devil of face recognition is in the noise. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 765–780. Cited by: §1.1, §1, §3.1, §3.4, §3.4, §3.6.
-  (2017) Normface: l 2 hypersphere embedding for face verification. In Proceedings of the 25th ACM international conference on Multimedia, pp. 1041–1049. Cited by: 6(a), §1.1, §1, §2.4, §3.6, 2(a).
-  (2018) Cosface: large margin cosine loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5265–5274. Cited by: Appendix A, 6(a), 7(a), §1.1, §1, §2.4, §2.5, §2.5, §3.2, §3.6, §3.6, 2(a).
-  (2018) Deep visual domain adaptation: a survey. Neurocomputing 312, pp. 135–153. Cited by: §1.1.
-  (2016) A discriminative feature learning approach for deep face recognition. In European conference on computer vision, pp. 499–515. Cited by: §1, §3.2.
-  (2018-06) Unsupervised feature learning via non-parametric instance discrimination. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.1.
Learning deep feature representations with domain guided dropout for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1249–1258. Cited by: §1.1.
-  (2017) Joint detection and identification feature learning for person search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3415–3424. Cited by: §3.6, Table 5.
-  (2017) Mind the class weight bias: weighted maximum mean discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2272–2281. Cited by: §1.1.
-  (2017) Neural aggregation network for video face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4362–4371. Cited by: §2.1.
-  (2019) Network slimming by slimmable networks: towards one-shot architecture search for channel numbers. arXiv preprint arXiv:1903.11728. Cited by: §1.1.
-  (2019) Universally slimmable networks and improved training techniques. arXiv preprint arXiv:1903.05134. Cited by: §1.1.
-  (2018) Slimmable neural networks. arXiv preprint arXiv:1812.08928. Cited by: §1.1.
-  (2015) Scalable person re-identification: a benchmark. In Proceedings of the IEEE international conference on computer vision, pp. 1116–1124. Cited by: §1.1, §3.6, Table 5.
Appendix A Implementation Details
In Section 2.5, we describe how to use the influence loss on newly added training data for BCT to 1) compute synthesized classifier weights with the old model, and 2) use knowledge distillation.
When we have a newly added training example whose class is not in the old embedding training set, we feed the new image into the old model and old classifier to obtain the classifier responses. Then, we can provide supervision signal to the new model for this image by feeding the new model’s embedding to the old classifier and compute the knowledge distillation loss (cross-entropy with temperature-modulated SoftMax) between the two response vectors as the influence loss. Because we are using the cosine margin loss  in the experiments which also has a temprature parameter, we set the temperature parameter in knowledge distillation to the same as the one in the cosine margin loss, which is .
Appendix B Partial Backfilling
Partial backfilling happens when only a part of the gallery classes have been processed by the new model . We test whether queries from the backward compatible new model can work with partially backfilled gallery sets. In Fig. 3, we illustrate the search accuracy on gallery sets of different backfill ratios. As higher percentages of the gallery set are backfilled, search accuracy grow proportionally towards the accuracy of the fully backfilled case. This suggests that one can upgrade to a new model, immediately benefiting from the improved accuracy, and optionally backfill the old gallery gradually in the background until paragon performance is achieved.
Appendix C Detailed Benchmark Results
Due to space limitations, in the main paper we only report one specific operating point for each metric in the evaluation, e.g., for face verification and for face identification. Here we report results at additional operating points on the IJB-C  benchmark. In Table 8, we show the performance of different compared baselines and our proposed method. In Table 7 and Table 9, we show the performance of extensions of our proposed backward-compatible training process. In Table 7, we illustrate extensions of BCT to different model depths, feature dimensions and supervision losses. In Table 9, we show extensions to multi-model compatibility towards sequential updating.
|Continual Learning||Domain Adaptation||BCRL|
|Access to all old model parameters||Yes||Yes||Not required|
|Access to old training data||Not required||Yes||Yes|
|Re-processing of test data?||Yes||Yes||Not required|
|Consistent output||Yes||Not required||Yes|
|Compatible representation||Not required||Not required||Yes|