Pytorch implementation for "Large-Scale Long-Tailed Recognition in an Open World"
Real world data often have a long-tailed and open-ended distribution. A practical recognition system must classify among majority and minority classes, generalize from a few known instances, and acknowledge novelty upon a never seen instance. We define Open Long-Tailed Recognition (OLTR) as learning from such naturally distributed data and optimizing the classification accuracy over a balanced test set which include head, tail, and open classes. OLTR must handle imbalanced classification, few-shot learning, and open-set recognition in one integrated algorithm, whereas existing classification approaches focus only on one aspect and deliver poorly over the entire class spectrum. The key challenges are how to share visual knowledge between head and tail classes and how to reduce confusion between tail and open classes. We develop an integrated OLTR algorithm that maps an image to a feature space such that visual concepts can easily relate to each other based on a learned metric that respects the closed-world classification while acknowledging the novelty of the open world. Our so-called dynamic meta-embedding combines a direct image feature and an associated memory feature, with the feature norm indicating the familiarity to known classes. On three large-scale OLTR datasets we curate from object-centric ImageNet, scene-centric Places, and face-centric MS1M data, our method consistently outperforms the state-of-the-art. Our code, datasets, and models enable future OLTR research and are publicly available at https://liuziwei7.github.io/projects/LongTail.html.READ FULL TEXT VIEW PDF
Visual recognition in real-world requires handling long-tailed and even
Classic deep learning methods achieve impressive results in image recogn...
Scaling up the vocabulary and complexity of current visual understanding...
Real-world visual recognition requires handling the extreme sample imbal...
This paper considers deep visual recognition on long-tailed data, with t...
Real-world data often follow a long-tailed distribution as the frequency...
Deep learning classification models typically train poorly on classes wi...
Pytorch implementation for "Large-Scale Long-Tailed Recognition in an Open World"
Our visual world is inherently long-tailed and open-ended: The frequency distribution of visual categories in our daily life is long-tailed , with a few common classes and many more rare classes, and we constantly encounter new visual concepts as we navigate in an open world.
|Task Setting||Imbalanced Train/Base Set||#Instances in Tail Class||Balanced Test Set||Open Class||Evaluation: Accuracy Over ?|
|Imbalanced Classification||2050||all classes|
|Few-Shot Learning||120||novel classes|
|Open-Set Recognition||N/A||all classes|
|Open Long-Tailed Recognition||120||all classes|
While the natural data distribution contains head, tail, and open classes (Fig. 1), existing classification approaches focus mostly on the head [8, 30], the tail [55, 27], often in a closed setting [59, 34]
. Traditional deep learning models are good at capturing the big data of head classes[26, 20]; more recently, few-shot learning methods have been developed for the small data of tail classes [52, 18].
We formally study Open Long-Tailed Recognition (OLTR) arising in natural data settings. A practical system shall be able to classify among a few common and many rare categories, to generalize the concept of a single category from only a few known instances, and to acknowledge novelty upon an instance of a never seen category. We define OLTR as learning from long-tail and open-end distributed data and evaluating the classification accuracy over a balanced test set which include head, tail, and open classes in a continuous spectrum (Fig. 1).
OLTR must handle not only imbalanced classification and few-shot learning in the closed world, but also open-set recognition with one integrated algorithm (Tab. 1). Existing classification approaches tend to focus on one aspect and deliver poorly over the entire class spectrum.
The key challenges for OLTR are tail recognition robustness and open-set sensitivity: As the number of training instances drops from thousands in the head class to the few in the tail class, the recognition accuracy should maintain as high as possible; on the other hand, as the number of instances drops to zero in the open set, the recognition accuracy relies on the sensitivity to distinguish unknown open classes from known tail classes.
An integrated OLTR algorithm should tackle the two seemingly contradictory aspects of recognition robustness and recognition sensitivity on a continuous category spectrum. To increase the recognition robustness, it must share visual knowledge between head and tail classes; to increase recognition sensitivity, it must reduce the confusion between tail and open classes.
We develop an OLTR algorithm that maps an image to a feature space such that visual concepts can easily relate to each other based on a learned metric that respects the closed-world classification while acknowledging the novelty of the open world.
Our so-called dynamic meta-embedding handles tail recognition robustness by combining two components: a direct feature computed from the input image, and an induced feature associated with the visual memory. 1)
Our direct feature is a standard embedding that gets updated from the training data by stochastic gradient descent over the classification loss. The direct feature lacks sufficient supervision for the rare tail class.2) Our memory feature is inspired by meta learning methods with memories [55, 12, 2] to augment the direct feature from the image. A visual memory holds discriminative centroids of the direct feature. We learn to retrieve a summary of memory activations from the direct feature, combined into a meta-embedding that is enriched particularly for the tail class.
Our dynamic meta-embedding handles open recognition sensitivity by dynamically calibrating the meta-embedding with respect to the visual memory. The embedding is scaled inversely by its distance to the nearest centroid: The farther away from the memory, the closer to the origin, and the more likely an open set instance. We also adopt modulated attention  to encourage the head and tail classes to use different sets of spatial features. As our meta-embedding relates head and tail classes, our modulated attention maintains discrimination between them.
We make the following major contributions. 1) We formally define the OLTR task, which learns from natural long-tail and open-end distributed data and optimizes the overall accuracy over a balanced test set. It provides a comprehensive and unbiased evaluation of visual recognition algorithms in practical settings. 2) We develop an integrated OLTR algorithm with dynamic meta-embedding. It handles tail recognition robustness by relating visual concepts among head and tail embeddings, and it handles open recognition sensitivity by dynamically calibrating the embedding norm with respect to the visual memory. 3) We curate three large OLTR datasets according to a long-tail distribution from existing representative datasets: object-centric ImageNet, scene-centric MIT Places, and face-centric MS1M datasets. We set up benchmarks for proper OLTR performance evaluation. 4) Our extensive experimentation on these OLTR datasets demonstrates that our method consistently outperforms the state-of-the-art.
Our code, datasets, and models are publicly available at https://liuziwei7.github.io/projects/LongTail.html. Our work fills the void in practical benchmarks for imbalanced classification, few-shot learning, and open-set recognition, enabling future research that is directly transferable to real-world applications.
While OLTR has not been defined in the literature, there are three closely related tasks which are often studied in isolation: imbalanced classification, few-shot learning, and open-set recognition. Tab. 1 summarizes their differences.
Imbalanced Classification. Arising from long-tail distributions of natural data, it has been extensively studied [45, 66, 4, 32, 67, 38, 31, 53, 7]. Classical methods include under-sampling head classes, over-sampling tail classes, and data instance re-weighting. We refer the readers to  for a detailed review. Some recent methods include metric learning [24, 37], hard negative mining [11, 29], and meta learning [17, 59]. The lifted structure loss  introduces margins between many training instances. The range loss  enforces data in the same class to be close and those in different classes to be far apart. The focal loss  induces an online version of hard negative mining. MetaModelNet  learns a meta regression net from head classes and uses it to construct the classifier for tail classes.
Our dynamic meta-embedding combines the strengths of both metric learning and meta learning. On one hand, our direct feature is updated to ensure centroids for different classes are far from each other; On the other hand, our memory feature is generated on-the-fly in a meta learning fashion to effectively transfer knowledge to tail classes.
Few-Shot Learning. It is often formulated as meta learning [50, 6, 41, 46, 14, 61]. Matching Network  learns a transferable feature matching metric to go beyond given classes. Prototypical Network  maintains a set of separable class templates. Feature hallucination  and augmentation  are also shown effective. Since these methods focus on novel classes, they often suffer a moderate performance drop for head classes. There are a few exceptions. The few-shot learning without forgetting  and incremental few-shot learning  attempt to remedy this issue by leveraging the duality between features and classifiers’ weights [40, 39]. However, the training set used in all of these methods are balanced.
In comparison, our OLTR learns from a more natural long-tailed training set. Nevertheless, our work is closely related to meta learning with fast weight and associative memory [22, 49, 55, 12, 2, 36] to enable rapid adaptation. Compared to these prior arts, our memory feature has two advantages: 1) It transfers knowledge to both head and tail classes adaptively via a learned concept selector; 2) It is fully integrated into the network without episodic training, and is thus especially suitable for large-scale applications.
Open-Set Recognition. Open-set recognition [48, 3], or out-of-distribution detection [10, 28], aims to re-calibrate the sample confidence in the presence of open classes. One of the representative techniques is OpenMax 
, which fits a Weibull distribution to the classifier’s output logits. However, when there are both open and tail classes, the distribution fitting could confuse the two.
Instead of calibrating the output logits, our OLTR approach incorporates the confidence estimation into feature learning and dynamically re-scale the meta-embedding w.r.t. to the learned visual memory.
We propose to map an image to a feature space such that visual concepts can easily relate to each other based on a learned metric that respects the closed-world classification while acknowledging the novelty of the open world. Our model has two main modules (Fig.2): dynamic meta-embedding and modulated attention. The former relates and transfers knowledge between head and tail classes and the latter maintains discrimination between them.
Our dynamic meta-embedding combines a direct image feature and an associated memory feature, with the feature norm indicating the familiarity to known classes.
Consider a convolutional neural network (CNN) with a softmax output layer for classification. The second-to-the-last layer can be viewed as the feature and the last layer a linear classifier (cf.in Fig. 2). The feature and the classifier are jointly trained from big data in an end-to-end fashion. Let denote the direct feature extracted from an input image. The final classification accuracy largely depends on the quality of this direct feature.
While a feed-forward CNN classifier works well with big training data [8, 26], it lacks sufficient supervised updates from small data in our tail classes. We propose to enrich direct feature with a memory feature that relates visual concepts in a memory module. This mechanism is similar to the memory popular in meta learning [46, 36]. We denote the resulting feature meta embedding , and it is fed to the last layer for classification. Both our memory feature and meta-embedding depend on direct feature .
Unlike the direct feature, the memory feature captures visual concepts from training classes, retrieved from a memory with a much shallower model.
Learning Visual Memory . We follow  on class structure analysis and adopt discriminative centroids as the basic building block. Let denote the visual memory of all the training data, where is the number of training classes. Compared to alternatives [60, 52], this memory is appealing for our OLTR task: It is almost effortlessly and jointly learned alongside the direct features , and it considers both intra-class compactness and inter-class discriminativeness.
We compute centroids in two steps. 1) Neighborhood Sampling: We sample both intra-class and inter-class examples to compose a mini-batch during training. These examples are grouped by their class labels and the centroid of each group is updated by the direct feature of this mini-batch. 2) Propagation: We alternatively update the direct feature and the centroids to minimize the distance between each direct feature and the centroid of its group and maximize the distance to other centroids.
Composing Memory Feature . For an input image, shall enhance its direct feature when there is not enough training data (as in the tail class) to learn it well. The memory feature relates the centroids in the memory, transferring knowledge to the tail class:
where is the coefficients hallucinated from the direct feature. We use a lightweight neural network to obtain the coefficients from the direct feature, .
Obtaining Dynamic Meta-Embedding. combines the direct feature and the memory feature, and is fed to the classifier for the final class prediction (Fig. 3):
where denotes element-wise multiplication. is seemingly a redundant scalar for the closed-world classification tasks. However, in the OLTR setting, it plays an important role in differentiating the examples of the training classes from those of the open-set. measures the reachability  of an input’s direct feature to the memory — the minimum distance between the direct feature and the discriminative centroids:
When is small, the input likely belongs to a training class from which the centroids are derived, and a large reachability weight is assigned to the resulting meta-embedding
. Otherwise, the embedding is scaled down to an almost all-zero vector at the extreme. Such a property is useful for encoding open classes.
We now describe the concept selector in Eq. (2). The direct feature is often good enough for the data-rich head classes, whereas the memory feature is more important for the data-poor tail classes. To adaptively select them in a soft manner, we learn a lightweight network with a activation function:
While dynamic meta-embedding facilitates feature sharing between head and tail classes, it is also vital to discriminate between them. The direct feature , e.g., the activation at the second-to-the-last layer in ResNet , is able to fulfill this requirement to some extent. However, we find it beneficial to further enhance it with spatial attention, since discriminative cues of head and tail classes seem to be distributed at different locations in the image.
Specifically, we propose modulated attention to encourage samples of different classes to use different contexts. Firstly, we compute a self-attention map from the input feature map by self-correlation . It is used as contextual information and added back (through skip connections) to the original feature map. The modulated attention is then designed as conditional spatial attention applied to the self-attention map: , which allows examples to select different spatial contexts (Fig. 4). The final attention feature map becomes:
where is a feature map in CNN, is the self-attention operation, and is a conditional attention function  with a softmax normalization. Sec. 4.1 shows empirically that our attention design achieves superior performance than the common practice of applying spatial attention to the input feature map. This modulated attention (Fig. 4b) could be plugged into any feature layer of a CNN. Here, we modify the last feature map only.
|Softmax Pred. ||43.6|
Cosine Classifier. We adopt the cosine classifier [39, 15] to produce the final classification results. Specifically, we normalize the meta-embeddings , where stands for the -th input as well as the weight vectors of the classifier (no bias term):
The normalization strategy for the meta-embedding is a non-linear squashing function  which ensures that vectors of small magnitude are shrunk to almost zeros while vectors of big magnitude are normalized to the length slightly below . This function helps amplify the effect of the reachability (cf. Eq. (2)).
Loss Function. Since all our modules are differentiable, our model can be trained end-to-end by alternatively updating the centroids and the dynamic meta-embedding
. The final loss functionis a combination of the cross-entropy classification loss and the large-margin loss between the embeddings and the centroids :
where is set to in our experiments via observing the accuracy curve on validation set.
Datasets. We curate three open long-tailed benchmarks, ImageNet-LT (object-centric), Places-LT (scene-centric), and MS1M-LT (face-centric), respectively.
ImageNet-LT: We construct a long-tailed version of the original ImageNet-2012  by sampling a subset following the Pareto distribution with the power value =6. Overall, it has K images from categories, with maximally images per class and minimally images per class. The additional classes of images in ImageNet-2010 are used as the open set. We make the test set balanced.
Places-LT: A long-tailed version of Places-2  is constructed in a similar way. It contains K images from categories, with the maximum of images per class and the minimum of images per class. The gap between the head and tail classes are even larger than ImageNet-LT. We use the test images from Places-Extra69 as the additional open-set.
, we sample images for each identity with a probability proportional to the image numbers of each identity. It results inK images and K identities, with a long-tailed distribution. To inspect the generalization ability of our approach, the performance is evaluated on the MegaFace benchmark , which has no identity overlap with MS1M-ArcFace.
Network Architectures. Following [18, 57, 15], we employ the scratch ResNet-10  as our backbone network for ImageNet-LT. To make a fair comparison with , the pre-trained ResNet-152  is used as the backbone network for Places-LT. For MS1M-LT, the popular pre-trained ResNet-50  is the backbone network.
Evaluation Metrics. We evaluate the performance of each method under both the closed-set (test set contains no unknown classes) and open-set (test set contains unknown classes) settings to highlight their differences. Under each setting, besides the overall top- classification accuracy  over all classes, we also calculate the accuracy of three disjoint subsets: many-shot classes (classes each with over training 100 samples), medium-shot classes (classes each with 20100 training samples) and few-shot classes (classes under 20 training samples). This helps us understand the detailed characteristics of each method. For the open-set setting, the F-measure
is also reported for a balanced treatment of precision and recall following. For determining open classes, the probability threshold is initially set as , while a more detailed analysis is provided in Sec. 4.3.
Competing Methods. We choose for comparison state-of-the-art methods from different fields dealing with the open long-tailed data, including: (1) metric learning: Lifted Loss , (2) hard negative mining: Focal Loss , (3) feature regularization: Range Loss , (4) few-shot learning: FSLwF , (5) long-tailed modeling: MetaModelNet , and (6) open-set detection: Open Max . We apply these methods on the same backbone networks as ours for a fair comparison. We also enable them with class-aware mini-batch sampling  for effective learning. Since Model Regression  and MetaModelNet  are the most related to our work, we directly contrast our results to the numbers reported in their paper.
|Backbone Net||MegaFace Identification Rate|
|Plain Model ||80.64||71.98||84.60||77.72||73.88||78.30||78.70|
|Range Loss ||78.60||71.36||83.14||77.40||72.17||-||-|
|Plain Model ||48.0|
|Model Reg. ||54.7|
We firstly investigate the merit of each module in our framework. The performance is reported with open-set top- classification accuracy on ImageNet-LT.
Effectiveness of the Dynamic Meta-Embedding. Recall that the dynamic meta-embedding consists of three main components: memory feature, concept selector, and confidence calibrator. From Fig. 6 (b), we observe that the combination of the memory feature and concept selector leads to large improvements on all three shots. It is because the obtained memory feature transfers useful visual concepts among classes. Another observation is that the confidence calibrator is the most effective on few-shot classes. The reachability estimation inside the confidence calibrator helps distinguish tail classes from open classes.
Effectiveness of the Modulated Attention. We observe from Fig. 6 (a) that, compared to medium-shot classes, the modulated attention contributes more to the discrimination between many-shot and few-shot classes. Fig. 6
(c) further validates that the modulated attention is more effective than directly applying spatial attention on feature maps. It implies that adaptive contexts selection is easier to learn than the conventional feature selection.
Effectiveness of the Reachability Calibration. To further demonstrate the merit of reachability calibration for open-world setting, we conduct additional experiments following the standard settings in [21, 28] (CIFAR100 + TinyImageNet(resized)). The results are listed in Table 6, where our approach shows favorable performance over standard open-set methods [21, 28].
We extensively evaluate the performance of various representative methods on our benchmarks.
ImageNet-LT. Table 2 (a) shows the performance comparison of different methods. We have the following observations. Firstly, both Lifted Loss  and Focal Loss  greatly boost the performance of few-shot classes by enforcing feature regularization. However, they also sacrifice the performance on many-shot classes since there are no built-in mechanism of adaptively handling samples of different shots. Secondly, OpenMax  improves the results under the open-set setting. However, the accuracy degrades when it is evaluated with F-measure, which considers both precision and recall in open-set. When the open classes are compounded with the tail classes, it becomes challenging to perform the distribution fitting that  requires. Lastly, though the few-shot learning without forgetting approach  retains the many-shot class accuracy, it has difficulty dealing with the imbalanced base classes which are lacked in the current few-shot paradigm. As demonstrated in Fig. 7, our approach provides a comprehensive treatment to all the many/medium/few-shot classes as well as the open classes, achieving substantial improvements on all aspects.
Places-LT. Similar observations can be made on the Places-LT benchmark as shown in Table 2 (b). With a much stronger baseline (i.e. pre-trained ResNet-152), our approach still consistently outperforms other alternatives under both the closed-set and open-set settings. The advantage is even more profound under the F-measure.
We train on the MS1M-LT dataset and report results on the MegaFace identification track, which is a standard benchmark in the face recognition field. Since the face identities in the training set and the test set are disjoint, we adopt an indirect way to partition the testing set into the subsets of different shots. We approximate the pseudo shots of each test sample by counting the number of training samples that are similar to it by at least a threshold (feature similarity greater than). Apart from many-shot, few-shot, one-shot subsets, we also obtain a zero-shot subset, for which we cannot find any sufficiently similar samples in the training set. It can be observed that our approach has the most advantage on one-shot identities ( gains) and zero-shot identities ( gains) as shown in Table 3 (left).
SUN-LT. To directly compare with  and , we also test on the SUN-LT benchmark they provided. The final results are listed in Table 3 (right). Instead of learning a series of classifier transformations, our approach transfers visual knowledge among features and achieves a improvement over the prior best. Note that our approach also incurs much less computational cost since MetaModelNet  requires a recursive training procedure.
Indication for Fairness. Here we report the sensitive attribute performance on MS1M-LT. The last two columns in Table 3 show that our approach achieves across-board improvements on both ‘male’ and ‘female’ sub-groups, which has an implication for effective fairness learning.
Finally we visualize and analyze some influencing aspects in our framework as well as typical failure cases.
What memory feature has Infused.
Here we inspect the visual concepts that memory feature has infused by visualizing its top activating neurons as shown in Fig.8. Specifically, for each input image, we identify its top- transferred neurons in memory feature. And each neuron is visualized by a collection of highest activated patches  over the whole training set. For example, to classify the top left image which belongs to a tail class ‘cock’, our approach has learned to transfer visual concepts that represents “bird head”, “round shape” and “dotted texture” respectively. After feature infusion, the dynamic meta-embedding becomes more informative and discriminative.
Influence of Dataset Longtail-ness. The longtail-ness of the dataset (e.g. the degree of imbalance of the class distribution) could have an impact on the model performance. For faster investigating, here the weights of the backbone network are freezed during training. From Fig. 9 (a), we observe that as the dataset becomes more imbalanced (i.e. power value decreases), our approach only undergoes a moderate performance drop. Dynamic meta-embedding enables effective knowledge transfer among data-abundant and data-scarce classes.
Influence of Open-Set Prob. Threshold. The performance change w.r.t. the open-set probability threshold is demonstrated in Fig. 9 (b). Compared to the plain model  and range loss , the performance of our approach changes steadily as the open-set threshold rises. The reachability estimator in our framework helps calibrate the sample confidence, thus enhancing robustness to open classes.
Influence of the Number of Open Classes. Finally we investigate performance change w.r.t. the number of open classes. Fig. 9 (c) indicates that our approach demonstrates great robustness to the contamination of open classes.
Failure Cases. Since our approach encourages the feature infusion among classes, it slightly sacrifices the fine-grained discrimination for the promotion of under-representative classes. One typical failure case of our approach is the confusion between many-shot and medium-shot classes. For example, the bottom right image in Fig. 8 is misclassified into ‘airplane’ because some cross-category traits like “nose shape” and “eye shape” are infused. We plan to explore feature disentanglement  to alleviate this trade-off issue.
We introduce the OLTR task that learns from natural long-tail open-end distributed data and optimizes the overall accuracy over a balanced test set. We propose an integrated OLTR algorithm, dynamic meta-embedding, in order to share visual knowledge between head and tail classes and to reduce confusion between tail and open classes. We validate our method on three curated large-scale OLTR benchmarks (ImageNet-LT, Places-LT and MS1M-LT). Our publicly available code and data would enable future research that is directly transferable to real-world applications.
Acknowledgements. This research was supported, in part, by SenseTime Group Limited, NSF IIS 1835539, Berkeley Deep Drive, DARPA, and US Government fund through Etegent Technologies on Low-Shot Detection in Remote Sensing Imagery. The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.
Talk on Workshop on Big Data and Statistical Machine Learning, 2015.
Large scale fine-grained categorization and domain-specific transfer learning.In CVPR, 2018.
Relay backpropagation for effective learning of deep convolutional neural networks.In ECCV, 2016.
Places: A 10 million image database for scene recognition.TPAMI, 2018.
In this supplementary material, we provide details omitted in the main text including:
Section A: intuitive explanation of our approach (Sec. 1 “Introduction” of the main paper.)
Section B: relation to fairness analysis (Sec. 2 “Related Work” of the main paper.)
Section C: more methodology details (Sec. 3 “Approach” of the main paper.)
Section D: detailed experimental setup (Sec. 4 “Experiments” of the main paper.)
Section E: additional visualization of our approach (Sec. 4.3 “Further Analysis” of the main paper.)
In this section, we give an intuitive explanation of our approach that tackles the problem open long-tail recognition. From the perspective of knowledge gained from observation (i.e. training set), head classes, tail classes and open classes form a continuous spectrum as illustrated in Fig. 10.
|Direct + Memory Feature||Modulated Attention||Reachability Module|
|Transfer knowledge||Maintain discrimination||Deal with open classes|
|between head/tail classes||between head/tail classes|
Firstly, we obtain a visual memory by aggregating the knowledge from both head and tail classes. Then the visual concepts stored in the memory are infused back as associated memory feature to enhance the original direct feature. It can be understood as using induced knowledge (i.e. memory feature) to assist the direct observation (i.e. direct feature). We further learn a concept selector to control the amount and type of memory feature to be infused. Since head classes already have abundant direct observation, only a small amount of memory feature is infused for them. On the contrary, tail classes suffer from scarce observation, the associated visual concepts in memory feature are extremely beneficial. Finally, we calibrate the confidence of open classes by calculating their reachability to the obtained visual memory. In this way, we provide a comprehensive treatment to the full spectrum of head, tail and open classes, improving the performance on all categories. To summarize, the effects of each component in our approach are listed in Table 4.
The open long-tail recognition proposed in our work also has an intrinsic relationship to fairness analysis [13, 63, 33, 35, 1]. Their key differences are listed in Table 5. On the problem setting side, both open long-tail recognition and fairness analysis aim to tackle the imbalance existed in real-world data. Open long-tail recognition focuses on the longtail-ness in both known and unknown categories while fairness analysis deals with the bias in sensitive attributes such as male/female and white/black.
On the methodology side, both open long-tail recognition and fairness analysis aim to learn transferable representations. Open long-tail recognition optimizes for the overall accuracy of all categories while fairness analysis optimizes for several attribute-wise criteria. The preliminary results in Table 3 demonstrates that our proposed dynamic meta-embedding is also a promising solution to fairness analysis.
|Problem||Imbalanced Asp.||Optimization Obj.|
|fairness analysis||sensitive attributes||attribute-wise criteria|
|open long-tail recog.||categories||acc. on all categories|
Notation Summary. We summarize the notations used in the paper in Table 6.
|the original feature map|
|feature map after modulated attention|
|hallucinated coefficients from visual memory|
Obtaining Discriminative Centroids. The step-by-step procedure for obtaining discriminative centroids is further illustrated in Fig. 12.
Detailed Loss Functions. Here we elaborate the two loss functions and described in Eqn. 7 in the main paper. Specifically, is the cross-entropy loss between dynamic meta-embedding and the ground truth category label :
where is the cosine classifier described in Eqn. 6 in the main paper. Next we introduce the large margin loss between the embedding and the centroids :
where is the margin and we set it as in our experiments. With this formulation, we minimize the distance between each embedding and the centroid of its group and meanwhile maximize the distance between the embedding and the centroids it does not belong to.
ImageNet-LT. The training data set was generated using a Pareto distribution  with a power value =6 and 1,2805 images per class from the 1000 classes of ImageNet dataset. Images were randomly selected based on the distribution values of each class. The classes were sorted following the benchmark proposed by Bharath & Girshick , where the 1000 classes were randomly split into 389 base classes and 611 novel classes. The first 389 largest classes in ImageNet-LT are the same as the base classes in the benchmark, and the rest 611 classes are the same as the novel classes. We randomly selected 20 training images per class from the origin training set as validation set. The original validation set of ImageNet was used as testing set in this paper. The dataset specifications are shown in Fig. 11.
Places-LT. The training data set was generated similarly to ImageNet-LT using a Pareto distribution with a power value =6 and 4,9805 images per class from the 365 classes of Places-365-standard data set. We used the distribution order of Places-365-challenge data set (which is imbalanced) to sort the training data classes. We also randomly selected 20 images per class from the original training set as validation set. The original validation set of Places-365 was used as testing set in this paper. The dataset specifications are shown in Fig. 13.
MS1M-LT. This dataset was generated from a large-scale face recognition dataset, named MS1M-ArcFace. The original dataset contains about 5.8M images with 85K identities. To create a long-tail version, we sampled images for each identity with a probability proportional to the image numbers of each identity. It results in 887.5K images and 74.5K identities, with a long-tail distribution.
For the evaluation set, MegaFace is one of the largest face recognition benchmarks. It contains 3,530 images from FaceScrub dataset as a probe set and 1M images as a gallery set. The identification task is to find top-1 nearest image from the 1M gallery for each sample in the probe set. Then the identification rate is the mean of hit rates. Since the identities in training set and testing set are non-overlapped, we adopt an indirect way to partition the testing set into subsets with different shots. We approximate the pseudo occurrences of each test sample by counting the number of the similar (similarity greater than ) training samples. The similarity is calculated as the feature distance produced by a state-of-the-art face recognition system . Apart from many-shot, few-shot and one-shot subset, we also define a zero-shot subset, for which we cannot find similar samples in the training set. The dataset specifications are shown in Fig. 14.
SUN-LT. We used the same training and testing data set as provided by , where there were 1,1321 images per class in the training set and 40 images per class in the testing set. We randomly selected 5 images from un-used training data as our validation set.
All the images were firstly resized to . During training, the images were randomly cropped to , then augmented with random horizontal flip at probability and random color jitter on brightness, contrast, and saturation with jitter factor of 0.4. During validation and testing, images were center cropped to without further augmentation.
ImageNet-LT. The feature extractor model used in the experiments on ImageNet-LT was a ResNet-10 model initialized from scratch (i.e., random initialization). All different classifiers were also initialized from scratch. Some major hyper-parameters can be found in Table 7.
Places-LT & SUN-LT. We used a two-stage training protocol following  when conducting experiments on both Places-LT and SUN-LT. (1) In the first stage, we used the ImageNet pre-trained ResNet-152 feature model with a dot-product classifier to fine-tune on the training data of Places-LT and SUN-LT. (2) In the second stage, we used the Places-LT/SUN-LT pre-trained model as our feature model and freezed the convolutional weights. Finally we fine-tuned the classifiers initialized from scratch to produce the experimental results. Some major hyper-parameters can be found in Table 7.
MS1M-LT. We used the ImageNet pre-trained ResNet-50 with a linear classifier and cross-entropy loss to train the face recognition model. Some major hyper-parameters can be found in Table 7.
|Dataset||Initial LR.||Epoch||LR. Schedule|
|ImageNet-LT||0.1||30||drop 10% every 10 epochs|
|Places-LT||0.01||30||drop 10% every 10 epochs|
|MS1M-LT||0.01||30||drop 10% every 10 epochs|
Top-1 Classification Accuracy. For ImageNet-LT, Places-LT, and SUN-LT, since the testing sets are balanced, the top-1 classification accuracy are calculated as the mean accuracy over all close-set categories with the contamination of open classes. All open classes are regared as one unknown class. Predictions of data are obtained as the classes with the highest probabilities.
F-measure. Following , the F-measure () is calculated as times the product of precision () and recall () divided by the sum of and :
is calculated as true positive (, defined as correct predictions on the closed testing set) over the sum of and false positive (, defined as incorrect predictions on closed testing set):
is calculated as over the sum of and false negative (, defined as number of images from the open set that are predicted as known categories):
Memory Feature in Places-LT. We visualize the memory feature in Places-LT similarly to ImageNet-LT as described in Sec. 4.3 in the main paper. Examples of the infused visual concepts from memory feature in Places-LT are presented in Fig. 15. We observe that memory feature encodes discriminative visual traits for the underlying scene.
Memory Feature in MS-1M. Following , we visualize the memory feature in MS1M-LT by contrasting the least activated average image and the most activated average image of the top firing neuron. From Fig. 16, we observe that memory feature in MS1M-LT infuses several identity-related attributes (e.g. “high cheekbones”, “dark skin color” and “narrow eyes”) for precise recognition.