Large-Scale Long-Tailed Recognition in an Open World

04/10/2019 ∙ by Ziwei Liu, et al. ∙ Microsoft berkeley college The Chinese University of Hong Kong 10

Real world data often have a long-tailed and open-ended distribution. A practical recognition system must classify among majority and minority classes, generalize from a few known instances, and acknowledge novelty upon a never seen instance. We define Open Long-Tailed Recognition (OLTR) as learning from such naturally distributed data and optimizing the classification accuracy over a balanced test set which include head, tail, and open classes. OLTR must handle imbalanced classification, few-shot learning, and open-set recognition in one integrated algorithm, whereas existing classification approaches focus only on one aspect and deliver poorly over the entire class spectrum. The key challenges are how to share visual knowledge between head and tail classes and how to reduce confusion between tail and open classes. We develop an integrated OLTR algorithm that maps an image to a feature space such that visual concepts can easily relate to each other based on a learned metric that respects the closed-world classification while acknowledging the novelty of the open world. Our so-called dynamic meta-embedding combines a direct image feature and an associated memory feature, with the feature norm indicating the familiarity to known classes. On three large-scale OLTR datasets we curate from object-centric ImageNet, scene-centric Places, and face-centric MS1M data, our method consistently outperforms the state-of-the-art. Our code, datasets, and models enable future OLTR research and are publicly available at https://liuziwei7.github.io/projects/LongTail.html.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 12

page 13

page 14

Code Repositories

OpenLongTailRecognition-OLTR

Pytorch implementation for "Large-Scale Long-Tailed Recognition in an Open World"


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Our visual world is inherently long-tailed and open-ended: The frequency distribution of visual categories in our daily life is long-tailed [42], with a few common classes and many more rare classes, and we constantly encounter new visual concepts as we navigate in an open world.

Figure 1: Our task of open long-tailed recognition must learn from long-tail distributed training data in an open world and deal with imbalanced classification, few-shot learning, and open-set recognition over the entire spectrum.

 

Task Setting Imbalanced Train/Base Set #Instances in Tail Class Balanced Test Set Open Class Evaluation: Accuracy Over ?
Imbalanced Classification 2050 all classes
Few-Shot Learning 120 novel classes
Open-Set Recognition N/A all classes
Open Long-Tailed Recognition 120 all classes

 

Table 1: Comparison between our proposed OLTR task and related existing tasks.

While the natural data distribution contains head, tail, and open classes (Fig. 1), existing classification approaches focus mostly on the head [8, 30], the tail [55, 27], often in a closed setting [59, 34]

. Traditional deep learning models are good at capturing the big data of head classes  

[26, 20]; more recently, few-shot learning methods have been developed for the small data of tail classes  [52, 18].

We formally study Open Long-Tailed Recognition (OLTR) arising in natural data settings. A practical system shall be able to classify among a few common and many rare categories, to generalize the concept of a single category from only a few known instances, and to acknowledge novelty upon an instance of a never seen category. We define OLTR as learning from long-tail and open-end distributed data and evaluating the classification accuracy over a balanced test set which include head, tail, and open classes in a continuous spectrum (Fig. 1).

OLTR must handle not only imbalanced classification and few-shot learning in the closed world, but also open-set recognition with one integrated algorithm (Tab. 1). Existing classification approaches tend to focus on one aspect and deliver poorly over the entire class spectrum.

The key challenges for OLTR are tail recognition robustness and open-set sensitivity: As the number of training instances drops from thousands in the head class to the few in the tail class, the recognition accuracy should maintain as high as possible; on the other hand, as the number of instances drops to zero in the open set, the recognition accuracy relies on the sensitivity to distinguish unknown open classes from known tail classes.

An integrated OLTR algorithm should tackle the two seemingly contradictory aspects of recognition robustness and recognition sensitivity on a continuous category spectrum. To increase the recognition robustness, it must share visual knowledge between head and tail classes; to increase recognition sensitivity, it must reduce the confusion between tail and open classes.

We develop an OLTR algorithm that maps an image to a feature space such that visual concepts can easily relate to each other based on a learned metric that respects the closed-world classification while acknowledging the novelty of the open world.

Our so-called dynamic meta-embedding handles tail recognition robustness by combining two components: a direct feature computed from the input image, and an induced feature associated with the visual memory. 1)

Our direct feature is a standard embedding that gets updated from the training data by stochastic gradient descent over the classification loss. The direct feature lacks sufficient supervision for the rare tail class.

2) Our memory feature is inspired by meta learning methods with memories [55, 12, 2] to augment the direct feature from the image. A visual memory holds discriminative centroids of the direct feature. We learn to retrieve a summary of memory activations from the direct feature, combined into a meta-embedding that is enriched particularly for the tail class.

Our dynamic meta-embedding handles open recognition sensitivity by dynamically calibrating the meta-embedding with respect to the visual memory. The embedding is scaled inversely by its distance to the nearest centroid: The farther away from the memory, the closer to the origin, and the more likely an open set instance. We also adopt modulated attention  [56] to encourage the head and tail classes to use different sets of spatial features. As our meta-embedding relates head and tail classes, our modulated attention maintains discrimination between them.

We make the following major contributions. 1) We formally define the OLTR task, which learns from natural long-tail and open-end distributed data and optimizes the overall accuracy over a balanced test set. It provides a comprehensive and unbiased evaluation of visual recognition algorithms in practical settings. 2) We develop an integrated OLTR algorithm with dynamic meta-embedding. It handles tail recognition robustness by relating visual concepts among head and tail embeddings, and it handles open recognition sensitivity by dynamically calibrating the embedding norm with respect to the visual memory. 3) We curate three large OLTR datasets according to a long-tail distribution from existing representative datasets: object-centric ImageNet, scene-centric MIT Places, and face-centric MS1M datasets. We set up benchmarks for proper OLTR performance evaluation. 4) Our extensive experimentation on these OLTR datasets demonstrates that our method consistently outperforms the state-of-the-art.

Our code, datasets, and models are publicly available at https://liuziwei7.github.io/projects/LongTail.html. Our work fills the void in practical benchmarks for imbalanced classification, few-shot learning, and open-set recognition, enabling future research that is directly transferable to real-world applications.

Figure 2: Method overview. There are two main modules: dynamic meta-embedding and modulated attention. The embedding relates visual concepts between head and tail classes, while the attention discriminates between them. The reachability separates tail and open classes.

2 Related Works

While OLTR has not been defined in the literature, there are three closely related tasks which are often studied in isolation: imbalanced classification, few-shot learning, and open-set recognition. Tab. 1 summarizes their differences.

Imbalanced Classification. Arising from long-tail distributions of natural data, it has been extensively studied  [45, 66, 4, 32, 67, 38, 31, 53, 7]. Classical methods include under-sampling head classes, over-sampling tail classes, and data instance re-weighting. We refer the readers to [19] for a detailed review. Some recent methods include metric learning [24, 37], hard negative mining [11, 29], and meta learning [17, 59]. The lifted structure loss [37] introduces margins between many training instances. The range loss [64] enforces data in the same class to be close and those in different classes to be far apart. The focal loss [29] induces an online version of hard negative mining. MetaModelNet [59] learns a meta regression net from head classes and uses it to construct the classifier for tail classes.

Our dynamic meta-embedding combines the strengths of both metric learning and meta learning. On one hand, our direct feature is updated to ensure centroids for different classes are far from each other; On the other hand, our memory feature is generated on-the-fly in a meta learning fashion to effectively transfer knowledge to tail classes.

Few-Shot Learning. It is often formulated as meta learning [50, 6, 41, 46, 14, 61]. Matching Network [55] learns a transferable feature matching metric to go beyond given classes. Prototypical Network [52] maintains a set of separable class templates. Feature hallucination [18] and augmentation [57] are also shown effective. Since these methods focus on novel classes, they often suffer a moderate performance drop for head classes. There are a few exceptions. The few-shot learning without forgetting [15] and incremental few-shot learning [43] attempt to remedy this issue by leveraging the duality between features and classifiers’ weights [40, 39]. However, the training set used in all of these methods are balanced.

In comparison, our OLTR learns from a more natural long-tailed training set. Nevertheless, our work is closely related to meta learning with fast weight and associative memory [22, 49, 55, 12, 2, 36] to enable rapid adaptation. Compared to these prior arts, our memory feature has two advantages: 1) It transfers knowledge to both head and tail classes adaptively via a learned concept selector; 2) It is fully integrated into the network without episodic training, and is thus especially suitable for large-scale applications.

Open-Set Recognition. Open-set recognition [48, 3], or out-of-distribution detection [10, 28], aims to re-calibrate the sample confidence in the presence of open classes. One of the representative techniques is OpenMax [3]

, which fits a Weibull distribution to the classifier’s output logits. However, when there are both open and tail classes, the distribution fitting could confuse the two.

Instead of calibrating the output logits, our OLTR approach incorporates the confidence estimation into feature learning and dynamically re-scale the meta-embedding w.r.t. to the learned visual memory.

3 Our OLTR Model

We propose to map an image to a feature space such that visual concepts can easily relate to each other based on a learned metric that respects the closed-world classification while acknowledging the novelty of the open world. Our model has two main modules (Fig.2): dynamic meta-embedding and modulated attention. The former relates and transfers knowledge between head and tail classes and the latter maintains discrimination between them.

3.1 Dynamic Meta-Embedding

Our dynamic meta-embedding combines a direct image feature and an associated memory feature, with the feature norm indicating the familiarity to known classes.

Consider a convolutional neural network (CNN) with a softmax output layer for classification. The second-to-the-last layer can be viewed as the feature and the last layer a linear classifier (cf. 

in Fig. 2). The feature and the classifier are jointly trained from big data in an end-to-end fashion. Let denote the direct feature extracted from an input image. The final classification accuracy largely depends on the quality of this direct feature.

While a feed-forward CNN classifier works well with big training data [8, 26], it lacks sufficient supervised updates from small data in our tail classes. We propose to enrich direct feature with a memory feature that relates visual concepts in a memory module. This mechanism is similar to the memory popular in meta learning [46, 36]. We denote the resulting feature meta embedding , and it is fed to the last layer for classification. Both our memory feature and meta-embedding depend on direct feature .

Unlike the direct feature, the memory feature captures visual concepts from training classes, retrieved from a memory with a much shallower model.

Learning Visual Memory . We follow [23] on class structure analysis and adopt discriminative centroids as the basic building block. Let denote the visual memory of all the training data, where is the number of training classes. Compared to alternatives  [60, 52], this memory is appealing for our OLTR task: It is almost effortlessly and jointly learned alongside the direct features , and it considers both intra-class compactness and inter-class discriminativeness.

We compute centroids in two steps. 1) Neighborhood Sampling: We sample both intra-class and inter-class examples to compose a mini-batch during training. These examples are grouped by their class labels and the centroid of each group is updated by the direct feature of this mini-batch. 2) Propagation: We alternatively update the direct feature and the centroids to minimize the distance between each direct feature and the centroid of its group and maximize the distance to other centroids.

Composing Memory Feature . For an input image, shall enhance its direct feature when there is not enough training data (as in the tail class) to learn it well. The memory feature relates the centroids in the memory, transferring knowledge to the tail class:

(1)

where is the coefficients hallucinated from the direct feature. We use a lightweight neural network to obtain the coefficients from the direct feature, .

Obtaining Dynamic Meta-Embedding. combines the direct feature and the memory feature, and is fed to the classifier for the final class prediction (Fig. 3):

(2)

where denotes element-wise multiplication. is seemingly a redundant scalar for the closed-world classification tasks. However, in the OLTR setting, it plays an important role in differentiating the examples of the training classes from those of the open-set. measures the reachability [47] of an input’s direct feature to the memory — the minimum distance between the direct feature and the discriminative centroids:

(3)

When is small, the input likely belongs to a training class from which the centroids are derived, and a large reachability weight is assigned to the resulting meta-embedding

. Otherwise, the embedding is scaled down to an almost all-zero vector at the extreme. Such a property is useful for encoding open classes.

We now describe the concept selector in Eq. (2). The direct feature is often good enough for the data-rich head classes, whereas the memory feature is more important for the data-poor tail classes. To adaptively select them in a soft manner, we learn a lightweight network with a activation function:

(4)
Figure 3: t-SNE feature visualization of (a) plain ResNet model (b) our dynamic meta-embedding. Ours is more compact for both head and tail classes.

3.2 Modulated Attention

While dynamic meta-embedding facilitates feature sharing between head and tail classes, it is also vital to discriminate between them. The direct feature , e.g., the activation at the second-to-the-last layer in ResNet [20], is able to fulfill this requirement to some extent. However, we find it beneficial to further enhance it with spatial attention, since discriminative cues of head and tail classes seem to be distributed at different locations in the image.

Specifically, we propose modulated attention to encourage samples of different classes to use different contexts. Firstly, we compute a self-attention map from the input feature map by self-correlation [56]. It is used as contextual information and added back (through skip connections) to the original feature map. The modulated attention is then designed as conditional spatial attention applied to the self-attention map: , which allows examples to select different spatial contexts (Fig. 4). The final attention feature map becomes:

(5)

where is a feature map in CNN, is the self-attention operation, and is a conditional attention function [54] with a softmax normalization. Sec. 4.1 shows empirically that our attention design achieves superior performance than the common practice of applying spatial attention to the input feature map. This modulated attention (Fig. 4b) could be plugged into any feature layer of a CNN. Here, we modify the last feature map only.

Figure 4: Modulated attention is spatial attention applied on self-attention maps (“attention on attention”). It encourages different classes to use different contexts, which helps maintain the discrimination between head and tail classes.
Figure 5: Results of ablation study. Dynamic meta-embedding contributes most on medium-shot and few-shot classes while modulated attention helps maintain the discrimination of many-shot classes. (The performance is reported with open-set top- classification accuracy on ImageNet-LT.)

 

Method Error (%)
Softmax Pred. [21] 43.6
Ours 29.9
ODIN [28] 24.6
Ours 18.0

 

Figure 6: Open class detection error (%) comparison. It is performed on the standard open-set benchmark, CIFAR100 + TinyImageNet (resized). “” denotes the setting where open samples are used to tune algorithmic parameters.

3.3 Learning

Cosine Classifier. We adopt the cosine classifier [39, 15] to produce the final classification results. Specifically, we normalize the meta-embeddings , where stands for the -th input as well as the weight vectors of the classifier (no bias term):

(6)

The normalization strategy for the meta-embedding is a non-linear squashing function [44] which ensures that vectors of small magnitude are shrunk to almost zeros while vectors of big magnitude are normalized to the length slightly below . This function helps amplify the effect of the reachability (cf. Eq. (2)).

Loss Function. Since all our modules are differentiable, our model can be trained end-to-end by alternatively updating the centroids and the dynamic meta-embedding

. The final loss function

is a combination of the cross-entropy classification loss and the large-margin loss between the embeddings and the centroids :

(7)

where is set to in our experiments via observing the accuracy curve on validation set.

4 Experiments

Datasets. We curate three open long-tailed benchmarks, ImageNet-LT (object-centric), Places-LT (scene-centric), and MS1M-LT (face-centric), respectively.

  1. [leftmargin=*]

  2. ImageNet-LT: We construct a long-tailed version of the original ImageNet-2012 [8] by sampling a subset following the Pareto distribution with the power value =6. Overall, it has K images from categories, with maximally images per class and minimally images per class. The additional classes of images in ImageNet-2010 are used as the open set. We make the test set balanced.

  3. Places-LT: A long-tailed version of Places-2 [65] is constructed in a similar way. It contains K images from categories, with the maximum of images per class and the minimum of images per class. The gap between the head and tail classes are even larger than ImageNet-LT. We use the test images from Places-Extra69 as the additional open-set.

  4. MS1M-LT: To create a long-tailed version of the MS1M-ArcFace dataset [16, 9]

    , we sample images for each identity with a probability proportional to the image numbers of each identity. It results in

    K images and K identities, with a long-tailed distribution. To inspect the generalization ability of our approach, the performance is evaluated on the MegaFace benchmark [25], which has no identity overlap with MS1M-ArcFace.

Network Architectures. Following [18, 57, 15], we employ the scratch ResNet-10 [20] as our backbone network for ImageNet-LT. To make a fair comparison with [59], the pre-trained ResNet-152 [20] is used as the backbone network for Places-LT. For MS1M-LT, the popular pre-trained ResNet-50 [20] is the backbone network.

Evaluation Metrics. We evaluate the performance of each method under both the closed-set (test set contains no unknown classes) and open-set (test set contains unknown classes) settings to highlight their differences. Under each setting, besides the overall top- classification accuracy [15] over all classes, we also calculate the accuracy of three disjoint subsets: many-shot classes (classes each with over training 100 samples), medium-shot classes (classes each with 20100 training samples) and few-shot classes (classes under 20 training samples). This helps us understand the detailed characteristics of each method. For the open-set setting, the F-measure

is also reported for a balanced treatment of precision and recall following 

[3]. For determining open classes, the probability threshold is initially set as , while a more detailed analysis is provided in Sec. 4.3.

Competing Methods. We choose for comparison state-of-the-art methods from different fields dealing with the open long-tailed data, including: (1) metric learning: Lifted Loss [37], (2) hard negative mining: Focal Loss [29], (3) feature regularization: Range Loss [64], (4) few-shot learning: FSLwF [15], (5) long-tailed modeling: MetaModelNet [59], and (6) open-set detection: Open Max [3]. We apply these methods on the same backbone networks as ours for a fair comparison. We also enable them with class-aware mini-batch sampling [51] for effective learning. Since Model Regression [58] and MetaModelNet [59] are the most related to our work, we directly contrast our results to the numbers reported in their paper.

 

Backbone Net closed-set setting open-set setting
ResNet-10 & &
Methods Many-shot Medium-shot Few-shot Overall Many-shot Medium-shot Few-shot F-measure
Plain Model [20] 40.9 10.7 0.4 20.9 40.1 10.4 0.4 0.295
Lifted Loss [37] 35.8 30.4 17.9 30.8 34.8 29.3 17.4 0.374
Focal Loss [29] 36.4 29.9 16 30.5 35.7 29.3 15.6 0.371
Range Loss [64] 35.8 30.3 17.6 30.7 34.7 29.4 17.2 0.373
   + OpenMax [3] - - - - 35.8 30.3 17.6 0.368
FSLwF [15] 40.9 22.1 15 28.4 40.8 21.7 14.5 0.347
Ours 43.2 35.1 18.5 35.6 41.9 33.9 17.4 0.474

 

Top- classification accuracy on ImageNet-LT.

 

Backbone Net closed-set setting open-set setting
ResNet-152 & &
Methods Many-shot Medium-shot Few-shot Overall Many-shot Medium-shot Few-shot F-measure
Plain Model [20] 45.9 22.4 0.36 27.2 45.9 22.4 0.36 0.366
Lifted Loss [37] 41.1 35.4 24 35.2 41 35.2 23.8 0.459
Focal Loss [29] 41.1 34.8 22.4 34.6 41 34.8 22.3 0.453
Range Loss [64] 41.1 35.4 23.2 35.1 41 35.3 23.1 0.457
   + OpenMax [3] - - - - 41.1 35.4 23.2 0.458
FSLwF [15] 43.9 29.9 29.5 34.9 38.1 19.5 14.8 0.375
Ours 44.7 37 25.3 35.9 44.6 36.8 25.2 0.464

 

Top- classification accuracy on Places-LT.
Table 2: Benchmarking results on (a) ImageNet-LT and (b) Places-LT. Our approach provides a comprehensive treatment to all the many/medium/few-shot classes as well as the open classes, achieving substantial advantages on all aspects.

 

Backbone Net MegaFace Identification Rate
ResNet-50 & & Sub-Groups
Methods Many-shot Few-shot One-shot Zero-shot Full Test Male Female
Plain Model [20] 80.64 71.98 84.60 77.72 73.88 78.30 78.70
Range Loss [64] 78.60 71.36 83.14 77.40 72.17 - -
Ours 80.82 72.44 87.60 79.50 74.51 79.04 79.08

 

 

Method Acc.
Plain Model [20] 48.0
Cost-Sensitive [24] 52.4
Model Reg. [58] 54.7
MetaModelNet [59] 57.3
Ours 58.7

 

Table 3: Benchmarking results on MegaFace (left) and SUN-LT (right). Our approach achieves the best performance on natural-world datasets when compared to other state-of-the-art methods. Furthermore, our approach achieves across-board improvements on both ‘male’ and ‘female’ sub-groups.

4.1 Ablation Study

We firstly investigate the merit of each module in our framework. The performance is reported with open-set top- classification accuracy on ImageNet-LT.

Effectiveness of the Dynamic Meta-Embedding. Recall that the dynamic meta-embedding consists of three main components: memory feature, concept selector, and confidence calibrator. From Fig. 6 (b), we observe that the combination of the memory feature and concept selector leads to large improvements on all three shots. It is because the obtained memory feature transfers useful visual concepts among classes. Another observation is that the confidence calibrator is the most effective on few-shot classes. The reachability estimation inside the confidence calibrator helps distinguish tail classes from open classes.

Effectiveness of the Modulated Attention. We observe from Fig. 6 (a) that, compared to medium-shot classes, the modulated attention contributes more to the discrimination between many-shot and few-shot classes. Fig. 6

(c) further validates that the modulated attention is more effective than directly applying spatial attention on feature maps. It implies that adaptive contexts selection is easier to learn than the conventional feature selection.

Figure 7: The absolute F1 score of our method over the plain model. Ours has across-the-board performance gains w.r.t. many/medium/few-shot and open classes.

Effectiveness of the Reachability Calibration. To further demonstrate the merit of reachability calibration for open-world setting, we conduct additional experiments following the standard settings in [21, 28] (CIFAR100 + TinyImageNet(resized)). The results are listed in Table 6, where our approach shows favorable performance over standard open-set methods [21, 28].

Figure 8: Examples of the top- infused visual concepts from memory feature. Except for the bottom right failure case (marked in red), all the other three input images are misclassified by the plain model and correctly classified by our model. For example, to classify the top left image which belongs to a tail class ‘cock’, our approach has learned to transfer visual concepts that represents “bird head”, “round shape” and “dotted texture” respectively.

4.2 Result Comparisons

We extensively evaluate the performance of various representative methods on our benchmarks.

ImageNet-LT. Table 2 (a) shows the performance comparison of different methods. We have the following observations. Firstly, both Lifted Loss [37] and Focal Loss [29] greatly boost the performance of few-shot classes by enforcing feature regularization. However, they also sacrifice the performance on many-shot classes since there are no built-in mechanism of adaptively handling samples of different shots. Secondly, OpenMax [3] improves the results under the open-set setting. However, the accuracy degrades when it is evaluated with F-measure, which considers both precision and recall in open-set. When the open classes are compounded with the tail classes, it becomes challenging to perform the distribution fitting that [3] requires. Lastly, though the few-shot learning without forgetting approach [15] retains the many-shot class accuracy, it has difficulty dealing with the imbalanced base classes which are lacked in the current few-shot paradigm. As demonstrated in Fig. 7, our approach provides a comprehensive treatment to all the many/medium/few-shot classes as well as the open classes, achieving substantial improvements on all aspects.

Places-LT. Similar observations can be made on the Places-LT benchmark as shown in Table 2 (b). With a much stronger baseline (i.e. pre-trained ResNet-152), our approach still consistently outperforms other alternatives under both the closed-set and open-set settings. The advantage is even more profound under the F-measure.

MS1M-LT.

We train on the MS1M-LT dataset and report results on the MegaFace identification track, which is a standard benchmark in the face recognition field. Since the face identities in the training set and the test set are disjoint, we adopt an indirect way to partition the testing set into the subsets of different shots. We approximate the pseudo shots of each test sample by counting the number of training samples that are similar to it by at least a threshold (feature similarity greater than

). Apart from many-shot, few-shot, one-shot subsets, we also obtain a zero-shot subset, for which we cannot find any sufficiently similar samples in the training set. It can be observed that our approach has the most advantage on one-shot identities ( gains) and zero-shot identities ( gains) as shown in Table 3 (left).

SUN-LT. To directly compare with [58] and [59], we also test on the SUN-LT benchmark they provided. The final results are listed in Table 3 (right). Instead of learning a series of classifier transformations, our approach transfers visual knowledge among features and achieves a improvement over the prior best. Note that our approach also incurs much less computational cost since MetaModelNet [59] requires a recursive training procedure.

Indication for Fairness. Here we report the sensitive attribute performance on MS1M-LT. The last two columns in Table 3 show that our approach achieves across-board improvements on both ‘male’ and ‘female’ sub-groups, which has an implication for effective fairness learning.

Figure 9: The influence of (a) dataset longtail-ness, (b) open-set probability threshold, and (c) the number of open classes. As the dataset becomes more imbalanced, our approach only undergoes a moderate performance drop. Our approach also demonstrates great robustness to the contamination of open classes.

4.3 Further Analysis

Finally we visualize and analyze some influencing aspects in our framework as well as typical failure cases.

What memory feature has Infused.

Here we inspect the visual concepts that memory feature has infused by visualizing its top activating neurons as shown in Fig. 

8. Specifically, for each input image, we identify its top- transferred neurons in memory feature. And each neuron is visualized by a collection of highest activated patches [62] over the whole training set. For example, to classify the top left image which belongs to a tail class ‘cock’, our approach has learned to transfer visual concepts that represents “bird head”, “round shape” and “dotted texture” respectively. After feature infusion, the dynamic meta-embedding becomes more informative and discriminative.

Influence of Dataset Longtail-ness. The longtail-ness of the dataset (e.g. the degree of imbalance of the class distribution) could have an impact on the model performance. For faster investigating, here the weights of the backbone network are freezed during training. From Fig. 9 (a), we observe that as the dataset becomes more imbalanced (i.e. power value decreases), our approach only undergoes a moderate performance drop. Dynamic meta-embedding enables effective knowledge transfer among data-abundant and data-scarce classes.

Influence of Open-Set Prob. Threshold. The performance change w.r.t. the open-set probability threshold is demonstrated in Fig. 9 (b). Compared to the plain model [20] and range loss [64], the performance of our approach changes steadily as the open-set threshold rises. The reachability estimator in our framework helps calibrate the sample confidence, thus enhancing robustness to open classes.

Influence of the Number of Open Classes. Finally we investigate performance change w.r.t. the number of open classes. Fig. 9 (c) indicates that our approach demonstrates great robustness to the contamination of open classes.

Failure Cases. Since our approach encourages the feature infusion among classes, it slightly sacrifices the fine-grained discrimination for the promotion of under-representative classes. One typical failure case of our approach is the confusion between many-shot and medium-shot classes. For example, the bottom right image in Fig. 8 is misclassified into ‘airplane’ because some cross-category traits like “nose shape” and “eye shape” are infused. We plan to explore feature disentanglement [5] to alleviate this trade-off issue.

5 Conclusions

We introduce the OLTR task that learns from natural long-tail open-end distributed data and optimizes the overall accuracy over a balanced test set. We propose an integrated OLTR algorithm, dynamic meta-embedding, in order to share visual knowledge between head and tail classes and to reduce confusion between tail and open classes. We validate our method on three curated large-scale OLTR benchmarks (ImageNet-LT, Places-LT and MS1M-LT). Our publicly available code and data would enable future research that is directly transferable to real-world applications.

Acknowledgements. This research was supported, in part, by SenseTime Group Limited, NSF IIS 1835539, Berkeley Deep Drive, DARPA, and US Government fund through Etegent Technologies on Low-Shot Detection in Remote Sensing Imagery. The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.

References

Appendices

In this supplementary material, we provide details omitted in the main text including:

  • [leftmargin=*]

  • Section A: intuitive explanation of our approach (Sec. 1 “Introduction” of the main paper.)

  • Section B: relation to fairness analysis (Sec. 2 “Related Work” of the main paper.)

  • Section C: more methodology details (Sec. 3 “Approach” of the main paper.)

  • Section D: detailed experimental setup (Sec. 4 “Experiments” of the main paper.)

  • Section E: additional visualization of our approach (Sec. 4.3 “Further Analysis” of the main paper.)

Appendix A Intuitive Explanation of Our Approach

In this section, we give an intuitive explanation of our approach that tackles the problem open long-tail recognition. From the perspective of knowledge gained from observation (i.e. training set), head classes, tail classes and open classes form a continuous spectrum as illustrated in Fig. 10.

Figure 10: Intuition explanation of our approach.

 

Direct + Memory Feature Modulated Attention Reachability Module
Transfer knowledge Maintain discrimination Deal with open classes
between head/tail classes between head/tail classes

 

Table 4: The effects of each component in our approach.

Firstly, we obtain a visual memory by aggregating the knowledge from both head and tail classes. Then the visual concepts stored in the memory are infused back as associated memory feature to enhance the original direct feature. It can be understood as using induced knowledge (i.e. memory feature) to assist the direct observation (i.e. direct feature). We further learn a concept selector to control the amount and type of memory feature to be infused. Since head classes already have abundant direct observation, only a small amount of memory feature is infused for them. On the contrary, tail classes suffer from scarce observation, the associated visual concepts in memory feature are extremely beneficial. Finally, we calibrate the confidence of open classes by calculating their reachability to the obtained visual memory. In this way, we provide a comprehensive treatment to the full spectrum of head, tail and open classes, improving the performance on all categories. To summarize, the effects of each component in our approach are listed in Table 4.

Appendix B Relation to Fairness Analysis

The open long-tail recognition proposed in our work also has an intrinsic relationship to fairness analysis [13, 63, 33, 35, 1]. Their key differences are listed in Table 5. On the problem setting side, both open long-tail recognition and fairness analysis aim to tackle the imbalance existed in real-world data. Open long-tail recognition focuses on the longtail-ness in both known and unknown categories while fairness analysis deals with the bias in sensitive attributes such as male/female and white/black.

On the methodology side, both open long-tail recognition and fairness analysis aim to learn transferable representations. Open long-tail recognition optimizes for the overall accuracy of all categories while fairness analysis optimizes for several attribute-wise criteria. The preliminary results in Table 3 demonstrates that our proposed dynamic meta-embedding is also a promising solution to fairness analysis.

 

 Problem  Imbalanced Asp.  Optimization Obj.
fairness analysis sensitive attributes attribute-wise criteria
open long-tail recog. categories acc. on all categories

 

Table 5: Key differences between fairness analysis and open long-tail recognition. “asp.” stands for aspects while “obj.” stands for objectives.
Figure 11: The dataset statistics of ImageNet-LT.

Appendix C More Methodology Details

Notation Summary. We summarize the notations used in the paper in Table 6.

 

 Notation  Meaning
input image
category label
the original feature map
feature map after modulated attention
feature extractor
classifier
discriminative centroid
local graph
visual memory
direct feature
memory feature
hallucinated coefficients from visual memory
concept selector
confidence calibrator
dynamic meta-embedding

 

Table 6: Summary of notations.

Obtaining Discriminative Centroids. The step-by-step procedure for obtaining discriminative centroids is further illustrated in Fig. 12.

Figure 12: The discriminative centroids constitute our visual memory, which are obtained with two iterative steps, neighborhood sampling and affinity propagation.

Detailed Loss Functions. Here we elaborate the two loss functions and described in Eqn. 7 in the main paper. Specifically, is the cross-entropy loss between dynamic meta-embedding and the ground truth category label :

(8)

where is the cosine classifier described in Eqn. 6 in the main paper. Next we introduce the large margin loss between the embedding and the centroids :

(9)

where is the margin and we set it as in our experiments. With this formulation, we minimize the distance between each embedding and the centroid of its group and meanwhile maximize the distance between the embedding and the centroids it does not belong to.

Appendix D Experimental Setup

d.1 Open Long-Tail Dataset Preparation

Figure 13: The dataset statistics of Places-LT.
Figure 14: The dataset statistics of MS1M-LT.

ImageNet-LT. The training data set was generated using a Pareto distribution [42] with a power value =6 and 1,2805 images per class from the 1000 classes of ImageNet dataset. Images were randomly selected based on the distribution values of each class. The classes were sorted following the benchmark proposed by Bharath & Girshick [18], where the 1000 classes were randomly split into 389 base classes and 611 novel classes. The first 389 largest classes in ImageNet-LT are the same as the base classes in the benchmark, and the rest 611 classes are the same as the novel classes. We randomly selected 20 training images per class from the origin training set as validation set. The original validation set of ImageNet was used as testing set in this paper. The dataset specifications are shown in Fig. 11.

Places-LT. The training data set was generated similarly to ImageNet-LT using a Pareto distribution with a power value =6 and 4,9805 images per class from the 365 classes of Places-365-standard data set. We used the distribution order of Places-365-challenge data set (which is imbalanced) to sort the training data classes. We also randomly selected 20 images per class from the original training set as validation set. The original validation set of Places-365 was used as testing set in this paper. The dataset specifications are shown in Fig. 13.

MS1M-LT. This dataset was generated from a large-scale face recognition dataset, named MS1M-ArcFace. The original dataset contains about 5.8M images with 85K identities. To create a long-tail version, we sampled images for each identity with a probability proportional to the image numbers of each identity. It results in 887.5K images and 74.5K identities, with a long-tail distribution.

For the evaluation set, MegaFace is one of the largest face recognition benchmarks. It contains 3,530 images from FaceScrub dataset as a probe set and 1M images as a gallery set. The identification task is to find top-1 nearest image from the 1M gallery for each sample in the probe set. Then the identification rate is the mean of hit rates. Since the identities in training set and testing set are non-overlapped, we adopt an indirect way to partition the testing set into subsets with different shots. We approximate the pseudo occurrences of each test sample by counting the number of the similar (similarity greater than ) training samples. The similarity is calculated as the feature distance produced by a state-of-the-art face recognition system [9]. Apart from many-shot, few-shot and one-shot subset, we also define a zero-shot subset, for which we cannot find similar samples in the training set. The dataset specifications are shown in Fig. 14.

SUN-LT. We used the same training and testing data set as provided by [59], where there were 1,1321 images per class in the training set and 40 images per class in the testing set. We randomly selected 5 images from un-used training data as our validation set.

d.2 Data Pre-processing

All the images were firstly resized to . During training, the images were randomly cropped to , then augmented with random horizontal flip at probability and random color jitter on brightness, contrast, and saturation with jitter factor of 0.4. During validation and testing, images were center cropped to without further augmentation.

Figure 15: Examples of the infused visual concepts from memory feature in Places-LT.
Figure 16: Examples of the infused visual concepts from memory feature in MS1M-LT.

d.3 Training Details

ImageNet-LT. The feature extractor model used in the experiments on ImageNet-LT was a ResNet-10 model initialized from scratch (i.e., random initialization). All different classifiers were also initialized from scratch. Some major hyper-parameters can be found in Table 7.

Places-LT & SUN-LT. We used a two-stage training protocol following [15] when conducting experiments on both Places-LT and SUN-LT. (1) In the first stage, we used the ImageNet pre-trained ResNet-152 feature model with a dot-product classifier to fine-tune on the training data of Places-LT and SUN-LT. (2) In the second stage, we used the Places-LT/SUN-LT pre-trained model as our feature model and freezed the convolutional weights. Finally we fine-tuned the classifiers initialized from scratch to produce the experimental results. Some major hyper-parameters can be found in Table 7.

MS1M-LT. We used the ImageNet pre-trained ResNet-50 with a linear classifier and cross-entropy loss to train the face recognition model. Some major hyper-parameters can be found in Table 7.

 

 Dataset  Initial LR.  Epoch  LR. Schedule
ImageNet-LT 0.1 30 drop 10% every 10 epochs
Places-LT 0.01 30 drop 10% every 10 epochs
MS1M-LT 0.01 30 drop 10% every 10 epochs

 

Table 7: The major hyper-parameters used in our experiments. “LR.” stands for learning rate.

d.4 Evaluation Protocols

Top-1 Classification Accuracy. For ImageNet-LT, Places-LT, and SUN-LT, since the testing sets are balanced, the top-1 classification accuracy are calculated as the mean accuracy over all close-set categories with the contamination of open classes. All open classes are regared as one unknown class. Predictions of data are obtained as the classes with the highest probabilities.

F-measure. Following [3], the F-measure () is calculated as times the product of precision () and recall () divided by the sum of and :

(10)

is calculated as true positive (, defined as correct predictions on the closed testing set) over the sum of and false positive (, defined as incorrect predictions on closed testing set):

(11)

is calculated as over the sum of and false negative (, defined as number of images from the open set that are predicted as known categories):

(12)

Appendix E More Visualization

Memory Feature in Places-LT. We visualize the memory feature in Places-LT similarly to ImageNet-LT as described in Sec. 4.3 in the main paper. Examples of the infused visual concepts from memory feature in Places-LT are presented in Fig. 15. We observe that memory feature encodes discriminative visual traits for the underlying scene.

Memory Feature in MS-1M. Following [32], we visualize the memory feature in MS1M-LT by contrasting the least activated average image and the most activated average image of the top firing neuron. From Fig. 16, we observe that memory feature in MS1M-LT infuses several identity-related attributes (e.g. “high cheekbones”, “dark skin color” and “narrow eyes”) for precise recognition.