ContinualLocalReplacement
The codes for paper "Continual Local Replacement for Few-shot Image Recognition"
view repo
The goal of few-shot learning is to learn a model that can recognize novel classes based on one or few training data. It is challenging mainly due to two aspects: (1) it lacks good feature representation of novel classes; (2) a few labeled data could not accurately represent the true data distribution. In this work, we use a sophisticated network architecture to learn better feature representation and focus on the second issue. A novel continual local replacement strategy is proposed to address the data deficiency problem. It takes advantage of the content in unlabeled images to continually enhance labeled ones. Specifically, a pseudo labeling strategy is adopted to constantly select semantic similar images on the fly. Original labeled images will be locally replaced by the selected images for the next epoch training. In this way, the model can directly learn new semantic information from unlabeled images and the capacity of supervised signals in the embedding space can be significantly enlarged. This allows the model to improve generalization and learn a better decision boundary for classification. Extensive experiments demonstrate that our approach can achieve highly competitive results over existing methods on various few-shot image recognition benchmarks.
READ FULL TEXT VIEW PDFThe codes for paper "Continual Local Replacement for Few-shot Image Recognition"
The codes for paper "Continual Local Replacement for Few-shot Image Recognition"
While deep learning has achieved remarkable results in image recognition
[Krizhevsky et al.2012, He et al.2016], it is highly data-hungry and requires massive labeled training data. In contrast, human-level intelligence can achieve fast learning after observing only one or few instances [Lake et al.2011]. To relieve this gap, researchers have devoted efforts to few-shot learning problem, such as similarity metric [Koch et al.2015, Snell et al.2017], meta learning [Finn et al.2017, Rusu et al.2019], and augmentation [Hariharan and Girshick2017, Wang et al.2018, Chen et al.2019d].However, the few-shot learning problem still remains challenging. We argue that the main challenge comes from two aspects: (1) feature deterioration. The trained feature representation from the training dataset may lose its discriminative property on novel classes. For example, as illustrated in Fig. 1 (a), the ProtoNet [Snell et al.2017] features deteriorate from training classes to novel testing classes. The model which works well on training datasets may not have good performance on novel classes. (2) Data deficiency. A single or a few of labeled data could not represent the true data distribution. As a result, it’s difficult to learn a good decision boundary even with a decent feature representation. This concept is showed in Fig. 1 (b). Most of previous approaches treat their model as a black box and suffer from both issues. Recent work [Chen et al.2019a] shows that existing approaches like ProtoNet and MAML [Finn et al.2017]
could be beaten by the standard transfer learning baseline in some experiment settings.
In this paper, we focus on the data deficiency issue and use a sophisticated network architecture (e.g. ResNet) to alleviate feature deterioration as suggested by Kornblith et al. [Kornblith et al.2019]. We come up with our approach from an important observation. As illustrated in Fig. 1 (c), the only difference between two few-shot episodic testing (i.e. first row and second row) is the labeled images (i.e. support set), but the testing accuracy varies significantly. It seems that representative training data play a vital role in few-shot learning. Indeed, it is hard to precisely define what a representative sample is since we may not have prior knowledge about the novel classes in practice. However, it should be generally helpful if model has the chance to see more data.
![]() |
![]() |
![]() |
(a) Feature deterioration | (b) Data deficiency | (c) Representative training data |
Challenges on few-shot image recognition. (a) T-SNE visualization of ProtoNet’s features on MiniImageNet. The feature discriminative property dramatically decreases from training (left) to novel testing (right) classes. (b) Classifier decision boundary on the double moon toy dataset. It is difficult to learn a good decision boundary on a single labeled instance (marked with a star), but doable on more and representative labeled instances. (c) The performance comparison on different support sets. First row: the model performs badly on a random labeled data. Second row: the performance can be improved a lot if we use a representative training data.
Based on this observation, we propose a continual local replacement approach which leverages the content of unlabeled images to constantly alter the labeled images. In particular, some regions of the labeled images will be replaced by unlabeled images and its contents can be learned through a supervised way. Note that the labeled and unlabeled images may be semantically different (i.e. different classes). In this case, the replacement can be seen as artifacts or random erasing [Zhong et al.2020] which brings limited benefits. To select informative unlabeled data, we borrow the idea of pseudo-labeling [Lee2013] and design an algorithm which makes the data selection on the fly before each training epoch. This brings two advantages. First, semantically similar data will be selected to enhance labeled images, and thus the model can learn new semantic information. Second, different data will be continually selected which diversifies semantic information and the supervised signals in the embedding space could be enlarged.
Our main technical contribution is a continual local replacement strategy which effectively addresses the data deficiency issue and learns a better classifier decision boundary. Our algorithm is built upon standard transfer learning and can be seen as an extension of the baseline method in [Chen et al.2019a]. Extensive experiments show that this simple yet effective approach can improve the baseline performance and achieve state-of-the-art results on various image recognition datasets. Source code has been made available in https://github.com/Lecanyu/ContinualLocalReplacement. We hope this approach can be a strong baseline for future research.
Our method takes unlabeled images to alter labeled ones for few-shot image recognition. Such a strategy is closely related to few-shot image recognition, semi-supervised learning, and data augmentation. We briefly review the most relevant works below.
Few-shot image recognition. The goal of few-shot image recognition is to endow models with the ability to recognize novel classes and datasets where only a limited amount of labeled data is available. Many approaches have been proposed for this goal. The former work like [Fei-Fei et al.2006, Salakhutdinov et al.2012] presented the Bayesian model for few-shot image recognition. The recent metric learning approaches [Koch et al.2015, Snell et al.2017, Sung et al.2018] learn metrics from pairwise comparisons between image instances. And the meta learning methods [Finn et al.2017, Zhou et al.2018, Rusu et al.2019] learn a good model initialization for the few-shot adaptation. Besides, the attention-based [Wang et al.2017], graph-based [Garcia and Bruna2018, Liu et al.2019], and memory-based [Santoro et al.2016, Cai et al.2018] strategies have also been investigated.
Semi-supervised learning. Unlike standard supervised learning, it attempts to leverage unlabeled data to help learning, especially when labeled data are insufficient. The previous work [Grandvalet and Bengio2005] proposed the entropy regularization to encourage learning low-density separations between classes. The pseudo-label approach [Lee2013] iteratively trains and updates the model on unlabeled data with guessed labels. The perturbation and consistency regularization strategies [Sajjadi et al.2016, Miyato et al.2018] perturb unlabeled data and force consistent output to exploit potential data distribution. A hybrid approach [Berthelot et al.2019] integrates previous strategies and achieves the new state-of-the-art. These semi-supervised methods provide ways and ideas to leverage unlabeled data. But they are usually inapplicable on few-shot learning where the number of labeled data is extremely small. More recently, [Ren et al.2018] tackles few-shot image recognition by proposing a variant of ProtoNet which takes advantage of unlabeled data to further adjust the learned prototypes. In this paper, we take the idea of pseudo-labeling with image local replacement to address the data deficiency issue in few-shot learning.
Data augmentation
. Data augmentation is widely adopted in various machine learning problems. The typical and standard image augmentations are rotating, flipping, cropping, color jittering and etc. Such augmentations can endow models with better generalization
[Krizhevsky et al.2012]. More sophisticated augmentation methods have also been explored, such as image synthesis [Wang et al.2018], random erasing [Zhong et al.2020], feature hallucination and augmentation [Hariharan and Girshick2017, Chen et al.2019d].In these strategies, the most relevant works are [Chen et al.2019b, Chen et al.2019c]. They introduced various patch-level image augmentation methods like mixup and replacement. However, they ignored the diversity of augmented images and used a fixed augmented set even though a learning-based augmentation strategy is adopted in [Chen et al.2019c]. By contrast, we apply image local replacement on the fly in each training epoch. The semantic information could be directly diversified through this way. Our strategy is simple and effective. It can be seamlessly integrated with the standard transfer learning procedure.
The few-shot image recognition problem can be described on training and testing datasets. The training dataset has abundant labeled classes . After training on samples with labels from , the goal is to produce a model for recognizing a disjoint set of novel classes (i.e. ) for which only a single or few of labeled samples are available. Formally, let denote an image and its label respectively. The training dataset , where . The testing dataset , where . Most of recent works on few-shot learning follow the episodic paradigm [Vinyals et al.2016]. In particular, the episodic testing consists of hundreds of independent testing episodes. For each episode, it will randomly sample an episodic testing dataset which contains random classes from and total images where and are the number of labeled images and testing images per class respectively. In few-shot literature, the labeled images and testing images are also referred to as the support set and query set , respectively. To mimic the episodic testing, most previous metric learning and meta learning methods like [Snell et al.2017, Finn et al.2017] develop episodic training. Our approach instead is built on the standard transfer learning which trains on and fine tunes on . With continual local replacement, this transfer learning approach demonstrates highly competitive performance over existing methods.
Our primary goal is to provide a chance for the model to see more data. We fulfill it via introducing the content of unlabeled images for learning. The image local replacement is a way to introduce new semantic information.
![]() |
![]() |
![]() |
(a) RandEra | (b) BlockAug | (c) BlockDef |
There are already some replacement methods. For example, a random region of an image is replaced by another one as showed in Fig. 2(a). The original image can be divided to several blocks and some of them are either replaced like Fig. 2(b) or linearly mixed with other images like Fig. 2(c). All these replacement methods are applicable for our purpose. Note that the random local replacement is crucial for the robustness. When two images are semantically different, the replaced regions may be interpreted as partial occlusions. This can be seen as an augmentation to improve the generalization ability [Chen et al.2019b, Zhong et al.2020]. When two images are semantically similar, the model has a chance to learn new semantic information from replaced regions. Except the existing replacement methods, we may also design other fancy methods as long as it can satisfy these two objectives. But this is beyond the scope of this paper since our technical contribution mainly lies in the following continual replacement approach.
![]() |
In the training stage, we first apply local replacement on the original training set to synthesize an augmented version training set . Specifically, for each image , we randomly select another image with probability that (i.e. two different images with the same label). Then, one or several local regions of will be replaced by to synthesize a new image . The new image is still with the original label . The reason why we introduce in training is that we want the model to understand such local replacement and make it robust to partial occlusions.
After building up , both the original images and synthesized images are simultaneously fed into CNN for training. This procedure is illustrated in Fig. 3
. Specifically, we optimize the loss function Eq.
1 during the training:(1) |
where is the standard cross-entropy loss. denotes the feature extractor backbone with trainable parameters and is a classifier (e.g. fully connected layer) with parameters .
![]() |
After training, we fine tune the model to recognize novel classes in . For each testing episode, additional images per class will be randomly sampled as the unlabeled set . In other words, there will be total images including support and query sets.
We follow the standard transfer learning procedure to fine tune the trained model on and . Specifically, suppose the pre-trained feature extractor is reasonably good. We fix the pre-trained feature extractor parameters and create a new classifier with random initialized parameters . The image local replacement is constantly applied when tuning the classifier. At the first fine-tune epoch, the original support set is fed into the model and classifier parameters are updated by minimizing the cross-entropy loss. Before the following tuning epochs, the latest model makes predictions for all unlabeled images. The pseudo labels are assigned to unlabeled set . Then, local regions of an image will be replaced by an unlabeled image with to synthesize a new image . And an augmented support dataset can be obtained. The is fed into CNN to fine tune the classifier in the next epoch and Eq. 2 will be optimized:
(2) |
The complete fine tuning algorithm is summarized in Alg. 1.
This approach is robust to wrong predictions on unlabeled images (i.e. ) by controlling the area of local replacement. We set a threshold so that the maximum area of the original image is allowed to be locally replaced. When using a small , the wrong prediction doesn’t hurt performance because it can be interpreted as partial occlusions or artifacts [Chen et al.2019b, Zhong et al.2020]. However, small replaced regions introduce little semantic information. So we prefer to set a larger so that the model can learn new semantic contents. Indeed, large replaced regions may lead to a risk of learning wrong contents and the fine-tune procedure may be trapped in poor local minima. We alleviate this issue by randomly determining the position and the size of replaced regions. Note that the fine-tune procedure can still converge quickly since always contains the content of the original support set. Empirically, this simple strategy works well and can roughly achieve the best trade-off across all experiment benchmarks.
The experiments are conducted on four widely used few-shot learning benchmarks: MiniImageNet [Ravi and Larochelle2017], TieredImageNet [Ren et al.2018], Caltech-256 [Griffin et al.2007] and CUB-200 [Wah et al.2011]. These datasets cover from general objects (e.g. MiniImageNet) to fine grained bird species (e.g. CUB-200) and they can provide a comprehensive evaluation for our method. We implement our approach based on a recent testbed [Chen et al.2019a] and follow its basic experimental settings, hyper-parameters and dataset split. We use ResNet-18 backbone with a single fully connected classifier (FC layer) on all datasets. The ResNet-18 backbone is only trained on the training set and the augmented version . Note that and are essentially the same dataset and thus the total capacity of training information is not changed. During the episodic testing stage, the weights of backbone are fixed and a new FC layer will be fine-tuned. We adopt the image block augmentation [Chen et al.2019b] as the local replacement strategy. Additional unlabeled images are randomly sampled and maximum 4 and 6 random blocks are allowed to be replaced during the training and fine tuning stages, respectively.
We follow the episodic testing protocol [Vinyals et al.2016] for the standard few-shot performance evaluation.
Methods | Arch. | MiniImageNet (%) | CUB-200 (%) | Caltech-256 (%) | TieredImageNet (%) | ||||
---|---|---|---|---|---|---|---|---|---|
1-shot | 5-shot | 1-shot | 5-shot | 1-shot | 5-shot | 1-shot | 5-shot | ||
S.S. ProtoNet [Ren et al.2018] | CONV4 | 50.41 0.31 | 64.39 0.24 | / | / | / | / | 52.39 0.44 | 69.88 0.20 |
TPN [Liu et al.2019] | CONV4 | 54.72 0.84 | 69.25 0.67 | / | / | / | / | 59.91 0.94 | 73.30 0.75 |
MAML [Finn et al.2017] | ResNet-18 | 49.610.92 | 65.720.77 | 69.96 1.01 | 82.70 0.65 | 57.33 1.00 | 75.77 0.70 | / | / |
MatchNet [Vinyals et al.2016] | ResNet-18 | 52.91 0.88 | 68.88 0.69 | 72.36 0.90 | 83.64 0.60 | 62.24 0.89 | 77.92 0.66 | / | / |
ProtoNet [Snell et al.2017] | ResNet-18 | 54.16 0.82 | 73.68 0.65 | 71.88 0.91 | 87.42 0.48 | 60.17 0.90 | 80.56 0.64 | / | / |
RelationNet [Sung et al.2018] | ResNet-18 | 52.480.86 | 69.830.68 | 67.59 1.02 | 82.75 0.58 | 55.72 0.90 | 77.42 0.68 | / | / |
Baseline [Chen et al.2019a] | ResNet-18 | 51.75 0.80 | 74.27 0.63 | 65.51 0.87 | 82.85 0.55 | 57.72 0.88 | 79.06 0.67 | / | / |
Baseline++ [Chen et al.2019a] | ResNet-18 | 51.87 0.77 | 75.68 0.63 | 67.02 0.90 | 83.58 0.54 | 56.72 0.90 | 77.24 0.67 | / | / |
BlockAug [Chen et al.2019b] | ResNet-18 | 58.80 1.36 | 76.71 0.72 | / | / | / | / | / | / |
IDeMe-Net [Chen et al.2019c] | ResNet-18 | 59.14 0.86 | 74.63 0.74 | / | / | / | / | / | / |
DEML [Zhou et al.2018] | ResNet-50 | 58.49 0.91 | 71.28 0.69 | 66.95 1.06 | 77.11 0.78 | 62.25 1.00 | 79.52 0.63 | / | / |
DualTriNet [Chen et al.2019d] | ResNet-18 | 58.12 1.37 | 76.92 0.69 | 69.61 0.46 | 84.10 0.35 | 63.77 0.62 | 80.53 0.46 | / | / |
LEO [Rusu et al.2019] | WRN28-10 | 61.76 0.08 | 77.59 0.12 | / | / | / | / | 66.33 0.05 | 81.44 0.09 |
MetaOptNet [Lee et al.2019] | ResNet-12 | 61.41 0.61 | 77.88 0.46 | / | / | / | / | 65.36 0.71 | 81.34 0.52 |
Vanilla (ours) | ResNet-18 | 56.44 0.81 | 78.18 0.60 | 65.91 0.88 | 83.45 0.51 | 59.32 0.88 | 79.95 0.67 | 63.93 0.89 | 84.66 0.62 |
CLR+Imprinting (ours) | ResNet-18 | 60.59 0.86 | 78.26 0.61 | 73.11 0.92 | 87.46 0.48 | 62.88 0.97 | 81.16 0.64 | 70.13 0.98 | 85.04 0.64 |
CLR (ours) | ResNet-18 | 61.49 0.83 | 81.00 0.60 | 68.00 0.92 | 84.26 0.53 | 64.04 0.92 | 83.05 0.60 | 70.26 1.00 | 86.78 0.66 |
Baselines and competitors. There are a lot of existing few-shot learning methods. We compare our approach with the most relevant ones. The competitors include popular baselines like MAML [Finn et al.2017], semi-supervised approaches like S.S. ProtoNet [Ren et al.2018], augmentation-based methods like BlockAug [Chen et al.2019b] and state-of-the-art approaches like LEO [Rusu et al.2019]. Since different backbone architectures could significantly affect the results, we report the used architecture for each method along with the 5-way 1-shot and 5-way 5-shot accuracy.
In addition to our Continual Local Replacement (CLR) approach, we also report the results of two variants: Vanilla and CLR+Imprinting. The vanilla version is a simple transfer learning approach. The model is trained on and , but it is fine tuned on the original support set without any local replacement. So it can be seen as a baseline of our approach. Our algorithm has good scalability and can be easily combined with existing methods like the weight imprinting. The imprinting version is inspired from a recent weight initialization method [Qi et al.2018]
. We introduce the imprinting technique to our framework since it learns similarity metric and can be seen as a complement for linear classifier. In particular, the imprinting method normalizes the embedded features and the weights of FC layer during the training. When the features and weights are normalized, the learning objective is equivalent to maximizing cosine similarity
[Qi et al.2018]. On the fine tuning stage, it takes the average of features on the support set as the initial weights of the new FC layer.Results. Comparison results on all datasets are shown in Table 1. We can observe that: (1) our proposed approach can achieve the best performance on all datasets. our method outperforms two most related methods BlockAug [Chen et al.2019b] and IDeMe-Net [Chen et al.2019c] with a clear improvement (3%-7%). This validates the effectiveness of our continual local replacement strategy. (2) Our simple vanilla variant also shows competitive performance. For example, the vanilla method has comparable or better performance over existing methods on MiniImageNet, TieredImageNet and Caltech-256 in 5-way 5-shot testing. This indicates that the deeper backbone can learn better feature representation in transfer learning and the feature deterioration issue can be alleviated by applying a sophisticated architecture. The similar conclusion is also observed in several recent works [Kornblith et al.2019, Chen et al.2019a]. With continual local replacement (i.e. CLR or CLR+Imprinting), the performance can be further improved. This means that our CLR technique can effectively improve the baseline. (3) The imprinting variant is more effective on fine grained recognition tasks than general object recognition tasks. This variant demonstrates strong performance on CUB-200 dataset but is slightly worse than CLR on other three general object recognition datasets. Interestingly, some previous metric-based learning methods like ProtoNet also show strong results on CUB-200. Since Imprinting and ProtoNet explicitly maximize the cosine and Euclidean similarity respectively, we may think these metric-based learning strategies are suitable for fine grained classification tasks through reducing the intra-class variation [Qi et al.2018, Chen et al.2019a].
We conduct ablation studies to help understand our approach better.
The number of replaced blocks. We adopt the image block augmentation as the local replacement method. The replaced area can be controlled by the number of replaced blocks. We evaluate our approach with different numbers of replaced blocks on both training and fine tuning stages. As shown in Tab. 2, the best results are achieved when the maximum numbers of replaced blocks on training/fine-tuning are roughly 3 and 6. This indicates that it is necessary to replace some blocks to build up for training to obtain the best results on the fine tune stage.
Fine tuning | |||||
---|---|---|---|---|---|
Training | 0 | 3 | 6 | 9 | |
0 | 76.96 0.60 | 78.11 0.61 | 78.68 0.61 | 79.16 0.66 | |
3 | 77.16 0.64 | 79.23 0.60 | 80.83 0.60 | 80.14 0.62 | |
6 | 76.44 0.61 | 79.26 0.61 | 80.07 0.62 | 79.23 0.64 | |
9 | 76.41 0.66 | 78.95 0.56 | 79.15 0.60 | 78.87 0.68 |
![]() |
The effect of different components. To verify the effectiveness of our CLR strategy, some variants of CLR are evaluated and compared: Vanilla, One Time Local Replacement (OTLR), CLR without Pseudo Labeling (CLR w/o PL), CLR with Ground Truth Label (CLR w GT), Continual All Replacement (CAR). The OTLR selects unlabeled images to replace some blocks in labeled images only once. It can be seen as an augmentation method like BlockAug [Chen et al.2019b] without continual replacement. The CLR w/o PL randomly selects unlabeled images to apply local replacement continually, whereas CLR w GT uses ground truth label to select unlabeled images which can be seen as an oracle and the upper bound of our method. The CAR use the whole of unlabeled image to tune classifier instead of replacing local regions. This strategy can be seen as the original pseudo labeling method.
The evaluation results on MiniImagenet are illustrated in Fig. 5. The CLR achieves the best results. This verifies all components of our approach are important for the best performance. Besides, the CLR w/o PL performs comparably with Vanilla. It is better than Vanilla in 1-shot but slightly worse than Vanilla in 5-shot testing. This indicates the robustness of continual local replacement when wrong unlabeled images are selected to replace.
We conduct experiments to quantitatively and qualitatively explain why CLR works.
![]() |
CLR introduces plentiful semantic information. On every fine-tune epoch, CLR makes the features of locally replaced images vary in their embedded space. This provides a larger semantic space for classifier tuning. Hence, a better classification decision boundary can be learned. As illustrated in Fig. 7, four fine-tune epochs are showed with T-SNE visualization [Maaten and Hinton2008]. The stars denote the features of and triangles represent . We can see that the positions of triangles move dramatically inside the big red and green circles. Similar phenomenons can also be observed in the other three semantic clusters. In addition, the saliency maps are shown in Fig. 8. The local replacement can be either interpreted as new semantic information (e.g. top-left figure) or partial occlusions (e.g. top-right figure).
![]() |
![]() |
Robustness of CLR. Fig. 6 shows the typical accuracy change during the whole of fine-tune procedure. Since contains stable semantic information from original support set, the learning can converge quickly. Moreover, we also evaluate CLR on high way few-shot testing. The results are showed in Tab. 3. Our method achieves the best results. Note that the performance of CLR can be constantly improved with more replaced blocks. It still arrives the best trade-off when a maximum of 6 blocks replaced. This indicates that (1) our approach still works well even though the overall accuracy is relatively low on high way setting. (2) Since 6 replaced blocks can always achieve the best results, the number of replaced blocks hyper-parameter is insensitive to different datasets and experimental settings
Methods | Arch. | MiniImageNet (%) | |
---|---|---|---|
20-way 1-shot | 20-way 5-shot | ||
MatchNet [Vinyals et al.2016] | ResNet-18 | 25.30 0.29 | 36.780.25 |
ProtoNet [Snell et al.2017] | ResNet-18 | 26.50 0.30 | 44.960.26 |
RelationNet [Sung et al.2018] | ResNet-18 | 23.75 0.30 | 39.170.25 |
Baseline [Chen et al.2019a] | ResNet-18 | 24.75 0.28 | 42.030.25 |
Baseline++ [Chen et al.2019a] | ResNet-18 | 25.58 0.27 | 50.850.25 |
CLR replace 0 block | ResNet-18 | 30.62 0.30 | 52.18 0.26 |
CLR replace 3 blocks | ResNet-18 | 32.66 0.29 | 53.62 0.25 |
CLR replace 6 blocks (ours) | ResNet-18 | 34.15 0.33 | 55.20 0.26 |
CLR replace 9 blocks | ResNet-18 | 32.48 0.36 | 53.62 0.28 |
In this paper, we propose a continual local replacement method for few-shot image recognition. It leverages the content of unlabeled images to synthesize new images for training. To introduce more useful semantic information, a pseudo labeling strategy is applied during the fine-tune stage. It continually selects semantically similar images to locally replace labeled ones. Our strategy is simple yet effective on few-shot image recognition. Extensive experiments show that it can significantly enlarge the capacity of semantic information and achieve new state-of-the-art results.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 4080–4088, 2018.Proceedings of the AAAI Conference on Artificial Intelligence
, volume 33, pages 3379–3386, 2019.Few-shot learning with graph neural networks.
In International Conference on Learning Representations, 2018.Do better imagenet models transfer better?
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2661–2671, 2019.Imagenet classification with deep convolutional neural networks.
In Advances in neural information processing systems, pages 1097–1105, 2012.