Deep neural networks (DNNs) have demonstrated success in significant tasks, like image recognition (He_2016_CVPR) and text processing (devlin-etal-2019-bert). Their stellar performance can be attributed to the following three pillars: a) well-curated datasets, b) tailored network architectures devised by experienced practitioners, c) specialized hardware, i.e. GPUs and TPUs. The adoption of DNNs by practitioners in different fields relies on a critical question:
Are those pillars still holding ground in real-world tasks?
The first obstacle is that well-curated datasets (e.g., uniform number of samples over the classes) might be hard or even impossible to obtain in different fields, e.g., medical imaging. Similarly, when downloading images from the web, the amount of images of dogs/cats is much larger than the images of ‘Vaquita’ fish. One mitigation of such imbalanced classes is the development of the Neural Architecture Search (NAS) (zoph2016neural), which enabled researchers to build architectures that can then generalize to similar tasks. Obtaining the annotations required for NAS methods is both a laborious and costly process. To that end, self-supervised learning (SSL) has been proposed for extracting representations (bengio2013representation). One major drawback of both NAS and SSL is that they require substantial computational resources, making their adoption harder.
In this work, we propose a novel framework that combines NAS with self-supervised learning and handling imbalanced datasets. Our method relies on recent progress with self-supervised learning (zbontar2021barlow) and self-supervised NAS (kaplan2020self). Specifically, the proposed method designs a network architecture using only unlabelled data, which are also (naturally) imbalanced, e.g., like data automatically downloaded from the web. We pay particular attention to the resources required, i.e., every component of the proposed framework is designed to run on a single GPU, e.g., on Google Colab. We evaluate our method using both a long-tailed distribution and naturally imbalanced datasets for medical imaging. In both settings, we notice that the proposed method results in accuracy that is similar to or better than well-established handcrafted DNNs with a fraction of the parameters of those networks, i.e., up to less parameters.
2.1 Neural Architecture Search
Neural Architecture Search (NAS) can be roughly separated into three components (elsken2019neural)
: search space, search strategy, and performance estimation strategy. The first component defines the set of architectures to be explored; the second one determines how to explore the search space; the third one designs a way to estimate the performance in each step. The first approaches on NAS used evolutionary algorithms(real2017large; real2019regularized)zoph2016neural) and outperform handcrafted neural architectures. One major drawback was the immense computational resources required for running NAS. The first papers that focus on the reduction of the computational cost construct a supernet that covers all possible operations and train exponentially many sub-networks simultaneously (pham2018efficient; liu2018darts). This approach indeed reduces the computational cost and is improved by the recent works. Specifically, (cai2018proxylessnas) samples architecture paths during the search phase such that only one is trained each step. This allows training architecture of the same depth as the final model which eliminates the depth gap in performance. Similarly, (Xu2020PC-DARTS) samples a small subset of channels and replace the rest with skip-connections. Both methods require less memory and reduce search time by using larger batches. However, it is shown that such approaches are unfair in the choice of operations which leads to deteriorating of sub-network performance (chu2019fairnas; fairdarts) Another approach (Liu_2020_CVPR) is based on growing and trimming candidate architectures which is combined with memory-efficient loss. In our work, we stick to the first approach. We avoid the depth gap using the same depth final networks and induce fair competition between operations leveraging (fairdarts).
The seminal work of DARTS (liu2018darts) constructs a differentiable search space using a cell-based approach. The final architecture is constructed by stacking cells. Each cell is a directed acyclic graph (DAG) with nodes. Each edge represents a candidate operation with input and output , where . This neural network is referred to a supernet or a parent network. Such a search space would be discrete, hence a softmax relaxation between the candidate operations is used:
where is operation mixing weights. Then, NAS is reduced to learning these weights. The final discrete architecture is obtained by . The following bi-level optimization problem describes the objective:
where denotes normal neural network weights, and
are loss functions computed based on batches from validation and train sets correspondingly. It is hard to solve this task directly. Thus, we approximateusing only a single training step
DARTS has the significant overhead of maintaining the supernet during training. To mitigate that, during the search phase a shallow network is assumed, which is then duplicated to obtain the full network for evaluation. However, the weights obtained by a shallow neural network are not appropriate for deep models (9010638). Specifically, skip connections are frequently selected as the operation . Additional drawback is lack of weights significantly outperforming others. The recent work of FairDARTS (fairdarts)
mitigates those issues using the following two modifications: (a) it replaces the softmax operation with a sigmoid function to avoid the competition with skip connections as a candidate operation, (b) it encourages sparsity in the architecture weights by using the following zero-one loss:, where , are architecture weights. The final loss for the architecture weights is , where controls the strength of the zero-one loss.
2.2 Barlow Twins
Supervised learning has demonstrated success in a number of domains, but requires a massive amount of annotations, while it ignores the enormous amount of unlabelled data that can provide complementary information. The effort to utilize unsupervised learning has been decades old process(radford2015unsupervised; Doersch_2015_ICCV; bengio2013representation), with the concept of self-supervised learning as a popular method of learning. The idea is to devise one task that the ”target label” is known, and use losses developed for supervised learning. For instance, predicting the next word in a sentence enables utilizing the virtually unlimited text on the internet for unsupervised training; this is precisely the method used in the recent successful GPT (radford2019language) and BERT models (devlin-etal-2019-bert). Similarly, in visual computing a host of tasks has been used for self-supervised learning (noroozi2016unsupervised; gidaris2018unsupervised; chen2020simple).
By analogy to other successful self-supervised methods (chen2020simple; chen2020exploring; caron2020unsupervised) Barlow Twins (zbontar2021barlow) creates a pair of images for every original image. The pair is created by applying two randomly sampled transformations (e.g., random crop, horizontal flip, color distortion in pixels, etc). Similar pairs of images are created for every sample in the mini-batch. The model extracts the representations and of the two corresponding distorted versions of the original mini-batch. The idea is then to make the cross-correlation between and close to the identity. Specifically, the objective function is where is a positive coefficient, is a cross-correlation matrix of the outputs size computed between two outputs:
where indexes batch samples and
index the vector dimension of the outputs. In other words, the model is encouraged to differentiate the two distinct images in each pair. The advantages of Barlow Twins is that this decorrelation removes redundant information about samples in the output units. Unlike other recent self-supervised methods, Barlow Twins does not require large batches which is important when there are constrained resources (e.g., a single GPU).
We propose a new NAS-based approach for real-world datasets, which might not have available labels or they might be imbalanced. The approach consists of two steps: architecture search and subsequent fine-tuning.
Our method is build on top of FairDARTS. This allows eliminating the shortcomings of DARTS as mentioned in Section 2.1. We also replace the supervised loss with a self-supervised one. As (NEURIPS2020_e025b627) shows, it is also beneficial for learning imbalanced datasets. Namely, the recent Barlow Twins loss which does not require labels. Additionally, we use the supernet with only 3 cells for all steps. This is beneficial for three reasons. Firstly, the training process is efficient and affordable even for slow GPUs, while it produces small but powerful architectures. Secondly, the designed architecture is appropriate for the final model as its depth is unchanged. Thirdly, the learned weights in the first step are fully utilized in the second step (i.e., unlike other NAS methods that the weight values are typically discarded). To fine-tune the designed model, we add on the top another layer which projects the output matrix into the output classes and train it with the focal loss (see Appendix A
) in a supervised manner. We do not freeze weights of the rest of the network. Furthermore, we apply the logit adjustment (see AppendixA) to improve learning of rare classes. The ablation study on imbalance handling techniques is in Appendix B.
Our work shares some similarities with the recent Self-Supervised Neural Architecture Search (SSNAS) (kaplan2020self), which we describe next. SSNAS consists of three steps. In the first step, DARTS is used to determine a cell architecture by a shallow neural network. SSNAS assumes the unlabelled data and uses SimCLR (chen2020simple) for determining the architecture. In the next step, SSNAS stacks the constructed cells from the previous step to obtain the architecture, which is sequentially trained with the same self-supervised loss. In the last step, the architecture is fine-tuned using an annotated dataset. Despite the similarities, our method differs from SSNAS in four critical ways: (a) Using DARTS has several drawbacks as aforementioned, while DARTS is not robust to initialization and it requires several runs, (b) SimCLR should be used with large batches, while Barlow Twins exhibits better performance and can be executed with a smaller batch size, (c) we use a smaller supernet which improves training time and produces more efficient architectures, (d) we skip the self-supervised pretraining step without occurring a loss in performance (this step required a overhead in the training time). Lastly, our method is specifically developed for imbalanced datasets by leveraging the logit adjustment and the focal loss.
Below, we conduct an empirical validation of the proposed method in both a long-tailed distribution and a naturally imbalanced medical dataset. Furthermore, we validate whether the learned architecture can be used for transfer learning in the crucial domain of medical imaging, where we utilize a recent COVID-19 X-ray dataset. Our empirical evidence confirms that the proposed method can achieve similar performance with well-established networks using a fraction of their parameters.
We apply a first-order version of FairDARTS to accelerate search. A supernet has 3 cells with 4 nodes each. The search space is the same as in (liu2018darts; kaplan2020self). We use SGD with learning rate , momentum , and weight decay with cosine annealing learning rate scheduler (loshchilov2016sgdr). A batch size of is used, while we train the architecture for epochs. The experiments are performed on NVIDIA Tesla K40c. In fine-tuning step, the focal loss is applied with and . We use light data augmentation: random image cropping and horizontal flipping. The training is run for epochs or until convergence. The rest parameters are the same.
4.1 Evaluation on a long-tailed distribution
We evaluate the proposed method on the long-tailed version of CIFAR-10 dataset (krizhevsky2009learning). To this end, we reduce the number of samples for each class according to an exponential function where , and are numbers of samples in each class before and after transformation correspondingly, is a set of classes. The test set stays unchanged. We define the imbalance factor of a dataset as the number of training samples in the largest class divided by the smallest one. The imbalanced dataset is used for both architecture search and subsequent fine-tuning. No label is used for the architecture search.
In Table 1, we summarize the results of our experiments and compare them against the previous representative works on long-tailed distributions: ResNet-32 + Focal loss (lin2017focal), ResNet-32 + Sigmoid (SGM) and Balanced Sigmoid (BSGM) Cross Entropy losses (cui2019class), LDAM-DRW (cao2019learning), smDragon and VE2 + smDragon (samuel2021generalized), SSNAS (kaplan2020self). To provide a fair comparison to SSNAS method, we implement it in a common framework with common hyper-parameters. Notably, NAS-based methods require orders of magnitude less parameters. Though, SSNAS result in an increased error for the reduced parameters. Our method significantly improves the results of other methods achieving it with much fewer parameters.
|Method||# Params||Error ()|
|ResNet-32 + Focal||21.80||13.34|
|ResNet-32 + SGM||21.80||12.97|
|ResNet-32 + BSGM||21.80||12.51|
|VE2 + smDragon||21.80+||11.84|
4.2 Evaluation on naturally imbalanced datasets
We assess the performance of the method on ChestMNIST (chest) which is a naturally imbalanced dataset. The dataset contains 78,468 images of chest X-ray scans. There are 14 non-exclusive pathologies. The results are presented in Table 2. The accuracy of models for comparison (ResNet (He_2016_CVPR), auto-sklearn (feurer2019auto), AutoKeras (jin2019auto), and Google Auto ML) are reported from (medmnist). The proposed method is able to achieve the best seen before result wherein it keeps a tiny number of parameters. The smaller number of parameters for SSNAS is caused by a lot of skip-connections.
|Method||# Params||Accuracy ()|
|Google Auto ML (28)||-||94.7|
|Our method (28)||0.82||94.8|
As typically done in NAS, we evaluate the optimized architecture on transfer learning using COVID-19 X-ray dataset (OZTURK2020103792). This dataset consists of chest images and naturally imbalanced with . This dataset represents several difficulties that arise in real-world settings: (a) there is an imbalance factor , (b) the images are slightly different since they are collected from different sources, (c) there is noise in some images since there are some overlaid labels which are not related to the task in hand (see Appendix C). Unfortunately, because such datasets are recent, we have the only model to compare which is DarkCovidNet (OZTURK2020103792). Table 3 shows that our architecture is successfully transferred to another task achieving slightly better results DarkCovidNet but with smaller resolution and number of parameters.
|Method||# Params||Accuracy ()|
|Our method (224)||0.82||98.40|
|Our method (28)||0.82||98.40|
In this paper, we propose a NAS framework which is well-suited for scenarios with real-world tasks, where the data are naturally imbalanced and do not have label annotations. Our framework designs an architecture based on the provided unlabelled data using self-supervised learning. To evaluate our method, we conduct experiments on a long-tailed version of CIFAR-10 as well as ChestMNIST and COVID-19 X-ray which are medical datasets that are naturally imbalanced. For all the experiments, we show that the proposed approach provides more compact architecture while maintaining an accuracy on par with strong performing baselines. We expect our method to provide a reasonable framework for practitioners from different fields that want to capitalize on the success of deep neural networks but do not necessarily have well-curated datasets. In addition, our method is suitable for researchers on a constrained budget (e.g., using only the publicly-available Google Colab).
This project was sponsored by the Department of the Navy, Office of Naval Research(ONR) under a grant number N62909-17-1-2111, by Hasler Foundation Program: Cyber Human Systems (project number 16066). This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 725594). The project was also supported by 2019 Google Faculty Research Award.
Appendix A Handling imbalanced datasets
In many real-world applications, gathering balanced datasets is difficult or even impossible. For instance, in medical analysis, a small group of the patients has a specific pathology that the majority of the population does not have. In such settings, doing predictions on imbalanced datasets is crucial.
In (menon2020long), the authors propose the logit adjustment for the loss function to handle imbalance which corrects the output of the model before softmax operation. Specifically, they introduce the logit adjusted softmax cross-entropy loss:
where is a number of classes, is a logit of the given class, is empirical frequencies of classes. Therefore, we induce the label-dependent prior offset which requires a larger margin for rare classes.
The focal loss (lin2017focal) is frequently used in imbalanced datasets (sambasivam2021predictive; dong2021recognition)
. The idea behind the focal loss is to give a lower weight to easily classified samples.In a binary case, we introduce,where are labels,
is model’s estimated probability, andis an indicator function. The cross-entropy loss is then . To tackle the imbalance problem, the focal loss adds a modulating factor to the weighted cross-entropy where are loss weights which can be inverse class frequencies, is a tunable parameter. If a sample is misclassified and is small, the loss is unaffected. However, when then which gives less weight to this sample.
Appendix B Ablation study
To show effectiveness of components responsible for handling imbalance, we analyse the performance of all combinations of the Focal loss and the logit adjustment as well as their absence. In the latter case, we use simply the cross-entropy loss. The results are summarized in Table B1. The best result is achieved by combination of the focal loss and logit adjustment. Removing the latter slightly deteriorates the performance while abscence of the focal loss is significant.
|CE + Logit adj.||13.05|
|FL + Logit adj.||10.91|
Appendix C COVID-19 X-ray dataset
In Figure C5, we show four representative samples of the COVID-19 X-ray dataset. All images are collected from different sources, while some images contain unrelated content overlaid on image. Likewise, the light intensity, the resolutions, and the image formats might differ from image to image which makes learning harder.