NICO: A Dataset Towards Non-I.I.D. Image Classification

06/07/2019 ∙ by Yue He, et al. ∙ Tsinghua University 0

The I.I.D. hypothesis between training data and testing data is the basis of a large number of image classification methods. Such a property can hardly be guaranteed in practical cases where the Non-IIDness is common, leading to instable performances of these models. In literature, however, the Non-I.I.D. image classification problem is largely understudied. A key reason is the lacking of a well-designed dataset to support related research. In this paper, we construct and release a Non-I.I.D. image dataset called NICO, which makes use of contexts to create Non-IIDness consciously. Extended experimental results and anslyses demonstrate that the NICO dataset can well support the training of a ConvNet model from scratch, and NICO can support various Non-I.I.D. situations with sufficient flexibility compared to other datasets.



There are no comments yet.


page 2

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, machine learning has achieved remarkable progress, mainly owing to the development of deep neural networks

[4, 8, 7, 6]. One basic hypothesis of machine learning models is that the training and testing data should consist samples Independent and Identically Distributed (I.I.D.). However, this ideal hypothesis is fragile in real cases where we can hardly impose constraints on the testing data distribution. This implies that the model minimizing empirical error on training data does not necessarily perform well on testing data, leading to the challenge of Non-I.I.D. learning. The problem is more serious when the training samples are not sufficient to approximate the training distribution itself. How to develop Non-I.I.D. learning methods that are robust to distribution shifting is of paramount significance for both academic research and industrial applications.

Benchmark datasets, providing a common ground for competing approaches, are always important to promote the development of a research direction. Take image classification, a prominent learning task, as an example. Its development benefits a lot from the benchmark datasets, such as PASCAL VOC [3], MSCOCO [5]

, and ImageNet


. In particular, it is the ImageNet, a large-scale and well-structured image dataset, that successfully demonstrates the capability of deep learning and thereafter significantly accelerates the advancement of deep convolutional neural networks. On these datasets, it is easy to establish an I.I.D. image classification setting by random data splitting. But they do not provide an explicit option to simulate a Non-I.I.D. setting. The dataset that can well support the research on Non-I.I.D. image classification is still in vacancy.

In this paper, we construct and release a dataset that is dedicately designed for Non-I.I.D. image classification, named NICO (Non-I.I.D. Image dataset with Contexts). The basic idea is to label images with both main concept and contexts. For example, in the category of ’dog’, images are divided into different contexts such as ’grass’, ’car’, ’beach’, meaning the ’dog’ is on the grass, in the car, or on the beach respectively. With these contexts, one can easily design an Non-I.I.D. setting by training a model in some contexts and testing it in the other unseen contexts. Meanwhile, the degree of distribution shift can be flexibly controlled by adjusting the proportions of different contexts in training and testing data. Till now, NICO contains 19 classes, 188 contexts and nearly 25,000 images in total. The scale is still increasing, and the current scale has been able to support the training of deep convolution networks from scratch.

Figure 1: (represented by the bar-type) and testing error (represented by the curve-type) of each class in Dataset A.
Figure 2: of each class in 3 different datasets constructed from ImageNet. Different datasets instantiate the same classes with different subclasses.

The NICO dataset can support, but not limited to, two typical settings of Non-I.I.D. image classification. One is Targeted Non-I.I.D. image classification, where testing data distribution is known but different from training data distribution. The other is General Non-I.I.D. image classification, where testing data distribution is unknown and different from training data distribution. Apparently, the latter one is much more realistic and challenging. A model learned in one environment could be possibly applied in many other environments. In this case, the robustness of a model in the environments with unknown distribution shift is a highly favorable characteristic, which is especially critical in risk-sensitive applications like medical and security. However, the research in this area is hindered due to the lack of a well-structured and reasonable-scaled dataset, which is the motivation of building NICO.

2 Non-I.I.D. Image Classification

2.1 Problem Definition

We first give a formal definition of Non-I.I.D. image classification as follow:

Problem 1

(Non-I.I.D. Image Classification) Given the training data , where represents the images and represents the labels. The task is to learn a feature extractor

and a classifier

, so that can predict the labels of testing data precisely, where and . Moreover, according to the availability of the prior knowledge on testing data, we further define two different tasks. One is Targeted Non-I.I.D. Image Classification where the testing data distribution is known. The other is General Non-I.I.D. Image Classification, which corresponds to a more realistic scenario where the testing data distribution is unknown.

In order to intuitively quantify the degree of distribution shift between and , we define the Non-I.I.D. Index as follow:

Definition 1

Non-I.I.D. Index (NI) Given a feature extractor and a class , the degree of distribution shift between training data and testing data is defined as:

where ,

represents the first order moment,

is the std used to normalize the scale of features and represents the 2-norm.

2.2 Existence of Non-IIDness

In real cases, the I.I.D. hypothesis can never be strictly satisfied, meaning that Non-IIDness ubiquitously exists in previous datasets [9]. Here we take ImageNet as an example. ImageNet is in a hierarchical structure, where each class (e.g. dog) contains multiple subclasses (e.g. different kinds of dogs). For each subclass, it provides training and testing (validation) subsets of images. To verify the Non-IIDness in ImageNet, we select 10 common animal classes (e.g. dog, cat) and construct a new dataset using 10 instantiated subclasses (e.g. Labrador, Persian), each randomly drawn from those classes. Using the training and testing subsets, we train and evaluate a ConvNet on image classification task. The structure of the ConvNet used in this paper is similar to AlexNet (details seen in Appendices), and we take the last FC layer of the ConvNet as the feature extractor . Note that model structure is used in all subsequent analysis (including on NICO) for fair comparison, and thus selected by trading-off performance and required training data scale. But as a base model with sufficient learning capacity, the specific model structure does not affect the conclusions. We repeat this collection procedure for 3 times, obtain 3 new datasets (, and ) and calculate the and testing error for each class respectively. As an example, we plot the results of in Figure 1. We can find that: (1) is above zero for all classes, which implies the Non-IIDness between training and testing data is ubiquitous even in large-scale datasets like ImageNet. (2) Different classes have different values and higher value corresponds to higher testing error. The strong correlation between and testing error can be further proved by their high pearson correlation coefficients () and small (2e-15). The showcase and statistical analysis well support an plausible conclusion that the degree of distribution shift quantified by NI is a key factor influencing classification performance.

2.3 Limitations of Existing Datasets

Throughout the development of computer vision research, benchmark datasets have always played a critical role on both providing a common ground for algorithm evaluation and driving new directions. Specifically, for image classification task, we can enumerate several milestone datasets such as PASCAL VOC, MSCOCO and ImageNet. However, existing benchmark datasets cannot well support the Non-I.I.D image classification. First of all, despite the manifested Non-IIDness in ImageNet and other datasets, as shown in Figure

1, the overall degree of distribution shift between training and testing data for each class is relatively small, making these datasets less challenging from the angle of Non-I.I.D. image classification. More importantly, there is no explicit way to control the degree of distribution shift between training and testing data in the existing datasets. As illustrated in Figure 2, if we instantiate the same class with different subclasses in ImageNet and obtain 3 datasets with identical structure, the of a given class is fairly unstable across different datasets. Without a controllable way to simulate different levels of Non-IIDness, competing approaches cannot be evaluated fairly and systematically on those datasets. Those said, a dataset that is dedicatedly designed for Non-I.I.D. image classification is demanded.

3 The NICO Dataset

In this section, we introduce the properties and collection process of the dataset, followed by preliminary empirical results in different Non-I.I.D. settings supported by this dataset.

3.1 Context for Non-I.I.D. Images

The essential idea of generating Non-I.I.D. images is to enrich the labels of an image with both conceptual and contextual labels. Different from previous datasets that only label an image with the major concept (e.g. dog), we also label the concrete context (e.g. on grass) that the concept appears in. Then it is easy to simulate an Non-I.I.D. setting by training and testing the model of a concept with different contexts. A good model for Non-I.I.D. image classification is expected to perform well in both training contexts and testing contexts.

In NICO, we mainly incorporate two kinds of contexts. One is the attributes of a concept (or object), such as color, action, and shape. Some examples of ‘context + concept’ pairs include white bear, climbing monkey and double decker etc. The other kind of contexts is the background or scene of a concept. The examples of ‘context + concept’ pairs include cat on snow, horse aside people and airplane in sunrise etc. Samples of different contexts in the NICO dataset are shown in Figure 3.

Data Size Data Size
Bear 1609 Airplane 930
Bird 1590 Bicycle 1639
Cat 1479 Boat 2156
Cow 1192 Bus 1009
Dog 1624 Car 1026
Elephant 1178 Helicopter 1351
Horse 1258 Motorcycle 1542
Monkey 1117 Train 750
Rat 846 Truck 1000
Sheep 918
Table 1: Data size of each class in NICO.

In order to provide more flexible Non-I.I.D. settings, we tend to select the contexts that occur in multiple concepts. Then for a given concept, a context may occur in both positive samples and negative samples (that are sampled from other concepts). This provides another flexibility to let a context included in training positive samples appear or do not appear in training negative samples, which will yield different Non-I.I.D. settings.

Figure 3: Samples with contexts in NICO. Images in the first row are dogs of , assigned to different contexts below it. The second and third row correspond to horse of and boat of respectively.

3.2 Data Collection and Statistics

Referring to ImageNet, MSCOCO and other classical datasets, we first confirm two superclasses: and . For each superclass, we select classes from the 272 candidates in MSCOCO, with the criterion that the selected classes in a superclass should have large inter-class differences. For context selection, we exploit YFCC100m broswer111 and first derive the frequently co-occurred tag list for a given concept (i.e. class label). We then filter out the tags that occur in only a few concepts. Finally, we manually screen all tags and select the ones that are consistent with our definition of contexts (i.e. object attributes or backgrounds and scenes).

After obtaining the conceptual and contextual tags, we concatenate a given conceptual tag and each of its contextual tags to form a query, input the query into the API of Google and Bing image search, and collect the top-ranked images as candidates. Finally, in the phase of screening, we select images into the final dataset according to the following criteria: (1) the content of an image should correctly reflects its concept and context; (2) given a class, the number of images in each context should be adequate and as balance as possible across contexts. Note that we do not conduct image registration or filtering by object centralization, so that the selected images are more realistic and in wild than those in ImageNet.

The NICO dataset will be continuously updated and expanded. Till now, there are two superclasses: and , with 10 classes for and 9 classes for . Each class has 9 or 10 contexts. The average size of contexts per class ranges from 83 to 215, and the average size of classes is about 1300 images, which is similar to ImageNet. In total, there are 25,000 images in the NICO dataset. As NICO is in a hierarchical structure, it is easy to be expanded. More statistics on NICO is reported in Table 1 and Appendices. It is worth noting that the NICO dataset does not own the copyright of the images. Only for researchers and educators who wish to use the images for non-commercial research and/or educational purposes, we may provide you access through the link222

3.3 Supported Non-I.I.D. Settings

By dividing a class into different contexts, NICO provides the flexibility of simulating Non-I.I.D. settings in different levels. To name a few, here we list 4 typical settings.

Setting 1: Minimum bias. Given a class, we can ignore the contexts, and randomly split all images of the class into training and testing subsets as positive samples. Then we can randomly sample images belonging to other classes into training and testing subsets as negative samples. In this setting, the way of random sampling lead to minimum distribution shift between training and testing distributions in the dataset, which simulates a nearly i.i.d. scenario.

Setting 2: Proportional bias. Given a class, when sampling positive samples, we use all contexts for both training and testing, but the percentage of each context is different in training and testing subsets. For example, we can let one context take the majority in training data while taking minority in testing. The negative sampling process is the same as Setting 1. In this setting, the level of distribution shift can be tuned by adjusting the proportion difference between training and testing subsets for each context.

Setting 3: Compositional bias. Given a class, not every testing context that the positive samples belong to appears in training subset simultaneously. Such a setting is quite common in real scene, because available datasets could not contain all the potential contexts in nature due to the limitations of sampling time and space. Intuitively, the distribution shift from observed contexts to unseen contexts is usually large. The less number of testing contexts observed in training generally leads to the higher distribution shift. A more radical distribution shift can be further achieved by combining compositional bias and proportional bias.

Setting 4: Adversarial bias. Given a class, the positive sampling process is the same as Setting 3. For negative sampling, we tend to select the negative samples from the contexts that have not been (or have been) included in positive training samples to form the negative training (or testing) subset. In this way, the distribution shifting is even higher than Setting 3, and the existing classification model developed under i.i.d. assumption are more prone to be confused.

The above 4 settings are designed to generate Non-I.I.D. training and testing subsets. Under each setting, we can conduct either Targeted or General Non-I.I.D. image classification by assuming the distribution of testing subset is known or unknown.

3.4 Empirical Analysis

To verify the effectiveness of NICO in supporting Non-I.I.D image classification, we conduct a series of empirical analysis. It is worth noting that, in each setting, only the distribution of training or testing data change, while the structure of ConvNet and the size of training data keep the same.

3.4.1 Minimum Bias Setting

In this setting, we randomly sample 8000 images for training and 2000 images for testing from and superclasses respectively. The average testing accuracy and over all the classes are , for superclass and , for superclass. We can find that in NICO is much higher than in ImageNet even if there is no explicit bias (due to random sampling) when we construct the training and testing subsets. This is because the images in NICO are typically non-iconic images with rich contextual information and non-canonical viewpoints, which is more challenging from the perspective of image classification.

3.4.2 Proportional Bias Setting

In this setting, we let all the contexts appear in both training and testing data, and randomly select one dominant context in training data (or testing data) for each class in superclass. Such experimental settings comply with the natural phenomena that a majority of visual contexts are rare except a few common ones [1]. Specifically, we define the dominant ratio as follow:

where refers to the sample size of the dominant context and refers to the average size of other contexts where we uniformly sample other contexts. We conduct two experiments where either dominant ratio of training data or testing data is fixed, and vary the other one. We plot the results in Figure 4 (a) and Figure 4 (b). From the figures, we can clearly find a consistent pattern that the becomes higher as the discrepancy between dominant ratio of training data and testing data becomes larger. As a result, by tuning the dominant ratio of training data (or testing data), we can easily simulate different extents of distribution shift as we want.

(a) Average over all classes in superclass with respect to various dominant ratio of training data, while the dominant ratio of testing data is fixed to 1:1 (uniform sampling).
(b) Average over all classes in superclass with respect to various dominant ratio of testing data, while the dominant ratio of training data is fixed to 5:1.
Figure 4: in proportional bias setting.
Figure 5: in compositional bias setting: average over all classes in superclass with respect to the number of contexts used in training data.
Figure 6: in the combined setting of compotisional bias and proportional bias: average over all classes in superclass with respect to various dominant ratio of training data, where contexts in testing data is totally unseen in training.

3.4.3 Compositional Bias Setting

Compared to proportional bias setting, compositional bias setting simulates a condition where the knowledge obtained from training data is insufficient to characterize the whole distribution. To doing so, we choose a subset of contexts for a given class when constructing the training data and test the model with all the contexts. By varying the number of contexts observed in training data, we can simulate different extents of information loss and distribution shift. From Figure 5, we can find that the consistently decreases when we observed more contexts in training data. A more radical distribution shift can be achieved by combining the notion of proportional bias and compositional bias. Given a particular class in superclass, We choose 7 contexts for training and the other 3 contexts for testing, and further let one context dominate the training data. By doing so, we can obtain a more severe Non-I.I.D. condition between training and testing data than previous two settings, as illustrated by the results from Figure 6.

3.4.4 Adversarial Bias Setting

Given a target class, we define a context as confounding context if it only appears in the negative samples of training data and positive samples of testing data. In this experiment, we choose four classes in superclass as target classes and report the w.r.t various number of confounding contexts in Figure 7. The experimental results indicate that the number of confounding contexts has consistent influence on the of different classes. Given any target class, we can simulate a more harsh distribution shift and further confuse the ConvNet by adding more confounding contexts.

Figure 7: NI in the adversarial bias setting: of target class with respect to the number of confounding contexts.

Finally, we show the range of NI in different Non-I.I.D. settings in Figure 8. We can see the level of NI in NICO is significantly higher than ImageNet, and there is an obvious ascending trend from Minimum Bias to Adversarial Bias settings, which can be controlled consciously by changing the gap of contexts between training and testing data.

Figure 8: Range of average over superclass for different settings supported in NICO.

4 Conclusion and Future Work

In this paper, we introduce a new dataset NICO for promoting the research on Non-I.I.D. image classification. To the best of our knowledge, NICO is the first well-structured Non-I.I.D. image dataset with reasonable scale to support the training of powerful deep networks, like ConvNets, from scratch. By incorporating the idea of context, empirical results clearly demonstrate NICO can provide various Non-I.I.D. settings and create different levels of Non-IIDness consciously.

Our future works will focus on the followings. Firstly, both quality and quantity of NICO continue to be improved. Orthogonal contexts, denoised images and proper area ratio of objects will be explored to make NICO more controllable to tune bias and response to the Non-I.I.D uniquely. And we will expand the scale of dataset from all the levels for adequate demands. Secondly, more settings about different forms of Non-I.I.D are expected to be exploited. So other visual concepts may be added to NICO if needed and the ways of using NICO to meet new settings will be given in detail.


  • [1] A. Clauset, C. R. Shalizi, and M. E. Newman. Power-law distributions in empirical data. SIAM review, 51(4):661–703, 2009.
  • [2] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and F. F. Li. Imagenet: A large-scale hierarchical image database. In

    IEEE Conference on Computer Vision & Pattern Recognition

    , 2009.
  • [3] M. Everingham, S. M. A. Eslami, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, 2015.
  • [4] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional networks. In Proceedings of the Conference Neural Information Processing Systems (NIPS), pages 1097–1105.
  • [5] T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. 2014.
  • [6] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
  • [7] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: towards real-time object detection with region proposal networks. 2015.
  • [8] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. Computer Science, 2014.
  • [9] A. Torralba and A. A. Efros. Unbiased look at dataset bias. 2011.