Visual Recognition with Deep Learning from Biased Image Datasets

In practice, and more especially when training deep neural networks, visual recognition rules are often learned based on various sources of information. On the other hand, the recent deployment of facial recognition systems with uneven predictive performances on different population segments highlights the representativeness issues possibly induced by a naive aggregation of image datasets. Indeed, sampling bias does not vanish simply by considering larger datasets, and ignoring its impact may completely jeopardize the generalization capacity of the learned prediction rules. In this paper, we show how biasing models, originally introduced for nonparametric estimation in (Gill et al., 1988), and recently revisited from the perspective of statistical learning theory in (Laforgue and Clémençon, 2019), can be applied to remedy these problems in the context of visual recognition. Based on the (approximate) knowledge of the biasing mechanisms at work, our approach consists in reweighting the observations, so as to form a nearly debiased estimator of the target distribution. One key condition for our method to be theoretically valid is that the supports of the distributions generating the biased datasets at disposal must overlap, and cover the support of the target distribution. In order to meet this requirement in practice, we propose to use a low dimensional image representation, shared across the image databases. Finally, we provide numerical experiments highlighting the relevance of our approach whenever the biasing functions are appropriately chosen.


page 1

page 2

page 4

page 7

page 9


Statistical Learning from Biased Training Samples

With the deluge of digitized information in the Big Data era, massive da...

Deep Multi-Facial Patches Aggregation Network For Facial Expression Recognition

In this paper, we propose an approach for Facial Expressions Recognition...

Measurement error models: from nonparametric methods to deep neural networks

The success of deep learning has inspired recent interests in applying n...

Dataset Bias in Few-shot Image Recognition

The goal of few-shot image recognition (FSIR) is to identify novel categ...

Deep neural networks are biased towards simple functions

We prove that the binary classifiers of bit strings generated by random ...

Copula-based conformal prediction for Multi-Target Regression

There are relatively few works dealing with conformal prediction for mul...

I Introduction

Besides the considerable advances in memory technology and in the computational power of machines, which now permit to implement optimization programs that operate on vast classes of rules, making deep learning feasible, the spectacular rise in performance of visual recognition algorithms is essentially due to the recent availability of massive labeled image datasets. However, due to the poor control of the data acquisition process or even the absence of any experimental design to collect the datasets, the distribution of the training examples may drastically differs from that of the data to which the predictive rule will be applied when deployed. The adverse effects of machine-learning algorithms for facial recognition when trained from data suffering from selection bias have been recently highlighted in various works, as recalled in

section I below.

[Bias in facial recognition] Most publicly available large face databases — such as [13] or [9] for instance — are composed of images of celebrities and do not represent appropriately the global population, with respect to ethnicity and gender in particular. In [26], face recognition systems learned from such databases are argued to suffer from an “other-race effect”, i.e. that they fail to distinguish individuals from unfamiliar ethnicities, like humans often do. Recently, industrial benchmarks for commercial face recognition algorithms highlighted the discrepancies in accuracy across race and gender and their fairness is now questioned, see [8].

In [34], a balanced dataset over ethnicities built by discarding selected observations from a larger dataset has been proposed so as to cope with the representativeness issue mentioned in section I

. However, reducing the number of training samples may strongly damage performance and generalization for most computer vision tasks, see

e.g. [32].

The purpose of this paper is to investigate how to use several image datasets, which are each biased with respect to the target distribution in a possibly different manner, for learning visual recognition systems, when knowing approximately the corresponding bias functions. Whereas learning from training data that are not distributed or of the same format as the test data is a very active research topic in machine learning literature, see e.g. [23, 1]

and in computer vision especially, with a plethora of transfer learning methods (

e.g. [30]) or domain adaptation techniques (e.g. [5]), the approach considered in this article is of different nature and relies on general user-defined biasing functions to relate the training datasets to the testing distribution for computer vision tasks. In section I, the biasing functions would have the gender and nationality information usually present in identification databases as arguments for instance. However, the approach promoted here is very general and applies when specific types of selection bias are present in the training databases supposedly available and are approximately known. It is based on recent theoretical results on re-weighting techniques for learning from biased training samples, documented in [15] (see also [7]) and [33].

Fig. 1:

A sample of: a) the not-MNIST dataset, b) the SynNumbers dataset of

[5] with only even numbers, c) the MNIST-M dataset of [5], and d) the well-known MNIST database. The difference between pairs of these datasets constitute examples of “low-level“ bias (b-d or c-d) and ”high-level“ bias (a-d or b-d).

The notion of bias is used with various different meanings. For instance in section I, it is understood as a racial prejudice, due to the possibly great disparity between the performance levels attained for different subgroups of the population. In computer vision, a significant part of the literature considers the case where the differences between the training and test image populations are related to low-level visual features, see e.g. [18, 2, 16, 5], as illustrated for example by the difference between the MNIST and MNIST-M datasets (fig. 1-c and fig. 1-d). The general principle behind the algorithms proposed consists in finding a representation common to training and test image datasets, that is relevant for the task considered.

Here, we focus on selection/sampling bias solely and consider situations where the training and test images have the same format but the distributions of the training image datasets available are different from the target distribution, i.e. the distribution of the statistical population of images to which the learned visual recognition system will be applied, as formulated in [31, 15] using the notion of biasing functions/models. In section I for instance, the selection bias effect results in different proportions of the categories determined by race and gender, whereas in fig. 1, one may guess that the differences between Sub-Synth (b) and MNIST (d) concern the visual appearance and the distribution over numbers both at the same time, since (b) contains a significantly higher proportion of even numbers than (d).

The method proposed in [7] and [15] and investigated here for computer vision tasks relies on user-defined bias functions. By means of an appropriate algorithm, these functions permit to weight the images of the training databases so as to form a nearly de-biased estimate of the test distribution and apply empirical risk minimization techniques with generalization guarantees to learn visual recognition systems. These biasing functions are either fixed in advance based on prior information or else estimated from auxiliary information related to the image data. In this paper, we explain at length how this debiasing machinery can be used for visual recognition tasks and illustrates its relevance through various experiments on the well known datasets CIFAR10 and CIFAR100.

The rest of the paper is organized as follows. Related works are collected in section II, while the methodology considered here for de-biasing image datasets is described at length in section III. In section IV, experimental results are displayed and discussed, providing empirical evidence of the advantages of the approach promoted here. Finally, some concluding remarks are gathered in section V.

Ii Related Work - State of the Art

Learning visual recognition models in presence of bias has been recently the subject of attention in the computer vision literature. Recently, authors explicitly referred to the notion of bias. For example, [12] proposed to correct for gender bias in captioning models by using masks that hide the gender information, or [14] learned to dissociate an objective classification task to some associated counfounding bias using a known prior on the bias.

Bias and Transfer Learning. However, since we considered bias as any significant difference between the training and testing distributions, learning with biased datasets is related to transfer learning. We introduce by (resp. ) the input (resp. output) space associated to a computer vision task. [22] define domain adaptation as the specific transfer learning setting where both the task — e.g. object detection or visual recognition — and the input space are the same for both training and testing data, but the train and test distribution are different, i.e. for some . We denote by any tuple . Our work makes similar assumptions as domain adaptation, but considers several training sets.

Transferable Features. [17, 5] address domain adaptation by learning a feature extractor that is invariant with respect to the shift between the domains, which makes the approach well suited to problems where only the marginal distribution over is different between training and testing. Formally, under this assumption the posterior distributions satisfy for any and we have for some . Consequently, their approach is well-suited when the difference is reduced to that of the “low-level” visual appearance of the images (see fig. 1). However, [28] considers domains as types of images (e.g.

medical, satellite), which makes learning invariant deep features undesirable. In the context of

section I, learning deep invariant features would discard information that is relevant for identification. For those reasons, [28] as well as [10] considers all networks layers as possible candidates to adaptation. Taking a different approach, [30, 18, 35]

all explicitly model different train and test posterior probabilities

for the output, for example through specialized classifiers for both the target and source dataset, see

e.g. [35]. Most of the work on deep transfer focuses on adapting feature representations, but another popular approach in transfer learning is to transfer knowledge from instances [22].

Instance Selection. Recent computer vision papers proposed to select specific observations from a large scale training dataset to improve performance on the testing data. Precisely, [3] selects observations that are close to the target data, as measured by a generic image encoder, while [6] uses a similarity based on low-level characteristics. More similar to our work, [21, 35] use importance weights for training instances, that are computed from a small set of observations that follow the target distribution. Precisely, [21] transfers information from the predicted distribution over labels of the adapted network, while [35]

learns with a reinforcement learning selection policy guided by a performance metric on a validation set. The objective of the importance function is to reweight any training instance

by a quantity proportional to the likelihood ratio . While most papers assume the availability of a sample that follows to approximate the likelihood ratio , recent work in statistical machine learning proposes to introduce simple assumptions on the relationship between and [33, 15]. These assumptions require the machine learning practicioner to only have some a priori knowledge on the testing data, which happens naturally in many situations, notably in industrial benchmarks. In the context of section I, facial recognition providers often have accurate knowledge of some characteristics (e.g. nationality) of the target distribution but no access to the target data. [33] assumed that a single training sample covers the whole testing distribution (with a dominated measure assumption), but [15] relaxes that assumption by considering several source datasets.

Multiple Source Datasets. Learning from several source datasets has been studied in various contexts, such as domain adaptation [19, 1], or domain generalization [16]. In the latter work, authors optimize for all sources simultaneously, and find an interesting trade-off between the different competing objectives. Our approach is closer to the one developed in [15], which consists in appropriately reweighting the observations generated by some biased sources. Under several technical assumptions, the authors have extended the results of [31]

and have shown that the unbiased estimate of the target distribution thus obtained can be used to learn reliable decision rules. Note however that an important limitation of these works is that the biasing functions are assumed to be exactly known. In this work, we learn a deep neural network for visual recognition with sampling weights derived from

[15], using approximate expressions for the biaising functions. Note that [14] also assumes the knowledge of bias information to adjust a neural network. However, their approach is pretty incomparable to ours, insofar as it consists in penalizing the mutual information between the bias and the objective.

Iii Debiasing the Sources - Methodology

Fig. 2: Given the datasets , it is easy to compute (unbiased) estimates of the , through the empirical distributions . Since the are assumed to be known, one can construct , normalize the distribution (to mimic the effect of which is unknown), and obtain an estimate of restricted to the support of . But if the union of and is strictly included in , as in case a), it is impossible to estimate on . This is expected, as the learner does not have access to any observation valued in this part of the space. If is included in , but there is no overlap, as in case b), we have estimates of on the entire support, but it is impossible to find the right normalization to ensure that the combination of and is a good estimate of . More precisely, and are two distributions defined on that are equal to if restricted to , but from the data available in scenario b) there is no way to know which one is closer to . However, if there is overlap, as in case c), the knowledge of the and the observations present in the intersection allow to estimate the relative normalization, and to produce an almost unbiased estimate of .

In this section, we formally introduce our methodology to learn reliable decision functions in the context of biased image-generating sources. In Section III-A, we first recall the debiased Empirical Risk Minimization framework developed in [15], which our approach builds upon. In Section III-B

, we then highlight that applying this theory to image datasets raises important practical issues, and propose heuristics to bypass these difficulties. In what follows, we use

to denote the indicator of an event.

Iii-a Debiased Empirical Risk Minimization

Recall that

denotes a generic random variable with distribution

over the space

. We consider the supervised learning framework, meaning that

represents some input information supposedly useful to predict the output

. Given a loss function

, and a hypothesis set (e.g., the set of decision functions possibly generated by a fixed neural architecture), our goal is to compute


As is unknown in general, a usual hypothesis in statistical learning consists in assuming that the learner has access to a dataset composed of independent realizations generated from . However, as highlighted in the introduction section, this i.i.d. assumption is often unrealistic in practice. In this paper, we instead assume that the observations at disposal are generated by several biased sources. Formally, let be the number of sources, and for , let be a dataset of size composed of independent realizations generated from biased distributions . We also define , and . The distributions are said to be biased, as we assume the existence of (known) biasing functions such that for any we have


where is an (unknown) normalizing factor. This data generation design is referred to as Biased Sampling Models (BSMs), and was originally introduced in [31] and [7]

for asymptotic nonparametric estimation of cumulative distribution functions. Note that BSMs are a strict generalization of the standard statistical learning framework, insofar as the latter can be recovered as a special case with the choice

and . For general BSMs, it is important to observe that performing naively Empirical Risk Minimization (ERM, see e.g., [4]), which means concatenating the observations without considering the sources of origin and computing


might completely fail. Indeed, minimizing criterion (3) instead of (1) amounts to replace with the empirical distribution


However, it can be seen from the above equations that is rather an approximation of distribution , which might be very different from (e.g., in the extreme case where and for , we have ).

To remedy this problem, [15] has developed a debiased ERM procedure for BSMs, under the following two assumptions: () the union of the supports of the must contain the support of , and () the supports must sufficiently overlap. See Figure 2 for an intuitive explanation of the assumptions. Formally, the support assumption ensures that the cannot be null all at the same time, and thus allows to invert the following equality:


As noticed earlier, it is straightforward to obtain an estimate of , through . The being assumed to be known, it is then enough to find estimates of the , and to plug them into (4), to construct an (almost unbiased) estimate of . As shown in [7], such estimates can be easily computed by solving a system of equations through a Gradient Descent approach, see the Appendix for technical details. Let be the distribution obtained by replacing and in (4) with and respectively. Debiased ERM consists in approximating in (1) with . It can be shown that it is equivalent to compute



Hence, once the debiasing weights are derived, this approach is not more computationally expensive than standard ERM. Note also that most modern machine learning libraries allows to weight the training observations during the learning stage, e.g., through the sample_weight keyword argument in the fit method for scikit-learn [25].

Iii-B Application to Image-Generating Sources

Thanks to its generality, the debiased ERM procedure allows to model realistic scenarios, which make its application to image-generating sources particularly relevant. First, the biasing functions apply to the entire observation . Then, one can address biases induced by the measurement devices (cameras with different intrinsic characteristics are expected to produce photos with different film grains), from the subjects (cameras located in different areas of the world are expected to observe different classes of objects/animals ), or both at the same time. This rich framework generalizes Covariate Shift, see e.g., [27, 29], a popular bias assumption which only allows the marginal distribution of

to vary across the datasets, the conditional probability

being assumed to remain constant equal to . A second advantage is that an individual biasing function might be null on some part of . This allows for instance to consider biasing functions of the form , where is a subspace of , accounting for the fact that dataset is actually sampled from a subpopulation (e.g., regional animal species). In this case, Importance Sampling typically fails, as the importance weights are given by , which explodes for . In contrast, debiased ERM combines the different datasets in a clever fashion to produce an estimator valid on the entire set . The final asset of debiased ERM is undeniably the theoretical guarantees that come with it. Indeed, it has been shown in [15] that, up to constant factors, has the same excess risk as an empirical risk minimizer that would have been computed on the basis of an unbiased dataset of size . This remarkable property is however conditioned upon two relatively strong requirements, that we will study and relax in the context of image-generating sources.

First, the biasing functions have to be known. Recall nonetheless that this assumption is less compelling that it might look like at first sight. Indeed, only the , and not the normalizing factors , are assumed to be known. Consider again the case where , with a subspace of . The assumption requires to know , or equivalently the subpopulation , but not to know the relative size of this subpopulation among the global one with respect to , i.e., , which is much more involved to obtain in practice. Still, except in very particular cases such as randomized control trials, it is extremely rare to have an exact knowledge of the . In the following, we conduct numerical experiments where we show that the biasing functions can actually be learned along the training process, without significantly degrading the empirical performances. Our approach thus provides an end-to-end methodology to debias image-generating sources, which only relies on the biased databases, does not use any oracle quantities (the ), and can therefore be fully implemented in practice.

The second restriction regards the overlapping. Indeed, an important caveat of the analysis carried out in [15] is that constant factors depend heavily on the overlapping. Formally, let


These two parameters quantify the overlapping, as shown in Figure 3. Results in [15], see e.g., Proposition 1 therein, typically hold with probability , where is a constant proportional to . As previously discussed, we recover that theoretical guarantees are conditioned upon overlapping: when or , the bound obtained is vacuous. But the formula actually suggests an even more interesting behavior. Besides the threshold, we can see that the more distributions overlap, i.e., the bigger and are, the better the guarantees. On the contrary, if input observations are images, i.e., high dimensional objects whose distributions have non overlapping supports, it is likely that constants and become too small to provide a meaningful bound. To remedy this practical issue, we propose to use biasing functions which take as input a low-dimensional embedding of the images (or the label ) in order to maximize the overlapping and thus ensure good constant factors. We also conduct several sensitivity analyses, which corroborate the fact that increasing the density overlapping brings stability to the procedure, and globally enhances the performances.

Fig. 3: Two overlapping biasing functions.

Iv Experiments

We now present experimental results on two standard image recognition datasets, namely CIFAR10 and CIFAR100, from which we extract biased datasets, so as to recreate the framework of Section III-A in a controllable way. The great generality of this setting allows us to model realistic scenarios, such as the following. An image database is actually composed of pictures that have been collected in several goes, by cameras located in different parts of the world. One may think about animal pictures gathered by expeditions in different countries, or security cameras located in different areas for instance. In this case, the location bias then translates into a class bias, and the class proportions greatly differ from one database to the other (we do not observe the same species of animals in Africa or in Europe). We show that our method allows to efficiently approximate the biasing functions in this case, and provides a satisfactory end-to-end solution to this common bias scenario. A second typical bias scenario occurs when the images originate from cameras with different characteristics, and thus exhibit different film grains. We show that a bias model on a low representation of the images (therefore avoiding the overlapping difficulty) is able to model such phenomenon, and that our approach provides satisfactory results despite the bias mechanism at work.

Imbalanced learning is a single dataset image recognition problem, in which the training set is assumed to contain the classes of interest in a proportion different from the one in the validation set (usually supposed balanced). Long-tail learning, a special instance of imbalanced learning, has recently received a great deal of attention in the computer vision community, see e.g., [20]. In Section IV-B, we propose a multi-dataset of imbalanced learning, where we assume that each database contains a different proportion of the classes of interest, this proportion being possibly equal to for certain classes (which is typically not allowed in single dataset imbalanced learning). Under the assumption that the classes are balanced in the validation dataset, we show that , where is the number of observations with label in dataset , approximates well the bias function. The good empirical results we obtain provide a proof of concept for our methodology in image recognition problems where the marginal distributions over classes differ between the data acquisition phase and the testing stage. This scenario is often found in practice, e.g., as soon as cameras are located in different places of the world and thus record objects in proportions which are specific to their locations (more cars in the city, more trees in the countryside, …). Note that our goal here is not to achieve the best possible accuracy, otherwise knowing the test proportions is enough to reweight the observations accurately, but rather to show that our method achieves reasonable performances, even without knowledge of the . We also conduct a sensitivity analysis, showing that increasing the overlapping indeed improves stability and accuracy.

In section IV-C, we assume that the databases are collected under different acquisition conditions (e.g., quality of the camera used, scene illumination, …), thus generating different types of images. To approximate the bias function, we first embed the images into a small space where the different types are easily separable. Then, we use the available instances in to estimate the boundaries between the different image types. This experimental design complement the previous one, as bias now apply to (a transformation of) the inputs , and not on the classes , thus highlighting the versatility of our approach in terms of bias that can be modeled. In our experiments, we consider a scenario where images have distinct backgrounds, but of course the debiasing technique could be applied to any other intrinsic property of the images, such as the illumination of the scene.

Iv-a Experimental details

CIFAR10 and CIFAR100 are two standard image recognition datasets, that contain both 50K training images and 10K testing images, of resolution pixels. CIFAR10 is a 10-way classification dataset, while CIFAR100 is a 100-way classification dataset. Both the training sets and the testing sets are balanced in terms of the number of images per class. CIFAR10 (resp. CIFAR100) thus contains 5K (resp. 500) images per class in the training split and 1K (resp. 100) images per class in the testing split. Our experimental protocol consists in four steps:

  1. we create the biased datasets by sampling observations with replacement (and according to the ) from the train split of CIFAR10/100,

  2. we construct estimates of the biasing function using the information contained in the datasets , and possibly some additional information on the testing distribution if available,

  3. we compute the according to the procedure detailed in the Appendix, and derive the debiasing weights , see Equation 5 where we use instead of ,

  4. we learn a network from scratch using the debiasing weights calculated in step 3.

Steps 1 and 2 are setting-dependent, as they use respectively an exact or incomplete knowledge about the bias mechanism at work. The exact expression is used in step 1 to generate the training databases from the original training split. It is however not used in the subsequent learning procedure. The approximate expression serves as a prior in step 2 to produce the estimates . The latter are typically determined by estimating from the databases the value of an unknown parameter in the approximate expression. Choices for the approximate expression can be motivated by expert knowledge, possibly combined with a preliminary analysis of the biased databases. We provide a detailed description of steps 1 and 2 for each experiment in their respective sections.

In step 3, we compute the , which according to the computations in the Appendix amounts to minimize over the convex function defined in Equation 13. As

is separable in the observations, we use the pytorch


implementation of Stochastic Gradient Descent (SGD) for

iterations, with batch size , fixed learning rate , and momentum .

In step 4, we train the ResNet56 implementation of [11]

for 205 epochs using SGD with fixed initial learning rate

and momentum , where we divide the learning rate by at epoch and . Our implementation is available at

Iv-B Class Imbalance Scenario

In this section, we assume that the conditional probability of given is the same for all the distributions . Formally, for all , and any , we have . However, the class probabilities may vary from one distribution to the other, and we recover the link to imbalanced learning. This is a common scenario, already studied in [33] for instance. In this context, it can be shown that the biasing function on is proportional to the likelihood ratio between the class proportion of the biased dataset and that of the testing distribution . Indeed, using (2) we obtain


Since we assume the classes to be balanced in the test set, we also have . It then follows from (6) that knowing gives up to a constant. Similarly, estimating based on also provides an estimate of . Although our experiment considers the classes associated to the datapoints as subjects of the bias, the knowledge of any discrete attribute of the data can be used for debiaising as long as both: 1) realizations of that attribute are known for the training set, and 2) its marginal distribution on the testing distribution is known. Those additional discrete attributes are called strata in [33].

Generation of the biased datasets. Let , and choose which divides (here we take ). We split the original train dataset into subsamples with different class proportions as follows. We first partition the classes into meta-classes , such that each contains

different classes. In the case of CIFAR-10 we have

. Then, we consider that each dataset is composed of classes in meta-classes and only, and that classes coming from the same meta-class are in the same proportion. Formally, for , we have:

with the convention , and where is an overlap parameter. Note that the concatenation of the datasets has the same distribution over the classes (in expectation) as the original unbiased training dataset. A classifier learned on this concatenation will therefore serve as a reference. Note also that some observations of the original dataset may not be present in the resampled dataset, which is a consequence of sampling with replacement. For , we illustrate the distribution over classes of all datasets for different values of in Figure 4.

Fig. 4: Number of element per class for the datasets in a scenario where and either low (0.05) or high (0.2) for CIFAR10.

Approximating the biasing functions. Let be the number of observations with class in database . A natural approximation of is . Furthermore, with the above construction of the biased datasets, we know that and are the same for all , and that . Plugging these values in Equation 6, and since it is enough to know the up to a constant multiplicative factor (see Section III-A), a natural approximation is

We then solve for the debiasing sampling weights by plugging the approximations ’s into the procedure described in section III.

Analysis of the debiasing weights. In Figure 5, we plot the distribution over classes of the datasets weighted by the debiaising procedure and compare it to the test distribution over classes, for 8 different runs of the debiasing procedure. We observe that the distributions are significantly further from the test distribution and more noisy as decreases. If we concatenated the databases into a single training dataset, i.e., set constant, then the concatenated dataset is equal to the original training dataset of CIFAR in expectation with respect to our synthetic biased generation procedure. Therefore, it is desirable to obtain uniform weights after solving for the debiaising procedure. In Figure 6, we plot the distance between the obtained sampling weights and the uniform weights, as a function of . The result shows that the distance from the optimal weights is higher when is low and lower when is high, confirming the overlap assumptions introduced in Section III.

Fig. 5: Distribution over the classes of the bias-corrected samples (dashed line, 8 different samples) compared to the testing distribution (full line), for both CIFAR10 (top) and CIFAR100 (bottom) and different overlap parameters (left to right).
Fig. 6: Expected distance of the weights as a function of for the CIFAR10 and . The expected distance decreases as grows.

Empirical results. We provide testing accuracies for several ’s in table I. A slight performance decrease is observed in both cases, stronger in the case of CIFAR100. We see that the predictive performance increases as

grows, while our estimate over 8 runs of the variance of the final accuracy decreases. Our results confirm empirically the necessity to have a significant enough overlap between the databases for our debiaising procedure.

TABLE I: Accuracies for different ’s on CIFAR for the experiments of section IV-B

, with 2 times the standard deviation over 8 runs between parenthesis. Learning with the original training dataset of CIFAR gives the Ref accuracy.

Iv-C Image Acquisition Scenario

In many cases, the bias does not originate from class imbalances, but from an intrinsic property of the images, such as the illumination of the scene or the background. In this image acquisition scenario, we simulate the effect of learning with datasets collected on different backgrounds.

Generation of the biased datasets. In the datasets CIFAR10 and CIFAR100, the background varies significantly between images. Some images feature a textured background while others have an uniform saturated background. We generate the biased datasets by grouping the images into bins derived from the average parametric color value in normalized HSV representation of a 2 pixel border around the image. For an image , we denote this 3-dimension representation of an image by . Each coordinate in this 3-dimensional space is then split using the median values of the encodings of the training dataset, giving bins to split the training images of CIFAR into 8 datasets . Precisely, for any , if is the binary representation of , i.e., , then

To satisfy the overlap condition, we define the biasing function for each image and any , as

where is the projection of on with respect to the norm. We sample a fixed number of elements per dataset with possible redundancy (due to sampling with replacement) using the biased functions as sampling weights.

Fig. 7: Examples of images obtained by separating the dataset CIFAR10 using the median thresholds on the average HSV values of the 2-pixel borders of the images.
Fig. 8: Distribution over the HSV dimensions of the average 2-pixel borders of the bias-corrected samples (dashed line, 8 different samples) compared to the testing distribution (full line), for both CIFAR10 (top) and CIFAR100 (bottom) and different overlap parameters (left to right).

Approximating the biasing functions. We first compute the minimum and maximum of each datasets for the average border values , with

Then, we approximate the biasing functions by


to solve for the sampling weights.

Balanced datasets scenario. We first considered a balanced scenario, where each dataset has the same number of instances, i.e. for all . In this scenario, the debiased dataset is still biased in some sense, since (7) does not take into account the natural distribution of the CIFAR training dataset onto the bins (see top figure of Figure 8). Although the results should improve with the debiasing procedure, they cannot be compared directly to the classification accuracy with the original training dataset of CIFAR. Following the outline of Section IV-B, we provide in Figure 8 the marginal distribution of the bias-corrected samples over the HSV dimensions of the average 2-pixel borders and compare them to the test marginal distributions. We observe that the corrected distributions are closer to the true distributions and less random when is high, although some expected bias remains. We provide the corresponding accuracies as well as an estimator of their variance for the learned models in Table II. Our results show that the accuracy of the debiasing procedure increases with the overlap.

Fig. 9: Distribution over the bins of the training data (left), of the balanced (middle) and of the imbalanced scenario (right).
TABLE II: Accuracies for different ’s on CIFAR for the balanced scenario of section IV-C, with 2 times the standard deviation over 8 runs between parenthesis.

Imbalanced datasets scenario. Dividing the data over HSV bins can split CIFAR into rather balanced bins, see Figure 9. However, in many practical scenarios, the different sources of data do not provide the same number of observations. In that case, concatenating the databases can lead to very bad results. We propose to study the relative performance of our debiaising methodology against the concatenation of the databases, for a long-tail distribution of the number of elements per dataset, as illustrated in Figure 9. Precisely, we fix where and . In this experiment, we take into account the size of the datasets relatively to the size of the bins in the testing data using an additional multiplicative term in . Results are provided in table III, highlighting the great impact of overlapping.

TABLE III: Accuracies for different ’s on CIFAR for the imbalanced databases scenario of section IV-C, with 2 times the standard deviation over 8 runs between parenthesis.

V Conclusion

In this work, we applied to visual recognition problems the reweighting procedure of [15] for learning with several biased datasets with known biasing functions ’s. Their approach is not readily applicable to image data, since images datasets typically do not occupy the same portions of the space of all possible images. To circumvent this problem, we propose to express the biasing functions either 1) by using additional information on the images, or 2) on smaller spaces by transforming the information contained in the images. While our work demonstrates the effectivenesss of a general methodology to learn with biased data, the choice of the biasing functions is deferred to the practicioner, and we cannot provide precise information on the expected effects of reweighting the data. For these reasons, we consider the two most promising directions for future work to consist in: 1) finding a general methodology for choosing relevant approximations to biaising functions, and 2) predicting the impact of the reweighting procedure before training, which is especially important when dealing with large models.

In this appendix, we provide the technical details about the computation of the . Recall that the are empirical estimators of the , that are meant to be plugged into Equation 4 to produce an unbiased estimate of . For all , we have




Plugging (9) into (8), we obtain that for all

In other words, the vector

is a solution to the system of equations

where is defined as


Therefore, a natural estimator of is given by the solution to the system of equations


where is the empirical counterpart of , i.e., defined as (10) but with instead of .

Note that (and similarly ) is homogeneous of degree , such that an infinite number of solutions might exist. To ensure uniqueness, one can enforce and solve the system composed of the first equations only. This convention has no impact on , as it is equivalent to plug or into Equation 4.

The remaining question is: how to solve system (11)? Let . We can rewrite system (11) as:


Now, the left-hand side of (12) can be interpreted as the component of the gradient of the function defined as


Thus, solving system (11) is equivalent to solving system (12), and amounts to maximizing over the function , which has been shown in [7] to be strongly convex if the distributions overlap sufficiently. A standard Gradient Descent strategy (whose gradient is given by Equation 12) can thus be used, and converges towards the unique minimum. Recall also that

so that actually rewrites

The function is thus separable in the observations, and a stochastic or minibatch variant of Gradient Descent can be used to reduce the computational cost.

We conclude with a remark about the datasets relative sizes. In this paper, we have assumed for simplicity that the datasets proportions are constant equal to ideal proportions . A natural way to relax this assumption is to consider random empirical proportions that converge to the ideal proportions as goes to infinity. We highlight that our methodology remains valid in the above setting, as still converges towards .


This research work was partially supported by the research chair “Good In Tech : Rethinking innovation and technology as drivers of a better world for and by humans“, under the auspices of the ”Fondation du Risque“ and in partnership with the Institut Mines-Télécom, Sciences Po, Afnor, Ag2r La Mondiale, CGI France, Danone and Sycomore.


  • [1] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan (2010) A theory of learning from different domains. Machine Learning 79 (1-2), pp. 151–175. Cited by: §I, §II.
  • [2] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan (2016) Domain separation networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, pp. 343–351. Cited by: §I.
  • [3] Y. Cui, Y. Song, C. Sun, A. Howard, and S. J. Belongie (2018) Large scale fine-grained categorization and domain-specific transfer learning. In

    2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018

    pp. 4109–4118. Cited by: §II.
  • [4] L. Devroye, L. Györfi, and G. Lugosi (1996) A probabilistic theory of pattern recognition. Springer. Cited by: §III-A.
  • [5] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks. Journal of Machine Learning Research. Cited by: Fig. 1, §I, §I, §II.
  • [6] W. Ge and Y. Yu (2017) Borrowing treasures from the wealthy: deep transfer learning through selective joint fine-tuning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, pp. 10–19. Cited by: §II.
  • [7] R. D. Gill, Y. Vardi, and J. A. Wellner (1988) Large sample theory of empirical distributions in biased sampling models. Annals of Statistics 16 (3), pp. 1069–1112. Cited by: Visual Recognition with Deep Learning from Biased Image Datasets thanks: This research was partially supported by the chair “Good In Tech”, see acknowledgements for more details., §I, §I, §III-A, §III-A, §V.
  • [8] P. Grother and M. Ngan (2019) Face Recognition Vendor Test (FRVT) — Performance of Automated Gender Classification Algorithms. Technical report Technical Report NISTIR 8052, National Institute of Standards and Technology (NIST). Cited by: §I.
  • [9] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao (2016) MS-celeb-1m: A dataset and benchmark for large-scale face recognition. In Computer Vision - ECCV 2016 - 14th European Conference, Proceedings, Part III, Lecture Notes in Computer Science, Vol. 9907, pp. 87–102. Cited by: §I.
  • [10] Y. Guo, H. Shi, A. Kumar, K. Grauman, T. Rosing, and R. S. Feris (2019) SpotTune: transfer learning through adaptive fine-tuning. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 4805–4814. Cited by: §II.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. CoRR abs/1512.03385. External Links: 1512.03385 Cited by: §IV-A.
  • [12] L. A. Hendricks, K. Burns, K. Saenko, T. Darrell, and A. Rohrbach (2018) Women also snowboard: overcoming bias in captioning models. In Computer Vision - ECCV 2018 - 15th European Conference, Lecture Notes in Computer Science, Vol. 11207, pp. 793–811. Cited by: §II.
  • [13] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller (2007) Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Technical report Technical Report 07-49, University of Massachusetts, Amherst. Cited by: §I.
  • [14] B. Kim, H. Kim, K. Kim, S. Kim, and J. Kim (2019-06) Learning not to learn: training deep neural networks with biased data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II, §II.
  • [15] P. Laforgue and S. Clémençon (2019) Statistical learning from biased training samples. CoRR abs/1906.12304. External Links: 1906.12304 Cited by: Visual Recognition with Deep Learning from Biased Image Datasets thanks: This research was partially supported by the chair “Good In Tech”, see acknowledgements for more details., §I, §I, §I, §II, §II, §III-A, §III-B, §III-B, §III, §V.
  • [16] D. Li, Y. Yang, Y. Song, and T. M. Hospedales (2018) Learning to generalize: meta-learning for domain generalization. In

    Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18)

    pp. 3490–3497. Cited by: §I, §II.
  • [17] M. Long, Y. Cao, J. Wang, and M. I. Jordan (2015) Learning transferable features with deep adaptation networks. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, JMLR Workshop and Conference Proceedings, Vol. 37, pp. 97–105. Cited by: §II.
  • [18] M. Long, H. Zhu, J. Wang, and M. I. Jordan (2016) Unsupervised domain adaptation with residual transfer networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, pp. 136–144. Cited by: §I, §II.
  • [19] Y. Mansour, M. Mohri, and A. Rostamizadeh (2008) Domain adaptation with multiple sources. In Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, pp. 1041–1048. Cited by: §II.
  • [20] A. K. Menon, S. Jayasumana, A. S. Rawat, H. Jain, A. Veit, and S. Kumar (2021)

    Long-tail learning via logit adjustment

    In 9th International Conference on Learning Representations, ICLR 2021, Cited by: §IV.
  • [21] J. Ngiam, D. Peng, V. Vasudevan, S. Kornblith, Q. V. Le, and R. Pang (2018) Domain adaptive transfer learning with specialist models. CoRR abs/1811.07056. External Links: 1811.07056 Cited by: §II.
  • [22] S. J. Pan and Q. Yang (2010) A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22 (10), pp. 1345–1359. Cited by: §II, §II.
  • [23] S. J. Pan and Q. Yang (2010) A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22 (10), pp. 1345–1359. Cited by: §I.
  • [24] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Cited by: §IV-A.
  • [25] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §III-A.
  • [26] P. J. Phillips, F. Jiang, A. Narvekar, J. H. Ayyad, and A. J. O’Toole (2011) An other-race effect for face recognition algorithms. ACM Transactions on Applied Perception 8 (2), pp. 14:1–14:11. Cited by: §I.
  • [27] J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, and N.D. Lawrence (2009) Dataset shift in machine learning. The MIT Press. Cited by: §III-B.
  • [28] S. Rebuffi, A. Vedaldi, and H. Bilen (2018) Efficient parametrization of multi-domain deep neural networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. , pp. 8119–8127. Cited by: §II.
  • [29] M. Sugiyama and M. Kawanabe (2012) Machine learning in non-stationary environments: introduction to covariate shift adaptation. The MIT Press. Cited by: §III-B.
  • [30] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko (2015) Simultaneous deep transfer across domains and tasks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, USA, pp. 4068–4076. External Links: ISBN 9781467383912 Cited by: §I, §II.
  • [31] Y. Vardi (1985-03) Empirical distributions in selection bias models. Ann. Statist. 13 (1), pp. 178–203. External Links: Document Cited by: §I, §II, §III-A.
  • [32] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra (2016) Matching networks for one shot learning. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, pp. 3630–3638. Cited by: §I.
  • [33] R. Vogel, M. Achab, S. Clémençon, and C. Tillier (2020) Weighted empirical risk minimization: sample selection bias correction based on importance sampling. CoRR abs/2002.05145. External Links: 2002.05145 Cited by: §I, §II, §IV-B.
  • [34] M. Wang, W. Deng, J. Hu, X. Tao, and Y. Huang (2019) Racial faces in the wild: reducing racial bias by information maximization adaptation network. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, pp. 692–702. Cited by: §I.
  • [35] L. Zhu, S. Arik, Y. Yang, and T. Pfister (2020) Learning to transfer learn: reinforcement learning-based selection for adaptive transfer learning. arXiv preprint arXiv:1908.11406. Cited by: §II, §II.