I Introduction
Besides the considerable advances in memory technology and in the computational power of machines, which now permit to implement optimization programs that operate on vast classes of rules, making deep learning feasible, the spectacular rise in performance of visual recognition algorithms is essentially due to the recent availability of massive labeled image datasets. However, due to the poor control of the data acquisition process or even the absence of any experimental design to collect the datasets, the distribution of the training examples may drastically differs from that of the data to which the predictive rule will be applied when deployed. The adverse effects of machinelearning algorithms for facial recognition when trained from data suffering from selection bias have been recently highlighted in various works, as recalled in
section I below.[Bias in facial recognition] Most publicly available large face databases — such as [13] or [9] for instance — are composed of images of celebrities and do not represent appropriately the global population, with respect to ethnicity and gender in particular. In [26], face recognition systems learned from such databases are argued to suffer from an “otherrace effect”, i.e. that they fail to distinguish individuals from unfamiliar ethnicities, like humans often do. Recently, industrial benchmarks for commercial face recognition algorithms highlighted the discrepancies in accuracy across race and gender and their fairness is now questioned, see [8].
In [34], a balanced dataset over ethnicities built by discarding selected observations from a larger dataset has been proposed so as to cope with the representativeness issue mentioned in section I
. However, reducing the number of training samples may strongly damage performance and generalization for most computer vision tasks, see
e.g. [32].The purpose of this paper is to investigate how to use several image datasets, which are each biased with respect to the target distribution in a possibly different manner, for learning visual recognition systems, when knowing approximately the corresponding bias functions. Whereas learning from training data that are not distributed or of the same format as the test data is a very active research topic in machine learning literature, see e.g. [23, 1]
and in computer vision especially, with a plethora of transfer learning methods (
e.g. [30]) or domain adaptation techniques (e.g. [5]), the approach considered in this article is of different nature and relies on general userdefined biasing functions to relate the training datasets to the testing distribution for computer vision tasks. In section I, the biasing functions would have the gender and nationality information usually present in identification databases as arguments for instance. However, the approach promoted here is very general and applies when specific types of selection bias are present in the training databases supposedly available and are approximately known. It is based on recent theoretical results on reweighting techniques for learning from biased training samples, documented in [15] (see also [7]) and [33].The notion of bias is used with various different meanings. For instance in section I, it is understood as a racial prejudice, due to the possibly great disparity between the performance levels attained for different subgroups of the population. In computer vision, a significant part of the literature considers the case where the differences between the training and test image populations are related to lowlevel visual features, see e.g. [18, 2, 16, 5], as illustrated for example by the difference between the MNIST and MNISTM datasets (fig. 1c and fig. 1d). The general principle behind the algorithms proposed consists in finding a representation common to training and test image datasets, that is relevant for the task considered.
Here, we focus on selection/sampling bias solely and consider situations where the training and test images have the same format but the distributions of the training image datasets available are different from the target distribution, i.e. the distribution of the statistical population of images to which the learned visual recognition system will be applied, as formulated in [31, 15] using the notion of biasing functions/models. In section I for instance, the selection bias effect results in different proportions of the categories determined by race and gender, whereas in fig. 1, one may guess that the differences between SubSynth (b) and MNIST (d) concern the visual appearance and the distribution over numbers both at the same time, since (b) contains a significantly higher proportion of even numbers than (d).
The method proposed in [7] and [15] and investigated here for computer vision tasks relies on userdefined bias functions. By means of an appropriate algorithm, these functions permit to weight the images of the training databases so as to form a nearly debiased estimate of the test distribution and apply empirical risk minimization techniques with generalization guarantees to learn visual recognition systems. These biasing functions are either fixed in advance based on prior information or else estimated from auxiliary information related to the image data. In this paper, we explain at length how this debiasing machinery can be used for visual recognition tasks and illustrates its relevance through various experiments on the well known datasets CIFAR10 and CIFAR100.
The rest of the paper is organized as follows. Related works are collected in section II, while the methodology considered here for debiasing image datasets is described at length in section III. In section IV, experimental results are displayed and discussed, providing empirical evidence of the advantages of the approach promoted here. Finally, some concluding remarks are gathered in section V.
Ii Related Work  State of the Art
Learning visual recognition models in presence of bias has been recently the subject of attention in the computer vision literature. Recently, authors explicitly referred to the notion of bias. For example, [12] proposed to correct for gender bias in captioning models by using masks that hide the gender information, or [14] learned to dissociate an objective classification task to some associated counfounding bias using a known prior on the bias.
Bias and Transfer Learning. However, since we considered bias as any significant difference between the training and testing distributions, learning with biased datasets is related to transfer learning. We introduce by (resp. ) the input (resp. output) space associated to a computer vision task. [22] define domain adaptation as the specific transfer learning setting where both the task — e.g. object detection or visual recognition — and the input space are the same for both training and testing data, but the train and test distribution are different, i.e. for some . We denote by any tuple . Our work makes similar assumptions as domain adaptation, but considers several training sets.
Transferable Features. [17, 5] address domain adaptation by learning a feature extractor that is invariant with respect to the shift between the domains, which makes the approach well suited to problems where only the marginal distribution over is different between training and testing. Formally, under this assumption the posterior distributions satisfy for any and we have for some . Consequently, their approach is wellsuited when the difference is reduced to that of the “lowlevel” visual appearance of the images (see fig. 1). However, [28] considers domains as types of images (e.g.
medical, satellite), which makes learning invariant deep features undesirable. In the context of
section I, learning deep invariant features would discard information that is relevant for identification. For those reasons, [28] as well as [10] considers all networks layers as possible candidates to adaptation. Taking a different approach, [30, 18, 35]all explicitly model different train and test posterior probabilities
for the output, for example through specialized classifiers for both the target and source dataset, see
e.g. [35]. Most of the work on deep transfer focuses on adapting feature representations, but another popular approach in transfer learning is to transfer knowledge from instances [22].Instance Selection. Recent computer vision papers proposed to select specific observations from a large scale training dataset to improve performance on the testing data. Precisely, [3] selects observations that are close to the target data, as measured by a generic image encoder, while [6] uses a similarity based on lowlevel characteristics. More similar to our work, [21, 35] use importance weights for training instances, that are computed from a small set of observations that follow the target distribution. Precisely, [21] transfers information from the predicted distribution over labels of the adapted network, while [35]
learns with a reinforcement learning selection policy guided by a performance metric on a validation set. The objective of the importance function is to reweight any training instance
by a quantity proportional to the likelihood ratio . While most papers assume the availability of a sample that follows to approximate the likelihood ratio , recent work in statistical machine learning proposes to introduce simple assumptions on the relationship between and [33, 15]. These assumptions require the machine learning practicioner to only have some a priori knowledge on the testing data, which happens naturally in many situations, notably in industrial benchmarks. In the context of section I, facial recognition providers often have accurate knowledge of some characteristics (e.g. nationality) of the target distribution but no access to the target data. [33] assumed that a single training sample covers the whole testing distribution (with a dominated measure assumption), but [15] relaxes that assumption by considering several source datasets.Multiple Source Datasets. Learning from several source datasets has been studied in various contexts, such as domain adaptation [19, 1], or domain generalization [16]. In the latter work, authors optimize for all sources simultaneously, and find an interesting tradeoff between the different competing objectives. Our approach is closer to the one developed in [15], which consists in appropriately reweighting the observations generated by some biased sources. Under several technical assumptions, the authors have extended the results of [31]
and have shown that the unbiased estimate of the target distribution thus obtained can be used to learn reliable decision rules. Note however that an important limitation of these works is that the biasing functions are assumed to be exactly known. In this work, we learn a deep neural network for visual recognition with sampling weights derived from
[15], using approximate expressions for the biaising functions. Note that [14] also assumes the knowledge of bias information to adjust a neural network. However, their approach is pretty incomparable to ours, insofar as it consists in penalizing the mutual information between the bias and the objective.Iii Debiasing the Sources  Methodology
In this section, we formally introduce our methodology to learn reliable decision functions in the context of biased imagegenerating sources. In Section IIIA, we first recall the debiased Empirical Risk Minimization framework developed in [15], which our approach builds upon. In Section IIIB
, we then highlight that applying this theory to image datasets raises important practical issues, and propose heuristics to bypass these difficulties. In what follows, we use
to denote the indicator of an event.Iiia Debiased Empirical Risk Minimization
Recall that
denotes a generic random variable with distribution
over the space. We consider the supervised learning framework, meaning that
represents some input information supposedly useful to predict the output. Given a loss function
, and a hypothesis set (e.g., the set of decision functions possibly generated by a fixed neural architecture), our goal is to compute(1) 
As is unknown in general, a usual hypothesis in statistical learning consists in assuming that the learner has access to a dataset composed of independent realizations generated from . However, as highlighted in the introduction section, this i.i.d. assumption is often unrealistic in practice. In this paper, we instead assume that the observations at disposal are generated by several biased sources. Formally, let be the number of sources, and for , let be a dataset of size composed of independent realizations generated from biased distributions . We also define , and . The distributions are said to be biased, as we assume the existence of (known) biasing functions such that for any we have
(2) 
where is an (unknown) normalizing factor. This data generation design is referred to as Biased Sampling Models (BSMs), and was originally introduced in [31] and [7]
for asymptotic nonparametric estimation of cumulative distribution functions. Note that BSMs are a strict generalization of the standard statistical learning framework, insofar as the latter can be recovered as a special case with the choice
and . For general BSMs, it is important to observe that performing naively Empirical Risk Minimization (ERM, see e.g., [4]), which means concatenating the observations without considering the sources of origin and computing(3) 
might completely fail. Indeed, minimizing criterion (3) instead of (1) amounts to replace with the empirical distribution
where
However, it can be seen from the above equations that is rather an approximation of distribution , which might be very different from (e.g., in the extreme case where and for , we have ).
To remedy this problem, [15] has developed a debiased ERM procedure for BSMs, under the following two assumptions: () the union of the supports of the must contain the support of , and () the supports must sufficiently overlap. See Figure 2 for an intuitive explanation of the assumptions. Formally, the support assumption ensures that the cannot be null all at the same time, and thus allows to invert the following equality:
(4) 
As noticed earlier, it is straightforward to obtain an estimate of , through . The being assumed to be known, it is then enough to find estimates of the , and to plug them into (4), to construct an (almost unbiased) estimate of . As shown in [7], such estimates can be easily computed by solving a system of equations through a Gradient Descent approach, see the Appendix for technical details. Let be the distribution obtained by replacing and in (4) with and respectively. Debiased ERM consists in approximating in (1) with . It can be shown that it is equivalent to compute
where
(5) 
Hence, once the debiasing weights are derived, this approach is not more computationally expensive than standard ERM. Note also that most modern machine learning libraries allows to weight the training observations during the learning stage, e.g., through the sample_weight keyword argument in the fit method for scikitlearn [25].
IiiB Application to ImageGenerating Sources
Thanks to its generality, the debiased ERM procedure allows to model realistic scenarios, which make its application to imagegenerating sources particularly relevant. First, the biasing functions apply to the entire observation . Then, one can address biases induced by the measurement devices (cameras with different intrinsic characteristics are expected to produce photos with different film grains), from the subjects (cameras located in different areas of the world are expected to observe different classes of objects/animals ), or both at the same time. This rich framework generalizes Covariate Shift, see e.g., [27, 29], a popular bias assumption which only allows the marginal distribution of
to vary across the datasets, the conditional probability
being assumed to remain constant equal to . A second advantage is that an individual biasing function might be null on some part of . This allows for instance to consider biasing functions of the form , where is a subspace of , accounting for the fact that dataset is actually sampled from a subpopulation (e.g., regional animal species). In this case, Importance Sampling typically fails, as the importance weights are given by , which explodes for . In contrast, debiased ERM combines the different datasets in a clever fashion to produce an estimator valid on the entire set . The final asset of debiased ERM is undeniably the theoretical guarantees that come with it. Indeed, it has been shown in [15] that, up to constant factors, has the same excess risk as an empirical risk minimizer that would have been computed on the basis of an unbiased dataset of size . This remarkable property is however conditioned upon two relatively strong requirements, that we will study and relax in the context of imagegenerating sources.First, the biasing functions have to be known. Recall nonetheless that this assumption is less compelling that it might look like at first sight. Indeed, only the , and not the normalizing factors , are assumed to be known. Consider again the case where , with a subspace of . The assumption requires to know , or equivalently the subpopulation , but not to know the relative size of this subpopulation among the global one with respect to , i.e., , which is much more involved to obtain in practice. Still, except in very particular cases such as randomized control trials, it is extremely rare to have an exact knowledge of the . In the following, we conduct numerical experiments where we show that the biasing functions can actually be learned along the training process, without significantly degrading the empirical performances. Our approach thus provides an endtoend methodology to debias imagegenerating sources, which only relies on the biased databases, does not use any oracle quantities (the ), and can therefore be fully implemented in practice.
The second restriction regards the overlapping. Indeed, an important caveat of the analysis carried out in [15] is that constant factors depend heavily on the overlapping. Formally, let
and
These two parameters quantify the overlapping, as shown in Figure 3. Results in [15], see e.g., Proposition 1 therein, typically hold with probability , where is a constant proportional to . As previously discussed, we recover that theoretical guarantees are conditioned upon overlapping: when or , the bound obtained is vacuous. But the formula actually suggests an even more interesting behavior. Besides the threshold, we can see that the more distributions overlap, i.e., the bigger and are, the better the guarantees. On the contrary, if input observations are images, i.e., high dimensional objects whose distributions have non overlapping supports, it is likely that constants and become too small to provide a meaningful bound. To remedy this practical issue, we propose to use biasing functions which take as input a lowdimensional embedding of the images (or the label ) in order to maximize the overlapping and thus ensure good constant factors. We also conduct several sensitivity analyses, which corroborate the fact that increasing the density overlapping brings stability to the procedure, and globally enhances the performances.
Iv Experiments
We now present experimental results on two standard image recognition datasets, namely CIFAR10 and CIFAR100, from which we extract biased datasets, so as to recreate the framework of Section IIIA in a controllable way. The great generality of this setting allows us to model realistic scenarios, such as the following. An image database is actually composed of pictures that have been collected in several goes, by cameras located in different parts of the world. One may think about animal pictures gathered by expeditions in different countries, or security cameras located in different areas for instance. In this case, the location bias then translates into a class bias, and the class proportions greatly differ from one database to the other (we do not observe the same species of animals in Africa or in Europe). We show that our method allows to efficiently approximate the biasing functions in this case, and provides a satisfactory endtoend solution to this common bias scenario. A second typical bias scenario occurs when the images originate from cameras with different characteristics, and thus exhibit different film grains. We show that a bias model on a low representation of the images (therefore avoiding the overlapping difficulty) is able to model such phenomenon, and that our approach provides satisfactory results despite the bias mechanism at work.
Imbalanced learning is a single dataset image recognition problem, in which the training set is assumed to contain the classes of interest in a proportion different from the one in the validation set (usually supposed balanced). Longtail learning, a special instance of imbalanced learning, has recently received a great deal of attention in the computer vision community, see e.g., [20]. In Section IVB, we propose a multidataset of imbalanced learning, where we assume that each database contains a different proportion of the classes of interest, this proportion being possibly equal to for certain classes (which is typically not allowed in single dataset imbalanced learning). Under the assumption that the classes are balanced in the validation dataset, we show that , where is the number of observations with label in dataset , approximates well the bias function. The good empirical results we obtain provide a proof of concept for our methodology in image recognition problems where the marginal distributions over classes differ between the data acquisition phase and the testing stage. This scenario is often found in practice, e.g., as soon as cameras are located in different places of the world and thus record objects in proportions which are specific to their locations (more cars in the city, more trees in the countryside, …). Note that our goal here is not to achieve the best possible accuracy, otherwise knowing the test proportions is enough to reweight the observations accurately, but rather to show that our method achieves reasonable performances, even without knowledge of the . We also conduct a sensitivity analysis, showing that increasing the overlapping indeed improves stability and accuracy.
In section IVC, we assume that the databases are collected under different acquisition conditions (e.g., quality of the camera used, scene illumination, …), thus generating different types of images. To approximate the bias function, we first embed the images into a small space where the different types are easily separable. Then, we use the available instances in to estimate the boundaries between the different image types. This experimental design complement the previous one, as bias now apply to (a transformation of) the inputs , and not on the classes , thus highlighting the versatility of our approach in terms of bias that can be modeled. In our experiments, we consider a scenario where images have distinct backgrounds, but of course the debiasing technique could be applied to any other intrinsic property of the images, such as the illumination of the scene.
Iva Experimental details
CIFAR10 and CIFAR100 are two standard image recognition datasets, that contain both 50K training images and 10K testing images, of resolution pixels. CIFAR10 is a 10way classification dataset, while CIFAR100 is a 100way classification dataset. Both the training sets and the testing sets are balanced in terms of the number of images per class. CIFAR10 (resp. CIFAR100) thus contains 5K (resp. 500) images per class in the training split and 1K (resp. 100) images per class in the testing split. Our experimental protocol consists in four steps:

we create the biased datasets by sampling observations with replacement (and according to the ) from the train split of CIFAR10/100,

we construct estimates of the biasing function using the information contained in the datasets , and possibly some additional information on the testing distribution if available,

we compute the according to the procedure detailed in the Appendix, and derive the debiasing weights , see Equation 5 where we use instead of ,

we learn a network from scratch using the debiasing weights calculated in step 3.
Steps 1 and 2 are settingdependent, as they use respectively an exact or incomplete knowledge about the bias mechanism at work. The exact expression is used in step 1 to generate the training databases from the original training split. It is however not used in the subsequent learning procedure. The approximate expression serves as a prior in step 2 to produce the estimates . The latter are typically determined by estimating from the databases the value of an unknown parameter in the approximate expression. Choices for the approximate expression can be motivated by expert knowledge, possibly combined with a preliminary analysis of the biased databases. We provide a detailed description of steps 1 and 2 for each experiment in their respective sections.
In step 3, we compute the , which according to the computations in the Appendix amounts to minimize over the convex function defined in Equation 13. As
is separable in the observations, we use the pytorch
[24]implementation of Stochastic Gradient Descent (SGD) for
iterations, with batch size , fixed learning rate , and momentum .In step 4, we train the ResNet56 implementation of [11]
for 205 epochs using SGD with fixed initial learning rate
and momentum , where we divide the learning rate by at epoch and . Our implementation is available at https://tinyurl.com/5ahadh7c.IvB Class Imbalance Scenario
In this section, we assume that the conditional probability of given is the same for all the distributions . Formally, for all , and any , we have . However, the class probabilities may vary from one distribution to the other, and we recover the link to imbalanced learning. This is a common scenario, already studied in [33] for instance. In this context, it can be shown that the biasing function on is proportional to the likelihood ratio between the class proportion of the biased dataset and that of the testing distribution . Indeed, using (2) we obtain
(6) 
Since we assume the classes to be balanced in the test set, we also have . It then follows from (6) that knowing gives up to a constant. Similarly, estimating based on also provides an estimate of . Although our experiment considers the classes associated to the datapoints as subjects of the bias, the knowledge of any discrete attribute of the data can be used for debiaising as long as both: 1) realizations of that attribute are known for the training set, and 2) its marginal distribution on the testing distribution is known. Those additional discrete attributes are called strata in [33].
Generation of the biased datasets. Let , and choose which divides (here we take ). We split the original train dataset into subsamples with different class proportions as follows. We first partition the classes into metaclasses , such that each contains
different classes. In the case of CIFAR10 we have
. Then, we consider that each dataset is composed of classes in metaclasses and only, and that classes coming from the same metaclass are in the same proportion. Formally, for , we have:with the convention , and where is an overlap parameter. Note that the concatenation of the datasets has the same distribution over the classes (in expectation) as the original unbiased training dataset. A classifier learned on this concatenation will therefore serve as a reference. Note also that some observations of the original dataset may not be present in the resampled dataset, which is a consequence of sampling with replacement. For , we illustrate the distribution over classes of all datasets for different values of in Figure 4.
Approximating the biasing functions. Let be the number of observations with class in database . A natural approximation of is . Furthermore, with the above construction of the biased datasets, we know that and are the same for all , and that . Plugging these values in Equation 6, and since it is enough to know the up to a constant multiplicative factor (see Section IIIA), a natural approximation is
We then solve for the debiasing sampling weights by plugging the approximations ’s into the procedure described in section III.
Analysis of the debiasing weights. In Figure 5, we plot the distribution over classes of the datasets weighted by the debiaising procedure and compare it to the test distribution over classes, for 8 different runs of the debiasing procedure. We observe that the distributions are significantly further from the test distribution and more noisy as decreases. If we concatenated the databases into a single training dataset, i.e., set constant, then the concatenated dataset is equal to the original training dataset of CIFAR in expectation with respect to our synthetic biased generation procedure. Therefore, it is desirable to obtain uniform weights after solving for the debiaising procedure. In Figure 6, we plot the distance between the obtained sampling weights and the uniform weights, as a function of . The result shows that the distance from the optimal weights is higher when is low and lower when is high, confirming the overlap assumptions introduced in Section III.
Empirical results. We provide testing accuracies for several ’s in table I. A slight performance decrease is observed in both cases, stronger in the case of CIFAR100. We see that the predictive performance increases as
grows, while our estimate over 8 runs of the variance of the final accuracy decreases. Our results confirm empirically the necessity to have a significant enough overlap between the databases for our debiaising procedure.
CIFAR10  

Ref  
Acc.  
0.10  
0.20 
CIFAR100  

Ref  
Acc.  
0.10  
0.20 
, with 2 times the standard deviation over 8 runs between parenthesis. Learning with the original training dataset of CIFAR gives the Ref accuracy.
IvC Image Acquisition Scenario
In many cases, the bias does not originate from class imbalances, but from an intrinsic property of the images, such as the illumination of the scene or the background. In this image acquisition scenario, we simulate the effect of learning with datasets collected on different backgrounds.
Generation of the biased datasets. In the datasets CIFAR10 and CIFAR100, the background varies significantly between images. Some images feature a textured background while others have an uniform saturated background. We generate the biased datasets by grouping the images into bins derived from the average parametric color value in normalized HSV representation of a 2 pixel border around the image. For an image , we denote this 3dimension representation of an image by . Each coordinate in this 3dimensional space is then split using the median values of the encodings of the training dataset, giving bins to split the training images of CIFAR into 8 datasets . Precisely, for any , if is the binary representation of , i.e., , then
To satisfy the overlap condition, we define the biasing function for each image and any , as
where is the projection of on with respect to the norm. We sample a fixed number of elements per dataset with possible redundancy (due to sampling with replacement) using the biased functions as sampling weights.
Approximating the biasing functions. We first compute the minimum and maximum of each datasets for the average border values , with
Then, we approximate the biasing functions by
(7) 
to solve for the sampling weights.
Balanced datasets scenario. We first considered a balanced scenario, where each dataset has the same number of instances, i.e. for all . In this scenario, the debiased dataset is still biased in some sense, since (7) does not take into account the natural distribution of the CIFAR training dataset onto the bins (see top figure of Figure 8). Although the results should improve with the debiasing procedure, they cannot be compared directly to the classification accuracy with the original training dataset of CIFAR. Following the outline of Section IVB, we provide in Figure 8 the marginal distribution of the biascorrected samples over the HSV dimensions of the average 2pixel borders and compare them to the test marginal distributions. We observe that the corrected distributions are closer to the true distributions and less random when is high, although some expected bias remains. We provide the corresponding accuracies as well as an estimator of their variance for the learned models in Table II. Our results show that the accuracy of the debiasing procedure increases with the overlap.
CIFAR10  

Acc.  
1  
10 
CIFAR100  

Acc.  
1  
10 
Imbalanced datasets scenario. Dividing the data over HSV bins can split CIFAR into rather balanced bins, see Figure 9. However, in many practical scenarios, the different sources of data do not provide the same number of observations. In that case, concatenating the databases can lead to very bad results. We propose to study the relative performance of our debiaising methodology against the concatenation of the databases, for a longtail distribution of the number of elements per dataset, as illustrated in Figure 9. Precisely, we fix where and . In this experiment, we take into account the size of the datasets relatively to the size of the bins in the testing data using an additional multiplicative term in . Results are provided in table III, highlighting the great impact of overlapping.
CIFAR10  

Accuracy  
0.1  
1  
10 
CIFAR100  

Accuracy  
0.1  
1  
10 
V Conclusion
In this work, we applied to visual recognition problems the reweighting procedure of [15] for learning with several biased datasets with known biasing functions ’s. Their approach is not readily applicable to image data, since images datasets typically do not occupy the same portions of the space of all possible images. To circumvent this problem, we propose to express the biasing functions either 1) by using additional information on the images, or 2) on smaller spaces by transforming the information contained in the images. While our work demonstrates the effectivenesss of a general methodology to learn with biased data, the choice of the biasing functions is deferred to the practicioner, and we cannot provide precise information on the expected effects of reweighting the data. For these reasons, we consider the two most promising directions for future work to consist in: 1) finding a general methodology for choosing relevant approximations to biaising functions, and 2) predicting the impact of the reweighting procedure before training, which is especially important when dealing with large models.
In this appendix, we provide the technical details about the computation of the . Recall that the are empirical estimators of the , that are meant to be plugged into Equation 4 to produce an unbiased estimate of . For all , we have
(8) 
and
(9) 
where is defined as
(10) 
Therefore, a natural estimator of is given by the solution to the system of equations
(11) 
where is the empirical counterpart of , i.e., defined as (10) but with instead of .
Note that (and similarly ) is homogeneous of degree , such that an infinite number of solutions might exist.
To ensure uniqueness, one can enforce and solve the system composed of the first equations only.
This convention has no impact on , as it is equivalent to plug or into Equation 4.
(12) 
Now, the lefthand side of (12) can be interpreted as the component of the gradient of the function defined as
(13) 
Thus, solving system (11) is equivalent to solving system (12), and amounts to maximizing over the function , which has been shown in [7] to be strongly convex if the distributions overlap sufficiently. A standard Gradient Descent strategy (whose gradient is given by Equation 12) can thus be used, and converges towards the unique minimum. Recall also that
so that actually rewrites
The function is thus separable in the observations, and a stochastic or minibatch variant of Gradient Descent can be used to reduce the computational cost.
We conclude with a remark about the datasets relative sizes. In this paper, we have assumed for simplicity that the datasets proportions are constant equal to ideal proportions . A natural way to relax this assumption is to consider random empirical proportions that converge to the ideal proportions as goes to infinity. We highlight that our methodology remains valid in the above setting, as still converges towards .
Acknowledgments
This research work was partially supported by the research chair “Good In Tech : Rethinking innovation and technology as drivers of a better world for and by humans“, under the auspices of the ”Fondation du Risque“ and in partnership with the Institut MinesTélécom, Sciences Po, Afnor, Ag2r La Mondiale, CGI France, Danone and Sycomore.
References
 [1] (2010) A theory of learning from different domains. Machine Learning 79 (12), pp. 151–175. Cited by: §I, §II.
 [2] (2016) Domain separation networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, pp. 343–351. Cited by: §I.

[3]
(2018)
Large scale finegrained categorization and domainspecific transfer learning.
In
2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018
, pp. 4109–4118. Cited by: §II.  [4] (1996) A probabilistic theory of pattern recognition. Springer. Cited by: §IIIA.
 [5] (2016) Domainadversarial training of neural networks. Journal of Machine Learning Research. Cited by: Fig. 1, §I, §I, §II.
 [6] (2017) Borrowing treasures from the wealthy: deep transfer learning through selective joint finetuning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, pp. 10–19. Cited by: §II.
 [7] (1988) Large sample theory of empirical distributions in biased sampling models. Annals of Statistics 16 (3), pp. 1069–1112. Cited by: Visual Recognition with Deep Learning from Biased Image Datasets ^{†}^{†}thanks: This research was partially supported by the chair “Good In Tech”, see acknowledgements for more details., §I, §I, §IIIA, §IIIA, §V.
 [8] (2019) Face Recognition Vendor Test (FRVT) — Performance of Automated Gender Classification Algorithms. Technical report Technical Report NISTIR 8052, National Institute of Standards and Technology (NIST). Cited by: §I.
 [9] (2016) MSceleb1m: A dataset and benchmark for largescale face recognition. In Computer Vision  ECCV 2016  14th European Conference, Proceedings, Part III, Lecture Notes in Computer Science, Vol. 9907, pp. 87–102. Cited by: §I.
 [10] (2019) SpotTune: transfer learning through adaptive finetuning. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 1620, 2019, pp. 4805–4814. Cited by: §II.
 [11] (2015) Deep residual learning for image recognition. CoRR abs/1512.03385. External Links: 1512.03385 Cited by: §IVA.
 [12] (2018) Women also snowboard: overcoming bias in captioning models. In Computer Vision  ECCV 2018  15th European Conference, Lecture Notes in Computer Science, Vol. 11207, pp. 793–811. Cited by: §II.
 [13] (2007) Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Technical report Technical Report 0749, University of Massachusetts, Amherst. Cited by: §I.
 [14] (201906) Learning not to learn: training deep neural networks with biased data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II, §II.
 [15] (2019) Statistical learning from biased training samples. CoRR abs/1906.12304. External Links: 1906.12304 Cited by: Visual Recognition with Deep Learning from Biased Image Datasets ^{†}^{†}thanks: This research was partially supported by the chair “Good In Tech”, see acknowledgements for more details., §I, §I, §I, §II, §II, §IIIA, §IIIB, §IIIB, §III, §V.

[16]
(2018)
Learning to generalize: metalearning for domain generalization.
In
Proceedings of the ThirtySecond AAAI Conference on Artificial Intelligence, (AAAI18), the 30th innovative Applications of Artificial Intelligence (IAAI18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI18)
, pp. 3490–3497. Cited by: §I, §II.  [17] (2015) Learning transferable features with deep adaptation networks. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, JMLR Workshop and Conference Proceedings, Vol. 37, pp. 97–105. Cited by: §II.
 [18] (2016) Unsupervised domain adaptation with residual transfer networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, pp. 136–144. Cited by: §I, §II.
 [19] (2008) Domain adaptation with multiple sources. In Advances in Neural Information Processing Systems 21, Proceedings of the TwentySecond Annual Conference on Neural Information Processing Systems, pp. 1041–1048. Cited by: §II.

[20]
(2021)
Longtail learning via logit adjustment
. In 9th International Conference on Learning Representations, ICLR 2021, Cited by: §IV.  [21] (2018) Domain adaptive transfer learning with specialist models. CoRR abs/1811.07056. External Links: 1811.07056 Cited by: §II.
 [22] (2010) A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22 (10), pp. 1345–1359. Cited by: §II, §II.
 [23] (2010) A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22 (10), pp. 1345–1359. Cited by: §I.
 [24] (2019) PyTorch: an imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Cited by: §IVA.
 [25] (2011) Scikitlearn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §IIIA.
 [26] (2011) An otherrace effect for face recognition algorithms. ACM Transactions on Applied Perception 8 (2), pp. 14:1–14:11. Cited by: §I.
 [27] (2009) Dataset shift in machine learning. The MIT Press. Cited by: §IIIB.
 [28] (2018) Efficient parametrization of multidomain deep neural networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. , pp. 8119–8127. Cited by: §II.
 [29] (2012) Machine learning in nonstationary environments: introduction to covariate shift adaptation. The MIT Press. Cited by: §IIIB.
 [30] (2015) Simultaneous deep transfer across domains and tasks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, USA, pp. 4068–4076. External Links: ISBN 9781467383912 Cited by: §I, §II.
 [31] (198503) Empirical distributions in selection bias models. Ann. Statist. 13 (1), pp. 178–203. External Links: Document Cited by: §I, §II, §IIIA.
 [32] (2016) Matching networks for one shot learning. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, pp. 3630–3638. Cited by: §I.
 [33] (2020) Weighted empirical risk minimization: sample selection bias correction based on importance sampling. CoRR abs/2002.05145. External Links: 2002.05145 Cited by: §I, §II, §IVB.
 [34] (2019) Racial faces in the wild: reducing racial bias by information maximization adaptation network. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, pp. 692–702. Cited by: §I.
 [35] (2020) Learning to transfer learn: reinforcement learningbased selection for adaptive transfer learning. arXiv preprint arXiv:1908.11406. Cited by: §II, §II.