1 Background
Our work relates to several research topics including Quantification and Domain Adaptation (DA).
Quantification.
Quantification [1]
is the task of estimating label distribution in a target dataset using a classifier that was trained on a source dataset. It is usually assumed that the conditional probabilities of input given a label remain fixed. Quantification can be used to answer very interesting questions. For example, estimating the number of people infected with a disease in a given population. The most straightforward Quantification method is based on a confusion matrix which relates the label distribution estimated by the classifier and the actual label distribution
[2, 3, 4]. In this paper we follow this approach.Domain Adaptation.
DA [5] is the task of adjusting a model from a source dataset to a different, yet related, target dataset. Unlike Quantification, DA focuses on optimizing classification.
DA methods differ in supervision resources. Some methods are supervised, some are semisupervised and others are unsupervised [6]. This work will focus on the unsupervised case.
We wish to characterize 3 types of DA (see Figure (1)).
1) label distribution, , differs between source and target, but conditional probabilities of input given a label, , remain fixed. This scenario is known in the literature as a label shift (Figure (1a)).
2) label distribution remains fixed, but conditional probabilities of input given a label, differ. This scenario is known in the literature as a conditional shift.
3) both label distribution and conditional probabilities of input given a label, differ (Figure (1c)). This is the most challenging type of DA. A notable example of a method tackling this task using adversarial techniques is presented in [7].
2 Introduction
In scenarios where classes are extremely imbalanced and overlapping (meaning the same input may be labeled, in different probabilities, to different classes) DA is very much needed, even if the target dataset is quite similar to the source.
Due to overlapping, the classifier will generally not learn definite , and so it will heavily rely on that it has learned from the source dataset. However, due to the extreme imbalance, even a small label shift will cause significant differences in , especially for small classes, thus leading to poor results.
A good example for such datasets can be found in the medical field. Consider the task of classifying a list of symptoms to a condition [8, 9]. In this scenario, as symptoms alone cannot usually be used to pinpoint a single condition, the best a symptombased classifier can do is assign reasonable probabilities to a number of conditions related to the reported symptoms. Some of these conditions may be more prevalent than others by several orders of magnitude. In fact, some conditions may be so rare and overshadowed by other conditions, that none of the cases will be classified to them (it is useful to include such conditions in datasets in order to generate a differential diagnosis from the classifier’s outcome, for example). A label shift is expected in this scenario as condition prevalence often varies significantly between populations and time periods. Less intuitively, a conditional shift is also expected. Symptom distribution given conditions, tend to change between populations and time periods due to differences in reporting methods (e.g., two different machinepatient dialogue algorithms that are used to extract symptoms).
Here we present a novel scheme for unsupervised DA which is tailored to imbalanced and overlapping datasets and works under both label and conditional shifts. Before we do so, we introduce a Quantification method which is known in the literature but is underexplored. We rederive it and analyze it in light of a more common approach, in order to set the building blocks for the remainder of the work. Then, we shortly describe a standard technique which employs Quantification to perform DA under a label shift. Finally, we utilize the Quantification and DA methods mentioned above to derive our novel DA scheme. The scheme is discussed theoretically and demonstrated on datasets, generated from Electronic Health Records (EHRs), involving the classification of a list of symptoms to a condition.
Outline
3 The Domain Adaptation Scheme
3.1 Quantification Under a Label Shift
Consider a dataset in which classes are highly imbalanced and overlapping, and a classifier trained on this dataset. We would like to use this classifier in order to estimate the label distribution in an unlabeled target set (at this point it is assumed there is no conditional shift). When using the classifier as is, for reasons explained above, the estimation is expected to be very poor.
Denoting the label distribution as , and the probability that the classifier assigned to class given input as , the classifier estimation of the label distribution is given by .
In order to quantify the classifier’s confusion we define the soft confusion matrix
(1) 
Next, plugging the definitions above we obtain , and so
(2) 
Note that can be estimated from the source dataset, while can be estimated from the target dataset using a classifier trained on the source dataset. Hence, Eq. (2) provides a recipe for performing Quantification, i.e. for evaluating label distribution in a target dataset, without using labels from the target dataset. The classifier does not have to be calibrated. In fact, this procedure may be used to calibrate classifiers on the source dataset as well. Note that this procedure works only if is nonsingular. will be singular if for one or more classes, or if the expected class distribution given a certain label is a linear combination of the expected class distributions given other labels. is very unlikely to be singular, but the closer it gets to singularity the more sensitive to noise it becomes.
The procedure described above is known in the literature but is usually performed with a hard confusion matrix rather than a soft one (in the notation we use it means replacing with its binary equivalent: for the with the highest probability and elsewhere).
The soft confusion matrix is mostly mentioned in the literature as a replacement for the hard confusion matrix in case it is singular [4, 10].
Here, we claim that in highly overlapping and imbalanced datasets this should be the confusion matrix used for several reasons:
1) singularity: if non of the data points in the source dataset are classified as a certain class, the hard confusion matrix becomes singular. As mentioned above, this often happens in such datasets since rare classes tend to be underestimated [11].
2) noise: comparing the expected noise of Quantification in the limit of extremely imbalanced and overlapping classes, we find that the noise in the soft approach is smaller. Consider, for simplicity, datasets with two classes, one rare and one prevalent. For the rare class, whose probability is , only of its instances are classified correctly ( is determined by and the nature of the overlap). It is easy to show that under reasonable assumptions, the relative noise of the hard approach for the rare class probability estimation will scale like , while that of the soft estimation will scale like .
3) sensitivity to conditional shift: using the hard approach, when classes are not extremely imbalanced and overlapping, minor conditional shifts will have a small effect on Quantification results (see Figure (2a)). However, the more imbalance and overlap exists, the more Quantification is sensitive to minor conditional shifts (see Figure (2b)). This is due to the fact that rare classes are underestimated. Once of the data points associated with a rare class are classified correctly, even a very small conditional shift (compared to the scale of the class contour) can lead, on one hand to an , and on the other hand to (which results in a singular confusion matrix). The changes in the soft approach will be much smoother.
4) generalization to DA with conditional shifts. An explanation will follow in Section 3.3.
3.2 From Quantification to DA
So far, we have shown how to estimate the label distribution in a target dataset. In order to calibrate the classifier to the target set, and thus improve its classification quality, a known procedure was followed [12]. Recall that every (calibrated) classifier can be described by Bayes Law,
(3) 
where the superscript ’s’ emphasizes that these quantities are calibrated to the source dataset. Once we have an estimation of class distribution in the target dataset, it may be used to calibrate the classifier as follows:
(4) 
(the subscript ’t’ stands for ’target’) where, again, we have assumed no conditional shift.
This implies that every classifier trained on the source dataset can be calibrated to a target dataset, in which class distribution is potentially very different than that of the source, with only a small number of numerical operations.
3.3 Adding a Conditional Shift
No conditional shift has been assumed above. However, in many cases this assumption is not realistic, hence we wish to relax it. In order to do so, we divide the input space into subspaces and apply the Quantification and calibration schemes for each subspace separately. This will work if for every subspace and every class the following condition is satisfied:
(5) 
where are constants. This is because when this condition is satisfied, the conditional shift translates in every single subspace into a label shift,
(6) 
where is the probability of class in .
In order to satisfy this condition, the shifts in need to be on a larger scale than the size of the subspaces. We are aware of two plausible scenarios where this happens:
1) subclasses: consider a scenario in which classes are composed of separated subclasses, and where the conditional shift results only from different weighting of the subclasses. In this scenario, if the space division corresponds to the subclasses, condition 5 will be satisfied by definition.
In the medical domain, such a scenario is plausible as conditions are often composed of “subconditions”. For example, Viral Upper Respiratory Infections and Bacterial Upper Respiratory Infections are both composed of several “subconditions”, such as conditions localized to the nasal passages, conditions of pharyngeal origin and others.
For different populations or different points in time, the ratio between various subconditions may change (in a different manner for each condition), leading to a conditional shift of the kind this approach can handle.
2) conditional independence of features with respect to the classes
: denoting the input vector as
(where is the number of features), assuming it can be split into vectors which satisfy conditional independence with respect to the classes, we get . Without loss of generality, suppose the conditional shift results from changes in , then, assuming categorical features, by dividing the input space in a way which corresponds to the different possible combinations of , conditional independence assures that condition 5 is satisfied, as(7) 
In the medical domain, conditional independence of features (or groups of features) is often satisfied or at least approximated. If needed, the groups may be evaluated from the source dataset. Suppose a classifier trained with data obtained using a certain dialogue algorithm that asks users questions regarding (categorical, usually even binary) symptoms. If at some point in future this algorithm is changed, this will most definitely lead to shifts in for some features . It is impossible to know in advance which features were changed due to the unknown label shift, however, the dialogue algorithm and global feature distribution may provide us with leads. In many cases, splitting by the possible values of the most prominent features may suffice.
We note that in order to work separately on different subspaces, soft confusion matrices must be used, as all problems mentioned in Section 3.1 regarding hard confusion matrices worsen when we apply the Quantification and calibration schemes on small subspaces. This is because in certain subspaces, some rare classes will be even rarer. In this context, it is worth mentioning that even though we assume an extremely overlapping dataset, some classes may have no support in some subspaces (meaning that in the source dataset, none of the data points in the subspace are labeled to these classes). In order to avoid singularity (or poor results) one should include in each subspace only classes that are supported in it.
Importantly, while bias decreases with the number of subspaces, variance will increase. This occurs due to the following:
1) the fewer data points that exist in a subspace, the more the scheme is vulnerable to noise. Denoting the number of subspaces as , and assuming they all contain approximately the same number of data points, the noise scales like .
2) the smaller the subspace, the more uniform the predictions in the subspace become, hence, the closer the confusion matrix gets to singularity. In the limit where the subspace area goes to zero, all predictions in the subspace will be identical and the confusion matrix will become singular.
Therefore, the space cannot be divided into arbitrarily small subspaces.
4 Experiments
We conducted two experiments with our scheme. Both involved the same data source and classifier:
Data source
The data used is based on EHR data from Maccabi HMO (Israel) encompassing visits of adults aged 18 years and older to their family physician. Raw EHR cases were transformed into a vector of 1221 binary symptoms (reported vs not reported) which were then fed to the classifier along with sex and age (age was the only nonbinary feature). The classifier classified each of them as one of 82 possible conditions. Data was divided randomly to source and target datasets. Following sourcetarget splitting, the source dataset was altered by excluding cases. It was decided to alter the source (rather than the target) as it simulates a more realistic scenario. Details pertaining to the alterations are discussed below, separately for each experiment.
Classifier architecture
The classifier was implemented by a 4layer neural net. ReLU activation was used twice for nonlinearity. Batch Normalization was used twice as well. The last layer was followed by a Softmax activation to ensure that output values are all positive and sum to one. In training, Dropout was applied to avoid overfitting and crossentropy loss was used (along with the Softmax activation) to assign the classifier a probabilistic nature.
Described below are the experiments we performed.
4.1 Quantification under a Label Shift
To this end, each condition was separately chosen to keep a certain ratio of its original data points. The ratio was selected from a uniform distribution within the range
. Excluded data points were chosen randomly. The process culminated in approximately 11.3M data points in the source dataset and 1.9M in the target.We compared the classifier’s results without calibration to results obtained with calibrations using the hard and soft approaches. Results are presented in Figure (3). It is shown that the soft approach performed considerably better than the hard approach.
4.2 DA under both Label and Conditional shifts
To this end we have only considered 8 respiratory conditions: Strep (Streptococcal) Pharyngitis, Upper Respiratory Infection, Viral Pharyngitis, Allergic Rhinitis, Acute Sinusitis, Asthma, Influenza & Pneumonia. As previously ascertained, samples were excluded from the source dataset. Each condition kept a certain ratio of its original data points, such that the source dataset became relatively balanced (a ratio of approximately between the prevalence of the largest and the smallest classes). The exclusion of data points was conducted by assigning each data point with a weight determined by the values of 3 prominent symptoms: runny nose, sore throat & cough (assigned weights were different for each condition). We then randomly selected data points using the weights as exclusion probabilities. This changed the conditional probability of other symptoms in the source dataset as well. Eventually, there were approximately 798K data points in the source dataset and 287K in the target.
Results of the classifier without calibration were compared to results obtained with calibrations. First, a calibration using the hard approach, without input space division (the hard approach with input space division failed due to singular confusion matrices). Then calibrations using the soft approach, with and without input space division. The input space division involved PCA to reduce the number of dimensions in the input space to 6, followed by KNN to divide the space into 5 subspaces.
Quantification results are presented in Figure (4) and DA results are presented in Table (1). Due to the conditional shift, both soft and hard approaches fail to achieve good results without dividing the input space into subspaces: in the Quantification task, the results without dividing the input space were at least considerably better than the results obtained by a classifier calibrated to the source dataset (note that the hard approach reached better results than the soft approach, we believe this is due to the fact that the imbalance in the source dataset was not extreme). In the DA task, the results without dividing the input space weren’t significantly better than those of the classifier calibrated to the source dataset. The soft approach with input space division was superior to the other methods and yielded very good results for both Quantification and DA.
Method  Top1  Top3 

No DA  0.51  0.79 
DA without input space division  hard  0.51  0.80 
DA without input space division  soft  0.51  0.80 
DA with input space division  soft  0.57  0.83 
Source dataset  0.60  0.87 
5 Discussion
In this work we introduced a novel DA approach tailored to a challenging scenario involving both label and conditional shifts in highly imbalanced and overlapping datasets.
This approach is based on embedding the input into a low dimensional space, dividing it into subspaces and calibrating the classifier separately for each subspace.
We have experimented with this scheme and found that it reaches very good results for both Quantification and DA.
A simple method of dividing the input space to subspaces was examined. It involved PCA for dimensionality reduction followed by KNN for the division. It would be interesting to investigate, both theoretically and experimentally, how different methods of space division may improve our results. Additional improvements to our results may be achieved by using ensembles of classifiers, as suggested by [4].
Finally, it would be interesting to further study this approach in light of an Active Learning scheme. The relation between Active Learning and DA has been studied extensively
[13]. In the context of this work, it would be exciting to envision new strategies for sampling in a way which optimizes learning new information by “sacrificing” sourcetarget similarity, but in such a way that we would know in advance that high quality DA is guaranteed.Potential applications in COVID19
Researchers are diligently working to estimate both COVID19 prevalence in given populations and the probability that individuals will be diagnosed with COVID19 based on reported symptoms [14, 15].
Since the symptom distribution of this disease overlaps with a variety of other respiratory tract conditions, the above mentioned estimations are highly sensitive to the ratio of COVID19 prevalence in the source and target datasets. All the while, prevalence in the source is arbitrary and prevalence in the target may change by several orders of magnitudes in short amounts of time.
An additional complication may arise from the fact that the conditional (reported) symptom distribution, given each condition, may change between source and target. Two plausible reasons are: 1) the source is based on physician reports while the target is based on user selfreports, and 2) as additional symptoms are associated with COVID19, the populations’ awareness of them increases over time (e.g., loss of the sense of smell and taste).
The scheme we have proposed in this paper can be used to generate reliable estimations, both on the population level and the individual level, under the given circumstances.
Moreover, it is suitable for categorical input, unlike a number of other methods requiring smooth input.
Acknowledgements
The authors would like to thank Omer Sidis, Yael Steuerman, Amit Wolecki, Uri Shalit and Gabi Stanovsky for helpful discussions.
References
 [1] P. González, A. Castaño, N. V. Chawla, and J. J. D. Coz, “A review on quantification learning,” ACM Computing Surveys (CSUR), vol. 50, no. 5, pp. 1–40, 2017.
 [2] G. J. McLachlan and K. E. Basford, Mixture models: Inference and applications to clustering, vol. 38. M. Dekker New York, 1988.

[3]
G. McLachlan, “Discriminant analysis and statistical pattern recognition, wiley,”
New York, 1992.  [4] Z. C. Lipton, Y.X. Wang, and A. Smola, “Detecting and correcting for label shift with black box predictors,” arXiv preprint arXiv:1802.03916, 2018.
 [5] I. Redko, E. Morvant, A. Habrard, M. Sebban, and Y. Bennani, Advances in Domain Adaptation Theory. Elsevier, 2019.
 [6] W. M. Kouw and M. Loog, “A review of domain adaptation without target labels,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2019.
 [7] B. Chidlovskii, “Using latent codes for class imbalance problem in unsupervised domain adaptation,” arXiv preprint arXiv:1909.08962, 2019.
 [8] J. Wu, J. Roy, and W. F. Stewart, “Prediction modeling using ehr data: challenges, strategies, and a comparison of machine learning approaches,” Medical care, pp. S106–S113, 2010.

[9]
B. Shickel, P. J. Tighe, A. Bihorac, and P. Rashidi, “Deep ehr: a survey of recent advances in deep learning techniques for electronic health record (ehr) analysis,”
IEEE journal of biomedical and health informatics, vol. 22, no. 5, pp. 1589–1604, 2017.  [10] S. Garg, Y. Wu, S. Balakrishnan, and Z. C. Lipton, “A unified view of label shift estimation,” arXiv preprint arXiv:2003.07554, 2020.
 [11] G. Forman, “Quantifying counts and costs via classification,” Data Mining and Knowledge Discovery, vol. 17, no. 2, pp. 164–206, 2008.
 [12] M. Saerens, P. Latinne, and C. Decaestecker, “Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure,” Neural computation, vol. 14, no. 1, pp. 21–41, 2002.

[13]
G. Csurka, “A comprehensive survey on domain adaptation for visual
applications,” in
Domain adaptation in computer vision applications
, pp. 1–35, Springer, 2017. 
[14]
A. S. S. Rao and J. A. Vazquez, “Identification of covid19 can be quicker through artificial intelligence framework using a mobile phonebased survey in the populations when cities/towns are under quarantine,”
Infection Control & Hospital Epidemiology, pp. 1–18, 2020.  [15] A. Imran, I. Posokhova, H. N. Qureshi, U. Masood, S. Riaz, K. Ali, C. N. John, and M. Nabeel, “Ai4covid19: Ai enabled preliminary diagnosis for covid19 from cough samples via an app,” arXiv preprint arXiv:2004.01275, 2020.
Comments
There are no comments yet.