1 Introduction
Text classification is one of the most important tasks of natural language processing (NLP). However, many text data sets are unlabelled. Moreover, text data is always domaindependent and it is difficult to obtain annotated data for all the domains of interest. To handle this, researchers apply the domain adaptation techniques for text classification. The study
[blitzer2007biographies] first applied the structural correspondence learning (SCL) [blitzer2006domain] algorithm to the crossdomain sentiment classification. In [pan2010cross] the spectral feature alignment (SFA) is proposed algorithm to reduce the gap between domainspecific words of the domains. The study [bollegala2015cross] modeled the crossdomain classification task as an embedding learning.Recently deep adversarial networks [goodfellow2014generative] have achieved success across many tasks. The domainadversarial neural networks (DANN) structure proposed by [ganin2016domain]
outperforms the traditional approaches in domain adaptation tasks of sentiment analysis. It implements a domain classifier to learn domaininvariant features. Some studies extended DANN for different multisource scenarios
[liu2017adversarial, zhao2018multiple, chen2018multinomial]. However, they all assume that the label proportions across the domains remain unchanged, an assumption that often is not met in real world tasks. The changes in the label distribution are known as prior probability shift or label shift and they prohibit the DANN from learning domaininvariant features. In the study of
[zhang2013domain] the author applies the kernel mean matching (KMM) methods to estimate the label shift. Another attempt to quantify the shift is the Black Box Shift Estimation (BBSE) [lipton2018detecting, azizzadenesheli2019regularized] which obtains accurate estimates on highdimensional datasets. Recently, [li2019target] proposed to address the label shift problem by using distribution matching to estimate label proportions.In this paper, we implement a domain adversarial network with label proportions estimation (DANLPE) framework which learns domaininvariant features and estimates the target label proportions. The proportion estimation only uses the validation confusion and label predictions as inputs. We reduce the label shift by reweighting the sample in the domain classifier based on the estimate. In the experiments we compare the DANLPE with other algorithms and show it leads in most of the tasks.
2 Problem Setup
Let and be the source and target domains defined on and be a classifier. We use and to denote the feature and label variables. The output of a classifier is denoted by . We use and
to indicate the probability density function of
and . The source and target datasets are represented by and . We split the into training set and validation set . The prior distributions of and are given by and .The red box in Fig. 1 shows the DANN structure consisting of a feature extractor , a text classifier and a domain classifier . We expect the feature extractor to capture the features satisfying with the help of which makes the feature distributions between source and target domains indistinguishable by backpropagation with gradient reversal. However, the performance of would be declined if the prior distributions between and differ a lot. To handle this, we implement a prior distribution estimator to estimate the target label proportions and correct the label shift by weighting the samples feeding into the based on this updating proportions.
3 Domain Adversarial Network With Label Proportion Estimation
3.1 Moments and Matrices Definition
We first define the validation set
The moments, matrices of
and and their plugin estimates are denoted as followsThe distributions of the label predictions of and their plugin estimates are expressed by
3.2 Label Proportions Estimation
The crucial component of the DANLPE is the a way of updating label proportion estimate. We define a random vector
to estimate . And we assume the perfect domaininvariant features are learnt that , which implies . When , the equality holds for every, thus we proposed the loss function
(1) 
After replacing with the plugin estimate we get
(2) 
Now we relate only to the observable data. We set , which implies , and we get that
(3) 
Then we conclude the following implication
Lemma 3.1
By computing the gradient we can derive
(4) 
The proposed prior is updated by gradient descent using Equation (3). However, since is constrained by , we apply the projected gradient descent
(5) 
Where is the learning rate of updating . To avoid the existence of the negative proportion estimate, we also set a lower bound that . Once , we have
(6) 
We define and as the loss functions of and . To eliminate the prior shift, we reweight the samples from in based on their labels. Let , where is the prior distribution of For a minibatch of size , the instances from and are and , the sample weight vector of is . So we compute by
(7) 
Where presents the crossentropy loss.
The complete pseudocode of this learning procedure is given in Algorithm 1
. In the first step it trains a domain adversarial net and processes label proportion estimation alternately to get an estimate of the target prior distribution. During this procedure, the label shift effect is being reduced due to the improving estimate of label proportions, which helps the model derive better domaininvariant features. Since the label shift still matters in early epochs, we need a second step to perform general DANN with the fixed
achieved in the first step and the modified loss function in (6).The hyperparameters of step 1 in algorithm 1 are quite flexible. The number of iteration is suitable when the validation loss does not increase evidently. The role of is to guarantee that we update when a decent model is trained. We update every iterations so it reduces the times to predict and and accelerates the process. Parameter controls how fast changes. The DANLPE is not very sensitive to these hyperparameters. When is fixed as the prior distribution of , step 1 of Algorithm 1 is equivalent to DANN.
4 Experiments
4.1 Experiments on Yelp Data
The Yelp Open Dataset [yelpdata2019] includes 192,609 businesses and 6,685,900 reviews of more than 20 categories. In each review a user expresses opinions about a business and gives a rating ranging from 1 to 5. We compute the average review ratings of each business and label the business with if and if . The business with are filtered out to make the gap. We select the data of Financial Services(F), HotelTravel(H), BeautySpas(B) and Pets(P) for the tasks. Their label distributions vary are shown in Fig. 2a. We sample 2800 businesses for each domain preserving the label proportions and predict the class using their reviews. Among the samples of each domain, 10% of them are split into the validation set.
In feature extraction, we find 500 words by the intersection of exact 837 most common words of each domain, and form the bag of words representation for each review by the occurrence of these tokens. The feature vector of a business is achieved by averaging the vectors of its reviews. In the DANLPE setting we implement a standard neural network with 2 hidden layers of 32 dimensions.
takes the output of the first layer as the input and another hidden layer of the same size. Dropout of p = 0.6 is set for all the hidden layers.We compare DANLPE with SVM, DNN and DANN. DNN is constructed by and DANN by . For DNN, DANN and DANLPE, the learning rate is fixed as and the size of minibatch is 64. For optimization, the Adam [kingma2014adam] optimizer was applied following an early stopping strategy. In the first step of DANLPE, we set , , , and .
Task  Accuracy  Estimation Results  

P>Q  SVM  DNN  DANN 


B>H  0.881  0.882  0.884  0.885  0.110.03  0.40  
B>F  0.869  0.876  0.883  0.880  0.130.06  0.32  
P>H  0.842  0.863  0.858  0.864  0.130.05  0.47  
P>F  0.871  0.879  0.880  0.881  0.210.04  0.38  
H>B  0.862  0.861  0.858  0.872  0.030.01  0.40  
H>P  0.871  0.878  0.875  0.878  0.080.02  0.47  
F>B  0.885  0.879  0.877  0.896  0.070.02  0.32  
F>P  0.840  0.828  0.826  0.846  0.030.01  0.38  
B>P  0.884  0.892  0.893  0.893  0.020.01  0.07  
P>B  0.896  0.907  0.908  0.907  0.080.03  0.07  
H>F  0.881  0.885  0.883  0.884  0.070.03  0.08  
F>H  0.846  0.839  0.852  0.848  0.170.05  0.08 
To evaluate the label proportions estimation, we define to be the estimate by DANLPE and and be the label proportions of samples of and .
The results are shown in Table 1. In the first eight tasks and differ a lot and the DANN does not show much improvement over DNN. In some tasks it even degrades the classification performance. In these experiments, the DANLPE shows a significant gain because the label proportions estimate reduces the label shift. In the last four tasks the label proportions between the and are close and DANN gets the best accuracy in three of them. The DANLPE performs comparatively with DANN in these tasks since does not increase much shift. It is worth mentioning that given accurate label shift estimate, we can also improve the classification accuracy by the prior probability adjustment [saerens2002adjusting] reweighting the class importance in . Here we focus on the behavior in the domain adapter and will not further discuss on this.
4.2 Experiments on Behavioral Coding Data
One of the text classification application is the behavioral coding in the Motivational Interviewing (MI) [miller2012motivational] counseling. The utterances of therapist are coded to evaluate a therapist based on the Motivational Interviewing Skill Code (MISC) [miller2003manual] manual. Some of these MISC codes confuse with each other and training classifier for subset of these confused codes help improve the behavioral coding [chen2019improving]. In this experiment we classify the utterances of Giving Information (GI), simple reflection (RES) and complex reflection(REC) collected from MI sessions of alcohol addiction (A), drugabuse (D) [atkins2014scaling] and general psychology conversations (G) from at a US university counseling center with each category containing around 10000 samples. The label proportions are shown in Fig. 2b. The modules and in the DANLPE structure is similar to the one in the Yelp experiment with dimensions of 128 in hidden layers and dropout rate p = 0.4. We replace of the yelp experiment with a word embedding layer, followed by a bidirectional LSTM layer and an attention mechanism implemented as in [yang2016hierarchical] above the LSTM. As shown in Fig. 2b, the data is highly imbalanced so we evaluate the performance by the average F1 score. In module of both step 1 and step 2 of the algorithm 1
, we assign weights for each class inversely proportional to their class frequencies to make the algorithm more robust as well as improving the Fscore.
Task  Fscore  Estimation Results  

P>Q  DNN  DANN  DANLPE  
A>G  0.496  0492  0.503  0.140.03  0.27 
G>A  0.489  0.489  0.496  0.100.02  0.27 
D>G  0.512  0.508  0.522  0.050.01  0.24 
G>D  0.552  0.556  0.558  0.060.02  0.24 
A>D  0.627  0.639  0.644  0.130.02  0.25 
D>A  0.593  0.594  0.590  0.190.04  0.25 
The results in Table 2 show that the DANLPE achieves the overall best performance and the highest Fscore in most tasks except in the last one when the estimated proportions does not reduce much label shift. The DANN only has a comparable Fscore compared with DNN and even degrades for some tasks. The DNN results suggest that we find the behavioral coding task is harder than the yelp experiment because the number of classes is larger and the behavior codes are human defined and not orthogonal. However, the DANLPE shows its robustness and still gives reasonable proportion estimate of the data in unlabelled domain.
5 Conclusion and Future Work
In this paper, we proposed the DANLPE framework to handle the label shift in the DANN for unsupervised domain adaptation of text classification. In DANLPE we estimate the target label distribution and learn the domaininvariant features simultaneously. We derived the formula to update the label proportions estimate using the validation confusion and target label predictions. Experiments shows that the DANLPE is robust to estimate the label distribution and improves the text classification. In the future, we plan to apply the DANLPE to other tasks such as image classification.