Recent progress in natural language processing and computational social science have pushed political science research into new frontiers. For example, scholars have studied language use in presidential elections
Recent progress in natural language processing and computational social science have pushed political science research into new frontiers. For example, scholars have studied language use in presidential electionsAcree et al. (2018), legislative text in Congress de Marchi et al. (2018), and similarities in national constitutions Elkins and Shaffer (2019). However, datasets used by political scientists are mostly homogeneous in terms of subject (e.g., immigration) or document type (e.g., constitutions). Labeled corpora with pertinent documents usually only stem from a single source; this makes it difficult to generalize conclusions derived from them to other sources. On the other hand, corpora spanning multiple decades and sources tend to be unlabeled. These corpora are largely untouched by political scientists; to illustrate some problems that arise with studying such data, Table 1 shows a sample of topics generated by Latent Dirichlet Allocation (LDA) Blei et al. (2003), a popular topic model in social science, trained on 60,000 documents sampled from the Corpus of Historical American English (COHA) Davies (2008). The generated topics are extremely vague and not specific to politics.
|Topic 1||like, day, would, a.m., center|
|Topic 2||two, samour, family, veronica, son|
|Topic 3||would, hospital, also, car, hyundai|
|Topic 4||said, people, one, years, think|
|Topic 5||city, 6-4, last, wine, york|
. Additionally, topic model hyperparameters are detailed in AppendixA.
This paper bridges the gap between labeled and unlabeled corpora by framing the problem as one of domain adaptation. We develop adaptive ensembling, an unsupervised domain adaptation framework that learns from a single-source, labeled corpus (the source domain) and utilizes these representations effectively to obtain labels for a multi-source, unlabeled corpus (the target domain). Our method draws upon consistency regularization, a popular technique that stabilizes model predictions under input or weight perturbations Athiwaratkun et al. (2019). At the framework-level, we introduce an adaptive, feature-specific approach to optimization; at the model-level, we develop a novel text classification model that works well with our framework. To better handle the diachronic nature of our corpora, we also incorporate time-aware training and representations.
Our experiments use the New York Times Annotated Corpus (NYT) Sandhaus (2008) as our source domain corpus and COHA as our target domain corpus. Concretely, we construct two classification tasks: a binary task to determine whether a document is political or non-political; and a multi-label task to categorize a document under three major areas of political science in the US: American Government, Political Economy, and International Relations Goodin (2009). We subsequently introduce an expert-labeled test set from COHA to evaluate our methods.
Our framework, equipped with our best model, significantly outperforms existing domain adaptation algorithms on our tasks. In particular, adaptive ensembling achieves gains of 11.4 and 10.1 macro-averaged F1 on the binary and multi-label tasks, respectively. Qualitatively, adaptive ensembling conditions the optimization process, learns smoother latent representations, and yields precise but diverse topics as demonstrated by LDA on an extracted political subcorpus of COHA. We release our code and datasets at http://github.com/shreydesai/adaptive-ensembling.
2 Motivation from Political Science
Quantitative studies of American public opinion over time have mostly been restricted to surveys such as the American National Election Survey Baldassarri and Gelman (2008); Campbell et al. (1980). However, surveys often do not pose well-formed questions, reflect true voter opinion, or capture mass public opinion Zaller and others (1992); Bishop (2004). Therefore, researchers often seek to compare survey findings with those of mass media as the relationship between public opinion and the media has been widely established Baum and Potter (2008); McCombs (2018). Press media, one form of mass media, manifests itself in large, diachronic collections of newspaper articles; such corpora provide a promising avenue for studying public opinion and testing theories, provided scholars can be confident that the measures they obtain over time are substantively invariant Davidov et al. (2014). However, as alluded to earlier, such diachronic corpora are often unlabeled; political scientists cannot draw conclusions from these corpora in their raw form as they are unable to distinguish between political and non-political articles. We frame this problem as an exchange between two domains: a source, labeled corpus with modern articles (NYT) and a target, unlabeled corpus with decades of articles originating from a multitude of news sources (COHA). Using domain adaptation methods, we can extract a political subcorpus from COHA that would be amenable for the study of public opinion research over time.
3 Unsupervised Domain Adaptation
In this section, we detail the core concepts behind our unsupervised domain adaptation framework. We describe the problem setup (§3.1), an overview of self-ensembling and consistency regularization (§3.2-§3.4), and our novel contributions to this framework (§3.5-§3.6).
3.1 Problem Setup
Let and denote the input and output spaces, respectively. We have access to labeled samples from a source domain and unlabeled samples from a target domain . The goal of unsupervised domain adaptation is to learn a function that maximizes the likelihood of the target domain samples by only leveraging supervision from the source domain samples. We also assume the existence of a small amount of labeled target domain samples in order to create a development set, following existing work in unsupervised domain adaptation Glorot et al. (2011); Chen et al. (2012); French et al. (2018); Zhang et al. (2017).
Our unsupervised domain adaptation framework builds on top of self-ensembling Laine and Aila (2017) , a semi-supervised learning algorithm based on
, a semi-supervised learning algorithm based onconsistency regularization, whereby models are trained to be robust against injected noise Athiwaratkun et al. (2019).
Self-ensembling is an interplay between two neural networks: a student network . In particular, the student network is updated via backpropagation, then the teacher network is updated with an exponential average of the student network’s parameters
Self-ensembling is an interplay between two neural networks: a student networkand a teacher network . The inputs to both networks are perturbed separately, and the objective is to measure the consistency of the student network’s predictions against the teacher’s. Both networks share the same base model architecture and initial parameter values, but follow different training paradigms Laine and Aila (2017)
. In particular, the student network is updated via backpropagation, then the teacher network is updated with an exponential average of the student network’s parametersTarvainen and Valpola (2017). The networks are trained in an alternating fashion until they converge. During test time, the teacher network is used to infer the labels for target domain samples. Figure 1 visualizes the overall training procedure. Further intuition behind self-ensembling is available in Appendix B.
3.3 Student Training
The student network uses labeled samples from the source domain and unlabeled samples from the target domain to learn domain-invariant features. This is realized by using multiple loss functions, each with its own objective. The
The student network uses labeled samples from the source domain and unlabeled samples from the target domain to learn domain-invariant features. This is realized by using multiple loss functions, each with its own objective. Thesupervised loss is simply the cross-entropy loss of the student network outputs given source domain samples:
However, the supervised loss alone prevents the student network from learning anything useful about the target domain. To address this, Laine and Aila (2017) introduce an unsupervised loss to ensure that the student and teacher networks have similar predictions for target domain samples. French et al. (2018) only enforce the consistency constraint for target domain samples, but we propose using both source and target domain samples with separately perturbed inputs and ; this provides a balanced source of supervision to train our adaptive constants, discussed in §3.5:
The overall objective is a combination of the two loss functions:
3.4 Fixed Ensembling
The teacher network’s parameters form an ensemble of the student network’s parameters over the course of training:
where is a smoothing factor that controls the magnitude of the parameter updates. Since the labels for the target domain samples are inherently unknown, ensembling parameters in the presence of noise helps the teacher network’s predictions converge to the true label Tarvainen and Valpola (2017).
Empirically, we find that the highly unstable loss surface presented by textual datasets causes large instabilities in the optimization process. One of the key insights of this paper is that these instabilities are due to the dynamics of the unsupervised loss. Because the unsupervised loss effectively regularizes the source domain representations to work well in the target domain Laine and Aila (2017), performance degrades rapidly if this loss fails to converge. This is a strong indicator that self-ensembling fails to learn useful, shared representations for knowledge transfer between textual domains. Qualitative evidence of the unsupervised loss’ instability is shown in Figure 5(a) and further discussed in §7.
3.5 Adaptive Ensembling
We hypothesize that smoothing with a fixed hyperparameter is responsible for said instabilities. For any given weight matrix (or bias vector), each hidden unit can be conceptualized as controlling one highly specific feature or attribute
is responsible for said instabilities. For any given weight matrix (or bias vector), each hidden unit can be conceptualized as controlling one highly specific feature or attributeBau et al. (2019). These units may need to be updated with varying degrees throughout the course of training; therefore, smoothing each unit with a fixed constant severely overlooks dynamics at the parameter-level. We propose modifying fixed ensembling by introducing trainable smoothing constants for each unit---hereafter termed adaptive constants---as opposed to using a fixed smoothing constant:
where a matrix of adaptive constants is applied element-wise to and at each step.
Assume we are training an arbitrary weight matrix in the th layer of a fixed network architecture. Both the student and teacher network have their own copy of , denoted as and , respectively. To ensure each parameter has a corresponding adaptive constant , shares the same dimensionality as and . The previous equation can then be written as:
Because the adaptive constants are designed to stabilize training, it is a natural fit to train them using the unsupervised loss:
This forms a crucial difference between self-ensembling and adaptive ensembling: in the former method, the teacher network has no say in how its parameters are modified. Adaptive ensembling equips the teacher network with fine-grained control over gradient updates, making it far easier to align activations under a noisy setting.
3.6 Temporal Curriculum
Diachronic datasets important in political science can be difficult to adapt to given the minimal vocabulary overlap between the source and target domain documents. Source and target articles mention named entities and events that, for the most part, do not appear across both datasets. To ease the difficulty of domain adaptation, we exploit the temporal information in our datasets to introduce a curriculum Bengio et al. (2009).
In particular, each article comes with metadata that includes the year in which the article was published. Figure 2 shows that COHA articles written closer to the time of NYT articles have a larger vocabulary overlap than those written in the distant past. Intuitively, it is easier to learn features from target domain samples that are more like the source domain samples. Hence, we sort the target domain mini-batches by year; the learning task becomes progressively harder as opposed to confusing the models during the early stages of training.
In this section, we introduce a new convolutional neural network (CNN) as the plug-in model for our unsupervised domain adaptation framework. We motivate the use of CNNs (§
In this section, we introduce a new convolutional neural network (CNN) as the plug-in model for our unsupervised domain adaptation framework. We motivate the use of CNNs (§4.1), formalize the model input (§4.2), and introduce several novel components for our task (§4.3).
CNNs have emerged as strong baselines for text classification in NLP Kim (2014). CNNs are desirable candidates for our framework as they exhibit a high degree of parameter sharing, significantly reducing the number of parameters to train. In addition, they can be designed to solely optimize the log-likelihood of the training data. Experimentally, we find that models that optimize other distributions (e.g., attention distributions in Transformers Vaswani et al. (2017) or Hierarchical Attention Networks Yang et al. (2016)) do not work well with this framework.
4.2 Model Input
Given a discrete input and vocabulary , an embedding matrix replaces each word with its respective -dimensional embedding. The resulting embeddings are stacked row-wise to obtain an input matrix . Following the notion of input perturbation used in consistency regularization algorithms Athiwaratkun et al. (2019), we design several methods to inject noise into the input layer. Each input is perturbed with additive, isotropic Gaussian noise: . Then, we apply dropout on the perturbed inputs to eliminate dependencies on any one word: where is a Bernoulli mask applied element-wise to the input matrix.
4.3 Model Architecture
Background: 1D Convolutions
CNNs for text classification generally use 2D convolutions over the input matrix Kim (2014), but architectures using 1D convolutions have also been explored in other contexts, e.g., sequence modeling Bai et al. (2018), machine translation Kalchbrenner et al. (2016) , and text generation to avoid information leakage into the future. Two approaches have been proposed to achieve this: history-padding
, and text generationYang et al. (2017). Our model draws upon the latter approach for political document classification. CNNs utilizing 1D convolutions are typically autoregressive in nature; that is, each output only depends on the inputs
to avoid information leakage into the future. Two approaches have been proposed to achieve this: history-paddingBai et al. (2018, 2019) and masked convolutions Kalchbrenner et al. (2016). Further, each successive convolution uses an exponentially increasing dilation factor, reducing the depth of the network significantly. Below, we elaborate on the components of our model:
Given a model with layers, previous approaches Bai et al. (2018, 2019) history-pad the input with zeros to obtain an output of length , where is the dilation factor and is the filter size. However, we propose history-padding the input with zeros to ensure the convolutions compress the sequence down to one output unit. Formally, this produces an output feature map of dimension where is the batch size and is the number of channels; one can use a simple squeeze() operation to obtain the compact feature matrix . Though this is a subtle difference, our approach yields much richer representations for classification.
In each layer , a kernel convolves across an intermediate sequence, inducing a feature map . Because the input is presented as a sequence, the application of along a one-dimensional axis encourages to encode temporal features, similar to how the hidden state is formed by applying shared weights across a sequence in recurrent architectures. Further, because the receptive field grows exponentially, the convolutions build hierarchical representations of the input, implying builds a more abstract representation of the input than . We exploit this stateful information by pooling each activation map into a vector and concatenating them row-wise to create a state matrix:
To the best of our knowledge, our paper is the first to explicitly use the temporal state embedded in causal 1D convolution activations as representations for an end task.
To make our model time aware, we learn representations for the years of the documents (available as metadata in COHA). Such time representations allow the model to reason about content as it appears in different decades. Given a year (e.g. 1954), we normalize it to the closed unit interval and linearly transform it into a low-dimensional embedding
and linearly transform it into a low-dimensional embedding:
where and represent the maximum and minimum observed years in the training dataset, respectively.
We concatenate the various components of our model to create a collective representation for classification. We use a 1D convolution ( and ) to project this representation to classes:
We did not observe any performance advantages from using a fully-connected layer to perform the projection, so we opt to use a fully-convolutional architecture to minimize the number of parameters Long et al. (2015). Finally, we apply softmax to the output vector to obtain a valid probability distribution over the classes. An example of our model architecture is depicted in Figure
to obtain a valid probability distribution over the classes. An example of our model architecture is depicted in Figure3.
We present a dataset for identifying political documents with manual annotation from political science graduate students. The dataset is constructed for binary and multi-label tasks: (1) identifying whether a document is political (i.e. containing notable political content) and (2) if so, the area(s) among three major political science subfields in the US: American Government, Political Economy, and International Relations Goodin (2009).
We use NYT as the source dataset as it contains fine-grained descriptors of article content. We sample 4,800 articles with the descriptor US Politics & Government. To obtain non-political articles, we sample 4,800 documents whose descriptors do not overlap with an exhaustive list of political descriptors identified by a political science graduate student. For our multi-label task, the annotator grouped descriptors in NYT that belong to each area label we consider111These descriptors are available in Appendix C..
Our target data are historical documents from COHA, which contains a large collection of news articles since the 1800s. To ensure our dataset is useful for diachronic analysis (e.g., public opinion over time), we sample only from news sources that consistently appear across the decades. Further, we ensure there are at least 8,000 total documents in each decade group; this narrows down our time span to 1922--1986. From this subset, we sample 250 documents from each decade for annotation. Two political science graduate students each annotated a subset of the data.
To train our unsupervised domain adaptation framework, we use 9,600 unlabeled target examples (same number as NYT). The expert-annotated dataset is divided into three subsets: a training set of 984 documents (only for training the In-Domain classifier discussed in §
for training the In-Domain classifier discussed in §6.2), development set of 246 documents, and test set of 350 documents (50 per decade)222The news sources used and label distributions for the expert-annotated dataset are available in Appendix D..
Our CNN has layers, each with channels, , (for the th layer), and ReLU activation. We enforce a maximum sequence length of and minimum word count from to build the vocabulary. The embedding matrix uses -D GloVe embeddings Pennington et al. (2014) with a dropout rate of Srivastava et al. (2014). We history-pad our input with a zero vector, the state connections are obtained using average pooling, and the time embedding has a dimensionality of . The model is optimized with Adam Kingma and Ba (2015), learning rate from , and mini-batch size from . Hyperparameters were discovered using a grid search on the held-out development set.
|Binary Task||Multi-Label Task|
6.2 Framework Results
Using our best model, we benchmark our unsupervised domain adaptation framework against established methods: (1) Marginalized Stacked Denoising Autoencoders (mSDA):
Marginalized Stacked Denoising Autoencoders (mSDA):Denoising autoencoders that marginalize out noise, enabling learning on infinitely many corrupted training samples Chen et al. (2012). (2) Self-Ensembling (SE): A consistency regularization framework that stabilizes student and teacher network predictions under injected noise (discussed in detail in §3.2-§3.4) Laine and Aila (2017); Tarvainen and Valpola (2017); French et al. (2018). (3) Domain-Adversarial Neural Network (DANN): Multi-component framework that learns domain-invariant representations through adversarial training Ganin et al. (2016). We also benchmark against Source Only (classifier trained on the source domain only) and In-Domain (classifier trained on the target domain only) to establish lower and upper performance bounds, respectively Zhang et al. (2017).
Framework results are presented in Table 2. Our method achieves the highest F1 scores for both tasks. The temporal curriculum further improves our results by a large margin, validating its effectiveness for domain adaptation on diachronic corpora. Although DANN achieves higher precision on the multi-label task, its recall largely suffers.
|Binary Task||Multi-Label Task|
|+ seq squeeze||75.1||74.6||45.8||79.6||58.2|
|+ state conn||80.2||76.3||45.3||85.5||59.2|
|+ time emb||77.4||77.1||48.2||83.5||61.1|
6.3 Model Results
Next, we ablate the various components of our model and evaluate several other strong text classification baselines under our framework: (1) Logistic Regression (LR): We average the word embeddings of each token in the sequence, then use these to train a logistic regression classifier. (2) Bidirectional LSTM (BiLSTM): A bidirectional LSTM obtains forwards and backwards input representations Hochreiter and Schmidhuber (1997); they are concatenated and passed through a fully-connected layer. (3) CNN (2D): A CNN using 2D kernels , , and obtains representations Kim (2014) . They are max-pooled, concatenated row-wise, and passed through a fully-connected layer.
. They are max-pooled, concatenated row-wise, and passed through a fully-connected layer.
Model ablations and results are presented in Table 3. Our full model achieves the highest F1 scores on both the binary and multi-label tasks, and each component consistently contributes to the overall F1 score. The 2D CNN also has decent F1 scores, showing that our framework works with standard CNN models. Further, the time embedding significantly improves both F1 scores, indicating the model effectively utilizes the unique temporal information present in our corpora.
In this section, we pose and qualitatively answer numerous probing questions to further understand the strong performance of adaptive ensembling. We analyze several characteristics of the overall framework (§7.1), then qualitatively inspect its performance on our datasets (§7.2).
Are the adaptive constants different across hidden units?
We randomly sample five adaptive constants and track their value trajectories over the course of training. Figure 4 shows all of them sharply converge to and bounce around the same general neighboorhood. This is strong evidence that we cannot use a fixed hyperparameter to smooth each parameter, rather we need per-parameter smoothing constants to account for the functionality and behavior of each unit.
How do the adaptive constants change by layer?
Figure 5 shows the distribution of weight and bias parameters of adaptive constants for a top, middle, and bottom layer of our CNN. For the weight parameters, the teacher relies heavily on the student ( is skewed towards smaller smoothing rates) in the top layer, but gradually reduces its dependence by learning target domain features in the lower layers (
is skewed towards smaller smoothing rates) in the top layer, but gradually reduces its dependence by learning target domain features in the lower layers (is skewed towards larger smoothing rates). For the bias parameters, the teacher prominently shifts the student features to work for the target domain in the top layer, but reduces its dependence on the student in the lower layers. This shows why using a fixed hyperparameter does not account for layer-wise dynamics, i.e. each layer requires a specific distribution of values to achieve strong performance.
Do adaptive constants benefit training and latent representations?
Figure 5(a) depicts the unsupervised loss trajectories for self-ensembling (SE) and adaptive ensembling (AE). Compared to SE, the adaptive constants significantly stabilize the unsupervised loss. Next, Figure 5(b) shows the general training curves for AE and domain-adversarial neural networks (DANN). The DANN loss oscillates uncontrollably as the adversarial weight increases, but increasing the unsupervised loss weight for AE does not result in as much instability. We also compare the latent representations learned by SE and AE in Figure 6. While SE shows evidence of feature alignment, AE learns a much smoother manifold where source and target domain representations are intertwined.
Does adaptive ensembling yield better topics?
In Table 1, we showed that applying LDA directly on COHA yields noisy, unrecognizable topics. Here, we use the Source Only model and the adaptive ensembling framework to obtain labels for the unlabeled pool of COHA documents. We extract the political documents, run a topic model on the political subcorpus, and randomly sample topics. The Source Only results are shown in Table 4 and the adaptive ensembling results are shown in Table 5. The Source Only model has poor recall, as most of the topics are extracted are vague and not inherently political in nature. In contrast, our framework is able to extract a wide range of clean, identifiable political topics. For example, the first topic reflects documents related to the Vietnam conflict while the third topic reflects documents related to important court proceedings.
|Topic 1||dr, women, week, medical, doctors|
|Topic 2||city, police, street, car, avenue|
|Topic 3||trial, years, police, prison, court|
|Topic 4||union, strike, workers, lewis, service|
|Topic 5||like, man, years, little, week|
|Topic 1||vietnam, hanoi, atomic, bombing, south|
|Topic 2||germany, britain, france, europe, soviet|
|Topic 3||court, justice, commission, law, attorney|
|Topic 4||tax, oil, prices, petroleum, industry|
|Topic 5||coal, union, strike, workers, miners|
Does adaptive ensembling preserve the integrity of the original corpus?
In order for political scientists to effectively study latent variables---such as political polarization---over time, the extracted political subcorpus must contain a similar integrity as the original corpus. That is, the subcorpus’ distribution of documents across years and sources must relatively match that of the original corpus. First, we analyze the document counts for each decade bin, shown in Figure 7. The political subcorpus shows a relatively consistent count across the deacdes, notably also capturing salient peaks from the 1920-1930s. Next, we analyze the document counts for each news source. Once again, the political subcorpus features documents from all sources that appear in the original corpus. In addition, the varied distribution across sources is also captured; Time Magazine (TM) has the most documents whereas Wall Street Journal (WSJ) has the least documents. Together, these results show that the resulting subcorpus is amenable for political science research as it exhibits important characteristics derived from the original COHA corpus.
8 Related Work
Early approaches for unsupervised domain adaptation use shared autoencoders to create cross-domain representations Glorot et al. (2011); Chen et al. (2012). More recently, Ganin et al. (2016) introduce a new paradigm that create domain-invariant representations through adversarial training. This has gained popularity in NLP Zhang et al. (2017); Fu et al. (2017); Chen et al. (2018), however the difficulties of adversarial training are well-established Salimans et al. (2016); Arjovsky and Bottou (2017). Consistency regularization methods (e.g., self-ensembling) outperform adversarial methods on visual semi-supervised and domain adaptation tasks Athiwaratkun et al. (2019), but have rarely been applied to textual data Ko et al. (2019). Finally, Huang and Paul (2018) establish the feasibility of using domain adaptation to label documents from discrete time periods. Our work departs from previous work by proposing an adaptive, time-aware approach to consistency regularization provisioned with causal convolutional networks.
We present adaptive ensembling, an unsupervised domain adaptation framework capable of identifying political texts for a multi-source, diachronic corpus by only leveraging supervision from a single-source, modern corpus. Our methods outperform strong benchmarks on both binary and multi-label classification tasks. We release our system, as well as an expert-annotated set of political articles from COHA, to facilitate domain adaptation research in NLP and political science research on public opinion over time.
The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources used to conduct this research. Thanks as well to Greg Durrett, Katrin Erk, and the anonymous reviewers for their helpful comments. This work was partially supported by the NSF Grant IIS-1850153.
- Etch-a-sketching: evaluating the post-primary rhetorical moderation hypothesis. American Politics Research. Cited by: §1.
- Towards principled methods for training generative adversarial networks. In International Conference on Learning Representations, Cited by: §8.
- There are many consistent explanations of unlabeled data: why you should average. In International Conference on Learning Representations, Cited by: §1, §3.2, §4.2, §8.
- An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271. Cited by: §4.3, §4.3.
- Trellis networks for sequence modeling. In International Conference on Learning Representations, Cited by: §4.3, §4.3.
- Partisans without constraint: political polarization and trends in american public opinion. American Journal of Sociology 114 (2), pp. 408–446. Cited by: §2.
- . In International Conference on Learning Representations, Cited by: §3.5.
- The relationships between mass media, public opinion, and foreign policy: toward a theoretical synthesis. Annual Review of Political Science 11, pp. 39–65. Cited by: §2.
Proceedings of the 26th Annual International Conference on Machine Learning, pp. 41–48. Cited by: §3.6.
- The illusion of public opinion: fact and artifact in american public opinion polls. Rowman & Littlefield Publishers. Cited by: §2.
- Latent dirichlet allocation. Journal of Machine Learning Research 3 (Jan), pp. 993–1022. Cited by: §1.
- The American voter. University of Chicago Press. Cited by: §2.
- Marginalized denoising autoencoders for domain adaptation. In Proceedings of the 29th International Conference on International Conference on Machine Learning, pp. 1627–1634. Cited by: §3.1, §6.2, §8.
- Adversarial deep averaging networks for cross-lingual sentiment classification. Transactions of the Association for Computational Linguistics 6, pp. 557–570. Cited by: §8.
- A coefficient of agreement for nominal scales. Educational and psychological measurement 20 (1), pp. 37–46. Cited by: Appendix D.
- Measurement equivalence in cross-national research. Annual review of sociology 40, pp. 55–75. Cited by: §2.
- The corpus of contemporary american english: 450 million words, 1990-present. Cited by: §1.
- Policy and the structure of roll call voting in the us house. Available at SSRN 3262316. Cited by: §1.
- On measuring textual similarity. Note: Work in progress Cited by: §1.
- Self-ensembling for visual domain adaptation. In International Conference on Learning Representations, Cited by: Appendix B, §3.1, §3.3, §6.2.
- Domain adaptation for relation extraction with domain adversarial neural network. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 425–429. Cited by: §8.
- Domain-adversarial training of neural networks. The Journal of Machine Learning Research 17 (1), pp. 2096–2030. Cited by: §6.2, 5(b), §8.
Domain adaptation for large-scale sentiment classification: a deep learning approach. In Proceedings of the 28th International Conference on Machine Learning, pp. 513–520. Cited by: §3.1, §8.
- The oxford handbook of political science. Vol. 11, Oxford University Press. Cited by: §1, §5.
- Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §6.3.
- Examining temporality in document classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 694–699. Cited by: §8.
- Neural machine translation in linear time. arXiv preprint arXiv:1610.10099. Cited by: §4.3.
- Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1746–1751. Cited by: §4.1, §4.3, §6.3.
- Adam: a method for stochastic optimization. In International Conference for Learning Representations, Cited by: §6.1.
Domain agnostic real-valued specificity prediction.
Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §8.
- Temporal ensembling for semi-supervised learning. In International Conference for Learning Representations, Cited by: Appendix B, Appendix B, Appendix B, §3.2, §3.2, §3.3, §3.4, §6.2.
- Fully convolutional networks for semantic segmentation. In , pp. 3431–3440. Cited by: §4.3.
- Setting the agenda: mass media and public opinion. John Wiley & Sons. Cited by: §2.
- Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543. Cited by: §6.1.
- Improved techniques for training GANs. In Advances in Neural Information Processing Systems 29, pp. 2234–2242. Cited by: §8.
- The New York Times annotated corpus. Linguistic Data Consortium, Philadelphia 6 (12), pp. e26752. Cited by: Appendix D, §1.
- Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §6.1.
- Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in Neural Information Processing Systems 30, pp. 1195–1204. Cited by: Appendix B, Appendix B, Appendix B, §3.2, §3.2, §3.4, §6.2.
- Attention is all you need. In Advances in Neural Information Processing Systems 30, pp. 5998–6008. Cited by: §4.1.
- Improved variational autoencoders for text modeling using dilated convolutions. In Proceedings of the 34th International Conference on Machine Learning, pp. 3881–3890. Cited by: §4.3.
- Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489. Cited by: §4.1.
- The nature and origins of mass opinion. Cambridge University Press. Cited by: §2.
- Aspect-augmented adversarial networks for domain adaptation. Transactions of the Association for Computational Linguistics 5, pp. 515–528. Cited by: §3.1, §6.2, §8.
Appendix A LDA Topic Model
We experimented with a range of hyperparameters to ensure the Latent Dirichlet Allocation (LDA) model was best optimized for our datasets, leveraging the Gensim333https://radimrehurek.com/gensim/ library. In particular, we removed all stopwords, extremely rare words (tail 10-20% from a unigram distribution), and set the number of topics to 50.
Appendix B Self-Ensembling
The core intuition behind consistency regularization is that ensembled predictions are more likely to be correct than single predictions Laine and Aila (2017); Tarvainen and Valpola (2017). To this end, Laine and Aila (2017) introduce a student and teacher network that yield single predictions and ensembled predictions, respectively.
After learning from labeled samples, the student may produce varying, dissimilar predictions for unlabeled samples due to the stochastic nature of optimization. One potential solution is to ensemble predictions across time to converge at the most likely prediction Laine and Aila (2017). Tarvainen and Valpola (2017) improve upon this method by showing that ensembling parameters (as opposed to predictions) results in better predictions. Because the teacher’s parameters are smoothed with the student’s learned parameters at each iteration, the teacher effectively becomes an ensemble of the student across time.
Further, to ensure that the features learned from the labeled samples are compatible with the unlabeled samples, Laine and Aila (2017); Tarvainen and Valpola (2017); French et al. (2018) motivate a consistency-enforcing approach to bring the student and teacher’s predictions closer together. In essence, if a feature learned from samples in the labeled domain is incompatible with samples in the unlabeled domain, the consistency (unsupervised) loss penalizes its incompatibility. Therefore, the interplay between these two networks creates a robust, domain-invariant feature space that characterizes both labeled and unlabeled samples French et al. (2018). A detailed visualization of the training procedure is presented in Figure 1 in the main body of this paper.
Appendix C NYT Descriptors
We build a list of ‘‘political’’ descriptors in NYT to determine (a) which labels we can or cannot sample non-political documents from; and (b) which descriptors fall under the three areas of political science we consider for our multi-label task (American Government, Political Economy, and International Relations).
Because documents can be tagged with multiple descriptors, we build a list of descriptors whose documents have significant overlap with US Politics & Government. The second author, a political science graduate student, filtered this list to 57 descriptors that are political in nature.
For (a), we sample 4,600 non-political documents whose descriptors do not overlap with the 57 political descriptors described above. For (b), the same political science graduate student assigns each descriptor with one or more area labels. We use this label information to build an NYT dataset for our tasks. The 57 political descriptors and their corresponding area labels are tabulated in Table 7.
Appendix D Expert-Annotated Dataset
To create an initial COHA subcorpus of 56,000 documents (8,000 per decade), we sample from the following news sources that consistently appear in across decades: Chicago Tribune, Christian Science Monitor, New York Times, Time Magazine, and Wall Street Journal. Note that these NYT articles (up to year 1986) do not appear in the NYT annotated corpus Sandhaus (2008) (starting from year 1987), which we used as our source, training dataset.
From this subcorpus, we perform additional steps to create an expert-annotated dataset (§5). Label distributions for our dataset are presented in Table 6. Although political economy (PE) is severely underrepresented, we experimentally find that these documents have salient features and are not as difficult to classify. In addition, we employ class imbalance penalties to prevent our model from ignoring these documents.
The source dataset (NYT) was already annotated; to ensure label agreement with our target dataset (COHA), we sampled documents from the source dataset and had our political science graduate students label them to compare against the original label. There were minimal problems here---because NYT has fine-grained labels for their documents, the politically-labeled articles were clearly political and vice-versa.
The target datatset (COHA) was divided into halves and each political science graduate student annotated a half. Prior to annotation, they agreed upon a set of rules to minimize bias in the annotation process. In addition, both of them worked side-by-side during all annotation periods, so they were able to ask each other’s opinion in case there was confusion. We also took measures to ensure label correctness after annotation was completed. Each political science graduate student sampled a batch of their political and non-political annotations and sent it to the other to evaluate. Again, there was not much disagreement here as the rules decided upon in the beginning were sufficient to cover most edge cases. Quantitatively, Cohen’s as calculated on a mutually annotated subset Cohen (1960).
|Presidents and Presidency (US)||✓|
|Presidential Elections (US)||✓|
|War and Revolution||✓|
|Presidential Election of 2000||✓|
|Presidential Election of 2004||✓|
|Law and Legislation||✓|
|Civil War and Guerrilla Warfare||✓|
|International Trade and World Market||✓|
|Presidential Election of 1996||✓|
|Economic Conditions and Trends||✓|
|Bombs and Explosives||✓|
|Arms Sales Abroad||✓|
|United States Economy||✓|
|Missiles and Missile Defense Systems||✓|
|Oil (Petroleum) and Gasoline||✓|
|Appointments and Executive Changes||✓|
|Prisoners of War||✓|
|War Crimes, Genocide and Crimes Against Humanity||✓|
|Vice Presidents and Vice Presidency (US)||✓|
|Arms Control and Limitation and Disarmament||✓|
|Military Bases and Installations||✓|
|Presidential Election of 2008||✓|
|Energy and Power||✓|
|Stocks and Bonds||✓|
|State of the Union Message (US)||✓|
|Wages and Salaries||✓|
|Special Prosecutors (Independent Counsel)||✓|
|White House (Washington, DC)||✓|
|Federal Taxes (US)||✓|
|Social Security (US)||✓|
|Third World and Developing Countries||✓|
|Futures and Options Trading||✓|
|Layoffs and Job Reductions||✓|
|Nazi Policies Toward Jews and Minorities||✓|
|Police Brutality and Misconduct||✓|
|Executive Privilege, Doctrine of||✓|