Many issues covered or discussed by the media and politicians today are so subtle that even word-choice may require one to adopt a particular ideological position Iyyer et al. (2014). For example, conservatives tend to use the term tax reform, while liberals use tax simplification. Though objectivity and unbiased reporting remains a cornerstone of professional journalism, several scholars argue that the media displays ideological bias Gentzkow and Shapiro (2010); Groseclose and Milyo (2005); Iyyer et al. (2014). Even if one were to argue that such bias may not be reflective of a lack of objectivity, prior research Dardis et al. (2008); Card et al. (2015) note that framing of topics can significantly influence policy.
Since manual detection of political ideology is challenging at a large scale, there has been extensive work on developing computational models for automatically inferring the political ideology of articles, blogs, statements, and congressional speeches Gentzkow and Shapiro (2010); Iyyer et al. (2014); Preoţiuc-Pietro et al. (2017); Sim et al. (2013). In this paper, we consider the detection of ideological bias at the news article level, in contrast to recent work by Iyyer et al. (2014) who focus on the sentence level or the work of Preoţiuc-Pietro et al. (2017) who focus on inferring ideological bias of social media users. Prior research exists on detecting ideological biases of news articles or documents Gentzkow and Shapiro (2010); Gerrish and Blei (2011); Iyyer et al. (2014). However, all of these works generally only model the text of the news article. However, in the online world, news articles do not just contain text but have a rich structure to them. Such an online setting influences the article in subtle ways: (a) choice of the title since this is what is seen in snippet views online (b) links to other news media and sources in the article and (c) the actual textual content itself. Except for the textual content, prior models ignore the rest of these cues. Figure 1 shows an example from The New York Times. Note the presence of hyperlinks in the text, which link to other sources like The Intercept (Figure 0(a)). We hypothesize that such a link structure is reflective of homophily between news sources sharing similar political ideology – homophily which can be exploited to build improved predictive models (see Figure 0(b)). Building on this insight, we propose a new model MVDAM: Multi-view document attention model
Multi-view document attention modelto detect the ideological bias of news articles by leveraging cues from multiple views: the title, the link structure, and the article content. Specifically, our contributions are:
We propose a generic framework MVDAM to incorporate multiple views of the news article and show that our model outperforms state of the art by percentage points on the F1 score.
We propose a method to estimate the ideological proportions of sources and rank them by the degree to which they lean towards a particular ideology.
Finally, differing from most works, which typically focus on congressional speeches, we conduct ideology detection of news articles by assembling a large-scale diverse dataset spanning more than sources.
2 Related Work
Several works study the detection of political ideology through the lens of computational linguistics and natural language processing Laver et al. (2003); Monroe and Maeda (2004); Thomas et al. (2006); Lin et al. (2008); Carroll et al. (2009); Ahmed and Xing (2010); Gentzkow and Shapiro (2010); Gerrish and Blei (2011); Sim et al. (2013); Iyyer et al. (2014); Preoţiuc-Pietro et al. (2017). Gentzkow and Shapiro (2010) first attempt to rate the ideological leaning of news sources by proposing a measure called “slant index” which captures the degree to which a particular newspaper uses partisan terms or co-allocations. Gerrish and Blei (2011) predict the voting patterns of Congress members based on supervised topic models while Ahmed and Xing (2010); Lin et al. (2008) use similar models to predict bias in news articles, blogs, and political speeches Iyyer et al. (2014). Differing from the above, Sim et al. (2013)
propose a novel HMM-based model to infer the ideological proportions of the rhetoric used by political candidates in their campaign speeches which relies on a fixed lexicon of bigrams.
use recurrent neural networks to predict political ideology of congressional debates and articles in the ideological book corpus (IBC) and demonstrate the importance of compositionality in predicting ideology where modifier phrases and punctuality affect the political ideological position.Preoţiuc-Pietro et al. (2017) propose models to infer political ideology of Twitter users based on their everyday language. Most crucially, they also show how to effectively use the relationship between user groups to improve prediction accuracy. Our work draws inspiration from both of these works but differentiates itself from these in the following aspects: We leverage the structure of a news article by noting that an article is just not free-form text, but has a rich structure to it. In particular, we model cues from the title, the inferred network, and the content in a joint generic neural variational inference framework to yield improved models for this task. Furthermore, differing from Iyyer et al. (2014), we also incorporate attention mechanisms in our model which enables us to inspect which sentences (or words) have the most predictive power as captured by our model. Finally, since we work with news articles (which also contain hyperlinks), naturally our setting is different from all other previous works in general (which mostly focus on congressional debates) and in particular from Iyyer et al. (2014) where only textual content is modeled or Preoţiuc-Pietro et al. (2017) which focuses on social media users.
3 Dataset Construction
We rely on the data released by AllSides.com111https://www.allsides.com/media-bias/media-bias-ratings to obtain a list of US-based news sources along with their political ideology ratings: Left, Center or Right which specify our target label space. While we acknowledge that there is no “perfect” measure of political ideology, Allsides.com is an apt choice for two main reasons. First, and most importantly the ratings are based on a blind survey, where readers are asked to rate news content without knowing the identity of the news source or the author being rated. This is also precisely the setting in which our proposed computational models operate (where the models have access to the content but are agnostic of the source itself) thus seeking to mirror human judgment closely. Second, these are normalized by AllSides to ensure they closely reflect popular opinion and political diversity present in the United States. These ratings also correlate with independent measurements made by the Pew Research Centre. All these observations suggest that these ratings are fairly robust and generally “reflective of the average judgment of the American People”222https://www.allsides.com/media-bias/about-bias.
Given the set of news sources selected above, we extract the article content for these news sources. We control for time by obtaining article content over a fixed time-period for all sources. Specifically, we spider several news sources and perform data cleaning. In particular, the spidering component collates the raw HTML of news sources into a storage engine (MongoDB). We track thousands of US based news outlets including country wide popular news sources as well as many local/state news based outlets like the Boston Herald333This is a part of an ongoing project called MediaRank. More details can be found at http://media-rank.com. However, in this paper, we consider only the US news sources for which we can derive ground truth labels for political ideology. For each of the news sources considered, we extract the title, the cleaned pre-processed content, and the hyperlinks within the article that reveal the network structure. The label for each article is the label assigned to its source as obtained from AllSides. We choose a random sample of articles and create 3 independent splits for training (), validation () and test () with a roughly balanced label distribution. 444Note that we do not restrict the articles to be strictly political since even articles on other topics like health and sports can be reflective of political ideology Hoberman (1977).
Data Pre-processing and Cleaning
Since the labels were derived from the source, we are careful to remove any systematic features in each article which are trivially reflective of the source, since that would result in over-fitting. In particular we perform the following operations: (a) Remove source link mentions When modeling the link structure of an article, we explicitly remove any link to the source itself. Second, we also explicitly remove any systematic link structures in articles that are source specific. In particular, some sources may always have links to other domains (like their own franchisees or social media sites). These links are removed explicitly by noting their high frequency. (b) Remove headers, footers, advertisements News sources systematically introduce footers, and advertisements which we remove explicitly. For example, every article of the The Daily Beast has the following footer You can subscribe to the Daily Beast here which we filter out.
4 Models and Methods
Given which represents a set of multi-modal features of news articles and a label set , we would like to model .
Overview of MVDAM
We consider a Bayesian approach with stochastic attention units to effectively model textual cues. Bayesian approaches with stochastic attention have been noted to be quite effective at modeling ambiguity as well as avoiding over-fitting scenarios especially in the case of small training data sets Miao et al. (2016). In particular, we assume a latent representation learned from the multiple modalities in which is then mapped to the label space . In the most general setting, instead of learning a deterministic encoding given
, we posit a latent distribution over the hidden representation, to model the overall document where
is parameterized by a diagonal Gaussian distribution.
Specifically, consider the distribution which can be written as follows:
As noted by Miao et al. (2016), computing the exact posterior is in general intractable. Therefore, we posit a variational distribution and maximize the evidence lower bound namely,
denotes a probability distribution overgiven the latent representation , and denotes the probability distribution over conditioned on .
Equation 2 can be interpreted as consisting of three components, each of which can modeled separately: (a) Discriminator can be viewed as a discriminator given the hidden representation . Maximizing the first term is thus equivalent to minimizing the cross-entropy loss between the model’s prediction and true labels. (b) The second term, the KL Divergence term consists of two components: (1) Approximate Posterior The term also known as the approximate posterior parameterizes the latent distribution which encodes the multi-modal features of a document. (2) Prior The term can be viewed as a prior which can be uninformative (a standard Gaussian prior in the most general case, or any other prior model based on other features). We now discuss how we model each of these components in detail.
We use a simple feed-forward network with a linear layer that accepts as input the latent hidden representation of
, followed by a ReLU for non-linearity followed by a linear layer and a final soft-max layer to model this component.
4.2 Approximate Posterior
Here we model the approximate posterior by an inference network shown succinctly in Figure 1(b). The inference network takes as input the features and learns a corresponding hidden representation . More specifically, it outputs two components: (,
) corresponding to the mean and log-variance of the gaussian parametrizing the hidden representation. We model this using a “multi-view” network which incorporates hidden representations learned from multiple modalities into a joint representation. Specifically, given -dimensional hidden representations corresponding to multiple modalities , , and the model first concatenates these representations into a single -dimensional representation which is then input through a 2-layer feed-forward network to output a
-dimensional mean vectorand a -dimensional log-variance vector that parameterizes the latent distribution governing . We now discuss the models used for capturing each view.
4.2.1 Modeling the Title
We learn a latent representation of the title of a article by using a convolutional network. Convolutional networks have been shown to be very effective for modeling short sentences like titles of news articles. In particular, we use the same architecture proposed by Kim (2014). The input words of the title are mapped to word embeddings and concatenated and passed through convolutional filters of varying window sizes. This is then followed by a max-over-time pooling Collobert et al. (2011). The outputs of this layer are input to a fully connected layer of dimension with drop-out which outputs , the latent representation of the title.
4.2.2 Modeling the Network Structure of articles
Capturing the network structure of article consists of two steps: (a) Learning a network representation of each source based on its social graph . (b) Using the learned representation of each source to capture the link structure of a particular article.
We use a state-of-the-art network representation learning algorithm to learn representations of nodes in a social network. In particular, we use Node2Vec Grover and Leskovec (2016), which learns a -dimensional representation of each source given the hyperlink structure graph . Node2Vec seeks to maximize the log likelihood of observing the neighborhood of a node , given the node . Let be a matrix of size where represents the embedding of node . We then maximize the following likelihood function . We model the above likelihood similar to the Skip-gram architecture Mikolov et al. (2013) by assuming that the likelihood of observing a node is conditionally independent of any other node in the neighborhood given . That is . We then model . Having fully specified the log likelihood function, we can now optimize it using stochastic gradient ascent.
Having learned the embedding matrix for each source node, we now model the link structure of any given article simply by the average of the network embedding representations for each link referenced in . In particular, we compute as: .
4.2.3 Modeling the Content of articles
To model the content of an article, we use a hierarchical approach with attention. In particular, we compute attention at both levels: (a) words and (b) sentences. We closely follow the approach by Yang et al. (2016) which learns a latent representation of a document using both word and sentence attention models.
We model the article hierarchically, by first representing each sentence with a hidden representation . We model the fact that not all words contribute equally in the sentence through a word level attention mechanism. We then learn the representation of the article by composing these individual sentence level representations with a sentence level attention mechanism.
Learning sentence representations
We first map each word to its embedding matrix through a lookup embedding matrix . We then learn a hidden representation of the given sentence centered around word by embedding the sentence through a bi-directional GRU as described by Bahdanau et al. (2014). Since not all words contribute equally to the representation of the sentence, we introduce a word level attention mechanism which attempts to extract relevant words that contribute to the meaning of the sentence. Specifically we learn a word level attention matrix as follows where is the latent representation of the sentence .
Composing sentence representations
We follow a similar method to learn a latent representation of an article. Given the embedding of each sentence in the article, we learn a hidden representation of the given sentence centered around by embedding the list of sentences through a bi-directional GRU as described by Bahdanau et al. (2014). Once again, since not all sentences contribute equally to the representation of the article, we introduce a sentence level attention mechanism which attempts to extract relevant sentences that contribute to the meaning of the article. Specifically we learn the weights of a sentence level attention matrix as , where is the latent representation of the article. In this case we let the hidden representation of the sentence be a stochastic representation similar to the work by Miao et al. (2016) and use the Gaussian re-parameterization trick to enable training via end-to-end gradient based methods 555Using deterministic sentence representations is a special case.. Such techniques have been shown to be useful in modeling ambiguity and also generalize well to small training datasets Miao et al. (2016).
The prior models in Equation 2. Note that our proposed framework is general and can be used to incorporate a variety of priors. Here, we assume the prior is drawn from a Gaussian distribution with diagonal co-variances. The KL Divergence term in Equation 2 can thus be analytically computed. In particular, the KL Divergence between two dimensional Gaussian distributions with means and diagonal co-variances is:
Having described precisely, the models for each of the components in Equation 2
, we can reformulate the maximization of the variational lower bound to the following loss function on the set of all learn-able model parameters: as follows:
where NLL is the negative log likelihood loss computed between the predicted label and the true label, and is a hyper-parameter that controls the amount of regularization offered by the KL Divergence term. We use AdaDelta to minimize this loss function.
We evaluate our model against several competitive baselines which model only a single view to place our model in context:
Chance Baseline We consider a simple baseline that returns a draw from the label distribution as the prediction.
CNN (Title) We consider a convolutional net classifier based on exactly the same architecture as Kim (2014) which uses the title of the news article. Convolutional Nets have been shown to be extremely effective at classifying short pieces of text and can capture non-linearities in the feature spaceKim (2014).
We also consider a simple fully-connected feed forward neural network using only the network features to characterize the predictive power of the network alone.
HDAM Model (Content) We use the state of the art hierarchical document attention model proposed by Yang et al. (2016) that models the content of the article using both word and sentence level attention mechanisms.
We consider three different flavors of our proposed model which differ in the subset of modalities used (a) Title and Network (b) Title and Content, and (c) Full model: Title, Network, and Content. We train all of our models and the baselines on the training data set choosing all hyper-parameter using the validation set. We report the performance of all models on the held-out test set.
We set the embedding latent dimension captured by each view to be including the final latent representation obtained by fusing multiple modalities. In case of the CNN’s, we consider three convolutional filters of window sizes each yielding a dimensional feature map followed by max-over time pooling which is then passed through a fully connected layer to yield the output. In all the neural models, we used AdaDelta with an initial learning rate of to learn the parameters of the model via back-propagation.
|MVDAM||Title, Network, Content||80.10||79.56||79.67|
5.1 Results and Analysis
Table 1 shows the results of the evaluation. First note that the logistic regression classifier and the CNN model using the Title outperforms the Chance classifier significantly (F1: vs ). Second, only modeling the network structure yields a F1 of but still significantly better than the chance baseline. This confirms our intuition that modeling the network structure can be useful in prediction of ideology. Third, note that modeling the content (HDAM) significantly outperforms all previous baselines (F1:). This suggests that content cues can be very strong indicators of ideology. Finally, all flavors of our model outperform the baselines. Specifically, observe that incorporating the network cues outperforms all uni-modal models that only model either the title, the network, or the content. It is also worth noting that without the network, only the title and the content show only a small improvement over the best performing baseline ( vs ) suggesting that the network yields distinctive cues from both the title, and the content. Finally, the best performing model effectively uses all three modalities to yield a F1 score of outperforming the state of the art baseline by percentage points. Altogether our results suggest the superiority of our model over competitive baselines. In order to obtain deeper insights into our model, we also perform a qualitative analysis of our model’s predictions.
|Article Title||Source Label||Predicted Label|
|Juan Williams Makes the ’Case for Oprah’||Right||Left|
|Tourist dies hiking in Australia Outback heat||Right||Left|
|Back From China, UCLA Basketball Players Plagued by Father||Right||Left|
|Democrat Ralph Northam Elected Governor of Virginia||Right||Left|
|South Africa blighted by racially charged farm murders||Right||Left|
|Lawsuit: Stripper punched man, knocked out his front tooth||Left||Right|
|Here’s How to Keep Fake News Off Twitter||Left||Right|
|We Are All Just Overclocked Chimpanzee||Left||Right|
|Curious Arctic Fox Pups Destroy Hidden Camera In The Most Adorable Way||Left||Right|
|I am American, Jewish, and banned from Israel for my activism||Left||Right|
Visualizing Attention Scores
Figure 3 shows a visualization of sentences based on their attention scores. Note that for a left leaning article (see Figure 2(a)), the model focuses on sentences involving gun-control, feminists, and transgender. In contrast, a visualization of sentence attention scores for an article which the model predicted as “right-leaning” ((see Figure 2(b))) reveals a focus on words like god, religion etc. These observations qualitatively suggest that the model is able to effectively pick up on content cues present in the article. By examining the distribution over the sentence indices corresponding to the maximum attention scores, we noted that only in about half the instances, the model focuses its greatest attention on the beginning of the article suggesting that the ability to selectively focus on sentences in the news article contributes to the superior performance.
In Table 2, we highlight some of the challenges of our model. In particular, our model finds it quite challenging to identify the political ideology of the source for articles that are non-political and related to global events, or entertainment. Examples include instances like Tourist dies hiking in Australia Outback heat or Juan Williams makes the ’case for Oprah’. We also note that articles with “click-baity” titles like We are all Just Overclocked Chimpanzees are not necessarily discriminative of the underlying ideology. In summary, while our proposed model significantly advances the state of art, it also suggests scope for further improvement especially in identifying political ideologies of articles in topics like Entertainment or Sports. For example, prior research suggests that engagement in particular sports is correlated with the political leanings Hoberman (1977) which suggest that improved models might need to capture deeper linguistic and contextual cues.
Ideological Proportions of News Sources
Finally, we compute the expected proportion of an ideology in a given source based on the probability estimates output by our model for the various articles. While one might expect that the expected degree of “left-ness” (or “right-ness”) for a given source can easily be computed by taking a simple mean of the prediction probability for the given ideology over all articles belonging to the source, such an approach can be in-accurate because the probability estimates output by the model are not necessarily calibrated and therefore cannot be interpreted as a confidence value. We therefore use isotonic regression to calibrate the probability scores output by the model. Having calibrated the probability scores, we now compute the degree to which a particular news source leans toward an ideology by simply computing the mean output score over all articles corresponding to the source. Table 3 shows the top sources ranked according to their proportions for each ideology. We note that sources like CNN, Buzz Feed, SF Chronicle are considered more left-leaning than the Washington Post. Similarly, we note that NPR and Reuters are considered to be the most center-aligned while Breitbart, Infowars and Blaze are considered to be most right-aligned by our model. These observations are moderately aligned with survey results that place news sources on the ideology spectrum based on the political beliefs of their consumers 666http://www.journalism.org/2014/10/21/political-polarization-media-habits/pj_14-10-21_mediapolarization-08/.
We proposed a model to leverage cues from multiple views in the predictive task of detecting political ideology of news articles. We show that incorporating cues from the title, the link structure and the content significantly beats state of the art. Finally, using the predicted probabilities of our model, we draw on methods for probability calibration to rank news sources by their ideological proportions which moderately correlates with existing surveys on the ideological placement of news sources. To conclude, our proposed framework effectively leverages cues from multiple views to yield state of the art interpret-able performance and sets the stage for future work which can easily incorporate other modalities like audio, video and images.
We thank the anonymous reviewers for their comments. This research was supported in part by DARPA Grant D18AP00044 funded under the DARPA YFA program. This work was also partially supported by NSF grants DBI-1355990 and IIS-1546113. The authors are solely responsible for the contents of the paper, and the opinions expressed in this publication do not reflect those of the funding agencies.
- Ahmed and Xing (2010) Amr Ahmed and Eric P Xing. 2010. Staying informed: supervised and semi-supervised multi-view topical analysis of ideological perspective. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1140–1150. Association for Computational Linguistics.
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
- Card et al. (2015) Dallas Card, Amber E Boydstun, Justin H Gross, Philip Resnik, and Noah A Smith. 2015. The media frames corpus: Annotations of frames across issues. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), volume 2, pages 438–444.
- Carroll et al. (2009) Royce Carroll, Jeffrey B Lewis, James Lo, Keith T Poole, and Howard Rosenthal. 2009. Measuring bias and uncertainty in dw-nominate ideal point estimates via the parametric bootstrap. Political Analysis, 17(3):261–275.
Collobert et al. (2011)
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray
Kavukcuoglu, and Pavel Kuksa. 2011.
Natural language processing (almost) from scratch.
Journal of Machine Learning Research, 12(Aug):2493–2537.
- Dardis et al. (2008) Frank E Dardis, Frank R Baumgartner, Amber E Boydstun, Suzanna De Boef, and Fuyuan Shen. 2008. Media framing of capital punishment and its impact on individuals’ cognitive responses. Mass Communication & Society, 11(2):115–140.
- Gentzkow and Shapiro (2010) Matthew Gentzkow and Jesse M Shapiro. 2010. What drives media slant? evidence from us daily newspapers. Econometrica, 78(1):35–71.
- Gerrish and Blei (2011) Sean Gerrish and David M Blei. 2011. Predicting legislative roll calls from text. In Proceedings of the 28th international conference on machine learning (icml-11), pages 489–496.
- Groseclose and Milyo (2005) Tim Groseclose and Jeffrey Milyo. 2005. A measure of media bias. The Quarterly Journal of Economics, 120(4):1191–1237.
- Grover and Leskovec (2016) Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
- Hoberman (1977) John M Hoberman. 1977. Sport and political ideology. Journal of sport and social issues, 1(2):80–114.
- Iyyer et al. (2014) Mohit Iyyer, Peter Enns, Jordan Boyd-Graber, and Philip Resnik. 2014. Political ideology detection using recursive neural networks. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1113–1122.
- Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
- Laver et al. (2003) Michael Laver, Kenneth Benoit, and John Garry. 2003. Extracting policy positions from political texts using words as data. American Political Science Review, 97(2):311–331.
- Lin et al. (2008) Wei-Hao Lin, Eric Xing, and Alexander Hauptmann. 2008. A joint topic and perspective model for ideological discourse. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 17–32. Springer.
- Miao et al. (2016) Yishu Miao, Lei Yu, and Phil Blunsom. 2016. Neural variational inference for text processing. In International Conference on Machine Learning, pages 1727–1736.
- Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
- Monroe and Maeda (2004) Burt L Monroe and Ko Maeda. 2004. Talk’s cheap: Text-based estimation of rhetorical ideal-points.
- Preoţiuc-Pietro et al. (2017) Daniel Preoţiuc-Pietro, Ye Liu, Daniel Hopkins, and Lyle Ungar. 2017. Beyond binary labels: political ideology prediction of twitter users. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 729–740.
- Sim et al. (2013) Yanchuan Sim, Brice DL Acree, Justin H Gross, and Noah A Smith. 2013. Measuring ideological proportions in political speeches. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 91–101.
- Thomas et al. (2006) Matt Thomas, Bo Pang, and Lillian Lee. 2006. Get out the vote: Determining support or opposition from congressional floor-debate transcripts. In Proceedings of the 2006 conference on empirical methods in natural language processing, pages 327–335. Association for Computational Linguistics.
- Yang et al. (2016) Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1480–1489.