Log In Sign Up

Who is Addressed in this Comment? Automatically Classifying Meta-Comments in News Comments

User comments have become an essential part of online journalism. However, newsrooms are often overwhelmed by the vast number of diverse comments, for which a manual analysis is barely feasible. Identifying meta-comments that address or mention newsrooms, individual journalists, or moderators and that may call for reactions is particularly critical. In this paper, we present an automated approach to identify and classify meta-comments. We compare comment classification based on manually extracted features with an end-to-end learning approach. We develop, optimize, and evaluate multiple classifiers on a comment dataset of the large German online newsroom SPIEGEL Online and the 'One Million Posts' corpus of DER STANDARD, an Austrian newspaper. Both optimized classification approaches achieved encouraging F_0.5 values between 76 91 of a qualitative analysis and discuss how our work contributes to making participation in online journalism more constructive.


page 1

page 2

page 3

page 4


Explainable Patterns for Distinction and Prediction of Moral Judgement on Reddit

The forum r/AmITheAsshole in Reddit hosts discussion on moral issues bas...

Is preprocessing of text really worth your time for online comment classification?

A large proportion of online comments present on public domains are cons...

Machine Learning Suites for Online Toxicity Detection

To identify and classify toxic online commentary, the modern tools of da...

Forecasting the presence and intensity of hostility on Instagram using linguistic and social features

Online antisocial behavior, such as cyberbullying, harassment, and troll...

Understanding Longitudinal Behaviors of Toxic Accounts on Reddit

Toxic comments are the top form of hate and harassment experienced onlin...

Identifying Barriers to Adoption for Rust through Online Discourse

Rust is a low-level programming language known for its unique approach t...

Placing M-Phasis on the Plurality of Hate: A Feature-Based Corpus of Hate Online

Even though hate speech (HS) online has been an important object of rese...

1. Introduction

It is becoming increasingly difficult for online newsrooms to handle the vast amount of user comments, which are heterogeneous in content and quality (Sood et al., 2012). For example, one of the most popular German online news sites, SPIEGEL Online, publishes 1.2 million user comments per year, which amounts to more than 3,000 comments per day and that is disregarding blocked comments and comments on social media. For community moderators, a manual selection of meaningful and highly qualified comments is neither easy nor scalable. Journalists and journalism researchers repeatedly mention this problem: finding particularly useful or high-quality comments is like finding a needle in a haystack (Braun and Gillespie, 2011, p.387) (Heise et al., 2014; Reimer et al., 2015; Park et al., 2016). Developing tools to assist moderators, journalists, and newsrooms to analyze, filter, and summarize user comments has been identified as a primary challenge for news organizations (Diplaris et al., 2012; Diakopoulos, 2015b, a).

Research has shown that most journalists have a clear sense of what they deem useful user contributions (Loosen et al., 2017). For instance, journalists particularly appreciate user feedback that reports errors in articles, include additional information on a topic, or contain critique addressed to the quality of an article. Media companies can use this information to improve journalistic work, correct articles, answer frequent questions, or gather feedback on the quality of their news coverage.

A previous study by Loosen et al. (Loosen et al., 2017) demonstrated, through group discussions with journalists and community-moderators, that the prospect of a software system for analyzing user comments was highly welcomed. One feature journalists considered particularly useful is the ability to identify the addressee in comments, for example, the newsroom or media organization, the author of the article being commented on, actors mentioned in the article, or other actors and users. This would help to direct comments to the newsroom or to single journalists that may call for reactions as correcting facts, answering questions, or providing additional information. This is all the more the case as it is also likely that user comments that address the author or the newsroom directly contain elements of media critique or praise (Craft et al., 2016).

Our work aims to develop and evaluate an approach to automatically identify and classify user comments based on whom they address. We focus on comments that are not (only) related to the article but address, for instance, the media company, a journalist, or a community-moderator. We call these comments “meta-comments”. The contribution of this paper is threefold. First

, we empirically explore and evaluate the solution space for this classification task based on supervised and end-to-end machine learning approaches with respective hyperparameter optimization.

Second, we propose a neural network model for the end-to-end learning which outperforms state-of-the-art comment classification reported by Schabus et al. (Schabus et al., 2017). Third, we give insights into designing comment analytics tools and use cases for the information extracted from meta-comments.

The remainder of the paper is structured as follows. Section 2 introduces the research questions, method, and data. Section 3 describes the data analysis process and the training of different word and comment embedding models. Section 4 outlines the analysis and deduction of machine learning features for a supervised machine learning approach. In Section 5 we experiment with and compare the accuracy of an end-to-end learning approach with traditional machine learning based on manually extracted features. In Section 6, we use the classifier to classify unseen user comments and qualitatively analyze the results. We then discuss the threats to validity (Section 7), related work (Section 8), the implications of our findings (Section 9), and conclude the paper in Section 10.

2. Research Design

2.1. Research Questions

Following Neuberger (Neuberger, 2009), we can differentiate between user comments related to the “object level” and those related to the “meta level”. Comments at the object level refer to what is covered, those at the meta level refer to how something is covered by the newsroom or individual journalists. Actors mentioned or addressed within the object level are often prescribed through the topic of the respective article, for instance, politicians, companies, or celebrities. Comments addressing the writing performance or giving general feedback to the author of the article belong to the meta level. In this paper, we focus on the meta-addressees and use a hierarchy inspired by Loosen et al. (Loosen et al., 2013):

  • Media: covers the media companies, their editing, and news coverage, for instance, SPIEGEL Online (de), DER STANDARD (at), New York Times (us), or The Guardian (uk).

  • Journalist: refers to the article’s author or other persons involved as editors or reporters.

  • Community-Moderator: refers to those who manage comment sections, read comments, actively participate in discussions, release, or block comments from the comment section.

Our goal is to identify whether a user comments is a meta-comment or not and then to classify meta-comments regarding their meta-addressees. A user comment is a meta-comment if it addresses at least one meta-addressee. We focus on three research questions:

  • Which classification approach/configuration is the most accurate for classifying meta-comments?

  • What are informative machine learning features among text features, semantic features, and comments’ metadata to identify and classify meta-comments?

  • Which information do classified meta-comments contain and how would it be useful?

2.2. Research Method

Figure 1 shows an overview of our methodological framework, which comprises four consecutive phases. To answer RQ1, we first deduced machine learning features for a supervised learning approach from a qualitative content analysis and related work. We trained the word and comment embeddings (Mikolov et al., 2013; Le and Mikolov, 2014) for text features, semantic features (Rumelhart et al., 1988)

, and for applying transfer learning

(Michalski, 1983) on an end-to-end learning approach (LeCun et al., 2015). We manually labeled a training set of user comments posted on SPIEGEL Online and combined it with the “One Million Posts” corpus to optimize the hyperparameter configuration for different classifiers and classification approaches. For RQ2, we calculated the most significant features for each meta-addressee class. For RQ3, we applied the trained classifier on a random subset of unlabeled user comments, read the classified comments, and qualitatively analyzed their content. The details of each step are discussed in the corresponding result section below.

Figure 1. Overview of our research methodology with four main consecutive steps.

2.3. Research Data

To answer our research questions, we used two datasets: (1) user comments posted on SPIEGEL Online111 (SPON) and (2) the “One Million Posts” (OMP) corpus (Schabus et al., 2017). We selected the SPON news page for two reasons. First, SPON is the most-read online German newspaper according to (ale, 2017). Second, the topics covered are diverse and structured in articles, forums, and comments. We collected a comprehensive sample of published user comments from 01-01-2000 to 28-02-2017 with their respective metadata and all archived articles and forums. The data collection took one week and we did not notice any changes of forum features between old and new forums. Our sample comprises 11,276,843 comments (with title, text, timestamp, username, department, and quoted comments if available), 515,522 articles (with title, introduction, text, date, and partly author names), and 181,399 forums (with title and department). Most SPON articles are signed by an acronym to state the author, while the acronyms are assigned to full names in the imprint. However, we could only identify the full author names for 16% of the news articles as many assignments were missing.

Additionally, we used the partly annotated comments of OMP, a dataset that consists of 11,773 labeled and one million unlabeled German online user comments posted on DER STANDARD, an Austrian newspaper website. The authors define the annotation category “feedback” as: “Sometimes users ask questions or give feedback to the author of the article or the newspaper in general, which may require a reply/reaction” (Schabus et al., 2017). This description is equivalent to our meta-comment definition.

3. Data Analysis

We describe the structure of the comment sections, the quantitative, and qualitative content analysis.

3.1. Structure of the Comment Sections

SPON’s comment section sorts user comments by time. It does not structure the comments in threads. Figure 2 shows an example of a SPON meta-comment. To post a comment on a news article, (1) users have to log in with either a SPON or Facebook account, (2) browse to the article’s forum, and (3) compose a comment with a text and an optional title. Alternatively, users can “Reply / Quote” an existing user comment, which adds its text as a linked quote to the user comment. SPON forum moderators review each comment to check if it complies with the terms of use before it is publicly released on the SPON website. In our dataset, SPON forum moderators also contributed infrequently (1,216 comments) with the username “sysop” to the discussion (2018, 2018).

Figure 2. Example of a meta-comment in the SPON comment section.

DER STANDARD’s comment section structures comments into threads and users can rate existing user comments as “worth reading” or “not worth reading”. There are different filter and sort options. Users can filter the comments to see all postings, top postings, or postings by moderators and sort the comment list by date or rating. Forum moderators use their own name to write comments. They consider themselves as participants as opposed to rigid comment administrators and supplement the discussions through active participation, if they consider it beneficial to a discussion (m.b.H., [n. d.]).

3.2. Quantitative Content Analysis

We describe only the SPON dataset as Schabus et al. (Schabus et al., 2017) report on the OMP dataset in-depth. The number of SPON user comments per year has steadily increased from 2005 to 2011 from 0.1 million to 1.6 million. From 2011 to 2015, users posted between 1.2 and 1.6 million user comments per year. Users posted the majority of comments in the politics (4.5 million, 39.7%) and economy sections (2.5 million, 21.9%). The other leading sections are sport, panorama, culture, science, technology, life & learning, car, health, career, and traveling. Each of them covers less than one million user comments in total (8.9%). The average length of a comment’s title is two words and 69 words for the text. 61% of the comments contain a quote. The average number of words for the title of a SPON article is seven words, while the average length of an article text is 457 words. Users were able to comment on 32.8% of all articles. On average, one forum (article) contains 66 user comments.

3.3. Qualitative Content Analysis

We conducted a qualitative content analysis of 1,000 randomly selected SPON user comments to better understand and quantify meta-comments and to identify potential useful machine learning features for our classification task. Each of the 1,000 comments was independently labeled by two human coders. We developed a coding guide for the labeling process in collaboration with communication researchers. It describes the labeling task with examples and defines each meta-addressee class to increase the quality of the manual labeling. Provided with a coding guide, student assistants labeled the comments. The coding guide and further resources are available on our project website222 After coding, the inter-coder disagreement was at 5%, which we resolved by majority with a third coder. In this random sample, we found 54 meta-comments (5.4%) of which only five addressed the community-moderator. The second column of Table 1 summarizes the label distribution for this random sample. We interviewed the coders to deduce machine learning features from their observations.

Training Sets
Labels Random Sample SPON OMP
Media 25 404 566
Journalist 33 426 198
Moderator 5 323 421
Meta 54 982 1,301
Non-Meta 946 1,127 4,737
Total 1,000 2,109 6,038
Table 1. The number of each label in the random sample, the SPON training set, and the OMP training set.

4. Feature Deduction

We describe the training of word and comment embeddings as well as the machine learning features, which we derived from the insights of our qualitative content analysis.

4.1. Training Word and Comment Embeddings

Word embeddings are a geometric way of capturing the meaning of a word by using low-dimensional vectors

(Rumelhart et al., 1988). Their main advantage is that the vector representation of similar words are situated close in vector space. We used word2vec (Mikolov et al., 2013) to obtain a distributed vector representation for German user comments. As an input word2vec requires a text corpus as large as possible to produce low-dimensional vectors as an output. Besides word2vec, paragraph2vec (or doc2vec) (Le and Mikolov, 2014) produces document embeddings from comments or articles. We used the Python library gensim (Řehůřek and Sojka, 2010) to generate the embeddings.

We preprocessed the comments in four steps: (1) concatenated each comment’s title with its text, (2) removed stop words, (3) removed punctuation, and (4) converted the text to lower case. We noted that for word2vec, using more than 300 dimensions or a window size of more than 5 unnecessarily increases the training time while not improving the precision of the vector representation (Pennington et al., 2014).

We used three different word embedding models for the end-to-end learning approach. Table 2 compares our generated SPON model with two other models: the OMP model according to Schabus et al. (Schabus et al., 2017), and the GermanWord model that Müller (Müller, 2015) trained on German Wikipedia and news articles.

We used the SPON user comments to train both (1) word embeddings with word2vec and (2) comment embeddings with doc2vec. We used the word embeddings to enrich a set of keywords and to pre-fill the embedding layer of a neural network (transfer learning) (Michalski, 1983). We used the comment embeddings to extract semantic features. To enable replication, our models are publicly available on our project website.

Model SPON OMP GermanWord
Number of dimensions 300 300 300
Vocab size 212,630 129,070 608,130
Corpus size in words 462,269,114 31,489,845 651,219,519
Min count 50 5 5
Window size 5 5 5

Training epochs

5 10 10
Training method CBOW CBOW Skip-gram
Table 2. A comparison of the training parameters between the three different word2vec models we used.

4.2. Machine Learning Features

We categorize all used machine learning features for our dataset into three groups: text features, semantic features, and metadata. We indicate the features specific to the SPON dataset with [S].

Text Features

In the following, we list the text features we identified based on the coders’ insights from the qualitative content analysis and related work discussing the criteria media organizations consider when identifying high-quality comments (Jorgensen, 2002; McElroy, 2013; Reader, 2007). Diakopoulos (Diakopoulos, 2015a, b) categorized these criteria into twelve human-centric categories, including emotionality, readability, thoughtfulness, brevity, and novelty.

  • Regular expression pattern: We identified a set of keywords based on word embeddings, which are likely to be used in meta-comments. We followed a two-step approach: (1) manual keyword collection and (2) keyword enrichment with word embeddings. We used the SPON word embeddings and fine-tuned the keywords for the SPON dataset. We started by manually collecting an initial set of keywords with communication researchers. Given the vector representations of the words in comment texts, we enriched the manually collected keywords by finding the most similar words (see Table 3). This shows how word embeddings can capture further words with a similar meaning and common misspellings. We created a regular expression (regex) based on the keywords to match words independently of the grammatical gender. We iteratively searched for user comments that match the regex pattern, assessed the matching comments, and adjusted the regex pattern to minimize unintended matches. We list the translated set of keywords for each meta-addressee class: (media) media, spon, spiegel, spiegelonline, editing, reporting, magazine; (journalist) article, journalism, contribution, author, writer, editor, penpusher, columnist, expert, reporter, spiegel editor, populist, last names of the SPON authors; (community-moderator) censorship, censored, moderation, moderator, admin, sysop.

    Word Similarity Word Similarity
    fleischhauer 1.00 autor 1.00
    fleischauer 0.91 author 0.86
    augstein 0.88 verfasser 0.85
    lobo 0.82 spiegelautor 0.80
    diez 0.80 artikelschreiber 0.80
    matussek 0.77 sponautor 0.80
    fleischhauers 0.77 autorin 0.80
    kuzmany 0.76 verfasserin 0.72
    fleichhauer 0.76 schreiberling 0.71
    münchau 0.76 rezensent 0.70
    dietz 0.75 schreiber 0.69
    nelles 0.73 spiegelredakteur 0.69
    broder 0.73 kommentator 0.68
    mattusek 0.71 kolumnist 0.68
    mattussek 0.71 artikelautor 0.68
    kaden 0.70 redakteur 0.65
    neubacher 0.70 sponredakteur 0.64
    fricke 0.70 artikelverfasser 0.64
    rickens 0.69 forist 0.63
    Table 3. Examples of similar words within the distributed vector space for the last name of the journalist “Mr. Fleischhauer” and the word “autor” (author).
  • Tf-idf: The tf-idf score of a word reveals the importance of this word in a user comment. It assigns words a greater weight proportionally to the occurrence frequency but reduces the significance of a word that frequently occurs in many documents as stop words. We used the tf-idf representation of the comment with unigrams and bigrams without stop words.

  • Count of “Sie” occurrences: In the German language, the formal address of “you” to an unknown person is “Sie” and is written with a capital “S” even if it is situated within a sentence. We count the occurrences of this address within the sentence to separate it from the similar third-person pronoun “sie”. For the identification of each occurrence, we used the regular expression pattern “[^\.!?]\s+Sie”. We assumed that it is an indicator of a reference to the article’s author. However, our coders observed that this formal address often refers to other users. For this, commenters also use the “@” notation to indicate a reference.

  • Number of questions: Questions in comment texts might address the media company, authors, or community-moderators. Our coders mentioned typical user questions as “Why has my comment been blocked?”. Therefore, we identified and counted the number of questions, contained in a comment.

  • Length: We added together the number of characters in the comment title and text. We assumed that meta-comments might differ in their length from other user comments as previous work has also identified brevity as a quality indicator.

  • Average word length: We used the average number of characters per word as a simple measure of text complexity. Users might put more effort in the wording of a user comment and choose more sophisticated and longer words on average in meta-comments.

  • Number of capital letters: We count the number of capital letters. Users often use capital letters to indicate “yelling” in user comments. We assumed that these comments are more likely to complain about meta-addressees. Besides, users also write the names of the media companies in capital letters such as “SPIEGEL” or “DER STANDARD”.

  • Sentiment score: We used the sentiment score (sen, 2017) of the comment title and text, assuming that a high polarity score is an indicator of media-critical statements (Craft et al., 2016).

Semantic Features

We used two different semantic features, derived from comment embeddings:

  • Document vector

    : From paragraph2vec, we obtained a 300-dimensional dense vector representation for each comment in a distributed vector space in which semantically similar comments have a high cosine similarity. We used each dimension of this vector as a feature. As we generated the comment embeddings based on the SPON user comments, the model infers a vector representation for the OMP comments as we did not use them for training.

  • Vector Space Distance: We utilized the comment embeddings to determine a representative average vector (class vector) for each comment class. We used the cosine distance and the most similar class vector as a feature. We formally describe the semantic distance feature. Let be the set of all comments and the set of all comment classes. Further, let be the comment embedding function that yields a vector representation for a comment. Then, for each class we define a class vector , which is an average vector as follows:

    As a feature for a comment , we used the cosine distance function to determine the distance for each . Additionally, we identified the class to which the class vector has the minimal distance

    and added it one-hot encoded as a feature.


The metadata is the set of additional properties of a user comment. We obtained more additional metadata for SPON user comments. We extracted the following features from the metadata:

  • Comment number [S]: The forum lists the user comments in ascending order of time, assigning each comment an consecutive number. This number is the position of the comment in the list. We added the comment position as a feature, as first user comments might be more likely to identify errors in the article.

  • Department [S]: The SPON page is structured into twelve departments. As users post their comments to an article, we used the department of the article as a feature.

  • Quote contained [S]: Users can reply to comments from other users. With this function, users can quote a previous user’s comment text. We assumed that users instead address another user than a meta-addressee when they refer to other comments. This assumption corresponded with our coders’ impressions.

  • Time: We further extracted the time stamp precisely to the minute of each comment. We add both the day of the week and the hour of the day as features.

5. Classifier Experimentation and Optimization

We used a supervised machine learning approach for the user comment classification. The classifier derives a classification model from these labeled training sets to classify unseen user comments. The training set contains comments with the label meta-comment (with meta-addresses) or non-meta-comment. Our approach uses four binary classifiers in two steps: (1) a binary classifier for meta-comment / non-meta-comment and (2) three binary classifiers to classify each meta-addressee class. For the second step, we used the classification strategy one-vs-all (Bishop, 2006, p. 182,338), which trains a binary classifier per class.

5.1. Training Set Creation

For the SPON training set, we collected coded comments for each meta-addressee class. Due to the small share of meta-comments, random sampling was not feasible for gathering enough comments per meta-addressee class. For sampling a user comment set with a higher share of meta-comments for annotation, we used (1) regular expressions and (2) cosine similarity between keywords and user comments in the vector space of the comment embeddings. We calculated the average vector of the keywords for each meta-addressee class and labeled the 100 most similar comments to each average vector. With this approach, we captured a heterogeneous set of user comments, for which manual labeling was feasible. We used the non-meta-comments of the random sample as well as the non-meta-comments of the sampling described above.

For the OMP dataset, we followed the same coding procedure to identify the meta-addressees for the 1301 feedback comments. Table 1 shows the distribution of meta-comments and meta-addressee comments for our SPON and OMP training sets. The latter contains 240 comments, which we were unable to assign to a meta-addressee class.

5.2. Classification Approaches

We compare the user comment classification results between a traditional machine learning approach and an end-to-end learning approach based on a neural network model. While the traditional classification approach requires a data representation based on hand-crafted features, neural networks can handle raw text as an input and learn high-level feature representations automatically (Goodfellow et al., 2016)

. They have been applied with remarkable results in different classification tasks as object detection in images, machine translation, sentiment analysis, and text classification tasks

(Collobert et al., 2011).

Convolutional neural networks have mainly been used for image classification tasks, but researchers have also started using them to solve natural language processing tasks (Kim, 2014)

. Given the small training set for an end-to-end approach, we used a shallow neural network model and experimented with different numbers of epochs to prevent the model from overfitting. We padded the input comment text to a maximum length of 1,000 words. As shown in Figure


, after the input layer our network consists of an embedding layer, a 1D convolution layer, a 1D global max pooling layer, a dense layer, and a concluding output layer with a softmax activation. For the other layers, we used the tanh activation function. We applied transfer learning

(Michalski, 1983) by pre-initializing the embedding layer of the model with three different word2vec models, which we compared in Table 2. While training the model, we froze the weights of the embedding layer.

Due to the small size of our training set, we conducted a stratified 10-fold cross-validation on the training set to acquire reliable results. For assessing the classification results, we report on precision, recall (to compare our results with state-of-the-art results) and the measure (to overvalue precision over recall). For the experiments, we used the Python libraries scikit-learn (Pedregosa et al., 2011)

for the traditional approach and Keras

(Chollet et al., 2015) for the end-to-end approach.

Figure 3. Neural network architecture with optimized hyperparameters for the user comment classification.

5.3. Hyperparameter Optimization

To answer RQ1, we performed a grid search to optimize the hyperparameters for both classification approaches. A grid search performs an exhaustive search over specified hyperparameter values for a classifier. We evaluated each parameter combination with a stratified three-fold cross-validation to reduce the computational complexity. To enable replication, the relevant source code, containing the parameter grids for both approaches are publicly available on our project website.

We value precision over recall to minimize type I errors (false positives) for the end user so that the comment analyst has to read a minimal number of wrongly-classified meta-comments. The classifier might not catch all meta-comments, but on the other hand, we minimize the time spent by the analyst reading irrelevant comments. We used the

score as the scoring method for the grid search. It is the weighted harmonic mean of precision and recall

(Baeza-Yates and Ribeiro-Neto, 2011, pp.327-328). We specify

to overvalue the precision score in our evaluation metric. We compare the accuracy of five different classifiers.

  • Support Vector Machine (SVM) is known to be one of the best text classifiers found in the literature (Ben-Hur and Weston, 2010).

  • Decision Tree learning assumes that all features have finite discrete domains and that there is a single target feature representing the classification (i.e., the tree leaves) (Torgo, 2016).

  • Random Forest (Breiman, 2001) is a combination of decision tree classifiers on sub-samples and controls over-fitting.

  • The meta-classifier AdaBoost (Freund and Schapire, 1995) initially fits a classifier on the original dataset and then fits additional copies of the classifier, adjusting the weights for wrongly classified samples.

  • KNeighbors does not construct a general model, but stores the training data and the classification for a point, which is derived from a majority vote of all nearest neighbors (Cunningham and Delany, 2007).

We additionally varied the number of the most significant features for each classifier to 10, 50, and “all features”. We conducted multiple grid search runs and added more fine-grained values into the parameter ranges to find the parameters for the best results.

The performance of neural networks is dependent on their architecture as well as the right hyperparameter selection. To optimize the neural network architecture, we also performed a grid search over the combined dataset and evaluated each configuration with a stratified three-fold cross-validation. We achieved the best results with the neural network architecture depicted in Figure 3, trained with a batch size of 32 for 5 epochs.

5.4. User Comment Classification

The grid search results showed that SVM with a linear kernel using all machine learning features achieves the best results for the SPON dataset, the OMP dataset, and the combined dataset. For the SPON and the combined dataset, the penalty parameter achieves the best values, for the OMP dataset . The results in Table 4 show that the traditional classification approach outperforms the end-to-end learning approach for the SPON dataset () and the combined dataset (). The end-to-end approach outperforms the traditional approach on the OMP dataset () pre-initialized with either the SPON word embedding model or the OMP model. However, the performance difference between the traditional and the end-to-end approach is negligible ().

The results show a higher score if we use pre-trained word embeddings based on user comments rather than embeddings based on the Wikipedia and news corpora. It is also striking that we achieve the same scores with both the SPON and OMP embeddings. Schabus et al. (Schabus et al., 2017) have also compared different classification approaches on the Feedback category of the OMP dataset where they achieved a best precision of 0.75, a recall of 0.71, and an F1-score of 0.63. All of our classification results outperformed their state-of-the-art results by up to 11% for precision and 12% for recall.

User Comment Classification Approach SPIEGEL Online One Million Posts Combined Dataset
Precision Recall Precision Recall Precision Recall

Traditional (with manual features)
0.91 0.91 0.91 0.83 0.81 0.82 0.88 0.82 0.87

End-to-End (with SPON embeddings)
0.86 0.87 0.86 0.86 0.81 0.85 0.85 0.83 0.85

End-to-End (with One Million embeddings)
0.84 0.82 0.84 0.85 0.83 0.85 0.84 0.81 0.84

End-to-End (with GermanWord embeddings)
0.73 0.77 0.73 0.77 0.73 0.76 0.73 0.74 0.73

Table 4. User comment classification (meta / non-meta) results of a stratified 10-fold cross validation for three different training set compositions.

5.5. Meta-Comment Classification

For the second step, we classified meta-comments with regards to their meta-addressees for the SPON and OMP datasets. We used SVM with a linear kernel and the penalty parameter set to as it achieved the best results for the user comment classification. Table 5 shows the results for both datasets as well as the classification results using different feature groups, which we describe later. The SPON dataset classification achieved high scores with for all meta-addressee classes. The scores for the SPON dataset are higher than the OMP dataset. For the Media and the Moderator class, the differences between the datasets are minor with .

Feature Combination Meta Media Journalist Moderator
Precision Recall Precision Recall Precision Recall Precision Recall

All 0.91 0.91 0.91 0.85 0.81 0.84 0.88 0.76 0.86 0.84 0.87 0.84

Without regex patterns 0.82 0.80 0.82 0.80 0.63 0.76 0.73 0.55 0.68 0.84 0.68 0.80

Only regex patterns 0.90 0.93 0.91 0.84 0.85 0.84 0.89 0.69 0.84 0.82 0.86 0.82


Only semantic features 0.77 0.71 0.76 0.76 0.36 0.62 0.68 0.42 0.61 0.75 0.34 0.60

All 0.85 0.79 0.84 0.79 0.82 0.79 0.78 0.39 0.65 0.81 0.67 0.78

Without regex patterns 0.81 0.80 0.81 0.76 0.83 0.77 0.79 0.38 0.65 0.82 0.68 0.78

Only regex patterns 0.88 0.44 0.73 0.74 0.53 0.69 0.89 0.09 0.31 0.85 0.07 0.25

One Million

Only semantic features 0.73 0.62 0.70 0.63 0.80 0.66 0.74 0.17 0.45 0.73 0.47 0.66

Table 5. User comment and meta-comment classification results of a stratified 10-fold cross-validation for both training sets, using an SVM classifier with different feature groups.

We also performed a cross-dataset classification. We trained the binary classifiers with the SPON dataset (training set) and classified the labeled user comments of the OMP dataset (test set) and vice versa. Table 6 shows the results. The scores are higher for all classes when trained on the OMP dataset and applied to the SPON dataset. The recall values were low for all classes () when using the SPON training set.

Training Set Test Set Meta Media Journalist Moderator
Precision Recall Precision Recall Precision Recall Precision Recall

One Million Posts 0.90 0.38 0.71 0.82 0.22 0.53 0.38 0.33 0.37 0.59 0.34 0.51

One Million Posts
SPIEGEL Online 0.89 0.71 0.85 0.63 0.88 0.67 0.82 0.60 0.76 0.87 0.75 0.84

Table 6. Cross-dataset classification results of an SVM classifier trained with the SPIEGEL Online data and applied on the OMP dataset and vice versa.

We tested the accuracy of the meta-comment classifier on unseen comments by classifying a random sample of 100,000 SPON comments regarding the three meta-addressee classes. The classifier assigned a label to a comment when the confidence score is greater than 0.8. In a comment analytics tool, this could be a user-adjustable parameter. Instead of ranking the labeled comments according to the confidence score we randomly selected 300 meta-comments (100 per meta-addressee). Following the coding guide (Section 3.3), the same coders manually checked if the classification was correct. This application would be similar to a desirable use case for comment analysts (Loosen et al., 2017). We achieved the following accuracy: 0.94 (Media), 0.64 (Journalist), and 0.67 (Moderator).

5.6. Feature Significance

To answer RQ2, we calculated the analysis of variance (ANOVA) F-value for each single machine learning feature and sorted them accordingly as shown in Table

7. For the SPON dataset, the most significant feature for the meta-comment identification is the meta property “department_career”. In our training set, we found only 35 meta-comments posted on the career department. The results show that our extended regular expression set is a significant feature of the SPIEGEL dataset and achieves an score of 91% for the meta-comment class as well as scores between 82% and 84% for the meta-addressee classes. The regex patterns for each meta-addressee class are the most important features respectively. Other essential features are the tf-idf scores of uni-grams. Not a single tf-idf bigram is in the list.

Meta Media Journalist Moderator
department_carreer 437 regex_media_matches 390 regex_journalist_matches 167 regex_moderator_matches 680
regex_journalist_matches 328 keyword_spon 181 regex_moderator_matches 162 keyword_sysop 206
regex_media_matches 206 tfidf_spiegel 110 tfidf_herr 58 tfidf_zensiert 95
regex_moderator_matches 138 keyword_spiegel 84 keyword_sysop 48 tfidf_sysop 80
keyword_spon 123 keyword_redaktion 66 keyword_zensiert 40 keyword_zensiert 77
tfidf_spiegel 84 tfidf_redaktion 53 tfidf_zensiert 40 tfidf_beitrag 71
keyword_artikel 83 tfidf_medien 51 department_carreer 35 keyword_zensur 71
text_capitalletters 80 tfidf_spon 50 keyword_spon 32 tfidf_beiträge 60
tfidf_artikel 78 keyword_sysop 48 regex_media_matches 32 keyword_moderation 59


keyword_spiegel 78 regex_moderator_matches 43 keyword_zensur 31 keyword_beitrag 59
tfidf_standard 302 tfidf_standard 181 tfidf_herr 173 semantic_min_dist_moderator 174
regex_journalist_matches 257 regex_media_matches 67 tfidf_rauscher 147 tfidf_postings 91
semantic_min_dist_non-meta 212 semantic_min_dist_moderator 54 semantic_min_dist_journalist 82 tfidf_gelöscht 67
semantic_min_dist_meta 212 tfidf_artikel 51 tfidf_herr rauscher 77 tfidf_posting 53
keyword_artikel 207 text_avgwordlength 48 tfidf_frau 76 tfidf_artikel 52
tfidf_artikel 194 tfidf_postings 47 text_num_sie 63 semantic_sem_16 49
keyword_redaktion 88 semantic_min_dist_media 45 semantic_sem_236 46 tfidf_posts 48
regex_media_matches 81 keyword_contained_artikel 42 tfidf_standard 42 tfidf_standard 48
tfidf_redaktion 79 tfidf_gelöscht 41 semantic_sem_158 40 regex_journalist_matches 47

One Million Posts

text_avgwordlength 65 keyword_contained_redaktion 40 keyword_contained_Rau 36 semantic_min_dist_media 47
Table 7. Top ten single features for classifying user and meta-comments according to their ANOVA F-value.

In the OMP dataset, the minimal semantic distance is among the top ten significant features for all classes. “Herr Rauscher” (Mr. Rauscher) is a journalist for the Austrian news site. The tf-idf bigram score for “herr rauscher” is significant for the Journalist class. Also, the regex sets for Journalist and Media are among the top features. The text feature average word length appears in the list of the Meta and Media class. The text feature occurrence of “Sie” appears in the Journalist class.

For both datasets, we can see that the names of the media company are significant features: “spon”, “spiegel”, and “standard”. We assume that the bigram “der standard” is not in the list because we removed stop words, which also contain the German article “der” (the). The words “artikel” (article), “redaktion” (editing), and “herr” (mr.) are significant features for both datasets.

In Table 5 we compare four different feature groups using an SVM classifier as the baseline with a linear kernel and the penalty parameter . We also performed a stratified 10-fold cross-validation to acquire the precision, recall, and score for the classification.

For the SPON dataset, the regex-based features achieve high results. The improvement of further features is minor. By adding the remaining features, the score increased up to 2% (for Moderator). For the Journalist class, the regex patterns are an essential feature and the score reduced drastically when they were removed. Further, additional features do not improve the score. Semantic features by themselves achieve an score of up to 76% on SPON meta-comments.

In the OMP dataset, the regex features are not relevant for the classes Journalist and Moderator and barely relevant for Meta and Media with . The Journalist class achieves the lowest score of 0.65. The Media and Moderator class achieve a similar score of 0.79 and 0.78.

6. Qualitative Insights into Classified Meta-Comments

To answer RQ3, we describe examples from the content of correctly classified meta-comments (true positives) from both datasets, a qualitative method inspired by Kurtanović and Maalej (Kurtanović and Maalej, 2017). The purpose of this qualitative analysis is to understand the content and the potential usefulness of meta-comments. We classified meta-comments for each meta-addressee class and dataset and identified different information types. We translate the user comments into English.

6.1. Comments Addressing the Media

The meta-comments addressing the media criticize the prioritization of the media company. These users demand justification for the attention the authors pay to a particular topic (e.g. #1,#2), report an error in the article text (e.g. #3), and praise the media coverage (e.g. #4):

#1 SPON: “[…], but it gets a whole article in the Spiegel. Please, someone explain this over-dramatization! It shows, however, that the drug policy and the anti-drug laws are lacking in goals and are, therefore, practically nonsense, but both have a lot of support from the press (Spiegel?). […]”

#2 SPON: “[…] it’s just disgusting, how journalists in Germany keep themselves busy and can seriously make a big thing out of this farce. Words fail me, that something like this does not appear as a 3-line message in the furthest corner of a tabloid newspaper, […]”

#3 OMP: “ “They complete reconnaissance aircrafts.” How does such an article come about? Is this proofread or will you press Enter after the last word and go to the coffee machine?”

#4 OMP: “Thanks, mka for the background. Most media have always only reported on the prayer room, and nebulously mentioned that the day before firefighters and a police officer had been injured, but neither how, where, in what context. Like this article, I want journalism.”

6.2. Comments Addressing the Journalist

The listed classified meta-comments addressing the journalist contain praise (#5), recommendations for other readers (#5), further questions (#6), missing information (#7), critiques (#6,#8), and corrections of factual errors (#8):

#5 SPON: “I find it very good that parents are reminded about that. All parents should read this article! […]”

#6 SPON: “Mr. Fleischhauer, what do the colleagues say about your comment? […] Are you insane?”

#7 OMP: “One should not forget in an article like this to mention who’s really to blame […]”

#8 OMP: “[…] The author of this short note (either APA or Standard) has obviously very poor geography skills: the Traunstein is a very distinctive mountain in Austria […]”

6.3. Comments Addressing the Moderator

The authors of the following meta-comments complain and ask the moderator for the rationale behind blocking previous comments (#9,#10,#12). One user requests a feedback feature for moderators so that users understand the rationale behind their decisions (#9,#11):

#9 SPON: “[…] It would be beneficial, if you could receive brief feedback on the censored contributions, why the censorship occurred. If e.g. in a longer post a part does not conform to the guidelines, one could replace it with a “[because of xxx]”, where instead of xxx it says “insulting other participants” or “glorification of violence” or whatever. A few template formulations would be enough. Then one would at least know why a contribution was censored and could be addressed in future contributions.”

#10 SPON: “It seems as if postings with the reference to “censorship” were systematically deleted here in the forum. Would you like us to spread this fact in other forums, blogs, etc.? Where among other things has this post remained: [link to a screenshot] Nothing against a deletion of unclean and unlawful contributions. […]”

#11 OMP: “Uiui, Standard deletes already published comments. I would like to know how…”

#12 OMP: “Haha and DER STANDARD actually censored a posting from me again. Why? […]”

7. Threats to Validity and Limitations

We mention limitations to its internal and external validity. Regarding the internal validity, this study contains multiple coding tasks, and human coders can cause noise in the training set data. We dealt with that issue, by designing a coding guide over many iterations (Neuendorf, 2016). It defines the criteria for a comment to belong to a specific meta-addressee class with examples. However, annotating 1,000 random user comments is tedious. Some user comments are long, and the comment classes occur at imbalanced frequencies. For example, the internal media responsibilities are unclear, whereby the coders sometimes assumed the addressee. For example, SPON uses the username “sysop” to reply to single user questions, but it is unclear who composes these comments. This uncertainty caused disagreements between the peer-coders.

Addressees in comments is a broad field and users also address and mention, for instance, celebrities, institutions, other users, or the general public. This study only focuses on the identification and classification of German meta-comments. However, it is possible to categorize meta-comments into a different set of addressee-classes which would lead to different results. We sampled part of our SPON training set based on regular expressions due to the small share of meta-comments. This procedure affected the ANOVA F-value as well as the significance of word-based features for the SPON dataset.

Regarding external validity, our work uses comments from the news sites SPIEGEL Online and DER STANDARD. User comments posted on respective Facebook or Twitter pages might use different terms or have a different style of writing. The accuracy of our classifier might be different.

The cross-dataset classification in Table 6 is an initial step to check whether the automatic classification can be used for comments on other media companies’ sites without using labeled data from their site. When training the traditional classifier on the OMP dataset and testing it on the SPON comments, we achieved a promising score of 0.85. However, as we used user comments from only two different datasets, further evaluation will be needed in the future if we are to generalize this statement.

8. Related Work

The question of who is addressed in user comments has been tackled in different studies, by different means, and for different purposes. We are currently carrying out a systematic literature review, covering the state of current research on the content analysis of user comments in online news media. To date, we have found related works that consider the variety of addressees of user comments. Most of these works conducted a qualitative content analysis and manually identified the addressees.

Collins and Nerlich (Collins and Nerlich, 2015) manually labeled direct references to other users and to the author to investigate public deliberation. Gervais (Gervais, 2015) studied incivility in online user comments. Bergt and Welker (Bergt and Welker, 2013) conducted a manual content analysis of 4,840 German user comments to check whether users refer to the quality criteria of news coverage and how it is integrated. They found that 5.9% of user comments refer to quality criteria. Lopez-Gonzalez and Guerrero-Sole (Lopez-Gonzalez and Guerrero-Sole, 2014) carried out a manual content analysis to analyze how much hate speech users direct towards the medium. They found that 2.84% of comments address the medium.

Macovei (Macovei, 2013) conducted a case study and manually analyzed 1,000 Romanian reader comments on articles about a protest. In this respect, she qualitatively analyzed the users’ expressions towards the newspaper, the authors, or to other users. Manosevitch and Walker (Manosevitch and Walker, 2009) analyzed the potential of the readers’ comments section as a constructive space for public discourse. In this regard, they manually analyzed the social process of deliberation of 124 comments where they identified how users address other users, post questions, and address an article’s content.

Rowe (Rowe, 2015) explores the differences in deliberative quality between news website users and Facebook users. To measure interactivity, he manually labeled comments that refer to other users. Al-Rawi (Al-Rawi, 2017) also analyzes the sentiment of Facebook comments. He studies the most recurrent words and phrases to assess the overall sentiment towards the topics being addressed. Carvalho et al. (Carvalho et al., 2011) have analyzed comments on political debates, in which they manually identified “opinion targets”. Opinion targets can be politicians, relevant media personalities, or other commentators. These can be politicians participating in the televised debates or other relevant media personalities. Further, they manually annotated how human entities are mentioned in user comments, for instance, by name, position, or nickname. Word embeddings capture this automatically.

Park et al. (Park et al., 2016) developed a system for supporting comment moderators that identifies high-quality comments by using different analytic scores. One feature is based on the LIWC dictionary to measure users’ personal experiences. Instead of measuring quality from the users’ perspective, we focused on identifying meta-comments, with a supervised learning approach. Djuric et al. (Djuric et al., 2015) have utilized comment embeddings with paragraph2Vec to classify hate speech in comments.

Schabus et al. (Schabus et al., 2017) created the OMP dataset, which contained annotated comments for different categories. In our work, we reused the “Feedback” category as meta-comments and were able to outperform their classification results. Fast et al. (Fast et al., 2016) and Park et al. (Park et al., 2018) developed a prototype that analyzes user comments with respect to concepts. Their prototype uses word embeddings to extend the keywords given by the user to generalize a concept. Hullman et al. (Hullman et al., 2015) conducted a qualitative content analysis of user comments on presented visualizations and found that over one third of the analyzed comments provided direct critical feedback on the journalistic content. They also suggest improving the design of commenting interfaces by grouping user comments according to their reference. Google and Jigsaw have established a project called Perspective (per, 2018) that uses machine learning to automatically detect toxic language in user comments. They published an experimental model that identifies attacks on the article’s author in user comments which is a subset according to our meta-comment definition. To the best of our knowledge, we did not find any other work that presents an automatic approach for the identification and classification of meta-comments.

9. Discussion

This paper focuses on automatically identifying and classifying meta-comments – while maximizing the accuracy and generalizability of the automated approach. Our classification approach was inspired by previous work by Maalej and Nabil (Maalej and Nabil, 2015) who classified app reviews in the domain of mobile app stores into four different feedback categories. We discuss the findings from both the technical and the application perspectives.

Using and Improving the Approach on Different Datasets

We expect our supervised learning approach to be applicable to other comment sections and other languages as it only requires the comment text and basic metadata. Applying our approach to other languages would require as many user comments as possible to precisely capture word similarities with word embeddings in that language. Additionally, a training set of a similar size to ours would be needed. The remainder of the process is language independent. One advantage of our approach is that it operates without common natural language processing methods such as lemmatization, named entity recognition, or part-of-speech tagging, which depend on pre-trained language specific models. Although word embeddings are also language specific, we can train them unsupervised on a large corpus of user comments to find words that users use in a similar context. However, it is unclear whether our approach is generalizable in other domains, for example, as part of online courses where students’ comments might address teaching materials, instructors, forum-moderators, or other students; or an online store where users’ comments might address vendors, developers, or delivery services.

We used transfer learning (Michalski, 1983) in the end-to-end classification by pre-initializing the embedding layer with pre-trained weights from the word embeddings. This approach did not use any hand-crafted features and achieved encouraging results with scores of 0.73 to 0.86. Typically, neural networks need large training sets to outperform traditional approaches (Goodfellow et al., 2016). Traditional approaches often perform better on small training sets as domain experts implicitly incorporate significant information through hand-crafted features (Chollet, 2018). We assume that for our experiments the hand-crafted keywords for the SPON dataset provided a considerable advantage whereas the end-to-end approach has to derive high-level features with many training samples. We presume that, given more training data, an end-to-end classification would outperform traditional approaches. More sophisticated features from the comment thread, comment ratings, user profiles, user comment history, or the respective article might improve the accuracy but this would require additional metadata from the comment section.

Application and Utilization of User Feedback

While this work is empirical and exploratory in nature, our intermediate goal is to develop and evaluate a tool for user comment analysis that we plan to evaluate with domain experts in future work. Our qualitative insights into identified meta-comments showed that our classification can capture meta-comments with diverse constructive feedback. A comment analysis tool can aggregate and forward the identified meta-comments to the concerned stakeholders. Further, it can enable moderators and journalists to directly reply to users to allow direct participation in the forum conversations while reducing the effort of manually searching for response worthy user comments.

Media houses can utilize user feedback from the meta-comments. The commenters addressing the media houses demand a transparent prioritization of topics by the news. They further seek for understanding of journalistic production routines and the sources used for an online article. To meet this demand, media houses might aim to explain newsroom working routines. An article recommendation system could utilize user recommendations as an input to highlight articles for other user groups. Journalists could reply to questions and aggregate frequent questions to a “frequently asked questions” section. Journalists could incorporate additional information provided by users either into the article or link to them. A new perspective might inspire journalists to produce an additional news article. Identifying meta-comments could help journalists to double check factual errors and fix them immediately.

In comments addressing the moderator users actively ask for the rationale behind blocking their comments. Users even show interest in improving their contribution if moderators would provide feedback about their decision. Forum moderators could reply to deescalate the dialog with unruly users. The online forum development team could consider user feature requests. For instance, a reply function for forum moderators to educate and provide feedback to users about what constitutes a desirable high-quality contribution. The dialogue between users and moderators could further help to improve the netiquette for user contributions.

Our classification approach is able to identify meta-comments that stakeholders deem useful, as they contain diverse user feedback and complaints, corrections, additional information, open questions, or clarification and feature requests. Feedback information of meta-comments could be further classified and clustered into categories, for example, as bug reports regarding the article, questions to the author, or forum feature requests. Subsequently, such automatic classification could help forwarding user comments to the relevant person responsible. In summary, identifying meta-comments would support stakeholders in extracting valuable information from user comments while also representing a crucial prerequisite for fostering a better dialog between media providers and users and increase the chances that response-worthy user comments are found at all.

10. Conclusion

With the emergence of user comments in online news media, news organizations are in need of tools to cope with the number of user comments. Researchers have found that journalists appreciate user feedback that, for instance, reports errors in articles, include additional information on a topic, or contain critique addressed to the quality of an article. In this paper, we present a preliminary approach to automatically identify and classify comments not (only) related to the news article but comments that address, for instance, the media company, a journalist, or a community-moderator. We call these comments “meta-comments”.

By using a supervised machine learning approach, we achieved encouraging results with scores between 76% and 91%. We found similarities between the most significant features of 2 large datasets. We computed word and comment embeddings based on  11 million German user comments for enriching text features, deriving semantic features, and transfer learning. The end-to-end learning approach outperformed the traditional approach on the “One Million Posts” dataset. We gained further qualitative insights into the content of automatically identified meta-comments. Finally, in our discussion, we highlight the training of word embedding models based on user comments as an important step for applying our approach to other languages. We further discuss use-cases for stakeholders, as e.g. considering the users’ forum feature requests when further developing the news comment section.


We thank V. Biryuk, J. Hennings, and H. Immler for their support with the manual labeling of the collected German user comments.


  • (1)
  • sen (2017) 2017. TextBlob: Simplified Text Processing - TextBlob 0.13.0 documentation. Website. (2017).
  • ale (2017) 2017. Top Sites in Germany - Alexa. Website. (2017).
  • per (2018) 2018. perspectiveapi: Perspective is an API that uses machine learning models to score the perceived impact a comment might have on a conversation. (July 2018). original-date: 2017-02-23T10:55:14Z.
  • 2018 (2018) SPIEGEL ONLINE 2018. 2018. : Besondere Nutzungsbedingungen für Ihre Beiträge. Spiegel Online (May 2018).
  • Al-Rawi (2017) Ahmed Al-Rawi. 2017. Assessing public sentiments and news preferences on Al Jazeera and Al Arabiya. International Communication Gazette 79, 1 (2017), 26–44.
  • Baeza-Yates and Ribeiro-Neto (2011) Ricardo Baeza-Yates and Berthier Ribeiro-Neto. 2011. Modern Information Retrieval: The Concepts and Technology behind Search (ACM Press Books). (2011).
  • Ben-Hur and Weston (2010) Asa Ben-Hur and Jason Weston. 2010. A user’s guide to support vector machines. Data mining techniques for the life sciences (2010), 223–239.
  • Bergt and Welker (2013) Swenja Bergt and Martin Welker. 2013. Online-Feedback als Teil redaktioneller Qualitätsprozesse von Tageszeitungen–eine Inhaltsanalyse von Leserkommentaren. OnlineDiskurse. Theorien und Methoden transmedialer OnlineDiskursforschung (2013), 346–363.
  • Bishop (2006) Christopher M Bishop. 2006. Pattern recognition and machine learning. springer.
  • Braun and Gillespie (2011) Joshua Braun and Tarleton Gillespie. 2011. Hosting the public discourse, hosting the public: When online news and social media converge. Journalism Practice 5, 4 (2011), 383–398.
  • Breiman (2001) Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5–32.
  • Carvalho et al. (2011) Paula Carvalho, Luís Sarmento, Jorge Teixeira, and Mário J Silva. 2011. Liars and saviors in a sentiment annotated corpus of comments to political debates. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2. Association for Computational Linguistics, 564–568.
  • Chollet (2018) François Chollet. 2018. Deep learning with Python. Manning Publications.
  • Chollet et al. (2015) François Chollet et al. 2015. Keras. (2015).
  • Collins and Nerlich (2015) Luke Collins and Brigitte Nerlich. 2015. Examining user comments for deliberative democracy: A corpus-driven analysis of the climate change debate online. Environmental Communication 9, 2 (2015), 189–207.
  • Collobert et al. (2011) Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, Aug (2011), 2493–2537.
  • Craft et al. (2016) Stephanie Craft, Tim P Vos, and J David Wolfgang. 2016. Reader comments as press criticism: Implications for the journalistic field. Journalism 17, 6 (2016), 677–693.
  • Cunningham and Delany (2007) Padraig Cunningham and Sarah Jane Delany. 2007. k-Nearest neighbour classifiers. Multiple Classifier Systems 34 (2007), 1–17.
  • Diakopoulos (2015a) Nicholas Diakopoulos. 2015a. The Editor’s Eye: Curation and Comment Relevance on the New York Times. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing. ACM, 1153–1157.
  • Diakopoulos (2015b) Nicholas Diakopoulos. 2015b. Picking the NYT picks: Editorial criteria and automation in the curation of online news comments. Editors‘ Note (2015), 147.
  • Diplaris et al. (2012) Sotiris Diplaris, Symeon Papadopoulos, Ioannis Kompatsiaris, Nicolaus Heise, Jochen Spangenberg, Nic Newman, and Hakim Hacid. 2012. Making sense of it all: an attempt to aid journalists in analysing and filtering user generated content. In Proceedings of the 21st International Conference on World Wide Web. ACM, 1241–1246.
  • Djuric et al. (2015) Nemanja Djuric, Jing Zhou, Robin Morris, Mihajlo Grbovic, Vladan Radosavljevic, and Narayan Bhamidipati. 2015. Hate speech detection with comment embeddings. In Proceedings of the 24th International Conference on World Wide Web. ACM, 29–30.
  • Fast et al. (2016) Ethan Fast, Binbin Chen, and Michael S Bernstein. 2016. Empath: Understanding topic signals in large-scale text. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM, 4647–4657.
  • Freund and Schapire (1995) Yoav Freund and Robert E Schapire. 1995. A desicion-theoretic generalization of on-line learning and an application to boosting. In

    European conference on computational learning theory

    . Springer, 23–37.
  • Gervais (2015) Bryan T Gervais. 2015. Incivility online: Affective and behavioral reactions to uncivil political posts in a web-based experiment. Journal of Information Technology & Politics 12, 2 (2015), 167–185.
  • Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.
  • Heise et al. (2014) Nele Heise, Julius Reimer, Wiebke Loosen, Jan-Hinrik Schmidt, Christina Heller, and Anne Quader. 2014. Publikumsinklusion bei der Süddeutschen Zeitung. (2014).
  • Hullman et al. (2015) Jessica Hullman, Nicholas Diakopoulos, Elaheh Momeni, and Eytan Adar. 2015.

    Content, context, and critique: Commenting on a data visualization blog. In

    Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing. ACM, 1170–1175.
  • Jorgensen (2002) Karin Wahl Jorgensen. 2002. Understanding the Conditions for Public Discourse: four rules for selecting letters to the editor. Journalism Studies 3, 1 (Jan. 2002), 69–81.
  • Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014).
  • Kurtanović and Maalej (2017) Zijad Kurtanović and Walid Maalej. 2017. Mining user rationale from software reviews. In Requirements Engineering Conference (RE), 2017 IEEE 25th International. IEEE, 61–70.
  • Le and Mikolov (2014) Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML-14). 1188–1196.
  • LeCun et al. (2015) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature 521, 7553 (2015), 436.
  • Loosen et al. (2017) Wiebke Loosen, Marlo Häring, Zijad Kurtanovic, Lisa Merten, Julius Reimer, Lies van Roessel, and Walid Maalej. 2017. Making sense of user comments: Identifying journalists’ requirements for a comment analysis framework. In Studies in Communication and Media. 333–364.
  • Loosen et al. (2013) Wiebke Loosen, Jan-Hinrik Schmidt, Nele Heise, Julius Reimer, and Mareike Scheler. 2013. Publikumsinklusion bei der Tagesschau. Fallstudienbericht aus dem DFG-Projekt “Die (Wieder-) Entdeckung des Publikums” (Arbeitspapiere des Hans-Bredow-Instituts Nr. 26), Hamburg. www. hans-bredow-institut. de/webfm_send/709 [19.04. 2013] (2013).
  • Maalej and Nabil (2015) Walid Maalej and Hadeer Nabil. 2015. Bug report, feature request, or simply praise? on automatically classifying app reviews. In 2015 IEEE 23rd international requirements engineering conference (RE). IEEE, 116–125.
  • Macovei (2013) Elena-Irina Macovei. 2013. Neo-Nazis Sympathizers on the Forums of the Romanian Online Publications. Styles of Communication 5, 1 (2013).
  • Manosevitch and Walker (2009) Edith Manosevitch and Dana Walker. 2009. Reader comments to online opinion journalism: A space of public deliberation. In International Symposium on Online Journalism, Vol. 10. 1–30.
  • m.b.H. ([n. d.]) STANDARD Verlagsgesellschaft m.b.H. [n. d.]. Die Community-Moderatoren. ([n. d.]).
  • McElroy (2013) Kathleen McElroy. 2013. Where old (gatekeepers) meets new (media): Herding reader comments into print. Journalism Practice 7, 6 (2013), 755–771.
  • Michalski (1983) Ryszard S Michalski. 1983. A theory and methodology of inductive learning. In Machine Learning, Volume I. Elsevier, 83–134.
  • Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
  • Müller (2015) Andreas Müller. 2015. Analyse von Wort-Vektoren deutscher Textkorpora. (7 2015).
  • Neuberger (2009) Christoph Neuberger. 2009. Internet, Journalismus und Öffentlichkeit. Analyse des Medienumbruchs. S. 19-105. Christoph Neuberger, Christian Nuernbergk (2009), 79.
  • Neuendorf (2016) Kimberly A Neuendorf. 2016. The content analysis guidebook. Sage.
  • Park et al. (2018) Deokgun Park, Seungyeon Kim, Jurim Lee, Jaegul Choo, Nicholas Diakopoulos, and Niklas Elmqvist. 2018.

    ConceptVector: text visual analytics via interactive lexicon building using word embedding.

    IEEE transactions on visualization and computer graphics 24, 1 (2018), 361–370.
  • Park et al. (2016) Deokgun Park, Simranjit Sachar, Nicholas Diakopoulos, and Niklas Elmqvist. 2016. Supporting comment moderators in identifying high quality online news comments. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM, 1114–1125.
  • Pedregosa et al. (2011) Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python. Journal of machine learning research 12, Oct (2011), 2825–2830.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation.. In EMNLP, Vol. 14. 1532–1543.
  • Reader (2007) Bill Reader. 2007. Air Mail: NPR Sees ”Community” in Letters From Listeners. Journal of Broadcasting & Electronic Media 51, 4 (Dec. 2007), 651–669.
  • Řehůřek and Sojka (2010) Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45–50.
  • Reimer et al. (2015) Julius Reimer, Nele Heise, Wiebke Loosen, Jan-Hinrik Schmidt, Jonas Klein, Ariane Attrodt, and Anne Quader. 2015. Publikumsinklusion beim “Freitag”. Fallstudienbericht aus dem DFG-Projekt “Die (Wieder-) Entdeckung des Publikums”. Hamburg: Hans-Bredow-Institut (2015).
  • Rowe (2015) Ian Rowe. 2015. Deliberation 2.0: Comparing the deliberative quality of online news user comments across platforms. Journal of Broadcasting & Electronic Media 59, 4 (2015), 539–555.
  • Rumelhart et al. (1988) David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, et al. 1988. Learning representations by back-propagating errors. Cognitive modeling 5, 3 (1988), 1.
  • Schabus et al. (2017) Dietmar Schabus, Marcin Skowron, and Martin Trapp. 2017. One Million Posts: A Data Set of German Online Discussions. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). Tokyo, Japan, 1241–1244.
  • Sood et al. (2012) Sara Owsley Sood, Elizabeth F Churchill, and Judd Antin. 2012. Automatic identification of personal insults on social news sites. Journal of the Association for Information Science and Technology 63, 2 (2012), 270–285.
  • Torgo (2016) Luis Torgo. 2016. Data mining with R: learning with case studies. CRC press.