Dialect Diversity in Text Summarization on Twitter

07/15/2020 ∙ by L. Elisa Celis, et al. ∙ 0

Extractive summarization algorithms can be used on Twitter data to return a set of posts that succinctly capture a topic. However, Twitter datasets have a significant fraction of posts written in different English dialects. We study the dialect bias in the summaries of such datasets generated by common summarization algorithms and observe that, for datasets that have sentences from more than one dialect, most summarization algorithms return summaries that under-represent the minority dialect. To correct for this bias, we propose a framework that takes an existing summarization algorithm as a blackbox and, using a small set of dialect-diverse sentences, returns a summary that is relatively more dialect-diverse. Crucially, our approach does not need the sentences in the dataset to have dialect labels, ensuring that the diversification process is independent of dialect classification and language identification models. We show the efficacy of our approach on Twitter datasets containing posts written in dialects used by different social groups defined by race, region or gender; in all cases, our approach leads to improved dialect diversity compared to the standard summarization approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The popularity of social media has led to a centralized discussion on a variety of topics. This has encouraged the participation of people from different communities on online discussions, helping induce a more diverse and robust dialogue, and giving voice to marginalized communities [38]. Twitter, for example, receives around 500 million posts per day, with posts written in more than 50 languages [71]. Within English, Twitter sees a large number of posts from different dialects; this diversity has even encouraged linguists to use Twitter posts to study dialects, for example, to map regional dialect variation [28, 21] or to construct parsing tools for minority dialects [8, 47, 34]. Yet, automated language tools are often unable to handle the dialect diversity in Twitter, leading to issues like disparate accuracy of language identification between posts written in African-American English (AAE) and standard English [7], or dialect-based discrepancies in abusive speech detection [65, 60].

Summarization algorithms for social media platforms, like Twitter, perform the task of condensing a large number of posts into a small representative sample. They are useful because they provide the users with a synopsis of long discussions on these platforms. At the same time, it is important to ensure that a synopsis sufficiently represents posts written in different dialects as the dialects are representative of the participating communities. Studies have shown that the lack of representational diversity can exacerbate negative stereotypes and lead to downstream biases [35, 67]. Summarization algorithms, in particular, can aggravate negative stereotypes by providing a false perception of the ground truth [35]

. Hence, it is crucial for automatically-generated text summaries to be diverse.

1.1 Our Contributions

We analyze standard summarization algorithms that represent the range of paradigms used for extractive summarization on platforms such as Twitter; this includes frequency-based algorithms (TF-IDF [46], Hybrid TF-IDF [31]), graph-based algorithms (LexRank [23], TextRank [49]), algorithms that reduce summary redundancy (Centroid-Word2Vec [63]), and pre-trained supervised approaches (SummaRuNNer [55]). All algorithms use different structural properties of the documents (Twitter posts, in our case) to score them on their importance. Our primary evaluation datasets are the TwitterAAE dataset [6] and the Crowdflower Gender AI dataset. We observe that, for random and topic-specific collections from the TwitterAAE dataset, most algorithms return summaries that under-represent the AAE-dialect. For the Crowdflower AI dataset, all algorithms other than TF-IDF return gender-biased summaries (Section 2). This trend is observed even for datasets that have posts written by users from different regions (Crises NLP dataset [30]).

To address the dialect bias and utilize the effectiveness of the existing summarization algorithms, we propose a framework that takes any existing summarization algorithm as a blackbox and returns a summary that is more dialect-diverse than the one the summarization algorithm would return without intervention. Along with the blackbox algorithm, our approach needs a small dialect-diverse control set of posts as part of the input; the generated summary is diverse in a similar manner as the control set (Section 3). Importantly, and in contrast to existing work, by using similarity metrics with items in the control set, the framework bypasses the need for dialect labels in the training or test data. Empirically, we show that our framework improves the dialect diversity of the generated summary for all Twitter datasets and discuss the deviation of the summaries generated by our framework from those generated by the blackbox algorithms (Section 4).

1.2 Related Work

Bias in NLP.

Recent studies have explored the presence of social biases in various language processing models. Pre-trained encoders [50, 9, 19] have been shown to exhibit gender, racial and intersectional biases [10, 11, 69, 48, 54]. Even downstream tasks can suffer from social biases; this includes gender and racial bias in sentiment-analysis systems [36]

, image captioning models

[26], language identification [7, 45], hate/abusive speech detection [65, 60] and even speech recognition [70]. Considering the significance of these language tasks, techniques to mitigate biases in some of the above NLP applications have been proposed [5, 10, 68, 73, 74]. Our work aims to identify and mitigate dialect bias in the application of text summarization, which has (to the best of our knowledge) not been addressed in prior work.

Existing text summarization algorithms.

The importance of a sentence in a collection can be quantified in many different ways. Algorithms such as TF-IDF [46] rank sentences based on word and document frequencies. To improve the performance of TF-IDF for summarization over Twitter posts, [31] propose a Hybrid TF-IDF algorithm, which calculates word frequency over the entire collection. Other unsupervised algorithms, such as LexRank [23], TextRank [49], centroid-based approaches [63, 51, 18, 59] quantify the importance of a sentence based on how well it represents the collection. LexRank and TextRank define a graph over the posts, quantifying the edges using pairwise similarity, and score sentences based on their centrality in the graph.

To ensure a diverse summary, many algorithms also define non-redundancy a secondary goal of summarization, i.e, the sentences in the summary should be representative of the entire original collection. This includes Maximum Marginal Relevance score (MMR) [12] algorithm, Maximum Coverage Minimum Redundant (MCMR) models [1], Determinantal Point Processes [37], and latent variable based approaches [58, 39]. However, reducing redundancy has been shown to be ineffective in ensuring diversity with respect to specific attributes, such as gender or race, in a variety of applications [13, 15]. In a similar vein, we show that non-redundancy also does not lead to dialect diversity by evaluating the algorithm proposed by Rossiello et al. [63] (Centroid-Word2Vec). This approach uses pre-trained encoders and scores sentences based on the distance of their features from the centroid of the collection. 111We will use Word2Vec embeddings, pre-trained on a large Twitter dataset [24], for this algorithm. While adding the sentences with the highest scores to the summary, the algorithm also checks for redundancy; if a candidate sentence is very similar to a sentence already present in the summary, it is discarded (similar to the greedy MMR approach).

We choose TextRank and Hybrid TF-IDF for our diversity analysis because they have been shown to produce better summaries (as evaluated using ROUGE metrics and manually-generated summaries) for Twitter datasets, compared to other frequency, graph, and latent variable based approaches [31, 56]. We analyze Centroid-Word2Vec to show the ineffectiveness of non-redundancy-based approaches. Further, Rossiello et al. [63] observe that Centroid-Word2Vec performs better than other approaches on the DUC-2004 task. TF-IDF and LexRank are also commonly used for Twitter datasets and serve as baselines for our analysis. The original papers for most of these algorithms primarily focused on evaluation on DUC tasks or CNN/DailyMail datasets; however, the documents in these datasets correspond to news articles and do not usually have significant dialect diversity within them.

Beyond unsupervised approaches, supervised techniques classify whether a sentence is important or not

[44, 20, 55, 32]. These models are trained on datasets for which summaries are available, such, news articles [64, 27], and the models pre-trained on these datasets do not always generalize well to other domains. We will evaluate the diversity of one such pre-trained model, SummaRuNNer [55]. 222Our work focuses on extractive summarization only, i.e., algorithms that use the sentences from the collection to create a summary. Abstractive summarization, on the other hand, aims to capture the semantic information of the dataset and the summary creation can involve paraphrasing the sentences in the dataset [42, 53]. Automated diversity evaluation for abstractive summarization algorithms is, therefore, more difficult since sentences in the summary are not necessarily from the original collection; hence, we focus on extractive summarization only.

As Twitter posts usually have metadata associated with them, some algorithms use this metadata to return summaries that are also diverse with respect to time of posts [17], and/or user-network [25, 22, 43, 2]. However, since our goal is to analyze the impact of dialect variation on summarization, we focus on techniques that aim to summarize using only the collection of posts.

Prior algorithms that aim to ensure unbiased summarization usually assume the existence of labels or partitions with respect to a given attribute (in this case, dialect). For example, [14, 41] use labels to construct fairness constraints or scoring functions to guarantee appropriate diversity in the output summary. However, dialect labels are not always available (or even desirable [4]) and automated dialect classification is a difficult task [34]. With the rapidly-evolving nature of dialects on social media, it is not reasonable to depend on robust dialect classification models for diversity in summarization. Using a dialect-diverse set of examples helps us skirt around this issue. This approach of using a visibly-diverse control set has also been employed for image summarization [15].

2 Dialect diversity analysis of standard text-summarization approaches

We first examine the dialect diversity of TF-IDF, Hybrid TF-IDF, LexRank, TextRank, Centroid-Word2Vec and SummaRuNNer. 333Mathematical and implementation details of all algorithms are given in Appendix A. All algorithms take as input a collection of Twitter posts and the desired summary size, and return a summary of the given size for the given collection.

2.1 Datasets

TwitterAAE dataset:

Our primary dataset of evaluation is the large TwitterAAE dataset 444http://slanglab.cs.umass.edu/TwitterAAE, curated by Blodgett et al. [6]. The dataset overall contains around 60 million Twitter posts from 2013, and for each post, the timestamp, user-id, and geo-location are available as well. Blodgett et al. [6]

used the census data to learn demographic language models for the following population categories: non-Hispanic Whites, non-Hispanic Blacks, Hispanics, and Asians; using the learned models, they report the probability of each post being written by a user of a given population category. We pre-process the dataset to filter and remove posts for which the probability of belonging to non-Hispanic African-American English language model or non-Hispanic White English language model is less than 0.99. The smaller dataset contains around 102k posts belonging to non-Hispanic African-American English language model and 1.06 million posts belonging to non-Hispanic White English language model; for simplicity, we will refer to the two groups of posts as AAE and WHE posts in the rest of the paper.

We also isolate 35 keywords that occur in a non-trivial fraction of posts in both AAE and WHE partitions to study topic-based summarization. The keywords and the fraction of AAE posts in the subset of the dataset containing them are given in Figure 2.

CrowdFlower AI Gender dataset:

Dialect variation with respect to gender has been received relatively less academic attention; nevertheless, prior studies have established that there is a recognizable difference between posts by men and posts by women on Twitter [57, 52]. Correspondingly, we look at the diversity of summarization algorithms with respect to the fraction of posts by men and women in the generated summaries. The CrowdFlower AI Gender dataset 555https://data.world/crowdflower/gender-classifier-data has around 20,000 posts, along with some user-information. The dataset contains the posts, along with whether the user is male, female, or a brand, and location. We remove the posts with a location outside US, to maintain certain regional uniformity in posts. The filtered dataset contains 6176 posts, with 34% from male accounts, 35% posts from female accounts and the rest are labeled as posts by brands or “unknown”. 666 We also evaluate the algorithms with respect to Crises NLP dataset to assess the dialect diversity of generated summaries in case of region-based dialect variation. This dataset contains crisis-related tweets from 19 different crisis (eg, earthquakes, floods, etc.) around the world that took place 2013 to 2015 [30]. The analysis with respect to this dataset is presented in Appendix E.

For all datasets, we also perform additional pre-processing, such as removing URLs, representing all posts in lower-case, replacing user mentions with the term ATMENTION and special character handling. However, we do not remove hashtags since they are, semantically, part of the posts.

Collection has 8.7% AAE posts
(a) Collection has 8.7% AAE posts
(b) Collection has 50% AAE posts
(c) Fixed summary size: 50
Figure 1: TwitterAAE Evaluation 1.

Plot (a), (b) presents the dialect diversity of generated summaries when the collection being summarized has 8.7% and 50% AAE posts respectively. Each point represents to the mean fraction of AAE posts in summary of the given size, with standard error as errorbars. Plot (c) presents the dialect diversity in summary of size 50 vs original collection.

2.2 Evaluation details

Despite the filtering, TwitterAAE dataset is prohibitively large for graph-based algorithms, due to the infeasibility of graph-construction for large datasets. Hence, we limit our simulations to collections of size 5000 and generate summaries of sizes upto 200 for these collections.

TwitterAAE Evaluation 1:

We sample collections of 5000 posts from the TwitterAAE dataset, changing the fraction of AAE posts in the collection everytime. The percentage of AAE posts in the collection is varied from 8.7% (i.e., percentage of AAE posts in the entire dataset) to 90%. We run the standard summarization algorithms for each sampled collection and record the fraction of AAE posts in the generated summary. For each fraction, we repeat the process 50 times and report the mean and standard error of the fraction of AAE posts in the generated summaries.

TwitterAAE Evaluation 2:

Next, using the 35 common keywords in this dataset, we extract the subset of posts containing any given keyword. Once again, we use the summarization algorithms on the extracted subsets and report the difference between the fraction of AAE posts in the generated summary and fraction of AAE posts in the subset of the dataset containing the keyword. This evaluation aims to assess the dialect diversity of summaries generated for topic-specific collections and also lets us verify whether the observations of Evaluation 1 extends to non-random collections.

Crowdflower-Gender Evaluation 1:

For this dataset, since the size is relatively small, we use the summarization algorithms on the entire dataset and report the fraction of posts written by men (amongst posts written by non-brands) for different summary sizes.

Remark 2.1.

For the CrowdFlower AI Gender dataset, the evaluation is with respect to the presented gender of the user who created the post, while for the TwitterAAE dataset, the evaluation is with respect to the dialect label of the post. The evaluation methods for both datasets are different, but the goal is the same, i.e., to assess the representational diversity of the generated summaries. The dialects we consider in this paper are those adopted by social groups and the disparate treatment of these dialects is closely related to the disparate treatment of the groups using these dialects. While the AAE dialect is not necessarily only used by African-Americans, it is primarily associated with them and studies have shown disparate treatment of AAE dialect can lead to racial bias [61, 33].

Mean dialect diversity vs summary size
(a) Mean dialect diversity vs summary size
(b) Dialect diversity for different keywords
Figure 2: TwitterAAE Evaluation 2.

Figure (a) reports the mean and standard deviation of the difference between AAE fraction in summary and AAE fraction in the collection of posts that contain the keyword. Figure (b) presents fraction of AAE posts in size 50 summaries for different keywords, as well as, the fraction of AAE posts in the subset of posts containing the keyword.

2.3 Results

The results for TwitterAAE Evaluation 1 are presented in Figure 1. Plots 1a, b show that for small summary sizes (less than 100), all algorithms mostly return summaries that have a smaller fraction of AAE posts than the original collection. For larger summary sizes, summaries generated by Hybrid TF-IDF are relatively more dialect diverse. Even when the fraction of AAE posts in the original collection is increased beyond 0.5, the fraction of AAE posts in size 50 summaries from all algorithms is less than the fraction of AAE posts in the original collection, as evident from Figure 1c.

The results for TwitterAAE Evaluation 2 are presented in Figure 2

. For many keywords, the summaries generated by all algorithms have lower dialect diversity than the original collection. For example, for “funny” and “blessed”, the AAE fraction in summaries generated by all algorithms is less than the AAE fraction in the collection containing the keyword. There are also keyword-specific collections where the summaries are relatively more diverse; e.g., for keyword“morning”, summaries generated by Hybrid TF-IDF and TextRank have better dialect diversity than the original collection. However, overall the high variance in Plot 

2a shows that the summaries generated by all algorithms are not guaranteed to be sufficiently diverse for all keywords.

For Crowdflower-Gender Evaluation 1 (Figure 4a), summaries generated by all algorithms other than TF-IDF have an unbalanced fraction of posts from male and female accounts. LexRank, TextRank, Hybrid-TF-IDF, Centroid-Word2Vec return summaries with a larger fraction of posts from female users, while SummaRuNNer returns summaries with a larger fraction of posts from male users.

2.4 Discussion

This dialect bias in the generated summaries in many cases is likely due to the fact that the scoring mechanism of all algorithms is affected by structural aspects of the dialect, like vocabulary, length, connectivity, etc., which can be different for different dialects. Frequency-based algorithms weight each word in a post according to its frequency; however, given that different dialects have different vocabulary sizes and different average post lengths [7], quantifying the importance of a word by its frequency can favor one dialect over other. Similarly, for graph-based approaches on TwitterAAE datasets, the subgraph of WHE posts seem to have better clustering properties than the subgraph of AAE posts (see Appendix C.1 for more discussion on the structural difference between the dialects). Scoring sentences based on structural properties, in this case, leads to representational disparities since the algorithms do not take into account the structural differences across the dialects.

The performance of Centroid-Word2Vec shows that ensuring non-redundancy does necessarily not lead to dialect diversity, and the lack of diversity of SummaRuNNer summaries also demonstrates that pre-trained supervised models do not necessarily generalize to other domains. Furthermore, even though the diversity of frequency-based approaches seems relatively better than other algorithms in some cases, they still are not completely reliable for dialect diversity; Hybrid TF-IDF does not return sufficiently diverse summaries for the Crowdflower gender dataset and TF-IDF does not return sufficiently diverse summaries for the TwitterAAE dataset. It is also important to note that summaries generated by Centroid-Word2Vec have been shown to correlate better to human-generated summaries than those generated by frequency-based algorithms [63]. Hence, it is important to explore ways to exploit the utility of algorithms like Centroid-Word2Vec and, at the same time, ensure that the generated summaries are dialect-diverse.

AAE fraction vs summary size
(a) AAE fraction vs summary size
(b) AAE fraction vs
(c)

Rouge-1 F-score vs

AAE fraction vs summary size
(d) AAE fraction vs summary size
(e) AAE fraction vs
(f) Rouge-1 F-score vs
Figure 3: The first row presents the evaluation of our model on collections containing 8.7% AAE posts using Centroid-Word2Vec as algorithm . The second row present the same evaluation on collections containing 50% AAE posts. Figures (a), (d) present the fraction of AAE posts in the summary for different summary sizes. Figures (b), (e) plot the diversity variation with respect to and figures (c), (f) plot ROUGE-1 score between summary generated using our model and Centroid-Word2Vec.

3 Our model to mitigate dialect bias

We propose a simple framework to correct for the dialect bias in standard summarization algorithms. Let denote a collection of sentences. Our approach uses any standard summarization algorithm, denoted by , as a blackbox to return a score , for each . This score represents the importance of sentence in the collection and we assume that the larger the score, the more important is the sentence. We also need a function to measure the pairwise similarity between sentences; we will call this function . An example of such a similarity function is presented later. To implicitly ensure dialect-diversity in the results, we use a diversity control set , i.e. a small set of sentences that has sufficient representation from each dialect (for example, an equal number of posts from all relevant dialects). We return a diverse and relevant summary by appropriately combining the importance score from the blackbox and the diversity with respect to the control set in the following manner. Given a hyper-parameter , for each , let denote the following score function:

Let represent the sorted list and let denote the sentence with the -th largest score in . Based on these scores, we rank the sentences in in the following order: first we return sentences that have the largest score for each , i.e., . Next, we return the set and so on. Sentences within each set can be ranked by their scores from algorithm . At every step, for each we check if a sentence has already been returned; if so, we replace it with the sentence with next-highest score for that . A summary based on this ranking can then be generated. The complete implementation of this algorithm is provided in Appendix A.

Our primary choice of will be 0.5. We will call our algorithm, with and blackbox , as -balanced. For example, our algorithm with as Centroid-Word2Vec and will be called Centroid-Word2Vec-balanced. The idea of summarization based on a linear combination of scores that correspond to different goals has been used in other contexts. For topic-focused summarization, Vanderwende et al. [72] score each word by linearly adding its frequency and topic relevance score. Even MMR computes a linear combination of the importance and non-redundancy score, measured as the maximum similarity to an existing summary sentence. Using a small set of diverse examples to generate a diverse summary has also been been shown to be effective for image summarization [15].

Method AAE fraction in summary ROUGE-1 ROUGE-L
Recall Precision F-score Recall Precision F-score

Keyword: “funny”

Collection containing keyword 0.11 - - - - - -
2-9 TF-IDF 0.04 - - - - - -
TF-IDF-balanced 0.10 0.76 0.79 0.78 0.77 0.73 0.75
2-9 Hybrid-TF-IDF 0.04 - - - - - -
Hybrid-TF-IDF-balanced 0.02 0.89 0.39 0.54 0.78 0.21 0.33
2-9 LexRank 0.04 - - - - - -
LexRank-balanced 0.22 0.53 0.55 0.54 0.41 0.36 0.38
2-9 TextRank 0.06 - - - - - -
TextRank-balanced 0.04 0.94 0.23 0.43 0.92 0.15 0.25
2-9 SummRuNNer 0.03 - - - - - -
SummRuNNer-balanced 0.10 0.61 0.51 0.56 0.45 0.41 0.43
2-9 Centroid-Word2Vec 0.02 - - - - - -
Centroid-Word2Vec-balanced 0.10 0.68 0.67 0.67 0.57 0.49 0.53

Keyword: “twitter”

Collection containing keyword 0.10 - - - - - -
2-9 TF-IDF 0.10 - - - - - -
TF-IDF-balanced 0.16 0.72 0.76 0.74 0.71 0.69 0.70
2-9 Hybrid-TF-IDF 0.11 - - - - - -
Hybrid-TF-IDF-balanced 0.08 0.85 0.46 0.59 0.69 0.34 0.45
2-9 LexRank 0.04 - - - - - -
LexRank-balanced 0.22 0.49 0.53 0.51 0.33 0.28 0.30
2-9 TextRank 0.08 - - - - - -
TextRank-balanced 0.12 0.96 0.63 0.76 0.93 0.61 0.73
2-9 SummRuNNer 0.14 - - - - - -
SummRuNNer-balanced 0.26 0.57 0.53 0.55 0.42 0.39 0.40
2-9 Centroid-Word2Vec 0.06 - - - - - -
Centroid-Word2Vec-balanced 0.12 0.64 0.66 0.65 0.51 0.44 0.47
Table 1: TwitterAAE Evaluation 2. The performance of our model for keywords “twitter” and “funny”, using different blackbox algorithms. The size of generated summary is 50. The ROUGE scores are computed for summaries generated by our model -balanced against summaries generated by .

4 Empirical analysis of our model

We repeat the evaluations proposed in Section 2 for our framework. Recall that TwitterAAE Evaluation 1 assesses the diversity of the summaries generated for random collections of the TwitterAAE dataset, with a varying fraction of AAE posts in the collections, TwitterAAE Evaluation 2 assesses the diversity of the summaries generated for keyword-specific collections and CrowdFlower Gender Evaluation 1 assesses the diversity of the summaries generated for the Crowdflower Gender AI dataset.

To construct diversity control sets, we sample small sets from the same domain as the evaluation dataset and assess their performance on a dialect clustering task. The sets that perform well on this task are chosen as diversity control sets; the control set used for TwitterAAE evaluations contains 28 posts, with an equal number of AAE and WHE posts, and the set used for Crowdflower-Gender evaluations contain 40 posts, with an equal number of posts written by male and female user accounts. Details of the construction process and the exact control sets used are provided in Appendix C, D.

We use the following similarity function for a given pair of sentences : , where

denotes the feature vector of sentence

. To obtain feature vectors for the sentences, we use a publicly-available word2Vec model pre-trained on a corpus of 400 million Twitter posts [24]. First, we use the word2vec model to get feature vectors for the words in a sentence and then aggregate them by computing a weighted average, where the weight assigned to a word is proportional to the smooth inverse frequency of the word (see Arora et al. [3] for details).

We also compare the summaries generated by standard algorithms and our framework using ROUGE recall, precision and F1-scores [40]. We report ROUGE-1 scores, which quantifies the amount of unigram overlap between the generated summary and the reference summary, and ROUGE-L scores, which looks at the longest co-occurring sequence in the generated and reference summary.

Performance of standard summarization algorithms
(a) Performance of standard summarization algorithms
(b) Performance of our model for
Figure 4: Crowdflower-Gender Evaluation 1. Fraction of non-brand posts by male user accounts in summaries generated by standard summarization algorithms (a) and our framework (b).

4.1 Results

The performance of our model for TwitterAAE Evaluation 1, using Centroid-Word2Vec as the blackbox algorithm, is presented in Figure 3. Plots 3a,c show that using our model with (Centroid-Word2Vec-balanced) leads to improved dialect diversity in the summary. For the case when the initial collection has 50% AAE posts, Centroid-Word2Vec-balanced generates summaries that have 40% AAE posts in the summary; to achieve better dialect diversity in summary, value needs to be increased (Plot 3e). The comparison using other algorithms is presented in Appendix C.3.

The performance on TwitterAAE Evaluation 2 for two keywords, “twitter” and “funny”, is presented in Table 1. We see that our model leads to a higher fraction of AAE posts in summary in most cases, compared to just the blackbox algorithm. However, is not always the ideal choice; eg, for keyword “funny” and Hybrid-TF-IDF or TextRank as the blackbox, the dialect diversity of our model is less than the dialect diversity of the summary from just Hybrid-TF-ID or TextRank. In this case, either or the fraction of AAE posts in the diversity control set can be made larger; the variation with respect to these parameters is presented in Appendix C.4, along with the results for other keywords.

The performance of our model for Crowdflower-Gender Evaluation 1 is presented in Figure 4b. Once again Centroid-Word2Vec-balanced returns summaries that have are relatively more balanced with respect to male and female users than the summaries generated by just Centroid-Word2Vec. The results using other blackbox algorithms and different values are presented in Appendix D.

The ROUGE scores for TwitterAAE Evaluation 1 are presented in Figure 3c, f. As expected, the similarity between the summary generated by our model and summary generated by Centroid-Word2vec decreases as the -value increases. For summary size 200, the ROUGE-1 F-score between the compared summaries is greater than 0.7, implying significant word-overlap between the two summaries. ROUGE scores in Table 1 shows that for TwitterAAE Evaluation 2, if the diversity correction required is small, then the recall scores tend to be large. For Centroid-Word2Vec, the recall is greater than 0.64, implying that the summary from our algorithm covers atleast 64% of the words in the summary of the blackbox algorithm. However, in the cases when the summaries generated by blackbox algorithm originally have low dialect diversity, the recall scores tend to be small (for example, LexRank-balanced). In these cases, a larger deviation from the original summaries is necessary to ensure sufficient dialect diversity. With respect to the ROUGE-assessment, note that this does not necessarily quantify the usability or the accuracy of the summaries generated by our model; this measure simply looks at the amount of deviation from summaries of standard algorithms.

5 Conclusion

This paper addresses the issue of dialect diversity in automatically-generated summaries for Twitter datasets. We show that standard summarization algorithms often return summaries that are dialect biased. To address this bias, we propose a framework that uses a small set of dialect diverse posts to improve the diversity of the generated summaries. By using state of the art summarization algorithms as blackbox, we seek to exploit the utility of these algorithms, and by using diverse set of examples, we ensure that the fairness framework is independent of the dialect labels and classification tools.

The context of our analysis is, however, limited to extractive algorithms over Twitter datasets; future work along this direction should inspect diversity for domains beyond Twitter and ways to evaluate the diversity of summaries from abstractive summarization algorithms. Finally, the construction of benchmark dialect-diverse collections and manually-generated summaries for these collections would also help better assess the accuracy of the summaries generated by our algorithm and future work.

Broader Impact

Using a small set of examples of dialect diverse Twitter posts to improve the dialect diversity of the summaries generated by standard summarization algorithms seems to be effective in our framework. Considering the importance of the diversity control set to our framework, the societal and policy impact of the composition of the control set requires careful deliberation.

While we provide a mechanism to construct such a diversity control set for the datasets in the Appendix, in general, the choice of such a set will be context-dependent and the ability of framework to mitigate bias in the summary will depend on whether the control set is appropriately chosen or not. Dialects represent communities and the composition of the diversity control set should ensure sufficient representation of all the user-dialects and participating communities of the application; correspondingly, to guarantee that the control set is sufficiently-diverse, the decisions regarding the composition of this set should be made in a responsible manner. This could involve additional steps such as a regular public audit of these sets, as well as, ways to obtain and incorporate community feedback on its composition.

Summarization algorithms are often implemented by organizations and engineers that collect the data (for example, Twitter), and users impacted by these algorithms usually have little influence on the design decisions. By ensuring that the design of control sets takes into account community feedback, our framework lets the user have a say in the representational diversity of the generated summaries. Such a participatory design has been encouraged in fairness literature as it leads to a more cooperative framework [66, 16].

Finally, it is important to note that using a misrepresentative control set can lead to worse diversity results; for example, using sentences in the control set that do not belong to the same domain as the dataset will lead to an overall worse summary. The fairness-accuracy tradeoff (in our case dialect diversity vs ROUGE-scores) should be taken into account while deciding the composition of the control set. An ideal implementation of our framework would involve an active dialogue between the users and engineers, with the user-feedback ensuring representational diversity in the control set and the engineers quantifying and discussing the feasibility of various control sets.

References

  • [1] R. M. Alguliev, R. M. Aliguliyev, M. S. Hajirahimova, and C. A. Mehdiyev (2011) MCMR: maximum coverage and minimum redundant text summarization model. Expert Systems with Applications 38 (12), pp. 14514–14522. Cited by: §1.2.
  • [2] N. Alsaedi, P. Burnap, and O. Rana (2016) Automatic summarization of real world events using twitter. In Tenth International AAAI Conference on Web and Social Media, Cited by: §1.2.
  • [3] S. Arora, Y. Liang, and T. Ma (2016) A simple but tough-to-beat baseline for sentence embeddings. Cited by: §C.2.1, §4.
  • [4] C. J. Beukeboom and C. Burgers (2019) How stereotypes are shared through language: a review and introduction of the aocial categories and stereotypes communication (scsc) framework. Review of Communication Research 7, pp. 1–37. Cited by: §1.2.
  • [5] S. L. Blodgett, S. Barocas, H. Daumé, and H. Wallach (2020) Language (technology) is power: a critical survey of “bias” in nlp. In Proceedings of the Conference of the Association for Computational Linguistics (ACL), External Links: Link Cited by: §1.2.
  • [6] S. L. Blodgett, L. Green, and B. O’Connor (2016) Demographic dialectal variation in social media: a case study of african-american english. arXiv preprint arXiv:1608.08868. Cited by: §1.1, §2.1.
  • [7] S. L. Blodgett and B. O’Connor (2017)

    Racial disparity in natural language processing: a case study of social media african-american english

    .
    arXiv preprint arXiv:1707.00061. Cited by: §1.2, §1, §2.4.
  • [8] S. L. Blodgett, J. Wei, and B. O’Connor (2018) Twitter universal dependency parsing for african-american and mainstream american english. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1415–1425. Cited by: §C.2, §1.
  • [9] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov (2017) Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, pp. 135–146. External Links: ISSN 2307-387X Cited by: §C.2.1, §1.2.
  • [10] T. Bolukbasi, K. Chang, J. Y. Zou, V. Saligrama, and A. T. Kalai (2016) Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in neural information processing systems, pp. 4349–4357. Cited by: §1.2.
  • [11] A. Caliskan, J. J. Bryson, and A. Narayanan (2017) Semantics derived automatically from language corpora contain human-like biases. Science 356 (6334), pp. 183–186. Cited by: §1.2.
  • [12] J. Carbinell and J. Goldstein (2017) The use of mmr, diversity-based reranking for reordering documents and producing summaries. In ACM SIGIR Forum, Vol. 51, pp. 209–210. Cited by: §1.2.
  • [13] L. E. Celis, A. Deshpande, T. Kathuria, and N. K. Vishnoi (2016) How to be fair and diverse?. arXiv preprint arXiv:1610.07183. Cited by: §1.2.
  • [14] L. E. Celis, V. Keswani, D. Straszak, A. Deshpande, T. Kathuria, and N. K. Vishnoi (2018) Fair and diverse dpp-based data summarization. arXiv preprint arXiv:1802.04023. Cited by: §1.2.
  • [15] L. E. Celis and V. Keswani (2019) Implicit diversity in image summarization. arXiv preprint arXiv:1901.10265. Cited by: Appendix B, §1.2, §1.2, §3.
  • [16] S. Chancellor, S. Guha, J. Kaye, J. King, N. Salehi, S. Schoenebeck, and E. Stowell (2019) The relationships between data, power, and justice in cscw research. In Conference Companion Publication of the 2019 on Computer Supported Cooperative Work and Social Computing, pp. 102–105. Cited by: Broader Impact.
  • [17] A. Chellal and M. Boughanem (2018) Optimization framework model for retrospective tweet summarization. In Proceedings of the 33rd Annual ACM Symposium on Applied Computing, pp. 704–711. Cited by: §1.2.
  • [18] V. T. Chou, L. Kent, J. A. Góngora, S. Ballerini, and C. D. Hoover (2019)

    Towards automatic extractive text summarization of a-133 single audit reports with machine learning

    .
    arXiv preprint arXiv:1911.06197. Cited by: Appendix A, §1.2.
  • [19] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §C.2.1, §1.2.
  • [20] Y. Dong, Y. Shen, E. Crawford, H. van Hoof, and J. C. K. Cheung (2018) Banditsum: extractive summarization as a contextual bandit. arXiv preprint arXiv:1809.09672. Cited by: §1.2.
  • [21] G. Doyle (2014) Mapping dialectal variation by querying social media. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 98–106. Cited by: §1.
  • [22] Y. Duan, Z. Chen, F. Wei, M. Zhou, and H. Y. Shum (2012) Twitter topic summarization by ranking tweets using social influence and content quality. In Proceedings of COLING 2012, pp. 763–780. Cited by: §1.2.
  • [23] G. Erkan and D. R. Radev (2004) Lexrank: graph-based lexical centrality as salience in text summarization.

    Journal of artificial intelligence research

    22, pp. 457–479.
    Cited by: Appendix A, §1.1, §1.2.
  • [24] F. Godin (2019)

    Improving and interpreting neural networks for word-level prediction tasks in natural language processing

    .
    Ph.D. Thesis, Ghent University, Belgium. Cited by: Appendix A, §C.2.1, §C.2.1, §4, footnote 1.
  • [25] R. He and X. Duan (2018) Twitter summarization based on social network and sparse reconstruction. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1.2.
  • [26] L. A. Hendricks, K. Burns, K. Saenko, T. Darrell, and A. Rohrbach (2018) Women also snowboard: overcoming bias in captioning models. In

    European Conference on Computer Vision

    ,
    pp. 793–811. Cited by: §1.2.
  • [27] K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom (2015) Teaching machines to read and comprehend. In Advances in neural information processing systems, pp. 1693–1701. Cited by: Appendix A, §1.2.
  • [28] Y. Huang, D. Guo, A. Kasakoff, and J. Grieve (2016) Understanding us regional linguistic variation with twitter data analysis. Computers, Environment and Urban Systems 59, pp. 244–255. Cited by: §1.
  • [29] M. Imran, C. Castillo, J. Lucas, P. Meier, and S. Vieweg (2014) AIDR: artificial intelligence for disaster response. In Proceedings of the 23rd International Conference on World Wide Web, pp. 159–162. Cited by: Appendix E.
  • [30] M. Imran, P. Mitra, and C. Castillo (2016) Twitter as a lifeline: human-annotated twitter corpora for nlp of crisis-related messages. arXiv preprint arXiv:1605.05894. Cited by: Appendix E, §1.1, footnote 6.
  • [31] D. Inouye and J. K. Kalita (2011) Comparing twitter summarization algorithms for multiple post summaries. In 2011 IEEE Third international conference on privacy, security, risk and trust and 2011 IEEE third international conference on social computing, pp. 298–306. Cited by: Appendix A, Appendix A, §1.1, §1.2, §1.2.
  • [32] A. Jadhav and V. Rajan (2018) Extractive summarization with swap-net: sentences and words from alternating pointer networks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 142–151. Cited by: §1.2.
  • [33] T. Jones, J. R. Kalbfeld, R. Hancock, and R. Clark (2019) Testifying while black: an experimental study of court reporter accuracy in transcription of african american english. Language 95 (2), pp. e216–e252. Cited by: Remark 2.1.
  • [34] A. Jørgensen, D. Hovy, and A. Søgaard (2015) Challenges of studying and processing dialects in social media. In Proceedings of the Workshop on Noisy User-generated Text, pp. 9–18. Cited by: §1.2, §1.
  • [35] M. Kay, C. Matuszek, and S. A. Munson (2015) Unequal representation and gender stereotypes in image search results for occupations. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pp. 3819–3828. Cited by: §1.
  • [36] S. Kiritchenko and S. M. Mohammad (2018)

    Examining gender and race bias in two hundred sentiment analysis systems

    .
    arXiv preprint arXiv:1805.04508. Cited by: §1.2.
  • [37] A. Kulesza and B. Taskar (2012) Determinantal point processes for machine learning. arXiv preprint arXiv:1207.6083. Cited by: §1.2.
  • [38] M. Lavan (2015) The negro tweets his presence: black twitter as social and political watchdog. Modern Language Studies, pp. 56–65. Cited by: §1.
  • [39] J. Lee, S. Park, C. Ahn, and D. Kim (2009) Automatic generic document summarization based on non-negative matrix factorization. Information Processing & Management 45 (1), pp. 20–34. Cited by: §1.2.
  • [40] C. Lin and E. Hovy (2003)

    Automatic evaluation of summaries using n-gram co-occurrence statistics

    .
    In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 150–157. Cited by: §4.
  • [41] H. Lin and J. Bilmes (2011) A class of submodular functions for document summarization. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 510–520. Cited by: §1.2.
  • [42] H. Lin and V. Ng (2019) Abstractive summarization: a survey of the state of the art. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 9815–9822. Cited by: footnote 2.
  • [43] X. Liu, Y. Li, F. Wei, and M. Zhou (2012) Graph-based multi-tweet summarization using social signals. In Proceedings of COLING 2012, pp. 1699–1714. Cited by: §1.2.
  • [44] Y. Liu and M. Lapata (2019) Text summarization with pretrained encoders. arXiv preprint arXiv:1908.08345. Cited by: §1.2.
  • [45] K. Lu, P. Mardziel, F. Wu, P. Amancharla, and A. Datta (2018) Gender bias in neural natural language processing. arXiv preprint arXiv:1807.11714. Cited by: §1.2.
  • [46] H. P. Luhn (1957) A statistical approach to mechanized encoding and searching of literary information. IBM Journal of research and development 1 (4), pp. 309–317. Cited by: Appendix A, §1.1, §1.2.
  • [47] T. Lynn, K. Scannell, and E. Maguire (2015) Minority language twitter: part-of-speech tagging and analysis of irish tweets. Cited by: §1.
  • [48] C. May, A. Wang, S. Bordia, S. R. Bowman, and R. Rudinger (2019) On measuring social biases in sentence encoders. arXiv preprint arXiv:1903.10561. Cited by: §1.2.
  • [49] R. Mihalcea and P. Tarau (2004) Textrank: bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing, pp. 404–411. Cited by: Appendix A, §1.1, §1.2.
  • [50] T. Mikolov, K. Chen, G. S. Corrado, and J. A. Dean (2015-May 19) Computing numeric representations of words in a high-dimensional space. Google Patents. Note: US Patent 9,037,464 Cited by: §C.2.1, §1.2.
  • [51] D. Miller (2019) Leveraging bert for extractive text summarization on lectures. arXiv preprint arXiv:1906.04165. Cited by: Appendix A, §1.2.
  • [52] Z. Miller, B. Dickinson, and W. Hu (2012) Gender prediction on twitter using stream algorithms with n-gram character features. Cited by: §2.1.
  • [53] N. Moratanch and S. Chitrakala (2016) A survey on abstractive text summarization. In 2016 International Conference on Circuit, power and computing technologies (ICCPCT), pp. 1–7. Cited by: footnote 2.
  • [54] M. Nadeem, A. Bethke, and S. Reddy (2020) StereoSet: measuring stereotypical bias in pretrained language models. arXiv preprint arXiv:2004.09456. Cited by: §1.2.
  • [55] R. Nallapati, F. Zhai, and B. Zhou (2017)

    Summarunner: a recurrent neural network based sequence model for extractive summarization of documents

    .
    In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: Appendix A, §1.1, §1.2.
  • [56] M. Nguyen, D. V. Lai, H. T. Nguyen, and M. Le Nguyen (2018) Tsix: a human-involved-creation dataset for tweet summarization. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Cited by: Appendix A, Appendix A, Appendix A, §1.2.
  • [57] M. Ott (2016) Tweet like a girl: a corpus analysis of gendered language in social media. Yale University, apr. Cited by: §2.1.
  • [58] M. G. Ozsoy, F. N. Alpaslan, and I. Cicekli (2011) Text summarization using latent semantic analysis. Journal of Information Science 37 (4), pp. 405–417. Cited by: §1.2.
  • [59] A. Padmakumar and A. Saran (2016) Unsupervised text summarization using sentence embeddings. Technical report Technical Report, University of Texas at Austin. Cited by: §1.2.
  • [60] J. H. Park, J. Shin, and P. Fung (2018) Reducing gender bias in abusive language detection. arXiv preprint arXiv:1808.07231. Cited by: §1.2, §1.
  • [61] J. R. Rickford (2016) Raciolinguistics: how language shapes our ideas about race. Oxford University Press. Cited by: Remark 2.1.
  • [62] A. Rosenberg and J. Hirschberg (2007) V-measure: a conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), pp. 410–420. Cited by: §C.2.1.
  • [63] G. Rossiello, P. Basile, and G. Semeraro (2017) Centroid-based text summarization through compositionality of word embeddings. In Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pp. 12–21. Cited by: Appendix A, Appendix A, §1.1, §1.2, §1.2, §1.2, §2.4.
  • [64] E. Sandhaus (2008) The new york times annotated corpus. Linguistic Data Consortium, Philadelphia 6 (12), pp. e26752. Cited by: §1.2.
  • [65] M. Sap, D. Card, S. Gabriel, Y. Choi, and N. A. Smith (2019) The risk of racial bias in hate speech detection. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1668–1678. Cited by: §1.2, §1.
  • [66] H. Sassaman, J. Lee, J. Irvine, and S. Narayan (2020) Creating community-based tech policy: case studies, lessons learned, and what technologists and communities can do together. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 685–685. Cited by: Broader Impact.
  • [67] M. Snyder, E. D. Tanke, and E. Berscheid (1977) Social perception and interpersonal behavior: on the self-fulfilling nature of social stereotypes.. Journal of Personality and social Psychology 35 (9), pp. 656. Cited by: §1.
  • [68] T. Sun, A. Gaut, S. Tang, Y. Huang, M. ElSherief, J. Zhao, D. Mirza, E. Belding, K. Chang, and W. Y. Wang (2019) Mitigating gender bias in natural language processing: literature review. arXiv preprint arXiv:1906.08976. Cited by: §1.2.
  • [69] Y. C. Tan and L. E. Celis (2019) Assessing social and intersectional biases in contextualized word representations. In Advances in Neural Information Processing Systems, pp. 13209–13220. Cited by: §1.2.
  • [70] R. Tatman (2017) Gender and dialect bias in youtube’s automatic captions. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, pp. 53–59. Cited by: §1.2.
  • [71] (2020) Twitter usage statistics. Note: https://www.internetlivestats.com/twitter-statistics Cited by: §1.
  • [72] L. Vanderwende, H. Suzuki, C. Brockett, and A. Nenkova (2007) Beyond sumbasic: task-focused summarization with sentence simplification and lexical expansion. Information Processing & Management 43 (6), pp. 1606–1618. Cited by: §3.
  • [73] J. Zhao, T. Wang, M. Yatskar, V. Ordonez, and K. Chang (2017) Men also like shopping: reducing gender bias amplification using corpus-level constraints. arXiv preprint arXiv:1707.09457. Cited by: §1.2.
  • [74] J. Zhao, Y. Zhou, Z. Li, W. Wang, and K. Chang (2018) Learning gender-neutral word embeddings. arXiv preprint arXiv:1809.01496. Cited by: §1.2.

Appendix A Details of summarization algorithms explored in Section 2

We first examine the dialect diversity of standard text summarization approaches; we limit our analysis to algorithms that have either previously been employed for Twitter datasets or the ones that represent recent effective approaches for extractive summarization. The algorithms we consider take as input a collection of Twitter posts and return a summary of the collection.

Tf-Idf.

This is a well-known baseline for information retrieval [46] and uses frequency of the words in a sentence to quantify their weight in the sentence. At the same time, if a word is very common and occurs in a lot of sentences, then it is likely that the word is part of the grammar structure and hence inverse of document frequency is also taken into account while calculating its score. 777 Internally implemented using the python sklearn and networkx libraries. For any sentence in collection , let denote the set of words in the sentence. Then the weight assigned to this sentence is

where is the number of times occurs in and is the number of sentences in which occurs.

Hybrid TF-IDF.

The standard TF-IDF has been noted to have poor performance for Twitter posts, primarily due to lack of generalization of Twitter posts as documents [56]. Correspondingly, a hybrid TF-IDF [31] approach is proposed. The primary difference is that the hybrid TF-IDF approach calculates word frequency considering the entire collection as a single document.footnote 7 For any sentence in collection , the weight assigned to is

where is the number of times occurs in .

LexRank.

This unsupervised summarizer constructs a graph over the dataset, with similarity between sentences quantifying the edge-weights of the graph [23]. The similarity between a pair of sentences is measured using cosine distance between their TF-IDF word vectors. Using the PageRank algorithm, sentences are ranked based on how “central” or well-connected they are within the graph 888https://github.com/crabcamp/lexrank.

TextRank.

An extension of the LexRank approach, TextRank quantifies the similarity between using a modified score of word document frequency [49]. TextRank also uses the PageRank algorithm to rank the sentences in the collection; however, it has been shown to achieve slightly better performance for some standard datasets [56]. footnote 7

Centroid-Word2Vec

: Rossiello et al. [63] propose a centroid-based summarization algorithm that scores sentences based on the distance of their TF-IDF features from the centroid of the dataset (also similar to [51, 18]

). While adding the sentences with the highest scores to the summary the algorithm also checks for redundancy based on similarity of features extracted from a pretrained model; if a candidate sentence is very similar to the sentences already added to the summary, it is discarded (similar to the MMR approach). We will use Word2Vec embeddings, pre-trained on a large Twitter dataset

[24], for this algorithm. For our framework, we need the algorithm to return scores for all sentences; hence, to efficiently incorporate non-redundancy, we add to the score of each sentence a non-redundancy score as well. The non-redundancy score is calculated as the minimum cosine distance between the feature of the current sentence and the feature of any sentence that had a better importance score. By comparing to this algorithm, we will also show that algorithms that just ensure non-redundancy are not sufficient to ensure dialect diversity. 999https://github.com/TextSummarizer/TextSummarizer

SummaRuNNer:

Finally, we use a recent Recurrent Neural Network based method, SummaRuNNer [55], that considers summarization to be a sequential classification problem over the sequence of sentences in the dataset, and generates summaries comparable to state-of-the-art for the CNN/DailyMail dataset [27]. Since it is not possible to train this model over the Twitter datasets we consider (due to non-availability of dataset-summary pairs for our setting), we use the model pre-trained on a standard summarization evaluation dataset 101010https://github.com/hpzhao/SummaRuNNer. The dialect diversity of summaries from this algorithm will also show that pre-trained algorithms do not necessarily generalize well to other domains.

Inouye and Kalita [31] empirically analyze the performance of TF-IDF, Hybrid TF-IDF, LexRank and TextRank on small Twitter datasets (containing only around 1500 tweets for 50 trending topics, not sufficient for a diversity analysis). Their findings suggest that simple frequency based summarizers, such as Hybrid TF-IDF, produce better results for Twitter summarization than TF-IDF, LexRank and TextRank (as evaluated using ROUGE metrics and manually-generated summaries). For larger and more-recent Twitter datasets, Nguyen et al. [56] found that TextRank and Hybrid TF-IDF have similar performance. Similarly, Rossiello et al. [63] showed that the centroid-based approach performs better than LexRank, frequency and RNN-based models on the DUC-2004 task. The original papers for most of these algorithms primarily focussed on evaluation of these methods on DUC tasks or CNN/DailyMail datasets; however, the documents in these datasets correspond to news articles from a particular agency and do not usually have significant dialect diversity within them.

1:Input: Dataset of sentences , query , blackbox algorithm , similarity function sim, diversity control set , parameter , number of sentences to be returned
2:for all  do
3:     
4:end for
5:
6:while  do
7:     
8:     for all  do Find images for each
9:         
10:         if  then Checking duplicates
11:              
12:              score Scores used for tie-breaks
13:              
14:         end if
15:     end for
16:     if  then If all of can be added
17:         
18:     else Tie-break when is not a multiple of
19:         
20:          images from with lowest values of score
21:         
22:     end if
23:end while
24:return
Algorithm 1 Algorithm for our model in Section 3

Appendix B Full Algorithm

The full algorithm for the model proposed in Section 3 is presented in Algorithm 1. Our framework is similar to the query-based image summarization framework used in [15].

b.1 Time complexity of the algorithm

Let denote the time taken by blackbox algorithm to score all elements of . Since the algorithm needs to create a matrix of size , there will be an additive factor of atleast size of the matrix. Furthermore, selecting the best element in each can be done in two ways, i.e., either by sorting each or using a max-heap over each . In both cases, this results in an additional factor of . Overall, the time complexity of Algorithm 1 is .

b.2 Machine specifications for simulations

The machine used for running the simulations has the following specifications: 16 cores, 64 GB RAM, 60 GB disk space and Amazon Linux AMI.

Appendix C Other details and results for TwitterAAE dataset

c.1 Structural Properties of AAE and WHE dialects

As mentioned earlier, the posts written in different dialects can differ structurally. In this section, we list certain structural differences between the posts written in AAE and WHE dialects. The pairwise similarity between vectors is calculated using cosine distance i.e.,

The smaller the distance, the more similar are the vectors. We calculate the average pairwise distance between the set of 1000 randomly selected AAE posts and the set of 1000 randomly selected WHE posts. To quantify how well the graphs based on pairwise similarity are connected, we also report the spectral gap (i.e., difference between largest and second-largest eigenvalues) of the weighted-adjacency matrix of the graph. The larger the spectral gap, the better connected is graph based on the set of posts of that dialect. Once again the spectral gap is calculated for the set of 1000 randomly selected AAE posts and the set of 1000 randomly selected WHE posts.

Features AAE WHE
Number of posts in dataset 102k 1.07 million
Vocabulary sizes 57k 258k
Average length of post 8.08 (6.19) 11.40 (7.04)
Average pairwise similarity between TFIDF vectors 0.99 (0.03) 0.98 (0.04)
Average pairwise similarity between Word2Vec vectors 0.58 (0.13) 0.52 (0.13)
Graph-based metrics
Average spectral gap for graph using TFIDF vectors 7.53 (0.94) 12.57 (0.67)
Average spectral gap for graph using Word2Vec vectors 395.00 (6.82) 461.17 (4.17)
Table 2: Structural differences between AAE and WHE posts. For average metrics, the standard error is given in parentheses.

The structural differences can lead to summarization disparities. For frequency-based methods, quantifying importance of a word just by its frequency in a post can favor one dialect over other. Indeed the vocabulary size of all AAE posts is around 57k while the vocabulary size of all WHE posts is around 258k and the average length of an AAE post is 8.1, while the average length of an WHE post is 11.4.

Similarly, for the graph-based approaches on TwitterAAE datasets, the subgraph of WHE posts seem to have better clustering properties than the subgraph of AAE posts, leading to better representation of WHE dialect in the summary of graph-based approaches. We observe that the spectral gap of sub-graph for AAE posts always has a smaller value than the spectral gap of sub-graph for WHE posts, i.e., the sub-graph of WHE posts is better connected than the sub-graph of AAE posts. This implies that a summarization algorithm, like PageRank, when choosing the next sentence for the summary, is more likely to choose a WHE tweet than an AAE tweet. While scoring sentences based on structural properties of the language is effective in case of uniform dialect, in this case it leads to representational disparities since the algorithm does not take into account the structural differences across the dialects.

c.2 Choice and efficacy of diversity control set

Before we empirically analyze our model, we need to look at methods to construct a diversity control set. For this analysis, we limit ourselves to assessing diversity with respect to AAE and WHE dialects. We employ a smaller processed version of TwitterAAE dataset, containing 250 AAE posts and 250 WHE, provided by Blodgett et al. [8] to develop tools for AAE language parsing. We will use this small dataset to select diversity control sets and evaluate at the efficacy of our model.

Maximum AUC score vs control set size
(a) Maximum AUC score vs control set size
(b) Mean AUC score score vs control set size
(c) Maximum V-measure score vs control set size
Figure 5: Efficacy of using diversity control set to identify posts from different dialects. The figure presents how effective the diversity control set of a particular size is in clustering posts of the different dialects in different clusters. Figure (a) presents the average maximum AUC score achieved by a control set across folds for different summary sizes, while Figure (b) presents the mean AUC score achieved by a control set across folds. As an alternative measure, Figure (x) presents the mean V-measure across folds.

c.2.1 Evaluation details

The size of the diversity control set should ideally be much smaller than the evaluation dataset; this will assist in better selection and curation of the control sets. Correspondingly, we restrict the size of the control sets for our simulations to be atmost 50.

We use a 5-fold cross validation setup for this simulation. For each fold, we have a validation partition of 400 posts (containing equal number of AAE and WHE posts) and a train partition of 100 posts (containing equal number of AAE and WHE posts); we use the train partition to construct a diversity control set. We randomly block-sample a set of posts from the train partition, making sure that the set has equal number of AAE and WHE posts, and use it as diversity control set; let denote this set of posts and let denote the validation partition. Then for each and , we calculate the score , and to each , we assign to it the label of the tweet . Finally, for this task, we report the AUC score and V-measure between the assigned and true labels for posts in . AUC refers to the area under the Receiver Operating Characteristic (ROC) curve. It is measure commonly used to evaluate how the performance of a binary learning task. V-measure, on the other hand, is to evaluate clustering tasks [62]. This measure combines homogeneity (the extent to which AAE clusters contain AAE posts) and completeness (all AAE posts are assigned to AAE clusters). This process is repeated 50 times for each fold, and we record the max, mean and standard deviation of AUC scores and V-measures across all repetitions.

To calculate similarity between two sentences, we will use pre-trained word and sentence embeddings to find the feature vectors for these sentences, and then measure the similarity as between the feature vectors. We employ the following popular and robust pre-trained embeddings for this task.

Word2Vec Embeddings.

Word2Vec is a popular model to encode and decode words. Introduced by Mikolov et al. [50], they’ve been used to improve performances for various NLP tasks, as well as, to provide robust encodings of vocabularies of different domains. To obtain sentence embeddings using word2vec representations, we use the aggregation method suggested by Arora et al. [3]. The word representations are aggregated by computing weighted average of the embeddings of the words in the sentence, where the weight assigned to a word is proportional to the smooth inverse frequency of the word. We will use a publicly-available word2Vec model pre-trained on a corpus of 400 million posts [24].

FastText Embeddings.

FastText model is an extension of the Word2Vec model that incorporates character-level information as well while encoding a word [9]. In particular, it is useful in settings where the word in consideration is not in the vocabulary is but is close to a word in the vocabulary, i.e., within a few characters or a slang-representation of the vocabulary word, making it useful for social media settings. To obtain sentence embeddings, we use the same aggregation method described above. For FastText, we will also use publicly available model pre-trained on a corpus of 400 million posts [24].

BERT Sentence Embeddings.

BERT or Bidirectional Encoder Representations from Transformers [19] is also a pre-trained language representation model. Output from hidden states of pre-trained BERT models can also be used to directly obtain sentence embeddings, and we use these as well for our task.

c.2.2 Results

The results for this task are presented in Figure 5. The plots show that diversity control sets are indeed suitable for differentiating between posts of different dialects; certain control sets are able to achieve AUC scores greater than 0.8. Furthermore, the average AUC score is also greater than 0.65 for diversity control set sizes greater than 10, implying that small diversity control sets are indeed suitable for this task as well.

The max AUC score evaluation also gives us certain choices of diversity control sets to use for the empirical evaluation of our model. These diversity control sets are provided in the Supplementary Material. Given that the diversity control sets do perform fairly well on this clustering task, they should be able to improve the diversity of standard summarization algorithms, when used in our framework.

In terms of word representations, using Word2vec model seems to achieve slightly better performance than FastText and BERT representations. Correspondingly, we will use Word2vec representations for the empirical analysis of our model as well. Word2vec and FastText have relatively better performance than BERT primarily because they have been trained on Twitter datasets making them more suitable for the datasets we consider.

AAE tweets
“ATMENTION yea dats more like it b4 I make a trip up der”
“these n***s talmbout money but . really ain’t getting no money .. I be laughing at these n***s cause that shit funny ATMENTION”
“Me and Pay got matching coupes, me and kid fucked ya boo”
“ATMENTION he bites his lips and manages to kick off his remaining clothes”
“Our Dog Is A Big Baby And A Wanna Be Thug EMOJI”
“Its a Damn Shame’ iont GangBang but i beat a N*** Blue Black”
“ATMENTION yes, my amazon . Lol Im good . Pop-a-lock came by . Thx!”
“ATMENTION: ATMENTION You talking now? RIGHT? im typing nd texting not talking”
“Soon as u think you gotcha 1 you find out she fckin erbody!!”
“ATMENTION lmaooooooooooooooooooooooo, that was the funniest shit ever to hit twitter dawg :D swearrr .. But yall do yall thang”
“Yea Ill Be Good In Bed But Ill Be Bad To Ya!”
“ATMENTION nope tell her get dressed im bouta come get her lol”
“Now omw to get my hair done for coronation tomorrow”
“Ohhhh Hell Naw Dis Bitch Shay Got My Last Name * Johnson *”
WHE tweets
“You don’t have to keep on smiling that smile that’s driving me wild”
“ATMENTION it’s probably dead because he hasn’t texted me back either”
“ATMENTION amen . Honestly have trouble watching that movie . Just because of her.”
“I need to get on a laptop so I can change my tumblr bio”
“Shout out to the blue collar workers . Gotta love it”
“Jax keeps curling up on my bed and tossing and turning repeatedly . Like he cant get comfy . #Soocute #Puppylove”
“ATMENTION you just can’t go wrong with Chili’s . They serve a mean chips and salsa”
“ATMENTION Tenuta hasn’t been good since he left GT and he hates recruiting”
“ATMENTION: Probably the coolest thing I can do ATMENTION yeah, pretty frickin’ sweet! Thanks”
“ATMENTION you said we were hanging all day…Lol I don’t have a car alslo”
“I want a love like off The Vow .. #perfect #oneday”
“Philosophy is the worst thing to ever happen to the world”
“How come I can never get in a " gunning " fight with anyone? #Jealous”
“’Poor poor Merle, bravo for Michael Rooker and Norman Reedus’s performance on last night’s show.’
Table 3: Diversity control set for simulations on TwitterAAE dataset

c.3 Evaluation of our model on random collections of TwitterAAE datasets

For random collections of TwitterAAE dataset, with different fraction of AAE tweets in them, we use our model to generate summaries of different sizes. The results for TF-IDF are given in Figure 6 and 7; for Hybrid-TF-IDF, see Figure 8 and 9; for LexRank, see Figure 10 and 11; for TextRank, see Figure 12 and 13; for SummaRuNNer, see Figure 14 and 15. In certain cases, the value of had to be larger than 0.7 to ensure sufficient diversity in the generated summary. The captions of the figures mention the value if it is anything other than 0.5.

AAE fraction vs summary size
(a) AAE fraction vs summary size
(b) AAE fraction in summary vs
(c) Rouge-1 F-score vs
Figure 6: Evaluation of our model on datasets containing 8.7% AAE tweets using TF-IDF as algorithm .
AAE fraction vs summary size
(a) AAE fraction vs summary size
(b) AAE fraction in summary vs
(c) Rouge-1 F-score vs
Figure 7: Evaluation of our model on datasets containing 50% AAE tweets using TF-IDF as algorithm .
AAE fraction vs summary size
(a) AAE fraction vs summary size
(b) AAE fraction in summary vs
(c) Rouge-1 F-score vs
Figure 8: Evaluation of our model on datasets containing 8.7% AAE tweets using Hybrid TF-IDF as algorithm . Here for balanced algorithm
AAE fraction vs summary size
(a) AAE fraction vs summary size
(b) AAE fraction in summary vs
(c) Rouge-1 F-score vs
Figure 9: Evaluation of our model on datasets containing 50% AAE tweets using Hybrid TF-IDF as algorithm . Here for balanced algorithm.
AAE fraction vs summary size
(a) AAE fraction vs summary size
(b) AAE fraction in summary vs
(c) Rouge-1 F-score vs
Figure 10: Evaluation of our model on datasets containing 8.7% AAE tweets using LexRank as algorithm . Here for balanced algorithm.
AAE fraction vs summary size
(a) AAE fraction vs summary size
(b) AAE fraction in summary vs
(c) Rouge-1 F-score vs
Figure 11: Evaluation of our model on datasets containing 50% AAE tweets using LexRank as algorithm . Here for balanced algorithm.
AAE fraction vs summary size
(a) AAE fraction vs summary size
(b) AAE fraction in summary vs
(c) Rouge-1 F-score vs
Figure 12: Evaluation of our model on datasets containing 8.7% AAE tweets using TextRank as algorithm . Here for balanced algorithm.
AAE fraction vs summary size
(a) AAE fraction vs summary size
(b) AAE fraction in summary vs
(c) Rouge-1 F-score vs
Figure 13: Evaluation of our model on datasets containing 50% AAE tweets using TextRank as algorithm . Here for balanced algorithm.
AAE fraction vs summary size
(a) AAE fraction vs summary size
(b) AAE fraction in summary vs
(c) Rouge-1 F-score vs
Figure 14: Evaluation of our model on datasets containing 8.7% AAE tweets using SummaRuNNer as algorithm .
AAE fraction vs summary size
(a) AAE fraction vs summary size
(b) AAE fraction in summary vs
(c) Rouge-1 F-score vs
Figure 15: Evaluation of our model on datasets containing 50% AAE tweets using SummaRuNNer as algorithm .

c.4 Evaluation of our model on keyword-specific collections of TwitterAAE datasets

Next, we also present the results for our model on collections of TwitterAAE dataset containing the keywords used in Section 2. The results for TF-IDF are given in Figure 17; for Hybrid-TF-IDF, see Figure 18; for LexRank, see Figure 19; for TextRank, see Figure 20; for Centroid-Word2Vec, see Figure 16; for SummaRuNNer, see Figure 21.

AAE fraction vs summary size
(a) AAE fraction vs summary size
(b) AAE fraction in summary vs
(c) Rouge-1 F-score vs
AAE fraction vs summary size
(d) AAE fraction for different keywords and summary size = 50
Figure 16: Evaluation of our model on keyword-specific datasets using Centroid-Word2Vec as algorithm .
AAE fraction vs summary size
(a) AAE fraction vs summary size
(b) AAE fraction in summary vs
(c) Rouge-1 F-score vs
AAE fraction vs summary size
(d) AAE fraction for different keywords and summary size = 50
Figure 17: Evaluation of our model on keyword-specific datasets using TF-IDF as algorithm .
AAE fraction vs summary size
(a) AAE fraction vs summary size
(b) AAE fraction in summary vs
(c) Rouge-1 F-score vs
AAE fraction vs summary size
(d) AAE fraction for different keywords and summary size = 50
Figure 18: Evaluation of our model on keyword-specific datasets using Hybrid TF-IDF as algorithm . Here for balanced algorithm.
AAE fraction vs summary size
(a) AAE fraction vs summary size
(b) AAE fraction in summary vs
(c) Rouge-1 F-score vs
AAE fraction vs summary size
(d) AAE fraction for different keywords and summary size = 50
Figure 19: Evaluation of our model on keyword-specific datasets using LexRank as algorithm .
AAE fraction vs summary size
(a) AAE fraction vs summary size
(b) AAE fraction in summary vs
(c) Rouge-1 F-score vs
AAE fraction vs summary size
(d) AAE fraction for different keywords and summary size = 50
Figure 20: Evaluation of our model on keyword-specific datasets using TextRank as algorithm . Here for balanced algorithm.
AAE fraction vs summary size
(a) AAE fraction vs summary size
(b) AAE fraction in summary vs
(c) Rouge-1 F-score vs
AAE fraction vs summary size
(d) AAE fraction for different keywords and summary size = 50
Figure 21: Evaluation of our model on keyword-specific datasets using SummaRuNNer as algorithm .

c.5 Evaluation of our model using different diversity set compositions

We also present the evaluation for the setting where the diversity control set has unequal fraction of AAE and WHE posts. For random collections where the fraction of AAE posts in collection is 50%, Figure 22. As expected, the fraction of AAE posts in summary increases as fraction of AAE posts in control set increases. This is another parameter that can be tuned to adjust and obtain the desired fraction of AAE posts in the summary.

AAE fraction in summary vs control set
(a) AAE fraction in summary vs control set
(b) Rouge-1 F-score vs control set
Figure 22: Evaluation of our model using different control set compositions.

Appendix D Other details and results for Crowdflower Gender AI dataset

d.1 Diversity control set used for Crowdflower Gender AI dataset

The diversity control set used for Crowdflower Gender evaluation is presented in Table 4.

Tweets by female user-accounts
“jameslykins haha man! the struggle is reeeeeal! ”
“red lips and rosy cheeks”
“#mood spirit of jezebel control revelation 21820, 26 a war goes on in todays church, and the ”
“where the hell did october go? halloween is already this weekend. ”
“my lipstick looked like shit and my hair is usually a mess but im still cute tho so ”
“say she gon ride for me , ill buy the tires for you ”
“so excited to start the islam section in my religions class ”
“wow blessed my 200 kate spade bag is ripping and ive only used it twice a week since the end of september .”
“all ive done today is lie around and homework tbh”
“of course you want to blame me for not finishing college and thus bringing this debt to myself of course”
“misskchrista everyone was obsessed with rhys though, no one really knew the other two xxx”
“papisaysyes at first i thought this said, my dick is on drugs and i still dont know which is worse lol”
“huge announcement and #career change for 2016. #goals #dreams #nymakeupartist ”
“practice random acts of kindness and make it a habit #aldubpredictions”
“ sammanthae glad i can make you laugh i miss you and love you too!!”
“nba i play basketball to escape reality. between the exercise and the diff personalities memories are made!”
“z100newyork please let me attend the future now vip party tonight i love demi and nick #z100futurenow ”
“#win 2 random jumbies stuffed animals #giveaway us only 1113 bassgiraffe ”
“daynachirps thats a great point. thanks for the reminder. #contentchat”
“ive told bri all this time it would happen and it finally did ”
Tweets by male user accounts
“warrenm ill be using my new mbp. i do see dells 5k line needs 2 thunderbolt connections to make it a true 5k display. not the case here?”
“logic301 salute on the new visuals my g! dope as fuck”
“i liked a youtube video official somewhere over the rainbow 2011 israel iz kamakawiwoole”
“laughs and cries at the same time cause true ”
“akeboshi night and day”
“now you all know the monster mash, but now for something really scary, the climate mash ”
“i hate when u tell someone u love them and they ignore u ”
“the finger hahsah ”
“the corruption of the wash. d.c. crowd is now of epic proportions. enlist gt join us ”
“i wish i went to school closer to mark a schwab . beating up doors and walls looks like a lot of fun.”
“keepherwarm kobrakiddlng aimhbread now ill let you know that ive known a guy my whole life who dated several girls and then later on”
“xavierleon fr like wtf are they taking that they just cant fucking dye and busting through doors?! ”
“heh, i just remember people actually think that se and hp are intentionally sabotaging the football team.”
“we must lessen the auditory deprivation! i agree earlier the implantation, the better! ”
“#repost seekthetruth with repostapp. repost ugly by nature 85 of the #tampons, cotton and ”
“the #ceo needs to embrace and sell social to the team or else is goes nowhere. bernieborges #h2hchat #ibminsight ”
“if you scored a touchdown on sunday and didnt dab, hit them folks, or do that hotline bling dance, it shouldnt have counted.”
“zbierband yo zbb, played our last seasonal gig at st. jude. good times had by all. remember the more you drink, the better we sound!”
“i hate writing on the first page of a notebook i feel like im ruining something so perfect”
“we schools should be given credit for growth in the apr, but growth is not the destination. michael jones moboe. ”
Table 4: Diversity control set for simulations on Crowdflower Gender AI dataset

d.2 Evaluation of our model with different blackbox algorithms

The performance of our model using different blackbox algorithms is presented here. The results for Hybrid TF-IDF are given in Figure 23; for LexRank, see Figure 24; for TextRank, see Figure 25; for Centroid-Word2Vec, see Figure 26.

Gender fraction vs summary size
(a) Gender fraction vs summary size
(b) Gender fraction vs
(c) Rouge-1 F-score vs
Figure 23: Evaluation of our model on Crowdflower Gender AI dataset using Hybrid TF-IDF as algorithm .
Gender fraction vs summary size
(a) Gender fraction vs summary size
(b) Gender fraction vs
(c) Rouge-1 F-score vs
Figure 24: Evaluation of our model on Crowdflower Gender AI dataset using LexRank as algorithm .
Gender fraction vs summary size
(a) Gender fraction vs summary size
(b) Gender fraction vs
(c) Rouge-1 F-score vs
Figure 25: Evaluation of our model on Crowdflower Gender AI dataset using TextRank as algorithm .
Gender fraction vs summary size
(a) Gender fraction vs summary size
(b) Gender fraction vs
(c) Rouge-1 F-score vs
Figure 26: Evaluation of our model on Crowdflower Gender AI dataset using Centroid-Word2Vec as algorithm .

Appendix E Other datasets

Crises NLP dataset:

This dataset contains crisis-related tweets from 19 different crisis (eg, earthquakes, floods, etc.) around the world that took place 2013 to 2015 [30]

. The tweets were collected using AIDR open-source platform (Artificial Intelligence for Disaster Response

[29]). Most of the rows in the original dataset just contain tweet-ids for the corresponding crises, many of which are difficult to retrieve. However, for around 50,000 tweets, the text of the tweet, along with some additional information, is available. This includes the user-id, the crisis corresponding to the posts, categories to which the posts belongs (human-annotated). For example, whether the post has information on "missing, trapped, or found people", or is regarding "infrastructure and utilities damage", etc. Overall there are 8 annotation categories.

We use this dataset because the posts are from locations around the world, that often discuss a common kind of crises and are usually in a regional English dialect. To restrict our analysis to a single topic, we only use posts corresponding to earthquakes. In total, there are 8500 posts for the following earthquakes (distribution given in parentheses): 2015 earthquake in Nepal (35%), 2014 earthquake in Chile (23%), 2014 earthquake in California (20%) and 2013 earthquake in Pakistan (22%).

Evaluation and Results:

Once again we use all algorithms to first evaluate the diversity of the generated summaries. Here we report the fraction of posts in the summary from different regions. Figure 27 shows that the summaries generated by all algorithms under or over-represent posts from some regions. Our model using and Centroid-Word2Vec as blackbox once again returns summaries in which the fraction of posts from different regions is closer to the fraction of posts from different regions in the original collection (Figure 28) for all regions.

Figure 27: Evaluation of all algorithms on Crises NLP dataset.
Figure 28: Evaluation of our model on Crises NLP dataset using Centroid-Word2Vec as algorithm .