Zero-shot Entity and Tweet Characterization with Designed Conditional Prompts and Contexts

Online news and social media have been the de facto mediums to disseminate information globally from the beginning of the last decade. However, bias in content and purpose of intentions are not regulated, and managing bias is the responsibility of content consumers. In this regard, understanding the stances and biases of news sources towards specific entities becomes important. To address this problem, we use pretrained language models, which have been shown to bring about good results with no task-specific training or few-shot training. In this work, we approach the problem of characterizing Named Entities and Tweets as an open-ended text classification and open-ended fact probing problem.We evaluate the zero-shot language model capabilities of Generative Pretrained Transformer 2 (GPT-2) to characterize Entities and Tweets subjectively with human psychology-inspired and logical conditional prefixes and contexts. First, we fine-tune the GPT-2 model on a sufficiently large news corpus and evaluate subjective characterization of popular entities in the corpus by priming with prefixes. Second, we fine-tune GPT-2 with a Tweets corpus from a few popular hashtags and evaluate characterizing tweets by priming the language model with prefixes, questions, and contextual synopsis prompts. Entity characterization results were positive across measures and human evaluation.



page 1

page 2

page 3

page 4


Geographic Adaptation of Pretrained Language Models

Geographic linguistic features are commonly used to improve the performa...

Toxicity Detection with Generative Prompt-based Inference

Due to the subtleness, implicity, and different possible interpretations...

Pop Quiz! Can a Large Language Model Help With Reverse Engineering?

Large language models (such as OpenAI's Codex) have demonstrated impress...

Understanding Politics via Contextualized Discourse Processing

Politicians often have underlying agendas when reacting to events. Argum...

Entities, Dates, and Languages: Zero-Shot on Historical Texts with T0

In this work, we explore whether the recently demonstrated zero-shot abi...

OAG-BERT: Pre-train Heterogeneous Entity-augmented Academic Language Model

To enrich language models with domain knowledge is crucial but difficult...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Online News and Social Media content have maximum reach and readers. The content often influences the reader, often changing worldviews and consequent decisions. Each news source comes with its own perspective and biases, and it is detrimental to assume that readers would discern the import and authenticity of the content. There is an increasing need to computationally characterize the diverse perspectives on content from online media and social media.

In the recent past, researchers have devised various approaches to tag wide-ranging, malicious and biased content and alert the readers (Hern, 2018; Sherr, 2018; YouTube, 2018). Major social media organizations have implemented checks and transparency measures to spot affective content and provide metainformation of the content for alert users to report. Widely accepted empirical ways of detecting bias and misinformation content include validating against some ground truth, checking the writing style, inherent nature of information, and credibility of the source. Adversarial biased contents evolve, making their classification with hard-labels impractical, requiring the detection systems to adapt to the changes. Given the evolving nature of the adversarial content, we focus on the subjective approach to solve content characterization. In the near future, massively trained multi-task learner language models could play a crucial role in building reliable and resilient Internet content validation systems.

We define Entity characterization as saying something about a person pertinent in a timeline or a context and Tweet characterization as saying something about intention or purpose of the Tweet. We construct human-psychology-inspired design prompts for entity characterization. The prompts constitute common constructs humans use in spoken and written format about a person. On the similar lines, we construct contextual synopsis, questions and templates which a human-being uses or is aware-of to understand an informative sentence like a Tweet.

In this work, we evaluate approaches for zero-shot characterization of Entities and Tweets. On news corpora, we evaluate entity characterization with an approach named “Designed Conditional Prefix-prompts.” News articles have rich descriptive content on popular entities like persons, places, etc. and are a good source for language models to learn about entities. We finetune GPT-2 on a large corpus of news articles from seven different news media houses. We acquire news article URLs from Global Database of Events, Language, and Tone (GDELT Project111GDELT: and scraped web-page content. Next, we compile a list of popular entities across news media houses for our experiments. We used common English language phrases to describe an entity as prefix-prompts. Similarly, we also experimented with entity characterization of entities appearing in tweets corpus. We hypothesize that Pretrained Language Models (PLMs) finetuned with a large corpus of descriptive knowledge about entities can be leveraged as proxy experts to characterize entities subjectively with respect to the corpus.

We evaluate Tweet characterization on a corpora of Tweets from popular hashtags with an approach named “Designed Contextual Questions and Templates.” We collate a Tweets corpora with varied Emotions, Emotionally Manipulative Language (EML), Prejudice, Dogmatism, and Call-to-Action which collectively represent biased perspectives, to fine-tune the PLM and evaluate zero-shot subjective Tweet characterization. We hypothesize that PLMs finetuned with a particular set of concepts can be leveraged as an expert to deduce such concepts.

2. Related Work

Information shared on social media might contain various forms of manipulation to stir targeted emotion in the reader. Researchers have focused on using crowdsourced methods for subjective judgments like identifying emotionally manipulative language (EML) (Huffaker et al., 2020), propaganda (Barrón-Cedeno et al., 2019), bias (Spinde et al., 2020), or prejudice (Wei et al., 2020)

. The state-of-the-art approach to identify emotionally manipulative language is based on a crowdsourced method that neutralizes the effect of intrinsically manipulative language by measuring EML through comparison. Apart from the crowdsourced approach, lexicon-based approaches have been used to identify emotional content in the text snippet. The use of classification-based methods backed by crowdsourced data is limited in their ability to generalize knowledge past what exists in training data, making them vulnerable to new patterns and references not explicitly trained for.

Recent few-shot approaches in Text Classification with language models have promising results. Schick and Schütze (Schick and Schütze, 2021a), show text classification by training with a masked (label to be predicted) token pattern or a cloze question

pattern. Classification training is started with a small set of data, and with a semi-supervised approach, training data is increased with soft labels. An ensemble of language models is used during each step of the semi-supervised training. The final classifier is trained on sufficiently large soft-labeled data resulting from iterations of semi-supervised training. Authors have experimented with different numbers on initial training data; key observation from this work is that good results are observed in zero-shot or without training data, which supports our approach to experiment with zero-shot.

In Schick and Schütze (Schick and Schütze, 2021b), on SuperGLUE language tasks, pattern exploiting training is applied with a small language model ALBERT. The results are as good as GPT3, a large language model. Few SuperGlue tasks, when converted to cloze questions, require the prediction of multiple masked tokens, and an approach for this is detailed. The observation is that small language models lead to good results by leveraging cloze questions in text classification. We experimented with designed Contextual Synopsis, Boolean Questions, and Open-ended Questions in Tweet characterization in a zero-shot setting.

In Schick et al. (Schick et al., 2020), labels for text classification are mined from the language model for text classification. The auto-generated labels are near-synonyms of labels in the training data. The observation is that the language models have sufficient knowledge to generate tokens in a context. We have experimented with subjective entity characterization in a zero-shot setting.

In Hambardzumyan et al. (Hambardzumyan et al., 2021)

, text classification is trained by learning an embedding from continuous space, which, when precedes the masked token, a classification label, instructs the language model to generate a high probable and most appropriate token in the location of the mask. The critical observation is that the underlying pretrained language model has sufficient prompt knowledge, and inferring appropriate embedding to solve classification tasks is possible.

In Gao et al. (Gao et al., 2021), the authors state that classification tasks with zero-shot have limitations and propose a suite of few-shot approaches for classification tasks—Fine-tuning PLM with prompt-based learning and demonstrations and automatic prompt, label, and template generations. We evaluated if PLMs finetuned with domain data-set have sufficient knowledge to characterize entities and tweets.

Common-sense knowledge mining with zero-shot learning is another area that has given promising results recently. In Davison et al. (Davison et al., 2019), a uni-directional language model (GPT-2) is first used to convert information triplets to a valid and highly probable sentence, and then a bi-directional language model (BERT-large) is used to score the validity of the given fact. This is done by calculating the weighted point-wise mutual information (PMI) for a given relation. The testing was done on PLMs or general models rather than finetuned ones to avoid biasing towards a given database. The results suggest that the unsupervised techniques outperform the currently available supervised approaches. This backs our approach of not training the model to downstream tasks and using ju language model finetuned on contextual data to extract knowledge-based a given prompt.

This leads to probing facts from pretrained language models (PLMs). The language models have numerous advantages over knowledge bases as they do not require any schema engineering, are easy to extend over more data, and are robust for unsupervised training. In Petroni et al. (Petroni et al., 2019), the authors of the paper proved the strong ability of the PLMs to recall factual knowledge without any task-specific fine-tuning and their usefulness as open-domain QA systems. In Kassner and Schütze (Kassner and Schütze, 2020), authors use LAnguage Model Analysis (LAMA) Petroni et al. (Petroni et al., 2019) benchmark datasets and show that negation of an affirmative sentence has no effect on the models prediction of the masked word. Moreover, prepending a given sentence with any misprime(s) mislead the model to predict wrong tokens. The authors of Jiang et al. (Jiang et al., 2020) extend the LAMA benchmark dataset to extract information in various different languages. They custom-made templates to be converted into sentences in different languages and prompted the model to fill in the masked word(s). In Kumar and Talukdar (Kumar and Talukdar, 2021)

, the employed few-shot learning with as few as 25 annotated examples. The examples were permuted with their proposed modification in the genetic algorithm. Then the sequence was used as a prompt to the model along with a new sample, which was to be predicted for the intended task.

The authors of Bragg et al. (Bragg et al., 2021) introduce a prompt-based model for a few short learning where the key idea was to pose prompts in the form of multiple-choice questions (MCQs). This backs our approach to characterize the tweets with templates of context synopsis and text to be classified followed by a multiple-choice question.

Nishida et al. (Nishida et al., 2020)

introduced a sequential finetuned BERT model for reading comprehension tasks. They accommodated an unsupervised learning method to make the domain adaptation of the BERT model on the target source, after which the model was further finetuned to source domain reading comprehension (RC) dataset. This way, the BERT was able to take care of RC tasks on a domain different from the one it was pretrained on. This backs our approach to use domain adaptation of Pretrained Language Models (PLMs) as a tool for getting domain-specific answers on prompting with task described inputs.

Domain-Adaptive Pretraining (DAPT) of the PLMs has proved to greatly improve the model’s performance on tasks from the target domain. A PLM was separately fine tuned on four different domain datasets, namely biomedical, computer science, news and reviews, and was then tested on corresponding two tasks of each domain. For each task, the domain adapted model performed better (Gururangan et al., 2020).

These approaches back our way of formulating various prompts and extracting knowledge captured by the PLMs extended with specific domain data.

3. GPT-2 Domain Adaptation

The pretrained GPT-2 model was used for all of our experiments, which was then finetuned with News and Tweets corpus. Entity Characterization experiments is executed on both News and tweets domain adapted language models. Tweet Characterization experiments is executed on tweets domain adapted language model. The fine-tuning of GPT-2 PLM was carried out with default hyper-parameters set by Hugging-Face implementation.222HuggingfaceTransformers:

3.1. News Corpus

GDELT is a global dataset that continuously monitors Broadcast, Print, and Web news worldwide in multiple languages and has recorded events from 2015 onward. The GDELT Database is integrated with Google Big-Query for fetching data. We downloaded news article URLs from the GDELT for seven different media houses and scraped the content of URLs. Each media house has its style of presenting content. Articles containing native language snippets were discarded and only the English language content was extracted, creating a corpus of articles for each Media House. Table 1 shows the count and size on disk of articles from each Media House.

We finetune the pretrained GPT-2 (345M) on the news corpora from seven different media houses, training a separate model for each Media House.

Media house
No. of
Size of
Training Data (MB)
Media House A
Media House B
Media House C
Media House D
Media House E
Media House F
Media House F
Table 1. Scraped Articles from each Media House
Dataset Category Number of Tweets
Government Policy 1,076,795
Economically Weaker Section Abuse 88,416
Agriculturists Voice 56,271
Table 2. Tweets Corpora

3.2. Tweets Corpus

We used a custom collated dataset from 3 different categories of social media movements for tweet characterization: Government Policy, Agriculturists’ Voice, and Economically Weaker Section Abuse. The data collected for our analysis spans over a period of years, and included prolonged heated discussions in the context of the related country. We used snowball sampling (Goodman, 1961) of the trending hashtag to collect tweets of events for our experiment. The list of events and the collected data is listed in the Table 2. These datasets were collated during their peaks in their respective timelines from Twitter. All these movements resulted in quite a turmoil in the online verse and hence used for domain adaptation on informal texts. The data was extracted using standard Twitter developer APIs using the then trending hashtags.

A lot of noise was present in the collated data from the online universe, it included multi-lingual and code-mixed data. Since, we wish to focus only on roman scripts and English tweets as part of our experiments, we cleaned the data.333Since GPT-2 PLM is trained on English corpus and our goal was to generate characterization in English language Moreover, there was content in the dataset like URLs and user mentions which were not needed for characterization. In order to get the tweets ready for the model’s domain adaptation, we passed the tweets through the following steps:

  • Lower case the tweets

  • Remove user mentions

  • Remove hashtags

  • Replace the emojis with their
    corresponding texts, surrounded by colons

  • Remove punctuation

After completing the above steps, only Tweets with greater than 70% of words found in the English dictionary are considered. Subsequently, we finetuned the pretrained GPT-2 (774M) model with the cleaned data.

4. Characterization with Designed Prompts

The theory and approach we have pursued in this work are strongly aligned with the idea of “programming in natural language” detailed by Reynolds and McDonell (Reynolds and McDonell, 2021)

. The language model would fail because “the probability distribution produced in response to a prompt is not a distribution over ways

a person would continue that prompt, it’s the distribution over the ways any person could continue that prompt”. Hence prompt design should constrain the entailment generation ( i.e. predicted sequence of the language model ) to produce the desired effect and circumvent the irrelevant. Task-agnostic prompts are less effective when compared to task-specific prompts.

Two main types of prompts can be used as inputs for a language model, namely cloze prompts and prefix prompts. Cloze prompts as in (Petroni et al., 2019) is where the “answer” token is masked. The model predicts the masked token, whereas in prefix prompts (Li and Liang, 2021; Lester et al., 2021), also known as priming, the language model acts as a sequence generator where a prompt is fed as input and left to the model to generate conditional text auto-regressively.

‘The datasets GLUE (Wang et al., 2018) and SuperGLUE (Wang et al., 2019) were, introduced by Wang et al. to evaluate natural language understanding tasks. The datasets contain context, which has to be understood by the model and subsequently perform a task. We have a similar approach in two forms. One, we append designed conditional questions to tweets, suited for priming, to generate characterizing entailments. Two, a synopsis describes a particular characterization concept followed by the tweet and a conditional question, together forming a broad context to prime the language model in order to characterize the tweet.

In this work we have designed conditional prefix prompts for the “Entity Characterization” task and detailed context and conditional question for the “Tweet Characterization” task.

4.1. Entity Characterization

We experiment with text entailment, generated by domain-adapted GPT-2 PLMs for subjective characterization of the entity in the designed input. The models are finetuned with formal texts in news corpora and informal texts in tweets corpora. The input for an experiment consisted of a entity appended with a Designed Conditional Prefix-prompt. Entailment text was generated using GPT-2 predicted tokens with the input and evaluated the effect of prefix-prompt on the entailment text to characterize entity subjectively. We considered raw outputs for evaluation, and we did not post-process to create a proper sentence. Instead, we evaluate raw output to check if it contains adjectives relevant and well-known for the corresponding entity.

News articles are written with a reasonable research and insights hence, we consider them as formal text. Tweets are general thoughts shared with limited or good insights from a large group of users hence, are considered as informal texts.

First, for entity characterization with formal text corpus, we experiment with entities in news corpus articles. The experiments considered the top 10 entities from all the seven media houses and eight designed prefix prompts shown in Table 3. Ten outputs were generated for each media house per (entity, prefix-prompt) pair, and evaluated.

“is a very” “is known as”
“can be described as a” “is regarded as a”
“lacks” “is called the”
“probably is a” “can be inferred as a”
Table 3. Designed Conditional Prefix-prompts

The GPT-2 PLM is finetuned with vocabulary from a single media house news corpus or formal domain to create a domain adapted language model . With , raw entailments are generated for a pair of entity and designed conditional prefix as shown in Table 3. Following are examples of input pairs:

Considering entailments as characterizing entities, entity characterization can be shown as:

One domain adapted language model is created for each of the seven media houses corpora as shown in the Table 1 and subsequently, we conduct experiments on each of them. We evaluate the performance of prefix prompts in input to characterize entity by generated entailment in with formal text-domain adapted language model .

Second, entity characterization with tweets corpus or informal domain adapted language model is created with a category of tweets as shown in Table 2 considered to be informal text domain . Considering language model adapted with informal texts generate characterization of entities, characterization of entities can be shown as:

One domain adapted language model is created for each of the three tweet categories corpora as shown in Table 2. Four most frequently appearing entities across the corpora were chosen for the experiments. Ten outputs were generated for each (entity, prefix-prompt) pair on each model, and an evaluation was done on results with higher number of adjectives. We evaluated the performance of prefix prompts in input to generate characterization of entity by generated entailment in with informal text-domain adapted language model .

4.2. Tweet Characterization

To understand the effectiveness of contextual prompts on tweet characterization, we focus on the domain adaptation of a pretrained language model. The model is finetuned on social media corpora as shown in Table 2, and prompted to generate outputs to mine commonsense knowledge and reasoning stored in the language model.

For the first set of experiments, as shown in Table 4, we exploit BoolQ (Clark et al., 2019) dataset format, where we first give a tweet as context followed by a Yes/No format question. The posed question try to exploit the commonsense knowledge stored in the GPT-2 model as shown by the authors (Radford et al., 2019). Our main aim is to investigate high-level concepts like advocacy, hyper-advocacy, disinformation, propaganda and stance. These were all explicitly asked in the corresponding questions.

Type 1 Experiment Designed Templates
<Tweet>. Q: Is it true that preceding sentence advocates a cause? A: <True or False or Subjective Text>
<Tweet>. Q: Is it true that preceding sentence hyper-advocates a cause ? A: <True or False or Subjective Text>
<Tweet>. Q: Is it true that preceding sentence is a disinformation ? A: <True or False or Subjective Text>
<Tweet>. Q: Is it true that preceding sentence is a about a propaganda ? A: <True or False or Subjective Text>
<Tweet>. Q: Is it true that preceding sentence favors a cause ? A: <True or False or Subjective Text>
<Tweet>. Q: Is it true that preceding sentence is against a cause? A: <True or False or Subjective Text>
<Tweet>. Q: The preceding statement is advocating a cause. True or False? A: <True or False or Subjective Text>
Table 4. Tweet Characterization by “Commonsense Reasoning” similar to BoolQ (Clark et al., 2019)

Next, we employ multiple choice question (MCQ) format, similar to AG’s News444AG News Dataset:, expecting a one-word or an option as a predicted output from the domain adapted PLM (Table 5). We aim to extract commonsense knowledge from the generative model given a task description (Radford et al., 2019).

Type 2 Experiment Designed Templates
<Tweet>. Q: Can the preceding sentence be classified as
one of the following: information, rhetoric, advocacy,
hyper-advocacy, dogma, or propaganda?
A: <One of the listed options or Subjective Text>
<Tweet>. Classify the preceding sentence as one of the following:
A) information
B) disinformation
C) advocacy
D) hyper-advocacy
E) propaganda
F) none of the above
<One of the above option or Subjective Text>
Table 5. Tweet Characterization by “MCQ” similar to AG’s News

Subsequently, our next set of experiments, as shown in Table 6

, focuses on asking the PLM to classify a given tweet into specific low-level concepts. The low-level concepts, we test our model on, include the classes/concepts which are not domain-specific like advocacy and hyper-advocacy, but rather standard terms in articles and communicative english. For example, call-to-action, dogma, EML, emotion, sentiment and propaganda. This experiment also tests the model for text classification, knowledge mining, and sentiment analysis.

. Type 3 Experiment Designed Template <Tweet>. Q: For what cause is the above Tweet CTA? A: <Subjective Text> <Tweet>. Q: For what cause is the above Tweet call-to-action? A: <Subjective Text> <Tweet>. Q: Do you find the above Tweet as having dogmatic content? A: <Subjective Text> <Tweet>. Q: Do you find the above Tweet as having rhetoric content? A: <Subjective Text> <Tweet>. Q: Is the Tweet call to action for a cause/protest? A: <Subjective Text> <Tweet>. Q: What is the dominant emotion in the above <Tweet>? A: <Subjective Text> <Tweet>. Q: What is the sentiment of the above <Tweet>? A: <Subjective Text> <Tweet>. Q: Is the above Tweet a propaganda? A: <Subjective Text>

Table 6. Tweet Characterization by “General Commonsense Reasoning” similar to MultiRC (Khashabi et al., 2018)

Our last set of experiments include prompts similar to Reading Comprehension (RC) where the questions are based on a given passage, as shown in Table 7. For example, each prompt starts with defining a concept like dogmatism, followed by a few examples that can be classified as having dogmatic content and posing a question for the model to generate answers to, building on the grounds as shown by Zhang et al.(Zhang et al., 2018). We further tone down our question domain by mentioning the set of emotions to classify into dogmatic content, call-to-action and asking a yes/no question for emotionally manipulative language.

Our primary aim in all the above experiments was to be able to test and analyze the GPT-2 model for tasks which are an intersection of Commonsense Knowledge Mining (CKM), Commonsense Reasoning (CR), Text classification (TC) and Question Answering (QA). Given a specific task description without any transfer learning, the model should generate expected/logical outputs as claimed by the authors in

(Radford et al., 2019).

Type 4 Experiment Designed Template
<Descriptive Synopsis on Dogmatism>
Question: John says ”<Tweet>”. Does John’s saying contain dogmatic content?
<Descriptive Synopsis on Emotionally Manipulative Language>
Question: John says ”<Tweet>”. Does John’s saying contain Emotionally Manipulative Language?
<Descriptive Synopsis on Emotional Analysis. Definitions of Aggressive, Optimism, Love, Submission, Fear, Surprise, Sadness and Disgust>
Question: Jhon says ”<Tweet>”. Does Jhon’s saying contain Aggressive/Optimism/Love/Submission/Fear/Surprise/Sadness/Disgust emotion?
<15 Definitions of Call-to-action followed by example texts>
Question: John says ”<Tweet>”. Can John’s saying be classified as Call-To-Action?
Table 7. Tweet Characterization by “Larger-context Commonsense Reasoning” similar to ReCoRD (Zhang et al., 2018)

5. Results

5.1. Entity Characterization with News Datasets

Generating entailments is similar to Natural Language Generation. With designed prefix-prompts we are interested in evaluating the quality of free-form text generation which characterizes an entity. Hence, we evaluated the efficiency of prefix-prompts to generate relevant and entity characterizing entailments with following measures:

  1. Iterations needed to generate ten valid English sentence entailments - Table 8

  2. The ratio of entailments of negative and positive sentiment for each prefix-prompt across media houses - Table 9

  3. Percentage of the positive sentiment of each entity in each media house - Table 10

  4. Presence of adjective POS tags in entailments. Table 11.

  5. Human evaluation of entity relevant and characteristic entailments - Table 12 and 13

  6. Cluster analysis of outputs with Sentence Embeddings

GPT-2 fine-tuning datasets had noise content like hashtags, emoji, and others. Iterations were done to get clear English sentences. Table 8 shows the number of failed outputs it took to generate ten good outputs across all seven Media houses. The “is a very” and “is regarded as” prefix-prompts required lesser iterations compared to “lacks” and “is called the” prefix-prompts.

Prefix Prompts Fail Count
is a very 89
is known as 189
can be described as a 93
is regarded as a 60
lacks 287
is called the 349
probably is a 141
can be inferred as a 126
Table 8. Failed outputs per Prefix-prompt

The Sentiment of all entailments was analyzed for sentiment analysis using AllenNLP Sentiment Analyzer555Allen NLP Sentiment Analysis: As shown in Table 9, the “is regarded as a”, “is Known as” and “is called the” are two most positive sentiment generating prefix-prompts while “lacks” is the most negative sentiment generating prefix-prompt.

%age of
is a very 34 666 95.14
is known as 21 679 97.00
can be described as a 37 663 94.71
is regarded as a 17 683 97.57
lacks 375 325 46.43
is called the 21 679 97.00
probably is a 110 590 84.29
can be inferred as a 30 670 95.71
Table 9. Sentiment of entailments for each Prefix-prompt

Table 10 shows the percentage of positive sentiment entailments for each entity for each Media house. For some entities, the number of total positive sentiments was consistently low, whereas, for some entities, it was high. Media houses M1 M2 and M6 produced more positive sentiments towards P1 P6 and P10 entities, who belong to the same Political Party.

Media Houses
M2 M7 M5 M1 M4 M3 M6
P1 92.5 91.3 88.8 92.5 90 91.3 95
P2 93.8 90 90 88.8 92.5 87.5 85
P3 87.5 92.5 93.8 93.8 87.5 85 86.3
P4 87.5 83.8 87.5 86.3 92.5 87.5 87.5
P5 87.5 91.3 92.5 90 88.8 93.8 93.8
P6 91.3 93.8 90 88.8 90 91.3 93.8
P7 85 76.3 82.5 82.5 78.8 85 85
P8 83.8 82.5 81.3 82.5 78.8 80 82.5
P9 88.8 90 87.5 83.8 93.8 88.8 92.5
P10 96.3 86.3 90 93.8 91.3 91.3 90
Table 10. Positive Sentiment percentage of Entities across Media Houses

Usually, to praise or criticize someone, adjectives are used. All entailments for each prefix-prompt are analyzed with the help of NLTK POS tagging666NLTK TAG: to check the presence of Adjectives POS Tags(JJ), Superlative Adjective POS Tags(JJS), and Comparative Adjective POS Tags(JJR). Table 11 shows the presence of above mentioned POS tags in 700 entailments for each prefix-prompt. ”is a very” and ”can be described as” prefix-prompts have most cases adjective POS tags while ”is known as” and ”is called the” prefix-prompts have least outcomes with Adjective POS tags.

Prefix-prompt POS Absent POS Present
is a very 152 548
is known as 328 372
can be described as a 162 538
is regarded as a 169 531
lacks 276 424
is called the 351 349
probably is a 276 424
can be inferred as a 181 519
Table 11. Adjective POS tags in 700 Characterization Outputs

Human evaluation was done on the following two attributes with one human evaluator for each Media House.

  1. If the entailment is relevant to the entity and factually correct.

  2. If the entailment is valid and describes the characteristic of the entity.

Another attribute to mark for the output was the correctness of the characteristics described by the entailment. Entity characterizations are cognitive qualities of a person and are subject more to individual perception. Therefore, the evaluator’s perception of the entity could greatly bias his opinion to validate the correctness of characterization in the entailment. Because of the mentioned reason, the third attribute should only be marked by a Domain expert.

Table 12 shows that out of total outputs, 35.98% were relevant to the entities. This score is not close to the 50% benchmark for GPT-2. However, since the score of 35.98% is only on a short experiment, it tends to improve over a large number of experiments. Out of total relevant output, 74% of outputs describe characteristics of entities. So we can say the efficiency of prefix-prompts is 74% to get characteristic details about an entity.

By comparing the individual prefix-prompts for the number of relevant as well as characterizing outputs, as shown in the Table 13, “is a very” is the best performing prefix-prompt. It has 55.4% entity relevant entailments and 81.71% entity characterizing entailments. On the other hand, “is called the” prefix-prompt is the least performing on both the scales.

Non- Relevant
Relevant &
No. of
3585 528 1487 2015
%age of
64.02 9.43 26.55 35.98
Table 12. Relevant and characterizing outputs generated
Relevant &
is a very 45.29 55.43 81.71
is known as 25 33.43 74.78
can be described as a 29.43 38.57 76.30
is regarded as a 31 43.14 71.86
lacks 25.71 33.43 76.91
is called the 15.71 24.57 63.94
probably is a 19.86 29.43 67.48
can be inferred as a 20.43 29.86 68.42
Table 13. Percentage of Relevant and Characterizing outputs for each Prefix-prompt for all Media Houses

Cluster analysis of outputs. Sentence Transformers can be used to calculate sentence embeddings for each output statement. We use pretrained Sentence-BERT (SBERT) embeddings 777

to calculate 768 length vector for each output, irrespective of the sentence length. Sentence Embedding vectors represent complete sentence with more focus on the context. This ability of Sentence Embeddings makes it more useful to evaluate entity characterization entailments.

k-means clustering algorithm is used to group similar sentence vectors together. Clusters are then validated against the human annotations and other results as shown in Table 14. Optimal clusters were computed based on Distortion, Silhouette and Calinski-harabaz scores. Distortion is “the sum of the squared distances between each observation vector in cluster and its dominating centroid”. Silhouette score “quantifies to what extent a given cluster is cohesive and separate”. Calinski-Harabasz score “is the ratio of the sum of between-clusters dispersion and of inter-cluster dispersion for all clusters”. On comparing the scores across the metrics, Silhouette score performed the best. Hence, is chosen based on the Silhouette score.

Table 14 shows the count of output in clusters across Sentiment, Adjective and Relevance dimensions. Following are key observations from Cluster Analysis:

  1. Cluster 0 has Relevant and Characterizing Entity outputs

  2. Cluster 1 has maximum negative sentiment outputs

  3. Cluster 3 has maximum adjective absent outputs

  4. Cluster 3 has maximum irrelevant outputs

  5. All clusters have high positive sentiment outputs

The key observation is in Cluster 3, where there are maximum adjectives absent and irrelevant outputs.

Only Relevant
Relevant and
Characterize Entity
0 98 1381 323 1156 690 143 646
1 283 749 305 727 683 82 267
2 145 1279 315 1109 966 173 285
3 119 1546 952 713 1246 130 289
Table 14. Cluster Analysis of Outputs

To add to the above results, we repeated all the experiments on off-the-shelf GPT-2 PLM and noticed that generated entailments were incoherent, too random, and insensible.

The critical takeaway from entity characterization evaluation is that “is a very” prefix-prompt is performing the best across all output analysis and characterizing entities with formal-text domain adapted GPT-2 PLM.

Model Used
(Cohen’s kappa)
Government Policy 0.94 47.65
Economically Weaker
Section Abuse
0.74 72.65
Agriculturists Voice 0.66 60.15
Vanilla GPT-2 0.87 49.21
Table 15. Annotation results upon agreement between the domain experts
“can be described as a” “can be inferred as” “is a very” “is called the” “is known as” “is regarded as a” “lacks” “probably is a”
0.64 0.73 0.57 0.77 0.67 0.72 0.62 0.68
0.51 0.50 0.43 0.52 0.50 0.51 0.51 0.44
%age adjectives 79.17 79.17 93.75 83.33 97.92 89.58 95.83 77.08
%age positive
95.83 85.42 87.50 93.75 91.67 97.92 60.42 91.67
%age negative
4.17 14.58 12.50 6.25 8.33 2.08 39.58 8.33
64.58 50 60.4 56.25 62.5 70.83 54.16 58.3
Table 16. Effects of Prefix-prompts on Tweet Corpus Entities

5.2. Entity Characterization with Tweets Datasets

We take the top four outputs for each (entity, prefix-prompt) pair, based on the adjectives present. They were then evaluated by two domain experts. The criteria of the annotation was to validate whether the generated entailments talk about the information relevant to the entity, while characterizing the same. Table 15 shows the Cohen’s kappa score for the inter-annotation agreement for each of the dataset adapted GPT-2 outputs. We also generated outputs with Vanilla GPT-2 model to quantify the effect of adaptation. The scores across all the four model outputs signify more than substantial agreement ( Cohen’s kappa score ¿ 0.6 ) between the two experts. We found that the generated characterizations were not only relevant but were also adapted to the specific movement on which the model had been finetuned. This included the majority opinion about the entity with respect to the particular dataset. Whereas, the vanilla GPT-2 PLM generated outputs relevant to the entity in general sense.

Since the characterization is often done using adjectives, we extracted the adjectives in each output and aggregated to show the significance of the characterization in outputs. To infer the effect of adaptation, we computed the Word2Vec centroid distance between the adjective sets of the adapted models and vanilla model outputs. On the similar lines, we also computed the SBERT centroid distances of outputs across the adapted models and the vanilla model outputs. Next, we calculated the percentage of outputs containing adjectives and the sentiment of outputs. The centroid distances are greater than zero signifying outputs hence produced are learnt from the training datasets with high percentage of outputs containing adjectives. To verify the effect of prefix-prompts, we generate outputs only with entities without the custom tailored prompts, and the observation by the experts is that the entailments were random and were not characterizing the entity. Table 16 summarizes the analysis for each prefix-prompt.

Based on the user-mentions present in the tweets, we pick two most frequent entities each from the ruling party and the opposition party. Entities 1 and 2 are from the ruling party, and Entities 3 and 4 are from the opposition party. The model trained on the ”Government Policy” dataset produced most varied sentiment outputs across all the entities, as shown in Table 17.

The key takeaway is that for the informal text corpora, the finetuned models characterize the entities best with the prefix-prompt “is regarded as a” and worst with “can be inferred as”.

Tweets Corpora
Economically Weaker
Section Abuse
Entity 1 90.63% 87.50% 96.88%
Entity 2 96.88% 84.38% 90.63%
Entity 3 90.63% 71.88% 93.75%
Entity 4 84.38% 78.13% 90.63%
Table 17. Positive Sentiment Outputs for each Entity across corpora

5.3. Tweet Characterization Results

The four sets of experiments focused on analysing generative PLMs as text classifiers with abstract labels like dogma, advocacy, and hyper-advocacy. Starting with the first set of experiments, each subsequent one resulted from the failure of the previous set. We started with a basic true/false based question-answer format where every related question specifically stated the label, classifying the given sentence/tweet.

On not getting the expected results from the generative language model, we tweaked the format to MCQ style question-answer format. All the labels into which the classification was needed were stated as different options. The experiments failed to expect a one-letter answer or even a one-word answer as the outputs were no way near even an ”inferred” answer.

Subsequently, we devised a new set of questions where we resorted to a bit low-level classification. The model is now asked to classify into general concepts usually used and shared in English. However, again, the answers were very subjective and not near the answer that can be expected from the posed question(s).

Lastly, the fourth set of experiments includes introducing a low-level general concept and its definition, and a few related examples. Hence, forming a reading comprehension format followed by a question asking whether a given tweet can be classified as the introduced concept. Nevertheless, the model produced outputs that did not make sense with the question asked and was not valuable for the task defined in the prompt.

From all the failures to characterize the tweets, we concluded that the model, even though trained on 8 million webpages of formal text plus the huge corpus of informal text, failed to classify the outputs into abstract terms like dogma. This shows, that more formal supervised training needs to be performed for the model to comprehend those abstract labels.

6. Conclusions and Future Work

With zero-shot, we have observed that well-designed linguistic prompts with finetuned language models lead to good results on entity characterization. With this work, we attempt to solve NLP tasks in a zero-shot setting by exploring linguistic options to design prompts before constructing specific approaches. Furthermore, by focusing on zero-shot experiments, there will be a better understanding of PLMs and insights into novel ways of building language models.

The open question from our evaluation is that there are no convincing results with tweet characterization on a broad range of templates we experimented. So probably tweet characterization requires task-specific training.


  • A. Barrón-Cedeno, I. Jaradat, G. Da San Martino, and P. Nakov (2019) Proppy: organizing the news based on their propagandistic content. Information Processing & Management 56 (5), pp. 1849–1864. Cited by: §2.
  • J. Bragg, A. Cohan, K. Lo, and I. Beltagy (2021) Flex: unifying evaluation for few-shot nlp. Advances in Neural Information Processing Systems 34. Cited by: §2.
  • C. Clark, K. Lee, M. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova (2019) BoolQ: exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2924–2936. Cited by: §4.2, Table 4.
  • J. Davison, J. Feldman, and A. Rush (2019) Commonsense knowledge mining from pretrained models. In

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

    Hong Kong, China, pp. 1173–1178. External Links: Link, Document Cited by: §2.
  • T. Gao, A. Fisch, and D. Chen (2021) Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 3816–3830. External Links: Link, Document Cited by: §2.
  • L. A. Goodman (1961) Snowball sampling. The annals of mathematical statistics, pp. 148–170. Cited by: §3.2.
  • S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, and N. A. Smith (2020) Don’t stop pretraining: adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8342–8360. Cited by: §2.
  • K. Hambardzumyan, H. Khachatrian, and J. May (2021) WARP: Word-level Adversarial ReProgramming. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 4921–4933. External Links: Link, Document Cited by: §2.
  • A. Hern (2018) Note: (Accessed on 01/04/2022) External Links: Link Cited by: §1.
  • J. S. Huffaker, J. K. Kummerfeld, W. S. Lasecki, and M. S. Ackerman (2020) Crowdsourced detection of emotionally manipulative language. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–14. Cited by: §2.
  • Z. Jiang, A. Anastasopoulos, J. Araki, H. Ding, and G. Neubig (2020) X-factr: multilingual factual knowledge retrieval from pretrained language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5943–5959. Cited by: §2.
  • N. Kassner and H. Schütze (2020) Negated and misprimed probes for pretrained language models: birds can talk, but cannot fly. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7811–7818. Cited by: §2.
  • D. Khashabi, S. Chaturvedi, M. Roth, S. Upadhyay, and D. Roth (2018) Looking beyond the surface: a challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 252–262. Cited by: Table 6.
  • S. Kumar and P. P. Talukdar (2021) Reordering examples helps during priming-based few-shot learning. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021, C. Zong, F. Xia, W. Li, and R. Navigli (Eds.), Findings of ACL, Vol. ACL/IJCNLP 2021, pp. 4507–4518. External Links: Link, Document Cited by: §2.
  • B. Lester, R. Al-Rfou, and N. Constant (2021) The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691. Cited by: §4.
  • X. L. Li and P. Liang (2021) Prefix-tuning: optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190. Cited by: §4.
  • K. Nishida, K. Nishida, I. Saito, H. Asano, and J. Tomita (2020) Unsupervised domain adaptation of language models for reading comprehension. In Proceedings of the 12th Language Resources and Evaluation Conference, pp. 5392–5399. Cited by: §2.
  • F. Petroni, T. Rocktäschel, S. Riedel, P. Lewis, A. Bakhtin, Y. Wu, and A. Miller (2019) Language models as knowledge bases?. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2463–2473. Cited by: §2, §4.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. (2019) Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §4.2, §4.2, §4.2.
  • L. Reynolds and K. McDonell (2021) Prompt programming for large language models: beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–7. Cited by: §4.
  • T. Schick, H. Schmid, and H. Schütze (2020) Automatically identifying words that can serve as labels for few-shot text classification. In Proceedings of the 28th International Conference on Computational Linguistics, pp. 5569–5578. Cited by: §2.
  • T. Schick and H. Schütze (2021a) Exploiting cloze-questions for few-shot text classification and natural language inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 255–269. Cited by: §2.
  • T. Schick and H. Schütze (2021b) It’s not just size that matters: small language models are also few-shot learners. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2339–2352. Cited by: §2.
  • I. Sherr (2018) Note: (Accessed on 01/04/2022) External Links: Link Cited by: §1.
  • T. Spinde, F. Hamborg, K. Donnay, A. Becerra, and B. Gipp (2020) Enabling news consumers to view and understand biased news coverage: a study on the perception and visualization of media bias. In Proceedings of the ACM/IEEE joint conference on digital libraries in 2020, pp. 389–392. Cited by: §2.
  • A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019) SuperGLUE: a stickier benchmark for general-purpose language understanding systems. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 3266–3280. Cited by: §4.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman (2018) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In

    Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

    pp. 353–355. Cited by: §4.
  • K. Wei, Y. Lin, and M. Yan (2020) Examining protest as an intervention to reduce online prejudice: a case study of prejudice against immigrants. In Proceedings of The Web Conference 2020, pp. 2443–2454. Cited by: §2.
  • YouTube (2018) Note: (Accessed on 01/04/2022) External Links: Link Cited by: §1.
  • S. Zhang, X. Liu, J. Liu, J. Gao, K. Duh, and B. Van Durme (2018) Record: bridging the gap between human and machine commonsense reading comprehension. arXiv preprint arXiv:1810.12885. Cited by: §4.2, Table 7.