Diving Deep into Clickbaits: Who Use Them to What Extents in Which Topics with What Effects?

03/28/2017 ∙ by Md Main Uddin Rony, et al. ∙ 0

The use of alluring headlines (clickbait) to tempt the readers has become a growing practice nowadays. For the sake of existence in the highly competitive media industry, most of the on-line media including the mainstream ones, have started following this practice. Although the wide-spread practice of clickbait makes the reader's reliability on media vulnerable, a large scale analysis to reveal this fact is still absent. In this paper, we analyze 1.67 million Facebook posts created by 153 media organizations to understand the extent of clickbait practice, its impact and user engagement by using our own developed clickbait detection model. The model uses distributed sub-word embeddings learned from a large corpus. The accuracy of the model is 98.3 this model, we further study the distribution of topics in clickbait and non-clickbait contents.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I introduction

The term clickbait refers to a form of web content that employs writing formulas and linguistic techniques in headlines to trick readers into clicking links [1, 2], but does not deliver on promises 111https://www.wired.com/2015/12/psychology-of-clickbait/. Media scholars and pundits consistently show clickbait content in a bad light, but the industry based on this type of content has been rapidly growing and reaching more and more people across the world  [3, 4]. Taboola, one of the key providers of clickbait content, claims 222https://www.taboola.com/press-release/taboola-crosses-one-billion-user-mark-second-only-facebook-world’s-largest-discovery to have doubled its monthly reach from million unique users to billion in a single year from March 2015. The growth of clickbait industry appears to have clear impact on the media ecosystem, as many traditional media organizations have started to use clickbait techniques to attract readers and generate revenue. However, media analysts suggest that news media risk losing readers’ trust and depleting brand value by using clickbait techniques that may boost advertising revenue only temporarily. According to a study performed by Facebook  333https://www.nytimes.com/2014/08/26/business/media/facebook-takes-steps-against-click-bait-articles.html, users “preferred headlines that helped them decide if they wanted to read the full article before they had to click through”.  [5] shows that clickbait headlines lead to negative reactions among media users.

Compared to the reach of clickbait content and its impact on the online media ecosystem, the amount of research done on this topic is very small. No large scale study has been conducted to examine the extent to which different types of media use clickbait techniques. Little is known about the extent to which clickbait headlines contribute to user engagement on social networking platforms – major distributors of web content. This study seeks to fill this gap by examining uses of clickbait techniques in headlines by mainstream and unreliable media organizations on the social network. Some of the questions we answer in this paper are– (i) to what extent, mainstream and unreliable media organizations use clickbait? (ii) does the topic distribution of the contents vary in clickbaity contents? (iii) which type of headlines – clickbait or non-clickbait —- generates more user engagement (e.g., shares, comments, reactions)?

We first create a set of supervised clickbait classification models to identify clickbait headlines. Instead of following the traditional bag-of-words and hand-crafted feature set approaches, we take a more recent deep learning path that does not require feature engineering. Specifically, we use distributed subword embedding technique  

[6, 7] to transform the words in the corpus to

dimensional embeddings. These embeddings are used to map sentences to a vector space over which a softmax function is applied as a classifier. Our best performing model achieves

accuracy on a labeled dataset. We use this model to analyze a larger dataset which is a collection of approximately million Facebook posts created during 2014–2016 by mainstream media and unreliable media organizations. In addition to identifying the clickbait headlines in the corpus, we also use the embeddings to measure the distance between the headline and the first paragraph, known as intro, of a news article. We use a word co-occurrence based topic model that learns topics by modeling word-word co-occurrences patterns (e.g., bi-terms) to understand the distribution of topics in the clickbait and non-clickbait contents of each media. Finally, using the data on Facebook reactions, comments, and shares, we analyzed the role clickbaits play in user engagement and information spread. The main contributions of this paper are–

•We collect a large data corpus of million Facebook posts by over U.S. based media organizations. Details of the corpus is explained in Section  II. We make the corpus available to use for research purpose  444URL will be added after acceptance.

•We prepare distributed subword based embeddings for the words present in the corpus. In Section  III, we provide a comparison between these word embeddings and the word2vec  [8, 9] embeddings created from Google News dataset with respect to clickbait detection. We plan to make these embeddings publicly available upon acceptance of the paper.

•We perform detailed analysis of the clickbait practice in the social network from multiple perspectives. Section  IV presents qualitative, quantitative and impact analysis of clickbait and non-clickbait contents.

Ii Dataset

We use two datasets in this paper. Below, we provide description of the datasets and explain the collection process.

11Headlines2Media Corpus[]: This dataset is curated by Chakraborty et al. [2]. It contains headlines of news articles which appeared in ‘WikiNews’, ‘New York Times’, ‘The Guardian’, ‘The Hindu’, ‘BuzzFeed’, ‘Upworthy’, ‘ViralNova’, ‘Thatscoop’, ‘Scoopwhoop’, and ‘ViralStories’. 555https://github.com/bhargaviparanjape/clickbait/tree/master/dataset Each of these headlines is manually labeled either as a clickbait or a non-clickbait by at least three volunteers. There are clickbait headlines and non-clickbait headlines in this dataset. We used this labeled dataset to develop an automatic clickbait classification model (details in Section  III). An earlier version of this dataset was used in [2, 10]. It had manually labeled headlines with an even distribution of clickbait and non-clickbait headlines.

21Headlines2Media Corpus[]: For large scale analysis, using Facebook Graph API 666https://developers.facebook.com/docs/graph-api, we accumulated all the Facebook posts created by a set of mainstream and unreliable media within January , 2014 – December , 2016. The mainstream set consists of the most circulated print media  777https://en.wikipedia.org/wiki/List_of_newspapers_in_the_United_States and the most-watched broadcast media  888www.indiewire.com/2016/12/cnn-fox-news-msnbc-nbc-ratings-2016-winners-losers-1201762864/ (according to Nielson rating [11]). The unreliable set is a collection of conspiracy, clickbait, satire and junk science based media organizations. The category of each unreliable media is cross-checked by two sources  [12, 13]. Figure  2 shows the number of media organizations in each category in the dataset along with the percentage. Overall, we collected more than million Facebook posts. A Facebook post may contain a photo or a video or a link to an external source. In this paper, we limit ourselves to the link and video type posts only. This reduces the corpus size to million. For each post, we collect the headline (title of a video or headline of an article) and the status message. For a collection of link type posts, we also collected the bodies of the corresponding news articles. All these contents (headlines, messages, bodies) were used to train a domain specific word embeddings (details in Section  III). We also gather the Facebook reaction (Like, Love, Haha, Wow, Sad, Angry) statistics of each post. Table  I shows distribution of the corpus.

Fig. 1: This figure shows the difference between the number of posts per day from an average mainstream (print, broadcast) media and the same from an average unreliable media during January , 2014 – December , 2016. The green areas indicate that during these time periods, on average, a mainstream media posted more Facebook contents per day than an unreliable media. The blue areas indicate the opposite. General observation is, media organizations are sharing contents in the Facebook more actively now than they did earlier.
Media Category Link Video Total
Mainstream Broadcast 324028 32924 356952
Print 516713 14129 530842
Unreliable Clickbait 371834 4099 375933
Conspiracy 309122 5841 314963
Junk Science 51923 649 52572
Satire 41046 151 41197
Total 1614666 57793 1672459
TABLE I: Distribution of the 21Headlines2Media Corpus[]
Fig. 2: Category distribution of the 21Headlines2Media Corpus[]

Iii Clickbait Detection

The key purpose of this study is to systematically quantify the extents to which traditional print and broadcast media as well as “alternative” media – often portrayed as unreliable – use clickbait properties in contents published on the web. The first step towards that goal is to identify clickbait and non-clickbait headlines.

Iii-a Problem Definition

We define the clickbait identification task as a supervised binary classification problem where the set of classes, . Formally, given , a set of all sentences, and a training set of labeled sentences , where , we want to learn a function such that , in other words, it maps sentences to . In the following sections, we describe modeling of the problem and compare performances of multiple learning techniques.

Iii-B Problem Modeling

In text classification, a traditional approach is to use bag-of-words (BOW) model to transform text into feature vectors before applying learning algorithms.  [2] followed this approach and used BOW model along with a collection of hand-crafted rules to prepare the feature set. However, inspired by the recent success of deep learning methods in text classification, we use distributed subword embeddings as features instead of applying BOW model. Specifically, we use an extension of the continuous skip-gram model [8], which takes into account subword (substring of a word) information  [6]. We call this model as Skip-Gram. Below, we explain how Skip-Gramis used to generate word embeddings.

Iii-B1 Skip-Gram

Given a large corpus , represented as a sequence of words, , the objective of the skip-gram model is to maximize the log-likelihood


where the context is the set of indices of words surrounding . In other words, given a word , the model wants to maximize the correct prediction of its context

. The probability of observing a context word

given is parametrized using the word vectors. The output of the model is an embedding for each word which captures semantic and contextual information of the word. Skip-Gramworks in a slightly different way. Rather than treating each word as a unit, it breaks down words into subwords and wants to correctly predict the context subwords of a given subword. This extension allows sharing the representations across words, thus allowing to learn reliable representation for rare words. Consider the following example.

Fig. 3: The Skip-Grammodel architecture. The training objective is to learn subword vector representations that are good at predicting the nearby subwords.
Example 1.

“the quick brown fox jumped over the lazy dog”- take the word “quick” as an example. Assuming subword length as three, the subwords are- . Skip-Grammodel learns to predict , in the context given as the input.

Figure  3 shows the architecture of the Skip-Gram

model. Using neural network, the model learns the mapping between the output and the input. The weights to the hidden layer form the vector representations of the subwords. The embedding of a word is formed by the sum of the vector representations of its subwords. Formally, given a word

and its set of subwords , we can calculate the embedding of using the following equation-


where is the embedding of and is the vector representation of . Further details of the Skip-Grammodel can be found in  [6].

Iii-B2 Pre-trained Vectors

Note that Skip-Gramdoes not require to learn the embeddings of words in corpus . It means that one can use the model on any large corpus of text to learn the word embeddings irrespective of whether the corpus is labeled or not. This technique of learning from large text corpus helps having richer word embeddings which capture a lot of semantic, conceptual and contextual information. We use the texts (headlines, messages, bodies) from 21Headlines2Media Corpus[] to learn word embeddings using this model. In Section III-C, we present comparison between our pre-trained vectors and word vectors which were trained on about 100 billion words  [9] from the Google News dataset.

Iii-B3 Classification

For a labeled sentence , we average the embeddings of words present in

to form the hidden representation of

. These sentence representations are used to train a linear classifier. Specifically, we use the softmax function to compute the probability distribution over the classes in

.  [7] describes the classification process in detail.

Iii-C Evaluation

We use the 11Headlines2Media Corpus[] dataset to evaluate our classification model. Section  II provides the description of the dataset. We perform 10-fold cross-validation to evaluate various methods with respect to accuracy, precision, recall, f-measure, area under the ROC curve (ROC-AUC) and Cohen’s . Table  II shows performances of the methods. To avoid randomness effect, we perform each experiment times and present the average. There are in total seven methods. We categorize them based on the use of pre-trained vectors. Note that we report performances of Chakroborty et al.  [2] and Anand et al.  [10] in the table. We keep Anand et al. with the methods which use pre-trained vectors. Because Anand et al. used word embeddings trained on about 100 billion words from the Google News dataset using the Continuous Bag of Words architecture  [9]. Each word embedding has dimensions. Both of these works  [2, 10] used a smaller and earlier version of the 11Headlines2Media Corpus[] dataset. Moreover, the training and test sets of the earlier dataset are not available. So, we could not compare our methods with them using the same test bed.

The Skip-Grammodel, even without pre-trained vectors, significantly outperforms the BOW based Chakroborty et al. It achieves a f-measure score of ( higher than Chakroborty et al.) and a score of . Powered with the pre-trained vectors, Skip-Gramperformed even better. We used the same word embeddings provided by  [9] as well as our own 21Headlines2Media Corpus[]. Regarding the later, we experimented with three combinations- pre-trained vectors learned from the content headlines only, from headlines and messages, and from headline, bodies and messages. We set embedding size to dimensions while learning from these combinations. For the methods which were applied on the full 11Headlines2Media Corpus[] dataset, we highlight the top performance in each column. Skip-Gramalong with pre-trained vectors from headlines, bodies and messages performed the best among all the variations. We realize that the differences of the measure values among the methods are small. However, we understand that making a small improvement while working above the range, is significant.

21Headlines2Media Corpus[] has unique embeddings where Google News dataset provided billion embeddings. One interesting observation is, even though the size of our 21Headlines2Media Corpus[] is significantly smaller than the Google News dataset, it contributes more to the clickbait classification task. It can be rationalized as, the embeddings from 21Headlines2Media Corpus[] have more domain specific knowledge than the Google News dataset. We plan to extend this corpus with more Facebook posts and release it along with the pre-trained vectors for research purpose upon acceptance of the paper.

With this powerful clickbait classification model [Skip-Gram+(Headline+Body+Message)], we move forward and perform large scale study on the clickbait practice by a range of media on social network (Facebook).

Method Precision Recall F-measure Accuracy Cohen’s ROC-AUC
Without Pre-trained Vectors *Chakroborty et al. [2] 0.95 0.90 0.93 0.93 NA 0.97
Skip-Gram 0.976 0.975 0.975 0.976 0.952 0.976
With Pre-trained Vectors *Anand et al. [10] 0.984 0.978 0.982 0.982 NA 0.998
Skip-Gram+ Google_word2vec 0.977 0.977 0.977 0.976 0.951 0.976
Skip-Gram+ (Headline) 0.981 0.981 0.981 0.981 0.962 0.981
Skip-Gram+ (Headline + Message) 0.982 0.982 0.982 0.982 0.964 0.982
Skip-Gram+ (Headline + Body + Message) 0.983 0.983 0.983 0.983 0.965 0.983
  • Their experiments were performed on a smaller and earlier version of the 11Headlines2Media Corpus[] dataset.

TABLE II: Performance of the methods on the 11Headlines2Media Corpus[] dataset

Iv Practice of using clickbait in Social Network

We analyze the clickbait practice in Facebook using the 21Headlines2Media Corpus[] from three perspectives.

Media Category Clickbait Non-clickbait Clickbait (%)
Mainstream Broadcast 169752 187200 47.56
Print 128022 402820 24.12
Unreliable Clickbait 172271 203662 45.82
Conspiracy 90389 224574 28.7
Junk Science 23637 28935 44.96
Satire 21798 19399 52.91
TABLE III: Amount of clickbaits in various media

Iv-a Quantitative Analysis

To understand the extent of clickbait practice by different media and their categories, we applied the clickbait detection model on their contents; particularly on the headline/title of the link/video type posts. From now onward, we will use the term headline to denote both the headline of a link content (article) and the title of a video content. Table  III shows amounts of clickbaits and non-clickbaits in the headlines of mainstream and unreliable media. Out of posts by mainstream media, have clickbait headlines. In unreliable media, the ratio is ( clickbait headlines out of ). Based on these statistics, the percentage appears to be surprisingly high for the mainstream. We zoom into the categories of these two media to analyze the primary proponents of the clickbait practice. We find that between the two categories of mainstream media, broadcast uses clickbait of the times whereas print only uses . We further zoom in to understand the high percentage in the broadcast category. The 21Headlines2Media Corpus[] has broadcast media. We manually categorize them into news oriented broadcast media (e.g. CNN, NBC, etc.) and non-news (lifestyle, entertainment, sports, etc.) broadcast media (e.g. HGTV, E!, etc.). There are news oriented broadcast media and non-news broadcast media. We find that the ratio of clickbait and non-clickbait is in non-news type broadcast media whereas it is only (close to print media) in news oriented media. Figure  5

shows kernel density estimation of the clickbait percentage both for news and non-news broadcast media. It clearly shows the difference in clickbait practice in these two sub-categories. Most of the news type broadcast media has about

clickbait contents. On the other hand, the percentage of clickbait for non-news type broadcast media has a wider range with peak at about . In case of unreliable media, unsurprisingly all the categories have high percentage of clickbaits in their headlines. In Figure  5, we show the percentage of clickbait in video and link type posts for each of the media categories. Satire is leading in both link and video type posts. Print and conspiracy media have the lowest clickbait practice among all the media categories in link and video type posts, respectively. Table  V shows the top- clickbait proponents in each media category.

Fig. 4: Percentage of clickbaits in link and video headlines.
Fig. 5: Broadcast (News) vs. Broadcast (Non-news).
Fig. 6: Frequency of link re-post by different media.

Iv-B Qualitative

Topic distribution: To understand the topics in the clickbait and non-clickbait contents, we applied topic modeling on all the headlines of each category. One concern about applying the traditional topic modeling algorithms (e.g. Latent Dirichlet Allocation, Latent Semantic Analysis) on our corpus is, they focus on document-level word co-occurrence patterns to discover the topics of a document. So, they may struggle with the high word co-occurrence patterns sparsity which becomes a dominant factor in case of shorter context. That is why we use Biterm Topic Modeling (BTM)  [14] which generates the topics by directly modeling the aggregated word co-occurrence patterns of a short document.

Media Clickbait Non-Clickbait
Print : best, thing, day, new, 2015, cleveland, la, 2016, know, year : new, san, la, jose, police, county, vega, get, bay, school
: trump, woman, donald, new, get, say, make, people, thing, know : police, man, cleveland, new, killed, woman, la, shooting, shot, get
: trump, new, get, woman, donald, make, star, say, man, chicago : news, trump, new, man, say, york, woman, hawaii, police, killed
: new, best, thing, year, get, kid, day, woman, make, trump : trump, new, u, clinton, say, state, win, donald, take, world
: boston, trump, donald, new, say, make, clinton, woman, get, 2016 : boston, new, say, trump, sox, chronicle, win, red, get, state
Broadcast : new, movie, star, make, swift, time, video, best, get, like : police, man, new, found, woman, killed, arrested, say, shooting, death
: new, get, baby, kardashian, jenner, star, first, make, love, say : trump, clinton, say, new, obama, u, gop, news, campaign, hillary,
: woman, episode, new, trump, man, black, get, video, full, girl : new, u, say, police, found, killed, dead, nbc, year, dy
: trump, history, know, thing, donald, clinton, get, make, best, say : win, new, say, game, first, get, team, player, take, back
: day, photo, national, way, best, like, food, dog, thing, geographic : national, geographic, photo, new, shark, day, classic, fs1undisputed, home, found
Unreliable : trump, hillary, donald, clinton, obama, get, make, say, one, news : obama, eagle, muslim, police, say, gun, u, cop, man, patriot
: video, people, american, black, obama, muslim, u, america, cop, white : trump, hillary, clinton, obama, new, say, campaign, news, donald, republican
: chick, trump, eagle, right, woman, hillary, say, get, people, make : u, obama, video, war, isi, new, military, american, world, muslim
: man, people, thing, woman, make, year, like, get, way, new : new, truth, obama, say, u, republican, police, broadcast, man, american
: day, reunionfather, human, food, way, health, thing, reason, life, make : human, cancer, health, new, vaccine, u, study, food, found, world
TABLE IV: Topic model of Clickbait and Non-clickbait headlines in different media

Table  IV shows topics in clickbait and non-clickbait contents for each media category. Each topic is represented by a set of words. The words are ordered by their significance in the corresponding topic. The modeling indicates that clickbait headlines in print and broadcast media vary in tones and subject matters from their non-clickbait headlines to a great extent. Clickbait headlines in these media represent more personalized, sensationalized and entertaining topics, while non-clickbait headlines highlight topics of collective problems such as public policies and civic affairs. But this variation is not much evident in unreliable media that use clickbait headlines indiscriminately across all topics.

The model highlights some differences in clickbait topics between print and broadcast media. Most clickbait topics in print media, four out of five, are about U.S. President Donald Trump’s views on women. Each of these four topics include all of these four words: Trump, woman, make, new. A manual search shows that print news media often used clickbait techniques (e.g., question based headline) in stories about Trump and women. For instance, “Did Donald Trump really say those things?” was the headline of a Washington Post article dated July 25, 2016. The headline of a New York Times story from May 14, 2016, reads; “Crossing the Line: How Donald Trump Behaved With Women in Private.”

Most clickbait topics in broadcast media are about entertainment (e.g., Taylor Swift’s new music video; Kardashian’s new baby) and lifestyle (e.g., food and health). Two topics appeared to touch Donald Trump and his opponent Hillary Clinton. Clickbait topics in unreliable media, however, range from politics to lifestyle. At least three topics appeared to be about politics in which key words include, Trump, Hillary, Obama, Muslim, Cop, and Woman. One topic is about food and health while another is unclear.

Non-clickbait topics remain similar across all three media types, which primarily focus on law and order, and U.S. presidential election campaign. Twelve out of 15 topics – all five in print, three in broadcast, and four in unreliable – are about these two areas. One broadcast topic appears to be about sports and one is unclear. One unreliable topic is about food and health.

Headline-Body similarity: One limitation of Skip-Gramis, it only considers the headline to determine whether it is a clickbait or not. The body of the news, is not considered as a factor in defining the headline. An attractive headline can be highly relevant to the content/body of a news or it can be very loosely related to the news. Our model is not capable of making the distinction. A metric is required to measure the similarity between the headline and the content to determine if the headline fairly represents the content. In future, we want to systematically incorporate the headline-body similarity in defining the clickbaitiness. Nonetheless, here we measure how similar the clickbait and non-clickbait headlines are to the corresponding bodies using a simple approach. We assume that the first para of an article represents the summary of the whole news  [15]

and use cosine similarity to measure the similarity between the headline and the sentences in the first para. We use bag-of-words model to transform the sentences into vectors before applying cosine similarity. In future, we plan to use our word embeddings to create the vectors instead. Figure  

7 shows the kernel density estimation of the headline-body similarity in clickbait and non-clickbait contents posted by different media. One observation is, in print media non-clickbait headlines are closer to their summary than clickbait headlines. In broadcast media, the difference is less clear and in unreliable media the difference is almost absent.

Media Name Clickbait Non-clickbait Clickbait (%)
Overall VH1 13760 1339 91.13
AmplifyingGlass 692 71 90.69
MTV 42313 4492 90.4
ClickHole 8250 930 89.87
Reductress 3984 484 89.17
Broadcast VH1 13760 1339 91.13
MTV 42313 4492 90.4
Bravo TV 8263 1242 86.93
Food Network 2990 492 85.87
OWN 474 118 80.07
Print Washington Post 13905 15158 47.84
New York Post 11977 13910 46.27
Dallas Morning News 3982 8232 32.6
USA Today 8538 20282 29.63
Houston Chronicle 8481 21618 28.18
Unreliable AmplifyingGlass 692 71 90.69
ClickHole 8250 930 89.87
Reductress 3984 484 89.17
Food Babe 2387 638 78.91
Chicks on the Right 14185 4977 74.03
TABLE V: Top-5 clickbait proponents in each media
Media Category Clickbait Status Non-clickbait Link Clickbait Status (%)
Mainstream Broadcast 84192 176177 32.34
Print 164669 379504 30.26
Unreliable Clickbait 91747 157886 36.75
Conspiracy 46851 190477 19.74
Junk Science 12764 28349 31.05
Satire 7425 14453 33.94
TABLE VI: Presence of clickbait in the status
Fig. 7: Headline-Body similarity in clickbait and non-clickbait contents.

Iv-C Impact

To measure the reachability and user engagement of clickbait and non-clickbait contents, we use Facebook reactions, comments and shares as metrices. Figure 8 shows number of comments, shares and reactions (summation of like, haha, wow, sad, angry, happy, love) of an average clickbait and non-clickbait post in each media category. Blue areas indicate that on average, a clickbait post (link or video) receives more attention (reactions/shares/comments) than a non-clickbait post. Green areas indicate the opposite. Clickbait contents receive more attention and reach to more users in general. One exception is the broadcast media.

We also analyze how often a news article is re-posted in Facebook. Figure  6 shows number of times a link is re-posted by a media. Each bar represents a news link. The height indicates how many times this link was posted in Facebook by the colored media category. We only consider the links which were re-posted at least 20 times. Compare to others, conspiracy media organizations repeat the same link more. This is observed both for clickbait and non-clickbait. Clickbait media seem to repeatedly posting same clickbait links more than others.

Other than headlines, the media organizations also practice using clickbait in the Facebook status message itself. Table  VI shows use of clickbait status for non-clickbait articles by different media. A general observation is, the practice is there to allure the readers by giving clickbaity message posts even for non-clickbaity news contents. Unsurprisingly, the clickbait media category is leading in this practice.

Fig. 8: Top: Print media, Middle: Broadcast media, Bottom: Unreliable media. Blue areas indicate that on average, a clickbait post (link or video) receives more attention (reactions/shares/comments) than a non-clickbait post. Green areas indicate the opposite.

V Related Work

Even though clickbait is a relatively nascent term, its traces can be found in several journalistic concepts such as tabloidization and content trivialization. The linguistic techniques and presentation styles, employed typically in clickbait headlines and articles, derived from the tabloid press that baits readers with sensational language and appealing topics such as celebrity gossip, humor, fear and sex [1]. Clickbait articles are also similar to tabloid press articles in terms of story focus, which puts emphasis on the entertaining elements of an event rather than the informative elements. The Internet and especially the social media have made it easier for the clickbait practitioners to create, publish in a larger scale and reach to a broader audience with a higher speed than before [16]. In the last several years, academicians and media studied this phenomenon from several perspectives.

Clickbait– Properties, Practice and Effects: There have been a small number of studies–some conducted by academic researchers and others by media firms–which examined correlations between headline attributes and degree of user engagement with content. Some media market analysts and commentators  [17] discussed various aspects of this practice. However, no research has been found, which gauges the extents of clickbait practices by mainstream and alternative media outlets on the web. Nor have we found any study that examined if clickbait techniques help increase user engagement on social media.

A journalism professor [1] manually examined content of four online sections of the Spanish newspaper El Pais 999http://elpais.com, which apparently used clickbait features to capture attention. The corpus included only articles published in June, 2015. The articles in the corpus appeared to emphasize anecdotal aspects, or issues with little value, and curiosities. The study identified various linguistic techniques used in headlines of these articles such as orality markers and interaction (e.g., direct appeal to the reader), vocabulary and word games (e.g., informal language, generic or buzzwords), and morphosyntax (e.g., simple structures).

Researchers at the University of Texas’s Engaging News Project  [5] conducted an experiment on U.S. adults to examine their reactions to clickbait (e.g., question-based headlines) and traditional news headlines in political articles. They found that clickbait headlines led to more negative reactions among users than non-clickbait headlines. Interestingly, the same users were slightly more engaged with non-traditional media that tend to use clickbait techniques more often. This finding questions the conventional belief that user reactions may predict user engagement, and warrants large-scale investigations.

Chartbeat, an analytics firm that provides market intelligence to media organizations, tested headlines from over 100 websites for their effectiveness in engaging users with content  [18]. The study examined ‘common tropes’ in headlines– a majority of them are considered clickbait techniques – and found that some of these tropes are more effective than others. Some media pundits interpreted the findings of this study as clickbaits being detrimental to traditional news brands.

HubSpot and Outbrain, two content marketing platforms that distribute clickbait contents across the web, examined millions of headlines to identify attributes that contribute to traffic growth, increased engagement, and conversion of readers into subscribers. The study suggested that clickbait techniques may increase temporary engagement  [19], but an article must deliver on its promises made in headline for users to return and convert.

Automated Clickbait Detection: [20, 2, 10, 21]

study automated detection of clickbait headlines using natural language processing and machine learning.

[21] collects headlines from Buzzfeed, Clickhole, and The New York Times (NYT)

and uses Logistic Regression to create a supervised clickbait detection model. It assumes all

Buzzfeed and Clickhole headlines as clickbait and all NYT headlines as non-clickbait. We would like to argue that it makes the model susceptible to personal bias as it overlooks the fact that many Buzzfeed contents are original, non-clickbaity and there are clickbait practice in NYT [22]. Moreover, BuzzFeed, and NYT usually write on very different topics. The model might have been trained merely as a topic classifier.  [20] attempts to detect clickbaity Tweets in Twitter by using common words occurring in clickbaits, and by extracting some tweet specific features.  [2] uses a dataset of manually labeled headlines to train several supervised models for clickbait detection. These methods heavily depend on a rich set of hand-crafted features which take good amount of time to engineer and sometimes are specific to the domain (for example, tweet related features are specific to Twitter data and inapplicable to other domains).  [10]

presents clickbait detection model which uses word embeddings and Recurrent Neural Network (RNN). These works consider the structure and semantic of a headline to determine whether it is a clickbait or not. However, one important aspect, the body of the news, is not considered as a factor in these works at all. We would like to argue that only the headline itself does not fully represent whether an article is a clickbait or not. If a headline represents the body fairly, it should not be considered as a clickbait. Consider the title as an example,

‘The Top 10 Mistakes Of Entrepreneurs’101010www.forbes.com/sites/roberthof/2016/02/23/guy-kawasaki-the-top-10-mistakes-of-entrepreneurs. It is as clickbait of a headline as it can be. However, the body actually contains reasonably decent materials, which might be interesting to many people.

Clickbait Generation  [23, 24, 25] present automated clickbait generation tools.  [23] trains an RNN model using million headlines collected from Buzzfeed, Gawker, Jezebel, Huffington Post and Upworthy. The model is then used to produce new clickbait headlines

Vi Conclusion

In this paper, we introduce a word-embedding based clickbait detection system which is built on our own collected corpus of news headlines and contents. We showed that our model performs better than the Google news dataset based embeddings. Our analysis also reveals how mainstream media are getting involved into clickbait practicing increasingly. Close scrutiny of the social media posts also reveals that broadcast type media has higher percentage of usage of clickbait practice than the print media and non-news type broadcast media mostly contributes to it. Our study also brings forth another fact of using higher percentage of clickbait practice by unreliable media which is quite obvious. Moreover, results from our topic modeling indicates that clickbait practice is prevalent in personalized and entertaining areas. In future, we want to incorporate the content of the news in defining the clickbaitiness of a headline. We believe, such a system would help social networking platforms to curb the problem of clickbait and provide a better using experience.