Clickbait is a text or thumbnail link that is designed to entice users to access the linked online content, which often fails to fulfill the promise made by the title. Online media outlets adopt clickbait techniques in a bid to expand their reach and subsequently increase revenue through ad monetization. As the adverse effects of clickbait attract more and more attention, researchers have started to explore machine learning techniques to automatically detect clickbaits. For example, previous approaches, such as Chakraborty et al. (2016, 2017); Wei and Wan (2017)
, built text classifiers via feature engineering methods such asRony et al. (2017); Zheng et al. (2018); Anand et al. (2017); Zhou (2017); Kumar et al. (2018)
built text classifiers using deep learning models. Despite the success of previous approaches, they all assume that all the training data is stored locally and hence the detection models can be built locally.
However, in reality, different organizations generally hold different parts of the data and cannot share data with each other Yang et al. (2019). For example, a social network company, such as Twitter, Weibo, etc., often have the need to automatically monitor the quality of content via a clickbait detection model. While they may access a title such as "LeBron James was dragged along the street", they do not have access to its corresponding externally linked content, such as "A man walked down the street with LeBron James’ autobiography in his hand. The cover of the autobiography is a picture of James."
In the above scenario, we found that a qualified model is difficult to obtain using prior work since the local data alone is not enough for clickbait detection. In our experiments, we found that the correlation between titles and contents is a key factor for clickbait detection while most of the prior work failed to exploit the correlation.
In response to the above problem, we propose Federated Hierarchical Hybrid Networks, which is Hierarchical Hybrid Networks trained by Clickbait Federated Learning. Hierarchical Hybrid Networks exploit the connection between title and content, which is a key factor for clickbait detection. Clickbait Federated Learning can effectively utilize data from two parties for model training without the agreement of the network structures from the two parties. Our experimental results show that Federated Hierarchical Hybrid Networks outperform other clickbait detection models, and is comparable to the models trained in the ideal situation.
We organize the paper as follows. We first review related work. After that, we present the details of our framework and then give a detailed description of our algorithm. Finally, we evaluate our algorithm in a dataset and conclude our work with a discussion on future work.
2 Related Work
2.1 Clickbait Detection
As we mentioned, there has been a substantial amount of prior work on automatic clickbait detection. Chakraborty et al. collected extensive titles for both clickbait and non-clickbait categories and manually extracted features for SVM classifier, Decision Trees classifier, and Random Forests classifierChakraborty et al. (2016). As an extension, Chakraborty et al. used another dataset collected from Twitter to conduct further analysis of clickbait Chakraborty et al. (2017)
. In the work of Wei and Wan, they redefined the clickbait problem and identified ambiguous and misleading titles separately. Wei and Wan crawled a total of 40000 articles from four major Chinese news sites and used SVM classifier for text classification with manual feature extractionWei and Wan (2017).
FastText Grave et al. (2017), TextCNN Kim (2014), TextRNN Cho et al. (2014), and Self-Attentive Network Yang et al. (2016); Lin et al. (2017) are classic deep learning models. Rony et al. applied a model similar to FastText, which uses distributed sub-word embedding learned from a large corpus to detect clickbait Rony et al. (2017). Zheng et al. proposed CBCNN. The architecture of CBCNN is similar to TextCNN Zheng et al. (2018). In the work of Anand et al., a model similar to TextRNN was applied to clickbait detection. Their model combines Distributed Word Embedding with Character Level Word Embedding Anand et al. (2017). The first attempt to apply Self-Attentive Network to clickbait detection is in the work of Zhou Zhou (2017).
The connection between title and content is an important feature in the clickbait detection task. To the best of our knowledge, only one prior work takes advantage of the connection Kumar et al. (2018). In the work of Kumar et al., they utilized not only the similarity between title and description but also the similarity between description and image. Their work emphasizes the similarity between different parts of the article while we focus on whether the title is ambiguous or misleading (Marquez divided news headlines into three types: accurate, ambiguous and misleading Marquez (1980)), which is an important indicator of clickbait. Notably, due to the flexibility of natural language, similarity cannot accurately measure the connection between title and content for clickbait detection.
2.2 Federated Learning
The Data Island problem has attracted more and more attention. Mcmahan et al. and Konen et al. advocated an alternative that left the training data distributed on the mobile devices and learned a shared model by aggregating locally computed updates McMahan et al. (2016); Konecný et al. (2016a). Konen et al. proposed two ways to reduce the uplink communication costs, which improved the efficiency of their framework of federated learning Konecný et al. (2016b). Smith et al. proposed a novel systems-aware optimization framework for federated multi-task learning and this method achieved significant speedups compared to alternatives in the federated setting Smith et al. (2017)
. Yang et al. introduced a comprehensive secure federated-learning framework, which includes horizontal federated learning, vertical federated learning, and federated transfer learningYang et al. (2019)
. Zhuo et al. propose federated reinforcement learning, which considers the privacy requirement and builds Q-network for each agent with the help of other agentsZhuo et al. (2019).
3 Problem Definition
Our problem can be defined by: given as input a set of pairs of titles and labels from , and a set of contents from , find the model that must be built on both and while cannot share data with . This problem is called the Data Island problem.
According to the above setting, in a clickbait detection task, is the social network company, which needs to monitor the quality of content on its platform. And is the media company. is a link to an external article, which contains the textual description of this article. is composed of a sequence of words and it can be regarded as the title of the article. is a label in , specifying whether the corresponding title is a clickbait or not. We denote the set of all pairs of titles and labels as (). is the detailed content from corresponding to title from , which is composed of a sequence of words . We denote the set of all contents as (). An article is made up of title and content . By mapping , we can find the title and the content from the same article . represents the automatic clickbait detection process.
In a word, hosts , and hosts . They cannot share data with each other. It meaning that and cannot be aggregated for model training.
4 Our Approach
Inspired by human practices in clickbait detection, we propose Hierarchical Hybrid Networks, which exploits the connection between title and content, while most of the previous approaches either fail to consider both or only take their similarity into account. However, like all of the prior works, Hierarchical Hybrid Networks only work in the ideal situation (data sharing). So we propose here a novel method of training model: Clickbait Federated Learning as an effective solution to the Data Island problem. This method can effectively utilize data from two parties for model training. What is more, Clickbait Federated Learning does not require the two parties to agree on the network structures. After we train Hierarchical Hybrid Networks by Clickbait Federated Learning, we obtain Federated Hierarchical Hybrid Networks, which provides a solution to the Data Island problem.
4.1 Hierarchical Hybrid Networks
As shown in Figure 1, Hierarchical Hybrid Networks consists of four parts: title feature extractor, content feature extractor, connection extractor, and classification network. The hypothesis of Hierarchical Hybrid Networks is in the ideal situation (data sharing).
Before the modeling, we applied the following preprocessing procedure: remove illegal characters and stop words, word tokenize, etc. After this process, we get a standardized title and content of an article. We denote a batch of standardized titles as , a batch of standardized contents as .
And then we feed and as inputs to Hierarchical Hybrid Networks to get prediction . We denote , , , as title feature extractor, content feature extractor, connection extractor, classification network respectively. And , , , are the parameters of , , , respectively. After feeding into
, we obtain the title feature vector. In a similar way, we obtain the content feature vector . We concatenate and and feed it to to obtain the connection vector . Based on , the makes a prediction. With label , we calculate the loss and update the parameters in the training process. In the predicting process, we get the corresponding label based on . The above process is shown in Equation 1:
4.1.1 Feature Extractor
Since clickbait detection is a text classification task, we need a feature extractor to extract features from the text for classification. We applied Self-Attentive Network to implement and as shown in Figure 2. Given a title (or content) that contains N tokens, we first map each token , where i [1, N], to its corresponding word embedding , through a word embedding matrix (100-dimension pre-trained Glove embedding of Wikipedia data Pennington et al. (2014)).
After that, we use a bi-directional LSTM Hochreiter and Schmidhuber (1997) to encode the contextual information from both directions of the token into its hidden state. The resulting hidden state of BiLSTM for each token was the concatenation of its forward hidden state and backward hidden states, as shown in the following equation:
We concatenate all hidden state and get . The token level attention vector represents the weights of tokens. and is the network and the context vector of attention mechanism. Both of them are the parameters to train. The process is shown in Equation 3:
4.1.2 Connection Extractor
The clickbait problem stems from the content of the article failing to fulfill the promise made by the title of the article. So when detecting clickbait, human combines the title with the content. Based on the complex connection between title and content, human makes the final judgment. Inspired by human practices in clickbait detection, we design a connection extractor implemented by Convolutional Neural Network, as shown in Figure3.
After we obtain the vector and from feature extractors, we concatenate them and get . We apply a convolution operation to get the connection feature. and are the parameters of this filter and
is the activation function. A featureis learned from the th row to the th row of . The feature map is the concatenation of all features. The process is as Equation 4:
We then apply a max pooling operation over the feature mapto obtain as the final feature corresponding to this particular filter with height . The concatenation of all final feature is the connection vector we need, as shown below:
By this connection extractor, contains not only the connection between title and content but also the features of title and content.
4.1.3 Classification Network
is a fully connected neural network. and is the parameters of the classification network. The process is given by the following equation:
4.2 Clickbait Federated Learning
The hypothesis of Hierarchical Hybrid Networks is in the ideal situation: aggregating the data together and training the model. But in the Data Island problem setting, we cannot aggregate the data together to train Hierarchical Hybrid Networks. Prior work on federated learning is to train a shared model with data scattered in a large number of nodes. The problem we focus on is to train a model with data stored in two companies. So we propose a novel model training method: Clickbait Federated Learning. This method can effectively utilize data from two parties for model training. Clickbait Federated Learning does not require the two parties to agree on the network structures. As a result, This method is a convenient and general method.
According to the assumptions in Problem Definition, has title and label while has content (both of them can access to the labels since labels are generated for training model so labels can be exchanged). cannot share data with . Our target is to find a way to train model in the Data Island setting.
, Clickbait Federated Learning consists of two parts. In reality, the data stored in different places usually does not completely overlap. So we need to identify the overlapping data for training and give it a unique id. Due to the case of non-shared data, we use encryption (Step 1). And then we determine every batch of every epoch and synchronize it on both sides so that we make sure the same samples are used in every batch of every epoch (Step 2). This requires the coordination of both sides.
Secondly, we apply the training data to the preprocessing procedure and obtain standardized title and standardized content respectively (Step 3). Then we initialize feature extractor and on both sides (Step 4). The actual implementation of feature extractor is not critical. can apply Convolutional Neural Network or Recursive Neural Network to implement feature extractor and the same with . Neither nor know the specific implementation of each other. This ensures data privacy and security. After that, we initialize the classification network for the side that has labels. Up to now, the preparatory work before training is done.
Next is the repetitive training steps. As shown in Algorithm 1 and Algorithm 2, the training steps is different between and . According to chronological order, after we input the corresponding batch and to and get and respectively for and (Step 5), sends to (Step 6 in ). When receives , concatenates and then feeds it to (Step 6, 7in ). Based on the prediction and label, calculates the relevant derivatives and update the parameters of and (Step 8, 9 in ). After that, sends to (Step 10 in ).
When receives (Step 7 in ), updates the parameters of with the dot product of and (Step 8, 9 in ). Repeat these above training steps until the federated model , which is composed of , and , converges and then we have obtained the clickbait detection model.
As described above, part of the training sequence is critical in Clickbait Federated Learning. The Step 6 in must precede the Step 6 in and the Step 10 in must precede the Step 7 in . If converges, will send a termination signal to . Hence, Clickbait Federated Learning requires the coordination of both sides while maintaining data privacy.
According to Chain Rule, the above dot product equals, which is the derivatives of the ideal situation: aggregating the data together and training the model. So theoretically, the effect of equals the effect of the model with the same architecture that is trained in the ideal situation.
We can also assume has and while have . We still can get an excellent federated model by Clickbait Federated Learning. What is more, we can generalize and to other data types. Clickbait Federated Learning can be generalized to any similar multi-input classification tasks with non-shared data. Hence Clickbait Federated Learning is a general vertical federated learning method and represents a solution for the Data Island problem.
4.3 Federated Hierarchical Hybrid Networks
In Clickbait Federated Learning, the connection extractor is not necessary since Clickbait Federated Learning just is a model training method. But if we append connection extractor with Convolutional Neural Network as the implementation in , and apply Self-Attentive Network to implement the feature extractor in and the feature extractor in , we get Federated Hierarchical Hybrid Networks, whose architecture is the same as Hierarchical Hybrid Networks, as shown in Figure 4. Hence, Federated Hierarchical Hybrid Networks, which presents a solution to the Data Island problem, can be considered as Hierarchical Hybrid Networks trained by Clickbait Federated Learning.
We use the dataset provided by The Clickbait Challenge 2017 (http://www.clickbait-challenge.org/), which is a classic clickbait detection competition, in this paper. The provided dataset contains posts from a social media platform “Twitter”. This platform is often used by media to publish links to their websites. Each post, “tweet”, is a short message (up to 140 characters), which can be accompanied by a link and a picture.
Each instance in the dataset includes , (the content of the tweet), (the title of the actual article), (the actual content of the article), (the description from the meta tags of the article), (the keywords from the meta tags of the article), (all captions in the article), (the image that was posted alongside with the tweet), (the clickbait label evaluated by five human evaluators), etc.
According to the source of the dataset, the above dataset is divided into two parts. The statistics of the two datasets is shown in Table 1. In our experiments, dataset A is the test set and dataset B is used as the training set and the validation set.
5.2 Experiment Design
As we mentioned, our problem is to design model between and in the situation of Data Island (data not sharing). According to our approach, we design two experiments. One is to train Hierarchical Hybrid Networks and to compare the performance of it with other clickbait detection models in the ideal situation.
Another is to train models in the Data Island situation. In this situation, we only obtain from . It means we can only train the model on titles and labels using the traditional training method. We also can train the model on contents and labels using the traditional training method. With Clickbait Federated Learning, we train a series of federated learning models with the same implementation of model architectures as these above models.
In our experiment, we choose as title and
as content. Two evaluation metrics are used in this work: ROC-AUC and F1-score.
5.3 Experimental Results
To avoid the randomness effect, we perform all our experiments using 5-fold cross-validation on the above dataset. So the following experimental results reflect the average performance of each model on the test set. All of these models use 100-dimension pre-trained Glove embedding of Wikipedia data Pennington et al. (2014).
5.3.1 The Ideal Situation
In this setting, we can aggregate the title and content together for model training. We train five different models in this situation by the traditional training method. TextCNN (&), TextRNN (&), TextSAN (&), and FastText (&) have the same model architectures. All of them have a title feature extractor, a content feature extractor, and a classification network. The difference between them is the implementation of the title feature extractor and content feature extractor.
As shown in Table 2, Hierarchical Hybrid Networks have the best performance. We must give credit to the connection extractor since it is the only difference between Hierarchical Hybrid Networks and TextSAN (&). We infer that the connection extractor extracts the complex connection between title and content effectively while retaining the original feature information in and . This illustrates the importance of the correlation between title and content in the clickbait detection task.
5.3.2 The Data Island
In this setting, we cannot aggregate the title and content together for model training. So when using the traditional training method, we can only utilize the title or content for model training. () means the model only accepts the title as input while () means the model only accepts the content as input. In this situation, we train five federated learning models by Clickbait Federated Learning and eight models by the traditional training method.
Since have , we train TextCNN (), TextRNN (), TextSAN (), and FastText () independently. All of them have a title feature extractor and a classification network. The only difference between them is the implementation of feature extractor.
We also train TextCNN (), TextRNN (), TextSAN (), and FastText () independently in . Similarly, all of them have a content feature extractor and a classification network. The only difference between them is the implementation of feature extractor.
According to the experimental results in Table 3, title is more valuable for clickbait detection task than content since TextCNN (), TextRNN (), TextSAN (), and FastText () performed better. This makes sense because the clickbait problem arises from the fact that the title of the article is ambiguous or misleading Marquez (1980). We can also see that the federated learning models perform better. It means that content still has valuable information for the clickbait detection task. It also proved that Clickbait Federated Learning can effectively utilize non-shared data. We can conclude that and can cooperate effectively and get a better clickbait detection model no matter what kind of implementation of feature extractor by Clickbait Federated Learning. The best performance in Table 3 is FedHHN.
As shown in Table 4, the ROC-AUC and F1-score of the models trained by Clickbait Federated Learning in the Data Island situation are close to those of the models trained by the traditional training method in the ideal situation, which matches with our analysis using the Chain Rule in section 4.2. This illustrates that Clickbait Federated Learning can effectively utilize non-shared data in the Data Island situation and train a federated learning model which is comparable to the model with the same implementation of the model architecture trained using the traditional training method in the ideal situation. We thus come to the conclusion that Clickbait Federated Learning represents a desirable solution for the Data Island problem.
What is more, the fact that the performance of those models trained by title and content is better than those models trained only with title and those models trained only with content, which is consistent with the belief that the clickbait detection model needs title and content together as input. The fact that FedHHN performs better than TextCNN (&), TextRNN (&), TextSAN (&), and FastText (&), which is trained in the ideal situation, shows the superiority of Federated Hierarchical Hybrid Networks.
In this paper, we propose Federated Hierarchical Hybrid Networks for clickbait detection. Federated Hierarchical Hybrid Networks can be considered as Hierarchical Hybrid Networks trained by Clickbait Federated Learning. Hierarchical Hybrid Networks utilize not only the features of the title and content but also the complex connection between the title and content for detecting clickbait. Clickbait Federated Learning can effectively utilize non-shared data in the Data Island setting and train a federated model which is comparable to the model with the same architecture trained using the traditional training method in the ideal situation. It thus represents a desirable solution for the Data Island problem and this method can be extended to any similar multi-input classification tasks with non-shared data. Our experimental results show that Federated Hierarchical Hybrid Networks performed well on clickbait detection tasks.
However, the federated model we get from Clickbait Federated Learning depends on the input from both sides when it makes predictions. How to get a federated model without this dependency is our future work. Moreover, we are also interested in how to apply Clickbait Federated Learning in other multi-input classification tasks with non-shared data.
- Anand et al. (2017) Ankesh Anand, Tanmoy Chakraborty, and Noseong Park. 2017. We used neural networks to detect clickbaits: You won’t believe what happened next! In Advances in Information Retrieval - 39th European Conference on IR Research, ECIR 2017, Aberdeen, UK, April 8-13, 2017, Proceedings, pages 541–547.
- Chakraborty et al. (2016) Abhijnan Chakraborty, Bhargavi Paranjape, Sourya Kakarla, and Niloy Ganguly. 2016. Stop clickbait: Detecting and preventing clickbaits in online news media. In 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2016, San Francisco, CA, USA, August 18-21, 2016, pages 9–16.
- Chakraborty et al. (2017) Abhijnan Chakraborty, Rajdeep Sarkar, Ayushi Mrigen, and Niloy Ganguly. 2017. Tabloids in the era of social media?: Understanding the production and consumption of clickbaits in twitter. PACMHCI, 1(CSCW):30:1–30:21.
Cho et al. (2014)
Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry
Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014.
phrase representations using RNN encoder-decoder for statistical machine
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1724–1734.
- Grave et al. (2017) Edouard Grave, Tomas Mikolov, Armand Joulin, and Piotr Bojanowski. 2017. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 2: Short Papers, pages 427–431.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.
- Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1746–1751.
- Konecný et al. (2016a) Jakub Konecný, H. Brendan McMahan, Daniel Ramage, and Peter Richtárik. 2016a. Federated optimization: Distributed machine learning for on-device intelligence. CoRR, abs/1610.02527.
- Konecný et al. (2016b) Jakub Konecný, H. Brendan McMahan, Felix X. Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. 2016b. Federated learning: Strategies for improving communication efficiency. CoRR, abs/1610.05492.
- Kumar et al. (2018) Vaibhav Kumar, Dhruv Khattar, Siddhartha Gairola, Yash Kumar Lal, and Vasudeva Varma. 2018. Identifying clickbait: A multi-strategy approach using neural networks. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR 2018, Ann Arbor, MI, USA, July 08-12, 2018, pages 1225–1228.
- Lin et al. (2017) Zhouhan Lin, Minwei Feng, Cícero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. CoRR, abs/1703.03130.
- Marquez (1980) F. T. Marquez. 1980. How accurate are the headlines? Journal of Communication, 30(3):30–36.
- McMahan et al. (2016) H. Brendan McMahan, Eider Moore, Daniel Ramage, and Blaise Agüera y Arcas. 2016. Federated learning of deep networks using model averaging. CoRR, abs/1602.05629.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1532–1543.
- Rony et al. (2017) Md Main Uddin Rony, Naeemul Hassan, and Mohammad Yousuf. 2017. Diving deep into clickbaits: Who use them to what extents in which topics with what effects? In Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017, Sydney, Australia, July 31 - August 03, 2017, pages 232–239.
- Smith et al. (2017) Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet S. Talwalkar. 2017. Federated multi-task learning. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 4427–4437.
Wei and Wan (2017)
Wei Wei and Xiaojun Wan. 2017.
Learning to identify
ambiguous and misleading news headlines.
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, pages 4172–4178.
- Yang et al. (2019) Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. 2019. Federated machine learning: Concept and applications. ACM TIST, 10(2):12:1–12:19.
- Yang et al. (2016) Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alexander J. Smola, and Eduard H. Hovy. 2016. Hierarchical attention networks for document classification. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pages 1480–1489.
- Zheng et al. (2018) Hai-Tao Zheng, Jin-Yuan Chen, Xin Yao, Arun Kumar Sangaiah, Yong Jiang, and Cong-Zhi Zhao. 2018. Clickbait convolutional neural network. Symmetry, 10(5):138.
- Zhou (2017) Yiwei Zhou. 2017. Clickbait detection in tweets using self-attentive network. CoRR, abs/1710.05364.
- Zhuo et al. (2019) Hankz Hankui Zhuo, Wenfeng Feng, Qian Xu, Qiang Yang, and Yufeng Lin. 2019. Federated reinforcement learning. CoRR, abs/1901.08277.