Federated Hierarchical Hybrid Networks for Clickbait Detection

06/03/2019 ∙ by Feng Liao, et al. ∙ Arizona State University 0

Online media outlets adopt clickbait techniques to lure readers to click on articles in a bid to expand their reach and subsequently increase revenue through ad monetization. As the adverse effects of clickbait attract more and more attention, researchers have started to explore machine learning techniques to automatically detect clickbaits. Previous work on clickbait detection assumes that all the training data is available locally during training. In many real-world applications, however, training data is generally distributedly stored by different parties (e.g., different parties maintain data with different feature spaces), and the parties cannot share their data with each other due to data privacy issues. It is challenging to build models of high-quality federally for detecting clickbaits effectively without data sharing. In this paper, we propose a federated training framework, which is called federated hierarchical hybrid networks, to build clickbait detection models, where the titles and contents are stored by different parties, whose relationships must be exploited for clickbait detection. We empirically demonstrate that our approach is effective by comparing our approach to the state-of-the-art approaches using datasets from social media.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Clickbait is a text or thumbnail link that is designed to entice users to access the linked online content, which often fails to fulfill the promise made by the title. Online media outlets adopt clickbait techniques in a bid to expand their reach and subsequently increase revenue through ad monetization. As the adverse effects of clickbait attract more and more attention, researchers have started to explore machine learning techniques to automatically detect clickbaits. For example, previous approaches, such as Chakraborty et al. (2016, 2017); Wei and Wan (2017)

, built text classifiers via feature engineering methods such as

Rony et al. (2017); Zheng et al. (2018); Anand et al. (2017); Zhou (2017); Kumar et al. (2018)

built text classifiers using deep learning models. Despite the success of previous approaches, they all assume that all the training data is stored locally and hence the detection models can be built locally.

However, in reality, different organizations generally hold different parts of the data and cannot share data with each other Yang et al. (2019). For example, a social network company, such as Twitter, Weibo, etc., often have the need to automatically monitor the quality of content via a clickbait detection model. While they may access a title such as "LeBron James was dragged along the street", they do not have access to its corresponding externally linked content, such as "A man walked down the street with LeBron James’ autobiography in his hand. The cover of the autobiography is a picture of James."

In the above scenario, we found that a qualified model is difficult to obtain using prior work since the local data alone is not enough for clickbait detection. In our experiments, we found that the correlation between titles and contents is a key factor for clickbait detection while most of the prior work failed to exploit the correlation.

In response to the above problem, we propose Federated Hierarchical Hybrid Networks, which is Hierarchical Hybrid Networks trained by Clickbait Federated Learning. Hierarchical Hybrid Networks exploit the connection between title and content, which is a key factor for clickbait detection. Clickbait Federated Learning can effectively utilize data from two parties for model training without the agreement of the network structures from the two parties. Our experimental results show that Federated Hierarchical Hybrid Networks outperform other clickbait detection models, and is comparable to the models trained in the ideal situation.

We organize the paper as follows. We first review related work. After that, we present the details of our framework and then give a detailed description of our algorithm. Finally, we evaluate our algorithm in a dataset and conclude our work with a discussion on future work.

2 Related Work

2.1 Clickbait Detection

As we mentioned, there has been a substantial amount of prior work on automatic clickbait detection. Chakraborty et al. collected extensive titles for both clickbait and non-clickbait categories and manually extracted features for SVM classifier, Decision Trees classifier, and Random Forests classifier

Chakraborty et al. (2016). As an extension, Chakraborty et al. used another dataset collected from Twitter to conduct further analysis of clickbait Chakraborty et al. (2017)

. In the work of Wei and Wan, they redefined the clickbait problem and identified ambiguous and misleading titles separately. Wei and Wan crawled a total of 40000 articles from four major Chinese news sites and used SVM classifier for text classification with manual feature extraction

Wei and Wan (2017).

FastText Grave et al. (2017), TextCNN Kim (2014), TextRNN Cho et al. (2014), and Self-Attentive Network Yang et al. (2016); Lin et al. (2017) are classic deep learning models. Rony et al. applied a model similar to FastText, which uses distributed sub-word embedding learned from a large corpus to detect clickbait Rony et al. (2017). Zheng et al. proposed CBCNN. The architecture of CBCNN is similar to TextCNN Zheng et al. (2018). In the work of Anand et al., a model similar to TextRNN was applied to clickbait detection. Their model combines Distributed Word Embedding with Character Level Word Embedding Anand et al. (2017). The first attempt to apply Self-Attentive Network to clickbait detection is in the work of Zhou Zhou (2017).

The connection between title and content is an important feature in the clickbait detection task. To the best of our knowledge, only one prior work takes advantage of the connection Kumar et al. (2018). In the work of Kumar et al., they utilized not only the similarity between title and description but also the similarity between description and image. Their work emphasizes the similarity between different parts of the article while we focus on whether the title is ambiguous or misleading (Marquez divided news headlines into three types: accurate, ambiguous and misleading Marquez (1980)), which is an important indicator of clickbait. Notably, due to the flexibility of natural language, similarity cannot accurately measure the connection between title and content for clickbait detection.

2.2 Federated Learning

The Data Island problem has attracted more and more attention. Mcmahan et al. and Konen et al. advocated an alternative that left the training data distributed on the mobile devices and learned a shared model by aggregating locally computed updates McMahan et al. (2016); Konecný et al. (2016a). Konen et al. proposed two ways to reduce the uplink communication costs, which improved the efficiency of their framework of federated learning Konecný et al. (2016b). Smith et al. proposed a novel systems-aware optimization framework for federated multi-task learning and this method achieved significant speedups compared to alternatives in the federated setting Smith et al. (2017)

. Yang et al. introduced a comprehensive secure federated-learning framework, which includes horizontal federated learning, vertical federated learning, and federated transfer learning

Yang et al. (2019)

. Zhuo et al. propose federated reinforcement learning, which considers the privacy requirement and builds Q-network for each agent with the help of other agents

Zhuo et al. (2019).

3 Problem Definition

Our problem can be defined by: given as input a set of pairs of titles and labels from , and a set of contents from , find the model that must be built on both and while cannot share data with . This problem is called the Data Island problem.

According to the above setting, in a clickbait detection task, is the social network company, which needs to monitor the quality of content on its platform. And is the media company. is a link to an external article, which contains the textual description of this article. is composed of a sequence of words and it can be regarded as the title of the article. is a label in , specifying whether the corresponding title is a clickbait or not. We denote the set of all pairs of titles and labels as (). is the detailed content from corresponding to title from , which is composed of a sequence of words . We denote the set of all contents as (). An article is made up of title and content . By mapping , we can find the title and the content from the same article . represents the automatic clickbait detection process.

In a word, hosts , and hosts . They cannot share data with each other. It meaning that and cannot be aggregated for model training.

4 Our Approach

Inspired by human practices in clickbait detection, we propose Hierarchical Hybrid Networks, which exploits the connection between title and content, while most of the previous approaches either fail to consider both or only take their similarity into account. However, like all of the prior works, Hierarchical Hybrid Networks only work in the ideal situation (data sharing). So we propose here a novel method of training model: Clickbait Federated Learning as an effective solution to the Data Island problem. This method can effectively utilize data from two parties for model training. What is more, Clickbait Federated Learning does not require the two parties to agree on the network structures. After we train Hierarchical Hybrid Networks by Clickbait Federated Learning, we obtain Federated Hierarchical Hybrid Networks, which provides a solution to the Data Island problem.

4.1 Hierarchical Hybrid Networks

Figure 1: The architecture of Hierarchical Hybrid Networks.

As shown in Figure 1, Hierarchical Hybrid Networks consists of four parts: title feature extractor, content feature extractor, connection extractor, and classification network. The hypothesis of Hierarchical Hybrid Networks is in the ideal situation (data sharing).

Before the modeling, we applied the following preprocessing procedure: remove illegal characters and stop words, word tokenize, etc. After this process, we get a standardized title and content of an article. We denote a batch of standardized titles as , a batch of standardized contents as .

And then we feed and as inputs to Hierarchical Hybrid Networks to get prediction . We denote , , , as title feature extractor, content feature extractor, connection extractor, classification network respectively. And , , , are the parameters of , , , respectively. After feeding into

, we obtain the title feature vector

. In a similar way, we obtain the content feature vector . We concatenate and and feed it to to obtain the connection vector . Based on , the makes a prediction. With label , we calculate the loss and update the parameters in the training process. In the predicting process, we get the corresponding label based on . The above process is shown in Equation 1:

(1)

4.1.1 Feature Extractor

Since clickbait detection is a text classification task, we need a feature extractor to extract features from the text for classification. We applied Self-Attentive Network to implement and as shown in Figure 2. Given a title (or content) that contains N tokens, we first map each token , where i [1, N], to its corresponding word embedding , through a word embedding matrix (100-dimension pre-trained Glove embedding of Wikipedia data Pennington et al. (2014)).

Figure 2: The implementation of the title feature extractor and content feature extractor.

After that, we use a bi-directional LSTM Hochreiter and Schmidhuber (1997) to encode the contextual information from both directions of the token into its hidden state. The resulting hidden state of BiLSTM for each token was the concatenation of its forward hidden state and backward hidden states, as shown in the following equation:

(2)

We concatenate all hidden state and get . The token level attention vector represents the weights of tokens. and is the network and the context vector of attention mechanism. Both of them are the parameters to train. The process is shown in Equation 3:

(3)

4.1.2 Connection Extractor

The clickbait problem stems from the content of the article failing to fulfill the promise made by the title of the article. So when detecting clickbait, human combines the title with the content. Based on the complex connection between title and content, human makes the final judgment. Inspired by human practices in clickbait detection, we design a connection extractor implemented by Convolutional Neural Network, as shown in Figure

3.

Figure 3: The implementation of the connection extractor.

After we obtain the vector and from feature extractors, we concatenate them and get . We apply a convolution operation to get the connection feature. and are the parameters of this filter and

is the activation function. A feature

is learned from the th row to the th row of . The feature map is the concatenation of all features. The process is as Equation 4:

(4)

We then apply a max pooling operation over the feature map

to obtain as the final feature corresponding to this particular filter with height . The concatenation of all final feature is the connection vector we need, as shown below:

(5)

By this connection extractor, contains not only the connection between title and content but also the features of title and content.

4.1.3 Classification Network

is a fully connected neural network. and is the parameters of the classification network. The process is given by the following equation:

(6)

4.2 Clickbait Federated Learning

Input:
Output: and of federated model

1:  identify the overlapping data in ;
2:  determine every batch and synchronize it with ;
3:  preprocess title get ;
4:  initialize and ;
5:  input corresponding batch to get ;
6:  wait for send ;
7:  concatenate and , and then input it to , get prediction and loss;
8:  calculate , and ;
9:  update and with and ;
10:  send to ;
11:  repeat above Step 5 to Step 10 until converges;
12:  return   and ;
Algorithm 1 Clickbait Federated Learning in

The hypothesis of Hierarchical Hybrid Networks is in the ideal situation: aggregating the data together and training the model. But in the Data Island problem setting, we cannot aggregate the data together to train Hierarchical Hybrid Networks. Prior work on federated learning is to train a shared model with data scattered in a large number of nodes. The problem we focus on is to train a model with data stored in two companies. So we propose a novel model training method: Clickbait Federated Learning. This method can effectively utilize data from two parties for model training. Clickbait Federated Learning does not require the two parties to agree on the network structures. As a result, This method is a convenient and general method.

According to the assumptions in Problem Definition, has title and label while has content  (both of them can access to the labels since labels are generated for training model so labels can be exchanged). cannot share data with . Our target is to find a way to train model in the Data Island setting.

As shown in Algorithm 1 and Algorithm 2

, Clickbait Federated Learning consists of two parts. In reality, the data stored in different places usually does not completely overlap. So we need to identify the overlapping data for training and give it a unique id. Due to the case of non-shared data, we use encryption (Step 1). And then we determine every batch of every epoch and synchronize it on both sides so that we make sure the same samples are used in every batch of every epoch (Step 2). This requires the coordination of both sides.

Input:  (content)
Output: of federated model

1:  identify the overlapping data in ;
2:  determine every batch and synchronize it with ;
3:  preprocess content get ;
4:  initialize ;
5:  input corresponding batch to get ;
6:  send to ;
7:  wait for send ;
8:  calculate ;
9:  update with ;
10:  repeat above Step 5 to Step 9 until converges;
11:  return  ;
Algorithm 2 Clickbait Federated Learning in

Secondly, we apply the training data to the preprocessing procedure and obtain standardized title and standardized content respectively (Step 3). Then we initialize feature extractor and on both sides (Step 4). The actual implementation of feature extractor is not critical. can apply Convolutional Neural Network or Recursive Neural Network to implement feature extractor and the same with . Neither nor know the specific implementation of each other. This ensures data privacy and security. After that, we initialize the classification network for the side that has labels. Up to now, the preparatory work before training is done.

Next is the repetitive training steps. As shown in Algorithm 1 and Algorithm 2, the training steps is different between and . According to chronological order, after we input the corresponding batch and to and get and respectively for and  (Step 5), sends to  (Step 6 in ). When receives , concatenates and then feeds it to  (Step 6, 7in ). Based on the prediction and label, calculates the relevant derivatives and update the parameters of and  (Step 8, 9 in ). After that, sends to  (Step 10 in ).

When receives  (Step 7 in ), updates the parameters of with the dot product of and  (Step 8, 9 in ). Repeat these above training steps until the federated model , which is composed of , and , converges and then we have obtained the clickbait detection model.

As described above, part of the training sequence is critical in Clickbait Federated Learning. The Step 6 in must precede the Step 6 in and the Step 10 in must precede the Step 7 in . If converges, will send a termination signal to . Hence, Clickbait Federated Learning requires the coordination of both sides while maintaining data privacy.

According to Chain Rule, the above dot product equals

, which is the derivatives of the ideal situation: aggregating the data together and training the model. So theoretically, the effect of equals the effect of the model with the same architecture that is trained in the ideal situation.

We can also assume has and while have . We still can get an excellent federated model by Clickbait Federated Learning. What is more, we can generalize and to other data types. Clickbait Federated Learning can be generalized to any similar multi-input classification tasks with non-shared data. Hence Clickbait Federated Learning is a general vertical federated learning method and represents a solution for the Data Island problem.

4.3 Federated Hierarchical Hybrid Networks

In Clickbait Federated Learning, the connection extractor is not necessary since Clickbait Federated Learning just is a model training method. But if we append connection extractor with Convolutional Neural Network as the implementation in , and apply Self-Attentive Network to implement the feature extractor in and the feature extractor in , we get Federated Hierarchical Hybrid Networks, whose architecture is the same as Hierarchical Hybrid Networks, as shown in Figure 4. Hence, Federated Hierarchical Hybrid Networks, which presents a solution to the Data Island problem, can be considered as Hierarchical Hybrid Networks trained by Clickbait Federated Learning.

Figure 4: The architecture of Federated Hierarchical Hybrid Networks.

5 Experiments

5.1 Dataset

We use the dataset provided by The Clickbait Challenge 2017 (http://www.clickbait-challenge.org/), which is a classic clickbait detection competition, in this paper. The provided dataset contains posts from a social media platform “Twitter”. This platform is often used by media to publish links to their websites. Each post, “tweet”, is a short message (up to 140 characters), which can be accompanied by a link and a picture.

Each instance in the dataset includes ,  (the content of the tweet),  (the title of the actual article),  (the actual content of the article),  (the description from the meta tags of the article),  (the keywords from the meta tags of the article),  (all captions in the article),  (the image that was posted alongside with the tweet),  (the clickbait label evaluated by five human evaluators), etc.

Dataset tweets clickbait non-clickbait
A 2459 762 1697
B 19538 4761 14777
Table 1: Statistics of the datasets.

According to the source of the dataset, the above dataset is divided into two parts. The statistics of the two datasets is shown in Table 1. In our experiments, dataset A is the test set and dataset B is used as the training set and the validation set.

5.2 Experiment Design

As we mentioned, our problem is to design model between and in the situation of Data Island (data not sharing). According to our approach, we design two experiments. One is to train Hierarchical Hybrid Networks and to compare the performance of it with other clickbait detection models in the ideal situation.

Another is to train models in the Data Island situation. In this situation, we only obtain from . It means we can only train the model on titles and labels using the traditional training method. We also can train the model on contents and labels using the traditional training method. With Clickbait Federated Learning, we train a series of federated learning models with the same implementation of model architectures as these above models.

In our experiment, we choose as title and

as content. Two evaluation metrics are used in this work: ROC-AUC and F1-score.

5.3 Experimental Results

To avoid the randomness effect, we perform all our experiments using 5-fold cross-validation on the above dataset. So the following experimental results reflect the average performance of each model on the test set. All of these models use 100-dimension pre-trained Glove embedding of Wikipedia data Pennington et al. (2014).

5.3.1 The Ideal Situation

In this setting, we can aggregate the title and content together for model training. We train five different models in this situation by the traditional training method. TextCNN (&), TextRNN (&), TextSAN (&), and FastText (&) have the same model architectures. All of them have a title feature extractor, a content feature extractor, and a classification network. The difference between them is the implementation of the title feature extractor and content feature extractor.

Model Measure
ROC-AUC F1
HHN 0.67224 0.53155
TextCNN(&) 0.66185 0.51734
TextRNN(&) 0.65026 0.49115
TextSAN(&) 0.64851 0.48759
FastText(&) 0.64489 0.47466
Table 2: The experimental results in the ideal situation.

As shown in Table 2, Hierarchical Hybrid Networks have the best performance. We must give credit to the connection extractor since it is the only difference between Hierarchical Hybrid Networks and TextSAN (&). We infer that the connection extractor extracts the complex connection between title and content effectively while retaining the original feature information in and . This illustrates the importance of the correlation between title and content in the clickbait detection task.

5.3.2 The Data Island

In this setting, we cannot aggregate the title and content together for model training. So when using the traditional training method, we can only utilize the title or content for model training. () means the model only accepts the title as input while () means the model only accepts the content as input. In this situation, we train five federated learning models by Clickbait Federated Learning and eight models by the traditional training method.

Since have , we train TextCNN (), TextRNN (), TextSAN (), and FastText () independently. All of them have a title feature extractor and a classification network. The only difference between them is the implementation of feature extractor.

We also train TextCNN (), TextRNN (), TextSAN (), and FastText () independently in . Similarly, all of them have a content feature extractor and a classification network. The only difference between them is the implementation of feature extractor.

Model Measure
ROC-AUC F1
FedCNN 0.65694 0.50881
TextCNN() 0.63943 0.47120
TextCNN() 0.61256 0.44583
FedRNN 0.64627 0.47744
TextRNN() 0.63306 0.47102
TextRNN() 0.60779 0.42497
FedSAN 0.64888 0.48603
TextSAN() 0.63555 0.46201
TextSAN() 0.59554 0.40951
FedFastText 0.64224 0.46843
FastText() 0.63287 0.45195
FastText() 0.59350 0.39497
FedHHN 0.66835 0.52898
Table 3: The experimental results in the Data Island.

According to the experimental results in Table 3, title is more valuable for clickbait detection task than content since TextCNN (), TextRNN (), TextSAN (), and FastText () performed better. This makes sense because the clickbait problem arises from the fact that the title of the article is ambiguous or misleading Marquez (1980). We can also see that the federated learning models perform better. It means that content still has valuable information for the clickbait detection task. It also proved that Clickbait Federated Learning can effectively utilize non-shared data. We can conclude that and can cooperate effectively and get a better clickbait detection model no matter what kind of implementation of feature extractor by Clickbait Federated Learning. The best performance in Table 3 is FedHHN.

Model Measure
ROC-AUC F1
TextCNN(&) 0.66185 0.51734
FedCNN 0.65694 0.50881
TextRNN(&) 0.65026 0.49115
FedRNN 0.64627 0.47744
TextSAN(&) 0.64851 0.48759
FedSAN 0.64888 0.48603
FastText(&) 0.64489 0.47466
FedFastText 0.64224 0.46843
HHN 0.67224 0.53155
FedHHN 0.66835 0.52898
Table 4: The experimental results of the models trained by Clickbait Federated Learning in the Data Island and the models trained in the ideal situation.

As shown in Table 4, the ROC-AUC and F1-score of the models trained by Clickbait Federated Learning in the Data Island situation are close to those of the models trained by the traditional training method in the ideal situation, which matches with our analysis using the Chain Rule in section 4.2. This illustrates that Clickbait Federated Learning can effectively utilize non-shared data in the Data Island situation and train a federated learning model which is comparable to the model with the same implementation of the model architecture trained using the traditional training method in the ideal situation. We thus come to the conclusion that Clickbait Federated Learning represents a desirable solution for the Data Island problem.

What is more, the fact that the performance of those models trained by title and content is better than those models trained only with title and those models trained only with content, which is consistent with the belief that the clickbait detection model needs title and content together as input. The fact that FedHHN performs better than TextCNN (&), TextRNN (&), TextSAN (&), and FastText (&), which is trained in the ideal situation, shows the superiority of Federated Hierarchical Hybrid Networks.

6 Conclusion

In this paper, we propose Federated Hierarchical Hybrid Networks for clickbait detection. Federated Hierarchical Hybrid Networks can be considered as Hierarchical Hybrid Networks trained by Clickbait Federated Learning. Hierarchical Hybrid Networks utilize not only the features of the title and content but also the complex connection between the title and content for detecting clickbait. Clickbait Federated Learning can effectively utilize non-shared data in the Data Island setting and train a federated model which is comparable to the model with the same architecture trained using the traditional training method in the ideal situation. It thus represents a desirable solution for the Data Island problem and this method can be extended to any similar multi-input classification tasks with non-shared data. Our experimental results show that Federated Hierarchical Hybrid Networks performed well on clickbait detection tasks.

However, the federated model we get from Clickbait Federated Learning depends on the input from both sides when it makes predictions. How to get a federated model without this dependency is our future work. Moreover, we are also interested in how to apply Clickbait Federated Learning in other multi-input classification tasks with non-shared data.

References