Log In Sign Up

Privacy-Preserving Online Content Moderation: A Federated Learning Use Case

Users are daily exposed to a large volume of harmful content on various social network platforms. One solution is developing online moderation tools using Machine Learning techniques. However, the processing of user data by online platforms requires compliance with privacy policies. Federated Learning (FL) is an ML paradigm where the training is performed locally on the users' devices. Although the FL framework complies, in theory, with the GDPR policies, privacy leaks can still occur. For instance, an attacker accessing the final trained model can successfully perform unwanted inference of the data belonging to the users who participated in the training process. In this paper, we propose a privacy-preserving FL framework for online content moderation that incorporates Differential Privacy (DP). To demonstrate the feasibility of our approach, we focus on detecting harmful content on Twitter - but the overall concept can be generalized to other types of misbehavior. We simulate a text classifier - in FL fashion - which can detect tweets with harmful content. We show that the performance of the proposed FL framework can be close to the centralized approach - for both the DP and non-DP FL versions. Moreover, it has a high performance even if a small number of clients (each with a small number of data points) are available for the FL training. When reducing the number of clients (from 50 to 10) or the data points per client (from 1K to 0.1K), the classifier can still achieve  81 four other Twitter datasets that capture different types of user misbehavior and still obtain a promising performance (61 the overhead on the users' devices during the FL training phase and show that the local training does not introduce excessive CPU utilization and memory consumption overhead.


page 1

page 2

page 3

page 4


FeO2: Federated Learning with Opt-Out Differential Privacy

Federated learning (FL) is an emerging privacy-preserving paradigm, wher...

Personalised Federated Learning: A Combinational Approach

Federated learning (FL) is a distributed machine learning approach invol...

Performance Analysis and Optimization in Privacy-Preserving Federated Learning

As a means of decentralized machine learning, federated learning (FL) ha...

Towards Privacy-Preserving Affect Recognition: A Two-Level Deep Learning Architecture

Automatically understanding and recognising human affective states using...

Fast-adapting and Privacy-preserving Federated Recommender System

In the mobile Internet era, recommender systems have become an irreplace...

Federated Learning With Highly Imbalanced Audio Data

Federated learning (FL) is a privacy-preserving machine learning method ...

Performance Analysis on Federated Learning with Differential Privacy

In this paper, to effectively prevent the differential attack, we propos...

1 Introduction

Users of all ages are exposed to a large volume of information from various Online Social Networks (OSNs). The content is often questionable or even harmful regardless of age, expressing abusive behavior, extreme sarcasm, cyberbullying, racism, and offensive or hate speech. OSN platforms try to protect users by setting special terms and conditions, blocking malicious accounts, and flagging or even taking down harmful content. Despite these efforts, harmful content is still present. Researchers and developers have made a great effort to develop automated detection tools mainly based on Machine Learning (ML) algorithms  founta2018large; 10.1145/2684822.2685316; Waseem2016HatefulSO;; chatzakou2017mean

. These ML models are first trained on large annotated datasets and then deployed online. One challenge is creating large labeled datasets suitable for deep learning training. The data are large (from millions of users), multi-modal (text, video, and images or a combination of those), and they change dynamically. Additionally, it is challenging for the platforms and researchers to collect and process these data in the first place. The users’ online data are private and sensitive, which is why the EU has imposed strict policies to protect users’ privacy (GDPR and accompanying national legislation).

In this paper, we investigate whether privacy–preserving ML methods can effectively detect harmful online content while complying with privacy policies. For this purpose, we propose and evaluate a privacy–preserving Federated Learning (FL) framework for training text classifiers able to detect harmful content. FL is a collaborative ML training process where in each round, the training phase is performed locally at users’ devices (the FL “clients”), and only the model parameters are sent to the central server (the FL “aggregator”). The central server aggregates the received information and updates the global model mcmahan2017communicationefficient. Therefore, FL has access to local, up-to-date user data and does not require such data to be globally collected by a central unit for storage and ML training; data that are usually massive in volume and a potential target for cyber-attacks, theft, and prying on.

Although the FL paradigm complies, in theory, with the GDPR policies (since the raw data never leave the users’ devices), privacy leakages can still occur. Prior studies have shown that the FL framework is vulnerable to membership inference and backdoor attacks NEURIPS2021_91cff01a; LDP_CDP. In this paper, we consider Differential Privacy (DP) as a defense mechanism against membership and attribute inference attacks dwork2006; Dwork. DP provides privacy guarantees (at the user level) against data (or membership) inference attacks by an external attacker who has access to the trained model. We incorporate the DP model proposed in NEURIPS2021_91cff01a which is a generalization of DP for the FL framework.

Our central research question is whether harmful online content can be detected efficiently and effectively by a privacy–preserving FL framework. To answer this, we bootstrap our ML text model from a modified version of the classifier presented in founta2018unified. We evaluate it when trained in an FL fashion (with and without DP) on different Twitter datasets from five studies of Twitter user misbehavior by generalizing the classification problem as detecting harmful or normal behavior. We compare the classifier’s FL performance with the centralized version that has access to all data. Finally, we assess a typical user device’s overhead while training the classifier locally to examine whether the FL approach slows down the device.

This work makes the following contributions:

  • We are the first to propose a methodology for applying privacy–preserving FL in the context of harmful content detection. Moreover, we provide a simulation methodology for using centralized datasets to test the performance of an FL framework.

  • We show that the performance of the proposed FL framework can be close to the centralized approach – for both the DP and non–DP FL versions. The FL classification performance on a total of 100K tweets has only a 10% difference in AUC compared to the centralized approach. For instance, by training the classifier (without DP) for only twenty FL rounds on fifty clients, we achieve AUC. Moreover, when reducing the number of clients (from fifty to ten) or the data points per client (from 1K to 0.1K), the classifier can still achieve AUC. In other words, we can achieve high performance even if few clients (with few data locally) are available.

  • Our further evaluation of the classifier on four smaller Twitter datasets of other types of misbehavior shows promising performance, ranging from 61% to 80% AUC. This means that the classifier can generalize and detect different types of misbehavior.

  • Finally, we show that the FL training process does not introduce excessive system overhead – in terms of CPU utilization and memory consumption –- on the users’ devices.

  • The simulation and the classifier code are made available to the research community.

Figure 1: FL Differential Private Framework and Dataflow

2 Related Work

2.1 Machine Learning for Automatic Detection and Filtering of Harmful Content

Harmful content can be found in a text, visual (image, video), audio (songs, recordings) format, or a combination of those. We define any violent, abusive, sexual, disrespectful, hateful, illegal content, or any content that may harm the user as “harmful”. One solution to protect users from such content is adopting automatic detection or filtering using Machine Learning techniques in online moderation tools.

Several studies have investigated misbehavior on Twitter. chatzakou2017mean proposes a deep-learning architecture to classify various types of abusive behavior (bullying and aggression) on Twitter. They proposed a methodology of extracting textual, user, and network-based features for Twitter accounts to identify patterns of abusive behavior. Then, they applied the methodology in a large dataset of 1.6M tweets collected during a period of three months. founta2018unified presents a unified deep learning classifier to detect abusive texts on Twitter. The authors tested the unified classifier with several abusive Twitter datasets and achieved high performance. One of the evaluation datasets was the one presented in  founta2018large with 100K tweets labeled as “Abusive”, “Hate”, “Normal”, and “Spam” using crowdsourcing annotation techniques. The unified classifier consists of two different classifiers whose results are combined to give the final result. One classifier is a text classification model, and the other treats domain-specific metadata (i.e., user’s friend network, number of retweets, etc.). They tested the unified classifier with several abusive Twitter datasets and achieved high performance. In this work, we adopt a simplified version of the proposed classifier by replicating the model for the text classification task – since we use no meta-data as training input but only text stored on a user’s device.

Yenala et al. proposed a deep learning architecture for detecting inappropriate language in query completion suggestions in search engines and users’ conversations in messengers yenala2018deep

. They used Convolution Neural Networks and Bi-directional LSTMs sequential model for the use case of the search engine suggestions and LSTM and Bidirectional LSTM sequential model for the users’ conversations messengers. They prove that the suggested architecture outperforms pattern-based and hand-crafted feature-based architectures. The authors in

10.1145/3442442.3452314 collected a dataset of

records to assess the exposure of kids and adolescents to inappropriate comments on YouTube. They built a model consisting of five high-accuracy classifiers using Natural Language Processing and ML to classify the comments obtained in five age-inappropriate classes (Toxic, Obscene, Insult, Threat, Identity hate). The model acts as a binary classifier that classifies input as inappropriate if it falls into at least one of the five classes.

Papadamou et al. built a deep learning classifier to detect videos with inappropriate content that targets toddlers on YouTube with high accuracy (84.3%) papadamou2019disturbed. The authors in 9073060 created a dataset with three different categories of videos: “Original Videos”, “Explicit Fake Videos”, and “Violent Fake Videos”. They trained a deep learning classifier to detect videos with content inappropriate for kids. with an accuracy of more than 90%. Additionally, Papadamou et al. collected YouTube videos related to pseudoscientific content and used the resulting dataset to train a deep learning classifier to detect misinformation videos on YouTube and achieved an accuracy of 79% papadamou2021it. These studies used video processing techniques to extract information from the videos but also collected other related information (i.e., the video’s title, comments, caption, etc.). Moreover, this work focuses on detecting inappropriate text content.

2.2 Federated Learning and Differential Privacy

McMahan et al. introduced Federated Learning(FL) as a distributed approach for training machine learning models without sharing an individual’s data with a central unit mcmahan2017communicationefficient. The idea is to train local models on clients’ devices with their on-device available data and only share locally-computed updates with the central server. The server will collect the locally computed updates from the clients and aggregate them to update the global model. A client device in an FL setting can scale from a mobile device, a laptop, a desktop, or an IoT device to a company’s data server.

Since the FL appearance, many studies have described FL applications in real settings. Gboard yang2018applied uses FL for training, evaluating, and deploying a model for giving optimized web, GIFs, and Stickers query suggestions. Gboard also used FL to train a model for next-word predictionhard2018federated. Next word prediction is used on the keyboard to suggest words for the user to type next based on the text already typed. In DBLP:journals/corr/abs-1903-10635

, the authors applied FL to train a neural network to learn out-of-vocabulary (OOV) words to minimize annoying users by auto-correcting the OOV words considering them as misspellings. FL is also used to train an image-classification model to decide whether a patient has the COVID-19 virus or not using x-ray images from several hospitals to preserve the patients’ privacy in 

liu2020experiments. The performance obtained when training the models using FL was slightly worse than training using a centralized approach.

Several studies have shown that maintaining the raw data locally does not sufficiently protect the users’ privacy so data leakages can occur in the FL framework. There are two main potential threats to data privacy; data inference attacks performed (i) by the other clients – or even the central aggregator – during the training phase and (ii) by an external attacker who has access to the final trained model. One of the proposed ways to provide privacy-preserving guarantees to FL is Differential Privacy (DP). The DP was first introduced by dwork2006; Dwork as a privacy–preserving technique for learning tasks on statistical databases. It can limit the information leakages regarding the data records used for the learning phase. DP provides statistical guarantees against data inference attacks performed by an adversary who has access to the output of the learning algorithm. These privacy guarantees are achieved by adding noise to the learning process to limit the data records’ influence on the algorithm’s final output. Two main variations of DP methodology have been incorporated into the FL framework toward privacy–preserving FL: the Central Differential Privacy (CDP) and the Local Differential Privacy (LDP) LDP_CDP; other hybrid approaches have also lately proposed Hierarchical_Federated_Learning. In CDP, the agents send the model updates to the central server, which will perform the DP noise addition NEURIPS2021_91cff01a. This implies that the central server is a trusted system entity, namely, it will not perform malicious inferences on the clients’ data. In LDP, the DP noise addition is performed locally by the clients – before sending the updates to the central server LDP-Fed. In this context, no trusted entity is required.

3 Conceptual Framework

To further explain the idea of applying the FL paradigm to the online moderation tools, we present our conceptual framework in Figure 1. Regarding the threat model we assume that the only trusted entity is the central aggregator. Under the Central Differential Private protocol NEURIPS2021_91cff01a – that we use in this study – the central aggregator is responsible for adding the DP–noise on the model updates that receives from the clients in an FL round. This implies that the aggregator is a trusted entity, but the other participants may not. Hence, possible adversaries are either some clients or an external entity that may perform data inference attacks either during the training phase or through the final model.

3.1 System Components

Client Device: The user’s device that accesses the Online Social Network application (i.e, Twitter).

Browser Add-On: filters the users’ online activity, conducts DOM tree analysis, and sends the selected data (i.e., tweets) to the Labeling Module.

Labeling Module: aggregates the labels obtained from the Auto-Labeling and User Feedback modules. Auto-Labeling module

can use semi-supervised learning techniques to label the data automatically. The

User Feedback module asks the user to label the data.

Local Database: stores the labeled data.

FL Module: schedules and executes FL tasks on the user device. FL task defines and executes the Local training.

Data properties computing: the module that computes the metadata of the user’s dataset (i.e., size of data, etc.), accompanied by other device information (e.g., battery, internet connection type, device capabilities, etc.).

Cloud Server: a unit owned by a trusted party that coordinates the FL training.

FL Task Configuration: generates the FL Task description, which contains the baseline model for training, the criteria for the clients to participate in this task, and the FL parameters (e.g., number of FL rounds, the number of clients to participate, etc.).

Scheduler: advertise the FL task to the available clients and manage the communication with the clients.

Client Selection Mechanism: checks if the client’s device complies with the criteria set by the FL Task Config module.

Model Aggregator: aggregates the clients’ model updates and applies the aggregated update to the global model.

3.2 Data-Flow

Figure 1 shows the data flow of the proposed framework. Specifically:

(A1) The user accesses Twitter through the device’s browser, (A2) and sends an HTTP request to Twitter. (A3) The Browser Add–On’s DOM Tree Analysis module receives the Twitter newsfeed page DOM tree, filters the user activity, and selects data for labeling. (A4) The Labeling module receives the data (e.g. a tweet text). The Auto Labeling module automatically labels the data. The User Feedback module asks the user to label it. (A5) Then, the two labeling modules send the {tweet, label} pairs to the Labeling Aggregator, which defines a final label for the tweet using an aggregation method, and (A6) stores the labeled data at the Local Database.

When there is a pending FL task at the server, (B1) the FL Task Config module sends the task description to the Scheduler. (B2) The Scheduler sends the task descriptions to the available clients. (B3) The client’s FL Scheduler receives it, and forwards it to the Data properties Computing module, (B4) which sends the device properties back to it. (B5) The FL scheduler sends the properties to the Scheduler, (B6) which forwards them to the Client Selection Mechanism to tell if the client will participate in the training or not. (B7) The mechanism module sends its positive or negative decision to the Scheduler, (B8) which announces to the client’s FL scheduler its participation in training with the global model to train or closes the connection with it.

For participating clients, (B9) the FL Scheduler sends the global model, and the task description to the FL Task module, and (B10) requests the local dataset. (B11) The Local Database sends the dataset, and starts the local ML training with Differential Privacy (DP) Adaptive Clipping. The Adaptive Clipping receives the local model’s updates, clip them, and (B12) sends them to the Model Aggregator. The Model Aggregator aggregates the updates, adds DP–noise (i.e., it adds noise to the updates’ sum), and applies the updates to the global model. Finally, (B13) it sends the model to the FL Task Config module for its use in the next round of the FL training.

Figure 2: Federated Learning Setup Pipeline

4 Federated Learning Setup

4.1 General Assumptions

Since we do not have access to the raw Twitter data from millions of users, the true distribution of harmful tweets to users is unknown. Thus, we have to somehow simulate the users’ browsing history. For this purpose, we construct artificial clients by splitting a centralized Twitter dataset – which contains harmful tweets – into a number of disjoint sets. Moreover, we assume a homogeneous population of clients, namely, all clients have the same number of total tweets with the same ratio harmful to normal (i.e., that same class ratio). Additionally, we assume that the clients selected for the FL training remain available during the whole FL phase.

4.2 Federated Learning Training

We used TensorFlow Federated (TFF), an open–source framework for computations on decentralized data

111, to simulate the FL training process for our experiments. The FL algorithm we used for aggregating the client’s model updates is the Federated Averaging mcmahan2017communicationefficient. TFF provides the implementation222 of Differential Privacy for FL training which we use to add CDP in our FL training simulation. Figure 2 presents our pipeline to simulate the FL training. We describe next the FL pipeline’s steps and main components.

4.3 Text classifier

We use a simplified version of the unified classification model described in founta2018unified, where only the text-classification path is enabled. We used this classifier since it showed a high performance (80% to

93% AUC) across many harmful tweet datasets. We used this simplified version to give a lighter computational task to the user’s device. The input of the classifier is the text of the tweets. We used TensorFlow Keras for the implementation of the classifier. The sequential ML pipeline starts with an Embedding layer, we use the GloVe embedding 


with the highest dimension (200). A Recurrent Neural Network Layer follows with gated recurrent unit (GRU), 128 units, and a dropout of p=0.5. The output layer is a classification dense layer, with one neuron with the sigmoid activation function. We set the parameters as proposed in 

founta2018unified. TFF framework offers a function that wraps a Keras model333˙docs/python/tff/learning/from˙keras˙model for its use in the federated training simulation.

4.4 Creating artificial clients for FL

We needed a decentralized dataset with a sufficient number of harmful and normal texts to simulate the FL training of the text classifier. Since we could not find a dataset fulfilling our criteria, we converted existing centralized datasets from past studies into artificial federated datasets. For this purpose, given a dataset with two classes of tweets (harmful and normal) and a sufficient number of harmful tweets, we do the following:

First, we create a test set with a size the 10% of the dataset, with the condition that 8% of the tweets in the test set are harmful. In other words, the class ratio harmful:normal in the test set is 8:92. We apply this percentage (8%) based on the results of previous studies founta2018large; chatzakou2017mean that showed that the percentage of harmful content on Twitter is around . Then, we create the clients using the remaining 90% of the dataset. In our simulation, the clients are represented by sets of tweets (the clients’ local data). To evaluate the FL on different populations of clients, we control the class ratio in clients’ data, i.e harmful:normal. We also set the total number of tweets per client. Finally, given the clients’ class ratio and clients’ data size, we compute the maximum number of clients we can construct.

5 Experimental Evaluation

5.1 Training Setup

To address the research questions of this work, we conducted experiments having the following training setups:
FL training: For the FL training setup, we are following the method described in Section 4.4 – given the parameters (clients’ data size, percentage of harmful tweets) – to construct the federated dataset. Then, we set the FL rounds and the number of participating clients in each round. Finally, we use the TFF framework to simulate the FL training. We refer to Local training as the training of the model on the client’s device, using the whole client’s dataset as the local training set.

Centralized training: This is the traditional ML training setup where the text classifier is trained with a single train set. Regarding the train–test split, we construct the test set following the same procedure described in Section 4.4. That is, we initially split the dataset into a test set of 10% size with class ratio 8:92 (i.e. 8% harmful tweets). Then, from the remaining 90% of the dataset, we construct the train set. We set a class ratio and a training–set size and then we randomly select a subset of tweets that satisfies these properties.

In both setups, we train the text classifier described in Section 4.3, and we compute the weighted classification metrics444

. We set the parameters (epochs=7, batch size=10, Adams optimizer, learning rate=0.001) after experimenting with different values for tuning and applying early stopping. We run all the experiments on a server with Intel(R) Core(TM) i7-7700K CPU

4.20GHz, and a 62GiB RAM except for the “overhead on client’s device” (Section 5.7) which we run on a Dell laptop device with Intel(R) Core(TM) i7-6500U CPU 2.50 GHz and 8GB RAM.

5.2 FL simulation parameters

The FL evaluation is based on the following three simulation parameters:

  • Size of harmful class in each client: With this parameter, we control the size of the harmful class on each client’s dataset. We consider a homogeneous population with the same class ratio (harmful:normal). Generally, as studies showed, of Twitter’s online content is harmful chatzakou2017mean; founta2018large. That said, there are often controversial topics where the users’ behavior is highly polarized. For instance, COVID-19 vaccination, the Russian invasion of Ukraine, and several conspiracy theories. We expect that the browsing history of users interested in these topics will contain a higher number of harmful content.

  • Client dataset size: the number of tweets at a client device. These tweets can represent either the user’s browsing history or tweets posted, retweeted, etc., by the user.

  • Number of FL clients: the number of clients available for the FL training.

We experiment with different values of the simulation parameters to explore how they affect the FL classification performance.

5.3 Datasets

(a) Harmful Class Ratio
(b) Client Dataset Size
(c) Number of Clients
Figure 3: Evaluation of non-DP FL. (a) 50 clients, 1K data points per client; (b) 50 clients, balanced data per client (i.e. 50% harmful data); (c) 1K data points and balanced data per client

We select the following datasets for the experimental evaluation based on past studies of misbehavior on Twitter. For all datasets, in order to keep the FL task lighter for the user device, we binarize the classification problem by merging the several harmful classes into a single “harmful” class. We report below the original classes together with the final binary ones.

Abusive Dataset founta2018large initially contains tweets, labeled as “Abusive”, “Hate”, “Normal”, and “Spam”. We remove 14,030 tweets labeled as “Spam” – following the same methodology of founta2018unified because there are more sophisticated techniques to handle spam profiles. The resulting dataset consists of tweets with 31.6% “Abusive”, 5.8% “Hate”, and 62.6% “Normal” classes. Final binary classes: 37.4% “Harmful” and 62.6% “Normal”.

Sarcastic Dataset 10.1145/2684822.2685316 contains tweets text classified in two classes labeled as “Sarcastic”(10.5%), and “None”(89.5%). Final binary classes: 10.5% “Harmful” and 89.5% “Normal”

Hateful Dataset Waseem2016HatefulSO is a tweets dataset. The tweets are categorized in “Racism”(12%), “Sexism”(20%), and “Normal”(68%) classes. Final binary classes: 32% “Harmful” and 68% “Normal”.

Offensive Dataset consists of tweets categorized in three classes: “Hate”(6%), “Offensive”(77%), and “Normal”(17%). Final binary classes: 83% “Harmful” and 17% “Normal”.

Cyberbully Dataset chatzakou2017mean is a smaller dataset, with tweets distinguished the “Bully”(8.5%), “Aggressive”(5.5%), and “Normal”(86%) classes. Final binary classes: 14% “Harmful” and 86% “Normal”.

We also preprocess the tweet texts by removing tags, URLs, numbers, punctuation characters, non-ASCII characters, etc. Moreover, we convert the text to lowercase, all the white spaces into a single one. We also remove English stop words and words that appear only once in the dataset (in case of misspelled words).

5.4 Evaluation of non–DP FL models

In the following experiments, we evaluate the non–DP FL framework on the “Abusive” dataset only. We chose this dataset because its size allowed experimentation with various FL simulation parameters.

5.4.1 Percent of “harmful” data in the clients. Here, we evaluate the FL classification when we vary the percent of “harmful” data in the clients’ datasets using the values . For a given “%harmful” value, first, we randomly select fifty clients and then we train the classifier in these clients for twenty FL rounds. Each client dataset consists of 1K data. Finally, we repeat the experiment five times to acquire average scores.

We also ran experiments with the Centralized training setup by varying the percent of “harmful” text in the training set. Then, we randomly select 50K tweets as the training set. We chose the 50K samples to compare the centralized classification performance with the previously mentioned FL training. We repeated the training three times for each %harmful value.

Results Discussion: In Figure 2(a), we present the average AUC values (test evaluation). We note that by increasing the examples of the “harmful” class by 5 times (i.e., from 10% to 50%), we have a 9% increase in AUC (from 74% to 83%). In the case with a 10% harmful class size, we got a 95% score in precision, recall, and F1-score. Interestingly, in the case of 50% of harmful class size, we obtained precision (93%), recall (89%), F1–score (90%), which shows a decrease by 1%, 6%, and 4% respectively. The training dataset is imbalanced when only 10% of clients’ data is harmful. To understand this reduction in the model’s performance, we also calculated the metrics only on the harmful class (which is the minority class), where we observed a 30% increase in recall but also a 40% negative impact on precision (with 10% of harmful class size we got a recall of 50%, and precision of 82%, with 50% we got a 77%, and a 40% respectively). This means having a balanced dataset (with 50% of harmful class size) impacts the recall of the harmful class: i.e., it helps the model to learn better the harmful class. This is what drives AUC up as well (in the weighted metrics as well as in the harmful–only case).

In the centralized approach, the classifier shows high performance, with only a 3% AUC difference between the 10% and 50% of harmful class size (90%, and 93% AUC, respectively). Finally, we get the best FL classification performance for balanced clients datasets (only 10% AUC difference with the centralized training).

5.4.2 Client’s dataset size. We assumed a homogeneous setting where all clients have the same dataset size. We evaluate the classifier performance for the client’s dataset size in [0.1K, 0.5K, 1K]. We run the FL training setup for twenty FL rounds by using the same randomly selected fifty clients. Each client has a balanced dataset. We repeat the FL training twenty times for the training with 0.1K and 0.5K data, and five times for the 1K data. We present the average AUC metric in Figure 2(b).

Results discussion: Increasing client dataset size by ten times (from 0.1K to 1K data points) can lead to the overall improvement of performance metrics by % in the AUC (from % to 83%). We observed also a % improvement in F1 score (from 88% to 90%), in accuracy (from 85% to 89%), recall (from 85% to 89%), and in precision (from 92% to 93%). The results show that increasing the data by five times did not significantly improve the performance, but the model performs similarly with the 0.1K data points per client. Therefore, the experiment shows that the FL training can build an effective model ( AUC) even with 100 data points per client.

5.4.3 Number of FL Clients. In this experiment, we run the FL training setup by varying the number of available clients, i.e., 10, 20, 30, 40, 50. Each client has a 1K balanced dataset, and the FL training runs for twenty rounds with the same randomly selected clients. We run the FL training five times for each value of the number of clients property, and we present the average test AUC in Figure 2(c).

Results Discussion: Increasing the number of clients participating in FL training by five times (i.e., from 10 to 50) results in increasing the AUC by (from 81% to 83%). Additionally, the accuracy, precision, recall, and f1-score, increase by 3%, 1%, 3%, and 2% respectively (from 86%, 92%,86%, 88% to 89%, 93%, 89%, 90%). However, the interesting point is that even with ten users/clients, the system can build an efficient model. The model performs similarly well when varying the number of clients participating in the FL training.

Dataset #Clients Accuracy AUC F1 Score
Abusive 50 0.85 0.81 0.88
(0.01) (0.01) (0.01)
Centr. 0.92 0.92 0.94
(0.01) (1e-3) (0.01)
Sarcastic 50 0.73 0.66 0.79
(0.01) (0.01) (0.01)
Centr. 0.76 0.75 .0.83
(0.05) (0.03) (0.03)
Hateful 50 0.85 0.61 0.87
(0.02) (0.01) (0.01)
Centr. 0.79 0.79 0.85
(0.02) (0.01) (0.01)
Offensive 37 0.78 0.78 0.83
(0.02) (0.01) (0.02)
Centr. 0.92 0.92 0.94
(0.01) (1e-3) (0.01)
Cyberbully 16 0.94 0.80 0.94
(1e-3) (0.01) (1e-3)
Centr. 0.91 0.91 0.93
(0.03) (0.02) (0.02)
Table 1: Comparing FL and centralized approach. Average values over five repetitions (std in parenthesis) for five different datasets. Each client has 0.1K data points and balanced data (50% harmful class).

5.5 Generalization on other Twitter datasets

Bootstrapping from the first round of experiments, we test the FL training setup with four other datasets (see datasets details in Section 5.3) to explore the generalization of the classifier’s utility. For each dataset, we run both the FL, and centralized training for five repetitions each, and then compare the average performances.

We run the FL training for twenty rounds, with the same clients participating in each round. Each client had a 100 tweets balanced dataset. We set the data size to 100 due to the datasets’ size limitations, and based on the previous experiments that 100 data points per client are sufficient for effective FL training. We randomly select fifty clients when the dataset size allowed us to do so. For small datasets we build the maximum number of clients i.e., 37 and 16 clients for Offensive and Cyberbully datasets, respectively. For the Centralized training, we used a training set size

to fit the total data used in the FL training for the corresponding dataset. We did not perform hyperparameter tuning to train the model with the different datasets. We present the average evaluation metrics (test phase) for both setups in Table 


Results Discussion: Across all five datasets, we observe an AUC performance . We get the best AUC while training with the Abusive dataset (81%), and with the smallest dataset, the Cyberbully, we achieved an AUC of 80%. Training with the Offensive, Sarcastic, and Hateful, we got an AUC performance of 78%, 66%, and 61%, respectively. Additionally, we can observe that the model’s performance decreases by (the minimum) to (the maximum) when trained with the FL approach compared to the centralized one. However, the results show that the classifier can be generalized and achieve acceptable performance on different types of misbehavior, even without hyperparameter tuning.

5.6 FL with Central Differential Privacy

Figure 4: Comparing DP and non-DP FL. Evaluation of -DP FL for different values and . Experiments with 628 total clients; 100 data points per client; 50% harmful-class (balanced data). For the non-DP FL, we perform client selection (per FL round) with the same sampling values used for the DP FL.

Central Differential Privacy provides privacy guarantees (at the user level) against data (or membership) inference attacks by an external attacker who has access to the trained model. We apply the CDP to our FL training setup (our implementation is based on the TensorFlow privacy library555 TensorFlow modifies the Federated Averaging algorithm to add CDP based on the study of NEURIPS2021_91cff01a. The modifications are the following: (i) each client clips the model’s updates before transmitting them to the server. (ii) the server, during the aggregation of the client’s updates, adds noise to the sum of the updates before averaging.

TensorFlow privacy library provides an implementation that returns the necessary DP parameters (i.e., noise multiplier, sampling size) to achieve a specific

-DP for the FL training setup. This implementation is based on the Moment Accountant method 

rdpaccountant which assesses the -DP of the model. Lower values mean that we offer higher privacy to the clients participating in the FL training. The noise multiplier property defines the addition of noise to the sum of the model’s updates, and the sampling size refers to randomly selecting a subset of the available clients to participate in each round. The client sampling adds to the privacy guarantee of the training since we do not set a fixed number of clients participating in every round.

We run an experiment to assess the privacy guarantee and utility trade-off. For this experiment, we use the Abusive dataset, split and distribute the data to clients as described in Section 4.4. We run the FL training setup for 100 rounds, and each client has a 0.1K balanced dataset. These FL parameters give the maximum available number of clients, i.e, 628 clients. We use Poisson sampling, which gives a different number of clients to participate in each round, with a mean set to sampling size value.

We evaluate the DP-classifier with different values, while setting . We define using the suggested formula in Abadi_2016; For each value, we get the DP-parameters – necessary for achieving the given -DP – using the TensorFlow privacy library. So for value of [1.5, 3, 5, 10], we get the following DP-parameters {sampling size, noise multiplier}: 1.5={23, 1.15}, 3={25, 0.875}, 5={66, 1.1}, 10={37, 0.612} respectively. We repeated the simulations ten times for set to [1.5, 3], and five times for set to [5, 10]. We present the average AUC achieved in Figure 3(a) (the green line shows the mean, and the red the median).

To investigate the trade-off between utility and privacy, we run a set of experiments with the FL training setup using the same parameters mentioned before (i.e., clients dataset, sampling size, number of FL rounds) but without adding DP. In Figure 3(a), we present the average AUC values (over five repetitions) for the non-DP model. We evaluated the model’s performance every ten rounds of the FL training for both the non-DP model and DP model for the values 3 (medium) and 5 (medium-high). We present the average AUC values in Figure 3(b), and 3(c) respectively.

Results Discussion: Figure 3(a) shows that adding DP with a strict privacy guarantee (i.e., ) causes a 20% decrease in AUC when compared to the non-DP model performance. Experimenting with lower values, we observed that we do not get a robust model with stable behavior (i.e., four out of ten repetitions gave a 10% to 30% AUC). Additionally, we observed that the classifier could tolerate a noise multiplier near the value 1; adding more noise does not allow the classifier to learn during the training. With a medium DP level, , we get an average AUC of 75%, and 80%, approaching the non-DP model’s performance. Figures 3(b), 3(c) show that a DP-model training requires more FL rounds to converge (i.e., 100 rounds) while the non-DP model’s performance shows a rapid increase, and reaches an acceptable AUC (i.e., 20-30 rounds). The performance of the non-private model additionally confirms our previous observations that altering the number of FL participants (i.e., sampling size) does not affect the model’s performance. Finally, we observe that by training the model for 100 FL rounds, we get 85% AUC. In other words, the performance is improved by 4% from the case we present in Figure 2(b) — i.e. fifty clients with 0.1K balanced dataset each.

5.7 Overhead on Client’s Device

We experiment to measure the extra overhead caused to the client’s device when participating in the FL training. Specifically, we assess the overhead during the local training, which happens in one FL round on the client’s device.

Figure 5: CPU and memory consumption (every 2 seconds) on client’s device due to the FL training. The client’s training set consists of 0.1K data.

We run the Local training on a laptop (see laptop properties in Section 5.1), using the whole client’s dataset as the training set. Since the results of the experiments with 0.1K data per client showed that we can have a well-performing classifier, we set the client’s dataset size to 0.1K. While training the model locally, we monitor the machine resource utilization (memory consumption and CPU utilization) and collect the logs after every two seconds. We repeated the training ten times. We kept the CPU ‘idle’ during the training by not running other applications. Figure 5 shows the device’s CPU utilization percentage and the memory consumption (in MB) during the local training after averaging the results of the experiment.

Results discussion: In Figure 5, we see that the total duration of the training phase is seconds. From seconds 0 to 8, the CPU utilization increases linearly from to 20%. Then, there is a rapid increase (from seconds 8 to 10) in which the CPU reaches . At the end of the training phase, there is a decrease to , and CPU utilization reaches a maximum of . The average CPU utilization during the training across all repetitions is .

The memory consumption varies between to MB during the training, with an average of MB. There is a warm-up phase (from 0 to 10) (when the training phase begins) where the memory consumption increases by 100MB. There is a decrease in memory consumption at 12 seconds (as it also happens with CPU utilization), resulting from one of the repetitions completing the training faster than the rest. Overall, the results show that the local training, with a mean of the CPU utilization around and at a maximum of , occupies the device for a short time of seconds thus, it does not introduce a severe overhead for the client device.

6 Conclusion

People of all ages excessively use Online Social Networks and often are exposed to harmful content and various types of misbehavior (i.e., hate speech, cyberbullying, sarcasm, offense, etc.). Online content moderation tools provide countermeasures against such distorted content but at the same time require processing sensitive users’ data. The FL paradigm, together with Differential privacy techniques, provides a distributed and private-preserving ML training framework that complies with privacy policies (i.e., GDPR).

In this work, we proposed a privacy-preserving (DP) FL framework for content moderation on Twitter. This DP FL paradigm protects the users’ privacy and can be easily adapted to other social media platforms and other types of misbehavior. The experimental results – over five Twitter datasets – show that (i) for both the DP and non-DP FL variations, the text classification performance is close to the centralized approach; (ii) it has a high performance even if only a small number of clients (with small local datasets) are available for the FL training; (iii) it does not affect the performance of user’s device – in terms of CPU and memory consumption – during the FL training.

7 Acknowledgments

This project has been funded by the European Union’s Horizon 2020 Research and Innovation program under the Cybersecurity CONCORDIA project (Grant Agreement No. 830927) and the Marie Skłodowska–Curie AERAS project (Grant Agreement No. 872735).


Ethical Considerations

This work followed the principles and guidelines on executing ethical information research and using shared data ethics. The suggested methodology complies with the GDPR and ePrivacy regulations. We have not collected data from Twitter. We use existing Twitter datasets – that have already been published by other academic studies by requesting access from their publishers. For this reason, we will not publicly release any dataset used in this study. We did not use or present any identifiable user information from the datasets (e.g., Twitter user IDs). We applied text preprocessing in order to clean the tweets from any information that could identify specific Twitter accounts (see Section 5.3). Hence, the train data of the text classifier did not contain Twitter usernames. Finally, we implemented and executed the experiments locally – on our devices – without using any cloud computation services, so we did not upload any of the datasets to the cloud.